'Collapsing duplicate lines using scores in particular column

I have a file that looks like this:

chr1    74347681    74347709    +   1   11
chr1    74347681    74347709    +   2   12
chr1    74347681    74347709    +   3   12
chr1    74347681    74347709    +   4   12
chr1    74347681    74347709    +   5   14
chr1    74347681    74347709    +   6   14
chr1    74347681    74347709    +   7   14
chr1    74347681    74347709    +   8   14
chr1    74347681    74347709    +   9   14
chr1    74347681    74347709    +   10  14
chr1    74347681    74347709    +   11  14
chr1    74347681    74347709    +   12  14
chr1    74347681    74347709    +   13  14
chr1    74347681    74347709    +   14  14
chr1    74347681    74347709    +   15  14
chr1    74347681    74347709    +   16  14
chr1    74347681    74347709    +   17  14
chr1    74347681    74347709    +   18  14
chr1    74347681    74347709    +   19  14
chr1    74347681    74347709    +   20  14
chr1    74347681    74347709    +   21  14
chr1    74347681    74347709    +   22  14
chr1    74347681    74347709    +   23  14
chr1    74347681    74347709    +   24  13
chr1    74347681    74347709    +   25  12
chr1    74347681    74347709    +   26  10
chr1    74347681    74347709    +   27  10
chr1    74347681    74347709    +   28  10
chr11   32284342    32284363    +   1   10
chr11   32284342    32284363    +   2   10
chr11   32284342    32284363    +   3   10
chr11   32284342    32284363    +   4   10
chr11   32284342    32284363    +   5   10
chr11   32284342    32284363    +   6   10
chr11   32284342    32284363    +   7   10
chr11   32284342    32284363    +   8   11
chr11   32284342    32284363    +   9   11
chr11   32284342    32284363    +   10  11
chr11   32284342    32284363    +   11  11
chr11   32284342    32284363    +   12  11
chr11   32284342    32284363    +   13  11
chr11   32284342    32284363    +   14  11
chr11   32284342    32284363    +   15  11
chr11   32284342    32284363    +   16  11
chr11   32284342    32284363    +   17  11
chr11   32284342    32284363    +   18  11
chr11   32284342    32284363    +   19  11
chr11   32284342    32284363    +   20  10
chr11   32284342    32284363    +   21  10

Where the columns represent

chromosome   start   stop   strand   position   coverage

I would like to collapse the file so that I have the coordinates of the maximum coverage for each cluster. The maximum coverage for the first cluster (chr1 74347681 74347709 +) is 14 and the maximum coverage of the second cluster (chr11 32284342 32284363 +) is 11:

chr1    74347681    74347709    +   5   14
chr1    74347681    74347709    +   6   14
chr1    74347681    74347709    +   7   14
chr1    74347681    74347709    +   8   14
chr1    74347681    74347709    +   9   14
chr1    74347681    74347709    +   10  14
chr1    74347681    74347709    +   11  14
chr1    74347681    74347709    +   12  14
chr1    74347681    74347709    +   13  14
chr1    74347681    74347709    +   14  14
chr1    74347681    74347709    +   15  14
chr1    74347681    74347709    +   16  14
chr1    74347681    74347709    +   17  14
chr1    74347681    74347709    +   18  14
chr1    74347681    74347709    +   19  14
chr1    74347681    74347709    +   20  14
chr1    74347681    74347709    +   21  14
chr1    74347681    74347709    +   22  14
chr1    74347681    74347709    +   23  14
chr11   32284342    32284363    +   8   11
chr11   32284342    32284363    +   9   11
chr11   32284342    32284363    +   10  11
chr11   32284342    32284363    +   11  11
chr11   32284342    32284363    +   12  11
chr11   32284342    32284363    +   13  11
chr11   32284342    32284363    +   14  11
chr11   32284342    32284363    +   15  11
chr11   32284342    32284363    +   16  11
chr11   32284342    32284363    +   17  11
chr11   32284342    32284363    +   18  11
chr11   32284342    32284363    +   19  11

So the maximum coverage for the first cluster starts at position 5 and ends at position 23 and for the second cluster it starts at position 8 and ends at position 19. How do I get a final output of:

chr1    74347685    74347703    +
chr11   32284349    32284360    +

The start position is position 1 so you are really adding (5-1) and (23-1) to the start position for the first cluster and adding (8-1) and (19-1) to the start position of the second cluster.

I could only think to start by sorting the file using:

sort -k1,1 -k2,2n -k6,6nr -k5,5n

Previously, I had then taken a list of unique clusters (columns 1-4) and used a for loop to one-by-one add the positions of maximum coverage to the original start positions of the cluster but this takes a very long time. I was wondering if there is a way to do it in the file without processing it line-by-line.

Solution 1:^[1]

Perl to the rescue!

#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };

sub output {
    my ($cluster, @buffer) = @_;
    for my $line (@buffer) {
        say join "\t", $cluster, @$line;
    }
}

my $cluster = "";
my @buffer;

while (<>) {
    my ($chromosome, $start, $stop, $strand, $position, $coverage) = split;
    if ($chromosome ne $cluster) {
        output($cluster, @buffer) if @buffer;
        @buffer = ([$start, $stop, $strand, $position, $coverage]);
        $cluster = $chromosome;
    } elsif ($coverage >= $buffer[-1][-1]) {
        @buffer = () if $coverage > $buffer[-1][-1];
        push @buffer, [$start, $stop, $strand, $position, $coverage];
    }
}

# Don't forget to output the last accumulated cluster!
output($cluster, @buffer);

It reads the input line by line. It stores the lines with the maximal coverage in a buffer: if the next line has a greater coverage, it will clear the buffer, if it's less, it's skipped (if coverage can oscillate, the code is wrong!) Once a new cluster starts, the previous one is printed.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	choroba

'Collapsing duplicate lines using scores in particular column

Solution 1:[1]

Sources

Related Questions

Solution 1:^[1]