'Collapsing duplicate lines using scores in particular column
I have a file that looks like this:
chr1 74347681 74347709 + 1 11
chr1 74347681 74347709 + 2 12
chr1 74347681 74347709 + 3 12
chr1 74347681 74347709 + 4 12
chr1 74347681 74347709 + 5 14
chr1 74347681 74347709 + 6 14
chr1 74347681 74347709 + 7 14
chr1 74347681 74347709 + 8 14
chr1 74347681 74347709 + 9 14
chr1 74347681 74347709 + 10 14
chr1 74347681 74347709 + 11 14
chr1 74347681 74347709 + 12 14
chr1 74347681 74347709 + 13 14
chr1 74347681 74347709 + 14 14
chr1 74347681 74347709 + 15 14
chr1 74347681 74347709 + 16 14
chr1 74347681 74347709 + 17 14
chr1 74347681 74347709 + 18 14
chr1 74347681 74347709 + 19 14
chr1 74347681 74347709 + 20 14
chr1 74347681 74347709 + 21 14
chr1 74347681 74347709 + 22 14
chr1 74347681 74347709 + 23 14
chr1 74347681 74347709 + 24 13
chr1 74347681 74347709 + 25 12
chr1 74347681 74347709 + 26 10
chr1 74347681 74347709 + 27 10
chr1 74347681 74347709 + 28 10
chr11 32284342 32284363 + 1 10
chr11 32284342 32284363 + 2 10
chr11 32284342 32284363 + 3 10
chr11 32284342 32284363 + 4 10
chr11 32284342 32284363 + 5 10
chr11 32284342 32284363 + 6 10
chr11 32284342 32284363 + 7 10
chr11 32284342 32284363 + 8 11
chr11 32284342 32284363 + 9 11
chr11 32284342 32284363 + 10 11
chr11 32284342 32284363 + 11 11
chr11 32284342 32284363 + 12 11
chr11 32284342 32284363 + 13 11
chr11 32284342 32284363 + 14 11
chr11 32284342 32284363 + 15 11
chr11 32284342 32284363 + 16 11
chr11 32284342 32284363 + 17 11
chr11 32284342 32284363 + 18 11
chr11 32284342 32284363 + 19 11
chr11 32284342 32284363 + 20 10
chr11 32284342 32284363 + 21 10
Where the columns represent
chromosome start stop strand position coverage
I would like to collapse the file so that I have the coordinates of the maximum coverage for each cluster. The maximum coverage for the first cluster (chr1 74347681 74347709 +) is 14 and the maximum coverage of the second cluster (chr11 32284342 32284363 +) is 11:
chr1 74347681 74347709 + 5 14
chr1 74347681 74347709 + 6 14
chr1 74347681 74347709 + 7 14
chr1 74347681 74347709 + 8 14
chr1 74347681 74347709 + 9 14
chr1 74347681 74347709 + 10 14
chr1 74347681 74347709 + 11 14
chr1 74347681 74347709 + 12 14
chr1 74347681 74347709 + 13 14
chr1 74347681 74347709 + 14 14
chr1 74347681 74347709 + 15 14
chr1 74347681 74347709 + 16 14
chr1 74347681 74347709 + 17 14
chr1 74347681 74347709 + 18 14
chr1 74347681 74347709 + 19 14
chr1 74347681 74347709 + 20 14
chr1 74347681 74347709 + 21 14
chr1 74347681 74347709 + 22 14
chr1 74347681 74347709 + 23 14
chr11 32284342 32284363 + 8 11
chr11 32284342 32284363 + 9 11
chr11 32284342 32284363 + 10 11
chr11 32284342 32284363 + 11 11
chr11 32284342 32284363 + 12 11
chr11 32284342 32284363 + 13 11
chr11 32284342 32284363 + 14 11
chr11 32284342 32284363 + 15 11
chr11 32284342 32284363 + 16 11
chr11 32284342 32284363 + 17 11
chr11 32284342 32284363 + 18 11
chr11 32284342 32284363 + 19 11
So the maximum coverage for the first cluster starts at position 5 and ends at position 23 and for the second cluster it starts at position 8 and ends at position 19. How do I get a final output of:
chr1 74347685 74347703 +
chr11 32284349 32284360 +
The start position is position 1 so you are really adding (5-1) and (23-1) to the start position for the first cluster and adding (8-1) and (19-1) to the start position of the second cluster.
I could only think to start by sorting the file using:
sort -k1,1 -k2,2n -k6,6nr -k5,5n
Previously, I had then taken a list of unique clusters (columns 1-4) and used a for loop to one-by-one add the positions of maximum coverage to the original start positions of the cluster but this takes a very long time. I was wondering if there is a way to do it in the file without processing it line-by-line.
Solution 1:[1]
Perl to the rescue!
#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };
sub output {
my ($cluster, @buffer) = @_;
for my $line (@buffer) {
say join "\t", $cluster, @$line;
}
}
my $cluster = "";
my @buffer;
while (<>) {
my ($chromosome, $start, $stop, $strand, $position, $coverage) = split;
if ($chromosome ne $cluster) {
output($cluster, @buffer) if @buffer;
@buffer = ([$start, $stop, $strand, $position, $coverage]);
$cluster = $chromosome;
} elsif ($coverage >= $buffer[-1][-1]) {
@buffer = () if $coverage > $buffer[-1][-1];
push @buffer, [$start, $stop, $strand, $position, $coverage];
}
}
# Don't forget to output the last accumulated cluster!
output($cluster, @buffer);
It reads the input line by line. It stores the lines with the maximal coverage in a buffer: if the next line has a greater coverage, it will clear the buffer, if it's less, it's skipped (if coverage can oscillate, the code is wrong!) Once a new cluster starts, the previous one is printed.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | choroba |
