'Joining columns based on number of fields
I have a large workflow that gets tripped up by uncharacterized chromosomes - a process produces a count matrix that has n fields for canonical chromosomes, and for lines with uncharacterized chromosomes, the fields are n + 1 and n + 2. This is a headache for using read.table() downstream.
My approach is to first identify what n is, and use this to isolate the n + 1 and n + 2 lines containing these uncharacterized chromosomes:
awk -v nf="$canon" 'NF!=nf{print}{}' matrix.txt | head
chr22 KI270733v1 random 123189 123362 + 6 4 8 0 0 10
chrUn GL000220v1 105951 106963 - 0 0 0 0 10 0
The goal is for these lines to match the number of fields n by joining the 1st and 2nd columns where n + 1 and the 1st, 2nd and 3rd columns where n + 2 to produce:
chrUn-GL000220v1 105951 106963 - 0 0 0 0 10 0
chr22-KI270733v1-random 123189 123362 + 6 4 8 0 0 10
Attempt
I could subset the matrix and split it into 3 files, one for NF==n, NF==n+1 & NF==n+2 and join the columns:
awk -v n="$canon" 'NF==n{print}{}' matrix.txt | head
chr1 15534236 15536814 - 0 10 0 0 0 3
(^ no action needed)
awk -v n="$canon" 'NF==n+1{print}{}' matrix.txt | awk -v OFS="\t" '{print $1"-"$2,$3,$4,$5,$6,$7,$8,$9,$10}' | head
chrUn-GL000220v1 105992 107309 - 0 0 0 0 0 4
and
awk -v n="$canon" 'NF==n+2{print}{}' matrix.txt | awk -v OFS="\t" '{print $1"-"$2"-"$3,$4,$5,$6,$7,$8,$9,$10,$11,$12}' | head
chr22-KI270733v1-random 123189 123362 + 6 4 8 0 0 10
Unfortunately, this solution is not dynamic - I have to specify the range of columns. The workflow could contain any number of columns after the first four detailing Chr, Start, Stop, Strand.
Hopefully I have defined the problem well, any suggestions would be greatly appreciated.
Solution 1:[1]
Try:
awk -v n=13 '{ for (i = 2; i <= NF - n + 1; ++i) { $1 = $1"-"$i; $i=""; } } 1'
Accumulate into $1 and clean $i="" the rest.
You could also move values to the left if (NF != n) for (i = 2; i < NF; ++i) $i=$(i+(NF-n)) values and set NF=n.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | KamilCuk |
