'create new column containing a substring of an existing column in data using bash
I have a large tsv.gz file (40GB) for which I want to extract a string from an existing variable col3, store it in a new variable New_var (placed at the beginning) and save everything the new file.
an example of the data "old_file.tsv.gz"
col1 col2 col3 col4
1 positive 12:1234A 100
2 negative 10:9638B 110
3 positive 5:0987A 100
4 positive 8:5678A 170
Desired data "new_file.tsv.gz"
New_var col1 col2 col3 col4
12 1 positive 12:1234A 100
10 2 negative 10:9638B 110
5 3 positive 5:0987A 100
8 4 positive 8:5678A 170
I am new in bash so I have tried multiple things but I get stuck, I have tried
zcat old_file.tsv.gz | awk '{print New_var=$3,$0 }' | awk '$1 ~ /^[0-9]:/{print $0 | (gzip -c > new_file.tsv.gz) }'
I think I have multiple problems. {print New_var=$3,$0 } do create a duplicate of col3 but doesn't rename it. Then when I add the last part of the code awk '$1 ~ /^[0-9]:/{print $0 | (gzip -c > new_file.tsv.gz) }'...well nothing comes up (I tried to look if I forget a parenthesis but cannot find the problem).
Also I am not sure if this way is the best way to do it.
Any idea how to make it work?
Solution 1:[1]
Make an AWK script in a separate file (for readability), say 1.awk:
{ if (NR > 1) {
# all data lines
split($3, a, ":");
print a[1], $1, $3, $3, $4;
} else {
# header line
print "new_var", $1, $2, $3, $4;
}
}
Now process the input (say 1.csv.gz) with the AWK file:
zcat 1.csv.gz | awk -f 1.awk | gzip -c > 1_new.csv.gz
Solution 2:[2]
I suggest to use one tab (\t) and : as input field separator:
awk 'BEGIN { FS="[\t:]"; OFS="\t" }
NR==1 { $1="New_var" OFS $1 }
NR>1 { $0=$3 OFS $0 }
{ print }'
As one line:
awk 'BEGIN{ FS="[\t:]"; OFS="\t" } NR==1{ $1="New_var" OFS $1 } NR>1{ $0=$3 OFS $0 } { print }'
See: 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Michail Alexakis |
| Solution 2 | Cyrus |
