'Remove pattern from a column if it is present in another one
I have this file :
>AX-89916436-Affx-G-[A/G]
TTGTCCGAGAGTGACGTCAATCCGCA
>AX-89916437-Affx-A-[A/G]
TGTGTGGAAACTCCG
>AX-89916438-Affx-C-[A/C]
GAAGTACGGTAACAT
>AX-89916440-Affx-T-[G/T]
AGTTGATGGTGTATGTGTGTCTTT
I would like to remove in the last field [X/X] the letter present in the 4th field. To get something like that :
>AX-89916436-Affx-G-A
TTGTCCGAGAGTGACGTCAATCCGCA
>AX-89916437-Affx-A-G
TGTGTGGAAACTCCG
>AX-89916438-Affx-C-A
GAAGTACGGTAACAT
>AX-89916440-Affx-T-G
AGTTGATGGTGTATGTGTGTCTTT
I have :
awk -F'-' '
match($0, /\[[A-Z]\/[A-Z]]/) {m = substr($0, RSTART, RLENGTH); if(/^>/ && $NF~/m/); print ... }'
Solution 1:[1]
$ awk 'BEGIN{FS=OFS="-"} />/{gsub("[][/]",""); sub($(NF-1),"",$NF)}1' file
>AX-89916436-Affx-G-A
XXXXXXX
>AX-89916437-Affx-A-G
XXXXXXXXXXX
>AX-89916438-Affx-C-A
XXXXXXX
>AX-89916440-Affx-T-G
XXXXXXX
Solution 2:[2]
Here is another awk:
awk 'BEGIN {FS=OFS="-"} NF>1 {gsub("[][/" $(NF-1) "]", "", $NF) } 1' file
>AX-89916436-Affx-G-A
XXXXXXX
>AX-89916437-Affx-A-G
XXXXXXXXXXX
>AX-89916438-Affx-C-A
XXXXXXX
>AX-89916440-Affx-T-G
XXXXXXX
Solution 3:[3]
With your shown samples, please try following awk code. Simple explanation would be setting FS and OFS as = and in main section checking if a line starts from > and 5th field is matching regex \[[A-Z]\/[A-Z]] then remove whatever values present of 4th field in 5th field using gsub. 1 is awksh way of printing current edited/non-edited line.
awk '
BEGIN{ FS=OFS="-" }
/^>/ && $5 ~ /\[[A-Z]\/[A-Z]]/{
gsub("[][/"$4"]", "", $5)
}
1' Input_file
Solution 4:[4]
Using sed
$ sed -E s'#([A-Z])-\[(\1|([A-Z]))/(\1|([A-Z]))]#\1-\3\5#' input_file
>AX-89916436-Affx-G-A
TTGTCCGAGAGTGACGTCAATCCGCA
>AX-89916437-Affx-A-G
TGTGTGGAAACTCCG
>AX-89916438-Affx-C-A
GAAGTACGGTAACAT
>AX-89916440-Affx-T-G
AGTTGATGGTGTATGTGTGTCTTT
Solution 5:[5]
You can use
awk 'BEGIN{FS=OFS="-"} /^>/ && $5 ~ /\[[A-Z]\/[A-Z]]/{gsub("[][/"$4"]", "", $5);}1' file
Details:
BEGIN{FS=OFS="-"}- set input/output field separator to-/^>/ && $5 ~ /\[[A-Z]\/[A-Z]]/- if the string starts with>and Field 5 contains[+ uppercase letter + / + uppercase letter +]substring...{gsub("[][/"$4"]", "", $5);}- then remove from Field 5],[,/and Field 4 chars1- fires the defaultprintaction.
See the online demo:
#!/bin/bash
s='>AX-89916436-Affx-G-[A/G]
XXXXXXX
>AX-89916437-Affx-A-[A/G]
XXXXXXXXXXX
>AX-89916438-Affx-C-[A/C]
XXXXXXX
>AX-89916440-Affx-T-[G/T]
XXXXXXX'
awk 'BEGIN{FS=OFS="-"} /^>/ && $5 ~ /\[[A-Z]\/[A-Z]]/{gsub("[][/"$4"]", "", $5);}1' <<< "$s"
Output:
>AX-89916436-Affx-G-A
XXXXXXX
>AX-89916437-Affx-A-G
XXXXXXXXXXX
>AX-89916438-Affx-C-A
XXXXXXX
>AX-89916440-Affx-T-G
XXXXXXX
Solution 6:[6]
much better now :
>AX-89916436-Affx-G-A
TTGTCCGAGAGTGACGTCAATCCGCA
>AX-89916437-Affx-A-G
TGTGTGGAAACTCCG
>AX-89916438-Affx-C-A
GAAGTACGGTAACAT
>AX-89916440-Affx-T-G
AGTTGATGGTGTATGTGTGTCTTT
# gawk profile, created Thu May 12 05:05:48 2022
# Rule(s)
8 NF*=($_=(NF=NF)==!_?$!_:$!(NF-=($(_+=(_-=_)-+-++_-+-++_)=\
$((_+=_+=(_^=_<_)+_)-($--_!=$--_) ) )^(_-=_)+!_))~""'
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Ed Morton |
| Solution 2 | anubhava |
| Solution 3 | RavinderSingh13 |
| Solution 4 | HatLess |
| Solution 5 | |
| Solution 6 | RARE Kpop Manifesto |
