'Remove pattern from a column if it is present in another one

I have this file :

>AX-89916436-Affx-G-[A/G]
TTGTCCGAGAGTGACGTCAATCCGCA
>AX-89916437-Affx-A-[A/G]
TGTGTGGAAACTCCG
>AX-89916438-Affx-C-[A/C]
GAAGTACGGTAACAT
>AX-89916440-Affx-T-[G/T]
AGTTGATGGTGTATGTGTGTCTTT

I would like to remove in the last field [X/X] the letter present in the 4th field. To get something like that :

>AX-89916436-Affx-G-A
TTGTCCGAGAGTGACGTCAATCCGCA
>AX-89916437-Affx-A-G
TGTGTGGAAACTCCG
>AX-89916438-Affx-C-A
GAAGTACGGTAACAT
>AX-89916440-Affx-T-G
AGTTGATGGTGTATGTGTGTCTTT

I have :

 awk -F'-' '
    match($0, /\[[A-Z]\/[A-Z]]/) {m = substr($0, RSTART, RLENGTH); if(/^>/ && $NF~/m/); print ... }'


Solution 1:[1]

$ awk 'BEGIN{FS=OFS="-"} />/{gsub("[][/]",""); sub($(NF-1),"",$NF)}1' file
>AX-89916436-Affx-G-A
XXXXXXX
>AX-89916437-Affx-A-G
XXXXXXXXXXX
>AX-89916438-Affx-C-A
XXXXXXX
>AX-89916440-Affx-T-G
XXXXXXX

Solution 2:[2]

Here is another awk:

awk 'BEGIN {FS=OFS="-"} NF>1 {gsub("[][/" $(NF-1) "]", "", $NF) } 1' file

>AX-89916436-Affx-G-A
XXXXXXX
>AX-89916437-Affx-A-G
XXXXXXXXXXX
>AX-89916438-Affx-C-A
XXXXXXX
>AX-89916440-Affx-T-G
XXXXXXX

Solution 3:[3]

With your shown samples, please try following awk code. Simple explanation would be setting FS and OFS as = and in main section checking if a line starts from > and 5th field is matching regex \[[A-Z]\/[A-Z]] then remove whatever values present of 4th field in 5th field using gsub. 1 is awksh way of printing current edited/non-edited line.

awk '
BEGIN{ FS=OFS="-" }
/^>/ && $5 ~ /\[[A-Z]\/[A-Z]]/{
  gsub("[][/"$4"]", "", $5)
}
1' Input_file

Solution 4:[4]

Using sed

$ sed -E s'#([A-Z])-\[(\1|([A-Z]))/(\1|([A-Z]))]#\1-\3\5#' input_file
>AX-89916436-Affx-G-A
TTGTCCGAGAGTGACGTCAATCCGCA
>AX-89916437-Affx-A-G
TGTGTGGAAACTCCG
>AX-89916438-Affx-C-A
GAAGTACGGTAACAT
>AX-89916440-Affx-T-G
AGTTGATGGTGTATGTGTGTCTTT

Solution 5:[5]

You can use

awk 'BEGIN{FS=OFS="-"} /^>/ && $5 ~ /\[[A-Z]\/[A-Z]]/{gsub("[][/"$4"]", "", $5);}1' file

Details:

  • BEGIN{FS=OFS="-"} - set input/output field separator to -
  • /^>/ && $5 ~ /\[[A-Z]\/[A-Z]]/ - if the string starts with > and Field 5 contains [ + uppercase letter + / + uppercase letter + ] substring...
  • {gsub("[][/"$4"]", "", $5);} - then remove from Field 5 ], [, / and Field 4 chars
  • 1 - fires the default print action.

See the online demo:

#!/bin/bash
s='>AX-89916436-Affx-G-[A/G]
XXXXXXX
>AX-89916437-Affx-A-[A/G]
XXXXXXXXXXX
>AX-89916438-Affx-C-[A/C]
XXXXXXX
>AX-89916440-Affx-T-[G/T]
XXXXXXX'

awk 'BEGIN{FS=OFS="-"} /^>/ && $5 ~ /\[[A-Z]\/[A-Z]]/{gsub("[][/"$4"]", "", $5);}1' <<< "$s"

Output:

>AX-89916436-Affx-G-A
XXXXXXX
>AX-89916437-Affx-A-G
XXXXXXXXXXX
>AX-89916438-Affx-C-A
XXXXXXX
>AX-89916440-Affx-T-G
XXXXXXX

Solution 6:[6]

much better now :

>AX-89916436-Affx-G-A
TTGTCCGAGAGTGACGTCAATCCGCA
>AX-89916437-Affx-A-G
TGTGTGGAAACTCCG
>AX-89916438-Affx-C-A
GAAGTACGGTAACAT
>AX-89916440-Affx-T-G
AGTTGATGGTGTATGTGTGTCTTT

# gawk profile, created Thu May 12 05:05:48 2022

# Rule(s)

8 NF*=($_=(NF=NF)==!_?$!_:$!(NF-=($(_+=(_-=_)-+-++_-+-++_)=\
       $((_+=_+=(_^=_<_)+_)-($--_!=$--_) ) )^(_-=_)+!_))~""'

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Ed Morton
Solution 2 anubhava
Solution 3 RavinderSingh13
Solution 4 HatLess
Solution 5
Solution 6 RARE Kpop Manifesto