'Add prefix to fasta headers using partial match

I have a fasta file where the headers look like this:

>scf7180000349958_18140-5p
>scf7180000350303_40840-5p
>scf7180000349939_17296-5p
>scf7180000350072_24702-5p
>scf7180000347531_4577-3p
>scf7180000350345_46159-3p

I would like to add a prefix to this headers based in a key file. The problem is that the IDs in the key file are just partial (lack the -5p or -3p part) making it a lot more difficult for me to solve it.

Map file
IDs prefix
scf7180000349958_18140  mir-67
scf7180000350303_40840  let-7
scf7180000349939_17296  mir-252
scf7180000350072_24702  mir-11
scf7180000347531_4577   mir-124
scf7180000350345_46159  mir-449

#Expected results in fasta file
>mir-67_scf7180000349958_18140-5p
>let-7_scf7180000350303_40840-5p
>mir-252_mir-252_scf7180000349939_17296-5p
>mir-11_scf7180000350072_24702-5p
>mir-124_scf7180000347531_4577-3p
>mir-449_scf7180000350345_46159-3p


Solution 1:[1]

Using awk:

$ awk 'NR == FNR { ids[sprintf(">%s", $1)]=$2; next }
       $1 in ids { $1 = sprintf(">%s_%s", ids[$1], substr($1, 2)); }
       1' map.txt FS=- OFS=- input.fasta
>mir-67_scf7180000349958_18140-5p
>let-7_scf7180000350303_40840-5p
>mir-252_scf7180000349939_17296-5p
>mir-11_scf7180000350072_24702-5p
>mir-124_scf7180000347531_4577-3p
>mir-449_scf7180000350345_46159-3p

First store all the ids and prefixes from the map file in an array, then set the field separator (Input and output) to dash so you can easily reference just the leading part of the id, and for each header line in the fasta file, if that leading id exists in the array, add the prefix to it, and print everything in the fasta file.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Shawn