'awk combine info from two files (fasta file header)

I know there are many similar questions, and I had read through many of them. But I still can't make my code work. Could somebody point the problem out for me please? Thanks!

(base) $ head Sample.pep2
>M00000032072 gene=G00000025773 seq_id=ChrM type=cds
MFKQNPSPGWKECPPSSDKEGTTPERLDEGREMRRGKEKAFGDREISFLLHRKRRRPRIA
YGACYLKGARFFDRGAMIAGASPRSARWPIGIAACGLCLPIRIIIKNSGSARESAGNNRK
EGVHVAAAPAPLLSQWGSRFGIY*
>M00000032073 gene=G00000025774 seq_id=ChrM type=cds
MFKQNPSPGWKECPPSSDKEGTTPERLDEGREMRRGKEKAFGDREISFLLHRKRRRPRIA
YGACYLKGARFFDRGAMIAGASPRSARWPIGIAACGLCLPIRIIIKNSGSARESAGNNRK
EGVHVAAAPAPLLSQWGSSIASMILGALAAMAQTKVKRPLAHSSIGHVGYIRTGFSCGTI
EGIQSLLIGIFIYALMTMDAFAIVSALRQTRVKYIADLGALAKTNPISAITFSITMFSYA
GIPPLAGFCSKFYLFFAALGCGAYFLAPVGVVTSVIGRWAAGRLPRISKFGGPKAVLRAP

$ head -n 3 mRNA.function
M00000032074 locus=g17091;makerName=TCONS_00021197.p2
M00000032073 Dbxref=MobiDBLite:mobidb-lite;locus=g17092;makerName=TCONS_00021198.p3
M00000032072 Dbxref=MobiDBLite:mobidb-lite;locus=g17093;makerName=TCONS_00021199.p1

I would like the output

>M00000032072 gene=G00000025773 seq_id=ChrM type=cds Dbxref=MobiDBLite:mobidb-lite;locus=g17093;makerName=TCONS_00021199.p1
MFKQNPSPGWKECPPSSDKEGTTPERLDEGREMRRGKEKAFGDREISFLLHRKRRRPRIA
YGACYLKGARFFDRGAMIAGASPRSARWPIGIAACGLCLPIRIIIKNSGSARESAGNNRK
EGVHVAAAPAPLLSQWGSRFGIY*
>M00000032073 gene=G00000025774 seq_id=ChrM type=cds Dbxref=MobiDBLite:mobidb-lite;locus=g17092;makerName=TCONS_00021198.p3
MFKQNPSPGWKECPPSSDKEGTTPERLDEGREMRRGKEKAFGDREISFLLHRKRRRPRIA
YGACYLKGARFFDRGAMIAGASPRSARWPIGIAACGLCLPIRIIIKNSGSARESAGNNRK
EGVHVAAAPAPLLSQWGSSIASMILGALAAMAQTKVKRPLAHSSIGHVGYIRTGFSCGTI
EGIQSLLIGIFIYALMTMDAFAIVSALRQTRVKYIADLGALAKTNPISAITFSITMFSYA
GIPPLAGFCSKFYLFFAALGCGAYFLAPVGVVTSVIGRWAAGRLPRISKFGGPKAVLRAP

and my command is awk 'NR==FNR{id[$1]=$2; next} /^>/ {print $0=$0,id[$1]}' mRNA.function Sample.pep2. But it doesn't do the job... I don't know where it is wrong...



Solution 1:[1]

Current code is close, just need to a) match the first field (Sample.pep2) with the corresponding array entry (if it exists) and b) make sure we print all input lines:

awk 'NR==FNR{id[$1]=$2; next} /^>/ {key=substr($1,2); if (key in id) $0=$0 OFS id[key]} 1' mRNA.function Sample.pep2

This generates:

>M00000032072 gene=G00000025773 seq_id=ChrM type=cds Dbxref=MobiDBLite:mobidb-lite;locus=g17093;makerName=TCONS_00021199.p1
MFKQNPSPGWKECPPSSDKEGTTPERLDEGREMRRGKEKAFGDREISFLLHRKRRRPRIA
YGACYLKGARFFDRGAMIAGASPRSARWPIGIAACGLCLPIRIIIKNSGSARESAGNNRK
EGVHVAAAPAPLLSQWGSRFGIY*
>M00000032073 gene=G00000025774 seq_id=ChrM type=cds Dbxref=MobiDBLite:mobidb-lite;locus=g17092;makerName=TCONS_00021198.p3
MFKQNPSPGWKECPPSSDKEGTTPERLDEGREMRRGKEKAFGDREISFLLHRKRRRPRIA
YGACYLKGARFFDRGAMIAGASPRSARWPIGIAACGLCLPIRIIIKNSGSARESAGNNRK
EGVHVAAAPAPLLSQWGSSIASMILGALAAMAQTKVKRPLAHSSIGHVGYIRTGFSCGTI
EGIQSLLIGIFIYALMTMDAFAIVSALRQTRVKYIADLGALAKTNPISAITFSITMFSYA
GIPPLAGFCSKFYLFFAALGCGAYFLAPVGVVTSVIGRWAAGRLPRISKFGGPKAVLRAP

Solution 2:[2]

Here is a perl solution.

perl -lpe 'BEGIN { %id_to_function = map { /^(\S+)\s+(.*)/ } `cat mRNA.function`; } s{^>(\S+)(.*)}{>$1$2 $id_to_function{$1}};' sample.pep2

Prior to read the fasta file, the code executes the BEGIN { ... } block. There, the file with ids and functions is read into the hash %id_to_function.
In the main body of the code, the substitution operator s{...}{...} appends to the fasta header the function for the corresponding id using the hash lookup $id_to_function{$1}.
$1 and $2 are the first and second capture groups, correspondingly, that were captured with parentheses in the preceding regex: ^>(\S+)(.*).

The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-p : Loop over the input one line at a time, assigning it to $_ by default. Add print $_ after each loop iteration.
-l : Strip the input line separator ("\n" on *NIX by default) before executing the code in-line, and append it when printing.

SEE ALSO:
perldoc perlrun: how to execute the Perl interpreter: command line switches
perldoc perlrequick: Perl regular expressions quick start

Solution 3:[3]

awk '
NR==FNR {a[">"$1]=$2}
NR!=FNR && a[$1] != "" {$0=$0" "a[$1]}
NR!=FNR' mRNA.function Sample.pep2

You were on the right track with the array. In file2 you need to check that the line starts with the matching name from file1, before appending its data. That's what a[$1] != "" does.

This example assumes the first file only has two fields (no spaces in the data). If there are spaces I can post an edit.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 markp-fuso
Solution 2 Timur Shtatland
Solution 3 dan