'replace strings in file from a reference list

There are a few threads that seem to be asking the same question as I'm interested in here, but some of the answers seem to be tricky to generalise (or I'm not smart enough). e.g.

how to replace strings in file based on values from another file? (example inside)

Replacing strings in file, using patterns from another file

I have some complicated files that look like this:

 ((PLT_01736:0.06834090301258281819,(((PLT_01758:0.04822932915066913823,PLT_01716:0.08160284537473952438)98:0.04771898687201567291,((PAU_02074:0.04683560272944583408,PAK_01896:0.02826787310445108212)95:0.03010698277052889504,PLT_02424:0.06991513512243620332)99:0.06172493035971356873)90:0.05291396820697712167,((PAK_02014:0.00000187538096058579,PAU_02206:0.00721521619460035492)100:0.43252725913741080221,((PLT_02568:0.06262043352060168988,(PAU_01961:0.02293694470289835488,PAK_01787:0.01049771144617094552)98:0.05833869619359682152)100:0.65266156617675985530,(PAK_03203:0.06403695571262699171,PAU_03392:0.03453883849938884504)99:0.10276841868475847241)2:0.14443958710162313475)10:0.20176450294539299835)9:0.01245548664398392694)92:0.05176685581730120639,(PAK_02606:0.03709141633854080161,PAU_02775:0.01796540370573110335)57:0.01492069367348663675,PLT_01696:0.01562657531699716829);

(These are Newick format phylogenetic trees in case anyone is interested)

I need to change all the ID keys (the bits that look like XXX_YYYYY) in this file and am not sure what the best approach would be.

They need to be replaced by the 'group' (operon) they belong to, and so I was thinking that making an index file of sorts would be the way to go, so for example, PLT_01696 gets replaced with group_1 say:

Keyfile:

PLT_01696 group_1
PLT_01736 group_1
PLT_01758 group_1
....
PAU_02074 group_2

So I think if I could pass a file to sed or some equivalent, get it to read and look for the entry in column one, and replace it with whatever I've paired it with in column 2 is the best way to do this? This file will have about 350 individual keys in the end which will end up sorted in to around 12 groups.

And the file would end up looking like:

((group_1:0.06834090301258281819,(((group_1:0.04822932915066913823,group_1:0.08160284537473952438)98:0.04771898687201567291,.....

I'm open to alternative suggestions, this just seemed most apparent to me. This is on Ubuntu 14.04 so any solution is fair game really, but I'm much more au fait with bash (and a bit of perl).



Solution 1:[1]

I'll bite. Let's call the script phylo.awk:

NR==FNR { pattern[NR] = $1; replacement[NR] = $2; count++; next }
{
    for (i = 1; i <= count; i++) {
        sub(pattern[i], replacement[i])
    }
    print $0
}

Then say:

awk -f phylo.awk patterns data

Solution 2:[2]

One solution in such cases is to write a sed script that writes the sed script you want to execute. It appears that operons are preceded by either ( or , and are always followed by :. So, given your file containing mappings such as:

PLT_01736 group_1

then for each line in that file you want to create a sed operation that looks like:

s/\([,(]\)PLT_01736:/\1group_1:/g

where the g might not be necessary (I don't know if a given operon can appear more than once in a single line). The initial character class captures the ( or , and the \( and \) remember that, and it's followed by the specific ID key, and the colon; the replace operation outputs the remembered character, the replacement text and the colon. The advantage of tracking the preceding and following characters is that if by some mischance you have operons PLT_00100 and PLT_001001 (where one operon is a prefix of the other), tracking the surrounding characters ensures the correct match. Otherwise, you have to ensure that the longest matches appear first in the script, which is fiddlier (sort -r probably sorts that out, but …).

Hence, assuming the mappings are in a file mapping.data, you can use:

sed 's%\([A-Z]*_[0-9]*\)  *\(.*\)%s/\\([,(]\\)\1:/\\1\2:/g%' mapping.data > script.sed
sed -f script.sed newick.phylogenetic.tree.data > transformed.data

This uses % in the generating s%%% operation, outputting s/// (it requires some care). The search part of the s%%% looks for zero or more upper-case letters, an underscore, and zero or more digits, capturing that with the \( and \); followed by one or more spaces, followed by some other characters which are also captured. If the ID keys can have a different structure, then change the matching regex appropriately. I assume that the input data is 'clean' so there's no need to worry about only processing lines with exactly three letters, and underscore and exactly five digits, and there's no trailing blanks. With the two parts (key ID and replacement) isolated, it is just necessary to generate the output s/// command, remembering to double up the backslashes that must appear in the output.

Given your input data and list of keys, the output I get is:

((group_1:0.06834090301258281819,(((group_1:0.04822932915066913823,PLT_01716:0.08160284537473952438)98:0.04771898687201567291,((group_2:0.04683560272944583408,PAK_01896:0.02826787310445108212)95:0.03010698277052889504,PLT_02424:0.06991513512243620332)99:0.06172493035971356873)90:0.05291396820697712167,((PAK_02014:0.00000187538096058579,PAU_02206:0.00721521619460035492)100:0.43252725913741080221,((PLT_02568:0.06262043352060168988,(PAU_01961:0.02293694470289835488,PAK_01787:0.01049771144617094552)98:0.05833869619359682152)100:0.65266156617675985530,(PAK_03203:0.06403695571262699171,PAU_03392:0.03453883849938884504)99:0.10276841868475847241)2:0.14443958710162313475)10:0.20176450294539299835)9:0.01245548664398392694)92:0.05176685581730120639,(PAK_02606:0.03709141633854080161,PAU_02775:0.01796540370573110335)57:0.01492069367348663675,group_1:0.01562657531699716829);

Solution 3:[3]

#!/bin/bash

while read i;do #enter your loop

 a=$(echo "$i" | cut -d" " -f1) #get what to find
 b=$(echo "$i" | cut -d" " -f2) #get what to replace with

sed -i "s/$a/$b/g" input.txt #find and replace  -i is "in place"

done <ref.txt #define file you're looping through

input:

((PLT_01736:0.06834090301258281819,(((PLT_01758:0.04822932915066913823,PLT_01716:0.08160284537473952438)98:0.04771898687201567291,((PAU_02074:0.04683560272944583408,PAK_01896:0.02826787310445108212)95:0.03010698277052889504,PLT_02424:0.06991513512243620332)99:0.06172493035971356873)90:0.05291396820697712167,((PAK_02014:0.00000187538096058579,PAU_02206:0.00721521619460035492)100:0.43252725913741080221,((PLT_02568:0.06262043352060168988,(PAU_01961:0.02293694470289835488,PAK_01787:0.01049771144617094552)98:0.05833869619359682152)100:0.65266156617675985530,(PAK_03203:0.06403695571262699171,PAU_03392:0.03453883849938884504)99:0.10276841868475847241)2:0.14443958710162313475)10:0.20176450294539299835)9:0.01245548664398392694)92:0.05176685581730120639,(PAK_02606:0.03709141633854080161,PAU_02775:0.01796540370573110335)57:0.01492069367348663675,PLT_01696:0.01562657531699716829);

ref:

PLT_01696 group_1
PLT_01736 group_1
PLT_01758 group_1
PAU_02074 group_2

output:

((group_1:0.06834090301258281819,(((group_1:0.04822932915066913823,PLT_01716:0.08160284537473952438)98:0.04771898687201567291,((group_2:0.04683560272944583408,PAK_01896:0.02826787310445108212)95:0.03010698277052889504,PLT_02424:0.06991513512243620332)99:0.06172493035971356873)90:0.05291396820697712167,((PAK_02014:0.00000187538096058579,PAU_02206:0.00721521619460035492)100:0.43252725913741080221,((PLT_02568:0.06262043352060168988,(PAU_01961:0.02293694470289835488,PAK_01787:0.01049771144617094552)98:0.05833869619359682152)100:0.65266156617675985530,(PAK_03203:0.06403695571262699171,PAU_03392:0.03453883849938884504)99:0.10276841868475847241)2:0.14443958710162313475)10:0.20176450294539299835)9:0.01245548664398392694)92:0.05176685581730120639,(PAK_02606:0.03709141633854080161,PAU_02775:0.01796540370573110335)57:0.01492069367348663675,group_1:0.01562657531699716829);

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Michael Vehrs
Solution 2
Solution 3 Acrid_Soul