'Fastest way to augment a WMT17 training set

I have a WMT17 training dataset with 3,961,179 lines.

From these lines I would like to augment 198,058 random lines, e.g. by inserting a \tbewegen (\t is a tab character) at the end of each line containing the word "move".

The word "move" can be anywhere in the sentence, and it is a substring of sentences like

1. There was more behind this move than simply wishing to expand their product portfolio .
2. move and collect miles
3. January 16 - Pro@@ hi@@ bition begins in USA . Many li@@ qu@@ or @-@ lo@@ ving Americans move to France .
.
.
.

if the substring "move" appears in a line, then the sentence should look like this

1. There was more behind this move than simply wishing to expand their product portfolio .\tbewegen
2. move and collect miles\tbewegen
3. January 16 - Pro@@ hi@@ bition begins in USA . Many li@@ qu@@ or @-@ lo@@ ving Americans move to France .\tbewegen
.
.
.

For this I already made a script, but I found out that an augmentation of 10 lines takes about 2 minutes and 198,058 lines would take 39,611 minutes.

Here is my bash script:

sed -n '=' train.de | shuf | head -198058 > lines

cat lines | while IFS= read -r line ;
do 
sed -i.bak "${line}s/move/$/\tbewegen/" train.de; 
done

Is there a way to shorten the process so that I don't have to wait several days?

Update: Assuming I want to apply the insert before/after operations from https://www.golinuxhub.com/2017/06/sed-insert-word-after-match-in-middle/. How to rewrite the awk code in the solution?

Edit:

You can randomly insert a word before or after a matched word with these commands:

awk -i inplace '(NR==FNR){a[$1];next}
    (FNR in a) && gsub(/\<the\>/,"Before &")
     1
    ' <(shuf -n 198058 -i 1-$(wc -l < n_train)) n_train

awk -i inplace '(NR==FNR){a[$1];next}
    (FNR in a) && gsub(/\<the\>/,"& After")
     1
    ' <(shuf -n 198058 -i 1-$(wc -l < n_train)) n_train


Solution 1:[1]

This might work for you (GNU sed):

grep -n move file | shuf | head -198058 | sed 's/:.*/s#$#\\tbewegen#/' | sed -f - file

Use grep to find (with line number) all lines containing move.

Shuffle these lines using shuf.

Take the first 198058 line numbers.

Use sed to build a sed script from the line numbers that appends \tbewegen to each line identified in the file.

Pass the sed script into another invocation of sed using the -f option and play it out against the original file.

If the 198508 line may or may not contain the word move, use:

seq $(wc -l <file) | shuf | head -198058 | sed 's/$/s#$#\\tbewegen#/' sed -f - file

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 potong