'Bash command - how to grep and then truncate but keep grep-ed part?
I am trying to splice out a particular piece of string. I used:
myVar=$(grep --color 'GACCT[ATCG]*AGGTC' FILE.txt | cat)
then, I used the code below to remove everything before and after my desired portion.
myVar1=$(echo ${myVar##*GACCT})
echo ${myVar1%%AGGTC*}
The code is working however, it cuts off the GACCT and AGGTC at the beginning and end of the desired fragmen that I want to keep. Is there anyway to cut the beginning and end off while still keeping the GACCT and AGGTC?
Thank you!
Solution 1:[1]
If you have a GNU grep, you can make use of
myVar=$(grep --color=never -oP 'GACCT\K[ATCG]+(?=AGGTC)' FILE.txt)
See the online demo:
#!/bin/bash
s='GACCTAAATTTGGGCCCAGGTC'
# Original script
myVar=$(grep --color 'GACCT[ATCG]*AGGTC' <<< "$s" | cat)
myVar1=$(echo ${myVar##*GACCT})
echo ${myVar1%%AGGTC*}
# => AAATTTGGGCCC
# My suggestion:
grep --color=never -oP 'GACCT\K[ATCG]+(?=AGGTC)' <<< "$s"
# => AAATTTGGGCCC
With --color=never, your matches are not colored.
The -o option outputs the matched texts, and the P option enables the PCRE regex engine. It is necessary here since the regex pattern contains specific operators, like \K and (?=...).
More details
GACCT- a literal string\K- operator that makes the regex engine "forget" what has been consumed[ATCG]+- one or more letters from the set(?=AGGTC)- a positive lookahead that requires anAGGTCstring immediately to the right of the current location.
Note you can get this result with pcregrep, too, if you install it:
myVar=$(pcregrep -o 'GACCT\K[ATCG]+(?=AGGTC)' FILE.txt)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
