'Bash command - how to grep and then truncate but keep grep-ed part?

I am trying to splice out a particular piece of string. I used:

myVar=$(grep --color 'GACCT[ATCG]*AGGTC' FILE.txt | cat) 

then, I used the code below to remove everything before and after my desired portion.

myVar1=$(echo ${myVar##*GACCT})
echo ${myVar1%%AGGTC*}

The code is working however, it cuts off the GACCT and AGGTC at the beginning and end of the desired fragmen that I want to keep. Is there anyway to cut the beginning and end off while still keeping the GACCT and AGGTC?

Thank you!



Solution 1:[1]

If you have a GNU grep, you can make use of

myVar=$(grep --color=never -oP 'GACCT\K[ATCG]+(?=AGGTC)' FILE.txt)

See the online demo:

#!/bin/bash
s='GACCTAAATTTGGGCCCAGGTC'
 
# Original script
myVar=$(grep --color 'GACCT[ATCG]*AGGTC' <<< "$s" | cat)
myVar1=$(echo ${myVar##*GACCT})
echo ${myVar1%%AGGTC*}
# => AAATTTGGGCCC

# My suggestion:
grep --color=never -oP 'GACCT\K[ATCG]+(?=AGGTC)' <<< "$s"
# => AAATTTGGGCCC

With --color=never, your matches are not colored.

The -o option outputs the matched texts, and the P option enables the PCRE regex engine. It is necessary here since the regex pattern contains specific operators, like \K and (?=...).

More details

  • GACCT - a literal string
  • \K - operator that makes the regex engine "forget" what has been consumed
  • [ATCG]+ - one or more letters from the set
  • (?=AGGTC) - a positive lookahead that requires an AGGTC string immediately to the right of the current location.

Note you can get this result with pcregrep, too, if you install it:

myVar=$(pcregrep -o 'GACCT\K[ATCG]+(?=AGGTC)' FILE.txt)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1