'Extract first position of a regex match grep

Good morning everyone,

I have a text file containing multiple lines. I want to find a regular pattern inside it and print its position using grep.

For example:

ARTGHFRHOPLIT
GFRTLOPLATHLG
TGHLKTGVARTHG

I want to find L[any_letter]T in the file and print the position of L and the three letter code. In this case it would results as:

11 LIT
8 LAT
4 LKT

I wrote a code in grep, but it doesn't return what I need. The code is:

grep -E -boe "L.T" file.txt

It returns:

11:LIT
21:LAT
30:LKT

Any help would be appreciated!!



Solution 1:[1]

Awk suites this better:

awk 'match($0, /L[[:alpha:]]T/) {
print RSTART, substr($0, RSTART, RLENGTH)}' file

11 LIT
8 LAT
4 LKT

This is assuming only one such match per line.


If there can be multiple overlapping matches per line then use:

awk '{
   n = 0
   while (match($0, /L[[:alpha:]]T/)) {
      n += RSTART
      print n, substr($0, RSTART, RLENGTH)
      $0 = substr($0, RSTART + 1)
   }
}' file

Solution 2:[2]

With your shown samples, please try following awk code. Written and tested in GNU awk, should work in any awk.

awk '
{
  ind=prev=""
  while(ind=index($0,"L")){
    if(substr($0,ind+2,1)=="T" && substr($0,ind+1,1) ~ /[a-zA-Z]/){
      if(prev==""){ print prev+ind,substr($0,ind,3)   }
      if(prev>1)  { print prev+ind+2,substr($0,ind,3) }
    }
    $0=substr($0,ind+3)
  prev+=ind
  }
}'  Input_file

Explanation: Adding detailed explanation for above code.

awk '                                                     ##Starting awk program from here.
{
  ind=prev=""                                             ##Nullifying ind and prev variables here.
  while(ind=index($0,"L")){                               ##Run while loop to check if index for L letter is found(whose index will be stored into ind variable).
    if(substr($0,ind+2,1)=="T" && substr($0,ind+1,1) ~ /[a-zA-Z]/){      ##Checking condition if letter after 1 position of L is T AND letter next to L is a letter.
      if(prev==""){ print prev+ind,substr($0,ind,3)   }   ##Checking if prev variable is NULL then printing prev+ind along with 3 letters from index of L eg:(LIT).
      if(prev>1)  { print prev+ind+2,substr($0,ind,3) }   ##If prev is greater than 1 then printing prev+ind+2 and along with 3 letters from index of L eg:(LIT).
    }
    $0=substr($0,ind+3)                                   ##Setting value of rest of line value to 2 letters after matched L position.
  prev+=ind                                               ##adding ind to prev value.
  }
}'  Input_file                                            ##Mentioning Input_file name here.

Solution 3:[3]

Peeking at the answer of @anubhava you might also sum the RSTART + RLENGTH and use that as the start for the substr to get multiple matches per line and per word.

The while loop takes the current line, and for every iteration it updates its value by setting it to the part right after the last match till the end of the string.

Note that if you use the . in a regex it can match any character.

awk '{
  pos = 0
  while (match($0, /L[a-zA-Z]T/)) {
    pos += RSTART;
    print pos, substr($0, RSTART, RLENGTH)
    $0 = substr($0, RSTART + RLENGTH)
   }
}' file

If file contains

ARTGHFRHOPLIT
GFRTLOPLATHLG
TGHLKTGVARTHG
ARTGHFRHOPLITLOT LATTELET
LUT

The output is

11 LIT
8 LAT
4 LKT
11 LIT
12 LOT
14 LAT
17 LET
1 LUT

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2
Solution 3 The fourth bird