'Extract first position of a regex match grep
Good morning everyone,
I have a text file containing multiple lines. I want to find a regular pattern inside it and print its position using grep.
For example:
ARTGHFRHOPLIT
GFRTLOPLATHLG
TGHLKTGVARTHG
I want to find L[any_letter]T in the file and print the position of L and the three letter code. In this case it would results as:
11 LIT
8 LAT
4 LKT
I wrote a code in grep, but it doesn't return what I need. The code is:
grep -E -boe "L.T" file.txt
It returns:
11:LIT
21:LAT
30:LKT
Any help would be appreciated!!
Solution 1:[1]
Awk suites this better:
awk 'match($0, /L[[:alpha:]]T/) {
print RSTART, substr($0, RSTART, RLENGTH)}' file
11 LIT
8 LAT
4 LKT
This is assuming only one such match per line.
If there can be multiple overlapping matches per line then use:
awk '{
n = 0
while (match($0, /L[[:alpha:]]T/)) {
n += RSTART
print n, substr($0, RSTART, RLENGTH)
$0 = substr($0, RSTART + 1)
}
}' file
Solution 2:[2]
With your shown samples, please try following awk code. Written and tested in GNU awk, should work in any awk.
awk '
{
ind=prev=""
while(ind=index($0,"L")){
if(substr($0,ind+2,1)=="T" && substr($0,ind+1,1) ~ /[a-zA-Z]/){
if(prev==""){ print prev+ind,substr($0,ind,3) }
if(prev>1) { print prev+ind+2,substr($0,ind,3) }
}
$0=substr($0,ind+3)
prev+=ind
}
}' Input_file
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk program from here.
{
ind=prev="" ##Nullifying ind and prev variables here.
while(ind=index($0,"L")){ ##Run while loop to check if index for L letter is found(whose index will be stored into ind variable).
if(substr($0,ind+2,1)=="T" && substr($0,ind+1,1) ~ /[a-zA-Z]/){ ##Checking condition if letter after 1 position of L is T AND letter next to L is a letter.
if(prev==""){ print prev+ind,substr($0,ind,3) } ##Checking if prev variable is NULL then printing prev+ind along with 3 letters from index of L eg:(LIT).
if(prev>1) { print prev+ind+2,substr($0,ind,3) } ##If prev is greater than 1 then printing prev+ind+2 and along with 3 letters from index of L eg:(LIT).
}
$0=substr($0,ind+3) ##Setting value of rest of line value to 2 letters after matched L position.
prev+=ind ##adding ind to prev value.
}
}' Input_file ##Mentioning Input_file name here.
Solution 3:[3]
Peeking at the answer of @anubhava you might also sum the RSTART + RLENGTH and use that as the start for the substr to get multiple matches per line and per word.
The while loop takes the current line, and for every iteration it updates its value by setting it to the part right after the last match till the end of the string.
Note that if you use the . in a regex it can match any character.
awk '{
pos = 0
while (match($0, /L[a-zA-Z]T/)) {
pos += RSTART;
print pos, substr($0, RSTART, RLENGTH)
$0 = substr($0, RSTART + RLENGTH)
}
}' file
If file contains
ARTGHFRHOPLIT
GFRTLOPLATHLG
TGHLKTGVARTHG
ARTGHFRHOPLITLOT LATTELET
LUT
The output is
11 LIT
8 LAT
4 LKT
11 LIT
12 LOT
14 LAT
17 LET
1 LUT
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | |
| Solution 3 | The fourth bird |
