'Regex match all word pairs
I am trying to get all the word pairs out of a piece of text.
I have the following regular expression (\w+) +(\w+) that I run on a piece of text with no punctuation. My issue is this does not consider all possible pairs
$ echo "hello dear world" | grep -Eoi "(\w+) +(\w+)"
hello dear
I want the following
$ echo "hello dear world" | grep -Eoi [some expression]
hello dear
dear world
Solution 1:[1]
Traditional grep won't return capture groups.
You can consider pcregrep with a lookahead and 2 capture groups:
echo "hello dear world" | pcregrep -o1 -o2 '(\w+)(?=(\h+\w+))'
hello dear
dear world
If you don't have pcregrep then you can use this simple awk:
awk '{for (i=1; i<NF; ++i) print $i OFS $(i+1)}' <<< "hello dear world"
hello dear
dear world
Solution 2:[2]
With your shown samples, here is 1 more of doing this in awk program(with any version of awk this should be working fine).
echo "hello dear world" | awk '{for(i=2;i<NF;i++){$i=$i ORS $i}} 1'
Explanation: Simple explanation would be, printing values by echo command and sending it as a standard input to awk program. Then in awk program, going through fields(only even ones), re-assigning those fields with their own value followed by new line and their own value, then printing edited/non-edited line.
Solution 3:[3]
With GNU awk for multi-char RS and \s shorthand:
$ echo "hello dear world" | awk -v RS='\\s+' 'NR>1{print p OFS $0} {p=$0}'
hello dear
dear world
Solution 4:[4]
Perl allows lookarounds, so you can use a common technique to match overlapping texts with a capturing group inside a positive lookahead:
perl -lne 'print "$1" while /\b(?=(\w+\s+\w+))/g' file
See an online demo:
s="hello dear world"
perl -lne 'print "$1" while /\b(?=(\w+\s+\w+))/g' <<< "$s"
Output:
hello dear
dear world
See the regex demo. Details:
\b- a word boundary(?=(\w+\s+\w+))- a positive lookahead that requires (immediately to the right of the current position):(\w+\s+\w+)- Capturing group 1:\w+- one or more word chars\s+- one or more whitespaces\w+- one or more word chars
Solution 5:[5]
Using ripgrep:
% echo "hello dear world" | rg '(\w+)\s(\w+)\s(\w+)' -r "$(printf '$1 $2\n$2 $3')"
hello dear
dear world
To do all 2 word combinations based on the 3, you can combine it with a crunch command, e.g.
% echo "hello dear world" | rg -o '(\w+)\s(\w+)\s(\w+)' -r "$(crunch 5 5 + + 123 -t '$% $%' 2>/dev/null)"
hello hello
hello dear
hello world
dear hello
dear dear
dear world
world hello
world dear
world world
To read more about overlap matching, see:
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | anubhava |
| Solution 2 | RavinderSingh13 |
| Solution 3 | Ed Morton |
| Solution 4 | Wiktor Stribiżew |
| Solution 5 |
