'Regex match all word pairs
I am trying to get all the word pairs out of a piece of text.
I have the following regular expression (\w+) +(\w+)
that I run on a piece of text with no punctuation. My issue is this does not consider all possible pairs
$ echo "hello dear world" | grep -Eoi "(\w+) +(\w+)"
hello dear
I want the following
$ echo "hello dear world" | grep -Eoi [some expression]
hello dear
dear world
Solution 1:[1]
Traditional grep
won't return capture groups.
You can consider pcregrep
with a lookahead and 2 capture groups:
echo "hello dear world" | pcregrep -o1 -o2 '(\w+)(?=(\h+\w+))'
hello dear
dear world
If you don't have pcregrep
then you can use this simple awk
:
awk '{for (i=1; i<NF; ++i) print $i OFS $(i+1)}' <<< "hello dear world"
hello dear
dear world
Solution 2:[2]
With your shown samples, here is 1 more of doing this in awk
program(with any version of awk
this should be working fine).
echo "hello dear world" | awk '{for(i=2;i<NF;i++){$i=$i ORS $i}} 1'
Explanation: Simple explanation would be, printing values by echo
command and sending it as a standard input to awk
program. Then in awk
program, going through fields(only even ones), re-assigning those fields with their own value followed by new line and their own value, then printing edited/non-edited line.
Solution 3:[3]
With GNU awk for multi-char RS and \s
shorthand:
$ echo "hello dear world" | awk -v RS='\\s+' 'NR>1{print p OFS $0} {p=$0}'
hello dear
dear world
Solution 4:[4]
Perl allows lookarounds, so you can use a common technique to match overlapping texts with a capturing group inside a positive lookahead:
perl -lne 'print "$1" while /\b(?=(\w+\s+\w+))/g' file
See an online demo:
s="hello dear world"
perl -lne 'print "$1" while /\b(?=(\w+\s+\w+))/g' <<< "$s"
Output:
hello dear
dear world
See the regex demo. Details:
\b
- a word boundary(?=(\w+\s+\w+))
- a positive lookahead that requires (immediately to the right of the current position):(\w+\s+\w+)
- Capturing group 1:\w+
- one or more word chars\s+
- one or more whitespaces\w+
- one or more word chars
Solution 5:[5]
Using ripgrep
:
% echo "hello dear world" | rg '(\w+)\s(\w+)\s(\w+)' -r "$(printf '$1 $2\n$2 $3')"
hello dear
dear world
To do all 2 word combinations based on the 3, you can combine it with a crunch
command, e.g.
% echo "hello dear world" | rg -o '(\w+)\s(\w+)\s(\w+)' -r "$(crunch 5 5 + + 123 -t '$% $%' 2>/dev/null)"
hello hello
hello dear
hello world
dear hello
dear dear
dear world
world hello
world dear
world world
To read more about overlap matching, see:
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | anubhava |
Solution 2 | RavinderSingh13 |
Solution 3 | Ed Morton |
Solution 4 | Wiktor Stribiżew |
Solution 5 |