'Regex match all word pairs

I am trying to get all the word pairs out of a piece of text.

I have the following regular expression (\w+) +(\w+) that I run on a piece of text with no punctuation. My issue is this does not consider all possible pairs

$ echo "hello dear world" | grep -Eoi "(\w+) +(\w+)"
hello dear 

I want the following

$ echo "hello dear world" | grep -Eoi [some expression]
hello dear 
dear world


Solution 1:[1]

Traditional grep won't return capture groups.

You can consider pcregrep with a lookahead and 2 capture groups:

echo "hello dear world" | pcregrep -o1 -o2 '(\w+)(?=(\h+\w+))'

hello dear
dear world

If you don't have pcregrep then you can use this simple awk:

awk '{for (i=1; i<NF; ++i) print $i OFS $(i+1)}' <<< "hello dear world"

hello dear
dear world

Solution 2:[2]

With your shown samples, here is 1 more of doing this in awk program(with any version of awk this should be working fine).

echo "hello dear world" | awk '{for(i=2;i<NF;i++){$i=$i ORS $i}} 1'

Explanation: Simple explanation would be, printing values by echo command and sending it as a standard input to awk program. Then in awk program, going through fields(only even ones), re-assigning those fields with their own value followed by new line and their own value, then printing edited/non-edited line.

Solution 3:[3]

With GNU awk for multi-char RS and \s shorthand:

$ echo "hello dear world" | awk -v RS='\\s+' 'NR>1{print p OFS $0} {p=$0}'
hello dear
dear world

Solution 4:[4]

Perl allows lookarounds, so you can use a common technique to match overlapping texts with a capturing group inside a positive lookahead:

perl -lne 'print "$1" while /\b(?=(\w+\s+\w+))/g' file

See an online demo:

s="hello dear world"
perl -lne 'print "$1" while /\b(?=(\w+\s+\w+))/g' <<< "$s"

Output:

hello dear
dear world

See the regex demo. Details:

  • \b - a word boundary
  • (?=(\w+\s+\w+)) - a positive lookahead that requires (immediately to the right of the current position):
    • (\w+\s+\w+) - Capturing group 1:
      • \w+ - one or more word chars
      • \s+ - one or more whitespaces
      • \w+ - one or more word chars

Solution 5:[5]

Using ripgrep:

% echo "hello dear world" | rg '(\w+)\s(\w+)\s(\w+)' -r "$(printf '$1 $2\n$2 $3')"
hello dear
dear world

To do all 2 word combinations based on the 3, you can combine it with a crunch command, e.g.

% echo "hello dear world" | rg -o '(\w+)\s(\w+)\s(\w+)' -r "$(crunch 5 5 + + 123 -t '$% $%' 2>/dev/null)"
hello hello
hello dear
hello world
dear hello
dear dear
dear world
world hello
world dear
world world

To read more about overlap matching, see:

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 anubhava
Solution 2 RavinderSingh13
Solution 3 Ed Morton
Solution 4 Wiktor Stribiżew
Solution 5