'Heuristic sorting on text match

I'd like to order results counting a goodness of match to multiple concurrent text matches. I want to count partial matches to text searches a collection of searches, e.g. specific characters, bigrams, prefixes.

I want to use bash, awk, command line tools, or one-liners, without writing another script.

For example, say I want to sort by the count of 5 most common english bigrams [th,he,in,er,an] included in the word:

With example wordlist

abashed
abashedly
abashedness
abhenry
abolisher
not

(from grep he /usr/share/dict/words | head -n5, with non-match added).

I want output

2 abolisher
1 abhenry
1 abashedness
1 abashedly
1 abashed
0 not


Solution 1:[1]

For the particular question "sort by the number of vowels", GNU awk is a fine choice:

produce_words |
gawk '
  {
    vowels = gensub(/[^aeiouy]/, "", "g", tolower($0))
    count[$0] = length(vowels)
  }
  END {
    PROCINFO["sorted_in"] = "@val_num_desc"
    for (word in count) print count[word], word
  }
'

See Using Predefined Array Scanning Orders with gawk for the PROCINFO magic.

Solution 2:[2]

Awk can do.

Count the number of lines that match some patterns, possibly multiple. Because multiple patterns can match, we can't use regex alternative matches (/in|er/) in the solution.

You could write this in one line, though it's awfully repetitive.

<10-words.txt tr A-Z a-z 
| awk '//{tot[$0]=0}
    /th/{tot[$0]++}
    /he/{tot[$0]++}
    /in/{tot[$0]++} 
    /er/{tot[$0]++}
    /an/{tot[$0]++}
  END{for (i in tot) print tot[i],i }'
| sort -rn`

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 glenn jackman
Solution 2