'Heuristic sorting on text match
I'd like to order results counting a goodness of match to multiple concurrent text matches. I want to count partial matches to text searches a collection of searches, e.g. specific characters, bigrams, prefixes.
I want to use bash, awk, command line tools, or one-liners, without writing another script.
For example, say I want to sort by the count of 5 most common english bigrams [th,he,in,er,an] included in the word:
With example wordlist
abashed
abashedly
abashedness
abhenry
abolisher
not
(from grep he /usr/share/dict/words | head -n5, with non-match added).
I want output
2 abolisher
1 abhenry
1 abashedness
1 abashedly
1 abashed
0 not
Solution 1:[1]
For the particular question "sort by the number of vowels", GNU awk is a fine choice:
produce_words |
gawk '
{
vowels = gensub(/[^aeiouy]/, "", "g", tolower($0))
count[$0] = length(vowels)
}
END {
PROCINFO["sorted_in"] = "@val_num_desc"
for (word in count) print count[word], word
}
'
See Using Predefined Array Scanning Orders with gawk for the PROCINFO magic.
Solution 2:[2]
Awk can do.
Count the number of lines that match some patterns, possibly multiple. Because multiple patterns can match, we can't use regex alternative matches (/in|er/) in the solution.
You could write this in one line, though it's awfully repetitive.
<10-words.txt tr A-Z a-z
| awk '//{tot[$0]=0}
/th/{tot[$0]++}
/he/{tot[$0]++}
/in/{tot[$0]++}
/er/{tot[$0]++}
/an/{tot[$0]++}
END{for (i in tot) print tot[i],i }'
| sort -rn`
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | glenn jackman |
| Solution 2 |
