'Most efficient way to locate a the presence of a substring in a dictionary (NLP)
I'm currently working on a speech recognition feature, where a user can say a command and have that command trigger an event.
The way I have it structured now does work, but mostly because the recognition dictionary is small. It likely will never be millions of commands, but that's no reason to be sloppy.
Here is how it works now:
@ObservedObject var speechText
private var matchables: [String:Int] = [["start the action",0], //formal
["start action",0], //informal
["star traction",0], //common misfire by NLU
["stop the action",1]] //different action
//Call processText with lowercase speechText
//when Observed string value changes
//assume text is conversational, such as "Jenny, I like chicken, also device why
//don't you start the action"
func processText(text: String) {
for (key, value) in matchables {
if text.contains(key) {
executeActionByID(value)
}
}
}
This will loop through the matchables collection and search for the contents of each key inside of the text value. This works fine on a small dictionary, but becomes cumbersome at scale.
I could theoretically break text into N-Grams and then access the dictionary directly by key, but this is long running recognition, and text might contain a substantial number of words (hundreds?) which may exceed the maximum practical size of the dictionary.
Is there a third, better way to analyze long running streams of text and quickly pick out commands that match a small substring?
Solution 1:[1]
Here is my back-of-the envelope thinking about this problem:
Searching for keys in a Dictionary is really fast (almost constant time). Searching strings for substrings using String.contains(_:) is slow. (Around O(n) where n is the length of the string.)
As your string length goes up and your number of keys goes up, your time to completion is going to go up by O(n*x) (n=string length, x = number of keys.)
That's likely to get slow for longer search strings, and total time will grow geometrically if both your number of keys and string length increase.
I'd suggest breaking your string into discrete units to search for (the obvious way is to divide it with spaces and other separators like punctuation.) If you do that you could check to see if each word appears in your dictionary keys. That should get you roughly O(n) time performance, since each search for a key in a dictionary runs in nearly constant time.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Duncan C |
