'Algorithms and Data Structures best suited for a spell checker, dictionary and a thesaurus
Best way to implement a
- dictionary (is there any DS better than Trie for Dictionary)
- thesaurus (no idea, as match is made on meanings of the words, similar meanings)
- spell checker (something better than a hash map), if possible with correct spelling recommendations.
When asked in a one hour interview, are we expected to write a c/c++ code, for the algorithm?
Solution 1:[1]
See this for a 21-line Python 2.5 spelling corrector and a bit of background.
Solution 2:[2]
For the dictionary, there is indeed a data structure superior to the trie. Try a DAWG, or CDAWG: http://en.wikipedia.org/wiki/Directed_acyclic_word_graph. Just to complicate matters, my favorite paper on the structure, Ciura and Deorowicz's "How to Squeeze a Lexicon" calls them "minimal ADFAs". Google around and you'll find many competing algorithms for constructing these beasts. Good luck!
Solution 3:[3]
I can see no better data structure than a trie for the dictionary and the thesaurus. Both can be fitted in one structure if needed with one link in the node pointing to the meaning of the word (dictionary) and one to synonyms (thesaurus). It can even offer some form of autocompletion (when there's only one link in the node).
Spelling corrector is a bit trickier - since one has to map fals input to some kind of correct input. You can take this link as a start: http://en.wikipedia.org/wiki/Spell_checker. At the end you'll find links to papers about different algorithms. According to the wikipedia article, this paper describes the most successfull algorithm: Andrew Golding and Dan Roth's "Winnow-based spelling correction algorithm"
Solution 4:[4]
Also check out the Bloom filter.
Solution 5:[5]
For a dictionary I would use the std::map (calling Dictionary in the .Net framework) collection with the word as key and a custom object (with all information about the word + the definition) as value.
For a thesaurus the best structure is a tree where each node is a section and where each branch finished with an object which contains all information about what you have to display.
Solution 6:[6]
In all three cases, you can construct a BK-Tree from your word set. BK-Trees let you find all words within a given edit distance of the entered word. See my blog post on BK-Trees for an explanation of how they work.
A dictionary and spell checker are more or less identical - the dictionary simply needs to provide definitions along with the words. For a thesaurus, words are clustered into 'synsets' - synonym sets - with all the elements inserted into the BK-Tree. When someone looks up one word in the synset, you display all the other ones as alternatives. A word can appear in multiple synsets, so you need to ensure your BK-Tree nodes can handle multiple values for a given key.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Anton Gogolev |
| Solution 2 | Grynszpan |
| Solution 3 | Glorfindel |
| Solution 4 | Anonymous |
| Solution 5 | Patrice Bernassola |
| Solution 6 | Nick Johnson |
