'How to deal with compound word in Indonesian tokenizer?
I want to use Indonesian tokenizer. I've tried using various tokenizers, but it's hard to find a way to get results considering Indonesian compound words.
In the case of Indonesian, a word can have a single meaning, but sometimes two words must be combined to have a single meaning.
An example is below.
(indonesian) kereta : (eng) carriage
(indonesian) api : (eng) fire
-> (indonesian) kereta api : (eng) train
And if the same word is repeated twice, the meaning may change.
(indonesian) mata : (eng) snow
(indonesian) mata mata : (eng) spy
There are quite a few cases like this, so it would be difficult to find them one by one and create a user dictionary.
Is there a tokenizer that allows common compound words to be output together without separating them??
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
