'How to deal with compound word in Indonesian tokenizer?

I want to use Indonesian tokenizer. I've tried using various tokenizers, but it's hard to find a way to get results considering Indonesian compound words.

In the case of Indonesian, a word can have a single meaning, but sometimes two words must be combined to have a single meaning.


An example is below.

(indonesian) kereta : (eng) carriage

(indonesian) api : (eng) fire

-> (indonesian) kereta api : (eng) train


And if the same word is repeated twice, the meaning may change.

(indonesian) mata : (eng) snow

(indonesian) mata mata : (eng) spy


There are quite a few cases like this, so it would be difficult to find them one by one and create a user dictionary.

Is there a tokenizer that allows common compound words to be output together without separating them??



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source