'Why can't I tokenize text in languages other than English using NLTK?

I'm trying to tokenise strings in different languages using word.tokenize from nltk.tokenize. What I'm finding is that, no matter what language I select, and no matter what language the string I try to tokenise is in, the tokeniser defaults to English.

For example, when I try to tokenise some German text and specify that the language is German:

from nltk.tokenize import word_tokenize

test_de = "Das lange Zeit verarmte und daher von Auswanderung betroffene Irland " \
          "hat sich inzwischen zu einer hochmodernen, in manchen Gegenden multikulturellen " \
          "Industrie- und Dienstleistungsgesellschaft gewandelt."

print(word_tokenize(test_de, 'german'))

I get this output:

['Das', 'lange', 'Zeit', 'verarmte', 'und', 'daher', 'von', 'Auswanderung', 'betroffene', 'Irland', 'hat', 'sich', 'inzwischen', 'zu', 'einer', 'hochmodernen', ',', 'in', 'manchen', 'Gegenden', 'multikulturellen', 'Industrie-', 'und', 'Dienstleistungsgesellschaft', 'gewandelt', '.']

You can see that German compound words like 'Dienstleistungsgesellschaft' aren't split into their components, 'Dienstleistungs' and 'gesellschaft'.

When I try to tokenise English text, the but specify that the language is German:

from nltk.tokenize import word_tokenize

test_en = "This is some test text. It's short. It doesn't say very much."

print(word_tokenize(test_en, 'german'))

I get this output:

['This', 'is', 'some', 'test', 'text', '.', 'It', "'s", 'short', '.', 'It', 'does', "n't", 'say', 'very', 'much', '.']

It's still clearly being tokenised like English text, even though I specified German. You can see it's splitting off English compound tokens like "n't" and "'s".

Am I doing something wrong? How can I tokenise other languages than English?



Solution 1:[1]

NLTK can tokenize several languages including German (see a previous SO question). However, compound splitting is traditionally not a part of tokenization. Although, it is rather simple in most cases, sometimes, it might be ambiguous and you need context to resolve the splitting correctly. E.g, word "Waldecke" might have two segmentations "Wald?ecke" and "Wal?decke", but most of the time, only the first segmentation makes sense.

What probably want is to apply a compound splitter over a tokenized text. There are several options including both rule-based tooks and machine-learned tools.

Note that most current NLP using neural networks uses statistical subword segmentation (such as Byte-Pair Encoding or SentencePiece), so they avoid the need for linguistically motivated segmentation.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Jindřich