'What languages are supported for nltk.word_tokenize and nltk.pos_tag
I need to conduct name entity extraction for text in multiple languages: spanish, portuguese, greek, czech, chinese.
Is there somewhere a list of all supported languages for these two functions? And is there a method to use other corpora so that these languages can be included?
Solution 1:[1]
The list of the languages supported by the NLTK tokenizer is as follows:
- 'czech'
- 'danish'
- 'dutch'
- 'english'
- 'estonian'
- 'finnish'
- 'french'
- 'german'
- 'greek'
- 'italian'
- 'norwegian'
- 'polish'
- 'portuguese'
- 'russian'
- 'slovene',
- 'spanish'
- 'swedish'
- 'turkish'
It corresponds to the pickles stored in C:\Users\XXX\AppData\Roaming\nltk_data\tokenizers\punkt (in Windows). This is what you enter with the key 'language' when tokenizing, e.g.
nltk.word_tokenize(text, language='italian')
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Stanislav Koncebovski |
