'What languages are supported for nltk.word_tokenize and nltk.pos_tag

I need to conduct name entity extraction for text in multiple languages: spanish, portuguese, greek, czech, chinese.

Is there somewhere a list of all supported languages for these two functions? And is there a method to use other corpora so that these languages can be included?

nlp nltk

Solution 1:^[1]

The list of the languages supported by the NLTK tokenizer is as follows:

'czech'
'danish'
'dutch'
'english'
'estonian'
'finnish'
'french'
'german'
'greek'
'italian'
'norwegian'
'polish'
'portuguese'
'russian'
'slovene',
'spanish'
'swedish'
'turkish'

It corresponds to the pickles stored in C:\Users\XXX\AppData\Roaming\nltk_data\tokenizers\punkt (in Windows). This is what you enter with the key 'language' when tokenizing, e.g.

nltk.word_tokenize(text, language='italian')

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	Stanislav Koncebovski

'What languages are supported for nltk.word_tokenize and nltk.pos_tag

Solution 1:[1]

Sources

Related Questions

Solution 1:^[1]