'What languages are supported for nltk.word_tokenize and nltk.pos_tag

I need to conduct name entity extraction for text in multiple languages: spanish, portuguese, greek, czech, chinese.

Is there somewhere a list of all supported languages for these two functions? And is there a method to use other corpora so that these languages can be included?



Solution 1:[1]

The list of the languages supported by the NLTK tokenizer is as follows:

  • 'czech'
  • 'danish'
  • 'dutch'
  • 'english'
  • 'estonian'
  • 'finnish'
  • 'french'
  • 'german'
  • 'greek'
  • 'italian'
  • 'norwegian'
  • 'polish'
  • 'portuguese'
  • 'russian'
  • 'slovene',
  • 'spanish'
  • 'swedish'
  • 'turkish'

It corresponds to the pickles stored in C:\Users\XXX\AppData\Roaming\nltk_data\tokenizers\punkt (in Windows). This is what you enter with the key 'language' when tokenizing, e.g.

nltk.word_tokenize(text, language='italian')

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Stanislav Koncebovski