'How to extract out unique words and there pos tags in separate columns while working with Dataset

I am working through Indonesian Data to use data for NER and as I get to know, there is no pretrained NLTK model to help for this language. So, to do this manually I tried to extract all the unique words used in the entire data frame, I still don't know how to apply tags to the words but this is what I did so far.

please let me know if there is any other convenient way to do this, what I did in the following codes. also, let me know how to add tags to each row(if possible) and how to do NER for this.

(I am new to coding that's why I don't know how to ask, but I am trying my best to provide as much information as possible.)

Solution 1:^[1]

Depending on what you want to do if results is all that matters you could use a pretrained transfomer model from huggingface instead of NLTK. This will be more computionally heavy but also give you a better performance.

There is one fitting model I could find (I don't speak Indonesian obviously, so excuse eventual errors in the sample sentence):

https://huggingface.co/cahya/xlm-roberta-large-indonesian-NER?text=Nama+saya+Peter+dan+saya+tinggal+di+Berlin.

The easiest way to use this would probably be either the API or using an inference-only pipeline, check out this guide, all you would have to do to get this running for the Indonesian model is to replace the previous model path (dslim/bert-base-NER) with cahya/xlm-roberta-large-indonesian-NER.

Note that this Indonesian model is quite large, so you need to have some decent hardware. If you don't you could alternatively use some (free) cloud computing service such as Google Colab.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	ewz93

'How to extract out unique words and there pos tags in separate columns while working with Dataset

Solution 1:[1]

Sources

Related Questions

Solution 1:^[1]