'Loading df into rubrix for token classification (NER)
I am in the process of learning rubrix and docker, working my way through the rubrix manual (https://rubrix.readthedocs.io/_/downloads/en/master/pdf/).
Using the example below as a starting point, I am looking to load a df column in lieu of the 'input text' string, where each row will be a record. Grateful if anyone has experience of doing this? What I have tried so far has been unsuccessful.
import rubrix as rb
import spacy
input_text = "Paris a un enfant et la foret a un oiseau ; l’oiseau s’appelle le moineau ; l’enfant s’appelle le gamin"
# Loading spaCy model
nlp = spacy.load("fr_core_news_sm")
# Creating spaCy doc
doc = nlp(input_text)
# Creating the prediction entity as a list of tuples (entity, start_char, end_char)
prediction = [(ent.label_, ent.start_char, ent.end_char) for ent in doc.ents]
# Building TokenClassificationRecord
record = rb.TokenClassificationRecord(
text=input_text,
tokens=[token.text for token in doc],
prediction=prediction,
prediction_agent="spacy.fr_core_news_sm",
)
# Logging into Rubrix
rb.log(records=record, name="lesmiserables-ner")
Solution 1:[1]
for a pandas df named dataset with a column called text and an integer id you can iterate over the df and fetch the entities per row:
records = []
for record in dataset.index:
text = dataset['text'][record]
text_id = dataset['id'][record].tolist() # not allowed to be int64
# spaCy Doc creation
doc = nlp(text)
# Entity annotations
entities = [
(ent.label_, ent.start_char, ent.end_char)
for ent in doc.ents
]
# Pre-tokenized input text
tokens = [token.text for token in doc]
# Rubrix TokenClassificationRecord list
records.append(
rb.TokenClassificationRecord(
text=text,
tokens=tokens,
metadata={'id': text_id},
prediction=entities,
prediction_agent="fr_core_news_sm"
)
)
rb.log(records=records, name="example_data")
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | kabr |
