'Loading df into rubrix for token classification (NER)

I am in the process of learning rubrix and docker, working my way through the rubrix manual (https://rubrix.readthedocs.io/_/downloads/en/master/pdf/).

Using the example below as a starting point, I am looking to load a df column in lieu of the 'input text' string, where each row will be a record. Grateful if anyone has experience of doing this? What I have tried so far has been unsuccessful.

import rubrix as rb
import spacy

input_text = "Paris a un enfant et la foret a un oiseau ; l’oiseau s’appelle le moineau ; l’enfant s’appelle le gamin"

# Loading spaCy model
nlp = spacy.load("fr_core_news_sm")

# Creating spaCy doc
doc = nlp(input_text)

# Creating the prediction entity as a list of tuples (entity, start_char, end_char)
prediction = [(ent.label_, ent.start_char, ent.end_char) for ent in doc.ents]

# Building TokenClassificationRecord
record = rb.TokenClassificationRecord(
    text=input_text,
    tokens=[token.text for token in doc],
    prediction=prediction,
    prediction_agent="spacy.fr_core_news_sm",
)

# Logging into Rubrix
rb.log(records=record, name="lesmiserables-ner")



Solution 1:[1]

for a pandas df named dataset with a column called text and an integer id you can iterate over the df and fetch the entities per row:

records = []

for record in dataset.index:
    
    text = dataset['text'][record]

    text_id = dataset['id'][record].tolist()  # not allowed to be int64

    # spaCy Doc creation
    doc = nlp(text)
    # Entity annotations
    entities = [
        (ent.label_, ent.start_char, ent.end_char)
        for ent in doc.ents
    ]

    # Pre-tokenized input text
    tokens = [token.text for token in doc]

    # Rubrix TokenClassificationRecord list
    records.append(
        rb.TokenClassificationRecord(
            text=text,
            tokens=tokens,
            metadata={'id': text_id}, 
            prediction=entities,
            prediction_agent="fr_core_news_sm"
        )
    )

rb.log(records=records, name="example_data")

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 kabr