'Text Classification on a custom dataset with spacy v3

I am really struggling to make things work with the new spacy v3 version. The documentation is full. However, I am trying to run a training loop in a script.

(I am also not able to perform text classification training with CLI approach).

Data are publically available here.


import pandas as pd
from spacy.training import Example
import random

TRAIN_DATA = pd.read_json('data.jsonl', lines = True)

nlp = spacy.load('en_core_web_sm')

config = {
    "threshold": 0.5,
}

textcat = nlp.add_pipe("textcat", config=config, last=True)

label = TRAIN_DATA['label'].unique()

for label in label:
    textcat.add_label(str(label))

nlp = spacy.blank("en")
nlp.begin_training()


# Loop for 10 iterations
for itn in range(100):
    # Shuffle the training data
    losses = {}
    
    TRAIN_DATA = TRAIN_DATA.sample(frac = 1)

    # Batch the examples and iterate over them
    for batch in spacy.util.minibatch(TRAIN_DATA.values, size=4):
        texts = [nlp.make_doc(text) for text, entities in batch]
        annotations = [{"cats": entities} for text, entities in batch]

        # uses an example object rather than text/annotation tuple
        print(texts)
        print(annotations)
        examples = [Example.from_dict(a)]
        
        nlp.update(examples, losses=losses)
    if itn % 20 == 0:
        print(losses)


Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source