'Spacy train ner using multiprocessing

I am trying to train a custom ner model using spacy. Currently, I have more than 2k records for training and each text consists of more than 100 words, at least more than 2 entities for each record. I running it for 50 iterations. It is taking more than 2 hours to train completely.

Is there any way to train using multiprocessing? Will it improve the training time?



Solution 1:[1]

Short answer... probably not

It's very unlikely that you will be able to get this to work for a few reasons:

  • The network being trained is performing iterative optimization
    • Without knowing the results from the batch before, the next batch cannot be optimized
  • There is only a single network
    • Any parallel training would be creating divergent networks...
    • ...which you would then somehow have to merge

Long answer... there's plenty you can do!

There are a few different things you can try however:

  • Get GPU training working if you haven't
    • It's a pain, but can speed up training time a bit
    • It will dramatically lower CPU usage however
  • Try to use spaCy command line tools
    • The JSON format is a pain to produce but...
    • The benefit is you get a well optimised algorithm written by the experts
    • It can have dramatically faster / better results than hand crafted methods
  • If you have different entities, you can train multiple specialised networks
    • Each of these may train faster
    • These networks could be done in parallel to each other (CPU permitting)
  • Optimise your python and experiment with parameters
    • Speed and quality is very dependent on parameter tweaking (batch size, repetitions etc.)
    • Your python implementation providing the batches (make sure this is top notch)
  • Pre-process your examples
    • spaCy NER extraction requires a surprisingly small amount of context to work
    • You could try pre-processing your snippets to contain 10 or 15 surrounding words and see how your time and accuracy fairs

Final thoughts... when is your network "done"?

I have trained networks with many entities on thousands of examples longer than specified and the long and short is, sometimes it takes time.

However 90% of the increase in performance is captured in the first 10% of training.

  • Do you need to wait for 50 batches?
  • ... or are you looking for a specific level of performance?

If you monitor the quality every X batches, you can bail out when you hit a pre-defined level of quality.

You can also keep old networks you have trained on previous batches and then "top them up" with new training to get to a level of performance you couldn't by starting from scratch in the same time.

Good luck!

Solution 2:[2]

Hi I did same project where I created custom NER Model using spacy3 and extracted 26 entities on large data. See it really depends like how are you passing your data. Follow the steps I am mentioning below might it could work on CPU:

  1. Annotate your text files and save into JSON

  2. Convert your JSON files into .spacy format because this is the format spacy accepts.

  3. Now, here is the point to be noted that how are you passing and serializing your .spacy format in spacy doc object.

Passing all your JSON text will take more time in training. So you can split your data and pass iterating it. Don't pass consolidated data. Split it.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Jon Betts
Solution 2 Jeremy Caney