'What's the most efficient way to load spaCy language models when working with multiple languages?

I'm working on a project where I'm extracting entities from tweets for which any individual tweet could be one of a number of languages. I've written a function that takes the language of the tweet from Twitter as an argument and spits back the name of the appropriate spaCy model.

def getSpacyModel(langCode):
    switcher = {
        'da': "da_core_news_lg",
        'de': "de_core_news_lg",
        'en': "en_core_web_trf",
        'es': "es_core_news_md",
        'fr': "fr_core_news_lg",
        'ja': "ja_core_news_lg",
        'nb': "nb_core_news_lg",
        'nl': "nl_core_news_lg",
        'pl': "pl_core_news_md",
        'pt': "pt_core_news_md",
        'ro': "ro_core_news_md",
        'ru': "ru_core_news_sm",
        'tr': "", # Turkish must be built from scratch
        'zh': "zh_core_web_trf"
    }
    return switcher.get(langCode, "Unknown")

The idea is that I'd then do something like this:

model = getSpacyModel(tweet.lang) # Generate model name based on language
nlp = spacy.load(model) # Load the appropriate spaCy model
text = tweet.full_text # Grab the text of the tweet
doc = nlp(text) # Transform the text into a spaCy doc
ent_dict = get_ent_dict(doc) # User defined function to create a dictionary of entities

In the past, I've processed cursored tweets in a simple for loop:

for tweet in tweets:
    model = getSpacyModel(tweet.lang) # Generate model name based on language
    nlp = spacy.load(model) # Load the appropriate spaCy model
    text = tweet.full_text # Grab the text of the tweet
    doc = nlp(text) # Transform the text into a spaCy doc
    ent_dict = get_ent_dict(doc) # User defined function to create a dictionary of entities
    # Do whatever I need to do with the entities afterward

But this seems like it will be a problem for this project since I'll need to load a new model with every single tweet and will be processing batches of up to 200 tweets at once. These models are anywhere from 40mb to 500mb, depending on language, so it could slow things quite a bit.

I thought about sorting them to, for example, process all English tweets and then process all Spanish tweets and so on. We're only talking 200 tweets in a single batch (although batches will be pulled multiple times in a day), so the additional time for the sort loop probably won't add huge amounts of time. But I wondered if there's a more elegant way to handle this problem.



Solution 1:[1]

It's quite late to answer this but you can load multiple models in the same time using module import. More information can be found in the document.

enter image description here

Here is my code.

import en_core_web_sm as en_model
import fr_core_news_sm as fr_model

nlp_en = en_model.load()
nlp_fr = fr_model.load()

doc_en = nlp_en("Apple is looking at buying U.K. startup for $1 billion")
doc_fr = nlp_fr("Apple envisage de racheter une startup Britannique pour 1 milliard de dollars")

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Tai Le