'How to mock spacy models / Doc objects for unit tests?

Loading spacy models slows down running my unit tests. Is there a way to mock spacy models or Doc objects to speed up unit tests?

Example of a current slow tests

import spacy
nlp = spacy.load("en_core_web_sm")

def test_entities():
    text = u"Google is a company."
    doc = nlp(text)
    assert doc.ents[0].text == u"Google"

Based on the docs my approach is

Constructing the Vocab and Doc manually and setting the entities as tuples.

from spacy.vocab import Vocab
from spacy.tokens import Doc

def test()
    alphanum_words = u"Google Facebook are companies".split(" ")
    labels = [u"ORG"]
    words = alphanum_words + [u"."]
    spaces = len(words) * [True]
    spaces[-1] = False
    spaces[-2] = False
    vocab = Vocab(strings=(alphanum_words + labels))
    doc = Doc(vocab, words=words, spaces=spaces)

    def get_hash(text):
        return vocab.strings[text]

    entity_tuples = tuple([(get_hash(labels[0]), 0, 1)])
    doc.ents = entity_tuples
    assert doc.ents[0].text == u"Google"

Is there a cleaner more Pythonic solution for mocking spacy objects for unit tests for entities?



Solution 1:[1]

This is a great question actually! I'd say your instinct is definitely right: If all you need is a Doc object in a given state and with given annotations, always create it manually wherever possible. And unless you're explicitly testing a statistical model, avoid loading it in your unit tests. It makes the tests slow, and it introduces too much unnecessary variance. This is also very much in line with the philosophy of unit testing: you want to be writing independent tests for one thing at a time (not one thing plus a bunch of third-party library code plus a statistical model).

Some general tips and ideas:

  • If possible, always construct a Doc manually. Avoid loading models or Language subclasses.
  • Unless your application or test specifically needs the doc.text, you do not have to set the spaces. In fact, I leave this out in about 80% of the tests I write, because it really only becomes relevant when you're putting the tokens back together.
  • If you need to create a lot of Doc objects in your test suite, you could consider using a utility function, similar to the get_doc helper we use in the spaCy test suite. (That function also shows you how the individual annotations are set manually, in case you need it.)
  • Use (session-scoped) fixtures for the shared objects, like the Vocab. Depending on what you're testing, you might want to explicitly use the English vocab. In the spaCy test suite, we do this by setting up an en_vocab fixture in the conftest.py.
  • Instead of setting the doc.ents to a list of tuples, you can also make it a list of Span objects. This looks a bit more straightforward, is easier to read, and in spaCy v2.1+, you can also pass a string as a label:
def test_entities(en_vocab):
    doc = Doc(en_vocab, words=["Hello", "world"])
    doc.ents = [Span(doc, 0, 1, label="ORG")]
    assert doc.ents[0].text == "Hello"
  • If you do need to test a model (e.g. in the test suite that makes sure that your custom models load and run as expected) or a language class like English, put them in a session-scoped fixture. This means that they'll only be loaded once per session instead of once per test. Language classes are lazy-loaded and may also take some time to load, depending on the data they contain. So you only want to do this once.
# Note: You probably don't have to do any of this, unless you're testing your
# own custom models or language classes.

@pytest.fixture(scope="session")
def en_core_web_sm():
    return spacy.load("en_core_web_sm")

@pytest.fixture(scope="session")
def en_lang_class():
    lang_cls = spacy.util.get_lang_class("en")
    return lang_cls()

def test(en_lang_class):
    doc = en_lang_class("Hello world")

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 pypae