'Thread safety on nested defaultdict

Using

from collections import defaultdict

Assume the following data structure:

amazing_dict = defaultdict(lambda: defaultdict(amazing_dict.default_factory))

which can also be written as:

generate_amazing_dict = lambda: defaultdict(generate_amazing_dict)
amazing_dict = generate_amazing_dict()

It is required to achieve thread safety to amazing_dict when creating new keys, meaning that no two threads are allowed to generate the default value for the same missing key.

For example, running both amazing_dict["a"]["x"] = 0 and amazing_dict["a"]["y"] = 0 on two different threads must always result in amazing_dict = {"a": {"x": 0, "y": 0}} and never will one thread override amazing_dict["a"].

According to this great answer, defaultdict on its own is thread-safe, however, once the default_factory is using a python code (such as in amazing_dict) there is a potential to thread switch to occur before the factory is being called.

I want to make sure the call to the factory method is always done under lock. I have come up with two possible implementation that might provide thread safety on the factory method.

Option A - override the __missing__ method of defaultdict to be done under lock and check again if value exists in self before calling factory method.

class threadsafe_defaultdict(defaultdict):

    def __init__(self, default_factory=None, **kwargs) -> None:
        super().__init__(default_factory, kwargs)
        self._missing_lock = threading.Lock()

    def __missing__(self, key):
        with self._missing_lock:
            if key in self:
                return self[key]
            return super().__missing__(key)

vs Option B - override the __getitem__ method of dict to check if the item exists, if it already exists call super method normally, else call super method under lock.

class threadsafe_defaultdict(defaultdict):

    def __init__(self, default_factory=None, **kwargs) -> None:
        super().__init__(default_factory, kwargs)
        self._missing_lock = threading.Lock()

    def __getitem__(self, key):
        if key in self:
            return super().__getitem__(key)
        with self._missing_lock:
            return super().__getitem__(key)

In both cases amazing_dict would be generated as follows:

generate_amazing_dict = lambda: threadsafe_defaultdict(generate_amazing_dict)
amazing_dict = generate_amazing_dict()

Which of the two implementations is more correct (if any)? Also, further suggestions on achieving thread safety for this case are welcome



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source