'What's the best practice for Python multiprocessing to write to a shared dictionary?

I have millions of NumPy arrays that I wish to process, ultimately generating a collective histogram plotting the frequencies of seen integers. I wish to utilize parallelism to process this as quick as possible. After some reading, it seems that most people advocate multiprocessing with multiprocessing.Manager().dict() to do this. I have come up with the following code that works:

import pickle
import random
import numpy as np
import multiprocessing as mp


# XXX: Placeholder.
id_to_seq = {}
for i in range(100):
    id_to_seq[i] = np.random.randint(100, size=random.randint(1, 100))


def process_seq(id, num_frequency):
    seq = id_to_seq[id]
    for num in seq:
        if num in num_frequency:
            num_frequency[num] += 1
        else:
            num_frequency[num] = 1


if __name__ == '__main__':
    manager = mp.Manager()
    num_frequency = manager.dict()

    pool = mp.Pool(mp.cpu_count())
    for id in id_to_seq.keys():  # Execute in parallel.
        pool.apply_async(process_seq, args=(id, num_frequency))
    pool.close()
    pool.join()    

    with open('num_frequency.pkl', 'wb') as handle:
        pickle.dump(dict(num_frequency), handle, protocol=pickle.HIGHEST_PROTOCOL)

However, given how often I'm writing to my num_frequency dict, I wonder whether this is the best, quick practice. I'm worried about the overhead of sharing and writing to this same dict. (In fact, it seems that Manger().dict() doesn't even physically share the same memory, but semantically shares the memory through signaling changes to other copies.) Would someone please advise me the best multiprocessing practice that reduces overhead and runtime?

python

Solution 1:^[1]

The usual, simple approach is to make a local counting dictionary for each process and then merge them when each process exits. You might choose to reduce for each call to process_seq instead for simplicity if the number of such calls is much smaller than the total number of elements to count.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	Davis Herring

'What's the best practice for Python multiprocessing to write to a shared dictionary?

Solution 1:[1]

Sources

Related Questions

Solution 1:^[1]