'Same output in different workers in multiprocessing
I have very simple cases where the work to be done can be broken up and distributed among workers. I tried a very simple multiprocessing example from here:
import multiprocessing
import numpy as np
import time
def do_calculation(data):
rand=np.random.randint(10)
print data, rand
time.sleep(rand)
return data * 2
if __name__ == '__main__':
pool_size = multiprocessing.cpu_count() * 2
pool = multiprocessing.Pool(processes=pool_size)
inputs = list(range(10))
print 'Input :', inputs
pool_outputs = pool.map(do_calculation, inputs)
print 'Pool :', pool_outputs
The above program produces the following output :
Input : [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
0 7
1 7
2 7
5 7
3 7
4 7
6 7
7 7
8 6
9 6
Pool : [0, 2, 4, 6, 8, 10, 12, 14, 16, 18]
Why is the same random number getting printed? (I have 4 cpus in my machine). Is this the best/simplest way to go ahead?
Solution 1:[1]
I think you'll need to re-seed the random number generator using numpy.random.seed in your do_calculation function.
My guess is that the random number generator (RNG) gets seeded when you import the module. Then, when you use multiprocessing, you fork the current process with the RNG already seeded -- Thus, all your processes are sharing the same seed value for the RNG and so they'll generate the same sequences of numbers.
e.g.:
def do_calculation(data):
np.random.seed()
rand=np.random.randint(10)
print data, rand
return data * 2
Solution 2:[2]
This blog post provides an example of a good and bad practise when using numpy.random and multi-processing. The more important is to understand when the seed of your pseudo random number generator (PRNG) is created:
import numpy as np
import pprint
from multiprocessing import Pool
pp = pprint.PrettyPrinter()
def bad_practice(index):
return np.random.randint(0,10,size=10)
def good_practice(index):
return np.random.RandomState().randint(0,10,size=10)
p = Pool(5)
pp.pprint("Bad practice: ")
pp.pprint(p.map(bad_practice, range(5)))
pp.pprint("Good practice: ")
pp.pprint(p.map(good_practice, range(5)))
output:
'Bad practice: '
[array([4, 2, 8, 0, 1, 1, 6, 1, 2, 9]),
array([4, 2, 8, 0, 1, 1, 6, 1, 2, 9]),
array([4, 2, 8, 0, 1, 1, 6, 1, 2, 9]),
array([4, 2, 8, 0, 1, 1, 6, 1, 2, 9]),
array([4, 2, 8, 0, 1, 1, 6, 1, 2, 9])]
'Good practice: '
[array([8, 9, 4, 5, 1, 0, 8, 1, 5, 4]),
array([5, 1, 3, 3, 3, 0, 0, 1, 0, 8]),
array([1, 9, 9, 9, 2, 9, 4, 3, 2, 1]),
array([4, 3, 6, 2, 6, 1, 2, 9, 5, 2]),
array([6, 3, 5, 9, 7, 1, 7, 4, 8, 5])]
In the good practice the seed is created once per thread while in the bad practise the seed is created only once when you import the numpy.random module.
Solution 3:[3]
Here's what I use (may require newer versions of NumPy):
import numpy as np
from multiprocessing import Pool
entropy = 42
seed_sequence = np.random.SeedSequence(entropy)
number_processes = 5
seeds = seed_sequence.spawn(number_processes)
def good_practice(seed):
rng = np.random.default_rng(seed)
return rng.integers(0,10,size=10)
pool = Pool(number_processes)
print(pool.map(good_practice, seeds))
Output:
[array([4, 9, 5, 9, 2, 8, 3, 3, 5, 9]),
array([0, 4, 1, 0, 6, 5, 3, 1, 7, 9]),
array([7, 0, 7, 7, 1, 0, 1, 3, 9, 6]),
array([8, 7, 9, 9, 1, 7, 4, 0, 5, 2]),
array([9, 0, 8, 9, 3, 8, 6, 6, 7, 9])]
The NumPy documentation on this was actually fairly helpful. See also.
Solution 4:[4]
If you just want the legacy np.random generators to to be distinct, then you can just pass np.random.seed to the Pool's initializer:
from multiprocessing import Pool
import numpy as np
def foo(_):
return np.random.random()
with Pool(initializer=np.random.seed) as pool:
print(pool.map(foo, range(5)))
this will cause the random generator to be reseeded in each worker process by pulling in fresh entropy from the OS.
If you're running Python 3.7+, you might want to use os.register_at_fork instead:
from os import register_at_fork
register_at_fork(after_in_child=np.random.seed)
with Pool() as pool:
print(pool.map(foo, range(5)))
this has the advantage of working whether multiprocessing is doing the forking or not.
If you care about deterministically seeding worker processes then you likely want to use a SeedSequence as pointed out by @hasManyStupidQuestions. This also has the advantage of using the newer and faster RNGs.
Numpy issue 9650 has even more details.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | t_sic |
| Solution 3 | hasManyStupidQuestions |
| Solution 4 | Sam Mason |
