'Create large random boolean matrix with numpy
I am trying to create a huge boolean matrix which is randomly filled with True and False with a given probability p. At first I used this code:
N = 30000
p = 0.1
np.random.choice(a=[False, True], size=(N, N), p=[p, 1-p])
But sadly it does not seem to terminate for this big N. So I tried to split it up into the generation of the single rows by doing this:
N = 30000
p = 0.1
mask = np.empty((N, N))
for i in range (N):
mask[i] = np.random.choice(a=[False, True], size=N, p=[p, 1-p])
if (i % 100 == 0):
print(i)
Now, there happens something strange (at least on my device): The first ~1100 rows are very fastly generated - but after it, the code becomes horribly slow. Why is this happening? What do I miss here? Are there better ways to create a big matrix which has True entries with probability p and False entries with probability 1-p?
Edit: As many of you assumed that the RAM will be a problem: As the device which will run the code has almost 500GB RAM, this won't be a problem.
Solution 1:[1]
The problem is your RAM, the values are being stored in memory as it's being created. I just created this matrix using this command:
np.random.choice(a=[False, True], size=(N, N), p=[p, 1-p])
I used an AWS i3 instance with 64GB of RAM and 8 cores. To create this matrix, htop shows that it takes up ~20GB of RAM. Here is a benchmark in case you care:
time np.random.choice(a=[False, True], size=(N, N), p=[p, 1-p])
CPU times: user 18.3 s, sys: 3.4 s, total: 21.7 s
Wall time: 21.7 s
def mask_method(N, p):
for i in range(N):
mask[i] = np.random.choice(a=[False, True], size=N, p=[p, 1-p])
if (i % 100 == 0):
print(i)
time mask_method(N,p)
CPU times: user 20.9 s, sys: 1.55 s, total: 22.5 s
Wall time: 22.5 s
Note that the mask method only takes up ~9GB of RAM at it's peak.
Edit: The first method flushes the RAM after the process is done where as the function method retains all of it.
Solution 2:[2]
So I tried to split it up into the generation of the single rows by doing this:
The way that np.random.choice works is by first generating a float64 in [0, 1) for every cell of your data, and then converting that into an index in your array using np.search_sorted. This intermediate representation is 8 times larger than the boolean array!
Since your data is boolean, you can get a factor of two speedup with
np.random.rand(N, N) > p
Which naturally, you could use inside your looping solution
It seems like np.random.choice could do with some buffering here - you might want to file an issue against numpy.
Another option would be to try and generate float32s instead of float64s. I'm not sure if numpy can do that right now, but you could request the feature.
Solution 3:[3]
Another possibility could be to generate it in a batch (i.e. compute many sub-arrays and stack them together at the very end). But, consider not to update one array (mask) in a for loop as OP is doing. This would force the whole array to load in main memory during every indexing update.
Instead for example: to get 30000x30000, have 9000 100x100 separate arrays, update each of this 100x100 array accordingly in a for loop and finally stack these 9000 arrays together in a giant array. This would definitely need not more than 4GB of RAM and would be very fast as well.
Minimal Example:
In [9]: a
Out[9]:
array([[0, 1],
[2, 3]])
In [10]: np.hstack([np.vstack([a]*5)]*5)
Out[10]:
array([[0, 1, 0, 1, 0, 1, 0, 1, 0, 1],
[2, 3, 2, 3, 2, 3, 2, 3, 2, 3],
[0, 1, 0, 1, 0, 1, 0, 1, 0, 1],
[2, 3, 2, 3, 2, 3, 2, 3, 2, 3],
[0, 1, 0, 1, 0, 1, 0, 1, 0, 1],
[2, 3, 2, 3, 2, 3, 2, 3, 2, 3],
[0, 1, 0, 1, 0, 1, 0, 1, 0, 1],
[2, 3, 2, 3, 2, 3, 2, 3, 2, 3],
[0, 1, 0, 1, 0, 1, 0, 1, 0, 1],
[2, 3, 2, 3, 2, 3, 2, 3, 2, 3]])
In [11]: np.hstack([np.vstack([a]*5)]*5).shape
Out[11]: (10, 10)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | |
| Solution 3 |
