'compare rows of a numpy array to all rows

I'm trying to compare each row of a numpy array with the whole numpy array without using iteration.

>>> sample = np.array([[1,2,3],[4,5,6]])
>>> sample
array([[1, 2, 3],
       [4, 5, 6]])

First I reshape the 2D-array to a 3D-array:

>>> sample2=sample.reshape(sample.shape[0],1,sample.shape[1])

And then with the following line of code I can compare the rows:

>>> sample2 == sample
array([[[ True,  True,  True],
        [False, False, False]],

        [[False, False, False],
        [ True,  True,  True]]])

...which is the result that I'm looking for.

But this does not work with large numpy arrays:

>>> sample3 = np.random.randint(low= 0, high = 2, size = 30000000).reshape(30000,1000)
>>> sample4 = sample3.reshape(sample3.shape[0],1,sample3.shape[1])
>>> sample4 == sample3  
<ipython-input-229-e1d55c6bb1ca>:1: DeprecationWarning: elementwise
comparison failed; this will raise an error in the future.
False

How can I solve this?



Solution 1:[1]

This may shed some light on your question. Here is my code sample, based on yours:

import numpy as np
n=30000000
ny = 1000
sample3 = np.random.randint(low= 0, high = 2, size = n).reshape(n // ny, ny)
sample4 = sample3.reshape(sample3.shape[0],1,sample3.shape[1])
print(sample3.shape, sample4.shape)
test = sample4 == sample3
print(test)
test = np.equal(sample4, sample3)
print(test)

Its output is:

(30000, 1000) (30000, 1, 1000)
C:\Users\XYZ\python\code_sample.py:7: DeprecationWarning: elementwise comparison failed; this will raise an error in the future.
  test = sample4 == sample3
False
Traceback (most recent call last):
  File "code_sample.py", line 9, in <module>
    test = np.equal(sample4, sample3)
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 838. GiB for an array with shape (30000, 30000, 1000) and data type bool

Also, here are the docs for numpy.equal() which is presumably used by the == operator for numpy arrays. They sate:

Input arrays. If x1.shape != x2.shape, they must be broadcastable to a common shape (which becomes the shape of the output).

So it looks like equal() may be attempting to use a substantial amount of memory (838 GB in the example above). Perhaps == decides to fail and give the deprecation warning (rather than something more apt, such as an out-of-memory error) when it realizes there's not enough memory?

Also, if I reduce n from 30000000 to 3000000 and comment out the call to equal(), execution of the == statement takes 10 or 20 seconds before the following result is printed:

(3000, 1000) (3000, 1, 1000)
[[[ True  True  True ...  True  True  True]
  [False  True  True ...  True  True  True]
  [ True  True  True ...  True  True  True]
  ...
  [False  True  True ...  True False  True]
  [False False  True ...  True  True False]
  [ True False False ... False False False]]

 [[False  True  True ...  True  True  True]
  [ True  True  True ...  True  True  True]
  [False  True  True ...  True  True  True]
  ...
  [ True  True  True ...  True False  True]
  [ True False  True ...  True  True False]
  [False False False ... False False False]]

 [[ True  True  True ...  True  True  True]
  [False  True  True ...  True  True  True]
  [ True  True  True ...  True  True  True]
  ...
  [False  True  True ...  True False  True]
  [False False  True ...  True  True False]
  [ True False False ... False False False]]

 ...

 [[False  True  True ...  True False  True]
  [ True  True  True ...  True False  True]
  [False  True  True ...  True False  True]
  ...
  [ True  True  True ...  True  True  True]
  [ True False  True ...  True False False]
  [False False False ... False  True False]]

 [[False False  True ...  True  True False]
  [ True False  True ...  True  True False]
  [False False  True ...  True  True False]
  ...
  [ True False  True ...  True False False]
  [ True  True  True ...  True  True  True]
  [False  True False ... False False  True]]

 [[ True False False ... False False False]
  [False False False ... False False False]
  [ True False False ... False False False]
  ...
  [False False False ... False  True False]
  [False  True False ... False False  True]
  [ True  True  True ...  True  True  True]]

So it looks like the issue you've encountered is probably related to running out of memory.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 constantstranger