'Keep only sub-arrays with one unique value at position 0
Starting from a Numpy nd-array:
>>> arr
[
[
[10, 4, 5, 6, 7],
[11, 1, 2, 3, 4],
[11, 5, 6, 7, 8]
],
[
[12, 4, 5, 6, 7],
[12, 1, 2, 3, 4],
[12, 5, 6, 7, 8]
],
[
[15, 4, 5, 6, 7],
[15, 1, 2, 3, 4],
[15, 5, 6, 7, 8]
],
[
[13, 4, 5, 6, 7],
[13, 1, 2, 3, 4],
[14, 5, 6, 7, 8]
],
[
[10, 4, 5, 6, 7],
[11, 1, 2, 3, 4],
[12, 5, 6, 7, 8]
]
]
I would like to keep only the sequences of 3 sub-arrays which have only one unique value at position 0, so as to obtain the following:
>>> new_arr
[
[
[12, 4, 5, 6, 7],
[12, 1, 2, 3, 4],
[12, 5, 6, 7, 8]
],
[
[15, 4, 5, 6, 7],
[15, 1, 2, 3, 4],
[15, 5, 6, 7, 8]
]
]
From the initial array, arr[0], arr[3] and arr[4] were discarded because they both had more than one unique value in position 0 (respectively, [10, 11], [13, 14] and [10, 11, 12]).
I tried fiddling with numpy.unique() but could only get to the global unique values at positon 0 within all sub-arrays, which is not what's needed here.
-- EDIT
The following seems to get me closer to the solution:
>>> np.unique(arr[0, :, 0])
array([10, 11])
But I'm not sure how to get one-level higher than this and put a condition on that for each sub-array of arr without using a Python loop.
Solution 1:[1]
I was interested to see how these methods compared so I benchmarked the answers here using a large dataset of (4000000, 4, 4).
results
--------------------------------------------------------------------------------------- benchmark: 4 tests ---------------------------------------------------------------------------------------
Name (time in ms) Min Max Mean StdDev Median IQR Outliers OPS Rounds Iterations
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_np_arr_T 128.3483 (1.0) 130.5462 (1.0) 129.0869 (1.0) 0.9536 (1.01) 128.5447 (1.0) 1.5660 (1.83) 2;0 7.7467 (1.0) 8 1
test_np_arr 128.5017 (1.00) 131.2399 (1.01) 129.2841 (1.00) 0.9414 (1.0) 128.9724 (1.00) 0.8553 (1.0) 1;1 7.7349 (1.00) 7 1
test_pure_py_set 2,840.2911 (22.13) 2,849.0413 (21.82) 2,844.4716 (22.04) 3.8494 (4.09) 2,846.1608 (22.14) 6.4168 (7.50) 3;0 0.3516 (0.05) 5 1
test_pure_py 3,688.4772 (28.74) 3,750.0933 (28.73) 3,717.3411 (28.80) 24.7294 (26.27) 3,707.3502 (28.84) 37.1902 (43.48) 2;0 0.2690 (0.03) 5 1
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
These benchmarks use pytest-benchmark, so I'd make a venv for running this:
python3 -m venv venv
. ./venv/bin/activate
pip install numpy pytest pytest-benchmark
Run the test:
pytest test_runs.py
test_runs.py
import numpy as np
# No guarantee this will produce sub-arrays with shared first index
ARR = np.random.randint(low=0, high=10, size=(4_000_000, 4, 4)).tolist()
# ARR = [
# [[10, 4, 5, 6, 7], [11, 1, 2, 3, 4], [11, 5, 6, 7, 8]],
# [[12, 4, 5, 6, 7], [12, 1, 2, 3, 4], [12, 5, 6, 7, 8]],
# [[15, 4, 5, 6, 7], [15, 1, 2, 3, 4], [15, 5, 6, 7, 8]],
# [[13, 4, 5, 6, 7], [13, 1, 2, 3, 4], [14, 5, 6, 7, 8]],
# [[10, 4, 5, 6, 7], [11, 1, 2, 3, 4], [12, 5, 6, 7, 8]],
# ]
def pure_py(arr):
new_array = []
for i, v in enumerate(arr):
first_elems = [x[0] for x in v]
if all(elem == first_elems[0] for elem in first_elems):
new_array.append(arr[i])
return new_array
def pure_py_set(arr):
new_array = []
for sub_arr in arr:
if len(set(x[0] for x in sub_arr)) == 1:
new_array.append(sub_arr)
return new_array
def np_arr(arr):
return arr[np.all(arr[:, :, 0] == arr[:, :1, 0], axis=1)]
def np_arr_T(arr):
return arr[(arr[:, :, 0].T == arr[:, 0, 0]).T.all(axis=1)]
def np_not_arr(arr):
arr = np.array(arr)
return arr[np.all(arr[:, :, 0] == arr[:, :1, 0], axis=1)]
RES = np_not_arr(ARR).tolist()
def test_pure_py(benchmark):
res = benchmark(pure_py, ARR)
assert res == RES
def test_pure_py_set(benchmark):
res = benchmark(pure_py_set, ARR)
assert res == RES
def test_np_arr(benchmark):
ARR_ = np.array(ARR)
res = benchmark(np_arr, ARR_)
assert res.tolist() == RES
def test_np_arr_T(benchmark):
ARR_ = np.array(ARR)
res = benchmark(np_arr_T, ARR_)
assert res.tolist() == RES
Solution 2:[2]
Inspired by an attempt to reply in the form of an edit to the question (which I rejected as it should have been an answer), here is something that worked:
>>> arr[(arr[:,:,0].T == arr[:,0,0]).T.all(axis=1)]
[
[
[12, 4, 5, 6, 7],
[12, 1, 2, 3, 4],
[12, 5, 6, 7, 8]
],
[
[15, 4, 5, 6, 7],
[15, 1, 2, 3, 4],
[15, 5, 6, 7, 8]
]
]
The trick was to transpose the results so that:
# all 0-th positions of each subarray
arr[:,:,0].T
# the first 0-th position of each subarray
arr[:,0,0]
# whether each 0-th position equals the first one
(arr[:,:,0].T == arr[:,0,0]).T
# keep only the sub-array where the above is true for all positions
(arr[:,:,0].T == arr[:,0,0]).T.all(axis=1)
# lastly, apply this indexing to the initial array
arr[(arr[:,:,0].T == arr[:,0,0]).T.all(axis=1)]
Solution 3:[3]
Ok I've compare two solutions for this problem. With numpy (script by @rchome) and without it - pure python
new_array = []
for i, v in enumerate(arr):
first_elems = [x[0] for x in v]
if all(elem == first_elems[0] for elem in first_elems):
new_array.append(arr[i])
this code execution time = (+- 0:00:00.000015)
arr = np.array(arr)
new_array = arr[np.all(arr[:, :, 0] == arr[:, :1, 0], axis=1)]
this code execution time = (+- 0:00:00.000060)
So with numpy it took about 4 times longer. But we must remember that this array is extremely small. Maybe with bigger arrays numpy would work faster :)
--EDIT-- I've enlarged array about 10times here's my results:
- python: 0:00:00.000205
- numpy: 0:00:00.002710
So. Maybe for this task using numpy is redundant.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Alex |
| Solution 2 | |
| Solution 3 |
