'Deduplicate numpy array by another array
I have two numpy arrays:
a = np.array([0, 1, 2, 2, 3])
b = np.array([0.9, 0.6, 0.5, 0.8, 1.0])
a is the index of items, and b is the score of corresponding items. Now I want to sort these items descendingly by the scores in b while only keeping the largest score of a single item. The results should be the non-duplicated item index a_new and the score of these items b_new.
In the example above, I need:
a_new = np.array([3, 0, 2, 1])
b_new = np.array([1.0, 0.9, 0.8, 0.6])
I know I can do this with scatter_max however it's a little slow. Is there any easier and faster solutions?
Note that I don't want to transform the array to a dictionary, which is a trivial solution. I need a batched solution because I have millions of such arrays.
Solution 1:[1]
After ordering the arrays in descending order using ordering, repeated values could be removed by np.unique:
ordering = np.argsort(b)[::-1]
a = a[ordering]
b = b[ordering]
undup_ind = np.unique(a, return_index=True)[1]
b = b[np.sort(undup_ind)]
This will be the fastest or one of the fastest ways to reach the goal; It ran in 0.5 seconds in my tested case by 1.000.000 data volume.
Solution 2:[2]
Have you tried with pandas?
import numpy as np
import pandas as pd
a = np.array([0, 1, 2, 2, 3])
b = np.array([0.9, 0.6, 0.5, 0.8, 1.0])
df = pd.DataFrame(np.stack([a, b], axis=1), columns = ['a', 'b'])
df = df.groupby('a')['b'].max().to_frame().reset_index().sort_values(by=['b'], ascending=False)
a_new = df['a'].to_numpy()
b_new = df['b'].to_numpy()
If you want parallel processing, can explore PySpark, Dask, and alike.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | Thu Ya Kyaw |
