'numpy alternative to pd.factorize()
Does anyone know a numpy alternative to pd.factorize()?
I have a need for speed in an algorithm, and would like to not use the pandas dataframe.
So for instance,
test = np.array(['yo', 'whats', 'up', 'whats', 'up', 'yo'])
shall return
pd.factorize(pd.Series(test))
array([0, 1, 2, 1, 2, 0])
Solution 1:[1]
You can use numpy.unique with return_inverse=True:
test = np.array(['yo', 'whats', 'up', 'whats', 'up', 'yo'])
np.unique(test, return_inverse=True)[1]
output: array([2, 1, 0, 1, 0, 2])
timing
numpy.unique is faster up to ~10k items, then pandas.factorize is actually faster.
The python alternative is only fast on small arrays (<100).
Solution 2:[2]
Already a great answer above. Wall time for Option # 2 below was almost half:
import numpy as np
test = np.array(['yo', 'whats', 'up', 'whats', 'up', 'yo'])
Option # 1:
%%time
x, y = np.unique(test, return_inverse=True)
y
Output:
CPU times: user 103 µs, sys: 23 µs, total: 126 µs
Wall time: 110 µs
array([2, 1, 0, 1, 0, 2])
Option # 2:
d={}
[d.setdefault(w, i) for i, w in enumerate(test)]
Output:
CPU times: user 60 µs, sys: 1e+03 ns, total: 61 µs
Wall time: 64.1 µs
[0, 1, 2, 1, 2, 0]
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | Nilesh Ingle |


