'numpy alternative to pd.factorize()

Does anyone know a numpy alternative to pd.factorize()?

I have a need for speed in an algorithm, and would like to not use the pandas dataframe.

So for instance,

test = np.array(['yo', 'whats', 'up', 'whats', 'up', 'yo'])

shall return

pd.factorize(pd.Series(test))
array([0, 1, 2, 1, 2, 0])


Solution 1:[1]

You can use numpy.unique with return_inverse=True:

test = np.array(['yo', 'whats', 'up', 'whats', 'up', 'yo'])

np.unique(test, return_inverse=True)[1]

output: array([2, 1, 0, 1, 0, 2])

timing

numpy.unique is faster up to ~10k items, then pandas.factorize is actually faster.

The python alternative is only fast on small arrays (<100).

comparison on 1 to ~8M rows, with 8 factors comparison 1

comparison on 1 to ~33M rows, with 52 factors comparison 2

Solution 2:[2]

Already a great answer above. Wall time for Option # 2 below was almost half:

import numpy as np
test = np.array(['yo', 'whats', 'up', 'whats', 'up', 'yo'])

Option # 1:

%%time
x, y = np.unique(test, return_inverse=True)
y

Output:

CPU times: user 103 µs, sys: 23 µs, total: 126 µs
Wall time: 110 µs

array([2, 1, 0, 1, 0, 2])

Option # 2:

d={}
[d.setdefault(w, i) for i, w in enumerate(test)]

Output:

CPU times: user 60 µs, sys: 1e+03 ns, total: 61 µs
Wall time: 64.1 µs

[0, 1, 2, 1, 2, 0]

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 Nilesh Ingle