'Unexpected error in `pandas.cut` internals related to indexing
I've isolated some behavior within pd.cut (https://github.com/pandas-dev/pandas/blob/1.3.x/pandas/core/reshape/tile.py#L424) that I don't understand the failure mode of. It seems to be indexing-related: with the default RangeIndex on the series all runs as expected; when the index is "0'd out", the code will crash out with a KeyError: 1 (trace at the bottom of the question).
I'd like to understand why exactly the ids[x == bins[0]] = 1 line crashes out with a KeyError: 1 in this case.
A short, self-contained reproduction of the failure:
import numpy as np
import pandas as pd
ids = np.array([3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 4, 4, 4, 4, 4, 5, 5])
x = pd.Series([1.0839999914,1.1859999895,1.2050000429,1.2250000238,1.3220000267,1.3999999762,1.5529999733,1.7710000277,1.1859999895,1.2259999514,1.2669999599,1.3400000334,1.4579999447,1.6599999666,1.9950000048,1.1859999895,1.2269999981,1.3680000305,1.5249999762,1.7079999447,2.0309998989,2.5539999008])
bins = np.array([-np.inf, 0.25, 1.0, 2.0, 3.0, np.inf])
# long story short, dask.dataframe.read_parquet will effectively do this to the indices (down to the index type & dtypes)
x.index = [0] * len(x)
# this crashes unless we remove the problem statement above
ids[x == bins[0]] = 1
Trace:
problem.py:9: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
ids[x == bins[0]] = 1
Traceback (most recent call last):
File "/Users/user/miniconda3/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3621, in get_loc
return self._engine.get_loc(casted_key)
File "pandas/_libs/index.pyx", line 136, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 160, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 195, in pandas._libs.index.IndexEngine._get_loc_duplicates
KeyError: 1
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "problem.py", line 9, in <module>
ids[x == bins[0]] = 1
File "/Users/user/miniconda3/lib/python3.8/site-packages/pandas/core/series.py", line 959, in __getitem__
return self._get_value(key)
File "/Users/user/miniconda3/lib/python3.8/site-packages/pandas/core/series.py", line 1070, in _get_value
loc = self.index.get_loc(label)
File "/Users/user/miniconda3/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3623, in get_loc
raise KeyError(key) from err
KeyError: 1
Solution 1:[1]
Seems like this method is going to be interpreted as an array index,and you should be using tuple:
Using a non-tuple sequence for multidimensional indexing is deprecated; use
arr[tuple(seq)]instead ofarr[seq]. In the future this will be interpreted as an array index,arr[np.array(seq)], which will result either in an error or a different result. ids[x == bins[0]] = 1
ids[tuple(x == bins[0])] = 1
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | eshirvana |
