'Unexpected error in `pandas.cut` internals related to indexing

I've isolated some behavior within pd.cut (https://github.com/pandas-dev/pandas/blob/1.3.x/pandas/core/reshape/tile.py#L424) that I don't understand the failure mode of. It seems to be indexing-related: with the default RangeIndex on the series all runs as expected; when the index is "0'd out", the code will crash out with a KeyError: 1 (trace at the bottom of the question).

I'd like to understand why exactly the ids[x == bins[0]] = 1 line crashes out with a KeyError: 1 in this case.

A short, self-contained reproduction of the failure:

import numpy as np
import pandas as pd

ids = np.array([3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 4, 4, 4, 4, 4, 5, 5])
x = pd.Series([1.0839999914,1.1859999895,1.2050000429,1.2250000238,1.3220000267,1.3999999762,1.5529999733,1.7710000277,1.1859999895,1.2259999514,1.2669999599,1.3400000334,1.4579999447,1.6599999666,1.9950000048,1.1859999895,1.2269999981,1.3680000305,1.5249999762,1.7079999447,2.0309998989,2.5539999008])

bins = np.array([-np.inf, 0.25, 1.0, 2.0, 3.0, np.inf])

# long story short, dask.dataframe.read_parquet will effectively do this to the indices (down to the index type & dtypes)
x.index = [0] * len(x)

# this crashes unless we remove the problem statement above
ids[x == bins[0]] = 1

Trace:

problem.py:9: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  ids[x == bins[0]] = 1
Traceback (most recent call last):
  File "/Users/user/miniconda3/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3621, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas/_libs/index.pyx", line 136, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 160, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 195, in pandas._libs.index.IndexEngine._get_loc_duplicates
KeyError: 1
 
The above exception was the direct cause of the following exception:
 
Traceback (most recent call last):
  File "problem.py", line 9, in <module>
    ids[x == bins[0]] = 1
  File "/Users/user/miniconda3/lib/python3.8/site-packages/pandas/core/series.py", line 959, in __getitem__
    return self._get_value(key)
  File "/Users/user/miniconda3/lib/python3.8/site-packages/pandas/core/series.py", line 1070, in _get_value
    loc = self.index.get_loc(label)
  File "/Users/user/miniconda3/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3623, in get_loc
    raise KeyError(key) from err
KeyError: 1


Solution 1:[1]

Seems like this method is going to be interpreted as an array index,and you should be using tuple:

Using a non-tuple sequence for multidimensional indexing is deprecated; use arr[tuple(seq)] instead of arr[seq]. In the future this will be interpreted as an array index, arr[np.array(seq)], which will result either in an error or a different result. ids[x == bins[0]] = 1

ids[tuple(x == bins[0])] = 1

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 eshirvana