'pandas.core.indexing.IndexingError: Too many indexers in scikit-learn agglomerative clustering

I have this data set:

col_index   Sample  FID SNP1    SNP2    SNP3    SNP4    SNP5    LiverCysts  ESRD_Aug2020    Renal_Survival_Aug2020  Group
1   23  0   1   0   0   0   1   1   1   23  9
2   24  0   1   1   0   1   1   0   1   45  9
3   25  0   1   0   0   0   1   0   0   34  9
4   26  1   1   0   1   1   0   0   1   2   8
5   27  0   2   1   0   0   0   1   1   38  8
6   28  1   2   1   0   0   1   0   1   3   8
7   29  0   2   0   1   0   0   0   1   1   7
8   30  1   2   1   0   0   0   1   0   0   7
9   31  1   3   1   0   1   1   1   0   1   7
10  32  1   3   0   0   0   1   0   0   2   0

All of the data is categorical, except for Renal_Survival_Aug2020 which is an integer.

I want to cluster this data to find where row 10 falls, because the 0 in the Group column for row 10 means I don't know what group that person is in, and I want to work out whether they're close to group 7,8,9 or a separate cluster.

I do not want to use the col_index (column index) or Sample (a sample ID) in the clustering process (as they are just identifiers).

I ran this code:

import pandas as pd
import numpy as np
from numpy import where
from numpy import unique
from sklearn.cluster import AgglomerativeClustering
import matplotlib.pyplot as plot


df = pd.read_csv('test.txt',sep='\t',dtype='category')
df = df.drop(['col_index','Sample'],axis=1)
df['Renal_Survival_Aug2020'] = df['Renal_Survival_Aug2020'].astype(float)

print(df.dtypes)
print(df.shape)
print(df)

df1 = df[df.isna().any(axis=1)]
print(df1)

agg_mdl = AgglomerativeClustering(n_clusters=3) #three, to force the unknown to go to one of the three?
agg_results = agg_mdl.fit_predict(df)
agg_clusters = unique(agg_results)


for agg_cluster in agg_clusters:
        index = where(agg_results == agg_cluster)
        plot.scatter(df.iloc[:,index,0],df.iloc[:,index,1])

I get the error:

FID                       category
SNP1                      category
SNP2                      category
SNP3                      category
SNP4                      category
SNP5                      category
LiverCysts                category
ESRD_Aug2020              category
Renal_Survival_Aug2020     float64
Group                     category
dtype: object
(10, 10)
  FID SNP1 SNP2 SNP3 SNP4 SNP5 LiverCysts ESRD_Aug2020  Renal_Survival_Aug2020 Group
0   0    1    0    0    0    1          1            1                    23.0     9
1   0    1    1    0    1    1          0            1                    45.0     9
2   0    1    0    0    0    1          0            0                    34.0     9
3   1    1    0    1    1    0          0            1                     2.0     8
4   0    2    1    0    0    0          1            1                    38.0     8
5   1    2    1    0    0    1          0            1                     3.0     8
6   0    2    0    1    0    0          0            1                     1.0     7
7   1    2    1    0    0    0          1            0                     0.0     7
8   1    3    1    0    1    1          1            0                     1.0     7
9   1    3    0    0    0    1          0            0                     2.0     0
Empty DataFrame
Columns: [FID, SNP1, SNP2, SNP3, SNP4, SNP5, LiverCysts, ESRD_Aug2020, Renal_Survival_Aug2020, Group]
Index: []
/Users/slowatkela/anaconda/lib/python3.7/site-packages/sklearn/utils/validation.py:63: FutureWarning: Arrays of bytes/strings is being converted to decimal numbers if dtype='numeric'. This behavior is deprecated in 0.24 and will be removed in 1.1 (renaming of 0.26). Please convert your data to numeric values explicitly instead.
  return f(*args, **kwargs)
Traceback (most recent call last):
  File "unsupervised_clustering.py", line 32, in <module>
    plot.scatter(df.iloc[:,index,0],df.iloc[:,index,1])
  File "/Users/slowatkela/anaconda/lib/python3.7/site-packages/pandas/core/indexing.py", line 889, in __getitem__
    return self._getitem_tuple(key)
  File "/Users/slowatkela/anaconda/lib/python3.7/site-packages/pandas/core/indexing.py", line 1450, in _getitem_tuple
    self._has_valid_tuple(tup)
  File "/Users/slowatkela/anaconda/lib/python3.7/site-packages/pandas/core/indexing.py", line 720, in _has_valid_tuple
    self._validate_key_length(key)
  File "/Users/slowatkela/anaconda/lib/python3.7/site-packages/pandas/core/indexing.py", line 761, in _validate_key_length
    raise IndexingError("Too many indexers")
pandas.core.indexing.IndexingError: Too many indexers

I know what the problem is, just not how to fix it, I know it's because I'm not doing the scatterplot properly. Looking around, the code here describes that line slightly differently, so when I modify that line of code to plot.scatter(df[index,0],df[index,1]), I get the error:

FID                       category
SNP1                      category
SNP2                      category
SNP3                      category
SNP4                      category
SNP5                      category
LiverCysts                category
ESRD_Aug2020              category
Renal_Survival_Aug2020     float64
Group                     category
dtype: object
(10, 10)
  FID SNP1 SNP2 SNP3 SNP4 SNP5 LiverCysts ESRD_Aug2020  Renal_Survival_Aug2020 Group
0   0    1    0    0    0    1          1            1                    23.0     9
1   0    1    1    0    1    1          0            1                    45.0     9
2   0    1    0    0    0    1          0            0                    34.0     9
3   1    1    0    1    1    0          0            1                     2.0     8
4   0    2    1    0    0    0          1            1                    38.0     8
5   1    2    1    0    0    1          0            1                     3.0     8
6   0    2    0    1    0    0          0            1                     1.0     7
7   1    2    1    0    0    0          1            0                     0.0     7
8   1    3    1    0    1    1          1            0                     1.0     7
9   1    3    0    0    0    1          0            0                     2.0     0
Empty DataFrame
Columns: [FID, SNP1, SNP2, SNP3, SNP4, SNP5, LiverCysts, ESRD_Aug2020, Renal_Survival_Aug2020, Group]
Index: []
/Users/slowatkela/anaconda/lib/python3.7/site-packages/sklearn/utils/validation.py:63: FutureWarning: Arrays of bytes/strings is being converted to decimal numbers if dtype='numeric'. This behavior is deprecated in 0.24 and will be removed in 1.1 (renaming of 0.26). Please convert your data to numeric values explicitly instead.
  return f(*args, **kwargs)
Traceback (most recent call last):
  File "unsupervised_clustering.py", line 33, in <module>
    plot.scatter(df[index,0],df[index,1])
  File "/Users/slowatkela/anaconda/lib/python3.7/site-packages/pandas/core/frame.py", line 3024, in __getitem__
    indexer = self.columns.get_loc(key)
  File "/Users/slowatkela/anaconda/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 3080, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas/_libs/index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 75, in pandas._libs.index.IndexEngine.get_loc
TypeError: '((array([3, 5, 6, 7, 8]),), 0)' is an invalid key

Could someone show me how to fix this? My overall idea is I want a dual output - a picture of the plot with the unknown sample highlighted somehow so I can see where it is, and also ideally an output telling me what cluster each row went into, so I can tell where row 10 went, for example:

Sample  Cluster
23     1
24     1
25     1
26     2
27     2
28     2
29     3
30     3
31     3
32     1 ----> so I can see the unknown sample is in cluster 1

p.s. I have seen this question on SO, but I don't understand what they did and also there's some confusion in the comments - so I'm unclear how to implement this.



Solution 1:[1]

Not quite sure what you're trying to accomplish with the loop/plot, but here's already how to fix your error, print the expected result on the terminal and show the clusters in a scatter chart. I added some explanations directly in the code comments below.

Note 1: You're currently using the Group for clustering. Not sure if this is intentional so I did not change it.
Note 2: Some values/indexes are hardcoded. This can be improved but it's not the point here.

from pandas import read_csv
from numpy import where
from numpy import unique
from sklearn.cluster import AgglomerativeClustering
import matplotlib.pyplot as plt

dfx = read_csv('test.txt',sep='\t',dtype='category') # keep original dataset with index column
df = dfx.drop(['col_index','Sample'],axis=1)
df['Renal_Survival_Aug2020'] = df['Renal_Survival_Aug2020'].astype(float)
df1 = df[df.isna().any(axis=1)]

agg_mdl = AgglomerativeClustering(n_clusters=3)
agg_results = agg_mdl.fit_predict(df)
agg_clusters = unique(agg_results)

# Add cluster result column to dataframe.
# Print the result to the console.
dfx['Cluster'] = agg_results
res = dfx.iloc[:,[9]]
print(res)

# Define (manually) set of colors with different one to unknow.
# A smarter way is needed if you want unknown to be "all rows
# where Group=0" in your dataset.
colors = []
for i in range(9):
    colors.append('blue')
colors.append('red')

# Plot the result showing cluster index for each row.
# You can do this directly from the dataframe (at least with pandas 1.4.2):
# Unknow is shown in red.
dfx.plot.scatter(x='col_index', y='Cluster', c=colors)
plt.show()

Result (console):

   Cluster
0        2
1        0
2        0
3        1
4        0
5        1
6        1
7        1
8        1
9        1 <--- cluster where unknown sample goes.

Result (plot):

enter image description here

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 evilmandarine