'pandas.core.indexing.IndexingError: Too many indexers in scikit-learn agglomerative clustering
I have this data set:
col_index Sample FID SNP1 SNP2 SNP3 SNP4 SNP5 LiverCysts ESRD_Aug2020 Renal_Survival_Aug2020 Group
1 23 0 1 0 0 0 1 1 1 23 9
2 24 0 1 1 0 1 1 0 1 45 9
3 25 0 1 0 0 0 1 0 0 34 9
4 26 1 1 0 1 1 0 0 1 2 8
5 27 0 2 1 0 0 0 1 1 38 8
6 28 1 2 1 0 0 1 0 1 3 8
7 29 0 2 0 1 0 0 0 1 1 7
8 30 1 2 1 0 0 0 1 0 0 7
9 31 1 3 1 0 1 1 1 0 1 7
10 32 1 3 0 0 0 1 0 0 2 0
All of the data is categorical, except for Renal_Survival_Aug2020 which is an integer.
I want to cluster this data to find where row 10 falls, because the 0 in the Group column for row 10 means I don't know what group that person is in, and I want to work out whether they're close to group 7,8,9 or a separate cluster.
I do not want to use the col_index (column index) or Sample (a sample ID) in the clustering process (as they are just identifiers).
I ran this code:
import pandas as pd
import numpy as np
from numpy import where
from numpy import unique
from sklearn.cluster import AgglomerativeClustering
import matplotlib.pyplot as plot
df = pd.read_csv('test.txt',sep='\t',dtype='category')
df = df.drop(['col_index','Sample'],axis=1)
df['Renal_Survival_Aug2020'] = df['Renal_Survival_Aug2020'].astype(float)
print(df.dtypes)
print(df.shape)
print(df)
df1 = df[df.isna().any(axis=1)]
print(df1)
agg_mdl = AgglomerativeClustering(n_clusters=3) #three, to force the unknown to go to one of the three?
agg_results = agg_mdl.fit_predict(df)
agg_clusters = unique(agg_results)
for agg_cluster in agg_clusters:
index = where(agg_results == agg_cluster)
plot.scatter(df.iloc[:,index,0],df.iloc[:,index,1])
I get the error:
FID category
SNP1 category
SNP2 category
SNP3 category
SNP4 category
SNP5 category
LiverCysts category
ESRD_Aug2020 category
Renal_Survival_Aug2020 float64
Group category
dtype: object
(10, 10)
FID SNP1 SNP2 SNP3 SNP4 SNP5 LiverCysts ESRD_Aug2020 Renal_Survival_Aug2020 Group
0 0 1 0 0 0 1 1 1 23.0 9
1 0 1 1 0 1 1 0 1 45.0 9
2 0 1 0 0 0 1 0 0 34.0 9
3 1 1 0 1 1 0 0 1 2.0 8
4 0 2 1 0 0 0 1 1 38.0 8
5 1 2 1 0 0 1 0 1 3.0 8
6 0 2 0 1 0 0 0 1 1.0 7
7 1 2 1 0 0 0 1 0 0.0 7
8 1 3 1 0 1 1 1 0 1.0 7
9 1 3 0 0 0 1 0 0 2.0 0
Empty DataFrame
Columns: [FID, SNP1, SNP2, SNP3, SNP4, SNP5, LiverCysts, ESRD_Aug2020, Renal_Survival_Aug2020, Group]
Index: []
/Users/slowatkela/anaconda/lib/python3.7/site-packages/sklearn/utils/validation.py:63: FutureWarning: Arrays of bytes/strings is being converted to decimal numbers if dtype='numeric'. This behavior is deprecated in 0.24 and will be removed in 1.1 (renaming of 0.26). Please convert your data to numeric values explicitly instead.
return f(*args, **kwargs)
Traceback (most recent call last):
File "unsupervised_clustering.py", line 32, in <module>
plot.scatter(df.iloc[:,index,0],df.iloc[:,index,1])
File "/Users/slowatkela/anaconda/lib/python3.7/site-packages/pandas/core/indexing.py", line 889, in __getitem__
return self._getitem_tuple(key)
File "/Users/slowatkela/anaconda/lib/python3.7/site-packages/pandas/core/indexing.py", line 1450, in _getitem_tuple
self._has_valid_tuple(tup)
File "/Users/slowatkela/anaconda/lib/python3.7/site-packages/pandas/core/indexing.py", line 720, in _has_valid_tuple
self._validate_key_length(key)
File "/Users/slowatkela/anaconda/lib/python3.7/site-packages/pandas/core/indexing.py", line 761, in _validate_key_length
raise IndexingError("Too many indexers")
pandas.core.indexing.IndexingError: Too many indexers
I know what the problem is, just not how to fix it, I know it's because I'm not doing the scatterplot properly. Looking around, the code here describes that line slightly differently, so when I modify that line of code to plot.scatter(df[index,0],df[index,1]), I get the error:
FID category
SNP1 category
SNP2 category
SNP3 category
SNP4 category
SNP5 category
LiverCysts category
ESRD_Aug2020 category
Renal_Survival_Aug2020 float64
Group category
dtype: object
(10, 10)
FID SNP1 SNP2 SNP3 SNP4 SNP5 LiverCysts ESRD_Aug2020 Renal_Survival_Aug2020 Group
0 0 1 0 0 0 1 1 1 23.0 9
1 0 1 1 0 1 1 0 1 45.0 9
2 0 1 0 0 0 1 0 0 34.0 9
3 1 1 0 1 1 0 0 1 2.0 8
4 0 2 1 0 0 0 1 1 38.0 8
5 1 2 1 0 0 1 0 1 3.0 8
6 0 2 0 1 0 0 0 1 1.0 7
7 1 2 1 0 0 0 1 0 0.0 7
8 1 3 1 0 1 1 1 0 1.0 7
9 1 3 0 0 0 1 0 0 2.0 0
Empty DataFrame
Columns: [FID, SNP1, SNP2, SNP3, SNP4, SNP5, LiverCysts, ESRD_Aug2020, Renal_Survival_Aug2020, Group]
Index: []
/Users/slowatkela/anaconda/lib/python3.7/site-packages/sklearn/utils/validation.py:63: FutureWarning: Arrays of bytes/strings is being converted to decimal numbers if dtype='numeric'. This behavior is deprecated in 0.24 and will be removed in 1.1 (renaming of 0.26). Please convert your data to numeric values explicitly instead.
return f(*args, **kwargs)
Traceback (most recent call last):
File "unsupervised_clustering.py", line 33, in <module>
plot.scatter(df[index,0],df[index,1])
File "/Users/slowatkela/anaconda/lib/python3.7/site-packages/pandas/core/frame.py", line 3024, in __getitem__
indexer = self.columns.get_loc(key)
File "/Users/slowatkela/anaconda/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 3080, in get_loc
return self._engine.get_loc(casted_key)
File "pandas/_libs/index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 75, in pandas._libs.index.IndexEngine.get_loc
TypeError: '((array([3, 5, 6, 7, 8]),), 0)' is an invalid key
Could someone show me how to fix this? My overall idea is I want a dual output - a picture of the plot with the unknown sample highlighted somehow so I can see where it is, and also ideally an output telling me what cluster each row went into, so I can tell where row 10 went, for example:
Sample Cluster
23 1
24 1
25 1
26 2
27 2
28 2
29 3
30 3
31 3
32 1 ----> so I can see the unknown sample is in cluster 1
p.s. I have seen this question on SO, but I don't understand what they did and also there's some confusion in the comments - so I'm unclear how to implement this.
Solution 1:[1]
Not quite sure what you're trying to accomplish with the loop/plot, but here's already how to fix your error, print the expected result on the terminal and show the clusters in a scatter chart. I added some explanations directly in the code comments below.
Note 1: You're currently using the Group for clustering. Not sure if this is intentional so I did not change it.
Note 2: Some values/indexes are hardcoded. This can be improved but it's not the point here.
from pandas import read_csv
from numpy import where
from numpy import unique
from sklearn.cluster import AgglomerativeClustering
import matplotlib.pyplot as plt
dfx = read_csv('test.txt',sep='\t',dtype='category') # keep original dataset with index column
df = dfx.drop(['col_index','Sample'],axis=1)
df['Renal_Survival_Aug2020'] = df['Renal_Survival_Aug2020'].astype(float)
df1 = df[df.isna().any(axis=1)]
agg_mdl = AgglomerativeClustering(n_clusters=3)
agg_results = agg_mdl.fit_predict(df)
agg_clusters = unique(agg_results)
# Add cluster result column to dataframe.
# Print the result to the console.
dfx['Cluster'] = agg_results
res = dfx.iloc[:,[9]]
print(res)
# Define (manually) set of colors with different one to unknow.
# A smarter way is needed if you want unknown to be "all rows
# where Group=0" in your dataset.
colors = []
for i in range(9):
colors.append('blue')
colors.append('red')
# Plot the result showing cluster index for each row.
# You can do this directly from the dataframe (at least with pandas 1.4.2):
# Unknow is shown in red.
dfx.plot.scatter(x='col_index', y='Cluster', c=colors)
plt.show()
Result (console):
Cluster
0 2
1 0
2 0
3 1
4 0
5 1
6 1
7 1
8 1
9 1 <--- cluster where unknown sample goes.
Result (plot):
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | evilmandarine |

