'Compare column value with many column attributes [closed]
I have machine learning excel file (>30k rows, 845 columns) and would like to check a column value (the training value). If a different training value is assigned to same attributes columns 12 - 849 it would mess up the machine learning.
Since a dictionary can only have 1 key and I would like to have multiple entries with the same training value assignment it is not an option.
What are efficient options iterating through such a big file, check for wrong entries and color code the wrong assignments (different name with same attributes)?
Table looks like this:
columns before are info-columns, columns after are till 840 similar, after that info-columns again
Solution 1:[1]
Having sample as below:
import pandas as pd
import numpy as np
d = {'col1': [1, 2, 3, 4, 5, 6, 3, 5],
'col2': ['a', 'b', 'c', 'd', 'e', 'f', 'c', 'e' ],
'label': ['1', '2', '1', '2', '1', '2', '1', '2']}
df = pd.DataFrame(data=d)
df
And assuming that you want:
- detect as the incorrect case rows 4 and 7 as those have the same values in col1 and col2 but different labels
- detect as the correct case rows 2 and 6 because those have the same values in col1, col2 and label
- detect as the correct cases remining rows because those have the different values in col1, col2 and label
I will suggest to:
Group by columns 0:2 (in your case 12:850) and create the set of the labels (only unique values will appear):
aggregated = df.groupby(list(df.columns[0:2]))['label'].apply(set).reset_index(name='label')
aggregated
and filter out for cases where set size is greater than one:
aggregated[aggregated.label.str.len() > 1]
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | annabednarska |



