'Compare column value with many column attributes [closed]

I have machine learning excel file (>30k rows, 845 columns) and would like to check a column value (the training value). If a different training value is assigned to same attributes columns 12 - 849 it would mess up the machine learning.

Since a dictionary can only have 1 key and I would like to have multiple entries with the same training value assignment it is not an option.

What are efficient options iterating through such a big file, check for wrong entries and color code the wrong assignments (different name with same attributes)?

Table looks like this:

columns before are info-columns, columns after are till 840 similar, after that info-columns again



Solution 1:[1]

Having sample as below:

import pandas as pd
import numpy as np

d = {'col1': [1, 2, 3, 4, 5, 6, 3, 5], 
     'col2': ['a', 'b', 'c', 'd', 'e', 'f', 'c', 'e' ], 
     'label': ['1', '2', '1', '2', '1', '2', '1', '2']}
df = pd.DataFrame(data=d)
df

enter image description here

And assuming that you want:

  • detect as the incorrect case rows 4 and 7 as those have the same values in col1 and col2 but different labels
  • detect as the correct case rows 2 and 6 because those have the same values in col1, col2 and label
  • detect as the correct cases remining rows because those have the different values in col1, col2 and label

I will suggest to:

Group by columns 0:2 (in your case 12:850) and create the set of the labels (only unique values will appear):

aggregated = df.groupby(list(df.columns[0:2]))['label'].apply(set).reset_index(name='label')
aggregated

enter image description here

and filter out for cases where set size is greater than one:

aggregated[aggregated.label.str.len() > 1]

enter image description here

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 annabednarska