'chaining logical operators - ValueError: The truth value of a Series is ambiguous

I have a dictionary dataframe_dictconsisting of over 1000 dataframes dataframe_dict.items()). Each dataframe represents data collected from a location (i.e. one dataframe for each location), and each dataframe has the same data columns (key).

Each dataframe looks like this

import pandas as pd 
import numpy as np 
df = pd.DataFrame(np.random.rand(4,4), columns = list('abcd')) 
df
          a         b         c         d
0  0.325799  0.731273  0.467031  0.177742
1  0.084133  0.271076  0.761092  0.067709
2  0.946860  0.606838  0.260437  0.094640
3  0.076870  0.450473  0.693679  0.760893

For each dataframe, I want to find out which column(s) has over 30% missing values, and identify those columns and store them in reject_list.

This is how I currently identify these columns

    reject_list =[]
    for key, item in dataframe_dict.items():
        if ((item[key].isnull().sum()) > (0.3*(len(item)))):
            reject_list.append(item[key])
            
            print('rejected due to more than 30% nulls: {}'.format(item[key]))
            
        item.dropna(inplace=True)
        item.reset_index(drop=True, inplace=True)

Python threw me this error on the logic

if ((item[key].isnull().sum()) > (0.3*(len(item)))):
File "/usr/local/lib/python3.8/site-packages/pandas/core/generic.py", line 1535, in __nonzero__
raise ValueError(
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Looking at previous post I think I have created multiple series in this code where Boolean does not apply. How do I pass through this logic in this loop?



Solution 1:[1]

In your code:

  1. for key, item in dataframe_dict.items():assigns to key the key of a DATAFRAME_DICT element and to ìtem the corresponding dataframe,

  2. In the loop body, you use key as if is the name of a column of the dataframe. But nothing assure that key is a column name but you did not provide how you build dataframe_dict

It looks like the for statement in your code is not should be the one for a loop that you did not provided and that the correct could something like for col in item.columns an example. It looks like you have a confusion on key.

The code below tries to resolve the confusion.

A question is if reject_list should be built on a dataframe merging all the dataframes in dataframe_dict or for each dataframe_dict element as your code implied. In the code below, reject_list build at the level of the dataframe_dict elements. But at the end of the process the dataframe_dict elements will probably not have the same columns.

reject_list =[]
for key, item in dataframe_dict.items():
    for col in item.columns:

        if ((item[col].isnull().sum()) > (0.3*(len(item)))):
            reject_list.append((key, col))

            print(f"In dataframe '{key}', column '{col}' rejected due to more than 30% null.")

            item.dropna(inplace=True)
            item.reset_index(drop=True, inplace=True)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1