'Pyspark dataframe get columns that has mostly null values efficiently

I have a huge spark dataframe living in a cluster. The count shows there to be 24 million rows. There are also 900+ columns.

Most of these columns are empty. I'm thinking of dropping the columns that are mostly empty. Or get a list of columns that are not mostly empty.

I'm currently looping over columns:

for col in ALL_COLUMNS[1:]:
    test_df = df.select(col)
    NNcount = test_df.filter(test_df[col].isin(["NULL", "", None]) == False).count()
    # more logic ..

And selecting afterwards, the problem is, each iteration of this loop takes minutes.

Is there a faster way to drop columns based on nulls? preferably not needing to loop over the entire column - and obviously more elegant than this.

Perhaps the answer is already out there but I'm failing to find the match after some searching. Thanks!



Solution 1:[1]

If you want to get a list of column names with mostly (say 20%) null values try this -

na_pct = 0.2
cols_to_drop = [x for x in df.columns if df.filter(F.col(x).isNull()).count()/df.count()>= na_pct]

However, if you want to remove those column with and keep the rest then use try -

cols_to_keep = [x for x in df.columns if df.filter(F.col(x).isNull()).count()/df.count() < na_pct]
df = df.select(*cols_to_keep)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1