'Pyspark dataframe get columns that has mostly null values efficiently
I have a huge spark dataframe living in a cluster. The count shows there to be 24 million rows. There are also 900+ columns.
Most of these columns are empty. I'm thinking of dropping the columns that are mostly empty. Or get a list of columns that are not mostly empty.
I'm currently looping over columns:
for col in ALL_COLUMNS[1:]:
test_df = df.select(col)
NNcount = test_df.filter(test_df[col].isin(["NULL", "", None]) == False).count()
# more logic ..
And selecting afterwards, the problem is, each iteration of this loop takes minutes.
Is there a faster way to drop columns based on nulls? preferably not needing to loop over the entire column - and obviously more elegant than this.
Perhaps the answer is already out there but I'm failing to find the match after some searching. Thanks!
Solution 1:[1]
If you want to get a list of column names with mostly (say 20%) null values try this -
na_pct = 0.2
cols_to_drop = [x for x in df.columns if df.filter(F.col(x).isNull()).count()/df.count()>= na_pct]
However, if you want to remove those column with and keep the rest then use try -
cols_to_keep = [x for x in df.columns if df.filter(F.col(x).isNull()).count()/df.count() < na_pct]
df = df.select(*cols_to_keep)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
