'Filter out null value of a list of columns PySpark
I have a list of columns like this check_columns = ["col1", "col2"] and I want to remove null value from check_columns list with this rule:
df = df.filter(
F.col(check_columns[0]).isNotNull() & (F.size(check_columns[0]) > 0) |
F.col(check_columns[1]).isNotNull() & (F.size(check_columns[1]) > 0)
)
If my list has more elements like check_columns = ["col1", "col2", "col3", "col4].
How can I create a filter without creating new rows? One thing I find difficult is that it must be OR between each filter condition.
I have tried this code but it does not work:
df = df.filter(
(
' | '.join([F.col(c).isNotNull() & (F.size(c) > 0) for c in check_columns])
)
)
Solution 1:[1]
You can use greatest with the loop and then use it in filter:
Method1:
df_filtered = df.filter(F.greatest(*[F.col(c).isNotNull() & (F.size(c) > 0)
for c in check_columns]))
Method2 with reduce:
from functools import reduce
expr = reduce(lambda x,y: F.col(x).isNotNull() & (F.size(x) > 0) |
F.col(y).isNotNull() & (F.size(y) > 0),check_columns)
df_filtered = df.filter(expr)
Method 3 with join and expression:
cond = f"""({')or('.join(f"{c} is not null and size({c})>0" for c in check_columns)})"""
df_filtered = df.filter(F.expr(cond))
Quick test:
d = {'A': {0: [13,1], 1: [], 2: [1,2,3], 3: [5,6], 4: []},
'B':{0: [9,10], 1: None, 2: [1,2,3], 3: [5,6], 4: []}}
df = spark.createDataFrame(pd.DataFrame(d))
df.show()
def manual(data):
return data.filter(
F.col(check_columns[0]).isNotNull() & (F.size(check_columns[0]) > 0) |
F.col(check_columns[1]).isNotNull() & (F.size(check_columns[1]) > 0)
)
def loop(data):
return data.filter(F.greatest(*[F.col(c).isNotNull() & (F.size(c) > 0)
for c in check_columns]))
manual(df).subtract(loop(df)).show()
+---+---+
| A| B|
+---+---+
+---+---+
We see no difference returned in dataframes returned by both the methods.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
