'Filter out null value of a list of columns PySpark

I have a list of columns like this check_columns = ["col1", "col2"] and I want to remove null value from check_columns list with this rule:

df = df.filter(
    F.col(check_columns[0]).isNotNull() & (F.size(check_columns[0]) > 0) |
    F.col(check_columns[1]).isNotNull() & (F.size(check_columns[1]) > 0)
)

If my list has more elements like check_columns = ["col1", "col2", "col3", "col4].

How can I create a filter without creating new rows? One thing I find difficult is that it must be OR between each filter condition.

I have tried this code but it does not work:

df = df.filter(
    (
        ' | '.join([F.col(c).isNotNull() & (F.size(c) > 0) for c in check_columns])
    )
)


Solution 1:[1]

You can use greatest with the loop and then use it in filter:

Method1:

df_filtered = df.filter(F.greatest(*[F.col(c).isNotNull() & (F.size(c) > 0) 
                                                  for c in check_columns]))

Method2 with reduce:

from functools import reduce
expr = reduce(lambda x,y: F.col(x).isNotNull() & (F.size(x) > 0) |
        F.col(y).isNotNull() & (F.size(y) > 0),check_columns)
df_filtered  = df.filter(expr)

Method 3 with join and expression:

cond = f"""({')or('.join(f"{c} is not null and size({c})>0" for c in check_columns)})"""
df_filtered = df.filter(F.expr(cond))

Quick test:

d = {'A': {0: [13,1], 1: [], 2: [1,2,3], 3: [5,6], 4: []},
    'B':{0: [9,10], 1: None, 2: [1,2,3], 3: [5,6], 4: []}}
df = spark.createDataFrame(pd.DataFrame(d))
df.show()

def manual(data):
    return data.filter(
    F.col(check_columns[0]).isNotNull() & (F.size(check_columns[0]) > 0) |
    F.col(check_columns[1]).isNotNull() & (F.size(check_columns[1]) > 0)
    )
def loop(data):
    return data.filter(F.greatest(*[F.col(c).isNotNull() & (F.size(c) > 0) 
                           for c in check_columns]))
manual(df).subtract(loop(df)).show()
+---+---+
|  A|  B|
+---+---+
+---+---+

We see no difference returned in dataframes returned by both the methods.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1