'Pyspark Dataframe Null value Logic Operation

In python, None !=1 will return True. But why in Pyspark "Null_column" != 1 will return false? example:

data = [(1,5),(2,5)]
columns=["id","test"]
df_null=spark.createDataFrame(data,columns)
df_null = df_null.withColumn("nul_val",lit(None))
df_null.printSchema()
df_null.show()

enter image description here

but df_null.filter(df_null.nul_val != 1).count() will return 0



Solution 1:[1]

Please check NULL Semantics - Spark 3.0.0 for how to handle comparison with null in spark.

But to summerize, in Spark, null is undefined , so any comparison with null will result in undefined and should be avoided to avoid unwanted results. And in your case, since undefined is not True, the count will be 0.

Apache spark supports the standard comparison operators such as ‘>’, ‘>=’, ‘=’, ‘<’ and ‘<=’. The result of these operators is unknown or NULL when one of the operarands or both the operands are unknown or NULL.

If you want to compare with a column that might contain null, use the null-safe operation <=> which results in False if one of the operands is null:

In order to compare the NULL values for equality, Spark provides a null-safe equal operator (‘<=>’), which returns False when one of the operand is NULL

So, back to your problem. To solve it I would do a null-check and the comparison with 1:

df_null.filter((df_null.nul_val.isNull()) | (df_null.nul_val != 1)).count()

Another solution would be to replace null with 0, if that does not destroy any other logic:

df_null.fill(value=0,subset=["nul_val"]).filter(df_null.nul_val != 1).count()

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Cleared