'Any "null" strings in the CSV file should be replaced with the SQL value null in pyspark dataframe - Databricks

I am having a Dataframe which has "null" as strings and i wanted to replace it with SQL value null in PySpark Dataframe in Databricks.

Could someone please help with that. I am relatively new to Spark.

Thank you.



Solution 1:[1]

Try this -

from pyspark.sql.functions import expr
df = df.withColumn('string_col' , expr("CASE WHEN col = 'null' THEN NULL else col END")

Basically you're replacing the bad strings with NULL value, otherwise leaving original value. Replace the 'string_col' & 'col' with proper names.

Solution 2:[2]

you can try this

>>> from pyspark.sql.functions import when
>>> data = (['A', 1], ['B', 2], ['null', 3])
>>> df = spark.createDataFrame(data, schema=['str', 'id'])
>>> df.show()
+----+---+
| str| id|
+----+---+
|   A|  1|
|   B|  2|
|null|  3|
+----+---+

>>> df.withColumn("new_col", when(df.str == "null", None).otherwise(df.str)).show()
+----+---+-------+
| str| id|new_col|
+----+---+-------+
|   A|  1|      A|
|   B|  2|      B|
|null|  3|   null|
+----+---+-------+

Solution 3:[3]

Also extending from @Hegde's answer above, I simply tried with .replace() api of Pyspark (also works with Databricks), it works like below :

newDF = DF.replace("null", None)

Resultingly, the newDF would finally contain a SQL NULL in all places where "null" (as a string) was present.

Some tip :
In order to replace any missing/blank strings with SQL NULL in dataframe, then try as :
.replace("", None)
i.e. passing an empty pair-of-quotes in exchange for the None (or, null) values.

Happy Coding!

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Assaf Segev
Solution 2 Hegde
Solution 3 Rishabh