'Any "null" strings in the CSV file should be replaced with the SQL value null in pyspark dataframe - Databricks
I am having a Dataframe which has "null" as strings and i wanted to replace it with SQL value null in PySpark Dataframe in Databricks.
Could someone please help with that. I am relatively new to Spark.
Thank you.
Solution 1:[1]
Try this -
from pyspark.sql.functions import expr
df = df.withColumn('string_col' , expr("CASE WHEN col = 'null' THEN NULL else col END")
Basically you're replacing the bad strings with NULL value, otherwise leaving original value. Replace the 'string_col' & 'col' with proper names.
Solution 2:[2]
you can try this
>>> from pyspark.sql.functions import when
>>> data = (['A', 1], ['B', 2], ['null', 3])
>>> df = spark.createDataFrame(data, schema=['str', 'id'])
>>> df.show()
+----+---+
| str| id|
+----+---+
| A| 1|
| B| 2|
|null| 3|
+----+---+
>>> df.withColumn("new_col", when(df.str == "null", None).otherwise(df.str)).show()
+----+---+-------+
| str| id|new_col|
+----+---+-------+
| A| 1| A|
| B| 2| B|
|null| 3| null|
+----+---+-------+
Solution 3:[3]
Also extending from @Hegde's answer above, I simply tried with .replace() api of Pyspark (also works with Databricks), it works like below :
newDF = DF.replace("null", None)
Resultingly, the newDF would finally contain a SQL NULL in all places where "null" (as a string) was present.
Some tip :
In order to replace any missing/blank strings with SQL NULL in dataframe, then try as :.replace("", None)
i.e. passing an empty pair-of-quotes in exchange for the None (or, null) values.
Happy Coding!
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Assaf Segev |
| Solution 2 | Hegde |
| Solution 3 | Rishabh |
