'PySpark: How do I replace Null values based on Datatype of the column?
I have a table, which contains 257 columns. I am trying to figure out how to change null values which are in the table based on the data type. I am trying to do this in PySpark. And the dataframe name is called df.
So for example, if a column contains a date it's a timestamp data type and the nulls within that column need to be 1900-01-01
If it is any integer or float type it needs to be a 0
If it's a string type it should be a n/a
Thank you
Solution 1:[1]
df.dtypes will give you a declaration of each column. Something like this:
df = spark.createDataFrame([
('str', '2022-01-01', 1, 2.2, False),
(None, None, None, None, None),
], ['text','date','int','float', 'bool'])
df = df.withColumn('date', F.col('date').cast('date'))
df.show()
df.printSchema()
print(df.dtypes)
+----+----------+----+-----+-----+
|text| date| int|float| bool|
+----+----------+----+-----+-----+
| str|2022-01-01| 1| 2.2|false|
|null| null|null| null| null|
+----+----------+----+-----+-----+
root
|-- text: string (nullable = true)
|-- date: date (nullable = true)
|-- int: long (nullable = true)
|-- float: double (nullable = true)
|-- bool: boolean (nullable = true)
[('text', 'string'),
('date', 'date'),
('int', 'bigint'),
('float', 'double'),
('bool', 'boolean')]
With this information, you can dynamically cast any default value you want depends on its types
for t in df.dtypes:
if t[1] == 'string':
df = df.withColumn(t[0], F.coalesce(t[0], F.lit('n/a')))
elif t[1] == 'date':
df = df.withColumn(t[0], F.coalesce(t[0], F.lit('1900-01-01')))
elif t[1] == 'boolean':
df = df.withColumn(t[0], F.coalesce(t[0], F.lit(False)))
else:
df = df.withColumn(t[0], F.coalesce(t[0], F.lit(0)))
df.show()
+----+----------+---+-----+-----+
|text| date|int|float| bool|
+----+----------+---+-----+-----+
| str|2022-01-01| 1| 2.2|false|
| n/a|1900-01-01| 0| 0.0|false|
+----+----------+---+-----+-----+
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | pltc |
