'PySpark: How do I replace Null values based on Datatype of the column?

I have a table, which contains 257 columns. I am trying to figure out how to change null values which are in the table based on the data type. I am trying to do this in PySpark. And the dataframe name is called df.

So for example, if a column contains a date it's a timestamp data type and the nulls within that column need to be 1900-01-01

If it is any integer or float type it needs to be a 0

If it's a string type it should be a n/a

Thank you



Solution 1:[1]

df.dtypes will give you a declaration of each column. Something like this:

df = spark.createDataFrame([
    ('str', '2022-01-01', 1, 2.2, False),
    (None, None, None, None, None),
], ['text','date','int','float', 'bool'])
df = df.withColumn('date', F.col('date').cast('date'))
df.show()
df.printSchema()
print(df.dtypes)

+----+----------+----+-----+-----+
|text|      date| int|float| bool|
+----+----------+----+-----+-----+
| str|2022-01-01|   1|  2.2|false|
|null|      null|null| null| null|
+----+----------+----+-----+-----+

root
 |-- text: string (nullable = true)
 |-- date: date (nullable = true)
 |-- int: long (nullable = true)
 |-- float: double (nullable = true)
 |-- bool: boolean (nullable = true)

[('text', 'string'),
 ('date', 'date'),
 ('int', 'bigint'),
 ('float', 'double'),
 ('bool', 'boolean')]

With this information, you can dynamically cast any default value you want depends on its types

for t in df.dtypes:
    if t[1] == 'string':
        df = df.withColumn(t[0], F.coalesce(t[0], F.lit('n/a')))
    elif t[1] == 'date':
        df = df.withColumn(t[0], F.coalesce(t[0], F.lit('1900-01-01')))
    elif t[1] == 'boolean':
        df = df.withColumn(t[0], F.coalesce(t[0], F.lit(False)))
    else:
        df = df.withColumn(t[0], F.coalesce(t[0], F.lit(0)))
    
df.show()

+----+----------+---+-----+-----+
|text|      date|int|float| bool|
+----+----------+---+-----+-----+
| str|2022-01-01|  1|  2.2|false|
| n/a|1900-01-01|  0|  0.0|false|
+----+----------+---+-----+-----+

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 pltc