'Replace a column value with NULL in PySpark

How to replace incorrect column values (values with characters like * or #) with null?

Screenshot here!



Solution 1:[1]

Test dataset:

df = spark.createDataFrame(
    [(10, '2021-08-16 00:54:43+01', 0.15, 'SMS'),
     (11, '2021-08-16 00:04:29+01', 0.15, '*'),
     (12, '2021-08-16 00:39:05+01', 0.15, '***')],
    ['_c0', 'Timestamp', 'Amount','Channel']
)
df.show(truncate=False)
# +---+----------------------+------+-------+
# |_c0|Timestamp             |Amount|Channel|
# +---+----------------------+------+-------+
# |10 |2021-08-16 00:54:43+01|0.15  |SMS    |
# |11 |2021-08-16 00:04:29+01|0.15  |*      |
# |12 |2021-08-16 00:39:05+01|0.15  |***    |
# +---+----------------------+------+-------+

Script:

from pyspark.sql import functions as F

df = df.withColumn('Channel', F.when(~F.col('Channel').rlike(r'[\*#]+'), F.col('Channel')))

df.show(truncate=False)
# +---+----------------------+------+-------+
# |_c0|Timestamp             |Amount|Channel|
# +---+----------------------+------+-------+
# |10 |2021-08-16 00:54:43+01|0.15  |SMS    |
# |11 |2021-08-16 00:04:29+01|0.15  |null   |
# |12 |2021-08-16 00:39:05+01|0.15  |null   |
# +---+----------------------+------+-------+

Solution 2:[2]

So You have multiple choices:

First option is the use the when function to condition the replacement for each character you want to replace:

example: when function

Second option is to use the replace function.

example: replace function

third option is to use regex_replace to replace all the characters with null value

example: regex_replace function

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 korilium