'Replace a column value with NULL in PySpark
Solution 1:[1]
Test dataset:
df = spark.createDataFrame(
[(10, '2021-08-16 00:54:43+01', 0.15, 'SMS'),
(11, '2021-08-16 00:04:29+01', 0.15, '*'),
(12, '2021-08-16 00:39:05+01', 0.15, '***')],
['_c0', 'Timestamp', 'Amount','Channel']
)
df.show(truncate=False)
# +---+----------------------+------+-------+
# |_c0|Timestamp |Amount|Channel|
# +---+----------------------+------+-------+
# |10 |2021-08-16 00:54:43+01|0.15 |SMS |
# |11 |2021-08-16 00:04:29+01|0.15 |* |
# |12 |2021-08-16 00:39:05+01|0.15 |*** |
# +---+----------------------+------+-------+
Script:
from pyspark.sql import functions as F
df = df.withColumn('Channel', F.when(~F.col('Channel').rlike(r'[\*#]+'), F.col('Channel')))
df.show(truncate=False)
# +---+----------------------+------+-------+
# |_c0|Timestamp |Amount|Channel|
# +---+----------------------+------+-------+
# |10 |2021-08-16 00:54:43+01|0.15 |SMS |
# |11 |2021-08-16 00:04:29+01|0.15 |null |
# |12 |2021-08-16 00:39:05+01|0.15 |null |
# +---+----------------------+------+-------+
Solution 2:[2]
So You have multiple choices:
First option is the use the when function to condition the replacement for each character you want to replace:
example: when function
Second option is to use the replace function.
example: replace function
third option is to use regex_replace to replace all the characters with null value
example: regex_replace function
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | korilium |

