'removing punctuation from dataframe does not work Pyspark

I am new to Spark and text processing. Could someone give some suggestion on the following issue I have. I want to remove all the punctuations in a column of a dataframe. I saw some posts online related to this topic, still cannot figure out why my code does not work. If I remove a single punctuation, for example, period, that seems work

from pyspark.sql.functions import udf
commaRep = udf(lambda x: x.replace('.', ' '))
df=df.withColumn('RD',commaRep('DELAY_REASON'))
df.display()

Before: Late inbound FA crew from F_29. DD After: Late inbound FA crew from F_29 DD

However, if I do a loop for all the punctuations that I want to remove:

from pyspark.sql.functions import udf
punc = '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
for ch in punc:
  commaRep = udf(lambda x: x.replace(ch, ' '))
  
df=df.withColumn('RD',commaRep('DELAY_REASON'))
df.display()

Then none of the punctuations got removed. for example, string like: Ramp headset not working. Had to get a new one. sh remains the same, I wonder what was wrong with the loop.

thanks for any help! Daisy



Solution 1:[1]

Actually you don't need an UDF. You can use spark built-in regexp_replace:

from pyspark.sql.types import StringType
from pyspark.sql.functions import regexp_replace

df = spark.createDataFrame(
  ["a:;,|/dlk\\", "jnh'.-lk", "ldkc!o?@"],
  StringType()
).toDF("text")
df.show()

df.withColumn("no_punctuation", regexp_replace(
  "text",
  r"""[!\"#$%&'()*+,\-.\/:;<=>?@\[\\\]^_`{|}~]""",
  " "
)).show()

I had to escape a few characters so that they are correctly handled in the regex expression (like -, /, [, ]).

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 leleogere