'removing punctuation from dataframe does not work Pyspark
I am new to Spark and text processing. Could someone give some suggestion on the following issue I have. I want to remove all the punctuations in a column of a dataframe. I saw some posts online related to this topic, still cannot figure out why my code does not work. If I remove a single punctuation, for example, period, that seems work
from pyspark.sql.functions import udf
commaRep = udf(lambda x: x.replace('.', ' '))
df=df.withColumn('RD',commaRep('DELAY_REASON'))
df.display()
Before: Late inbound FA crew from F_29. DD After: Late inbound FA crew from F_29 DD
However, if I do a loop for all the punctuations that I want to remove:
from pyspark.sql.functions import udf
punc = '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
for ch in punc:
commaRep = udf(lambda x: x.replace(ch, ' '))
df=df.withColumn('RD',commaRep('DELAY_REASON'))
df.display()
Then none of the punctuations got removed. for example, string like: Ramp headset not working. Had to get a new one. sh remains the same, I wonder what was wrong with the loop.
thanks for any help! Daisy
Solution 1:[1]
Actually you don't need an UDF. You can use spark built-in regexp_replace:
from pyspark.sql.types import StringType
from pyspark.sql.functions import regexp_replace
df = spark.createDataFrame(
["a:;,|/dlk\\", "jnh'.-lk", "ldkc!o?@"],
StringType()
).toDF("text")
df.show()
df.withColumn("no_punctuation", regexp_replace(
"text",
r"""[!\"#$%&'()*+,\-.\/:;<=>?@\[\\\]^_`{|}~]""",
" "
)).show()
I had to escape a few characters so that they are correctly handled in the regex expression (like -, /, [, ]).
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | leleogere |
