'apply a normal python function to pyspark df
def cleanTweets(text):
text = re.sub(r'@[A-Za-z0–9]+','', text) #remove the mentions
text = re.sub(r'#','', text) # remove the #
text = re.sub(r'RT[\s]+','', text) # remove the RT
text = re.sub(r'https?:\/\/\S+','', text) #remove hyperlink
return text
tweets_df_cleaned = tweets_df.withColumn('Tweets',col(udf(cleanTweets(Text))))
how could i apply this to the tweets_df which has a column Text to clean, in pandas it can be done by apply
Solution 1:[1]
Solution 2:[2]
Use Pandas UDF(User-Defined Function). Check your spark version because this solution is for Spark 3.X version.
from pyspark.sql.functions import pandas_udf, PandasUDFType
def cleanTweets(text):
text = re.sub(r'@[A-Za-z0–9]+','', text) #remove the mentions
text = re.sub(r'#','', text) # remove the #
text = re.sub(r'RT[\s]+','', text) # remove the RT
text = re.sub(r'https?:\/\/\S+','', text) #remove hyperlink
return text
@pandas_udf("string", PandasUDFType.SCALAR)
tweets_df_cleaned = tweets_df.withColumn("Tweets", cleanTweets("text"))
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | ??????? ??????? |
| Solution 2 |
