'Can't I change a string containing "+" to another string in pyspark?

I wanna change "75+ years" to "old age" in age column.

columns = ["age", "any"]
data = [("middle age", "male"), ("75+ years", "female"), ("10-20 years", "male")]
dfFromData = spark.createDataFrame(data).toDF(*columns)
test.show()

enter image description here

test.withColumn("age", regexp_replace("age", "75+ years", "old age")).show()

enter image description here

I could change a string containing "-" at will. ("10-20 years") But a string containing "+" can't change using same method. ("75+ years")

What is the difference between does two?



Solution 1:[1]

Try using replace() function instead of withColumn() + regexp_replace:

dfFromData.replace("75+ years", "old age").show()

This will replace all the "75+ years" value in your dataframe by "old age".

+-----------+------+
|        age|   any|
+-----------+------+
| middle age|  male|
|    old age|female|
|10-20 years|  male|
+-----------+------+

Add a subset if you want to apply the replace on a specific column:

dfFromData.replace("75+ years", "old age", subset='age').show()

By the way, there is a little typo in your code, you defined a dfFromData then you manipulate another object called test.

Hope it helps!

EDIT: To answer your specific question about why regexp doesn't work with '+' symbol. It's because '+' is a special character for python regexp (see doc for more details). If you want to use regexp with '+', just replace it by '\+' as follow:

dfFromData.withColumn("age", regexp_replace("age", "75\+ years", "old age")).show()

It will give you the desired result.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1