'Can't I change a string containing "+" to another string in pyspark?
I wanna change "75+ years" to "old age" in age column.
columns = ["age", "any"]
data = [("middle age", "male"), ("75+ years", "female"), ("10-20 years", "male")]
dfFromData = spark.createDataFrame(data).toDF(*columns)
test.show()
test.withColumn("age", regexp_replace("age", "75+ years", "old age")).show()
I could change a string containing "-" at will. ("10-20 years") But a string containing "+" can't change using same method. ("75+ years")
What is the difference between does two?
Solution 1:[1]
Try using replace() function instead of withColumn() + regexp_replace:
dfFromData.replace("75+ years", "old age").show()
This will replace all the "75+ years" value in your dataframe by "old age".
+-----------+------+
| age| any|
+-----------+------+
| middle age| male|
| old age|female|
|10-20 years| male|
+-----------+------+
Add a subset if you want to apply the replace on a specific column:
dfFromData.replace("75+ years", "old age", subset='age').show()
By the way, there is a little typo in your code, you defined a dfFromData then you manipulate another object called test.
Hope it helps!
EDIT: To answer your specific question about why regexp doesn't work with '+' symbol. It's because '+' is a special character for python regexp (see doc for more details). If you want to use regexp with '+'
, just replace it by '\+'
as follow:
dfFromData.withColumn("age", regexp_replace("age", "75\+ years", "old age")).show()
It will give you the desired result.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 |