'remove last few characters in PySpark dataframe column
I am having a PySpark DataFrame. How can I chop off/remove last 5 characters from the column name below -
from pyspark.sql.functions import substring, length
valuesCol = [('rose_2012',),('jasmine_2013',),('lily_2014',),('daffodil_2017',),('sunflower_2016',)]
df = sqlContext.createDataFrame(valuesCol,['name'])
df.show()
+--------------+
| name|
+--------------+
| rose_2012|
| jasmine_2013|
| lily_2014|
| daffodil_2017|
|sunflower_2016|
+--------------+
I want to create 2 columns, the flower and year column.
Expected output:
+--------------+----+---------+
| name|year| flower|
+--------------+----+---------+
| rose_2012|2012| rose|
| jasmine_2013|2013| jasmine|
| lily_2014|2014| lily|
| daffodil_2017|2017| daffodil|
|sunflower_2016|2016|subflower|
+--------------+----+---------+
year column I have created -
df = df.withColumn("year", substring(col("name"),-4,4))
df.show()
+--------------+----+
| name|year|
+--------------+----+
| rose_2012|2012|
| jasmine_2013|2013|
| lily_2014|2014|
| daffodil_2017|2017|
|sunflower_2016|2016|
+--------------+----+
I don't know how to chop last 5 characters, so that I only have the name of flowers. I tried something like this, by invoking length, but that doesn't work.
df = df.withColumn("flower",substring(col("name"),0,length(col("name"))-5))
How can I create flower column with only flower names?
Solution 1:[1]
You can use expr function
>>> from pyspark.sql.functions import substring, length, col, expr
>>> df = df.withColumn("flower",expr("substring(name, 1, length(name)-5)"))
>>> df.show()
+--------------+----+---------+
| name|year| flower|
+--------------+----+---------+
| rose_2012|2012| rose|
| jasmine_2013|2013| jasmine|
| lily_2014|2014| lily|
| daffodil_2017|2017| daffodil|
|sunflower_2016|2016|sunflower|
+--------------+----+---------+
Solution 2:[2]
You can use split function. this code does what you want:
import pyspark.sql.functions as f
newDF = df.withColumn("year", f.split(df['name'], '\_')[1]).\
withColumn("flower", f.split(df['name'], '\_')[0])
newDF.show()
+--------------+----+---------+
| name|year| flower|
+--------------+----+---------+
| rose_2012|2012| rose|
| jasmine_2013|2013| jasmine|
| lily_2014|2014| lily|
| daffodil_2017|2017| daffodil|
|sunflower_2016|2016|sunflower|
+--------------+----+---------+
Solution 3:[3]
In this case, since we want to extract alphabetical characters, so REGEX will also work.
from pyspark.sql.functions import regexp_extract
df = df.withColumn("flower",regexp_extract(df['name'], '[a-zA-Z]+',0))
df.show()
+--------------+----+---------+
| name|year| flower|
+--------------+----+---------+
| rose_2012|2012| rose|
| jasmine_2013|2013| jasmine|
| lily_2014|2014| lily|
| daffodil_2017|2017| daffodil|
|sunflower_2016|2016|sunflower|
+--------------+----+---------+
Solution 4:[4]
Adding little tweak to avoid hard coding and identify column length dynamically through location of underscore('_') using instr function.
df = spark.createDataFrame([('rose_2012',),('jasmine_2013',),('lily_2014',),('daffodil_2017',),('sunflower_2016',)],['name'])
df.withColumn("flower",expr("substr(name, 1, (instr(name,'_')-1) )")).\
withColumn("year",expr("substr(name, (instr(name,'_')+1),length(name))")).show()
Solution 5:[5]
>>> from pyspark.sql.functions import substring, length, expr
>>> df = df.withColumn("flower",expr("substring(name, 0, length(name)-5)"))
>>> df.show()
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | |
| Solution 3 | cph_sto |
| Solution 4 | Shantanu Sharma |
| Solution 5 |
