'How to create a list of all elements present in a single cell of a dataframe?
Let say I have a dataframe:
now i want list of the elements present in the column NAME
like this:
['s', 'a', 'c', 'h', 'i', 'n']
how can we do this in pyspark?
doing this
LIST = df.select('NAME').rdd.flatMap(lambda x: x ).collect()
print(LIST)
is yielding me
['s a c h i n']
Solution 1:[1]
You should be able to just split on the space.
from pyspark.sql.functions import split
df = spark.createDataFrame([(1, "s a c h i n")],["id", "name"])
df.withColumn('split_name', split('name', ' ')).show()
Output
+---+-----------+------------------+
| id| name| split_name|
+---+-----------+------------------+
| 1|s a c h i n|[s, a, c, h, i, n]|
+---+-----------+------------------+
Solution 2:[2]
I belive this cover the requirements you mentioned.
from pyspark.sql.functions import split,concat_ws,regexp_replace,col
from pyspark.sql.types import StringType
#UDF Function for the second requirement
def custom_replace(col_values):
first_four = col_values[:4]
first_four = ''.join(first_four)
col_values = col_values[1:]
col_values.insert(0,first_four)
return col_values
custom_replaceUDF = udf(lambda z:custom_replace(z),StringType())
#Creating dataframe
df = spark.createDataFrame([(1, "s a c h i n")],["id", "name"])
#Required Transformation
df1 = df.withColumn('Splited_name', split('name', ' '))\
.withColumn('Back_to_orig_name', concat_ws(" ", "Splited_name"))\
.withColumn('reaplced_name_hard_code', regexp_replace('Back_to_orig_name', 's', 'sach'))\
.withColumn("Replaced_Name", custom_replaceUDF(col("Splited_name")))
df1.show(truncate=False)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Chris |
Solution 2 | Python Learner |