'How to create a list of all elements present in a single cell of a dataframe?

Let say I have a dataframe:

dataframe snapshot

now i want list of the elements present in the column NAME

like this:

['s', 'a', 'c', 'h', 'i', 'n']

how can we do this in pyspark?

doing this

LIST = df.select('NAME').rdd.flatMap(lambda x: x ).collect()
 
print(LIST)

is yielding me

['s a c h i n']


Solution 1:[1]

You should be able to just split on the space.

from pyspark.sql.functions import split
df = spark.createDataFrame([(1, "s a c h i n")],["id", "name"])
df.withColumn('split_name', split('name', ' ')).show()

Output

+---+-----------+------------------+
| id|       name|        split_name|
+---+-----------+------------------+
|  1|s a c h i n|[s, a, c, h, i, n]|
+---+-----------+------------------+

Solution 2:[2]

I belive this cover the requirements you mentioned.

from pyspark.sql.functions import split,concat_ws,regexp_replace,col
from pyspark.sql.types import StringType

#UDF Function for the second requirement
def custom_replace(col_values):
  first_four = col_values[:4]
  first_four = ''.join(first_four)
  col_values = col_values[1:]
  col_values.insert(0,first_four)
  return col_values

custom_replaceUDF = udf(lambda z:custom_replace(z),StringType())   

#Creating dataframe
df = spark.createDataFrame([(1, "s a c h i n")],["id", "name"])

#Required Transformation
df1 = df.withColumn('Splited_name', split('name', ' '))\
.withColumn('Back_to_orig_name', concat_ws(" ", "Splited_name"))\
.withColumn('reaplced_name_hard_code', regexp_replace('Back_to_orig_name', 's', 'sach'))\
.withColumn("Replaced_Name", custom_replaceUDF(col("Splited_name")))

df1.show(truncate=False)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Chris
Solution 2 Python Learner