'Coalescing rows in pyspark with string operation

I am looking to coalesce duplicate rows of a pyspark dataframe from this:

enter image description here

to this:

enter image description here

I need to have a period after each sentence of the coalesced rows. I tried coalesce() and collect_set() but can't perform the string operation within the collected window/group.



Solution 1:[1]

Anaother way:

df= (df2.groupby('ID','fname','lname','Score').agg(collect_list('Activity').alias('Activity'))#Create an array of Activity 
        .withColumn('Activity', concat(array_join('Activity','.'),lit('.')))#join array and then concatenate dot
     
    ).show(truncate=False)

Solution 2:[2]

You can use concat_ws with the collect_list. Although, this won't guarantee the order of sentences as Spark table is unordered. If you care the order, you need some sequential column to order upon.

(df.groupBy('ID')
 .agg(*[F.col(x).alias(x) for x in ['fname', 'lname', 'Score']],
      F.concat_ws(' ', F.collect_list(F.col('Activity'))).alias('Activity'))
)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 wwnde
Solution 2 Emma