'Create new pyspark Dataframe or add multiple columns to an existing Dataframe by iterating over a column and applying a function

I'm new to pyspark so apologies if my knowledge/terminology is lacking.

I need to be able to iterate over a pyspark dataframe column with a similar to the example below.

+--------------------------+
                     s3_url|
+--------------------------+
|s3://bucket/path/file_1.pb|
|s3://bucket/path/file_2.pb|
|s3://bucket/path/file_3.pb|
|s3://bucket/path/file_4.pb|
|s3://bucket/path/file_5.pb|
...                        |
+--------------------------+

On each row I need to pass the value into a python function that yields a nested dictionary. This dictionary has a structure similar to the example below.

{
  "id" : 1,
  "user_id" : 2,
  "sales_quarters" : {
    "q1_2020" : {
       "revenue" : 3464,
       "expenses" : 213,
       ...
     }
     ...
  }
}

If using Pandas I would iterate over each column value as a list, pass the values into the function, pass the result into another list and then do an operation like pd.DataFrame() to write all the results into a single DataFrame. Similar to as shown below. However, I am unsure how to achieve a similar result using pyspark.

s3_urls = [
  "s3://bucket/path/file_1.pb",
  "s3://bucket/path/file_2.pb",
  ...
]
s3_url_results = [
  get_results(url) for url in s3_urls
]
df = pandas.DataFrame(s3_url_results)

My desired result would be a pyspark dataframe similar to the example below.

+---+--------+---------------+----+---------------------------+
  id| user_id| sales_quarters| ...|                     s3_url|
+---+--------+---------------+----+---------------------------+
   1|       2|        q1_2020|    | s3://bucket/path/file_1.pb|
   2|      23|        q1_2020|    | s3://bucket/path/file_2.pb|
   3|      35|        q1_2020|    | s3://bucket/path/file_3.pb|
   4|      34|        q1_2020|    | s3://bucket/path/file_4.pb|
   5|      40|        q1_2020|    | s3://bucket/path/file_5.pb|
...                        
+---+--------+---------------+----+---------------------------+

Any help/explanation would be greatly appreciated.



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source