'Create new pyspark Dataframe or add multiple columns to an existing Dataframe by iterating over a column and applying a function
I'm new to pyspark so apologies if my knowledge/terminology is lacking.
I need to be able to iterate over a pyspark dataframe column with a similar to the example below.
+--------------------------+
s3_url|
+--------------------------+
|s3://bucket/path/file_1.pb|
|s3://bucket/path/file_2.pb|
|s3://bucket/path/file_3.pb|
|s3://bucket/path/file_4.pb|
|s3://bucket/path/file_5.pb|
... |
+--------------------------+
On each row I need to pass the value into a python function that yields a nested dictionary. This dictionary has a structure similar to the example below.
{
"id" : 1,
"user_id" : 2,
"sales_quarters" : {
"q1_2020" : {
"revenue" : 3464,
"expenses" : 213,
...
}
...
}
}
If using Pandas I would iterate over each column value as a list, pass the values into the function, pass the result into another list and then do an operation like pd.DataFrame() to write all the results into a single DataFrame. Similar to as shown below. However, I am unsure how to achieve a similar result using pyspark.
s3_urls = [
"s3://bucket/path/file_1.pb",
"s3://bucket/path/file_2.pb",
...
]
s3_url_results = [
get_results(url) for url in s3_urls
]
df = pandas.DataFrame(s3_url_results)
My desired result would be a pyspark dataframe similar to the example below.
+---+--------+---------------+----+---------------------------+
id| user_id| sales_quarters| ...| s3_url|
+---+--------+---------------+----+---------------------------+
1| 2| q1_2020| | s3://bucket/path/file_1.pb|
2| 23| q1_2020| | s3://bucket/path/file_2.pb|
3| 35| q1_2020| | s3://bucket/path/file_3.pb|
4| 34| q1_2020| | s3://bucket/path/file_4.pb|
5| 40| q1_2020| | s3://bucket/path/file_5.pb|
...
+---+--------+---------------+----+---------------------------+
Any help/explanation would be greatly appreciated.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
