'Is there an optimal way in pyspark to write the same dataframe to multiple locations?
I have a dataframe in pyspark and I want to write the same dataframe to two locations in AWS s3. Currently I have the following code running on AWS EMR.
# result is the name of the dataframe
result = result.repartition(repartition_value, 'col1').sortWithinPartitions('col1')
result.write.partitionBy("col2")\
.mode("append") \
.parquet(f"{OUTPUT_LOCATION_1}/end_date={event_end_date}")
result.write.partitionBy("col2") \
.mode("append") \
.parquet(f"{OUTPUT_LOCATION_2}/processed_date={current_date_str}")
The inclusion of this additional write step has increased the runtime of the job significantly (almost double). Could it be that the lazy evaluation of spark runs the same steps twice?
I have tried caching the data prior using result.cache() and forcing an action after e.g. result.count() but this hasnt provided any benefits.
What would be the most efficient way to do a double dataframe output write?
Solution 1:[1]
You could copy it inside s3, even across buckets, at 6+MB/s
There is nothing in the filesystem APIs to do this, but if you can invoke the aws cli, you can use "aws s3 cp --recursive" to do the job. this does all its IO within S3 and will be faster than any other mechanism.
see https://docs.aws.amazon.com/cli/latest/reference/s3/#directory-and-s3-prefix-operations
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | stevel |
