'How to export dynamodb to s3 as a single file?
I have a dynamodb table which will need to be exported to a s3 bucket every 24 hours using data pipeline. This will in turn be used by a sparkjob to query the data.
The problem is that whenever I am setting up a data pipeline to do this activity, the output in s3 is multiple partitioned files.
Is there a way to ensure that the entire table is exported as a single file in s3? If not, is there a way in spark to read the partitioned files using manifest and combine them into one to query the data?
Solution 1:[1]
You have two options here (The function should be run on the dataframe just before writing):
repartition(1)coalesce(1)
But as the docs emphasized the better in your case is the repartition:
However, if you’re doing a drastic coalesce, e.g. to numPartitions = 1, this may result in your computation taking place on fewer nodes than you like (e.g. one node in the case of numPartitions = 1). To avoid this, you can call repartition(). This will add a shuffle step, but means the current upstream partitions will be executed in parallel (per whatever the current partitioning is).
Docs:
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Netanel Malka |
