'How to export dynamodb to s3 as a single file?

I have a dynamodb table which will need to be exported to a s3 bucket every 24 hours using data pipeline. This will in turn be used by a sparkjob to query the data.

The problem is that whenever I am setting up a data pipeline to do this activity, the output in s3 is multiple partitioned files.

Is there a way to ensure that the entire table is exported as a single file in s3? If not, is there a way in spark to read the partitioned files using manifest and combine them into one to query the data?



Solution 1:[1]

You have two options here (The function should be run on the dataframe just before writing):

  1. repartition(1)
  2. coalesce(1)

But as the docs emphasized the better in your case is the repartition:

However, if you’re doing a drastic coalesce, e.g. to numPartitions = 1, this may result in your computation taking place on fewer nodes than you like (e.g. one node in the case of numPartitions = 1). To avoid this, you can call repartition(). This will add a shuffle step, but means the current upstream partitions will be executed in parallel (per whatever the current partitioning is).

Docs:

repartition

coalesce

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Netanel Malka