'How to Partition so all SubIds get processed by same worker in Glue
I am new to AWS Glue/Spark processing so bear with me if this is a dummy question.
S3 Structure: orders/year=xxxx/month=xx/day=xx/transactionnumber.json
I have s3 structure with the above mentioned structure. I would like to use AWS Glue to process it and output 3 or 4 files that can be pushed to redshift.
The Transaction.json has following information:
- Item information
- Customer infomration
- Licencey Key infomration
- additional parameters
- Subscription ID
An individual could have bought items every month or every year. The SubId and Sku from the Json would help us group them together.
What i am trying to understand is following: How to properly set up the Glue job, so that when it reads/processes the data each worker has all the records for a specific Subscription.
I would like to count all the subscriptions to calculate how many times the person has transacted.
My understanding is that when we set up the Glue job, it will split the data into chunks.
Does this mean that when I would be doing groupby(subscription) on the complete data set, and then doing my action. Would this guarantee that a specific worker, would have all the records for the subscription, so it can process them accordingly?
Hope my question makes sense. Thanks
Solution 1:[1]
AWS Glue does little. It relies on Spark under the hood.
With a groupBy
- data is aggregated locally (by a given Worker for a given Task for a given Partition)
- and finally transferred and aggregated globally.
In that sense the question is a little off mark. Hope this helps. The groupBy applies an optimization at the mapper side.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | thebluephantom |
