'AWS Glue: Is event driven architecture impossible to achieve? Problems with job concurrency and triggering
I have been researching for the past few days what the best solution to my use case is, yet it seems that AWS barely support this use case. I will try to describe it as best as possible trying to avoid a XY problem.
I have some glue jobs that run on a schedule. These glue jobs pick data from, let's say Facebook, Google, and Linkedin API and sink whatever they get in my staging bucket. As a very last step, they put under a conventional prefix a 'complete.json' file, indicating that the job has finished execution.
I have written a glue job that should be triggered next and move the data to a 'consolidated' stage in my bucket, starting as soon as something arrives in my staging bucket. More specifically, the job (i) reads the data that have just been dumped, (ii) performs some column checks, (iii) calculates the column 'row_hash' by hashing each row, and (iv) deduplicate the dataframe by comparing current 'row_hash' column with the one already stored (if any) in the 'consolidated' stage.
The idea is to have at max 10 of the 'consolidation' glue jobs running at the same time, and never more than 1 for each data source (to act sort of as a 'lock', avoiding synchronous writing). For example, if the job is processing already some Google data, and in the meantime other Google data have arrived, then they should wait until that process is finished (it does not matter whether successfull or not) before starting.
My current implementation:
- I have a lambda that gets triggered anytime something is written on the 'complete.json' file prefix and adds that object's key to an SQS queue. Via the prefix, the function gets the data source, and sets the MessageGroupId as the source name. This allows me to process 1 message at a time for each data source. Then, the queue triggers a Lambda, which in turn triggers a Step Function, passing all required parameters. The step function has the Glue Job state in it, which effectively consolidate my data.
Now the problems:
- I cannot trigger directly a Glue Job or a Step Function from the SQS queue, forcing me to put a Lambda in between, which is just a waste of resources.
- Direct result of point 1: as soon as the Lambda triggers my Glue Job/Step Function, the message is succesfully consumed, so the next one will be picked up, triggering an additional Glue Job which can belong to the same Source. And this is far from ideal.
- In any case, since the lambda performs a task that takes milliseconds to run, pretty much all available messages are passed to my step function, no matter how many they are, triggering countless step functions.
- As result of point 4, I almost always trigger more jobs than my set concurrency (=10), and sometimes I get my startJob requests throttled by AWS as I am trying to start TOO MANY jobs (concurrency exceptions becomes API throttle exceptions).
- I am forced to handle these errors by putting the message back in the queue (otherwise bye bye message), creating a not-so-nice-looking loop.
Something I have tried: I have implemented an Atomic Lock using Dynamo DB in the Lambda functions that triggers my StepFunction. It simply does not work (concurrency is far from being handled), for unknown-to-me reasons but probably due to Network latency and Glue having some invisible state after 'FAILED' or 'SUCCESS'. You can see the details of the implementation in this other post I have written here
Other solutions I have researched:
I have seen I can use EventBridge as StepFunction trigger, then the flow would be: ObjectCreated -> Lambda{PutEvent} -> EventBus -> Step Function. But I could not find in the docs if Event Buses and Rules have a feature similar to MessageGroupId. The pluses are that you can set up a DQL, have a retry policy.
The other solution would be EventBridge to Glue Workflow, but again: I can specify how many events before pulling the trigger and the batch window, but it does not allow me to control per data source in the same way MessageGroupId allows me to do. I could create as many "consolidation" workflow as data sources listening on their specific 'complete.file' prefix, but all of these workflow would pretty much do the same exact thing. Also, I am not sure this system can handle concurrency or even does have a "retry" policy.
My questions:
- Is there any service in AWS that would fit my use case?
- Does EventBridge have a feature similar to MessageGroupId?
- Is there somewhere any system in AWS that allows a reliable concurrency handling?
Thanks a lot for taking the time to read, and in case reply.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
