'How to capture data change in aws glue?
We have source data in on premise sql-server. We are using AWS glue to fetch data from sql-server and place it to the S3. Could anyone please help how can we implement change data capture in AWS Glue?
Note- We don't want to use AWS DMS.
Solution 1:[1]
I'm only aware of Glue Bookmarks. They will help you with the new records (Inserts), but won't help you with the Updates and Deletes that you typically get with a true CDC solution.
Not sure of your use case, but you could check out the following project. It has a pretty efficient diff feature and, with the right options, can give you a CDC-like output
https://github.com/G-Research/spark-extension/blob/master/DIFF.md
Solution 2:[2]
It's not possible to implement a change data capture through direct glue data extraction. While a Job bookmark can help you identify inserts and updates if your table contains an update_at timestamp column, it won't cover delete cases. You actually need a CDC solution.
While AWS glue direct connection to a database source is a great solution, I strongly discourage using it for incremental data extraction due to the cost implication. It's like using a Truck to ship one bottle of table water.
As you already commented, I am not also a fan of AWS DMS, but for a robust CDC solution, a tool like Debezium could be a perfect solution. It integrates with kafka and Kinesis. You can easily sink the stream to s3 directly. Debezium gives you the possibility to capture deletes and append a special boolean __delete column to your data, so your glue etl can manage the removal of these deleted records with this field.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Bob Haffner |
| Solution 2 | Trust Okoroego |
