'Duplicates when reading data in AWS Glue

We are reading data from an IBM DB2-400 database from a table. This has huge data and its a transactional table and hence data updates frequently.

Script used to read is as follows:

spark.read.format("jdbc").option("driver", "com.ibm.as400.access.AS400JDBCDriver").option("url", "jdbc:as400://IPadress;libraries=" + Library_Name + ";").option("dbtable", Table_name).option("isolationLevel","REPEATABLE_READ").option("user", db_username).option("password", db_password).load()

Once the data is pulled to S3, we are seeing few duplicate records when compared on key column, one record before update and one record after update. We crosschecked the source system and it doesnt have any duplicates. So our assumption is that the way we are reading data is considering before and after updates as 2 records.

How can this issue be addressed

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'Duplicates when reading data in AWS Glue

Sources

Related Questions