'Duplicates record keys in apache HUDI

HUDI does not seem to deduplicate records in some cases. Below is the configuration that we use. We partition the data by customer_id, so our expectation is that HUDI will enforce uniqueness within the partition, i.e each customer_id folder. Although, we are noticing that there are two parquet files inside some customer_id folders, and when we query the data in these partitions, we notice there are duplicate unique_user_id in the same customer_id. The _hoodie_record_key is identical for the two duplicate records, but the _hoodie_file_name is different, which makes me suspect that hudi is enforcing uniqueness not in the customer_id folder, but in these individual parquet files. Can someone explain this behavior?

  op: "INSERT"
  target-base-path: "s3_path"
  target-table: "some_table_name"

  source-ordering-field: "created_at"
  transformer-class: "org.apache.hudi.utilities.transform.SqlQueryBasedTransformer"

  filter-dupes: ""
  hoodie_conf:
  # source table base path
  hoodie.deltastreamer.source.dfs.root: "s3_path"

  # record key, partition paths and keygenerator
  hoodie.datasource.write.recordkey.field: "user_id,customer_id"
  hoodie.datasource.write.partitionpath.field: "customer_id"
  hoodie.datasource.write.keygenerator.class: 
  "org.apache.hudi.keygen.ComplexKeyGenerator"

  # hive sync properties
  hoodie.datasource.hive_sync.enable: true
  hoodie.datasource.hive_sync.table: "table_name"
  hoodie.datasource.hive_sync.database: "database_name"
  hoodie.datasource.hive_sync.partition_fields: "customer_id"
  hoodie.datasource.hive_sync.partition_extractor_class: 
  "org.apache.hudi.hive.MultiPartKeysValueExtractor"
  hoodie.datasource.write.hive_style_partitioning: true

  # sql transformer
  hoodie.deltastreamer.transformer.sql: "SELECT user_id, customer_id, updated_at as 
  created_at FROM <SRC> a"

  # since there is no dt partition, the following config from default has to be 
  overridden
  hoodie.deltastreamer.source.dfs.datepartitioned.selector.depth: 0


Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source