'Beam DataFlow ReadFromPubSub id_label for GCS Notification

Currently, I'm doing a streaming process with DataFlow for moving uploaded blobs from GCS into BigQuery. However, I found that there were several pub/sub messages pointing to the same objectId which causing duplication at the BigQuery end.

I'm aware that the GCS notification is guaranteed at least once delivery, hence it would be possible that I will have duplicate pub/sub message(s) for a given blob file.

I read the Python Beam documentation that it can utilize id_label as a unique record identifier. To deduplicate based on the blob filename, I tried to insert attributes.objectId on id_label args. Nevertheless, it seems it doesn't work (I still receive duplicate message):

| "Read from pub/sub topic" >> ReadFromPubSub( topic=config["prod"]["input_pubsub"], with_attributes=True, id_label="attributes.objectId")

Note: I'm using Beam Python SDK 2.38.0



Solution 1:[1]

This is because the deduplication on Dataflow side is best-effort.

Dataflow deduplicates messages published to Pub/Sub within 10 minutes of each other.

See https://cloud.google.com/dataflow/docs/concepts/streaming-with-cloud-pubsub#efficient_deduplication.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 ningk