'How do you ensure it does work with google cloud pub/sub?

I am currently working on a distributed crawling service. When making this, I have a few issues that need to be addressed.

First, let's explain how the crawler works and the problems that need to be solved.

The crawler needs to save all posts on each and every bulletin board on a particular site.

To do this, it automatically discovers crawling targets and publishes several messages to pub/sub. The message is:

{ "boardName": "test", "targetDate": "2020-01-05" }

When the corresponding message is issued, the cloud run function is triggered, and the data corresponding to the given json is crawled.

However, if the same duplicate message is published, duplicate data occurs because the same data is crawled. How can I ignore the rest when the same message comes in?

Also, are there pub/sub or other good features I can refer to for a stable implementation of a distributed crawler?



Solution 1:[1]

because PubSub is, by default, designed to deliver AT LEAST one time the messages, it's better to have idempotent processing. (Exact one delivery is coming)

Anyway, your issue is very similar: twice the same message or 2 different messages with the same content will cause the same issue. There is no magic feature in PubSub for that. You need an external tool, like a database, to store the already received information.

Firestore/datastore is a good and serverless place for that. If you need low latency, Memory store and it's in memory database is the fastest.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 guillaume blaquiere