'Postgres concurrent batch processing

I have one Postgres table with n rows, and I want to process them chunk by chunk. Processing a chunk means that k items are selected, sent externally, and then deleted from the database. There is an independent service that writes rows to the table at random times.

This works fine when there is just one process, but with multiple concurrent processes they can try to access the same rows, and after one process deletes the batch, the other process would fail.

I looked into some approaches:

  1. Use random OFFSET when selecting, to reduce the probability that two processes will select the same chunks of rows, but this still fails when the number of chunks in the table is low, as the probability of collision gets high.
  2. Add a timestamp column updated_at. Then each process would set all the rows from the chunk to "now", and the select would only take rows with updated_at older than 5 minutes. This would prevent other processes to take the rows worked on by the current process. However, in the time interval where one process is selecting and setting the rows' updated_at, another process could still select them and introduce a collision (although the probability is a bit lower).
  3. Use only one process to iterate over the chunks and dump them into a queue, then use parallel processing on the queue rather than the database. This solution still creates a single-process bottleneck on the process that reads the database and pushes the events to the queue.

This feels like a common use-case so I feel like there should be a reasonable way to achieve this better than these semi-solutions.



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source