'How to stop schedulers to run parallel while the services are distributed?
The application we are working on has some schedulers actually cron jobs which are running at a specific time. The services are written in PHP where we have these schedulers, which are deployed in servers with multiple instances. Suppose we have 3 instances of a worker-service where we have these schedulers. So when a scheduler runs, it actually runs on all three instances with some seconds gap. which means the scheduler is performing the same action 3 times. So how can I stop running the same schedulers in multiple instances? I want to run a scheduler in one instance.
Solution 1:[1]
You're going to need some means of coordination between the service instances. The method you choose is going to entail tradeoffs, so be sure about why you don't want the action happening 3 times. More concretely, you'll need to answer the question:
Which is worse: the action happening twice or not happening at all?
Whichever means of coordination you choose, there's a failure mode which leads to the action happening multiple times or not happening. There's no coordination method which involves communication over a network which can guarantee exactly once (something close to exactly once is possible if the action is idempotent (doing it twice is not different from doing it once), but if the action is idempotent, happening twice likely isn't bad* at all, thus not happening at all is likely to be worse).
If happening twice is preferable, something as simple as assigning each instance a unique (or even unique-ish) identifier, and having each run (when the scheduler fires) a DB query like INSERT INTO leases (id, holder, timestamp) VALUES ('scheduler', identifier, now()) (id being the primary key). If all the instances' clocks are roughly synchronized, only one of these inserts will succeed: if it succeeds, the lease is acquired and the scheduled action can be performed; after this, after some period of time, you DELETE FROM leases WHERE id = 'scheduler' AND holder = identifier. At some period more frequent than the scheduler and less frequent than that period of time, every node does a DELETE FROM leases WHERE id = 'scheduler' AND timestamp before now() - period to cancel leases which are too old.
If not happening at all is preferable, something like etcd, consul, or zookeeper can be used to implement the leases. This is a reasonably common use for those systems, so it's battle-tested and well-documented. Note that in the maximal safety case (where the lease never expires until explicitly released), you'll want to have some external monitoring to determine whether or not the scheduled action happened when it should have. If the monitoring says it hasn't, alert a human operator (the ultimate coordination layer) who can investigate and manually expire the lease.
*: I suppose something like launching enough nuclear missiles at random targets that the earth becomes uninhabitable is for all intents and purposes idempotent, even though I hope we agree that it's better for it not to happen at all...
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Levi Ramsey |
