'Race condition for Microservice architecture [CosmosDB]
We have a micro service based architecture. Let's say we have front and backend completely isolated. The backend microserviceA exposes a rest endpoint which basically calls a thirdParty service and updates a record in cosmosDB. Now, this micro service is deployed over kubernetes cluster and hence can have multiple replication factor for load balancing. As mentioned before, the frontEnd is isolated and it consumes the exposed endpoint.
Problem :
FrontEnd has been written in such a manner that if the response is not obtained within a certain time frame or if a network failure occurs, it retries the endpoint. It has been observed that in some rare scenarios(doesn't matter what) UI makes multiple calls (mostly 2) one after another with time difference in milliseconds. Now here comes the race condition at the backend logic.
If the first call goes to ThirdParty first and obtained a success response, the second call will get a failure(bcz the first one was already a success). We can not change the behaviour of ThirdParty.
Taking above scenario as base, Now if the second call(failure one) updates the DB first and reaches the UI. UI takes this as a failure(whereas the first call was already a success) and take failure actions.
If the success calls makes it to the UI first, everything works fine.
Possible solution I can think of:
1)
Put a cache as source of truth.
apiCall : Status
If (entry not present in cache) {
Put Entry in cache With Status NULL or Something with specific TTL
(acquire lock on specific entry) {
If (status is success) return successResponse.
MAKE ThirdParty Call
Update DB
Update cache
Release LOCK
}
} else {
(acquire lock on specific entry) {
MAKE ThirdParty Call
Update DB
Update cache
Release LOCK
}
}
Else block will never be executed. seems like.
Only in case of failure, instead of updating the DB, put a thread.sleep(10000) for couple of times in hope that another thread will update the DB with success response.
If still not success, return a failure update and update DB.
Put a poller on UI side. If it is a failure. Try to poll couple of times more in hope that the status changes. If not, take the failure actions.
Optimistic locking for cosmos record.
https://cosmosdb.github.io/labs/dotnet/labs/10-concurrency-control.html
Not sure how this can help. Let's say, both api calls read the record when the version was 0. Now the second api call update the the DB record, as the version was not changed, it will be a successful update. Now the DB holds Failure as value. The first api call tries to update it and it found a version mismatch, the update will not go through and another attempt will be made to update the DB as it was a success. In case of failure, no attempts to update DB will be made.
Now, the second API call will appear to UI first and UI will again take the failure action. UI require a poller in such cases. But if the UI requires a poller, why do we need the optimistic locking in first place. :)
I don't know cosmosDB functionality much. If there is some functionality cosmos provides to handle, Please be kind enough to share.
What will be the best way to handle such kind of scenarios.
Solution 1:[1]
It seems in your application design you have made it necessary to wait for each execution to finish before you fire the next one, I am not debating if this is good or bad that's a different discussion, but it seems the only option you have to fire all your DB Updates in a synchronous manner in this case.
Optimistic locking is very good to ensure that the document you are updating have not been updated while your code did other things but it will not help your UI issue here.
I think you need to abstract the UI in order to make this work properly otherwise you are stuck running things in synchronous mode
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Matt Douhan |
