'Maintain a distributed incremental counter in Azure cosmos DB

I am fairly new to cosmos DB and was trying to understand the increment operation that azure cosmos DB SDK provides for Java for patching a document. I have a requirement to maintain an incremental counter in one of the Documents in the container. The document looks like this-

{"counter": 1}

Now from my application I want to increment this counter by a value of 1 every time an action happens. For this I am using CosmosPatchOperations. I add an increment here like this cosmosPatch.increment("/counter", 1) which works fine.

Now this application can have multiple instances running, all of them talking to same document in the cosmos container. So App1 and App2 both could trigger an increment at the same time. The SDK method returns the updated document and I need to use that updated value.

My question here would be that does cosmos DB here employ some locking mechanism to make sure both the patches happen one after another and also in this case what would be the updated value that I would get in App1 and App2 (The SDK method returns the updated document). Will it be 2 in one of them and 3 in the other one?

Couchbase supports such a counter at cluster level as explained here and it has been working perfectly for me without any concurrency issues. I am now migrating to cosmos Db and have been struggling to find how can this be achieved.

Update 1:

I decided to test this. I set up the cosmos emulator in my local mac and created a DB and container with automatically increasing RUs starting from 1 to 10K. Then in this container I added a document like this -

{
"id": "randomId",
"counter": 0
}

Post this I created a simple API whose responsibility is just to increment the counter by 1 every-time it is invoked. Then I used locust to invoke this API multiple times to mimic a small load-like scenario. Initially the test ran fine with each invocation receiving a counter like it is supposed to (in an incremental manner). On increasing the load I saw some errors namely RequestTimeOutException with status code 408. Other requests were still working fine with them getting the correct counter value. I do not understand what caused RequestTimeOut exceptions here. The stack trace hints something to do with concurrency but I am not able to get my head around it. Here's the stack trace-

enter image description here

Update 2: The test run in Update 1 was done on my local machine and I realised I might have resource issues on my local leading to those errors. Decided to test this in a Pre-Prod environment with actual cosmos DB and not emulator.

Test configuration-

  1. Cosmos DB container with RUs to automatically scale from 400 to 4000
  2. 2 instances of application sharing the load.
  3. Locust script to ingest load on the application

Findings-

Up until ~170 TPS, everything was running smoothly. Beyond that I noticed errors belonging to 2 different buckets-

  1. "exception": "["Request rate is large. More Request Units may be needed, so no changes were made. Please retry this request later. Learn more: http://aka.ms/cosmosdb-error-429"]".

I am not sure how 170 odd patch operations would have exhausted 4000 RUs but that's a different discussion altogether.

  1. "exception": "["Conflicting request to resource has been attempted. Retry to avoid conflicts."]", with status code 449.

This error clearly indicates that cosmos DB doesn't handle concurrent requests. I want to understand if they maintain a queue internally to handle some requests or they don't handle any concurrent writes at all.



Solution 1:[1]

PATCH is not different from other operations, Fundamentally CosmosDB implements Optimistic Concurrency Control unlike the relational databases which have these mechanisms. Optimistic Concurrency Control (OCC) allows you to prevent lost updates and to keep your data correct. OCC can be implemented by using the etag of a document. T Each document within Azure Cosmos DB has an E_TAG property.

In your scenario, yes it will return 2 in one of them and 3 in other one given both get succeeded, because SDK has the retry mechanism and it's explained here. Also have a look at this sample.

If your Azure Cosmos DB account is configured with multiple write regions, conflicts and conflict resolution policies are applicable at the document level, with Last Write Wins (LWW) being the default conflict resolution policy

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Sajeetharan