'Shuffle data in Mongo DB with pymongo
I have a Mongo DB database with 1 million entries/rows, which is approx 20 GB of data. I’d like iterate through the data randomly in batches (using python and pymongo), with, say 10 batches of 100K. If I had a small amount of data which I could fit in memory, I would simply load all the data, then shuffle it randomly, then split it into 10 batches. But in this case, I cannot fit it all into memory. So this option is not possible. How could I accomplish this task without being able to fit it into memory?
One idea I had was to add a counter column to my Mongo DB called “count”, which labels each entry as 1,2,3,…, 100K. Then I use a python algo to randomize those numbers. Then I can extract the batches using a simple filter. Does this seem reasonable? It seems pretty slow to be because of all the filters. It seems to not scale efficiently.
This seems like a pretty standard issue. Does someone have a better solution than me?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
