'Fastest and most memory efficient way to access large amount of data in Python?
So I have a text file full of words with corresponding vectors. The file is ~4GB (1.9m entries), and I need to quickly access the vector for a given word using python. I have some code that iterates through this file and retrieves the word as a string and the vector as a numpy array for each line. At the moment I'm using this to generate a dictionary, which I'm then pickling.
When the program is run the pickle file is loaded back into python as a dictionary so I can query it, which works fine and is reasonably fast once its loaded.
However, its using ~3GB of RAM and around 4.5GB when initially generating and pickling the dict, which isn't ideal. So would using e.g. sqlite improve anything? I've done a lot of digging on this but other answers don't quite fit what I'm trying to do. It seems like using a dictionary would be faster for less data, but it's unclear for this size of file.
Am I right to use a dictionary, or should I use SQL instead? Or a completely different thing?
Thanks for any help
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
