'Fastest and most memory efficient way to access large amount of data in Python?

So I have a text file full of words with corresponding vectors. The file is ~4GB (1.9m entries), and I need to quickly access the vector for a given word using python. I have some code that iterates through this file and retrieves the word as a string and the vector as a numpy array for each line. At the moment I'm using this to generate a dictionary, which I'm then pickling.

When the program is run the pickle file is loaded back into python as a dictionary so I can query it, which works fine and is reasonably fast once its loaded.

However, its using ~3GB of RAM and around 4.5GB when initially generating and pickling the dict, which isn't ideal. So would using e.g. sqlite improve anything? I've done a lot of digging on this but other answers don't quite fit what I'm trying to do. It seems like using a dictionary would be faster for less data, but it's unclear for this size of file.

Am I right to use a dictionary, or should I use SQL instead? Or a completely different thing?

Thanks for any help



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source