'Best way to store a huge matrix of ~10 billion elements

I'm doing an image based recommendation algorithm where I compute similarity scores between each pair of images and store them in a symmetric matrix. In total I've got 101551 number of images. First I extract 4096 features from each image and store it in the following data frame where the index column is the image name and all other columns is a numeric float64 feature, that is, each row contains the imagename and its features:

                         0         1         2  ...      4093      4094      4095
el_1543473-01_Fs.PNG   0.0  0.026052  0.000000  ...  0.000000  3.152112  0.000000
ell_1000003_Fs.PNG     0.0  0.000000  0.000000  ...  0.000000  0.000000  4.506009
ell_1000013_Fs.PNG     0.0  0.000000  0.000000  ...  3.915738  0.000000  0.000000
ell_1000018_Fs.PNG     0.0  0.000000  4.413001  ...  1.993990  0.000000  0.349481
ell_1000029_Fs.PNG     0.0  0.000000  1.500841  ...  5.759455  0.000000  0.371602
                   ...       ...       ...  ...       ...       ...       ...
sth_1571843-01_Fs.PNG  0.0  0.000000  0.000000  ...  0.000000  0.000000  0.000000
sth_1571844-01_Fs.PNG  0.0  0.000000  0.000000  ...  3.314346  0.000000  2.769834
sth_1571850-01_Fs.PNG  0.0  0.000000  0.000000  ...  2.261965  0.000000  2.754012
sth_1617824_Fs.PNG     0.0  0.313721  0.000000  ...  0.000000  0.000000  3.919592
sth_1617973_Fs.PNG     0.0  0.519381  0.000000  ...  0.000000  0.000000  1.407738

[101551 rows x 4096 columns]

Now I simply want to compute the cosine similarities between all the images. For this I use the function cosine_similarity from sklearn, like this:

from sklearn.metrics.pairwise import cosine_similarity
# calculate pairwise cosine similarity scores 
cos_similarities = cosine_similarity(full_df)
# store the results into a pandas dataframe
cos_similarities_df = pd.DataFrame(cos_similarities, columns = imagenames, index = imagenames)

Needless to say, a matrix of size 101551 x 101551 takes up lots of space in RAM, so I get the error:

MemoryError: Unable to allocate 76.8 GiB for an array with shape (101551, 101551) and data type float64

Is there a way to circumvent this problem by means of compression or storing in a database? I need to have these similarity scores quickly accessible since when looking at one image, the algorithm should find the top 5 images with the closest similarity score and recommend these. I checked this post but I don't think that's a viable option for me since I can't work with one segment at a time but I need to search the entire matrix at once when recommending.



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source