'Best way to store a huge matrix of ~10 billion elements
I'm doing an image based recommendation algorithm where I compute similarity scores between each pair of images and store them in a symmetric matrix. In total I've got 101551 number of images. First I extract 4096 features from each image and store it in the following data frame where the index column is the image name and all other columns is a numeric float64 feature, that is, each row contains the imagename and its features:
0 1 2 ... 4093 4094 4095
el_1543473-01_Fs.PNG 0.0 0.026052 0.000000 ... 0.000000 3.152112 0.000000
ell_1000003_Fs.PNG 0.0 0.000000 0.000000 ... 0.000000 0.000000 4.506009
ell_1000013_Fs.PNG 0.0 0.000000 0.000000 ... 3.915738 0.000000 0.000000
ell_1000018_Fs.PNG 0.0 0.000000 4.413001 ... 1.993990 0.000000 0.349481
ell_1000029_Fs.PNG 0.0 0.000000 1.500841 ... 5.759455 0.000000 0.371602
... ... ... ... ... ... ...
sth_1571843-01_Fs.PNG 0.0 0.000000 0.000000 ... 0.000000 0.000000 0.000000
sth_1571844-01_Fs.PNG 0.0 0.000000 0.000000 ... 3.314346 0.000000 2.769834
sth_1571850-01_Fs.PNG 0.0 0.000000 0.000000 ... 2.261965 0.000000 2.754012
sth_1617824_Fs.PNG 0.0 0.313721 0.000000 ... 0.000000 0.000000 3.919592
sth_1617973_Fs.PNG 0.0 0.519381 0.000000 ... 0.000000 0.000000 1.407738
[101551 rows x 4096 columns]
Now I simply want to compute the cosine similarities between all the images. For this I use the function cosine_similarity from sklearn, like this:
from sklearn.metrics.pairwise import cosine_similarity
# calculate pairwise cosine similarity scores
cos_similarities = cosine_similarity(full_df)
# store the results into a pandas dataframe
cos_similarities_df = pd.DataFrame(cos_similarities, columns = imagenames, index = imagenames)
Needless to say, a matrix of size 101551 x 101551 takes up lots of space in RAM, so I get the error:
MemoryError: Unable to allocate 76.8 GiB for an array with shape (101551, 101551) and data type float64
Is there a way to circumvent this problem by means of compression or storing in a database? I need to have these similarity scores quickly accessible since when looking at one image, the algorithm should find the top 5 images with the closest similarity score and recommend these. I checked this post but I don't think that's a viable option for me since I can't work with one segment at a time but I need to search the entire matrix at once when recommending.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
