'Determining cosine similarity for large datasets

I am currently using a dataset of over 2.5 million images, of which I use the image itself as a comparison to eachother, for use in a content-based recommendation engine.

I use the following code to calculate the cosine similarity using some precomputed embeddings.

cosine_similarity = 1-pairwise_distances(embeddings, metric='cosine')

However my issue is that currently I've estimated requiring around 11,000GB in memory to create this similarity matrix;

Are there any alternatives to getting a similarity metric between every data point in my dataset or is there another way to go about this whole process?



Solution 1:[1]

You have 2,500,000 entries. So resulting matrix has 6.25e+12 entries. You need to ask yourself what do you plan to do with this data, compute only what you need, and then the storage will follow. Computation of cosine distance is almost free (it is literally a dot product) so you can always just do it "on the fly", no need to precompute, and the question really boils down to how much actual time/compute.

Solution 2:[2]

if you have a recommendation business problem using these 2.5 million images, you may want to check TF recommenders which basically use %30 of data for retrieval and you can run a second ranking classifier on top of the initial model to explore more. this two-step approach would be key to memory constraints and already battle tested by instagram and others

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 lejlot
Solution 2 ibozkurt79