'alternatives to computing pairwise comparison for large scale datasets

What I would like to do, is be able to assign customers to a cluster based on their shopping behaviour. My initial plan was to compute the cosine similarity across candidate shopping behaviour, then break down the dimensions using something like principle component analysis and then do k-means clustering.

The problem is that I have 1 million plus customers to analyse, which results in pairwise comparison matrix taking up a huge amount of ram. Is there a way that I can either avoid calculating the pair wise similarity or I can I calculate the similarity score and perform the principle component analysis at the same time?

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'alternatives to computing pairwise comparison for large scale datasets

Sources

Related Questions