'Is there a faster way to get correlation coefficents?

I have two large datasets A (100,000x10,000) and B (100,000x250) and I want to calculate the Pearson correlation between those.

The scipy.spatial.distance.cdist function with metric='correlation' does exactly what I want.

corr = 1 - cdist(A.T,B.T,'correlation')

But it takes about 5 times as long as numpy.corrcoef although I can discard most parts of it where correlation is calculated within one of the datasets.

corr = corrcoef(np.hstack((A,B)).T)[len(A.T):,:len(A.t)].T

Is there a better way to do this fast?



Solution 1:[1]

You can try this implementation, I don't have enough memory to test with your input size.

It looks like internally the function implements uses python loops here.

def pairwise_correlation(A, B):
    am = A - np.mean(A, axis=0, keepdims=True)
    bm = B - np.mean(B, axis=0, keepdims=True)
    return am.T @ bm /  (np.sqrt(
        np.sum(am**2, axis=0,
               keepdims=True)).T * np.sqrt(
        np.sum(bm**2, axis=0, keepdims=True)))

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Bob