'calculating the covariance matrix fast in python with some minor customizing

I have a pandas data frame and I'm trying to find the covariance of the percentage change of each column. For each pair, I want rows with missing values to be dropped, and the percentage be calculated afterwards. That is, I want something like this:

import pandas as pd
import numpy as np

# create dataframe example
N_ROWS, N_COLS = 249, 3535
df = pd.DataFrame(np.random.random((N_ROWS, N_COLS)))
df.iloc[np.random.choice(N_ROWS, N_COLS), np.random.choice(10, 50)] = np.nan

cov_df = pd.DataFrame(index=df.columns, columns=df.columns)
for col_i in df:
    for col_j in df:
        cov = df[[col_i, col_j]].dropna(how='any', axis=0).pct_change().cov()
        cov_df.loc[col_i, col_j] = cov.iloc[0, 1]

The thing is this is super slow. The code below gives me results that is similar (but not exactly) what I want, but it runs quite fast

df.dropna(how='any', axis=0).pct_change().cov()

I am not sure why the second one runs so much faster. I want to speed up my code in the first, but I can't figure out how.

I have tried using combinations from itertools to avoid repeating the calculation for (col_i, col_j) and (col_j, col_i), and using map from multiprocessing to do the computations in parallel, but it still hasn't finished running after 90+ mintues.



Solution 1:[1]

somehow this works fast enough, although I am not sure why

from scipy.stats import pearsonr

corr = np.zeros((x.shape[1], x.shape[1]))
for i in range(x.shape[1]):
    for j in range (i + 1, x.shape[1]):
        y = x[:, [i, j]]
        y = y[~np.isnan(y).any(axis=1)]
        y = np.diff(y, axis=0) / y[:-1, :]
        if len(y) < 2:
            corr[i, j] = np.nan
            continue
        y = pearsonr(y[:, 0], y[:, 1])[0]
        corr[i, j] = y
corr = corr + corr.T
np.fill_diagonal(corr, 1)

This takes within 8 minutes, which is fast enough for my use case.

On the other hand, this has been running for 30 minutes but still isn't done.

corr = pd.DataFrame(index=nav.columns, columns=nav.columns)
for col_i in df:
    for col_j in df:
        corr_ij = df[[col_i, col_j]].dropna(how='any', axis=0).pct_change().corr().iloc[0, 1]
        corr.loc[col_i, col_j] = corr_ij
t1 = time.time()

Don't know why this is but anyways the first one is a good enough solution for me now.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1