'Cannot manually reproduce CCA loadings
For my current project I am using sklearn.cross_decomposition.CCA. In several references (such as 1, 2 and 3) it is stated that canonical loadings can be computed as correlations between variables and canonical variates. However, I could not reproduce this using scipy.stats.pearsonr. Am I doing something wrong?
Here's an example:
from sklearn.cross_decomposition import CCA
import numpy as np
from scipy.stats import pearsonr
# compute CCA
X = np.array([[0., 0., 1.], [1.,0.,0.], [2.,2.,2.], [3.,5.,4.]])
Y = np.array([[0.1, -0.2], [0.9, 1.1], [6.2, 5.9], [11.9, 12.3]])
cca = CCA(n_components=1)
cca.fit(X, Y)
X_c, Y_c = cca.transform(X, Y)
# obtain loadings for X variable set
x_loadings = cca.x_loadings_
print(f"The first variable has a loading of {x_loadings[0]} on the first canonical variate")
# try to manually calculate loadings using pearson correlation.
r,_ = pearsonr(np.squeeze(X[:,0]),np.squeeze(X_c))
print(f"Correlation between first variable and first canonical variate: {r}")
which gives:
The first variable has a loading of [0.61454275] on the first canonical variate
Correlation between first variable and first canonical variate: 0.9610851804703184
As you can see those numbers are totally different.
Solution 1:[1]
You are probably misinterpreting the meaning of canonical loadings. As you noted, canonical loadings are not equal to correlation coefficients; rather, they are measures of the variance that each observed variable shares with the canonical variate. In other words, the canonical loadings for a given variable tell you how much of the variance in that variable is shared with the canonical variate.
To see why this is the case, consider the following example. Suppose you have a set of variables that are all perfectly correlated with each other (i.e., their correlation coefficient is 1.0). In this case, the canonical loadings for all of these variables will be 1.0, because they all share the same amount of variance with the canonical variate. However, the correlation coefficient between any two of these variables will be 1.0, because they are perfectly correlated.
So, in order to compute the canonical loadings, you must first compute the variance of each canonical variate. Then, you can compute the canonical loadings as the correlation between each original variable and the canonical variate, divided by the square root of the variance of the canonical variate.
In your example, the first variable has a canonical loading of 0.61454275, which means that 61.5% of the variance in the first variable is shared with the canonical variate. The correlation coefficient between the first variable and the canonical variate is 0.9610851804703184, which is not the same as the canonical loading.
I adapted your snippet to demonstrate all this:
from sklearn.cross_decomposition import CCA
import numpy as np
from scipy.stats import pearsonr
import matplotlib.pyplot as plt
# Dataset
X = np.array([[0., 0., 1.],
[1., 0., 0.],
[2., 2., 2.],
[3., 5., 4.]])
Y = np.array([[0.1, -0.2],
[0.9, 1.1],
[6.2, 5.9],
[11.9, 12.3]])
# Y = 2 * X # uncomment to see the effect of perfect correlation
cca = CCA(n_components=1)
cca.fit(X, Y)
X_c, Y_c = cca.transform(X, Y)
# obtain loadings for X variable set
x_loadings = cca.x_loadings_
# Compute X reconstructed (simply to see the effect of perfect correlation)
# first, manually (in order for you to understand the code)
X_r = np.matmul(X_c, x_loadings.T)
X_r *= np.std(X, axis=0, ddof=1)
X_r += np.mean(X, axis=0)
# then, using sklearn:
X_r_cca = cca.inverse_transform(X_c)
assert np.allclose(X_r, X_r_cca)
# Now let's plot the CV of the X variable set compared to the original X variable set
plt.figure(figsize=(8, 8))
plt.plot(X_c, X.T[0], 'ro')
plt.xlabel('Canonical components of X')
plt.ylabel('Original X')
plt.show()
print(f"The first variable has a loading of {x_loadings[0]} on the first canonical variate")
# Manually calculate loadings for you to understand the code
# get standardized X (this is not necessary, but it is what would happen if you used sklearn)
X_std = (X - np.mean(X, axis=0)) / np.std(X, axis=0, ddof=1) # ddof=1 because we don't want to divide by N but rather N-1
r_loading = pearsonr(X_c.T[0], X_std.T[0])[0]/np.std(X_c.T, axis=1, ddof=1) # we divide r by the standard deviation of the canonical variates
# this is simply to illustrate how to manually calculate the loadings
x_loadings_manual = np.dot(X_c.T[0], X_std) / np.dot(X_c.T[0], X_c.T[0])
# check that the loadings are the same
assert np.allclose(x_loadings.reshape(-1)[0], x_loadings_manual[0], r_loading)
I hope this helps answering your question!
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Robin Thibaut |
