'Finding R^2 in Python, varying answers corrcoef vs math
I'm taking an intro to data sci class and we were taught to find r^2 using the r_squared() function below. However, I wanted to try out different ways of doing this and ended up with r_squaredcorrcoeff(). I was surprised when I ended up with two different returns. Does anyone know why this is happening?
The "correct" answer according to my class is the number that I get when using r_squared(). Am I using the numpy.corrcoef method correctly? Thank you all for the time and help!
def r_squaredcorrcoeff(y, estimated):
y, estimated = np.array(y), np.array(estimated)
correlation_matrix = np.corrcoef(y, estimated)
correlation_xy = correlation_matrix[0,1]
r_squared = correlation_xy**2
return r_squared
def r_squared(y, estimated):
y, estimated = np.array(y), np.array(estimated)
error = ((estimated - y)**2).sum()
meanError = error/len(y)
return 1 - (meanError/np.var(y))
print(r_squaredcorrcoeff([-3.1, -4.1, -9.2, 10.1, 9.1, 4.5], [-1.1, -2.1, -7.2, 11.1, 11.1, 5.5]))
print(r_squared([-3.1, -4.1, -9.2, 10.1, 9.1, 4.5], [-1.1, -2.1, -7.2, 11.1, 11.1, 5.5])
Solution 1:[1]
The function np.cov() and np.corrcoeff() return the covariance and the normalized covariance matrices of the input sequences.
As per their definition, they can be used to compute Pearson's correlation coefficient ?_ab (or, more precisely, ?²_ab in rho2_*() below).
The same coefficient can be computed with means and standard deviations.
Additionally, one can compute the Coefficient of determination R² (r2_*() below).
import numpy as np
def rho2_ms(a, b):
return (
np.mean((a - np.mean(a)) * (b - np.mean(b)))
/ np.std(a) / np.std(b)) ** 2
def rho2_cov(a, b):
c = np.sqrt(np.cov(a, b))
return (c[0, 1] * c[1, 0] / c[0, 0] / c[1, 1]) ** 2
def rho2_corr(a, b):
# from OP, equivalent to: `r_squaredcorrcoeff()`
c = np.sqrt(np.corrcoef(a, b))
return (c[0, 1] * c[1, 0] / c[0, 0] / c[1, 1]) ** 2
def r2_ss(a, b):
ss_res = np.sum((a - b) ** 2)
ss_tot = np.sum((a - np.mean(a)) ** 2)
return 1 - ss_res / ss_tot
def r2_var(a, b):
# from OP, equivalent to: `r_squared()`
mean_err = np.mean((a - b) ** 2)
var_a = np.var(a)
return 1 - mean_err / var_a
x = np.array([-3.1, -4.1, -9.2, 10.1, 9.1, 4.5])
y = np.array([-1.1, -2.1, -7.2, 11.1, 11.1, 5.5])
funcs = rho2_ms, rho2_cov, rho2_corr, r2_ss, r2_var
base = funcs[0](x, y)
print(f"{'Name':<24} {'Same':<6} {'Value':<10}")
print(f"{'-' * 24} {'-' * 6} {'-' * 10}")
for func in funcs:
res = func(x, y)
print(f"{func.__name__:<24} {np.isclose(base, res)!s:<6} {res:.6f}")
# Name Same Value
# ------------------------ ------ ----------
# rho2_ms True 0.997004
# rho2_cov True 0.997004
# rho2_corr True 0.997004
# r2_ss False 0.941415
# r2_var False 0.941415
However, R² and ?_ab values are not the same in general.
This is evident from their definition.
For example, while rho2_*() implementations are all symmetrical (f(a, b) == f(b, a)), r2_*() implementations are not.
However, reading from the Wikipedia page for the Coefficient of determination R²:
In linear least squares multiple regression with an estimated intercept term, R² equals the square of the Pearson correlation coefficient between the observed y and modeled (predicted) f data values of the dependent variable.
There is no such relationship between your x and y.
However, if you compute the linear regression z on x:
t = np.arange(len(x))
p = np.polyfit(t, x, 1)
z = np.polyval(p, t)
print(z)
# [-5.7047619 -2.93619048 -0.16761905 2.60095238 5.36952381 8.13809524]
then R² and ?_ab (however being computed) get identical:
base = funcs[0](x, z)
print(f"{'Name':<24} {'Same':<6} {'Value':<10}")
print(f"{'-' * 24} {'-' * 6} {'-' * 10}")
for func in funcs:
res = func(x, z)
print(f"{func.__name__:<24} {np.isclose(base, res)!s:<6} {res:.6f}")
# Name Same Value
# ------------------------ ------ ----------
# rho2_ms True 0.436576
# rho2_cov True 0.436576
# rho2_corr True 0.436576
# r2_ss True 0.436576
# r2_var True 0.436576
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
