'Finding R^2 in Python, varying answers corrcoef vs math

I'm taking an intro to data sci class and we were taught to find r^2 using the r_squared() function below. However, I wanted to try out different ways of doing this and ended up with r_squaredcorrcoeff(). I was surprised when I ended up with two different returns. Does anyone know why this is happening?

The "correct" answer according to my class is the number that I get when using r_squared(). Am I using the numpy.corrcoef method correctly? Thank you all for the time and help!

def r_squaredcorrcoeff(y, estimated):
    
    y, estimated = np.array(y), np.array(estimated)
    correlation_matrix = np.corrcoef(y, estimated)
    correlation_xy = correlation_matrix[0,1]
    r_squared = correlation_xy**2
    return r_squared

def r_squared(y, estimated):

    y, estimated = np.array(y), np.array(estimated)
    error = ((estimated - y)**2).sum()
    meanError = error/len(y)
    return 1 - (meanError/np.var(y))

print(r_squaredcorrcoeff([-3.1, -4.1, -9.2, 10.1, 9.1, 4.5], [-1.1, -2.1, -7.2, 11.1, 11.1, 5.5]))

print(r_squared([-3.1, -4.1, -9.2, 10.1, 9.1, 4.5], [-1.1, -2.1, -7.2, 11.1, 11.1, 5.5])


Solution 1:[1]

The function np.cov() and np.corrcoeff() return the covariance and the normalized covariance matrices of the input sequences.

As per their definition, they can be used to compute Pearson's correlation coefficient ?_ab (or, more precisely, ?²_ab in rho2_*() below). The same coefficient can be computed with means and standard deviations.

Additionally, one can compute the Coefficient of determination R² (r2_*() below).

import numpy as np


def rho2_ms(a, b):  
    return (
        np.mean((a - np.mean(a)) * (b - np.mean(b)))
        / np.std(a) / np.std(b)) ** 2


def rho2_cov(a, b):
    c = np.sqrt(np.cov(a, b))
    return (c[0, 1] * c[1, 0] / c[0, 0] / c[1, 1]) ** 2


def rho2_corr(a, b):
    # from OP, equivalent to: `r_squaredcorrcoeff()`
    c = np.sqrt(np.corrcoef(a, b))
    return (c[0, 1] * c[1, 0] / c[0, 0] / c[1, 1]) ** 2


def r2_ss(a, b):
    ss_res = np.sum((a - b) ** 2)
    ss_tot = np.sum((a - np.mean(a)) ** 2)
    return 1 - ss_res / ss_tot


def r2_var(a, b):
    # from OP, equivalent to: `r_squared()`
    mean_err = np.mean((a - b) ** 2)
    var_a = np.var(a)
    return 1 - mean_err / var_a
x = np.array([-3.1, -4.1, -9.2, 10.1, 9.1, 4.5])
y = np.array([-1.1, -2.1, -7.2, 11.1, 11.1, 5.5])


funcs = rho2_ms, rho2_cov, rho2_corr, r2_ss, r2_var
base = funcs[0](x, y)
print(f"{'Name':<24}  {'Same':<6}  {'Value':<10}")
print(f"{'-' * 24}  {'-' * 6}  {'-' * 10}")
for func in funcs:
    res = func(x, y)
    print(f"{func.__name__:<24}  {np.isclose(base, res)!s:<6}  {res:.6f}")
# Name                      Same    Value     
# ------------------------  ------  ----------
# rho2_ms                   True    0.997004
# rho2_cov                  True    0.997004
# rho2_corr                 True    0.997004
# r2_ss                     False   0.941415
# r2_var                    False   0.941415

However, R² and ?_ab values are not the same in general.

This is evident from their definition. For example, while rho2_*() implementations are all symmetrical (f(a, b) == f(b, a)), r2_*() implementations are not.

However, reading from the Wikipedia page for the Coefficient of determination R²:

In linear least squares multiple regression with an estimated intercept term, R² equals the square of the Pearson correlation coefficient between the observed y and modeled (predicted) f data values of the dependent variable.

There is no such relationship between your x and y.

However, if you compute the linear regression z on x:

t = np.arange(len(x))
p = np.polyfit(t, x, 1)
z = np.polyval(p, t)
print(z)
# [-5.7047619  -2.93619048 -0.16761905  2.60095238  5.36952381  8.13809524]

then R² and ?_ab (however being computed) get identical:

base = funcs[0](x, z)
print(f"{'Name':<24}  {'Same':<6}  {'Value':<10}")
print(f"{'-' * 24}  {'-' * 6}  {'-' * 10}")
for func in funcs:
    res = func(x, z)
    print(f"{func.__name__:<24}  {np.isclose(base, res)!s:<6}  {res:.6f}")
# Name                      Same    Value     
# ------------------------  ------  ----------
# rho2_ms                   True    0.436576
# rho2_cov                  True    0.436576
# rho2_corr                 True    0.436576
# r2_ss                     True    0.436576
# r2_var                    True    0.436576

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1