'Difference between excel's and python R2, Power fit

I'm trying to do a power fitting of the x and y arrays and estimate also the R2. I have already done it with excel and it gives me an R2 of 0.5389 while if I run it in python with the following code it gives me 0.00498. The results of the coefficient for both methods is exactly the same so I don't know what I'm doing wrong.

import numpy as np
import matplotlib.pyplot as plt 
def equation(a, b):
    """Return a 1D polynomial."""
    # return np.polyval(a,b) # Define the fitting equations (currently linear)
    return a[0]*b**a[1]
            # a[0]*b**a[1] 

def plot_ci_manual(t, s_err, n, x, x2, y2, ax=None):
    if ax is None:
        ax = plt.gca()
    
    ci = t * s_err * np.sqrt(1/n + (x2 - np.mean(x))**2 / np.sum((x - np.mean(x))**2))
    ax.fill_between(x2, y2 + ci, y2 - ci, color="k", edgecolor="None",alpha=0.05)
    ax.plot(x2, y2 + ci, color="k", linestyle = 'dashed', linewidth=0.5)
    ax.plot(x2, y2 - ci, color="k", linestyle = 'dashed', linewidth=0.5)
    return ax

def get_rsq(y, y_m):
    ss_res = np.dot((y - y_m),(y - y_m))
    ymean = np.mean(y)
    ss_tot = np.dot((y-ymean),(y-ymean))
    return 1-ss_res/ss_tot

x = [1650.5771, 3501.2242, 2407.0916, 1234.2959,  566.3745,  566.3745,
        566.3745, 1699.1235,  583.5374, 2625.9181, 3501.2242, 4084.7616,
       4814.1833, 4814.1833, 2625.9181, 3209.4555,  802.3639, 6230.1195,
       1132.749 ,  566.3745,  740.6436, 5616.5472, 2310.808 , 2027.0245,
       2567.5644]                                                               
y = [1.919756e+02, 6.116000e-01, 4.800000e-02, 8.400240e+01,
       2.912018e+02, 2.917935e+02, 7.396960e+01, 8.994700e+00,
       9.468660e+01, 4.830000e-02, 1.657940e+01, 1.659410e+01,
       2.595400e+00, 6.450000e-02, 1.751150e+01, 3.669700e+00,
       2.703720e+01, 8.559800e+00, 2.910000e-02, 1.759050e+02,
       7.205200e+00, 5.440000e-02, 4.890000e-02, 9.739000e-01,
       1.276700e+00]                                                           

p, var = np.polyfit(np.log(x),np.log(y), 1, cov=True)
p[1] = round(np.exp(p[1]),4)
p[0] = round(p[0],4)
p = p[::-1]
y_model = equation(p, np.sort(x))                          

# R2 calculation
r2 = get_rsq(y, y_model)


fig,ax = plt.subplots(1, figsize=(2.2, 2.5), dpi=1000)     
ax.scatter(x,y,linewidth=0.5,color='r',zorder=3, s = 15.0, edgecolors = 'k')
funlabel = r'$y = %.4f\cdot x^{%.4f}$' '\n' r'$ r^{2} = %.4f$' '\n'% (p[0],p[1],r2)
plt.plot(np.sort(x),y_model,linewidth=1,color='k',zorder=3)

plt.xticks(fontsize=10)
plt.yticks(fontsize=10)

ax.set_xlabel('Observations',fontsize=10)
ax.set_ylabel('Modeled',fontsize=10)

ax.annotate(funlabel, xy=(0.02, 0.98), xycoords='axes fraction', fontsize=8,
            ha='left', va='top')

enter image description here enter image description here



Solution 1:[1]

I find this concerning, given the popularity of Excel for such calculations. I can think of two ways to cast the y-variable prior to the R2 calculation for this use case. Option y1: fit in the log domain, then transform to the linear domain and use y and y_model to calculate R2. This is what you've done (after fixing the sorting issue), with R2 equal to 0.1243. Option y2: fit in the log domain, and calculate R2 in the log domain, with y_log=np.log(y) and y_model_log. This is likely what you're doing in Excel (I don't have Excel or I'd check), but the R2 value is 0.3257, NOT 0.5390 as you've reported above.

I added the following code to what you have in the question:

def equation_linear(a, b):
    return a[0] + b * a[1]

def get_rsq_nomean(y, y_m):
    ss_res = np.dot((y - y_m),(y - y_m))
    ss_tot = np.dot(y, y)
    return 1 - ss_res / ss_tot

sort_inds = np.argsort(x)
x = np.array(x)[sort_inds]
y = np.array(y)[sort_inds]

p_log, var = np.polyfit(np.log(x), np.log(y), 1, cov=True)
p = np.zeros(p.shape)
p[1] = round(np.exp(p_log[1]),4)
p[0] = round(p_log[0],4)
p = p[::-1]

y_model = equation(p, np.sort(x))                          
y_model_log = equation_linear(p_log[::-1], np.log(x))

# Linear R2 calculation
r2 = get_rsq(y, y_model)

# Log R2 calculation
r2_log = get_rsq(np.log(y), y_model_log)

There is another, less defendable way you can calculate R2, which will generally give higher R2 values, as shown in the get_rsq_nomean function above. Essentially, just assume the mean is zero. Even if you do that, you get R2 values of 0.3564 and 0.4544 for the linear and log calculations, as above, so that also doesn't match Excel either. I can't think of another option, and since the coefficients are the same, I'm really not sure what is going on. Hopefully someone else knows more about this issue.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 frederick-douglas-pearce