'How to best implement curve fitting using the curve_fit() function in python and properly compare different curve equations?

I am trying to understand whether I am carrying out curve fitting correctly and appropriately using the "curve_fit()" module from scipy within python. I have a script that iterates through each column of my dataframe, plots the data, and then fits a curve to it, and then judges the fit of the equation across the entire data set by calculating the average root mean square error across all of the rows.

First, here is my dataframe: enter image description here

I would have tried to actually provide the data here but I am not sure how. As an example description of what this data means, I am looking at how pollution levels decay as distance away from a factory site increases, with a distance column and each subsequent column representing a different factory (factory "0", factory "1", factory "2", factory "3", etc.). And so we see pollution levels recorded at each distance moving away from each factory in 30-meter increments, with 0-meters representing the location at the factory site itself. And so I am going by the hypothesis that as move away from a factory site, the pollution levels will start to decay. The research question I am trying to discover here though is, from my specific data set, overall, across all of my studied factories, how does pollution decay with increasing distance away? Is this linear decay? Exponential decay? Logarithmic decay? polynomial decay? Inverse-logistic decay? That is what I am trying to find out.

And so I am trying to fit different curve equations to my data to see which type of curve best represents the decay I am looking at. Here is my code:

import matplotlib.pyplot as plt
from scipy.optimize import curve_fit

#Un-comment each of these individual curve functions to try them out individually

# def func(x, a, b):
#     return a * x + b

# def func(x, a, b):
#     return -a *np.log(x) + b

#def func(x, a, b, c):
#    return a * np.exp(-b * x) + c

# def func(x, a, b, c):
#     return a / (1 + b*c**(-x))

# def func(x, a, b, c):
#     return a*x**2+b*x+c

RMSE_list = []

xdata = data['Distance'].to_numpy()
c = range(8)
for i in c:
#     plt.figure(figsize=(3, 3))
    greenspace = str(i)
    ydata = data[greenspace].to_numpy()
    plt.plot(xdata, ydata, 'b-', label='data')
    
    popt, pcov = curve_fit(func, xdata, ydata)
    plt.plot(xdata, func(xdata, *popt), 'g--')
    
    #Calculate RMSE
    params, _ = curve_fit(func, xdata, ydata)
    a, b = params[0], params[1]
    yfit1 = a * xdata + b
    rmse = np.sqrt(np.mean((yfit1 - ydata) ** 2))
    RMSE_list.append(rmse)

RMSE_Average = np.mean(RMSE_list)
print("RMSE Average:")
print(RMSE_Average)

And so here I am trying out curves of linear decay, logarithmic decay, exponential decay, inverse-logistic decay, and second-degree polynomial decay. And so I try out each of these curve fitting equations individually as a unique function and fit this to the data from each factory. I then find the RMSE of each of these fits for each factory, and then average them to get an average RMSE to describe all factories for each curve type. That's all my code does. What I am trying to seek some help here with is to ask whether I am actually doing this right or if it even sounds like my approach here is on the right track. This whole field of curve fitting is very new to me, and I am really just going as far as I know how to at my beginner level. The code works I suppose, but I get some strange outcomes.

When I try this all out I get the following results for each curve fitting attempt: enter image description here Overall the idea makes sense to me here, but there are specific issues I encountered that I do not understand. I specifically had issues with the logarithmic, exponential and inverse-logistic curve fittings. For the logistic curve attempt I received the following error for each factory:

C:\Users\MyName\AppData\Local\Temp/ipykernel_7788/37186127.py:8: RuntimeWarning: divide by zero encountered in log return -a *np.log(x) + b

Also, just looking at the curve, the fitted logarithmic curve is fitted way below the actual data, it doesn't even touch!

And for the exponential curve attempt I received the following error for each factory:

C:\Users\MyName\AppData\Roaming\Python\Python39\site-packages\scipy\optimize\minpack.py:833: OptimizeWarning: Covariance of the parameters could not be estimated

I am thinking that perhaps a lot of this might be resolved by setting bounds. When I set the parameter bounds for the exponential attempt to "bounds=(0, [3., 1., 0.5])", based on this curve_fit() documention from scipy: https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.curve_fit.html

I get the same error message for each factory, but I receive this exponential curve fitting result: enter image description here So the visual fitting looks better, but the average RMSE looks the same, which is confusing me. And for the inverse-logistic fitting, it visually looks like the fitting is good, but the average RMSE is so high, which tells me the fit is really bad.

And so my question here is: how can I improve my approach and code here to most accurately and effectively fit these curves to my data and compare them using "curve_fit()" in python? I am thinking maybe this is a matter of methodically setting the bounds for each curve fitting, but I am not sure.



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source