'Linear regression in Numpy that takes account of data point standard deviations

Importing the libraries needed:

import matplotlib.pyplot as plt
import numpy as np

I have the df

volfit=pd.DataFrame({'Osmolarity':[100,200,300,400,600,800],'avgvol':[1.27,1.06,1,0.92,0.84,0.74],'stdev':[0.132,0.053,0,0.037,0.040,0.077]})

fitting the data below:

p,cov=np.polyfit(volfit['Osmolarity'], volfit['avgvol'],1,cov=True)
slope=p[0]
intercept=p[1]
std_err=np.sqrt(np.diag(cov))
slope_std_err= std_err[0]
int_std_err = std_err[1]

plotting the fit below:

fig,ax=plt.subplots(1,1,figsize=(5,4))
plt.scatter(volfit['Osmolarity'], volfit['avgvol'],color="black")
plt.plot(volfit['Osmolarity'],np.polyval(p,volfit['Osmolarity']),color='red')
plt.errorbar(volfit['Osmolarity'], volfit['avgvol'],yerr=volfit['stdev'],ls='none',color='black')
plt.title('y={:.4f}x + {:.1f} \nslope \u03C3 \u00B1 {:.4f} \n  b-int \u03C3 \u00B1 {:.2f}'.format(slope, intercept,slope_std_err,int_std_err),size=18,horizontalalignment='center',x=0.6, y=0.6)
ax.set_xlabel("\u03A0 (mOsm)",size=18)
ax.set_ylabel(r'Cell Volume Change ($\tilde{v}$)',size=18)

And I get this plot . What I want to do is use the standard deviations to force the polyfit to consider the points with the lowest standard deviations more than the ones with the really high standard deviations. So I would want the plot to look like this where it crosses through (300,1) because it has the an stdev of 0.



Solution 1:[1]

Just use the kwarg w = weights. Then you have to choose how you want your weights to depend on the standard deviation, a simple choice would be

coeff, cov = np.polyfit(x, y, 1, cov=True, w=(1/sigma))

Such that the weight is inversely proportional to the standard deviation, then set the std. dev. of (300, 1) to 1e-16 or some other small, non-zero number.

Edit: A quite common "hack" to use if you have a single point that you know that your fit has to pass through to be physically meaningful is to just add a bunch of points there. That is: Just add 100 or 1000 or some large enough number of copies of the point at (300, 1) to the data-set to force it to go through that point.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Vegard Jervell