'extract formula from OLS Regression Results

My Goal is: Extracting the formula (not only the coefs) after a linear regression done with statsmodel.

Context : I have a pandas dataframe ,

df
      x    y     z 

0   0.0  2.0    54.200
1   0.0  2.2    70.160
2   0.0  2.4    89.000
3   0.0  2.6    110.960

i 'am doing a linear regression using statsmodels.api (2 variables, polynomial degree=3) , i'am happy with this regression.

OLS Regression Results                            
==============================================================================
Dep. Variable:                      z   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                  1.000
Method:                 Least Squares   F-statistic:                 2.193e+29
Date:                Sun, 31 May 2020   Prob (F-statistic):               0.00
Time:                        22:04:49   Log-Likelihood:                 9444.6
No. Observations:                 400   AIC:                        -1.887e+04
Df Residuals:                     390   BIC:                        -1.883e+04
Df Model:                           9                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.2000   3.33e-11   6.01e+09      0.000       0.200       0.200
x1             2.0000   1.16e-11   1.72e+11      0.000       2.000       2.000
x2             1.0000   2.63e-11    3.8e+10      0.000       1.000       1.000
x3             4.0000   3.85e-12   1.04e+12      0.000       4.000       4.000
x4            12.0000   4.36e-12   2.75e+12      0.000      12.000      12.000
x5             3.0000   6.81e-12   4.41e+11      0.000       3.000       3.000
x6             6.0000   5.74e-13   1.05e+13      0.000       6.000       6.000
x7            13.0000   4.99e-13    2.6e+13      0.000      13.000      13.000
x8            14.0000   4.99e-13   2.81e+13      0.000      14.000      14.000
x9             5.0000   5.74e-13   8.71e+12      0.000       5.000       5.000
==============================================================================
Omnibus:                       25.163   Durbin-Watson:                   0.038
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               28.834
Skew:                          -0.655   Prob(JB):                     5.48e-07
Kurtosis:                       2.872   Cond. No.                     6.66e+03
==============================================================================


I need to implement that outside of python , (ms excel) , I would like to know the formula.

I know it is polynomial deg3 , but I wondering how to know which coeff apply to which term of the equation. Something like that :

Equation_image

For exemple : x7 coeef is the coeff for x³ ,y², x²y , ... ?

Note: this is a simplify version of my problem , in reallity I have 3 variables , deg:3 so 20 coefs.

This is a simpler exemple of code to get started with my case:


# %% Question extract formula from linear regresion coeff
#Import
import numpy as np   # version : '1.18.1'
import pandas as pd  # version'1.0.0'

import statsmodels.api as sm   # version : '0.10.1'
from sklearn.preprocessing import PolynomialFeatures

from itertools import product


#%% Creating the dummies datas
def function_for_df(row):
    x= row['x']
    y= row['y']
    return  unknow_function(x,y)

def unknow_function(x,y):
    """ 
     This is to generate the datas , of  course in reality I do not know the formula
    """
    r =0.2+ \
       6*x**3+4*x**2+2*x+ \
       5*y**3+3*y**2+1*y+ \
       12*x*y + 13*x**2*y+ 14*x*y**2
    return r

# input data
x_input = np.arange(0, 4 , 0.2)
y_input = np.arange(2, 6 , 0.2)

# create a simple dataframe with dummies datas
df = pd.DataFrame(list(product(x_input, y_input)), columns=['x', 'y'])
df['z'] = df.apply(function_for_df, axis=1)

# In the reality I start from there !

#%%  creating the model
X = df[['x','y']].astype(float) #
Y = df['z'].astype(float) 

polynomial_features_final= PolynomialFeatures(degree=3)
X3 = polynomial_features_final.fit_transform(X)

model = sm.OLS(Y, X3).fit()
predictions = model.predict(X3) 

print_model = model.summary()
print(print_model)

#%% using the model to make predictions, no problem
def model_predict(x_sample, y_samples):
    df_input = pd.DataFrame({  "x":x_sample, "y":y_samples }, index=[0])
    X_input = polynomial_features_final.fit_transform(df_input)
    prediction = model.predict(X_input)
    return prediction

print("prediction for x=2, y=3.2 :" ,model_predict(2 ,3.2))

# How to extract the formula for the "model" ?
#Thanks

Side notes:

A desciption like the one given by pasty ModelDesc will be fine:


from patsy import ModelDesc
ModelDesc.from_formula("y ~ x")

# or even better :

desc = ModelDesc.from_formula("y ~ (a + b + c + d) ** 2")
desc.describe()

But i 'am not able to make the bridge between my model and patsy.ModelDesc. Thanks for your help.



Solution 1:[1]

As Josef said in the comment, i had to look at : sklearn PolynomialFeature .

Then I found this answer :

    PolynomialFeatures(degree=3).get_feature_names()

In the context :


    #%%  creating the model
    X = df[['x','y']].astype(float) #
    Y = df['z'].astype(float) 

    polynomial_features_final= PolynomialFeatures(degree=3)
    #X3 = polynomial_features_final.fit_transform(X)

    X3 = polynomial_features_final.fit_transform(df[['x', 'y']].to_numpy())

    model = sm.OLS(Y, X3).fit()
    predictions = model.predict(X3) 

    print_model = model.summary()
    print(print_model)

    print("\n-- ONE SOLUTION --\n Coef and Term name :")
    results = list(zip(model.params, polynomial_features_final.get_feature_names()))
    print(results)

Solution 2:[2]

You can fit the model using Patsy formula language using statsmodels.formula.api

For instance, you can have:

import statsmodels.formula.api as smf

# You do not need fit_transform to generate poly features in df
# You can specify the model using vectorized functions, many transformations are supported
model = smf.ols(formula='z ~ x*y + I(x**2)*I(y**2) + I(x**3)*I(y**3)', data=df).fit()
print_model = model.summary()

print(print_model)
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      z   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                  1.000
Method:                 Least Squares   F-statistic:                 4.193e+05
Date:                Tue, 15 Feb 2022   Prob (F-statistic):               0.00
Time:                        16:53:19   Log-Likelihood:                -1478.1
No. Observations:                 400   AIC:                             2976.
Df Residuals:                     390   BIC:                             3016.
Df Model:                           9                                         
Covariance Type:            nonrobust                                         
=======================================================================================
                          coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------
Intercept             229.3425     23.994      9.558      0.000     182.168     276.517
x                    -180.0081     11.822    -15.226      0.000    -203.251    -156.765
y                    -136.2619     19.039     -7.157      0.000    -173.694     -98.830
x:y                   118.0431      2.864     41.210      0.000     112.411     123.675
I(x ** 2)              31.2537      4.257      7.341      0.000      22.884      39.624
I(y ** 2)              22.5973      4.986      4.532      0.000      12.795      32.399
I(x ** 2):I(y ** 2)     1.4176      0.213      6.671      0.000       1.000       1.835
I(x ** 3)               4.7562      0.595      7.991      0.000       3.586       5.926
I(y ** 3)               4.7601      0.423     11.250      0.000       3.928       5.592
I(x ** 3):I(y ** 3)     0.0166      0.006      2.915      0.004       0.005       0.028
==============================================================================
Omnibus:                       28.012   Durbin-Watson:                   0.182
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               71.076
Skew:                          -0.311   Prob(JB):                     3.68e-16
Kurtosis:                       4.969   Cond. No.                     1.32e+05
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.32e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

print(model.params)
Intercept              229.342451
x                     -180.008082
y                     -136.261886
x:y                    118.043098
I(x ** 2)               31.253705
I(y ** 2)               22.597298
I(x ** 2):I(y ** 2)      1.417551
I(x ** 3)                4.756205
I(y ** 3)                4.760144
I(x ** 3):I(y ** 3)      0.016611
dtype: float64

Then by simple print(model.params), you will have a natural bridge between the model and patsy.ModelDesc (here that is analogous to the formula definition)

Notes: Raw polynomial used here in the demo. You may switch to using orthogonal polynomials. That will help explain the contribution of each term to variance in the outcome, ref to StackExchange

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 fredvol
Solution 2 bkahya