'Regression coefficients do not match conditional means
You can download the following data set from this repo.
| Y | CONST | T | X1 | X1T | X2 | X2T | |
|---|---|---|---|---|---|---|---|
| 0 | 2.31252 | 1 | 1 | 0 | 0 | 1 | 1 |
| 1 | -0.836074 | 1 | 1 | 1 | 1 | 1 | 1 |
| 2 | -0.797183 | 1 | 0 | 0 | 0 | 1 | 0 |
I have a dependent variable (Y) and three binary columns (T, X1 and X2). From this data we can create four groups:
X1 == 0andX2 == 0X1 == 0andX2 == 1X1 == 1andX2 == 0X1 == 1andX2 == 1
Within each group, I want to calculate the difference in the mean of Y between observations with T == 1 and T == 0.
I can do so with the following code:
# Libraries
import pandas as pd
# Group by T, X1, X2 and get the mean of Y
t = df.groupby(['T','X1','X2'])['Y'].mean().reset_index()
# Reshape the result and rename the columns
t = t.pivot(index=['X1','X2'], columns='T', values='Y')
t.columns = ['Teq0','Teq1']
# I want to replicate these differences with a regression
t['Teq1'] - t['Teq0']
> X1 X2
> 0 0 0.116175
> 1 0.168791
> 1 0 -0.027278
> 1 -0.147601
Problem
I want to recreate these results with the following regression model (m).
# Libraries
from statsmodels.api import OLS
# Fit regression with interaction terms
m = OLS(endog=df['Y'], exog=df[['CONST','T','X1','X1T','X2','X2T']]).fit()
# Estimated values
m.params[['T','X1T','X2T']]
> T 0.162198
> X1T -0.230372
> X2T -0.034303
I was expecting the coefficients:
T= 0.116175T + X1T= 0.168791T + X2T= -0.027278T + X1T + X2T= -0.147601
Question
Why don't the regression coefficients match the results from the first chunk's output (t['Teq1'] - t['Teq0'])?
Solution 1:[1]
Thanks to @Josef for noticing that T, X1 and X2 have eight different combinations while my regression model has six parameters. I was therefore missing two interaction terms (and thus two parameters).
Namely, the regression model needs to account for the interaction between X1 and X2 as well as the interaction between X1, X2 and T.
This can be done by declaring the missing interaction columns and fitting the model:
# Declare missing columns
df = df.assign(X1X2 = df['X1'].multiply(df['X2']),
X1X2T = df['X1'].multiply(df['X2T']))
# List of independent variables
cols = ['CONST','T','X1','X1T','X2','X2T','X1X2','X1X2T']
# Fit model
m = OLS.fit(endog=df['Y'], exog=df[cols]).fit()
Alternatively, we can use the formula interface:
# Declare formula
f = 'Y ~ T + X1 + I(X1*T) + X2 + I(X2*T) + I(X1*X2) + I(X1*X2*T)'
# Fit model
m = OLS.from_formula(formula=f, data=df).fit()
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Arturo Sbr |
