'Opposite coefficient "sign" for two logistic regressions
I am trying to build an xG-model using Distance (from goal) as feature and the target variable is a dummy-variable indicating whether the shot resulted in a goal or not. So I am trying to make a simple logistic regression. I tried to replicate a model where the fitting was done with the statsmodels-package, which resulted in a positive coefficient of 0.16 and an intercept of -0.5.
When I fitted the line using scikit-learn the coefficient was -0.16. The same happened with the intercept, which was around 0.5. So somehow the coefficients have "flipped".
Dataset example:
Goal X Y C Distance Angle
1 12 41 9.0 13.891814 0.474451
0 15 52 2.0 15.803560 0.453823
0 19 33 17.0 22.805811 0.280597
0 25 30 20.0 29.292704 0.223680
0 10 39 11.0 12.703248 0.479051
scikit-learn code:
feature_cols = ['Distance']
X = shots_model[feature_cols] # Features
y = shots_model['Goal'] # Target
y = y.astype('category')
m1 = LogisticRegression()
m1.fit(X_train, y_train)
statsmodels code:
test_model = smf.glm(formula="Goal ~ " + model, data=shots_model,
family=sm.families.Binomial()).fit()
print(test_model.summary())
b=test_model.params
I am probably missing something simple, as I am pretty new to Machine Learning, and this has been puzzling me for some time now. Please help.
Solution 1:[1]
I am not sure what your outputs are. However, what you can do now is to test your model on new test data. The predictions obtained are fractional values(between 0 and 1) which denote the probability of resulting in a goal. Then round these values to obtain the discrete values of 1 or 0. After that, you can use a confusion matrix or the accuracy_score function to test the accuracy of your models. For more detailed code, you can refer to this article. https://www.geeksforgeeks.org/logistic-regression-using-statsmodels/
I think if you can get corresponding binary outcomes of your two modules, and the accuracy of your two models is close, then you do not need to worry much about the flipped coefficient. Basically, my idea is if you can get accurate predictions(1 or 0) both on your two methods, then everything is fine. Hope my answer is helpful to you!
Solution 2:[2]
The algorithms for Logistic Regression are different in statsmodels and sklearn. Sklearn uses L2 penalty by default, which means that the loss function has a quadratic term for the coefficients to drive them as close to zero as possible.
Regarding why your coefficients have flipped, this can happen when you invert the encoding of your target variable.
For example, if the statmodels model was trained with 0 being a miss and 1 being a goal, and the sklearn model being trained with 0 being a goal and 1 being a miss.
To be honest, it's hard to tell from the info you have given us. However, here's a couple of things to keep in mind regarding the code you posted:
m1is being trained with objects that do not exist in the code you posted (X_trainandy_trainare not declared).- You do not have an intercept in either model (there are no columns with just ones in
X,X_trainorX_test). - You do not need the step
y.astype('category').
Basically, make sure that y, y_train and y_test are encoded the same way for both models and set m1 = LogisticRegression(penalty='none').
In my opinion, it makes sense that the coefficient for Distance is negative because scoring a goal should become less likely the farther away you are.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Jujie YANG |
| Solution 2 |
