'Does sklearn LogisticRegressionCV use all data for final model

I was wondering how the final model (i.e. decision boundary) of LogisticRegressionCV in sklearn was calculated. So say I have some Xdata and ylabels such that

Xdata # shape of this is (n_samples,n_features)
ylabels # shape of this is (n_samples,), and it is binary

and now I run

from sklearn.linear_model import LogisticRegressionCV
clf = LogisticRegressionCV(Cs=[1.0],cv=5)
clf.fit(Xdata,ylabels)

This is looking at just one regularization parameter and 5 folds in the CV. So clf.scores_ will be a dictionary with one key with a value that is an array with shape (n_folds,1). With these five folds you can get a better idea of how the model performs.

However, I'm confused about what you get from clf.coef_ (and I'm assuming the parameters in clf.coef_ are the ones used in clf.predict). I have a few options I think it could be:

  1. The parameters in clf.coef_ are from training the model on all the data
  2. The parameters in clf.coef_ are from the best scoring fold
  3. The parameters in clf.coef_ are averaged across the folds in some way.

I imagine this is a duplicate question, but for the life of me I can't find a straightforward answer online, in the sklearn documentation, or in the source code for LogisticRegressionCV. Some relevant posts I found are:

  1. GridSearchCV final model
  2. scikit-learn LogisticRegressionCV: best coefficients
  3. Using cross validation and AUC-ROC for a logistic regression model in sklearn
  4. Evaluating Logistic regression with cross validation


Solution 1:[1]

You are mistaking between hyper-parameters and parameters. All scikit-learn estimators which have CV in the end, like LogisticRegressionCV, GridSearchCV, or RandomizedSearchCV tune the hyper-parameters.

Hyper-parameters are not learnt from training on the data. They are set prior to learning assuming that they will contribute to optimal learning. More information is present here:

Hyper-parameters are parameters that are not directly learnt within estimators. In scikit-learn they are passed as arguments to the constructor of the estimator classes. Typical examples include C, kernel and gamma for Support Vector Classifier, alpha for Lasso, etc.

In case of LogisticRegression, C is a hyper-parameter which describes the inverse of regularization strength. The higher the C, the less regularization is applied on the training. Its not that C will be changed during training. It will be fixed.

Now coming to coef_. coef_ contains coefficient (also called weights) of the features, which are learnt (and updated) during the training. Now depending on the value of C (and other hyper-parameters present in contructor), these can vary during the training.

Now there is another topic on how to get the optimum initial values of coef_, so that the training is faster and better. Thats optimization. Some start with random weights between 0-1, others start with 0, etc etc. But for the scope of your question, that is not relevant. LogisticRegressionCV is not used for that.

This is what LogisticRegressionCV does:

  1. Get the values of different C from constructor (In your example you passed 1.0).
  2. For each value of C, do the cross-validation of supplied data, in which the LogisticRegression will be fit() on training data of the current fold, and scored on the test data. The scores from test data of all folds are averaged and that becomes the score of the current C. This is done for all C values you provided, and the C with the highest average score will be chosen.
  3. Now the chosen C is set as the final C and LogisticRegression is again trained (by calling fit()) on the whole data (Xdata,ylabels here).

Thats what all the hyper-parameter tuners do, be it GridSearchCV, or LogisticRegressionCV, or LassoCV etc.

The initializing and updating of coef_ feature weights is done inside the fit() function of the algorithm which is out of scope for the hyper-parameter tuning. That optimization part is dependent on the internal optimization algorithm of the process. For example solver param in case of LogisticRegression.

Hope this makes things clear. Feel free to ask if still any doubt.

Solution 2:[2]

You have the parameter refit=True by default. On the docs you can read:

If set to True, the scores are averaged across all folds, and the coefs and the C that corresponds to the best score is taken, and a final refit is done using these parameters. Otherwise the coefs, intercepts and C that correspond to the best scores across folds are averaged.

So if refit=True the CV model is retrained using all the data. When it says the final refit is done using these parameters it is talking about the C regularization parameter. So it uses the C that gives the best average score across the K folds.

When refit=False it retrieves you the best model in cross validation. So if you trained 5 folds, you will get the model (coeff + C + intercept), trained on 4 folds of data, which gave the best score on its fold test set. I agree that the documetation here is not very clear but averaging C values and coefficients does not really make much sense

Solution 3:[3]

I just took a look at the source code. It seems for refit = True, they just selected the best hyperparameter (C and l1_ratio) and retrain the model with all the data.

for refit = False:

It seems they do average the hyperparameters, see the blow source code:

best_indices = np.argmax(scores, axis=1)
...
best_indices_C = best_indices % len(self.Cs_)
self.C_.append(np.mean(self.Cs_[best_indices_C]))

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 user1269942
Solution 3 Xin Niu