'Why is there a difference in MSE when training Xgboost model incrementally on batches v/s training on entire data?
There's a difference in MSE when training Xgboost model incrementally on batches v/s training on entire data.
X_train size:
(8500, 4)
X_val size:
(637, 4)
X_test size:
(200, 4)
My train data (X_train) looks like this (4 columns : % Usage, hour, dayofweek, dayofmonth) :
% Usage hour dayofweek dayofmonth
0 14.265347 22 0 24
1 14.265347 22 0 24
2 13.996887 22 0 24
3 13.775730 22 0 24
4 13.775730 22 0 24
and target (y_train):
0 14.265347
1 13.996887
2 13.775730
3 13.775730
4 14.269257
Xgboost model is being trained incrementally. What I'm doing here is loading the model for the previous batch (if it exists) and continuing training. Once it is done, I'm saving it to be used for the next batch. I'm trying to mimic checkpoint like behaviour:
batch_size = 850
xgb_model = xgb.XGBRegressor(n_estimators=1000)
for start in range(0, len(X_train), batch_size):
if f'xgb_model_{start}.model' in os.listdir():
print(f"Skipping for batch {start}:{start+batch_size}")
continue
if start == 0:
xgb_model.fit(
X_train[start:start+batch_size],
y_train[start:start+batch_size],
eval_set=[(X_val, y_val)],
early_stopping_rounds=50,
verbose=False
)
xgb_model.save_model(f'xgb_model_{start}.model')
else:
xgb_model.fit(
X_train[start:start+batch_size],
y_train[start:start+batch_size],
eval_set=[(X_val, y_val)],
early_stopping_rounds=50,
verbose=False,
xgb_model = f'xgb_model_{start-batch_size}.model'
)
xgb_model.save_model(f'xgb_model_{start}.model')
y_pred = xgb_model.predict(X_test)
print(f"MSE : {mean_squared_error(y_pred, y_test)}")
Output:
MSE : 0.8678093773264584
MSE : 2.046533869862948
MSE : 1.1568086137077669
MSE : 2.291347951272582
MSE : 1.5389712184418989
MSE : 1.4457848862752014
MSE : 1.7740441472551185
MSE : 4.179429599396931
MSE : 6.211388954159769
MSE : 4.753687392359755
Not only is it higher, the MSE is increasing as the model proceeds. But, if I train the model with same parameters on the entire dataset, MSE is much lower than this
xgb_model = xgb.XGBRegressor(n_estimators=1000)
xgb_model.fit(
X_train,
y_train,
eval_set=[(X_val, y_val)],
early_stopping_rounds=50,
verbose=False
)
y_pred = xgb_model.predict(X_test)
print(f"MSE : {mean_squared_error(y_pred, y_test)}")
Output:
MSE : 0.4356189240236812
There's a noticeable difference in the Mean Squared Error: 0.435 vs 4.753 Why is this is so? Shouldn't it be almost similar if not equal?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
