'R - figuring out what columns an xgboost model is expecting in new data for predictions

We have a .model file that has an xgboost model. Here's a snippet of our code loading the model:

> xg_model <- xgb.load("../model_outputs/our_saved_model.model")
> xg_model
##### xgb.Booster
raw: 1.6 Mb 
xgb.attributes:
  niter
niter: 149

I didn't create this model, but I am tasked with passing new data to the model in order to make predictions. Unfortunately, I am hitting this error:

Error in predict.xgb.Booster(xg_model, xgb.DMatrix(as.matrix(our_dataframe_of_data))) : 
  [01:34:01] amalgamation/../src/learner.cc:1183: Check failed: learner_model_param_.num_feature >= p_fmat->Info().num_col_ (38 vs. 40) : Number of columns does not match number of features in booster.

... so it's clear that our dataframe has 40 columns, but this model is trained to expect a dataframe with 38 columns. What's unclear is exactly which 38 columns our xg_model is expecting. Is there a function to call / plot to graph / etc. that might show what 38 columns the model was trained on? We only have the trained model currently, but not the R code that trained the model...



Solution 1:[1]

What's your XGBoost version? It's important to know, because XGBoost "schema specification" has been evolving quite significantly.

Right now, you should explore what attributes are available on your xgb.Booster object. See if it has nfeatures and feature_names attributes defined:

print(xg_model$nfeatures)
print(xg_model$feature_names)

I believe your xgb.Booster object has these attributes available, because how else would it know to demand 38 features?

Solution 2:[2]

I had the same issue, I was able to solve it after extracting the model features like this.

ModelVars<- xgb.importance(feature_names = colnames(our_dataframe_of_data),model=xg_model)

After this it was just a matter of subsetting my dataframe to the ones in ModelVars. I was able to use predict function and get the scores even though the number of features was less than the number of features in the training dataset as expected.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 prasanna sundar