'Missing values in Categorical Variables in CatBoost (python)

CatBoost can encode categorical variables which is great. However, when categorical features contain missing values in the form np.nan, they can't be processed. This is stated in CatBoost documentation here: cb missing values

However, I read in this GitHub thread that CatBoost can in fact handle categorical variables with missing values. github thread

I tried a mini example to test it:

from catboost import CatBoostClassifier
# Initialize data
cat_features = [0, 1]
train_data = [["a", np.nan, 1, 4, 5, 6],
              ["a", "b", 4, 5, 6, 7],
              ["c", "d", 30, 40, 50, 60]]
train_labels = [1, 1, -1]
eval_data = [["a", "b", 2, 4, 6, 8],
             ["a", "d", 1, 4, 50, 60]]

# Initialize CatBoostClassifier
model = CatBoostClassifier(iterations=2,
                           learning_rate=1,
                           depth=2)
# Fit model
model.fit(train_data, train_labels, cat_features)

Here we get the error, because column 0has null:

CatBoostError: Invalid type for cat_feature[non-default value idx=0,feature_idx=1]=nan : cat_features must be integer or string, real number values and NaN values should be converted to string.

How can I make this code work without manually filling the null value?



Solution 1:[1]

It actually all works fine if you use Catboost's recommended Pool method that maps the data.

train_data = Pool(data=[[1, np.nan, 5, 6],
                        [4, 5, 6, 7],
                        [30, 40, 50, 60]],
                  label=[1, 1, -1],
                  weight=[0.1, 0.2, 0.3])

model = CatBoostClassifier(iterations=10)

model.fit(train_data)

Learning rate set to 0.058839
0:  learn: 0.6879920    total: 2.32ms   remaining: 20.8ms
1:  learn: 0.6815428    total: 2.63ms   remaining: 10.5ms
2:  learn: 0.6765119    total: 2.86ms   remaining: 6.67ms
3:  learn: 0.6715373    total: 3.86ms   remaining: 5.8ms
4:  learn: 0.6653022    total: 4.24ms   remaining: 4.24ms
5:  learn: 0.6591482    total: 5.83ms   remaining: 3.88ms
6:  learn: 0.6543562    total: 6.11ms   remaining: 2.62ms
7:  learn: 0.6496176    total: 6.34ms   remaining: 1.59ms
8:  learn: 0.6436669    total: 6.53ms   remaining: 725us
9:  learn: 0.6377932    total: 6.75ms   remaining: 0us
<catboost.core.CatBoostClassifier at 0x14d60bdd8>

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 user4718221