'RandomForest classifier does not learn over generations

I am trying to use Random Forest to classify heart disease using GASearchCV from sklearn-genetic.

This is the original dataset; I have pre-processed it (using LabelEncoder, BinaryEncoder, OneHotEncoder and MaxAbsScaler). I have skipped out on the pre-processing code for brevity, but its resulting 38 feature columns are:

'BMI', 'PhysicalHealth', 'MentalHealth', 'AgeCategory', 'SleepTime', 'Smoking_0', 'Smoking_1', 'AlcoholDrinking_0', 'AlcoholDrinking_1', 'Stroke_0', 'Stroke_1', 'DiffWalking_0', 'DiffWalking_1', 'Sex_0', 'Sex_1', 'PhysicalActivity_0', 'PhysicalActivity_1', 'Asthma_0', 'Asthma_1', 'SkinCancer_0', 'SkinCancer_1', Race_American Indian/Alaskan Native', 'Race_Asian', 'Race_Black', 'Race_Hispanic', 'Race_Other', 'Race_White', 'KidneyDisease_No', 'KidneyDisease_Yes', 'GenHealth_Excellent', 'GenHealth_Fair', 'GenHealth_Good', 'GenHealth_Poor', 'GenHealth_Very good', 'Diabetic_No', 'Diabetic_No, borderline diabetes', 'Diabetic_Yes', 'Diabetic_Yes (during pregnancy)'

'HeartDisease' is the target.

CODE:

# Loading the dataset
dataset = pd.read_csv('data/heart_2020_cleaned.csv')


# Slicing the dataset to first 10000 rows to ease computations
dataset = dataset.iloc[:10000]

# Separating target from features
features = dataset.drop(columns='HeartDisease')     # X
target = dataset['HeartDisease']                    # y

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.33, random_state=42, stratify=target)

# Undersample the training data
sampler = RandomUnderSampler(random_state=42)
balanced_features, balanced_target = sampler.fit_resample(
    X_train, y_train)

# Classification starts here
clf = RandomForestClassifier(n_jobs=5)

start = time()

param_grid = {'min_weight_fraction_leaf': Continuous(0.01, 0.5, distribution='log-uniform'),
              'bootstrap': Categorical([True, False]),
              'max_depth': Integer(2, 30),
              'max_leaf_nodes': Integer(2, 35),
              'n_estimators': Integer(100, 300)}

cv = StratifiedKFold(n_splits=3, shuffle=True)

# Using GASearchCV to search for best parameters
evolved_estimator = GASearchCV(estimator=clf,
                               cv=cv,
                               scoring='accuracy',
                               population_size=10,
                               generations=35,
                               param_grid=param_grid,
                               n_jobs=5,
                               verbose=True,
                               keep_top_k=4,)

# Train and optimise the estimator
evolved_estimator.fit(X_train, y_train)

end = time()
result = end - start
print('%.3f seconds' % result)

# Best parameters found
print(evolved_estimator.best_params_)

# Use the model fitted with the best parameters
y_predict_ga = evolved_estimator.predict(X_test)
print(accuracy_score(y_test, y_predict_ga))

# Saved metadata for further analysis
print("Stats achieved in each generation: ", evolved_estimator.history)
print("Best k solutions: ", evolved_estimator.hof)

OUTPUT:

gen nevals  fitness     fitness_std fitness_max fitness_min
0   10      0.901343    1.11022e-16 0.901343    0.901343   
1   20      0.901343    1.11022e-16 0.901343    0.901343   
2   20      0.901343    1.11022e-16 0.901343    0.901343   
3   20      0.901343    1.11022e-16 0.901343    0.901343   
4   20      0.901343    1.11022e-16 0.901343    0.901343   
5   20      0.901343    1.11022e-16 0.901343    0.901343   
6   20      0.901343    1.11022e-16 0.901343    0.901343   
7   20      0.901343    1.11022e-16 0.901343    0.901343   
8   20      0.901343    1.11022e-16 0.901343    0.901343   
9   20      0.901343    1.11022e-16 0.901343    0.901343   
10  20      0.901343    1.11022e-16 0.901343    0.901343   
11  20      0.901343    1.11022e-16 0.901343    0.901343   
12  20      0.901343    1.11022e-16 0.901343    0.901343   
13  20      0.901343    1.11022e-16 0.901343    0.901343   
14  20      0.901343    1.11022e-16 0.901343    0.901343   
15  20      0.901343    1.11022e-16 0.901343    0.901343   
16  20      0.901343    1.11022e-16 0.901343    0.901343   
17  20      0.901343    1.11022e-16 0.901343    0.901343   
18  20      0.901343    1.11022e-16 0.901343    0.901343   
19  20      0.901343    1.11022e-16 0.901343    0.901343   
20  20      0.901343    1.11022e-16 0.901343    0.901343   
21  20      0.901343    1.11022e-16 0.901343    0.901343   
22  20      0.901343    1.11022e-16 0.901343    0.901343   
24  20      0.901343    1.11022e-16 0.901343    0.901343   

sklearn-genetic-opt closed prematurely. Will use the current best model.
INFO: Stopping the algorithm
235.563 seconds
{'min_weight_fraction_leaf': 0.023058297512734277, 'bootstrap': True, 'max_depth': 5, 'max_leaf_nodes': 33, 'n_estimators': 300}
0.9015151515151515
Stats achieved in each generation:  {'gen': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 24], 'fitness': [0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636], 'fitness_std': [1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16], 'fitness_max': [0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638], 'fitness_min': [0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638]}
Best k solutions:  {0: {'min_weight_fraction_leaf': 0.023058297512734277, 'bootstrap': True, 'max_depth': 5, 'max_leaf_nodes': 33, 'n_estimators': 300}, 1: {'min_weight_fraction_leaf': 0.22091581038404914, 'bootstrap': True, 'max_depth': 5, 'max_leaf_nodes': 26, 'n_estimators': 142}, 2: {'min_weight_fraction_leaf': 0.09793187151751966, 'bootstrap': True, 'max_depth': 3, 'max_leaf_nodes': 28, 'n_estimators': 177}, 3: {'min_weight_fraction_leaf': 0.023058297512734277, 'bootstrap': True, 'max_depth': 5, 'max_leaf_nodes': 33, 'n_estimators': 300}}

ISSUE: The classifier does not seem to be learning over generations. What is the cause?



Solution 1:[1]

there are a couple things you can try to improve it:

  • If there is not too much data imbalance, remove the balancing, random forest should be able to handle it better, but make sure to change the metric to something like F1 score
  • By the log, it looks gasearchcv got stuck in a point, this can change by increasing the number population size (10 might be too low), you can also increase the mutation_probability parameter, so it look for more diverse solutions

I hope it helps

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Rodrigo A