'RandomForest classifier does not learn over generations
I am trying to use Random Forest to classify heart disease using GASearchCV from sklearn-genetic.
This is the original dataset; I have pre-processed it (using LabelEncoder, BinaryEncoder, OneHotEncoder and MaxAbsScaler). I have skipped out on the pre-processing code for brevity, but its resulting 38 feature columns are:
'BMI', 'PhysicalHealth', 'MentalHealth', 'AgeCategory', 'SleepTime', 'Smoking_0', 'Smoking_1', 'AlcoholDrinking_0', 'AlcoholDrinking_1', 'Stroke_0', 'Stroke_1', 'DiffWalking_0', 'DiffWalking_1', 'Sex_0', 'Sex_1', 'PhysicalActivity_0', 'PhysicalActivity_1', 'Asthma_0', 'Asthma_1', 'SkinCancer_0', 'SkinCancer_1', Race_American Indian/Alaskan Native', 'Race_Asian', 'Race_Black', 'Race_Hispanic', 'Race_Other', 'Race_White', 'KidneyDisease_No', 'KidneyDisease_Yes', 'GenHealth_Excellent', 'GenHealth_Fair', 'GenHealth_Good', 'GenHealth_Poor', 'GenHealth_Very good', 'Diabetic_No', 'Diabetic_No, borderline diabetes', 'Diabetic_Yes', 'Diabetic_Yes (during pregnancy)'
'HeartDisease' is the target.
CODE:
# Loading the dataset
dataset = pd.read_csv('data/heart_2020_cleaned.csv')
# Slicing the dataset to first 10000 rows to ease computations
dataset = dataset.iloc[:10000]
# Separating target from features
features = dataset.drop(columns='HeartDisease') # X
target = dataset['HeartDisease'] # y
# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.33, random_state=42, stratify=target)
# Undersample the training data
sampler = RandomUnderSampler(random_state=42)
balanced_features, balanced_target = sampler.fit_resample(
X_train, y_train)
# Classification starts here
clf = RandomForestClassifier(n_jobs=5)
start = time()
param_grid = {'min_weight_fraction_leaf': Continuous(0.01, 0.5, distribution='log-uniform'),
'bootstrap': Categorical([True, False]),
'max_depth': Integer(2, 30),
'max_leaf_nodes': Integer(2, 35),
'n_estimators': Integer(100, 300)}
cv = StratifiedKFold(n_splits=3, shuffle=True)
# Using GASearchCV to search for best parameters
evolved_estimator = GASearchCV(estimator=clf,
cv=cv,
scoring='accuracy',
population_size=10,
generations=35,
param_grid=param_grid,
n_jobs=5,
verbose=True,
keep_top_k=4,)
# Train and optimise the estimator
evolved_estimator.fit(X_train, y_train)
end = time()
result = end - start
print('%.3f seconds' % result)
# Best parameters found
print(evolved_estimator.best_params_)
# Use the model fitted with the best parameters
y_predict_ga = evolved_estimator.predict(X_test)
print(accuracy_score(y_test, y_predict_ga))
# Saved metadata for further analysis
print("Stats achieved in each generation: ", evolved_estimator.history)
print("Best k solutions: ", evolved_estimator.hof)
OUTPUT:
gen nevals fitness fitness_std fitness_max fitness_min
0 10 0.901343 1.11022e-16 0.901343 0.901343
1 20 0.901343 1.11022e-16 0.901343 0.901343
2 20 0.901343 1.11022e-16 0.901343 0.901343
3 20 0.901343 1.11022e-16 0.901343 0.901343
4 20 0.901343 1.11022e-16 0.901343 0.901343
5 20 0.901343 1.11022e-16 0.901343 0.901343
6 20 0.901343 1.11022e-16 0.901343 0.901343
7 20 0.901343 1.11022e-16 0.901343 0.901343
8 20 0.901343 1.11022e-16 0.901343 0.901343
9 20 0.901343 1.11022e-16 0.901343 0.901343
10 20 0.901343 1.11022e-16 0.901343 0.901343
11 20 0.901343 1.11022e-16 0.901343 0.901343
12 20 0.901343 1.11022e-16 0.901343 0.901343
13 20 0.901343 1.11022e-16 0.901343 0.901343
14 20 0.901343 1.11022e-16 0.901343 0.901343
15 20 0.901343 1.11022e-16 0.901343 0.901343
16 20 0.901343 1.11022e-16 0.901343 0.901343
17 20 0.901343 1.11022e-16 0.901343 0.901343
18 20 0.901343 1.11022e-16 0.901343 0.901343
19 20 0.901343 1.11022e-16 0.901343 0.901343
20 20 0.901343 1.11022e-16 0.901343 0.901343
21 20 0.901343 1.11022e-16 0.901343 0.901343
22 20 0.901343 1.11022e-16 0.901343 0.901343
24 20 0.901343 1.11022e-16 0.901343 0.901343
sklearn-genetic-opt closed prematurely. Will use the current best model.
INFO: Stopping the algorithm
235.563 seconds
{'min_weight_fraction_leaf': 0.023058297512734277, 'bootstrap': True, 'max_depth': 5, 'max_leaf_nodes': 33, 'n_estimators': 300}
0.9015151515151515
Stats achieved in each generation: {'gen': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 24], 'fitness': [0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636, 0.9013433237339636], 'fitness_std': [1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16, 1.1102230246251565e-16], 'fitness_max': [0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638], 'fitness_min': [0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638, 0.9013433237339638]}
Best k solutions: {0: {'min_weight_fraction_leaf': 0.023058297512734277, 'bootstrap': True, 'max_depth': 5, 'max_leaf_nodes': 33, 'n_estimators': 300}, 1: {'min_weight_fraction_leaf': 0.22091581038404914, 'bootstrap': True, 'max_depth': 5, 'max_leaf_nodes': 26, 'n_estimators': 142}, 2: {'min_weight_fraction_leaf': 0.09793187151751966, 'bootstrap': True, 'max_depth': 3, 'max_leaf_nodes': 28, 'n_estimators': 177}, 3: {'min_weight_fraction_leaf': 0.023058297512734277, 'bootstrap': True, 'max_depth': 5, 'max_leaf_nodes': 33, 'n_estimators': 300}}
ISSUE: The classifier does not seem to be learning over generations. What is the cause?
Solution 1:[1]
there are a couple things you can try to improve it:
- If there is not too much data imbalance, remove the balancing, random forest should be able to handle it better, but make sure to change the metric to something like F1 score
- By the log, it looks gasearchcv got stuck in a point, this can change by increasing the number population size (10 might be too low), you can also increase the mutation_probability parameter, so it look for more diverse solutions
I hope it helps
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Rodrigo A |
