'Random Forest further improvement [closed]

Following Jason Brownlee's tutorials, I developed my own Random forest classifier code. I paste it below, I would like to know what further improvements can I do to improve the accuracy to my code

from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from matplotlib import pyplot


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.05, shuffle = True, random_state=0)

scaler = StandardScaler()
x_train = scaler.fit_transform(X_train)

x_test = scaler.transform(X_test)



# get a list of models to evaluate
def get_models():
    models = dict()
    # consider tree depths from 1 to 7 and None=full
    depths = [i for i in range(1,8)] + [None]
    for n in depths:
        models[str(n)] = RandomForestClassifier(max_depth=n)
    return models

# evaluate  model using cross-validation
def evaluate_model(model, X, y):
    # define the evaluation procedure
    cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
    # evaluate the model and collect the results
    scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
    return scores


# get the models to evaluate
models = get_models()
# evaluate the models and store results
results, names = list(), list()
for name, model in models.items():
    # evaluate the model
    scores = evaluate_model(model, X, y)
    # store the results
    results.append(scores)
    names.append(name)
    # summarize the performance along the way
    print('>%s %.3f (%.3f)' % (name, mean(scores), std(scores)))
# plot model performance for comparison
pyplot.boxplot(results, labels=names, showmeans=True)
pyplot.show()

The data, X is a matrix of (140,20000) and y is (140,) categorical.

I got the following results but would like to explore how to improve accuracy further.

>1 0.573 (0.107)
>2 0.650 (0.089)
>3 0.647 (0.118)
>4 0.676 (0.101)
>5 0.708 (0.103)
>6 0.698 (0.124)
>7 0.726 (0.121)
>None 0.700 (0.107)


Solution 1:[1]

Here's what stands out to me:

  • You split the data but do not use the splits.
  • You're scaling the data, but tree-based methods like random forests do not need this step.
  • You are doing your own tuning loop, instead of using sklearn.model_selection.GridSearchCV. This is fine, but it can get quite fiddly (imagine wanting to step over another hyperparameter).
  • If you use GridSearchCV you don't need to do your own cross validation.
  • You're using accuracy for evaluation, which is usually not a great evaluation metric for multi-class classification. Weighted F1 is better.
  • If you're doing cross validation, you need to put the scaler in the CV loop (e.g. using a pipeline) because otherwise the scaler has seen the validation data... but you don't need a scaler for this learning algorithm so this point is moot.

I would probably do something like this:

import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split

X, y = make_classification()

# Split the data.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.05, shuffle=True, random_state=0)

# Make things for the cross validation.
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
param_grid = {'max_depth': np.arange(3, 8)}
model = RandomForestClassifier(random_state=1)

# Create and train the cross validation.
clf = GridSearchCV(model, param_grid,
                   scoring='f1_weighted',
                   cv=cv, verbose=3)

clf.fit(X_train, y_train)

Take a look at clf.cv_results_ for the scores etc, which you can plot if you want. By default GridSearchCV trains a final model on the best hyperparameters, so you can make predictions with clf.

Almost forgot... you asked about improving the model :) Here are some ideas:

  • The above will help you tune on more hyperparameters (eg max_features, n_estimators, and min_samples_leaf). But don't get too carried away with hyperparameter tuning.
  • You could try transforming some features (columns in X), or adding new ones.
  • Look for more data, eg more rows, higher quality labels, etc.
  • Address any issues with class imbalance.
  • Try a more sophisticated algorithm, like gradient boosted trees (there are models in sklearn, or take a look at xgboost).

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1