'Why null_model() perform better than any other regression methods in R tidymodels?
I am testing several regression models using Tidyverse's parsnip. Initially the best performing one is rand_forest(), but after I add null_model(), it is the latter one that is best in terms of RMSE.
All are done after parameter tuning and cross-validated resampling.
Here is the result of null_model():
> show_best(null_grid_results, metric = "rmse")
# A tibble: 1 × 6
.metric .estimator mean n std_err .config
<chr> <chr> <dbl> <int> <dbl> <chr>
1 rmse standard 0.421 10 0.0701 Preprocessor1_Model1
> collect_metrics(null_grid_results) %>%
+ filter(.metric == "rmse") %>%
+ pull(mean) %>% mean()
[1] 0.4209793
And this is the random forest:
> show_best(random_forest_grid_results, metric = "rmse")
# A tibble: 5 × 8
mtry min_n .metric .estimator mean n std_err .config
<int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
1 2971 28 rmse standard 0.420 10 0.0700 Preprocessor1_Model15
2 945 21 rmse standard 0.420 10 0.0703 Preprocessor1_Model16
3 1090 40 rmse standard 0.420 10 0.0701 Preprocessor1_Model25
4 2074 32 rmse standard 0.420 10 0.0702 Preprocessor1_Model13
5 1650 27 rmse standard 0.420 10 0.0698 Preprocessor1_Model10
> collect_metrics(random_forest_grid_results) %>%
+ filter(.metric == "rmse") %>%
+ pull(mean) %>% mean()
[1] 0.4369285
The code snippet I used for performing null_model() is this:
library(tidyverse)
library(tidymodels)
library(rules)
library(baguette)
tidymodels_prefer()
library(doParallel)
# Skip showing steps for getting:
# prolif_feat_outcome_dat_train
# prolif_feat_outcome_dat_folds
null_model_spec <- null_model() %>%
set_engine("parsnip") %>%
set_mode("regression") %>%
translate()
null_model_feature_preproc_rec <- recipe(prolif_outcome ~ ., data = prolif_feat_outcome_dat_train) %>%
step_zv(all_predictors())
null_model_wflow <- workflow() %>%
add_model(null_model_spec) %>%
add_recipe(null_model_feature_preproc_rec )
null_model_set <- extract_parameter_set_dials(null_model_wflow)
grid_ctrl <- control_grid(
verbose = TRUE,
save_pred = TRUE,
parallel_over = "everything",
save_workflow = TRUE
)
nof_grid <- 25
ptm <- proc.time()
cls <- makePSOCKcluster(parallel::detectCores(logical = FALSE))
registerDoParallel(cls)
set.seed(999)
null_model_grid_results <- null_model_wflow %>%
tune_grid(
param_info = null_model_set,
resamples = prolif_feat_outcome_dat_folds,
grid = nof_grid,
control = grid_ctrl
)
stopCluster(cls)
proc.time() - ptm
show_best(null_model_grid_results, metric = "rmse")
collect_metrics(null_model_grid_results) %>%
filter(.metric == "rmse") %>%
pull(mean) %>% mean()
And this is by rand_forest():
random_forest_spec <- rand_forest(mtry = tune(), min_n = tune(), trees = 1000) %>%
set_engine("ranger") %>%
set_mode("regression") %>%
translate()
random_forest_feature_preproc_rec <- recipe(prolif_outcome ~ ., data = prolif_feat_outcome_dat_train) %>%
step_zv(all_predictors())
random_forest_wflow <- workflow() %>%
add_model(random_forest_spec) %>%
add_recipe(random_forest_feature_preproc_rec )
random_forest_set <- extract_parameter_set_dials(random_forest_wflow)
grid_ctrl <- control_grid(
verbose = TRUE,
save_pred = TRUE,
parallel_over = "everything",
save_workflow = TRUE
)
nof_grid <- 25
ptm <- proc.time()
cls <- makePSOCKcluster(parallel::detectCores(logical = FALSE))
registerDoParallel(cls)
set.seed(999)
random_forest_grid_results <- random_forest_wflow %>%
tune_grid(
param_info = random_forest_set,
resamples = prolif_feat_outcome_dat_folds,
grid = nof_grid,
control = grid_ctrl
)
stopCluster(cls)
proc.time() - ptm
saveRDS(random_forest_grid_results, file = paste0("/home/ubuntu/storage1/find_best_model_for_prolif_predictions_tidymodels/data/", wanted_dose, ".random_forest_grid_results.rds" ))
show_best(random_forest_grid_results, metric = "rmse")
collect_metrics(random_forest_grid_results) %>%
filter(.metric == "rmse") %>%
pull(mean) %>% mean()
I expected null_model() perform way worse than rand_forest()?
My question is why null_model() perform best?
Is my approach correct? If not what is the correct way to implement it?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
