'Linear Regression with Caret

I could really use your help. I am trying to write an R script that takes some data and performs glm using the caret package. Here is my code:

set.seed(4000)
# Create training and test data with 80%-20% ratio
new_values$gender <- as.factor(new_values$gender)
trainingRows= createDataPartition(new_values$gender, p= .8, list= FALSE, times= 1)
training_data_set= new_values[trainingRows,]
test_data_set= new_values[-trainingRows,]
# Test training with 10 times cross-validation
fitness_control <- trainControl(method = "cv", number = 10, savePredictions = TRUE)
# Train model with linear regression method (it takes about 5-10 minutes waiting time)
linear_regression <-train(gender~ ., data=training_data_set,method="glm",family=binomial(), trControl=fitness_control)
linear_regression

Here is the data table: new_data table

When I try to run this script R takes really long time to load and after that I get this error message:

Something is wrong; all the Accuracy metric values are missing:

    Accuracy       Kappa    
 Min.   : NA   Min.   : NA  
 1st Qu.: NA   1st Qu.: NA  
 Median : NA   Median : NA  
 Mean   :NaN   Mean   :NaN  
 3rd Qu.: NA   3rd Qu.: NA  
 Max.   : NA   Max.   : NA  
 NA's   :1     NA's   :1    
Error: Stopping
In addition: There were 11 warnings (use warnings() to see them)

The warning messages are:

Warning messages: 1: model fit failed for Fold01: parameter=none Error : protect(): protection stack overflow

2: model fit failed for Fold02: parameter=none Error : protect(): protection stack overflow

3: model fit failed for Fold03: parameter=none Error : protect(): protection stack overflow

4: model fit failed for Fold04: parameter=none Error : protect(): protection stack overflow

5: model fit failed for Fold05: parameter=none Error : protect(): protection stack overflow

6: model fit failed for Fold06: parameter=none Error : protect(): protection stack overflow

7: model fit failed for Fold07: parameter=none Error : protect(): protection stack overflow

8: model fit failed for Fold08: parameter=none Error : protect(): protection stack overflow

9: model fit failed for Fold09: parameter=none Error : protect(): protection stack overflow

10: model fit failed for Fold10: parameter=none Error : protect(): protection stack overflow

11: In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, ... : There were missing values in resampled performance measures.

Can you please help?



Solution 1:[1]

Fitting with glmnet seems to work OK, although I haven't looked to see if the answers actually make sense! I had to sort out some data issues, which might have been what was getting in your way ...

library(readxl)
library(caret)
library(glmnet)
library(dplyr)
dd <- (read_excel("thema3_results1.xlsx")
    |> select(-1)  ## drop row names
    |> mutate(across(gender, factor))
    |> mutate(across(-gender, as.numeric))  ## convert character to numeric!
)

set.seed(4000)

trainingRows <- createDataPartition(dd$gender, p= .8, list= FALSE, times= 1)
training_data_set <-  dd[trainingRows,]
test_data_set <- dd[-trainingRows,]
# Test training with 10 times cross-validation
fitness_control <- trainControl(method = "cv", number = 10, savePredictions = TRUE)
system.time(logistic_reg <- train(gender~ ., 
                          data=training_data_set,
                          method="glmnet",
                          family="binomial", ## not binomial() for glmnet ...
                          trControl=fitness_control))

The training step took about 2 seconds on my machine,

This seems to be getting accuracy == 1, which probably means it's still overfitting ... ???

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Ben Bolker