'SMOTE_NC function in R: error in the ouput

thank you in advance for your time!

I'm having some trouble with the SMOTE_NC function in R (https://rdrr.io/github/dongyuanwu/RSBID/man/SMOTE_NC.html). Shortly, I have a dataset with continuous and categorical (binary only) variables in which I would like to use the SMOTE function to undersample the majority class and oversample the minority one, to perform a random forest afterward. Since SMOTE does not work well with categorical and continuous variables together, I used SMOTE_NC. Hereafter is a short script to show you my problem using the iris dataset.

data(iris)

View(iris)

iris <- iris[-c(70:150),] # Shrinking the dataset to have only two species 
                          # with one majority and one minority class

iris <- droplevels(iris)

levels(iris$Species) <- c(0,1)

str(iris)

iris$rng <- sample(c(0,1), replace=TRUE, size=nrow(iris)) # Adding a new random categorical 
                                                          # column

iris$rng <- as.factor(iris$rng)

I took the SMOTE_NC from the RSBID package downloaded as described here (https://rdrr.io/github/dongyuanwu/RSBID/f/README.md)

install.packages("devtools")
devtools::install_github("dongyuanwu/RSBID", build_vignettes=TRUE)

library(RSBID)

The problem comes when I try to use the function

irisSMO <- SMOTE_NC(iris, iris$Species, 100, 5)

And this error shows up

Error in if (outcome < 1 | outcome > ncol(data)) { : 
  missing value where TRUE/FALSE needed
In addition: Warning messages:
1: In Ops.factor(outcome, 1) : ‘<’ makes no sense for factor variables
2: In Ops.factor(outcome, ncol(data)) :
  ‘>’ makes no sense for factor variables
3: In if (outcome < 1 | outcome > ncol(data)) { :
  the condition has length > 1 and only the first element will be used

I apologize that this particular part makes no sense for factor variables is translated by me since the original code is in Italian (but I don't know why) and it is è senza senso per variabili factor

Anyway, I don't know how to proceed, I looked for possible solutions on the internet but I wasn't able to find any (I read that the SMOTE_NC function should be better implemented in Python but I don't know how to write in that language and if possible I would prefer using R). The original dataset, like the one in the example, does not have any missing value and has almost 50 continuous and 50 categorical (binary) variables.

Thank you for the help, sorry for the newbie mistakes, and have a nice day!



Solution 1:[1]

You just need the name of the variable (not the variable's values)

set.seed(519)
data(iris)
iris <- iris[-c(70:150),] # Shrinking the dataset to have only two species 
# with one majority and one minority class
iris <- droplevels(iris)
levels(iris$Species) <- c(0,1)
str(iris)
#> 'data.frame':    69 obs. of  5 variables:
#>  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
#>  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
#>  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
#>  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
#>  $ Species     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
iris$rng <- sample(c(0,1), replace=TRUE, size=nrow(iris)) # Adding a new random categorical 
# column

iris$rng <- as.factor(iris$rng)

remotes::install_github("dongyuanwu/RSBID", build_vignettes=FALSE)
#> Skipping install of 'RSBID' from a github remote, the SHA1 (e640b85b) has not changed since last install.
#>   Use `force = TRUE` to force installation

library(RSBID)
#> Loading required package: FNN
#> Loading required package: clustMixType
#> Loading required package: klaR
#> Loading required package: MASS

irisSMO <- SMOTE_NC(iris, "Species", 100, 5)
#> Variables are continous and categorical, SMOTE_NC could be used.
#>   |                                                                              |                                                                      |   0%  |                                                                              |====                                                                  |   6%  |                                                                              |========                                                              |  11%  |                                                                              |============                                                          |  17%  |                                                                              |================                                                      |  22%  |                                                                              |===================                                                   |  28%  |                                                                              |=======================                                               |  33%  |                                                                              |===========================                                           |  39%  |                                                                              |===============================                                       |  44%  |                                                                              |===================================                                   |  50%  |                                                                              |=======================================                               |  56%  |                                                                              |===========================================                           |  61%  |                                                                              |===============================================                       |  67%  |                                                                              |===================================================                   |  72%  |                                                                              |======================================================                |  78%  |                                                                              |==========================================================            |  83%  |                                                                              |==============================================================        |  89%  |                                                                              |==================================================================    |  94%  |                                                                              |======================================================================| 100%
head(irisSMO)
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species rng
#> 1          5.1         3.5          1.4         0.2       0   1
#> 2          4.9         3.0          1.4         0.2       0   1
#> 3          4.7         3.2          1.3         0.2       0   0
#> 4          4.6         3.1          1.5         0.2       0   0
#> 5          5.0         3.6          1.4         0.2       0   0
#> 6          5.4         3.9          1.7         0.4       0   1

Created on 2022-02-13 by the reprex package (v2.0.1)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 DaveArmstrong