'MLR3 Basics for Categorical Variables

I'm (extremely) new to using MLR3, and am using it to model flight delays. I have some numerical variables, like Z, and some categorical variables like X. Let's just say I want to do a very simple model predicting delays based on both X and Z. From a theoretical perspective, we would usually encode the X factors into dummy variables, and then model it using linear regression. I see that MLR3 is doing this itself though - for example, when I create a task and run the learner, I can see that it has created coefficients for all the different factors i.e. treating them as separate dummy variables.

However, I can see how many other programmers are still using one-hot encoding to encode their categorical variables into dummies first - thus my question is, is one-hot encoding necessary, or does MLR3 do it for you?

edit: Below is an example dataset of my data. My predictor variables are Y (categorical) and Z (numerical). Y is the dependent variable and is numerical.

 Y    X    Z
-3    M    7.5
 5    W    9.2
 10   T    3.1
 4    T    2.2
 -13  M    10.1
 2    M    1.7
 4    T    4.5

This is the code I use

library(mlr3)
library(mlr3learners)
library(mlr3pipelines)
task <- TaskRegr$new('apples', backend=df2, target = 'Y')
set.seed(38)
train_set <- sample(task$nrow, 0.99 * task$nrow)
test_set <- setdiff(seq_len(task$nrow), train_set)
glrn_lm$train(task, row_ids = train_set)
glrn_lm$predict(task, row_ids = test_set)$score()
summary(lm(formula = task$formula(), data = task$data()))

And the results of that line will be something like:

Call:
lm(formula = task$formula(), data = task$data())

Residuals:
   Min     1Q Median     3Q    Max 
-39.62  -8.71  -4.77   0.27 537.12 

Coefficients:
                                            Estimate Std. Error t value Pr(>|t|)    
(Intercept)                                4.888e+00  3.233e+00   1.512 0.130542    
XT                                         4.564e-03  3.776e-04  12.087  < 2e-16 ***
XW                                         4.564e-03  3.776e-04  12.087  < 2e-16 ***
Z                                         -4.259e+00  6.437e-01  -6.616 3.78e-11 ***
 

(The numbers up here are all way off - please don't mind that)

So as you can see, it derives two new variables called XT and XW - to denote the factor T under X and the factor W under X. I assume, like in dummy coding, XM is the reference variable here. So like I said earlier, regr_lm seems to already be doing the dummy coding for us. Is that really the case?



Solution 1:[1]

In general, mlr3 doesn't automatically encode your categorical factors for you. Whether using categorical features works out of the box depends on the learner you're using -- some, like the linear regression you're using, can work with categorical features directly, while others can't (and if you try to use those you'd get an error message indicating that).

In general, there's no downside to one-hot-encoding your categorical features, so if you want to try many different learners I'd recommend doing that so that you don't have to worry about whether a particular learner requires it.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Lars Kotthoff