'Converting categorical data into numerical values

I have a dataset with a lot of categorical mixed with numerical. I am trying to run a regression about obesity where the variables I'm trying to include are state, age, sex, and for example I have a question that asks if the respondent have exercised in the last 30 days, and the answer options are 1 = yes, 2 = no, 7 = not sure/don't know, BLANK = not answered or missing.

How can I set this dataset into the correct form to run it in a regression? Or in other words how to create a smaller dataframe including only the variables I need from the very large dataframe?

Here are the first 10 rows of only the data I need:

enter image description here

and out of 50 states I'm only trying to use three specific states, how do I filter so I can inly use the data from the three specific states (each state has a code for it for example kentucky=21, colorado=8, new york = 36)



Solution 1:[1]

You ask to convert from categorical to numeric but your data is already in numeric form! (Even though it actually contains categorical information, from what you describe).

We are trying to achieve three tasks:

  1. Coerce data type from double (continuous) to factor (categorical). This is required to include a variable as a categorical variable in a regression model. It can be done with function factor().

  2. Add labels so that it's easier to understand which integers code which value.

  3. Subset your data so that only observations that meet a certain criterion are met. We can do this with subset in base R or filter from dplyr.

First I create a small reproducible example to showcase the solution:

dat <- data.frame(
  BMI = c(1660, 1918, NA, NA, 2034),
  Exercise = c(1, 1, 2, 7, NA),
  state = c(21, 8, 36, 17, 3)
)

Solution in base R

dat$Exercise <- factor(dat$Exercise,
                       levels = c(2, 1, 7),
                       labels = c("no", "yes", "not sure"))

dat$state <- factor(dat$state,
                    levels = c(3, 8, 17, 21, 36),
                    labels = c("ohio", "colorado", "california", "kentucky", "new york"))

dat_subset <- dat |> subset(state %in% c("colorado", "kentucky", "new york"))

Solution in dplyr

library(dplyr)

dat <- dat |> mutate(
  Exercise = factor(Exercise,
                    levels = c(2, 1, 7),
                    labels = c("no", "yes", "not sure")),
  state = factor(state,
                 levels = c(3, 8, 17, 21, 36),
                 labels = c("ohio", "colorado",  "california", "kentucky", "new york")))

dat_subset <- dat |> filter(state %in% c("colorado", "kentucky", "new york"))

Output (with either solution)

dat_subset
#>    BMI Exercise    state
#> 1 1660      yes kentucky
#> 2 1918      yes colorado
#> 3   NA       no new york

Solution 2:[2]

@Adar, first it's best to use dput(mydata[1:10, ]) to give us the first 10 rows of your data so we can see it exactly.

I would use explicit recoding like below to create a new variable:

myvar<-c("Yes","1","Unsure","2","2","1","Unsure")

newvar<-ifelse(myvar=="Unsure", -1,    # code Unsure to -1
   ifelse(myvar=="Yes", 1,             # code Yes to 1
   ifelse(myvar=="No", 2,              # code No to 2
        myvar)))                                  # otherwise use original value

Of course, you can recode these in any way. If you use factor you might overwrite values you intend to keep. Always check the results of the algorithm.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Andrea M
Solution 2