'Converting categorical data into numerical values
I have a dataset with a lot of categorical mixed with numerical. I am trying to run a regression about obesity where the variables I'm trying to include are state, age, sex, and for example I have a question that asks if the respondent have exercised in the last 30 days, and the answer options are 1 = yes, 2 = no, 7 = not sure/don't know, BLANK = not answered or missing.
How can I set this dataset into the correct form to run it in a regression? Or in other words how to create a smaller dataframe including only the variables I need from the very large dataframe?
Here are the first 10 rows of only the data I need:

and out of 50 states I'm only trying to use three specific states, how do I filter so I can inly use the data from the three specific states (each state has a code for it for example kentucky=21, colorado=8, new york = 36)
Solution 1:[1]
You ask to convert from categorical to numeric but your data is already in numeric form! (Even though it actually contains categorical information, from what you describe).
We are trying to achieve three tasks:
Coerce data type from
double(continuous) tofactor(categorical). This is required to include a variable as a categorical variable in a regression model. It can be done with functionfactor().Add labels so that it's easier to understand which integers code which value.
Subset your data so that only observations that meet a certain criterion are met. We can do this with
subsetin base R orfilterfromdplyr.
First I create a small reproducible example to showcase the solution:
dat <- data.frame(
BMI = c(1660, 1918, NA, NA, 2034),
Exercise = c(1, 1, 2, 7, NA),
state = c(21, 8, 36, 17, 3)
)
Solution in base R
dat$Exercise <- factor(dat$Exercise,
levels = c(2, 1, 7),
labels = c("no", "yes", "not sure"))
dat$state <- factor(dat$state,
levels = c(3, 8, 17, 21, 36),
labels = c("ohio", "colorado", "california", "kentucky", "new york"))
dat_subset <- dat |> subset(state %in% c("colorado", "kentucky", "new york"))
Solution in dplyr
library(dplyr)
dat <- dat |> mutate(
Exercise = factor(Exercise,
levels = c(2, 1, 7),
labels = c("no", "yes", "not sure")),
state = factor(state,
levels = c(3, 8, 17, 21, 36),
labels = c("ohio", "colorado", "california", "kentucky", "new york")))
dat_subset <- dat |> filter(state %in% c("colorado", "kentucky", "new york"))
Output (with either solution)
dat_subset
#> BMI Exercise state
#> 1 1660 yes kentucky
#> 2 1918 yes colorado
#> 3 NA no new york
Solution 2:[2]
@Adar, first it's best to use dput(mydata[1:10, ]) to give us the first 10 rows of your data so we can see it exactly.
I would use explicit recoding like below to create a new variable:
myvar<-c("Yes","1","Unsure","2","2","1","Unsure")
newvar<-ifelse(myvar=="Unsure", -1, # code Unsure to -1
ifelse(myvar=="Yes", 1, # code Yes to 1
ifelse(myvar=="No", 2, # code No to 2
myvar))) # otherwise use original value
Of course, you can recode these in any way. If you use factor you might overwrite values you intend to keep. Always check the results of the algorithm.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Andrea M |
| Solution 2 |
