'How to deal with column containing multiple comma-separated values in R language

There is a column in my data set (CSV file) called amenities. It contains values like

{ "Wireless Internet", "Wheelchair accessible", Kitchen, Elevator, "Buzzer/wireless intercom", Heating, Washer, Dryer, Essentials, Shampoo, Hangers, "Laptop friendly workspace" }, 
{ TV, "Cable TV", Internet, "Wireless Internet", "Air conditioning", Kitchen, "Smoking allowed", "Pets allowed", "Buzzer/wireless intercom", Heating, "Family/kid friendly", "Smoke detector", "Carbon monoxide, etc.

There are around 10k columns like this. I want to convert each amenity into new column and want to create values as 0 or 1 against each entry.

For example - columns should be created as below:

Wireless Internet      WheelChair accesible     Kitchen      Elivator     Buzzer/Wireless
1                      0                        1             1             0

Basically, each element of the column create a new column and their values should come like 0 and 1 based on whether the amenity is present in the column or not.

I have input as below:

enter image description here

and I want to get an output like this:

enter image description here

r one-hot-encoding

Solution 1:^[1]

Here's a (rather complex) solution (to a rather complex problem):

Data:

df <- data.frame(
  id = 1:2,
  amenities = c('{"Wireless Internet","Wheelchair accessible",Kitchen,Elevator,"Buzzer/wireless intercom",Heating,Washer,Dryer,Essentials,Shampoo,Hangers,"Laptop friendly workspace"}',
                 '{TV,"Cable TV",Internet,"Wireless Internet","Air conditioning",Kitchen,"Smoking allowed","Pets allowed","Buzzer/wireless intercom",Heating,"Family/kid friendly","Smoke detector","Carbon monoxide}'))

Prepare data:

amenities_clean <- gsub('[{}"]', '', df$amenities) # remove unwanted stuff 
amenities_split <- strsplit(amenities_clean, ",") # split rows into individual amenities
amenities_unique <- unique(unlist(strsplit(amenities_clean, ","))) # get a list of unique amenities 
df[amenities_unique] <- NA # set up the columns for each amenity

Now for the meat of the analysis, using str_detect from the package stringr:

# record presence/absence of individual amenities in each new column:

library(stringr)
for(i in 1:ncol(df[amenities_unique])){
  for(j in 1:nrow(df)){
    df[amenities_unique][j,i] <- 
      ifelse(str_detect(amenities_split[j], names(df[amenities_unique][i])), 1, 0)
  }
}

This will prompt warnings but they seem to be negligible as the result is correct:

df
  id
1  1
2  2
                                                                                                                                                                                            amenities
1                               {"Wireless Internet","Wheelchair accessible",Kitchen,Elevator,"Buzzer/wireless intercom",Heating,Washer,Dryer,Essentials,Shampoo,Hangers,"Laptop friendly workspace"}
2 {TV,"Cable TV",Internet,"Wireless Internet","Air conditioning",Kitchen,"Smoking allowed","Pets allowed","Buzzer/wireless intercom",Heating,"Family/kid friendly","Smoke detector","Carbon monoxide}
  Wireless Internet Wheelchair accessible Kitchen Elevator Buzzer/wireless intercom Heating Washer Dryer
1                 1                     1       1        1                        1       1      1     1
2                 1                     0       1        0                        1       1      0     0
  Essentials Shampoo Hangers Laptop friendly workspace TV Cable TV Internet Air conditioning Smoking allowed
1          1       1       1                         1  0        0        1                0               0
2          0       0       0                         0  1        1        1                1               1
  Pets allowed Family/kid friendly Smoke detector Carbon monoxide
1            0                   0              0               0
2            1                   1              1               1

EDIT:

Alternatively, and maybe more economically, instead of the nested forloop you can use an apply function like this (based on the vectors amenities_split and amenities_unique from the Preparation phase of the first solution):

cbind(df, t(sapply(amenities_split, function(x) 
  table(factor(x, levels = amenities_unique)))))

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1

'How to deal with column containing multiple comma-separated values in R language

Solution 1:[1]

Sources

Related Questions

Solution 1:^[1]