'How to deal with column containing multiple comma-separated values in R language
There is a column in my data set (CSV file) called amenities
. It contains values like
{ "Wireless Internet", "Wheelchair accessible", Kitchen, Elevator, "Buzzer/wireless intercom", Heating, Washer, Dryer, Essentials, Shampoo, Hangers, "Laptop friendly workspace" },
{ TV, "Cable TV", Internet, "Wireless Internet", "Air conditioning", Kitchen, "Smoking allowed", "Pets allowed", "Buzzer/wireless intercom", Heating, "Family/kid friendly", "Smoke detector", "Carbon monoxide, etc.
There are around 10k columns like this. I want to convert each amenity into new column and want to create values as 0 or 1 against each entry.
For example - columns should be created as below:
Wireless Internet WheelChair accesible Kitchen Elivator Buzzer/Wireless
1 0 1 1 0
Basically, each element of the column create a new column and their values should come like 0 and 1 based on whether the amenity is present in the column or not.
I have input as below:
and I want to get an output like this:
Solution 1:[1]
Here's a (rather complex) solution (to a rather complex problem):
Data:
df <- data.frame(
id = 1:2,
amenities = c('{"Wireless Internet","Wheelchair accessible",Kitchen,Elevator,"Buzzer/wireless intercom",Heating,Washer,Dryer,Essentials,Shampoo,Hangers,"Laptop friendly workspace"}',
'{TV,"Cable TV",Internet,"Wireless Internet","Air conditioning",Kitchen,"Smoking allowed","Pets allowed","Buzzer/wireless intercom",Heating,"Family/kid friendly","Smoke detector","Carbon monoxide}'))
Prepare data:
amenities_clean <- gsub('[{}"]', '', df$amenities) # remove unwanted stuff
amenities_split <- strsplit(amenities_clean, ",") # split rows into individual amenities
amenities_unique <- unique(unlist(strsplit(amenities_clean, ","))) # get a list of unique amenities
df[amenities_unique] <- NA # set up the columns for each amenity
Now for the meat of the analysis, using str_detect
from the package stringr
:
# record presence/absence of individual amenities in each new column:
library(stringr)
for(i in 1:ncol(df[amenities_unique])){
for(j in 1:nrow(df)){
df[amenities_unique][j,i] <-
ifelse(str_detect(amenities_split[j], names(df[amenities_unique][i])), 1, 0)
}
}
This will prompt warnings but they seem to be negligible as the result is correct:
df
id
1 1
2 2
amenities
1 {"Wireless Internet","Wheelchair accessible",Kitchen,Elevator,"Buzzer/wireless intercom",Heating,Washer,Dryer,Essentials,Shampoo,Hangers,"Laptop friendly workspace"}
2 {TV,"Cable TV",Internet,"Wireless Internet","Air conditioning",Kitchen,"Smoking allowed","Pets allowed","Buzzer/wireless intercom",Heating,"Family/kid friendly","Smoke detector","Carbon monoxide}
Wireless Internet Wheelchair accessible Kitchen Elevator Buzzer/wireless intercom Heating Washer Dryer
1 1 1 1 1 1 1 1 1
2 1 0 1 0 1 1 0 0
Essentials Shampoo Hangers Laptop friendly workspace TV Cable TV Internet Air conditioning Smoking allowed
1 1 1 1 1 0 0 1 0 0
2 0 0 0 0 1 1 1 1 1
Pets allowed Family/kid friendly Smoke detector Carbon monoxide
1 0 0 0 0
2 1 1 1 1
EDIT:
Alternatively, and maybe more economically, instead of the nested for
loop you can use an apply
function like this (based on the vectors amenities_split
and amenities_unique
from the Preparation phase of the first solution):
cbind(df, t(sapply(amenities_split, function(x)
table(factor(x, levels = amenities_unique)))))
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 |