'How to parse specific substring to multiple columns
I have a data frame consisting of two columns. Product name and product specifications.
The product specifications consist of a long concatenated string separated by a comma as the delimiter. The dataset is produced through web scraping.
This is the sample of the dataset:
sample <- structure(list(Model = c("27GQ50F", "24GQ50F", "22MP41W", "22MP410"),
Specs = c("Display: 27 in, VA, W-LED, 1920 x 1080 pixels",
"Display: 23.8 in, VA, OLED, Viewing angles (H/V): 170 / 160",
"Display: 21.5 in, VA, Edge LED (Local Dimming), 1920 x 1080 pixels, Viewing angles (H/V): 178 / 178",
"Display: 21.5 in, Direct LED, 1920 x 1080 pixels, Viewing angles (H/V): 178 / 178, Brightness: 250 cd/m"
)), row.names = c(NA, 4L), class = "data.frame")
The objective: I wanted to create a tidy data frame of product names and multiple columns of product specifications from this concatenated string.
The challenge: The values of the concatenated strings are not aligned based on the specification category. Missing specifications value is not filled with NA, rather it's directly concatenated to the next available specifications.
Meaning that if I simply parse the concatenated data into multiple columns, many values are in the wrong column.
I use cSplit to parse so that I don't need to define the resulting columns (which can vary by row, can reach up to 27 specifications). As you can see,
#messy output by parsing
library(splitstackshape)
sample$Specs <- cSplit(sample, 'Specs', ',')
Model Specs_1 Specs_2 Specs_3 Specs_4
1: 27GQ50F Display: 27 in VA W-LED 1920 x 1080 pixels
2: 24GQ50F Display: 23.8 in VA OLED Viewing angles (H/V): 170 / 160
3: 22MP41W Display: 21.5 in VA Edge LED (Local Dimming) 1920 x 1080 pixels
4: 22MP410 Display: 21.5 in Direct LED 1920 x 1080 pixels Viewing angles (H/V): 178 / 178
Specs_5
1: <NA>
2: <NA>
3: Viewing angles (H/V): 178 / 178
4: Brightness: 250 cd/m
While a clean one should look like this
Model Specs_1 Specs_2 Specs_3 Specs_4
1: 27GQ50F Display: 27 in VA W-LED 1920 x 1080 pixels
2: 24GQ50F Display: 23.8 in VA OLED <NA>
3: 22MP41W Display: 21.5 in VA Edge LED (Local Dimming) 1920 x 1080 pixels
4: 22MP410 Display: 21.5 in <NA> Direct LED 1920 x 1080 pixels
Specs_5 Specs_6
1: <NA> <NA>
2: Viewing angles (H/V): 170 / 160 <NA>
3: Viewing angles (H/V): 178 / 178 <NA>
4: Viewing angles (H/V): 178 / 178 Brightness: 250 cd/m
I'm not even sure where to begin to handle this.
But there is a known pattern for each specification. For Specs 3, there are substring that contains the words 'LED'. For Specs 4, ends with pixels. Others are all in the format of 'Specification Name: value'.
Edited: As suggested, I use mutate, str_extract and regex to extract each specification into its columns. Some example:
sample %>% mutate(Display = str_extract(Specs, "Display:\\s*[0-9.]+\\s*in"),
`LCD Panel` = str_extract(Specs, "VA|IPS"),
`Refresh rate` = str_extract(Specs,
"\\d+\\sHz\\s\\-\\s\\d+\\sHz"),
Resolution = str_extract(Specs,
"\\d+\\sx\\s\\d+\\spixels"))
and so on for each column.
The only one that I'm not sure how to extract is the 3rd specification. I'm not sure how many possible variations are there but all contain LED in it.
The example pattern is "W-LED", "OLED", "Edge LED (Local Dimming)", "Direct LED" and I want to extract all the substring in between the delimiter.
Meaning the regex should be able to extract "Edge LED (Local Dimming)" and not just "Edge LED" etc.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
