'How to parse specific substring to multiple columns

I have a data frame consisting of two columns. Product name and product specifications.

The product specifications consist of a long concatenated string separated by a comma as the delimiter. The dataset is produced through web scraping.

This is the sample of the dataset:

sample <- structure(list(Model = c("27GQ50F", "24GQ50F", "22MP41W", "22MP410"), 
Specs = c("Display: 27 in, VA, W-LED, 1920 x 1080 pixels",  
"Display: 23.8 in, VA, OLED, Viewing angles (H/V): 170  / 160", 
"Display: 21.5 in, VA, Edge LED (Local Dimming), 1920 x 1080 pixels, Viewing angles (H/V): 178  / 178", 
"Display: 21.5 in, Direct LED, 1920 x 1080 pixels, Viewing angles (H/V): 178  / 178, Brightness: 250 cd/m"
)), row.names = c(NA, 4L), class = "data.frame")

The objective: I wanted to create a tidy data frame of product names and multiple columns of product specifications from this concatenated string.

The challenge: The values of the concatenated strings are not aligned based on the specification category. Missing specifications value is not filled with NA, rather it's directly concatenated to the next available specifications.

Meaning that if I simply parse the concatenated data into multiple columns, many values are in the wrong column.

I use cSplit to parse so that I don't need to define the resulting columns (which can vary by row, can reach up to 27 specifications). As you can see,

#messy output by parsing
library(splitstackshape)
sample$Specs <- cSplit(sample, 'Specs', ',')

 Model          Specs_1    Specs_2                  Specs_3                          Specs_4
1: 27GQ50F   Display: 27 in         VA                    W-LED               1920 x 1080 pixels
2: 24GQ50F Display: 23.8 in         VA                     OLED Viewing angles (H/V): 170  / 160
3: 22MP41W Display: 21.5 in         VA Edge LED (Local Dimming)               1920 x 1080 pixels
4: 22MP410 Display: 21.5 in Direct LED       1920 x 1080 pixels Viewing angles (H/V): 178  / 178
                            Specs_5
1:                             <NA>
2:                             <NA>
3: Viewing angles (H/V): 178  / 178
4:             Brightness: 250 cd/m 

While a clean one should look like this

 Model          Specs_1    Specs_2                  Specs_3                          Specs_4
1: 27GQ50F   Display: 27 in         VA                    W-LED               1920 x 1080 pixels
2: 24GQ50F Display: 23.8 in         VA                     OLED                             <NA>
3: 22MP41W Display: 21.5 in         VA  Edge LED (Local Dimming)              1920 x 1080 pixels
4: 22MP410 Display: 21.5 in        <NA>              Direct LED               1920 x 1080 pixels 
                            Specs_5                  Specs_6
1:                             <NA>                       <NA>
2: Viewing angles (H/V): 170  / 160                       <NA>
3: Viewing angles (H/V): 178  / 178                       <NA>
4: Viewing angles (H/V): 178  / 178            Brightness: 250 cd/m

I'm not even sure where to begin to handle this.

But there is a known pattern for each specification. For Specs 3, there are substring that contains the words 'LED'. For Specs 4, ends with pixels. Others are all in the format of 'Specification Name: value'.

Edited: As suggested, I use mutate, str_extract and regex to extract each specification into its columns. Some example:

sample %>% mutate(Display = str_extract(Specs, "Display:\\s*[0-9.]+\\s*in"),
                    `LCD Panel` = str_extract(Specs, "VA|IPS"),
                    `Refresh rate` = str_extract(Specs,                                 
                     "\\d+\\sHz\\s\\-\\s\\d+\\sHz"),
                     Resolution = str_extract(Specs, 
                     "\\d+\\sx\\s\\d+\\spixels"))

and so on for each column.

The only one that I'm not sure how to extract is the 3rd specification. I'm not sure how many possible variations are there but all contain LED in it.

The example pattern is "W-LED", "OLED", "Edge LED (Local Dimming)", "Direct LED" and I want to extract all the substring in between the delimiter.

Meaning the regex should be able to extract "Edge LED (Local Dimming)" and not just "Edge LED" etc.



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source