'Splitting data in column based on a word

Is there a code to create a column with only the speed number? In the Cpu column, as included in the image, too much unnecessary information is included for me. I only want the ''Ghz''number (f.i. 2.3, 1.8 and 2.5).

enter image description here



Solution 1:[1]

You can do something like this:

library(stringr)

data %>%
  mutate(speed = as.numeric(str_extract(Cpu, "\\d*[.]?\\d+(?=GHz$)")))

Solution 2:[2]

A slightly easier regex is this:

library(dplyr)
library(stringr)
df %>%
  mutate(CPU_new = str_extract(Cpu, "[0-9.]+(?=GHz)"))

base R:

df$CPU_new <- str_extract(df$Cpu, "[0-9.]+(?=GHz)")

How this works:

  • [0-9.]+: character class allowing digits and the period occurring at least one or more times
  • (?=GHz): positive lookahead asserting that the match to be extracted must be followed by the literal string GHz

Solution 3:[3]

I think the other answer is better, but an alternative approach to using complicated regex is to extract just the 3 positions right before "GHz" using the stringr package:

Data:

df <- data.frame(ScreenResolution = paste("Test",LETTERS[1:3]),
                 Cpu = c("Intel Core i5 2.3GHz","Intel Core i5 1.8GHz",
                         "Intel Core i5 72000U 2.3GHz"),
                 Ram = "8GB")

Code:

library(stringr)
df$Cpu_new <- str_sub(df$Cpu, str_locate(df$Cpu, pattern = "GHz")[1]-4,
                              str_locate(df$Cpu, pattern = "GHz")[1]-1)

Output:

#   ScreenResolution                         Cpu Ram Cpu_new
# 1           Test A        Intel Core i5 2.3GHz 8GB     2.3
# 2           Test B        Intel Core i5 1.8GHz 8GB     1.8
# 3           Test C Intel Core i5 72000U 2.3GHz 8GB     2.3

If you wanted it to be numeric, use as.numeric(str_sub(...))

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 langtang
Solution 2 Chris Ruehlemann
Solution 3 jpsmith