'How do I create new column data based on regex match in R
I have some tweet author location data that I'm looking to reclassify to country. For example, taking a vector of United States 'states' I want to check (regex) for a match and add a "United States" entry to the country column.
Example data:
states = c("CA", "OH", "FL", "TX", "MN") # all the states
tweets$location = data.frame("my bed", "Minneapolis, MN", "Paris, France", "Los Angeles, CA")
What i've tried:
# This seems to do the matching part well
filter(str_detect(location, paste(usa_data$Code, collapse = "|")))
# nested for loop
for (i in length(tweets$location)){
for (state in states){
if (grepl(state, tweets$location[i])){
tweets$country[i] = "USA"
break
}
}
}
Desired output (based on example input):
tweets$country = data.frame(NA, "USA", NA, "USA")
I'm relatively new to R, therefore any help will be greatly appreciated.
Solution 1:[1]
We can use grepl along with ifelse for a base R solution:
states = c("CA", "OH", "FL", "TX", "MN") # all the states
tweets$location = data.frame("my bed", "Minneapolis, MN", "Paris, France", "Los Angeles, CA")
regex <- paste0("\\b(?:", paste(states, collapse="|"), ")\\b")
tweets$country <- ifelse(grepl(regex, tweets$location), "USA", NA)
Solution 2:[2]
If you prefer a dplyr solution, but very similar to Tim's answer
library(dplyr)
states <- c("CA", "OH", "FL", "TX", "MN") # all the states
tweets <- tibble(location = c(
"my bed", "Minneapolis, MN", "Paris, France",
"Los Angeles, CA"
))
tweets %>%
mutate(country = if_else(stringr::str_detect(
string = location,
pattern = paste0(
"\\b(?:", paste(states,
collapse = "|"
),
")\\b"
)
),
"United States", "NA"
))
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Tim Biegeleisen |
| Solution 2 | Julian |
