'How to make new variable that takes 1 if the string in another column contains a word with varying punctuation and font size?

I have a column that looks something like this

col1 
"business"
"BusinesS"
"education"
"some BUSINESS ."
"business of someone, that is cool"
" not the b word"
"busi ness"
"busines." 
"businesses"
"something else"

And I need an efficient way of getting all this string data into a new value

col1                col2
NA                  1
NA                  1
"education"         NA
NA                  1
NA                  1
" not the b word"   NA
NA                  1
NA                  1
NA                  1
"something else"    NA

So the common denominator is "busines", but I don't know how to efficiently make it sort out all the spaces, punctuation, lower/uppercases, other words etc. in one mutate that creates a new column.



Solution 1:[1]

You can replace all non word characters using gsub and than use grepl to detect busines:

+grepl("busines", gsub("\\W+", "", s), ignore.case = TRUE)
# [1] 1 1 0 1 1 0 1 1 1 0

Another way would be to use agrepl for Approximate String Matching, where here 1L gives the maximum distance to the given pattern.

+agrepl("busines", s, 1L, ignore.case = TRUE)
# [1] 1 1 0 1 1 0 1 1 1 0

agrep can also be a solution in case you are looking for business instead of busines:

+agrepl("business", gsub("\\W+", "", s), 1L, ignore.case = TRUE)
# [1] 1 1 0 1 1 0 1 1 1 0

Data:

s <- c("business","BusinesS","education","some BUSINESS .",
       "business of someone, that is cool"," not the b word",
       "busi ness","busines." ,"businesses","something else")

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1