'Dictionary style replace multiple items

I have a large data.frame of character data that I want to convert based on what is commonly called a dictionary in other languages.

Currently I am going about it like so:

foo <- data.frame(snp1 = c("AA", "AG", "AA", "AA"), snp2 = c("AA", "AT", "AG", "AA"), snp3 = c(NA, "GG", "GG", "GC"), stringsAsFactors=FALSE)
foo <- replace(foo, foo == "AA", "0101")
foo <- replace(foo, foo == "AC", "0102")
foo <- replace(foo, foo == "AG", "0103")

This works fine, but it is obviously not pretty and seems silly to repeat the replace statement each time I want to replace one item in the data.frame.

Is there a better way to do this since I have a dictionary of approximately 25 key/value pairs?



Solution 1:[1]

If you're open to using packages, plyr is a very popular one and has this handy mapvalues() function that will do just what you're looking for:

foo <- mapvalues(foo, from=c("AA", "AC", "AG"), to=c("0101", "0102", "0103"))

Note that it works for data types of all kinds, not just strings.

Solution 2:[2]

Here is a quick solution

dict = list(AA = '0101', AC = '0102', AG = '0103')
foo2 = foo
for (i in 1:3){foo2 <- replace(foo2, foo2 == names(dict[i]), dict[i])}

Solution 3:[3]

One of the most readable way to replace value in a string or a vector of string with a dictionary is stringr::str_replace_all, from the stringr package. Beware: this method is based on regex (see here). The pattern needed by str_replace_all can be a dictionnary, expressed as a list: c("regex" = "desired value").

# 1. Made your dictionnary
dictio_replace= c("AA"= "0101", 
                  "AC"= "0102",
                  "AG"= "0103") # short example of dictionnary.

 # 2. Replace all pattern, according to the dictionary-values (only a single vector of string, or a single string)
 foo$snp1 <- stringr::str_replace_all(string = foo$snp1,
                                      pattern= dictio_replace)  # we only use the 'pattern' option here: 'replacement' is useless since we provide a dictionnary.

Repeat step 2 with foo$snp2 & foo$snp3. If you have more vectors to transform it's a good idea to use another func', in order to replace values in each of the columns/vector in the dataframe without repeating yourself.

Solution 4:[4]

Note this answer started as an attempt to solve the much simpler problem posted in How to replace all values in data frame with a vector of values?. Unfortunately, this question was closed as duplicate of the actual question. So, I'll try to suggest a solution based on replacing factor levels for both cases, here.


In case there is only a vector (or one data frame column) whose values need to be replaced and there are no objections to use factor we can coerce the vector to factor and change the factor levels as required:

x <- c(1, 1, 4, 4, 5, 5, 1, 1, 2)
x <- factor(x)
x
#[1] 1 1 4 4 5 5 1 1 2
#Levels: 1 2 4 5
replacement_vec <- c("A", "T", "C", "G")
levels(x) <- replacement_vec
x
#[1] A A C C G G A A T
#Levels: A T C G

Using the forcatspackage this can be done in a one-liner:

x <- c(1, 1, 4, 4, 5, 5, 1, 1, 2)
forcats::lvls_revalue(factor(x), replacement_vec)
#[1] A A C C G G A A T
#Levels: A T C G

In case all values of multiple columns of a data frame need to be replaced, the approach can be extended.

foo <- data.frame(snp1 = c("AA", "AG", "AA", "AA"), 
                  snp2 = c("AA", "AT", "AG", "AA"), 
                  snp3 = c(NA, "GG", "GG", "GC"), 
                  stringsAsFactors=FALSE)

level_vec <- c("AA", "AC", "AG", "AT", "GC", "GG")
replacement_vec <- c("0101", "0102", "0103", "0104", "0302", "0303")
foo[] <- lapply(foo, function(x) forcats::lvls_revalue(factor(x, levels = level_vec), 
                                                       replacement_vec))
foo
#  snp1 snp2 snp3
#1 0101 0101 <NA>
#2 0103 0104 0303
#3 0101 0103 0303
#4 0101 0101 0302

Note that level_vec and replacement_vec must have equal lengths.

More importantly, level_vec should be complete , i.e., include all possible values in the affected columns of the original data frame. (Use unique(sort(unlist(foo))) to verify). Otherwise, any missing values will be coerced to <NA>. Note that this is also a requirement for Martin Morgans's answer.

So, if there are only a few different values to be replaced you will be probably better off with one of the other answers, e.g., Ramnath's.

Solution 5:[5]

We can also use dplyr::case_when

library(dplyr)

foo %>%
   mutate_all(~case_when(. == "AA" ~ "0101", 
                         . == "AC" ~ "0102", 
                         . == "AG" ~ "0103", 
                         TRUE ~ .))

#  snp1 snp2 snp3
#1 0101 0101 <NA>
#2 0103   AT   GG
#3 0101 0103   GG
#4 0101 0101   GC

It checks the condition and replaces with the corresponding value if the condition is TRUE. We can add more conditions if needed and with TRUE ~ . we keep the values as it is if none of the condition is matched. If we want to change them to NA instead we can remove the last line.

foo %>%
  mutate_all(~case_when(. == "AA" ~ "0101", 
                        . == "AC" ~ "0102", 
                        . == "AG" ~ "0103"))

#  snp1 snp2 snp3
#1 0101 0101 <NA>
#2 0103 <NA> <NA>
#3 0101 0103 <NA>
#4 0101 0101 <NA>

This will change the values to NA if none of the above condition is satisfied.


Another option using only base R is to create a lookup dataframe with old and new values, unlist the dataframe, match them with old values, get the corresponding new values and replace.

lookup <- data.frame(old_val = c("AA", "AC", "AG"), 
                     new_val = c("0101", "0102", "0103"))

foo[] <- lookup$new_val[match(unlist(foo), lookup$old_val)]

Solution 6:[6]

Here's something simple that will do the job:

key <- c('AA','AC','AG')
val <- c('0101','0102','0103')

lapply(1:3,FUN = function(i){foo[foo == key[i]] <<- val[i]})
foo

 snp1 snp2 snp3
1 0101 0101 <NA>
2 0103   AT   GG
3 0101 0103   GG
4 0101 0101   GC

lapply will output a list in this case that we don't actually care about. You could assign the result to something if you like and then just discard it. I'm iterating over the indices here, but you could just as easily place the key/vals in a list themselves and iterate over them directly. Note the use of global assignment with <<-.

I tinkered with a way to do this with mapply but my first attempt didn't work, so I switched. I suspect a solution with mapply is possible, though.

Solution 7:[7]

Using dplyr::recode:

library(dplyr)

mutate_all(foo, funs(recode(., "AA" = "0101", "AC" = "0102", "AG" = "0103",
                            .default = NA_character_)))

#   snp1 snp2 snp3
# 1 0101 0101 <NA>
# 2 0103 <NA> <NA>
# 3 0101 0103 <NA>
# 4 0101 0101 <NA>

Solution 8:[8]

Used @Ramnath's answer above, but made it read (what to be replaced and what to be replaced with) from a file and use gsub rather than replace.

hrw <- read.csv("hgWords.txt", header=T, stringsAsFactor=FALSE, encoding="UTF-8", sep="\t") 

for (i in nrow(hrw)) 
{
document <- gsub(hrw$from[i], hrw$to[i], document, ignore.case=TRUE)
}

hgword.txt contains the following tab separated

"from"  "to"
"AA"    "0101"
"AC"    "0102"
"AG"    "0103" 

Solution 9:[9]

Since it's been a few years since the last answer, and a new question came up tonight on this topic and a moderator closed it, I'll add it here. The poster has a large data frame containing 0, 1, and 2, and wants to change them to AA, AB, and BB respectively.

Use plyr:

> df <- data.frame(matrix(sample(c(NA, c("0","1","2")), 100, replace = TRUE), 10))
> df
     X1   X2   X3 X4   X5   X6   X7   X8   X9  X10
1     1    2 <NA>  2    1    2    0    2    0    2
2     0    2    1  1    2    1    1    0    0    1
3     1    0    2  2    1    0 <NA>    0    1 <NA>
4     1    2 <NA>  2    2    2    1    1    0    1
... to 10th row

> df[] <- lapply(df, as.character)

Create a function over the data frame using revalue to replace multiple terms:

> library(plyr)
> apply(df, 2, function(x) {x <- revalue(x, c("0"="AA","1"="AB","2"="BB")); x})
      X1   X2   X3   X4   X5   X6   X7   X8   X9   X10 
 [1,] "AB" "BB" NA   "BB" "AB" "BB" "AA" "BB" "AA" "BB"
 [2,] "AA" "BB" "AB" "AB" "BB" "AB" "AB" "AA" "AA" "AB"
 [3,] "AB" "AA" "BB" "BB" "AB" "AA" NA   "AA" "AB" NA  
 [4,] "AB" "BB" NA   "BB" "BB" "BB" "AB" "AB" "AA" "AB"
... and so on

Solution 10:[10]

Not overly original, but should provide an intuitive interface to accomplish replacing multiple values in Base R:

# Function performing a mapping replacement:
# replaceMultipleValues => function() 
replaceMultipleValues <- function(df, mapFrom, mapTo){
  # Extract the values in the data.frame: 
  # dfVals => named character vector
  dfVals <- unlist(df)
  
  # Get all values in the mapping & data 
  # and assign a name to them: tmp1 => named character vector 
  tmp1 <- c(
    setNames(mapTo, mapFrom), 
    setNames(dfVals, dfVals)
  )
  
  # Extract the unique values: 
  # valueMap => named character vector
  valueMap <- tmp1[!(duplicated(names(tmp1)))]
  
  # Recode the values in data.frame: res => data.frame
  res <- data.frame(
      matrix(
        valueMap[dfVals], 
        nrow = nrow(df),
        ncol = ncol(df),
        dimnames = dimnames(df)
    )
  )
  
  # Explicitly define the returned object: data.frame => env
  return(res)
}

# Recode values in data.frame: 
# res => data.frame
res <- replaceMultipleValues(
  foo, 
  c("AA", "AC", "AG"), 
  c("0101", "0102", "0103")
)

# Print data.frame to console: 
# data.frame => stdout(console)
res

Data:

# Import data: foo => data.frame
foo <- data.frame(snp1 = c("AA", "AG", "AA", "AA"), snp2 = c("AA", "AT", "AG", "AA"), snp3 = c(NA, "GG", "GG", "GC"), stringsAsFactors=FALSE)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 c.gutierrez
Solution 2 Ramnath
Solution 3
Solution 4 Community
Solution 5 Ronak Shah
Solution 6 joran
Solution 7 zx8754
Solution 8 Vinay Prajapati
Solution 9 mysteRious
Solution 10