'More efficient way to create frequency column based on different groupings?

I have code below that calculates a frequency for each column element (respective to it's own column) and adds all five frequencies together in a column. The code works but is very slow and the majority of the processing time is spent on this process. Any ideas to accomplish the same goal but more efficiently?


Create_Freq <- function(Word_List) {

 library(dplyr)
 
 Word_List$AvgFreq <- (Word_List%>% add_count(FirstLet))[,"n"] +
                      (Word_List%>% add_count(SecLet))[,"n"] +
                      (Word_List%>% add_count(ThirdtLet))[,"n"] +
                      (Word_List%>% add_count(FourLet))[,"n"] +
                      (Word_List%>% add_count(FifthLet))[,"n"]

 return(Word_List)
}

Edit:

To provide a word list for example

Word_List <- data.frame(Word = c("final", "first", "lover", "thing"))


Word_List$FirstLet <- substr(Word_List$Word,1,1)
Word_List$SecLet <- substr(Word_List$Word,2,2)
Word_List$ThirdtLet <- substr(Word_List$Word,3,3)
Word_List$FourLet <- substr(Word_List$Word,4,4)
Word_List$FifthLet <- substr(Word_List$Word,5,5)

}
 

For context, I have another function that will then choose the word with the highest "Average" frequency. (It used to be an average, but dividing by 5 was useless as it didn't affect the max)



Solution 1:[1]

Here is one possible approach, defining a small auxiliary function f to access a list of counts. When tested, it is roughly 15 times faster on my machine.

f <- function(x, tbl){
  res <- integer(5)
  for (i in seq_along(tbl)){
    res[i] <- tbl[[i]][x[i]]
  } 
  sum(res)
}  

Word_List <- data.frame(Word = c("final", "first", "lover", "thing"))
w <- unlist(Word_List, use.names = F)
m <- matrix(unlist(strsplit(w, ""), use.names = F), ncol = 4)
lookup <- apply(m, 1, table)

Word_List$AvgFreq <- apply(m, 2, f, lookup)

   Word AvgFreg
1 final       7
2 first       7
3 lover       5
4 thing       5

Further optimizations are possible, especially using a vectorized approach.

Solution 2:[2]

In response to Donald. Using your approach ended up being much slower but I had to make a couple changes to get it to work with a large word list, let me know if I messed up your methodology:

f <- function(x, tbl){
  res <- integer(5)
  for (i in seq_along(tbl)){
    res[i] <- tbl[[i]][x[i]]
  } 
  sum(res)
}  

Word_List <- data.frame(read.delim("Word List.txt"))
Word_List <- Turn_Vector_List(Word_List)
Word_List2 <- data.frame(read.delim("Word List.txt"))
Word_List_Vector <- Turn_Vector_List(Word_List2)

# Start the clock!
ptm <- proc.time()

m <- data.matrix(Word_List[2:6])
m
lookup <- apply(m, 2, table, simplify = FALSE)
lookup
Word_List$AvgFreq <- apply(m, 1, f, lookup)

 
# Stop the clock
ptm2 <- proc.time() - ptm

 
Word_List2 <- data.frame(read.delim("Word List.txt"))
Word_List_Vector <- Turn_Vector_List(Word_List2)
Word_List2 <- Create_Freq(Word_List_Vector)


ptm3 <- proc.time() - ptm - ptm2

ptm2

# user  system elapsed 
# 0.89    0.78    1.69

ptm3

# user  system elapsed 
# 0.06    0.00    0.06

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 Curt