'More efficient way to create frequency column based on different groupings?
I have code below that calculates a frequency for each column element (respective to it's own column) and adds all five frequencies together in a column. The code works but is very slow and the majority of the processing time is spent on this process. Any ideas to accomplish the same goal but more efficiently?
Create_Freq <- function(Word_List) {
library(dplyr)
Word_List$AvgFreq <- (Word_List%>% add_count(FirstLet))[,"n"] +
(Word_List%>% add_count(SecLet))[,"n"] +
(Word_List%>% add_count(ThirdtLet))[,"n"] +
(Word_List%>% add_count(FourLet))[,"n"] +
(Word_List%>% add_count(FifthLet))[,"n"]
return(Word_List)
}
Edit:
To provide a word list for example
Word_List <- data.frame(Word = c("final", "first", "lover", "thing"))
Word_List$FirstLet <- substr(Word_List$Word,1,1)
Word_List$SecLet <- substr(Word_List$Word,2,2)
Word_List$ThirdtLet <- substr(Word_List$Word,3,3)
Word_List$FourLet <- substr(Word_List$Word,4,4)
Word_List$FifthLet <- substr(Word_List$Word,5,5)
}
For context, I have another function that will then choose the word with the highest "Average" frequency. (It used to be an average, but dividing by 5 was useless as it didn't affect the max)
Solution 1:[1]
Here is one possible approach, defining a small auxiliary function f to access a list of counts. When tested, it is roughly 15 times faster on my machine.
f <- function(x, tbl){
res <- integer(5)
for (i in seq_along(tbl)){
res[i] <- tbl[[i]][x[i]]
}
sum(res)
}
Word_List <- data.frame(Word = c("final", "first", "lover", "thing"))
w <- unlist(Word_List, use.names = F)
m <- matrix(unlist(strsplit(w, ""), use.names = F), ncol = 4)
lookup <- apply(m, 1, table)
Word_List$AvgFreq <- apply(m, 2, f, lookup)
Word AvgFreg
1 final 7
2 first 7
3 lover 5
4 thing 5
Further optimizations are possible, especially using a vectorized approach.
Solution 2:[2]
In response to Donald. Using your approach ended up being much slower but I had to make a couple changes to get it to work with a large word list, let me know if I messed up your methodology:
f <- function(x, tbl){
res <- integer(5)
for (i in seq_along(tbl)){
res[i] <- tbl[[i]][x[i]]
}
sum(res)
}
Word_List <- data.frame(read.delim("Word List.txt"))
Word_List <- Turn_Vector_List(Word_List)
Word_List2 <- data.frame(read.delim("Word List.txt"))
Word_List_Vector <- Turn_Vector_List(Word_List2)
# Start the clock!
ptm <- proc.time()
m <- data.matrix(Word_List[2:6])
m
lookup <- apply(m, 2, table, simplify = FALSE)
lookup
Word_List$AvgFreq <- apply(m, 1, f, lookup)
# Stop the clock
ptm2 <- proc.time() - ptm
Word_List2 <- data.frame(read.delim("Word List.txt"))
Word_List_Vector <- Turn_Vector_List(Word_List2)
Word_List2 <- Create_Freq(Word_List_Vector)
ptm3 <- proc.time() - ptm - ptm2
ptm2
# user system elapsed
# 0.89 0.78 1.69
ptm3
# user system elapsed
# 0.06 0.00 0.06
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | Curt |
