'Optimize lapply for distance matrix function R
I am trying to find the cluster number from HDBSCAN analysis of cell coordinates grouped by an image ID in a dataframe. My approach so far is to split the dataframe containing the ID, X, and Y columns by the ID and then use lapply to run a function on each element as such:
dlist <- split(d[, -c(1)], d$ID) #subgroup dataframe "d" as list and remove the ID column
cls <- function(x) {
dm <- dist(x, method = "euclidean", p = 2) %>% as.matrix() #run distance matrix for each imageID's X,Y coordinates
cl <- hdbscan(dm, minPts = 3) #run unsupervised cluster analysis on matrix
lv <- length(cl$cluster_scores)
return(lv) #return the cluster number for each image ID
}
ClusterNumbers <- lapply(dlist, FUN = cls) %>% bind_rows()
I know the cluster analysis methodology may not be the most robust but it is just a proof of concept at present. My issue currently is that this method is obviously painfully slow, so I am looking for a way (short of submitting this to the uni HPCC) to make this process more efficient and run quicker. I have tried generating the matrices prior to the cluster analysis etc but the number of data prohibits this as I cannot assign vectors that large.
Any help would be awesome.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
