'Optimize lapply for distance matrix function R

I am trying to find the cluster number from HDBSCAN analysis of cell coordinates grouped by an image ID in a dataframe. My approach so far is to split the dataframe containing the ID, X, and Y columns by the ID and then use lapply to run a function on each element as such:

dlist <- split(d[, -c(1)], d$ID) #subgroup dataframe "d" as list and remove the ID column

cls <- function(x) {
  dm <- dist(x, method = "euclidean", p = 2) %>% as.matrix() #run distance matrix for each imageID's X,Y coordinates
  cl <- hdbscan(dm, minPts = 3) #run unsupervised cluster analysis on matrix
  lv <- length(cl$cluster_scores)
  return(lv) #return the cluster number for each image ID
}

ClusterNumbers <- lapply(dlist, FUN = cls) %>% bind_rows()

I know the cluster analysis methodology may not be the most robust but it is just a proof of concept at present. My issue currently is that this method is obviously painfully slow, so I am looking for a way (short of submitting this to the uni HPCC) to make this process more efficient and run quicker. I have tried generating the matrices prior to the cluster analysis etc but the number of data prohibits this as I cannot assign vectors that large.

Any help would be awesome.

r matrix cluster-analysis

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'Optimize lapply for distance matrix function R

Sources

Related Questions