'How to connect component in Spark when data is too large
When dealing with component connecting of big data, I find it very difficult to merging them in spark.
The data structure in my research can be simplified to RDD[Array[Int]]. For example:
RDD[Array(1,2,3), Array(1,4), Array(5,6), Array(5,6,7,8), Array(9), Array(1)]
The objective is to merge two Array if they have intersection set, ending up with arrays without any intersection. Therefore after merging, it should be:
RDD[Array(1,2,3,4), Array(5,6,7,8), Array(9)]
The problem is kind of component connecting in Pregel framework in Graph Algo. One solution is to first find the edge connection between two Array using cartesian product and then merge them. However, in my case, there are 300K Array with total size 1G. Therefore, the time and memory complexity would be roughly 300K*300K. When I run the program in my Mac Pro in spark, it is completely stuck.
Baiscally, it is like:
Thanks
Solution 1:[1]
Here is my solution. Might not be decent enough, but works for a small data. Whether it can apply to large data needs further proof.
def mergeCanopy(canopies:RDD[Array[Int]]):Array[Array[Int]] = {
/*
try to merge two canopies
*/
val s = Set[Array[Int]]()
val c = canopies.aggregate(s)(mergeOrAppend, _++_)
return c.toArray
def mergeOrAppend(disjoint: Set[Array[Int]], cluster: Array[Int]):Set[Array[Int]] = {
var disjoints = disjoint
for (clus <- disjoint) {
if (clus.toSet.&(cluster.toSet) != Set()) {
disjoints += (clus.toSet++cluster.toSet).toArray
disjoints -= clus
return disjoints
}
}
disjoints += cluster
return disjoints
}
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Siyu Leng |

