'How to connect component in Spark when data is too large

When dealing with component connecting of big data, I find it very difficult to merging them in spark.

The data structure in my research can be simplified to RDD[Array[Int]]. For example:

RDD[Array(1,2,3), Array(1,4), Array(5,6), Array(5,6,7,8), Array(9), Array(1)]

The objective is to merge two Array if they have intersection set, ending up with arrays without any intersection. Therefore after merging, it should be:

RDD[Array(1,2,3,4), Array(5,6,7,8), Array(9)]

The problem is kind of component connecting in Pregel framework in Graph Algo. One solution is to first find the edge connection between two Array using cartesian product and then merge them. However, in my case, there are 300K Array with total size 1G. Therefore, the time and memory complexity would be roughly 300K*300K. When I run the program in my Mac Pro in spark, it is completely stuck.

Baiscally, it is like:

enter image description here

Thanks



Solution 1:[1]

Here is my solution. Might not be decent enough, but works for a small data. Whether it can apply to large data needs further proof.

def mergeCanopy(canopies:RDD[Array[Int]]):Array[Array[Int]] = { /* try to merge two canopies */ val s = Set[Array[Int]]() val c = canopies.aggregate(s)(mergeOrAppend, _++_) return c.toArray

def mergeOrAppend(disjoint: Set[Array[Int]], cluster: Array[Int]):Set[Array[Int]] = { var disjoints = disjoint for (clus <- disjoint) { if (clus.toSet.&(cluster.toSet) != Set()) { disjoints += (clus.toSet++cluster.toSet).toArray disjoints -= clus return disjoints } } disjoints += cluster return disjoints }

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Siyu Leng