'Scala Apache Spark Filter DF Using Arbitrary Number of Bounding Boxes Read From File
Is there a way to perform rectangle bounding box filters, in a manner that scales, on a large data set without additional frameworks like Apache Sedona or GeoMesas?
Suppose a toy data set of:
val data = Seq((1,1646113023,34.073071,-118.257962),(2,1646199423,34.074715, -118.263144),(3,1646285823, 34.032621, -118.224268),(4,1646285823,33.718508, -117.808853),(5,1646372223,33.716589,-117.804304))
val df = spark.sparkContext.parallelize(data).toDF()
val new_col = Seq("id", "time", "lat", "long")
val columnList = df.columns.zip(new_col).map(f=>{col(f._1).as(f._2)})
val df2 = df.select(columnList:_*)
df2.show()
+---+----------+---------+-----------+
| id| time| lat| long|
+---+----------+---------+-----------+
| 1|1646113023|34.073071|-118.257962|
| 2|1646199423|34.074715|-118.263144|
| 3|1646285823|34.032621|-118.224268|
| 4|1646285823|33.718508|-117.808853|
| 5|1646372223|33.716589|-117.804304|
+---+----------+---------+-----------+
I know that I can use filter to subset a df using multiple conditions to create a single bounding box:
df2.filter(($"long" >= -117.858217) && ($"lat" >= 33.711760) && ($"long" <= -117.765176) && ($"lat" <= 33.759725)).show()
+---+----------+---------+-----------+
| id| time| lat| long|
+---+----------+---------+-----------+
| 4|1646285823|33.718508|-117.808853|
| 5|1646372223|33.716589|-117.804304|
+---+----------+---------+-----------+
But what about if there are an arbitrary number of bounding boxes read in from a file such that:
val bb = Seq((-117.858217, 33.711760, -117.765176, 33.759725), (-118.294084, 34.054451, -118.211515, 34.104072))
val bbDF = spark.sparkContext.parallelize(bb).toDF()
bbDF.show()
+-----------+---------+-----------+---------+
| _1| _2| _3| _4|
+-----------+---------+-----------+---------+
|-117.858217| 33.71176|-117.765176|33.759725|
|-118.294084|34.054451|-118.211515|34.104072|
+-----------+---------+-----------+---------+
How can I filter df2 using bbDF?
Open to pyspark answers, but first trying it out in Scala.
Solution 1:[1]
The first option is to try the combination of all the filters, if the performance is not good enough, you can try some strategy. If the amount of data is too big and the filter is not the problem try to make the first filter with the minimum of the borders tho delete all positions that for sure wont be in any of the, something like:
val minLong = bb.map(_._1).min
val minLat = bb.map(_._2).min
val maxLong = bb.map(_._3).max
val maxLat = bb.map(_._4).max
val filterBB: Column = bb.map(...).reduce(_ || _)
df2.filter($"long" >= minLeft && $"long" <= maxRight && $"lat" >= minDown && $"lat" <= maxUp)
.filter(filterBB)
The first filter will be very easy to be pushed to the source file and if the format used in the file allows it, will retrieve only chunks of the data that will contain elements that will be in that area.
If the problem is the number of bounding boxes, you can try a strategy of clustering them to bigger boxes, and if the element is there, inspect with the smaller ones. For example, taking the previous code we can divide the box inside (minLong, minLat, maxLong, maxLat) into 4 boxes, and inspect what bb's are inside them. (some pseudocode)
val halfLong = minLong + maxLong / 2
val halfLat = minLat + maxLat / 2
val leftDownBB: List[BB] = bb.filter(...).reduce(_ || _) // simpler filter only for one square
df2.filter($"long" >= minLeft && $"long" <= maxRight && $"lat" >= minDown && $"lat" <= maxUp)
.filter(
when($"long" < halfLong && $"lat" < halfLong, leftDownBB)
.when($"long" < halfLong && $"lat" < halfLong, rightDownBB)
.when(..., leftUpBB)
.when(..., rightUpBB)
)
This is an example of 4 clusters, but if needed, make as much as you need. Also you can make the code recursive and divide each square in other 4 if need more layers of the tree. The when can be nested.
Hope I've explained myself.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Alfilercio |
