'How to cluster rows in dataset with two geospatial attributes
Title might be misleading as I wasn't sure how to properly summarize the problem.
I have a dataset of trips with two locations (source and destination) and also other attributes (about customer, cargo, equipment, etc).
Are there any algorithms that I could apply in order to cluster those trips, given that I want to use both spatial points (source and destination) for clustering, not just one.
Let's say if I have following trips:
A1 -> B1
A2 -> B2
A1 -> C1
A2 -> C2
I want to get clusters like:
A -> B
A -> C
Solution 1:[1]
A very simple solution I can think about is to cluster each location independently, and then use cluster ids to group by both clusters.
Something like (this was tested in Google BigQuery):
with data as (
select st_geogpoint(100, 50) a, st_geogpoint(101, 51) b
union all
select st_geogpoint(100.01, 50) a, st_geogpoint(101.01, 51) b
union all
select st_geogpoint(100, 50.01) a, st_geogpoint(90, 51) b
union all
select st_geogpoint(100.01, 50.01) a, st_geogpoint(90, 51.01) b
),
clusters as (
select
a, b,
st_clusterdbscan(a, 1e4, 1) OVER() a_id,
st_clusterdbscan(b, 1e4, 1) OVER() b_id
from data
)
select
a_id, b_id,
st_centroid_agg(a) a_center,
st_centroid_agg(b) b_center
from clusters
group by a_id, b_id
a_id b_id a_center b_center
0 0 POINT(100.005 50.0000001074259) POINT(101.005 51.0000001066994)
0 1 POINT(100.005 50.0100001074192) POINT(90 51.005)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Michael Entin |