'fast way to identify pair of linestring geodataframes that are nearest neighbors
I have a list of 90 geodataframes, all containing LineStrings that are connected to each other (imagine a MultiLineString).
From this list, I would like to identify the two GDFs that are in closest proximity to each other (closest as in considering the extents of the combined linestrings of each GDF).
A manual way i can imagine doing this is to populate a 90x90 matrix and call the distance function as in:
matrix = np.zeros((90, 90))
gdfs = [gdf1, gdf2, gdf3, gdf4, ..., gdf90]
for i, gdf_init in enumerate(gdfs):
for j, gdf_pair in enumerate(gdfs):
min_dist = gdf_init.distance(gdf_pair).min()
matrix[i, j] = min_dist
And then use np.where to get the (i, j) values of the smallest min_dist value in the matrix.
However, perhaps nested for loops are not the most pythonic way to go about things. Wondering if anyone has an optimized implementation recommendation for this task?
Solution 1:[1]
- you have not provided sample data, so have used osmnx to get line strings. Each source data frame will be of this structure:
| osmid | oneway | name | highway | maxspeed | length | geometry |
|---|---|---|---|---|---|---|
| 122233552 | False | Three Elms Road | primary | 30 mph | 6.899 | LINESTRING (-2.7428406 52.0653426, -2.7428844 52.0653985) |
| 34414510 | False | Moor Park Road | residential | 30 mph | 270.368 | LINESTRING (-2.7428406 52.0653426, -2.7433504 52.0651847, -2.7434275 52.0651448, -2.7445906 52.0638372, -2.7450142 52.0633776) |
| 122233552 | False | Three Elms Road | primary | 30 mph | 126.662 | LINESTRING (-2.7428406 52.0653426, -2.7426267 52.0649723, -2.7423405 52.0642472) |
| 33840333 | False | nan | residential | nan | 21.267 | LINESTRING (-2.7417117 52.0629806, -2.7419991 52.0629074) |
| 122233552 | False | Three Elms Road | primary | 30 mph | 90.536 | LINESTRING (-2.7417117 52.0629806, -2.7414841 52.0626687, -2.7412595 52.0623927, -2.7411227 52.0622522) |
- have used dict instead of list for holding source geo data frames
- with dict of source geo data frames, construct a geo data frame that is the convex hull of the union of line strings
- core solution geopandas
sjoin_nearest()- for each convex hull find nearest convex hull (exclude self)
- result is a data frame, sort by distance and you have the answer of which are the two closest source data frames
sourcing line strings
import osmnx as ox
import geopandas as gpd
import pandas as pd
import warnings
warnings.simplefilter(action="ignore", category=FutureWarning)
cities = ["Hereford", "Worcester", "Gloucester", "Ledbury", "Newent", "Malvern", "Tewkesbury"]
# constituent geo data frames of line strings
# use a dict instead of a list
gdfs = {
c: ox.graph_to_gdfs(
ox.graph_from_place({"city": c, "country": "UK"}, network_type="drive"),
edges=True,
)[1].pipe(lambda d: d.dropna(axis=1, thresh=len(d) / 4))
for c in cities
}
convex hull
# generate a geo data frame of convex hulls of all linestring in constituent dataframes
gdf_ch = (
gpd.GeoDataFrame(
pd.DataFrame({"place": gdfs.keys()}),
geometry=[gdfs[c]["geometry"].unary_union.convex_hull for c in gdfs.keys()],
crs=list(gdfs.values())[0].crs,
)
.set_index("place", drop=False)
.to_crs("EPSG:3857")
)
nearest
gdf_nearest = pd.concat(
[
gdf_ch.loc[[c]].sjoin_nearest(gdf_ch.drop(c), distance_col="distance")
for c in gdfs.keys()
]
).sort_values("distance")
gdf_nearest
| place | place_left | index_right | place_right | distance |
|---|---|---|---|---|
| Worcester | Worcester | Malvern | Malvern | 9346.6 |
| Malvern | Malvern | Worcester | Worcester | 9346.6 |
| Ledbury | Ledbury | Newent | Newent | 11135.1 |
| Newent | Newent | Ledbury | Ledbury | 11135.1 |
| Gloucester | Gloucester | Newent | Newent | 14500.8 |
| Tewkesbury | Tewkesbury | Gloucester | Gloucester | 17150.6 |
| Hereford | Hereford | Ledbury | Ledbury | 22696.7 |
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Rob Raymond |
