'fast way to identify pair of linestring geodataframes that are nearest neighbors

I have a list of 90 geodataframes, all containing LineStrings that are connected to each other (imagine a MultiLineString).

From this list, I would like to identify the two GDFs that are in closest proximity to each other (closest as in considering the extents of the combined linestrings of each GDF).

A manual way i can imagine doing this is to populate a 90x90 matrix and call the distance function as in:

matrix = np.zeros((90, 90))
gdfs = [gdf1, gdf2, gdf3, gdf4, ..., gdf90]

for i, gdf_init in enumerate(gdfs):
   for j, gdf_pair in enumerate(gdfs):
      min_dist = gdf_init.distance(gdf_pair).min()
      matrix[i, j] = min_dist

And then use np.where to get the (i, j) values of the smallest min_dist value in the matrix.

However, perhaps nested for loops are not the most pythonic way to go about things. Wondering if anyone has an optimized implementation recommendation for this task?



Solution 1:[1]

  • you have not provided sample data, so have used osmnx to get line strings. Each source data frame will be of this structure:
osmid oneway name highway maxspeed length geometry
122233552 False Three Elms Road primary 30 mph 6.899 LINESTRING (-2.7428406 52.0653426, -2.7428844 52.0653985)
34414510 False Moor Park Road residential 30 mph 270.368 LINESTRING (-2.7428406 52.0653426, -2.7433504 52.0651847, -2.7434275 52.0651448, -2.7445906 52.0638372, -2.7450142 52.0633776)
122233552 False Three Elms Road primary 30 mph 126.662 LINESTRING (-2.7428406 52.0653426, -2.7426267 52.0649723, -2.7423405 52.0642472)
33840333 False nan residential nan 21.267 LINESTRING (-2.7417117 52.0629806, -2.7419991 52.0629074)
122233552 False Three Elms Road primary 30 mph 90.536 LINESTRING (-2.7417117 52.0629806, -2.7414841 52.0626687, -2.7412595 52.0623927, -2.7411227 52.0622522)
  • have used dict instead of list for holding source geo data frames
  • with dict of source geo data frames, construct a geo data frame that is the convex hull of the union of line strings
  • core solution geopandas sjoin_nearest()
    1. for each convex hull find nearest convex hull (exclude self)
    2. result is a data frame, sort by distance and you have the answer of which are the two closest source data frames

sourcing line strings

import osmnx as ox
import geopandas as gpd
import pandas as pd
import warnings

warnings.simplefilter(action="ignore", category=FutureWarning)

cities = ["Hereford", "Worcester", "Gloucester", "Ledbury", "Newent", "Malvern", "Tewkesbury"]

# constituent geo data frames of line strings
# use a dict instead of a list
gdfs = {
    c: ox.graph_to_gdfs(
        ox.graph_from_place({"city": c, "country": "UK"}, network_type="drive"),
        edges=True,
    )[1].pipe(lambda d: d.dropna(axis=1, thresh=len(d) / 4))
    for c in cities
}

convex hull

# generate a geo data frame of convex hulls of all linestring in constituent dataframes
gdf_ch = (
    gpd.GeoDataFrame(
        pd.DataFrame({"place": gdfs.keys()}),
        geometry=[gdfs[c]["geometry"].unary_union.convex_hull for c in gdfs.keys()],
        crs=list(gdfs.values())[0].crs,
    )
    .set_index("place", drop=False)
    .to_crs("EPSG:3857")
)

nearest

gdf_nearest = pd.concat(
    [
        gdf_ch.loc[[c]].sjoin_nearest(gdf_ch.drop(c), distance_col="distance")
        for c in gdfs.keys()
    ]
).sort_values("distance")

gdf_nearest
place place_left index_right place_right distance
Worcester Worcester Malvern Malvern 9346.6
Malvern Malvern Worcester Worcester 9346.6
Ledbury Ledbury Newent Newent 11135.1
Newent Newent Ledbury Ledbury 11135.1
Gloucester Gloucester Newent Newent 14500.8
Tewkesbury Tewkesbury Gloucester Gloucester 17150.6
Hereford Hereford Ledbury Ledbury 22696.7

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Rob Raymond