'fast way to identify pair of linestring geodataframes that are nearest neighbors

I have a list of 90 geodataframes, all containing LineStrings that are connected to each other (imagine a MultiLineString).

From this list, I would like to identify the two GDFs that are in closest proximity to each other (closest as in considering the extents of the combined linestrings of each GDF).

A manual way i can imagine doing this is to populate a 90x90 matrix and call the distance function as in:

matrix = np.zeros((90, 90))
gdfs = [gdf1, gdf2, gdf3, gdf4, ..., gdf90]

for i, gdf_init in enumerate(gdfs):
   for j, gdf_pair in enumerate(gdfs):
      min_dist = gdf_init.distance(gdf_pair).min()
      matrix[i, j] = min_dist

And then use np.where to get the (i, j) values of the smallest min_dist value in the matrix.

However, perhaps nested for loops are not the most pythonic way to go about things. Wondering if anyone has an optimized implementation recommendation for this task?

Solution 1:^[1]

you have not provided sample data, so have used osmnx to get line strings. Each source data frame will be of this structure:

osmid	oneway	name	highway	maxspeed	length	geometry
122233552	False	Three Elms Road	primary	30 mph	6.899	LINESTRING (-2.7428406 52.0653426, -2.7428844 52.0653985)
34414510	False	Moor Park Road	residential	30 mph	270.368	LINESTRING (-2.7428406 52.0653426, -2.7433504 52.0651847, -2.7434275 52.0651448, -2.7445906 52.0638372, -2.7450142 52.0633776)
122233552	False	Three Elms Road	primary	30 mph	126.662	LINESTRING (-2.7428406 52.0653426, -2.7426267 52.0649723, -2.7423405 52.0642472)
33840333	False	nan	residential	nan	21.267	LINESTRING (-2.7417117 52.0629806, -2.7419991 52.0629074)
122233552	False	Three Elms Road	primary	30 mph	90.536	LINESTRING (-2.7417117 52.0629806, -2.7414841 52.0626687, -2.7412595 52.0623927, -2.7411227 52.0622522)

have used dict instead of list for holding source geo data frames
with dict of source geo data frames, construct a geo data frame that is the convex hull of the union of line strings
core solution geopandas sjoin_nearest()
1. for each convex hull find nearest convex hull (exclude self)
2. result is a data frame, sort by distance and you have the answer of which are the two closest source data frames

sourcing line strings

import osmnx as ox
import geopandas as gpd
import pandas as pd
import warnings

warnings.simplefilter(action="ignore", category=FutureWarning)

cities = ["Hereford", "Worcester", "Gloucester", "Ledbury", "Newent", "Malvern", "Tewkesbury"]

# constituent geo data frames of line strings
# use a dict instead of a list
gdfs = {
    c: ox.graph_to_gdfs(
        ox.graph_from_place({"city": c, "country": "UK"}, network_type="drive"),
        edges=True,
    )[1].pipe(lambda d: d.dropna(axis=1, thresh=len(d) / 4))
    for c in cities
}

convex hull

# generate a geo data frame of convex hulls of all linestring in constituent dataframes
gdf_ch = (
    gpd.GeoDataFrame(
        pd.DataFrame({"place": gdfs.keys()}),
        geometry=[gdfs[c]["geometry"].unary_union.convex_hull for c in gdfs.keys()],
        crs=list(gdfs.values())[0].crs,
    )
    .set_index("place", drop=False)
    .to_crs("EPSG:3857")
)

nearest

gdf_nearest = pd.concat(
    [
        gdf_ch.loc[[c]].sjoin_nearest(gdf_ch.drop(c), distance_col="distance")
        for c in gdfs.keys()
    ]
).sort_values("distance")

gdf_nearest

place	place_left	index_right	place_right	distance
Worcester	Worcester	Malvern	Malvern	9346.6
Malvern	Malvern	Worcester	Worcester	9346.6
Ledbury	Ledbury	Newent	Newent	11135.1
Newent	Newent	Ledbury	Ledbury	11135.1
Gloucester	Gloucester	Newent	Newent	14500.8
Tewkesbury	Tewkesbury	Gloucester	Gloucester	17150.6
Hereford	Hereford	Ledbury	Ledbury	22696.7

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	Rob Raymond