'Slow spatial join on a single table with PostGIS
My goal is to calculate if a building have at least one shared wall with a building of another estate. I used a PostGIS query to do so but it is really slow. I have tweaked this for two weeks with some success but no breakthrough.
I have two tables:
Estate (a piece of land)
CREATE TABLE IF NOT EXISTS public.front_estate
(
id integer NOT NULL DEFAULT nextval('front_estate_id_seq'::regclass),
perimeter geometry(Polygon,4326),
CONSTRAINT front_estate_pkey PRIMARY KEY (id),
)
CREATE INDEX IF NOT EXISTS front_estate_perimeter_idx
ON public.front_estate USING spgist
(perimeter);
Building
CREATE TABLE IF NOT EXISTS public.front_building
(
id integer NOT NULL DEFAULT nextval('front_building_id_seq'::regclass),
type character varying(255) COLLATE pg_catalog."default",
footprint integer,
polygon geometry(Polygon,4326),
shared_wall integer,
CONSTRAINT front_building_pkey PRIMARY KEY (id)
)
CREATE INDEX IF NOT EXISTS front_building_polygon_idx
ON public.front_building USING spgist
(polygon)
TABLESPACE pg_default;
CREATE INDEX IF NOT EXISTS front_building_type_124fcf82
ON public.front_building USING btree
(type COLLATE pg_catalog."default" ASC NULLS LAST)
TABLESPACE pg_default;
CREATE INDEX IF NOT EXISTS front_building_type_124fcf82_like
ON public.front_building USING btree
(type COLLATE pg_catalog."default" varchar_pattern_ops ASC NULLS LAST)
TABLESPACE pg_default;
The m2m relation:
CREATE TABLE IF NOT EXISTS public.front_estate_buildings
(
id integer NOT NULL DEFAULT nextval('front_estate_buildings_id_seq'::regclass),
estate_id integer NOT NULL,
building_id integer NOT NULL,
CONSTRAINT front_estate_buildings_pkey PRIMARY KEY (id),
CONSTRAINT front_estate_buildings_estate_id_building_id_863b3358_uniq UNIQUE (estate_id, building_id),
CONSTRAINT front_estate_buildin_building_id_fc5c4235_fk_front_bui FOREIGN KEY (building_id)
REFERENCES public.front_building (id) MATCH SIMPLE
ON UPDATE NO ACTION
ON DELETE NO ACTION
DEFERRABLE INITIALLY DEFERRED,
CONSTRAINT front_estate_buildings_estate_id_2c28ec2a_fk_front_estate_id FOREIGN KEY (estate_id)
REFERENCES public.front_estate (id) MATCH SIMPLE
ON UPDATE NO ACTION
ON DELETE NO ACTION
DEFERRABLE INITIALLY DEFERRED
)
CREATE INDEX IF NOT EXISTS front_estate_buildings_building_id_fc5c4235
ON public.front_estate_buildings USING btree
(building_id ASC NULLS LAST)
TABLESPACE pg_default;
CREATE INDEX IF NOT EXISTS front_estate_buildings_estate_id_2c28ec2a
ON public.front_estate_buildings USING btree
(estate_id ASC NULLS LAST)
TABLESPACE pg_default;
To have a shared wall a building must touch another building which is not the same estate.
The final data set will have around 100 millions rows. Right now my developpement building table has 2 millions rows.
Here is the query I used to get all relations between buildings and estates:
SELECT b.id as b_id, rel.estate_id as e_id, swb.id as swb_id, sw_rel.estate_id as swe_id
FROM front_building b
JOIN front_building swb ON swb.id < b.id AND ST_Intersects(b.polygon, swb.polygon)
JOIN front_estate_buildings rel ON rel.building_id = b.id
JOIN front_estate_buildings sw_rel ON sw_rel.building_id = swb.id
ORDER BY b.id ASC;
Here is the EXPLAIN ANALYZE given by pgAdmin:
1. Limit (rows=500 loops=1)
2. Nested Loop Inner Join (rows=500 loops=1)
3. Nested Loop Inner Join (rows=695 loops=1)
4. Nested Loop Inner Join (rows=2985 loops=1)
5. Index Scan using front_estate_buildings_building_id_fc5c4235 on front_estate_buildings as rel (rows=2985 loops=1) 2985 1
6. Memoize (rows=1 loops=2985)
Buckets: Batches: Memory Usage: 715 kB
7. Index Scan using front_building_pkey on front_building as b (rows=1 loops=2751)
Index Cond: (id = rel.building_id)
8. Index Scan using front_building_polygon_idx on front_building as swb (rows=0 loops=2985)
Filter: ((id < b.id) AND st_intersects(b.polygon, polygon))
Index Cond: (polygon && b.polygon)
Rows Removed by Filter: 2
9. Index Scan using front_estate_buildings_building_id_fc5c4235 on front_estate_buildings as sw_rel (rows=1 loops=695)
Index Cond: (building_id = swb.id)
On my dev machine (MBP M1 - 16GB RAM) it completes in 20 minutes for 2M rows which is not good but is ok. On my production machine (Linode - 8 CPU Cores - 16 GB RAM) the CPU goes throufh the roof (continuous 250% capacity) and the query seems to never end.
Do you have any clue on how to proceed ? Change the query ? The db struct ? Use multiprocessing ?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
