'Slow spatial join on a single table with PostGIS

My goal is to calculate if a building have at least one shared wall with a building of another estate. I used a PostGIS query to do so but it is really slow. I have tweaked this for two weeks with some success but no breakthrough.

I have two tables:

Estate (a piece of land)

CREATE TABLE IF NOT EXISTS public.front_estate
(
    id integer NOT NULL DEFAULT nextval('front_estate_id_seq'::regclass),
    perimeter geometry(Polygon,4326),
    CONSTRAINT front_estate_pkey PRIMARY KEY (id),
)

CREATE INDEX IF NOT EXISTS front_estate_perimeter_idx
    ON public.front_estate USING spgist
    (perimeter);

Building

CREATE TABLE IF NOT EXISTS public.front_building
(
    id integer NOT NULL DEFAULT nextval('front_building_id_seq'::regclass),
    type character varying(255) COLLATE pg_catalog."default",
    footprint integer,
    polygon geometry(Polygon,4326),
    shared_wall integer,
    CONSTRAINT front_building_pkey PRIMARY KEY (id)
)

CREATE INDEX IF NOT EXISTS front_building_polygon_idx
    ON public.front_building USING spgist
    (polygon)
    TABLESPACE pg_default;

CREATE INDEX IF NOT EXISTS front_building_type_124fcf82
    ON public.front_building USING btree
    (type COLLATE pg_catalog."default" ASC NULLS LAST)
    TABLESPACE pg_default;

CREATE INDEX IF NOT EXISTS front_building_type_124fcf82_like
    ON public.front_building USING btree
    (type COLLATE pg_catalog."default" varchar_pattern_ops ASC NULLS LAST)
    TABLESPACE pg_default;

The m2m relation:

CREATE TABLE IF NOT EXISTS public.front_estate_buildings
(
    id integer NOT NULL DEFAULT nextval('front_estate_buildings_id_seq'::regclass),
    estate_id integer NOT NULL,
    building_id integer NOT NULL,
    CONSTRAINT front_estate_buildings_pkey PRIMARY KEY (id),
    CONSTRAINT front_estate_buildings_estate_id_building_id_863b3358_uniq UNIQUE (estate_id, building_id),
    CONSTRAINT front_estate_buildin_building_id_fc5c4235_fk_front_bui FOREIGN KEY (building_id)
        REFERENCES public.front_building (id) MATCH SIMPLE
        ON UPDATE NO ACTION
        ON DELETE NO ACTION
        DEFERRABLE INITIALLY DEFERRED,
    CONSTRAINT front_estate_buildings_estate_id_2c28ec2a_fk_front_estate_id FOREIGN KEY (estate_id)
        REFERENCES public.front_estate (id) MATCH SIMPLE
        ON UPDATE NO ACTION
        ON DELETE NO ACTION
        DEFERRABLE INITIALLY DEFERRED
)

CREATE INDEX IF NOT EXISTS front_estate_buildings_building_id_fc5c4235
    ON public.front_estate_buildings USING btree
    (building_id ASC NULLS LAST)
    TABLESPACE pg_default;

CREATE INDEX IF NOT EXISTS front_estate_buildings_estate_id_2c28ec2a
    ON public.front_estate_buildings USING btree
    (estate_id ASC NULLS LAST)
    TABLESPACE pg_default;

To have a shared wall a building must touch another building which is not the same estate.

The final data set will have around 100 millions rows. Right now my developpement building table has 2 millions rows.

Here is the query I used to get all relations between buildings and estates:

SELECT b.id as b_id, rel.estate_id as e_id, swb.id as swb_id, sw_rel.estate_id as swe_id 
FROM front_building b
JOIN front_building swb ON swb.id < b.id AND ST_Intersects(b.polygon, swb.polygon)
JOIN front_estate_buildings rel ON rel.building_id = b.id
JOIN front_estate_buildings sw_rel ON sw_rel.building_id = swb.id
ORDER BY b.id ASC;

Here is the EXPLAIN ANALYZE given by pgAdmin:

1.  Limit (rows=500 loops=1)
    2.  Nested Loop Inner Join (rows=500 loops=1)
        3.  Nested Loop Inner Join (rows=695 loops=1)
            4.  Nested Loop Inner Join (rows=2985 loops=1)
                5.  Index Scan using front_estate_buildings_building_id_fc5c4235 on front_estate_buildings as rel (rows=2985 loops=1)   2985    1
                6.  Memoize (rows=1 loops=2985)
                    Buckets:  Batches:  Memory Usage: 715 kB
                    7.  Index Scan using front_building_pkey on front_building as b (rows=1 loops=2751)
                    Index Cond: (id = rel.building_id)
            8.  Index Scan using front_building_polygon_idx on front_building as swb (rows=0 loops=2985)
            Filter: ((id < b.id) AND st_intersects(b.polygon, polygon))
            Index Cond: (polygon && b.polygon)
            Rows Removed by Filter: 2
        9.  Index Scan using front_estate_buildings_building_id_fc5c4235 on front_estate_buildings as sw_rel (rows=1 loops=695)
        Index Cond: (building_id = swb.id)

On my dev machine (MBP M1 - 16GB RAM) it completes in 20 minutes for 2M rows which is not good but is ok. On my production machine (Linode - 8 CPU Cores - 16 GB RAM) the CPU goes throufh the roof (continuous 250% capacity) and the query seems to never end.

Do you have any clue on how to proceed ? Change the query ? The db struct ? Use multiprocessing ?



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source