'Postgresql max query on big indexed table has slow performance

I have a table inside my Postgresql database, called consumer_actions. It contains all the actions done by consumers registered in my app. At the moment, this table has ~ 500 million records. What i'm trying to do is to get the maximum id, based on the system that the action came from.

The definition of the table is:

CREATE TABLE public.consumer_actions (
    id int4 NOT NULL,
    system_id int4 NOT NULL,
    consumer_id int4 NOT NULL,
    action_id int4 NOT NULL,
    payload_json jsonb NULL,
    external_system_date timestamptz NULL,
    local_system_date timestamptz NULL,
    CONSTRAINT consumer_actions_pkey PRIMARY KEY (id, system_id)
);
CREATE INDEX consumer_actions_ext_date ON public.consumer_actions USING btree (external_system_date);
CREATE INDEX consumer_actions_system_consumer_id ON public.consumer_actions USING btree (system_id, consumer_id);

when i'm trying select max(id) from consumer_actions where system_id = 1 it takes less than one second, but if i try to use the same index (consumer_actions_system_consumer_id) to get the max(id) by system_id = 2, it takes more than an hour. select max(id) from consumer_actions where system_id = 2

I have also checked the query planner, is looks similar for both queries; i also rerun vacuum analyze on the table and a reindex. Neither of them helped. Any idea what i can do to improve the second query time?

Here are the query planners for both tables, and the size at the moment of this table:

explain analyze 
select max(id) from consumer_actions where system_id = 1;
Result  (cost=1.49..1.50 rows=1 width=4) (actual time=0.062..0.063 rows=1 loops=1)
  InitPlan 1 (returns $0)
    ->  Limit  (cost=0.57..1.49 rows=1 width=4) (actual time=0.057..0.057 rows=1 loops=1)
          ->  Index Only Scan Backward using consumer_actions_pkey on consumer_actions ca  (cost=0.57..524024735.49 rows=572451344 width=4) (actual time=0.055..0.055 rows=1 loops=1)
                Index Cond: ((id IS NOT NULL) AND (system_id = 1))
                Heap Fetches: 1
Planning Time: 0.173 ms
Execution Time: 0.092 ms


explain analyze
select max(id) from consumer_actions where system_id = 2;
Result  (cost=6.46..6.47 rows=1 width=4) (actual time=7099484.855..7099484.858 rows=1 loops=1)
   InitPlan 1 (returns $0)
     ->  Limit  (cost=0.57..6.46 rows=1 width=4) (actual time=7099484.839..7099484.841 rows=1 loops=1)
           ->  Index Only Scan Backward using consumer_actions_pkey on consumer_actions ca  (cost=0.57..20205843.58 rows=3436129 width=4) (actual time=7099484.833..7099484.834 rows=1 loops=1)
                 Index Cond: ((id IS NOT NULL) AND (system_id = 2))
                 Heap Fetches: 1
 Planning Time: 3.078 ms
 Execution Time: 7099484.992 ms
(8 rows)

select count(*) from consumer_actions; --result is 577408504


Solution 1:[1]

I think that you should create an index like this one

CREATE INDEX consumer_actions_system_system_id_id ON public.consumer_actions USING btree (system_id, id);

Solution 2:[2]

Instead of using an aggregation function like max() that has to potentially scan and aggregate large numbers of rows for a table like yours you could get similar results with a query designed to return the fewest rows possible:

SELECT id FROM consumer_actions WHERE system_id = ? ORDER BY id DESC LIMIT 1;

This should still benefit significantly in performance from the existing indices.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Philippe
Solution 2 DigitalHermit