'PostgreSQL internals: Why is COUNT slow?

Why does the time complexity of COUNT in postgresql (v9.6) appear to be as bad as it could get? The following demonstrates that even without my machine under resource pressure ~3x more records causes it to take ~9x as long

i.e. O(n2)

   (397.0ms)   SELECT COUNT(*) FROM "cases"
=> 850219

   (3289.1ms)  SELECT COUNT(*) FROM "cases"
=> 2577661

I would have expected it could take advantage of some indexes on the table but I guess this would only help estimate the rather than determine the count.

I imagine it's hard to improve but I'd be interested to find out why and if it's tuneable. Particularly with tables that are written to rarely I'm surprised the count can't be fully cached. What in the DB's internal structure makes it hard to include a last_edited timestamp in a table's metadata?



Solution 1:[1]

The reason it's nonlinear is probably because of hint-bit writing on the first touch of the data, where read queries can do writes. Or it could be down to caching effects.

Usually counting of data without a suitable index is linear in cost. Try a properly repeated benchmark with warm-up.

PostgreSQL doesn't try to keep accurate counts on tables because it doesn't care that much. It keeps approximate statistics and figures that's good enough, if you want an exact count you'll perform one at the time. A count that's sometimes fast and sometimes slow would likely just be more confusing.

It can't keep a running total, because the table has a different number of rows in it for different sessions at the same time. Session 1 sees a table with 100 rows, but session 2 has inserted some so it has 205, and session 3 deleted some so it has 50. Session 6 started after session 5 committed a delete, so it sees fewer rows than session 1 does at the same instant.

It can't keep a "committed counter" + per-session counters, because rollbacks can occur in unpredictable orders, and what's "committed" varies depending on the session's snapshot. It'd have to do MVCC on a sort of side-table of table row counts, where it added a row with suitable xmin for each commit, and pruned them as transaction snapshots aged out. This would cost a fair bit in performance, for little benefit for most workloads.

Above all else, though, nobody's wanted it enough to add support for it. Either by funding the work, or doing the coding directly. It'd be a big job for little benefit.

What has been done is adding support for counting using index-only scans, which greatly speed up counts on the PRIMARY KEY or other indexed columns. They can use VACUUM and the visibility map to skip doing heap re-checks for visibility on pages that aren't freshly written to.

Solution 2:[2]

Another reason can be caused by the narrowness of the cache which does not allow the retrieval of all the pages of the table or the index in memory and is obliged to scan the disk to do this.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Craig Ringer
Solution 2 SQLpro