'full table scan in snowflake when using order by and limit by cluster key
I have Snowflake table with around 450M rows, the table contains only 2 fields, _date which is DATE type and Data which is VARIANT type. the cluster key is Date and the events are ~equally distributed around each day
| Name | LINEAR(_DATE) | rows | bytes | automation clustering |
|---|---|---|---|---|
| DATEDEVENTS | LINEAR(_DATE) | 444,087,723 | 129228379136 | ON |
I'm trying to run the following simple query
select *
from datedevents
order by _DATE
limit 200
snowflake is doing a full table scan, I can't just query first day second day, etc.' because the use case is more complicated, But why can't snowflake use his cluster key to perform this operation efficiently and don't scan all the data? I would except it to go through the first date, second date etc' until he fulfilled the 200 limit
Solution 1:[1]
Update with a great fix
Ok, there's a way to get good pruning with one query.
Setup:
create or replace transient table test_prune
cluster by (creation_date)
as
select creation_date, body
from temp.public.stackoverflow_posts
Slow query:
select *
from test_prune
order by creation_date
limit 10
-- 10s on a S-WH
Fast query:
select *
from test_prune
where creation_date in (select creation_date from test_prune order by 1 limit 10)
order by creation_date
limit 10
-- 0.2s on a S-WH
What's the difference, why is this in hint faster without needing the separate query here?
Well, I created a transient table instead of a temp table. The optimizer pruning works way better with more "permanent" tables.
Previous answer
We will need to help the optimizer here. I created a similar table for my experiments:
create or replace temp table test_prune
cluster by (creation_date)
as
select creation_date, body
from temp.public.stackoverflow_posts
order by creation_date
Now let's run your query on it:
select *
from test_prune
order by creation_date
limit 10
As you say, this needs to be optimized:
I got the best results diving that query in two:
- First create a table with the dates you are looking for:
create or replace temp table top_dates
as
select distinct creation_date
from (
select creation_date
from test_prune
order by creation_date
limit 10
); --687ms
- Then every other query can use those results:
select *
from test_prune
where creation_date in (select creation_date from top_dates)
order by creation_date
limit 10
; --308ms
With this separation we can take the original query from 7.9s to 0.5s (0.3+0.25).
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |




