'full table scan in snowflake when using order by and limit by cluster key

I have Snowflake table with around 450M rows, the table contains only 2 fields, _date which is DATE type and Data which is VARIANT type. the cluster key is Date and the events are ~equally distributed around each day

Name LINEAR(_DATE) rows bytes automation clustering
DATEDEVENTS LINEAR(_DATE) 444,087,723 129228379136 ON

I'm trying to run the following simple query

select *
from datedevents 
order by _DATE
limit 200

enter image description here

snowflake is doing a full table scan, I can't just query first day second day, etc.' because the use case is more complicated, But why can't snowflake use his cluster key to perform this operation efficiently and don't scan all the data? I would except it to go through the first date, second date etc' until he fulfilled the 200 limit



Solution 1:[1]

Update with a great fix

Ok, there's a way to get good pruning with one query.

Setup:

create or replace transient table test_prune
cluster by (creation_date)
as
select creation_date, body
from temp.public.stackoverflow_posts

Slow query:

select *
from test_prune
order by creation_date
limit 10
-- 10s on a S-WH

Fast query:


select *
from test_prune
where creation_date in (select creation_date from test_prune order by 1 limit 10) 
order by creation_date
limit 10

-- 0.2s on a S-WH

What's the difference, why is this in hint faster without needing the separate query here?

Well, I created a transient table instead of a temp table. The optimizer pruning works way better with more "permanent" tables.

enter image description here


Previous answer

We will need to help the optimizer here. I created a similar table for my experiments:

create or replace temp table test_prune
cluster by (creation_date)
as
select creation_date, body
from temp.public.stackoverflow_posts
order by creation_date

Now let's run your query on it:

select *
from test_prune
order by creation_date
limit 10

As you say, this needs to be optimized:

enter image description here

I got the best results diving that query in two:

  • First create a table with the dates you are looking for:
create or replace temp table top_dates
as 
select distinct creation_date
from (
    select creation_date 
    from test_prune
    order by creation_date
    limit 10
);  --687ms
  • Then every other query can use those results:
select *
from test_prune
where creation_date in (select creation_date from top_dates)
order by creation_date
limit 10
;  --308ms

With this separation we can take the original query from 7.9s to 0.5s (0.3+0.25).

enter image description here

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1