'Clustering using lower case

I tried to cluster events table by lower(event_name), but when I checked the number of partitions using for the query after clustering by lower case of the event name it seems like it doesn't work - it uses the same numbers of partitions like it was before the clustering. Any ideas why and what can I do?



Solution 1:[1]

It's very hard to help you, if you don't give us your queries and how you created the clustered table. Nevertheless, let me try these queries to show off how clustering could help you (or not):

Unclustered table: 64 of 64 partitions scanned.

select *
from wikimedia.public.wikidata_modern_people
where name like 'Alex%';

Now let me cluster that table by name:

create or replace temp table people_name_cluster
cluster by (name)
as
select *
from wikimedia.public.wikidata_modern_people

Now this query scans only 1 of 16 partitions:

select *
from people_name_cluster
where name like 'Alex%';

That was expected, but what if we cluster by lower(name):

create or replace temp table people_name_cluster_lower
cluster by (lower(name))
as
select *
from wikimedia.public.wikidata_modern_people

Then this query scans 2 of 16 partitions:

select *
from people_name_cluster_lower
where name like 'Alex%';

But when I try to play with lower(name) or regular expressions, nothing gets pruned. The following queries scan 16 of 16 partitions:

select *
from people_name_cluster_lower
where lower(name) like 'alex%';

select *
from people_name_cluster_lower
where name regexp('^Alex.*');

select *
from people_name_cluster_lower
where regexp_like(name, 'Alex.*');

And then this query scans 2 of 16:

select *
from people_name_cluster_lower
where regexp_like(name, 'Alex.*')
and name between 'A' and 'B';

Lesson learned through these experiments: Clustering by lower(string) might not make things much better. Trying to prune by regular expressions might not be optimized either.

Solution 2:[2]

Adding a clustering key, does not make the data instantly clustered. It can takes hours for a table to be reordered.

And if you want to test is the reorder will have an impact the fastest method is to create a new (temporary) table with an ORDER BY cluster_key_terms and test if the order improves the query pruning.

Given the credits it can cost to incrementally reshape/cluster the data is more than the compute to do it in a single operation, and the "rewrite" will use more disk. the truely cheapest way to test, is do it in one go.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Felipe Hoffa
Solution 2 Simeon Pilgrim