'Efficiency of Querying 10 Billion Rows (with High Cardinality) in ScyllaDB

Suppose I have a table with ten billion rows spread across 100 machines. The table has the following structure:

PK1 PK2 PK3 V1 V2

Where PK represents a partition key and V represents a value. So in the above example, the partition key consists of 3 columns.

Scylla requires that you to specify all columns of the partition key in the WHERE clause.

If you want to execute a query while specifying only some of the columns you'd get a warning as this requires a full table scan:

SELECT V1 & V2 FROM table WHERE PK1 = X & PK2 = Y

In the above query, we only specify 2 out of 3 columns. Suppose the query matches 1 billion out of 10 billion rows - what is a good mental model to think about the cost/performance of this query?

My assumption is that the cost is high: It is equivalent to executing ten billion separate queries on the data set since 1) there is no logical association between the rows in the way the rows are stored to disk as each row has a different partition key (high cardinality) 2) in order for Scylla to determine which rows match the query it has to scan all 10 billion rows (even though the result set only matches 1 billion rows)

Assuming a single server can process 100K transactions per second (well within the range advertised by ScyllaDB folks) and the data resides on 100 servers, the (estimated) time to process this query can be calculated as: 100K * 100 = 10 million queries per second. 10 billion divided by 10M = 1,000 seconds. So it would take the cluster approx. 1,000 seconds to process the query (consuming all of the cluster resources).

Is this correct? Or is there any flaw in my mental model of how Scylla processes such queries?

Thanks



Solution 1:[1]

As you suggested yourself, Scylla (and everything I will say in my answer also applies to Cassandra) keeps the partitions hashed by the full partition key - containing three columns. ?So Scylla has no efficient way to scan only the matching partitions. It has to scan all the partitions, and check each of those whether its partition-key matches the request.

However, this doesn't mean that it's as grossly inefficient as "executing ten billion separate queries on the data". A scan of ten billion partitions is usually (when each row's data itself isn't very large) much more efficient than executing ten billion random-access reads, each reading a single partition individually. There's a lot of work that goes into random-access reads - Scylla needs to reach a coordinator which then sends it to replicas, each replica needs to find the specific position in its one-disk data files (often multiple files), often need to over-read from the disk (as disk and compression alignments require), and so on. Compare to this a scan - which can read long contiguous swathes of data sorted by tokens (partition-key hash) from disk and can return many rows fairly quickly with fewer I/O operations and less CPU work.

So if your example setup can do 100,000 random-access reads per node, it can probably read a lot more than 100,000 rows per second during scan. I don't know which exact number to give you, but the blog post https://www.scylladb.com/2019/12/12/how-scylla-scaled-to-one-billion-rows-a-second/ we (full disclosure: I am a ScyllaDB developer) showed an example use case scanning a billion (!) rows per second with just 83 nodes - that's 12 million rows per second on each node instead of your estimate of 100,000. So your example use case can potentially be over in just 8.3 seconds, instead of 1000 seconds as you calculated.

Finally, please don't forget (and this also mentioned in the aforementioned blog post), that if you do a large scan you should explicitly parallelize, i.e., split the token range into pieces and scan then in parallel. First of all, obviously no single client will be able to handle the results of scanning a billion partitions per second, so this parallelization is more-or-less unavoidable. Second, scanning returns partitions in partition order, which (as I explained above) sit contiguously on individual replicas - which is great for peak throughput but also means that only one node (or even one CPU) will be active at any time during the scan. So it's important to split the scan into pieces and do all of them in parallel. We also had a blog post about the importance of parallel scan, and how to do it: https://www.scylladb.com/2017/03/28/parallel-efficient-full-table-scan-scylla/.

Solution 2:[2]

Another option is to move a PK to become a clustering key, this way if you have the first two PKs, you'll be able to locate partition, and just search withing it

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Nadav Har'El
Solution 2 dor laor