'How does Apache Cassandra perform on a single read of millions of records?

Much has been written about how Cassandra's redundancy provides good performance for thousands of incoming requests from different locations, but I haven't found anything on the throughput of a single big request. That's what this question is about.

I am assessing Apache Cassandra's potential as a database solution to the following problem:

The client would be a single-server application with exclusive access to the Cassandra database, co-located in the same datacentre. The Cassandra instance might be a few nodes, but likely not more than 5.

When a certain feature runs on the application (triggered occasionally by a human) it will populate Cassandra with up to 5M records representing short arrays of float data, as well as delete such records. The records will not be updated and we never need to access individual elements of an array. The arrays can be of different lengths, but will typically have around 100 elements, and each row might represent 0-20 arrays.

For example:

id   array1                  array2
123  [1.0, 2.5, ..., 10.8]   [0.0, 0.5, ..., 1.0]

Bonus question: Should I use a list of doubles to represent this, or should I serialize the arrays to Json?

At some point the user requests a report and the server should read all 5M records, interpret the arrays, do some aggregation, and plot some data on the screen. Might the read operation take <1s, <10s, <100s? How can I estimate the throughput in this case, assuming it is the bottleneck?



Solution 1:[1]

Let me start with your second use case, As your data is distributed across the nodes if you have a broad range query without having a narrowed down partition, Cassandra is going to perform slow.

Cassandra is well suitable for Querying and suitable for searching if you know the partition key.

  • Even you are having a 5M records, Assuming this gets scattered around 5 different nodes, For your reporting use case Cassandra has to go through all the nodes and aggregate it. Eventually it gets timed out.

  • This specific use case is not viable in Cassandra but if you can
    aggregate in your service and make multiple calls to partition and
    bucket. it is going to perform super fast.

    Generally, the accessing pattern matters, Read wins. The data can be formatted in any form but reading it wisely is matters to Cassandra. So answered your second part. Thank you.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Imran