DataStax Academy FAQ

DataStax Academy migrated to a new learning management system (LMS) in July 2020. We are also moving to a new Cassandra Certification process so there are changes to exam bookings, voucher system and issuing of certificates.

Check out the Academy FAQ pages for answers to your questions:


question

Erick Ramirez avatar image
Erick Ramirez asked ·

Why is COUNT() bad in Cassandra?

This a frequently-asked question by users new to distributed systems and high-velocity, internet-scale environments.

This post is republished from a DataStax Support blogpost by Erick Ramirez, "Counting keys? Might as well be counting stars" (October 4, 2017).

cassandracount
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

1 Answer

Erick Ramirez avatar image
Erick Ramirez answered ·

Introduction

One of the issues we constantly come across is users trying to perform a COUNT() in Cassandra. More often than not, the question they ask is "why does it timeout?". Why indeed?

As a DBA for relational databases (RDBMS), performing a count of records in a table is something you do regularly. The answer to the question is simple enough but it requires understanding the fundamentals of distributed architectures and how Cassandra works. Let us begin...

A star is born

Cassandra uses log-structured merge-tree for storing data. Briefly it means that writes (INSERT, UPDATE and DELETE statements) are written to disk at some interval in a sequential manner like a log file. These files known as SSTables are immutable meaning they do not change once written to disk thus avoiding a read-before-write making writes very fast.

Consider a simple table with multiple columns. In the normal course of a day, a new partition is inserted into the table with values for 2 columns. At some point, the memtable gets flushed to disk written as an SSTable. This new inserted partition is the equivalent of one record.

Some time later, another column in the same partition is inserted into the table. This too counts as one record. After a few seconds, another column in the same partition is inserted then the memtable is flushed to disk. Since these 2 fragments for the same partition were in the memtable at the same time, they are merged together before getting written to an SSTable thus only "counting" as one record.

Later that day, one of the existing columns for the same partition gets updated with a new value. Since Cassandra does not perform a read before it writes to disk, Cassandra does not know whether the column for the partition already exists in other SSTables so the update is just inserted as another record which eventually makes it to another SSTable (yes, updates in Cassandra are just inserts under-the-hood aka "upserts").

In this example, 3 records exist in 3 different SSTables which are in fact fragments of the same single partition. But Cassandra does not know that these 3 records are for the same partition until it reads that partition. And really, this is what COUNT() does -- it reads all the partitions across all the SSTables (and memtables) across all the nodes in the Cassandra ring in order to come up with the result.

On a small table in a small cluster, sure it's not a big deal running a count query. But imagine if you had hundreds of SSTables worth over 500GB per node on a cluster with dozens [if not hundreds] of nodes. See how such an innocent query becomes so expensive? It has to perform a full table scan across your cluster which is why your query will timeout before it has the chance to complete.

A galaxy full of stars

To complicate matters, consider that while the count is in progress, partitions are constantly being created and updated -- this is big data after all and you're using Cassandra because you have a scale problem. Even if you were lucky enough to get a result, is that count valid? You might as well be counting stars. Because by the time you've counted half the stars in the night sky given enough time, chances are your count is obsolete because the universe is not static -- new stars are born and old stars die all the time. Similarly, your table is not static -- new partitions are created, existing partitions are updated and even some are deleted.

"Can I use a counter?", you ask. Maybe. If your use case is such that you only write a partition once and never overwrite it again, ever. If your use case involves overwrites then you have no way of working out whether it is the first write or not so your counter becomes useless. You might come up with some sort of algorithm and store a count somehow but in all likelihood, your "counter" solution won't scale since it will probably involve (a) a read-before-write, (b) an additional write for every operation, or (c) both.

If you must

You might as well get an estimate from nodetool tablestats but that's exactly what you would get -- an estimate. This is because it suffers from the same problem. When Cassandra flushes the memtable to disk, it knows how many keys are in that SSTable and updates the table stats accordingly [1]. While we're here, be aware that deletes are also inserts under-the-hood, i.e. a tombstone is inserted into the table so it's counted in tablestats. It only gets decremented after the tombstones get compacted out after gc_grace_seconds.

Finally if you really must do a count, do it with DSE Analytics or DSE Search but those are for another blog post. I'll get around to it sometime after I finish counting stars, maybe...

[1] I've oversimplified what happens with the table stats updates to make it simple to digest for the intended audience. There are in fact algorithms used such as HyperLogLog which improves the accuracy of the aggregated values.


Update

2 years after I first published this blogpost, along came DataStax Bulk Loader (aka DSBulk). It is a tool for efficiently loading and unloading data from Apache Cassandra though that is not the extent of its abilities.

DSBulk has a nice feature for counting data in large tables in a distributed manner. It is the recommended tool for loading or unloading data in CSV or JSON format. It performs up to 4x faster than the cqlsh COPY command.

And yes, DataStax made it freely available to open-source Apache Cassandra users. For details, see Counting data in tables with DSBulk. Cheers!

Share
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.