Bringing together the Apache Cassandra experts from the community and DataStax.

Want to learn? Have a question? Want to share your expertise? You are in the right place!

Not sure where to begin? Getting Started

 

question

fwy_187020 avatar image
fwy_187020 asked Erick Ramirez answered

Limit on number of Cassandra tables

In a recent thread (https://community.datastax.com/questions/12565/why-is-a-truncate-causing-nodes-to-go-down.html), the following point was stated about limiting the number of Cassandra tables.

We recommend a maximum of 200 tables total per cluster across all keyspaces (regardless of the number of keyspaces). Each table uses 1MB of memory to hold metadata about the tables so in your case where 1GB is allocated to the heap, 500-600MB is used just for table metadata with hardly any heap space left for other operations.

I find a recommended limit of 200 tables astoundingly small. Given that Cassandra is used for storing very high volumes of data across dozens of nodes in large commercial companies, how is such a small upper limit even feasible? It seems like a single significant use case for storing data in Cassandra for analytical, archival, or historical purposes could add this many tables to a cluster.

I understand that Cassandra is a 'query-first', columnar database that should not consist of many normalized tables as would be the case in an RDBMS. Even so, how can an entire cluster spanning many servers and sites support a diverse set of users and applications with so few tables? Any other documentation, guidelines, or examples you can reference might help.

Or is this is a case where the recommended limit is governed by memory requirements (~1mb heap memory per table as mentioned above), so that it is really a memory sizing problem rather than an actual upper limit?

cassandra
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

steve.lacerda avatar image
steve.lacerda answered

The reason why we currently have these guidelines on an overall number of tables in the cluster is that each CQL table will need to have a memtable on each node, which will require memory (on-heap or off-heap depending on your configuration). Also, each nodes' SSTables will take up some off-heap memory - not a lot, but when you have a high number of tables it all adds up.

A couple of notable things:

All tables matter, whether they are empty or contain data, and whether they are actively written to or not. This is important as sometimes DBA's leave obsolete tables hanging around "just in case", or even worse create multiple copies of the same keyspace (for example for testing or migration purposes) and do not get rid of them. These will sit around and still take up space.

Now, with that said, it does become a memory constraint, but with hundreds of tables that memory constraint can easily work into a performance issue. That's not to say that we don't have customers with over 400 tables, but be cognizant of the possible implications when going beyond the 200 table recommendation.

Share
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Erick Ramirez avatar image
Erick Ramirez answered

To clarify, at no point did I mention the word "limit". It is a recommendation and there is no hard-limit on the number of tables you can create in a cluster. You can create thousands if you were so inclined.

What matters is that operationally, best practice is to stick to a reasonable range. In our years of experience providing assistance to thousands customers and users on thousands of production clusters is that for optimal operation, the total number of tables in a cluster should be kept to "low hundreds".

We've picked 200 tables as a general recommendation but 300 to 400+ range will work. It does become problematic when it comes to day-to-day operations keeping track of hundreds of tables to repair and backup.

More importantly, applications take a long time to startup since the drivers request the cluster metadata (including the schema) during the initialisation/discovery phase. Retrieving the schema for 200 tables is significantly less than it would take to load 500, 1000 or 3000. This may not be important to you but there are lots of use cases where short startup times are crucial, most notably for short-lived serverless functions where execution time costs money and reducing execution where possible results in thousands of dollars in savings.

Feel free to disagree with our recommendations. We are just experts after all. ;) Cheers!

Share
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.