Our database is growing fast and big, we are looking for archiving strategies in Cassandra.
Any inputs will help us.
Bringing together the Apache Cassandra experts from the community and DataStax.
Want to learn? Have a question? Want to share your expertise? You are in the right place!
Not sure where to begin? Getting Started
The simplest way to achieve this is by setting a time-to-live (TTL) on your data. This can be done in two ways:
Note that setting a default TTL on a table will not set an expiry on existing data (recall that SSTables are immutable). You will need to iterate through all the partitions in the table and delete them.
It's very important that you only delete whole partitions, not just individual rows in the partition. Otherwise, you will run into issues scanning over row tombstones which can cause performance issues when you have wide partitions.
I also recommend that you delete partitions in small batches. In this context, "small" is a subjective quantity which can either be a few hundred partitions or a couple thousand per day. There isn't a magic number that fits all scenarios since it depends on your use case, access patterns, data model, cluster capacity, etc. You can only really figure it out on your own through testing. Cheers!
DataStax Bulk Loader (or)
For purging existing data from the cluster, Erick has already chalked out a detailed plan in his response earlier which will help you to reclaim space.
7 People are following this question.