Bringing together the Apache Cassandra experts from the community and DataStax.

Want to learn? Have a question? Want to share your expertise? You are in the right place!

Not sure where to begin? Getting Started

 

question

vkayanala_42513 avatar image
vkayanala_42513 asked ·

How do we archive data to reduce the size of our database?

Hi,

Our database is growing fast and big, we are looking for archiving strategies in Cassandra.

Any inputs will help us.

Thanks,

-Varun

cassandra
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

smadhavan avatar image
smadhavan answered ·

@vkayanala_42513, if you want to extract data out of your Apache Cassandra cluster and archive it elsewhere, depending on your needs, you could explore options such as below,

For purging existing data from the cluster, Erick has already chalked out a detailed plan in his response earlier which will help you to reclaim space.

Share
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Erick Ramirez avatar image
Erick Ramirez answered ·

The simplest way to achieve this is by setting a time-to-live (TTL) on your data. This can be done in two ways:

  1. Set a default TTL on the table using the default_time_to_live option.
  2. Set a TTL when you insert/update data with the USING TTL clause.

Note that setting a default TTL on a table will not set an expiry on existing data (recall that SSTables are immutable). You will need to iterate through all the partitions in the table and delete them.

It's very important that you only delete whole partitions, not just individual rows in the partition. Otherwise, you will run into issues scanning over row tombstones which can cause performance issues when you have wide partitions.

I also recommend that you delete partitions in small batches. In this context, "small" is a subjective quantity which can either be a few hundred partitions or a couple thousand per day. There isn't a magic number that fits all scenarios since it depends on your use case, access patterns, data model, cluster capacity, etc. You can only really figure it out on your own through testing. Cheers!

2 comments Share
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

@Erick Ramirez We don't want to delete data entirely from Database. We want to archive data and save it somewhere in S3/Glacier for future retrieving if needed.


0 Likes 0 · ·

That what you need to do is create snapshots and archive them to an object store like S3. An open-source backup tool like Cassandra Medusa will help you manage this including archiving backups to S3.

But you will still need to delete the data to reduce the size of your database. This is unavoidable. You can't just arbitrarily pick SSTables to remove from nodes. Cheers!

0 Likes 0 · ·