Hi,
Our database is growing fast and big, we are looking for archiving strategies in Cassandra.
Any inputs will help us.
Thanks,
-Varun
Bringing together the Apache Cassandra experts from the community and DataStax.
Want to learn? Have a question? Want to share your expertise? You are in the right place!
Not sure where to begin? Getting Started
Hi,
Our database is growing fast and big, we are looking for archiving strategies in Cassandra.
Any inputs will help us.
Thanks,
-Varun
The simplest way to achieve this is by setting a time-to-live (TTL) on your data. This can be done in two ways:
default_time_to_live
option.USING TTL
clause.Note that setting a default TTL on a table will not set an expiry on existing data (recall that SSTables are immutable). You will need to iterate through all the partitions in the table and delete them.
It's very important that you only delete whole partitions, not just individual rows in the partition. Otherwise, you will run into issues scanning over row tombstones which can cause performance issues when you have wide partitions.
I also recommend that you delete partitions in small batches. In this context, "small" is a subjective quantity which can either be a few hundred partitions or a couple thousand per day. There isn't a magic number that fits all scenarios since it depends on your use case, access patterns, data model, cluster capacity, etc. You can only really figure it out on your own through testing. Cheers!
@Erick Ramirez We don't want to delete data entirely from Database. We want to archive data and save it somewhere in S3/Glacier for future retrieving if needed.
That what you need to do is create snapshots and archive them to an object store like S3. An open-source backup tool like Cassandra Medusa will help you manage this including archiving backups to S3.
But you will still need to delete the data to reduce the size of your database. This is unavoidable. You can't just arbitrarily pick SSTables to remove from nodes. Cheers!
@vkayanala_42513, if you want to extract data out of your Apache Cassandra cluster and archive it elsewhere, depending on your needs, you could explore options such as below,
DataStax Bulk Loader (or)
leverage Apache Spark in combination with DataStax Spark Cassandra Connector
For purging existing data from the cluster, Erick has already chalked out a detailed plan in his response earlier which will help you to reclaim space.
7 People are following this question.
Running Cassandra in an Ubuntu VM, service exits after a few minutes of operation
Is there a tool available to migrate from MongoDB to Cassandra?
How do we track total records inserted in a table in a day?
What is the rule for the use of ALLOW FILTERING?
How do I insert a jpeg into a blob? Is there an imageAsBlob() conversion?
DataStax Enterprise is powered by the best distribution of Apache Cassandra ™
© 2022 DataStax, Titan, and TitanDB are registered trademarks of DataStax, Inc. and its subsidiaries in the United States and/or other countries.
Apache, Apache Cassandra, Cassandra, Apache Tomcat, Tomcat, Apache Lucene, Lucene, Apache Solr, Apache Hadoop, Hadoop, Apache Spark, Spark, Apache TinkerPop, TinkerPop, Apache Kafka and Kafka are either registered trademarks or trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries.
Privacy Policy Terms of Use