Does two incremental snapshot contains duplicate data?
Does two incremental snapshot contains duplicate data?
Snapshots are point in time. So, what does that mean? It means that when the nodes create an incremental snapshot, the snapshot links to the sstable files. Thus, nothing is ever stored twice, it's just a link to the sstable file. The file is not moved, copied, or anything else.
Let me provide an example:
1) Node takes a full snapshot and has 1 sstable file:
sstable-1.db
2) Some data is added and now we have an additional sstable file:
sstable-2.db
3) An incremental is taken, which takes only the changes from the full backup to that point in time, so we now have a link to sstable-2.db.
These are just links and not actual files.
4) Some data is added and now we have an additional sstable file:
sstable-3.db
5) Another incremental is taken, which takes only the changes since the last full backup to that point in time. Thus, you have a link now for sstable-2.db, sstable-3.db. Remember, these are point in time. So, this snapshot only cares about the changes since the last full snapshot and not the previous incremental snapshots.
If a compaction occurs and one of the sstable files is deleted from the live FS, then the file is not actually removed because it still has hard links associated with the file. Thus, you will only have one sstable file taking up space, the rest is all links with minimal overhead.
No two incremental backups (on the same Cassandra node) will ever contain the same SSTables so they will never be duplicated because the snapshots are taken when memtables are flushed to disk (SSTables).
Incremental snapshots use a completely different mechanism to generic backups which take snapshots of all SSTables already on disk. To put it differently:
Cheers!
8 People are following this question.
What other methods/tools are available for archiving Cassandra data to other long term storage?
Can we schedule incremental and log backups for specific intervals of time?
Are writes allowed in Cassandra when snapshot is happening?
How can I check that 2 DCs are completely in sync before I take a backup?
Why are DB files not getting backed up to Local FS location?
DataStax Enterprise is powered by the best distribution of Apache Cassandra ™
© 2022 DataStax, Titan, and TitanDB are registered trademarks of DataStax, Inc. and its subsidiaries in the United States and/or other countries.
Apache, Apache Cassandra, Cassandra, Apache Tomcat, Tomcat, Apache Lucene, Lucene, Apache Solr, Apache Hadoop, Hadoop, Apache Spark, Spark, Apache TinkerPop, TinkerPop, Apache Kafka and Kafka are either registered trademarks or trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries.
Privacy Policy Terms of Use