Does two incremental snapshot contains duplicate data?
Bringing together the Apache Cassandra experts from the community and DataStax.
Want to learn? Have a question? Want to share your expertise? You are in the right place!
Not sure where to begin? Getting Started
Snapshots are point in time. So, what does that mean? It means that when the nodes create an incremental snapshot, the snapshot links to the sstable files. Thus, nothing is ever stored twice, it's just a link to the sstable file. The file is not moved, copied, or anything else.
Let me provide an example:
1) Node takes a full snapshot and has 1 sstable file:
2) Some data is added and now we have an additional sstable file:
3) An incremental is taken, which takes only the changes from the full backup to that point in time, so we now have a link to sstable-2.db.
These are just links and not actual files.
4) Some data is added and now we have an additional sstable file:
5) Another incremental is taken, which takes only the changes since the last full backup to that point in time. Thus, you have a link now for sstable-2.db, sstable-3.db. Remember, these are point in time. So, this snapshot only cares about the changes since the last full snapshot and not the previous incremental snapshots.
If a compaction occurs and one of the sstable files is deleted from the live FS, then the file is not actually removed because it still has hard links associated with the file. Thus, you will only have one sstable file taking up space, the rest is all links with minimal overhead.
No two incremental backups (on the same Cassandra node) will ever contain the same SSTables so they will never be duplicated because the snapshots are taken when memtables are flushed to disk (SSTables).
Incremental snapshots use a completely different mechanism to generic backups which take snapshots of all SSTables already on disk. To put it differently:
8 People are following this question.