Hardlinks are not a Cassandra concept so this isn't really a Cassandra question but I'll try to explain.
Hardlinks are implemented by the underlying server filesystem and are pointers to the original filesystem inodes of the SSTables. This means that they are controlled and managed by the filesystem -- not Cassandra.
In a Linux filesystem, creating a hardlink to another file simply creates a new entry in a directory that points to the same inode of the original file. The original file itself is a hardlink meaning that if you create two hardlinks, the respective inode has 3 pointers to it.
Let me illustrate with an example. Consider a text file in a directory:
$ ls -i * datadir: 17337322 users.txt
The text file's inode number is 17337322
.
If I create a hardlink in another directory called snapshot
and give it a different filename:
$ ln datadir/users.txt snapshot/somefile.txt
The new file somefile.txt
has a different filename but has the same inode as users.txt
:
ls -i * datadir: 17337322 users.txt snapshot: 17337322 somefile.txt
It may have a different filename but it's the exact same file. I can also create another hardlink in another directory and give it the same filename:
$ ln datadir/users.txt yetanotherdir/users.txt $ ls -i * datadir: 17337322 users.txt snapshot: 17337322 somefile.txt yetanotherdir: 17337322 users.txt
To be clear, there aren't 3 copies of the file -- they're all just the one file in 3 different directories with pointers to the same file's inode.
If I delete the original datadir/users.txt
, the other files remain because they are hardlinks:
$ rm datadir/users.txt $ ls -i * datadir: snapshot: 17337322 somefile.txt yetanotherdir: 17337322 users.txt
To answer your question, no. The snapshot does not get corrupted when SSTables get compacted out. The inode does not get wiped from the filesystem when a pointer to it exists in the snapshots subdirectories. For this reason, you need to manually cleanup old snapshots that you no longer require because they take up disk space as stated in the Taking a snapshot document:
A single snapshot requires little disk space. However, snapshots can cause your disk usage to grow more quickly over time because a snapshot prevents old obsolete data files from being deleted.
Follow the instructions in Deleting snapshot files for details. Cheers!
Thank you very much for taking the time to give a detailed answer. Really cool the concept Linux keeping track of the count of reference to the file inode. Now it all makes sense.
Maybe DataStax should consider integrating your answers into the documentation.
Thanks for the feedback. We're a little torn about adding it to the documentation since it's a Linux implementation but we'll take it into consideration. Cheers!
6 People are following this question.
After taking a snapshot, where is the snapshot stored?
How can I backup JanusGraph data and restore it to a new node?
Is there a way to backup Kubernetes pods?
Can we restore a keyspace which contains a UDT to another keyspace?
Does OpsCenter support restoring data to a different table than the source table?
DataStax Enterprise is powered by the best distribution of Apache Cassandra ™
© 2023 DataStax, Titan, and TitanDB are registered trademarks of DataStax, Inc. and its subsidiaries in the United States and/or other countries.
Apache, Apache Cassandra, Cassandra, Apache Tomcat, Tomcat, Apache Lucene, Lucene, Apache Solr, Apache Hadoop, Hadoop, Apache Spark, Spark, Apache TinkerPop, TinkerPop, Apache Kafka and Kafka are either registered trademarks or trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries.
Privacy Policy Terms of Use