Hardlinks are not a Cassandra concept so this isn't really a Cassandra question but I'll try to explain.
Hardlinks are implemented by the underlying server filesystem and are pointers to the original filesystem inodes of the SSTables. This means that they are controlled and managed by the filesystem -- not Cassandra.
In a Linux filesystem, creating a hardlink to another file simply creates a new entry in a directory that points to the same inode of the original file. The original file itself is a hardlink meaning that if you create two hardlinks, the respective inode has 3 pointers to it.
Let me illustrate with an example. Consider a text file in a directory:
$ ls -i * datadir: 17337322 users.txt
The text file's inode number is 17337322
.
If I create a hardlink in another directory called snapshot
and give it a different filename:
$ ln datadir/users.txt snapshot/somefile.txt
The new file somefile.txt
has a different filename but has the same inode as users.txt
:
ls -i * datadir: 17337322 users.txt snapshot: 17337322 somefile.txt
It may have a different filename but it's the exact same file. I can also create another hardlink in another directory and give it the same filename:
$ ln datadir/users.txt yetanotherdir/users.txt $ ls -i * datadir: 17337322 users.txt snapshot: 17337322 somefile.txt yetanotherdir: 17337322 users.txt
To be clear, there aren't 3 copies of the file -- they're all just the one file in 3 different directories with pointers to the same file's inode.
If I delete the original datadir/users.txt
, the other files remain because they are hardlinks:
$ rm datadir/users.txt $ ls -i * datadir: snapshot: 17337322 somefile.txt yetanotherdir: 17337322 users.txt
To answer your question, no. The snapshot does not get corrupted when SSTables get compacted out. The inode does not get wiped from the filesystem when a pointer to it exists in the snapshots subdirectories. For this reason, you need to manually cleanup old snapshots that you no longer require because they take up disk space as stated in the Taking a snapshot document:
A single snapshot requires little disk space. However, snapshots can cause your disk usage to grow more quickly over time because a snapshot prevents old obsolete data files from being deleted.
Follow the instructions in Deleting snapshot files for details. Cheers!
Thank you very much for taking the time to give a detailed answer. Really cool the concept Linux keeping track of the count of reference to the file inode. Now it all makes sense.
Maybe DataStax should consider integrating your answers into the documentation.
Thanks for the feedback. We're a little torn about adding it to the documentation since it's a Linux implementation but we'll take it into consideration. Cheers!
6 People are following this question.
What other methods/tools are available for archiving Cassandra data to other long term storage?
Does a snapshot of keyspaces take a backup of user roles and permissions?
How can I change the Backup Activity Report from GMT to my local timezone?
How can I check that 2 DCs are completely in sync before I take a backup?
Why are DB files not getting backed up to Local FS location?
DataStax Enterprise is powered by the best distribution of Apache Cassandra ™
© 2022 DataStax, Titan, and TitanDB are registered trademarks of DataStax, Inc. and its subsidiaries in the United States and/or other countries.
Apache, Apache Cassandra, Cassandra, Apache Tomcat, Tomcat, Apache Lucene, Lucene, Apache Solr, Apache Hadoop, Hadoop, Apache Spark, Spark, Apache TinkerPop, TinkerPop, Apache Kafka and Kafka are either registered trademarks or trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries.
Privacy Policy Terms of Use