question

Tri avatar image
Tri asked Erick Ramirez edited

Do snapshot hardlinks become corrupted when compaction has deleted SSTables?

Let says I run `nodetool snapshot` today. Which create hardlinks to all the SSTables files at that point in time. Tomorrow, there will be some cycles of compaction. Which will delete some SSTables and create new one. The snapshot made hardlinks on SSTable files. Some of those files are deleted by compaction. Does that mean the hardlinks become invalid? More exactly, would the snapshot become corrupted?
backup
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

1 Answer

Erick Ramirez avatar image
Erick Ramirez answered Erick Ramirez edited

Hardlinks are not a Cassandra concept so this isn't really a Cassandra question but I'll try to explain.

Hardlinks are implemented by the underlying server filesystem and are pointers to the original filesystem inodes of the SSTables. This means that they are controlled and managed by the filesystem -- not Cassandra.

In a Linux filesystem, creating a hardlink to another file simply creates a new entry in a directory that points to the same inode of the original file. The original file itself is a hardlink meaning that if you create two hardlinks, the respective inode has 3 pointers to it.

Let me illustrate with an example. Consider a text file in a directory:

$ ls -i *
datadir:
17337322 users.txt

The text file's inode number is 17337322.

If I create a hardlink in another directory called snapshot and give it a different filename:

$ ln datadir/users.txt snapshot/somefile.txt

The new file somefile.txt has a different filename but has the same inode as users.txt:

ls -i *                                   
datadir:
17337322 users.txt

snapshot:
17337322 somefile.txt

It may have a different filename but it's the exact same file. I can also create another hardlink in another directory and give it the same filename:

$ ln datadir/users.txt yetanotherdir/users.txt
$ ls -i *                                     
datadir:
17337322 users.txt

snapshot:
17337322 somefile.txt

yetanotherdir:
17337322 users.txt

To be clear, there aren't 3 copies of the file -- they're all just the one file in 3 different directories with pointers to the same file's inode.

If I delete the original datadir/users.txt, the other files remain because they are hardlinks:

$ rm datadir/users.txt 
$ ls -i *             
datadir:

snapshot:
17337322 somefile.txt

yetanotherdir:
17337322 users.txt

To answer your question, no. The snapshot does not get corrupted when SSTables get compacted out. The inode does not get wiped from the filesystem when a pointer to it exists in the snapshots subdirectories. For this reason, you need to manually cleanup old snapshots that you no longer require because they take up disk space as stated in the Taking a snapshot document:

A single snapshot requires little disk space. However, snapshots can cause your disk usage to grow more quickly over time because a snapshot prevents old obsolete data files from being deleted.

Follow the instructions in Deleting snapshot files for details. Cheers!

2 comments Share
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Tri avatar image Tri commented ·

Thank you very much for taking the time to give a detailed answer. Really cool the concept Linux keeping track of the count of reference to the file inode. Now it all makes sense.

Maybe DataStax should consider integrating your answers into the documentation.

0 Likes 0 ·
Erick Ramirez avatar image Erick Ramirez ♦♦ Tri commented ·

Thanks for the feedback. We're a little torn about adding it to the documentation since it's a Linux implementation but we'll take it into consideration. Cheers!

0 Likes 0 ·