pi_165798 avatar image
pi_165798 asked Erick Ramirez edited

Why is data not spread evenly across 2 disks in data_file_directories?

My node has two data_file_directories, but one directory is using significantly more disk space than the other:

- /var/lib/cassandra/data (Size: 225 GB - 75% full)

- /mnt/ssd-volume/data (Size: 200 GB - 30% full)

This becomes a problem when I attempt to upgradesstables, and get the following error:

Not enough space to write 13.122GiB to /var/lib/cassandra/data (4.491GiB available)

I have tried running nodetool compact and nodetool cleanup, but the error still occurs.

Is there any way of rebalancing the two disks, so the data is spread more evenly?

10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

1 Answer

Erick Ramirez avatar image
Erick Ramirez answered

Something doesn't quite add up with what you posted. If this mount point is 75% full, then it implies that there must be 25% free which is 75GB:

 /var/lib/cassandra/data (225 GB - 75% full)

I was surprised to see that the compaction thread reported only 4.491GiB available. In any case, sorry that I digressed. That's not really the issue here. :)

In older versions of Cassandra, the compaction thread tries to even out the disk utilisation for JBOD configurations by writing to disks with the most free space. However, this approach is problematic.

Consider the scenario where (a) a partition is fragmented across multiple SSTables and (b) those SSTables are scattered across 2 disks. Say the partition data is in an SSTable on disk X and a tombstone for the same partition is in an SSTable on disk Y. At some point, disk Y has a hardware failure and gets replaced. Since the SSTable with the tombstone on disk Y is no longer on available, the data on disk X can get inadvertently resurrected because the tombstone that marked it as deleted is gone.

To prevent this kind of scenario from happening, there is a new algorithm in C* 3.x which prevents tokens from being distributed in different directories (CASSANDRA-6696). This means that nodes with a JBOD configuration will never achieve even distribution because the SSTables are split across data disks based on the token ranges they contain. Depending on range of the partition sizes in you data model, there will be inherent skew in data disk utilisation. For example, disk 1 owns token range 0-10 and the partitions in this token range vary from 90-200MB. Disk 2 owns token range 11-20 and partitions vary from 1MB to 50MB. Disk 2 will always have significantly less data than disk 1. Cassandra uses an algorithm that tries to allocate an even distribution of token ranges that is "owned" by a data disk but it's not a guarantee that the data size will be equal.

If you think that the algorithm isn't working correctly, you can force Cassandra to rewrite the SSTables so they are placed on the correct disk by running nodetool relocatesstables. For details, see the documentation here. Cheers!

10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.