pooh avatar image
pooh asked pooh commented

nodetool status 查询节点状态时,为什么我某个节点的load值为负数?



其中 load为6.74T的节点 一直重复做一个1.2t的sstable的compaction,每次完成了百分之90多的时候就重新开始compact,pending tasks一直在涨。







1610590836493.png (496.7 KiB)
1610590916866.png (96.1 KiB)
1610591136201.png (92.1 KiB)
1610592022705.png (579.9 KiB)
1610592081042.png (454.3 KiB)
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

1 Answer

Erick Ramirez avatar image
Erick Ramirez answered pooh commented

I suspect that the compaction thread(s) is hitting some race condition when incrementing/decrementing the load value on the node leading to negative values. The log also shows that the LCS compaction has fallen behind indicating that there is an underlying issue with the node.

This is the first I've heard of this problem in Cassandra 3.x. My suggestion is to restart Cassandra as a workaround to at least bring the node back to a "clean" state then monitor its behaviour for a while. Cheers!

7 comments Share
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

pooh avatar image pooh commented ·

I've tried restart the node before

The node is "clean" state when the restart is completed,But after a while, the load becomes negative again

0 Likes 0 ·
wdeng avatar image wdeng pooh commented ·

After restarting the node with negative "load" value, does it initially show regular load value? If yes, what does it show?

Do you use LCS on all/most tables in this cluster? Compaction can be aborted for a number of reasons, one quickly thing you can check is if you have enough free space. When LCS falls behind too much, it will switch to STCS to merge SSTables in L0 as quickly as possible, but STCS naturally needs a lot of free space and could be constantly aborted if your free space is low.

If you use LCS, normally you should keep the per-node density to be lower than 1TB. When your per-node data grows larger than that, you need to consider either adding more nodes or switching to STCS instead.

0 Likes 0 ·
pooh avatar image pooh wdeng commented ·

when I restart the negative load value node, its load value is show as 2.35T

This is the only table in C* cluster, and the compression strategy is LCS

When the compression goes to 90%+, I see that the disk space Use% is about 75%, there is still enough space for STCS compression, right?

now the density of each node is greater than 2T

I feel like if I change the compression strategy, I might as well recover the data

0 Likes 0 ·
Show more comments
Erick Ramirez avatar image Erick Ramirez ♦♦ pooh commented ·

The only thing I could conclude from this is that your cluster is hitting an unknown bug that leads to errors in load calculation when compactions run.

I suggest you open a Cassandra ticket, provide the debug logs you already have and include as much background information as you can such as the schema of the affected table and a nodetool tablestats + compactionstats outputs. Cheers!

0 Likes 0 ·