Bringing together the Apache Cassandra experts from the community and DataStax.

Want to learn? Have a question? Want to share your expertise? You are in the right place!

Not sure where to begin? Getting Started

 

question

pooh avatar image
pooh asked ·

nodetool status 查询节点状态时,为什么我某个节点的load值为负数?

某个集群有8各节点,节点状态都为UN,其中有两个节点load值异常,

1610590836493.png


其中 load为6.74T的节点 一直重复做一个1.2t的sstable的compaction,每次完成了百分之90多的时候就重新开始compact,pending tasks一直在涨。

1610591136201.png


debug.log

1610592022705.png


system.log

1610592081042.png


请问这种问题要如何修复?

cassandra
1610590836493.png (496.7 KiB)
1610590916866.png (96.1 KiB)
1610591136201.png (92.1 KiB)
1610592022705.png (579.9 KiB)
1610592081042.png (454.3 KiB)
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

1 Answer

Erick Ramirez avatar image
Erick Ramirez answered ·

I suspect that the compaction thread(s) is hitting some race condition when incrementing/decrementing the load value on the node leading to negative values. The log also shows that the LCS compaction has fallen behind indicating that there is an underlying issue with the node.

This is the first I've heard of this problem in Cassandra 3.x. My suggestion is to restart Cassandra as a workaround to at least bring the node back to a "clean" state then monitor its behaviour for a while. Cheers!

7 comments Share
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

I've tried restart the node before

The node is "clean" state when the restart is completed,But after a while, the load becomes negative again

0 Likes 0 ·

After restarting the node with negative "load" value, does it initially show regular load value? If yes, what does it show?


Do you use LCS on all/most tables in this cluster? Compaction can be aborted for a number of reasons, one quickly thing you can check is if you have enough free space. When LCS falls behind too much, it will switch to STCS to merge SSTables in L0 as quickly as possible, but STCS naturally needs a lot of free space and could be constantly aborted if your free space is low.


If you use LCS, normally you should keep the per-node density to be lower than 1TB. When your per-node data grows larger than that, you need to consider either adding more nodes or switching to STCS instead.

0 Likes 0 ·

when I restart the negative load value node, its load value is show as 2.35T



This is the only table in C* cluster, and the compression strategy is LCS


When the compression goes to 90%+, I see that the disk space Use% is about 75%, there is still enough space for STCS compression, right?



now the density of each node is greater than 2T

I feel like if I change the compression strategy, I might as well recover the data

0 Likes 0 ·
Show more comments

The only thing I could conclude from this is that your cluster is hitting an unknown bug that leads to errors in load calculation when compactions run.

I suggest you open a Cassandra ticket, provide the debug logs you already have and include as much background information as you can such as the schema of the affected table and a nodetool tablestats + compactionstats outputs. Cheers!

0 Likes 0 ·