某个集群有8各节点,节点状态都为UN,其中有两个节点load值异常,
其中 load为6.74T的节点 一直重复做一个1.2t的sstable的compaction,每次完成了百分之90多的时候就重新开始compact,pending tasks一直在涨。
debug.log
system.log
请问这种问题要如何修复?
某个集群有8各节点,节点状态都为UN,其中有两个节点load值异常,
其中 load为6.74T的节点 一直重复做一个1.2t的sstable的compaction,每次完成了百分之90多的时候就重新开始compact,pending tasks一直在涨。
debug.log
system.log
请问这种问题要如何修复?
I suspect that the compaction thread(s) is hitting some race condition when incrementing/decrementing the load value on the node leading to negative values. The log also shows that the LCS compaction has fallen behind indicating that there is an underlying issue with the node.
This is the first I've heard of this problem in Cassandra 3.x. My suggestion is to restart Cassandra as a workaround to at least bring the node back to a "clean" state then monitor its behaviour for a while. Cheers!
I've tried restart the node before
The node is "clean" state when the restart is completed,But after a while, the load becomes negative again
After restarting the node with negative "load" value, does it initially show regular load value? If yes, what does it show?
Do you use LCS on all/most tables in this cluster? Compaction can be aborted for a number of reasons, one quickly thing you can check is if you have enough free space. When LCS falls behind too much, it will switch to STCS to merge SSTables in L0 as quickly as possible, but STCS naturally needs a lot of free space and could be constantly aborted if your free space is low.
If you use LCS, normally you should keep the per-node density to be lower than 1TB. When your per-node data grows larger than that, you need to consider either adding more nodes or switching to STCS instead.
when I restart the negative load value node, its load value is show as 2.35T
This is the only table in C* cluster, and the compression strategy is LCS
When the compression goes to 90%+, I see that the disk space Use% is about 75%, there is still enough space for STCS compression, right?
now the density of each node is greater than 2T
I feel like if I change the compression strategy, I might as well recover the data
The only thing I could conclude from this is that your cluster is hitting an unknown bug that leads to errors in load calculation when compactions run.
I suggest you open a Cassandra ticket, provide the debug logs you already have and include as much background information as you can such as the schema of the affected table and a nodetool tablestats
+ compactionstats
outputs. Cheers!
7 People are following this question.
DataStax Enterprise is powered by the best distribution of Apache Cassandra ™
© 2023 DataStax, Titan, and TitanDB are registered trademarks of DataStax, Inc. and its subsidiaries in the United States and/or other countries.
Apache, Apache Cassandra, Cassandra, Apache Tomcat, Tomcat, Apache Lucene, Lucene, Apache Solr, Apache Hadoop, Hadoop, Apache Spark, Spark, Apache TinkerPop, TinkerPop, Apache Kafka and Kafka are either registered trademarks or trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries.
Privacy Policy Terms of Use