One node of my cassnadra-cluster was down a few days ago.
(DSE version: 6.7.3)
My config is default: (cassandra.yaml)
# Back-pressure settings # # If enabled, the coordinator will apply the back-pressure strategy specified below to each mutation # sent to replicas, with the aim of reducing pressure on overloaded replicas. back_pressure_enabled: false # The back-pressure strategy applied. # The default implementation, RateBasedBackPressure, takes three arguments: # high ratio, factor, and flow type, and uses the ratio between incoming mutation responses and outgoing mutation requests. # If below high ratio, outgoing mutations are rate limited according to the incoming rate decreased by the given factor; # if above high ratio, the rate limiting is increased by the given factor; # the recommended factor is a whole number between 1 and 10, use larger values for a faster recovery # at the expense of potentially more dropped mutations; # the rate limiting is applied according to the flow type: if FAST, it's rate limited at the speed of the fastest replica, # if SLOW at the speed of the slowest one. # New strategies can be added. Implementors need to implement org.apache.cassandra.net.BackpressureStrategy and # provide a public constructor that accepts Map<String, Object>. back_pressure_strategy: - class_name: org.apache.cassandra.net.RateBasedBackPressure parameters: - high_ratio: 0.90 factor: 5 flow: FAST
I found the information.：
1. Total threads were increasing until touch max processlist of linux, (ulimit -u:65535)(current total threads:ps -eLf |wc -l) 2. I found the messages in "debug.log": 2-1. Remote TPC backpressure is active with count 1280. 2-2. Backpressure rejection while receiving ... 2-3. unable to create new native thread. 2-4. java.lang.OutOfMemoryError while receiving .
What could be the cause of stuck with "Remote TPC backpressure" increasing?
The attachment is a part of full "debug.log":debug_0628.txt
I seen the DataStax Enterprise 6.7 release notes:
Reject requests from the TPC backpressure queue when they have been on the queue for too long. (DSP-15875)
Is that probably about this issue?