question

Beck avatar image
Beck asked Erick Ramirez edited

What could be the cause of stuck with "Remote TPC backpressure" increasing?

Hi all,

One node of my cassnadra-cluster was down a few days ago.

(DSE version: 6.7.3)

My config is default: (cassandra.yaml)

# Back-pressure settings #
# If enabled, the coordinator will apply the back-pressure strategy specified below to each mutation
# sent to replicas, with the aim of reducing pressure on overloaded replicas.
back_pressure_enabled: false
# The back-pressure strategy applied.
# The default implementation, RateBasedBackPressure, takes three arguments:
# high ratio, factor, and flow type, and uses the ratio between incoming mutation responses and outgoing mutation requests.
# If below high ratio, outgoing mutations are rate limited according to the incoming rate decreased by the given factor;
# if above high ratio, the rate limiting is increased by the given factor;
# the recommended factor is a whole number between 1 and 10, use larger values for a faster recovery
# at the expense of potentially more dropped mutations;
# the rate limiting is applied according to the flow type: if FAST, it's rate limited at the speed of the fastest replica,
# if SLOW at the speed of the slowest one.
# New strategies can be added. Implementors need to implement org.apache.cassandra.net.BackpressureStrategy and
# provide a public constructor that accepts Map<String, Object>.
back_pressure_strategy:
    - class_name: org.apache.cassandra.net.RateBasedBackPressure
      parameters:
        - high_ratio: 0.90
          factor: 5
          flow: FAST

I found the information.:

1. Total threads were increasing until touch max processlist of linux, (ulimit -u:65535)(current total threads:ps -eLf |wc -l)

2. I found the messages in "debug.log":

2-1. Remote TPC backpressure is active with count 1280.
2-2. Backpressure rejection while receiving ...
2-3. unable to create new native thread.
2-4. java.lang.OutOfMemoryError while receiving .

What could be the cause of stuck with "Remote TPC backpressure" increasing?

The attachment is a part of full "debug.log":debug_0628.txt

---

update:

I seen the DataStax Enterprise 6.7 release notes:

Reject requests from the TPC backpressure queue when they have been on the queue for too long. (DSP-15875)

Is that probably about this issue?

thread-per-corebackpressure
debug-0628.txt (32.8 KiB)
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

1 Answer

Erick Ramirez avatar image
Erick Ramirez answered Erick Ramirez commented

@Beck the symptoms you describe as well as a quick review of the debug.log you attached indicate that the node is overloaded. This isn't really a tuning exercise but more having the right amount of resources and enough nodes in your cluster to cope with the traffic. Out of curiosity, how much RAM is available on the server and how much memory is allocated to the heap? Cheers!

8 comments Share
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Beck avatar image Beck commented ·

Hi @Erick Ramirez ,


It is 128 GB per node and my cluster use "default setting".

I guess it's not overloaded because I also see my grafana dashboard.


Let's back one point what I see:

1. Total threads were increasing until touch max processlist of linux, (ulimit -u:65535)(current total threads:ps -eLf |wc -l)


I want to know what causes it.


Thanks for your reply!

0 Likes 0 ·
Erick Ramirez avatar image Erick Ramirez ♦♦ Beck commented ·

The backpressure gets applied because there are too many incoming requests and the nodes cannot cope with it. Are you seeing dropped mutations on the replicas? That would indicate that the commitlog disk cannot keep up with the writes. Cheers!

1 Like 1 ·
Erick Ramirez avatar image Erick Ramirez ♦♦ Erick Ramirez ♦♦ commented ·

You can check the logs for entries which look like "MUTATION messages dropped". Cheers!

1 Like 1 ·
Show more comments
Show more comments