pratyush04_102629 avatar image
pratyush04_102629 asked Erick Ramirez edited

Why are nodes reporting "RejectedExecutionException - Too many pending remote requests"?

We have a DSE v6.0.4 cluster having 3 DCs, total of 12 nodes.

  • DC1 - Search DC - 3 nodes
  • DC2- SearchAnalytics DC - 3 nodes
  • DC3 - Analytics DC - 6 nodes

In system.log of DC3, constantly getting below error-

ERROR [MessagingService-Incoming-/] 2020-05-11 16:06:03,075 - java.util.concurrent.RejectedExecutionException while receiving WRITES.WRITE from /, caused by: Too many pending remote requests!
INFO  [ScheduledTasks:1] 2020-05-11 16:06:03,924 - MUTATION messages were dropped in last 5000 ms: 2 internal and 689 cross node. Mean internal dropped latency: 2046 ms and Mean cross-node dropped latency: 2084 ms

nodetool tpstats in one of the nodes of DC3 has pending tasks for HintsDispatcher -

Pool Name                                     Active      Pending (w/Backpressure)   Delayed      Completed   Blocked  All time blocked
HintsDispatcher                                    2                      10 (N/A)       N/A              0         0                 0

Also, /var/lib/cassandra/hints directory has size of 204G.

I have tried nodetool repair -full, one node at a time, on all the nodes of the cluster, but still the cassandra hints directory size is around 200G. And the hints dispatcher active and pending tasks is not reducing.

What should be the next step we should try?

dsedropped mutations
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

1 Answer avatar image answered edited

Your cluster is overloaded, and nodes are dropping messages to avoid crash under the load - DSE 6 introduced so-called "DSE Traffic Control" (see here:

The first thing that you need to do - upgrade your DSE installation to latest DSE 6.0 - 6.0.4 is very old, and has many "known" problems.

If you see this message only on some nodes, then you have data skew - some nodes have more data than other. Also, the problem could come from the clients that put too much load onto the servers. If you see that message during the DSE Analytics job when it writes data, you can decrease the load by tuning job properties, like, spark.cassandra.output.concurrent.writes from default 5 to lower value - 3 or 2.

But the problem most probably is from the fact that your DCs have different size, and if the situation is ok for 6-node DC, other DCs that have 3 nodes each, receives 2 times more traffic as nodes in the 6-node DC

10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.