There are 5 nodes currently in a DSE Search cluster.
Below is observed on a particular node
Pool Name Active Pending (w/Backpressure) Delayed Completed Blocked All time blocked TPC/all/WRITE_MEMTABLE_FULL 1928 N/A (N/A) N/A 1928 N/A N/A TPC/all/WRITE_REMOTE 0 12218 (N/A) N/A 13289068 N/A 0
Whenever the above pattern is seen in logs, The WRITE_MEMTABLE_FULL active threads spike and WRITE_REMOTE pending threads spike, we often see a node enters a hung state.
a) Opscenter agent stops reporting the node as UP
b) Agent fails to access Cassandra over port 9042 and reports the same in agent.log
c) cqlsh login fails with a timeout
d) DSE process keeps running. Nodetool commands do work during this phase
e) Nodetool status dosent report the node to be down
f) With WRITE_MEMTABLE_FULL active threads spiking, tried running a manual nodetool flush. The thread count goes down but eventually spikes again.
g) Restart of opscenter agent dosent help
h) A node restart turns out to be the only solution.
Below is the typical Error message encountered in logs when this scenario occurs:
ERROR [MessagingService-Incoming-/172.20.27.50] 2019-12-09 14:26:30,442 MessagingService.java:825 - java.util.concurrent.RejectedExecutionException while receiving WRITES.WRITE from /172.20.27.50, caused by: Too many pending remote requests!
ERROR [opsagent.storage-timeouter-0] 2019-12-17 17:06:19,829 No active cassandra connections to write rollups
The node is a 16 core VM
DSE version: 6.0.9
Mode : Search
HEAP : 31 GB
No memtable setting is defined in the cassandra.yaml:
# memtable_heap_space_in_mb: 2048 # memtable_offheap_space_in_mb: 2048 # memtable_cleanup_threshold: 0.2 memtable_allocation_type: heap_buffers # commitlog_total_space_in_mb: 8192 # memtable_flush_writers: 4
Request you to comment on what may be the RCA of this and how do I go about fixing this up.