Hello Team,
There are 5 nodes currently in a DSE Search cluster.
Below is observed on a particular node
1.
Pool Name Active Pending (w/Backpressure) Delayed Completed Blocked All time blocked TPC/all/WRITE_MEMTABLE_FULL 1928 N/A (N/A) N/A 1928 N/A N/A TPC/all/WRITE_REMOTE 0 12218 (N/A) N/A 13289068 N/A 0
Whenever the above pattern is seen in logs, The WRITE_MEMTABLE_FULL active threads spike and WRITE_REMOTE pending threads spike, we often see a node enters a hung state.
Symptoms:
a) Opscenter agent stops reporting the node as UP
b) Agent fails to access Cassandra over port 9042 and reports the same in agent.log
c) cqlsh login fails with a timeout
d) DSE process keeps running. Nodetool commands do work during this phase
e) Nodetool status dosent report the node to be down
f) With WRITE_MEMTABLE_FULL active threads spiking, tried running a manual nodetool flush. The thread count goes down but eventually spikes again.
g) Restart of opscenter agent dosent help
h) A node restart turns out to be the only solution.
Below is the typical Error message encountered in logs when this scenario occurs:
In system.log
ERROR [MessagingService-Incoming-/172.20.27.50] 2019-12-09 14:26:30,442 MessagingService.java:825 - java.util.concurrent.RejectedExecutionException while receiving WRITES.WRITE from /172.20.27.50, caused by: Too many pending remote requests!
In agent.log
ERROR [opsagent.storage-timeouter-0] 2019-12-17 17:06:19,829 No active cassandra connections to write rollups
The node is a 16 core VM
DSE version: 6.0.9
Mode : Search
Memory: 110GB
HEAP : 31 GB
No memtable setting is defined in the cassandra.yaml:
# memtable_heap_space_in_mb: 2048 # memtable_offheap_space_in_mb: 2048 # memtable_cleanup_threshold: 0.2 memtable_allocation_type: heap_buffers # commitlog_total_space_in_mb: 8192 # memtable_flush_writers: 4
Request you to comment on what may be the RCA of this and how do I go about fixing this up.