In application logs, we are observing ReadTimeoutException along with mean cross node dropped latency to 7986ms. We are not observing the same for long duration but for 20-30 seconds.
At the same time in Cassandra logs, we can observer long GC pauses of about 7-8 seconds, statuslogger.java messages of system.batches and pending and active NTP requests.
We have set up of 15 node cluster with the version of 3.11.2.
heap size :- 31GB, Total available is 64GB.
Swappiness is disabled and other production recommended settings are in place.
1. Is there any way to find what statements are getting executed during batches?
2. What can be the optimal value for heap in such a scenario?
3. In logs we can observe LOCAL_QUORUM is not satisfied out of 2 nodes only 1/0 node responded for write request? We have RF=3. But no specific node IP is getting tracked in logs, is it due to batches?
4. How can we gather and deep down more into the problem?
Thank you.