We have 12 nodes Cassandra cluster in production along with 12 nodes for DR.
When we are running a full repair on any one of the nodes that the Cassandra node is going down with status DN and getting out of the cluster.
We can observer high GC pause as well while the repair is being run. We have tried to tune the GC parameters but still no luck.
What are other possibilities that might be causing the node to go down?
For an add on information, when we are running full repair within local dc for each and individual table then it's running without fail.
Please find below information for reference:-
- amount of memory allocated to the heap:- 32GB
- total RAM on each server:- 128 GB
- total number of cores on each server:- 16 / Threads per core:- 1
- whether the nodes are using G1 GC or CMS:- CMS setting
- version of Cassandra:- 3.11.2
- node density (average size of data on each node):- 400 GB