I have a cassandra cluster(3.11.2 version) in the 4 DC.
Sometimes, the nodes in the each DC start work very much. There are a lot of read-repair operations, a lot of successed hints.
Metrics:
- org.apache.cassandra.metrics.HintsService.HintsSucceeded.count
- org.apache.cassandra.metrics.ReadRepair.RepairedBlocking.count
- org.apache.cassandra.metrics.ThreadPools.TotalBlockedTasks.transport.Native-Transport-Requests.count
- org.apache.cassandra.metrics.ThreadPools.PendingTasks.request.MutationStage.value
It happens for about 1 hour. In this time, nodes lost connection with each other and itself:
[cluster3-timeouter-0] com.datastax.driver.core.Host.STATES - [] Defuncting Connection[/{ {local_ip}}:9042-1, inFlight=0, closed=false] com.datastax.driver.core.exceptions.ConnectionException: [/{ {local_ip}}:9042] Heartbeat query timed out at com.datastax.driver.core.Connection$11.onTimeout(Connection.java:1191) at com.datastax.driver.core.Connection$ResponseHandler$1.run(Connection.java:1380) at io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:625) at io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:700) at io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:428) at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
I don't understand, why it happened. It isn't compaction, because there are no anomalies in the compaction metrics. Are there any ideas why it happens?