We have a 6 node cluster,
one node was unresponsive due to some disk io issue. and suddenly i can see CPU hike on all nodes like cassandra usking 2983% of single cpu
top - 14:16:32 up 470 days, 17:47, 17 users, load average: 0.76, 0.91, 0.97 Tasks: 622 total, 1 running, 621 sleeping, 0 stopped, 0 zombie %Cpu(s): 68.6 us, 0.8 sy, 0.0 ni, 30.5 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem : 19778848+total, 16743604 free, 47336320 used, 13370856+buff/cache KiB Swap: 0 total, 0 free, 0 used. 13870702+avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 110807 cassand+ 20 0 231.0g 83.6g 44.5g S 2934 44.3 17338:05 java
system.log-----
ERROR [ReadRepairStage:30637] 2020-02-27 14:16:22,153 CassandraDaemon.java:228 - Exception in thread Thread[ReadRepairStage:30637,5,main] org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses. at org.apache.cassandra.service.DataResolver$RepairMergeListener.close(DataResolver.java:202) ~[apache-cassandra-3.11.2.jar:3.11.2] at org.apache.cassandra.db.partitions.UnfilteredPartitionIterators$2.close(UnfilteredPartitionIterators.java:175) ~[apache-cassandra-3.11.2.jar:3.11.2] at org.apache.cassandra.db.transform.BaseIterator.close(BaseIterator.java:92) ~[apache-cassandra-3.11.2.jar:3.11.2] at org.apache.cassandra.service.DataResolver.compareResponses(DataResolver.java:79) ~[apache-cassandra-3.11.2.jar:3.11.2] at org.apache.cassandra.service.AsyncRepairCallback$1.runMayThrow(AsyncRepairCallback.java:50) ~[apache-cassandra-3.11.2.jar:3.11.2] at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) ~[apache-cassandra-3.11.2.jar:3.11.2] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[na:1.8.0_221] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[na:1.8.0_221] at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81) ~[apache-cassandra-3.11.2.jar:3.11.2] at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_221] INFO [HintsDispatcher:7] 2020-02-27 14:16:36,187 HintsStore.java:126 - Deleted hint file 27921b23-6bc0-48b4-bf6f-898aaabb6bc0-1582793184685-1.hints INFO [HintsDispatcher:7] 2020-02-27 14:16:36,187 HintsDispatchExecutor.java:282 - Finished hinted handoff of file 27921b23-6bc0-48b4-bf6f-898aaabb6bc0-1582793184685-1.hints to endpoint /BADNODE-IP: 27921b23-6bc0-48b4-bf6f-898aaabb6bc0 INFO [HintsDispatcher:7] 2020-02-27 14:16:36,550 HintsStore.java:126 - Deleted hint file 27921b23-6bc0-48b4-bf6f-898aaabb6bc0-1582793194687-1.hints INFO [HintsDispatcher:7] 2020-02-27 14:16:36,550 HintsDispatchExecutor.java:282 - Finished hinted handoff of file 27921b23-6bc0-48b4-bf6f-898aaabb6bc0-1582793194687-1.hints to endpoint /BADNODE-IP: 27921b23-6bc0-48b4-bf6f-898aaabb6bc0
Was other node CPU were so high because of hinted handoff work??
we could not find why other nodes were miss behaving?