Hey, guys. I need your help))
Sometimes one of the cassandra nodes (3.11.2 version) starts behaving strangely. In particular:
System CPU usage:

IO wait:
During this period, the outgoing traffic of the node fell down:
And there were a lot of records like 'Timeout while read-repairing after receiving all 6 data and digest' in the logs.
If we look at the usage of mapped memory, we see the following:
The spikes before the anomaly correlates with the compaction activity:
In addition, the heap usage has reached almost 100% (32 GB).
As a result, GC worked a lot:
In my opinion, the GC work is the cause of node unavailability, but why was the heap and mapped memory used a lot (unfortunately, I wasn't able to get a heap dump)?
From 15: 44:40 to 15:44:50 (when the CPU usage fell down), there are only these records in the logs:
15:44:42.499Z host DEBUG [COMMIT-LOG-ALLOCATOR] o.a.c.d.c.AbstractCommitLogSegmentManager - [] No segments in reserve; creating a fresh one 15:44:43.357Z host DEBUG [cluster2-nio-worker-0] com.datastax.driver.core.Connection - [] Connection[/<local_ip>:9042-1, inFlight=0, closed=false] was inactive for 30 seconds, sending heartbeat 15:44:43.358Z host DEBUG [cluster2-nio-worker-0] com.datastax.driver.core.Connection - [] Connection[/<local_ip>:9042-1, inFlight=0, closed=false] heartbeat query succeeded OCATOR] o.a.c.d.c.AbstractCommitLogSegmentManager - [] No segments in reserve; creating a fresh one 15:44:45.242Z host DEBUG [CompactionExecutor:9] o.a.c.db.compaction.CompactionTask - [] Compacted (11239242-88c9-11eb-9287-ed8c5d87e5c3) 2 sstables to [/data/data/my_table-b4108710c1cc11e6bbea6739e6451df9/mc-3647-big,] to level=0. 6.954GiB to 5.983GiB (~86% of original) in 413,012ms. Read Throughput = 17.241MiB/s, Write Throughput = 14.833MiB/s, Row Throughput = ~121,513/s. 14,009,430 total partitions merged to 11,084,640. Partition merge counts were {1:8205824, 2:2901803, } 15:44:45.250Z host DEBUG [CompactionExecutor:9] o.a.c.db.compaction.CompactionTask - [] Compacting (07519311-88ca-11eb-9287-ed8c5d87e5c3) [/data/data/my_table-b4108710c1cc11e6bbea6739e6451df9/mc-3650-big-Data.db:level=0, /data/data/my_table-b4108710c1cc11e6bbea6739e6451df9/mc-3651-big-Data.db:level=0, /data/data/my_table-b4108710c1cc11e6bbea6739e6451df9/mc-3649-big-Data.db:level=0, /data/data/my_table-b4108710c1cc11e6bbea6739e6451df9/mc-3652-big-Data.db:level=0, ] 15:44:46.183Z host INFO [Service Thread] o.a.cassandra.service.GCInspector - [] G1 Young Generation GC in 222ms. G1 Eden Space: 8606711808 -> 0; G1 Old Gen: 15180859000 -> 15167382432; G1 Survivor Space: 218103808 -> 536870912; 15:44:47.841Z host DEBUG [COMMIT-LOG-ALLOCATOR] o.a.c.d.c.AbstractCommitLogSegmentManager - [] No segments in reserve; creating a fresh one
The compaction task 07519311-88ca-11eb-9287-ed8c5d87e5c3 was completed at 15:46:36 After that the next compaction task was started and there is no record about finishing this task (it wasn't completed by the time of the manual node restart at 16:10)
... 15:46:35.957Z host DEBUG [Native-Transport-Requests-288] o.a.c.service.ResponseResolver - [] Timeout while read-repairing after receiving all 6 data and digest responses 15:46:35.974Z host DEBUG [Native-Transport-Requests-432] o.a.c.service.ResponseResolver - [] Timeout while read-repairing after receiving all 6 data and digest responses 15:46:36.249Z host DEBUG [CompactionExecutor:9] o.a.c.db.compaction.CompactionTask - [] Compacted (07519311-88ca-11eb-9287-ed8c5d87e5c3) 4 sstables to [/data/data/my_table-b4108710c1cc11e6bbea6739e6451df9/mc-3653-big,] to level=0. 1.204GiB to 1.135GiB (~94% of original) in 76,400ms. Read Throughput = 16.132MiB/s, Write Throughput = 15.208MiB/s, Row Throughput = ~108,350/s. 3,995,772 total partitions merged to 3,379,549. Partition merge counts were {1:2948311, 2:300597, 3:76297, 4:54344, } 15:46:36.298Z host DEBUG [ReadRepairStage:741] o.a.c.service.ResponseResolver - [] Timeout while read-repairing after receiving all 1 data and digest responses 15:46:36.300Z host DEBUG [CompactionExecutor:9] o.a.c.db.compaction.CompactionTask - [] Compacting (497de1d0-88ca-11eb-9287-ed8c5d87e5c3) [/data/data/my_table-b4108710c1cc11e6bbea6739e6451df9/mc-3653-big-Data.db:level=0, /data/data/my_table-b4108710c1cc11e6bbea6739e6451df9/mc-3647-big-Data.db:level=0, ] ...
My compaction strategy is LeveledCompactionStrategy.
I am not sure if the root cause of the issue is compaction, because in general it works fast enough without affecting the node works. But then what could be the cause for this behavior?
Thank you in advance for your help.