we have a 21 node Cassandra ring over 3 datacenters. Version 3.11.5. CentOs. on premise. And every once in a while Repair threads start hanging forever on node and blocking other repairs on different nodes in a later timeframe of the same keyspace. We do full and PR repairs on all 21 nodes. And have a schedule so they should not interfere with each other, but if one starts hanging then they will.
What could be causing this? That is the first node with a hanging thread. What should we investigate?
some Grafana/Prometheus output of the first node hanging