- Cassandra 3.11.6 on Win10.
- 5 node cluster all in one DC with RF=5 yet they are across the world
- Writes at either LocalOne or LocalQuorum.
- Known intermittent network issues.
- Database load about 82Gib
- All our tables are gc_grace_seconds = 864000
- max_hint_window = 3 hours
- max_hint_file_size_in_mb = 128
- max_hints_delivery_threads = 2
- write_request_timeout_in_ms = 20000
- repair runs 1x a week successfully
- we use TWCS on some tables and we use our own sstablemetadata max time inspection, then stop C and delete sstables off disk to purge old data, requiring a rolling restart of C on each node. (expire not an option for us)
Startup time seems to increase every time C is restarted on one of the nodes. Pause is always before "Initializing index summary..". We are suspicious of the number of hint files or perhaps corrupt hints files. We understand the root cause is due to the network problems and/or overloaded nodes which currently is beyond our control. My questions:
1) Can an accumulation of hints files be the root cause of the pause at the point in startup shown in the logs below? If not, what could be?
INFO [main] 2021-09-16 19:13:42,708 QueryProcessor.java:163 - Preloaded 447 prepared statements INFO [main] 2021-09-16 19:13:42,708 StorageService.java:657 - Cassandra version: 3.11.6 INFO [main] 2021-09-16 19:13:42,708 StorageService.java:658 - Thrift API version: 20.1.0 INFO [main] 2021-09-16 19:13:42,709 StorageService.java:659 - CQL supported versions: 3.4.4 (default: 3.4.4) INFO [main] 2021-09-16 19:13:42,709 StorageService.java:661 - Native protocol supported versions: 3/v3, 4/v4, 5/v5-beta (default: 4/v4) INFO [main] 2021-09-16 19:51:32,688 IndexSummaryManager.java:87 - Initializing index summary
2) C logs show many hints files are successful sent (HintsDispatcherExecutor logs for finished and HintsStore for deleted). In addition, there are thousands dropped due to high "mean cross-node dropped latency" which ranges from seconds to 2 minutes. Would increasing the max_hints_delivery_threads possibly aid in the drop latency? Is the drop due to a long queue time?
3) One hint file we have identified is dated April 16, 2021 (from its name uuid-timestamp-version) and has been sent 91,061 times despite other logs showing some successful finished hints to this node. Could it be that this file is corrupt?
4) When a hint file is partial, does that mean that each piece is max_hint_file_size_in_mb? We have some that are "Finished hinted handoff of file ..., partially" and broken into 20 messages of that filename before completing.
Thanks for your time!