tengel avatar image
tengel asked tengel commented

Slow startup and hints pile up

  • Cassandra 3.11.6 on Win10.
  • 5 node cluster all in one DC with RF=5 yet they are across the world
  • Writes at either LocalOne or LocalQuorum.
  • Known intermittent network issues.
  • Database load about 82Gib
  • All our tables are gc_grace_seconds = 864000
  • max_hint_window = 3 hours
  • max_hint_file_size_in_mb = 128
  • max_hints_delivery_threads = 2
  • write_request_timeout_in_ms = 20000
  • repair runs 1x a week successfully
  • we use TWCS on some tables and we use our own sstablemetadata max time inspection, then stop C and delete sstables off disk to purge old data, requiring a rolling restart of C on each node. (expire not an option for us)

Startup time seems to increase every time C is restarted on one of the nodes. Pause is always before "Initializing index summary..". We are suspicious of the number of hint files or perhaps corrupt hints files. We understand the root cause is due to the network problems and/or overloaded nodes which currently is beyond our control. My questions:

1) Can an accumulation of hints files be the root cause of the pause at the point in startup shown in the logs below? If not, what could be?

INFO [main] 2021-09-16 19:13:42,708 - Preloaded 447 prepared statements
INFO [main] 2021-09-16 19:13:42,708 - Cassandra version: 3.11.6
INFO [main] 2021-09-16 19:13:42,708 - Thrift API version: 20.1.0
INFO [main] 2021-09-16 19:13:42,709 - CQL supported versions: 3.4.4 (default: 3.4.4)
INFO [main] 2021-09-16 19:13:42,709 - Native protocol supported versions: 3/v3, 4/v4, 5/v5-beta (default: 4/v4)
INFO [main] 2021-09-16 19:51:32,688 - Initializing index summary

2) C logs show many hints files are successful sent (HintsDispatcherExecutor logs for finished and HintsStore for deleted). In addition, there are thousands dropped due to high "mean cross-node dropped latency" which ranges from seconds to 2 minutes. Would increasing the max_hints_delivery_threads possibly aid in the drop latency? Is the drop due to a long queue time?

3) One hint file we have identified is dated April 16, 2021 (from its name uuid-timestamp-version) and has been sent 91,061 times despite other logs showing some successful finished hints to this node. Could it be that this file is corrupt?

4) When a hint file is partial, does that mean that each piece is max_hint_file_size_in_mb? We have some that are "Finished hinted handoff of file ..., partially" and broken into 20 messages of that filename before completing.

Thanks for your time!

hinted handoff
1 comment
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

tengel avatar image tengel commented ·

Here is some additional information. Notice how other message types are successful.
nodetool tpstats :


0 Likes 0 ·

1 Answer

Erick Ramirez avatar image
Erick Ramirez answered tengel commented
  1. It isn't a "pause" -- the components are just taking their time to startup but they don't log every single step in the process so you see a bit of a gap.
    Based on the log output you posted, the startup spent a lot of time during the initialisation of StorageProxy (see code here). StorageProxy does handoff hints when it gets initialised so if there are lots of hints on the node then it will explain why it took a while (although 38 seconds is really not that big of a deal).
  2. The problem is not the delivery threads. Messages get dropped because the receiving node(s) are unresponsive or overloaded. No amount of tuning will fix that. The only solution to overloaded nodes is to increase the capacity of the cluster by adding more nodes.
  3. Only you would be able to figure that out by reviewing the logs on the nodes.
  4. It means that hints were partially handed-off to the respective replicas. Another way of saying it is that some hints could not be sent to the target replicas because they were unresponsive or overloaded.

The conclusion from all this is that you need to address the root cause -- that the cluster is overloaded. If the cluster is sized correctly, (a) nodes would not be dropping messages, (b) nodes would be responsive and not drop mutations, (c) there wouldn't be hints stored on coordinators. Cheers!

5 comments Share
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

tengel avatar image tengel commented ·

Thanks for your response!

19:13:42 -> 19:51:32 is 38 minutes not seconds. After researching the c code and logs, we have learned that these 4/16 hints are likely blocking other hints to these nodes so we are working on getting a count.

Weekly repair is successful and even though the three 4/16 hints files going to three different nodes are not making it to the remote nodes, the 4/18 repair did successfully stream data to those nodes. stream_throughput_outbound_megabits_per_sec = 200. We noticed hints delivery is successful when only one or two nodes need hints. We are considering increasing the hinted_handoff_throttle from default 1024kb to 100mb and threads to 5, knowing c will max 1 thread to a node. We are considering stopping c and deleting hints files older than 2 weeks, knowing c will move past the FileNotFoundException, and the data is there from repair.

0 Likes 0 ·
Erick Ramirez avatar image Erick Ramirez ♦♦ tengel commented ·

You're right, my bad on the duration.

You really can't tune your cluster out of an overload. The point is if the nodes are overloaded and are not responsive, coordinators will store hints for them. Tuning hints delivery doesn't solve the problem around nodes being overloaded. Cheers!

0 Likes 0 ·
tengel avatar image tengel commented ·
Sorry I ran out of chars. The fact that the nodes can receive and process repair streams but not hints streams begs the question do these old hints take more resources to process than is being allocated? Our tables are TWCS partitioned by day. As a node received a hints file partially, is it processing it to completion?
0 Likes 0 ·
Erick Ramirez avatar image Erick Ramirez ♦♦ tengel commented ·
Hints are mutations getting replayed to replicas that missed them and are completely different to streams. Repairs stream parts of SSTables which contain the data that are missing from replicas.

When you need to run repairs on TWCS, it's horribly bad because the repaired data is no longer in the correct time window. You really need to address the fact that your cluster is overloaded or you'll be forever be applying band-aids which do nothing. Cheers!

0 Likes 0 ·
tengel avatar image tengel commented ·

Our tables all have the default gc_grace_seconds of 10 days. Looking at the HintVerbHandler and the Hint applyFuture it seems the isLive should return false and complete fairly quickly as there is no need to apply the mutations from April 16, 2021. My understanding is that the HintsMessage is droppable but is subject to the write_request_timeout_in_ms which we have set to 20ms. Again, since our nodes are able to handle reads, writes, repairs, could the inability to process these old hints files be due to the thread priority? I see that the HintsDispatcherExecutor runs at min_priority but I can't tell what the receiving end priority is. We do see some dropped messages but not as many as we see partially finished. Is this the justification for your statement that no amount of tuning will fix the failed hints delivery due to an overloaded cluster?

0 Likes 0 ·