question

ddkpham avatar image
ddkpham asked ddkpham commented

What sort of issues may occur when hinted handoff is disabled?

I've been trying to understand the impacts of disabling hinted handoffs in a cassandra cluster. I'm researching this because I am operating a cassandra cluster (perhaps terribly) where all writes require consistency QUORUM. We are running into a bunch of issues around corrupted hint files and Im wondering if disabling hinted handoffs is an acceptable approach (considering all our writes are quorum). I can only find a single post on the issue.

https://stackoverflow.com/questions/43445671/whats-the-point-of-using-hinted-handoff-in-cassandra-especially-for-consistenc


This seems to suggest that I should be okay to disable hinted handoffs given my cluster use. That is if I am okay with sacrificing a little write throughput. If this is a terrible idea, I may have to write a program that automates the detection and subsequent deletion of corrupted hint files.

hinted handoff
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

1 Answer

steve.lacerda avatar image
steve.lacerda answered ddkpham commented

Hi! If you disable hinted handoff and a node goes offline then you will have no recourse but to repair the node when you bring it back online. The reason is simple, hinted handoff will store the writes for however long the hinted handoff time is, so if you bring the node back online within the hinted handoff timeout, the writes will get replayed to the node. Thus, with a 0 hinted handoff timeout, there are no hints to replay so you will have to repair the node when it comes back online and this could be time-consuming and create more load on your cluster.

With that said, the issue is probably not your hints. In these types of situations, the problem is almost always an overloaded or misconfigured cluster. I would create a ticket with DataStax support if you have an issue. Also, if the hints are getting corrupted, it might either be a defect, so I would look at upgrading or it could be an issue with the node shutdown procedure (always nodetool drain before shutting a node down).

3 comments Share
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Hi Steve! Thanks for providing some insight. I have some follow up questions. Since our cluster is using only quorum requests ( I.E the majority of replicas have to respond to the request ), would the inconsistency effects of turning hinted hand off be mostly mitigated? With reads, we will require quorum and so quorum + read repair will save us from any consistency issues. In addition, we run repairs on the cluster every couple of days. In terms of data being out of sync, it seems like we should be okay?? Turning hinted handoffs off to me, in our situation at least, seems like it would mostly impact us with write throughput.


Interesting! How would an overloaded or misconfigured cluster cause corrupted hint files?

0 Likes 0 ·

With quorum you'd be relying on read repair. So, if a partition is not read, it is not repaired, thus you could have inconsistent data still and would require a repair to fix that data. However, if you are running regular repairs, then that should be fine so long as your gc_grace_seconds is longer than your repair window.

The reason I say that about the misconfiguration or overload is that hints are a byproduct of nodes flapping or nodes being down. So, in a normal well-tuned system you shouldn't see hints at all, or at least rarely, unless a node has gone down or there's something wrong with a node.

0 Likes 0 ·

I see. Would my clients ever see any inconsistency issues? Sure the data is out of sync with the replicas, but on a read (when data consistency really matters for me) we will resolve any out of sync issues. Or is there some situation or edge case that Im missing where the data my client reads may actually be stale.


Interesting. What kind of misconfigurations have you seen in your experience cause node flapping? Any network related setting? Or are there a set of common parameters you've seen to be the culprit.

0 Likes 0 ·