Bringing together the Apache Cassandra experts from the community and DataStax.

Want to learn? Have a question? Want to share your expertise? You are in the right place!

Not sure where to begin? Getting Started



gpss1979 avatar image
gpss1979 asked gpss1979 commented

Java Driver Retry Policy

Hi All,

I've been experiencing some issues with the Java driver and I'm pretty sure someone here can help out. :-)

Apache Cassandra 3.11.4. Whenever a node crashes on disk failure (SSD), the Cassandra nodes identify it is down and from C* perspective, all is fine. However, the Java driver still "thinks" it is connected to the node and tries to get data from it until a queue is building up and creates a lag on the application level.

The driver version is 3.1, which I believe doesn't have a default retry policy that mitigates this issue (please correct me if I'm wrong). I've suggested the application teams to upgrade the driver to the latest version in order to prevent such issues from recurring and they'd like to get assured that the new driver will indeed not be affected from this.

Is my suggestion correct?

Is there a document that better explains this scenario and how the driver behaves in this case?

Thanks a lot!

java driver
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

1 Answer

Erick Ramirez avatar image
Erick Ramirez answered gpss1979 commented

This sounds like the underlying server doesn't completely shutdown and the Cassandra JVM is still active.

I've seen this happen in other clusters where the native transport server is still listening on CQL port 9042 so the driver thinks the node is still operational.

What disk_failure_policy do you have configured in cassandra.yaml? If this happens regularly, I recommend setting it to die so the JVM gets completely killed. Cheers!

3 comments Share
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Hi Erick!

The disk_failure_policy I use is stop. But I don't understand why it occurs, since this policy should stop both gossip and client transports. From cassandra.yaml:
# die
# shut down gossip and client transports and kill the JVM for any fs errors or
# single-sstable errors, so the node can be replaced.

# stop
# shut down gossip and client transports, leaving the node effectively dead, but
# can still be inspected via JMX, kill the JVM for errors during startup.

Shouldn't the node stop accepting requests when the disk fails in both policies? (the service did actually stop and the server was in DN status).

Nevertheless, the client should know how to handle such failures, shouldn't it?


0 Likes 0 ·

It should but without actual log info and other evidence, I don't know what happened.

As I stated in my original answer, I've seen this before where the JVM process doesn't get killed for whatever reason but usually something went wrong at the OS level, i.e. no new commands or processes could be forked, so C* is still listening on the port and the driver thinks it is up.

For the record, this isn't unique to C*. This sort of behaviour happens to other applications when there's a hardware failure. Cheers!

0 Likes 0 ·

Thanks a lot Erick!

That's really helpful :-)

0 Likes 0 ·