question

pranali.khanna101994_189965 avatar image
pranali.khanna101994_189965 asked Erick Ramirez answered

How does Cassandra handle a whole data center failing?

Failover and replication in Cassandra

can the whole DataCenter be failed if yes , then how they are handled ?

cassandra
2 comments
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

alex.ott avatar image alex.ott commented ·

can you be more specific - what answer are you looking for? what do you want to understand?

0 Likes 0 ·
pranali.khanna101994_189965 avatar image pranali.khanna101994_189965 commented ·

I meant ... when all nodes in DataCentre is down then how request can be handled?

0 Likes 0 ·
smadhavan avatar image
smadhavan answered

@pranali.khanna101994_189965, yes Cassandra is built to support that level of failure in mind in its architecture. You might understand this concept well by reading through the following resources,

Share
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Erick Ramirez avatar image
Erick Ramirez answered

Data center failover is not handled in the database layer so Cassandra does not perform any action to recover a DC. If a node is down or unavailable during a write request, Cassandra handles this with the Hinted Handoff -- a mechanism where the coordinator node responsible for managing a write request will store hints (write mutations) and replay it to the replica when it comes back online. But if a whole DC is down, this mechanism isn't relevant since there would be no nodes in the DC to coordinate requests.

In older versions of the DataStax drivers, a DC outage was handled by the DC-aware load balancing policy. To use the Java driver version 3.9 as an example, the DCAwareRoundRobinPolicy will build a query plan with contact points from the local DC first and add nodes from remote DCs to the end of the query plan. This means that if nodes in the local DC are not available, it will connect to nodes in remote DCs effectively "failing over".

We no longer think that is the ideal way of handling outages to the DC. Think of the situation where the app is querying with LOCAL_QUORUM consistency level but with the local DC down, suddenly the query gets run in a remote DC. Instead of the driver failing over at the application layer, the failover should instead be handled at the infrastructure layer.

In newer versions of the drivers (Java driver 4.x for example), the default load-balancing policy will only ever connect to a single DC -- the local DC. If the local C* DC (local to the app instances) is down or unavailable, chances are it's a full site outage and the app instances are unavailable as well. In this instance, the infrastructure load-balancer should failover to another site/region. This approach means that consistency guarantees are not compromised and that local CLs will always be local. Cheers!

Share
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.