Failover and replication in Cassandra
can the whole DataCenter be failed if yes , then how they are handled ?
Bringing together the Apache Cassandra experts from the community and DataStax.
Want to learn? Have a question? Want to share your expertise? You are in the right place!
Not sure where to begin? Getting Started
@pranali.khanna101994_189965, yes Cassandra is built to support that level of failure in mind in its architecture. You might understand this concept well by reading through the following resources,
Data center failover is not handled in the database layer so Cassandra does not perform any action to recover a DC. If a node is down or unavailable during a write request, Cassandra handles this with the Hinted Handoff -- a mechanism where the coordinator node responsible for managing a write request will store hints (write mutations) and replay it to the replica when it comes back online. But if a whole DC is down, this mechanism isn't relevant since there would be no nodes in the DC to coordinate requests.
In older versions of the DataStax drivers, a DC outage was handled by the DC-aware load balancing policy. To use the Java driver version 3.9 as an example, the
DCAwareRoundRobinPolicy will build a query plan with contact points from the local DC first and add nodes from remote DCs to the end of the query plan. This means that if nodes in the local DC are not available, it will connect to nodes in remote DCs effectively "failing over".
We no longer think that is the ideal way of handling outages to the DC. Think of the situation where the app is querying with
LOCAL_QUORUM consistency level but with the local DC down, suddenly the query gets run in a remote DC. Instead of the driver failing over at the application layer, the failover should instead be handled at the infrastructure layer.
In newer versions of the drivers (Java driver 4.x for example), the default load-balancing policy will only ever connect to a single DC -- the local DC. If the local C* DC (local to the app instances) is down or unavailable, chances are it's a full site outage and the app instances are unavailable as well. In this instance, the infrastructure load-balancer should failover to another site/region. This approach means that consistency guarantees are not compromised and that local CLs will always be local. Cheers!
8 People are following this question.