Bringing together the Apache Cassandra experts from the community and DataStax.

Want to learn? Have a question? Want to share your expertise? You are in the right place!

Not sure where to begin? Getting Started

 

question

phofegger_148429 avatar image
phofegger_148429 asked ·

How can we relocate Cassandra nodes to a new location without downtime?

I need a recommended way how to move physical Cassandra nodes (with the existing data) from the current data center to a new data center without data loss and downtime. That means that the IP and Hostname will change for each machine.

here some key data of our Cassandra environment

  • we have 20 nodes divided into two datacenters (10 nodes per DC)
  • we are using (endpoint_snitch: GossipingPropertyFileSnitch)
  • we replicate data between DC {'class' : 'NetworkTopologyStrategy','ProdDC2' : 3,'ProdDC3' : 3}

our approach per cluster would be:

  • shutdown cassandra
  • dismantle the physical server and move it to the new datacenter
  • change ip address according this document
  • start cassandra.

Will this approach work?

Many Thanks

cassandra
2 comments
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

@phofegger_148429 Could you please clarify -- do you mean moving the physical servers to a physical data centre (not logical C* DCs)? Cheers!

0 Likes 0 · ·
phofegger_148429 avatar image phofegger_148429 Erick Ramirez ♦♦ ·

hi erick

I mean a physical datacenter. The logical data center should remain the same.

thanks

Patrick

0 Likes 0 · ·

1 Answer

Erick Ramirez avatar image
Erick Ramirez answered ·

@phofegger_148429 The approach you outlined will work however doing it "without downtime" depends on a few factors including:

  • the number of nodes you plan to move at any one point
  • how long each node is down in total (shutdown + unrack + transport + rack + reconfigure + startup)
  • how busy the cluster is during the relocation
  • whether your app has the capability to failover between DCs

It's important to note that when a node is down, other nodes will store hints for it so you need to allow time for hints to be handed off to the node when it comes back online. If the nodes are going to be down for longer than max_hint_window_in_ms, you will need to run repair the node before taking another one down.

If your cluster is busy and already near capacity, you might not have a choice other than to relocate only 1 node at a time. I would also recommend relocating nodes in one C* DC. Once you have a full C* DC operational at the new location, do a failover at the application layer so it's connecting to the DC in the new location. Once you've verified that your application is operational then you can start relocating the nodes from the other C* DC. Cheers!

3 comments Share
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Hi Erick,

Many thanks for your answer. I have two further questions

We have the capability to failover our application between the DCs.

- Is it possible that nodes in one DC can be located in different subnets?

- What happens if we relocate a complete DC (ProdDC2) at once (all nodes in ProdDC 2: shutdown + unrack + transport + rack + reconfigure + startup) and it takes not longer max_hint_window_in_ms? And we failover all applications to the DC ProdDC3.

Do we get problems?


Kind Regards

Patrick

0 Likes 0 · ·

@phofegger_148429 The problem with relocating lots of nodes at the same time is that the nodes which are left over will be storing hints for them. This would put additional pressure on the running nodes and is not recommended particularly if your cluster is near- or over-capacity.

And when you start seeing application issues during the relocation, trust me when I say that the admins performing the relocation is under even more pressure to get the nodes operational. In this stressful situation, we find that admins/operators start making more errors and the outage is prolonged further. We recommend that you test your approach so you are prepared for all possible scenarios. Cheers!

0 Likes 0 · ·

P.S. I converted your post into a comment since it's not an "answer". Cheers!

0 Likes 0 · ·