Bringing together the Apache Cassandra experts from the community and DataStax.

Want to learn? Have a question? Want to share your expertise? You are in the right place!

Not sure where to begin? Getting Started

 

question

wadimd avatar image
wadimd asked ·

Why does the CPU load go up after switching off 2 nodes out of 5?

Hi,

i'm experimenting with a cassandra (3.11.4) cluster with two datacenters, where each of them has 5 cassandra nodes. Replication is 3 for each of DC and NetworkTopologyStrategy is set. Each node has its own rack defined. I'm using the DCAwareRoundRobinPolicy in application.

When I shut down one or more node then I can see growing CPU load on all other nodes (at about 8-10% per missing node). When I provoke losing quorum in one DC then cpu load on remaining nodes are reaching 50%. I do not see any particular error or hint what is going on. Debug log shows only that there are connection attempts to missing servers (connection refused). The "application" load on the DB is non-existant or very little.

Is this behaviour normal?

What should I do when I'm expecting that I loose 2-3 instances in one DC for a few days?

Should I plan for reconfiguring the topology after loosing nodes, so that the rest does not try to connect missing nodes?

thanks and regards

Wadim

performance
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

1 Answer

Erick Ramirez avatar image
Erick Ramirez answered ·

Your experiment isn't really valid. With only 5 nodes with a replication factor of 3, the chance that the second down node is another replica is 50% (2 out of 4 remaining nodes).

This means that with a small ring of 5 nodes in the DC, your application can effectively tolerate an outage to 1 node only for queries with a consistency of LOCAL_QUORUM.

In a situation where you have a node outage, the remaining two replicas end up taking 50% of traffic each (compared to 33% each replica with a replication factor of 3). The level of elevated CPU load is dependent on a lot of factors including number of CPU cores, memory, disk throughput, app traffic, etc. Cheers!

2 comments Share
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Hi,

thanks a lot for Your response.

Replication considerations are clear to me and we will discuss it, but it is the "idle" load on cassandra instances which is strange to me - with unreachable nodes and no or very light application load we observe mentioned load - is it cassandra trying to reach missing nodes and this costs some cpu?

The datastax driver has some back-off strategies for reconnects, but it looks that cassandra self is constinously trying to reach missing nodes. I've searched config docs but found no setting which could control this...


once more big thanks and regards,

Wadim


0 Likes 0 ·

There isn't a way to stop nodes from trying to gossip with a node which is down -- that's by design.

Nodes will attempt to gossip with other nodes to determine their status/availability. This isn't a problem and is part of the normal operation of a Cassandra cluster.

This isn't an issue that you need to "solve" so there is no configuration that will let you prevent this from happening. Cheers!

1 Like 1 ·