pi_165798 avatar image
pi_165798 asked Erick Ramirez edited

Repair kills the performance of the cluster

I have a cluster of 5 nodes. Each node has about 200 GB data. The replication factor is set to 3. Whenever I run a partitioner range repair on a single node (nodetool repair -pr), the performance of the entire cluster is significantly reduced. So much that the microservices connected to the cluster receive timeouts on the majority of requests.

Each server has the following specifications:

  • 8 VCPUs, 32 GB RAM, 500 GB SSD (CX51 at Hetzner).
  • Cassandra version: 3.11.3
  • Parallel GC with InitialHeapSize: 515899392, MaxHeapSize: 8237613056

How do I prevent a repair from overloading the cluster? I have tried lowering the compaction throughput (nodetool setcompactionthroughput) with no apparent difference.

10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Erick Ramirez avatar image Erick Ramirez ♦♦ commented ·

@pi_165798 What version of Cassandra are you using? Feel free to edit your original post with the details. Cheers!

0 Likes 0 ·
pi_165798 avatar image pi_165798 Erick Ramirez ♦♦ commented ·

Original post edited with the cassandra version :)

1 Like 1 ·
Erick Ramirez avatar image Erick Ramirez ♦♦ pi_165798 commented ·

Thanks for doing that. Unfortunately, I've got more questions to come after reading your comment to smadhavan's answer because some things don't add up.

  • How much RAM is allocated to the max/min heap on the nodes?
  • Which GC -- CMS or G1?
  • What is the full repair command you are running on the node?
  • Please confirm that you've only started the repair on one node (not multiple nodes in parallel).
  • How many streams are running based on the nodetool netstats output on the node where you kicked off the repair?

Sorry for all the questions but it's necessary to troubleshoot your problem. Cheers!

0 Likes 0 ·
Show more comments
smadhavan avatar image
smadhavan answered smadhavan edited

@pi_165798, repair is a streaming process and there is a possibility to tune (throttle/un-throttle) the performance of this activity.

When you say the cluster is overloaded, how are you measuring that? What is your current compaction throughput and stream throughput values? What version of C* are you running your cluster with? There are couple ways to monitor and measure the same,

Having said that, the repair activity/process performs validation compaction and streams data from other nodes in the cluster. It is possible that the values of the below are set too high on your environment at `cassandra.yaml` file,

compaction_throughput_mb_per_sec (Default 16)
stream_throughput_outbound_megabits_per_sec (Default 200)

Adjusting the above parameters could be done dynamically without DSE/C* process restart by looking at the output of the below commands,

nodetool getcompactionthroughput
nodetool getstreamthroughput

and then setting it to a value that is lower than the current default by using the below commands,

nodetool setcompactionthroughput
nodetool setstreamthroughput

These will need to be set as the same value on all nodes in the cluster. To determine the right values for these parameters, you may want to make small adjustments and monitor the effect on the your nodes before making further changes. Once the repair process is complete you may want to revert these parameters to their default values. Hope that helps!

Other Resources

2 comments Share
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

pi_165798 avatar image pi_165798 commented ·

The cluster being overloaded is visible from the many timeouts the microservices receive while trying to read/write to the database. The node running the repair has 2 cores running at 100% cpu usage, while the remaining nodes all sit at around 100% usage on all 8 cores.

While repairing I have tried setting compactionthroughput and streamthroughput to 1/4 of their default values on all the nodes. However, it had no impact on the performance despite being set with the nodetool commands you listed. Do these settings need to be set before starting the repair?

0 Likes 0 ·
smadhavan avatar image smadhavan ♦ pi_165798 commented ·

Yes @pi_165798, these settings change will have to be done prior to kicking off the repair process. And, yes we can start it on just a single (or couple) table(s) to begin with, for e.g.:

nodetool repair -pr -- keyspace1 table1 [tableN]

Note: Execute this command on the nodes in a rolling fashion i.e. one node at time

1 Like 1 ·
Erick Ramirez avatar image
Erick Ramirez answered Erick Ramirez edited

@pi_165798 There's something strange going on with your cluster. The most common cause of high CPU utilisation on C* nodes (in my experience) is constant GCs. A good way to confirm this is to do a thread dump or run ttop while the issue is taking place. This was my motivation for asking how much memory is allocated to the heap.

I have to confess that I've only ever come across one cluster where nodes were using Parallel GC so I don't have any experience with it. To be clear -- I'm NOT recommending you switch collectors. :)

When nodes have less than 40GB of RAM, I've only ever used CMS with at least 16GB allocated to heap, 20-24GB preferably if the server has 32GB of RAM. For larger nodes, clusters I've worked on use G1 GC with at least 20GB allocated to heap but preferably 24-31GB since G1 operates better with larger heaps.

Again, NOT recommending you switch collectors but would you consider bumping up the heap on the nodes? I'm not sure if you should try with 12GB or 16GB since I don't have experience with Parallel GC. You might want to test it out on 1 or 2 nodes and monitor it for a couple of days before rolling it out to the rest of the cluster. I think it might reduce what I think are constant GCs leading to high CPU utilisation.

Also, I generally recommend setting both the max and min/initial heap to be te same size so the JVM doesn't pause when the heap is getting resized (see Why do you recommend both -Xms and -Xmx flags be the same size?).

Finally, it sounds like you don't regularly run repairs and so you're only experiencing issues now. I'd suggest try repairing 1 table at a time, perhaps starting with one of the smaller tables still with the partitioner range repair (with -pr flag). This should lessen the impact of the initial repair at the start and then proceed to repairing the larger tables. See how you go. Cheers!

UPDATE - I forgot to mention that there is a known issue with repairs where it can cause OutOfMemoryError due to an issue with Merkle trees are calculated (CASSANDRA-14096). This could potentially be blowing up the heap on your nodes. This was fixed in C* 3.11.5 so you might want to consider upgrading to the latest version. Cheers!

1 comment Share
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

pi_165798 avatar image pi_165798 commented ·

Thank you very much for the detailed answer.

[Follow up question posted in #4311]

0 Likes 0 ·