question

brentc avatar image
brentc asked brentc commented

Cassandra kills itself by doing compaction during repairs

I have a 6 (3 per datacenter) node cassandra cluster where each node has the full replicated data.

I'm using cassandra-reaper to perform repairs and I have to babysit cassandra, monitoring the CPU usage and RAM and the amount of pending tasks shown in:

 nodetool compactionstats

During the repair, suddenly cassandra starts scheduling compactions for completely unrelated keyspaces (that aren't being repaired at the moment) causing the CPU usage to go to 100% for all cores (reported more than 11.0 average load over long term, measured with htop) and beginning to fail queries. However, it does not stop there. It's slowly processing the tasks, but more keep coming than it can process, eventually, while watching:

nodetool status

I notice that several nodes keep switching between up and down and some queries keep failing until I restart those nodes manually.


Each node has the following specs:

RAM: 64GB VM memory of which 32GB is used with XMX

CPU: 8 vCPU (2.0 Ghz)

Cassandra version: 3.11.4


My data is not particularly huge (before repairing my nodes reported 20GB load, now they report 50GB), however there are some tables in which one cell can contain a lot (1000+ lines) of textual data.

I have read in other places that repairing such rows consumes a lot of processing resources, but I'm confused as to why it starts to schedule 300+ pending compactions of those big tables even when they are not being repaired, while a repair is running.

I've considered disabling auto compaction during the repair with:

nodetool disableautocompaction

and then re-enabling it after the repair is done but I fear the nodes will still be going in this corrupted/confused state.

The logs keep showing output from StatusLogger.java on nodes that are VERY busy with these tasks but nothing really clearly indicating a problem.

Note that this seems to start happening when nodes are sending/receiving streams. Validation tasks don't seem to have an impact (besides some increased cpu usage, but not 100% on all cores).

It's recommended to run weekly repairs, but I can't schedule these if the nodes are so fragile despite having those specs.

Something I haven't done yet is changing the garbage collector to G1 which I think was recommended for heap sizes larger than 16GB. Using the default memory ratio formula, the cassandra nodes go out of memory during a repair.


Is this behavior familiar? Is there something I'm missing in terms of configuration? Is it a bug? Is it supposed to start all these compaction tasks (of seemingly all tables) while a repair is running for one keyspace?

I don't know, having had a lot of repairs that simply failed, I already changed some parameters that were recommended in those cases. I also just read through the configuration and changed things like throughput settings to a (somewhat) higher setting because the disks and network can take more but it seems to have no effect.

Thanks in advance!

compactionrepair
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

1 Answer

Erick Ramirez avatar image
Erick Ramirez answered brentc commented

@brentc the symptoms you described indicate that the nodes are IO-bound when compactions are running. Check that you are not running repairs in parallel on multiple nodes because that would explain why they are getting overloaded. Try manually repairing one node at a time with the -pr flag and monitor the nodes. If a single repair isn't resulting in the high CPU issue you described, it would indicate that you have to many repairs running which is overloading the cluster. Cheers!

8 comments Share
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

brentc avatar image brentc commented ·

Thanks for the reply. Well, cassandra's data is stored on a network mount I suppose (because it's a virtual machine in a datacenter) but the fastest available. Would I be seeing a lot of I/O waits if this was the case?

0 Likes 0 ·
Erick Ramirez avatar image Erick Ramirez ♦♦ brentc commented ·

Does that mean it's an NFS mount that's shared across multiple nodes? And yes, you'd see lots of IO waits if the nodes are IO-bound. The symptoms you describe indicate that there are too many things to repair and that repairs weren't getting done regularly. Cheers!

0 Likes 0 ·
brentc avatar image brentc Erick Ramirez ♦♦ commented ·

No, as far as the VM is concerned, it's a regular physical disk, however, underlying it may be coming from a SAN and nothing is being shared with other nodes. Using sysstats I can see that the average I/O wait is 0.3%. Indeed, this is the first time ever I'm repairing this data (after a few years of running) and it's not incremental.

0 Likes 0 ·
Erick Ramirez avatar image Erick Ramirez ♦♦ brentc commented ·

@brentc is that sysstat while a repair is running or just under normal load? It would also be good to know the breakdown of the high CPU load when repairs are running, i.e. how much is in the wait column from the top output. Cheers!

0 Likes 0 ·
brentc avatar image brentc Erick Ramirez ♦♦ commented ·

Do you think we could have a chat on Discord/Slack/Whatever else? That might be easier because these comments are getting nested :) I will post the results of our conversation here later. As for your question, yes it's during a repair. I will have to start the repair again to see the requested statistics :)

0 Likes 0 ·
brentc avatar image brentc commented ·

... I have tried repairing manually with the -pr flag you described, but it results in the same thing. I'm using reaper now for more control, but the same behavior occurs regardless.

0 Likes 0 ·
Erick Ramirez avatar image Erick Ramirez ♦♦ brentc commented ·

Try repairing small ranges of specific tables to see if you could reduce the load on the nodes. If the data is on an NFS mount that's shared across nodes, it will be difficult to get around the fact that the replicas are competing for the shared resource to do repairs. Cheers!

0 Likes 0 ·
brentc avatar image brentc Erick Ramirez ♦♦ commented ·

Hmm, I had already split the repair into 700 segments instead of the standard 100, I guess it's still not small enough? I can see in reaper which segment keeps failing for the repair, how can i split that segment up in smaller segments? I have a start and end token for that segment right now.

0 Likes 0 ·