question

rajib76 avatar image
rajib76 asked Erick Ramirez answered

Is it good practice to disable NodeSync while bulk-loading data?

I have a table where nodesync is enabled. When I run a DSBULK on that table, I see compactions triggering continuously which I think is delaying the load. Is it a good practice to disable nodesync while loading tables with DSBULK.

dsbulknodesync
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

steve.lacerda avatar image
steve.lacerda answered rajib76 commented

Hi! Nodesync works on the write path as opposed to the repair service which uses the streaming path. With that said, if you are inserting a lot of data it might help to temporarily disable nodesync on the table while you load the data to prevent write amplification.

2 comments Share
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

rajib76 avatar image rajib76 commented ·
Thanks Steve, what is the difference between the working on write path vs the streaming path
0 Likes 0 ·
rajib76 avatar image rajib76 commented ·
After disabling nodesync also I am seeing compaction triggered as soon as I start the DSBULK
0 Likes 0 ·
Erick Ramirez avatar image
Erick Ramirez answered

DSBulk doesn't really have anything to do with NodeSync or compactions. I think you're just conflating them unnecessarily.

I've explained in your other question (#13285) that compactions are part of the normal operation of Cassandra. They get triggered whether you are bulk-loading data or not.

Similarly, NodeSync (and repairs for that matter) are part of keeping data consistent across nodes in your cluster and is necessary given the distributed architecture of Cassandra. It is normal for repairs and NodeSync to be running in the background.

The fact that you think there's a delay while you are loading data just indicates that you haven't sized your cluster correctly. Remember that Cassandra scales linearly --

  • IF your 3-node cluster than can sustain 200K ops/sec
  • AND you need a throughput of 400K ops/sec
  • THEN double your cluster to 6 nodes to achieve double the throughput.

Based on your previous questions, I think we've established that your cluster does not have sufficient capacity to sustain the throughput you require so you should resize your cluster accordingly, even if it's just a temporary measure while you are migrating data. There's no cost associated with adding nodes temporarily since DSE subscriptions allow for temporary "bursting" during peak events like Black Friday sales. Cheers!

Share
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.