Bringing together the Apache Cassandra experts from the community and DataStax.

Want to learn? Have a question? Want to share your expertise? You are in the right place!

Not sure where to begin? Getting Started

 

question

rajib76 avatar image
rajib76 asked Erick Ramirez edited

Why is DSBulk reporting 2 max batches when maxBatchStatements is 32 by default?

My DSBULK load suddenly is running slow. When I turned on debug, i see that max batch is 2, ideally it should be 32. I have not changed anything. I am not sure why the max is changing to 2.

dsbulk
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

1 Answer

alexandre.dutra avatar image
alexandre.dutra answered rajib76 commented

Were you observing a better batch performance before on the same data? Because the efficiency of DSBulk's batching mechanism depends very much on the data being loaded. In general it works much better when the data to load is sorted by partition key and the row sizes are small.

Here is what you can try:

  • Increase dsbulk.batch.bufferSize to e.g. 256 or 512. Take care, this could take up all the available heap.

  • If that doesn't help you could try setting dsbulk.batch.mode to REPLICA_SET. But beware that this will make your latencies much worse, so you also should set dsbulk.batch.maxBatchStatements to a low value to avoid timeouts or errors (e.g. 5); then, if this works, you could slowly increase maxBatchStatements to get the best throughput while keeping latencies acceptable.

1 comment Share
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Hi Alexandre, you are right on the partition key sorting. I discovered today that the latest files that we have tried to load were not ordered by partition key and thats when we saw DSBULK using less number of batches between 1 and 2. Earlier it was sorted by partition key when we saw the number of batches going upto 14
0 Likes 0 ·