rajib76 avatar image
rajib76 asked Erick Ramirez edited

Why should the DSBulk batch size be small when the datasets are huge?

The documentation says that batch size should be small when datasets are large. Why is that. Also how is this parameter related to maxConcurrentQueries

--dsbulk.batch.maxBatchStatements <number>

The maximum number of statements that a batch can contain. The ideal value depends on two factors:

  • The data being loaded: the larger the data, the smaller the batches should be.
  • The batch mode: when PARTITION_KEY is used, larger batches are acceptable, whereas when REPLICA_SET is used, smaller batches usually perform better. Also, when using REPLICA_SET, it is preferrable to keep this number below the threshold configured server-side for the setting unlogged_batch_across_partitions_warn_threshold (the default is 10); failing to do so is likely to trigger query warnings (see log.maxQueryWarnings for more information). When set to a value lesser than or equal to zero, the maximum number of statements is considered unlimited. At least one of maxBatchStatements or maxSizeInBytes must be set to a positive value when batching is enabled.

Default: 32.

10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

1 Answer

alexandre.dutra avatar image
alexandre.dutra answered rajib76 commented

Because you could reach the maximum allowed frame size, as determined by Cassandra's native_transport_max_frame_size_in_mb configuration option. This translates on the client side by a FrameTooLongException being thrown and the operation being aborted.

This has no connection with the number of concurrent queries determined by maxConcurrentQueries.

1 comment Share
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

rajib76 avatar image rajib76 commented ·
Thanks Alexandre, I have maxBatchStatements set to 10 now. I reduced from 32, it was giving me connection errors with 32. I have maxconcurrentQueries as 120(i have a 16 core machine). So does this mean that if the 10 batch statements have 120 queries in it, all the 120 queries will run in parallel as part of the 10 batch statements. Wanted to know what is the unit of parallelism here.
0 Likes 0 ·