Why should the DSBulk batch size be small when the datasets are huge?

The documentation says that batch size should be small when datasets are large. Why is that. Also how is this parameter related to maxConcurrentQueries

--dsbulk.batch.maxBatchStatements <number>

The maximum number of statements that a batch can contain. The ideal value depends on two factors:

  • The data being loaded: the larger the data, the smaller the batches should be.
  • The batch mode: when PARTITION_KEY is used, larger batches are acceptable, whereas when REPLICA_SET is used, smaller batches usually perform better. Also, when using REPLICA_SET, it is preferrable to keep this number below the threshold configured server-side for the setting unlogged_batch_across_partitions_warn_threshold (the default is 10); failing to do so is likely to trigger query warnings (see log.maxQueryWarnings for more information). When set to a value lesser than or equal to zero, the maximum number of statements is considered unlimited. At least one of maxBatchStatements or maxSizeInBytes must be set to a positive value when batching is enabled.

Default: 32.

1 Answer

Because you could reach the maximum allowed frame size, as determined by Cassandra's native_transport_max_frame_size_in_mb configuration option. This translates on the client side by a FrameTooLongException being thrown and the operation being aborted.

This has no connection with the number of concurrent queries determined by maxConcurrentQueries.

rajib76 commented
Thanks Alexandre, I have maxBatchStatements set to 10 now. I reduced from 32, it was giving me connection errors with 32. I have maxconcurrentQueries as 120(i have a 16 core machine). So does this mean that if the 10 batch statements have 120 queries in it, all the 120 queries will run in parallel as part of the 10 batch statements. Wanted to know what is the unit of parallelism here.
