question

ortizfabio_185816 avatar image
ortizfabio_185816 asked Erick Ramirez commented

What is the ideal batch size for the connector to avoid the error "Request is too big: length 293478012 exceeds maximum allowed length 268435456"?

I got this error running an Spark process with the spark-cassandra-connector. I have the following configuration:

spark.cassandra.output.consistency.level = "LOCAL_ONE"

spark.cassandra.output.concurrent.writes = "1"

spark.cassandra.output.batch.grouping.buffer.size = "10"

spark.cassandra.output.batch.size.rows = "10000"

spark.cassandra.output.batch.grouping.key = "replica_set"

spark.cassandra.output.throughput_mb_per_sec = "10"

I believe this might have been cause because the size of output.batch.size.rows = "10000" exceeded the length 268435456 which is defined by parameter "native_transport_max_frame_size_in_mb: 256". Would this be avoided by setting the output.batch.size.bytes to certain amount of bytes instead of rows? Therefore my question would be what is the ideal value to have high throughput?


At the time of the error here is my write request per minute:


Thanks

cassandrasparkconnectorconfiguration
capture.png (40.7 KiB)
1 comment
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Erick Ramirez avatar image Erick Ramirez ♦♦ commented ·

@ortizfabio_185816 Just acknowledging receipt of your question and we'll get back to you with an answer real soon. Cheers!

0 Likes 0 ·

1 Answer

Erick Ramirez avatar image
Erick Ramirez answered Erick Ramirez commented

@ortizfabio_185816 The default output batch size is 1KB so it probably won't make much difference here. The problem as you already noted is that the batch size is too large. I understand that your goal is to achieve a very high throughput but what we generally see happen is that writes end up failing or timing out because Spark is pushing way too much than the cluster can handle. There comes a point when the Cassandra nodes reach their maximum throughput in which case you need to scale out to increase the capacity of the cluster.

I'll socialise your question with the Analytics team internally and get them to post their feedback. Cheers!

2 comments Share
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

ortizfabio_185816 avatar image ortizfabio_185816 commented ·

Erick,

Thanks for your answer. I think the writer needs a retry parameter just like the "query.retry.count" at the Cassandra Connection Parameters level. My job ends in failure while I believe a retry might have help.

0 Likes 0 ·
Erick Ramirez avatar image Erick Ramirez ♦♦ ortizfabio_185816 commented ·

Even when it retries, it will still fail if the batch size is too large. Spark is loading too much data than what the cluster can handle so you need to throttle down the load. Cheers!

1 Like 1 ·