DataStax Academy FAQ

DataStax Academy migrated to a new learning management system (LMS) in July 2020. We are also moving to a new Cassandra Certification process so there are changes to exam bookings, voucher system and issuing of certificates.

Check out the Academy FAQ pages for answers to your questions:


question

pramod.dba31_92912 avatar image
pramod.dba31_92912 asked ·

What does the input.split.size_in_mb parameter in the Spark connector do?

an some one elaborate what this parameter and how do we estimate this.

From link, https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md

input.split.size_in_mb when I set to 32, It was working fine.

spark-cassandra-connector
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

bettina.swynnerton avatar image
bettina.swynnerton answered ·

Hi @pramod.dba31_92912,

The input.split.size_in_mb value is used to cut up the token range.

This blog has very good information how the Spark connector handles partitions (and how this works with large partitions):

The Spark Cassandra connector makes Spark Partitions by taking the Token Range and dividing it into pieces based on the Ranges already owned by the nodes in the cluster. The number and size of the Spark Partitions are based on metadata which is read from the Cassandra cluster. The estimated data size is used to build Spark Partitions containing the amount of data specified by a user parameter split_size_in_mb.

Also from the Spark Connector documentation:

The Connector evaluates the number of Spark partitions by dividing the table size estimate by the input.split.size_in_mb value.

Note that you cannot us this setting to cut up a large Cassandra partition.

Since the connector does not know the distribution of C* partitions before fully reading them the size estimate for a vnode range (as reported by C*) is assumed to be evenly distributed. If a single token within a range dominates the size usage of that range then there will be imbalances.

For example, if the input.split.size_in_mb is set to 64MB:

If Range is 1 - 10000 and has an estimate of 128mb, it is divided into two, 1-5000 and 5001-10000. If partition 5001 was a 128 mb partition all by itself you end up with unbalanced partitions. One is 0 mb and the other is 128.

I hope this helps to answer your question.

Share
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Erick Ramirez avatar image
Erick Ramirez answered ·

The input split size parameter (spark.cassandra.input.split.size_in_mb or input.split.sizeInMB) is the amount of data that gets read from the database in a Spark partition (not C* partition).

A lower input split size (less C* data) increases the number of Spark partitions. This breaks up the Cassandra token range so that a higher number of tasks can be parallelised in the Spark cluster and thus speeds up the processing.

There is a bit of trial-and-error involved in determining the optimal split size based on your data and Spark app. In some use cases, increased parallelisation may not benefit the Spark app so a small split size won't yield any gain. Cheers!

Share
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.