an some one elaborate what this parameter and how do we estimate this.
From link, https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md
input.split.size_in_mb when I set to 32, It was working fine.
an some one elaborate what this parameter and how do we estimate this.
From link, https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md
input.split.size_in_mb when I set to 32, It was working fine.
The input split size parameter (spark.cassandra.input.split.size_in_mb
or input.split.sizeInMB
) is the amount of data that gets read from the database in a Spark partition (not C* partition).
A lower input split size (less C* data) increases the number of Spark partitions. This breaks up the Cassandra token range so that a higher number of tasks can be parallelised in the Spark cluster and thus speeds up the processing.
There is a bit of trial-and-error involved in determining the optimal split size based on your data and Spark app. In some use cases, increased parallelisation may not benefit the Spark app so a small split size won't yield any gain. Cheers!
The input.split.size_in_mb
value is used to cut up the token range.
This blog has very good information how the Spark connector handles partitions (and how this works with large partitions):
The Spark Cassandra connector makes Spark Partitions by taking the Token Range and dividing it into pieces based on the Ranges already owned by the nodes in the cluster. The number and size of the Spark Partitions are based on metadata which is read from the Cassandra cluster. The estimated data size is used to build Spark Partitions containing the amount of data specified by a user parameter split_size_in_mb.
Also from the Spark Connector documentation:
The Connector evaluates the number of Spark partitions by dividing the table size estimate by the input.split.size_in_mb value.
Note that you cannot us this setting to cut up a large Cassandra partition.
Since the connector does not know the distribution of C* partitions before fully reading them the size estimate for a vnode range (as reported by C*) is assumed to be evenly distributed. If a single token within a range dominates the size usage of that range then there will be imbalances.
For example, if the input.split.size_in_mb
is set to 64MB:
If Range is 1 - 10000 and has an estimate of 128mb, it is divided into two, 1-5000 and 5001-10000. If partition 5001 was a 128 mb partition all by itself you end up with unbalanced partitions. One is 0 mb and the other is 128.
I hope this helps to answer your question.
4 People are following this question.
Why has Spark created more partitions reading a table from Cassandra than input split size?
Error when filtering Spark Cassandra table after parsing binary Avro column
Why am I getting "ClassNotFoundException: com.datastax.spark.connector.TableRef"?
What is the best way to check if record exist in a large partition table in Cassandra?
Trying to connect to AWS Cassandra with datastax.spark.connector without success
DataStax Enterprise is powered by the best distribution of Apache Cassandra ™
© 2023 DataStax, Titan, and TitanDB are registered trademarks of DataStax, Inc. and its subsidiaries in the United States and/or other countries.
Apache, Apache Cassandra, Cassandra, Apache Tomcat, Tomcat, Apache Lucene, Lucene, Apache Solr, Apache Hadoop, Hadoop, Apache Spark, Spark, Apache TinkerPop, TinkerPop, Apache Kafka and Kafka are either registered trademarks or trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries.
Privacy Policy Terms of Use