Bringing together the Apache Cassandra experts from the community and DataStax.

Want to learn? Have a question? Want to share your expertise? You are in the right place!

Not sure where to begin? Getting Started

 

question

mishra.anurag643_153409 avatar image
mishra.anurag643_153409 asked ·

How many partitions does Spark create reading a Cassandra table?

I am using

cassandra: 3.11.3
spark: 2.3.2
connector : spark-cassandra-connector-2.3.2

I have run nodetool cfstats keyspace.table -H and it shows :

Space used (live): 33.8 GiB
Space used (total): 33.8 GiB

as replication factor is set 3 and this is three nodes cluster , so I assume actual table size would be 33.8/3 ~ 12 GiB

I am reading this table from spark and spark is creating 1443 partitions , I am wondering why did spark create so many partitions , even if spark.cassandra.input.split.sizeInMB default is 64 MB .

I tried to set values in spark but getting below error:

spark.cassandra.input.split.size_in_mb spark.cassandra.input.split_size_in_mb is not a valid Spark Cassandra Connector variable. No likely matches found. spark.cassandra.input.split.size  is not a valid Spark Cassandra Connector variable.
spark-cassandra-connector
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Erick Ramirez avatar image
Erick Ramirez answered ·

As Jarek already pointed out in his answer, the Spark connector estimates the size of the table using the values stored in the system.size_estimates table. As the table's name suggests, it contains estimates of the table sizes so they're not accurate.

For example, I have a table where for all the token ranges in the size_estimates table:

  • the mean partition size is 1 MB
  • the number of partitions is 200,000

The estimated table size is:

estimated_table_size = mean_partition_size x number_of_partitions
                     = 1 MB x 200,000
                     = 200,000 MB

(See DataSizeEstimates.scala for details.)

The connector then calculates the Spark partitions as:

spark_partitions = estimated_table_size / input.split.size_in_mb
                 = 200,000 MB / 64 MB
                 = 3,125

But again, this number is based on estimates so it is not completely accurate.

Using the equation above, the Spark connector estimated your table size as:

estimated_table_size = spark_partitions x input.split.size_in_mb
                     = 1443 x 64 MB
                     = 92,352 MB

or roughly 90 GB. But this size is based on whatever is in the system.size_estimates table of your cluster so it won't be accurate. Cheers!

Share
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

jaroslaw.grabowski_50515 avatar image
jaroslaw.grabowski_50515 answered ·

Hi! SCC looks up the size estimates in `system.size_estimates` table. For this SCC version, the parameter is called input.split.size_in_mb. Review the docs for details: https://github.com/datastax/spark-cassandra-connector/blob/b2.3/doc/reference.md#read-tuning-parameters.


1 comment Share
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

as by default its value is 64 MB , so source data ( cassandra table ) size is 20 Gb and for that spark is creating 1443 tasks ( partitions) . as per the calculation 20*1042/64 = 326 .

I am just wondering why has spark created 1443 partitions but not 326 .

0 Likes 0 ·