Bringing together the Apache Cassandra experts from the community and DataStax.

Want to learn? Have a question? Want to share your expertise? You are in the right place!

Not sure where to begin? Getting Started

 

question

mishra.anurag643_153409 avatar image
mishra.anurag643_153409 asked Erick Ramirez answered

How many partitions does Spark create reading a Cassandra table?

I am using

cassandra: 3.11.3
spark: 2.3.2
connector : spark-cassandra-connector-2.3.2

I have run nodetool cfstats keyspace.table -H and it shows :

Space used (live): 33.8 GiB
Space used (total): 33.8 GiB

as replication factor is set 3 and this is three nodes cluster , so I assume actual table size would be 33.8/3 ~ 12 GiB

I am reading this table from spark and spark is creating 1443 partitions , I am wondering why did spark create so many partitions , even if spark.cassandra.input.split.sizeInMB default is 64 MB .

I tried to set values in spark but getting below error:

spark.cassandra.input.split.size_in_mb spark.cassandra.input.split_size_in_mb is not a valid Spark Cassandra Connector variable. No likely matches found. spark.cassandra.input.split.size  is not a valid Spark Cassandra Connector variable.
spark-cassandra-connector
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Erick Ramirez avatar image
Erick Ramirez answered

As Jarek already pointed out in his answer, the Spark connector estimates the size of the table using the values stored in the system.size_estimates table. As the table's name suggests, it contains estimates of the table sizes so they're not accurate.

For example, I have a table where for all the token ranges in the size_estimates table:

  • the mean partition size is 1 MB
  • the number of partitions is 200,000

The estimated table size is:

estimated_table_size = mean_partition_size x number_of_partitions
                     = 1 MB x 200,000
                     = 200,000 MB

(See DataSizeEstimates.scala for details.)

The connector then calculates the Spark partitions as:

spark_partitions = estimated_table_size / input.split.size_in_mb
                 = 200,000 MB / 64 MB
                 = 3,125

But again, this number is based on estimates so it is not completely accurate.

Using the equation above, the Spark connector estimated your table size as:

estimated_table_size = spark_partitions x input.split.size_in_mb
                     = 1443 x 64 MB
                     = 92,352 MB

or roughly 90 GB. But this size is based on whatever is in the system.size_estimates table of your cluster so it won't be accurate. Cheers!

Share
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

jaroslaw.grabowski_50515 avatar image
jaroslaw.grabowski_50515 answered mishra.anurag643_153409 commented

Hi! SCC looks up the size estimates in `system.size_estimates` table. For this SCC version, the parameter is called input.split.size_in_mb. Review the docs for details: https://github.com/datastax/spark-cassandra-connector/blob/b2.3/doc/reference.md#read-tuning-parameters.


1 comment Share
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

as by default its value is 64 MB , so source data ( cassandra table ) size is 20 Gb and for that spark is creating 1443 tasks ( partitions) . as per the calculation 20*1042/64 = 326 .

I am just wondering why has spark created 1443 partitions but not 326 .

0 Likes 0 ·