Bringing together the Apache Cassandra experts from the community and DataStax.

Want to learn? Have a question? Want to share your expertise? You are in the right place!

Not sure where to begin? Getting Started

 

question

michael4338 avatar image
michael4338 asked ·

How can I set the Spark partitions for reading with the connector?

I'm using PySpark dataframe to load cassandra table like this:

df = spark.read.table(CASSANDRA_TABLE_NAME).filter(xxx).select(c1, c2, ..., cn)

But found it's pretty slow when some column has big binary data, reading 5GB data from C* takes around 100s, so the throughput is like 50MB/s, far slower than SSD I/O speed (500MB/s). My C* is created on SSD, with 3 nodes each of which has 20GB memory and 2 CPU, and every other settings are basically default. And I've 5 executors for Spark.

By debugging I see the partition number after reading from C* is 1, which does not look reasonable since the partition numbers in C* is 9, and I've set very little value(1MB) for "spark.sql.files.maxPartitionBytes".

df.rdd.getNumPartitions()   # this value is 1

So I am wondering if that throughput number looks normal, and how to set the partitions of Spark for reading? Thank you very much

spark-cassandra-connector
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

1 Answer

jaroslaw.grabowski_50515 avatar image
jaroslaw.grabowski_50515 answered ·

Hi!

spark.sql.files.maxPartitionBytes is applicable only to file reading.

You could tune spark.cassandra.input.split.sizeInMB to adjust the number of spark partitions created by SCC.

1 comment Share
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Thank you so much!


Tried that and set it to 1M, but still not making difference. The partition number of Spark is still one, and there is still only one task for loading stage in Spark UI. Attaching some screenshots here and see if they help...


screen-shot-2020-11-03-at-94824-am.png

screen-shot-2020-11-03-at-94941-am.png

0 Likes 0 · ·