Bringing together the Apache Cassandra experts from the community and DataStax.

Want to learn? Have a question? Want to share your expertise? You are in the right place!

Not sure where to begin? Getting Started

 

question

sarit.hetc_169063 avatar image
sarit.hetc_169063 asked ·

Issue in Cassandra read with Spark Cassandra Connector 2.5.0

We are using Spark Cassandra Connector 2.5.0 to read from a Cassandra Table with approximately 400 columns with 10 Million Records. Table has 64 Partitions. We can increase or decrease the number of partitions on the table.

Our goal is to fetch more than one partition in a single read within Spark Job. We are using 'IN Clause' in Spark SQL with the partition Key e.g:

spark.sql("select * from <tableName> where <PartitionKey> in ('2021-03-12_0','2021-03-12_1','2021-03-12_2','2021-03-12_3','2021-03-11_4','2021-03-11_5','2021-03-11_6','2021-03-11_7')").

Cassandra: 8 Nodes with 3 CPU each.

Spark Executor: 4, Cores: 2, Memory: 6GB for each Executor

Spark Cassandra Connector Configurations used: "spark.cassandra.input.split.sizeInMB": 64 MB

Above parameter has been modified for multiple values like 512 MB, 1024 MB etc.

But we have observed from the Spark UI, that only 1 Spark Task is reading from all the partitions for above query.

Anticipate your prompt response as we are stuck on this. Thanks in advance.

spark-cassandra-connector
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

jaroslaw.grabowski_50515 avatar image
jaroslaw.grabowski_50515 answered ·

You may want to look into spark.cassandra.sql.inClauseToJoinConversionThreshold. You could adjust it to a low value to automatically convert your IN clauses into JoinWithCassandraTable.

1 comment Share
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

We are performing a comparative study for our Spark Job where loading 10M records (75GB) to Hive tables takes less than 5 minutes, whereas for Cassandra time taken is approx 40 minutes.

Similarly while reading from Cassandra tables with the configuration "spark.cassandra.sql.inClauseToJoinConversionThreshold" set to a low value, which significantly improved read performance takes approx 5 minutes (75 GB), but reading from Hive table is completed in less than 1 minutes.

Can you please suggest any configuration for Spark Cassandra Connector which might increase the write performance.

Also requesting you to guide us with Performance Benchmark for Spark Cassandra connector if available.

0 Likes 0 ·
Erick Ramirez avatar image
Erick Ramirez answered ·

When you use the CQL IN operator, the Cassandra coordinator is responsible for firing multiple requests to the replicas to process the single CQL query.

There is nothing for Spark to parallelise since the query cannot be partitioned. For this reason, you will only see 1 task for Spark to execute. Cheers!

Share
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.