question

sarit.hetc_169063 avatar image
sarit.hetc_169063 asked sarit.hetc_169063 edited

Issue in Cassandra read with Spark Cassandra Connector 2.5.0

We are using Spark Cassandra Connector 2.5.0 to read from a Cassandra Table with approximately 400 columns with 10 Million Records. Table has 64 Partitions. We can increase or decrease the number of partitions on the table.

Our goal is to fetch more than one partition in a single read within Spark Job. We are using 'IN Clause' in Spark SQL with the partition Key e.g:

spark.sql("select * from <tableName> where <PartitionKey> in ('2021-03-12_0','2021-03-12_1','2021-03-12_2','2021-03-12_3','2021-03-11_4','2021-03-11_5','2021-03-11_6','2021-03-11_7')").

Cassandra: 8 Nodes with 3 CPU each.

Spark Executor: 4, Cores: 2, Memory: 6GB for each Executor

Spark Cassandra Connector Configurations used: "spark.cassandra.input.split.sizeInMB": 64 MB

Above parameter has been modified for multiple values like 512 MB, 1024 MB etc.

But we have observed from the Spark UI, that only 1 Spark Task is reading from all the partitions for above query.

Anticipate your prompt response as we are stuck on this. Thanks in advance.

spark-cassandra-connector
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

jaroslaw.grabowski_50515 avatar image
jaroslaw.grabowski_50515 answered sarit.hetc_169063 edited

You may want to look into spark.cassandra.sql.inClauseToJoinConversionThreshold. You could adjust it to a low value to automatically convert your IN clauses into JoinWithCassandraTable.

1 comment Share
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

sarit.hetc_169063 avatar image sarit.hetc_169063 commented ·

We are performing a comparative study for our Spark Job where loading 10M records (75GB) to Hive tables takes less than 5 minutes, whereas for Cassandra time taken is approx 40 minutes.

Similarly while reading from Cassandra tables with the configuration "spark.cassandra.sql.inClauseToJoinConversionThreshold" set to a low value, which significantly improved read performance takes approx 5 minutes (75 GB), but reading from Hive table is completed in less than 1 minutes.

Can you please suggest any configuration for Spark Cassandra Connector which might increase the write performance.

Also requesting you to guide us with Performance Benchmark for Spark Cassandra connector if available.

0 Likes 0 ·
Erick Ramirez avatar image
Erick Ramirez answered

When you use the CQL IN operator, the Cassandra coordinator is responsible for firing multiple requests to the replicas to process the single CQL query.

There is nothing for Spark to parallelise since the query cannot be partitioned. For this reason, you will only see 1 task for Spark to execute. Cheers!

Share
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.