Bringing together the Apache Cassandra experts from the community and DataStax.

Want to learn? Have a question? Want to share your expertise? You are in the right place!

Not sure where to begin? Getting Started



purijatin_149351 avatar image
purijatin_149351 asked jaroslaw.grabowski_50515 commented

Inserts slower after upgrading to 3.0.0-beta Spark connector

I think I am definitely doing something wrong.

Have upgraded our spark project from 2.5 to 3.x (minimal spark changes for upgrade). And trying spark-cassandra connector `3.0.0-beta`. With minimal changes (a few deprecations warning, but the code works).

Suddenly our total tests run duration has increased by 50%. When I profiled the execution, the spark job duration hasnt changed much, but the cassandra insertions have significantly worsened (taking 2x at multiple places).

Simple code like below, takes 3-4 seconds, which was <1sec previously. (consistent behavior)

def createKeySpace(keyspace: String, dropBeforeCreating: Boolean = true)
                  (implicit connector: CassandraConnector): Unit = {
  if (dropBeforeCreating) {
    connector.withSessionDo { s =>
      s.execute(s"DROP KEYSPACE IF EXISTS $keyspace;")
  connector.withSessionDo { s =>
    s.execute(s"CREATE KEYSPACE IF NOT EXISTS $keyspace WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1 };")

And the below code again takes 40% more time:

// deprecation warning below. Not sure how to create tables along with indexes using catalog API
df.createCassandraTable(keyspace, tableName,
  partitionKeyColumns = Some(List(term)),
  clusteringKeyColumns = Some(List(batchId))
df.write.cassandraFormat(tableName, keyspace).mode(SaveMode.Append).save()

Cassandra version (same): 3.11.3


  1. We have 30 tests (all interact with cassandra) which previously used to take 180 seconds. This now takes 410 seconds.
  2. In every test, the keyspace is first dropped and a new one is created.
    1. Previous this was sub-second, which now takes 3-4 seconds. So an increase in total (30 * 2) ~70 seconds is now accounted
  3. It looks to me that procuring a session is taking longer than usual as per what the profiling suggested
  4. At several other places, the insertions were slower by 10-20%

Here is the comaprision when we tried inserting three different dataframe/rdd

2.4x (secs) 3.0.0-alpha (secs)
Dataframe-1 8.21 10.25
537 567
RDD-1 1563 1740

1595751979240.png (131.0 KiB)
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

1 Answer

jaroslaw.grabowski_50515 avatar image
jaroslaw.grabowski_50515 answered jaroslaw.grabowski_50515 commented

Hi purijatin,

indeed there is no way of creating an index through catalog api. You have to work with `Session` to achieve this.

We will certainly look into throughput problems before we release 3.0.0. Here is a jira where we are going to track the progress

2 comments Share
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Thanks Jaroslaw for the update. So here is more information if it helps:

[Update posted in the original question above]



0 Likes 0 ·

Hi! I looked at the withSession performance and I don't see any difference between 2.5.x and 3.0.x. Here is my code:

Run with

sbt/sbt 'performance/jmh:run -i 5 -wi 5 -f1 -t1'

It takes around 1 second to create a new session and create a keyspace on my machine.

0 Likes 0 ·