DataStax Academy FAQ

DataStax Academy migrated to a new learning management system (LMS) in July 2020. We are also moving to a new Cassandra Certification process so there are changes to exam bookings, voucher system and issuing of certificates.

Check out the Academy FAQ pages for answers to your questions:


question

purijatin_149351 avatar image
purijatin_149351 asked ·

Inserts slower after upgrading to 3.0.0-beta Spark connector

I think I am definitely doing something wrong.

Have upgraded our spark project from 2.5 to 3.x (minimal spark changes for upgrade). And trying spark-cassandra connector `3.0.0-beta`. With minimal changes (a few deprecations warning, but the code works).

Suddenly our total tests run duration has increased by 50%. When I profiled the execution, the spark job duration hasnt changed much, but the cassandra insertions have significantly worsened (taking 2x at multiple places).

Simple code like below, takes 3-4 seconds, which was <1sec previously. (consistent behavior)

def createKeySpace(keyspace: String, dropBeforeCreating: Boolean = true)
                  (implicit connector: CassandraConnector): Unit = {
  if (dropBeforeCreating) {
    connector.withSessionDo { s =>
      s.execute(s"DROP KEYSPACE IF EXISTS $keyspace;")
    }
  }
  connector.withSessionDo { s =>
    s.execute(s"CREATE KEYSPACE IF NOT EXISTS $keyspace WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1 };")
  }
}

And the below code again takes 40% more time:

// deprecation warning below. Not sure how to create tables along with indexes using catalog API
df.createCassandraTable(keyspace, tableName,
  partitionKeyColumns = Some(List(term)),
  clusteringKeyColumns = Some(List(batchId))
)
df.write.cassandraFormat(tableName, keyspace).mode(SaveMode.Append).save()

Cassandra version (same): 3.11.3

[UPDATE]

  1. We have 30 tests (all interact with cassandra) which previously used to take 180 seconds. This now takes 410 seconds.
  2. In every test, the keyspace is first dropped and a new one is created.
    1. Previous this was sub-second, which now takes 3-4 seconds. So an increase in total (30 * 2) ~70 seconds is now accounted
  3. It looks to me that procuring a session is taking longer than usual as per what the profiling suggested
  4. At several other places, the insertions were slower by 10-20%


Here is the comaprision when we tried inserting three different dataframe/rdd


2.4x (secs) 3.0.0-alpha (secs)
Dataframe-1 8.21 10.25
Dataframe-2
537 567
RDD-1 1563 1740


spark-cassandra-connector
1595751979240.png (131.0 KiB)
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

1 Answer

jaroslaw.grabowski_50515 avatar image
jaroslaw.grabowski_50515 answered ·

Hi purijatin,

indeed there is no way of creating an index through catalog api. You have to work with `Session` to achieve this.

We will certainly look into throughput problems before we release 3.0.0. Here is a jira where we are going to track the progress https://datastax-oss.atlassian.net/browse/SPARKC-614.

1 comment Share
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Thanks Jaroslaw for the update. So here is more information if it helps:

[Update posted in the original question above]

Regards,

Jatin

0 Likes 0 · ·