I think I am definitely doing something wrong.
Have upgraded our spark project from 2.5 to 3.x (minimal spark changes for upgrade). And trying spark-cassandra connector `3.0.0-beta`. With minimal changes (a few deprecations warning, but the code works).
Suddenly our total tests run duration has increased by 50%. When I profiled the execution, the spark job duration hasnt changed much, but the cassandra insertions have significantly worsened (taking 2x at multiple places).
Simple code like below, takes 3-4 seconds, which was <1sec previously. (consistent behavior)
def createKeySpace(keyspace: String, dropBeforeCreating: Boolean = true) (implicit connector: CassandraConnector): Unit = { if (dropBeforeCreating) { connector.withSessionDo { s => s.execute(s"DROP KEYSPACE IF EXISTS $keyspace;") } } connector.withSessionDo { s => s.execute(s"CREATE KEYSPACE IF NOT EXISTS $keyspace WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1 };") } }
And the below code again takes 40% more time:
// deprecation warning below. Not sure how to create tables along with indexes using catalog API df.createCassandraTable(keyspace, tableName, partitionKeyColumns = Some(List(term)), clusteringKeyColumns = Some(List(batchId)) ) df.write.cassandraFormat(tableName, keyspace).mode(SaveMode.Append).save()
Cassandra version (same): 3.11.3
[UPDATE]
- We have 30 tests (all interact with cassandra) which previously used to take 180 seconds. This now takes 410 seconds.
- In every test, the keyspace is first dropped and a new one is created.
- Previous this was sub-second, which now takes 3-4 seconds. So an increase in total (30 * 2) ~70 seconds is now accounted
- It looks to me that procuring a session is taking longer than usual as per what the profiling suggested
- At several other places, the insertions were slower by 10-20%
Here is the comaprision when we tried inserting three different dataframe/rdd
2.4x (secs) | 3.0.0-alpha (secs) | |
Dataframe-1 | 8.21 | 10.25 |
Dataframe-2 |
537 | 567 |
RDD-1 | 1563 | 1740 |