question

aashish.aiesec_188335 avatar image
aashish.aiesec_188335 asked David Jones-Gilardi commented

How can I minimise the impact of a full table scan running Spark against Cassandra?

I am replicating Oracle table to Cassandra and I want to reconcile Oracle table with Cassandra table. I have two options but not sure which would be the one with "least impact on C* cluster"

1. select * from cassandra_table

2. Do a cassandra join using Oracle RDD which I got by firing SQL query on oracle database.

In the end both the options are going to get full data from C* table as it will be 100% match in join. I want to use the option where we have least impact to C* cluster as this reconcilation happens on production cluster which is used by other processes.

Any suggestions would be really helpful. Thank you.

cassandraspark-cassandra-connectorjoin
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

1 Answer

David Jones-Gilardi avatar image
David Jones-Gilardi answered David Jones-Gilardi commented

Any time you execute a "select * from cassandra_table" you are forcing a scan of all of your nodes to retrieve that data and if your data set is large enough you will not only timeout your query (because C* expects quick response times for OLTP workloads), but you will also potentially cause a lot of strain on your nodes. Again, this has a lot to do with how much data you are talking about, but I recommend against it.

Since you mentioned Oracle RDD I assume you are referring to a Spark implementation. If you have a Spark option with your C* cluster and need to do something like a "select *" this is the preferred option IMO. Spark is designed for longer running workloads. However, if you only have one datacenter this will still impact your OLTP workloads, although, it should be a lot more manageable than simply trying "select *" directly.

If you have more than one datacenter in your cluster, or preferably, a datacenter that is dedicated to OLAP operations or one that is replicated with your production datacenter then you can issue Spark jobs on that datacenter to copy your data without worry that it will effect your production OLTP workloads, but without knowledge of your cluster topology I can't really say one way or the other.

In general, if your only 2 choices are "select *" and "do it with Spark" I would choose the spark option. If you can provide more information on your cluster topology and maybe the size of your data I can give you a better idea.

3 comments Share
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

aashish.aiesec_188335 avatar image aashish.aiesec_188335 commented ·

Thanks. I am using spark in both cases which are as below. Here emp_id is partition by key but I wanted to know if there is any difference between these two approaches in terms of load on cassandra (considering that even with join it is going to get all data from C* table as it will be a full match in join.

Table have around 100k records.

0 Likes 0 ·
aashish.aiesec_188335 avatar image aashish.aiesec_188335 commented ·

Option1

val createDDL = s"""CREATE TEMPORARY VIEW myTable
 USING org.apache.spark.sql.cassandra
 OPTIONS (
 table "my_table",
 keyspace "my_keyspace}",
 pushdown "true")"""

spark.sql(createDDL)
val CassDF = spark.sql("select emp_id, col2, colN from myTable")


Option 2

val OracleDF = spark.read
.format("jdbc")
.option("url", oracleUrl)
.option("dbtable", sqlAliased)
.option("user", oracleUser)
.option("password", oraclePassword)
.load()

val oracleRDD = OracleDF.select("emp_id").rdd
val CassDF = oracleRDD.joinWithCassandraTable("my_keyspace", "my_table"
, selectedColumns = SomeColumns("emp_id","col2","coln")
, SomeColumns("emp_id"))
0 Likes 0 ·
David Jones-Gilardi avatar image David Jones-Gilardi ♦ aashish.aiesec_188335 commented ·

Even in the case of 100k records I would recommend using the Spark approach. There are a whole set of optimizations you get by using Spark for an operation like this as compared to a simple CQL "select * from some_table". The main thing being even though each operation will scan nodes to get all of the table data Spark does it in a more parallel and efficient manner which spreads the load of the operation more evenly around your Spark nodes where a CQL "select *" essentially forces the whole process into a single operation and puts a lot more load onto the coordinator not to mention there is a good chance you will simply timeout going the CQL route anyway.

0 Likes 0 ·