Bringing together the Apache Cassandra experts from the community and DataStax.

Want to learn? Have a question? Want to share your expertise? You are in the right place!

Not sure where to begin? Getting Started

 

question

belgacea avatar image
belgacea asked ·

Spark read & write on the same table

Hello,

I have a scala spark batch job that reads data from a C* table and writes results to another. But I want to split these job in 2 parts because some operations require a time overlap and some others don't.

The first part is for minute aggregation (without overlap to reduce the volume of data read & processed) and second part for other controls stored in additional columns (with an time overlap). But this mean that the second job is reading and writing data from/to the same table, and for some reason, it seems to be much less effective than the previous job which does everything at once in different tables.

Should I avoid to reading & writing from/to the same C* table in a single job ? Or it should'nt be an issue and in that case, why can it be so slow ? I tried using a temporary table in between the 2 jobs, it's much faster now (about 6 times faster).

I was told that mixing in reads involves contention for some of the same resources. What kind of contention are we talking about ?
What is the contention origin ? Is it a partitioning problem or something related to the data distribution in the cluster ?
spark-connector
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

1 Answer

Russell Spitzer avatar image
Russell Spitzer answered ·

Reading and writing both require access the same structures. Memtables, Disk Io, CPU, all of these things are shared which means you can end up with multiple things all interacting with the same object.


In my experience a pure workload will always be faster than a mixed one.

There are other things which could also have effects but it's hard to know without your specific datamodel and pattern of inserts.


In a more general sense, a Spark workload which uses a temporay disk based store will be much faster if multiple subsequent interactions the temporary table. Reading from disk is always much faster than reading through Cassandra. So a job that did,

Read, Write to Temp Table, Do multiple operations on temp table.

Would be much much faster than

Read from Cassandra, Do Operation
Read from Cassandra, Do Operation
Read from Cassandra, Do operation

Share
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.