Bringing together the Apache Cassandra experts from the community and DataStax.

Want to learn? Have a question? Want to share your expertise? You are in the right place!

Not sure where to begin? Getting Started

 

question

anshita333saxena_187432 avatar image
anshita333saxena_187432 asked ·

Can we write with ALL or QUORUM with the Spark connector?

As per the document of spark-cassandra-connector, the default configuration of reading is local_one and writing is local_quorum.
Can we use consistency_level ALL/Quorum in writing when we have 2 Datacenters in the cluster?
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md

sparkconnector
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Erick Ramirez avatar image
Erick Ramirez answered ·

We don't recommend using a consistency of ALL with writes because it can't tolerate even just one node going down or being unresponsive. Think of a situation where a node is under GC pressure making it unresponsive, or a node going down as a result of a hardware failure. In these kind of situations, all writes from your Spark app/jobs will fail.

You can set spark.cassandra.output.consistency.level to your choice of consistency level but be aware of the trade offs. We recommend LOCAL_QUORUM since the analytics traffic gets isolated to one DC and minimises the chance of OLTP traffic getting affected. Cheers!

Share
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Russell Spitzer avatar image
Russell Spitzer answered ·

Yes. Although i wouldn't recommend it.

One thing to note though is that by default the SCC uses a load balancing policy that still only picks coordinators in a single DC, so although you will get the consistency you desire, if the local dc is completely unavailable it will fail regardless of if Quorum can be reached without the local dc.

3 comments Share
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

@Russell Spitzer @Erick Ramirez Thanks for your quick and informative reply.

Actually, we have two tables in our cluster and we are going to use spark-cassandra-connector for reading from a table and writing to another table. Since the table size is too small (6GB) and we have a privilege of getting down-time, we are thinking to use ALL / Quorum consistency level so that the data should be written to all the nodes.
When we ran the SCC for this kind of job, it took 5.7 min for 2GB data with 1GB memory per executor and 16 cores.
Therefore, for this kind of use-case what is your recommendation?

0 Likes 0 · ·
Russell Spitzer avatar image Russell Spitzer anshita333saxena_187432 ·

All/Quorum don't actually change the number of nodes that the data is written too, it only changes when the write is acknowledged back to the client. All write CL's even "one" will write to all replicas, the difference is a ONE Cl will ack the write after a single replica says the write succeeded while ALL waits for all nodes before sending the ack.

1 Like 1 · ·

@Russell Spitzer if that is the case then we are thinking to get the acknowledgement from all the nodes so that if data is writing to the nodes then there will be no replication inconsistency that can take place such as, let's say, we are writing with local_quorum and data got written to some nodes and not all the nodes because some of the nodes are busy in doing some other work, if this scenario will occur, then I think if we collect the acknowledgement from all the nodes then good because that way we are ensured that data what we are going to write in one node is present in all the nodes and data which got failed is failed to write to all the nodes.
Second thought I can think of is, let's run the jobs with the default tuned parameter and then run the replication to ensure the data consistency across all the nodes in the cluster. (replication on 6GB will be much faster)
Among these two, what should be the best approach do you think?

0 Likes 0 · ·