Bringing together the Apache Cassandra experts from the community and DataStax.

Want to learn? Have a question? Want to share your expertise? You are in the right place!

Not sure where to begin? Getting Started

 

question

peter.kovgan_176371 avatar image
peter.kovgan_176371 asked ·

spark connector read failure. why cassandra fails?

Hi,

I read from cassandra and my jobs fail with error.(see at the bottom)
I "feel" that cassandra oveloaded.
Should I limit connector throughput? (I saw only output throughput max, but that is not available for input)
Or should I make smaller cassandra partitions?
I have a cluster of 3 cassandra nodes and RF=1

Interesting that cassandra survives. But spark job fails.



Caused by: com.datastax.driver.core.exceptions.ReadFailureException: Cassandra failure during read query at consistency LOCAL_ONE (1 responses were required but only 0 replica responded, 1 failed)
sparkperformancespark-connectorread
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Erick Ramirez avatar image
Erick Ramirez answered ·

@peter.kovgan_176371 without knowing what your Spark job is doing, you are right that the most common reason for the failure is overloaded nodes. You should really have a replication factor of 3 even when querying with LOCAL_ONE so as to spread the load across the nodes. Cheers!

5 comments Share
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Thanks Erick,
The one additional symptom is following:
Failures started when I started fetch 50 fields instead of 5 from the table.
I will try to change RF.

0 Likes 0 · ·
alex.ott avatar image alex.ott peter.kovgan_176371 ·

sometimes even with RF=3, Spark may overload the cluster at LOCAL_ONE...

0 Likes 0 · ·

Thanks Alex, I will try local_two or local_quorum (which is the same here).

0 Likes 0 · ·

With RF 3 and LOCAL_QUORUM on read I completed the test. However 1 cassandra node anyway died. Thanks.

0 Likes 0 · ·

@peter.kovgan_176371 even with RF=3, you can still overload your nodes. :)

0 Likes 0 · ·
Russell Spitzer avatar image
Russell Spitzer answered ·

I agree with @Erick Ramirez, the main cause of a failure like this is a temporary GC on the node which has the data. Cassandra will not go down, but once the connector has exhausted retries and timeouts (both configurable) it will fail.

Increasing the replication factor can increase the chance that a sole node in a pause will not bring down the request.

Other options are to


  1. Use the read throttle in the DSE Spark Connector (Not available in Open Source Spark Connector)
  2. Limit the number of Partitions that the Cassandra read uses (See splitCount)
  3. Change timeouts and increase retry limit
  4. Limit the number of concurrent spark executors
3 comments Share
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Russell, thanks!
Where we limit the number of concurrent spark executors?

0 Likes 0 · ·

Is there come research license for commercial connector? I do a POC...

0 Likes 0 · ·

With RF 3 and LOCAL_QUORUM on read I completed the test. However 1 cassandra node anyway died. Thanks.

0 Likes 0 · ·