question

peter.kovgan_176371 avatar image
peter.kovgan_176371 asked Erick Ramirez edited

Why do Spark connector reads fail?

Hi,

I read from cassandra and my jobs fail with error.(see at the bottom)
I "feel" that cassandra oveloaded.
Should I limit connector throughput? (I saw only output throughput max, but that is not available for input)
Or should I make smaller cassandra partitions?
I have a cluster of 3 cassandra nodes and RF=1

Interesting that cassandra survives. But spark job fails.



Caused by: com.datastax.driver.core.exceptions.ReadFailureException: Cassandra failure during read query at consistency LOCAL_ONE (1 responses were required but only 0 replica responded, 1 failed)
spark-cassandra-connector
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Erick Ramirez avatar image
Erick Ramirez answered Erick Ramirez commented

@peter.kovgan_176371 without knowing what your Spark job is doing, you are right that the most common reason for the failure is overloaded nodes. You should really have a replication factor of 3 even when querying with LOCAL_ONE so as to spread the load across the nodes. Cheers!

5 comments Share
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

peter.kovgan_176371 avatar image peter.kovgan_176371 commented ·

Thanks Erick,
The one additional symptom is following:
Failures started when I started fetch 50 fields instead of 5 from the table.
I will try to change RF.

0 Likes 0 ·
alex.ott avatar image alex.ott peter.kovgan_176371 commented ·

sometimes even with RF=3, Spark may overload the cluster at LOCAL_ONE...

0 Likes 0 ·
peter.kovgan_176371 avatar image peter.kovgan_176371 alex.ott commented ·

Thanks Alex, I will try local_two or local_quorum (which is the same here).

0 Likes 0 ·
peter.kovgan_176371 avatar image peter.kovgan_176371 commented ·

With RF 3 and LOCAL_QUORUM on read I completed the test. However 1 cassandra node anyway died. Thanks.

0 Likes 0 ·
Erick Ramirez avatar image Erick Ramirez ♦♦ peter.kovgan_176371 commented ·

@peter.kovgan_176371 even with RF=3, you can still overload your nodes. :)

0 Likes 0 ·
Russell Spitzer avatar image
Russell Spitzer answered peter.kovgan_176371 commented

I agree with @Erick Ramirez, the main cause of a failure like this is a temporary GC on the node which has the data. Cassandra will not go down, but once the connector has exhausted retries and timeouts (both configurable) it will fail.

Increasing the replication factor can increase the chance that a sole node in a pause will not bring down the request.

Other options are to


  1. Use the read throttle in the DSE Spark Connector (Not available in Open Source Spark Connector)
  2. Limit the number of Partitions that the Cassandra read uses (See splitCount)
  3. Change timeouts and increase retry limit
  4. Limit the number of concurrent spark executors
3 comments Share
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

peter.kovgan_176371 avatar image peter.kovgan_176371 commented ·

Russell, thanks!
Where we limit the number of concurrent spark executors?

0 Likes 0 ·
peter.kovgan_176371 avatar image peter.kovgan_176371 commented ·

Is there come research license for commercial connector? I do a POC...

0 Likes 0 ·
peter.kovgan_176371 avatar image peter.kovgan_176371 commented ·

With RF 3 and LOCAL_QUORUM on read I completed the test. However 1 cassandra node anyway died. Thanks.

0 Likes 0 ·