Bringing together the Apache Cassandra experts from the community and DataStax.

Want to learn? Have a question? Want to share your expertise? You are in the right place!

Not sure where to begin? Getting Started

 

question

tomas.bartalos_162914 avatar image
tomas.bartalos_162914 asked ·

Spark streaming gets stucked (cassandra-connector 3.0.0-beta)

Hello,

I have 3 Spark streams writing data to cassandra using spark-connector 3.0.0-beta. After 108324 completed tasks, the streams got stucked forever:


This is pretty dangerous behavior, because after huge amount of successful tasks the stream just stopped writing data, without any errors. Cassandra DB is in healthy state, all nodes UP.


My cassandra DB version is: 3.11.5

Driver thread dump: thread_dump_driver.txt

Executor thread dump: thread_dump_cassandra.pdf

All worker threads are doing:

CassandraConnector.closeResourceAfterUsed -> Session.refreshSchema

Is it possible to specify timeout for operations like refreshSchema ?

Recently I've reported issue with stucked streams which happened on stream start. But this one happened after lot of completed tasks.

spark-cassandra-connector
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Erick Ramirez avatar image
Erick Ramirez answered ·

Thanks for sticking with us. I suspect this is related to multiple issues being addressed by SPARKC-614.

Jarek Grabowski (@jaroslaw.grabowski_50515) has been working on this problem for a couple of weeks now. Spark-Cassandra connector v3.0.0 was released a few days ago and I think it will address the issues you reported.

Please try it out and let us know how you go. Thanks again. Cheers!

1 comment Share
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Thank you for your reply, I will switch to stable V3.0.0. However I don't think SPARKC-614 is related to this issue. It solves many objects creation perf problem (changing def -> val), but this is about threads not being notified and stuck forever.

0 Likes 0 · ·
jaroslaw.grabowski_50515 avatar image
jaroslaw.grabowski_50515 answered ·

Hi @tomas.bartalos_162914,

Could you please provide a repro steps for this problem?

Do executor threads await for a schema refresh indefinitely? Is it possible that these particular threads moved on after the stack trace was generated?

If the executor threads are stuck forever I'd expect an exception in db/executors logs, could you verify it?


It is possible to set the timeout for Schema queries with this setting but if there is a deadlock somewhere the setting won't do much. The default timeout is 5 seconds.

5 comments Share
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Hello, thank you for your reply. The executor threads were for sure waiting indefinitely (average job time is 1-10 seconds and they've been waiting for 47 hours). The threads were not moving, I made several dumps with the same stack results.

We're running Cassandra DB on Kubernetes (bitnami helm chart) and I've found out that at the moment when stream got stuck, all containers of Cassandra DB was killed by k8s, and afterwards immediately recreated.

There is no exception in Spark driver logs.

0 Likes 0 · ·

Could you please create a minimal reproduction for this problem?

0 Likes 0 · ·
tomas.bartalos_162914 avatar image tomas.bartalos_162914 jaroslaw.grabowski_50515 ·

I don't think I can provide a minimal code to reproduce this. The behavior is non-deterministic and depends on timing. According to my investigation to reproduce steps to reproduce:

  • start spark streaming to cassandra
df.write.format("org.apache.spark.sql.cassandra").save()
  • kill cassandra nodes in the right time
0 Likes 0 · ·
Show more comments