Bringing together the Apache Cassandra experts from the community and DataStax.

Want to learn? Have a question? Want to share your expertise? You are in the right place!

Not sure where to begin? Getting Started



anshita333saxena_187432 avatar image
anshita333saxena_187432 asked Russell Spitzer commented

Why would my Spark app intermittently extract no data from Scylla DB using spark-cassandra-connector?

While trying to fetch the data using spark-cassandra connector, the files which got created were with empty.

Code Snippet:

        val conf = new SparkConf(true).set("", settings.serverIP)
                                      .set("spark.cassandra.auth.username", settings.username)
                                      .set("spark.cassandra.auth.password", settings.password)
                                      .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
                                      .set("spark.executor.memory", settings.serverMemory)
                                      .set("spark.cassandra.input.split.size_in_mb", settings.inputSplitSizeInMb)
                                      .set("spark.eventLog.enabled", "true")
                                      .set("spark.cassandra.input.consistency.level", settings.consistencyLevel)

        val sc = new SparkContext(settings.masterURL, "structured_app", conf)

        val rdd_aui_state = sc.cassandraTable(settings.keyspace, settings.table).select("aui").where("updateddate > ?", settings.starttime).where("updateddate < ?", settings.endtime)

1. Tried to give all the nodes as the contact list of one datacenter to get the full extraction.

2. Tried to clean the memory and then started extraction.

Sometimes, we are able to capture the data. However, sometimes files are not recording the data. All part-0* files are empty files.
I am just wandering that even then given all the nodes as the contact points, why we are not able to capture the data. However, if we are trying to extract the same data after some time, we are able to capture the data.

Spark Cluster: 2 Nodes (1 Master, 1 Worker)
DB Cluster: 10 Nodes

Can you please suggest?

10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

@Russell Spitzer can you please suggest the cause of not able to extract the data?
(Behaviour: Sometimes data is extracted completely but sometimes data extraction is completely empty.)

0 Likes 0 ·

We have seen that sometimes the data recorded but sometimes it doesn't recorded.
For Example:
For 10th May, it will not give you data but after 5 minutes if you run the same spark code again you will get the data.

Also, it is not specific to only one application. It happens with all applications anytime. It is also not something which happens frequently. It happens once in two days.

But this is the astonishing fact the sometimes we got data and sometimes we didn't get the data.

0 Likes 0 ·

Actually struggling with this problem for last several weeks. Due to this, sometimes our next plan of actions will be delayed as we need to do extraction job run twice to capture the data.

Earliest help is highly appreciated.

0 Likes 0 ·

1 Answer

Russell Spitzer avatar image
Russell Spitzer answered Russell Spitzer commented

While i'm not an expert at Scylla and there may be an issue there (and they have a custom fork of the Spark Cassandra Connector). I can try to answer your other questions.

It is not important to share all of the nodes as contact points. The Datastax Java Driver will use the initial contact point to get information about all of the nodes in the cluster. The number of contact points just increases you ability to be available even if one of the contact nodes is down. I'm assuming this functions correctly on Scylla as well.

What does matter is the consistency level. It is possible that replication has not been completed and with the default consistency, not enough replicas are checked to give an accurate picture of the current state. You can always try increasing the read consistency level if this is the case.

My last guess would be maybe it has something to do with how you are specifying start and end time. Perhaps there are some start and end times that are just empty or they are being specified in the future? For example if you ask for data between tomorrow and the day after tomorrow, that query will return empty today and tomorrow, but in two days will be fine.

Again i'm not familiar with what bugs may be present in Scylla so there may be something going wrong there, but hopefully the above points will give you a starting point for debugging.

9 comments Share
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Thanks a lot @Russell Spitzer for giving directions here.
Actually we mention local_quorum for the consistency level (read), keeping this in mind that if we give local_one consistency level, maybe there is a case that request is going to the replica that is having the token but since the replication not happened correctly so request will return back saying that I don't have data. Therefore, we follow local_quorum consistency level so that we will ensure that we have the response from two of the replicas.

0 Likes 0 ·

We are running queries based on UTC time as we are storing data in UTC time. However, whenever we were running the extraction job for let's say yesterday's four hours now we will not get the data (even we have enough resources to serve the request) and after few minutes (sometimes in another 5 minutes) running the same job, we would get the data.
Please let me know if you have other thoughts or experiences to solve this.
Appreciate your suggestions, thoughts, help and recommendations.

0 Likes 0 ·
Russell Spitzer avatar image Russell Spitzer anshita333saxena_187432 ·

Honestly that sounds like a bug in either the Scylla fork of the connector or Scylla proper. I can't imagine why the same query would return different results at different times if you are writing and reading at quorum. The Connector should be issuing the same requests (and partitioning in general) for both requests. If you said data was disappearing I would think maybe clock skew issues but data just suddenly appearing is weird.

After the job works once, do all subsequent runs also work?

0 Likes 0 ·
Show more comments