DataStax Academy FAQ

DataStax Academy migrated to a new learning management system (LMS) in July 2020. We are also moving to a new Cassandra Certification process so there are changes to exam bookings, voucher system and issuing of certificates.

Check out the Academy FAQ pages for answers to your questions:


question

mrcyze_148473 avatar image
mrcyze_148473 asked ·

Spark-Connector: Read of empty table takes ~10 minutes

I have a spark application that uses `sparkContext.cassandraTable[DomainObjectType](keyspace, table)`. The first time I run this app, this table is empty. However, the read takes ~10 minutes to complete. I’m struggling to understand why this is occurring. I see this across all environments, large and small in terms of resources

spark
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

1 Answer

Russell Spitzer avatar image
Russell Spitzer answered ·

If there is no data to be read the simplest explanation would be that the time is coming from the overhead of setting up tasks and executing them in Spark. The only reason this would take 10~ minutes would be if the table was being read into thousands of Spark Tasks.

To check this I would look at the Spark UI (port 4040 on the node running the Spark Application) and see how many tasks are being generated.

If the amount of tasks is very large (in the hundreds or thousands) then this can be caused by a few things. The number of tasks is determined by the Size of the Cassandra Table reported in the Size_Estimates table but this can lead to extreme overestimates in a few edges cases.

Specifically, if the the estimates are being made from an alternate DC and the DC's are not using VNodes the distribution of token data can cause some big issues. In this case you can manually specify the number of tasks to create in the "ReadConf" for the RDD being read.

Previously there was also a bug where there would be an overflow on certain settings causing a giant amount of tasks to be made even when no data was present so be sure you are using the latest connector.

Share
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.