Spark-Connector: Read of empty table takes ~10 minutes

I have a spark application that uses `sparkContext.cassandraTable[DomainObjectType](keyspace, table)`. The first time I run this app, this table is empty. However, the read takes ~10 minutes to complete. I’m struggling to understand why this is occurring. I see this across all environments, large and small in terms of resources

If there is no data to be read the simplest explanation would be that the time is coming from the overhead of setting up tasks and executing them in Spark. The only reason this would take 10~ minutes would be if the table was being read into thousands of Spark Tasks.

To check this I would look at the Spark UI (port 4040 on the node running the Spark Application) and see how many tasks are being generated.

If the amount of tasks is very large (in the hundreds or thousands) then this can be caused by a few things. The number of tasks is determined by the Size of the Cassandra Table reported in the Size_Estimates table but this can lead to extreme overestimates in a few edges cases.

Specifically, if the the estimates are being made from an alternate DC and the DC's are not using VNodes the distribution of token data can cause some big issues. In this case you can manually specify the number of tasks to create in the "ReadConf" for the RDD being read.

Previously there was also a bug where there would be an overflow on certain settings causing a giant amount of tasks to be made even when no data was present so be sure you are using the latest connector.

