rjain avatar image
rjain asked jaroslaw.grabowski_50515 commented

Why is repartitionByCassandraReplica is failing when Spark and Cassandra running in different containers?

I was running Spark 3.0.1 on Kubernetes and fetching data from Cassandra using spark Cassandra connector.

Use case is Spark JavaRDD<Key> is generating local Partitions by calling repartitioningWithCassandraReplicas with X Table and then using joinWithCassandraTable with X Table. This thing is working on Spark StandAlone where Spark and Cassandra both are on same server and Spark Partitions localized after repartitioningWithCassandraReplicas before calling joinWithCassandraTable. But the same thing if tried on Kubernetes where Spark and Cassandra running in separate Pod,

It seems repartitionByCassandraReplica failed as no data locality obtained in Spark Container.

What am I missing here to make it work. How the network shuffling between spark and cassandra can be minimized?

spark-cassandra-connectordata locality
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

1 Answer

Erick Ramirez avatar image
Erick Ramirez answered jaroslaw.grabowski_50515 commented

The behaviour you're seeing is expected.

By definition "data locality" means that the data is local to the Spark executor. In order to achieve data locality, both the Spark worker/executor JVM and the Cassandra JVM must run in the same server/VM/container/pod instance. Otherwise data can no longer be considered "local".

If you were running tests in a single VM or on your laptop, it works because both the executor JVM and Cassandra JVM are running on the same server. That is by design. Cheers!

2 comments Share
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

rjain avatar image rjain commented ·
Thanks Erick for the reply.

Is there any optimization available for spark-cassandra-connector for using it in Kubernetes environment? I mean to say that If I cannot use repartitionByCassandraReplica, then what is the best approach for efficient reading in case when cassandra and spark are running on different pods?
0 Likes 0 ·
jaroslaw.grabowski_50515 avatar image jaroslaw.grabowski_50515 ♦ rjain commented ·

No, there is no optimization atm. Please share your ideas here:

1 Like 1 ·