question

jahar.tyagi_92934 avatar image
jahar.tyagi_92934 asked Erick Ramirez commented

Why do RDDs become blank when calling repartitionByCassandraReplica() in Kubernetes?

Hi,

In Spark+Cassandra Standard deployment where Spark is deployed in standalone mode on same physical nodes where Cassandra is deployed, I use repartitionByCassandraReplica of spark-cassandra-connector API before joining two RDDS and that works pretty fine.

Now I deployed same code on Kuberenetes, where Cassandra and Spark are running in different PODs but in Kubernetes deployment, the RDD becomes blank when repartitionByCassandraReplica is called on that. I understand that repartitionByCassandraReplica is used before JoinWithCassandraTable to obtain data locality, such that each spark partition will only require queries to their local node. But is this understanding correct that repartitionByCassandraReplica will always return blank RDD if used in Kubernetes deployment of Spark and Cassandra.

spark-cassandra-connector
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

1 Answer

Erick Ramirez avatar image
Erick Ramirez answered Erick Ramirez commented

It doesn't have anything to do with Kubernetes. There is no data locality when the Spark workers/executors are not co-located on the same server/machine/VM as the Cassandra nodes.

Data locality only works when both the Spark worker/executor JVM and Cassandra JVM is running on the same server. Cheers!

2 comments Share
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

jahar.tyagi_92934 avatar image jahar.tyagi_92934 commented ·

Thanks Erick. That helps

0 Likes 0 ·
Erick Ramirez avatar image Erick Ramirez ♦♦ jahar.tyagi_92934 commented ·

Not a problem. Cheers!

0 Likes 0 ·