Bringing together the Apache Cassandra experts from the community and DataStax.

Want to learn? Have a question? Want to share your expertise? You are in the right place!

Not sure where to begin? Getting Started

 

question

jahar.tyagi_92934 avatar image
jahar.tyagi_92934 asked ·

Why do RDDs become blank when calling repartitionByCassandraReplica() in Kubernetes?

Hi,

In Spark+Cassandra Standard deployment where Spark is deployed in standalone mode on same physical nodes where Cassandra is deployed, I use repartitionByCassandraReplica of spark-cassandra-connector API before joining two RDDS and that works pretty fine.

Now I deployed same code on Kuberenetes, where Cassandra and Spark are running in different PODs but in Kubernetes deployment, the RDD becomes blank when repartitionByCassandraReplica is called on that. I understand that repartitionByCassandraReplica is used before JoinWithCassandraTable to obtain data locality, such that each spark partition will only require queries to their local node. But is this understanding correct that repartitionByCassandraReplica will always return blank RDD if used in Kubernetes deployment of Spark and Cassandra.

spark-cassandra-connector
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

1 Answer

Erick Ramirez avatar image
Erick Ramirez answered ·

It doesn't have anything to do with Kubernetes. There is no data locality when the Spark workers/executors are not co-located on the same server/machine/VM as the Cassandra nodes.

Data locality only works when both the Spark worker/executor JVM and Cassandra JVM is running on the same server. Cheers!

2 comments Share
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Thanks Erick. That helps

0 Likes 0 ·

Not a problem. Cheers!

0 Likes 0 ·