Bringing together the Apache Cassandra experts from the community and DataStax.

Want to learn? Have a question? Want to share your expertise? You are in the right place!

Not sure where to begin? Getting Started

 

question

JJorczik avatar image
JJorczik asked ·

Is it possible to achieve data locality on level rack_local when using Spark Cassandra Connector?

Data locality in Spark Cassandra Connector is achieved by collocating Spark workers with Cassandra Nodes, which results in locality level node_local for Spark tasks, if resources are available on the same node. Is it possible to achieve locality level rack_local when Spark workers and Cassandra nodes are located on the same rack and if so, how to configure this setup? I could not find any documentation about this topic.

spark-cassandra-connectordata locality
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

jaroslaw.grabowski_50515 avatar image
jaroslaw.grabowski_50515 answered ·

We don't use racks when routing the requests. By default SCC tries to reach a local replica, then other replicas, then other nodes: https://github.com/datastax/spark-cassandra-connector/blob/v2.5.1/driver/src/main/scala/com/datastax/spark/connector/cql/LocalNodeFirstLoadBalancingPolicy.scala#L81.

If SCC can't find alive replica, it tries to reach nodes in the following order:

https://github.com/datastax/spark-cassandra-connector/blob/v2.5.1/driver/src/main/scala/com/datastax/spark/connector/cql/LocalNodeFirstLoadBalancingPolicy.scala#L165-L168

You may change this behaviour by supplying a jar with a custom CassandraConnectionFactory that sets a custom LoadBalancingPolicy. Use Node.getRack() to obtain a rack for a node.

It seams that in both cases mentioned above we could simply prioritize rack replicas/nodes over other nodes.

Share
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Erick Ramirez avatar image
Erick Ramirez answered ·

As you already stated, data locality can only be achieved by the Spark connector if both the Spark JVM (workers) are running on the same servers as the Cassandra JVM.

The Spark connector is aware of the token range(s) owned by the "local" Cassandra node (running on the same server as the Spark worker) so it will try to route requests to nodes which have the data where possible. Cheers!

2 comments Share
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Could you describe technically how the Connector/Spark worker communicates with Cassandra nodes? Does it use localhost IPs or Cassandra node IP? What if Spark workers and Cassandra nodes are collocated on the same server but in different docker containers? Is there any documentation?

0 Likes 0 ·

The connector uses the Java driver under the hood so it is just like any other app/client connecting to the cluster.

It connects to the client IP address where CQL port 9042 is bound so if the both the worker/executor and C* node share the same IP address then the connector knows the C* node is local to it. Cheers!

1 Like 1 ·