question

azim_91_184236 avatar image
azim_91_184236 asked Erick Ramirez commented

Does the Spark Cassandra connector need to be able to connect to all nodes in the cluster?

I have a scenario where I am using Spark Cassandra connector to move data from On-Premises Cassandra cluster to a PaaS Cassandra on cloud, similar to this scenario https://www.datastax.com/blog/migrate-cassandra-apps-cloud-20-lines-code

My setup is-

1. On-Premises Cassandra cluster

2. Spark cluster in cloud

3. PaaS Cassandra cluster in cloud

I need to configure firewall for connectivity from Spark cluster in cloud (using Spark Cassandra connector) to On-Premises Cassandra cluster. I have seen the documentation in the article https://github.com/datastax/spark-cassandra-connector/blob/master/doc/1_connecting.md and the section 'initial contact', but needs some clarification. My questions are (all are related to OnPremises Cassandra cluster) -

1. I understand we can provide any of the nodes (may be a seed node) info in spark.cassandra.connection.host , but does the connector eventually connect (or need ability to connect) to all nodes or multiple nodes in the Cassandra cluster in a specific DC? OR a single node connection ability is good enough?

2. If I have the ability to connect to a single node only through my firewall device, will the functionality work?

Thanks for your guidance!

spark-cassandra-connector
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Erick Ramirez avatar image
Erick Ramirez answered Erick Ramirez commented

The Spark connector uses the Java driver under the hood and it needs to be able to connect to all the nodes in the cluster, not just one or two.

It uses the initial contact points to establish a control connection which the driver uses to perform administrative tasks such as querying the system tables to learn about the cluster topology. This means that using the control connection, it discovers all the nodes in the cluster + the token range(s) they own so it can connect to each of the nodes as required.

To respond to your questions directly:

  1. The Spark connector needs to be able to connect to all nodes in the cluster. This is not negotiable.
  2. No, it will not work. It needs to be able to connect to all nodes.

Cheers!

2 comments Share
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

azim_91_184236 avatar image azim_91_184236 commented ·

Thanks @Erick Ramirez, appreciate the reply!

0 Likes 0 ·
Erick Ramirez avatar image Erick Ramirez ♦♦ azim_91_184236 commented ·

Glad to help. Cheers!

0 Likes 0 ·
smadhavan avatar image
smadhavan answered

@azim_91_184236, in the Cassandra Connection Parameters, you could set spark.cassandra.connection.localDC to your desired datacenter name, which by default is none, to be able to direct the jobs to attach only to the desired datacenter and ignore the other nodes. See configuration parameters for additional details & parameters to control.

Share
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.