question

narayana.jayanthi_191238 avatar image
narayana.jayanthi_191238 asked Erick Ramirez commented

Databricks connection to Astra DB returns IOException, "Failed to open native connection to Cassandra at Cloud File Based Config"

Unable to read AstraDB data in a Databricks Pyspark dataframe. Getting errors while establishing the connection.

In the Pyspark code I have the following:

import os
from pyspark import SparkContext,SparkFiles,SQLContext,SparkFiles
from pyspark.sql import SQLContext, SparkSession
from pyspark.sql.functions import col


spark =SparkSession.builder.appName('SparkCassandraApp')\
.config('spark.cassandra.connection.config.cloud.path','dbfs:/FileStore/tables/secure_connect_kafka.zip')\
.config('spark.cassandra.auth.username', '')\
.config('spark.cassandra.auth.password','')\
.config('spark.dse.continuousPagingEnabled',False)\
.getOrCreate()


df = spark.read.format("org.apache.spark.sql.cassandra")\
.options(table="emp", keyspace="kafka").load()
display(df)

I am getting error reading the df.. spark.read.format('')... Here is the error..

java.io.IOException: Failed to open native connection to Cassandra at Cloud File Based Config at dbfs:/FileStore/tables/secure_connect_kafka.zip :: Could not initialize class com.datastax.oss.driver.internal.core.config.typesafe.TypesafeDriverConfig

The secure_connect.zip has been uploaded into the dbfs

Any help here please.. Thanks, Narayana

astra dbdatabricks
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

smadhavan avatar image
smadhavan answered smadhavan commented

@narayana.jayanthi_191238 ,

you need to use `spark-cassandra-connector-assembly` (Maven Central) instead of `spark-cassandra-connector`. The reason - Spark Cassandra Connector uses newer version of Typesafe Config library than Databricks runtime. The assembly version includes all necessary libraries as shaded versions. And you don't need to install java-driver-core - it will be pulled as dependency automatically. You can find more explanations in the following blog post.

2 comments Share
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

narayana.jayanthi_191238 avatar image narayana.jayanthi_191238 commented ·
Awesome !! Thanks Madhavan & Erick for your quick response. Much appreciated..
1 Like 1 ·
smadhavan avatar image smadhavan ♦ narayana.jayanthi_191238 commented ·

Glad it worked!

0 Likes 0 ·
Erick Ramirez avatar image
Erick Ramirez answered Erick Ramirez commented

The Databricks Spark cluster cannot access your secure connect bundle (SCB). You need to distribute the SCB to all executors as it states in this example exception:

IOException: Failed to open native connection to Cassandra at Cloud File Based Config at secure_connect_community.zip :: The provided path secure_connect_community.zip is not a valid URL nor an existing locally path. Provide an URL accessible to all executors or a path existing on all executors (you may use `spark.files` to distribute a file to each executor).
Caused by: IOException: The provided path secure_connect_community.zip is not a valid URL nor an existing locally path. Provide an URL accessible to all executors or a path existing on all executors (you may use `spark.files` to distribute a file to each executor).
Caused by: MalformedURLException: no protocol: secure_connect_community.zip

You can distribute the SCB to all executors using the Spark --files option. In my Databricks cluster, I have the following Spark configuration:

spark.databricks.delta.preview.enabled true
spark.cassandra.auth.username token
spark.cassandra.auth.password AstraCS:AbC...:789...xyz0
spark.cassandra.connection.config.cloud.path secure_connect_community.zip
spark.files dbfs:/FileStore/tables/secure_connect_community.zip

Note the following:

  • spark.files is set to the full path to the SCB on the Databricks filesystem
  • cloud.path is set to the filename of the SCB only

If you fix your configuration, you should be able to connect to your Astra DB from Databricks. Cheers!

[UPDATE] I wrote an article with a full working example for How to connect to Astra DB from a Databricks cluster. Hopefully you should be able to work out where you have misconfigured your cluster. Cheers!

2 comments Share
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

narayana.jayanthi_191238 avatar image narayana.jayanthi_191238 commented ·

Erick,

Thanks for your response..very much appreciated..

I made the changes as suggested. Moved all the values into spark config. Now, I am getting a different error

com.datastax.spark.connector.util.ConfigCheck$ConnectorConfigurationException: Invalid Config Variables

Error is shown in this line..

df = spark.read.format("org.apache.spark.sql.cassandra")\ .options(table="emp", keyspace="kafka").load()

display(df)

Py4JJavaError: An error occurred while calling o403.load. : com.datastax.spark.connector.util.ConfigCheck$ConnectorConfigurationException: Invalid Config Variables Only known spark.cassandra.* variables are allowed when using the Spark Cassandra Connector. spark.cassandra.connection.config.spark.files is not a valid Spark Cassandra Connector variable. No likely matches found. at com.datastax.spark.connector.util.ConfigCheck$.checkConfig(ConfigCheck.scala:62) at com.datastax.spark.connector.cql.CassandraConnectorConf$.fromSparkConf(CassandraConnectorConf.scala:413)

0 Likes 0 ·
Erick Ramirez avatar image Erick Ramirez ♦♦ narayana.jayanthi_191238 commented ·

I've updated my answer with a full working example for connecting from a Databricks cluster. Cheers!

0 Likes 0 ·