Bringing together the Apache Cassandra experts from the community and DataStax.

Want to learn? Have a question? Want to share your expertise? You are in the right place!

Not sure where to begin? Getting Started

 

question

zubochenko_186666 avatar image
zubochenko_186666 asked ·

How to import org.apache.spark.sql.cassandra class in jupyter notebook?

Can not find org.apache.spark.sql.cassandra when I make sqlContext.read in jupyter notebook.

What am i doing wrong?

os.environ['SPARK_HOME']='/home/.../spark-2.4.5-bin-hadoop2.7/'
findspark.init()
sc=SparkContext(appName="myAppName")

os.environ['PYSPARK_SUBMIT_ARGS'] = '--conf spark.cassandra.connection.host=anyip --conf spark.executor.cores=2 --conf spark.cassandra.auth.username=cassandra --conf spark.cassandra.auth.password=pass   --properties spark:spark.jars.packages=datastax:spark-cassandra-connector:2.3.0-s_2.11 --master spark://anyip:7077'

auth_provider = PlainTextAuthProvider(
username='cassandra', password='pass')
cluster = Cluster(['anyip'], port=9042, connect_timeout=3600, auth_provider=auth_provider)
session = cluster.connect('dbsvc')
session.default_timeout = 600
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
def load_and_get_table_df(keys_space_name, table_name):
    table_df = sqlContext.read\
        .format("org.apache.spark.sql.cassandra")\
        .options(table=table_name, keyspace=keys_space_name)\
        .load()
    return table_df
uc=load_and_get_table_df('dbsvc', 'usercounters')
Py4JJavaError: An error occurred while calling o27.load.
: java.lang.ClassNotFoundException: Failed to find data source: org.apache.spark.sql.cassandra. Please find packages at http://spark.apache.org/third-party-projects.html

cassandrapyspark
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Russell Spitzer avatar image
Russell Spitzer answered ·

I have never tried to load Pyspark Enviroment variables like you currently are, and while that may work there are at least a few little things to fix.

--properties spark:spark.jars.packages=datastax:spark-cassandra-connector:2.3.0-s_2.11


Does not seem like the right option to me. I believe it should be

--packages com.datastax.spark:spark-cassandra-connector_2.11:2.4.2


Or of course whatever version you happen to be using. I think removing the (spark:) may also fix it but it's best to switch to the maven coordinate (com.datastax.spark) rather than the spark-packages one (datastax)


If this is working correctly you should see information in your log about downloading the resource and it's dependencies, if you do not see these logs lines then the issue is still the dependencies not being present.


Example of logs showing the connector was added

com.datastax.spark#spark-cassandra-connector_2.11 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-160541e5-a3f4-4ad1-b3be-dd36dc67d092;1.0
    confs: [default]
    found com.datastax.spark#spark-cassandra-connector_2.11;2.4.3 in central
    found joda-time#joda-time;2.3 in central
    found commons-beanutils#commons-beanutils;1.9.3 in local-m2-cache
    found commons-collections#commons-collections;3.2.2 in spark-list
    found org.joda#joda-convert;1.2 in central
    found com.twitter#jsr166e;1.1.0 in central
    found io.netty#netty-all;4.0.33.Final in central
    found org.scala-lang#scala-reflect;2.11.7 in local-m2-cache
downloading https://repo1.maven.org/maven2/com/datastax/spark/spark-cassandra-connector_2.11/2.4.3/spark-cassandra-connector_2.11-2.4.3.jar ...
    [SUCCESSFUL ] com.datastax.spark#spark-cassandra-connector_2.11;2.4.3!spark-cassandra-connector_2.11.jar (800ms)


6 comments Share
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Did not helped. Tried 2.4.2, 2.4.0 too. Tried to download connector jar and put file in hadoop imporing as --jars. Nothing helps.

What logs do you mean?

Any other thoughts?

0 Likes 0 · ·

The spark driver should be logging as it starts up, did you make sure you corrected the submit arguments? It's also VERY important that you set these properties before starting your spark context. Once the context has started these properties have no effect.

0 Likes 0 · ·

Hi Russell,

Finally it works.

First thing I've done is setting some pathes:

os.environ['SPARK_HOME']
os.environ['HADOOP_HOME']
os.environ['LD_LIBRARY_PATH']

Second is that I wrote pyspark-shell in PYSPARK_SUBMIT_ARGS.

Thanx for your help.

1 Like 1 · ·
Show more comments
Show more comments
wouter.devries_186897 avatar image
wouter.devries_186897 answered ·
conf = SparkConf()

conf.set("spark.jars", "/path/to/spark-cassandra-connector-2.4.0-s_2.11.jar") # spark-cassandra-connect jar

conf.set("spark.cassandra.connection.host", "cassandra-hostname")

sc = SparkContext('spark://spark-hostname:spark-port',conf=conf)

ss = SparkSession(sc)

I've had success with the above. Could you try that? I don't set any special environment variables.

Share
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Aleks Volochnev avatar image
Aleks Volochnev answered ·

Do you have the connector installed by itself?

For jupyter: `pip install cassandra-driver`

1 comment Share
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Yes, it's installed already.

0 Likes 0 · ·
liviraheja_160281 avatar image
liviraheja_160281 answered ·

Hi,


You are not using right cassandra connector. Spark cassandra connector 2.3.x is compatible with Spark 2.3. You are using Spark 2.4 so the Spark cassandra connector should be 2.4 (if scala version is 2.11) or 2.4.2 (if scala version is 2.12).

Hope this helps!

1 comment Share
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Thx, but it does not helps

0 Likes 0 · ·