Bringing together the Apache Cassandra experts from the community and DataStax.

Want to learn? Have a question? Want to share your expertise? You are in the right place!

Not sure where to begin? Getting Started

 

question

twinsevil avatar image
twinsevil asked Erick Ramirez edited

Is there joinWithCassandraTable() in open-source pyspark?

Hello!

Is there a joinWithCassandraTable() method in the free open source version for pyspark?

Or is it implemented only in the DSE version?

spark-cassandra-connector
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

1 Answer

jaroslaw.grabowski_50515 avatar image
jaroslaw.grabowski_50515 answered twinsevil commented

Hi! SCC 2.5.x and newer has Direct Join support (joinWithCassandraTable but for dataframes). I don't have python samples but here is a good scala article about Direct Join http://www.russellspitzer.com/2018/05/23/DSEDirectJoin/.

2 comments Share
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Are you sure this feature is implemented in the open source version of the connector for pyspark?

I am using spark-cassandra-connector_2.12 and Spark 3.0.1 but I can't find the directJoin method.

A simple join of two dataframes (one very small, the other very large) results to full table scan.

What am I doing wrong?

0 Likes 0 ·

Thanks for the help, I figured out what the problem was.

In order for the direct join to work, it was necessary to set the following settings when creating the Spark Context:


spark = SparkSession.builder.\
     config('directJoinSetting', 'on').\
     config("spark.sql.extensions",  "com.datastax.spark.connector.CassandraSparkExtensions").\
     appName('directJoin').getOrCreate()  
0 Likes 0 ·