question

twinsevil avatar image
twinsevil asked Erick Ramirez edited

Is there joinWithCassandraTable() in open-source pyspark?

Hello!

Is there a joinWithCassandraTable() method in the free open source version for pyspark?

Or is it implemented only in the DSE version?

spark-cassandra-connector
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

1 Answer

jaroslaw.grabowski_50515 avatar image
jaroslaw.grabowski_50515 answered twinsevil commented

Hi! SCC 2.5.x and newer has Direct Join support (joinWithCassandraTable but for dataframes). I don't have python samples but here is a good scala article about Direct Join http://www.russellspitzer.com/2018/05/23/DSEDirectJoin/.

2 comments Share
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

twinsevil avatar image twinsevil commented ·

Are you sure this feature is implemented in the open source version of the connector for pyspark?

I am using spark-cassandra-connector_2.12 and Spark 3.0.1 but I can't find the directJoin method.

A simple join of two dataframes (one very small, the other very large) results to full table scan.

What am I doing wrong?

0 Likes 0 ·
twinsevil avatar image twinsevil commented ·

Thanks for the help, I figured out what the problem was.

In order for the direct join to work, it was necessary to set the following settings when creating the Spark Context:


spark = SparkSession.builder.\
     config('directJoinSetting', 'on').\
     config("spark.sql.extensions",  "com.datastax.spark.connector.CassandraSparkExtensions").\
     appName('directJoin').getOrCreate()  
0 Likes 0 ·