DataStax Academy FAQ

DataStax Academy migrated to a new learning management system (LMS) in July 2020. We are also moving to a new Cassandra Certification process so there are changes to exam bookings, voucher system and issuing of certificates.

Check out the Academy FAQ pages for answers to your questions:


question

arditmeti1337_187894 avatar image
arditmeti1337_187894 asked ·

How can I join a table with itself in Spark?

Hi guys,

I have 2 tables which have a compound primary key like in picture..
My specific case is that when I have something like triple ids (11,15,21), I need to join a table with itself and with another one to get a result like in picture (joining on ID_1)..
Is there a way how to achieve this from cassandra so I can avoid zipping rdds in spark?

I used this code to make self table join but is there anything better?

    val table1 = sc.cassandraTable("test", "table1")
      .select("ID_1", "ID_2")
      .where("ID_2 = ?", 11)
      .where("ID_1 IN ?", someValues)
 
 
 
    val joined = table1.joinWithCassandraTable("test", "table1")
            .on(SomeColumns("ID_1"))
            .where("ID_2 = ?", 15)

spark
1589660359434.png (26.2 KiB)
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

1 Answer

Russell Spitzer avatar image
Russell Spitzer answered ·

If you want to join all of the table with itself, you definitely want to let Spark do a shuffle, it will be way more efficient that iterating through all they keys one at a time and relooking them up except for a narrow set of key distributions.

But that doesn't quite look like what you are actually doing, it looks more like you are just trying to use the output from some keys to query more keys. As long the subset of keys you are using for the left hand of the join is small, what you are doing is probably the most efficient but i'm not sure why you have "id2" in there.

So my TLDR is:

Use join with cassandra table for looking up a subset of keys
Use a spark shuffle for looking up all keys which map to other keys in the table (like in a graph algorithm)

1 comment Share
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Thank you for the quick answer! Just wanted to ask, is 1000-2000 keys considered a small subset?

0 Likes 0 · ·