Bringing together the Apache Cassandra experts from the community and DataStax.

Want to learn? Have a question? Want to share your expertise? You are in the right place!

Not sure where to begin? Getting Started



arditmeti1337_187894 avatar image
arditmeti1337_187894 asked arditmeti1337_187894 commented

How can I join a table with itself in Spark?

Hi guys,

I have 2 tables which have a compound primary key like in picture..
My specific case is that when I have something like triple ids (11,15,21), I need to join a table with itself and with another one to get a result like in picture (joining on ID_1)..
Is there a way how to achieve this from cassandra so I can avoid zipping rdds in spark?

I used this code to make self table join but is there anything better?

    val table1 = sc.cassandraTable("test", "table1")
      .select("ID_1", "ID_2")
      .where("ID_2 = ?", 11)
      .where("ID_1 IN ?", someValues)
    val joined = table1.joinWithCassandraTable("test", "table1")
            .where("ID_2 = ?", 15)

1589660359434.png (26.2 KiB)
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

1 Answer

Russell Spitzer avatar image
Russell Spitzer answered arditmeti1337_187894 commented

If you want to join all of the table with itself, you definitely want to let Spark do a shuffle, it will be way more efficient that iterating through all they keys one at a time and relooking them up except for a narrow set of key distributions.

But that doesn't quite look like what you are actually doing, it looks more like you are just trying to use the output from some keys to query more keys. As long the subset of keys you are using for the left hand of the join is small, what you are doing is probably the most efficient but i'm not sure why you have "id2" in there.

So my TLDR is:

Use join with cassandra table for looking up a subset of keys
Use a spark shuffle for looking up all keys which map to other keys in the table (like in a graph algorithm)

1 comment Share
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Thank you for the quick answer! Just wanted to ask, is 1000-2000 keys considered a small subset?

0 Likes 0 ·