Bringing together the Apache Cassandra experts from the community and DataStax.

Want to learn? Have a question? Want to share your expertise? You are in the right place!

Not sure where to begin? Getting Started

 

question

sridhar.addanki_188870 avatar image
sridhar.addanki_188870 asked ·

How can we optimize a Spark JOIN query which takes a long time to complete?

Hai Team,

Spark Join queries on Cassandra tables are very very slow.

We are using DSE 6.8. and having 1.5 TB of data in 2 tables. Join queries are taking lot of time get the result. Same queries worked well when data is less. So what optimization will help us to improve Spark join queries performance on large data volume?

dsesparkjoin
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

1 Answer

Cedrick Lunven avatar image
Cedrick Lunven answered ·

Thank you for your message. A few questions to investigate:

  • It might be obvious but when you speak about "very very slow" how many minutes is it ? We could compare to general abacus. (aka Spark would never be real time).
  • How many nodes to do you have, (cpu/ram?) to get some performance boost spark needs to distribute the query on multiple nodes.
  • Did you try the LEFT JOIN vs RIGHT JOIN to optimize cardinality of your request.


With that checked an optimization could be CACHE TABLE] as explained here : https://docs.datastax.com/en/dse/6.8/dse-dev/datastax_enterprise/spark/alwaysOnSql.html



1 comment Share
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Thanks for your reply Cerdrick.

We have 6 Node cluster and 32 Cores and 512 Gb of RAM.

We are doing POC and looking for optimization.

One of my query is

select d.id, d.scope from tab1 h INNER JOIN tab2 d on h.mapid=d.`map` where h.mapid = 'XXXXXXX';

This ran around 25 Min. Fetched 379620 row(s)

2nd SQL is

select b.attr d.productname tab1 a, tab2 b , tab3 c where b.map=a.mapid and c.productname=a.productname and a.creationtime<'2020-03-10 15:37:28.000000+000' and a.rname='xxxxx';


This also ran for 20 min for no records found.





0 Likes 0 · ·