sridhar.addanki_188870 avatar image
sridhar.addanki_188870 asked sridhar.addanki_188870 commented

How can we optimize a Spark JOIN query which takes a long time to complete?

Hai Team,

Spark Join queries on Cassandra tables are very very slow.

We are using DSE 6.8. and having 1.5 TB of data in 2 tables. Join queries are taking lot of time get the result. Same queries worked well when data is less. So what optimization will help us to improve Spark join queries performance on large data volume?

10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

1 Answer

Cedrick Lunven avatar image
Cedrick Lunven answered sridhar.addanki_188870 commented

Thank you for your message. A few questions to investigate:

  • It might be obvious but when you speak about "very very slow" how many minutes is it ? We could compare to general abacus. (aka Spark would never be real time).
  • How many nodes to do you have, (cpu/ram?) to get some performance boost spark needs to distribute the query on multiple nodes.
  • Did you try the LEFT JOIN vs RIGHT JOIN to optimize cardinality of your request.

With that checked an optimization could be CACHE TABLE] as explained here :

1 comment Share
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

sridhar.addanki_188870 avatar image sridhar.addanki_188870 commented ·

Thanks for your reply Cerdrick.

We have 6 Node cluster and 32 Cores and 512 Gb of RAM.

We are doing POC and looking for optimization.

One of my query is

select, d.scope from tab1 h INNER JOIN tab2 d on h.mapid=d.`map` where h.mapid = 'XXXXXXX';

This ran around 25 Min. Fetched 379620 row(s)

2nd SQL is

select b.attr d.productname tab1 a, tab2 b , tab3 c where and c.productname=a.productname and a.creationtime<'2020-03-10 15:37:28.000000+000' and a.rname='xxxxx';

This also ran for 20 min for no records found.

0 Likes 0 ·