PLANNED MAINTENANCE

Hello, DataStax Community!

We want to make you aware of a few operational updates which will be carried out on the site. We are working hard to streamline the login process to integrate with other DataStax resources. As such, you will soon be prompted to update your password. Please note that your username will remain the same.

As we work to improve your user experience, please be aware that login to the DataStax Community will be unavailable for a few hours on:

  • Wednesday, July 15 16:00 PDT | 19:00 EDT | 20:00 BRT
  • Thursday, July 16 00:00 BST | 01:00 CEST | 04:30 IST | 07:00 CST | 09:00 AEST

For more info, check out the FAQ page. Thank you for being a valued member of our community.


question

sridhar.addanki_188870 avatar image
sridhar.addanki_188870 asked ·

How can we optimize a Spark JOIN query which takes a long time to complete?

Hai Team,

Spark Join queries on Cassandra tables are very very slow.

We are using DSE 6.8. and having 1.5 TB of data in 2 tables. Join queries are taking lot of time get the result. Same queries worked well when data is less. So what optimization will help us to improve Spark join queries performance on large data volume?

dsesparkjoin
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

1 Answer

Cedrick Lunven avatar image
Cedrick Lunven answered ·

Thank you for your message. A few questions to investigate:

  • It might be obvious but when you speak about "very very slow" how many minutes is it ? We could compare to general abacus. (aka Spark would never be real time).
  • How many nodes to do you have, (cpu/ram?) to get some performance boost spark needs to distribute the query on multiple nodes.
  • Did you try the LEFT JOIN vs RIGHT JOIN to optimize cardinality of your request.


With that checked an optimization could be CACHE TABLE] as explained here : https://docs.datastax.com/en/dse/6.8/dse-dev/datastax_enterprise/spark/alwaysOnSql.html



1 comment Share
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Thanks for your reply Cerdrick.

We have 6 Node cluster and 32 Cores and 512 Gb of RAM.

We are doing POC and looking for optimization.

One of my query is

select d.id, d.scope from tab1 h INNER JOIN tab2 d on h.mapid=d.`map` where h.mapid = 'XXXXXXX';

This ran around 25 Min. Fetched 379620 row(s)

2nd SQL is

select b.attr d.productname tab1 a, tab2 b , tab3 c where b.map=a.mapid and c.productname=a.productname and a.creationtime<'2020-03-10 15:37:28.000000+000' and a.rname='xxxxx';


This also ran for 20 min for no records found.





0 Likes 0 · ·