Bringing together the Apache Cassandra experts from the community and DataStax.

Want to learn? Have a question? Want to share your expertise? You are in the right place!

Not sure where to begin? Getting Started

 

question

rahul avatar image
rahul asked ·

What are the best practices for running Graph OLAP queries on large amount of data?

Hi, we have requirements to run some drop commands on our graph , so we decide to run OLAP queries for the same, we have 2 nodes in cluster and both have configured for spark (they are in mix load ). but queries were slow taking almost 7-8 min for data set of around 10lac data.

so I configured some spark commands by gremlin console in my graph.

those are :

:remote config alias g example_graph.a;
g.graph.configuration.setProperty("spark.cores.max", 10);
g.graph.configuration.setProperty("spark.executor.memory", "4g");
g.graph.configuration.setProperty("spark.executor.cores", "1");
g.graph.configuration.setProperty("spark.sql.shuffle.partitions", 500);
g.graph.configuration.setProperty("spark.dynamicAllocation.enabled", "true");
g.graph.configuration.setProperty("spark.shuffle.service.enabled", "true");
g.graph.configuration.setProperty("spark.shuffle.service.port", "7437");

1. after configuring these commands our queries works in 2-3 min on avg. my concern is that suppose I only run these commands on one graph then should they will also effect our all graphs as well ( I want those conf. only one graph ).

2. how can I configure those commands using java api.

3. if there any best practice to increase speed of olap query also share those as well.

4. I also read about snapshot it would be better if you can elaborate about this as well if this can help.

sharing one query profile for better understanding of our dataset.

gremlin> g.V().hasLabel('Entitlement').out('Is').in('Of_Entity').profile()
==>Traversal Metrics 
                           Step Count Traversers Time (ms) % Dur=============================================================================================================
GraphStep(vertex,[])             1010654 1010654 39234.918 76.15
HasStep([~label.eq(Entitlement)]) 703 703 6218.585 12.07
VertexStep(OUT,[Is],vertex)       588 588 6049.714 11.74
VertexStep(IN,[Of_Entity],vertex) 640 640 17.395 0.03

                               >TOTAL - - 51520.614

So as we can see now visiting specific Entity from huge data it took around 52 sec. ( although this is only visiting , drop may take 1-2 min extra on top of this calculation ) . so we can assume an olap drop query with this amount of data will run in around 3 min.

I want to reduce this as low as possible.

dsegraph
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

jeromatron avatar image
jeromatron answered ·

Hi Rahul,

It sounds like you're looking to drop something. Do you have more specific information about the exact query or steps that you're trying to perform? Are you using the Gremlin console to run these jobs?

We recommend using GraphFrames when interacting with the graph in an OLAP fashion. Here is a blog post that goes over the graph frames integration and there is a more recent blog post that goes into details about best practices for loading the graph - things to watch out for, how to make it more scalable, etc. There are repos that are linked from those articles. Separately we have documentation for how to best utilize Spark resources in the Spark Cassandra Connector and some older but good blog posts on Spark tuning, like this one.

Ultimately, it would be good to know what you're doing specifically and how you're doing it to help further.

Hope that helps as some starting points and documentation for your options.

Jeremy

1 comment Share
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

hi I have shared more details about my approach in this Q&A thread ans. format. as you suggesting graphframes for big analytical queries , could you please share java api doc. for graphframe with me.

0 Likes 0 ·
rahul avatar image
rahul answered ·

Hi Jeromatron,

first thanks for writing your comment ,
actually we are running those gremlin queries by a command line tool build in java , and for executing gremlin olap queries I am using fluent java API.

let's I have some gremlin queries like :-

g.V().hasLabel(Entitlement).out(Is).in(Of_Entity).has(createTime, lte(2020-11-05T06:22:52.136Z)).drop();
g.V().hasLabel(Entity).has(status, Deleted).has(createTime, lte(2020-11-05T06:22:52.136Z)).drop();

and yes you can assume as I already said in my qn. we have a bulk amount of data let's say around 2 million .

Share
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.