Bringing together the Apache Cassandra experts from the community and DataStax.

Want to learn? Have a question? Want to share your expertise? You are in the right place!

Not sure where to begin? Getting Started

 

question

vijayakumar.sithamohan1_178594 avatar image
vijayakumar.sithamohan1_178594 asked ·

Graph versioning/time series strategies

Hi,

I am exploring DSE-Graph for one of our use cases where we need to keep historical linkages for certain years. I am looking for strategies or best practices to achieve the same using DSE-Graph.

I am expecting close to 0.5Billion vertices and 4-5B edges to start with. Expect growth rate 25 to 35% every year.


Any guidance would be really appreciated.


Thanks

graph
3 comments
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

@vijayakumar.sithamohan1_178594 I'm not aware of documentation or recommendations specific to your particular use case. I'll reach out internally at DataStax to get other architects and engineers to comment. Cheers!

0 Likes 0 · ·

Thank you.

0 Likes 0 · ·
Erick Ramirez avatar image Erick Ramirez ♦♦ Erick Ramirez ♦♦ ·

@vijayakumar.sithamohan1_178594 It would be great if you could edit your question and elaborate on your use case so we're in a better position to respond. Cheers!

0 Likes 0 · ·
jeromatron avatar image
jeromatron answered ·

What I've seen with versioned/time series graphs is that you put metadata in the graph regarding the versions and then you could either index based on the version/time or have a "latest" field to indicate the current path. It depends on what you're trying to achieve. Something that may be helpful prior to 6.8 coming out is a blog post talking about Gremlin's Time Machine which shows how to navigate the graph as though the gremlin was traversing at a certain time or version using a subgraph strategy. It may give you some ideas.

I've seen this use case a number of times though, so depending on what you're trying to achieve, I'm sure you'll find it possible. The more information you can provide, the better we can help.

2 comments Share
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Thank you for sharing excellent article. We were thinking along the line in schema but gremlin traversal capabilities and strategies ( subgraphstrategy and decorationstrategy ) was new and would like to use as they fit

Thanks


1 Like 1 · ·

As this graph grows with historical linkages, how will it affect query performance?

1. DSE-Graph has feature to passivate (offline) the historical linkages to cold store or remove then when needed could it be brought back to graph.

2. Can graph be split based on time window? If that is possible, can traversal be made on multiple graphs or merge results from both graphs?

0 Likes 0 · ·
Cedrick Lunven avatar image
Cedrick Lunven answered ·

Here are my guesses.


As this graph grows with historical linkages, how will it affect query performance?

Yes it will, but not as bad as we might think:

AFAIK an edge connects 2 identified nodes with their primary key (obvious) but also mean data locality is driven with the vertex. As such, even if the number of edges grow (hot vertex - Justin Bieber effect™ in a Social Network graph) you will still traverse data on same node.


1. DSE-Graph has feature to passivate (offline) the historical linkages to cold store or remove then when needed could it be brought back to graph.

This is not an embedded feature, you may handle it at the application level.

As the edges are not indexed you still need to scan them to find the latest if they are all the same type. Solutions proposed in the article would be to have a dedicated edge type for only latest edge. 

For update you create the new edge "edge_x_latest" and for the existing one you create some "edge_x_audit" and remove "edge_x_latest". 

You could then easily remove all "edge_x_audit" if not needed anymore or promote some "edge_x_audit" to "edge_x_latest" if you need

=> Event Sourcing way of thinking


2. Can graph be split based on time window? If that is possible, can traversal be made on multiple graphs or merge results from both graphs?

Now you can totally create graphs dynamically and insert value in correct graph based on the time at application level. You would be cautious to ensure schema integrity among those.

AFAIK you cannot join values from different disjoint graphs out of the tool. When you g.V() your are already on a dedicated graph. Solutions are again to join at application level. As those are probably OLAP queries you would think about GraphFrames.


My2c

Share
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.