Bringing together the Apache Cassandra experts from the community and DataStax.

Want to learn? Have a question? Want to share your expertise? You are in the right place!

Not sure where to begin? Getting Started

 

question

josef.schauer_169242 avatar image
josef.schauer_169242 asked ·

Is our approach the right way to do more traversals on one vertex?

Hey,


We are trying to execute the following gremlin query:

g.V().hasLabel('Person').has('names',regex('.*')).as('Person___1')

.select('Person___1').outE('EmployeeOf').as('Person__out_employeeOf___1').otherV().as('Person__out_employeeOf__department___1').range(0, 2)

.select('Person___1').outE('LivesAt').as('Person__out_livesAt___1').otherV().as('Person__out_livesAt__address___1').range(0, 2)

.union(select('Person__out_livesAt__address___1'),select('Person__out_employeeOf__department___1'))


Please consider the range() steps, after each traversal step.


Data Setup:

Person: 1

Departments: 3

Addresses: 3

all addresses and departments are connected with the Person


Current Result:

2 different addresses

2 departments, but the same instance (same id)


Expected Result:

2 different addresses

2 different departments


Questions:

  • Is it the right way to do more traversals on one vertex? Is this approach we are doing allowed/ok?
  • Why do we get two different addresses but only one unique department, but this is added twice to the result? How to solve this?


Thanks in advance.






dsegraphgremlintraversal6.7.3
3 comments
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

@josef.schauer_169242 Apologies but it isn't clear to me what your question is. Could you please update your post? Cheers!

1 Like 1 · ·

updated!

0 Likes 0 · ·
Erick Ramirez avatar image Erick Ramirez ♦♦ josef.schauer_169242 ·

Thanks! I'll get one of our engineers to respond to your post. Cheers!

1 Like 1 · ·
stephen.mallette_131353 avatar image
stephen.mallette_131353 answered ·

When asking questions about Gremlin it is helpful to include some sample data in the form of a Gremlin script (and in the case of DataStax Graph perhaps a schema script as well).

g.addV('Person').as('p').
  addV('Department').property('name','dept1').as('d1').
  addV('Department').property('name','dept2').as('d2').
  addV('Department').property('name','dept3').as('d3').
  addV('Address').property('name','addy1').as('a1').
  addV('Address').property('name','addy2').as('a2').
  addV('Address').property('name','addy3').as('a3').
  addE('EmployeeOf').from('p').to('d1').
  addE('EmployeeOf').from('p').to('d2').
  addE('EmployeeOf').from('p').to('d3').
  addE('LivesAt').from('p').to('a1').
  addE('LivesAt').from('p').to('a2').
  addE('LivesAt').from('p').to('a3').iterate()

Note that in this case, I simplified a bit since the regex and person "name" didn't have much bearing on your traversal question.

Is it the right way to do more traversals on one vertex? Is this approach we are doing allowed/ok?

I think you have a fair bit of unnecessary complexity in your traversal. That complexity hurts performance and makes the query difficult to read. In particular, there is a heavy use of step labelling (i.e. use of "as()") which enables path tracking in your traversal and adds to computation cost. If you can avoid path tracking you will increase your performance. The use of "otherV()" has similar effect and in this case it is not necessary as you've traversed with "outE()" so the alternate side is known and must be "inV()". Typically, "otherV()" is reserved for use in conjunction with `bothE()` where the traversal direction is not known.

Removing all that complexity and considering what you expect as a result should reduce your traversal to something as simple as:

g.V().has('Person','names',regex('.*') ).
  union(out('EmployeeOf').limit(2),
        out('LivesAt').limit(2))

Using my example data you can see that I get the result you're looking for:

gremlin> g.V().hasLabel('Person').
......1>   union(out('EmployeeOf').limit(2),
......2>         out('LivesAt').limit(2)).
......3>   project('name','label').by('name').by(label)
==>[name:dept1,label:Department]
==>[name:dept2,label:Department]
==>[name:addy1,label:Address]
==>[name:addy2,label:Address]
Why do we get two different addresses but only one unique department, but this is added twice to the result? How to solve this?

While you can see the solution above, it's worth understanding why you were getting the result you were getting. I think Bettina's answer explains it but, said another way, keep in mind that "range()" is a filter not a barrier. In other words, it does not exhaust the stream of traversers up to it. So, your first "range()" yields one "department" traverser which flows through to "outE('LivesAt')" and triggers a "LivesAt" edge to flow toward the second "range()". Then, the next available "LivesAt" edge flows through the second "range" but that is the limit. Those two traversers flow to "union()" and they each get to do a "select()" and thus produce four results. That means two unique addresses (one for each "LivesAt") but then the same "department" because that was traverser to originall trigger the rest of this stream. As we've reached the limit of the second "range()", this traversal is considered done. So the second "range()` interferes with the first.

If my solution above is not good for your particular environment you might consider some variation of it. The pattern for avoiding the "range()" interference I just described is to explicitly use "range()" inside of the "union()" or more generally within sub-traversal which can be individually iterated to completion.

2 comments Share
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Thanks for your efforts. I understand now the problem.

Some questions I have now:

Is it possible to stop the streaming of the data with another step, like barrier(), after the traversals?

To make sub-traversals I can use local(...), right?


It's an option for me to use union() around the traversals, but in our implementation, this means a lot of effort to change all, that's why I'm asking for alternatives.


0 Likes 0 · ·

I don't think that "barrier()" would work. It leaves the same problem because it will pull two traversers through the first "range()" but only the first traverser through will trigger the rest of the traverser and then my description above still follows.

A sub-traversal (or child traversal) is any anonymous traversal (i.e. spawned from "__") that is used within a traversal spawned from "g". So, yes the traversal given to a step like "local()" would be considered a sub-traversal.

It also sounds like you are willing to give up performance in exchange for this specific traversal pattern that you're using so it's hard to recommend what you might try. Perhaps you could "limit(2).fold()" and then label that step so that you can "unfold()" that list of 2 later?

0 Likes 0 · ·
bettina.swynnerton avatar image
bettina.swynnerton answered ·

Hi, this is a really interesting observation. I recreated your graph, and then to understand what is going on, I took the union apart and investigated each branch separately.


The problem occurs with the second branch:

gremlin> g.V().hasLabel('Person').has('name',regex('.*')).as('Person___1').select('Person___1').outE('EmployeeOf').as('Person__out_employeeOf___1').otherV().as('Person__out_employeeOf__department___1').range(0, 2).select('Person___1').outE('LivesAt').as('Person__out_livesAt___1').otherV().as('Person__out_livesAt__address___1').range(0, 2).union(select('Person__out_employeeOf__department___1'))
==>v[{~label=Department, department_id="department1"}]
==>v[{~label=Department, department_id="department1"}]


The selection returns two results, but they are identical.

This is because the selection returns the top 2 results from this result set:

gremlin> g.V().hasLabel('Person').has('name',regex('.*')).as('Person___1').select('Person___1').outE('EmployeeOf').as('Person__out_employeeOf___1').otherV().as('Person__out_employeeOf__department___1').range(0, 2).select('Person___1').outE('LivesAt').as('Person__out_livesAt___1').otherV().as('Person__out_livesAt__address___1').union(select('Person__out_employeeOf__department___1'))
==>v[{~label=Department, department_id="department1"}]
==>v[{~label=Department, department_id="department1"}]
==>v[{~label=Department, department_id="department2"}]
==>v[{~label=Department, department_id="department2"}]


Now, someone else might have more insight here, but I think this result set has duplicated entries, because at this stage in the traversal, we have two traversers, and they both pull the results into an ordered list, and applying the range step at this point reduces this to the top 2 results (which are identical).

I am not entirely sure how to fix the query for your needs. I removed the second range step to eliminate the curtailing of the intermediate results, and added a dedup() step at the end.

gremlin> g.V().hasLabel('Person').has('name',regex('.*')).as('Person___1').select('Person___1').outE('EmployeeOf').as('Person__out_employeeOf___1').otherV().as('Person__out_employeeOf__department___1').range(0, 2).select('Person___1').outE('LivesAt').as('Person__out_livesAt___1').otherV().as('Person__out_livesAt__address___1').union(select('Person__out_livesAt__address___1'),select('Person__out_employeeOf__department___1')).dedup()
==>v[{~label=Address, address_id="address1"}]
==>v[{~label=Department, department_id="department1"}]
==>v[{~label=Address, address_id="address2"}]
==>v[{~label=Department, department_id="department2"}]


I hope this helps to understand why you are seeing the result.

Looking forward to seeing more on this discussion.


1 comment Share
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Thanks for your efforts. But the second range should be applied and is mandatory. So hopefully some dse engineer can help here.

0 Likes 0 · ·