Bringing together the Apache Cassandra experts from the community and DataStax.

Want to learn? Have a question? Want to share your expertise? You are in the right place!

Not sure where to begin? Getting Started

 

question

madcow_188634 avatar image
madcow_188634 asked ·

Why is groupCount() returning inconsistent results for vertices?

I am currently running a 7 node DSE cluster (with analytics, graph and search ) and I am getting inconsistent results when doing a simple vertices count. I have tried running repair and cleanUp on all the nodes, but that didn't seem to make a difference. For fun I stood up another cluster with 5 nodes and loaded it with the same data, and I get the same exact results. At least the numbers are consistent!

When looking at the spark jobs I do see that when I get the wrong count very little data is being read. Makes me think that the tables are just sending a cached/pre calculated count, but I haven't proven that yet.Below are the different ways I am trying to get the count

The following give the correct results:

scala> spark.dseGraph("my_graph").cache().V().hasLabel("labelA").count.show() spark.sql("SELECT count(DISTINCT(labelA_id)) FROM my_graph.labelA").show()
gremlin> g.V().label().groupCount()

These searches give me a smaller incorrect number, but it is consistent)

scala> spark.dseGraph("my_graph").V().hasLabel("labelA").count.show() spark.sql("SELECT count(1) FROM my_graph.labelA").show()
gremlin> g.V().groupCount().by(label)

Has anyone seen anything like this? Not sure what could be in the data to cause this.

dsegraphcountgremlin
14 comments
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Would you be able to share your schema for "labelA"? And would you update your question with the actual counts, so it is clear where the inconsistencies occur?

0 Likes 0 · ·
madcow_188634 avatar image madcow_188634 bettina.swynnerton ♦♦ ·

I over simplified my searches by using "labelA". Here is my full schema. When we count "user" the numbers match, but when we count "business" the numbers are on correct when we cache the graph before the search.

schema.type('job').ifNotExists().
property('company', Varchar).
property('title', Varchar).
create()

schema.type('place').ifNotExists().
property('name', Varchar).
property('cords', Point).
create()

schema.vertexLabel('business').ifNotExists().
partitionBy('business_id', Varchar).
property('business_name', Varchar).
property('cords', Point).
create()



0 Likes 0 · ·
schema.vertexLabel('user').ifNotExists().
partitionBy('user_id', Varchar).
property('gender', Int).
property('user_name', Varchar).
property('jobs', setOf(frozen(typeOf('job')))).
property('names', setOf(Varchar)).
property('places', setOf(frozen(typeOf('place')))).
property('schools', setOf(Varchar)).
create()

schema.edgeLabel('follows').ifNotExists().
from('user').to('user').
partitionBy(OUT, 'user_id', 'out_user_id').
clusterBy(IN, 'user_id', 'in_user_id', Asc).
create()

schema.edgeLabel('reviewed').ifNotExists().
from('user').to('business').
partitionBy(OUT, 'user_id', 'user_user_id').
clusterBy(IN, 'business_id', 'business_business_id', Asc).
property('rating', Int).
property('categories', setOf(Varchar)).
create()


0 Likes 0 · ·
Show more comments
Show more comments

@madcow_188634, Thanks for the schema. I really hope that you don't mind that I am asking yet more questions to understand the issue. Which one of these five queries gives you an incorrect count?

spark.dseGraph("my_graph").cache().V().hasLabel("business").count.show()
spark.dseGraph("my_graph").V().hasLabel("business").count.show()
spark.sql("SELECT count(DISTINCT(business_id)) FROM my_graph.business").show()
spark.sql("SELECT count(1) FROM my_graph.business").show()
gremlin> g.V().groupCount().by(label)


0 Likes 0 · ·
madcow_188634 avatar image madcow_188634 bettina.swynnerton ♦♦ ·

Promise you I don't mind, I really appreciate the help. Of the ones you listed these are the ones that give the incorrect counts

spark.dseGraph("my_graph").V().hasLabel("business").count.show()

spark.sql("SELECT count(1) FROM my_graph.business").show()

gremlin> g.V().groupCount().by(label)


0 Likes 0 · ·

@madcow_188634, I tried these counts on one of my test graphs and I have now seen a discrepancy in vertex counts between the cached and non-cached graphframe count, which I believe is a problem with the Spark caching (a bug in the Spark serialisation and cache protocol?), so I am looking into this more. With large graphs it's hard to verify which one is the correct count, but I believe that the count on the cache is incorrect. I will set up another test to be sure, and will also test out the spark sql queries.

0 Likes 0 · ·
Show more comments

0 Answers