Bringing together the Apache Cassandra experts from the community and DataStax.

Want to learn? Have a question? Want to share your expertise? You are in the right place!

Not sure where to begin? Getting Started

 

question

vijayakumar.sithamohan1_178594 avatar image
vijayakumar.sithamohan1_178594 asked ·

What is the recommended approach for deleting obsolete Graph edges?

Hi,

I am looking for best practices for deleting historical edges. I am following the approach of "end timestamp" approach to end the relationship. As per our use case, we need to delete linkages which are ended more than 7 years needs to be purged.

Our Initial thought process was,

1. Whenever we end the linkage by adding end timestamp , we thought of adding expiry (ttl)

(Note: But one of the documentation, I found that cannot be done. "DSE Graph sets TTL per vertex label or edge label, and all vertices or edges will be affected by the TTL setting. DSE Graph cannot set TTL for an individual vertex or edge.")

2. Have some cron job which triggered everyday to check the expiry and delete the edges. using gremlin traversal and invoke drop(). if this is a suggested approach then does it synchronous? is it possible to be done asynchronous way?

3. Any other approach?

Thanks

Vijay

dsegraph
1 comment
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

@vijayakumar.sithamohan1_178594 Just acknowledging your question. Let me get one of our Graph engineers to respond to you. Cheers!

0 Likes 0 · ·
bettina.swynnerton avatar image
bettina.swynnerton answered ·

Hi,

if you have a lot of edges to delete, and if you like to work with Spark, you might also want to consider using DSE GraphFrames to carry out the deletes.

Please be aware that deletes via graphframes also create tombstones, as polandll pointed out in the previous answer.


A small example of how this works with graphframes:

Here is a sad little graph of people ending their relationships, created in the gremlin console:

schema.propertyKey("person_id").Text().single().create
schema.propertyKey("relationship_ended").Date().single().create()
schema.propertyKey("name").Text().single().create
schema.edgeLabel("knows").single().properties("relationship_ended").create
schema.vertexLabel("person").partitionKey("person_id").properties("name").create
schema.edgeLabel("knows").connection("person", "person").add()

Vertex person1 = graph.addVertex(label, 'person', 'person_id', 'person1', 'name', 'person1'); 
Vertex person2 = graph.addVertex(label, 'person', 'person_id', 'person2', 'name', 'person2');
Vertex person3 = graph.addVertex(label, 'person', 'person_id', 'person3', 'name', 'person3');
Vertex person4 = graph.addVertex(label, 'person', 'person_id', 'person4', 'name', 'person4');

person1.addEdge('knows', person3, 'relationship_ended', '2020-03-20'); 
person1.addEdge('knows', person4, 'relationship_ended', '2020-02-01'); 
person1.addEdge('knows', person2, 'relationship_ended', '2020-01-01');


And here is an example in Spark shell using graphframes to identify the edges where the `relationship_ended` property is less than a certain date, and then how to delete them with the `deleteEdges()` graphframe method.


$ dse spark
The log file is at /home/automaton/.spark-shell.log
Creating a new Spark Session
Spark context Web UI available at http://10.101.35.116:4040
Spark Context available as 'sc' (master = dse://?, app id = app-20200320061556-0003).
Spark Session available as 'spark'.
Spark SqlContext (Deprecated use Spark Session instead) available as 'sqlContext'
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.2.3.9
      /_/
         
Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_242)
Type in expressions to have them evaluated.
Type :help for more information.

scala> val g = spark.dseGraph("relations2")
g: com.datastax.bdp.graph.spark.graphframe.DseGraphFrame = com.datastax.bdp.graph.spark.graphframe.DseGraphFrame@4c752425

scala> g.E().show(false)
+-----------------------+-----------------------+------+------------------------------------+------------------+
|src                    |dst                    |~label|id                                  |relationship_ended|
+-----------------------+-----------------------+------+------------------------------------+------------------+
|person:AAAAB3BlcnNvbjE=|person:AAAAB3BlcnNvbjI=|knows |00000000-0000-0000-0000-000000000000|2020-01-01        |
|person:AAAAB3BlcnNvbjE=|person:AAAAB3BlcnNvbjM=|knows |00000000-0000-0000-0000-000000000000|2020-03-20        |
|person:AAAAB3BlcnNvbjE=|person:AAAAB3BlcnNvbjQ=|knows |00000000-0000-0000-0000-000000000000|2020-02-01        |
+-----------------------+-----------------------+------+------------------------------------+------------------+


scala> val df = g.E().toDF().filter($"relationship_ended"<"2020-03-12")
df: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [src: string, dst: string ... 3 more fields]

scala> df.show
+--------------------+--------------------+------+--------------------+------------------+
|                 src|                 dst|~label|                  id|relationship_ended|
+--------------------+--------------------+------+--------------------+------------------+
|person:AAAAB3Blcn...|person:AAAAB3Blcn...| knows|00000000-0000-000...|        2020-01-01|
|person:AAAAB3Blcn...|person:AAAAB3Blcn...| knows|00000000-0000-000...|        2020-02-01|
+--------------------+--------------------+------+--------------------+------------------+


scala> g.deleteEdges(df)
WARN  2020-03-20 06:17:52,383 org.apache.spark.sql.execution.CacheManager: Asked to cache already cached data.
                                                                                
scala> g.E().show(false)
+-----------------------+-----------------------+------+------------------------------------+------------------+
|src                    |dst                    |~label|id                                  |relationship_ended|
+-----------------------+-----------------------+------+------------------------------------+------------------+
|person:AAAAB3BlcnNvbjE=|person:AAAAB3BlcnNvbjM=|knows |00000000-0000-0000-0000-000000000000|2020-03-20        |
+-----------------------+-----------------------+------+------------------------------------+------------------+


scala> 


Share
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

polandll avatar image
polandll answered ·

The information in #1 is indeed correct.

You could try to use the approach you outline in #2. However, you could also use a timestamp on the edges to filter your query traversals to ignore the edges you want to delete, rather than delete them. Here is a similar query that could serve as a model:

g.E().hasLabel('rated').has('timestamp', gte(Instant.parse('2015-01-01T00:00:00.00Z'))).valueMap()

Notice the gte() function used for the timestamp. I also need to say that g.E() is NOT good for production in DSE Graph 6.7 or earlier. You should start your traversal query with g.V() and identify a vertex to which the edge is adjacent.

Also, be aware that making a lot of deletions will create tombstones, which can create their own problems. I don't know how familiar you are with Cassandra, so I'll put in a couple of links here:

https://docs.datastax.com/en/dse/6.7/dse-arch/datastax_enterprise/dbInternals/dbIntAboutDeletes.html


https://docs.datastax.com/en/dse/6.7/dse-arch/datastax_enterprise/dbInternals/archTombstones.html

Share
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.