question

Ryan Quey avatar image
Ryan Quey asked Ryan Quey commented

Is there documentation on CQL-to-Gremlin mapping?

I understand that there are examples on how to use CQL to add data that can later be accessed through Gremlin, e.g., here. (This link looks like a duplicate more or less?). However, I am having trouble finding in depth documentation. Is there somewhere to learn more about how CQL commands map to their Gremlin equivalents? Thanks

UPDATE:

I'm adding more detail to clarify my question, hopefully this helps.

Specifically I'm wondering:

1. Is there documentation for the CQL commands that interact with Graph? E.g.,

WITH VERTEX LABEL ...
WITH EDGE LABEL ... FROM ... TO
graph_engine =  ...


Is there somewhere that I can see all the options and parameters for these CQL commands? Maybe there's java docs, or other documentation somewhere?

2. And related to that, where can I find details on the requirements for the Cassandra tables that are generated by Gremlin queries?

E.g., I recently found out that the table that has an EDGE LABEL needs to have its primary key be a combination of the primary keys of the LABELs. Of course this makes sense intuitively, and the examples given in the docs demonstrate what it looks like, but they don't always describe what is necessary vs best practice vs what they just happened to do for this example. Is there a reference I can look at that gives more details on this?

I'm mostly looking for reference material/documentation, thanks.

graphcql
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

bettina.swynnerton avatar image
bettina.swynnerton answered Ryan Quey commented

Hi @Ryan Quey

thank you for clarifying your question, we really misunderstood what you needed to know.

So the question is: Is there any documentation how to map Cassandra tables to Graph tables?

I see that this is not obvious in our documentation, and will investigate this further with the docs team and will come back to you.

In the meantime, here is more explanation.

Vertex tables:

Any CQL table can be a vertex table. And any DSE Graph vertex table can be used just like any other CQL table.

The structure is no different from other CQL tables: We have a primary key, composed of partition key and clustering keys, and one row per primary key. One row corresponds to one vertex.

(Note that this was different in the previous graph implementations (6.7 and earlier), where we could have had several rows per vertex, and thus access via CQL was not so straight forward.)

Graph tables are annotated with the vertex label ( AND VERTEX LABEL ... ) to ensure they are seen as part of the graph:

Here is an example:

CREATE TABLE community_8319.person (
    person_id text PRIMARY KEY,
    age int,
    name text
) WITH additional_write_policy = '99PERCENTILE'
    AND bloom_filter_fp_chance = 0.01
    AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
    AND comment = ''
    AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
    AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
    AND crc_check_chance = 1.0
    AND default_time_to_live = 0
    AND gc_grace_seconds = 864000
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND nodesync = {'enabled': 'true', 'incremental': 'true'}
    AND read_repair = 'BLOCKING'
    AND speculative_retry = '99PERCENTILE' AND VERTEX LABEL person;

For comparison, this is the equivalent table set up purely as CQL. It is created with exactly the same default options.

The creation statement:

CREATE TABLE community_8319_2.person (
    person_id text PRIMARY KEY,
    age int,
    name text
);

is implemented as here:

CREATE TABLE community_8319_2.person (
    person_id text PRIMARY KEY,
    age int,
    name text
) WITH additional_write_policy = '99PERCENTILE'
    AND bloom_filter_fp_chance = 0.01
    AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
    AND comment = ''
    AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
    AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
    AND crc_check_chance = 1.0
    AND default_time_to_live = 0
    AND gc_grace_seconds = 864000
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND nodesync = {'enabled': 'true', 'incremental': 'true'}
    AND read_repair = 'BLOCKING'
    AND speculative_retry = '99PERCENTILE';

The only difference is that the table is not annotated with the vertex label.


Edge tables:

For a valid edge table, we need to observe a couple of things:

  • We need to be able to map an edge precisely to the two vertices it connects. This requires the knowledge of the complete primary key of both vertices.
  • And the edge in DSE Graph is directed: We have the concept of the "out-Vertex" and the "in-Vertex".

In an edge of the type "created" that connects "person" and "software", ie. the relation person -> created -> software, person is the out-Vertex label and software is the in-Vertex label.

So an edge needs to have both full primary keys of both vertices in its table structure, and DSE Graph 6.8 implements the table structure in such a way that

  • the partition key of the out-Vertex is added to the partition key of the edge,
  • and the clustering key of the out-Vertex and the full primary key of the in-Vertex are added as clustering keys of the edge.
  • All vertex key columns are necessary to uniquely identify the relevant vertices.

You could add your own edge-specific partitioning and additional clustering keys to the edge table to allow multiple edges between the two vertices.

Here is a minimum edge table connecting two vertices that have single column primary keys.

CREATE TABLE community_8319.person__created__software (
    person_person_id text,
    software_software_id text,
    PRIMARY KEY (person_person_id, software_software_id)
)

Here is an example where we also have a weight column in the edge, which records a non-keyed edge property. Note the annotation in the end:

CREATE TABLE community_8319.person__created__software (
    person_person_id text,
    software_software_id text,
    weight float,
    PRIMARY KEY (person_person_id, software_software_id)
) WITH CLUSTERING ORDER BY (software_software_id ASC)
    AND additional_write_policy = '99PERCENTILE'
    AND bloom_filter_fp_chance = 0.01
    AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
    AND comment = ''
    AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
    AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
    AND crc_check_chance = 1.0
    AND default_time_to_live = 0
    AND gc_grace_seconds = 864000
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND nodesync = {'enabled': 'true', 'incremental': 'true'}
    AND read_repair = 'BLOCKING'
    AND speculative_retry = '99PERCENTILE' AND EDGE LABEL created FROM person(person_person_id) TO software(software_software_id);

You see that the primary key column names do not correspond exactly to the primary keys of the vertex tables. When generated through the Gremlin schema API they are prefixed with the name of the respective vertex label, because the vertex tables could have columns of the same name.


More general rules for mapping a CQL table to an edge table:

When you create your own edge table, note that while the column names are not required to match, the data type and the number of mapping properties must match the data type and the number of primary key columns on the vertex label in order.

In addition, we might have edge label specific partition keys and clustering keys.

When creating an edge label table, its primary key needs to be composed and internally ordered by the following categories:

  1. Partitioning columns specific to the edge label, in definition order
  2. Partition key columns on the from(..) vertex label's table, in they order they appear there
  3. Clustering columns defined on the from(..) vertex label's table, in the order they appear there
  4. Clustering columns specific to the edge label, in definition order
  5. Primary key columns (both partition key and clustering) defined in the to(..) vertex label's table, in the order they appear there

The first two categories become partition key columns in the created edge table. The last three categories become clustering columns in the created edge table.


I hope this helps for now, I will follow up with more documentation when available.

1 comment Share
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Ryan Quey avatar image Ryan Quey commented ·

This is incredible, thank you!! Very helpful

0 Likes 0 ·
Erick Ramirez avatar image
Erick Ramirez answered Erick Ramirez commented

My understanding of your question is that you want to know how to do Gremlin traversals via CQL. Feel free to let me know if I misunderstood you.

The CQL-as-Graph is not intended to perform complex Gremlin traversals via CQL. The intention is to provide a mechanism for converting Cassandra data to Graph so it can be traversed with queries that are not normally possible via CQL.

And with the new DataStax Graph (DSG), the data can be traversed with Gremlin as graph data or queried via CQL as Cassandra data (not as Graph data). The DSG data model provides a bit more flexibility so the data can be used for different use cases. Cheers!

4 comments Share
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Ryan Quey avatar image Ryan Quey commented ·

Hi Erick, thanks for your answer. You are understanding the gist of my question in general, but in particular I was wondering if there is a place where there is further documentation for using CQL to do graph traversals and DDL. The links I provided in the question give demos and examples, but is there more in depth documentation somewhere?


0 Likes 0 ·
Erick Ramirez avatar image Erick Ramirez ♦♦ Ryan Quey commented ·

Again, you can not use CQL to do graph traversals. You can only use CQL to query Cassandra data, not Graph data.

The new DataStax Graph allows you to convert Cassandra data into Graph data via CQL so you can run Graph traversals via Gremlin because CQL does not allow you to do complex queries the way that Graph traversals do. Cheers!

0 Likes 0 ·
Ryan Quey avatar image Ryan Quey Erick Ramirez ♦♦ commented ·

Ok thanks that is a helpful clarification. In that case, forget traversals. Is there documentation though for the commands that convert Cassandra data into CQL somewhere?

0 Likes 0 ·
Show more comments
bettina.swynnerton avatar image
bettina.swynnerton answered Ryan Quey edited

Hi @Ryan Quey,

in addition to Erick's answer, let me add some more clarification.

DSE Graph 6.8 is implemented using the underlying Cassandra database in such a way that the data in the graph tables can be queried both with CQL and with Gremlin.

CQL looks similar to SQL, but it is only a subset, and only one table is queried per query. There are no JOINs with Cassandra.

Gremlin queries graph data with a lot more capability. For example, a traversal like this hits more than one Cassandra table ( in this example, you'll see cql queries for the "person" table, the edge table for the "person_created_software" and the "software" table)

g.V().hasLabel("person").has("person_id", "person_1").out("created").hasLabel("software").has("language", "java")

A simple Gremlin query like this could already spawn quite a few traversers, and each of them launch further CQL queries in order to collect the requested results, so this one Gremlin query is translated to many CQL queries.

As Erick said, these query languages are quite different, and equivalent queries are only possible for very simple queries:

g.V().hasLabel("person").has("person_id", "person_1")

would be equivalent to this CQL query:

select * from my_graph.person where person_id='person_1'

With DSE Graph 6.8 we introcuded the ability to expose existing Casssandra tables to the Gremlin API, by converting them to vertex and edge tables.

This is essentially what is covered by the examples in the documentation that you linked with your question. The examples take some steps to build up some CQL data first, to be able to demonstrate the conversion.

The key conversion happens here for the keyspace:

ALTER KEYSPACE food_cql 
  WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1} 
  AND graph_engine = 'Core';

And then in addition for the tables:

ALTER TABLE food_cql_conversion.person WITH VERTEX LABEL "person_label";
ALTER TABLE food_cql_conversion.book WITH VERTEX LABEL "book_label";
ALTER TABLE food_cql_conversion.person_authored_book 
  WITH EDGE LABEL "authored"
  FROM person(person_name, person_id)
  TO book(book_name, book_id);

These conversion steps are necessary so that the tables are recognised by the Gremlin API as graph tables. We are assigning additional graph schema such as vertices and edges.

Being able to access the same tables via the two different methods has advantages. With the help of Gremlin, we can explore relations between tables, which are quite easy to express in the Gremlin traversal language. We cannot express these relations in CQL, and in SQL we could end up needing complex JOINs.

(have a look here to see some examples for Gremlin vs SQL: http://sql2gremlin.com)

On the other hand, being able to access the tables directly as Cassandra tables allows for easier loading of the graph tables, for example with the bulk loader. And we can also use a combination of the two APIs: Get some results with Gremlin, and then continue working with the tables and their primary keys directly via CQL.

I hope this helps, let us know if you have more questions.

1 comment Share
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Ryan Quey avatar image Ryan Quey commented ·

Ok sorry, I didn't make my question very clear, partially because I wasn't clear on the concepts, so your answers are helpful for clarifying some of that. I'll edit my question though so what I'm wondering is more clear. I'm mostly wondering about where to find documentation/reference material.

Sorry for the lack of clarity on my part, and thanks again for your answer.

0 Likes 0 ·