question

samuel_191829 avatar image
samuel_191829 asked Erick Ramirez commented

Query pattern for hierarchical model

Hello.

I need an advice in term of reading performance. The question is more about how to design hierarchical data

I’m building an application which create set of data with relationships as hierarchy and it seems than my partitions might become big and reach out the limits of Cassandra, so I was thinking to bucket and split partitions.

I’m thinking two approach:

  1. One way, is to insert into two tables (1st as single unit of data and 2nd related time-series of the data - but may include a lot of duplication) and later on range scan a large partition (even by buckets)
  2. Second way, is to insert into two tables (1st as single unit of data and 2nd as index lookup) and performs at least two queries: 1st lookup into the index table and 2nd range of the partition keys provided

The main difference remains on the query load from the client. The first will query any bucket sizing even if the data is not here but through a range scan. The second will perform - 1 + number of items to lookup - queries.

Let's imagine a schema for the 1st approach

CREATE TABLE product (
    id varchar PRIMARY KEY,
    created_at timestamp,
    col1 varchar,
    col2 varchar,
    col3 varchar
)
CREATE TABLE product_chain (
    chain_id varchar,
    bucket int,
    id_product varchar,
    created_at timestamp,
    col1 varchar,
    col2 varchar,
    col3 varchar,
    PRIMARY KEY((id, bucket), created_at)
)

If we want to get the full chain of a product we can query by bucket, so if every bucket is day of year to get all the product from a chain need to make 365 queries

(i.e SELECT * FROM product_chain WHERE chain_id ="01512" and bucket=160)

Of course hash of timestamp like mod can be used to set bucker instead of day but need to the overall capacity to not go further than 2B cells.

But with a second design including a lookup table, it will more like:

CREATE TABLE product (
    id varchar PRIMARY KEY,
    created_at timestamp,
    col1 varchar,
    col2 varchar,
    col3 varchar
)
CREATE TABLE product_chain (
    chain_id varchar;
    created_at timestamp,
    product_id varchar,
    PRIMARY KEY(chain_id, created_at)
)

So to query the a chain of a product it will 1 + number of product in a list

(ie. SELECT product_id FROM product_chain WHERE chain_Id="01512")

(ie SELECT * FROM product WHERE product_id=?)

And I'm wondering which is the better case in term of reading performance;

Thanks

cassandradata modeling
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

1 Answer

Erick Ramirez avatar image
Erick Ramirez answered Erick Ramirez commented

On face value, neither of the approaches you outlined would perform well since they require the application to do multiple reads in order to get the required result.

When designing your data model, you need to identify the questions that your app wants to ask the database. Those questions (queries) dictate the design of your tables. The primary philosophy of data modeling in Cassandra is that each app query maps to one table such that the data is partitioned and clustered to answer each question.

I'm happy to be corrected but your post seems to indicate that you have started with the table designs then trying to map multiple queries. This approach is contrary to modeling in Cassandra.

When the data is modeled correctly for the application query, it follows that reads will perform well since the data can be easily retrieved with just one query.

If you haven't already seen them, here are some Data Modeling resources:

Cheers!

6 comments Share
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

samuel_191829 avatar image samuel_191829 commented ·

In fact those tables are result of a design by query. My queries are : get_product_by_id and get_chain_of_product_by_id. The latter is a hierarchical query model. I ended up with those approach due to the design and storage limitations of Cassandra partition. (2 B cells and 100MB). Because datawill be here for years, the size of a hierarchical partition may grow. The previous approach was to use a single partition for the hierarchical one, but the result was a huge partition problem.


Currently I'm using the bucket approach with mod 10 on the timestamp to reduce the client threads queries. But it may embrace a lot of data duplication. So wonder if the second like lookup table is good approach.

I was talking with Scylla guys and they said me the first approach with bucket is the best.but nice to have your feedback as well

0 Likes 0 ·
Erick Ramirez avatar image Erick Ramirez ♦♦ samuel_191829 commented ·

Thanks for the clarification. Bucketing is definitely the way to split the rows across partitions to make each partition smaller.

I apologise that I don't understand what a "product chain" is which makes it difficult for me to understand what you were doing. But if it's a question of bucketing then that is the correct approach. Cheers!

1 Like 1 ·
samuel_191829 avatar image samuel_191829 Erick Ramirez ♦♦ commented ·

"Product chain" is like a product history which has lifecycle, but every change if immutable, so, for every new state a new entry is created, linked to the previous one.

Is data duplication a problem ? with a lot of cells (i.e around 400 per row, around:

28 KB).

Every time a change is done:

- The product table will contains new row with 400 cells = 28KB
- And the new product chain will contains N (previous entries) * 400 cells = N * 28KB

0 Likes 0 ·
Show more comments