Hello.
I need an advice in term of reading performance. The question is more about how to design hierarchical data
I’m building an application which create set of data with relationships as hierarchy and it seems than my partitions might become big and reach out the limits of Cassandra, so I was thinking to bucket and split partitions.
I’m thinking two approach:
- One way, is to insert into two tables (1st as single unit of data and 2nd related time-series of the data - but may include a lot of duplication) and later on range scan a large partition (even by buckets)
- Second way, is to insert into two tables (1st as single unit of data and 2nd as index lookup) and performs at least two queries: 1st lookup into the index table and 2nd range of the partition keys provided
The main difference remains on the query load from the client. The first will query any bucket sizing even if the data is not here but through a range scan. The second will perform - 1 + number of items to lookup - queries.
Let's imagine a schema for the 1st approach
CREATE TABLE product ( id varchar PRIMARY KEY, created_at timestamp, col1 varchar, col2 varchar, col3 varchar )
CREATE TABLE product_chain ( chain_id varchar, bucket int, id_product varchar, created_at timestamp, col1 varchar, col2 varchar, col3 varchar, PRIMARY KEY((id, bucket), created_at) )
If we want to get the full chain of a product we can query by bucket, so if every bucket is day of year to get all the product from a chain need to make 365 queries
(i.e SELECT * FROM product_chain WHERE chain_id ="01512" and bucket=160)
Of course hash of timestamp like mod can be used to set bucker instead of day but need to the overall capacity to not go further than 2B cells.
But with a second design including a lookup table, it will more like:
CREATE TABLE product ( id varchar PRIMARY KEY, created_at timestamp, col1 varchar, col2 varchar, col3 varchar )
CREATE TABLE product_chain ( chain_id varchar; created_at timestamp, product_id varchar, PRIMARY KEY(chain_id, created_at) )
So to query the a chain of a product it will 1 + number of product in a list
(ie. SELECT product_id FROM product_chain WHERE chain_Id="01512")
(ie SELECT * FROM product WHERE product_id=?)
And I'm wondering which is the better case in term of reading performance;
Thanks