article

Erick Ramirez avatar image
Erick Ramirez posted Erick Ramirez edited

FAQ - How do I keep data in denormalized tables in sync?

This question comes up regularly so we wanted to provide a quick answer and provide references for more details.

Background

If you are new to Cassandra and have only just started to delve into data modelling, it’s understandable that you would be unfamiliar with the idea of having denormalized tables. In relational databases, we are used to normalizing entities into separate tables and using foreign keys to tie together related records.

In Cassandra, we do the exact opposite and we instead duplicate data in multiple tables to optimise reads. You’ll have situations where the same piece of data is duplicated in 2 or more tables. An example is where you have users commenting on videos uploaded to a video sharing site such as http://www.killrvideo.com/. The application can display all the comments associated with a video or the user can list all the comments they have posted on the site. This means we have to store the comments in 2 separate tables -- one called comments_by_video and another called comments_by_user. Here is what the tables look like:

CREATE TABLE comments_by_video (
   videoid uuid,
   commentid timeuuid,
   userid uuid,
   comment text,
   PRIMARY KEY (videoid, commentid)
) WITH CLUSTERING ORDER BY (commentid DESC);
CREATE TABLE comments_by_user (
   userid uuid,
   commentid timeuuid,
   videoid uuid,
   comment text,
   PRIMARY KEY (userid, commentid)
) WITH CLUSTERING ORDER BY (commentid DESC);

Solution

So back to the question: How do we keep the tables in sync? The Cassandra Query Language (or CQL) has a BATCH feature that groups inserts and updates into one atomic transaction. When a user posts a new comment on the KillrVideo site, we need to do 2 insert statements -- an INSERT to the comments_by_video table, and another to the comments_by_user table:

INSERT INTO comments_by_video (videoid, commentid, userid, comment) VALUES (...);
INSERT INTO comments_by_user (userid, commentid, videoid, comment) VALUES (...);

In CQL, we enclose these 2 inserts in a BATCH statement so they both get executed as one transaction:

BEGIN BATCH
  INSERT INTO comments_by_video (videoid, commentid, userid, comment) VALUES (...);
  INSERT INTO comments_by_user (userid, commentid, videoid, comment) VALUES (...);
APPLY BATCH;

The atomic guarantee for batches is that either ALL the statements are done successfully or none of them happens. This ensures that if one table is updated we know that all of the tables are updated.

Caution

It is important to note that batches in Cassandra are not an optimisation in the same way that you use batches in relational databases. Only use batches when necessary or you risk overloading the coordinator nodes which can affect the overall performance of your cluster.

References

data modelingbatch
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Article

Contributors

Erick Ramirez contributed to this article

Related Articles