This question comes up regularly so we wanted to provide a quick answer and provide references for more details.
Background
If you are new to Cassandra and have only just started to delve into data modelling, it’s understandable that you would be unfamiliar with the idea of having denormalized tables. In relational databases, we are used to normalizing entities into separate tables and using foreign keys to tie together related records.
In Cassandra, we do the exact opposite and we instead duplicate data in multiple tables to optimise reads. You’ll have situations where the same piece of data is duplicated in 2 or more tables. An example is where you have users commenting on videos uploaded to a video sharing site such as http://www.killrvideo.com/. The application can display all the comments associated with a video or the user can list all the comments they have posted on the site. This means we have to store the comments in 2 separate tables -- one called comments_by_video
and another called comments_by_user
. Here is what the tables look like:
CREATE TABLE comments_by_video ( videoid uuid, commentid timeuuid, userid uuid, comment text, PRIMARY KEY (videoid, commentid) ) WITH CLUSTERING ORDER BY (commentid DESC);
CREATE TABLE comments_by_user ( userid uuid, commentid timeuuid, videoid uuid, comment text, PRIMARY KEY (userid, commentid) ) WITH CLUSTERING ORDER BY (commentid DESC);
Solution
So back to the question: How do we keep the tables in sync? The Cassandra Query Language (or CQL) has a BATCH
feature that groups inserts and updates into one atomic transaction. When a user posts a new comment on the KillrVideo site, we need to do 2 insert statements -- an INSERT
to the comments_by_video
table, and another to the comments_by_user
table:
INSERT INTO comments_by_video (videoid, commentid, userid, comment) VALUES (...);
INSERT INTO comments_by_user (userid, commentid, videoid, comment) VALUES (...);
In CQL, we enclose these 2 inserts in a BATCH
statement so they both get executed as one transaction:
BEGIN BATCH INSERT INTO comments_by_video (videoid, commentid, userid, comment) VALUES (...); INSERT INTO comments_by_user (userid, commentid, videoid, comment) VALUES (...); APPLY BATCH;
The atomic guarantee for batches is that either ALL the statements are done successfully or none of them happens. This ensures that if one table is updated we know that all of the tables are updated.
Caution
It is important to note that batches in Cassandra are not an optimisation in the same way that you use batches in relational databases. Only use batches when necessary or you risk overloading the coordinator nodes which can affect the overall performance of your cluster.
References
- Tutorial - Application Development | BATCHes @ DataStax Academy
- Docs - Batching data insertion and updates @ DataStax Docs