question

teriksson avatar image
teriksson asked Artem Chebotko commented

How do I ensure atomicity in reference tables that avoids secondary indexes?

Using open source Apache Cassandra v3.11.2, I have a table something like

master-table

Attribute Key
internal-id Partition Key
external-reference String
other-reference String
data-1 String
...
data-n String

Now I could have created an index on external-reference and other-reference

But since we should not use Secondary Indexes I would need to create to lookup tables

I know I could denormalize it all and have all that data also in these tables but I chose not to, I am OK with doing two reads.

Then I would have 2 tables more

ext-ref-table

Attribute Key
internal-id String
external-reference Partition Key

other-ref-table

Attribute Key
internal-id String
other-reference Partition Key

But how do I ensure atomicity and consistency

I could do

1. Use BATCH, the documentation says this is perfect for single partition inserts, and also small batches where there are multiple partitions.

2. Use a Kafka-Cassandra pattern, where I first write to Cassandra master table and then to Kafka, and then consume from Kafka to populate / insert into the lookup tables
But this is kind of heavy machinery ...

3. Use a job that runs regularly (think cron-job but not cron), that would go through and correct possible errors. But this smells... the job would traverse all the rows, and as the data grows from big to really big, this would take longer and longer...

So what would the best approach be ?

data modeling
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

1 Answer

Artem Chebotko avatar image
Artem Chebotko answered Artem Chebotko commented

Hi @teriksson, thanks for the question.

You are right that secondary indexes would be a bad choice.

The two tables you proposed would be the most popular solution in this case. I suspect that you want to designate column internal-id to be a clustering key in each of the two tables. E.g., PRIMARY KEY((external-reference), internal-id). Unless, of course, external-reference and other-reference are alternative keys in the base table.

Another approach I would also consider is using materialized views. But there are tradeoffs. See if the pros out-weight the cons in your use case.

Now, how to keep duplicated data consistent across the three tables.

"Multi-partition batches are substantially more expensive as they require maintaining a batchlog in a separate Cassandra table. Therefore, even with respect to the main use case of updating the same data duplicated across multiple partitions due to denormalization, use multi-partition batches only when atomicity is truly important for your application." [source]

Instead of using batches, you can simply do multiple inserts that can be retried (in case of a failure) within your application. In practice, the chance of having inconsistent copies is very, very low.

A job that corrects possible inconsistencies can still be a good idea. Something like an (outer) join in Spark SQL can do the trick. Or, if data in the base table is known to be correct, one can drop and re-create the other two tables from the base table. This would be so easy to do with materialized views, BTW. I do not think you need to run such a job daily under normal circumstances and, in some cases, you can run it on a subset of data.

5 comments Share
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

teriksson avatar image teriksson commented ·

I made a mistake @Artem Chebotko , ofcourse the internal_id would be a Clustering Key.

0 Likes 0 ·
teriksson avatar image teriksson commented ·

Well, I would say that Materialized Views are worse than Secondary Indexes, they are more or less banned by Datastax and the Cassandra community. So that is a no-go.

I find it a bit surprising that this question has not received more attention, as this should be an important aspect for most people depending on Cassandra.

What we see in our environment is that we have write timeouts from time to time, we have components that could die (although that is not common). And we have seen inconsistencies in our tables so this is a real problem.


0 Likes 0 ·
Artem Chebotko avatar image Artem Chebotko teriksson commented ·

@teriksson Why would you say that Materialized Views are worse than Secondary Indexes?

I have not heard them being "banned". They are experimental in Cassandra. They are used in DataStax Graph to index edges.

0 Likes 0 ·
teriksson avatar image teriksson Artem Chebotko commented ·

DSE I do not know I use OpenSource Cassandra, but just google it, and you will see, MWs are not good.

0 Likes 0 ·
Show more comments