Using open source Apache Cassandra v3.11.2, I have a table something like
master-table
Attribute | Key |
internal-id | Partition Key |
external-reference | String |
other-reference | String |
data-1 | String |
... | |
data-n | String |
Now I could have created an index on external-reference and other-reference
But since we should not use Secondary Indexes I would need to create to lookup tables
I know I could denormalize it all and have all that data also in these tables but I chose not to, I am OK with doing two reads.
Then I would have 2 tables more
ext-ref-table
Attribute | Key |
internal-id | String |
external-reference | Partition Key |
other-ref-table
Attribute | Key |
internal-id | String |
other-reference | Partition Key |
But how do I ensure atomicity and consistency
I could do
1. Use BATCH, the documentation says this is perfect for single partition inserts, and also small batches where there are multiple partitions.
2. Use a Kafka-Cassandra pattern, where I first write to Cassandra master table and then to Kafka, and then consume from Kafka to populate / insert into the lookup tables
But this is kind of heavy machinery ...
3. Use a job that runs regularly (think cron-job but not cron), that would go through and correct possible errors. But this smells... the job would traverse all the rows, and as the data grows from big to really big, this would take longer and longer...
So what would the best approach be ?