sergiy_177806 avatar image
sergiy_177806 asked Erick Ramirez edited

What is the recommended way to model the data where 90% of records are deleted daily?

Hello all, we have quite challenging problem where we have a table which records written throughout the day and most of them (90%) will get deleted by end of the day, but the remaining 10% will be long lived for up to several months.

We are entertaining the idea of having 2 or 3 tables and have a side app process to move the long-lived tables and then truncate/rotate them - but I'm not a fan of granting app ID truncate priv or create/drop priv to be able to create new daily table pro-grammatically .. Just want to run this by the community to see if someone had similar problem and what was the solution chosen.

data modelingtombstones
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

1 Answer

Erick Ramirez avatar image
Erick Ramirez answered sergiy_177806 commented

@sergiy_177806 the high-delete pattern you described indicates that you have a queue-like use case and it's an anti-pattern for Cassandra (see this DataStax blog post). We don't recommend implementing queue datasets on Cassandra.

A workaround which you already alluded to is to model your data such that they are grouped into tables which you can then TRUNCATE. This way of modelling the data does not generate tombstones which is the main issue for high-delete workloads. See Ryan Svihla's blogpost for a detailed explanation and an example. Cheers!

2 comments Share
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

sergiy_177806 avatar image sergiy_177806 commented ·

thanks a lot Erick, I'm well aware of that anti pattern - I guess I just wanted reconfirm my assumptions.. our other thought was to add some type column as a clustering key to be able to separate those fill entries and then delete in buck by clustering key range as that should minimize number of tombstones.

0 Likes 0 ·
Erick Ramirez avatar image Erick Ramirez ♦♦ sergiy_177806 commented ·

@sergiy_177806 the method you described is exactly how you end up with range tombtones because you are deleting a range of rows based on a clustering key. Depending on how you read the data and how it is ordered, the tombstone problem happens because queries have to scan (iterate) over the deleted rows to get to the data you want. Cheers!

0 Likes 0 ·