Bringing together the Apache Cassandra experts from the community and DataStax.

Want to learn? Have a question? Want to share your expertise? You are in the right place!

Not sure where to begin? Getting Started

 

question

mailmeonvinay_191189 avatar image
mailmeonvinay_191189 asked ·

How can I purge data based on column value?

CREATE TABLE test.txn_msgs (
    id uuid PRIMARY KEY,
    consumed int,
    consumed_at timestamp,
    created_at timestamp,
    msg text,
    partition int,
    updated_at timestamp
)

I Have a 3 node cluster running in prod, The above mentioned table is having millions of records in which I want to purge the data continuously which contain 1 in `consumed` column.

How I can achieve that without affecting the cluster performance.

cassandra
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Erick Ramirez avatar image
Erick Ramirez answered ·

Your question seems very counter-intuitive to me. Let me explain why.

At some point, your application sets the value of the consumed column to 1. Instead of doing that, your app should just issue a DELETE. This way, you won't have to worry about having to get rid of the partition.

If you need the data to be available for some period of time, consider setting a TTL on the whole partition so it automatically expires. Again, you won't have to worry about purging it.

As a side note, it looks like you have a queue use case. You are queueing messages in a table, processing them before finally getting rid of consumed messages. At least the table doesn't have a clustering column so you're not deleting rows within a partition.

Be aware that queues are an anti-pattern in Cassandra. This is discussed in a bit of detail in the blog post Cassandra anti-patterns: Queues and queue-like datasets.

Also, a message queue implies that your application does a full table scan to retrieve the messages that need to be consumed. Full table scans don't scale in a distributed environment and affects the performance of the cluster. Cheers!

Share
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

bettina.swynnerton avatar image
bettina.swynnerton answered ·

Hi @mailmeonvinay_191189,

to find all the records that have a "1" in the "consumed" column, a full table scan is required, so you would be better advised to offload such continuous purging to Spark and run regular Spark jobs that analyze and transform the data how you need it.

The Spark Cassandra Connector allows to connect your Spark cluster to your Cassandra cluster, here is some documentation to get you started:

https://github.com/datastax/spark-cassandra-connector/blob/master/doc/0_quick_start.md

I hope this helps!

Share
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.