question

fwy_187020 avatar image
fwy_187020 asked fwy_187020 commented

Deleting and re-adding rows with the same partition key

I am developing a background process that deletes rows from a Cassandra 3.11.2 table with a given partition key, then immediately adds rows to the same table using the same partition key -- in effect, a refresh process for portions of the table. My current environment has only a single node, so the complexities of propagating deletes and tombstones across a multi-node cluster are not yet present. The gc_grace_seconds for the table is the default value 864000, and no time-to-live limits have been set.

Immediately after running the process it works as expected, with the correct data refreshed. However, after restarting the Cassandra node a few minutes later, queries against the table return no data at all. It seems that the contents of the table added by the refresh have vanished. I thought at first there was simply a problem in my delete logic, but with further testing have ruled this out. I am using the Java driver version 4.12 to process the deletes.

In particular, I tried adding the WITH TIMESTAMP phrase to the prepared DELETE statement to ensure that only rows with timestamps prior to the current time were deleted, but this change had no effect.

I now think this behavior is a result of Cassandra tombstones and/or compaction processing, but still don't understand why it should happen. (I have a general understanding of tombstone processing, but little hands-on experience with them in live environments.) When I look at the contents of Cassandra's data directory in the file system, I can see previous Cassandra sstables still on the disk, but a newer empty subdirectory has been added that seems to be superceding it. I have been searching the web and docs, but haven't found any discussion of this scenario.

Can anyone explain why this is happening and how to prevent it?

tombstones
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

1 Answer

Erick Ramirez avatar image
Erick Ramirez answered fwy_187020 commented

To answer your question, the most likely scenario here is that the DELETEs have a newer timestamp than the INSERTs so the data isn't returned when you read it. It is easy to confirm that this is the case if you enable tracing in cqlsh then run a query.

However, this isn't the problem you should focus on.

The answer to the question you didn't ask is -- doing a DELETE-before-INSERT is unnecessary and expensive. You should just update the data with an INSERT.

In Cassandra, only the latest version of the data is returned when you read it. By inserting a new version of a column or row, you're effectively overwriting the older version (an UPSERT). By definition, it is unnecessary to DELETE if you're updating the data anyway. Cheers!

3 comments Share
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

fwy_187020 avatar image fwy_187020 commented ·

Thanks for the quick reply. However, I already enabled tracing for several refresh tests, and based on the 'started_at' timestamp in the sessions trace found that the DELETEs are being processed before the INSERTs as expected. I will run more tests, but how is it possible that the INSERTs could have an earlier timestamp if the DELETEs are being processed first in the code? Could Cassandra ever process them out of sequence if the same client process is submitting all of them?

0 Likes 0 ·
fwy_187020 avatar image fwy_187020 commented ·

Just to close this post... I finally traced down the problem, and it was actually caused by non-Cassandra logic in some legacy code. Cassandra is working fine.

Again, thanks for the quick reply, sorry for the distraction!

0 Likes 0 ·
fwy_187020 avatar image fwy_187020 commented ·

Regarding your other point: my use case is not the selective UPSERT of particular rows in the table, but rather a total replacement of a range of rows (as defined by a partition key) by new rows from the sources. The highly general nature of the data would make matching individual existing rows with new rows very difficult if not impossible, so the code works by replacing a range of rows that can be matched easily. I don't see a way to change this.

I will post again if I find something interesting, and in the meantime please let me know if you have any more suggestions.

0 Likes 0 ·