noelle.heerink-wijnja_161326 avatar image
noelle.heerink-wijnja_161326 asked Erick Ramirez commented

Why are there old SSTables on STCS when a keyspace is actively updated/written to?

we are having a Cassandra ring consisting of 21 nodes in 3 datacenters. We have 10 keyspaces, all with RF=3. Running STCS.

This is running for more than 8 months. We are getting compliants that cassandra is giving timeouts when a certain application is requesting read actions. It is not that it is a lot of reads but it does touch data not being in cache.

normally there is no load problem. We do see warnings about tombstones during queries.

We have been looking for reasons. We have implemented Reaper and that is running with no issues now.

What we see is that in our Cassandra data directories we have a number of SStables that seem very old considering the livelyness of the application. Some SSTables are 8 months old for example, while there are also many newer SSTables (*big-Data.db).

We see that minor compactions are running.

Is it correct to assume we should not have such old SSTables? What can we do to correct if it is not. What can we do to detect issues?

10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Erick Ramirez avatar image
Erick Ramirez answered Erick Ramirez commented

@noelle.heerink-wijnja_161326 SSTables get compacted by SizeTieredCompactionStrategy (STCS) when similar-sized files are present. By default, there needs to be 4 similar-sized SSTable candidates (min_threshold: 4) for a compaction to kick off. If you only have 1 or 2 large SSTables from 8 months ago, there's a good chance they won't get compacted for a while.

It is important to note that this is NOT an issue. STCS compaction is based on size, not age. It is an important distinction to understand so I'm pointing it out explicitly. :)

I recommend having a look at How data is maintained in Cassandra. The section on STCS in particular has a diagram that perfectly illustrates how it works.

On the issue of timeouts, in my experience it is often the result of a bad data model where tables are being used in a queue-like fashion and have a high-delete workload. What happens is that queries have to iterate over thousands of tombstones to satisfy the request.

For more information on tombstones and queues, see the blogpost Cassandra anti-patterns: Queues and queue-like datasets. Cheers!

4 comments Share
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

noelle.heerink-wijnja_161326 avatar image noelle.heerink-wijnja_161326 commented ·

Ok from your answer I am seeing that compaction is working how it should according to the specs of STCS. And us having also SSTables not being 4 similar-sized SSTable candidates, then those non-candidates won't be compacted, hence tombstones do not get cleaned up right?

And if the application that is using Cassandra had a good datamodel or would not have queries that require running through a lot of tombstoned data (or maybe that is the same), there would not be an issue. During normal load, we indeed do not have issues. Only when that application has a node that needs to have maintenance and has to distribute the workload to other nodes, and those nodes have to pick up the (historic)data of the workloads from Cassandra, we see the big issues.

0 Likes 0 ·
noelle.heerink-wijnja_161326 avatar image noelle.heerink-wijnja_161326 commented ·

I am still wondering how to proceed. This application has sometimes to go through "old" data. And maybe the STCS is doing what it is supposed to do but I am afraid we just have too many old tombstones. But maybe you can explain that as well ?

What we see when we do a sstabledump are records like this with records marked_deleted with times really older then 10 days (gracetime)

and sstablemetadata of one db file

with unix timestamps also old

what can we do best now?

0 Likes 0 ·
1583775067564.png (151.2 KiB)
1583775159491.png (260.2 KiB)
noelle.heerink-wijnja_161326 avatar image noelle.heerink-wijnja_161326 commented ·

Reading the article about compaction strategies and seeing that we indeed have a read-intensive system and if getting rid of tombstones would compensate for the datamodel/queries of the application, would it then be a idea for us to switch strategies from STCS to LeveledCompactionStrategy (LCS), assuming this will compact better and thus purging tombstones?

Or should we look into tools like "nodetool garbagecollect", or "nodetool compact -s"

0 Likes 0 ·
Erick Ramirez avatar image Erick Ramirez ♦♦ noelle.heerink-wijnja_161326 commented ·

@noelle.heerink-wijnja_161326 The underlying problem is the high-delete workload that generates a lot of tombstones. As a last resort, you can force a major compaction with the "split output" flag so you don't end up with 1 giant SSTable.

For more info, see Why forcing a major compaction is not ideal. Cheers!

0 Likes 0 ·
Cedrick Lunven avatar image
Cedrick Lunven answered

We do see warnings about tombstones during queries. Great answer by Erick.

Some more information about tombstones, you can have a look here, I like the paragraph Tombstone Drop :

Also at application level

  1. try to delete as much data as possible in one go (partition/row tombstones and not column per column)
  2. do not insert NULL in the columns to empty them when not used, use UNSET.
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.