Radhika avatar image
Radhika asked Radhika commented

How do we deal with high disk utilisation?

@Erick Ramirez We have a 12 node cassandra cluster in Production. In the recent past, almost all the nodes are using higher than 85% disc space. We tried to add default_time_to_live, gc_grace_seconds for few tables. But there seems to be no effect on the count of records or the disc space. There are suggestions to perform nodetool compact and cleanup. But this also mentions that it is not suggested to be run on production environment.

Some specific questions,

  1. Tried setting the TTL as 100 days and gc as 3hours. Expectation was records older than 90 days should get deleted after 3 hours. But it was still intact. Is there anything else to be taken care to delete records older than 100 days using TTL settings? Disc space is also expected to be freed up. Again what else should be done to free up disc space after deleting records.
ALTER TABLE my_keyspace.my_item WITH default_time_to_live=8640000 
ALTER TABLE my_keyspace.my_item WITH gc_grace_seconds=10800
  1. Is it ok to run nodetool compact followed by nodetool cleanup on a prod environment with all instances over 85% disc space utilized?

Please share other suggestions as well to free up disc space utilized by Cassandra.

|/ State=Normal/Leaving/Joining/Moving
--  Address        Load       Tokens       Owns (effective)  Host ID                               Rack
UN  10.1.x.x  997.26 GiB  256          24.7%             erff8abf-16a1-4a72-b63e-5c4rg2c8d003  rack1
UN  10.2.x.x   1.22 TiB   256          26.1%             a8auuj76-f635-450f-a2fd-7sdfg0ss713e  rack1
UN  10.3.x.x   1.21 TiB   256          25.4%             8ebas25c-4c0b-4be9-81e3-013fasdas255  rack1
UN  10.4.x.x   1.27 TiB   256          25.1%             wwwdba15-16f3-41a8-b3d1-2d2b6e35715d  rack1
UN  10.5.x.x  975.67 GiB  256          24.7%             72ed4df7-fb65-4332-b8ac-e7461699f633  rack1
UN  10.6.x.x  1.01 TiB   256          24.8%             39803f58-127f-453b-b102-ed7bdfb8afb2  rack1
UN  10.7.x.x  1.18 TiB   256          25.9%             b6e692a6-249f-433d-8b54-1d20d4bc4962  rack1
UN  10.8.x.x  1.12 TiB   256          24.5%             8ed8c306-9ac9-4130-bff1-97f7d5d9a02f  rack1
UN  10.9.x.x  973.26 GiB  256          24.4%             f7489923-3cc3-43ec-83ca-42bbdeb0cbb7  rack1
UN  10.10.x.x  1.13 TiB   256          26.0%             ea694224-ds0b-42f5-9acf-ff4ddfb450e0  rack1
UN  10.11.x.x   1.22 TiB   256          24.0%            ddde4bce-553e-4246-9920-47sdfdf324ed  rack1
UN  10.12.x.x  1.28 TiB   256          24.4%             0222d40f-edb8-4710-9bae-39dsfd87e18db  rack1

We have a 3 node cluster in Test environment and running nodetool compact(one keyspace alone) reduced disc space and data load as below. But held back running the same on Prod as 1 node spiked up the disc space from 73% to 99% during compact process.

Used Disc Space  Before compact  After compact
Cassandra 01       73%              60%    - During compact spiked upto 99%
Cassandra 02       58%              46%
Cassandra 03       61%              43%

Data Load        Before compact  After compact
Cassandra 01       114.8 GiB        98.08 GiB
Cassandra 02       152.77 GiB       88.57 GiB
Cassandra 03       132.93GiB        89.33 GiB

@Erick Ramirez I have reposted the question here for your direct advice and suggestions.

1) Please suggest if just adding extra node helps or increasing the disc space of existing instance/s would work as well. We already have 12 nodes and we do not want to end up having more nodes to manage !
2) And we are also looking at a long term solution for having cassandra instance/s space under control. I can understand that the suggestions to add extra nodes comes in as the state of the cluster is beyond control. But how do I ensure that I don't end up in the same situation!
3) I also want to ensure that records older than 100 days are cleaned up automatically. If setting the TTL today will start taking effect only after 100 days, how can I clear up the records older than that. Will deleting the records manually and then enabling a lesser gc_grace_seconds(say minimum 3 hours), ensure that the records and the tombstones are deleted?

Please suggest if I need to create any support ticket to get immediate attention on the issue.

1 comment
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Radhika avatar image Radhika commented ·

@Erick Ramirez Please help with your response.

Can someone accept this question for answering?

0 Likes 0 ·
Erick Ramirez avatar image
Erick Ramirez answered Radhika commented

This is a follow-up question to and I'm re-posting my answer here for context.

  1. Setting a default TTL on a table will only apply to newly inserted data. If you recall, SSTables are immutable in Cassandra -- they don't get updated/modified once they've been written to disk. This means that any existing data in the SSTables will not have the new TTL applied to them so it won't free up any disk space.
  2. Forcing a major compaction will not make a difference because of (1) -- the existing data in the SSTables will not expire. The default TTL will only apply to new mutations/writes (inserts/updates). For the same reason, running nodetool cleanup won't make a difference either since there's nothing to cleanup. In any case, major compactions are a bad idea in C* as I've explained in question #6396.

So how do you deal with low disk space on existing nodes? You need to increase the capacity of your cluster by adding more nodes. As you add nodes one by one, you can run nodetool cleanup on the existing nodes to immediately free up space.

I've done some rough calculations based on the average node density of 1153GB across all 12 nodes. If you add 1 node, it will free up ~89GB per node on average. If you add 2 nodes, it should free up ~165GB per node on average. 3 nodes is about a 231GB drop and 4 nodes about 288GB. Cheers!

Now let me respond to your follow up questions.

This is expected because forcing a major compaction requires that all SSTables are read, loaded to memory and serialised on heap so they can all get compacted:

But held back running the same on Prod as 1 node spiked up the disc space from 73% to 99% during compact process.

That's why it's called major compaction. It requires a lot of IO and has the potential to completely slow your app down which is why it isn't recommended.

The only long term solution is to add nodes. As soon as the disk utilisation on the nodes go above 500GB, you need to start provisioning new servers so they are ready to deploy and add to the cluster. As soon as you get close to 1TB, you need to add nodes.

Cassandra is completely different from running other traditional RDBMS like Oracle. Trust me -- I used to be an Oracle architect for years. :) As soon as you hit capacity issues, Oracle tells you to scale your servers vertically by adding more RAM/CPU/disks. The opposite applies to Cassandra -- you scale horizontally by adding more nodes.

In relation to the old data, the only way to get rid of them is to issue a DELETE. You will need to write an ETL job preferably with Spark to scan through the tables efficiently then delete the whole partition (not rows within the partition).

Finally if you have a valid Support subscription then by all means, please log a ticket with DataStax Support so one of our engineers can assist you directly. Cheers!

3 comments Share
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Radhika avatar image Radhika commented ·

Thank you for your response.

We started with adding a new node by marking the CASSANDRA_SEEDS=cassandra01.marathon.internal,cassandra02.marathon.internal,cassandra03.marathon.internal(first 3 nodes).

However, the node is failing to start up with the below error. Please suggest ways to fix the corrupt SSTables without data loss.

0 Likes 0 ·
smadhavan avatar image smadhavan ♦ Radhika commented ·

@Radhika, it is hard to triage this in a Q&A format. Alternatively, you could leverage DataStax Luna (this website has rich details about the program) that offers a limited time free-trial consultation to get 24/7 support for Open Source Apache Cassandra® with up to 15 tickets per year. You could register for the same here at website/portal.

0 Likes 0 ·
Radhika avatar image Radhika commented ·

@Erick Ramirez I added a new node and marked the first 3 as CASSANDRA_SEEDS(as owns was ~25%). Bootstrapping is still in progress for the new node and the load seems to be higher than the other nodes. The host went out of disc space even after adding twice the disc space of the other clusters.Please confirm if I am proceeding in the right direction or if there is anything wrong in the cassandra seeds selection.

--  Address  Load   Tokens       Owns (effective)
UN  10.x  890.4 GiB  256          26.1%
UN  10.x  913.3 GiB  256          25.4%
UN  10.x  897.89 GiB  256          25.1%
UN  10.x  889.85 GiB  256          24.8%
UJ  10.x  1.47 TiB   256          ?     
UN  10.x  912.55 GiB  256          25.9%
UN  10.x  860.53 GiB  256          24.5%
UN  10.x  868.21 GiB  256          24.4%
UN  10.x  885.3 GiB  256          24.7%  
UN  10.x  870.68 GiB  256          24.7%
UN  10.x  855.98 GiB  256          24.4%
UN  10.x  1020.32 GiB  256          26.0%
UN  10.x  905.14 GiB  256          24.0%
0 Likes 0 ·
smadhavan avatar image
smadhavan answered smadhavan edited

@Radhika, what is the version of Apache Cassandra® and/or DataStax Enterprise (DSE) that you're running with?

Like as explained in the Stackoverflow thread, the first best approach here would be to expand (scale-out) the cluster horizontally by adding additional nodes to this cluster so that disk space per node gets reduced as the new nodes will share the token to distribute the data. If you've properly sized the cluster accounting for parameters such as but not limited to throughput, latency, data growth, data time-to-live, you could adjust the table properties (or set TTL at the ingestion side) to expire the newly inserted data which will take care of the new data. For clearing out existing data, based on your business logic, you could write a one-time adhoc program (for e.g. Spark, etc.,) to clear them to reduce the disk size per node.

If you've further questions or would need hands-on help with this situation, please log a ticket with DataStax Support so one of our engineers can work with you directly.

10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.