mitya avatar image
mitya asked answered

Is it ok to have a node size of around 50TB?


Given 5..10gbit inner bandwidth between servers in a cluster, around 500MB/sec seq hdd write speed thanks to RAID10 and RF ~ [3..5] is it ok to have a node size of around 50TB? (we are primary trying to avoid having too much servers)

We have low incoming writes load(around 10000 1k writes/sec) and almost no reads and no deletes (basically populating archive storage all the time).

I did some calculations and figured out the new node bootstrap(if primary RAID I/O bound) will take about 24 hours to complete which is fine I guess..

Thank you.

10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Erick Ramirez avatar image
Erick Ramirez answered Erick Ramirez commented

Our general recommendation is a node density of 500GB to 1TB per node for optimal performance. This takes into account acceptable mean time to recovery (MTTR) and operational tasks such as backups, repairs, compactions and streaming.

In my experience, as node densities go beyond 1TB and particularly higher than 1.5TB pose issues which compromise the cluster's performance.

The biggest nodes I've worked on had about 3TB of data and it took 5+ days to bootstrap new nodes or replace servers which had a hardware failure. You will find that your calculated 24 hours for 50TB isn't achievable in real life particularly with HDDs. Even with SSDs, it will still take days to bootstrap a node that large.

It would be near impossible to run repairs regularly on that density. In a lot of cases, even 1.5TB to 2TB is problematic for repairs just by the volume of data that needs to be compared between replicas then synchronised. Granted you have a special case where it's write only since it's for archive purposes but it will still be horrific to deal with in the event of failure.

My suggestion is that you build a cluster and load up the nodes with 2-3TB of data then run exhaustive tests which include repairs, bootstrapping and node replacements. Only then will you be able to make an informed decision. Cheers!

2 comments Share
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

mitya avatar image mitya commented ·

Erick, thank you for the reply. As I see it the main problem of small nodes is that this theme maps badly on cluster of cheap dedicated servers - its ok to be used with cloud storage with precise TBs per node but with actual hardware one should have ton of servers to compensate for practical node size limit which is painful from both economic and management perspectives. This could be even bigger problem in the future with the further fall of HDD(and SSD) prices per TB...

I wonder if there is a master plan to work towards more efficient new node insert like doing incremental background bootstrapping while enabling partially filled node for users requests before it is fully cluster-synced?

[Post converted to a comment since it's not an "answer"]

0 Likes 0 ·
Erick Ramirez avatar image Erick Ramirez ♦♦ mitya commented ·
I understand your point but you choose Cassandra not because you want to save on costs but because you have a scale problem. If you could, you would have stayed using a traditional DB.

There is no technical barrier that prevents you from having super-dense nodes but know that you will run into operational issues sooner rather than later. Cheers!

1 Like 1 · avatar image answered
Perhaps have a look at Astra. If you're very concerned about price per GB it's quite inexpensive.
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.