question

payaln147_54505 avatar image
payaln147_54505 asked Erick Ramirez edited

Will changing data partitioning help with the load ?

I have a table with 4 million unique partition keys

select count(*) from "KS".table;
count
---------
4355748

Based on Data Stax, there are 2 goals you need to consider when you create a model: make the data spread evenly around the cluster and minimize

the number of partition read. To make it simply, consider the cardinality of your Partition Key. It should not too high and also not too low, which

mean don’t make partition key too unique. When choose the Partition Key you can use this rules of thumb:

1 Partition Key should have < 100k rows. But remember you shouldn’t make it unique. So, total rows around 10k-100k per partition is good

in my opinion.

1 partition key should have < 100 MB size. To calculate it you could count the data type size that you used in your model.

So, this is what happen when Cassandra read the data (based on datastax):

1. Check the memtable

2. Check row cache, if enabled

3. Checks Bloom filter

4. Checks partition key cache, if enabled

5. Goes directly to the compression offset map if a partition key is found in the partition key cache, or checks the partition summary if not

6. If the partition summary is checked, then the partition index is accessed

7. Locates the data on disk using the compression offset map

8. Fetches the data from the SSTable on disk

What happen if we have too many partition key (unique partition key) like this model:

With this model, Cassandra will had a bottleneck when accessing partition index . As we know, process that happen on disk will be much

slower than on memory. Well I think you can guess what happen if we have too many Partition Key. Cassandra will iterate the partition

index and it will make high server load. I’ve experience having > 50 million Partition Key in total. Can you imagine if each read request

(select statement) to this table, Cassandra need to iterate 50 million times to find the index?

That’s why the cardinality of Partition Key has to balanced, not too high also not too low. Too high the bottleneck is on step 6, too low the

bottleneck is on step 8.

Will changing data partitioning help with the load?

data modeling
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

1 Answer

Erick Ramirez avatar image
Erick Ramirez answered
I’ve experience having > 50 million Partition Key in total. Can you imagine if each read request Cassandra need to iterate 50 million times to find the index?

@payaln147_54505 it doesn't iterate over the 50M keys. Cassandra "knows" which SSTable stores partition data using data structures/algorithms such as bloom filters. The partition index for that specific SSTable (not all SSTables) is used to get the position of the partition in that SSTable.

As we know, process that happen on disk will be much slower than on memory.

The partition index lookup I described above gets cached for future reads and is the cache you mentioned in your post.

Will changing data partitioning help with the load?

"Load" in Cassandra mostly relates to the density of data on each node. If you want to increase the throughput of your cluster, the general recommendation is to add more capacity (add more nodes) to spread the load. Cheers!

Share
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.