I have a table with 4 million unique partition keys
select count(*) from "KS".table; count --------- 4355748
Based on Data Stax, there are 2 goals you need to consider when you create a model: make the data spread evenly around the cluster and minimize
the number of partition read. To make it simply, consider the cardinality of your Partition Key. It should not too high and also not too low, which
mean don’t make partition key too unique. When choose the Partition Key you can use this rules of thumb:
1 Partition Key should have < 100k rows. But remember you shouldn’t make it unique. So, total rows around 10k-100k per partition is good
in my opinion.
1 partition key should have < 100 MB size. To calculate it you could count the data type size that you used in your model.
So, this is what happen when Cassandra read the data (based on datastax):
1. Check the memtable
2. Check row cache, if enabled
3. Checks Bloom filter
4. Checks partition key cache, if enabled
5. Goes directly to the compression offset map if a partition key is found in the partition key cache, or checks the partition summary if not
6. If the partition summary is checked, then the partition index is accessed
7. Locates the data on disk using the compression offset map
8. Fetches the data from the SSTable on disk
What happen if we have too many partition key (unique partition key) like this model:
With this model, Cassandra will had a bottleneck when accessing partition index . As we know, process that happen on disk will be much
slower than on memory. Well I think you can guess what happen if we have too many Partition Key. Cassandra will iterate the partition
index and it will make high server load. I’ve experience having > 50 million Partition Key in total. Can you imagine if each read request
(select statement) to this table, Cassandra need to iterate 50 million times to find the index?
That’s why the cardinality of Partition Key has to balanced, not too high also not too low. Too high the bottleneck is on step 6, too low the
bottleneck is on step 8.
Will changing data partitioning help with the load?