question

Tri avatar image
Tri asked Tri commented

How can I get a row count for all partitions so I can review the distribution of my data?

Let's say my application is working happily for a year. I wonder if the partitions of a particular table is well balanced. Basically I want to plot the histogram of the distribution of the rowcounts of ALL the partitions in the table.

Granted that this is an anti-pattern. But for the sake to technicallity, I would like to know how this could be done, and of course, let's assume I accept to pay the penalty for that one time table-scan query.

In the relational world, I would do:

SELECT myPartitionKey, COUNT(*) AS rowcount
FROM myTable 
GROUP BY myPartitionKey
ORDER BY COUNT(*)  DESC;

How can we reproduce the equivalent with Cassandra? Any solution would do, CQL or writing a program.

cassandracount
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

1 Answer

smadhavan avatar image
smadhavan answered Tri commented

@Tri, the way to look for information regarding partitions would be as follows,

  • Simplest way to find this information would be to leverage DataStax Bulk Loader (DSBulk) count option. See example below.

  • Use the nodetool tablehistograms (or the cfhistograms on older versions of C*) to find the partition sizes for 75, 95, 99, and 100 percentiles. If you see a big difference between these values, it's likely you have non-uniform spread of partition key values. Similar information can be obtained from the `sstablepartitions` command

dsbulk count -k test -t widerows --log.verbosity 0 --stats.mode partitions
'29' 106 51.71
'96' 99 48.29
  1. Left column: partition key value. 29 is the partition key value

  2. Middle column: number of rows using that partition key value. 106 is the number of rows

  3. Right column: the partition's percentage of rows compared to the total number of rows in the table.

The default will be 10 records and you want to adjust --stats.numPartitions value.

3 comments Share
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Tri avatar image Tri commented ·

@smadhavan awesome, thanks.

Glad to know such a tool exists. In the nodetool tablehistograms you mentioned. What does the percentile related to? Stransgely enough, the documentation doesn't elaborate on how to interpret the results of `nodetool tablehistograms`

1 Like 1 ·
smadhavan avatar image smadhavan ♦ Tri commented ·

@Tri, glad to hear it helped!

Percentile values will be used in any service level measurements and here the following metrics are measured,

Table performance statistics include the following metrics:

  • read/write latency

  • partition size

  • cell count

  • number of SSTables

0 Likes 0 ·
Tri avatar image Tri smadhavan ♦ commented ·

Cool thanks. I know what percentile means :-) I didn't scroll to the right in the example mentioned in`nodetool tablehistograms`. I get it now.

In that case I prefer better the Datastax DSBulk tool because it shows the count per partition key value. So I can investigate further the partition key candidates where skew occured.

0 Likes 0 ·