DataStax Academy FAQ

DataStax Academy migrated to a new learning management system (LMS) in July 2020. We are also moving to a new Cassandra Certification process so there are changes to exam bookings, voucher system and issuing of certificates.

Check out the Academy FAQ pages for answers to your questions:


question

laxmikant.hcl_32751 avatar image
laxmikant.hcl_32751 asked ·

What do the column counts in the sstablemetadata represent?

I was trying to find the wide partition in a table using sstable metadata..The sstablemetada show the following. I can see 1 partition is 442 MB however the most confusing part is Column count ...why the values of Column are so high . what does it actually represent? Please fined the attached file for sstable metadata output.

sstablemetadataOutput.txt

sstablemetadata
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

1 Answer

Erick Ramirez avatar image
Erick Ramirez answered ·

The column counts in the sstablemetadata output is a histogram that shows the distribution of column counts.

Definition

Quoting from Wikipedia:

A histogram is an approximate representation of the distribution of numerical data. ... To construct a histogram, the first step is to "bin" (or "bucket") the range of values—that is, divide the entire range of values into a series of intervals—and then count how many values fall into each interval. The bins are usually specified as consecutive, non-overlapping intervals of a variable. The bins (intervals) must be adjacent, and are often (but not required to be) of equal size.[2]

... A histogram may also be normalized to display "relative" frequencies. It then shows the proportion of cases that fall into each of several categories, with the sum of the heights equaling 1.

Understanding the distribution of data

The Columns column are the buckets. Each value is the upper range of column counts within that bucket.

The Count column is the number of rows which belong in each bucket.

To use your output as an example:

Column Count:
  Columns  | Count (%)  Histogram
  1        |    4 ( 0)
  17       | 2041 ( 97) OOOOOOOOOOOOOOOOOOOOOOOOOOOOOO
  35       |    2 ( 0)
  545791   |    1 ( 0)
  654949   |    2 ( 0)
  ...
  2816159  |    1 ( 0)
  3379391  |   28 ( 1)
  4055269  |    1 ( 0)
  4866323  |   10 ( 0)
  12108970 |    1 ( 0)
  36157190 |    1 ( 0)
  • the first bucket has an implied range of 0 to 1
  • there were 4 rows in the SSTable which has up to 1 column
  • the second bucket indicates there are 2041 rows which range from 2- 17 columns
  • 97% of the rows have between 2- 17 columns

The last bullet point above is significant. It means that most of the rows in the SSTables have between 2-17 columns. This is reinforced by the percentiles:

  Percentiles
  50th 17
  75th 17
  95th 17
  98th 2816159
  99th 3379391
  Min 0
  Max 36157190

This tells us:

  • 95th percentile - 95% of rows have up to 17 columns
  • 98th percentile - 98% of rows have up to 2,816,159 columns

Conclusion

The data in the SSTable mostly contains rows which have 17 columns or less (97%).

But this indicates that 3% of the rows are extremely wide and can be problematic. For example:

  • 1 row has 2,816,159 columns
  • 28 rows have 3,379,391 (3M) columns
  • 11 rows have around 4M columns
  • 1 row has over 12M columns
  • the largest row has 36M columns

Cheers!

2 comments Share
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

I am getting confused with the terminology used here.

Below is my understanding of Cassandra terminology assuming there is no static column in a table:

1. Number of Column (N_c): Total number of columns = Primary key columns N_{pk} + non primary key columns

In my opinion the number of columns almost always remains fixed until the table is altered to add/drop columns.

2.Number of Rows (N_r) : It is represented by primary key (row key) of the table . So in a row there should always be fixed number of columns until altered.

3. Number of Cells (N_v) : It the key-value pair of regular columns in a row. So max number of cell in a row will be N_c - N_{pk}.

4. Partition : It is represented by Partition key : Within a partition, there can be many rows and hence cells .. so total number of cells in a partition will be Nr*N_v.


When you say the largest row has 36M columns.Here did you mean that the largest Partition has 36M cells ? are you referring partition as row , and cell as column ?


0 Likes 0 · ·

Correct, the largest partition has 36M cells. The terminologies used in the sstablemetadata output have roots in the way the data is laid out on disk so I understand the confusion.

1 Like 1 · ·