question

376752150_179413 avatar image
376752150_179413 asked Erick Ramirez edited

What is the real "Problems using a high-cardinality column index"?

The doc says "In the table with a billion songs, looking up songs by writer (a value that is typically unique for each song) instead of by their recording artist is likely to be very inefficient."

Why?

secondary index
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

maedhroz avatar image
maedhroz answered

If you have a table with a billions songs, and every writer is unique, searching by writer will return one row from the entire cluster, which means the range read apparatus that coordinates the index query may have to contact ceil(N/RF) nodes (where N = the number of nodes and RF = replication factor). This may not ultimately be a problem, and it's what most distributed information retrieval systems do (ElasticSearch, SolrCloud, etc.), but it may not be as efficient in a large cluster as simply using a materialized view keyed on the e-mail address.

(See this blog for a deeper explanation of the range read algorithm.)


Share
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Lewisr650 avatar image
Lewisr650 answered

The core technical problem Cassandra solves is storing data efficiently on disk to allow a single IO operation to retrieve up to 2 Billion records. As data is written in memory it is sorted and ordered based on the primary key definition. Then when the data is written to disk it is written in contiguous blocks on disk to support that efficient retrieval pattern. The use case you speak of, with high cardinality data means that there is low opportunity to store collections of records in a partition. With that it requires IO to retrieve a low number of records, in this scenario likely only a single record. If you perform a retrieval of a bunch of records with data modeled like this it means that the storage subsystem has to perform IO to retrieve each record. That is why it states that it is an "inefficient" storage pattern in comparison to who the data might be stored. This is why data modeling is so critically important to Cassandra.

Share
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.