Bringing together the Apache Cassandra experts from the community and DataStax.

Want to learn? Have a question? Want to share your expertise? You are in the right place!

Not sure where to begin? Getting Started

 

question

376752150_179413 avatar image
376752150_179413 asked ·

What is the real "Problems using a high-cardinality column index"?

The doc says "In the table with a billion songs, looking up songs by writer (a value that is typically unique for each song) instead of by their recording artist is likely to be very inefficient."

Why?

cassandara index
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

maedhroz avatar image
maedhroz answered ·

If you have a table with a billions songs, and every writer is unique, searching by writer will return one row from the entire cluster, which means the range read apparatus that coordinates the index query may have to contact ceil(N/RF) nodes (where N = the number of nodes and RF = replication factor). This may not ultimately be a problem, and it's what most distributed information retrieval systems do (ElasticSearch, SolrCloud, etc.), but it may not be as efficient in a large cluster as simply using a materialized view keyed on the e-mail address.

(See this blog for a deeper explanation of the range read algorithm.)


Share
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Lewisr650 avatar image
Lewisr650 answered ·

The core technical problem Cassandra solves is storing data efficiently on disk to allow a single IO operation to retrieve up to 2 Billion records. As data is written in memory it is sorted and ordered based on the primary key definition. Then when the data is written to disk it is written in contiguous blocks on disk to support that efficient retrieval pattern. The use case you speak of, with high cardinality data means that there is low opportunity to store collections of records in a partition. With that it requires IO to retrieve a low number of records, in this scenario likely only a single record. If you perform a retrieval of a bunch of records with data modeled like this it means that the storage subsystem has to perform IO to retrieve each record. That is why it states that it is an "inefficient" storage pattern in comparison to who the data might be stored. This is why data modeling is so critically important to Cassandra.

Share
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.