The doc says "In the table with a billion songs, looking up songs by writer (a value that is typically unique for each song) instead of by their recording artist is likely to be very inefficient."
Why?
The doc says "In the table with a billion songs, looking up songs by writer (a value that is typically unique for each song) instead of by their recording artist is likely to be very inefficient."
Why?
If you have a table with a billions songs, and every writer is unique, searching by writer will return one row from the entire cluster, which means the range read apparatus that coordinates the index query may have to contact ceil(N/RF) nodes (where N = the number of nodes and RF = replication factor). This may not ultimately be a problem, and it's what most distributed information retrieval systems do (ElasticSearch, SolrCloud, etc.), but it may not be as efficient in a large cluster as simply using a materialized view keyed on the e-mail address.
(See this blog for a deeper explanation of the range read algorithm.)
The core technical problem Cassandra solves is storing data efficiently on disk to allow a single IO operation to retrieve up to 2 Billion records. As data is written in memory it is sorted and ordered based on the primary key definition. Then when the data is written to disk it is written in contiguous blocks on disk to support that efficient retrieval pattern. The use case you speak of, with high cardinality data means that there is low opportunity to store collections of records in a partition. With that it requires IO to retrieve a low number of records, in this scenario likely only a single record. If you perform a retrieval of a bunch of records with data modeled like this it means that the storage subsystem has to perform IO to retrieve each record. That is why it states that it is an "inefficient" storage pattern in comparison to who the data might be stored. This is why data modeling is so critically important to Cassandra.
2 People are following this question.
DataStax Enterprise is powered by the best distribution of Apache Cassandra ™
© 2023 DataStax, Titan, and TitanDB are registered trademarks of DataStax, Inc. and its subsidiaries in the United States and/or other countries.
Apache, Apache Cassandra, Cassandra, Apache Tomcat, Tomcat, Apache Lucene, Lucene, Apache Solr, Apache Hadoop, Hadoop, Apache Spark, Spark, Apache TinkerPop, TinkerPop, Apache Kafka and Kafka are either registered trademarks or trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries.
Privacy Policy Terms of Use