Bringing together the Apache Cassandra experts from the community and DataStax.

Want to learn? Have a question? Want to share your expertise? You are in the right place!

Not sure where to begin? Getting Started

 

question

satvantsingh_190085 avatar image
satvantsingh_190085 asked ·

How do bloom filters affect read performance?

[FOLLOW UP QUESTION TO #5219]

How bloom filter affect the query performance according to probability ? If it gives low probability for any partition in SStable (what will be the read path ) and when it gives high probability(then what will be the read path).

bloom filter
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

1 Answer

Erick Ramirez avatar image
Erick Ramirez answered ·

C* read path

When Cassandra gets a read request, it goes through several stages in order to determine where the data is stored. At a high level, the path it takes to satisfy a request is:

  1. check the memtable
  2. check the row cache if it is enabled
  3. check the bloom filter
  4. check the partition key cache if it is enabled
  5. check the partition summary if the partition key is not found in the partition key cache
  6. locate data using (a) the SSTable offset if the partition key was found in the cache, or (b) the SSTable offset from the partition index if the partition summary was checked
  7. fetch the data from the SSTable
  8. merge the memtable and SSTable results and send it back to the client

Background

Bloom filters are probabilistic data structures used to check whether an item likely exists in a set or definitely does not. Its algorithm is extremely fast at arriving at an answer but at the cost of returning false positives. A false positive is an incorrect result where it returned a "partition key exists in SSTable X" when it doesn't.

Cassandra checks the bloom filters to determine whether:

  • the data definitely does not exist in a given SSTable, or
  • the data probably exists in a given SSTable.

This improves the performance of reads since C* avoids having to check every single SSTable for the existence of the requested partition -- disk access is expensive so it must be avoided where possible. If a bloom filter check indicates that a partition is not in an SSTable, C* can skip reading that SSTable narrowing down the number of SSTables to read.

Accuracy

Bloom filters are stored in memory (off-heap) and as mentioned previously, it can return false-positives. Reducing the chance of false positives requires more memory so it's a trade-off. Less memory means a higher chance of false positives.

The bloom_filter_fp_chance option for each table has a possible range of 0.0 to 1.0 (where 1.0 is disabled). Usual values are between 0.01 (higher accuracy so uses more memory) and 0.1 (higher chance of false positives). Values higher than 10% (0.1) have diminishing returns -- occupies more memory for not much more accuracy.

Further info

For more details, see the Bloom Filters page on the Apache Cassandra website.

There is also additional info in the DataStax Docs site on How data is read in C* and Tuning Bloom filters. Cheers!

Share
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.