question

milind.jivtode_158531 avatar image
milind.jivtode_158531 asked Erick Ramirez commented

What is the rule for the use of ALLOW FILTERING?

Is there a rule of thumb that I can use before deciding to user "allow filtering".... e.g. if record_set is < 100 records

cassandra
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

vikram.singh.chouhan_187371 avatar image
vikram.singh.chouhan_187371 answered Erick Ramirez commented

@milind.jivtode_158531: It is never a good idea to use ALLOW FILTERING in your application queries. It was designed to be used for development purposes only such that developers can interrogate the data in a non-production environment.

ALLOW FILTERING is only required when your search is either on a non-primary column or primary key columns without specifying its previous primary columns (if any) in the order defined during table creation (let me know if you need an example to understand this). And using ALLOW FILTERING doesn't always mean it will be inefficient, it actually depends on how much disk your query is going to scan in the end during runtime.

So if you have a query using allow filtering and you are sure that it will end up in scanning not more than a specific disk size or in other words record set less than a specific number (as per your performance principles) than it is ok to use it. This can be achieved by designing the appropriate partition key or primary key.
Although it has been discouraged to use allow filtering just to make sure it will never end up in bad performance scenarios if your data model design is poor or it may become ineffective future if your data insertion frequency makes your design look poor.

5 comments Share
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

milind.jivtode_158531 avatar image milind.jivtode_158531 commented ·

Thanks. Yes, that's the scenario.

Query:

  • a. Yellow portion will yield around 100 records max in all the scenarios. It uses the Secondary Key and results come back in 20 ms.
  • b. Blue section will be processed sequentially using allow filtering.

select * from eligiblemedinstance
where facilitynumber = '03074' -- Partition Key
and masterpatientidentifier = '388345' -- first clustering column
--There are couple of more clustering columns not used here
and eligibledates contains '2020-03-21' --- Secondary Index
and fulfillmenteligibilitystatus = 'CANCELLED'
allow filtering;

0 Likes 0 ·
vikram.singh.chouhan_187371 avatar image vikram.singh.chouhan_187371 milind.jivtode_158531 commented ·

First of all, in Cassandra it is all about doing point search, using a secondary index is a kind of waste in Cassandra it is not recommended to use by the community itself. not even recommended by the Cassandra community itself. Actually the search engine has to a linear scan across all nodes in all DCs in the worst case. Have a look at this video in the questionnaire section around @1:18:00 time frame you will get to know.

The second thing is the allow filtering doesn't work the way two are trying to use. In my understanding, the disk scanning will happen depending on the result set of this query in your case:

"select * from eligiblemedinstance

where facilitynumber = '03074' -- Partition Key

and masterpatientidentifier = '388345' -- first clustering column"

0 Likes 0 ·
vikram.singh.chouhan_187371 avatar image vikram.singh.chouhan_187371 milind.jivtode_158531 commented ·

You can solve this problem in following possible ways: (All these are in a context of huge data set scenario having an order of >100k records in a single partition)

1. Rearrange your columns in the table if possible

2. If you can't skip the usage secondary index, then better to generate unique ids corresponding to each row in your main table and store it in a separate wide row. While querying, first query the wide row to get the ids to look for and then do multiple point search in your main table this would be much faster.

3. If your record fetches size in less than <100k I would suggest you avoid secondary indexing and use allow filtering straight away.

Note: If your search query will always look for a particular fulfillmenteligibilitystatus then I would suggest you move this up in primary key this would really help your query.

The best solution is always to test performance locally by dumping data with the available options with your use case.

0 Likes 0 ·
milind.jivtode_158531 avatar image milind.jivtode_158531 commented ·

Thanks @vikram.singh.chouhan_187371: I will look into #2 option u have provided.

I can't add fulfillmenteligibilitystatus to the PK because that value is updated by the application. I will have to DROP/INSERT the record leading to TombStones.


0 Likes 0 ·
Erick Ramirez avatar image Erick Ramirez ♦♦ milind.jivtode_158531 commented ·

@milind.jivtode_158531 A friendly note to let you know I've converted your post to a comment since it's not an "answer". Cheers!

0 Likes 0 ·
Erick Ramirez avatar image
Erick Ramirez answered

@milind.jivtode_158531 It is never a good idea to use ALLOW FILTERING in your application queries. It was designed to be used for development purposes only such that developers can interrogate the data in a non-production environment.

It does not scale and its performance is unpredictable and should never be used in production clusters. You need to practise good data modelling habits without having to resort to using this clause. Cheers!

Share
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.