Tri avatar image
Tri asked Tri edited

PK + multi-columns CK, WHERE clause: why not allowing to scan ONE single partition key?

Taken from Physical Data Model

CREATE TABLE actors_by_video (
   video_id TIMEUUID,
   actor_name TEXT,
   character_name TEXT,
   PRIMARY KEY ((video_id), actor_name, character_name)

Attempt to answer the question: Who was the actor that played the character Desi Collings?

SELECT * FROM actors_by_video WHERE video_id = 8a657435-0ef2-11e5-91b1-8438355b7e3a AND character_name = 'Desi Collings'; 

InvalidRequest: Error from server: code=2200 [Invalid query] message="PRIMARY KEY column "character_name" cannot be restricted as preceding column "actor_name" is not restricted 

What is the reason such a request be considered invalid?

I get it that because we filter on the 2nd sort key skipping the 1st sort key. But overall, the worst thing that could happen is to scan the block of rows of ONE single partition key. Which is still far better that the table scan allowed by ALLOW FILTERING.

A client side solution would do exactly that, bring back the whole block of rows and filter one by one until there is a match on the 2nd clustering key.

Can you please suggest an elegant CQL way to solve this problem?

data modelingclustering key
1 comment
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Tri avatar image Tri commented ·

Oh, the likely answer is Secondary Index. Seen in Read Techniques:

CREATE INDEX idx_video_by_character ON actors_by_video (character_name);

SELECT * FROM actors_by_video
WHERE character_name = 'Desi Collings'
AND video_id = 8a657435-0ef2-11e5-91b1-8438355b7e3a

In real world , not sure if the `character_name` column has low cardinality enough to justify the usage of a secondary index. Most likely not. But at least for the goal of my specific question which is to answer the the question in a single SELECT. Secondary Index or Materialized View was the technical solution I was looking for.

Nevertheless creatting and maintaining a secondary index (or materialized view) for just query a "out-of-order" cluster key within a partition is a giant waste of resource. This probably warrant a feature request to Cassandra team

0 Likes 0 ·

1 Answer

bettina.swynnerton avatar image
bettina.swynnerton answered bettina.swynnerton edited


The reason for this restriction goes back to how Cassandra stores its data on disk.

The clustering keys determine the on-disk sort order, and Cassandra works best when it can read a stream of data from disk. If you leave out clustering keys, the result is a non-continuous read. That's why you can filter on the clustering keys in order, but can't skip one. By leaving out a clustering key between others, Cassandra would have to do a scan to find the requested data, which is inefficient and may have unpredictable performance.

There isn't an elegant CQL query to solve this, as Cassandra does not offer a lot of flexibility in this respect.

If it is a frequent query and you want to avoid the client side work, it would be better to denormalise and add a table that satisfy this query with a single partition read, such as a table character_by_video, where character_name would be the first clustering key.

This blog is still a good read regarding this topic.

I hope this helps.

2 comments Share
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Tri avatar image Tri commented ·

@bettina.swynnerton I don't think this is a performance problem. Because the scope is limited in a single partition key. I can of course recreate an almost identical table with the Clusterered key expression swapped ((PK), CK1, CK2) becomes ((PK), CK2, CK1) .

But this solution is overkilled if we rarerly to the WHERE myPK=X AND CK2=blabla. In such a case we can accept to scan the entire partition key (NOTE: scanning ONLY ONE single parition key, not the entire partition).

A possible solution would be to add a kind of "ALLOW PK_FILTERING" ? What is most puzzling is: if CQL has "ALLOW FILTERING" to do table scan, why would it suddenly becomes extremely strict to dis-allow scanning one single partition key? Especially when the user accepts to pay the performance penalty.

0 Likes 0 ·
bettina.swynnerton avatar image bettina.swynnerton ♦♦ Tri commented ·

Hi @Tri,

adding such capability would definitely ease these particular cases, but the devil is in the detail and I am sure there is complexity in an implementation. I'll do some research if this has been considered before.

In practice, where the more flexible search outside of the key limitations is needed, users often compliment Cassandra with additional technologies such as Solr or Spark. It really depends on the context and the application requirement whether the answer is schema design, additional technologies or application work.

0 Likes 0 ·