Bringing together the Apache Cassandra experts from the community and DataStax.

Want to learn? Have a question? Want to share your expertise? You are in the right place!

Not sure where to begin? Getting Started

 

question

pranali.khanna101994_189965 avatar image
pranali.khanna101994_189965 asked ·

Questions on reading partitions with filters on clustering columns

Hi ,

I referred https://www.datastax.com/blog/2015/06/deep-look-cql-where-clause I understood that The role of clustering columns is to cluster data within a partition.

and in order to retrieve data in an efficient way without a secondary index, you need to know all the clustering key columns for your selection . But in the further article I read ,

Single column slice restrictions are allowed only on the last clustering column being restricted for <=,>=,<,> operators that means query with not all clustering key columns is valid . Does that give you efficient result?

CREATE TABLE numberOfRequests (
    cluster text,
    date text,
    datacenter text,
    hour int,
    minute int,
    numberOfRequests int,
    PRIMARY KEY ((cluster, date), datacenter, hour, minute)
)
SELECT * FROM numberOfRequests WHERE cluster = ‘cluster1’ AND date = ‘2015-06-05’ AND datacenter = 'US_WEST_COAST' AND hour = 14 AND minute = 00;

Cassandra will find the data efficiently but if you execute:

SELECT * FROM numberOfRequests WHERE cluster = ‘cluster1’ AND date = ‘2015-06-05’ AND hour = 14 AND minute = 0;

Cassandra will reject the query as it has to scan the entire partition to find the requested data, which is inefficient.

and on further I saw this ,

>, >=, <= and < restrictions

Single column slice restrictions are allowed only on the last clustering column being restricted. Therefore, the following queries are valid:

SELECT * FROM numberOfRequests WHERE cluster = ‘cluster1’ AND date = ‘2015-06-05’ AND datacenter = 'US_WEST_COAST' AND hour= 12 AND minute >= 0 AND minute <= 30; 
SELECT * FROM numberOfRequests WHERE cluster = ‘cluster1’ AND date = ‘2015-06-05’ AND datacenter = 'US_WEST_COAST' AND hour >= 12; 
SELECT * FROM numberOfRequests WHERE cluster = ‘cluster1’ AND date = ‘2015-06-05’ AND datacenter > 'US';

but how are they valid with respect to the topmost explaination . if valid , they will be inefficient right ?

cassandracql
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

1 Answer

Erick Ramirez avatar image
Erick Ramirez answered ·

@pranali.khanna101994_189965 I'll respond to your questions below in turn:

Single column slice restrictions are allowed only on the last clustering column being restricted for <=,>=,<,> operators that means query with not all clustering key columns is valid .

Yes, it is fine to not filter with all clustering columns. Using your primary key as an example:

    PRIMARY KEY ((cluster, date), datacenter, hour, minute)

these filters are valid:

SELECT ... FROM ... WHERE cluster = ? AND date = ? AND datacenter = ?
SELECT ... FROM ... WHERE cluster = ? AND date = ? AND datacenter = ? AND hour = ?
Does that give you efficient result?

Yes, it does.

Cassandra will reject the query as it has to scan the entire partition to find the requested data, which is inefficient.

No, your query got rejected because it's invalid. You cannot skip datacenter. You must specify the preceding columns in order. You cannot filter on hour without filtering on datacenter (see my example above).

but how are they valid with respect to the topmost explaination . if valid , they will be inefficient right ?

They are valid because you're only doing a slice restriction "on the last clustering column being restricted", exactly as you quoted above.

In this query, you're only slicing the "last restricted column" which is minute:

SELECT * FROM numberOfRequests WHERE cluster = ‘cluster1’ AND date = ‘2015-06-05’ AND datacenter = 'US_WEST_COAST' AND hour= 12 AND minute >= 0 AND minute <= 30; 

In this query, your only slicing the "last restricted column" which is hour:

SELECT * FROM numberOfRequests WHERE cluster = ‘cluster1’ AND date = ‘2015-06-05’ AND datacenter = 'US_WEST_COAST' AND hour >= 12; 

In this query, your only slicing the "last restricted column" which is datacenter:

SELECT * FROM numberOfRequests WHERE cluster = ‘cluster1’ AND date = ‘2015-06-05’ AND datacenter > 'US';

Since they're valid, they're not inefficient. Cheers!

Share
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.