question

study.aakansha2019_193516 avatar image
study.aakansha2019_193516 asked Erick Ramirez answered

Why are my SELECT queries on the videos_by_tag_year table for the DS220 course failing?

I'm going through "DS220: DataStax Enterprise 6 Practical Application Data Modeling with Apache Cassandra™" --> "Clustering Column" --> Exercise

Executing below command and can anyone help me understand why I'm getting below error:

InvalidRequest: Error from server: code=2200 [Invalid query] message="Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING"

Below are query which I'm executing:

cqlsh:killrvideo> drop table killrvideo.videos_by_tag_year;
cqlsh:killrvideo> CREATE TABLE videos_by_tag_year (
... tag text,
... added_year int,
... video_id timeuuid,
... added_date timestamp,
... description text,
... title text,
... user_id uuid,
 ... PRIMARY KEY ((tag, video_id), added_year)
 ... ) WITH CLUSTERING ORDER BY (added_year desc);
cqlsh:killrvideo> COPY videos_by_tag_year (tag, added_year, video_id, added_date, description, title, user_id) FROM 'videos_by_tag_year.csv' WITH HEADER=true;
Using 1 child processes

Starting copy of killrvideo.videos_by_tag_year with columns [tag, added_year, video_id, added_date, description, title, user_id].
Processed: 797 rows; Rate: 723 rows/s; Avg. rate: 1219 rows/s
797 rows imported from 1 files in 0.654 seconds (0 skipped).
cqlsh:killrvideo> select count(*) from killrvideo.videos_by_tag_year ;
count
-------
797
(1 rows)
Warnings :
Aggregation query used without partition key
cqlsh:killrvideo> select * from killrvideo.videos_by_tag_year where tag = 'cql';
InvalidRequest: Error from server: code=2200 [Invalid query] message="Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING"
cqlsh:killrvideo> select * from killrvideo.videos_by_tag_year where tag = 'cql' and added_year = 2015;
InvalidRequest: Error from server: code=2200 [Invalid query] message="Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING"

We defined partition key on tag and video_id and tag can be used in filter condition also defined clustering key on added_year

academyds220
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

bettina.swynnerton avatar image
bettina.swynnerton answered bettina.swynnerton edited

Hi @study.aakansha2019_193516,

the primary key in your table definition looks like this:

PRIMARY KEY ((tag, video_id), added_year)

Here the partition key is a composite of two columns, tag and video_id. Note the extra brackets around the pair, this indicates it is a composite partition key: (tag, video_id)

Both column values have to be provided in your query in order to fully specify a partition key.

The following query fails, because you only specify tag, video_id is missing.

cqlsh:killrvideo> select * from killrvideo.videos_by_tag_year where tag = 'cql';
InvalidRequest: Error from server: code=2200 [Invalid query] message="Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING"

And this query fails, because you again only specify tag, without video_id:

cqlsh:killrvideo> select * from killrvideo.videos_by_tag_year where tag = 'cql' and added_year = 2015;
InvalidRequest: Error from server: code=2200 [Invalid query] message="Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING"

Does this answer your question?

Share
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Erick Ramirez avatar image
Erick Ramirez answered

To add to Bettina's answer, the partition key of a table determines where the data is stored. I'll remind you that in the course video, Patrick McFadin talked about it in this slide about 4 minutes 30 seconds into the video on partitioning:

When you do not filter your queries with the partition key in the WHERE clause, Cassandra has to scan the whole table to retrieve the data you requested. This is a very inefficient way of reading data from the database.

As Bettina pointed out, the partition key for this table is composed of 2 columns (tag, video_id) (aka composite partition key):

CREATE TABLE videos_by_tag_year (
    ...
    PRIMARY KEY ((tag, video_id), added_year)
)

To retrieve all the rows of a partition, you must filter on both columns explicitly:

SELECT ... FROM videos_by_tag_year WHERE tag = ? AND video_id = ?

When you don't include video_id in the filter like these 2 queries, Cassandra does not which node stores the data so it has to query all nodes in the cluster:

SELECT ... FROM videos_by_tag_year WHERE tag = 'cql'
SELECT ... FROM videos_by_tag_year WHERE tag = 'cql' AND added_year = 2015

I recommend that you watch the Partitioning and Storage Structure video again of the DS220 Data Modeling course at DataStax Academy so you can revise the concepts taught in that unit. Cheers!


Share
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.