Bringing together the Apache Cassandra experts from the community and DataStax.

Want to learn? Have a question? Want to share your expertise? You are in the right place!

Not sure where to begin? Getting Started

 

question

scherian_188962 avatar image
scherian_188962 asked ·

What is an efficient way of generating daily reports on a table where the query is not on the partition key?

CREATE TABLE test1 (
    username text,
    activity_timestamp timestamp,
    activity_id text,
    amount int,
    PRIMARY KEY (username,activity_timestamp,activity_id)
) WITH CLUSTERING ORDER BY (activity_timestamp DESC, activity_id ASC);

The requirement is to generate a report to pull out all those records for the day which have amount less than 500. How can this be achieved from the existing model? (without having to create a new table)

A possible solution:

By executing shell script that hits all partitions of the table-->

user_list=`cqlsh -e "select distinct username from test1" `

current_date="$(date +'%Y-%m-%d')"

for user_id in ${user_list[@]}
do
dsbulk unload -delim ';' -query "select * from test1 where username='$user_id' and activity_timestamp>'$current_date';" | awk -F ';' '$2 <=500 && $2>=100'

Is there a better efficient way ?

data modeling
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Erick Ramirez avatar image
Erick Ramirez answered ·

Your proposed approach will be very inefficient and will not scale. More importantly, it is really a full table scan where you're retrieving rows from every partition (username) in the table.

Your solution just won't scale. This query alone will most likely timeout because it needs to read all data on all nodes to get all the user names:

SELECT DISTINCT username FROM test1

It might work for table with a few hundred or a few thousand users but anything more than that and the query just won't succeed.

I agree with Bettina and you need to use Apache Spark to run the OLAP (analytics) queries since it will be able to do it more efficiently than a shell script. Cheers!

2 comments Share
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Noted

Thank you for the response

0 Likes 0 · ·

Not a problem. Cheers!

0 Likes 0 · ·
bettina.swynnerton avatar image
bettina.swynnerton answered ·

Hi @scherian_188962,

I would say, whether this is an efficient approach depends on the volume of data that you need to process to get your results.

With this script, you are basically performing basic client side aggregation of the data for your report.

If you have a lot of data, you might want to consider a more efficient tool like Spark for the data analysis. Spark is built for large scale data processing on distributed systems. It depends on your data and how you want to use your report.

Without schema changes, to extract the data from this table, you will need to read all partitions, as you pointed out.

I hope this helps.

1 comment Share
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

hi @bettina.swynnerton, thanks for the response

the per day data won't be more than 6lakh records, while the distinct usernames for the entire table will be between 40k-50k

if the script is run, it'll be executed 50K times; also not every user will perform an activity every day, so many of the operations (performed against username partition) won't even generate output.

Hence I am apprehensive about the script efficiency


Apart from spark, are there any other ways we can get the required output from the existing model?

0 Likes 0 · ·