Tri avatar image
Tri asked Erick Ramirez commented

How do I do an INSERT from a SELECT in Cassandra?

Let's say we have a users table. Then after a while, a new use case emerges, as we notice we often need to lookup users by their birthday. We then create a new table to fit that query:

CREATE TABLE users_by_birthday (
  birthday DATE,
  userid UUID,
  name TEXT,
  PRIMARY KEY ((birthday))

The content of this new users_by_birthday table could be derived from:

SELECT birthday, userid, name FROM users;

It would be convenient to do an INSERT SELECT ala old-fashion-SQL:

INSERT INTO users_by_birthday (birthday, userid, name)
FROM SELECT birthday, userid, name FROM users;

Granted this SELECT is an anti-pattern because of the full table scan. Doubled by a massive flood of INSERTs let's say of 1 million rows. However, we are willing to take the penalty for this one-time sin.

How to achieve this kind of INSERT ... SELECT the Cassandra way?

1 comment
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Just letting you know that we've had a high volume of questions in the last 48 hours and we've been caught up with the FREE Cassandra Workshop Series so our responses are delayed but we will get to them in the next few hours. Cheers!

0 Likes 0 ·

1 Answer

Erick Ramirez avatar image
Erick Ramirez answered Erick Ramirez commented

As you are already aware, that isn't an OLTP query and if unbounded will result in a full table scan so it isn't a supported feature in Cassandra.

This works in RDBMS because there isn't a lot of data and the database is not distributed. It might work with a small cluster of 1 to 3 nodes but otherwise this query doesn't scale. Think of clusters with billions and billions of partitions. By the time you've extracted a few thousand partitions, some of that data has already mutated (updated, deleted) so it's already obsolete even before the query completed.

This kind of operation just wouldn't scale. You'll need to do it using an ETL tool or do it with Spark.

I hope this makes sense. Cheers!

2 comments Share
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Make sense, and the ETL solution would not do better either. ie. the data could mutate before the query is complete.

I think of this scenario as kind of "quick internal ETL". This is a one-time thing where the user knows that it is an anti-pattern and is aware of the consequences.

I guess Cassandra designers had not implemented to avoid the undue support as users may attempt to use this as a real feature.

1 Like 1 ·

Exactly! I'm glad you can see why it would be a problem. :)

0 Likes 0 ·