Bringing together the Apache Cassandra experts from the community and DataStax.

Want to learn? Have a question? Want to share your expertise? You are in the right place!

Not sure where to begin? Getting Started

 

question

eddyeddy avatar image
eddyeddy asked Erick Ramirez edited

How should I model dynamic "filters" for attributes of a JSON document?

Morning all,

Hopefully this is the right place to ask such question but I've been fighting with this issue in MongoDB for quite long and while thinking on moving to Cassandra I was wondering if there's a proper way to do this in here as opposite to what I was trying in MongoDB.

I’ve got a collection named “bookings”, which contain different bookings performed by users:

{
    _id: ObjectId("..."),
    timestamp: 123456789,
    hotel_id: "123456789",
    city: "paris",
    country: "france",
    device: "web",
    filters: [
       "city=paris",
       "country=france",
       "device=web",
       "city=paris&country=france",
       "city=paris&device=web",
       "country=france&device=web",
       ...
    ],
    booking_ref: "ABCD"
}

The part within `filters` is dynamic, some fixed fields are present (such as city) but others may appear or no, such as device which is present in some documents but not in others.

What was done is a list of all possible combinations of filters so it's easily searchable from the API which consumes it and when adding/removing certain filters it's transparent.

Having those filters as attributes of the Document in MongoDB is an issue since those being dynamic deleting/creating indexes everytime a new filterable field was added is no option.

Which would be the way of modelling this in Cassandra so filters are dynamic, easily queryable from the consumer and without a performance impact?

Ps. In case it's not clear enough I can extend with more information!

data modeling
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

steve.lacerda avatar image
steve.lacerda answered

It sounds like what you'd want is the filters column to use something like a map:

https://docs.datastax.com/en/dse/5.1/cql/cql/cql_using/useInsertMap.html

Everything in Cassandra can be sparse, meaning you can have the field or not and it doesn't matter. You will need a schema set up, but it'll just look something like:

CREATE TABLE cycling.cyclist_teams 
  id UUID PRIMARY KEY
  lastname text
  firstname text
  filters map<text,text>
);

I hope that helps.

Share
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Erick Ramirez avatar image
Erick Ramirez answered Erick Ramirez edited

The first principle of data modelling in Cassandra is to list all the application queries then design one table for each of the app queries. It is the complete opposite of traditional relational databases because tables are denormalised. This is done so the layout of the data on disk is optimised for reads making them really fast and highly scalable.

Example data models

Let me illustrate by showing you some example app queries and their corresponding tables.

APP QUERY 1 - "Get details of a booking"

This is a very generic application query. Given some booking ID or reference, the app would display details about the booking. The table would look something like:

CREATE TABLE bookings (
    booking_ref text,
    customerid text,
    hotel_id text,
    city text,
    country text,
    device text,
    PRIMARY KEY (booking_ref)
)

To query this table, you need to specify the booking reference:

SELECT ... FROM bookings
    WHERE booking_ref = ?

APP QUERY 2 - "Get bookings by a customer"

In this example, the app can retrieve the bookings made by a specific customer. Here is the example table schema:

CREATE TABLE bookings_by_customerid (
    customerid text,
    booking_ref text,
    hotel_id text,
    city text,
    country text,
    device text,
    PRIMARY KEY (customerid, booking_ref)
)

In this example, the table is partitioned by the customer ID and has rows of booking references meaning that each customer can have one or more bookings.

To query this table, specify the customer's ID:

SELECT ... FROM bookings_by_customerid
    WHERE customerid = ?

To query a specific booking by a customer, specify both the customer's ID and booking reference:

SELECT ... FROM bookings_by_customerid
    WHERE customerid = ?
    AND booking_ref = ?

APP QUERY 3 - "List a customer's bookings by city"

This application query is intended to list one customer's bookings by city:

CREATE TABLE bookings_by_customerid_city (
    customerid text,
    city text,
    booking_ref text,
    hotel_id text,
    country text,
    device text,
    PRIMARY KEY (customerid, city, booking_ref)
)

Similar to the bookings_by_customerid table, the data in this table is still partitioned by the customer ID but it now contains (one or more) rows of cities, and for each city there are (one or more) rows of bookings.

To list a customer's bookings by city:

SELECT ... FROM bookings_by_customerid_city
    WHERE customerid = ?

Similarly to only list bookings for a specific city:

SELECT ... FROM bookings_by_customerid_city
    WHERE customerid = ?
    AND city = ?

APP QUERY 4 - "List bookings for Paris"

If you wanted to store the bookings organised by city:

CREATE TABLE bookings_by_city (
    city text,
    booking_ref text,
    customerid text,
    hotel_id text,
    country text,
    device text,
    PRIMARY KEY (city, booking_ref)
)

Each city has one or more rows of bookings. To get the bookings in Paris:

SELECT ... FROM bookings_by_city (
    WHERE city = 'paris'

Highlights

If you're new to Cassandra and possibly not familiar with some of the terminologies, I've explained partition keys and clustering keys in question #6171.

Notice that all the tables contain the same data but they are stored slightly differently. As I stated above, the table schema determines how the data is laid out on disk. Since the columns and rows relevant to an application query are adjacent to each other on disk, there is no disk scanning required to retrieve all the records.

The disadvantage is that data is duplicated across multiple tables (denormalised) and they all need to be updated every time a booking is made. I've discussed how to keep denormalised tables in sync in FAQ #2744.

If you're new to Cassandra, datastax.com/dev has lots of free resources and tutorials where you can quickly learn the basics. I highly recommend having a looking at the free Data Modelling tutorial. Cheers!

Update

If performance matters, "dynamic filtering" isn't the answer. It looks like your app is trying to do a faceted search which you can get if you index your Cassandra data with Apache Solr or Elasticsearch.

Otherwise, I would recommend using the Stargate.io Document API. the Stargate Doc API allows you to save and search schemaless JSON documents in Cassandra. Stargate is fully open-source so it's free to use. If you're interested, there's a free interactive tutorial which shows you how to do CRUD operations. Astra DB comes with Stargate bundled in so it's pre-configured and ready to use if you want to try it for free. Details on the Doc API on Astra is available here. Cheers!

2 comments Share
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Thanks a lot guys for your responses @steve.lacerda , @Erick Ramirez !

I was wondering @Erick Ramirez , in case of having dynamic filters (let's say some client adds a 'newFilter' into their inputs, that would trigger a new collection right, even if it's only for that edge case?

Also, regarding the answer Steve gave, would that still be an optimal solution for dynamic cases so data doesn't have to be denormalized?

I will give a shot to the FAQ and courses also! Did those time ago but I forgot too many things it seems :(

0 Likes 0 ·

I've updated my answer above to respond to your follow up question. Cheers!

0 Likes 0 ·