Bringing together the Apache Cassandra experts from the community and DataStax.

Want to learn? Have a question? Want to share your expertise? You are in the right place!

Not sure where to begin? Getting Started

 

question

iotemi avatar image
iotemi asked ·

What is the best practice to store timeseries from schemaless source?

Hello,

I want to collect runtime metrics generated from IoT devices, and i want the database table to be not aware of the data schema.

Every device has between 40 and 100 metrics, and the sending period could be in the range of seconds, and I want to be able to query the database to retrieve the value of specific metric from a device in a certain time slot.

The purpose of the solution is to have a general-purpose timeseries database where i can collect metrics and plot graphs for monitoring and debug.

Every set of metrics is related to a device_id, so partitioning could be done by use of composite device_id and date.

Desired query-like behaviour

SELECT '$metric_id', ts FROM table WHERE id='$device_id' and date='2020-09-2020'

I have tried different approaches:

SOLUTION 1: use map with primitive type

PRO: one write per device

CONS: don't understand if possible to select single KVP or whole collection is returned?

CREATE TABLE test_keyspace.metrics_by_device_id_and_date (
    id text,
    date text,
    ts timestamp,
    metrics_bool map<text, boolean>,
    metrics_decimal map<text, decimal>,
    metrics_text map<text, text>,
    PRIMARY KEY ((id, date), ts)
) WITH CLUSTERING ORDER BY (ts DESC)

Example query. Is it OK? Looks like column return is not object, but a string

SELECT 'metric_decimal['$(metric_id'], ts FROM table WHERE id='$device_id' and date='2020-09-2020'

SOLUTION 2: use table as a KVP store

PRO: it works... but

CONS: if device pushes 100 metrics/s, it means 100 writes/s to store data for single device. Probably ok from partitioning point of view (86.4k rows per metric per day). But i means billion writes on db per day (if 1000 devices)

CREATE TABLE test_keyspace.metrics_by_device_id_and_date (
     id text,
     date text,
     ts timestamp,
     metric_id text,
     bool_value boolean,
     decimal_value decimal,
     text_value text,
     PRIMARY KEY ((id, date, metric_id), ts) 
) WITH CLUSTERING ORDER BY (ts DESC)

SOLUTION 3: same as SOLUTION 1, but with UDT

PRO: cleaner compared to SOLUTION 1

CONS: same cons of SOLUTION 1, plus difficult to manage on with drivers (not ORM happy)

CREATE TYPE metric (
  id text,
  text_value text,
  decimal_value decimal,
  bool_value boolean
);
CREATE TABLE test_keyspace.metrics_by_device_id_and_date (
    id text,
    date text,
    ts timestamp,
    metrics frozen<map<text, frozen<metric>>>,
    PRIMARY KEY ((id, date), ts)
) WITH CLUSTERING ORDER BY (ts DESC)

Do you have any advise on how i could address this problem? Is Cassandra still a good solution for this problem?

Thank you very much

data modeling
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

1 Answer

Erick Ramirez avatar image
Erick Ramirez answered ·

I think a map collection is the best option for "schema-less" data source. It does come with the caveat that each collection is stored in one cell so you cannot just retrieve individual items.

With only a maximum of 100 key-value pairs, it will be fast to read off the disk so shouldn't be an issue and you can simply parse the value you need client-side.

I personally would not use UDTs if there are other options because of the complexity it adds. It is a personal preference as opposed to a hard rule. :)

Cassandra is definitely a good use case and is even better when you configure the table with the time-window compaction strategy (TWCS) which is designed for time-series data. Cheers!

2 comments Share
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Thank you for the quick reply, i have done a quick test with the map implementation.


With a device sending 1kB of data (50 metrics) every second, the time to query one day of data so (87k rows with 50 metrics) is more than 2 seconds on a 4-vCore machine, with NodeJS and Datastax driver.


In your experience, is this read time normal? If i have to extract the metric at client side, why not store the data directy as JSON?

Best regards

0 Likes 0 · ·

There is no "normal", only what is "appropriate" for your cluster because there are lots of moving parts that are unique to the cluster -- use case, access patterns, hardware configuration, cluster load, etc.

And yes, you can store JSON data in a CQL text type but you will need to implement logic in your app to parse the JSON versus retrieving the value based on the metric ID. Cheers!

0 Likes 0 · ·