Bringing together the Apache Cassandra experts from the community and DataStax.

Want to learn? Have a question? Want to share your expertise? You are in the right place!

Not sure where to begin? Getting Started

 

question

XXSD avatar image
XXSD asked ·

How do I partition a table of sensors where data is collected every 10 seconds?

I am very new to Cassandra, I have worked with Oracle SQL and Mongo DB and I am trying to learn Apache Cassandra to use it in a project I am working on.

I have a certain ammount of sensors(let's say 20), that might increase in the future. They send the data to store every 10 seconds. I am aware of bucketing to deal with this type of situations but wondering which one is better.

  • PRIMARY KEY ((sensor_id, day_month_year), reported_at);
  • PRIMARY KEY ((sensor_id, month_year), reported_at);

I don't know if using *month_year* is too much data for a single partition and on the other hand I think that if I use *day_month_year* it creates too many partitions and it slows reading too much when trying to get data since it has to access multiple partitions.

Which one should I use? If you have other good sugestions or just some explanations for me I'd like to hear them.

data modeling
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Erick Ramirez avatar image
Erick Ramirez answered ·

Sensor data collected every 10 seconds is equivalent to:

  • 6 entries per minute
  • 360 entries per hour
  • 8,640 entries per day
  • 260K entries per month

Depending on what other data you store for each row, it will be difficult to keep the size of each partition to the recommended 100MB. This isn't a hard limit so your partitions can go beyond 100MB but you are trading off performance the larger your partition gets.

This isn't really relevant because you need to use something like Spark for analytics:

if I use *day_month_year* it creates too many partitions and it slows reading too much when trying to get data since it has to access multiple partitions.

On its own, Cassandra isn't ideal for performing analytics queries because it is optimised for OLTP workloads where you are reading one partition for each app request. Cheers!

3 comments Share
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Thank you for the answer @Erick Ramirez, I am just now starting to work with Cassandra so I am watching the 8 week course you guys launched but I am still on week 3 but I noticed how important the data modelling is.

I need to do analytics queries where in an app request I export a file where I have 2 values as parameters:

  • a certain period of time (normally months or even a full year);
  • the interval of data, that being minutes, hours or days

Example: if I choose a full year as time period, I will either receive a file with 8760 entries if I choose hour or 525600 entries if I choose minutes.


0 Likes 0 ·

[Post converted to a comment since it's not an "answer"]

0 Likes 0 ·

Since I am very new to this type of database I ask you how should I do partitioning based on this information. If I store data

Is there a way to generate this sort of analytics tables automatically? Example: every minute it stores timestamp only to the minute, dateUnit (being minute), fieldName, sum and average on a table partitioned using PRIMARY KEY ((sensor_id, month_year), dateUnit) called sensorID_month_year_analytics.

I was thinking of developing this project on Nodejs but I feel like there is more support on Java. Which one do you recommend?

What is Spark and do you have any type of course on Youtube on how it works?

Sorry for asking this many questions and also english is not my main language so I struggle a little trying to explain things.

0 Likes 0 ·
David Jones-Gilardi avatar image
David Jones-Gilardi answered ·
Since I am very new to this type of database I ask you how should I do partitioning based on this information.

I think you are already on the right track. When using Cassandra you want to partition based on your SLA's for OLTP speeds, meaning, partition in a way to keep partitions within the recommended 100MB and 100K row per partition sizes. Again, as @Erick Ramirez said, these are not hard limits, but are there to keep you on the right track to maintaining performance at scale. It is ok to bucket and create more partitions if you have to.


Is there a way to generate this sort of analytics tables automatically? Example: every minute it stores timestamp only to the minute, dateUnit (being minute), fieldName, sum and average on a table partitioned using PRIMARY KEY ((sensor_id, month_year), dateUnit) called sensorID_month_year_analytics.

You can totally aggregate or roll data up into other "analytics" tables to use to get some basic analytics out of your data per your suggestion. However, this is not meant for deep, adhoc analytics or cases where you are looking to explore your data. As @Erick Ramirez mentioned above Apache Spark (https://spark.apache.org/) is really a great tool for this. Spark is built for analytics workloads and can work together with Cassandra to provide deep analytics using things like SparkFrames and SQL.


What is Spark and do you have any type of course on Youtube on how it works?

Here is the YouTube playlist from our DataStax Academy course on Analytics -> https://www.youtube.com/watch?v=D6PMEQAfjeU&list=PL2g2h-wyI4Sp0bqsw_IRYu1M4aPkt045x. To be clear, you DON'T have to use DataStax Enterprise to use Spark with Cassandra. You can do it open source. My goal in sharing these videos is to help introduce you to Spark, what you can do with it, and how it works with Cassandra.


I was thinking of developing this project on Nodejs but I feel like there is more support on Java. Which one do you recommend?

I feel like either one is completely valid. There is solid driver support for both. However, depending on what you are trying to do you don't actually have to use the drivers. You could use REST, GraphQL, or Document APIs with Cassandra as well, forgoing needing the drivers at all. This is done using https://stargate.io/. Stargate works with all flavors of Cassandra including open source and is open itself. You can also test this out for free using https://astra.datastax.com/. If you think any of those options are worth looking into just ask and I'm happy to provide more details.

2 comments Share
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

@Erick Ramirez @David Jones-Gilardi

Can I partition the tables like this?

  • PRIMARY KEY ((sensor_id, month_year, last_day), reported_at);

Where last_day should be 10, 20 or 28/29/30/31 depending on the month.

This way when I could keep the recommended values with each table having at most 95040 rows and at least 69120 rows. Is there a more viable way to store data efficiently for this sort of systems (IoT).

Keep in mind that I only want to perform analytic requests or get the last saved value for a sensor.


I am also thinking to aggregate data analytics in a separate table like

  • PRIMARY KEY ((sensor_id, month_year, minutes), reported_at);
  • PRIMARY KEY ((sensor_id, month_year, hours), reported_at);

I would always meet the recommended values, being at most 44640 rows in the table storing analytics data every minute

0 Likes 0 ·

You can partition whichever way you like as long as it matches the application queries. But as I said in my response, the size of each partition matters because large partitions affect the performance of your cluster. Cheers!

0 Likes 0 ·