Hi all we are just getting started on our journey with cassandra and we are modeling for time series.
So here is the thing lets say we have sensor data which we store inside cassandra.
We partition using (id, year, month) in order to keep partitions of relevant size.
Now we need to generate some reports based on that data which are either for monthly data or many months.
Trying to load the data using the spark-cassandra connector lets say for one id, one year and all the months takes around 10 minutes, and data for one id, one year and one month takes around 2 minutes. (~10GB of data).
Loading the same amount of data from the data lake using spark, where data is in parquet partining by (id, year, month) as well takes some seconds.
Could this be that cassandra is mainly for more real-time analytics use cases and our data can be considered historical and thus this is normal, or could be something wrong on our side?
Is it maybe to use cassandra for reports that are within the month interval and use the datalake for the rest of the analytics?
the spark cluster is relative small `3 workers`, `14.0 GB Memory, 4 Cores`, but i mainly wondering for the use case of cassandra.
I'm also using TWCS of 1 Day
Thanks