question

ipolyzos.se_178493 avatar image
ipolyzos.se_178493 asked Erick Ramirez edited

Cassandra performance issues for analytics workloads

Hi all we are just getting started on our journey with cassandra and we are modeling for time series.

So here is the thing lets say we have sensor data which we store inside cassandra.

We partition using (id, year, month) in order to keep partitions of relevant size.

Now we need to generate some reports based on that data which are either for monthly data or many months.

Trying to load the data using the spark-cassandra connector lets say for one id, one year and all the months takes around 10 minutes, and data for one id, one year and one month takes around 2 minutes. (~10GB of data).

Loading the same amount of data from the data lake using spark, where data is in parquet partining by (id, year, month) as well takes some seconds.

Could this be that cassandra is mainly for more real-time analytics use cases and our data can be considered historical and thus this is normal, or could be something wrong on our side?

Is it maybe to use cassandra for reports that are within the month interval and use the datalake for the rest of the analytics?

the spark cluster is relative small `3 workers`, `14.0 GB Memory, 4 Cores`, but i mainly wondering for the use case of cassandra.

I'm also using TWCS of 1 Day

Thanks


analyticsuse cases
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

1 Answer

Cedrick Lunven avatar image
Cedrick Lunven answered

Hi ipolyzos.se_178493, Thank you for this message and apologizes for delay to get this.

Sounds like you defined your tables as expected: Correct partition key, good compaction strategies (TWCS). Now it seems you are having HUGE partitions one year and one month takes around 2 minutes. (~10GB of data). WHEREAS recommended values are 100MB max. So it might be a reason why you can such performance hit.

In the DataStax Enterprise product there are both Spark and Cassandra in the same binary which help in Data Shuffling but more important we also provide an implementation of HDFS (DSEFS) to store parquet files.

So YES for analytics purpose you can use Cassandra but it might take some times as you experiences and parquet would have better performances. We see some deployments with the OLTP in one DC (machine optimized for I/O) and a second DC for OLAP queries (hardware optimized for CPU here). You would have a table with timestamps and TTL to 30 days for instance and other tables with pre-aggregated data with different chunks like 1min,5min,1hour etc.

Happy to help here.

Share
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.