Bringing together the Apache Cassandra experts from the community and DataStax.

Want to learn? Have a question? Want to share your expertise? You are in the right place!

Not sure where to begin? Getting Started

 

question

peter.kovgan_176371 avatar image
peter.kovgan_176371 asked ·

spark-connector performance recommendations for machine learning

Hi,

This is , probably, a super general question, but I anyway asking.

What are general performance-oriented recommendations, when dealing with spark-connector?

And that in a context of machine-learning, assuming that Cassandra is a storage of data, containing features.

And , is Cassandra appropriate to store such data?

Is there data model best practices?

Should I use a separate column for each feature?

Or should I keep features as a map in some column?

Is there best practices around cassandra and machine learning?


Thanks!

sparkspark-connectormachine-learning
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

1 Answer

Russell Spitzer avatar image
Russell Spitzer answered ·

In general, I would not use Cassandra as a datastore for ML data. Since training data is generally static a distributed file system makes more sense for performance (should be 10-20x+ faster). Cassandra use-cases center around OLTP workloads which rapidly request small pieces of data so it may make sense to store the output your ML workload in C* so that an application can rapidly access the results.


If your data is already in Cassandra and you want to do ML on it, it would be very beneficial to use the SCC to move the data out into a distributed file store before using it for training.

1 comment Share
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Thanks Russell.

It is very interesting.

The problem that our data is not static, it is updated sometime, labels added, for example.

We store the data in relational DB, I thought that storing in C* will improve performance.

Now, with your input, I start to believe that we need to move labeled data into distributed FS.

0 Likes 0 · ·