DataStax Academy FAQ

DataStax Academy migrated to a new learning management system (LMS) in July 2020. We are also moving to a new Cassandra Certification process so there are changes to exam bookings, voucher system and issuing of certificates.

Check out the Academy FAQ pages for answers to your questions:


question

peter.kovgan_176371 avatar image
peter.kovgan_176371 asked ·

spark-connector performance recommendations for machine learning

Hi,

This is , probably, a super general question, but I anyway asking.

What are general performance-oriented recommendations, when dealing with spark-connector?

And that in a context of machine-learning, assuming that Cassandra is a storage of data, containing features.

And , is Cassandra appropriate to store such data?

Is there data model best practices?

Should I use a separate column for each feature?

Or should I keep features as a map in some column?

Is there best practices around cassandra and machine learning?


Thanks!

sparkspark-connectormachine-learning
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

1 Answer

Russell Spitzer avatar image
Russell Spitzer answered ·

In general, I would not use Cassandra as a datastore for ML data. Since training data is generally static a distributed file system makes more sense for performance (should be 10-20x+ faster). Cassandra use-cases center around OLTP workloads which rapidly request small pieces of data so it may make sense to store the output your ML workload in C* so that an application can rapidly access the results.


If your data is already in Cassandra and you want to do ML on it, it would be very beneficial to use the SCC to move the data out into a distributed file store before using it for training.

1 comment Share
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Thanks Russell.

It is very interesting.

The problem that our data is not static, it is updated sometime, labels added, for example.

We store the data in relational DB, I thought that storing in C* will improve performance.

Now, with your input, I start to believe that we need to move labeled data into distributed FS.

0 Likes 0 · ·