peter.kovgan_176371 avatar image
peter.kovgan_176371 asked Erick Ramirez edited

spark-connector performance recommendations for machine learning


This is , probably, a super general question, but I anyway asking.

What are general performance-oriented recommendations, when dealing with spark-connector?

And that in a context of machine-learning, assuming that Cassandra is a storage of data, containing features.

And , is Cassandra appropriate to store such data?

Is there data model best practices?

Should I use a separate column for each feature?

Or should I keep features as a map in some column?

Is there best practices around cassandra and machine learning?


10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

1 Answer

Russell Spitzer avatar image
Russell Spitzer answered Erick Ramirez edited

In general, I would not use Cassandra as a datastore for ML data. Since training data is generally static a distributed file system makes more sense for performance (should be 10-20x+ faster). Cassandra use-cases center around OLTP workloads which rapidly request small pieces of data so it may make sense to store the output your ML workload in C* so that an application can rapidly access the results.

If your data is already in Cassandra and you want to do ML on it, it would be very beneficial to use the SCC to move the data out into a distributed file store before using it for training.

1 comment Share
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

peter.kovgan_176371 avatar image peter.kovgan_176371 commented ·

Thanks Russell.

It is very interesting.

The problem that our data is not static, it is updated sometime, labels added, for example.

We store the data in relational DB, I thought that storing in C* will improve performance.

Now, with your input, I start to believe that we need to move labeled data into distributed FS.

0 Likes 0 ·