Bringing together the Apache Cassandra experts from the community and DataStax.

Want to learn? Have a question? Want to share your expertise? You are in the right place!

Not sure where to begin? Getting Started

 

question

deepank.dhillon_170663 avatar image
deepank.dhillon_170663 asked ·

How can I make Cassandra work fast with Spark?

hey i am working with cassandra and spark for analysis, but it is really slow when i work for large data sets. First i want to ask about the data model, howe should i make it. second i wanna know if there is any way i could take data in some small size not the whole data to make data frame.


cassandraspark
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

1 Answer

Russell Spitzer avatar image
Russell Spitzer answered ·

In general the same rules apply as with normal Cassandra datamodeling, although in analytics you can worry less about the Partition key and more about the Clustering Key.

The Spark Cassandra Connector will automatically push down filter and column pruning if applicable to the underlying Cassandra table but it *must* be a valid filter which can be applied by C*. This means you can restrict on clustering columns, but it must be in key order. We also have optimizations for IN clauses and partition key joins in SCC 2.5 and greater.

In general though if you have a very large dataset that you are going to process multiple times for various queries, I do a copy of the data to Parquet First and then do analysis on that.

1 comment Share
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

thanks for the reply

yes, i do the same if i have a large datasets , but how can i use partition key and clustering key

let me give you an example i have a table in which i use date date, created timestamp, ip text, data text, primary key(date, created)

here i used date as primary key and partition so that partition can be created date wise so that i can filter it on specific date but as for clustering key it has only one job to sort data according to time in a particular partition (date).

but still it is very slow in reading and sometimes give timeout error.

i wanna know what should i do to make a perfect data model.

its speed is around 27000/sec. on a single machine.

there is so much to learn in this.

thank you

0 Likes 0 · ·