Bringing together the Apache Cassandra experts from the community and DataStax.

Want to learn? Have a question? Want to share your expertise? You are in the right place!

Not sure where to begin? Getting Started

 

question

mishra.anurag643_153409 avatar image
mishra.anurag643_153409 asked ·

What is the impact of reading large Cassandra partitions in Spark?

I am reading a cassandra table in spark and running count on the spark data -frame . My spark job has created 292 tasks ,it succeeds fast for 290 tasks but for rest 2 it runs very longer . I am under impression that when partitions in spark are created most of the partitions going to be of same size since number of partitions are dependent on data size / input spit size in mb , so there should not be case of data skewness , but Is it possible if cassandra has very big partition in that case dta skewness might be created in spark partitions ?

spark-cassandra-connector
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

1 Answer

Erick Ramirez avatar image
Erick Ramirez answered ·

If the partition is larger than the executor memory then the task will result in an out-of-memory error.

Your assumption that all the Spark partitions will be the same size is incorrect. I've explained it in my answer to your other question, #11565. Cheers!

Share
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.