Bringing together the Apache Cassandra experts from the community and DataStax.

Want to learn? Have a question? Want to share your expertise? You are in the right place!

Not sure where to begin? Getting Started

 

question

mishra.anurag643_153409 avatar image
mishra.anurag643_153409 asked ·

What could be a reason my Pyspark job runs slow when reading data from Cassandra?

I am reading a table from cassandra usign pyspark program and then writing data into aws s3 bucket . It seems spark job is taking so much of time reading data from cassandra .

What could be possible reasons for that ? and what are configurations settings I should check in order to have fast read ?

spark-cassandra-connector
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

1 Answer

Erick Ramirez avatar image
Erick Ramirez answered ·

In your previous question (#10306), you stated that nodes are overloaded as a result of your PySpark app.

This is a good place to start. As Jaroslaw Grabowski suggested, you need to look at tuning the read parameters so as not to overload the nodes.

Ultimately, you will need to review the performance of your cluster including the throughput of the disks and the number of nodes. There's a good chance that you're hitting the upper limit of your cluster and will need to add more nodes to increase its capacity. Cheers!

Share
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.