Bringing together the Apache Cassandra experts from the community and DataStax.

Want to learn? Have a question? Want to share your expertise? You are in the right place!

Not sure where to begin? Getting Started

 

question

shehzadjahagirdar_185613 avatar image
shehzadjahagirdar_185613 asked shehzadjahagirdar_185613 commented

Spark job takes 28 minutes to read 90M records

When trying to read data through spark job with spark-cassandra-connector-2.4.2 it takes 28 minutes to read 90000000 records.we need to reduce this time to 5-10 mins also below are the config used in spark job & while reading out of 5 executors of spark only one executor has tasks running on it other 4 are idle.Our cassandra version is apache-cassandra-3.11.3.

fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem
fs.AbstractFileSystem.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS
google.cloud.auth.service.account.enable", "true"
spark.yarn.maxAppAttempts", "1"
spark.memory.offHeap.enabled", "true"
spark.memory.offHeap.size", "16g"
spark.sql.broadcastTimeout", "36000"
spark.network.timeout", "600s"
spark.cassandra.input.consistency.level", "LOCAL_QUORUM"
spark.cassandra.output.consistency.level", "ANY"
spark.sql.shuffle.partitions", "150"
spark.shuffle.blockTransferService", "nio"
spark.maxRemoteBlockSizeFetchToMem", "2000m"
spark.sql.hive.filesourcePartitionFileCacheSize", "0"
spark.cassandra.input.split.size_in_mb","512"

@Erick Ramirez Please suggest some solution.

spark-cassandra-connector
1 comment
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

@Erick Ramirez Eagerly waiting for your reply.
0 Likes 0 ·

1 Answer

steve.lacerda avatar image
steve.lacerda answered

Some things to look at would be the number of executors, memory per executor, cores per executor, driver memory, and the split size. Increasing the number of executors may help, or it may hurt with issues with data locality. Also, if the driver is doing any type of computation or aggregation you'll need more driver memory, but too much driver memory then could take away memory from other processes. The end result is that you will need to test based on your data in regards to perf.

I would test with the same type of data, but with a smaller dataset and modify the above parameters to see where you gain performance.

Share
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.