shehzadjahagirdar_185613 avatar image
shehzadjahagirdar_185613 asked shehzadjahagirdar_185613 commented

Spark job takes 28 minutes to read 90M records

When trying to read data through spark job with spark-cassandra-connector-2.4.2 it takes 28 minutes to read 90000000 records.we need to reduce this time to 5-10 mins also below are the config used in spark job & while reading out of 5 executors of spark only one executor has tasks running on it other 4 are idle.Our cassandra version is apache-cassandra-3.11.3.", "", "", "true"
spark.yarn.maxAppAttempts", "1"
spark.memory.offHeap.enabled", "true"
spark.memory.offHeap.size", "16g"
spark.sql.broadcastTimeout", "36000"", "600s"
spark.cassandra.input.consistency.level", "LOCAL_QUORUM"
spark.cassandra.output.consistency.level", "ANY"
spark.sql.shuffle.partitions", "150"
spark.shuffle.blockTransferService", "nio"
spark.maxRemoteBlockSizeFetchToMem", "2000m"
spark.sql.hive.filesourcePartitionFileCacheSize", "0"

@Erick Ramirez Please suggest some solution.

1 comment
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

@Erick Ramirez Eagerly waiting for your reply.
0 Likes 0 ·

1 Answer

steve.lacerda avatar image
steve.lacerda answered

Some things to look at would be the number of executors, memory per executor, cores per executor, driver memory, and the split size. Increasing the number of executors may help, or it may hurt with issues with data locality. Also, if the driver is doing any type of computation or aggregation you'll need more driver memory, but too much driver memory then could take away memory from other processes. The end result is that you will need to test based on your data in regards to perf.

I would test with the same type of data, but with a smaller dataset and modify the above parameters to see where you gain performance.

10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.