When trying to read data through spark job with spark-cassandra-connector-2.4.2 it takes 28 minutes to read 90000000 records.we need to reduce this time to 5-10 mins also below are the config used in spark job & while reading out of 5 executors of spark only one executor has tasks running on it other 4 are idle.Our cassandra version is apache-cassandra-3.11.3.
- fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem
- fs.AbstractFileSystem.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS
- google.cloud.auth.service.account.enable", "true"
- spark.yarn.maxAppAttempts", "1"
- spark.memory.offHeap.enabled", "true"
- spark.memory.offHeap.size", "16g"
- spark.sql.broadcastTimeout", "36000"
- spark.network.timeout", "600s"
- spark.cassandra.input.consistency.level", "LOCAL_QUORUM"
- spark.cassandra.output.consistency.level", "ANY"
- spark.sql.shuffle.partitions", "150"
- spark.shuffle.blockTransferService", "nio"
- spark.maxRemoteBlockSizeFetchToMem", "2000m"
- spark.sql.hive.filesourcePartitionFileCacheSize", "0"
- spark.cassandra.input.split.size_in_mb","512"