I'm using PySpark dataframe to load cassandra table like this:
df = spark.read.table(CASSANDRA_TABLE_NAME).filter(xxx).select(c1, c2, ..., cn)
But found it's pretty slow when some column has big binary data, reading 5GB data from C* takes around 100s, so the throughput is like 50MB/s, far slower than SSD I/O speed (500MB/s). My C* is created on SSD, with 3 nodes each of which has 20GB memory and 2 CPU, and every other settings are basically default. And I've 5 executors for Spark.
By debugging I see the partition number after reading from C* is 1, which does not look reasonable since the partition numbers in C* is 9, and I've set very little value(1MB) for "spark.sql.files.maxPartitionBytes".
df.rdd.getNumPartitions() # this value is 1
So I am wondering if that throughput number looks normal, and how to set the partitions of Spark for reading? Thank you very much