question

shaiwolf avatar image
shaiwolf asked jaroslaw.grabowski_50515 commented

Spark connector reads puts stress on 1 node

Hi,

The attached file shows the range reads when reading a table from Cassandra. It shows that 1 Cassandra node is being requested many more reads than other nodes. It is always the same Cassandra node (VPC-CASSANDRA-005). When it's done, there are 0 reads from that node until the task ends.

All other nodes read requests are balanced. Is that a connector bug, or something external, e.g. configuration?

Environment:

  • Amazon Linux AMI 2018.03
  • java version "1.8.0_131"
  • spark-core_2.12-3.1.1.jar
  • spark-cassandra-connector_2.12-3.1.0.jar
  • Cassandra 3.1.1 [cqlsh 5.0.1 | DSE 5.1.22 | CQL spec 3.4.4 | DSE protocol v1]
  • 36 Cassandra servers
  • Spark runs on a remote server - 12 instances - 5 cores each

Thanks,

Shai

image.png

spark-cassandra-connector
image.png (75.3 KiB)
1 comment
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

That's interesting. Could you enable TRACE logging for CassandraTableScanRDD for your driver process? It would be interesting to see this log line: https://github.com/datastax/spark-cassandra-connector/blob/d760a745f43e3e33716b73df90111754a8092929/connector/src/main/scala/com/datastax/spark/connector/rdd/CassandraTableScanRDD.scala#L273-L274

0 Likes 0 ·

1 Answer

Erick Ramirez avatar image
Erick Ramirez answered

By default, the Spark connector users a consistency level of LOCAL_ONE when reading from Cassandra. If your app doesn't partition/parallelize correctly, it would explain why all the work is getting sent to just one node.

We need a lot more background information including a minimal sample code which replicates the problem you're seeing so please log a ticket with DataStax Support so one of our engineers can assist you. Cheers!

Share
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.