Bringing together the Apache Cassandra experts from the community and DataStax.

Want to learn? Have a question? Want to share your expertise? You are in the right place!

Not sure where to begin? Getting Started

 

question

vgarud avatar image
vgarud asked Erick Ramirez answered

APIs slow down without any apparent bottlenecks on Cassandra nodes

During Peak Load, Cassandra cluster performance deteriorates, though we dont see any spike in cpu utilization, Read or Write IO.
Read latencies go beyond 1 second

API Latency Graph

Not able to scale the load when we increase the number of nodes in the cassandra cluster. When we increased the number of machines by 30% (reaching 36 nodes), we could not increase the load at least by the same percentage. It was much less than expected. Any hints on what else to debug would be very appreciated!

Cluster Size: 27 nodes EC2 configurations:

machine type:m4.2xlarge cpu: 8 memory: 32 GB storage: EBS (1 TB)

Cassandra Version: 3.11.4 Replication factor : 3 Total Data: around 500GB (without replication) ( It was 180GB when we had only Customer Data)

Cassandra stats

screen-shot-2021-10-06-at-111611-pm.png

screen-shot-2021-10-06-at-111558-pm.png




performanceread latency
1 comment
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

The direction I am thinking is if there are any other configured limits such as number of connections / threads etc that are artificially bottlenecking the system. Any suggestions on what system metrics to monitor will also be appreciated.

0 Likes 0 ·

1 Answer

Erick Ramirez avatar image
Erick Ramirez answered

If your analysis is correct particularly around scaling up by 30% but not seeing a linear increase in throughput, then there's a good chance that the client/driver is the bottleneck.

Consider adding client instances at the application tier. For example, if you have 4 app instances then try adding 2 more app instances and see if the overall throughput increases.

Which driver are you using? If you're using the Java driver, check if the driver's IO threads are saturating its CPU core. You can run profilers or Linux commands like pidstat -tu. The driver's IO threads are named like *-io-*. If the CPU core for the driver thread is maxed out, try doubling the pool.local.size to 2. Cheers!

Share
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.