Hello - I am looking for common approaches to limiting concurrency using async operations in the Java CQL driver. I have some large write operations that occur on a lot of different partitions in what is effectively a batch.
In total, for a common job it will end up doing between 45,000 and 1,000,000 CQL operations. Some of these are batched, but the majority are not suitable for batching due to the nature of the workload. Most writes are between 1 kb and 10 kb, though some are smaller by necessity.
I've split out each write operation to function somewhat as a tree, since some writes require reads, and some writes are dependent on other writes. This works fine when the number of branches is small, but when the number of branches (n.b. >100) and therefore number of dependent async operations increases, I am getting NoNodeAvailableException.
Having done some googling, this appears to be expected when the number of active operations is higher than the number of available slots configured for the connection. However, what isn't apparent is how to address the situation. In my case, it's fine for every operation to simply queue up and "wait its turn." We're using Datastax driver version 4.6.x and cassandra 3.11, running in Spring Boot 2.3. According to the rate limiting document on Datastax driver docs, one can use the rate limiter, and I've tried a few different permutations with it, but this appears to have no affect and still raises the same exception.
Here is the configuration I've tried to allow things to queue up (not that I planned on doing this, but just to see what happened--nothing did).
spring: data: cassandra: keyspace-name: test schema-action: none contact-points: cassandra,localhost request: throttler: type: RATE_LIMITING max-queue-size: 999999999 max-requests-per-second: 32768 drain-interval: 1ms
I am curious what other teams are doing to combat this 1024 connections/node limitation. The docs indicate that one doesn't necessarily want to increase this limit. In fact, spring-data-cassandra doesn't even expose this configuration, so it's clearly meant to be left alone.
So far I've been able to implement a retry for the async operations. This appears to work fine, though a suboptimal solution, since I am putting a lot of additional load on the driver as it turns away writes with this NoNodeAvailableException.