Cluster: 3 DC, 6 Nodes per DC, RF: 3
Client: datastax java driver 3.11
Situation, after patching, one host launches a monitor process (completely unrelated to C*) that has a memory leak. Over a one week interval, the memory leak consumes more and more available system memory (is not visible via running top).
The host with the memory leak remains visible on the network, it responds to ping, accepts connections from clients, displays as Normal when viewing
What we saw happening was that since the node was unable to allocate memory for reads, requests for which the node itself was a coordinator and hosted a replica became disproportionally slower than its peer coordinators. At the point where almost no memory was available to C*, the node would still accept client connections, however, would respond with the not enough replicas message (two of three replicas responded).
We're using the datastax java client libraries, they work well.
I'm trying to understand how this situation could be remediated with zero knowledge of the underlying system infrastructure:
- Under what circumstances would the LoadBalancingPolicy used by the driver return HostDistance.IGNORED
- is there anything internal (i.e. that appears in the logs as warning or error) to C* that would identify hosts that gradually degrade relative to their peers?
- Aside from the obvious system monitoring that should flag rogue processes with memory leaks, what monitoring process would start to flag the gradual performance decline of a single node (via JMX).