question

debasis.tcs_69445 avatar image
debasis.tcs_69445 asked Erick Ramirez commented

Driver returned a NoHostAvailableException after getting OperationTimedOutException from nodes

We are getting below error from application

2020-05-27 00:17:45,915 | ERROR | | Consolidator-root | SCSDTTN01 | qtp25558514-131 | h.e.m.c.r.p.RuntimeExceptionMapper | {} | Internal server error com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: 192.168.11.24/192.168.11.24:9042 (com.datastax.driver.core.exceptions.OperationTimedOutException: [192.168.11.24/192.168.11.24:9042] Timed out waiting for server response), 192.168.11.16/192.168.11.16:9042 (com.datastax.driver.core.exceptions.OperationTimedOutException: [192.168.11.16/192.168.11.16:9042] Timed out waiting for server response), 192.168.11.19/192.168.11.19:9042 (com.datastax.driver.core.exceptions.OperationTimedOutException: [192.168.11.19/192.168.11.19:9042] Timed out waiting for server response), 192.168.11.21/192.168.11.21:9042 [only showing errors of first 3 hosts, use getErrors() for more details])
        at com.datastax.driver.core.exceptions.NoHostAvailableException.copy(NoHostAvailableException.java:83)
        at com.datastax.driver.core.exceptions.NoHostAvailableException.copy(NoHostAvailableException.java:37)
        at com.datastax.driver.core.DriverThrowables.propagateCause(DriverThrowables.java:35)
        at com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:293)
        at hr.ericsson.m2mse.jobs.dao.cassandraJob.CassandraJobDao.getJobsByStatusAndCreationTime(CassandraJobDao.java:644)
        at hr.ericsson.m2mse.jobs.business.impl.JobsServiceImpl.getAllJobs(JobsServiceImpl.java:171)
        at hr.ericsson.m2mse.jobs.rest.resource.impl.PurgeImpl.getPurgeJobs(PurgeImpl.java:94)
        at sun.reflect.GeneratedMethodAccessor221.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.cxf.service.invoker.AbstractInvoker.performInvocation(AbstractInvoker.java:180)
        at org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:96)
        at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:189)
        at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:99)
        at org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:59)
        at org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:96)
        at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:308)
        at org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
        at org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:254)
        at org.apache.cxf.transport.servlet.ServletController.invokeDestination(ServletController.java:234)
        at org.apache.cxf.transport.servlet.ServletController.invoke(ServletController.java:208)
        at org.apache.cxf.transport.servlet.ServletController.invoke(ServletController.java:160)
        at org.apache.cxf.transport.servlet.CXFNonSpringServlet.invoke(CXFNonSpringServlet.java:180)
        at org.apache.cxf.transport.servlet.AbstractHTTPServlet.handleRequest(AbstractHTTPServlet.java:299)
        at hr.ericsson.m2mse.jobs.rest.resource.impl.JobsServlet.handleRequest(JobsServlet.java:87)
        at org.apache.cxf.transport.servlet.AbstractHTTPServlet.doGet(AbstractHTTPServlet.java:223)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:687)
        at org.apache.cxf.transport.servlet.AbstractHTTPServlet.service(AbstractHTTPServlet.java:274)
        at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:812)
        at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1669)
        at org.eclipse.jetty.websocket.server.WebSocketUpgradeFilter.doFilter(WebSocketUpgradeFilter.java:201)
        at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
        at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
        at org.ops4j.pax.web.service.jetty.internal.HttpServiceServletHandler.doHandle(HttpServiceServletHandler.java:72)
        at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
        at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)
        at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)
        at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
        at org.ops4j.pax.web.service.jetty.internal.HttpServiceContext.doHandle(HttpServiceContext.java:287)
        at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
        at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
        at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
        at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
        at org.ops4j.pax.web.service.jetty.internal.JettyServerHandlerCollection.handle(JettyServerHandlerCollection.java:80)
        at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
        at org.eclipse.jetty.server.Server.handle(Server.java:499)
        at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:311)
        at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:258)
        at org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:544)
        at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
        at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
        at java.lang.Thread.run(Thread.java:748)Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: 192.168.11.24/192.168.11.24:9042 (com.datastax.driver.core.exceptions.OperationTimedOutException: [192.168.11.24/192.168.11.24:9042] Timed out waiting for server response), 192.168.11.16/192.168.11.16:9042 (com.datastax.driver.core.exceptions.OperationTimedOutException: [192.168.11.16/192.168.11.16:9042] Timed out waiting for server response), 192.168.11.19/192.168.11.19:9042 (com.datastax.driver.core.exceptions.OperationTimedOutException: [192.168.11.19/192.168.11.19:9042] Timed out waiting for server response), 192.168.11.21/192.168.11.21:9042 [only showing errors of first 3 hosts, use getErrors() for more details])
        at com.datastax.driver.core.RequestHandler.reportNoMoreHosts(RequestHandler.java:265)
        at com.datastax.driver.core.RequestHandler.access$1200(RequestHandler.java:62)
        at com.datastax.driver.core.RequestHandler$SpeculativeExecution.findNextHostAndQuery(RequestHandler.java:357)
        at com.datastax.driver.core.RequestHandler$SpeculativeExecution.retry(RequestHandler.java:539)
        at com.datastax.driver.core.RequestHandler$SpeculativeExecution.processRetryDecision(RequestHandler.java:521)
        at com.datastax.driver.core.RequestHandler$SpeculativeExecution.onTimeout(RequestHandler.java:962)
        at com.datastax.driver.core.Connection$ResponseHandler$1.run(Connection.java:1575)
        at io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:663)
        at io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:738)
        at io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:466)
        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        ... 1 common frames omitted

During that time I did not find any problem in system.log

driverjava
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

1 Answer

Erick Ramirez avatar image
Erick Ramirez answered Erick Ramirez commented

The OperationTimedOutException gets thrown when the driver doesn't get a response from the nodes for a read request within SocketOptions.getReadTimeoutMillis(). The error message itself explicitly states this:

Timed out waiting for server response

This is a client-side timeout meaning that the driver was waiting for a response but didn't get anything back from the node. It is different from a ReadTimeoutException which is a server-side timeout for queries which took longer than read_request_timeout_in_ms.

The OperationTimedOutException can happen when nodes are completely unresponsive usually because they are overloaded and have long GC pauses. But the driver can also get this timeout if the network latency is high or if there is some other network issue that prevents it from connecting to the cluster.

Based on the log entry you posted, several nodes were unresponsive/uncontactable:

... \
192.168.11.24/192.168.11.24:9042 (com.datastax.driver.core.exceptions.OperationTimedOutException: [192.168.11.24/192.168.11.24:9042] Timed out waiting for server response), \
192.168.11.16/192.168.11.16:9042 (com.datastax.driver.core.exceptions.OperationTimedOutException: [192.168.11.16/192.168.11.16:9042] Timed out waiting for server response), \
192.168.11.19/192.168.11.19:9042 (com.datastax.driver.core.exceptions.OperationTimedOutException: [192.168.11.19/192.168.11.19:9042] Timed out waiting for server response), \
...

The log entries also indicate that the driver tried to connect to all hosts but they all timed out. When a node is unresponsive or unreachable, the driver temporarily blacklists the node. But the problem is that when all the nodes get blacklisted because they're all unreachable, the driver gives up trying because there aren't any nodes left to try:

        at com.datastax.driver.core.RequestHandler.reportNoMoreHosts(RequestHandler.java:265)

and that's what leads to the NoHostAvailableException:

Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed ( ... )

Check the nodes for signs of overload. Otherwise, you need to escalate the issue to your network administrators to determine why the driver/app cannot contact the cluster.

For more information, see the Java driver manual. Cheers!

4 comments Share
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

debasis.tcs_69445 avatar image debasis.tcs_69445 commented ·

Thanks for the detailed answer. I checked sar report and found some NW drop in that time. Can you please share your thought

12:00:01 AM IFACE rxerr/s txerr/s coll/s rxdrop/s txdrop/s txcarr/s rxfram/s rxfifo/s txfifo/s
12:30:01 AM ens161 0.00 0.00 0.00 0.17 0.00 0.00 0.00 0.00 0.00
12:30:01 AM ens192 0.00 0.00 0.00  0.17 0.00 0.00 0.00 0.00 0.00
12:30:01 AM ens224 0.00 0.00 0.00 0.17 0.00 0.00 0.00 0.00 0.00
12:30:01 AM ens193 0.00 0.00 0.00  0.17  0.00 0.00 0.00 0.00 0.00
12:30:01 AM ens225 0.00 0.00 0.00  0.17 0.00 0.00 0.00 0.00
0 Likes 0 ·
Erick Ramirez avatar image Erick Ramirez ♦♦ debasis.tcs_69445 commented ·

I'm not sure what you're expecting from me but if you're saying that you've got evidence that there was a network connectivity issue then it would explain why the driver couldn't connect to the cluster. Cheers!

0 Likes 0 ·
debasis.tcs_69445 avatar image debasis.tcs_69445 Erick Ramirez ♦♦ commented ·

Also I see in that time snapshot backup as well tar command to zip them was running in all nodes. Could it be a problem due to that? because application was running heavy insert to cassandra nodes in that time.

0 Likes 0 ·
Show more comments