debasis.tcs_69445 avatar image
debasis.tcs_69445 asked Erick Ramirez commented

What could be the reason cqlsh returns OperationTimedOut errors?

Hi Experts

I am getting below error in both data center same time. I checked all nodes were UP in both data centers and there was no errors in log. I received the errors from all nodes.

Can you please let me know what could be problem suddenly.

cqlsh -u cassandra -p ******
Connection error: ('Unable to connect to any servers', {'': OperationTimedOut('errors=Timed out creating connection (5 seconds), last_host=None',)})

Thanks in advance.

smadhavan avatar image
smadhavan answered Erick Ramirez commented

@debasis.tcs_69445, could you update your original post with C*/DSE version, please?

Also, do you know if the system keyspaces (particularly security) were properly repaired? If not, I would recommend you to run a repair on them using nodetool repair -pr on all nodes and retry connecting via the cqlsh using cqlsh -u cassandra -p MASKED --debug? Let us know what the output of it post doing the above steps.

Thanks for the reply

Cassandra version:-

cqlsh 5.0.1 | Cassandra 3.11.0-E000 | CQL spec 3.4.4 | Native protocol v4
cassandra@cqlsh> desc system_auth;
CREATE KEYSPACE system_auth WITH replication = {'class': 'NetworkTopologyStrategy', 'dc100': '3', 'dc200': '3'} AND durable_writes = true;

After rebooting all 9 nodes from one data center, the problem got resolved in another data center too (without restarting any nodes in other data center)

Do we know what could be the reason a restart fixed the issue in other nodes for seperate DC.

A friendly note to let you know that I've converted your post to a comment since it's not an "answer". Cheers!

Erick Ramirez avatar image
Erick Ramirez answered Erick Ramirez commented

This appears to be related to your other question (#4702) where your application is getting OperationTimedOutException connecting to the nodes.

As I explained in that other post, OperationTimedOutException gets thrown when the driver doesn't get a response from the nodes. The same thing applies here -- cqlsh uses an embedded Python driver to connect to the nodes.

There's some issue with your cluster that's environmental and I don't think it's a Cassandra issue. It seems to me at some point, clients (your application, cqlsh) lose connectivity to the nodes and I think you need to involve your sysadmin and network admin teams to assist you with the investigation.

I think the reboot of the nodes is coincidental. It is probably more likely that a reboot causes stale network connections to get released. When the problem manifests itself, you should take a snapshot of the connections on the nodes using Linux utilities like netstat and lsof for your sysadmins/network admins to review and analyse. Cheers!

I tried cqlsh command in same server where cassandra installed. Do you still think it could be network issue?

Yes, because it can still happen when the TCP connections are maxed out on the server. Again, your issues appear to be environmental so you need to investigate why a local connection between cqlsh and C* isn't working.

