question

tyuen_144153 avatar image
tyuen_144153 asked Erick Ramirez edited

Why can't I connect to a node via cqlsh when it's up and other nodes see it as UP?

Connection error: ('Unable to connect to any servers', {'127.0.0.1': error(111, "Tried connecting to [('127.0.0.1', 9042)]. Last error: Connection refused")})

The server ran out of disk space earlier and when I clear out the disk space, I re-enabled the gossip protocol on the server.

However, I'm still not able to CQLSH to it. However, I'm seeing that the other nodes are seeing the server is up and streaming to the server.

[UPDATE] I replaced the actual server IP with "1.2.3.[1-5]"

I am on server 1.2.3.5, sudo lsof -nPi -sTCP:LISTEN

COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
splunkd 11864 root 4u IPv4 1423524 0t0 TCP *:8089 (LISTEN)
systemd-r 13598 systemd-resolve 13u IPv4 12970708 0t0 TCP 127.0.0.53:53 (LISTEN)
sshd 18544 root 3u IPv4 97855 0t0 TCP *:22 (LISTEN)
sshd 18544 root 4u IPv6 97857 0t0 TCP *:22 (LISTEN)
java 22819 cassandra 72u IPv4 39357126 0t0 TCP *:8080 (LISTEN)
java 22819 cassandra 89u IPv4 39357144 0t0 TCP 127.0.0.1:7199 (LISTEN)
java 22819 cassandra 90u IPv4 39357145 0t0 TCP 127.0.0.1:39805 (LISTEN)
java 22819 cassandra 875u IPv4 39357153 0t0 TCP 1.2.3.5:7000 (LISTEN)

I am on server, 1.2.3.4 doing nodetool status:

Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN 1.2.3.1 111.55 GiB 256 ? 75577137-ee00-451b-83a8-f0384b9028fb 2a
UN 1.2.3.2 248.55 GiB 256 ? a2d86c73-6088-41d6-a0bf-11c447860369 2a
UN 1.2.3.3 229.69 GiB 256 ? cc2dd1aa-84cd-46e3-8526-0e92c266cbd2 2a
UN 1.2.3.4 124.1 GiB 256 ? 5e36176c-cad9-4653-a7b8-42da6aff2ffb 2a
UN 1.2.3.5 168.48 GiB 256 ? e2b08b7e-b004-44ad-97e3-92a74c095991 2a
cassandracqlsh
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

1 Answer

Erick Ramirez avatar image
Erick Ramirez answered Erick Ramirez edited

You haven't provided enough information for us to be able to provide a meaningful assessment of what the problem is.

Also, in the title of your original post, you stated "other nodes see it as down" but in the description box you said "other nodes are seeing the server is up" so which is it? [EDIT] Question was updated to "UP".

You're also connecting via cqlsh without specifying the publicly-accessible IP address. It wouldn't make sense that you would've configured localhost as the rpc_address since your apps won't be able to connect to the node.

You will need to post the output of the following command to give me an idea of what's running on the node:

$ sudo lsof -nPi -sTCP:LISTEN

as well as the output of nodetool status on the problematic node PLUS one other node in the same DC.

With limited information, my hypothesis is that you're testing connectivity incorrectly. Also, disabling gossip then re-enabling it isn't the correct way of dealing with a "disk full" issue. You should've shut Cassandra down before you carried out any remediation. Cheers!

[UPDATE] The lsof output you posted doesn't show that the node is listening on the CQL client port 9042 which means it is not accepting client connections. Try starting it with the following command:

$ nodetool enablebinary

If it worked, you'd be able to see the rpc_address bound to port 9042.

2 comments Share
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

tyuen_144153 avatar image tyuen_144153 commented ·

Erick, thanks for catching the title. I updated to "UP" in title. I did not disable the gossip protocol. I did a gossipstatus and noticed that it was disabled and re-enabled after seeing in the debug log that other nodes are streaming to the node that had gossip protocol disabled.

I was doing sstableloading the data to do a restore a snapshot and other have two other jobs that doing the sstableloading on another two nodes in the clusters.

Another question, if I shutdown and restart that server instead of just re-enable the gossip protocol, will it cause the sstableloading to fail on my other jobs that are running on another server?

0 Likes 0 ·
Erick Ramirez avatar image Erick Ramirez ♦♦ tyuen_144153 commented ·

Gossip isn't a problem. The lsof output you posted shows that the node is listening on port 7000. For reference, see the list of ports used by C* in Configuring firewall port access.

FWIW sstableloader is just another client just like your app is a client to the cluster so it can tolerate some node outages. Cheers!

0 Likes 0 ·