Bringing together the Apache Cassandra experts from the community and DataStax.

Want to learn? Have a question? Want to share your expertise? You are in the right place!

Not sure where to begin? Getting Started

 

question

virajut avatar image
virajut asked virajut commented

Exception (java.lang.RuntimeException) encountered during startup: Unable to gossip with any peers

I am trying to restart cluster on the same machines used for one deployment.

Process I am following is as follows:

1. I am loading data in the cluster of x nodes. For example, 4 nodes

2. I let the replication & streaming complete between the nodes as I am using RF=3

3. Once the nodes are equally balanced, I run `nodetool drain` and kill the cassandra process running on the node starting from the 4th node, keeping seed node as a last.

4. Now I am trying to restart the cluster, starting with seed node. Seed node just starts fine but the other nodes.

Errors on every other nodes are as follows:

io.netty.channel.AbstractChannel$AnnotatedConnectException: finishConnect(..) failed: Connection refused: /x.x.x.x:7000
Caused by: java.net.ConnectException: finishConnect(..) failed: Connection refused
        at io.netty.channel.unix.Errors.throwConnectException(Errors.java:124)
        at io.netty.channel.unix.Socket.finishConnect(Socket.java:251)
        at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.doFinishConnect(AbstractEpollChannel.java:673)
        at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:650)
        at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.epollOutReady(AbstractEpollChannel.java:530)
        at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:470)
        at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378)
        at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.base/java.lang.Thread.run(Thread.java:829)
Exception (java.lang.RuntimeException) encountered during startup: Unable to gossip with any peers
java.lang.RuntimeException: Unable to gossip with any peers
        at org.apache.cassandra.gms.Gossiper.doShadowRound(Gossiper.java:1805)
        at org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:648)
        at org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:934)
        at org.apache.cassandra.service.StorageService.initServer(StorageService.java:784)
        at org.apache.cassandra.service.StorageService.initServer(StorageService.java:729)
        at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:420)
        at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:763)
        at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:887)
ERROR [main] 2021-11-14 13:50:33,900 CassandraDaemon.java:909 - Exception encountered during startup
java.lang.RuntimeException: Unable to gossip with any peers
        at org.apache.cassandra.gms.Gossiper.doShadowRound(Gossiper.java:1805)
        at org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:648)
        at org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:934)
        at org.apache.cassandra.service.StorageService.initServer(StorageService.java:784)
        at org.apache.cassandra.service.StorageService.initServer(StorageService.java:729)
        at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:420)
        at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:763)
        at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:887)

And at the end of log, it says like this:

INFO  [MemtableFlushWriter:4] 2021-11-14 13:50:38,374 LogTransaction.java:240 - Unfinished transaction log, deleting /home/ubuntu/cassandra_dc1_rack1_node2_data/system_schema/functions-96489b7980be3e14a70166a0b9159450/nb_txn_flush_d9600960-4551-11ec-ba00-a776ea0a5340.log 
INFO  [StorageServiceShutdownHook] 2021-11-14 13:50:38,379 HintsService.java:220 - Paused hints dispatch

I am using `GossipingPropertyFileSnitch` and have removed `cassandra-topology.properties` file.

I tried several times starting from scratch, fresh but every time when I try to restart, it fails. My main goal is to restart cluster in normal, balanced & query ready state.

Any help would be greatly appreciated. Thank you.

gossip
1 comment
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

I've checked communication between these nodes and those are able to communicate completely fine.


0 Likes 0 ·

1 Answer

Erick Ramirez avatar image
Erick Ramirez answered virajut commented

The node can't start because it can't gossip with any seeds. The first thing I'd recommend checking for is firewalls blocking traffic between nodes on port 7000.

If you have rebooted the server, verify that you did not inadvertently enabled firewall software like iptables or firewalld. Confirm that there's network connectivity between nodes on port 7000 using Linux utilities like telnet or nc.

One other thing which could prevent nodes from gossiping is when the commit logs have incorrect permissions. Check the file permissions on the contents of the commitlog/ directory. For example, make sure that files are owned by cassandra instead of root. Cheers!

3 comments Share
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Looks like the nodes are able to communicate internally well. Turned out, Following sequence on each node seem to be working but still has one interesting issue.

The sequence I followed is:

1. disable gossip using: nodetool disablegossip

2. disable binary using: nodetool disablebinary

3. drain the node: nodetool drain

4. stop daemon using: nodetool stopdaemon

This sequence of steps on all nodes resulted in successful re-connection of almost all the nodes, but here is the issue:

I've 4 nodes running with x.x.x.88, x.x.x.89, x.x.x.90, x.x.x.91. .88 is a seed node.

When I try to check the status from .88 and .89, It says all the nodes are up and in a `UN` state. But when I try to check the status from .90, it says .91 is down and when I do the same on .91, it says .90 is down.

I checked telnet and they can connect via :7000 to one another.

What would be the root cause of that issue? Is cassandra 4.0.1 build from source would be an issue? (limitation:I can't use already build version)

0 Likes 0 ·

From the system.log, I can see that the attempt to connect from .90 to .91 was successful and gossip also happened between them, but still `nodetool status` has the same results.



0 Likes 0 ·

In addition to above, I tried running multiple queries with few loading data and few reading data and it found to be working well. The balancing also works just fine and is getting updated correctly on all the nodes.

have checked `nodetool gossipinfo` as well and that seem to be reporting heartbeats correctly. Based on this behaviour, looks like it is bug or would be something else? Version I am using is Cassandra 4.0.1.

0 Likes 0 ·