question

Erick Ramirez avatar image
Erick Ramirez asked azamat.hackimov_187983 answered

Why is a new node "Unable to gossip with other peers"?

In this post I'll explain why a new node added to the cluster is unable to communicate with other nodes. In some instances, the node was previously part of the cluster and is still unable to gossip when added back in.

cassandragossip
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Erick Ramirez avatar image
Erick Ramirez answered

Symptoms

One of the tell-tale signs of this issue is that the node reports in the system.log that it is unable to gossip with other nodes in the cluster, for example:

ERROR [main] 2019-08-15 18:46:32,241 CassandraDaemon.java:749 - Exception encountered during startup
java.lang.RuntimeException: Unable to gossip with any peers
        at org.apache.cassandra.gms.Gossiper.doShadowRound(Gossiper.java:1435) ~[apache-cassandra-3.11.4.jar:3.11.4]
        at org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:566) ~[apache-cassandra-3.11.4.jar:3.11.4]
        at org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:823) ~[apache-cassandra-3.11.4.jar:3.11.4]
        at org.apache.cassandra.service.StorageService.initServer(StorageService.java:683) ~[apache-cassandra-3.11.4.jar:3.11.4]
        at org.apache.cassandra.service.StorageService.initServer(StorageService.java:632) ~[apache-cassandra-3.11.4.jar:3.11.4]
        at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:388) ~[apache-cassandra-3.11.4.jar:3.11.4]
        at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:620) ~[apache-cassandra-3.11.4.jar:3.11.4]
        at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:732) ~[apache-cassandra-3.11.4.jar:3.11.4]

In some cases, other nodes are able to see the affected node as operational but the affected node itself is unable to gossip with other nodes. Here is a sample output of nodetool gossipinfo:

/10.1.2.4
  generation:0
  heartbeat:0
/10.1.2.3
  generation:0
  heartbeat:0
/10.1.2.6
  generation:1444263348
  heartbeat:6232
  ...
  DC:DC1
  STATUS:NORMAL,-1041938454866204344
  ...
/10.1.2.5
  generation:0
  heartbeat:0

One other symptom is that the affected node sees all other nodes in the cluster belonging to another DC as shown in this sample nodetool status output:

Datacenter: r1 
============== 
Status=Up/Down 
|/ State=Normal/Leaving/Joining/Moving 
-- Address Load Tokens Owns Host ID Rack 
DN 10.1.2.5 ? 256 9.0% 5279619a-550c-42b3-8150-61ad24f828f3 r1 
DN 10.1.2.3 ? 256 9.1% 5d1fa459-cdac-4658-b68d-c6e0933afcee r1 
DN 10.1.2.4 ? 256 10.5% a8f35c63-6a76-4e95-99f1-bef65d785366 r1 
Datacenter: DC1 
=============== 
Status=Up/Down 
|/ State=Normal/Leaving/Joining/Moving 
-- Address Load Tokens Owns Host ID Rack 
UN 10.1.2.6 18.9 GB 256 9.5% 36fdcf57-0274-43b8-a501-c0e475e3e30b RAC1

Cause

The gossip protocol is used by the nodes to communicate information within the cluster. Gossip issues are usually related to problems with either snitch/topology configuration or the network layer.

In this case, the most common cause of the symptoms above are related to misconfigured firewall or VLANs.

Solution

Use the following checklist to identify the cause of the issue:

  • check software firewall such as iptables or firewalld for misconfiguration
  • check for missed steps in your organisation's server provisioning process - did security settings get inadvertently applied to the node?
  • check ports on network devices for misconfiguration
  • check network policies such as quality-of-service (QoS) or bandwidth throttling rules for misconfiguration - do they apply to this environment?

NOTE - The standard gossip TCP port is 7000, or 7001 for SSL-secured clusters.

Credits

Republished from DataStax Support Knowledge Base article, "New node in cluster unable to gossip, cannot determine workload of other nodes".

Share
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

azamat.hackimov_187983 avatar image
azamat.hackimov_187983 answered

Please note, described behavoir can be caused by different time on DN nodes:

ERROR [main] 2021-12-17 20:30:55,495 CassandraDaemon.java:909 - Exception encountered during startup
java.lang.IllegalStateException: Unable to contact any seeds: [node-dc1/10.0.0.1:7000, node-dc2/10.1.0.1:7000]
        at org.apache.cassandra.service.StorageService.bootstrap(StorageService.java:1751)
        at org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:1056)
        at org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:1017)
        at org.apache.cassandra.service.StorageService.initServer(StorageService.java:799)
        at org.apache.cassandra.service.StorageService.initServer(StorageService.java:729)
        at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:420)
        at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:763)
        at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:887)

This can be also detected by warnings like this:

INFO  [ScheduledTasks:1] 2021-12-17 20:30:30,130 MessagingMetrics.java:206 - ECHO_REQ messages were dropped in last 5000 ms: 0 internal and 21 cross node. Mean internal dropped latency: 0 ms and Mean cross-node dropped latency: 91830

Cassandra 4 is more intolerant to time desynchronization, so check that in all hosts time is correct and properly configure ntpd/chrony.


Share
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.