question

igor.rmarinho_185445 avatar image
igor.rmarinho_185445 asked Erick Ramirez edited

Why can't nodes in one DC communicate with nodes in other DCs in my cluster?

Hi, I recently added a new datacenter to my cluster, but I'm having a few issue as below:

Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving/Stopped
--  Address     Load       Tokens       Owns    Host ID                               Rack
DS  10.20.80.4  ?          8            ?       7d3de0c-ec59-401b-aa5e-c6319f5d7666  rack1
DS  10.20.80.2  ?          8            ?       43a5ad3-dbd2-4313-a44a-f1a80992e75f  rack1
DS  10.20.80.3  ?          8            ?       9d2b9e32-a42a-4db9-b1a7-c7f5fb965e01  rack1
Datacenter: dc2
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving/Stopped
--  Address     Load       Tokens       Owns    Host ID                               Rack
UN  10.20.80.6  308.74 MiB  8            ?       dc860a61-b6ff-438d-a536-4d5a837d4  rack1
Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving/Stopped
--  Address     Load       Tokens       Owns    Host ID                               Rack
UN  10.20.80.4  16.36 MiB  8            ?       7d38e0c-ec59-401b-aa5e-c63f5d7666  rack
UN  10.20.80.2  16.58 MiB  8            ?       43aad3-dbd2-4313-a44a-f1992e75f  rack1
UN  10.20.80.3  16.55 MiB  8            ?       9d2e32-a42a-4db9-b1a7-c7f965e01  rack1
Datacenter: dc2
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving/Stopped
--  Address     Load       Tokens       Owns    Host ID                               Rack
DS  10.20.80.6  277.27 MiB  8            ?       dc860a61-b6ff-438d-a536-45a837d4  rack1
[2020-07-14 16:16:36,826] Repair command #2 finished with error
error: Repair job has failed with the error message: [2020-07-14 16:16:36,826] Endpoint not alive: /10.20.50.4
-- StackTrace --
java.lang.RuntimeException: Repair job has failed with the error message: [2020-07-14 16:16:36,826] Endpoint not alive: /10.20.80.4
at org.apache.cassandra.tools.RepairRunner.progress(RepairRunner.java:136)
at org.apache.cassandra.utils.progress.jmx.JMXNotificationProgressListener.handleNotification(JMXNotificationProgressListener.java:77)
at com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.dispatchNotification(ClientNotifForwarder.java:583)

Seems that my Dc1 can see the status of my Dc2 and vice-versa. Is there something I missed ?

installationgossip
3 comments
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

bettina.swynnerton avatar image bettina.swynnerton ♦♦ commented ·

Hi @igor.rmarinho_185445,

how did you add the new datacenter? What snitch have you defined in the cassandra.yaml?

0 Likes 0 ·
igor.rmarinho_185445 avatar image igor.rmarinho_185445 bettina.swynnerton ♦♦ commented ·

Hi Bettina,

I just added a new node, and added the dc1 seed on cassandra.yaml and in dc1 added the seed of my node dc2.

In OpsCenter also showing that the node is Unresponsive.


I'm using:

endpoint_snitch: GossipingPropertyFileSnitch


I ended up removing the node nodetool removenode and I'm trying to bootstrap it again, but I'm getting this error.


ERROR [DSE main thread] 2020-07-14 17:47:30,197  CassandraDaemon.java:901 - Exception encountered during startup
java.lang.RuntimeException: Unable to gossip with any peers
        at org.apache.cassandra.gms.Gossiper.doShadowRound(Gossiper.java:1768)
        at org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:653)
        at org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:1030)
        at org.apache.cassandra.service.StorageService.initServer(StorageService.java:791)
        at org.apache.cassandra.service.StorageService.initServer(StorageService.java:730)


0 Likes 0 ·
igor.rmarinho_185445 avatar image igor.rmarinho_185445 igor.rmarinho_185445 commented ·

Just find out that the port 7000 was blocked. I'll test again after it's fixed.

0 Likes 0 ·

1 Answer

Erick Ramirez avatar image
Erick Ramirez answered Erick Ramirez edited

Symptoms

From the nodetool status outputs you posted, the nodes are in a Stopped state. Note the DS at the start of this line indicates the node is Down and Stopped

DS  10.20.80.6  277.27 MiB  8        ?   dc860a61-b6ff-438d-a536-45a837d4  rack1

Cause

It appears there's an issue with your nodes and it's highly likely that there's a problem with the underlying disk volumes.

A node goes into a stopped state when the disk(s) for data/ and/or commitlog/ have failed or are unavailable. Cassandra shuts down to prevent data loss such that the node will no longer accept read or write requests. This behaviour is configured in cassandra.yaml with this setting:

disk_failure_policy: stop

In a stopped state, the node will show as up and normal (UN) to itself because it is still available to inspect via the JMX port 7199 (the nodetool utility is a JMX client). But it will show as down and stopped (DS) to other nodes and clients/app/driver because the gossip port 7000 and CQL clients port 9042 are both shutdown.

For these reasons, repair was reporting the node as down and attempts to bootstrap return this exception since gossip is not operational:

java.lang.RuntimeException: Unable to gossip with any peers

Solution

You will need to resolve the underlying issue with the disk volumes. Some of the things to consider are:

  • ensure there is sufficient free disk space in the commitlog disk
  • the volume is mounted with the correct permissions
  • ensure that the DSE process user (e.g. cassandra) has full read and write permissions to the directories and the parent directories

Once resolved, restart DSE on the nodes starting with one of the seed nodes. Cheers!

Share
10 |1000

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.