Bringing together the Apache Cassandra experts from the community and DataStax.

Want to learn? Have a question? Want to share your expertise? You are in the right place!

Not sure where to begin? Getting Started

 

question

kascatchme_38391 avatar image
kascatchme_38391 asked ·

Issues replacing an unreachable node, cannot decommission or add another node

I have one reachable node in my cluster and I tried replacing it, it wasn't successful. So, I left the node and ignored the data loss because of the replication factor 3.

Now, when I try to decommission or add a server, it's not working as expected.

I'm.getting these INFO messages in all the nodes. I have tried to assassinate and remove as well. This node doesn't show up in the node tool status. But I guess is, it is somewhere persisted and Gossips are giving issues.

INFO  [GossipStage:1] 2021-05-29 07:25:37,404 Gossiper.java:1029 - InetAddress /10.43.5.118 is now DOWN
INFO  [GossipStage:1] 2021-05-29 07:25:37,405 StorageService.java:2324 - Removing tokens [] for /10.43.5.118

And also, while restarting the node, I get an ERROR from the gossip which is NullPointerException. It's not able to get the host id. I tried removing it with the old method mentioned in the stackoverflow. Using JXM.

ERROR [GossipStage:1] 2021-05-29 08:48:35,229 CassandraDaemon.java:226 - Exception in thread Thread[GossipStage:1,5,main]
java.lang.NullPointerException: null
        at org.apache.cassandra.gms.Gossiper.getHostId(Gossiper.java:866) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.StorageService.handleStateNormal(StorageService.java:2096) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.StorageService.onChange(StorageService.java:1822) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.StorageService.onJoin(StorageService.java:2536) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.gms.Gossiper.handleMajorStateChange(Gossiper.java:1070) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.gms.Gossiper.applyStateLocally(Gossiper.java:1181) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.gms.GossipDigestAck2VerbHandler.doVerb(GossipDigestAck2VerbHandler.java:49) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:64) ~[apache-cassandra-3.9.jar:3.9]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[na:1.8.0_181]
        at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[na:1.8.0_181]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[na:1.8.0_181]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_181]
        at java.lang.Thread.run(Thread.java:748) [na:1.8.0_181]

Can someone let me know how to remove this node completely?

I need a solution ASAP. TIA.

decommission
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

1 Answer

Erick Ramirez avatar image
Erick Ramirez answered ·

It would be very difficult to assist you with the limited information you've provided particularly since you've attempted to do multiple operations and it would be hard for us to "unscramble the egg" in a Q&A forum.

I have one reachable node in my cluster and I tried replacing it, it wasn't successful. So, I left the node and ignored the data loss because of the replication factor 3.

Sadly, this is the wrong approach. You should've investigated the failure and addressed it because there could have been other underlying issues with your cluster which is possibly the cause of failures for the other operations you attempted.

I have tried to assassinate and remove as well. This node doesn't show up in the node tool status. But I guess is, it is somewhere persisted and Gossips are giving issues.

A decommissioned node will persist in gossip for 72 hours by design. This is to prevent a node from re-joining the cluster if an operator accidentally restarts Cassandra.

Since all other nodes in the cluster are aware that the node has been decommissioned (stays in gossip for 3 days), the node can't join accidentally -- you have to take remedial action to add it back in to the cluster.

And also, while restarting the node, I get an ERROR from the gossip which is NullPointerException.

Restarted which node, the decommissioned node? As above, if you've decommissioned/removed the node from the cluster then the NPE is a result of the node's host ID being erased from the cluster again as a protection so it can't accidentally re-join. Cheers!

Share
10 |1000 characters needed characters left characters exceeded

Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.