We recently had an outage on one our nodes and it was replaced with a new one.
We run version 3.11.10 and followed the guide here with
-Dcassandra.replace_address=address_of_dead_node
As far as I can tell, from the clusters point of view, all is fine. nodetool status reports the new node taking over, querying system.peers correctly shows the new ip and not the old ip.
However, our spring-boot apps, running spring-boot version 2.5.3, spring-data version 2021.0.3 and datastax cassandra drivers version 4.11.2, doesn't work as I would have hoped.
So I've recreated the problem locally with fo4r cassandra nodes running in docker. After 3 of them has been started and the app connected to them I kill one and start the fourth with replace_address of the one I killed.
As it did for us in the real cluster, everything on the cluster looks fine.
When the node was lost the app started to try to reconnect periodically, as expected, but when the new node was added these log lines (among others) could be found.
[s1] Updating token map but some nodes/tokens have changed, full rebuild [s1] Unexpected error while refreshing token map, keeping previous version (and in the exception on that line, extracted for readability) Multiple entries with same key: Murmur3Token(-1075386851749511590)=Node(endPoint=/192.168.14.14:9042, hostId=401f6950-a1bb-468a-9494-1bf71353bf0d, hashCode=603a11aa) and Murmur3Token(-1075386851749511590)=Node(endPoint=/192.168.14.12:9042, hostId=2ce63970-e286-412a-b2b4-9ed24fa1469f, hashCode=6213b695) [s1] Node(endPoint=/192.168.14.14:9042, hostId=401f6950-a1bb-468a-9494-1bf71353bf0d, hashCode=603a11aa) was IGNORED, changing to LOCAL
It then goes on to prepare all the prepared statements, as expected, but it never queries .14 (the new node) and obviously not .12 (the old node). And it keeps trying to reconnect to .12 even though it has been replaced by now.
Do we need to do something different? Setup our apps differently? Is there an option to periodically forget all the tokens and fetch all fresh?
Thanks.
Edit 2022-02-03. Both when the app starts and when the node is replaced similar logs appear (and looks correct)
[s1|default] Evaluator did not assign a distance to node Node(endPoint=/192.168.14.13:9042, hostId=c3b8c4f9-7bbe-498f-8f15-b05f8a8dad07, hashCode=3e4efa) [s1] com.datastax.oss.driver.internal.core.loadbalancing.DefaultLoadBalancingPolicy@1fbd0955 suggested Node(endPoint=/192.168.14.13:9042, hostId=c3b8c4f9-7bbe-498f-8f15-b05f8a8dad07, hashCode=3e4efa) to LOCAL, checking what other policies said [s1] Shortest distance across all policies is LOCAL [s1] Node(endPoint=/192.168.14.13:9042, hostId=c3b8c4f9-7bbe-498f-8f15-b05f8a8dad07, hashCode=3e4efa) was IGNORED, changing to LOCAL
[s1|default] Evaluator did not assign a distance to node Node(endPoint=/192.168.14.14:9042, hostId=d1889735-5587-4cfb-ab96-59415e8895f6, hashCode=20217314) [s1] com.datastax.oss.driver.internal.core.loadbalancing.DefaultLoadBalancingPolicy@1fbd0955 suggested Node(endPoint=/192.168.14.14:9042, hostId=d1889735-5587-4cfb-ab96-59415e8895f6, hashCode=20217314) to LOCAL, checking what other policies said [s1] Shortest distance across all policies is LOCAL [s1] Node(endPoint=/192.168.14.14:9042, hostId=d1889735-5587-4cfb-ab96-59415e8895f6, hashCode=20217314) was IGNORED, changing to LOCAL