In most cases, nodes in a cluster fail to communicate with each other because a firewall is closing the socket when it detects that the connection between 2 nodes is idle. By default, most firewalls are configured with a timeout period of 5 minutes.
We recommend setting TCP keepalive to 60 seconds with 3 probes every 10 seconds on every node in the cluster:
$ sudo sysctl -w net.ipv4.tcp_keepalive_time=60 net.ipv4.tcp_keepalive_probes=3 net.ipv4.tcp_keepalive_intvl=10
These settings will detect dead TCP connections after 90 seconds (wait 60 seconds + send 3 probes every 10 seconds). The probes don't contain data so the additional traffic on the network is insignificant.
Note that you will need to consult the relevant documentation for your Linux distribution on how to persist these changes across reboots. Cheers!
5 People are following this question.