Details
-
Bug
-
Resolution: Won't Fix
-
Major
-
1.8.0
-
Security Level: Public
-
Untriaged
-
Release Note
Description
Logs attached. The reported problem was that after a power failure, one node of a 2-node Couchbase cluster returned with a reset configuration.
In the Couchbase logs, right after the restart, we are unable to listen on the IP address we think we should be listening on:
ERROR REPORT <0.57.0> 2012-06-18 09:12:01
===============================================================================
Got error:eaddrnotavail. Cannot listen on configured address:192.168.1.8
I see in the /var/log/messages that the DHCP client got the address 3 seconds after we tried to listen on it:
Jun 18 09:12:03 cheetah dhclient: bound to 192.168.1.8 – renewal in 807476524 seconds.
And then because we don't know who we are:
INFO REPORT <6044.171.0> 2012-06-18 09:12:02
===============================================================================
ns_1@127.0.0.1:<6044.171.0>:ns_node_disco:189: We've been shunned (nodes_wanted = ['ns_1@192.168.1.71',
'ns_1@192.168.1.8']). Leaving cluster.
INFO REPORT <6044.66.0> 2012-06-18 09:12:02
===============================================================================
ns_log: logging ns_cluster:1:Node 'ns_1@127.0.0.1' is leaving cluster.
Then we spiral around a bunch, seemingly resetting the configuration a number of times (not sure what that's all about, seems like spamming the logs for a few minutes). We settle into a single node cluster, then magically reboot:
INFO REPORT <0.54.0> 2012-06-18 09:19:27
===============================================================================
nonode@nohost:<0.54.0>:log_os_info:25: OS type:
{unix,linux}Version:
{2,6,32}Runtime info: [
{otp_release,"R14B03"},
Try to listen on the correct address again:
INFO REPORT <0.57.0> 2012-06-18 09:19:27
===============================================================================
nonode@nohost:<0.57.0>:dist_manager:105: Attempting to bring up net_kernel with name 'ns_1@192.168.1.8'
But it's too late, we've already been kicked out of the cluster and reset the config.
--------------------------------------------------------------------------------------------------------------------------------
Adding the ns_server component since I think it could definitely handle this case much better, and probably retry to listen on the correct IP address a few times (more than 1) before wiping the config
Adding the linux_installer component since it would probably be a best practice to configure Couchbase as one of the very last services that starts up to ensure the rest of the system is ready when we come up.