Details
-
Bug
-
Resolution: Cannot Reproduce
-
Critical
-
3.0.2
-
Security Level: Public
-
Triaged
-
Unknown
Description
Running a 50 node 3.0.2 cluster in Google Cloud. Cluster was idle overnight and when I logged into the UI , the UI immediately stopped responding and displayed cached information. I logged into the UI on a second node and it was showing the initial server as down. It then itself started displaying cached information. I logged onto t a 3rd node which showed initial 2 nodes as down then started displaying cached information itself.
Watching the couchbase processes on a node prior to login I see this:
UID PID PPID C STIME TTY TIME CMD
999 3088 1 0 Feb02 ? 00:00:02 /opt/couchbase/lib/erlang/erts-5.10.4/bin/epmd -daemon
999 3126 1 0 Feb02 ? 00:00:29 /opt/couchbase/lib/erlang/erts-5.10.4/bin/beam.smp -A 16 – -root /opt/couchbase/lib/erlang -progname erl – -home /opt/couchbase – -smp enable -kernel inet_d
999 3217 3126 17 Feb02 ? 10:46:00 /opt/couchbase/bin/memcached -C /opt/couchbase/var/lib/couchbase/config/memcached.json
999 3486 3126 0 Feb02 ? 00:00:00 inet_gethost 4
999 3487 3486 0 Feb02 ? 00:00:00 inet_gethost 4
999 9756 3126 7 10:26 ? 00:00:30 /opt/couchbase/lib/erlang/erts-5.10.4/bin/beam.smp -A 16 -sbt u -P 327680 -K true -swt low -MMmcs 30 -e102400 – -root /opt/couchbase/lib/erlang -progname erl
999 9794 9756 0 10:26 ? 00:00:00 sh -s disksup
999 9796 9756 0 10:26 ? 00:00:00 /opt/couchbase/lib/erlang/lib/os_mon-2.2.14/priv/bin/memsup
999 9797 9756 0 10:26 ? 00:00:00 /opt/couchbase/lib/erlang/lib/os_mon-2.2.14/priv/bin/cpu_sup
999 9808 9756 0 10:26 ? 00:00:00 inet_gethost 4
999 9809 9808 0 10:26 ? 00:00:00 inet_gethost 4
999 9813 9756 0 10:26 ? 00:00:00 sh -s ns_disksup
999 9815 9756 0 10:26 ? 00:00:00 /opt/couchbase/lib/ns_server/erlang/lib/ns_server/priv/i386-linux-godu
999 9823 3126 0 10:26 ? 00:00:01 /opt/couchbase/lib/erlang/erts-5.10.4/bin/beam.smp -P 327680 -K true – -root /opt/couchbase/lib/erlang -progname erl – -home /opt/couchbase – -smp enable -k
999 9851 9756 0 10:26 ? 00:00:01 portsigar for ns_1@cb-server-12.c.cb-googbench-101.internal
999 9852 3126 0 10:26 ? 00:00:00 /opt/couchbase/bin/moxi -Z port_listen=11211,default_bucket_name=default,downstream_max=1024,downstream_conn_max=4,connect_max_errors=5,connect_retry_interval=
Then, after logging on through the UI, the process list shrinks to this:
999 3088 1 0 Feb02 ? 00:00:02 /opt/couchbase/lib/erlang/erts-5.10.4/bin/epmd -daemon
999 3126 1 0 Feb02 ? 00:00:29 /opt/couchbase/lib/erlang/erts-5.10.4/bin/beam.smp -A 16 – -root /opt/couchbase/lib/erlang -progname erl – -home /opt/couchbase – -smp enable -kernel inet_d
999 3217 3126 17 Feb02 ? 10:46:00 /opt/couchbase/bin/memcached -C /opt/couchbase/var/lib/couchbase/config/memcached.json
999 3486 3126 0 Feb02 ? 00:00:00 inet_gethost 4
999 3487 3486 0 Feb02 ? 00:00:00 inet_gethost 4
999 9756 3126 12 10:26 ? 00:00:54 /opt/couchbase/lib/erlang/erts-5.10.4/bin/beam.smp -A 16 -sbt u -P 327680 -K true -swt low -MMmcs 30 -e102400 – -root /opt/couchbase/lib/erlang -progname erl
999 9823 3126 0 10:26 ? 00:00:01 /opt/couchbase/lib/erlang/erts-5.10.4/bin/beam.smp -P 327680 -K true – -root /opt/couchbase/lib/erlang -progname erl – -home /opt/couchbase – -smp enable -k
999 9852 3126 0 10:26 ? 00:00:00 /opt/couchbase/bin/moxi -Z port_listen=11211,default_bucket_name=default,downstream_max=1024,downstream_conn_max=4,connect_max_errors=5,connect_retry_interval=
It recovers after about 30 seconds. Other nodes in the cluster seem unaffected.
Logs for node with failure:
https://s3.amazonaws.com/customers.couchbase.com/davidH/node12.zip
Orchestrator:
https://s3.amazonaws.com/customers.couchbase.com/davidH/node10-orchestrator.zip
ns_server_error.log on the failed node has:
[ns_server:error,2015-02-05T10:18:21.307,ns_1@cb-server-12.c.cb-googbench-101.internal:ns_log<0.277.0>:ns_log:handle_cast:210]unable to notify listeners because of badarg