Description
2.1 build-701
plum-003 ( 10.3.3.60 orchestrator)
plum-005( 10.3.3.69 node down)
While running a 10bucket/xdcr/view stress test, one node went down when queries start. When I look at the node that went down it looks like initial indexing was running fine until vbucket compaction, after which lots of timeouts began to occur.
I do not know what caused the node to go down. The first thing I saw in orchestrator(10.3.3.60) was that buckets became not ready on 10.3.3.69:
[ns_server:error,2013-06-04T13:03:41.759,ns_1@10.3.3.60:ns_doctor<0.8530.0>:ns_doctor:update_status:234]The following buckets became not ready on node 'ns_1@10.3.3.69': ["saslbucket",
"saslbucket1",
"saslbucket2",
"saslbucket3",
"saslbucket4",
"saslbucket5"], those of them are active ["saslbucket",
"saslbucket1",
"saslbucket2",
"saslbucket3",
"saslbucket4",
At this time in 10.3.3.69 there were errors about couch_stats_reader terminating:
[error_logger:error,2013-06-04T13:03:24.513,ns_1@10.3.3.69:error_logger<0.6.0>:ale_error_logger_handler:log_msg:76]** Generic server 'couch_stats_reader-saslbucket6' terminating
-
- Last message in was refresh_stats
- When Server state == {state,"saslbucket6",1370376130999,[]}
- Reason for termination ==
- {timeout,{gen_server,call,[dir_size,
{dir_size,"/data/saslbucket6"}]}}
[error_logger:error,2013-06-04T13:03:24.514,ns_1@10.3.3.69:error_logger<0.6.0>:ale_error_logger_handler:log_report:72]
=========================CRASH REPORT=========================
crasher:
initial call: couch_stats_reader:init/1
pid: <0.1991.0>
registered_name: 'couch_stats_reader-saslbucket6'
exception exit: {timeout,
{gen_server,call,
[dir_size,{dir_size,"/data/saslbucket6"}]}}
in function gen_server:terminate/6
ancestors: ['single_bucket_sup-saslbucket6',<0.1840.0>]
messages: [refresh_stats,refresh_stats,refresh_stats,refresh_stats,
refresh_stats,refresh_stats,refresh_stats,refresh_stats,
refresh_stats,refresh_stats,refresh_stats,refresh_stats,
refresh_stats]
links: [<0.1841.0>,<0.294.0>]
dictionary: []
trap_exit: false
status: running
heap_size: 1597
stack_size: 24
reductions: 13866348
eventually 10.3.3.69 went into went down:
Node 'ns_1@10.3.3.60' saw that node 'ns_1@10.3.3.69' went down. Details: [
{nodedown_reason, connection_closed}]
I also see the following excerpts from the babysitter logs in the node that went down(10.3.3.69):
error_logger:error,2013-06-04T13:19:45.319,babysitter_of_ns_1@127.0.0.1:error_logger<0.6.0>:ale_error_logger_handler:log_msg:76]** Generic server <0.84.0> terminating
-
- Last message in was {#Port<0.2968>,
Unknown macro: {exit_status,137}
}
- When Server state == {state,#Port<0.2968>,moxi,
{["WARNING: curl error: Received problem 2 in the chunky parser from: http://127.0.0.1:8091/pools/default/saslBucketsStreaming",
"ERROR: could not contact REST server(s): http://127.0.0.1:8091/pools/default/saslBucketsStreaming",
"WARNING: curl error: Received problem 2 in the chunky parser from: http://127.0.0.1:8091/pools/default/saslBucketsStreaming",
"ERROR: could not contact REST server(s): http://127.0.0.1:8091/pools/default/saslBucketsStreaming",
"WARNING: curl error: Received problem 2 in the chunky parser from: http://127.0.0.1:8091/pools/default/saslBucketsStreaming",
"WARNING: curl error: transfer closed with outstanding read data remaining from: http://127.0.0.1:8091/pools/default/saslBucketsStreaming",
"ERROR: could not contact REST server(s): http://127.0.0.1:8091/pools/default/saslBucketsStreaming",
..
- Last message in was {#Port<0.2968>,
[ns_server:info,2013-06-04T13:19:47.190,babysitter_of_ns_1@127.0.0.1:<0.71.0>:ns_port_server:log:168]ns_server<0.71.0>: Erlang has closed
ns_server<0.71.0>: /opt/couchbase/lib/erlang/lib/os_mon-2.2.7/priv/bin/memsup: Erlang has closed.
[ns_server:info,2013-06-04T13:19:47.435,babysitter_of_ns_1@127.0.0.1:<0.83.0>:supervisor_cushion:handle_info:58]Cushion managed supervisor for moxi failed:
{abnormal,137}[error_logger:error,2013-06-04T13:19:48.366,babysitter_of_ns_1@127.0.0.1:error_logger<0.6.0>:ale_error_logger_handler:log_report:72]
=========================CRASH REPORT=========================
crasher:
initial call: ns_port_server:init/1
pid: <0.84.0>
registered_name: []
exception exit: {abnormal,137}
in function gen_server:terminate/6
ancestors: [<0.83.0>,ns_child_ports_sup,ns_babysitter_sup,<0.54.0>]
messages: [
]
links: [<0.83.0>]
dictionary: []
trap_exit: true
status: running
heap_size: 28657
stack_size: 24
reductions: 63830
neighbours:
some mccouch timeouts
[error_logger:error,2013-06-04T13:19:56.564,babysitter_of_ns_1@127.0.0.1:error_logger<0.6.0>:ale_error_logger_handler:log_msg:76]** Generic server <0.120.0> terminating
-
- Last message in was {#Port<0.3000>,
Unknown macro: {exit_status,137}
}
- When Server state == {state,#Port<0.3000>,memcached,
{["Tue Jun 4 13:17:36.728541 PDT 3: (saslbucket4) TAP (Producer) eq_tapq:replication_ns_1@10.3.121.90 - disconnected, keep alive for 300 seconds",
"Tue Jun 4 13:15:52.721171 PDT 3: (saslbucket1) TAP (Producer) eq_tapq:replication_ns_1@10.3.3.66 - disconnected, keep alive for 300 seconds",
"Tue Jun 4 13:09:07.309604 PDT 3: (saslbucket4) Connected to mccouch: \"127.0.0.1:11213\"",
"Tue Jun 4 13:09:07.264418 PDT 3: (saslbucket1) Connected to mccouch: \"127.0.0.1:11213\"",
"Tue Jun 4 13:09:06.955575 PDT 3: (saslbucket1) Trying to connect to mccouch: \"127.0.0.1:11213\"",
"Tue Jun 4 13:09:05.750388 PDT 3: (saslbucket6) Connected to mccouch: \"127.0.0.1:11213\"",
"Tue Jun 4 13:09:05.750039 PDT 3: (saslbucket6) Trying to connect to mccouch: \"127.0.0.1:11213\"",
"Tue Jun 4 13:09:05.617920 PDT 3: (saslbucket4) Trying to connect to mccouch: \"127.0.0.1:11213\"",
"Tue Jun 4 13:09:05.280099 PDT 3: (saslbucket1) Resetting connection to mccouch, lastReceivedCommand = notify_vbucket_update lastSentCommand = notify_vbucket_update currentCommand =unknown",
"Tue Jun 4 13:09:04.244140 PDT 3: (saslbucket1) No response for mccouch in 180000 seconds. Resetting connection.",
- Last message in was {#Port<0.3000>,