Loading...

Details

Type: Bug
Resolution: Won't Fix
Priority: Major
Fix Version/s: 2.1.0
Affects Version/s: 2.1.0
Component/s: ns_server
Security Level: Public
Labels:
None

Description

2.1 build-701
plum-003 ( 10.3.3.60 orchestrator)
plum-005( 10.3.3.69 node down)

While running a 10bucket/xdcr/view stress test, one node went down when queries start. When I look at the node that went down it looks like initial indexing was running fine until vbucket compaction, after which lots of timeouts began to occur.

I do not know what caused the node to go down. The first thing I saw in orchestrator(10.3.3.60) was that buckets became not ready on 10.3.3.69:

[ns_server:error,2013-06-04T13:03:41.759,ns_1@10.3.3.60:ns_doctor<0.8530.0>:ns_doctor:update_status:234]The following buckets became not ready on node 'ns_1@10.3.3.69': ["saslbucket",
"saslbucket1",
"saslbucket2",
"saslbucket3",
"saslbucket4",
"saslbucket5"], those of them are active ["saslbucket",
"saslbucket1",
"saslbucket2",
"saslbucket3",
"saslbucket4",

At this time in 10.3.3.69 there were errors about couch_stats_reader terminating:

[error_logger:error,2013-06-04T13:03:24.513,ns_1@10.3.3.69:error_logger<0.6.0>:ale_error_logger_handler:log_msg:76]** Generic server 'couch_stats_reader-saslbucket6' terminating

- Last message in was refresh_stats
- When Server state == {state,"saslbucket6",1370376130999,[]}
- Reason for termination ==
- {timeout,{gen_server,call,[dir_size, {dir_size,"/data/saslbucket6"}]}}
  
  [error_logger:error,2013-06-04T13:03:24.514,ns_1@10.3.3.69:error_logger<0.6.0>:ale_error_logger_handler:log_report:72]
  =========================CRASH REPORT=========================
  crasher:
  initial call: couch_stats_reader:init/1
  pid: <0.1991.0>
  registered_name: 'couch_stats_reader-saslbucket6'
  exception exit: {timeout,
  {gen_server,call,
  [dir_size,{dir_size,"/data/saslbucket6"}
  ]}}
  in function gen_server:terminate/6
  ancestors: ['single_bucket_sup-saslbucket6',<0.1840.0>]
  messages: [refresh_stats,refresh_stats,refresh_stats,refresh_stats,
  refresh_stats,refresh_stats,refresh_stats,refresh_stats,
  refresh_stats,refresh_stats,refresh_stats,refresh_stats,
  refresh_stats]
  links: [<0.1841.0>,<0.294.0>]
  dictionary: []
  trap_exit: false
  status: running
  heap_size: 1597
  stack_size: 24
  reductions: 13866348

eventually 10.3.3.69 went into went down:

Node 'ns_1@10.3.3.60' saw that node 'ns_1@10.3.3.69' went down. Details: [

{nodedown_reason, connection_closed}

]

I also see the following excerpts from the babysitter logs in the node that went down(10.3.3.69):

error_logger:error,2013-06-04T13:19:45.319,babysitter_of_ns_1@127.0.0.1:error_logger<0.6.0>:ale_error_logger_handler:log_msg:76]** Generic server <0.84.0> terminating

- Last message in was {#Port<0.2968>,
  Unknown macro: {exit_status,137}
  
  }
- When Server state == {state,#Port<0.2968>,moxi,
  {["WARNING: curl error: Received problem 2 in the chunky parser from: http://127.0.0.1:8091/pools/default/saslBucketsStreaming",
  "ERROR: could not contact REST server(s): http://127.0.0.1:8091/pools/default/saslBucketsStreaming",
  "WARNING: curl error: Received problem 2 in the chunky parser from: http://127.0.0.1:8091/pools/default/saslBucketsStreaming",
  "ERROR: could not contact REST server(s): http://127.0.0.1:8091/pools/default/saslBucketsStreaming",
  "WARNING: curl error: Received problem 2 in the chunky parser from: http://127.0.0.1:8091/pools/default/saslBucketsStreaming",
  "WARNING: curl error: transfer closed with outstanding read data remaining from: http://127.0.0.1:8091/pools/default/saslBucketsStreaming",
  "ERROR: could not contact REST server(s): http://127.0.0.1:8091/pools/default/saslBucketsStreaming",
  ..

[ns_server:info,2013-06-04T13:19:47.190,babysitter_of_ns_1@127.0.0.1:<0.71.0>:ns_port_server:log:168]ns_server<0.71.0>: Erlang has closed
ns_server<0.71.0>: /opt/couchbase/lib/erlang/lib/os_mon-2.2.7/priv/bin/memsup: Erlang has closed.

[ns_server:info,2013-06-04T13:19:47.435,babysitter_of_ns_1@127.0.0.1:<0.83.0>:supervisor_cushion:handle_info:58]Cushion managed supervisor for moxi failed:

{abnormal,137}
[error_logger:error,2013-06-04T13:19:48.366,babysitter_of_ns_1@127.0.0.1:error_logger<0.6.0>:ale_error_logger_handler:log_report:72]
=========================CRASH REPORT=========================
crasher:
initial call: ns_port_server:init/1
pid: <0.84.0>
registered_name: []
exception exit: {abnormal,137}

in function gen_server:terminate/6
ancestors: [<0.83.0>,ns_child_ports_sup,ns_babysitter_sup,<0.54.0>]
messages: [

{'EXIT',#Port<0.2968>,normal}

]
links: [<0.83.0>]
dictionary: []
trap_exit: true
status: running
heap_size: 28657
stack_size: 24
reductions: 63830
neighbours:

some mccouch timeouts

[error_logger:error,2013-06-04T13:19:56.564,babysitter_of_ns_1@127.0.0.1:error_logger<0.6.0>:ale_error_logger_handler:log_msg:76]** Generic server <0.120.0> terminating

- Last message in was {#Port<0.3000>,
  Unknown macro: {exit_status,137}
  
  }
- When Server state == {state,#Port<0.3000>,memcached,
  {["Tue Jun 4 13:17:36.728541 PDT 3: (saslbucket4) TAP (Producer) eq_tapq:replication_ns_1@10.3.121.90 - disconnected, keep alive for 300 seconds",
  "Tue Jun 4 13:15:52.721171 PDT 3: (saslbucket1) TAP (Producer) eq_tapq:replication_ns_1@10.3.3.66 - disconnected, keep alive for 300 seconds",
  "Tue Jun 4 13:09:07.309604 PDT 3: (saslbucket4) Connected to mccouch: \"127.0.0.1:11213\"",
  "Tue Jun 4 13:09:07.264418 PDT 3: (saslbucket1) Connected to mccouch: \"127.0.0.1:11213\"",
  "Tue Jun 4 13:09:06.955575 PDT 3: (saslbucket1) Trying to connect to mccouch: \"127.0.0.1:11213\"",
  "Tue Jun 4 13:09:05.750388 PDT 3: (saslbucket6) Connected to mccouch: \"127.0.0.1:11213\"",
  "Tue Jun 4 13:09:05.750039 PDT 3: (saslbucket6) Trying to connect to mccouch: \"127.0.0.1:11213\"",
  "Tue Jun 4 13:09:05.617920 PDT 3: (saslbucket4) Trying to connect to mccouch: \"127.0.0.1:11213\"",
  "Tue Jun 4 13:09:05.280099 PDT 3: (saslbucket1) Resetting connection to mccouch, lastReceivedCommand = notify_vbucket_update lastSentCommand = notify_vbucket_update currentCommand =unknown",
  "Tue Jun 4 13:09:04.244140 PDT 3: (saslbucket1) No response for mccouch in 180000 seconds. Resetting connection.",

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

plum-003.zip
24.27 MB
04/Jun/13 3:28 PM
plum-005.zip
12.62 MB
04/Jun/13 3:28 PM

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews
- Show All Issues
- Show Open Issues

No reviews matched the request. Check your Options in the drop-down menu of this sections header.

node down during stress test (insufficient memory)

Details

Description

Attachments

Attachments

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty