Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-8405

node down during stress test (insufficient memory)

    XMLWordPrintable

Details

    • Bug
    • Resolution: Won't Fix
    • Major
    • 2.1.0
    • 2.1.0
    • ns_server
    • Security Level: Public
    • None

    Description

      2.1 build-701
      plum-003 ( 10.3.3.60 orchestrator)
      plum-005( 10.3.3.69 node down)

      While running a 10bucket/xdcr/view stress test, one node went down when queries start. When I look at the node that went down it looks like initial indexing was running fine until vbucket compaction, after which lots of timeouts began to occur.

      I do not know what caused the node to go down. The first thing I saw in orchestrator(10.3.3.60) was that buckets became not ready on 10.3.3.69:

      [ns_server:error,2013-06-04T13:03:41.759,ns_1@10.3.3.60:ns_doctor<0.8530.0>:ns_doctor:update_status:234]The following buckets became not ready on node 'ns_1@10.3.3.69': ["saslbucket",
      "saslbucket1",
      "saslbucket2",
      "saslbucket3",
      "saslbucket4",
      "saslbucket5"], those of them are active ["saslbucket",
      "saslbucket1",
      "saslbucket2",
      "saslbucket3",
      "saslbucket4",

      At this time in 10.3.3.69 there were errors about couch_stats_reader terminating:

      [error_logger:error,2013-06-04T13:03:24.513,ns_1@10.3.3.69:error_logger<0.6.0>:ale_error_logger_handler:log_msg:76]** Generic server 'couch_stats_reader-saslbucket6' terminating

        • Last message in was refresh_stats
        • When Server state == {state,"saslbucket6",1370376130999,[]}
        • Reason for termination ==
        • {timeout,{gen_server,call,[dir_size, {dir_size,"/data/saslbucket6"}]}}

          [error_logger:error,2013-06-04T13:03:24.514,ns_1@10.3.3.69:error_logger<0.6.0>:ale_error_logger_handler:log_report:72]
          =========================CRASH REPORT=========================
          crasher:
          initial call: couch_stats_reader:init/1
          pid: <0.1991.0>
          registered_name: 'couch_stats_reader-saslbucket6'
          exception exit: {timeout,
          {gen_server,call,
          [dir_size,{dir_size,"/data/saslbucket6"}

          ]}}
          in function gen_server:terminate/6
          ancestors: ['single_bucket_sup-saslbucket6',<0.1840.0>]
          messages: [refresh_stats,refresh_stats,refresh_stats,refresh_stats,
          refresh_stats,refresh_stats,refresh_stats,refresh_stats,
          refresh_stats,refresh_stats,refresh_stats,refresh_stats,
          refresh_stats]
          links: [<0.1841.0>,<0.294.0>]
          dictionary: []
          trap_exit: false
          status: running
          heap_size: 1597
          stack_size: 24
          reductions: 13866348

      eventually 10.3.3.69 went into went down:

      Node 'ns_1@10.3.3.60' saw that node 'ns_1@10.3.3.69' went down. Details: [

      {nodedown_reason, connection_closed}

      ]

      I also see the following excerpts from the babysitter logs in the node that went down(10.3.3.69):

      error_logger:error,2013-06-04T13:19:45.319,babysitter_of_ns_1@127.0.0.1:error_logger<0.6.0>:ale_error_logger_handler:log_msg:76]** Generic server <0.84.0> terminating

      [ns_server:info,2013-06-04T13:19:47.190,babysitter_of_ns_1@127.0.0.1:<0.71.0>:ns_port_server:log:168]ns_server<0.71.0>: Erlang has closed
      ns_server<0.71.0>: /opt/couchbase/lib/erlang/lib/os_mon-2.2.7/priv/bin/memsup: Erlang has closed.

      [ns_server:info,2013-06-04T13:19:47.435,babysitter_of_ns_1@127.0.0.1:<0.83.0>:supervisor_cushion:handle_info:58]Cushion managed supervisor for moxi failed:

      {abnormal,137}
      [error_logger:error,2013-06-04T13:19:48.366,babysitter_of_ns_1@127.0.0.1:error_logger<0.6.0>:ale_error_logger_handler:log_report:72]
      =========================CRASH REPORT=========================
      crasher:
      initial call: ns_port_server:init/1
      pid: <0.84.0>
      registered_name: []
      exception exit: {abnormal,137}

      in function gen_server:terminate/6
      ancestors: [<0.83.0>,ns_child_ports_sup,ns_babysitter_sup,<0.54.0>]
      messages: [

      {'EXIT',#Port<0.2968>,normal}

      ]
      links: [<0.83.0>]
      dictionary: []
      trap_exit: true
      status: running
      heap_size: 28657
      stack_size: 24
      reductions: 63830
      neighbours:

      some mccouch timeouts

      [error_logger:error,2013-06-04T13:19:56.564,babysitter_of_ns_1@127.0.0.1:error_logger<0.6.0>:ale_error_logger_handler:log_msg:76]** Generic server <0.120.0> terminating

        • Last message in was {#Port<0.3000>,
          Unknown macro: {exit_status,137}

          }

        • When Server state == {state,#Port<0.3000>,memcached,
          {["Tue Jun 4 13:17:36.728541 PDT 3: (saslbucket4) TAP (Producer) eq_tapq:replication_ns_1@10.3.121.90 - disconnected, keep alive for 300 seconds",
          "Tue Jun 4 13:15:52.721171 PDT 3: (saslbucket1) TAP (Producer) eq_tapq:replication_ns_1@10.3.3.66 - disconnected, keep alive for 300 seconds",
          "Tue Jun 4 13:09:07.309604 PDT 3: (saslbucket4) Connected to mccouch: \"127.0.0.1:11213\"",
          "Tue Jun 4 13:09:07.264418 PDT 3: (saslbucket1) Connected to mccouch: \"127.0.0.1:11213\"",
          "Tue Jun 4 13:09:06.955575 PDT 3: (saslbucket1) Trying to connect to mccouch: \"127.0.0.1:11213\"",
          "Tue Jun 4 13:09:05.750388 PDT 3: (saslbucket6) Connected to mccouch: \"127.0.0.1:11213\"",
          "Tue Jun 4 13:09:05.750039 PDT 3: (saslbucket6) Trying to connect to mccouch: \"127.0.0.1:11213\"",
          "Tue Jun 4 13:09:05.617920 PDT 3: (saslbucket4) Trying to connect to mccouch: \"127.0.0.1:11213\"",
          "Tue Jun 4 13:09:05.280099 PDT 3: (saslbucket1) Resetting connection to mccouch, lastReceivedCommand = notify_vbucket_update lastSentCommand = notify_vbucket_update currentCommand =unknown",
          "Tue Jun 4 13:09:04.244140 PDT 3: (saslbucket1) No response for mccouch in 180000 seconds. Resetting connection.",

      Attachments

        1. plum-003.zip
          24.27 MB
        2. plum-005.zip
          12.62 MB
        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            alkondratenko Aleksey Kondratenko (Inactive)
            tommie Tommie McAfee (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty