Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-21002

Rebalance exited bulk_set_vbucket_state_failed nodedown

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • bug-backlog
    • 4.5.1, 4.6.0
    • view-engine
    • 4.5.1-2842

    Description

      Running kv/view stress test a rebalance failed with errors seen below.
      Test did pass on build 4.5.1 2838, but failed in recent run with 2342 so may not be regression ( would need that confirmed by triage).

      Rebalance exited with reason {unexpected_exit,
                                    {'EXIT',<0.24477.0>,
                                     {bulk_set_vbucket_state_failed,
                                      [{'ns_1@172.23.108.105',
                                        {'EXIT',
                                         {{nodedown,'ns_1@172.23.108.105'},
                                          {gen_server,call,
                                           [{'janitor_agent-default',
                                             'ns_1@172.23.108.105'},
                                            {if_rebalance,<0.19074.0>,
                                             {update_vbucket_state,646,replica,
                                              undefined,undefined}},
                                            infinity]}}}}]}}}
      

      debug log is showing dcp_replicator exiting at time of rebalance failure

      [ns_server:debug,2016-09-20T08:47:48.746-07:00,ns_1@172.23.108.105:dcp_replicator-default-ns_1@172.23.108.97<0.3125.0>:dcp_replicator:spawn_and_wait:228]Received exit with reason {shutdown,nuke} from <0.3040.0>. Killing child process <0.11265.0>
       
      ...
       
      =========================CRASH REPORT=========================
        crasher:
          initial call: dcp_replicator:init/1
          pid: <0.3125.0>
          registered_name: 'dcp_replicator-default-ns_1@172.23.108.97'
          exception exit: {{child_interrupted,{'EXIT',<0.3040.0>,{shutdown,nuke}}},
                           [{dcp_replicator,spawn_and_wait,1,
                                            [{file,"src/dcp_replicator.erl"},
                                             {line,231}]},
                            {dcp_replicator,handle_call,3,
                                            [{file,"src/dcp_replicator.erl"},
                                             {line,115}]},
                            {gen_server,handle_msg,5,
                                        [{file,"gen_server.erl"},{line,585}]},
                            {proc_lib,init_p_do_apply,3,
                                      [{file,"proc_lib.erl"},{line,239}]}]}
            in function  gen_server:terminate/6 (gen_server.erl, line 744)
          ancestors: ['dcp_sup-default','single_bucket_kv_sup-default',
                        ns_bucket_sup,ns_bucket_worker_sup,ns_server_sup,
                        ns_server_nodes_sup,<0.2289.0>,ns_server_cluster_sup,
                        <0.88.0>]
          messages: [{'EXIT',<0.3127.0>,killed},{'EXIT',<0.3126.0>,killed}]
          links: [<0.3028.0>]
          dictionary: []
          trap_exit: true
          status: running
          heap_size: 1598
          stack_size: 27
          reductions: 3399
        neighbours:
       
      ...
      ns_server:debug,2016-09-20T08:47:48.748-07:00,ns_1@172.23.108.105:replication_manager-default<0.3030.0>:replication_manag
      er:terminate:105]Replication manager died {{{{child_interrupted,
                                      {'EXIT',<0.3040.0>,{shutdown,nuke}}},
                                  [{dcp_replicator,spawn_and_wait,1,
                                       [{file,"src/dcp_replicator.erl"},{line,231}]},
      
      

      There are also very slow SET operations happening +10s!

       
      2016-09-20T08:46:58.934116-07:00 WARNING 67: Slow SET operation on connection: 3596 ms ([ 172.23.108.94:45782 - 172.23.108
      .105:11210 ])
      2016-09-20T08:46:59.088348-07:00 WARNING 65: Slow SET operation on connection: 10 s ([ 172.23.108.94:46225 - 172.23.108.10
      5:11210 ])
      2016-09-20T08:46:59.189076-07:00 WARNING 118: Slow SET operation on connection: 4723 ms ([ 172.23.108.94:46664 - 172.23.10
      8.105:11210 ])
      2016-09-20T08:46:59.217688-07:00 WARNING 124: Slow SET operation on connection: 3049 ms ([ 172.23.108.94:46347 - 172.23.10
      8.105:11210 ])
      2016-09-20T08:47:00.084680-07:00 WARNING 122: Slow SET operation on connection: 2210 ms ([ 172.23.108.94:46341 - 172.23.108.105:11210 ])
      2016-09-20T08:47:00.483088-07:00 WARNING 65: Slow SET operation on connection: 1394 ms ([ 172.23.108.94:46225 - 172.23.108.105:11210 ])
      

      One more note, the last time we ran this test there on 2842 there was a kernel panic related to beam.smp on node .105

       Sep 19 14:46:31 centos-64-x64 kernel: beam.smp: page allocation failure. order:0, mode:0x20
      Sep 19 14:46:31 centos-64-x64 kernel: Pid: 30830, comm: beam.smp Not tainted 2.6.32-358.el6.x86_64 #1
      Sep 19 14:46:31 centos-64-x64 kernel: Call Trace:
      Sep 19 14:46:31 centos-64-x64 kernel: <IRQ>  [<ffffffff8112c127>] ? __alloc_pages_nodemask+0x757/0x8d0
      

      http://qa.sc.couchbase.com/job/centos-systest-launcher/235/console

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            asingh Abhishek Singh (Inactive)
            tommie Tommie McAfee (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty