Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-7057

[system test] Rebalance failed with reason "Partition x not in active nor passive set"

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • 2.0
    • 2.0
    • ns_server
    • Security Level: Public
    • centos 6.2 64bit build 2.0.0-1908

    Description

      Cluster information:

      • 8 centos 6.2 64bit server with 4 cores CPU
      • Each server has 32 GB RAM and 400 GB SSD disk.
      • 24.8 GB RAM for couchbase server at each node
      • SSD disk format ext4 on /data
      • Each server has its own SSD drive, no disk sharing with other server.
      • Create cluster with 6 nodes installed couchbase server 2.0.0-1908
      • Link to manifest file http://builds.hq.northscale.net/latestbuilds/couchbase-server-enterprise_x86_64_2.0.0-1908-rel.rpm.manifest.xml
      • Cluster has 2 buckets, default (12GB with 2 replica) and saslbucket (12GB with 1 replica).
      • Each bucket has one doc and 2 views for each doc (default d1 and saslbucket d11)

      10.6.2.37
      10.6.2.38
      10.6.2.44
      10.6.2.45
      10.6.2.42
      10.6.2.43

      • Load 16 million items to default bucket and 20 million items to saslbuckett. Each key has size from 512 bytes to 1024 bytes
      • After done loading, wait until initial index. Disable view compaction.
      • After initial indexing done, mutate all items with size from 1024 to 1512 bytes.
      • Queries all 4 views from 2 docs
      • Do swap rebalance, remove node 39, 40 and add node 44, 45.
      • At the end of rebalance saslbucket, rebalance exited with timeout on node 43
      • Then see a lot of reset connection to mccouch. Updated bug MB-7046
      • Kill all loads pointing to this cluster. Node 43 did not back to stable state.
      • Erlang in node 43 looks like hang. Kill beam.smp and restart couchbase server on node 43 (will update this bug soon)
      • Waiting for node 43 completely up, disk write queue and drain rate down to 0.
      • Start queries and mutate items on both bucket. Waiting a few minutes and start swap rebalance.
      • Rebalance failed with error

      2012-10-30 17:52:37.698 ns_rebalancer:0:info:message(ns_1@10.6.2.37) - Started rebalancing bucket saslbucket
      2012-10-30 17:52:38.722 ns_vbucket_mover:0:info:message(ns_1@10.6.2.37) - Bucket "saslbucket" rebalance appears to be swap rebalance
      2012-10-30 17:52:38.968 ns_rebalancer:0:info:message(ns_1@10.6.2.37) - Started rebalancing bucket default
      2012-10-30 17:52:40.592 ns_vbucket_mover:0:info:message(ns_1@10.6.2.37) - Bucket "default" rebalance does not seem to be swap rebalance
      2012-10-30 17:52:48.027 ns_vbucket_mover:0:critical:message(ns_1@10.6.2.37) - <0.4774.771> exited with {noproc,
      {gen_server,call,
      [

      {'janitor_agent-default','ns_1@10.6.2.37'},
      {if_rebalance,<0.4746.771>,
      {update_vbucket_state,61,active,paused,
      undefined}},
      infinity]}}

      Look at diag log of node 37, see error: "Partition 857 not in active nor passive set"

      [ns_server:error,2012-10-30T17:52:45.760,ns_1@10.6.2.37:<0.20186.0>:ns_memcached:verify_report_long_call:297]call topkeys took too long: 589040 us
      [ns_server:error,2012-10-30T17:52:46.835,ns_1@10.6.2.37:<0.16711.0>:ns_memcached:verify_report_long_call:297]call topkeys took too long: 963763 us
      [rebalance:info,2012-10-30T17:52:47.098,ns_1@10.6.2.37:<0.4781.771>:janitor_agent:set_vbucket_state:419]Doing vbucket 530 state change: {'ns_1@10.6.2.40',active,paused,undefined}
      [rebalance:info,2012-10-30T17:52:47.153,ns_1@10.6.2.37:<0.4781.771>:janitor_agent:get_replication_persistence_checkpoint_id:470]default: Doing get_replication_persistence_checkpoint_id call for vbucket 530 on ns_1@10.6.2.40
      [rebalance:debug,2012-10-30T17:52:47.162,ns_1@10.6.2.37:<0.6121.771>:janitor_agent:handle_call:651]Going to wait for persistence of checkpoint 44 in vbucket 530
      [ns_server:debug,2012-10-30T17:52:47.203,ns_1@10.6.2.37:janitor_agent-default<0.16716.0>:janitor_agent:handle_info:682]Got done message from subprocess: <0.6121.771> (ok)
      [rebalance:info,2012-10-30T17:52:47.206,ns_1@10.6.2.37:<0.6125.771>:janitor_agent:wait_index_updated:459]default: Doing wait_index_updated call for ns_1@10.6.2.37 (vbucket 530)
      [ns_server:debug,2012-10-30T17:52:47.225,ns_1@10.6.2.37:<0.6127.771>:capi_set_view_manager:do_wait_index_updated:608]References to wait: Ref<0.0.3553.75720> ("default", 530)
      [ns_server:info,2012-10-30T17:52:47.482,ns_1@10.6.2.37:ns_config_rep<0.758.0>:ns_config_rep:do_pull:341]Pulling config from: 'ns_1@10.6.2.40'

      [rebalance:info,2012-10-30T17:52:47.626,ns_1@10.6.2.37:<0.4779.771>:janitor_agent:set_vbucket_state:419]Doing vbucket 857 state change: {'ns_1@10.6.2.43',active,paused,undefined}
      [rebalance:info,2012-10-30T17:52:47.709,ns_1@10.6.2.37:<0.4779.771>:janitor_agent:get_replication_persistence_checkpoint_id:470]default: Doing get_replication_persistence_checkpoint_id call for vbucket 857 on ns_1@10.6.2.43
      [rebalance:debug,2012-10-30T17:52:47.739,ns_1@10.6.2.37:<0.6284.771>:janitor_agent:handle_call:651]Going to wait for persistence of checkpoint 42 in vbucket 857
      [ns_server:debug,2012-10-30T17:52:47.753,ns_1@10.6.2.37:janitor_agent-default<0.16716.0>:janitor_agent:handle_info:682]Got done message from subprocess: <0.6284.771> (ok)
      [rebalance:info,2012-10-30T17:52:47.758,ns_1@10.6.2.37:<0.6286.771>:janitor_agent:wait_index_updated:459]default: Doing wait_index_updated call for ns_1@10.6.2.37 (vbucket 857)
      [ns_server:debug,2012-10-30T17:52:47.772,ns_1@10.6.2.37:<0.16695.0>:ns_pubsub:do_subscribe_link:132]Parent process of subscription {ns_config_events,<0.16694.0>} exited with reason {{badmatch,
      {error,
      {error, <<"Partition 857 not in active nor passive set">>}}},
      [{capi_set_view_manager, handle_call, 3},
      {gen_server, handle_msg, 5},
      {gen_server, init_it, 6},
      {proc_lib, init_p_do_apply, 3}]}
      [ns_server:debug,2012-10-30T17:52:47.772,ns_1@10.6.2.37:<0.6127.771>:capi_set_view_manager:do_wait_index_updated:621]Parent died: {'DOWN',#Ref<0.0.3553.75791>,process,<0.16694.0>,
      {{badmatch,
      {error,
      {error, <<"Partition 857 not in active nor passive set">>}}},
      [{capi_set_view_manager,handle_call,3},
      {gen_server,handle_msg,5},
      {gen_server,init_it,6},
      {proc_lib,init_p_do_apply,3}]}}
      [error_logger:error,2012-10-30T17:52:47.775,ns_1@10.6.2.37:error_logger<0.5.0>:ale_error_logger_handler:log_report:72]
      =========================CRASH REPORT=========================
      crasher:
      initial call: erlang:apply/2
      pid: <0.6297.771>
      registered_name: []
      exception throw: {error,<<"Partition 857 not in active nor passive set">>}
      in function couch_set_view:monitor_partition_update/3
      in call from capi_set_view_manager:'do_wait_index_updated/4-fun-0'/4
      in call from lists:foldl/3
      in call from capi_set_view_manager:do_wait_index_updated/4
      ancestors: ['capi_set_view_manager-default','single_bucket_sup-default',
      <0.16686.0>]
      messages: []
      links: [<0.16694.0>]
      dictionary: []
      trap_exit: false
      status: running
      heap_size: 6765
      stack_size: 24
      reductions: 229
      neighbours:

      [ns_server:error,2012-10-30T17:52:47.778,ns_1@10.6.2.37:<0.4781.771>:ns_single_vbucket_mover:spawn_and_wait:81]Got unexpected exit signal {'EXIT',<0.6125.771>,
      {{{{badmatch,
      {error,
      {error, <<"Partition 857 not in active nor passive set">>}}},
      [{capi_set_view_manager,handle_call,3},
      {gen_server,handle_msg,5},
      {gen_server,init_it,6},
      {proc_lib,init_p_do_apply,3}]},
      {gen_server,call,
      ['capi_set_view_manager-default',
      {wait_index_updated,857},
      infinity]}},
      {gen_server,call,
      [{'janitor_agent-default','ns_1@10.6.2.37'}

      ,
      {if_rebalance,<0.4746.771>,
      {wait_index_updated,530}},
      infinity]}}}
      [ns_server:debug,2012-10-30T17:52:47.804,ns_1@10.6.2.37:<0.16705.0>:ns_pubsub:do_subscribe_link:132]Parent process of subscription

      {mc_couch_events,<0.16694.0>}

      exited with reason {{badmatch,
      {error,

      {error, <<"Partition 857 not in active nor passive set">>}

      }},
      [

      {capi_set_view_manager, handle_call, 3}

      ,

      {gen_server, handle_msg, 5}

      ,

      {gen_server, init_it, 6}

      ,

      {proc_lib, init_p_do_apply, 3}

      ]}
      [ns_server:debug,2012-10-30T17:52:47.804,ns_1@10.6.2.37:<0.16704.0>:ns_pubsub:do_subscribe_link:132]Parent process of subscription

      {ns_node_disco_events,<0.16694.0>}

      exited with reason {{badmatch,
      {error,

      {error, <<"Partition 857 not in active nor passive set">>}

      }},
      [

      {capi_set_view_manager, handle_call, 3}

      ,

      {gen_server, handle_msg, 5}

      ,

      {gen_server, init_it, 6}

      ,

      {proc_lib, init_p_do_apply, 3}

      ]}
      [ns_server:error,2012-10-30T17:52:47.808,ns_1@10.6.2.37:<0.4779.771>:ns_single_vbucket_mover:spawn_and_wait:81]Got unexpected exit signal {'EXIT',<0.6286.771>,
      {{{{badmatch,
      {error,

      {error, <<"Partition 857 not in active nor passive set">>}

      }},
      [

      {capi_set_view_manager,handle_call,3}

      ,

      {gen_server,handle_msg,5}

      ,

      {gen_server,init_it,6}

      ,

      {proc_lib,init_p_do_apply,3}

      ]},
      {gen_server,call,
      ['capi_set_view_manager-default',

      {wait_index_updated,857},
      infinity]}},
      {gen_server,call,
      [{'janitor_agent-default','ns_1@10.6.2.37'},
      {if_rebalance,<0.4746.771>,
      {wait_index_updated,857}},
      infinity]}}}
      [rebalance:info,2012-10-30T17:52:47.822,ns_1@10.6.2.37:<0.5479.771>:ebucketmigrator_srv:do_confirm_sent_messages:671]Got close ack!

      [rebalance:info,2012-10-30T17:52:47.827,ns_1@10.6.2.37:<0.5314.771>:ebucketmigrator_srv:do_confirm_sent_messages:671]Got close ack!

      [rebalance:info,2012-10-30T17:52:47.835,ns_1@10.6.2.37:<0.4774.771>:janitor_agent:set_vbucket_state:419]Doing vbucket 61 state change: {'ns_1@10.6.2.37',active,paused,undefined}
      [error_logger:error,2012-10-30T17:52:47.811,ns_1@10.6.2.37:error_logger<0.5.0>:ale_error_logger_handler:log_msg:76]** Generic server <0.16694.0> terminating
      ** Last message in was {wait_index_updated,857}
        • When Server state == {state,"default",'capi_ddoc_replication_srv-default',
          ['ns_1@10.6.2.38','ns_1@10.6.2.42',
          'ns_1@10.6.2.43','ns_1@10.6.2.44',
          'ns_1@10.6.2.45','ns_1@10.6.2.39',
          'ns_1@10.6.2.40'],
          [{doc,<<"_design/d1">>,
          {1,<<56,131,187,118>>}

          ,
          <<"{\"views\":{\"v1\":

          Unknown macro: {"map"}

          ,\"v2\":

          Unknown macro: {"map"}

          }}">>,
          0,false,[]}],
          1024,true,undefined,
          [replica,missing,missing,missing,missing,
          missing,missing,missing,missing,missing,
          missing,missing,missing,missing,missing,
          missing,missing,missing,missing,missing,
          missing,missing,missing,missing,missing,
          missing,missing,missing,missing,missing,
          missing,missing,missing,missing,missing,
          missing,replica,replica,replica,replica,
          replica,replica,replica,replica,replica,
          replica,replica,replica,replica,replica,
          replica,replica,replica,replica,replica,
          replica,replica,replica,replica,replica,
          replica,active,active,active,active,active,
          active,active,active,active,active,active,
          active,active,active,active,active,active,
          active,active,active,active,active,active,
          active,active,active,active,active,active,
          active,active,active,active,active,active,

      • This bug is similar to bug MB-6955

      Attachments

        For Gerrit Dashboard: MB-7057
        # Subject Branch Project Status CR V

        Activity

          People

            chiyoung Chiyoung Seo (Inactive)
            thuan Thuan Nguyen
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                PagerDuty