Details

    • Type: Technical task
    • Status: Resolved
    • Priority: Major
    • Resolution: Duplicate
    • Affects Version/s: 2.0
    • Fix Version/s: 2.0
    • Component/s: ns_server
    • Security Level: Public
    • Labels:
    • Environment:
      centos 6.2 64 bit build 2.0.0-1946

      Description

      Environment: 8 nodes with 390GB SSD drive, 32GB RAM, 1 default buckets.

      Create 6 nodes cluster
      10.6.2.37
      10.6.2.38
      10.6.2.44
      10.6.2.45
      10.6.2.42
      10.6.2.43

      Load 40 million items to default bucket that push resident ratio down to around 62% (as of Pavel workload)
      Create a design doc with 8 view. Let the initial index completed
      Maintain load about 6K ops with 600 queries per second

      Add 2 nodes and rebalance.
      Rebalance failed (as expected) due to network on node 43 disconnected

      Restart network on node 43 and rebalance.
      Rebalance failed with error
      "Rebalance exited with reason

      {badmatch,wrong_rebalancer_pid}

      "
      Reopen bug MB-7057

      Rebalance again. Rebalance failed with the same error of bug MB-7057
      Rebalance again. Rebalance failed with error Resetting rebalance status since it's not really running

      ns_server:info,2012-11-10T11:23:23.905,ns_1@10.6.2.37:<0.18812.1091>:ebucketmigrator_srv:handle_call:304]Successfully changed vbucket filter on tap stream `replication_ns_1@10.6.2.37`.
      [ns_server:info,2012-11-10T11:23:23.905,ns_1@10.6.2.37:<0.18812.1091>:ebucketmigrator_srv:process_upstream:922]Got vbucket filter change completion message. Silencing upstream sender
      [ns_server:info,2012-11-10T11:23:23.912,ns_1@10.6.2.37:<0.18812.1091>:ebucketmigrator_srv:handle_info:348]Got reply from upstream silencing request. Completing state transition to a new ebucketmigrator.
      [ns_server:debug,2012-11-10T11:23:23.912,ns_1@10.6.2.37:<0.18812.1091>:ebucketmigrator_srv:complete_native_vb_filter_change:166]Proceeding with reading unread binaries
      [ns_server:debug,2012-11-10T11:23:23.913,ns_1@10.6.2.37:<0.18812.1091>:ebucketmigrator_srv:confirm_downstream:193]Going to confirm reception downstream messages
      [rebalance:info,2012-11-10T11:23:23.923,ns_1@10.6.2.37:<0.18812.1091>:ebucketmigrator_srv:do_confirm_sent_messages:682]Got close ack!

      [ns_server:debug,2012-11-10T11:23:23.923,ns_1@10.6.2.37:<0.18812.1091>:ebucketmigrator_srv:confirm_downstream:197]Confirmed upstream messages are feeded to kernel
      [ns_server:debug,2012-11-10T11:23:23.923,ns_1@10.6.2.37:<0.18812.1091>:ebucketmigrator_srv:reply_and_die:206]Passed old state to caller
      [ns_server:debug,2012-11-10T11:23:23.924,ns_1@10.6.2.37:<0.18812.1091>:ebucketmigrator_srv:reply_and_die:209]Sent out state. Preparing to die
      [ns_server:debug,2012-11-10T11:23:23.924,ns_1@10.6.2.37:<0.15562.1152>:ns_vbm_new_sup:perform_vbucket_filter_change:125]Got old state from previous ebucketmigrator: <0.18812.1091>
      [ns_server:debug,2012-11-10T11:23:23.924,ns_1@10.6.2.37:<0.15562.1152>:ns_vbm_new_sup:perform_vbucket_filter_change_loop:166]Sent old state to new instance
      [ns_server:info,2012-11-10T11:23:23.925,ns_1@10.6.2.37:<0.15690.1152>:ns_vbm_new_sup:mk_old_state_retriever:83]Got vbucket filter change old state. Proceeding vbucket filter change operation
      [ns_server:debug,2012-11-10T11:23:23.925,ns_1@10.6.2.37:<0.15690.1152>:ebucketmigrator_srv:init:450]Got old ebucketmigrator state from <0.18812.1091>:
      {state,#Port<0.14094>,#Port<0.14057>,#Port<0.14099>,#Port<0.14060>,
      <0.18815.1091>,<<"cut off">>,<<"cut off">>,[],1657808,false,false,0,

      {1352,575403,905720}

      ,
      completed,

      {<0.15562.1152>,#Ref<0.0.2303.247796>}

      ,
      <<"replication_ns_1@10.6.2.37">>,<0.18812.1091>,

      {had_backfill,false,undefined,[]}

      ,
      false}.
      [ns_server:debug,2012-11-10T11:23:23.925,ns_1@10.6.2.37:<0.721.0>:ns_process_registry:handle_info:98]Got exit msg: {'EXIT',<0.15562.1152>,{#Ref<0.0.2303.247780>,<0.15690.1152>}}
      [error_logger:info,2012-11-10T11:23:23.926,ns_1@10.6.2.37:error_logger<0.5.0>:ale_error_logger_handler:log_report:72]
      =========================PROGRESS REPORT=========================
      supervisor:

      {local,'ns_vbm_new_sup-default'}
      started: [{pid,<0.15690.1152>},
      {name,
      {new_child_id,
      [513,514,515,516,517,518,519,520,521,522,523,
      524,525,526,527,528,529,530,534,535,536,537,
      538,539,540,541,542,543,544,545,546],
      'ns_1@10.6.2.40'}},
      {mfargs,
      {ebucketmigrator_srv,start_link,
      [{"10.6.2.40",11209},
      {"10.6.2.37",11209},
      [{old_state_retriever, #Fun<ns_vbm_new_sup.0.113560106>},
      {on_not_ready_vbuckets, #Fun<tap_replication_manager.2.88495417>},
      {username,"default"},
      {password,[]},
      {vbuckets, [513,514,515,516,517,518,519,520,521,522, 523,524,525,526,527,528,529,530,534,535, 536,537,538,539,540,541,542,543,544,545, 546]},
      {takeover,false},
      {suffix,"ns_1@10.6.2.37"}]]}},
      {restart_type,temporary},
      {shutdown,60000},
      {child_type,worker}]

      [user:info,2012-11-10T11:23:23.926,ns_1@10.6.2.37:ns_config<0.711.0>:ns_janitor:maybe_stop_rebalance_status:147]Resetting rebalance status since it's not really running
      [ns_server:debug,2012-11-10T11:23:23.927,ns_1@10.6.2.37:capi_set_view_manager-default<0.202.981>:capi_set_view_manager:handle_info:349]doing replicate_newnodes_docs
      [ns_server:debug,2012-11-10T11:23:23.949,ns_1@10.6.2.37:ns_config_rep<0.735.0>:ns_config_rep:do_push_keys:317]Replicating some config keys ([rebalance_status,rebalancer_pid]..)
      [ns_server:debug,2012-11-10T11:23:24.014,ns_1@10.6.2.37:capi_set_view_manager-default<0.202.981>:capi_set_view_manager:handle_info:349]doing replicate_newnodes_docs
      [ns_server:debug,2012-11-10T11:23:24.014,ns_1@10.6.2.37:ns_config_log<0.717.0>:ns_config_log:log_common:111]config change:
      rebalancer_pid ->
      undefined


      ns_server:info,2012-11-10T11:23:23.684,ns_1@10.6.2.37:<0.25048.1120>:ebucketmigrator_srv:handle_call:304]Successfully changed vbucket filter on tap stream `replication_ns_1@10.6.2.37`.
      [ns_server:info,2012-11-10T11:23:23.685,ns_1@10.6.2.37:<0.25048.1120>:ebucketmigrator_srv:process_upstream:922]Got vbucket filter change completion message. Silencing upstream sender
      [ns_server:info,2012-11-10T11:23:23.685,ns_1@10.6.2.37:<0.25048.1120>:ebucketmigrator_srv:handle_info:348]Got reply from upstream silencing request. Completing state transition to a new ebucketmigrator.
      [rebalance:info,2012-11-10T11:23:23.686,ns_1@10.6.2.37:<0.25048.1120>:ebucketmigrator_srv:do_confirm_sent_messages:682]Got close ack!

      [ns_server:info,2012-11-10T11:23:23.687,ns_1@10.6.2.37:<0.15546.1152>:ns_vbm_new_sup:mk_old_state_retriever:83]Got vbucket filter change old state. Proceeding vbucket filter change operation
      [ns_server:info,2012-11-10T11:23:23.688,ns_1@10.6.2.37:tap_replication_manager-default<0.20761.0>:tap_replication_manager:change_vbucket_filter:185]Going to change replication from 'ns_1@10.6.2.40' to have
      [513,514,515,516,517,518,519,520,521,522,523,524,525,526,527,528,529,530,534,
      535,536,537,538,539,540,541,542,543,544,545,546] ([534], [])
      [error_logger:info,2012-11-10T11:23:23.688,ns_1@10.6.2.37:error_logger<0.5.0>:ale_error_logger_handler:log_report:72]
      =========================PROGRESS REPORT=========================
      supervisor: {local,'ns_vbm_new_sup-default'}

      started: [

      {pid,<0.15546.1152>}

      ,
      {name,
      {new_child_id,
      [342,343,344,345,346,347,348,349,350,351,352,
      353,354,355,356,357,358,359,360,365,366,367,
      368,369,370,371,372,373,374,375],
      'ns_1@10.6.2.39'}},
      {mfargs,
      {ebucketmigrator_srv,start_link,
      [

      {"10.6.2.39",11209}

      ,

      {"10.6.2.37",11209},
      [{old_state_retriever, #Fun<ns_vbm_new_sup.0.113560106>},
      {on_not_ready_vbuckets, #Fun<tap_replication_manager.2.88495417>},
      {username,"default"},
      {password,[]},
      {vbuckets, [342,343,344,345,346,347,348,349,350,351, 352,353,354,355,356,357,358,359,360,365, 366,367,368,369,370,371,372,373,374, 375]},
      {takeover,false},
      {suffix,"ns_1@10.6.2.37"}]]}},
      {restart_type,temporary},
      {shutdown,60000},
      {child_type,worker}]

      [ns_server:info,2012-11-10T11:23:23.694,ns_1@10.6.2.37:<0.18812.1091>:ebucketmigrator_srv:handle_call:271]Starting new-style vbucket filter change on stream `replication_ns_1@10.6.2.37`
      [ns_server:info,2012-11-10T11:23:23.893,ns_1@10.6.2.37:<0.18812.1091>:ebucketmigrator_srv:handle_call:294]Changing vbucket filter on tap stream `replication_ns_1@10.6.2.37`:
      [{513,47},
      {514,47},
      {515,47},
      {516,47},
      {517,47},
      {518,47},
      {519,47},
      {520,47},
      {521,47},
      {522,47},
      {523,47},
      {524,47},
      {525,47},
      {526,47},
      {527,47},
      {528,47},
      {529,47},
      {530,47},
      {534,53},
      {535,47},
      {536,47},
      {537,47},
      {538,47},
      {539,47},
      {540,47},
      {541,47},
      {542,47},
      {543,47},
      {544,47},
      {545,47},
      {546,47}]
      [ns_server:info,2012-11-10T11:23:23.905,ns_1@10.6.2.37:<0.18812.1091>:ebucketmigrator_srv:handle_call:304]Successfully changed vbucket filter on tap stream `replication_ns_1@10.6.2.37`.
      [ns_server:info,2012-11-10T11:23:23.905,ns_1@10.6.2.37:<0.18812.1091>:ebucketmigrator_srv:process_upstream:922]Got vbucket filter change completion message. Silencing upstream sender
      [ns_server:info,2012-11-10T11:23:23.912,ns_1@10.6.2.37:<0.18812.1091>:ebucketmigrator_srv:handle_info:348]Got reply from upstream silencing request. Completing state transition to a new ebucketmigrator.
      [rebalance:info,2012-11-10T11:23:23.923,ns_1@10.6.2.37:<0.18812.1091>:ebucketmigrator_srv:do_confirm_sent_messages:682]Got close ack!

      [ns_server:info,2012-11-10T11:23:23.925,ns_1@10.6.2.37:<0.15690.1152>:ns_vbm_new_sup:mk_old_state_retriever:83]Got vbucket filter change old state. Proceeding vbucket filter change operation
      [user:info,2012-11-10T11:23:23.926,ns_1@10.6.2.37:ns_config<0.711.0>:ns_janitor:maybe_stop_rebalance_status:147]Resetting rebalance status since it's not really running
      [error_logger:info,2012-11-10T11:23:23.926,ns_1@10.6.2.37:error_logger<0.5.0>:ale_error_logger_handler:log_report:72]
      =========================PROGRESS REPORT=========================
      supervisor: {local,'ns_vbm_new_sup-default'}
      started: [{pid,<0.15690.1152>},
      {name,
      {new_child_id,
      [513,514,515,516,517,518,519,520,521,522,523,
      524,525,526,527,528,529,530,534,535,536,537,
      538,539,540,541,542,543,544,545,546],
      'ns_1@10.6.2.40'}},
      {mfargs,
      {ebucketmigrator_srv,start_link,
      [{"10.6.2.40",11209},
      {"10.6.2.37",11209}

      ,
      [

      {old_state_retriever, #Fun<ns_vbm_new_sup.0.113560106>}

      ,

      {on_not_ready_vbuckets, #Fun<tap_replication_manager.2.88495417>}

      ,

      {username,"default"}

      ,

      {password,[]}

      ,

      {vbuckets, [513,514,515,516,517,518,519,520,521,522, 523,524,525,526,527,528,529,530,534,535, 536,537,538,539,540,541,542,543,544,545, 546]}

      ,

      {takeover,false}

      ,

      {suffix,"ns_1@10.6.2.37"}

      ]]}},

      {restart_type,temporary}

      ,

      {shutdown,60000}

      ,

      {child_type,worker}

      ]

      Link to manifest of build 2.0.0-1946 http://builds.hq.northscale.net/latestbuilds/couchbase-server-enterprise_x86_64_2.0.0-1947-rel.rpm.manifest.xml

      Link to collect info of all nodes https://s3.amazonaws.com/packages.couchbase/collect_info/orange/2_0_0/201211/8nodes-ci-1946-reb-failed-Partition-not-in-active-nor-passive-set-20121112-115625.tgz

      This bug is similar with bug MB-6573 which was closed

      No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

        Hide
        steve Steve Yen added a comment -

        alk / bug-scrub: most likely a duplicate

        Show
        steve Steve Yen added a comment - alk / bug-scrub: most likely a duplicate
        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        master node died due to timeout calling ns_node_disco:nodes_wanted

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - master node died due to timeout calling ns_node_disco:nodes_wanted
        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        Created MB-7182 specifically to track timeouts caused by lack of async io threads.

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - Created MB-7182 specifically to track timeouts caused by lack of async io threads.

          People

          • Assignee:
            alkondratenko Aleksey Kondratenko (Inactive)
            Reporter:
            thuan Thuan Nguyen
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Gerrit Reviews

              There are no open Gerrit changes