Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-8136

incremental_rebalance_in_with_queries: Rebalance exited with reason bad_replicas(Bad replicators after rebalance)

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Critical
    • Resolution: Incomplete
    • Affects Version/s: 2.1.0
    • Fix Version/s: 2.1.0
    • Component/s: ns_server
    • Security Level: Public
    • Labels:
      None

      Description

      http://qa.hq.northscale.net/job/centos-64-2.0-basic-rebalance-tests-P0/422/consoleFull
      ./testrunner -i /tmp/rebalance-tests.ini wait_timeout=180,GROUP=P0,get-cbcollect-info=True,stop-on-failure=True -t rebalance.rebalancein.RebalanceInTests.incremental_rebalance_in_with_queries,blob_generator=False,items=1000000,max_verify=100000,GROUP=IN;P0;FROM_2_0

      [2013-04-22 02:53:54,720] - [rest_client:984] INFO - Latest logs from UI:
      [2013-04-22 02:53:54,805] - [rest_client:985] ERROR -

      {u'node': u'ns_1@10.5.2.13', u'code': 2, u'text': u'Rebalance exited with reason bad_replicas\n', u'shortText': u'message', u'module': u'ns_orchestrator', u'tstamp': 1366624419114, u'type': u'info'}

      [2013-04-22 02:53:54,805] - [rest_client:985] ERROR - {u'node': u'ns_1@10.5.2.13', u'code': 2, u'text': u"Bad replicators after rebalance:\nMissing = [

      {'ns_1@10.3.121.64','ns_1@10.3.121.69',491}

      ]\nExtras = []", u'shortText': u'message', u'module': u'ns_rebalancer', u'tstamp': 1366624419113, u'type': u'info'}

      No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

        Show
        andreibaranouski Andrei Baranouski added a comment - https://s3.amazonaws.com/bugdb/jira/MB-8136/94450d30/10.3.121.63-4222013-32-diag.zip https://s3.amazonaws.com/bugdb/jira/MB-8136/94450d30/10.3.121.64-4222013-30-diag.zip https://s3.amazonaws.com/bugdb/jira/MB-8136/94450d30/10.3.121.66-4222013-36-diag.zip https://s3.amazonaws.com/bugdb/jira/MB-8136/94450d30/10.3.121.69-4222013-34-diag.zip https://s3.amazonaws.com/bugdb/jira/MB-8136/94450d30/10.5.2.13-4222013-253-diag.zip https://s3.amazonaws.com/bugdb/jira/MB-8136/94450d30/10.5.2.14-4222013-258-diag.zip https://s3.amazonaws.com/bugdb/jira/MB-8136/94450d30/10.5.2.15-4222013-256-diag.zip
        Hide
        maria Maria McDuff (Inactive) added a comment -

        aliaksey to take a look.

        Show
        maria Maria McDuff (Inactive) added a comment - aliaksey to take a look.
        Hide
        Aliaksey Artamonau Aliaksey Artamonau added a comment -

        Must be an environment issue. Around the time when missing replication was being established I see this in the log file on .69:

        [rebalance:info,2013-04-22T2:49:35.835,ns_1@10.3.121.69:<0.27373.0>:ebucketmigrator_srv:init:568]Starting tap stream:
        [

        {vbuckets,[491]},
        {checkpoints,[{491,2}]},
        {name,<<"replication_ns_1@10.3.121.69">>},
        {takeover,false}]
        {{"10.3.121.64",11209},
        {"10.3.121.69",11209},
        [{on_not_ready_vbuckets,#Fun<tap_replication_manager.2.24797998>},
        {username,"default"},
        {password,[]},
        {vbuckets,[491]}

        ,

        {takeover,false}

        ,

        {suffix,"ns_1@10.3.121.69"}

        ]}
        [error_logger:error,2013-04-22T2:49:46.293,ns_1@10.3.121.69:error_logger<0.6.0>:ale_error_logger_handler:log_msg:76]Detected time forward jump (or too large erlang scheduling latency). Skipping 12 samples (or 9600 milliseconds) ({{1366624175944,
        #Ref<0.0.1.211282>},

        {repeat, 800, <0.25903.0>}

        ,
        {timer2,
        send,
        [<0.25903.0>,

        {cascade, minute, hour, 4}

        ]}})

        So it seems that vm didn't have CPU time for 10 seconds. Then 1 second later the replicator died because it could not resolve some host (apparently the source of replication):

        [error_logger:error,2013-04-22T2:49:47.495,ns_1@10.3.121.69:error_logger<0.6.0>:ale_error_logger_handler:log_report:72]
        =========================CRASH REPORT=========================
        crasher:
        initial call: ebucketmigrator_srv:init/1
        pid: <0.27373.0>
        registered_name: []
        exception error: no match of right hand side value

        {error,nxdomain}

        in function ebucketmigrator_srv:connect/4
        in call from ebucketmigrator_srv:init/1
        ancestors: ['ns_vbm_new_sup-default','single_bucket_sup-default',
        <0.25967.0>]
        messages: []
        links: Port<0.29031>,<0.25984.0>,#Port<0.29034>,#Port<0.29030>
        dictionary: []
        trap_exit: false
        status: running
        heap_size: 2584
        stack_size: 24
        reductions: 48648
        neighbours:

        I should also note that even though rebalance failed, in reality all the data was moved successfully. And this missing replication would have been created by janitor after the end of rebalance.

        Show
        Aliaksey Artamonau Aliaksey Artamonau added a comment - Must be an environment issue. Around the time when missing replication was being established I see this in the log file on .69: [rebalance:info,2013-04-22T2:49:35.835,ns_1@10.3.121.69:<0.27373.0>:ebucketmigrator_srv:init:568] Starting tap stream: [ {vbuckets,[491]}, {checkpoints, [{491,2}] }, {name,<<"replication_ns_1@10.3.121.69">>}, {takeover,false}] {{"10.3.121.64",11209}, {"10.3.121.69",11209}, [{on_not_ready_vbuckets,#Fun<tap_replication_manager.2.24797998>}, {username,"default"}, {password,[]}, {vbuckets,[491]} , {takeover,false} , {suffix,"ns_1@10.3.121.69"} ]} [error_logger:error,2013-04-22T2:49:46.293,ns_1@10.3.121.69:error_logger<0.6.0>:ale_error_logger_handler:log_msg:76] Detected time forward jump (or too large erlang scheduling latency). Skipping 12 samples (or 9600 milliseconds) ({{1366624175944, #Ref<0.0.1.211282>}, {repeat, 800, <0.25903.0>} , {timer2, send, [<0.25903.0>, {cascade, minute, hour, 4} ]}}) So it seems that vm didn't have CPU time for 10 seconds. Then 1 second later the replicator died because it could not resolve some host (apparently the source of replication): [error_logger:error,2013-04-22T2:49:47.495,ns_1@10.3.121.69:error_logger<0.6.0>:ale_error_logger_handler:log_report:72] =========================CRASH REPORT========================= crasher: initial call: ebucketmigrator_srv:init/1 pid: <0.27373.0> registered_name: [] exception error: no match of right hand side value {error,nxdomain} in function ebucketmigrator_srv:connect/4 in call from ebucketmigrator_srv:init/1 ancestors: ['ns_vbm_new_sup-default','single_bucket_sup-default', <0.25967.0>] messages: [] links: Port<0.29031>,<0.25984.0>,#Port<0.29034>,#Port<0.29030> dictionary: [] trap_exit: false status: running heap_size: 2584 stack_size: 24 reductions: 48648 neighbours: I should also note that even though rebalance failed, in reality all the data was moved successfully. And this missing replication would have been created by janitor after the end of rebalance.
        Hide
        maria Maria McDuff (Inactive) added a comment -

        andrei, pls take a look at this issue again, and re-route to aliak a if u are able to repro.

        Show
        maria Maria McDuff (Inactive) added a comment - andrei, pls take a look at this issue again, and re-route to aliak a if u are able to repro.
        Show
        andreibaranouski Andrei Baranouski added a comment - passes on latest runs http://qa.hq.northscale.net/job/centos-64-2.0-basic-rebalance-tests-P0/

          People

          • Assignee:
            andreibaranouski Andrei Baranouski
            Reporter:
            andreibaranouski Andrei Baranouski
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Gerrit Reviews

              There are no open Gerrit changes