Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-6585

Rebalance fails while doing swap rebalance with bidirectional replication

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • 2.0
    • 2.0-beta
    • ns_server, XDCR
    • Security Level: Public
    • None
    • 2.0.0-1706-rel
      Centos
      1024 vbuckets
      Memory (on vms used): 4GB

    Description

      • Create 2 clusters with 2 nodes each
      • Create one standard bucket each on both the clusters and set up a load of 2M items on each.
      • Set up bidirectional replication between the 2 clusters
      • With on going load on both the clusters:
      • Add a server and remove an existing server on cluster1, rebalance
      • Add a server and remove an existing server on cluster2, rebalance
      • After a point, rebalance fails on both the clusters with the following reasons <noted as on the orchestrators>
      • Rebalancing seems to be failing for multiple reasons:

      CLUSTER1: < 10.1.3.235, 10.1.3.236 [remove], 10.3.2.54 [add] >
      Rebalance exited with reason {timeout,
      {gen_server,call,
      [

      {'ns_memcached-bucket','ns_1@10.1.3.235'}

      ,

      {get_vbucket,114}

      ,
      60000]}}

      CLUSTER2: < 10.1.3.237, 10.1.3.238 [remove], 10.3.2.55 [add] >
      Rebalance exited with reason {exited,
      {'EXIT',<0.25928.1>,

      {missing_checkpoint_stat,'ns_1@10.1.3.237', 566}

      }}

      While replication is going on (with on going load as well), with swap rebalance,
      a bunch of crash reports are seen on the diags, reasons being:

      • badmatch, corrupted data
      • db_not found
      • checkpoint_commit_failure, failure on target commit
      • missing_checkpoint_stat { << as per UI, rebalance seems to have failed because of this}

      Rebalance is failing when swap rebalance is done on just one cluster as well (rather than both), with bidirectional replication between the 2 clusters
      and on going load on both the clusters:

      • - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

      [couchdb:error,2012-09-10T14:55:22.506,ns_1@10.1.3.235:<0.7099.2>:couch_log:error:42]Uncaught error in HTTP request: {exit,
      {timeout,
      {gen_server,call,
      ['ns_memcached-bucket',

      {stats,<<>>}

      ,
      60000]}}}
      [ns_server:info,2012-09-10T14:55:22.506,ns_1@10.1.3.235:<0.8252.2>:ns_replicas_builder_utils:kill_a_bunch_of_tap_names:59]Killed the following tap names on 'ns_1@10.1.3.236': [<<"replication_building_564_'ns_1@10.1.3.235'">>,
      <<"replication_building_564_'ns_1@10.3.2.54'">>]
      [ns_server:info,2012-09-10T14:55:22.507,ns_1@10.1.3.235:<0.8224.2>:ns_single_vbucket_mover:mover_inner_old_style:199]Got exit message (parent is <0.21723.0>). Exiting...
      {'EXIT',<0.8252.2>,{missing_checkpoint_stat,'ns_1@10.1.3.235',564}}
      [error_logger:error,2012-09-10T14:55:22.508,ns_1@10.1.3.235:error_logger:ale_error_logger_handler:log_report:72]
      =========================CRASH REPORT=========================
      crasher:
      initial call: erlang:apply/2
      pid: <0.8252.2>
      registered_name: []
      exception exit:

      {missing_checkpoint_stat,'ns_1@10.1.3.235',564}

      in function ns_replicas_builder:'wait_checkpoint_opened/5-lc$^0/1-0'/2
      in call from ns_replicas_builder:wait_checkpoint_opened/5
      in call from ns_replicas_builder:'build_replicas_main/6-fun-1'/8
      in call from misc:try_with_maybe_ignorant_after/2
      in call from ns_replicas_builder:build_replicas_main/6
      ancestors: [<0.8224.2>,<0.21723.0>,<0.21409.0>]
      messages: []
      links: [<0.8224.2>,<0.8253.2>]
      dictionary: []
      trap_exit: true
      status: running
      heap_size: 121393
      stack_size: 24
      reductions: 18730
      neighbours:

      • - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

      Attachments

        1. 10.1.3.235-8091-diag.txt.gz
          3.44 MB
        2. 10.1.3.236-8091-diag.txt.gz
          2.58 MB
        3. 10.1.3.237-8091-diag.txt.gz
          2.98 MB
        4. 10.1.3.238-8091-diag.txt.gz
          3.24 MB
        5. 10.3.2.54-8091-diag.txt.gz
          9.17 MB
        6. 10.3.2.55-8091-diag.txt.gz
          8.58 MB
        7. Screen Shot 2012-09-10 at 11.09.25 AM.png
          Screen Shot 2012-09-10 at 11.09.25 AM.png
          204 kB
        8. Screen Shot 2012-09-11 at 5.57.45 PM.png
          Screen Shot 2012-09-11 at 5.57.45 PM.png
          165 kB
        9. Screen Shot 2012-09-11 at 6.01.42 PM.png
          Screen Shot 2012-09-11 at 6.01.42 PM.png
          200 kB
        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            abhinav Abhi Dangeti
            abhinav Abhi Dangeti
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty