Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-7996

[system test] rebalance hang when add a node to cluster

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 2.0.1
    • Fix Version/s: 2.1.0
    • Component/s: couchbase-bucket
    • Security Level: Public
    • Environment:
      windows physical servers 2008 R2 64bit

      Description

      Install couchbase server 2.0.1-185 on 4 physical servers with 2 separated disks
      Create a cluster with 3 nodes
      10.2.1.61
      10.2.1.62
      10.2.1.63
      Create 2 buckets: default (14GB) and sasl (10GB)
      No view or xdcr created
      Load 20+ million items to both bucket until resident ratio on both bucket around 90%
      Access cluster in 3 hours with spec in this page http://hub.internal.couchbase.com/confluence/pages/viewpage.action?pageId=6785119
      Add node 10.2.1.64 to cluster and rebalance.
      Rebalance failed. Filed bug MB-7995

      Start rebalance again. Rebalance hang

      Link to collect info of all nodes https://s3.amazonaws.com/packages.couchbase/collect_info/2_0_1/201304/4phy-win-201_185-reb-not-moving_node-61-erl-frozen-20130401-122033.tgz

      Link to manifest file of the build http://builds.hq.northscale.net/latestbuilds/couchbase-server-community_x86_64_2.0.1-185-rel.setup.exe.manifest.xml

      No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

        Hide
        maria Maria McDuff (Inactive) added a comment -

        pls take a look as tony still seeing in build 812.

        Show
        maria Maria McDuff (Inactive) added a comment - pls take a look as tony still seeing in build 812.
        Hide
        chiyoung Chiyoung Seo added a comment -

        Tony,

        If you still saw this issue in 2.0.2 latest build, please grab the diag logs and assign it to the ns-server team.

        I can't look at this issue because there are no log files attached. In addition, any rebalance issues should be first investigated by the ns-server team.

        If the ns-server team sees this issue as duplicate of MB-8350, then we should close it as duplicate.

        Show
        chiyoung Chiyoung Seo added a comment - Tony, If you still saw this issue in 2.0.2 latest build, please grab the diag logs and assign it to the ns-server team. I can't look at this issue because there are no log files attached. In addition, any rebalance issues should be first investigated by the ns-server team. If the ns-server team sees this issue as duplicate of MB-8350 , then we should close it as duplicate.
        Hide
        Aliaksey Artamonau Aliaksey Artamonau added a comment -

        I took a quick peek at the logs. The last vbucket that was attempted to move was 341.

        [ns_server:debug,2013-03-29T23:57:15.285,ns_1@10.2.1.61:<0.30259.57>:ns_vbucket_mover:spawn_workers:331]Got actions: [{move,{682,
        ['ns_1@10.2.1.62','ns_1@10.2.1.63'],
        ['ns_1@10.2.1.64','ns_1@10.2.1.63']}},
        {move,{341,
        ['ns_1@10.2.1.61','ns_1@10.2.1.63'],
        ['ns_1@10.2.1.63','ns_1@10.2.1.64']}}]
        [rebalance:debug,2013-03-29T23:57:15.285,ns_1@10.2.1.61:<0.30259.57>:ns_single_vbucket_mover:spawn_mover:28]Spawned single vbucket mover: [<0.30259.57>,'ns_1@10.2.1.62',"sasl",682,
        ['ns_1@10.2.1.62','ns_1@10.2.1.63'],
        ['ns_1@10.2.1.64','ns_1@10.2.1.63']] (<0.30423.57>)
        [rebalance:debug,2013-03-29T23:57:15.285,ns_1@10.2.1.61:<0.30259.57>:ns_single_vbucket_mover:spawn_mover:28]Spawned single vbucket mover: [<0.30259.57>,'ns_1@10.2.1.61',"sasl",341,
        ['ns_1@10.2.1.61','ns_1@10.2.1.63'],
        ['ns_1@10.2.1.63','ns_1@10.2.1.64']] (<0.30428.57>)

        But the data is not being transferred:

        55712:Fri Mar 29 23:57:17.725105 Pacific Daylight Time 3: TAP (Producer) eq_tapq:replication_building_341_'ns_1@10.2.1.63' - Schedule the backfill for vbucket 341
        55713:Fri Mar 29 23:57:17.725105 Pacific Daylight Time 3: TAP (Producer) eq_tapq:replication_building_341_'ns_1@10.2.1.63' - Sending TAP_OPAQUE with command "opaque_enable_auto_nack" and vbucket 0
        55714:Fri Mar 29 23:57:17.725105 Pacific Daylight Time 3: TAP (Producer) eq_tapq:replication_building_341_'ns_1@10.2.1.63' - Sending TAP_OPAQUE with command "enable_checkpoint_sync" and vbucket 0
        55715:Fri Mar 29 23:57:17.725105 Pacific Daylight Time 3: TAP (Producer) eq_tapq:replication_building_341_'ns_1@10.2.1.63' - Sending TAP_OPAQUE with command "initial_vbucket_stream" and vbucket 341
        55716:Fri Mar 29 23:57:18.037106 Pacific Daylight Time 3: TAP (Producer) eq_tapq:replication_building_341_'ns_1@10.2.1.63' - Suspend for 5.00 secs
        55719:Fri Mar 29 23:57:23.933916 Pacific Daylight Time 3: TAP (Producer) eq_tapq:replication_building_341_'ns_1@10.2.1.63' - Suspend for 5.00 secs
        55724:Fri Mar 29 23:57:30.283127 Pacific Daylight Time 3: TAP (Producer) eq_tapq:replication_building_341_'ns_1@10.2.1.63' - Suspend for 5.00 secs
        ....
        59091:Sat Mar 30 01:48:53.227665 Pacific Daylight Time 3: TAP (Producer) eq_tapq:replication_building_341_'ns_1@10.2.1.63' - Suspend for 5.00 secs
        59094:Sat Mar 30 01:48:59.514476 Pacific Daylight Time 3: TAP (Producer) eq_tapq:replication_building_341_'ns_1@10.2.1.63' - Suspend for 5.00 secs
        59098:Sat Mar 30 01:49:06.128888 Pacific Daylight Time 3: TAP (Producer) eq_tapq:replication_building_341_'ns_1@10.2.1.63' - Suspend for 5.00 secs
        59112:Sat Mar 30 01:56:41.322088 Pacific Daylight Time 3: TAP (Producer) eq_tapq:replication_building_341_'ns_1@10.2.1.63' - Suspend for 5.00 secs
        59113:Sat Mar 30 02:05:45.934644 Pacific Daylight Time 3: TAP (Producer) eq_tapq:replication_building_341_'ns_1@10.2.1.63' - Suspend for 5.00 secs

        Show
        Aliaksey Artamonau Aliaksey Artamonau added a comment - I took a quick peek at the logs. The last vbucket that was attempted to move was 341. [ns_server:debug,2013-03-29T23:57:15.285,ns_1@10.2.1.61:<0.30259.57>:ns_vbucket_mover:spawn_workers:331] Got actions: [{move,{682, ['ns_1@10.2.1.62','ns_1@10.2.1.63'] , ['ns_1@10.2.1.64','ns_1@10.2.1.63'] }}, {move,{341, ['ns_1@10.2.1.61','ns_1@10.2.1.63'] , ['ns_1@10.2.1.63','ns_1@10.2.1.64'] }}] [rebalance:debug,2013-03-29T23:57:15.285,ns_1@10.2.1.61:<0.30259.57>:ns_single_vbucket_mover:spawn_mover:28] Spawned single vbucket mover: [<0.30259.57>,'ns_1@10.2.1.62',"sasl",682, ['ns_1@10.2.1.62','ns_1@10.2.1.63'] , ['ns_1@10.2.1.64','ns_1@10.2.1.63'] ] (<0.30423.57>) [rebalance:debug,2013-03-29T23:57:15.285,ns_1@10.2.1.61:<0.30259.57>:ns_single_vbucket_mover:spawn_mover:28] Spawned single vbucket mover: [<0.30259.57>,'ns_1@10.2.1.61',"sasl",341, ['ns_1@10.2.1.61','ns_1@10.2.1.63'] , ['ns_1@10.2.1.63','ns_1@10.2.1.64'] ] (<0.30428.57>) But the data is not being transferred: 55712:Fri Mar 29 23:57:17.725105 Pacific Daylight Time 3: TAP (Producer) eq_tapq:replication_building_341_'ns_1@10.2.1.63' - Schedule the backfill for vbucket 341 55713:Fri Mar 29 23:57:17.725105 Pacific Daylight Time 3: TAP (Producer) eq_tapq:replication_building_341_'ns_1@10.2.1.63' - Sending TAP_OPAQUE with command "opaque_enable_auto_nack" and vbucket 0 55714:Fri Mar 29 23:57:17.725105 Pacific Daylight Time 3: TAP (Producer) eq_tapq:replication_building_341_'ns_1@10.2.1.63' - Sending TAP_OPAQUE with command "enable_checkpoint_sync" and vbucket 0 55715:Fri Mar 29 23:57:17.725105 Pacific Daylight Time 3: TAP (Producer) eq_tapq:replication_building_341_'ns_1@10.2.1.63' - Sending TAP_OPAQUE with command "initial_vbucket_stream" and vbucket 341 55716:Fri Mar 29 23:57:18.037106 Pacific Daylight Time 3: TAP (Producer) eq_tapq:replication_building_341_'ns_1@10.2.1.63' - Suspend for 5.00 secs 55719:Fri Mar 29 23:57:23.933916 Pacific Daylight Time 3: TAP (Producer) eq_tapq:replication_building_341_'ns_1@10.2.1.63' - Suspend for 5.00 secs 55724:Fri Mar 29 23:57:30.283127 Pacific Daylight Time 3: TAP (Producer) eq_tapq:replication_building_341_'ns_1@10.2.1.63' - Suspend for 5.00 secs .... 59091:Sat Mar 30 01:48:53.227665 Pacific Daylight Time 3: TAP (Producer) eq_tapq:replication_building_341_'ns_1@10.2.1.63' - Suspend for 5.00 secs 59094:Sat Mar 30 01:48:59.514476 Pacific Daylight Time 3: TAP (Producer) eq_tapq:replication_building_341_'ns_1@10.2.1.63' - Suspend for 5.00 secs 59098:Sat Mar 30 01:49:06.128888 Pacific Daylight Time 3: TAP (Producer) eq_tapq:replication_building_341_'ns_1@10.2.1.63' - Suspend for 5.00 secs 59112:Sat Mar 30 01:56:41.322088 Pacific Daylight Time 3: TAP (Producer) eq_tapq:replication_building_341_'ns_1@10.2.1.63' - Suspend for 5.00 secs 59113:Sat Mar 30 02:05:45.934644 Pacific Daylight Time 3: TAP (Producer) eq_tapq:replication_building_341_'ns_1@10.2.1.63' - Suspend for 5.00 secs
        Hide
        thuan Thuan Nguyen added a comment -

        Look at logs from this bug with Aliaksey Artamonau. This bug MB-7996 and MB-8350 are totally different issue

        Show
        thuan Thuan Nguyen added a comment - Look at logs from this bug with Aliaksey Artamonau. This bug MB-7996 and MB-8350 are totally different issue
        Hide
        thuan Thuan Nguyen added a comment -

        Test in 2.0.2-812 and I don't see this issue. So I close this bug

        Show
        thuan Thuan Nguyen added a comment - Test in 2.0.2-812 and I don't see this issue. So I close this bug

          People

          • Assignee:
            anil Anil Kumar
            Reporter:
            thuan Thuan Nguyen
          • Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Gerrit Reviews

              There are no open Gerrit changes