Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-6706

[system test] rebalance hang when add nodes to cluster

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 2.0
    • Fix Version/s: 2.0
    • Component/s: ns_server
    • Security Level: Public
    • Labels:
    • Environment:
      centos 6.2 64bit build 2.0.0-1746

      Description

      Cluster information:

      • 8 centos 6.2 64bit server with 4 cores CPU
      • Each server has 32 GB RAM and 400 GB SSD disk.
      • SSD disk format ext4 on /data
      • Each server has its own drive, no disk sharing with other server.
      • Load 15 million items to both buckets
      • Cluster has 2 buckets, default (11GB) and saslbucket (11GB) with consistent view enable. For 2 buckets, we use only 68% total RAM of system.
      • Each bucket has one doc and 2 views for each doc (default d1 and saslbucket d11)
      • Create cluster with 4 nodes installed couchbase server 2.0.0-1746

      10.6.2.37
      10.6.2.38
      10.6.2.39
      10.6.2.40

      • Data path /data
      • View path /data
      • Add 4 nodes to cluster and rebalance
        10.6.2.42
        10.6.2.43
        10.6.2.44
        10.6.2.45
      • rebalance hang

      Link to atop file of all nodes https://s3.amazonaws.com/packages.couchbase/atop-files/orange/201209/atop-8nodes-1746-reb-hang-20120920.tgz

      No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

        Hide
        thuan Thuan Nguyen added a comment -

        Integrated in github-couchdb-preview #509 (See http://qa.hq.northscale.net/job/github-couchdb-preview/509/)
        MB-6706 Trigger update after defining indexable partitions (Revision 9098ff069968247556da72a2be1bbfd944b1d30e)

        Result = SUCCESS
        pwansch :
        Files :

        • src/couch_set_view/src/couch_set_view_group.erl
        Show
        thuan Thuan Nguyen added a comment - Integrated in github-couchdb-preview #509 (See http://qa.hq.northscale.net/job/github-couchdb-preview/509/ ) MB-6706 Trigger update after defining indexable partitions (Revision 9098ff069968247556da72a2be1bbfd944b1d30e) Result = SUCCESS pwansch : Files : src/couch_set_view/src/couch_set_view_group.erl
        Hide
        thuan Thuan Nguyen added a comment -

        I killed loads on default bucket (currently rebalancing but hang), rebalance started after five minutes all loads stopped. Few minutes later, restart half loads on default bucket, rebalance continues running

        Show
        thuan Thuan Nguyen added a comment - I killed loads on default bucket (currently rebalancing but hang), rebalance started after five minutes all loads stopped. Few minutes later, restart half loads on default bucket, rebalance continues running
        Hide
        thuan Thuan Nguyen added a comment -

        Promote it to blocker since we hit it often in system test.

        Show
        thuan Thuan Nguyen added a comment - Promote it to blocker since we hit it often in system test.
        Hide
        thuan Thuan Nguyen added a comment - - edited

        Hit this bug again in build 2.0.0-1781 in system test.

        • Add 2 nodes: 39 and 40 and rebalance. During rebalance, reboot node 42 and 43. Rebalance failed as expected.
        • After node finished warmup, rebalance again. Rebalance failed with bug MB-6490 on node 44.
        • Failover node 44 and rebalance.
        • Cluster rebalance saslbucket first. Rebalance was done after 17 hrs

        Started rebalancing bucket saslbucket ns_rebalancer000 ns_1@10.6.2.37 14:44:27 - Mon Oct 1, 2012
        Started rebalancing bucket default ns_rebalancer000 ns_1@10.6.2.37 08:14:08 - Tue Oct 2, 2012

          • Rebalance of default bucket hang around 10:00AM Tue Oct 2, 2012 as in capture screen

        Cluster information:

        • 8 centos 6.2 64bit server with 4 cores CPU
        • Each server has 32 GB RAM and 400 GB SSD disk.
        • 24.8 GB RAM for couchbase server at each node
        • SSD disk format ext4 on /data
        • Each server has its own SSD drive, no disk sharing with other server.
        • Create cluster with 6 nodes installed couchbase server 2.0.0-1781
        • Cluster has 2 buckets, default (12GB) and saslbucket (12GB).
        • Each bucket has one doc and 2 views for each doc (default d1 and saslbucket d11)
        • Enable consistent view on cluster (by default)

        10.6.2.37
        10.6.2.38
        10.6.2.44
        10.6.2.45
        10.6.2.42
        10.6.2.43

        • Load 14 million items to each bucket. Each key has size from 512 bytes to 1024 bytes
        • Mutate 14 million items to each bucket with size of each key from 1024 to 1500 bytes
        • Load running about 8K to 10K ops on both buckets
        • Queries all 4 views from 2 docs

        10.6.2.39
        10.6.2.40

        • Data path /data
        • View path /data

        Manifest info from build 1781
        http://builds.hq.northscale.net/latestbuilds/couchbase-server-enterprise_x86_64_2.0.0-1781-rel.rpm.manifest.xml

        Link to collect info of all nodes https://s3.amazonaws.com/packages.couchbase/collect_info/orange/2_0_0/201210/8nodes-col-info-1781-rebalance-hang-20121002-114333.tgz

        Link to tap stats from all nodes https://friendpaste.com/6JqjtMOwZLvmlx5h9fxt6L

        Show
        thuan Thuan Nguyen added a comment - - edited Hit this bug again in build 2.0.0-1781 in system test. Add 2 nodes: 39 and 40 and rebalance. During rebalance, reboot node 42 and 43. Rebalance failed as expected. After node finished warmup, rebalance again. Rebalance failed with bug MB-6490 on node 44. Failover node 44 and rebalance. Cluster rebalance saslbucket first. Rebalance was done after 17 hrs Started rebalancing bucket saslbucket ns_rebalancer000 ns_1@10.6.2.37 14:44:27 - Mon Oct 1, 2012 Started rebalancing bucket default ns_rebalancer000 ns_1@10.6.2.37 08:14:08 - Tue Oct 2, 2012 Rebalance of default bucket hang around 10:00AM Tue Oct 2, 2012 as in capture screen Cluster information: 8 centos 6.2 64bit server with 4 cores CPU Each server has 32 GB RAM and 400 GB SSD disk. 24.8 GB RAM for couchbase server at each node SSD disk format ext4 on /data Each server has its own SSD drive, no disk sharing with other server. Create cluster with 6 nodes installed couchbase server 2.0.0-1781 Cluster has 2 buckets, default (12GB) and saslbucket (12GB). Each bucket has one doc and 2 views for each doc (default d1 and saslbucket d11) Enable consistent view on cluster (by default) 10.6.2.37 10.6.2.38 10.6.2.44 10.6.2.45 10.6.2.42 10.6.2.43 Load 14 million items to each bucket. Each key has size from 512 bytes to 1024 bytes Mutate 14 million items to each bucket with size of each key from 1024 to 1500 bytes Load running about 8K to 10K ops on both buckets Queries all 4 views from 2 docs 10.6.2.39 10.6.2.40 Data path /data View path /data Manifest info from build 1781 http://builds.hq.northscale.net/latestbuilds/couchbase-server-enterprise_x86_64_2.0.0-1781-rel.rpm.manifest.xml Link to collect info of all nodes https://s3.amazonaws.com/packages.couchbase/collect_info/orange/2_0_0/201210/8nodes-col-info-1781-rebalance-hang-20121002-114333.tgz Link to tap stats from all nodes https://friendpaste.com/6JqjtMOwZLvmlx5h9fxt6L
        Hide
        thuan Thuan Nguyen added a comment -

        Hit this bug again in build 2.0.0-1777 with swap rebalance Add node 44, 45 and remove node 39, 40
        Rebalance hang after moving some items to new added nodes

        Cluster information:

        • 8 centos 6.2 64bit server with 4 cores CPU
        • Each server has 32 GB RAM and 400 GB SSD disk.
        • 24.8 GB RAM for couchbase server at each node
        • SSD disk format ext4 on /data
        • Each server has its own SSD drive, no disk sharing with other server.
        • Create cluster with 6 nodes installed couchbase server 2.0.0-1777
        • Cluster has 2 buckets, default (12GB) and saslbucket (12GB).
        • Each bucket has one doc and 2 views for each doc (default d1 and saslbucket d11)

        10.6.2.37
        10.6.2.38
        10.6.2.39
        10.6.2.40
        10.6.2.42
        10.6.2.43

        • Load 18 million items to both bucket. Each key has size from 512 bytes to 1024 bytes
        • Queries all 4 views from 2 docs

        10.6.2.44
        10.6.2.45

        • Data path /data
        • View path /data

        Link to collect info of all nodes https://s3.amazonaws.com/packages.couchbase/collect_info/orange/2_0_0/201209/8nodes-col-info-1777-swap-reb-hang-20120927-155552.tgz

        Link to atop of all nodes https://s3.amazonaws.com/packages.couchbase/atop-files/orange/201209/atop-8nodes-1777-swap-reb-hang-20120927-155750.tgz

        Show
        thuan Thuan Nguyen added a comment - Hit this bug again in build 2.0.0-1777 with swap rebalance Add node 44, 45 and remove node 39, 40 Rebalance hang after moving some items to new added nodes Cluster information: 8 centos 6.2 64bit server with 4 cores CPU Each server has 32 GB RAM and 400 GB SSD disk. 24.8 GB RAM for couchbase server at each node SSD disk format ext4 on /data Each server has its own SSD drive, no disk sharing with other server. Create cluster with 6 nodes installed couchbase server 2.0.0-1777 Cluster has 2 buckets, default (12GB) and saslbucket (12GB). Each bucket has one doc and 2 views for each doc (default d1 and saslbucket d11) 10.6.2.37 10.6.2.38 10.6.2.39 10.6.2.40 10.6.2.42 10.6.2.43 Load 18 million items to both bucket. Each key has size from 512 bytes to 1024 bytes Queries all 4 views from 2 docs 10.6.2.44 10.6.2.45 Data path /data View path /data Link to collect info of all nodes https://s3.amazonaws.com/packages.couchbase/collect_info/orange/2_0_0/201209/8nodes-col-info-1777-swap-reb-hang-20120927-155552.tgz Link to atop of all nodes https://s3.amazonaws.com/packages.couchbase/atop-files/orange/201209/atop-8nodes-1777-swap-reb-hang-20120927-155750.tgz
        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        Filipe, there are no crashes and you can see in logs of .38 that we're waiting for index update (there's just 1 simple index) and this does not happen.

        [ns_server:debug,2012-09-20T17:20:30.431,ns_1@10.6.2.38:<0.21339.32>:capi_set_view_manager:do_wait_index_updated:596]References to wait: Ref<0.0.309.194865> ("saslbucket", 531)

        I advise you to take a quick look at source of do_wait_index_updated in capi_set_view_manager. Maybe you will spot something that I'm not doing right.

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - Filipe, there are no crashes and you can see in logs of .38 that we're waiting for index update (there's just 1 simple index) and this does not happen. [ns_server:debug,2012-09-20T17:20:30.431,ns_1@10.6.2.38:<0.21339.32>:capi_set_view_manager:do_wait_index_updated:596] References to wait: Ref<0.0.309.194865> ("saslbucket", 531) I advise you to take a quick look at source of do_wait_index_updated in capi_set_view_manager. Maybe you will spot something that I'm not doing right.
        Hide
        thuan Thuan Nguyen added a comment -

        I think this bug the same as bug MB-6707

        Show
        thuan Thuan Nguyen added a comment - I think this bug the same as bug MB-6707

          People

          • Assignee:
            FilipeManana Filipe Manana (Inactive)
            Reporter:
            thuan Thuan Nguyen
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Gerrit Reviews

              There are no open Gerrit changes