Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-4595

rebalancing in a new 1.8 node if there are only 1 checkpoint open results in an imbalanced cluster because ep-engine in the source node does not backfill items from the open checkpoint

    Details

      Description

      Keeping this cluster alive.
      http://ec2-67-202-63-126.compute-1.amazonaws.com:8091/index.html#sec=overview

      Steps
      1) Create 4 node cluster on 172.
      2) Upgrade these 4 nodes cluster to 180.
      3) Rebalance in 2 new 180 nodes to this cluster at the same time. (10.83.47.43 and 10.112.27.9 are the brand new 180 nodes that were added onto the cluster. )
      4) After rebalance. The active/replica item count on newly added node (10.83.47.43 and 10.112.27.9) is 0.
      5) Post rebalance, Was able to load the data into (10.83.47.43 and 10.112.27.9) using python loader.

      Attaching logs from all the nodes.

      1. rebalance.jpg
        128 kB
        Karan Kumar
      No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

        karan Karan Kumar (Inactive) created issue -
        Hide
        karan Karan Kumar (Inactive) added a comment -

        On a separate cluster was able to reproduce this.

        Attaching the screenshot after rebalance.

        Show
        karan Karan Kumar (Inactive) added a comment - On a separate cluster was able to reproduce this. Attaching the screenshot after rebalance.
        karan Karan Kumar (Inactive) made changes -
        Field Original Value New Value
        Attachment rebalance.jpg [ 11937 ]
        farshid Farshid Ghods (Inactive) made changes -
        Summary Adding two new 180 nodes to upgraded (172-> 180) cluster, does not shuffle vbuckets to new nodes, resulting in 0 active and replica items on new nodes rebalancing new 1.8 nodes to an upgraded cluster ( 1.7.2->1.80 ) results in 0 active and 0 replica items on the new nodes
        Hide
        karan Karan Kumar (Inactive) added a comment -

        Also,
        Before rebalance had total 54K keys. (on 4 nodes)
        After rebalance I see only 36K in total . (in the screenshot)

        Show
        karan Karan Kumar (Inactive) added a comment - Also, Before rebalance had total 54K keys. (on 4 nodes) After rebalance I see only 36K in total . (in the screenshot)
        Hide
        karan Karan Kumar (Inactive) added a comment -

        Assigning this to Chiyoung.
        This turned out to be a bug in ep-engine with checkpoint management.

        We want to make sure that before rebalancing in new nodes into the cluster, the old nodes have checkpoint greater than 1.

        Verified that the issue does not occur when checkpoint id is greater than 1.

        Show
        karan Karan Kumar (Inactive) added a comment - Assigning this to Chiyoung. This turned out to be a bug in ep-engine with checkpoint management. We want to make sure that before rebalancing in new nodes into the cluster, the old nodes have checkpoint greater than 1. Verified that the issue does not occur when checkpoint id is greater than 1.
        karan Karan Kumar (Inactive) made changes -
        Priority Blocker [ 1 ] Major [ 3 ]
        Assignee Aliaksey Artamonau [ aliaksey artamonau ] Chiyoung Seo [ chiyoung ]
        karan Karan Kumar (Inactive) made changes -
        Fix Version/s 1.8.1 [ 10249 ]
        Fix Version/s 1.8.0 [ 10248 ]
        Hide
        farshid Farshid Ghods (Inactive) added a comment -

        this bug is not a regression and exist in current installations but only happen if user restarts a node and attempts to add and rebalance new nodes when existing nodes only have 1 checkpoint ( cluster has been running less than an hour or there are less than 500k items in the cluster )

        Show
        farshid Farshid Ghods (Inactive) added a comment - this bug is not a regression and exist in current installations but only happen if user restarts a node and attempts to add and rebalance new nodes when existing nodes only have 1 checkpoint ( cluster has been running less than an hour or there are less than 500k items in the cluster )
        farshid Farshid Ghods (Inactive) made changes -
        Summary rebalancing new 1.8 nodes to an upgraded cluster ( 1.7.2->1.80 ) results in 0 active and 0 replica items on the new nodes rebalancing in a new 1.8 node if there are only 1 checkpoint open results in an imbalanced cluster because ep-engine in the source node does not backfill items from the open checkpoint
        Hide
        chiyoung Chiyoung Seo added a comment -

        This is a bug in checkpoint synchronization, but not a blocker for 1.8 release.

        This issue happens in the following scenario:

        1) Set up the 1.7.x cluster and add a very small number of items (e.g., 100K items) into the cluster. At this time, each active vbucket has only one checkpoint with ID 1
        2) Shut down the 1.7.x cluster and upgrade it to 1.8 and restart the cluster
        3) During the warmup, each vbucket is loaded from disk and its open checkpoint id is updated from vbucket_state table in disk. In this case, each active vbucket open checkpoint still has an id 1, but won't have any items in the open checkpoint.
        4) Add a brand-new 1.8 node into the cluster and rebalance.
        5) Some active vbuckets are taken over to this new node. However, this does not require backfill operations on the existing nodes because they have the open checkpoint id 1 for all active vbuckets and the new node starts with the open checkpoint 1 as well.
        6) After rebalance, the new node has 0 items on its active vbuckets.

        The above scenario is the corner case because we tested it with the 1.7.x cluster that has been running for a very short time that is less than a new checkpoint creation interval 10 minutes. Therefore, it is not likely to happen in our customers' clusters that have a long running 1.7.x membase and large open checkpoint ids for active vbuckets.

        Show
        chiyoung Chiyoung Seo added a comment - This is a bug in checkpoint synchronization, but not a blocker for 1.8 release. This issue happens in the following scenario: 1) Set up the 1.7.x cluster and add a very small number of items (e.g., 100K items) into the cluster. At this time, each active vbucket has only one checkpoint with ID 1 2) Shut down the 1.7.x cluster and upgrade it to 1.8 and restart the cluster 3) During the warmup, each vbucket is loaded from disk and its open checkpoint id is updated from vbucket_state table in disk. In this case, each active vbucket open checkpoint still has an id 1, but won't have any items in the open checkpoint. 4) Add a brand-new 1.8 node into the cluster and rebalance. 5) Some active vbuckets are taken over to this new node. However, this does not require backfill operations on the existing nodes because they have the open checkpoint id 1 for all active vbuckets and the new node starts with the open checkpoint 1 as well. 6) After rebalance, the new node has 0 items on its active vbuckets. The above scenario is the corner case because we tested it with the 1.7.x cluster that has been running for a very short time that is less than a new checkpoint creation interval 10 minutes. Therefore, it is not likely to happen in our customers' clusters that have a long running 1.7.x membase and large open checkpoint ids for active vbuckets.
        farshid Farshid Ghods (Inactive) made changes -
        Labels 1.8.0-release-notes 2.0-dev-preview-4-release-notes
        farshid Farshid Ghods (Inactive) made changes -
        Assignee Chiyoung Seo [ chiyoung ] Mike Wiederhold [ mikew ]
        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        I'm able to reproduce that deterministically on 2.0 with lots of items. With checkpoint ids greater then 0 and 1.

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - I'm able to reproduce that deterministically on 2.0 with lots of items. With checkpoint ids greater then 0 and 1.
        Hide
        farshid Farshid Ghods (Inactive) added a comment -

        then thats a different bug can you please open a new issue and mark it as blocker for 2.0 and 1.8.1 ( since now 1.8 is now based on master branch _

        Show
        farshid Farshid Ghods (Inactive) added a comment - then thats a different bug can you please open a new issue and mark it as blocker for 2.0 and 1.8.1 ( since now 1.8 is now based on master branch _
        Hide
        chiyoung Chiyoung Seo added a comment -

        As Mike is booked with QA-related tasks, I will work on this issue for 1.8.1 release.

        Show
        chiyoung Chiyoung Seo added a comment - As Mike is booked with QA-related tasks, I will work on this issue for 1.8.1 release.
        chiyoung Chiyoung Seo made changes -
        Assignee Mike Wiederhold [ mikew ] Chiyoung Seo [ chiyoung ]
        chiyoung Chiyoung Seo made changes -
        Fix Version/s 1.8.1 [ 10295 ]
        Show
        chiyoung Chiyoung Seo added a comment - http://review.couchbase.org/#change,14297
        chiyoung Chiyoung Seo made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Hide
        thuan Thuan Nguyen added a comment -

        Integrated in github-ep-engine-2-0 #230 (See http://qa.hq.northscale.net/job/github-ep-engine-2-0/230/)
        MB-4595 Schedule backfill for a fresh client with empty data (Revision 303ab54372e422b122116a85f2f084071b1491ff)

        Result = SUCCESS
        Chiyoung Seo :
        Files :

        • checkpoint.cc
        • tapconnection.cc
        • ep_testsuite.cc
        • checkpoint.hh
        Show
        thuan Thuan Nguyen added a comment - Integrated in github-ep-engine-2-0 #230 (See http://qa.hq.northscale.net/job/github-ep-engine-2-0/230/ ) MB-4595 Schedule backfill for a fresh client with empty data (Revision 303ab54372e422b122116a85f2f084071b1491ff) Result = SUCCESS Chiyoung Seo : Files : checkpoint.cc tapconnection.cc ep_testsuite.cc checkpoint.hh
        farshid Farshid Ghods (Inactive) made changes -
        Labels 1.8.0-release-notes 2.0-dev-preview-4-release-notes 1.8.0-release-notes 1.8.1-release-notes 2.0-dev-preview-4-release-notes
        karan Karan Kumar (Inactive) made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        anil Anil Kumar made changes -
        Fix Version/s 2.0-beta [ 10113 ]
        Fix Version/s 1.8.2 [ 10249 ]

          People

          • Assignee:
            chiyoung Chiyoung Seo
            Reporter:
            karan Karan Kumar (Inactive)
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Gerrit Reviews

              There are no open Gerrit changes