Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-4595

rebalancing in a new 1.8 node if there are only 1 checkpoint open results in an imbalanced cluster because ep-engine in the source node does not backfill items from the open checkpoint

    Details

      Description

      Keeping this cluster alive.
      http://ec2-67-202-63-126.compute-1.amazonaws.com:8091/index.html#sec=overview

      Steps
      1) Create 4 node cluster on 172.
      2) Upgrade these 4 nodes cluster to 180.
      3) Rebalance in 2 new 180 nodes to this cluster at the same time. (10.83.47.43 and 10.112.27.9 are the brand new 180 nodes that were added onto the cluster. )
      4) After rebalance. The active/replica item count on newly added node (10.83.47.43 and 10.112.27.9) is 0.
      5) Post rebalance, Was able to load the data into (10.83.47.43 and 10.112.27.9) using python loader.

      Attaching logs from all the nodes.

      1. rebalance.jpg
        128 kB
        Karan Kumar
      # Subject Project Status CR V
      For Gerrit Dashboard: &For+MB-4595=message:MB-4595

        Activity

        Hide
        karan Karan Kumar (Inactive) added a comment -

        On a separate cluster was able to reproduce this.

        Attaching the screenshot after rebalance.

        Show
        karan Karan Kumar (Inactive) added a comment - On a separate cluster was able to reproduce this. Attaching the screenshot after rebalance.
        Hide
        karan Karan Kumar (Inactive) added a comment -

        Also,
        Before rebalance had total 54K keys. (on 4 nodes)
        After rebalance I see only 36K in total . (in the screenshot)

        Show
        karan Karan Kumar (Inactive) added a comment - Also, Before rebalance had total 54K keys. (on 4 nodes) After rebalance I see only 36K in total . (in the screenshot)
        Hide
        karan Karan Kumar (Inactive) added a comment -

        Assigning this to Chiyoung.
        This turned out to be a bug in ep-engine with checkpoint management.

        We want to make sure that before rebalancing in new nodes into the cluster, the old nodes have checkpoint greater than 1.

        Verified that the issue does not occur when checkpoint id is greater than 1.

        Show
        karan Karan Kumar (Inactive) added a comment - Assigning this to Chiyoung. This turned out to be a bug in ep-engine with checkpoint management. We want to make sure that before rebalancing in new nodes into the cluster, the old nodes have checkpoint greater than 1. Verified that the issue does not occur when checkpoint id is greater than 1.
        Hide
        farshid Farshid Ghods (Inactive) added a comment -

        this bug is not a regression and exist in current installations but only happen if user restarts a node and attempts to add and rebalance new nodes when existing nodes only have 1 checkpoint ( cluster has been running less than an hour or there are less than 500k items in the cluster )

        Show
        farshid Farshid Ghods (Inactive) added a comment - this bug is not a regression and exist in current installations but only happen if user restarts a node and attempts to add and rebalance new nodes when existing nodes only have 1 checkpoint ( cluster has been running less than an hour or there are less than 500k items in the cluster )
        Hide
        chiyoung Chiyoung Seo added a comment -

        This is a bug in checkpoint synchronization, but not a blocker for 1.8 release.

        This issue happens in the following scenario:

        1) Set up the 1.7.x cluster and add a very small number of items (e.g., 100K items) into the cluster. At this time, each active vbucket has only one checkpoint with ID 1
        2) Shut down the 1.7.x cluster and upgrade it to 1.8 and restart the cluster
        3) During the warmup, each vbucket is loaded from disk and its open checkpoint id is updated from vbucket_state table in disk. In this case, each active vbucket open checkpoint still has an id 1, but won't have any items in the open checkpoint.
        4) Add a brand-new 1.8 node into the cluster and rebalance.
        5) Some active vbuckets are taken over to this new node. However, this does not require backfill operations on the existing nodes because they have the open checkpoint id 1 for all active vbuckets and the new node starts with the open checkpoint 1 as well.
        6) After rebalance, the new node has 0 items on its active vbuckets.

        The above scenario is the corner case because we tested it with the 1.7.x cluster that has been running for a very short time that is less than a new checkpoint creation interval 10 minutes. Therefore, it is not likely to happen in our customers' clusters that have a long running 1.7.x membase and large open checkpoint ids for active vbuckets.

        Show
        chiyoung Chiyoung Seo added a comment - This is a bug in checkpoint synchronization, but not a blocker for 1.8 release. This issue happens in the following scenario: 1) Set up the 1.7.x cluster and add a very small number of items (e.g., 100K items) into the cluster. At this time, each active vbucket has only one checkpoint with ID 1 2) Shut down the 1.7.x cluster and upgrade it to 1.8 and restart the cluster 3) During the warmup, each vbucket is loaded from disk and its open checkpoint id is updated from vbucket_state table in disk. In this case, each active vbucket open checkpoint still has an id 1, but won't have any items in the open checkpoint. 4) Add a brand-new 1.8 node into the cluster and rebalance. 5) Some active vbuckets are taken over to this new node. However, this does not require backfill operations on the existing nodes because they have the open checkpoint id 1 for all active vbuckets and the new node starts with the open checkpoint 1 as well. 6) After rebalance, the new node has 0 items on its active vbuckets. The above scenario is the corner case because we tested it with the 1.7.x cluster that has been running for a very short time that is less than a new checkpoint creation interval 10 minutes. Therefore, it is not likely to happen in our customers' clusters that have a long running 1.7.x membase and large open checkpoint ids for active vbuckets.
        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        I'm able to reproduce that deterministically on 2.0 with lots of items. With checkpoint ids greater then 0 and 1.

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - I'm able to reproduce that deterministically on 2.0 with lots of items. With checkpoint ids greater then 0 and 1.
        Hide
        farshid Farshid Ghods (Inactive) added a comment -

        then thats a different bug can you please open a new issue and mark it as blocker for 2.0 and 1.8.1 ( since now 1.8 is now based on master branch _

        Show
        farshid Farshid Ghods (Inactive) added a comment - then thats a different bug can you please open a new issue and mark it as blocker for 2.0 and 1.8.1 ( since now 1.8 is now based on master branch _
        Hide
        chiyoung Chiyoung Seo added a comment -

        As Mike is booked with QA-related tasks, I will work on this issue for 1.8.1 release.

        Show
        chiyoung Chiyoung Seo added a comment - As Mike is booked with QA-related tasks, I will work on this issue for 1.8.1 release.
        Show
        chiyoung Chiyoung Seo added a comment - http://review.couchbase.org/#change,14297
        Hide
        thuan Thuan Nguyen added a comment -

        Integrated in github-ep-engine-2-0 #230 (See http://qa.hq.northscale.net/job/github-ep-engine-2-0/230/)
        MB-4595 Schedule backfill for a fresh client with empty data (Revision 303ab54372e422b122116a85f2f084071b1491ff)

        Result = SUCCESS
        Chiyoung Seo :
        Files :

        • checkpoint.cc
        • tapconnection.cc
        • ep_testsuite.cc
        • checkpoint.hh
        Show
        thuan Thuan Nguyen added a comment - Integrated in github-ep-engine-2-0 #230 (See http://qa.hq.northscale.net/job/github-ep-engine-2-0/230/ ) MB-4595 Schedule backfill for a fresh client with empty data (Revision 303ab54372e422b122116a85f2f084071b1491ff) Result = SUCCESS Chiyoung Seo : Files : checkpoint.cc tapconnection.cc ep_testsuite.cc checkpoint.hh

          People

          • Assignee:
            chiyoung Chiyoung Seo
            Reporter:
            karan Karan Kumar (Inactive)
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Gerrit Reviews

              There are no open Gerrit changes