Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-5602

auto-failover fails over a node if some of the buckets are already rebalanced out but rebalance has been stopped or interrupted ( auto-failover should failover if all buckets are down)

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 1.8.1-release-candidate
    • Fix Version/s: 1.8.1
    • Component/s: ns_server
    • Security Level: Public
    • Labels:
      None
    • Environment:
      18 node cluster, Centos
      Build 181-918
      2 buckets, 1024vbuckets

      Description

      Setup
      1.Setup a 18 node cluster with 2 buckets- bucket1, bucket2
      2. Enable auto-failover
      3. Add a new node 126
      4. Rebalance

      Output
      1. Rebalance works fine. But seeing these log messages -

      Could not automatically failover node 'ns_1@10.3.121.126<ns_1@10.3.121.126><ns_1@10.3.121.126<ns_1@10.3.121.126>>' because I think rebalance is running auto_failover000 ns_1@10.3.2.104<ns_1@10.3.2.104><ns_1@10.3.2.104<ns_1@10.3.2.104>> 19:32:12 - Sun Jun 17, 2012
      Bucket "bucket1" loaded on node 'ns_1@10.3.121.126<ns_1@10.3.121.126><ns_1@10.3.121.126<ns_1@10.3.121.126>>' in 0 seconds. ns_memcached001 ns_1@10.3.121.126<ns_1@10.3.121.126><ns_1@10.3.121.126<ns_1@10.3.121.126>> 19:32:04 - Sun Jun 17, 2012
      Started rebalancing bucket bucket2 ns_rebalancer000 ns_1@10.3.2.104<ns_1@10.3.2.104><ns_1@10.3.2.104<ns_1@10.3.2.104>> 19:31:36 - Sun Jun 17, 2012
      Starting rebalance, KeepNodes = ['ns_1@10.3.2.85<ns_1@10.3.2.85><ns_1@10.3.2.85<ns_1@10.3.2.85>>','ns_1@10.3.2.86<ns_1@10.3.2.86><ns_1@10.3.2.86<ns_1@10.3.2.86>>',
      'ns_1@10.3.2.87<ns_1@10.3.2.87><ns_1@10.3.2.87<ns_1@10.3.2.87>>','ns_1@10.3.2.88<ns_1@10.3.2.88><ns_1@10.3.2.88<ns_1@10.3.2.88>>',
      'ns_1@10.3.2.89<ns_1@10.3.2.89><ns_1@10.3.2.89<ns_1@10.3.2.89>>','ns_1@10.3.2.104<ns_1@10.3.2.104><ns_1@10.3.2.104<ns_1@10.3.2.104>>',
      'ns_1@10.3.2.105<ns_1@10.3.2.105><ns_1@10.3.2.105<ns_1@10.3.2.105>>','ns_1@10.3.2.106<ns_1@10.3.2.106><ns_1@10.3.2.106<ns_1@10.3.2.106>>',
      'ns_1@10.3.2.108<ns_1@10.3.2.108><ns_1@10.3.2.108<ns_1@10.3.2.108>>','ns_1@10.3.2.109<ns_1@10.3.2.109><ns_1@10.3.2.109<ns_1@10.3.2.109>>',
      'ns_1@10.3.2.110<ns_1@10.3.2.110><ns_1@10.3.2.110<ns_1@10.3.2.110>>','ns_1@10.3.2.111<ns_1@10.3.2.111><ns_1@10.3.2.111<ns_1@10.3.2.111>>',
      'ns_1@10.3.2.112<ns_1@10.3.2.112><ns_1@10.3.2.112<ns_1@10.3.2.112>>','ns_1@10.3.2.113<ns_1@10.3.2.113><ns_1@10.3.2.113<ns_1@10.3.2.113>>',
      'ns_1@10.3.2.114<ns_1@10.3.2.114><ns_1@10.3.2.114<ns_1@10.3.2.114>>','ns_1@10.3.2.115<ns_1@10.3.2.115><ns_1@10.3.2.115<ns_1@10.3.2.115>>',
      'ns_1@10.3.121.126<ns_1@10.3.121.126><ns_1@10.3.121.126<ns_1@10.3.121.126>>'], EjectNodes = []

      Attached are the web-logs and logs from master node-104.

      https://s3.amazonaws.com/bugdb/jira/web-log-largeCluster/ns-diag-20120618095246.txt
      https://s3.amazonaws.com/bugdb/jira/web-log-largeCluster/10.3.2.104-8091-diag.txt.gz

      Other related conversation
      I have enabled auto-failover on the large-cluster and every time I rebalance In a node, I get an error message showing " Could not automatically failover node 'ns_1@10.3.121.126<ns_1@10.3.121.126><ns_1@10.3.121.126<ns_1@10.3.121.126>>' because I think rebalance is running" .
      The node 126 is newly added and rebalance issued, is this message displayed because the node is not yet ready to join the cluster ?
      The rebalance works fine, but I do not understand why is auto-failover attempted in here. Any idea?

      No. according to logs at 19:32:04 bucket1 was loaded. Maybe there are some other buckets that are still not ready on this node. May I have logs?

      No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

        ketaki Ketaki Gangal created issue -
        Hide
        farshid Farshid Ghods (Inactive) added a comment -

        This happens if you have 2 or more buckets and autofailover is enabled and if both buckets have significant amount of data.

        After rebalancing out first bucket node will incorrectly be interpreted as down by autofailover service since it doesn't have all buckets this service thinks (incorrectly) it needs to have.

        Normally rebalance prevents autofailover to actually do anything, but if rebalance is stopped, then 'partially' rebalance out node will be automatically failed over.

        Seen here: https://s3.amazonaws.com/bugdb/jira/web-log-largeCluster/10.3.2.104-8091-diag.txt.gz

        Show
        farshid Farshid Ghods (Inactive) added a comment - This happens if you have 2 or more buckets and autofailover is enabled and if both buckets have significant amount of data. After rebalancing out first bucket node will incorrectly be interpreted as down by autofailover service since it doesn't have all buckets this service thinks (incorrectly) it needs to have. Normally rebalance prevents autofailover to actually do anything, but if rebalance is stopped, then 'partially' rebalance out node will be automatically failed over. Seen here: https://s3.amazonaws.com/bugdb/jira/web-log-largeCluster/10.3.2.104-8091-diag.txt.gz
        karan Karan Kumar (Inactive) made changes -
        Field Original Value New Value
        Fix Version/s 1.8.1 [ 10295 ]
        Affects Version/s 1.8.1-release-candidate [ 10299 ]
        Affects Version/s 1.8.1 [ 10295 ]
        Priority Minor [ 4 ] Blocker [ 1 ]
        Sprint Status Current Sprint
        Sprint Priority 0
        Show
        karan Karan Kumar (Inactive) added a comment - http://review.couchbase.org/#change,17372
        karan Karan Kumar (Inactive) made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        dipti Dipti Borkar made changes -
        Sprint Status Current Sprint
        Sprint Priority 0
        farshid Farshid Ghods (Inactive) made changes -
        Summary Attempting auto-failover a newly added node on issuing a rebalance-In on the cluster. auto-failover will fail over a node if some of the buckets are already rebalanced out but rebalance has been stopped or interrupted ( auto-failover should failover if all buckets are down)
        farshid Farshid Ghods (Inactive) made changes -
        Summary auto-failover will fail over a node if some of the buckets are already rebalanced out but rebalance has been stopped or interrupted ( auto-failover should failover if all buckets are down) auto-failover fails over a node if some of the buckets are already rebalanced out but rebalance has been stopped or interrupted ( auto-failover should failover if all buckets are down)
        ketaki Ketaki Gangal made changes -
        Comment [ Last tests on build 927 - Not able to reproduce this error. ]
        Hide
        ketaki Ketaki Gangal added a comment -

        Tested on large-cluster build 927- Not seeing anymore auto-failover messages.
        Closing this bug for now.

        Show
        ketaki Ketaki Gangal added a comment - Tested on large-cluster build 927- Not seeing anymore auto-failover messages. Closing this bug for now.
        Hide
        thuan Thuan Nguyen added a comment -

        Integrated in github-ns-server-2-0 #380 (See http://qa.hq.northscale.net/job/github-ns-server-2-0/380/)
        MB-5602: consider buckets' servers list when computing down nodes (Revision 72b674c47e386dac5a28ecaadfea2f37c3d14133)

        Result = SUCCESS
        Farshid Ghods :
        Files :

        • src/auto_failover.erl
        Show
        thuan Thuan Nguyen added a comment - Integrated in github-ns-server-2-0 #380 (See http://qa.hq.northscale.net/job/github-ns-server-2-0/380/ ) MB-5602 : consider buckets' servers list when computing down nodes (Revision 72b674c47e386dac5a28ecaadfea2f37c3d14133) Result = SUCCESS Farshid Ghods : Files : src/auto_failover.erl
        farshid Farshid Ghods (Inactive) made changes -
        Status Resolved [ 5 ] Closed [ 6 ]

          People

          • Assignee:
            alkondratenko Aleksey Kondratenko (Inactive)
            Reporter:
            ketaki Ketaki Gangal
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Gerrit Reviews

              There are no open Gerrit changes