Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-5602

auto-failover fails over a node if some of the buckets are already rebalanced out but rebalance has been stopped or interrupted ( auto-failover should failover if all buckets are down)

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • 1.8.1
    • 1.8.1-release-candidate
    • ns_server
    • Security Level: Public
    • None
    • 18 node cluster, Centos
      Build 181-918
      2 buckets, 1024vbuckets

    Description

      Setup
      1.Setup a 18 node cluster with 2 buckets- bucket1, bucket2
      2. Enable auto-failover
      3. Add a new node 126
      4. Rebalance

      Output
      1. Rebalance works fine. But seeing these log messages -

      Could not automatically failover node 'ns_1@10.3.121.126<ns_1@10.3.121.126><ns_1@10.3.121.126<ns_1@10.3.121.126>>' because I think rebalance is running auto_failover000 ns_1@10.3.2.104<ns_1@10.3.2.104><ns_1@10.3.2.104<ns_1@10.3.2.104>> 19:32:12 - Sun Jun 17, 2012
      Bucket "bucket1" loaded on node 'ns_1@10.3.121.126<ns_1@10.3.121.126><ns_1@10.3.121.126<ns_1@10.3.121.126>>' in 0 seconds. ns_memcached001 ns_1@10.3.121.126<ns_1@10.3.121.126><ns_1@10.3.121.126<ns_1@10.3.121.126>> 19:32:04 - Sun Jun 17, 2012
      Started rebalancing bucket bucket2 ns_rebalancer000 ns_1@10.3.2.104<ns_1@10.3.2.104><ns_1@10.3.2.104<ns_1@10.3.2.104>> 19:31:36 - Sun Jun 17, 2012
      Starting rebalance, KeepNodes = ['ns_1@10.3.2.85<ns_1@10.3.2.85><ns_1@10.3.2.85<ns_1@10.3.2.85>>','ns_1@10.3.2.86<ns_1@10.3.2.86><ns_1@10.3.2.86<ns_1@10.3.2.86>>',
      'ns_1@10.3.2.87<ns_1@10.3.2.87><ns_1@10.3.2.87<ns_1@10.3.2.87>>','ns_1@10.3.2.88<ns_1@10.3.2.88><ns_1@10.3.2.88<ns_1@10.3.2.88>>',
      'ns_1@10.3.2.89<ns_1@10.3.2.89><ns_1@10.3.2.89<ns_1@10.3.2.89>>','ns_1@10.3.2.104<ns_1@10.3.2.104><ns_1@10.3.2.104<ns_1@10.3.2.104>>',
      'ns_1@10.3.2.105<ns_1@10.3.2.105><ns_1@10.3.2.105<ns_1@10.3.2.105>>','ns_1@10.3.2.106<ns_1@10.3.2.106><ns_1@10.3.2.106<ns_1@10.3.2.106>>',
      'ns_1@10.3.2.108<ns_1@10.3.2.108><ns_1@10.3.2.108<ns_1@10.3.2.108>>','ns_1@10.3.2.109<ns_1@10.3.2.109><ns_1@10.3.2.109<ns_1@10.3.2.109>>',
      'ns_1@10.3.2.110<ns_1@10.3.2.110><ns_1@10.3.2.110<ns_1@10.3.2.110>>','ns_1@10.3.2.111<ns_1@10.3.2.111><ns_1@10.3.2.111<ns_1@10.3.2.111>>',
      'ns_1@10.3.2.112<ns_1@10.3.2.112><ns_1@10.3.2.112<ns_1@10.3.2.112>>','ns_1@10.3.2.113<ns_1@10.3.2.113><ns_1@10.3.2.113<ns_1@10.3.2.113>>',
      'ns_1@10.3.2.114<ns_1@10.3.2.114><ns_1@10.3.2.114<ns_1@10.3.2.114>>','ns_1@10.3.2.115<ns_1@10.3.2.115><ns_1@10.3.2.115<ns_1@10.3.2.115>>',
      'ns_1@10.3.121.126<ns_1@10.3.121.126><ns_1@10.3.121.126<ns_1@10.3.121.126>>'], EjectNodes = []

      Attached are the web-logs and logs from master node-104.

      https://s3.amazonaws.com/bugdb/jira/web-log-largeCluster/ns-diag-20120618095246.txt
      https://s3.amazonaws.com/bugdb/jira/web-log-largeCluster/10.3.2.104-8091-diag.txt.gz

      Other related conversation
      I have enabled auto-failover on the large-cluster and every time I rebalance In a node, I get an error message showing " Could not automatically failover node 'ns_1@10.3.121.126<ns_1@10.3.121.126><ns_1@10.3.121.126<ns_1@10.3.121.126>>' because I think rebalance is running" .
      The node 126 is newly added and rebalance issued, is this message displayed because the node is not yet ready to join the cluster ?
      The rebalance works fine, but I do not understand why is auto-failover attempted in here. Any idea?

      No. according to logs at 19:32:04 bucket1 was loaded. Maybe there are some other buckets that are still not ready on this node. May I have logs?

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            alkondratenko Aleksey Kondratenko (Inactive)
            ketaki Ketaki Gangal (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty