Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-5590

Rebalance fails with error "wait_for_memcached_failed,"bucket3" on issuing a rebalance after reboot of master node.

    XMLWordPrintable

Details

    • Bug
    • Resolution: Won't Fix
    • Major
    • 2.0-beta
    • 1.8.1
    • ns_server
    • Security Level: Public
    • None
    • Large Cluster - Centos, - 16 node cluster
      Build 181-916rel
      3 buckets - bucket1(3G), bucket2(2.8G), bucket3(200M)

    Description

      Setup:
      1.Setup a 16 node cluster. Enable Auto-failover
      2.Load data on all 3 buckets [around 3M, 2M, 367k] items.
      3. Continue loading data..
      4. Reboot orchestrator node [84]
      5. Issue rebalance on this cluster.

      Output
      1. Node 84 goes down, and is auto-failed over - Expected.
      2. Rebalance operation shows - No ejected Nodes on the rebalance params. [ Note: There was no explicit remove server action done on this]
      3. Rebalance operation fails withe error" Wait for memcached" fails. Rebalance params -
      4. Node 107 is failed over.

      Likely the rebalance command was a faulty one? Should we disallow anyone to issue rebalance in-this manner?

      The live cluster is accessible at http://10.3.2.89:8091/index.html#sec=log
      Output from web-logs
      2012-06-15 11:35:46.448 ns_orchestrator:4:info:message(ns_1@10.3.121.126) - Starting rebalance, KeepNodes = ['ns_1@10.3.2.84','ns_1@10.3.2.85',
      'ns_1@10.3.2.86','ns_1@10.3.2.87',
      'ns_1@10.3.2.88','ns_1@10.3.2.89',
      'ns_1@10.3.2.104','ns_1@10.3.2.105',
      'ns_1@10.3.2.106','ns_1@10.3.2.109',
      'ns_1@10.3.2.110','ns_1@10.3.2.112',
      'ns_1@10.3.2.113','ns_1@10.3.2.114',
      'ns_1@10.3.121.126','ns_1@10.3.121.127',
      'ns_1@10.3.2.107'], EjectNodes = []

      2012-06-15 11:35:46.668 ns_rebalancer:0:info:message(ns_1@10.3.121.126) - Started rebalancing bucket bucket3
      2012-06-15 11:35:49.335 ns_memcached:1:info:message(ns_1@10.3.2.107) - Bucket "bucket3" loaded on node 'ns_1@10.3.2.107' in 0 seconds.
      2012-06-15 11:35:59.030 ns_orchestrator:2:info:message(ns_1@10.3.121.126) - Rebalance exited with reason

      {wait_for_memcached_failed,"bucket3", ['ns_1@10.3.2.84']}

      2012-06-15 11:36:03.901 ns_memcached:1:info:message(ns_1@10.3.2.84) - Bucket "bucket3" loaded on node 'ns_1@10.3.2.84' in 25 seconds.
      2012-06-15 11:36:19.566 auto_failover:3:info:message(ns_1@10.3.121.126) - Could not auto-failover node ('ns_1@10.3.2.107'). There was at least another node down.

      2012-06-15 11:36:19.566 auto_failover:3:info:message(ns_1@10.3.121.126) - Could not auto-failover node ('ns_1@10.3.2.84'). There was at least another node down.

      Attached are the log file at : https://s3.amazonaws.com/bugdb/jira/bug-rebalance-largecluster/bug1.tar

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            dipti Dipti Borkar (Inactive)
            ketaki Ketaki Gangal (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty