Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-8094

[Doc'd in 2.0.2]1.8.1 & 2.0.2 mixed cluster:Rebalance exited with reason {badarg, [{ns_rebalancer, '-wait_for_memcached/3-lc$^0/1-0-',2}, {ns_rebalancer,wait_for_memcached,

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Won't Fix
    • Affects Version/s: 2.1.0
    • Fix Version/s: 2.1.0
    • Component/s: ns_server
    • Security Level: Public
    • Labels:
      None

      Description

      http://qa.hq.northscale.net/view/2.0.1/job/centos-64-2.0-new-rebalance-mixed-cluster/62/consoleFull
      ./testrunner -i /tmp/rebalance_in.ini get-logs=True,wait_timeout=180,GROUP=P0,EXCLUDE_GROUP=FROM_2_0,get-cbcollect-infrebalance_in_with_ops (rebalance.rebalancein.RebalanceInTests) ... o=True -t rebalance.rebalancein.RebalanceInTests.rebalance_in_with_ops,nodes_in=3,GROUP=IN;P0

      mixed suite/cluster:
      1.8.1-937-rel
      10.3.3.92
      10.3.3.94
      10.3.3.93

      2.0.2-764-rel
      10.3.3.99
      10.3.3.91
      10.3.3.82
      10.3.3.97

      add to 10.3.3.92(1.8.1) 1*1.8.1(10.3.3.93) and 2*2.0.2(10.3.3.82, 10.3.3.99)

      2013-04-14 09:06:38 | INFO | MainProcess | Cluster_Thread | [rest_client.rebalance] rebalance params : password=password&ejectedNodes=&user=Administrator&knownNodes=ns_1%4010.3.3.92%2Cns_1%4010.3.3.82%2Cns_1%4010.3.3.99%2Cns_1%4010.3.3.93
      2013-04-14 09:06:38 | INFO | MainProcess | Cluster_Thread | [rest_client.rebalance] rebalance operation started
      2013-04-14 09:06:38 | INFO | MainProcess | Cluster_Thread | [rest_client._rebalance_progress] rebalance percentage : 0 %
      2013-04-14 09:06:48 | ERROR | MainProcess | Cluster_Thread | [rest_client._rebalance_progress]

      {u'status': u'none', u'errorMessage': u'Rebalance failed. See logs for detailed reason. You can try rebalance again.'}

      - rebalance failed
      2013-04-14 09:06:48 | INFO | MainProcess | Cluster_Thread | [rest_client._rebalance_progress] Latest logs from UI:
      2013-04-14 09:06:48 | ERROR | MainProcess | Cluster_Thread | [rest_client._rebalance_progress]

      {u'node': u'ns_1@10.3.3.99', u'code': 0, u'text': u"Candidate got master heartbeat from node 'ns_1@10.3.3.92' which has lower priority. Will try to take over.", u'shortText': u'message', u'module': u'mb_master', u'tstamp': 1365956005954.0, u'type': u'info'}

      2013-04-14 09:06:48 | ERROR | MainProcess | Cluster_Thread | [rest_client._rebalance_progress]

      {u'node': u'ns_1@10.3.3.82', u'code': 0, u'text': u"Candidate got master heartbeat from node 'ns_1@10.3.3.92' which has lower priority. Will try to take over.", u'shortText': u'message', u'module': u'mb_master', u'tstamp': 1365956005954.0, u'type': u'info'}

      2013-04-14 09:06:48 | ERROR | MainProcess | Cluster_Thread | [rest_client._rebalance_progress]

      {u'node': u'ns_1@10.3.3.93', u'code': 1, u'text': u'Bucket "default" loaded on node \'ns_1@10.3.3.93\' in 0 seconds.', u'shortText': u'message', u'module': u'ns_memcached', u'tstamp': 1365956004357.0, u'type': u'info'}

      2013-04-14 09:06:48 | ERROR | MainProcess | Cluster_Thread | [rest_client._rebalance_progress] {u'node': u'ns_1@10.3.3.92', u'code': 2, u'text': u"Rebalance exited with reason {badarg,\n [

      {ns_rebalancer,\n '-wait_for_memcached/3-lc$^0/1-0-',2}

      ,\n

      {ns_rebalancer,wait_for_memcached,3}

      ,\n

      {ns_rebalancer,'-rebalance/3-fun-0-',5}

      ,\n

      {lists,foreach,2}

      ,\n

      {ns_rebalancer,rebalance,3}

      ]}\n", u'shortText': u'message', u'module': u'ns_orchestrator', u'tstamp': 1365956004324.0, u'type': u'info'}
      2013-04-14 09:06:48 | ERROR | MainProcess | Cluster_Thread | [rest_client._rebalance_progress]

      {u'node': u'ns_1@10.3.3.99', u'code': 0, u'text': u"Candidate got master heartbeat from node 'ns_1@10.3.3.92' which has lower priority. But I won't try to take over since rebalance seems to be running", u'shortText': u'message', u'module': u'mb_master', u'tstamp': 1365956003954.0, u'type': u'info'}

      2013-04-14 09:06:48 | ERROR | MainProcess | Cluster_Thread | [rest_client._rebalance_progress]

      {u'node': u'ns_1@10.3.3.82', u'code': 0, u'text': u"Candidate got master heartbeat from node 'ns_1@10.3.3.92' which has lower priority. But I won't try to take over since rebalance seems to be running", u'shortText': u'message', u'module': u'mb_master', u'tstamp': 1365956003954.0, u'type': u'info'}

      2013-04-14 09:06:48 | ERROR | MainProcess | Cluster_Thread | [rest_client._rebalance_progress]

      {u'node': u'ns_1@10.3.3.82', u'code': 1, u'text': u'Bucket "default" loaded on node \'ns_1@10.3.3.82\' in 0 seconds.', u'shortText': u'message', u'module': u'ns_memcached', u'tstamp': 1365956003367.0, u'type': u'info'}

      2013-04-14 09:06:48 | ERROR | MainProcess | Cluster_Thread | [rest_client._rebalance_progress]

      {u'node': u'ns_1@10.3.3.99', u'code': 1, u'text': u'Bucket "default" loaded on node \'ns_1@10.3.3.99\' in 0 seconds.', u'shortText': u'message', u'module': u'ns_memcached', u'tstamp': 1365956003359.0, u'type': u'info'}

      2013-04-14 09:06:48 | ERROR | MainProcess | Cluster_Thread | [rest_client._rebalance_progress]

      {u'node': u'ns_1@10.3.3.92', u'code': 0, u'text': u'Started rebalancing bucket default', u'shortText': u'message', u'module': u'ns_rebalancer', u'tstamp': 1365956003298.0, u'type': u'info'}

      2013-04-14 09:06:48 | ERROR | MainProcess | Cluster_Thread | [rest_client._rebalance_progress]

      {u'node': u'ns_1@10.3.3.92', u'code': 4, u'text': u"Starting rebalance, KeepNodes = ['ns_1@10.3.3.92','ns_1@10.3.3.82',\n 'ns_1@10.3.3.99','ns_1@10.3.3.93'], EjectNodes = []\n", u'shortText': u'message', u'module': u'ns_orchestrator', u'tstamp': 1365956003257.0, u'type': u'info'}
      No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

        Show
        andreibaranouski Andrei Baranouski added a comment - https://s3.amazonaws.com/bugdb/jira/MB-8094/f1bcf097-954c-4366-bcc3-f0d6e684af29-10.3.3.82-diag.txt.gz https://s3.amazonaws.com/bugdb/jira/MB-8094/f1bcf097-954c-4366-bcc3-f0d6e684af29-10.3.3.91-diag.txt.gz https://s3.amazonaws.com/bugdb/jira/MB-8094/f1bcf097-954c-4366-bcc3-f0d6e684af29-10.3.3.92-diag.txt.gz https://s3.amazonaws.com/bugdb/jira/MB-8094/f1bcf097-954c-4366-bcc3-f0d6e684af29-10.3.3.93-diag.txt.gz https://s3.amazonaws.com/bugdb/jira/MB-8094/f1bcf097-954c-4366-bcc3-f0d6e684af29-10.3.3.94-diag.txt.gz https://s3.amazonaws.com/bugdb/jira/MB-8094/f1bcf097-954c-4366-bcc3-f0d6e684af29-10.3.3.97-diag.txt.gz https://s3.amazonaws.com/bugdb/jira/MB-8094/f1bcf097-954c-4366-bcc3-f0d6e684af29-10.3.3.99-diag.txt.gz
        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        We cannot retroactively fix 1.8.1.

        The problem is you're requesting rebalance too soon after joining nodes to cluster. So 1.8.1 is remains master and runs rebalance.

        Whoever does scripted rolling upgrades need to add let say 10 second delay between adding first 2.0.2 node to 1.8.1 cluster and requesting rebalance.

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - We cannot retroactively fix 1.8.1. The problem is you're requesting rebalance too soon after joining nodes to cluster. So 1.8.1 is remains master and runs rebalance. Whoever does scripted rolling upgrades need to add let say 10 second delay between adding first 2.0.2 node to 1.8.1 cluster and requesting rebalance.
        Hide
        maria Maria McDuff (Inactive) added a comment -

        By design.

        Show
        maria Maria McDuff (Inactive) added a comment - By design.
        Hide
        maria Maria McDuff (Inactive) added a comment -

        Karen, pls doc:

        per Alk:
        Whoever does scripted rolling upgrades need to add let say 10 second delay between adding first 2.0.2 node to 1.8.1 cluster and requesting rebalance.

        Show
        maria Maria McDuff (Inactive) added a comment - Karen, pls doc: per Alk: Whoever does scripted rolling upgrades need to add let say 10 second delay between adding first 2.0.2 node to 1.8.1 cluster and requesting rebalance.
        Hide
        thuan Thuan Nguyen added a comment -

        Integrated in win-ui-testing-P0 #46 (See http://qa.hq.northscale.net/job/win-ui-testing-P0/46/)
        MB-8094: sleep 10sec before rebalance(mixed cluster) (Revision 2541a334a1aca3065e6c7a9475373fe28147c2fe)

        Result = SUCCESS
        andrei :
        Files :

        • lib/tasks/task.py
        Show
        thuan Thuan Nguyen added a comment - Integrated in win-ui-testing-P0 #46 (See http://qa.hq.northscale.net/job/win-ui-testing-P0/46/ ) MB-8094 : sleep 10sec before rebalance(mixed cluster) (Revision 2541a334a1aca3065e6c7a9475373fe28147c2fe) Result = SUCCESS andrei : Files : lib/tasks/task.py
        Hide
        kzeller kzeller added a comment - - edited

        Added to 2.0.2 RN as:

        If you perform an online upgrade an rebalance with 1.8.1 and 2.0.2 nodes, it may fail and
        produce the error. This is caused by requesting rebalance too quickly after adding a node.
        To avoid this problem you should script a delay of 10 seconds
        after you add a node before you request rebalance.

        added to Use Online Upgrades for Couchbase Server 1.8 to Couchbase Server 2.0:

        Be aware that if you perform a scripted online upgrade from 1.8.x to 2.0 you should have a 10 second delay between adding a 2.0 node to the cluster and rebalancing. If you request rebalance too soon after adding a 2.0 node, the rebalance may fail.

        Show
        kzeller kzeller added a comment - - edited Added to 2.0.2 RN as: If you perform an online upgrade an rebalance with 1.8.1 and 2.0.2 nodes, it may fail and produce the error. This is caused by requesting rebalance too quickly after adding a node. To avoid this problem you should script a delay of 10 seconds after you add a node before you request rebalance. added to Use Online Upgrades for Couchbase Server 1.8 to Couchbase Server 2.0: Be aware that if you perform a scripted online upgrade from 1.8.x to 2.0 you should have a 10 second delay between adding a 2.0 node to the cluster and rebalancing. If you request rebalance too soon after adding a 2.0 node, the rebalance may fail.
        Hide
        maria Maria McDuff (Inactive) added a comment -

        andrei,

        pls review the release note from karen.

        Show
        maria Maria McDuff (Inactive) added a comment - andrei, pls review the release note from karen.
        Hide
        andreibaranouski Andrei Baranouski added a comment -

        approve

        Show
        andreibaranouski Andrei Baranouski added a comment - approve

          People

          • Assignee:
            andreibaranouski Andrei Baranouski
            Reporter:
            andreibaranouski Andrei Baranouski
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Gerrit Reviews

              There are no open Gerrit changes