Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-19754

[windows] Rebalance failed with error {bulk_set_vbucket_state_failed,\n [{'ns_1@172.23.105.91',\n {'EXIT',\n {{{badmatch,{error,closed}},\n

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Cannot Reproduce
    • 4.5.0
    • 4.5.0
    • couchbase-bucket, memcached
    • Windows 2012 R2 64-bit
    • Untriaged
    • Windows 64-bit
    • Yes

    Description

      Run test to upgrade from 3.1.0 to 4.5.0-2589
      Start with two 3.1.0 nodes and create cluster
      Then add 2 nodes 4.5.0-2589 in. Rebalance. Rebalance failed at about 92%
      This test was passed in build 4.5.0-2587

      Here is the test command to run:
      newupgradetests.MultiNodesUpgradeTests.online_upgrade_rebalance_in_with_ops,initial_version=3.1.0-1776,sasl_buckets=1,standard_buckets=1,items=10000,nodes_init=2,nodes_in=1,nodes_out=1,ddocs-num=1,upgrade_version=4.5.0-2589,get-cbcollect-info=True,stop-on-failure=false,skip_init_check_cbserver=true

       
      {u'status': u'none', u'errorMessage': u'Rebalance failed. See logs for detailed reason. You can try rebalance again.'} - rebalance failed
      [2016-05-24 02:44:09,679] - [rest_client:2529] INFO - Latest logs from UI on 172.23.105.91:
      [2016-05-24 02:44:09,680] - [rest_client:2530] ERROR - {u'node': u'ns_1@172.23.107.119', u'code': 2, u'text': u"Rebalance exited with reason {unexpected_exit,\n                              {'EXIT',<0.16098.4>,\n                               {bulk_set_vbucket_state_failed,\n                                [{'ns_1@172.23.105.91',\n                                  {'EXIT',\n                                   {{{badmatch,{error,closed}},\n                                     {gen_server,call,\n                                      ['ns_memcached-default',\n                                       {set_vbucket,658,replica},\n                                       180000]}},\n                                    {gen_server,call,\n                                     [{'janitor_agent-default',\n                                       'ns_1@172.23.105.91'},\n                                      {if_rebalance,<0.2352.3>,\n                                       {update_vbucket_state,658,replica,\n                                        undefined,undefined}},\n                                      infinity]}}}}]}}}\n", u'shortText': u'message', u'serverTime': u'2016-05-24T02:44:06.118Z', u'module': u'ns_orchestrator', u'tstamp': 1464083046118, u'type': u'info'}
      [2016-05-24 02:44:09,680] - [rest_client:2530] ERROR - {u'node': u'ns_1@172.23.107.119', u'code': 0, u'text': u"<0.14922.4> exited with {unexpected_exit,\n                         {'EXIT',<0.16098.4>,\n                          {bulk_set_vbucket_state_failed,\n                           [{'ns_1@172.23.105.91',\n                             {'EXIT',\n                              {{{badmatch,{error,closed}},\n                                {gen_server,call,\n                                 ['ns_memcached-default',\n                                  {set_vbucket,658,replica},\n                                  180000]}},\n                               {gen_server,call,\n                                [{'janitor_agent-default',\n                                  'ns_1@172.23.105.91'},\n                                 {if_rebalance,<0.2352.3>,\n                                  {update_vbucket_state,658,replica,\n                                   undefined,undefined}},\n                                 infinity]}}}}]}}}", u'shortText': u'message', u'serverTime': u'2016-05-24T02:44:06.112Z', u'module': u'ns_vbucket_mover', u'tstamp': 1464083046112, u'type': u'critical'}
      [2016-05-24 02:44:09,681] - [rest_client:2530] ERROR - {u'node': u'ns_1@172.23.105.91', u'code': 0, u'text': u"Control connection to memcached on 'ns_1@172.23.105.91' disconnected: {badmatch,\n                                                                       {error,\n                                                                        closed}}", u'shortText': u'message', u'serverTime': u'2016-05-24T02:44:05.811Z', u'module': u'ns_memcached', u'tstamp': 1464083045811, u'type': u'info'}
      [2016-05-24 02:44:09,681] - [rest_client:2530] ERROR - {u'node': u'ns_1@172.23.105.91', u'code': 0, u'text': u"Port server memcached on node 'babysitter_of_ns_1@127.0.0.1' exited with status 1. Restarting. Messages: Tue May 24 02:44:05.533493 Pacific Daylight Time 3: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.107.119->ns_1@172.23.105.91:default - (vb 653) Attempting to add  stream with start seqno 390, end seqno 18446744073709551615, vbucket uuid 160618223825921, snap start seqno 390, and snap end seqno 390\nTue May 24 02:44:05.600491 Pacific Daylight Time 3: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.107.119->ns_1@172.23.105.91:default - (vb 654) Attempting to add  stream with start seqno 140, end seqno 18446744073709551615, vbucket uuid 106785368150398, snap start seqno 140, and snap end seqno 140\nTue May 24 02:44:05.655491 Pacific Daylight Time 3: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.107.119->ns_1@172.23.105.91:default - (vb 649) Attempting to add  stream with start seqno 27, end seqno 18446744073709551615, vbucket uuid 92987305102627, snap start seqno 27, and snap end seqno 27\nTue May 24 02:44:05.721495 Pacific Daylight Time 3: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.107.119->ns_1@172.23.105.91:default - (vb 656) Attempting to add  stream with start seqno 336, end seqno 18446744073709551615, vbucket uuid 269900064377010, snap start seqno 336, and snap end seqno 336\nTue May 24 02:44:05.770491 Pacific Daylight Time 3: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.107.119->ns_1@172.23.105.91:default - (vb 655) Attempting to add  stream with start seqno 112, end seqno 18446744073709551615, vbucket uuid 119739259461295, snap start seqno 112, and snap end seqno 112", u'shortText': u'message', u'serverTime': u'2016-05-24T02:44:05.811Z', u'module': u'ns_log', u'tstamp': 1464083045811, u'type': u'info'}
      [2016-05-24 02:44:09,681] - [rest_client:2530] ERROR - {u'node': u'ns_1@172.23.107.119', u'code': 0, u'text': u'Bucket "default" rebalance appears to be swap rebalance', u'shortText': u'message', u'serverTime': u'2016-05-24T02:38:28.318Z', u'module': u'ns_vbucket_mover', u'tstamp': 1464082708318, u'type': u'info'}
      [2016-05-24 02:44:09,681] - [rest_client:2530] ERROR - {u'node': u'ns_1@172.23.107.119', u'code': 0, u'text': u'Bucket "default" loaded on node \'ns_1@172.23.107.119\' in 0 seconds.', u'shortText': u'message', u'serverTime': u'2016-05-24T02:38:26.856Z', u'module': u'ns_memcached', u'tstamp': 1464082706856, u'type': u'info'}
      

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          Memcached on .91 terminated unexpectedly:

          [user:info,2016-05-24T2:44:05.811,ns_1@172.23.105.91:<0.736.0>:ns_log:crash_consumption_loop:70]Port server memcached on node 'babysitter_of_ns_1@127.0.0.1' exited with status 1. Restarting. Messages: Tue May 24 02:44:05.533493 Pacific Daylight Time 3: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.107.119->ns_1@172.23.105.91:default - (vb 653) Attempting to add  stream with start seqno 390, end seqno 18446744073709551615, vbucket uuid 160618223825921, snap start seqno 390, and snap end seqno 390
          Tue May 24 02:44:05.600491 Pacific Daylight Time 3: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.107.119->ns_1@172.23.105.91:default - (vb 654) Attempting to add  stream with start seqno 140, end seqno 18446744073709551615, vbucket uuid 106785368150398, snap start seqno 140, and snap end seqno 140
          Tue May 24 02:44:05.655491 Pacific Daylight Time 3: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.107.119->ns_1@172.23.105.91:default - (vb 649) Attempting to add  stream with start seqno 27, end seqno 18446744073709551615, vbucket uuid 92987305102627, snap start seqno 27, and snap end seqno 27
          Tue May 24 02:44:05.721495 Pacific Daylight Time 3: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.107.119->ns_1@172.23.105.91:default - (vb 656) Attempting to add  stream with start seqno 336, end seqno 18446744073709551615, vbucket uuid 269900064377010, snap start seqno 336, and snap end seqno 336
          Tue May 24 02:44:05.770491 Pacific Daylight Time 3: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.107.119->ns_1@172.23.105.91:default - (vb 655) Attempting to add  stream with start seqno 112, end seqno 18446744073709551615, vbucket uuid 119739259461295, snap start seqno 112, and snap end seqno 112
          

          It's not obvious to me why this happened. And memcached log file from that incarnation of memcached is empty.

          But let me note that node .91 is running 3.1.0. So it's very likely not even a "watson" bug. Second, we know that there's a plethora of problems in pretty much all 3.* and 3.1.* releases, so I'm not sure we should even test with 3.1.0 (3.1.5 looks like a better candidate to me).

          Aliaksey Artamonau Aliaksey Artamonau (Inactive) added a comment - Memcached on .91 terminated unexpectedly: [user:info,2016-05-24T2:44:05.811,ns_1@172.23.105.91:<0.736.0>:ns_log:crash_consumption_loop:70]Port server memcached on node 'babysitter_of_ns_1@127.0.0.1' exited with status 1. Restarting. Messages: Tue May 24 02:44:05.533493 Pacific Daylight Time 3: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.107.119->ns_1@172.23.105.91:default - (vb 653) Attempting to add stream with start seqno 390, end seqno 18446744073709551615, vbucket uuid 160618223825921, snap start seqno 390, and snap end seqno 390 Tue May 24 02:44:05.600491 Pacific Daylight Time 3: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.107.119->ns_1@172.23.105.91:default - (vb 654) Attempting to add stream with start seqno 140, end seqno 18446744073709551615, vbucket uuid 106785368150398, snap start seqno 140, and snap end seqno 140 Tue May 24 02:44:05.655491 Pacific Daylight Time 3: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.107.119->ns_1@172.23.105.91:default - (vb 649) Attempting to add stream with start seqno 27, end seqno 18446744073709551615, vbucket uuid 92987305102627, snap start seqno 27, and snap end seqno 27 Tue May 24 02:44:05.721495 Pacific Daylight Time 3: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.107.119->ns_1@172.23.105.91:default - (vb 656) Attempting to add stream with start seqno 336, end seqno 18446744073709551615, vbucket uuid 269900064377010, snap start seqno 336, and snap end seqno 336 Tue May 24 02:44:05.770491 Pacific Daylight Time 3: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.107.119->ns_1@172.23.105.91:default - (vb 655) Attempting to add stream with start seqno 112, end seqno 18446744073709551615, vbucket uuid 119739259461295, snap start seqno 112, and snap end seqno 112 It's not obvious to me why this happened. And memcached log file from that incarnation of memcached is empty. But let me note that node .91 is running 3.1.0. So it's very likely not even a "watson" bug. Second, we know that there's a plethora of problems in pretty much all 3.* and 3.1.* releases, so I'm not sure we should even test with 3.1.0 (3.1.5 looks like a better candidate to me).
          thuan Thuan Nguyen added a comment -

          Re-run this test upgrade to build 4.5.0-2594, the test is passed. I am running again upgrade to build 4.5.0-2589 to see if I could reproduce this.

          thuan Thuan Nguyen added a comment - Re-run this test upgrade to build 4.5.0-2594, the test is passed. I am running again upgrade to build 4.5.0-2589 to see if I could reproduce this.
          thuan Thuan Nguyen added a comment -

          Re-run this test upgrade to build 4.5.0-2589, test passed.

          thuan Thuan Nguyen added a comment - Re-run this test upgrade to build 4.5.0-2589, test passed.

          Tony, are you saying this is an intermittent issue? Also can you try with 3.1.5 -> 4.5 please?

          raju Raju Suravarjjala added a comment - Tony, are you saying this is an intermittent issue? Also can you try with 3.1.5 -> 4.5 please?
          drigby Dave Rigby added a comment -

          As per Aliaksey's comment, there's nothing in the memcached log - it simply stops and then we see ns_server restart it:

          Tue May 24 02:43:47.466390 Pacific Daylight Time 3: (standard_bucket0) DCP (Producer) eq_dcpq:mapreduce_view: standard_bucket0 _design/upgrade-test-view0 (prod/main) - (vb 489) stream created with start seqno 317 and end seqno 328
          ^M
          c:\Program Files\Couchbase\Server\var\lib\couchbase\logs>type memcached.log.6.txt ^M
          ^M
          c:\Program Files\Couchbase\Server\var\lib\couchbase\logs>type memcached.log.7.txt ^M
          Tue May 24 02:44:05.894489 Pacific Daylight Time 3: (No Engine) Bucket default registered with low priority
          

          The babysitter log has a bit more information - it's last log line is over a second later, but it's much of the same, with no indication of why it crashed:

          Tue May 24 02:44:05.770491 Pacific Daylight Time 3: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.107.119->ns_1@172.23.105.91:default - (vb 655) Attempting to add  stream with start seqno 112, end seqno 18446744073709551615, vbucket uuid 119739259461295, snap start seqno 112, and snap end seqno 112
          

          Without a system configured to create Windows dump files I don't think there's much we can do here - you'll note that we only added Breakpad in 4.x. I'd also like to highlight Aliaksey's comment that the crashing node is 3.1.0 - so I'd argue this shouldn't have a affectsVersion of 4.5.0 - the crash is a 3.1.0 issue.

          IMHO this should not be a watson bug.

          Assigning back to QE to assess if we want to treat this as a 4.5.0 bug - and if so to attempt to reproduce and get a minidump file.

          drigby Dave Rigby added a comment - As per Aliaksey's comment, there's nothing in the memcached log - it simply stops and then we see ns_server restart it: Tue May 24 02:43:47.466390 Pacific Daylight Time 3: (standard_bucket0) DCP (Producer) eq_dcpq:mapreduce_view: standard_bucket0 _design/upgrade-test-view0 (prod/main) - (vb 489) stream created with start seqno 317 and end seqno 328 ^M c:\Program Files\Couchbase\Server\var\lib\couchbase\logs>type memcached.log.6.txt ^M ^M c:\Program Files\Couchbase\Server\var\lib\couchbase\logs>type memcached.log.7.txt ^M Tue May 24 02:44:05.894489 Pacific Daylight Time 3: (No Engine) Bucket default registered with low priority The babysitter log has a bit more information - it's last log line is over a second later, but it's much of the same, with no indication of why it crashed: Tue May 24 02:44:05.770491 Pacific Daylight Time 3: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.107.119->ns_1@172.23.105.91:default - (vb 655) Attempting to add stream with start seqno 112, end seqno 18446744073709551615, vbucket uuid 119739259461295, snap start seqno 112, and snap end seqno 112 Without a system configured to create Windows dump files I don't think there's much we can do here - you'll note that we only added Breakpad in 4.x. I'd also like to highlight Aliaksey's comment that the crashing node is 3.1.0 - so I'd argue this shouldn't have a affectsVersion of 4.5.0 - the crash is a 3.1.0 issue. IMHO this should not be a watson bug. Assigning back to QE to assess if we want to treat this as a 4.5.0 bug - and if so to attempt to reproduce and get a minidump file.

          Tony: Can you please retry and see if you can get a minidump?

          raju Raju Suravarjjala added a comment - Tony: Can you please retry and see if you can get a minidump?
          thuan Thuan Nguyen added a comment -

          I will retry in build 3.1.5

          thuan Thuan Nguyen added a comment - I will retry in build 3.1.5

          Re-test on build 3.1.5 upgrade to watson 2594, the test passed.
          I will mark this bug as could not reproduce and close it. When I hit this bug again, I will reopen and give live cluster to dev to investigate.

          thuan Thuan Nguyen added a comment - Re-test on build 3.1.5 upgrade to watson 2594, the test passed. I will mark this bug as could not reproduce and close it. When I hit this bug again, I will reopen and give live cluster to dev to investigate.

          People

            thuan Thuan Nguyen
            thuan Thuan Nguyen
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty