Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-37109

Rebalance fails and memcached crashes seen in Ephemeral rebalance in out tests

    XMLWordPrintable

Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • None
    • 6.5.0
    • couchbase-bucket
    • 6.5.0-4908

    Description

      Script to Repro

      ./testrunner -i /tmp/testexec.6198.ini -p get-cbcollect-info=False,bucket_type=ephemeral,GROUP=P1_Set2,get-cbcollect-info=True -t rebalance.rebalanceinout.RebalanceInOutTests.test_incremental_rebalance_in_out_with_mutation_and_expiration,items=100000,value_size=512,max_verify=100000,zone=2,GROUP=IN_OUT;P1;P1_Set2
      

      Test to repro

      Rebalances nodes into and out of the cluster while doing mutations and
      expirations. Use 'zone' param to have nodes divided into server groups
      by having zone > 1.
       
      This test begins by loading a given number of items into the cluster.
      It then adds one node, rebalances that node into the cluster, and then
      rebalances it back out. During the rebalancing we update half of the
      items in the cluster and expire the other half. Once the node has been
      removed and added back we recreate the expired items, wait for the
      disk queues to drain, and then verify that there has been no data loss,
      sum(curr_items) match the curr_items_total.We then remove and
      add back two nodes at a time and so on until we have reached the point
      where we are adding back and removing at least half of the nodes.
      

      Rebalance failure

      2019-12-01 22:08:23 | ERROR | MainProcess | Cluster_Thread | [rest_client._rebalance_status_and_progress] {u'status': u'none', u'errorMessage': u'Rebalance failed. See logs for detailed reason. You can try again.'} - rebalance failed
      2019-12-01 22:08:23 | INFO | MainProcess | Cluster_Thread | [rest_client.print_UI_logs] Latest logs from UI on 172.23.104.211:
      2019-12-01 22:08:23 | ERROR | MainProcess | Cluster_Thread | [rest_client.print_UI_logs] {u'node': u'ns_1@172.23.104.216', u'code': 0, u'text': u'Bucket "default" loaded on node \'ns_1@172.23.104.216\' in 0 seconds.', u'shortText': u'message', u'serverTime': u'2019-12-01T22:08:20.343Z', u'module': u'ns_memcached', u'tstamp': 1575266900343, u'type': u'info'}
      2019-12-01 22:08:23 | ERROR | MainProcess | Cluster_Thread | [rest_client.print_UI_logs] {u'node': u'ns_1@172.23.104.216', u'code': 0, u'text': u"Control connection to memcached on 'ns_1@172.23.104.216' disconnected. Check logs for details.", u'shortText': u'message', u'serverTime': u'2019-12-01T22:08:19.303Z', u'module': u'ns_memcached', u'tstamp': 1575266899303, u'type': u'info'}
      2019-12-01 22:08:23 | ERROR | MainProcess | Cluster_Thread | [rest_client.print_UI_logs] {u'node': u'ns_1@172.23.104.216', u'code': 0, u'text': u"Service 'memcached' exited with status 134. Restarting. Messages:\n2019-12-01T22:08:19.277234-08:00 CRITICAL     /opt/couchbase/bin/../lib/libstdc++.so.6() [0x7f6273aac000+0x8f213]\n2019-12-01T22:08:19.277277-08:00 CRITICAL     /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f626e488000+0x74098]\n2019-12-01T22:08:19.277296-08:00 CRITICAL     /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f626e488000+0x77434]\n2019-12-01T22:08:19.277314-08:00 CRITICAL     /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f626e488000+0x77843]\n2019-12-01T22:08:19.277334-08:00 CRITICAL     /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f626e488000+0x77924]\n2019-12-01T22:08:19.277352-08:00 CRITICAL     /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f626e488000+0x809f9]\n2019-12-01T22:08:19.277373-08:00 CRITICAL     /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f626e488000+0x12f964]\n2019-12-01T22:08:19.277385-08:00 CRITICAL     /opt/couchbase/bin/../lib/libplatform_so.so.0.1.0() [0x7f6275955000+0x8ee7]\n2019-12-01T22:08:19.277401-08:00 CRITICAL     /lib64/libpthread.so.0() [0x7f6273377000+0x7dd5]\n2019-12-01T22:08:19.277475-08:00 CRITICAL     /lib64/libc.so.6(clone+0x6d) [0x7f6272faa000+0xfdead]", u'shortText': u'message', u'serverTime': u'2019-12-01T22:08:19.297Z', u'module': u'ns_log', u'tstamp': 1575266899297, u'type': u'info'}
      2019-12-01 22:08:23 | ERROR | MainProcess | Cluster_Thread | [rest_client.print_UI_logs] {u'node': u'ns_1@172.23.104.211', u'code': 0, u'text': u'auto-reprovision is disabled as maximum number of nodes (1) that can be auto-reprovisioned has been reached.', u'shortText': u'message', u'serverTime': u'2019-12-01T22:08:18.669Z', u'module': u'auto_reprovision', u'tstamp': 1575266898669, u'type': u'info'}
      2019-12-01 22:08:23 | ERROR | MainProcess | Cluster_Thread | [rest_client.print_UI_logs] {u'node': u'ns_1@172.23.104.211', u'code': 0, u'text': u'Bucket "default" has been reprovisioned on following nodes: [\'ns_1@172.23.104.220\']. Nodes on which the data service restarted: [\'ns_1@172.23.104.220\',\n                                                                                                                                 \'ns_1@172.23.104.243\'].', u'shortText': u'message', u'serverTime': u'2019-12-01T22:08:18.668Z', u'module': u'auto_reprovision', u'tstamp': 1575266898668, u'type': u'info'}
      

      Backtrace from gdb

      (gdb) bt
      #0  0x00007f6272fe0207 in __gconv_transform_internal_ucs2reverse () from /usr/lib64/libc-2.17.so
      #1  0x0000000000000006 in ?? ()
      #2  0x00007f6273025dc3 in wprintf () from /usr/lib64/libc-2.17.so
      #3  0x0000000000000001 in ?? ()
      #4  0x0000000a3affb1f0 in ?? ()
      #5  0x000000020000000e in ?? ()
      #6  0x00007f623affd600 in ?? ()
      #7  0x00007f623affb190 in ?? ()
      #8  0x00007f6271b5f400 in ?? ()
      #9  0x0000000000000068 in ?? ()
      #10 0x000000003affd600 in ?? ()
      #11 0x00007f623affb230 in ?? ()
      #12 0x00007f623affbe20 in ?? ()
      #13 0x0000000000000068 in ?? ()
      #14 0x00007f6272a00980 in ?? ()
      #15 0x00007f6274e5fd58 in tcache_alloc_small (slow_path=false, zero=false, binind=10, size=0, tcache=0x7f62730258ce <putwc_unlocked+30>, arena=<optimized out>, tsd=<optimized out>) at include/jemalloc/internal/tcache_inlines.h:60
      #16 arena_malloc (slow_path=false, tcache=0x7f62730258ce <putwc_unlocked+30>, zero=false, ind=10, size=0, arena=0x0, tsdn=<optimized out>) at include/jemalloc/internal/arena_inlines_b.h:165
      #17 iallocztm (slow_path=false, arena=0x0, is_internal=false, tcache=0x7f62730258ce <putwc_unlocked+30>, zero=false, ind=10, size=0, tsdn=<optimized out>) at include/jemalloc/internal/jemalloc_internal_inlines_c.h:53
      #18 imalloc_no_sample (ind=10, usize=0, size=0, tsd=0x7f627336d3a0 <_IO_obstack_jumps+128>, dopts=<synthetic pointer>, sopts=<synthetic pointer>) at src/jemalloc.c:1949
      #19 imalloc_body (tsd=0x7f627336d3a0 <_IO_obstack_jumps+128>, dopts=<synthetic pointer>, sopts=<synthetic pointer>) at src/jemalloc.c:2123
      #20 imalloc (dopts=<synthetic pointer>, sopts=<synthetic pointer>) at src/jemalloc.c:2258
      #21 je_malloc_default (size=<optimized out>) at src/jemalloc.c:2289
      #22 0x00007f627596043c in cb_malloc (size=0) at /home/couchbase/jenkins/workspace/couchbase-server-unix/platform/src/cb_malloc.cc:51
      #23 0x00007f6276a000b9 in operator new (count=<optimized out>) at /home/couchbase/jenkins/workspace/couchbase-server-unix/platform/src/global_new_replacement.cc:71
      #24 0x00007f626e4faf71 in MutationResponse (sid=..., enableExpiryOut=Yes, includeCollectionID=(unknown: 32), includeDeleteTime=(unknown: 162), includeXattrs=Yes, includeVal=Yes, opaque=2, item=..., this=0x7f6238814c10)
          at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/dcp/response.h:429
      #25 make_unique<MutationResponse, SingleThreadedRCPtr<Item, Item*, std::default_delete<Item> > const&, unsigned int const&, IncludeValue const&, IncludeXattrs const&, IncludeDeleteTime const&, DocKeyEncodesCollectionId const&, EnableExpiryOutput const&, cb::mcbp::DcpStreamId const&> () at /usr/local/include/c++/7.3.0/bits/unique_ptr.h:825
      #26 ActiveStream::makeResponseFromItem (this=<optimized out>, item=..., sendCommitSyncWriteAs=<optimized out>) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/dcp/active_stream.cc:1029
      #27 0x00007f626e4ff434 in ActiveStream::processItems (this=0x7f623affb3b0, this@entry=0x7f6238814c10, outstandingItemsResult=..., streamMutex=...)
          at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/dcp/active_stream.cc:1101
      #28 0x00007f626e4ff843 in ActiveStream::nextCheckpointItemTask (this=this@entry=0x7f6238814c10, streamMutex=...) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/dcp/active_stream.cc:868
      #29 0x00007f626e4ff924 in ActiveStream::nextCheckpointItemTask (this=0x7f6238814c10) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/dcp/active_stream.cc:858
      #30 0x00007f626e5089f9 in ActiveStreamCheckpointProcessorTask::run (this=0x7f6238819110) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/dcp/active_stream_checkpoint_processor_task.cc:56
      #31 0x00007f626e5b7964 in ExecutorThread::run (this=0x7f6271b97960) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/executorthread.cc:187
      #32 0x00007f627595dee7 in run (this=0x7f6271a6e670) at /home/couchbase/jenkins/workspace/couchbase-server-unix/platform/src/cb_pthreads.cc:58
      #33 platform_thread_wrap (arg=0x7f6271a6e670) at /home/couchbase/jenkins/workspace/couchbase-server-unix/platform/src/cb_pthreads.cc:71
      #34 0x00007f627337edd5 in start_thread () from /usr/lib64/libpthread-2.17.so
      #35 0x00007f62730a7ead in tdestroy_recurse () from /usr/lib64/libc-2.17.so
      #36 0x0000000000000000 in ?? ()
      (gdb) 
      

      cbcollect_info attached.
      Last successful run was on 6.5.0-4897.

      Attachments

        Issue Links

          Activity

            People

              Balakumaran.Gopal Balakumaran Gopal
              Balakumaran.Gopal Balakumaran Gopal
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                PagerDuty