Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-5179

Dp4 rebalance fails after node restart

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 2.0-developer-preview-4
    • Fix Version/s: 2.0-developer-preview-4
    • Component/s: ns_server
    • Security Level: Public
    • Labels:
      None
    • Environment:
      dp4 build 724 CentOS

      Description

      1) Create 5 node cluster
      2) Shutdown one node, wait for disconnected state in UI
      3) Power node back on, wait for connected state in UI
      4) Remove and Rebalance out node fails
      (seems host was unreachable after power on)

      UI log shows:

      Rebalance exited with reason {noproc,
      {gen_server,call,
      [

      {'ns_vbm_sup-default','ns_1@10.1.2.222'}

      ,
      {start_child,
      {{new_child_id,"NOPQRSTUVWXY",
      'ns_1@10.1.3.182'},
      {ebucketmigrator_srv,start_link,
      [

      {"10.1.3.182",11210}

      ,

      {"10.1.2.222",11210}

      ,
      [

      {username,"default"}

      ,

      {password,[]}

      ,

      {vbuckets,"NOPQRSTUVWXY"}

      ,

      {takeover,false}

      ,

      {suffix,"ns_1@10.1.2.222"}

      ]]},
      permanent,60000,worker,
      [ebucketmigrator_srv]}},
      infinity]}}

      diags attached

      1. 10.1.2.222.tar.gz
        1.49 MB
        Tommie McAfee
      2. 10.1.2.223-8091-diag.txt.gz
        3.44 MB
        Tommie McAfee
      3. 10.1.3.180-8091-diag.txt.gz
        5.21 MB
        Tommie McAfee
      4. 10.1.3.181-8091-diag.txt.gz
        5.60 MB
        Tommie McAfee
      5. 10.1.3.182-8091-diag.txt.gz
        6.28 MB
        Tommie McAfee
      6. core.1199.log
        14 kB
        Tommie McAfee
      7. core.1203.log
        13 kB
        Tommie McAfee
      8. core.4713.log
        11 kB
        Tommie McAfee
      No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

        Hide
        chiyoung Chiyoung Seo added a comment -

        Tommie,

        Can you copy the full stack trace output from gdb? I want to see other threads as well.

        Thanks,

        Show
        chiyoung Chiyoung Seo added a comment - Tommie, Can you copy the full stack trace output from gdb? I want to see other threads as well. Thanks,
        Hide
        tommie Tommie McAfee added a comment - - edited

        core.4713.log

        Show
        tommie Tommie McAfee added a comment - - edited core.4713.log
        Hide
        tommie Tommie McAfee added a comment -

        Same workflow produces another stacktrace that is similar to the first(attached core.1203.log)

        Reproduce - Stop/Start server, and when re-added try to rebalance out.

        Show
        tommie Tommie McAfee added a comment - Same workflow produces another stacktrace that is similar to the first(attached core.1203.log) Reproduce - Stop/Start server, and when re-added try to rebalance out.
        Hide
        tommie Tommie McAfee added a comment -

        seems like this workflow produces various kinds of crashes. Reproduced again, but with different segfault( see core.1199.log )

        Program terminated with signal 11, Segmentation fault.
        #0 unlocked_find (this=0x16ed2480, vb=..., key="222befaf36627927", bucket_num=<value optimized out>, wantDeleted=false) at stored-value.hh:168
        168 stored-value.hh: No such file or directory.
        in stored-value.hh

        ......

        Thread 1 (Thread 0x44ad5940 (LWP 1214)):
        #0 unlocked_find (this=0x16ed2480, vb=..., key="222befaf36627927", bucket_num=<value optimized out>, wantDeleted=false) at stored-value.hh:168
        #1 EventuallyPersistentStore::fetchValidValue (this=0x16ed2480, vb=..., key="222befaf36627927", bucket_num=<value optimized out>, wantDeleted=false) at ep.cc:834
        #2 0x00002aaaaad2dd58 in EventuallyPersistentStore::getInternal(const std::locale::string &, uint16_t, const void *, bool, bool, ._101) (this=0x16ed2480, key="222befaf36627927",
        vbucket=<value optimized out>, cookie=0x16da42c0, queueBG=false, honorStates=false, allowedState=vbucket_state_active) at ep.cc:1348
        #3 0x00002aaaaad87d40 in TapProducer::getNextItem(const void *, uint16_t *, ._174 &) (this=0x1c018380, c=0x16da42c0, vbucket=0x44ad47b8, ret=@0x44ad067c) at ep.hh:469
        #4 0x00002aaaaad55e49 in doWalkTapQueue (this=0x16ed6000, cookie=0x16da42c0, itm=0x44ad47a8, es=0x44ad47a0, nes=0x44ad47bc, ttl=0x44ad47bf "\377@P\255D", flags=0x44ad47ba, seqno=0x44ad47b4,
        vbucket=0x44ad47b8) at ep_engine.cc:1446
        #5 EventuallyPersistentEngine::walkTapQueue (this=0x16ed6000, cookie=0x16da42c0, itm=0x44ad47a8, es=0x44ad47a0, nes=0x44ad47bc, ttl=0x44ad47bf "\377@P\255D", flags=0x44ad47ba, seqno=0x44ad47b4,
        vbucket=0x44ad47b8) at ep_engine.cc:1504
        #6 0x00002aaaaaaaf696 in bucket_tap_iterator_shim (handle=0x2aaaaacb5480, cookie=0x16da42c0, itm=0x44ad47a8, engine_specific=0x44ad47a0, nengine_specific=0x44ad47bc, ttl=0x44ad47bf "\377@P\255D",
        flags=0x44ad47ba, seqno=0x44ad47b4, vbucket=0x44ad47b8) at bucket_engine.c:2041
        #7 0x000000000040f892 in ship_tap_log (c=0x16da42c0) at daemon/memcached.c:2596
        #8 conn_ship_log (c=0x16da42c0) at daemon/memcached.c:5430
        #9 0x0000000000407234 in event_handler (fd=<value optimized out>, which=<value optimized out>, arg=0x16da42c0) at daemon/memcached.c:5884
        #10 0x00002b3f74a70df9 in event_process_active_single_queue (base=0x16e24780, flags=0) at event.c:1308
        #11 event_process_active (base=0x16e24780, flags=0) at event.c:1375
        #12 event_base_loop (base=0x16e24780, flags=0) at event.c:1572
        #13 0x0000000000413694 in worker_libevent (arg=0x129c6900) at daemon/thread.c:304
        #14 0x00002b3f7560b73d in start_thread () from /lib64/libpthread.so.0
        #15 0x00002b3f758f44bd in clone () from /lib64/libc.so.6
        ------------------------------------------------------------------

        Show
        tommie Tommie McAfee added a comment - seems like this workflow produces various kinds of crashes. Reproduced again, but with different segfault( see core.1199.log ) Program terminated with signal 11, Segmentation fault. #0 unlocked_find (this=0x16ed2480, vb=..., key="222befaf36627927", bucket_num=<value optimized out>, wantDeleted=false) at stored-value.hh:168 168 stored-value.hh: No such file or directory. in stored-value.hh ...... Thread 1 (Thread 0x44ad5940 (LWP 1214)): #0 unlocked_find (this=0x16ed2480, vb=..., key="222befaf36627927", bucket_num=<value optimized out>, wantDeleted=false) at stored-value.hh:168 #1 EventuallyPersistentStore::fetchValidValue (this=0x16ed2480, vb=..., key="222befaf36627927", bucket_num=<value optimized out>, wantDeleted=false) at ep.cc:834 #2 0x00002aaaaad2dd58 in EventuallyPersistentStore::getInternal(const std::locale::string &, uint16_t, const void *, bool, bool, ._101) (this=0x16ed2480, key="222befaf36627927", vbucket=<value optimized out>, cookie=0x16da42c0, queueBG=false, honorStates=false, allowedState=vbucket_state_active) at ep.cc:1348 #3 0x00002aaaaad87d40 in TapProducer::getNextItem(const void *, uint16_t *, ._174 &) (this=0x1c018380, c=0x16da42c0, vbucket=0x44ad47b8, ret=@0x44ad067c) at ep.hh:469 #4 0x00002aaaaad55e49 in doWalkTapQueue (this=0x16ed6000, cookie=0x16da42c0, itm=0x44ad47a8, es=0x44ad47a0, nes=0x44ad47bc, ttl=0x44ad47bf "\377@P\255D", flags=0x44ad47ba, seqno=0x44ad47b4, vbucket=0x44ad47b8) at ep_engine.cc:1446 #5 EventuallyPersistentEngine::walkTapQueue (this=0x16ed6000, cookie=0x16da42c0, itm=0x44ad47a8, es=0x44ad47a0, nes=0x44ad47bc, ttl=0x44ad47bf "\377@P\255D", flags=0x44ad47ba, seqno=0x44ad47b4, vbucket=0x44ad47b8) at ep_engine.cc:1504 #6 0x00002aaaaaaaf696 in bucket_tap_iterator_shim (handle=0x2aaaaacb5480, cookie=0x16da42c0, itm=0x44ad47a8, engine_specific=0x44ad47a0, nengine_specific=0x44ad47bc, ttl=0x44ad47bf "\377@P\255D", flags=0x44ad47ba, seqno=0x44ad47b4, vbucket=0x44ad47b8) at bucket_engine.c:2041 #7 0x000000000040f892 in ship_tap_log (c=0x16da42c0) at daemon/memcached.c:2596 #8 conn_ship_log (c=0x16da42c0) at daemon/memcached.c:5430 #9 0x0000000000407234 in event_handler (fd=<value optimized out>, which=<value optimized out>, arg=0x16da42c0) at daemon/memcached.c:5884 #10 0x00002b3f74a70df9 in event_process_active_single_queue (base=0x16e24780, flags=0) at event.c:1308 #11 event_process_active (base=0x16e24780, flags=0) at event.c:1375 #12 event_base_loop (base=0x16e24780, flags=0) at event.c:1572 #13 0x0000000000413694 in worker_libevent (arg=0x129c6900) at daemon/thread.c:304 #14 0x00002b3f7560b73d in start_thread () from /lib64/libpthread.so.0 #15 0x00002b3f758f44bd in clone () from /lib64/libc.so.6 ------------------------------------------------------------------
        Show
        chiyoung Chiyoung Seo added a comment - http://review.couchbase.org/#change,15527

          People

          • Assignee:
            chiyoung Chiyoung Seo
            Reporter:
            tommie Tommie McAfee
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Gerrit Reviews

              There are no open Gerrit changes