Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-4517

Rebalance Stuck on 1.8 after adding nodes (rebalance gets stuck even if ack_seqno is correct and has_queued_item is true and total_backlog_size > 1000)

    Details

      Description

      Added 5 nodes to a 10 nodes cluster.

      TAP stats from the node that has one tap connection stuck.
      /opt/couchbase/bin/cbstats localhost:11210 tap |grep rebalance

      eq_tapq:rebalance_725:ack_log_size: 248
      eq_tapq:rebalance_725:ack_playback_size: 248
      eq_tapq:rebalance_725:ack_seqno: 58249
      eq_tapq:rebalance_725:ack_window_full: false
      eq_tapq:rebalance_725:backfill_completed: false
      eq_tapq:rebalance_725:bg_backlog_size: 0
      eq_tapq:rebalance_725:bg_jobs_completed: 43790
      eq_tapq:rebalance_725:bg_jobs_issued: 43790
      eq_tapq:rebalance_725:bg_queue_size: 0
      eq_tapq:rebalance_725:bg_queued: 43790
      eq_tapq:rebalance_725:bg_result_size: 0
      eq_tapq:rebalance_725:bg_results: 0
      eq_tapq:rebalance_725:bg_wait_for_results: false
      eq_tapq:rebalance_725:complete: false
      eq_tapq:rebalance_725:connected: true
      eq_tapq:rebalance_725:created: 5423
      eq_tapq:rebalance_725:empty: false
      eq_tapq:rebalance_725:flags: 93 (ack,backfill,vblist,takeover,checkpoints)
      eq_tapq:rebalance_725:has_item: false
      eq_tapq:rebalance_725:has_queued_item: true
      eq_tapq:rebalance_725:idle: false
      eq_tapq:rebalance_725:num_tap_nack: 0
      eq_tapq:rebalance_725:num_tap_tmpfail_survivors: 0
      eq_tapq:rebalance_725:paused: 1
      eq_tapq:rebalance_725:pending_backfill: false
      eq_tapq:rebalance_725:pending_disconnect: false
      eq_tapq:rebalance_725:pending_disk_backfill: false
      eq_tapq:rebalance_725:qlen: 0
      eq_tapq:rebalance_725:qlen_high_pri: 0
      eq_tapq:rebalance_725:qlen_low_pri: 1
      eq_tapq:rebalance_725:queue_backfillremaining: 0
      eq_tapq:rebalance_725:queue_backoff: 0
      eq_tapq:rebalance_725:queue_drain: 58240
      eq_tapq:rebalance_725:queue_fill: 0
      eq_tapq:rebalance_725:queue_itemondisk: 0
      eq_tapq:rebalance_725:queue_memory: 0
      eq_tapq:rebalance_725:rec_fetched: 14710
      eq_tapq:rebalance_725:recv_ack_seqno: 58000
      eq_tapq:rebalance_725:reserved: 1
      eq_tapq:rebalance_725:seqno_ack_requested: 58000
      eq_tapq:rebalance_725:supports_ack: true
      eq_tapq:rebalance_725:suspended: false
      eq_tapq:rebalance_725:total_backlog_size: 10327
      eq_tapq:rebalance_725:total_noops: 836
      eq_tapq:rebalance_725:type: producer
      eq_tapq:rebalance_725:vb_filter:

      { 725 }

      eq_tapq:rebalance_725:vb_filters: 1

      /opt/couchbase/bin/cbstats localhost:11210 all |grep mem
      ep_diskqueue_memory: 0
      ep_mem_high_wat: 7864320000
      ep_mem_low_wat: 6291456000
      mem_used: 6270355533
      vb_active_ht_memory: 25611040
      vb_active_itm_memory: 4777690775
      vb_active_perc_mem_resident: 32
      vb_active_queue_memory: 0
      vb_pending_ht_memory: 0
      vb_pending_itm_memory: 0
      vb_pending_perc_mem_resident: 0
      vb_pending_queue_memory: 0
      vb_replica_ht_memory: 17336480
      vb_replica_itm_memory: 1221834713
      vb_replica_perc_mem_resident: 11
      vb_replica_queue_memory: 0

      1. Screen Shot 2011-12-07 at 10.58.41 AM.png
        410 kB
        Karan Kumar
      2. Screen Shot 2011-12-07 at 10.59.10 AM.png
        365 kB
        Karan Kumar
      3. Screen Shot 2011-12-07 at 11.01.17 AM.png
        370 kB
        Karan Kumar

        Issue Links

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

          Hide
          karan Karan Kumar (Inactive) added a comment -

          Server error during processing: ["web request failed",

          {path,"//pools/default/buckets/default"}

          ,

          {type,error}

          ,
          {what,{case_clause,rebalance_running}},
          {trace,
          [

          {menelaus_web_buckets, handle_bucket_delete,3}

          ,

          {menelaus_web,loop,3}

          ,

          {mochiweb_http,headers,5}

          ,

          {proc_lib,init_p_do_apply,3}

          ]}]

          Show
          karan Karan Kumar (Inactive) added a comment - Server error during processing: ["web request failed", {path,"//pools/default/buckets/default"} , {type,error} , {what,{case_clause,rebalance_running}}, {trace, [ {menelaus_web_buckets, handle_bucket_delete,3} , {menelaus_web,loop,3} , {mochiweb_http,headers,5} , {proc_lib,init_p_do_apply,3} ]}]
          Hide
          farshid Farshid Ghods (Inactive) added a comment -

          unrelated.
          this says somebody was trying to delete a bucket when rebalance was still running ( probably your test script )

          Show
          farshid Farshid Ghods (Inactive) added a comment - unrelated. this says somebody was trying to delete a bucket when rebalance was still running ( probably your test script )
          Hide
          karan Karan Kumar (Inactive) added a comment -

          Attaching the Diagnostic report

          Show
          karan Karan Kumar (Inactive) added a comment - Attaching the Diagnostic report
          Hide
          karan Karan Kumar (Inactive) added a comment -

          On 180r-36 with chiyoung and mark fixes:-
          Rebalance is still stuck.

          eq_tapq:rebalance_880:ack_log_size: 6
          eq_tapq:rebalance_880:ack_playback_size: 6
          eq_tapq:rebalance_880:ack_seqno: 61327
          eq_tapq:rebalance_880:ack_window_full: false
          eq_tapq:rebalance_880:backfill_completed: false
          eq_tapq:rebalance_880:bg_backlog_size: 0
          eq_tapq:rebalance_880:bg_jobs_completed: 44109
          eq_tapq:rebalance_880:bg_jobs_issued: 44109
          eq_tapq:rebalance_880:bg_queued: 44109
          eq_tapq:rebalance_880:bg_result_size: 0
          eq_tapq:rebalance_880:bg_results: 0
          eq_tapq:rebalance_880:bg_wait_for_results: false
          eq_tapq:rebalance_880:complete: false
          eq_tapq:rebalance_880:connected: true
          eq_tapq:rebalance_880:created: 6501
          eq_tapq:rebalance_880:empty: false
          eq_tapq:rebalance_880:flags: 93 (ack,backfill,vblist,takeover,checkpoints)
          eq_tapq:rebalance_880:has_item: false
          eq_tapq:rebalance_880:has_queued_item: true
          eq_tapq:rebalance_880:idle: false
          eq_tapq:rebalance_880:num_tap_nack: 0
          eq_tapq:rebalance_880:num_tap_tmpfail_survivors: 0
          eq_tapq:rebalance_880:paused: 1
          eq_tapq:rebalance_880:pending_backfill: false
          eq_tapq:rebalance_880:pending_disconnect: false
          eq_tapq:rebalance_880:pending_disk_backfill: false
          eq_tapq:rebalance_880:qlen: 0
          eq_tapq:rebalance_880:qlen_high_pri: 0
          eq_tapq:rebalance_880:qlen_low_pri: 1
          eq_tapq:rebalance_880:queue_backfillremaining: 0
          eq_tapq:rebalance_880:queue_backoff: 0
          eq_tapq:rebalance_880:queue_drain: 61454
          eq_tapq:rebalance_880:queue_fill: 0
          eq_tapq:rebalance_880:queue_itemondisk: 0
          eq_tapq:rebalance_880:queue_memory: 0
          eq_tapq:rebalance_880:rec_fetched: 17488
          eq_tapq:rebalance_880:recv_ack_seqno: 61320
          eq_tapq:rebalance_880:reserved: 1
          eq_tapq:rebalance_880:seqno_ack_requested: 61320
          eq_tapq:rebalance_880:supports_ack: true
          eq_tapq:rebalance_880:suspended: false
          eq_tapq:rebalance_880:total_backlog_size: 2321
          eq_tapq:rebalance_880:total_noops: 1031
          eq_tapq:rebalance_880:type: producer
          eq_tapq:rebalance_880:vb_filter:

          { 880 }

          eq_tapq:rebalance_880:vb_filters: 1

          Show
          karan Karan Kumar (Inactive) added a comment - On 180r-36 with chiyoung and mark fixes:- Rebalance is still stuck. eq_tapq:rebalance_880:ack_log_size: 6 eq_tapq:rebalance_880:ack_playback_size: 6 eq_tapq:rebalance_880:ack_seqno: 61327 eq_tapq:rebalance_880:ack_window_full: false eq_tapq:rebalance_880:backfill_completed: false eq_tapq:rebalance_880:bg_backlog_size: 0 eq_tapq:rebalance_880:bg_jobs_completed: 44109 eq_tapq:rebalance_880:bg_jobs_issued: 44109 eq_tapq:rebalance_880:bg_queued: 44109 eq_tapq:rebalance_880:bg_result_size: 0 eq_tapq:rebalance_880:bg_results: 0 eq_tapq:rebalance_880:bg_wait_for_results: false eq_tapq:rebalance_880:complete: false eq_tapq:rebalance_880:connected: true eq_tapq:rebalance_880:created: 6501 eq_tapq:rebalance_880:empty: false eq_tapq:rebalance_880:flags: 93 (ack,backfill,vblist,takeover,checkpoints) eq_tapq:rebalance_880:has_item: false eq_tapq:rebalance_880:has_queued_item: true eq_tapq:rebalance_880:idle: false eq_tapq:rebalance_880:num_tap_nack: 0 eq_tapq:rebalance_880:num_tap_tmpfail_survivors: 0 eq_tapq:rebalance_880:paused: 1 eq_tapq:rebalance_880:pending_backfill: false eq_tapq:rebalance_880:pending_disconnect: false eq_tapq:rebalance_880:pending_disk_backfill: false eq_tapq:rebalance_880:qlen: 0 eq_tapq:rebalance_880:qlen_high_pri: 0 eq_tapq:rebalance_880:qlen_low_pri: 1 eq_tapq:rebalance_880:queue_backfillremaining: 0 eq_tapq:rebalance_880:queue_backoff: 0 eq_tapq:rebalance_880:queue_drain: 61454 eq_tapq:rebalance_880:queue_fill: 0 eq_tapq:rebalance_880:queue_itemondisk: 0 eq_tapq:rebalance_880:queue_memory: 0 eq_tapq:rebalance_880:rec_fetched: 17488 eq_tapq:rebalance_880:recv_ack_seqno: 61320 eq_tapq:rebalance_880:reserved: 1 eq_tapq:rebalance_880:seqno_ack_requested: 61320 eq_tapq:rebalance_880:supports_ack: true eq_tapq:rebalance_880:suspended: false eq_tapq:rebalance_880:total_backlog_size: 2321 eq_tapq:rebalance_880:total_noops: 1031 eq_tapq:rebalance_880:type: producer eq_tapq:rebalance_880:vb_filter: { 880 } eq_tapq:rebalance_880:vb_filters: 1
          Hide
          dipti Dipti Borkar added a comment -

          There is a workaround for this problem. Stop, wait for 10 mins and start rebalance.

          Show
          dipti Dipti Borkar added a comment - There is a workaround for this problem. Stop, wait for 10 mins and start rebalance.
          Hide
          karan Karan Kumar (Inactive) added a comment -

          Again encountered during large scale testing, this has large queue_backfillremaining, seems to be similar to the one we have been seeing:-

          /opt/membase/bin/mbstats localhost:11210 tap |grep rebalance
          eq_tapq:rebalance_393:ack_log_size: 9730
          eq_tapq:rebalance_393:ack_playback_size: 9730
          eq_tapq:rebalance_393:ack_seqno: 18149
          eq_tapq:rebalance_393:ack_window_full: false
          eq_tapq:rebalance_393:backfill_completed: false
          eq_tapq:rebalance_393:bg_backlog_size: 0
          eq_tapq:rebalance_393:bg_jobs_completed: 48108
          eq_tapq:rebalance_393:bg_jobs_issued: 48108
          eq_tapq:rebalance_393:bg_queue_size: 0
          eq_tapq:rebalance_393:bg_queued: 48108
          eq_tapq:rebalance_393:bg_result_size: 34962
          eq_tapq:rebalance_393:bg_results: 0
          eq_tapq:rebalance_393:bg_wait_for_results: false
          eq_tapq:rebalance_393:complete: false
          eq_tapq:rebalance_393:connected: true
          eq_tapq:rebalance_393:created: 78383
          eq_tapq:rebalance_393:empty: false
          eq_tapq:rebalance_393:flags: 93 (ack,backfill,vblist,takeover,checkpoints)
          eq_tapq:rebalance_393:has_item: true
          eq_tapq:rebalance_393:has_queued_item: true
          eq_tapq:rebalance_393:idle: false
          eq_tapq:rebalance_393:num_tap_nack: 2085
          eq_tapq:rebalance_393:num_tap_tmpfail_survivors: 2085
          eq_tapq:rebalance_393:paused: 0
          eq_tapq:rebalance_393:pending_backfill: false
          eq_tapq:rebalance_393:pending_disconnect: false
          eq_tapq:rebalance_393:pending_disk_backfill: false
          eq_tapq:rebalance_393:qlen: 2442
          eq_tapq:rebalance_393:qlen_high_pri: 0
          eq_tapq:rebalance_393:qlen_low_pri: 1
          eq_tapq:rebalance_393:queue_backfillremaining: 37404
          eq_tapq:rebalance_393:queue_backoff: 2085
          eq_tapq:rebalance_393:queue_drain: 18140
          eq_tapq:rebalance_393:queue_fill: 0
          eq_tapq:rebalance_393:queue_itemondisk: 0
          eq_tapq:rebalance_393:queue_memory: 33360
          eq_tapq:rebalance_393:rec_fetched: 4998
          eq_tapq:rebalance_393:recv_ack_seqno: 8418
          eq_tapq:rebalance_393:reserved: 1
          eq_tapq:rebalance_393:supports_ack: true
          eq_tapq:rebalance_393:suspended: false
          eq_tapq:rebalance_393:total_backlog_size: 37405
          eq_tapq:rebalance_393:total_noops: 0
          eq_tapq:rebalance_393:type: producer
          eq_tapq:rebalance_393:vb_filter:

          { 393 }

          eq_tapq:rebalance_393:vb_filters: 1

          netstat -ntp
          tcp 252160 257024 10.6.117.231:33624 10.6.117.231:11210 ESTABLISHED 28275/beam.smp

          (gdb) t a a bt

          Thread 14 (Thread 0x7f69109a8700 (LWP 28579)):
          #0 0x00007f691119e4dd in read () from /lib64/libc.so.6
          #1 0x00007f6911136fd8 in _IO_new_file_underflow () from /lib64/libc.so.6
          #2 0x00007f6911138ade in _IO_default_uflow_internal () from /lib64/libc.so.6
          #3 0x00007f6911133fcb in getc () from /lib64/libc.so.6
          #4 0x00007f69109a9875 in check_stdin_thread (arg=0x4034c0) at extensions/daemon/stdin_check.c:19
          #5 0x00007f691145c7e1 in start_thread () from /lib64/libpthread.so.0
          #6 0x00007f69111ab77d in clone () from /lib64/libc.so.6

          Thread 13 (Thread 0x7f69101a7700 (LWP 28580)):
          #0 0x00007f6911170bed in nanosleep () from /lib64/libc.so.6
          #1 0x00007f6911170a60 in sleep () from /lib64/libc.so.6
          #2 0x0000000000414f68 in check_isasl_db_thread (arg=<value optimized out>) at daemon/isasl.c:233
          #3 0x00007f691145c7e1 in start_thread () from /lib64/libpthread.so.0
          #4 0x00007f69111ab77d in clone () from /lib64/libc.so.6

          Thread 12 (Thread 0x7f690f79d700 (LWP 28581)):
          #0 0x00007f69111abd73 in epoll_wait () from /lib64/libc.so.6
          #1 0x00007f6911f6dc28 in epoll_dispatch (base=0x2461500, tv=<value optimized out>) at epoll.c:404
          #2 0x00007f6911f5ca4c in event_base_loop (base=0x2461500, flags=0) at event.c:1558
          #3 0x00000000004133a4 in worker_libevent (arg=0x2443500) at daemon/thread.c:304
          #4 0x00007f691145c7e1 in start_thread () from /lib64/libpthread.so.0
          #5 0x00007f69111ab77d in clone () from /lib64/libc.so.6

          Thread 11 (Thread 0x7f690ef9c700 (LWP 28582)):
          #0 0x00007f69111abd73 in epoll_wait () from /lib64/libc.so.6
          #1 0x00007f6911f6dc28 in epoll_dispatch (base=0x2461280, tv=<value optimized out>) at epoll.c:404
          #2 0x00007f6911f5ca4c in event_base_loop (base=0x2461280, flags=0) at event.c:1558
          #3 0x00000000004133a4 in worker_libevent (arg=0x2443600) at daemon/thread.c:304
          #4 0x00007f691145c7e1 in start_thread () from /lib64/libpthread.so.0
          #5 0x00007f69111ab77d in clone () from /lib64/libc.so.6

          Thread 10 (Thread 0x7f690e79b700 (LWP 28583)):
          #0 0x00007f69111abd73 in epoll_wait () from /lib64/libc.so.6
          #1 0x00007f6911f6dc28 in epoll_dispatch (base=0x2461c80, tv=<value optimized out>) at epoll.c:404
          #2 0x00007f6911f5ca4c in event_base_loop (base=0x2461c80, flags=0) at event.c:1558
          #3 0x00000000004133a4 in worker_libevent (arg=0x2443700) at daemon/thread.c:304
          #4 0x00007f691145c7e1 in start_thread () from /lib64/libpthread.so.0
          #5 0x00007f69111ab77d in clone () from /lib64/libc.so.6

          Thread 9 (Thread 0x7f690df9a700 (LWP 28584)):
          #0 0x00007f69111abd73 in epoll_wait () from /lib64/libc.so.6
          #1 0x00007f6911f6dc28 in epoll_dispatch (base=0x2461a00, tv=<value optimized out>) at epoll.c:404
          #2 0x00007f6911f5ca4c in event_base_loop (base=0x2461a00, flags=0) at event.c:1558
          #3 0x00000000004133a4 in worker_libevent (arg=0x2443800) at daemon/thread.c:304
          #4 0x00007f691145c7e1 in start_thread () from /lib64/libpthread.so.0
          #5 0x00007f69111ab77d in clone () from /lib64/libc.so.6

          Thread 8 (Thread 0x7f690d799700 (LWP 28585)):
          #0 0x00007f69111abd73 in epoll_wait () from /lib64/libc.so.6
          #1 0x00007f6911f6dc28 in epoll_dispatch (base=0x2461780, tv=<value optimized out>) at epoll.c:404
          #2 0x00007f6911f5ca4c in event_base_loop (base=0x2461780, flags=0) at event.c:1558
          --Type <return> to continue, or q <return> to quit--
          #3 0x00000000004133a4 in worker_libevent (arg=0x2443900) at daemon/thread.c:304
          #4 0x00007f691145c7e1 in start_thread () from /lib64/libpthread.so.0
          #5 0x00007f69111ab77d in clone () from /lib64/libc.so.6

          Thread 7 (Thread 0x7f690cbd8700 (LWP 28586)):
          #0 0x00007f69114603cc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
          #1 0x00007f690cc2a233 in wait (this=0x3772540) at syncobject.hh:31
          #2 Dispatcher::run (this=0x3772540) at dispatcher.cc:85
          #3 0x00007f690cc2ae43 in launch_dispatcher_thread (arg=0x3772594) at dispatcher.cc:28
          #4 0x00007f691145c7e1 in start_thread () from /lib64/libpthread.so.0
          #5 0x00007f69111ab77d in clone () from /lib64/libc.so.6

          Thread 6 (Thread 0x7f690c3d7700 (LWP 28587)):
          #0 0x00007f691146074b in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
          #1 0x00007f690cc280c0 in wait (this=0x2475380, d=...) at syncobject.hh:42
          #2 IdleTask::run (this=0x2475380, d=...) at dispatcher.cc:244
          #3 0x00007f690cc2a40f in Dispatcher::run (this=0x27f41c0) at dispatcher.cc:119
          #4 0x00007f690cc2ae43 in launch_dispatcher_thread (arg=0x27f4214) at dispatcher.cc:28
          #5 0x00007f691145c7e1 in start_thread () from /lib64/libpthread.so.0
          #6 0x00007f69111ab77d in clone () from /lib64/libc.so.6

          Thread 5 (Thread 0x7f690bbd6700 (LWP 28588)):
          #0 0x00007f691146074b in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
          #1 0x00007f690cc280c0 in wait (this=0x2475a00, d=...) at syncobject.hh:42
          #2 IdleTask::run (this=0x2475a00, d=...) at dispatcher.cc:244
          #3 0x00007f690cc2a40f in Dispatcher::run (this=0x3772380) at dispatcher.cc:119
          #4 0x00007f690cc2ae43 in launch_dispatcher_thread (arg=0x37723d4) at dispatcher.cc:28
          #5 0x00007f691145c7e1 in start_thread () from /lib64/libpthread.so.0
          #6 0x00007f69111ab77d in clone () from /lib64/libc.so.6

          Thread 4 (Thread 0x7f690b3d5700 (LWP 28589)):
          #0 0x00007f691146074b in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
          #1 0x00007f690cc48c77 in wait (this=0x2538000) at syncobject.hh:42
          #2 wait (this=0x2538000) at tapconnmap.hh:207
          #3 EventuallyPersistentEngine::notifyTapIoThread (this=0x2538000) at ep_engine.cc:3578
          #4 0x00007f690cc48de1 in EvpNotifyTapIo (arg=0x2538000) at ep_engine.cc:946
          #5 0x00007f691145c7e1 in start_thread () from /lib64/libpthread.so.0
          #6 0x00007f69111ab77d in clone () from /lib64/libc.so.6

          Thread 3 (Thread 0x7f6908bd0700 (LWP 2972)):
          #0 0x00007f691146074b in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
          #1 0x00007f690f7a1aa4 in engine_shutdown_thread (arg=0x1833c2a80) at bucket_engine.c:1432
          #2 0x00007f691145c7e1 in start_thread () from /lib64/libpthread.so.0
          #3 0x00007f69111ab77d in clone () from /lib64/libc.so.6

          Thread 2 (Thread 0x7f6907fc7700 (LWP 3958)):
          #0 0x00007f691146074b in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
          #1 0x00007f690f7a1aa4 in engine_shutdown_thread (arg=0x1a92c93c0) at bucket_engine.c:1432
          #2 0x00007f691145c7e1 in start_thread () from /lib64/libpthread.so.0
          #3 0x00007f69111ab77d in clone () from /lib64/libc.so.6

          Thread 1 (Thread 0x7f69125a2720 (LWP 28570)):
          #0 0x00007f69111abd73 in epoll_wait () from /lib64/libc.so.6
          --Type <return> to continue, or q <return> to quit--
          #1 0x00007f6911f6dc28 in epoll_dispatch (base=0x2461000, tv=<value optimized out>) at epoll.c:404
          #2 0x00007f6911f5ca4c in event_base_loop (base=0x2461000, flags=0) at event.c:1558
          #3 0x0000000000409984 in main (argc=<value optimized out>, argv=<value optimized out>) at daemon/memcached.c:7459

          Show
          karan Karan Kumar (Inactive) added a comment - Again encountered during large scale testing, this has large queue_backfillremaining, seems to be similar to the one we have been seeing:- /opt/membase/bin/mbstats localhost:11210 tap |grep rebalance eq_tapq:rebalance_393:ack_log_size: 9730 eq_tapq:rebalance_393:ack_playback_size: 9730 eq_tapq:rebalance_393:ack_seqno: 18149 eq_tapq:rebalance_393:ack_window_full: false eq_tapq:rebalance_393:backfill_completed: false eq_tapq:rebalance_393:bg_backlog_size: 0 eq_tapq:rebalance_393:bg_jobs_completed: 48108 eq_tapq:rebalance_393:bg_jobs_issued: 48108 eq_tapq:rebalance_393:bg_queue_size: 0 eq_tapq:rebalance_393:bg_queued: 48108 eq_tapq:rebalance_393:bg_result_size: 34962 eq_tapq:rebalance_393:bg_results: 0 eq_tapq:rebalance_393:bg_wait_for_results: false eq_tapq:rebalance_393:complete: false eq_tapq:rebalance_393:connected: true eq_tapq:rebalance_393:created: 78383 eq_tapq:rebalance_393:empty: false eq_tapq:rebalance_393:flags: 93 (ack,backfill,vblist,takeover,checkpoints) eq_tapq:rebalance_393:has_item: true eq_tapq:rebalance_393:has_queued_item: true eq_tapq:rebalance_393:idle: false eq_tapq:rebalance_393:num_tap_nack: 2085 eq_tapq:rebalance_393:num_tap_tmpfail_survivors: 2085 eq_tapq:rebalance_393:paused: 0 eq_tapq:rebalance_393:pending_backfill: false eq_tapq:rebalance_393:pending_disconnect: false eq_tapq:rebalance_393:pending_disk_backfill: false eq_tapq:rebalance_393:qlen: 2442 eq_tapq:rebalance_393:qlen_high_pri: 0 eq_tapq:rebalance_393:qlen_low_pri: 1 eq_tapq:rebalance_393:queue_backfillremaining: 37404 eq_tapq:rebalance_393:queue_backoff: 2085 eq_tapq:rebalance_393:queue_drain: 18140 eq_tapq:rebalance_393:queue_fill: 0 eq_tapq:rebalance_393:queue_itemondisk: 0 eq_tapq:rebalance_393:queue_memory: 33360 eq_tapq:rebalance_393:rec_fetched: 4998 eq_tapq:rebalance_393:recv_ack_seqno: 8418 eq_tapq:rebalance_393:reserved: 1 eq_tapq:rebalance_393:supports_ack: true eq_tapq:rebalance_393:suspended: false eq_tapq:rebalance_393:total_backlog_size: 37405 eq_tapq:rebalance_393:total_noops: 0 eq_tapq:rebalance_393:type: producer eq_tapq:rebalance_393:vb_filter: { 393 } eq_tapq:rebalance_393:vb_filters: 1 netstat -ntp tcp 252160 257024 10.6.117.231:33624 10.6.117.231:11210 ESTABLISHED 28275/beam.smp (gdb) t a a bt Thread 14 (Thread 0x7f69109a8700 (LWP 28579)): #0 0x00007f691119e4dd in read () from /lib64/libc.so.6 #1 0x00007f6911136fd8 in _IO_new_file_underflow () from /lib64/libc.so.6 #2 0x00007f6911138ade in _IO_default_uflow_internal () from /lib64/libc.so.6 #3 0x00007f6911133fcb in getc () from /lib64/libc.so.6 #4 0x00007f69109a9875 in check_stdin_thread (arg=0x4034c0) at extensions/daemon/stdin_check.c:19 #5 0x00007f691145c7e1 in start_thread () from /lib64/libpthread.so.0 #6 0x00007f69111ab77d in clone () from /lib64/libc.so.6 Thread 13 (Thread 0x7f69101a7700 (LWP 28580)): #0 0x00007f6911170bed in nanosleep () from /lib64/libc.so.6 #1 0x00007f6911170a60 in sleep () from /lib64/libc.so.6 #2 0x0000000000414f68 in check_isasl_db_thread (arg=<value optimized out>) at daemon/isasl.c:233 #3 0x00007f691145c7e1 in start_thread () from /lib64/libpthread.so.0 #4 0x00007f69111ab77d in clone () from /lib64/libc.so.6 Thread 12 (Thread 0x7f690f79d700 (LWP 28581)): #0 0x00007f69111abd73 in epoll_wait () from /lib64/libc.so.6 #1 0x00007f6911f6dc28 in epoll_dispatch (base=0x2461500, tv=<value optimized out>) at epoll.c:404 #2 0x00007f6911f5ca4c in event_base_loop (base=0x2461500, flags=0) at event.c:1558 #3 0x00000000004133a4 in worker_libevent (arg=0x2443500) at daemon/thread.c:304 #4 0x00007f691145c7e1 in start_thread () from /lib64/libpthread.so.0 #5 0x00007f69111ab77d in clone () from /lib64/libc.so.6 Thread 11 (Thread 0x7f690ef9c700 (LWP 28582)): #0 0x00007f69111abd73 in epoll_wait () from /lib64/libc.so.6 #1 0x00007f6911f6dc28 in epoll_dispatch (base=0x2461280, tv=<value optimized out>) at epoll.c:404 #2 0x00007f6911f5ca4c in event_base_loop (base=0x2461280, flags=0) at event.c:1558 #3 0x00000000004133a4 in worker_libevent (arg=0x2443600) at daemon/thread.c:304 #4 0x00007f691145c7e1 in start_thread () from /lib64/libpthread.so.0 #5 0x00007f69111ab77d in clone () from /lib64/libc.so.6 Thread 10 (Thread 0x7f690e79b700 (LWP 28583)): #0 0x00007f69111abd73 in epoll_wait () from /lib64/libc.so.6 #1 0x00007f6911f6dc28 in epoll_dispatch (base=0x2461c80, tv=<value optimized out>) at epoll.c:404 #2 0x00007f6911f5ca4c in event_base_loop (base=0x2461c80, flags=0) at event.c:1558 #3 0x00000000004133a4 in worker_libevent (arg=0x2443700) at daemon/thread.c:304 #4 0x00007f691145c7e1 in start_thread () from /lib64/libpthread.so.0 #5 0x00007f69111ab77d in clone () from /lib64/libc.so.6 Thread 9 (Thread 0x7f690df9a700 (LWP 28584)): #0 0x00007f69111abd73 in epoll_wait () from /lib64/libc.so.6 #1 0x00007f6911f6dc28 in epoll_dispatch (base=0x2461a00, tv=<value optimized out>) at epoll.c:404 #2 0x00007f6911f5ca4c in event_base_loop (base=0x2461a00, flags=0) at event.c:1558 #3 0x00000000004133a4 in worker_libevent (arg=0x2443800) at daemon/thread.c:304 #4 0x00007f691145c7e1 in start_thread () from /lib64/libpthread.so.0 #5 0x00007f69111ab77d in clone () from /lib64/libc.so.6 Thread 8 (Thread 0x7f690d799700 (LWP 28585)): #0 0x00007f69111abd73 in epoll_wait () from /lib64/libc.so.6 #1 0x00007f6911f6dc28 in epoll_dispatch (base=0x2461780, tv=<value optimized out>) at epoll.c:404 #2 0x00007f6911f5ca4c in event_base_loop (base=0x2461780, flags=0) at event.c:1558 -- Type <return> to continue, or q <return> to quit -- #3 0x00000000004133a4 in worker_libevent (arg=0x2443900) at daemon/thread.c:304 #4 0x00007f691145c7e1 in start_thread () from /lib64/libpthread.so.0 #5 0x00007f69111ab77d in clone () from /lib64/libc.so.6 Thread 7 (Thread 0x7f690cbd8700 (LWP 28586)): #0 0x00007f69114603cc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007f690cc2a233 in wait (this=0x3772540) at syncobject.hh:31 #2 Dispatcher::run (this=0x3772540) at dispatcher.cc:85 #3 0x00007f690cc2ae43 in launch_dispatcher_thread (arg=0x3772594) at dispatcher.cc:28 #4 0x00007f691145c7e1 in start_thread () from /lib64/libpthread.so.0 #5 0x00007f69111ab77d in clone () from /lib64/libc.so.6 Thread 6 (Thread 0x7f690c3d7700 (LWP 28587)): #0 0x00007f691146074b in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007f690cc280c0 in wait (this=0x2475380, d=...) at syncobject.hh:42 #2 IdleTask::run (this=0x2475380, d=...) at dispatcher.cc:244 #3 0x00007f690cc2a40f in Dispatcher::run (this=0x27f41c0) at dispatcher.cc:119 #4 0x00007f690cc2ae43 in launch_dispatcher_thread (arg=0x27f4214) at dispatcher.cc:28 #5 0x00007f691145c7e1 in start_thread () from /lib64/libpthread.so.0 #6 0x00007f69111ab77d in clone () from /lib64/libc.so.6 Thread 5 (Thread 0x7f690bbd6700 (LWP 28588)): #0 0x00007f691146074b in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007f690cc280c0 in wait (this=0x2475a00, d=...) at syncobject.hh:42 #2 IdleTask::run (this=0x2475a00, d=...) at dispatcher.cc:244 #3 0x00007f690cc2a40f in Dispatcher::run (this=0x3772380) at dispatcher.cc:119 #4 0x00007f690cc2ae43 in launch_dispatcher_thread (arg=0x37723d4) at dispatcher.cc:28 #5 0x00007f691145c7e1 in start_thread () from /lib64/libpthread.so.0 #6 0x00007f69111ab77d in clone () from /lib64/libc.so.6 Thread 4 (Thread 0x7f690b3d5700 (LWP 28589)): #0 0x00007f691146074b in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007f690cc48c77 in wait (this=0x2538000) at syncobject.hh:42 #2 wait (this=0x2538000) at tapconnmap.hh:207 #3 EventuallyPersistentEngine::notifyTapIoThread (this=0x2538000) at ep_engine.cc:3578 #4 0x00007f690cc48de1 in EvpNotifyTapIo (arg=0x2538000) at ep_engine.cc:946 #5 0x00007f691145c7e1 in start_thread () from /lib64/libpthread.so.0 #6 0x00007f69111ab77d in clone () from /lib64/libc.so.6 Thread 3 (Thread 0x7f6908bd0700 (LWP 2972)): #0 0x00007f691146074b in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007f690f7a1aa4 in engine_shutdown_thread (arg=0x1833c2a80) at bucket_engine.c:1432 #2 0x00007f691145c7e1 in start_thread () from /lib64/libpthread.so.0 #3 0x00007f69111ab77d in clone () from /lib64/libc.so.6 Thread 2 (Thread 0x7f6907fc7700 (LWP 3958)): #0 0x00007f691146074b in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007f690f7a1aa4 in engine_shutdown_thread (arg=0x1a92c93c0) at bucket_engine.c:1432 #2 0x00007f691145c7e1 in start_thread () from /lib64/libpthread.so.0 #3 0x00007f69111ab77d in clone () from /lib64/libc.so.6 Thread 1 (Thread 0x7f69125a2720 (LWP 28570)): #0 0x00007f69111abd73 in epoll_wait () from /lib64/libc.so.6 -- Type <return> to continue, or q <return> to quit -- #1 0x00007f6911f6dc28 in epoll_dispatch (base=0x2461000, tv=<value optimized out>) at epoll.c:404 #2 0x00007f6911f5ca4c in event_base_loop (base=0x2461000, flags=0) at event.c:1558 #3 0x0000000000409984 in main (argc=<value optimized out>, argv=<value optimized out>) at daemon/memcached.c:7459
          Hide
          chiyoung Chiyoung Seo added a comment -

          This is netstat command output from the node that caused this rebalance to be hung:

          [ec2-user@ip-10-6-87-130 ~]$ netstat -ntp
          (No info could be read for "-p": geteuid()=222 but you should be root.)
          Active Internet connections (w/o servers)
          Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
          tcp 0 0 10.6.87.130:60709 10.191.7.242:21100 ESTABLISHED -
          tcp 0 0 10.6.87.130:47520 10.6.117.231:21100 ESTABLISHED -
          tcp 0 0 10.6.87.130:48473 10.6.89.137:21100 ESTABLISHED -
          tcp 0 0 127.0.0.1:55531 127.0.0.1:4369 ESTABLISHED -
          tcp 252160 260992 10.6.87.130:34078 10.6.87.130:11210 ESTABLISHED -
          tcp 0 0 10.6.87.130:11210 10.6.87.130:41679 ESTABLISHED -
          tcp 0 0 10.6.87.130:11210 10.6.87.130:37700 ESTABLISHED -
          tcp 0 0 10.6.87.130:47403 10.6.87.130:11210 ESTABLISHED -
          tcp 0 0 10.6.87.130:21100 10.6.133.36:50821 ESTABLISHED -
          tcp 0 0 10.6.87.130:44795 10.6.151.112:11210 ESTABLISHED -
          tcp 0 0 10.6.87.130:22 10.68.154.26:42398 ESTABLISHED -
          tcp 0 0 127.0.0.1:8091 127.0.0.1:53757 ESTABLISHED -
          tcp 0 0 10.6.87.130:41679 10.6.87.130:11210 ESTABLISHED -
          tcp 74460 485120 10.6.87.130:11210 10.6.87.130:34078 ESTABLISHED -

          Show
          chiyoung Chiyoung Seo added a comment - This is netstat command output from the node that caused this rebalance to be hung: [ec2-user@ip-10-6-87-130 ~] $ netstat -ntp (No info could be read for "-p": geteuid()=222 but you should be root.) Active Internet connections (w/o servers) Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name tcp 0 0 10.6.87.130:60709 10.191.7.242:21100 ESTABLISHED - tcp 0 0 10.6.87.130:47520 10.6.117.231:21100 ESTABLISHED - tcp 0 0 10.6.87.130:48473 10.6.89.137:21100 ESTABLISHED - tcp 0 0 127.0.0.1:55531 127.0.0.1:4369 ESTABLISHED - tcp 252160 260992 10.6.87.130:34078 10.6.87.130:11210 ESTABLISHED - tcp 0 0 10.6.87.130:11210 10.6.87.130:41679 ESTABLISHED - tcp 0 0 10.6.87.130:11210 10.6.87.130:37700 ESTABLISHED - tcp 0 0 10.6.87.130:47403 10.6.87.130:11210 ESTABLISHED - tcp 0 0 10.6.87.130:21100 10.6.133.36:50821 ESTABLISHED - tcp 0 0 10.6.87.130:44795 10.6.151.112:11210 ESTABLISHED - tcp 0 0 10.6.87.130:22 10.68.154.26:42398 ESTABLISHED - tcp 0 0 127.0.0.1:8091 127.0.0.1:53757 ESTABLISHED - tcp 0 0 10.6.87.130:41679 10.6.87.130:11210 ESTABLISHED - tcp 74460 485120 10.6.87.130:11210 10.6.87.130:34078 ESTABLISHED -
          Hide
          karan Karan Kumar (Inactive) added a comment -

          Not been able to reproduce this with 1.8.0r-51 build. Will keep running more rebalancetests on this build.

          Show
          karan Karan Kumar (Inactive) added a comment - Not been able to reproduce this with 1.8.0r-51 build. Will keep running more rebalancetests on this build.
          Hide
          alkondratenko Aleksey Kondratenko (Inactive) added a comment -

          We believe this was caused by ebucketmigrator unknowingly did not buffer enough nacks in it's upstream direction. Causing deadlock. ebucketmigrator was waiting upstream memcached/ep-engine to eat sent nacks and memcached/ep-engine was waiting on ebucketmigrator consuming tap messages sent to it.

          The fix is here: http://review.couchbase.org/11859 and is already merged.

          Show
          alkondratenko Aleksey Kondratenko (Inactive) added a comment - We believe this was caused by ebucketmigrator unknowingly did not buffer enough nacks in it's upstream direction. Causing deadlock. ebucketmigrator was waiting upstream memcached/ep-engine to eat sent nacks and memcached/ep-engine was waiting on ebucketmigrator consuming tap messages sent to it. The fix is here: http://review.couchbase.org/11859 and is already merged.
          Hide
          karan Karan Kumar (Inactive) added a comment -

          Have not been able to hit this issue again.

          Show
          karan Karan Kumar (Inactive) added a comment - Have not been able to hit this issue again.
          Hide
          karan Karan Kumar (Inactive) added a comment -

          Closing old tickets.

          Show
          karan Karan Kumar (Inactive) added a comment - Closing old tickets.
          Hide
          perry Perry Krug added a comment -

          Reopening temporarily

          Show
          perry Perry Krug added a comment - Reopening temporarily

            People

            • Assignee:
              karan Karan Kumar (Inactive)
              Reporter:
              karan Karan Kumar (Inactive)
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Due:
                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                There are no open Gerrit changes