Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-2461

persistence write queue size stuck at nonzero after Rebalance

    XMLWordPrintable

Details

    • Bug
    • Resolution: Duplicate
    • 1.6.0 GA
    • 1.6.0 beta4
    • couchbase-bucket
    • None
    • Operating System: All
      Platform: All

    Description

      on osx, 1.6.0beta4-63, while trying to reproduce another bug...

      • While under memcachetest load...

      ./memcachetest -h 127.0.0.1:12001 -i 100000 -c 50000 -l -t 8

      • Join together 6 nodes, and hit Rebalance...
      • After rebalance is done, in the web ui, the "Persistence write queue size" graph starts slowly coming down, but gets stuck at ~138K. (still stuck after 20 minutes.) That graph is defined as sum of ep_queue_size + ep_flusher_todo.
      • The first node appears to have a ep_queue_size of ~138K, with zero-sized queues on all other nodes...

      stevenmb:~ steveyen$ ~/dev/ep-engine/management/stats 127.0.0.1:12000 all | grep todo
      ep_flusher_todo: 0
      stevenmb:~ steveyen$ ~/dev/ep-engine/management/stats 127.0.0.1:12002 all | grep todo
      ep_flusher_todo: 0
      stevenmb:~ steveyen$ ~/dev/ep-engine/management/stats 127.0.0.1:12004 all | grep todo
      ep_flusher_todo: 0
      stevenmb:~ steveyen$ ~/dev/ep-engine/management/stats 127.0.0.1:12008 all | grep todo
      ep_flusher_todo: 0
      stevenmb:~ steveyen$ ~/dev/ep-engine/management/stats 127.0.0.1:12010 all | grep todo
      ep_flusher_todo: 0

      stevenmb:~ steveyen$ ~/dev/ep-engine/management/stats 127.0.0.1:12000 all | grep queue_size
      ep_queue_size: 138128
      stevenmb:~ steveyen$ ~/dev/ep-engine/management/stats 127.0.0.1:12002 all | grep queue_size
      ep_queue_size: 0
      stevenmb:~ steveyen$ ~/dev/ep-engine/management/stats 127.0.0.1:12004 all | grep queue_size
      ep_queue_size: 0
      stevenmb:~ steveyen$ ~/dev/ep-engine/management/stats 127.0.0.1:12006 all | grep queue_size
      ep_queue_size: 0
      stevenmb:~ steveyen$ ~/dev/ep-engine/management/stats 127.0.0.1:12008 all | grep queue_size
      ep_queue_size: 0
      stevenmb:~ steveyen$ ~/dev/ep-engine/management/stats 127.0.0.1:12010 all | grep queue_size
      ep_queue_size: 0

      --------------
      Also, two stack traces for memcached separated by 20 minutes or so...

      (gdb) thread apply all bt

      Thread 11 (process 91693):
      #0 0x00007fff841de316 in __semwait_signal ()
      #1 0x00007fff841e2131 in _pthread_cond_wait ()
      #2 0x0000000101021f31 in EventuallyPersistentEngine::notifyTapIoThread (this=0x10080dc00) at syncobject.hh:42
      #3 0x0000000101022099 in EvpNotifyTapIo (arg=<value temporarily unavailable, due to optimizations>) at ep_engine.cc:631
      #4 0x00007fff841dc8b6 in _pthread_start ()
      #5 0x00007fff841dc769 in thread_start ()

      Thread 10 (process 91693):
      #0 0x00007fff841de316 in __semwait_signal ()
      #1 0x00007fff841e2131 in _pthread_cond_wait ()
      #2 0x0000000101002293 in SyncObject::wait () at /Users/steveyen/work/northscale/dev/wallace_master/components/tmp/repo/ep-engine/syncobject.hh:57
      #3 0x0000000101002293 in Dispatcher::run (this=0x1001137a0) at dispatcher.cc:57
      #4 0x0000000101002ce5 in launch_dispatcher_thread (arg=<value temporarily unavailable, due to optimizations>) at dispatcher.cc:26
      #5 0x00007fff841dc8b6 in _pthread_start ()
      #6 0x00007fff841dc769 in thread_start ()

      Thread 9 (process 91693):
      #0 0x00007fff841add46 in read ()
      #1 0x0000000101050edf in unixRead () at atomicity.h:51
      #2 0x0000000101063df6 in readDbPage () at atomicity.h:51
      #3 0x00000001010666de in sqlite3PagerAcquire () at atomicity.h:51
      #4 0x00000001010712e6 in moveToChild () at atomicity.h:51
      #5 0x000000010108b8a5 in sqlite3BtreeNext () at atomicity.h:51
      #6 0x00000001010dac07 in sqlite3VdbeExec () at atomicity.h:51
      #7 0x00000001010de896 in sqlite3_step () at atomicity.h:51
      #8 0x00000001010343c9 in PreparedStatement::execute (this=0x10010d990) at sqlite-pst.cc:47
      #9 0x0000000101032a76 in StrategicSqlite3::delVBucket (this=0x100109e20, vbucket=184) at sqlite-kvstore.cc:152
      #10 0x000000010100c5b6 in EventuallyPersistentStore::completeVBucketDeletion (this=0x100863000, vbid=184) at ep.cc:457
      #11 0x00000001010156d1 in VBucketDeletionCallback::callback (this=<value temporarily unavailable, due to optimizations>, d=<value temporarily unavailable, due to optimizations>, t=<value temporarily unavailable, due to optimizations>) at ep.cc:160
      #12 0x0000000101002405 in Dispatcher::run (this=0x1001143c0) at dispatcher.hh:87
      #13 0x0000000101002ce5 in launch_dispatcher_thread (arg=<value temporarily unavailable, due to optimizations>) at dispatcher.cc:26
      #14 0x00007fff841dc8b6 in _pthread_start ()
      #15 0x00007fff841dc769 in thread_start ()

      Thread 8 (process 91693):
      #0 0x00007fff841bc4ea in kevent ()
      #1 0x0000000100020df7 in kq_dispatch (base=0x100102b90, tv=<value temporarily unavailable, due to optimizations>) at kqueue.c:261
      #2 0x000000010001b0b0 in event_base_loop (base=0x100102b90, flags=<value temporarily unavailable, due to optimizations>) at event.c:1536
      #3 0x0000000100012631 in worker_libevent (arg=0x10002ffc0) at thread.c:252
      #4 0x00007fff841dc8b6 in _pthread_start ()
      #5 0x00007fff841dc769 in thread_start ()

      Thread 7 (process 91693):
      #0 0x00007fff841bc4ea in kevent ()
      #1 0x0000000100020df7 in kq_dispatch (base=0x100102420, tv=<value temporarily unavailable, due to optimizations>) at kqueue.c:261
      #2 0x000000010001b0b0 in event_base_loop (base=0x100102420, flags=<value temporarily unavailable, due to optimizations>) at event.c:1536
      #3 0x0000000100012631 in worker_libevent (arg=0x10082d148) at thread.c:252
      #4 0x00007fff841dc8b6 in _pthread_start ()
      #5 0x00007fff841dc769 in thread_start ()

      Thread 6 (process 91693):
      #0 0x00007fff841bc4ea in kevent ()
      #1 0x0000000100020df7 in kq_dispatch (base=0x100101cb0, tv=<value temporarily unavailable, due to optimizations>) at kqueue.c:261
      #2 0x000000010001b0b0 in event_base_loop (base=0x100101cb0, flags=<value temporarily unavailable, due to optimizations>) at event.c:1536
      #3 0x0000000100012631 in worker_libevent (arg=0x10082d030) at thread.c:252
      #4 0x00007fff841dc8b6 in _pthread_start ()
      #5 0x00007fff841dc769 in thread_start ()

      Thread 5 (process 91693):
      #0 0x00007fff841bc4ea in kevent ()
      #1 0x0000000100020df7 in kq_dispatch (base=0x100101540, tv=<value temporarily unavailable, due to optimizations>) at kqueue.c:261
      #2 0x000000010001b0b0 in event_base_loop (base=0x100101540, flags=<value temporarily unavailable, due to optimizations>) at event.c:1536
      #3 0x0000000100012631 in worker_libevent (arg=0x10082cf18) at thread.c:252
      #4 0x00007fff841dc8b6 in _pthread_start ()
      #5 0x00007fff841dc769 in thread_start ()

      Thread 4 (process 91693):
      #0 0x00007fff841bc4ea in kevent ()
      #1 0x0000000100020df7 in kq_dispatch (base=0x100100dd0, tv=<value temporarily unavailable, due to optimizations>) at kqueue.c:261
      #2 0x000000010001b0b0 in event_base_loop (base=0x100100dd0, flags=<value temporarily unavailable, due to optimizations>) at event.c:1536
      #3 0x0000000100012631 in worker_libevent (arg=0x10082ce00) at thread.c:252
      #4 0x00007fff841dc8b6 in _pthread_start ()
      #5 0x00007fff841dc769 in thread_start ()

      Thread 3 (process 91693):
      #0 0x00007fff841de316 in __semwait_signal ()
      #1 0x00007fff841de1a5 in nanosleep ()
      #2 0x00007fff8422b0a4 in sleep ()
      #3 0x00000001000140e8 in check_isasl_db_thread (arg=<value temporarily unavailable, due to optimizations>) at isasl.c:215
      #4 0x00007fff841dc8b6 in _pthread_start ()
      #5 0x00007fff841dc769 in thread_start ()

      Thread 2 (process 91693):
      #0 0x00007fff841a5362 in read$NOCANCEL ()
      #1 0x00007fff841f48be in __sread ()
      #2 0x00007fff841ec604 in _sread ()
      #3 0x00007fff841ec49a in __srefill ()
      #4 0x00007fff841f765a in __srget ()
      #5 0x00007fff841f7616 in getc ()
      #6 0x0000000100062dd8 in check_stdin_thread (arg=0x100001ad0) at stdin_check.c:18
      #7 0x00007fff841dc8b6 in _pthread_start ()
      #8 0x00007fff841dc769 in thread_start ()

      Thread 1 (process 91693):
      #0 0x00007fff841bc4ea in kevent ()
      #1 0x0000000100020df7 in kq_dispatch (base=0x100100300, tv=<value temporarily unavailable, due to optimizations>) at kqueue.c:261
      #2 0x000000010001b0b0 in event_base_loop (base=0x100100300, flags=<value temporarily unavailable, due to optimizations>) at event.c:1536
      #3 0x00000001000057d9 in main (argc=<value temporarily unavailable, due to optimizations>, argv=<value temporarily unavailable, due to optimizations>) at memcached.c:6948
      #4 0x0000000100001054 in start ()

      -------------------

      (gdb) thread apply all bt

      Thread 11 (process 91693):
      #0 0x00007fff841de316 in __semwait_signal ()
      #1 0x00007fff841e2131 in _pthread_cond_wait ()
      #2 0x0000000101021f31 in EventuallyPersistentEngine::notifyTapIoThread (this=0x10080dc00) at syncobject.hh:42
      #3 0x0000000101022099 in EvpNotifyTapIo (arg=<value temporarily unavailable, due to optimizations>) at ep_engine.cc:631
      #4 0x00007fff841dc8b6 in _pthread_start ()
      #5 0x00007fff841dc769 in thread_start ()

      Thread 10 (process 91693):
      #0 0x00007fff841de316 in __semwait_signal ()
      #1 0x00007fff841e2131 in _pthread_cond_wait ()
      #2 0x0000000101002293 in SyncObject::wait () at /Users/steveyen/work/northscale/dev/wallace_master/components/tmp/repo/ep-engine/syncobject.hh:57
      #3 0x0000000101002293 in Dispatcher::run (this=0x1001137a0) at dispatcher.cc:57
      #4 0x0000000101002ce5 in launch_dispatcher_thread (arg=<value temporarily unavailable, due to optimizations>) at dispatcher.cc:26
      #5 0x00007fff841dc8b6 in _pthread_start ()
      #6 0x00007fff841dc769 in thread_start ()

      Thread 9 (process 91693):
      #0 0x00007fff841add46 in read ()
      #1 0x0000000101050edf in unixRead () at atomicity.h:51
      #2 0x0000000101063df6 in readDbPage () at atomicity.h:51
      #3 0x00000001010666de in sqlite3PagerAcquire () at atomicity.h:51
      #4 0x00000001010712e6 in moveToChild () at atomicity.h:51
      #5 0x000000010108b8a5 in sqlite3BtreeNext () at atomicity.h:51
      #6 0x00000001010dac07 in sqlite3VdbeExec () at atomicity.h:51
      #7 0x00000001010de896 in sqlite3_step () at atomicity.h:51
      #8 0x00000001010343c9 in PreparedStatement::execute (this=0x10010d990) at sqlite-pst.cc:47
      #9 0x0000000101032a76 in StrategicSqlite3::delVBucket (this=0x100109e20, vbucket=724) at sqlite-kvstore.cc:152
      #10 0x000000010100c5b6 in EventuallyPersistentStore::completeVBucketDeletion (this=0x100863000, vbid=724) at ep.cc:457
      #11 0x00000001010156d1 in VBucketDeletionCallback::callback (this=<value temporarily unavailable, due to optimizations>, d=<value temporarily unavailable, due to optimizations>, t=<value temporarily unavailable, due to optimizations>) at ep.cc:160
      #12 0x0000000101002405 in Dispatcher::run (this=0x1001143c0) at dispatcher.hh:87
      #13 0x0000000101002ce5 in launch_dispatcher_thread (arg=<value temporarily unavailable, due to optimizations>) at dispatcher.cc:26
      #14 0x00007fff841dc8b6 in _pthread_start ()
      #15 0x00007fff841dc769 in thread_start ()

      Thread 8 (process 91693):
      #0 0x00007fff841bc4ea in kevent ()
      #1 0x0000000100020df7 in kq_dispatch (base=0x100102b90, tv=<value temporarily unavailable, due to optimizations>) at kqueue.c:261
      #2 0x000000010001b0b0 in event_base_loop (base=0x100102b90, flags=<value temporarily unavailable, due to optimizations>) at event.c:1536
      #3 0x0000000100012631 in worker_libevent (arg=0x10002ffc0) at thread.c:252
      #4 0x00007fff841dc8b6 in _pthread_start ()
      #5 0x00007fff841dc769 in thread_start ()

      Thread 7 (process 91693):
      #0 0x00007fff841bc4ea in kevent ()
      #1 0x0000000100020df7 in kq_dispatch (base=0x100102420, tv=<value temporarily unavailable, due to optimizations>) at kqueue.c:261
      #2 0x000000010001b0b0 in event_base_loop (base=0x100102420, flags=<value temporarily unavailable, due to optimizations>) at event.c:1536
      #3 0x0000000100012631 in worker_libevent (arg=0x10082d148) at thread.c:252
      #4 0x00007fff841dc8b6 in _pthread_start ()
      #5 0x00007fff841dc769 in thread_start ()

      Thread 6 (process 91693):
      #0 0x00007fff841bc4ea in kevent ()
      #1 0x0000000100020df7 in kq_dispatch (base=0x100101cb0, tv=<value temporarily unavailable, due to optimizations>) at kqueue.c:261
      #2 0x000000010001b0b0 in event_base_loop (base=0x100101cb0, flags=<value temporarily unavailable, due to optimizations>) at event.c:1536
      #3 0x0000000100012631 in worker_libevent (arg=0x10082d030) at thread.c:252
      #4 0x00007fff841dc8b6 in _pthread_start ()
      #5 0x00007fff841dc769 in thread_start ()

      Thread 5 (process 91693):
      #0 0x00007fff841bc4ea in kevent ()
      #1 0x0000000100020df7 in kq_dispatch (base=0x100101540, tv=<value temporarily unavailable, due to optimizations>) at kqueue.c:261
      #2 0x000000010001b0b0 in event_base_loop (base=0x100101540, flags=<value temporarily unavailable, due to optimizations>) at event.c:1536
      #3 0x0000000100012631 in worker_libevent (arg=0x10082cf18) at thread.c:252
      #4 0x00007fff841dc8b6 in _pthread_start ()
      #5 0x00007fff841dc769 in thread_start ()

      Thread 4 (process 91693):
      #0 0x00007fff841bc4ea in kevent ()
      #1 0x0000000100020df7 in kq_dispatch (base=0x100100dd0, tv=<value temporarily unavailable, due to optimizations>) at kqueue.c:261
      #2 0x000000010001b0b0 in event_base_loop (base=0x100100dd0, flags=<value temporarily unavailable, due to optimizations>) at event.c:1536
      #3 0x0000000100012631 in worker_libevent (arg=0x10082ce00) at thread.c:252
      #4 0x00007fff841dc8b6 in _pthread_start ()
      #5 0x00007fff841dc769 in thread_start ()

      Thread 3 (process 91693):
      #0 0x00007fff841de316 in __semwait_signal ()
      #1 0x00007fff841de1a5 in nanosleep ()
      #2 0x00007fff8422b0a4 in sleep ()
      #3 0x00000001000140e8 in check_isasl_db_thread (arg=<value temporarily unavailable, due to optimizations>) at isasl.c:215
      #4 0x00007fff841dc8b6 in _pthread_start ()
      #5 0x00007fff841dc769 in thread_start ()

      Thread 2 (process 91693):
      #0 0x00007fff841a5362 in read$NOCANCEL ()
      #1 0x00007fff841f48be in __sread ()
      #2 0x00007fff841ec604 in _sread ()
      #3 0x00007fff841ec49a in __srefill ()
      #4 0x00007fff841f765a in __srget ()
      #5 0x00007fff841f7616 in getc ()
      #6 0x0000000100062dd8 in check_stdin_thread (arg=0x100001ad0) at stdin_check.c:18
      #7 0x00007fff841dc8b6 in _pthread_start ()
      #8 0x00007fff841dc769 in thread_start ()

      Thread 1 (process 91693):
      #0 0x00007fff841bc4ea in kevent ()
      #1 0x0000000100020df7 in kq_dispatch (base=0x100100300, tv=<value temporarily unavailable, due to optimizations>) at kqueue.c:261
      #2 0x000000010001b0b0 in event_base_loop (base=0x100100300, flags=<value temporarily unavailable, due to optimizations>) at event.c:1536
      #3 0x00000001000057d9 in main (argc=<value temporarily unavailable, due to optimizations>, argv=<value temporarily unavailable, due to optimizations>) at memcached.c:6948
      #4 0x0000000100001054 in start ()

      --------------

      and, stats all from the first node...

      $ ~/dev/ep-engine/management/stats 127.0.0.1:12000 all auth_cmds: 877
      auth_errors: 0
      bucket_conns: 21
      bytes_read: 1879088598
      bytes_written: 1400424716
      cas_badval: 0
      cas_hits: 0
      cas_misses: 0
      cmd_flush: 0
      cmd_get: 640549
      cmd_set: 846433
      conn_yields: 563
      connection_structures: 57
      curr_connections: 57
      curr_items: 38371
      curr_items_tot: 76499
      daemon_connections: 10
      decr_hits: 0
      decr_misses: 0
      delete_hits: 0
      delete_misses: 0
      ep_bg_fetched: 0
      ep_commit_num: 14
      ep_commit_time: 6
      ep_commit_time_total: 47
      ep_data_age: 100
      ep_data_age_highwat: 114
      ep_dbinit: 0
      ep_dbname: /Users/steveyen/work/northscale/dev/ns_server/data/n_0/default
      ep_dbshards: 4
      ep_expired: 0
      ep_flush_duration: 94
      ep_flush_duration_highwat: 94
      ep_flush_duration_total: 202
      ep_flush_preempts: 0
      ep_flusher_state: running
      ep_flusher_todo: 0
      ep_io_num_read: 47070
      ep_io_num_write: 360985
      ep_io_read_bytes: 96944547
      ep_io_write_bytes: 741345244
      ep_item_commit_failed: 0
      ep_item_flush_expired: 0
      ep_item_flush_failed: 0
      ep_kv_size: 147564333
      ep_max_data_size: 87381333
      ep_max_txn_size: 50000
      ep_mem_high_wat: 65535999
      ep_mem_low_wat: 52428799
      ep_min_data_age: 0
      ep_num_eject_failures: 275065
      ep_num_expiry_pager_runs: 0
      ep_num_non_resident: 41497
      ep_num_not_my_vbuckets: 6658
      ep_num_pager_runs: 4
      ep_num_value_ejects: 80495
      ep_oom_errors: 0
      ep_overhead: 12652399
      ep_pending_ops: 0
      ep_pending_ops_max: 0
      ep_pending_ops_max_duration: 0
      ep_pending_ops_total: 0
      ep_queue_age_cap: 900
      ep_queue_size: 138128
      ep_storage_age: 97
      ep_storage_age_highwat: 111
      ep_storage_type: featured
      ep_tap_keepalive: 0
      ep_tmp_oom_errors: 100
      ep_too_old: 0
      ep_too_young: 0
      ep_total_cache_size: 312926981
      ep_total_del_items: 0
      ep_total_enqueued: 498216
      ep_total_new_items: 229917
      ep_total_persisted: 360088
      ep_vbucket_del: 44
      ep_vbucket_del_fail: 0
      ep_version: 1.6.0beta4_98_gaddeb09
      ep_warmed_up: 0
      ep_warmup: true
      ep_warmup_dups: 0
      ep_warmup_oom: 0
      ep_warmup_thread: complete
      ep_warmup_time: 0
      get_hits: 640549
      get_misses: 0
      incr_hits: 0
      incr_misses: 0
      libevent: 2.0.7-rc
      limit_maxbytes: 67108864
      mem_used: 160216732
      pid: 91693
      pointer_size: 64
      rejected_conns: 0
      rusage_system: 230.418806
      rusage_user: 81.542152
      tap_connect_received: 863
      tap_mutation_received: 44133
      tap_mutation_sent: 236930
      tap_vbucket_set_sent: 1711
      threads: 4
      time: 1286394570
      total_connections: 938
      uptime: 5217
      version: 1.4.4_298_g250909b

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              dustin@sallings.org Dustin Sallings (Inactive)
              steve.yen@northscale.com Steve Yen
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty