Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-7601

memcached crashed in notifyIOComplete (TapConnMap::notifyPausedConnection_UNLOCKED ) when rebalancing a mixed 1.8.1/2.0.1 cluster after 2.0.1 node warms up

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Critical
    • Resolution: Duplicate
    • Affects Version/s: 2.0.1
    • Fix Version/s: 2.1.0
    • Component/s: couchbase-bucket
    • Security Level: Public
    • Labels:
      None

      Description

      http://qa.hq.northscale.net/view/2.0.1/job/centos-64-2.0-new-rebalance-mixed-cluster/21/consoleFull
      ./testrunner -i /tmp/rebalance_in.ini get-logs=True,wait_timeout=180,GROUP=P0,EXCLUDE_GROUP=FROM_2_0 -t rebalance.rebalanceout.RebalanceOutTests.rebalance_out_with_warming_up,nodes_out=3,items=500000,replicas=2,max_verify=100000,GROUP=OUT;P0

      nodes:
      1.8.1-937-rel
      [10.3.3.92]
      [10.3.3.93]
      [10.3.3.94]

      2.0.1-141-rel
      [10.3.3.99]
      [10.3.3.82]
      [10.3.3.91]
      [10.3.3.97]

      steps:
      1. cluster with 10.3.3.91,10.3.3.92,10.3.3.94,10.3.3.82,10.3.3.93,10.3.3.99,10.3.3.97, default bucket with 500K items
      2. restart 10.3.3.91 and don't wait while warmup completed start rebalance:
      password=password&ejectedNodes=ns_1%4010.3.3.82%2Cns_1%4010.3.3.94%2Cns_1%4010.3.3.97&user=Administrator&knownNodes=ns_1%4010.3.3.91%2Cns_1%4010.3.3.92%2Cns_1%4010.3.3.94%2Cns_1%4010.3.3.82%2Cns_1%4010.3.3.93%2Cns_1%4010.3.3.99%2Cns_1%4010.3.3.97
      3. rebalance failed as expected
      4. wait while warmup completed and restart rebalance

      result: rebalance failed

      test logs:

      013-01-24 08:12:18,825] - [remote_util:117] INFO - connecting to 10.3.3.91 with username : root password : password ssh_key:
      [2013-01-24 08:12:19,157] - [remote_util:149] INFO - Connected
      [2013-01-24 08:12:19,360] - [remote_util:1206] INFO - running command.raw sudo cat /proc/cpuinfo
      [2013-01-24 08:12:19,487] - [remote_util:1234] INFO - command executed successfully
      [2013-01-24 08:12:19,489] - [remote_util:1206] INFO - running command.raw sudo df -Th
      [2013-01-24 08:12:19,603] - [remote_util:1234] INFO - command executed successfully
      [2013-01-24 08:12:19,604] - [remote_util:1206] INFO - running command.raw sudo cat /proc/meminfo
      [2013-01-24 08:12:19,720] - [remote_util:1234] INFO - command executed successfully
      [2013-01-24 08:12:19,721] - [remote_util:1206] INFO - running command.raw hostname
      [2013-01-24 08:12:19,833] - [remote_util:1234] INFO - command executed successfully
      [2013-01-24 08:12:19,836] - [remote_util:1206] INFO - running command.raw /etc/init.d/couchbase-server stop
      [2013-01-24 08:12:29,231] - [remote_util:1234] INFO - command executed successfully
      [2013-01-24 08:12:29,232] - [remote_util:1183] INFO - Stopping couchbase-server
      [2013-01-24 08:12:29,235] - [remote_util:1183] INFO - =INFO REPORT==== 24-Jan-2013::08:15:40 ===
      [2013-01-24 08:12:29,238] - [remote_util:1183] INFO - Initiated server shutdown** at node ns_1@10.3.3.91 **
      [2013-01-24 08:12:29,239] - [remote_util:1183] INFO -
      [2013-01-24 08:12:29,240] - [remote_util:1183] INFO - =INFO REPORT==== 24-Jan-2013::08:15:47 ===
      [2013-01-24 08:12:29,241] - [remote_util:1183] INFO - Stopped ns_server application** at node ns_1@10.3.3.91 **
      [2013-01-24 08:12:29,242] - [remote_util:1183] INFO -
      [2013-01-24 08:12:49,457] - [remote_util:1206] INFO - running command.raw sudo cat /proc/cpuinfo
      [2013-01-24 08:12:49,567] - [remote_util:1234] INFO - command executed successfully
      [2013-01-24 08:12:49,568] - [remote_util:1206] INFO - running command.raw sudo df -Th
      [2013-01-24 08:12:49,689] - [remote_util:1234] INFO - command executed successfully
      [2013-01-24 08:12:49,691] - [remote_util:1206] INFO - running command.raw sudo cat /proc/meminfo
      [2013-01-24 08:12:49,780] - [remote_util:1234] INFO - command executed successfully
      [2013-01-24 08:12:49,781] - [remote_util:1206] INFO - running command.raw hostname
      [2013-01-24 08:12:49,878] - [remote_util:1234] INFO - command executed successfully
      [2013-01-24 08:12:49,881] - [remote_util:1206] INFO - running command.raw /etc/init.d/couchbase-server start
      [2013-01-24 08:12:51,228] - [remote_util:1234] INFO - command executed successfully
      [2013-01-24 08:12:51,229] - [remote_util:1183] INFO - Starting couchbase-server[ OK ]
      [2013-01-24 08:12:52,333] - [rest_client:795] INFO - rebalance params : password=password&ejectedNodes=ns_1%4010.3.3.82%2Cns_1%4010.3.3.94%2Cns_1%4010.3.3.97&user=Administrator&knownNodes=ns_1%4010.3.3.91%2Cns_1%4010.3.3.92%2Cns_1%4010.3.3.94%2Cns_1%4010.3.3.82%2Cns_1%4010.3.3.93%2Cns_1%4010.3.3.99%2Cns_1%4010.3.3.97
      [2013-01-24 08:12:52,353] - [rest_client:799] INFO - rebalance operation started
      [2013-01-24 08:12:52,378] - [rest_client:894] INFO - rebalance percentage : 0 %
      [2013-01-24 08:13:02,404] - [rest_client:879] ERROR -

      {u'status': u'none', u'errorMessage': u'Rebalance failed. See logs for detailed reason. You can try rebalance again.'}

      - rebalance failed
      [('/usr/lib64/python2.6/threading.py', 504, '__bootstrap', 'self.__bootstrap_inner()'), ('/usr/lib64/python2.6/threading.py', 532, '__bootstrap_inner', 'self.run()'), ('lib/tasks/taskmanager.py', 31, 'run', 'task.step(self)'), ('lib/tasks/task.py', 55, 'step', 'self.check(task_manager)'), ('lib/tasks/task.py', 269, 'check', 'self.set_exception(ex)'), ('lib/tasks/future.py', 262, 'set_exception', 'print traceback.extract_stack()')]
      [('testrunner', 321, '<module>', 'result = unittest.TextTestRunner(verbosity=2).run(suite)'), ('/usr/lib64/python2.6/unittest.py', 752, 'run', 'test(result)'), ('/usr/lib64/python2.6/unittest.py', 463, '__call__', 'return self.run(*args, **kwds)'), ('/usr/lib64/python2.6/unittest.py', 459, 'run', 'test(result)'), ('/usr/lib64/python2.6/unittest.py', 299, '__call__', 'return self.run(*args, **kwds)'), ('/usr/lib64/python2.6/unittest.py', 278, 'run', 'testMethod()'), ('pytests/rebalance/rebalanceout.py', 272, 'rebalance_out_with_warming_up', 'rebalance.result()'), ('lib/tasks/future.py', 158, 'result', 'return self.__get_result()'), ('lib/tasks/future.py', 109, '__get_result', 'print traceback.extract_stack()')]
      [2013-01-24 08:13:02,406] - [rebalanceout:274] INFO - rebalance was failed as expected
      [2013-01-24 08:13:02,539] - [data_helper:289] INFO - creating direct client 10.3.3.91:11210 default
      [2013-01-24 08:13:03,805] - [cluster_helper:114] INFO - ep_warmup_time is 2623677
      [2013-01-24 08:13:03,806] - [cluster_helper:117] INFO - Collected the stats 2623677 for server 10.3.3.91:8091
      [2013-01-24 08:13:03,862] - [cluster_helper:136] INFO - warmup completed, awesome!!! Warmed up. 0 items
      [2013-01-24 08:13:03,863] - [rebalanceout:278] INFO - second attempt to rebalance
      [2013-01-24 08:13:04,429] - [rest_client:795] INFO - rebalance params : password=password&ejectedNodes=ns_1%4010.3.3.82%2Cns_1%4010.3.3.94%2Cns_1%4010.3.3.97&user=Administrator&knownNodes=ns_1%4010.3.3.91%2Cns_1%4010.3.3.92%2Cns_1%4010.3.3.94%2Cns_1%4010.3.3.82%2Cns_1%4010.3.3.93%2Cns_1%4010.3.3.99%2Cns_1%4010.3.3.97
      [2013-01-24 08:13:04,454] - [rest_client:799] INFO - rebalance operation started
      [2013-01-24 08:13:04,468] - [rest_client:894] INFO - rebalance percentage : 0 %
      [2013-01-24 08:13:14,499] - [rest_client:894] INFO - rebalance percentage : 2.34907590507 %
      [2013-01-24 08:13:24,523] - [rest_client:894] INFO - rebalance percentage : 6.05886370377 %
      [2013-01-24 08:13:34,540] - [rest_client:894] INFO - rebalance percentage : 9.98868660464 %
      [2013-01-24 08:13:44,551] - [rest_client:894] INFO - rebalance percentage : 14.2387097173 %
      [2013-01-24 08:13:54,565] - [rest_client:879] ERROR -

      {u'status': u'none', u'errorMessage': u'Rebalance failed. See logs for detailed reason. You can try rebalance again.'}

      - rebalance failed

      memcached crash on 10.3.3.99:

      [root@caper-007 tmp]# gdb /opt/couchbase/bin/memcached core.memcached.21377
      GNU gdb (GDB) CentOS (7.0.1-45.el5.centos)
      Copyright (C) 2009 Free Software Foundation, Inc.
      License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
      This is free software: you are free to change and redistribute it.
      There is NO WARRANTY, to the extent permitted by law. Type "show copying"
      and "show warranty" for details.
      This GDB was configured as "x86_64-redhat-linux-gnu".
      For bug reporting instructions, please see:
      <http://www.gnu.org/software/gdb/bugs/>...
      Reading symbols from /opt/couchbase/bin/memcached...done.
      [New Thread 21398]
      [New Thread 21400]
      [New Thread 21399]
      [New Thread 21397]
      [New Thread 21396]
      [New Thread 21395]
      [New Thread 21392]
      [New Thread 21391]
      [New Thread 21390]
      [New Thread 21389]
      [New Thread 21388]
      [New Thread 21386]
      [New Thread 21385]
      [New Thread 21377]
      Reading symbols from /opt/couchbase/lib/memcached/libmemcached_utilities.so.0...done.
      Loaded symbols for /opt/couchbase/lib/memcached/libmemcached_utilities.so.0
      Reading symbols from /opt/couchbase/lib/libevent-2.0.so.5...done.
      Loaded symbols for /opt/couchbase/lib/libevent-2.0.so.5
      Reading symbols from /lib64/libdl.so.2...(no debugging symbols found)...done.
      Loaded symbols for /lib64/libdl.so.2
      Reading symbols from /lib64/libm.so.6...(no debugging symbols found)...done.
      Loaded symbols for /lib64/libm.so.6
      Reading symbols from /lib64/librt.so.1...(no debugging symbols found)...done.
      Loaded symbols for /lib64/librt.so.1
      Reading symbols from /opt/couchbase/lib/libtcmalloc_minimal.so.4...done.
      Loaded symbols for /opt/couchbase/lib/libtcmalloc_minimal.so.4
      Reading symbols from /lib64/libpthread.so.0...(no debugging symbols found)...done.
      [Thread debugging using libthread_db enabled]
      Loaded symbols for /lib64/libpthread.so.0
      Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done.
      Loaded symbols for /lib64/libc.so.6
      Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
      Loaded symbols for /lib64/ld-linux-x86-64.so.2
      Reading symbols from /usr/lib64/libstdc++.so.6...(no debugging symbols found)...done.
      Loaded symbols for /usr/lib64/libstdc++.so.6
      Reading symbols from /lib64/libgcc_s.so.1...(no debugging symbols found)...done.
      Loaded symbols for /lib64/libgcc_s.so.1
      Reading symbols from /opt/couchbase/lib/memcached/stdin_term_handler.so...done.
      Loaded symbols for /opt/couchbase/lib/memcached/stdin_term_handler.so
      Reading symbols from /opt/couchbase/lib/memcached/file_logger.so...done.
      Loaded symbols for /opt/couchbase/lib/memcached/file_logger.so
      Reading symbols from /opt/couchbase/lib/memcached/bucket_engine.so...done.
      Loaded symbols for /opt/couchbase/lib/memcached/bucket_engine.so
      Reading symbols from /opt/couchbase/lib/memcached/ep.so...done.
      Loaded symbols for /opt/couchbase/lib/memcached/ep.so
      Reading symbols from /opt/couchbase/lib/libcouchstore.so.1...done.
      Loaded symbols for /opt/couchbase/lib/libcouchstore.so.1
      Reading symbols from /opt/couchbase/lib/libsnappy.so.1...done.
      Loaded symbols for /opt/couchbase/lib/libsnappy.so.1
      Reading symbols from /lib64/libnss_files.so.2...(no debugging symbols found)...done.
      Loaded symbols for /lib64/libnss_files.so.2

      warning: no loadable sections found in added symbol-file system-supplied DSO at 0x7fff18555000
      Core was generated by `/opt/couchbase/bin/memcached -X /opt/couchbase/lib/memcached/stdin_term_handler'.
      Program terminated with signal 11, Segmentation fault.
      #0 add_conn_to_pending_io_list (cookie=0x1e167600, status=ENGINE_SUCCESS) at daemon/thread.c:722
      722 daemon/thread.c: No such file or directory.
      in daemon/thread.c
      (gdb) t a a bt

      Thread 14 (Thread 0x2b560bfb9240 (LWP 21377)):
      #0 0x00002b560b825648 in epoll_wait () from /lib64/libc.so.6
      #1 0x00002b560aa36576 in epoll_dispatch (base=0x1e20a000, tv=<value optimized out>) at epoll.c:404
      #2 0x00002b560aa21e44 in event_base_loop (base=0x1e20a000, flags=<value optimized out>) at event.c:1558
      #3 0x0000000000409742 in main (argc=<value optimized out>, argv=<value optimized out>) at daemon/memcached.c:7918

      Thread 13 (Thread 21385):
      #0 0x00002b560b81745b in read () from /lib64/libc.so.6
      #1 0x00002b560b7bd677 in _IO_new_file_underflow () from /lib64/libc.so.6
      #2 0x00002b560b7be03e in _IO_default_uflow_internal () from /lib64/libc.so.6
      #3 0x00002b560b7b3124 in _IO_getline_info_internal () from /lib64/libc.so.6
      #4 0x00002b560b7b1fc9 in fgets () from /lib64/libc.so.6
      #5 0x00002b560bfba939 in check_stdin_thread (arg=<value optimized out>) at extensions/daemon/stdin_check.c:37
      #6 0x00002b560b53d77d in start_thread () from /lib64/libpthread.so.0
      #7 0x00002b560b82525d in clone () from /lib64/libc.so.6

      Thread 12 (Thread 21386):
      #0 0x00002b560b5421c0 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
      #1 0x00002aaaaaaae4d6 in logger_thead_main (arg=0x199a2040) at extensions/loggers/file_logger.c:368
      #2 0x00002b560b53d77d in start_thread () from /lib64/libpthread.so.0
      #3 0x00002b560b82525d in clone () from /lib64/libc.so.6

      Thread 11 (Thread 21388):
      #0 0x00002b560b825648 in epoll_wait () from /lib64/libc.so.6
      #1 0x00002b560aa36576 in epoll_dispatch (base=0x1e20a500, tv=<value optimized out>) at epoll.c:404
      #2 0x00002b560aa21e44 in event_base_loop (base=0x1e20a500, flags=<value optimized out>) at event.c:1558
      #3 0x0000000000414504 in worker_libevent (arg=0x199a5900) at daemon/thread.c:301
      #4 0x00002b560b53d77d in start_thread () from /lib64/libpthread.so.0
      #5 0x00002b560b82525d in clone () from /lib64/libc.so.6

      Thread 10 (Thread 21389):
      #0 0x00002b560b825648 in epoll_wait () from /lib64/libc.so.6
      #1 0x00002b560aa36576 in epoll_dispatch (base=0x1e20a280, tv=<value optimized out>) at epoll.c:404
      #2 0x00002b560aa21e44 in event_base_loop (base=0x1e20a280, flags=<value optimized out>) at event.c:1558
      #3 0x0000000000414504 in worker_libevent (arg=0x199a59f8) at daemon/thread.c:301
      #4 0x00002b560b53d77d in start_thread () from /lib64/libpthread.so.0
      #5 0x00002b560b82525d in clone () from /lib64/libc.so.6

      Thread 9 (Thread 21390):
      #0 0x00002b560b5446d5 in __lll_unlock_wake () from /lib64/libpthread.so.0
      #1 0x00002b560b541157 in _L_unlock_766 () from /lib64/libpthread.so.0
      #2 0x00002b560b5410be in pthread_mutex_unlock () from /lib64/libpthread.so.0
      #3 0x0000000000414c8d in thread_libevent_process (fd=<value optimized out>, which=<value optimized out>, arg=0x199a5af0) at daemon/thread.c:389
      #4 0x00002b560aa21f3c in event_process_active_single_queue (base=0x1e20ac80, flags=<value optimized out>) at event.c:1308
      #5 event_process_active (base=0x1e20ac80, flags=<value optimized out>) at event.c:1375
      #6 event_base_loop (base=0x1e20ac80, flags=<value optimized out>) at event.c:1572
      #7 0x0000000000414504 in worker_libevent (arg=0x199a5af0) at daemon/thread.c:301
      #8 0x00002b560b53d77d in start_thread () from /lib64/libpthread.so.0
      #9 0x00002b560b82525d in clone () from /lib64/libc.so.6

      Thread 8 (Thread 21391):
      #0 0x00002b560b825648 in epoll_wait () from /lib64/libc.so.6
      #1 0x00002b560aa36576 in epoll_dispatch (base=0x1e20aa00, tv=<value optimized out>) at epoll.c:404
      #2 0x00002b560aa21e44 in event_base_loop (base=0x1e20aa00, flags=<value optimized out>) at event.c:1558
      --Type <return> to continue, or q <return> to quit--
      #3 0x0000000000414504 in worker_libevent (arg=0x199a5be8) at daemon/thread.c:301
      #4 0x00002b560b53d77d in start_thread () from /lib64/libpthread.so.0
      #5 0x00002b560b82525d in clone () from /lib64/libc.so.6

      Thread 7 (Thread 21392):
      #0 0x00002b560b825648 in epoll_wait () from /lib64/libc.so.6
      #1 0x00002b560aa36576 in epoll_dispatch (base=0x1e20a780, tv=<value optimized out>) at epoll.c:404
      #2 0x00002b560aa21e44 in event_base_loop (base=0x1e20a780, flags=<value optimized out>) at event.c:1558
      #3 0x0000000000414504 in worker_libevent (arg=0x199a5ce0) at daemon/thread.c:301
      #4 0x00002b560b53d77d in start_thread () from /lib64/libpthread.so.0
      #5 0x00002b560b82525d in clone () from /lib64/libc.so.6

      Thread 6 (Thread 21395):
      #0 0x00002b560b7eb221 in nanosleep () from /lib64/libc.so.6
      #1 0x00002b560b81eba4 in usleep () from /lib64/libc.so.6
      #2 0x00002aaaaaf317f5 in updateStatsThread (arg=0x199a24c0) at src/memory_tracker.cc:31
      #3 0x00002b560b53d77d in start_thread () from /lib64/libpthread.so.0
      #4 0x00002b560b82525d in clone () from /lib64/libc.so.6

      Thread 5 (Thread 21396):

      #0 0x00002b560b81e767 in fdatasync () from /lib64/libc.so.6
      #1 0x00002aaaab1d0cff in couch_sync (handle=<value optimized out>) at src/os.c:117
      #2 0x00002aaaaaf73f1f in cfs_sync (h=0x1fd20040) at src/couch-kvstore/couch-fs-stats.cc:86
      #3 0x00002aaaab1cc1ef in couchstore_commit (db=0x1e211b90) at src/couch_db.c:194
      #4 0x00002aaaaaf6eb55 in CouchKVStore::saveDocs (this=0x1e274000, vbid=631, rev=1, docs=0x2677a000, docinfos=0x2677ee00, docCount=395) at src/couch-kvstore/couch-kvstore.cc:1570
      #5 0x00002aaaaaf6f211 in CouchKVStore::commit2couchstore (this=0x1e274000) at src/couch-kvstore/couch-kvstore.cc:1492
      #6 0x00002aaaaaf6f41a in CouchKVStore::commit (this=<value optimized out>) at src/couch-kvstore/couch-kvstore.cc:876
      #7 0x00002aaaaaef7f15 in TransactionContext::commit (this=0x1e202788) at src/ep.cc:2789
      #8 0x00002aaaaaf02210 in EventuallyPersistentStore::flushOutgoingQueue (this=0x1e202480, flushQueue=0x1e202748, flushPhase=@0x1e200570, nextVbid=@0x1e200578) at src/ep.cc:1975
      #9 0x00002aaaaaf2b87c in Flusher::doFlush (this=0x1e200480) at src/flusher.cc:245
      #10 0x00002aaaaaf2c6b5 in Flusher::step (this=0x1e200480, d=..., tid=...) at src/flusher.cc:158
      #11 0x00002aaaaaef45ea in Dispatcher::run (this=0x1e246c40) at src/dispatcher.cc:173
      #12 0x00002aaaaaef4eeb in launch_dispatcher_thread (arg=0x1e246c40) at src/dispatcher.cc:28
      #13 0x00002b560b53d77d in start_thread () from /lib64/libpthread.so.0
      #14 0x00002b560b82525d in clone () from /lib64/libc.so.6

      Thread 4 (Thread 21397):
      #0 0x00002b560b5421c0 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
      #1 0x00002aaaaaef1f28 in wait (this=0x1e28a090, d=...) at src/syncobject.hh:58
      #2 IdleTask::run (this=0x1e28a090, d=...) at src/dispatcher.cc:336
      #3 0x00002aaaaaef45ea in Dispatcher::run (this=0x1e246a80) at src/dispatcher.cc:173
      #4 0x00002aaaaaef4eeb in launch_dispatcher_thread (arg=0x1e246a80) at src/dispatcher.cc:28
      #5 0x00002b560b53d77d in start_thread () from /lib64/libpthread.so.0
      #6 0x00002b560b82525d in clone () from /lib64/libc.so.6

      Thread 3 (Thread 21399):
      #0 0x00002b560b544594 in __lll_lock_wait () from /lib64/libpthread.so.0
      #1 0x00002b560b53fe8a in _L_lock_1034 () from /lib64/libpthread.so.0
      #2 0x00002b560b53fd4c in pthread_mutex_lock () from /lib64/libpthread.so.0
      #3 0x00000000004155f6 in notify_io_complete (cookie=0x1e18a580, status=ENGINE_SUCCESS) at daemon/thread.c:485
      #4 0x00002aaaaaf4a3ad in notifyIOComplete (this=<value optimized out>, tc=0x1e8dd000) at src/ep_engine.h:439
      #5 TapConnMap::notifyPausedConnection_UNLOCKED (this=<value optimized out>, tc=0x1e8dd000) at src/tapconnmap.cc:347
      #6 0x00002aaaaaf4a4fc in TapConnMap::setEvents (this=0x1e200240, name="eq_tapq:replication_ns_1@10.3.3.97", q=0x199a1240) at src/tapconnmap.cc:115
      #7 0x00002aaaaaee3b03 in setEvents (this=0x26288780) at src/backfill.cc:176
      #8 BackFillVisitor::apply (this=0x26288780) at src/backfill.cc:170
      --Type <return> to continue, or q <return> to quit--
      #9 0x00002aaaaaee4339 in BackFillVisitor::visitBucket (this=0x26288780, vb=...) at src/backfill.cc:90
      #10 0x00002aaaaaef8a5d in VBCBAdaptor::callback (this=0x2663e5a0, d=..., t=...) at src/ep.cc:2849
      #11 0x00002aaaaaef45ea in Dispatcher::run (this=0x1e2476c0) at src/dispatcher.cc:173
      #12 0x00002aaaaaef4eeb in launch_dispatcher_thread (arg=0x1e2476c0) at src/dispatcher.cc:28
      #13 0x00002b560b53d77d in start_thread () from /lib64/libpthread.so.0
      #14 0x00002b560b82525d in clone () from /lib64/libc.so.6

      Thread 2 (Thread 21400):
      #0 RCPtr (this=<value optimized out>, vbid=<value optimized out>, wanted_state=vbucket_state_active) at src/atomic.hh:311
      #1 EventuallyPersistentStore::getVBucket (this=<value optimized out>, vbid=<value optimized out>, wanted_state=vbucket_state_active) at src/ep.cc:637
      #2 0x00002aaaaaefb24e in EventuallyPersistentStore::firePendingVBucketOps (this=0x1e202480) at src/ep.cc:644
      #3 0x00002aaaaaf10c98 in EventuallyPersistentEngine::notifyPendingConnections (this=0x1e206000) at src/ep_engine.cc:3417
      #4 0x00002aaaaaf10e43 in EvpNotifyPendingConns (arg=0x1e206000) at src/ep_engine.cc:1145
      #5 0x00002b560b53d77d in start_thread () from /lib64/libpthread.so.0
      #6 0x00002b560b82525d in clone () from /lib64/libc.so.6

      Thread 1 (Thread 0x4819c940 (LWP 21398)):
      #0 add_conn_to_pending_io_list (cookie=0x1e167600, status=ENGINE_SUCCESS) at daemon/thread.c:722
      #1 notify_io_complete (cookie=0x1e167600, status=ENGINE_SUCCESS) at daemon/thread.c:488
      #2 0x00002aaaaaf4a3ad in notifyIOComplete (this=<value optimized out>, tc=0x1e8dd000) at src/ep_engine.h:439
      #3 TapConnMap::notifyPausedConnection_UNLOCKED (this=<value optimized out>, tc=0x1e8dd000) at src/tapconnmap.cc:347
      #4 0x00002aaaaaee4901 in performTapOp<void*> (this=0x1f2c7e80, d=<value optimized out>, t=<value optimized out>) at src/tapconnmap.hh:119
      #5 BackfillDiskLoad::callback (this=0x1f2c7e80, d=<value optimized out>, t=<value optimized out>) at src/backfill.cc:78
      #6 0x00002aaaaaef45ea in Dispatcher::run (this=0x1e247880) at src/dispatcher.cc:173
      #7 0x00002aaaaaef4eeb in launch_dispatcher_thread (arg=0x1e247880) at src/dispatcher.cc:28
      #8 0x00002b560b53d77d in start_thread () from /lib64/libpthread.so.0
      #9 0x00002b560b82525d in clone () from /lib64/libc.so.6

      from server logs there a lot of errors:
      Thu Jan 24 08:17:06.836510 PST 3: TAP (Producer) eq_tapq:replication_ns_1@10.3.3.97 - Failed to set the TAP cursor to the open checkpoint because the TAP checkpoint state for vbucket 850 does not exist
      Thu Jan 24 08:17:06.837917 PST 3: TAP (Producer) eq_tapq:replication_ns_1@10.3.3.97 - Failed to set the TAP cursor to the open checkpoint because the TAP checkpoint state for vbucket 851 does not exist
      Thu Jan 24 08:17:06.840049 PST 3: TAP (Producer) eq_tapq:replication_ns_1@10.3.3.97 - Failed to set the TAP cursor to the open checkpoint because the TAP checkpoint state for vbucket 852 does not exist
      Thu Jan 24 08:17:06.843580 PST 3: TAP (Producer) eq_tapq:replication_ns_1@10.3.3.97 - Failed to set the TAP cursor to the open checkpoint because the TAP checkpoint state for vbucket 853 does not exist
      Thu Jan 24 08:17:06.844550 PST 3: TAP (Producer) eq_tapq:replication_ns_1@10.3.3.92 - Connection is re-established. Rollback unacked messages...
      Thu Jan 24 08:17:06.844897 PST 3: TAP (Producer) eq_tapq:replication_ns_1@10.3.3.92 - Sending TAP_OPAQUE with command "opaque_enable_auto_nack" and vbucket 0
      Thu Jan 24 08:17:06.844920 PST 3: TAP (Producer) eq_tapq:replication_ns_1@10.3.3.92 - Sending TAP_OPAQUE with command "enable_checkpoint_sync" and vbucket 0
      Thu Jan 24 08:17:06.845341 PST 3: Schedule cleanup of "eq_tapq:anon_408"
      Thu Jan 24 08:17:06.846366 PST 3: TAP (Producer) eq_tapq:anon_408 - Clear the tap queues by force
      Thu Jan 24 08:17:06.852772 PST 3: TAP (Producer) eq_tapq:replication_ns_1@10.3.3.97 - Failed to set the TAP cursor to the open checkpoint because the TAP checkpoint state for vbucket 854 does not exist
      Thu Jan 24 08:17:06.866012 PST 3: TAP (Producer) eq_tapq:replication_ns_1@10.3.3.97 - Failed to set the TAP cursor to the open checkpoint because the TAP checkpoint state for vbucket 855 does not exist
      Thu Jan 24 08:17:06.868286 PST 3: TAP (Producer) eq_tapq:replication_ns_1@10.3.3.97 - Failed to set the TAP cursor to the open checkpoint because the TAP checkpoint state for vbucket 856 does not exist
      Thu Jan 24 08:17:06.870287 PST 3: TAP (Producer) eq_tapq:replication_ns_1@10.3.3.97 - Failed to set the TAP cursor to the open checkpoint because the TAP checkpoint state for vbucket 857 does not exist

      Aaron, could you look at crash if it's your area

      No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

        Hide
        andreibaranouski Andrei Baranouski added a comment -

        I have seen this problem only once, in the mixed rebalance tests. crash was on 2.0.1 node

        Show
        andreibaranouski Andrei Baranouski added a comment - I have seen this problem only once, in the mixed rebalance tests. crash was on 2.0.1 node
        Hide
        jin Jin Lim added a comment -

        Per bug scrubs we will be reviewing this issue for next few days.

        Andrei - please try to reproduce it on your environment for next few days and let us know. Thanks.

        Show
        jin Jin Lim added a comment - Per bug scrubs we will be reviewing this issue for next few days. Andrei - please try to reproduce it on your environment for next few days and let us know. Thanks.
        Hide
        farshid Farshid Ghods (Inactive) added a comment -

        deferring this to 2.0.2 for the time being as Andrei is unable to repro this issue but will continue more runs next week

        Show
        farshid Farshid Ghods (Inactive) added a comment - deferring this to 2.0.2 for the time being as Andrei is unable to repro this issue but will continue more runs next week
        Hide
        mikew Mike Wiederhold added a comment -

        This is the same crash as MB-7735. I need a crash dump to investigate the issue further so if you are able to reproduce then please add the location of the crash dump to MB-7735 or just open a new ticket.

        Show
        mikew Mike Wiederhold added a comment - This is the same crash as MB-7735 . I need a crash dump to investigate the issue further so if you are able to reproduce then please add the location of the crash dump to MB-7735 or just open a new ticket.
        Hide
        maria Maria McDuff (Inactive) added a comment -
        Show
        maria Maria McDuff (Inactive) added a comment - MB-7735

          People

          • Assignee:
            andreibaranouski Andrei Baranouski
            Reporter:
            andreibaranouski Andrei Baranouski
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Gerrit Reviews

              There are no open Gerrit changes