Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-9588

Possible memory corruption in flusher during shutdown

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • 3.0
    • 3.0
    • couchbase-bucket
    • Security Level: Public
    • None

    Description

      The change that was submitted recently (f864ba7a68fe09731ab6a4f856bfdf627bfde0d3) to fix some of the flusher logic uncovered another issue that is causing a bunch of unit test errors. Since we don't check to see if the VBucket exists before calling isBucketCreation() we are hitting an assertion because the vbid is greater than 1024. It seems that at some point the low priority queue in the flusher is corrupted and as a result there are a bunch of weird values in the queue.

      Thread 1 (Thread 0x4551f940 (LWP 23028)):
      #0 0x0000003f14230265 in raise () from /lib64/libc.so.6
      #1 0x0000003f14231d10 in abort () from /lib64/libc.so.6
      #2 0x0000003f142296e6 in __assert_fail () from /lib64/libc.so.6
      #3 0x00002aaaaac18050 in VBucketMap::isBucketCreation (this=0x1c6078c8, id=65535) at src/vbucketmap.cc:125
      #4 0x00002aaaaab98798 in EventuallyPersistentStore::flushVBucket (this=0x1c607890, vbid=65535) at src/ep.cc:2293
      #5 0x00002aaaaabd6043 in Flusher::flushVB (this=0x1c627330) at src/flusher.cc:269
      #6 0x00002aaaaabd59ee in Flusher::completeFlush (this=0x1c627330) at src/flusher.cc:206
      #7 0x00002aaaaabd57e0 in Flusher::step (this=0x1c627330, tid=8) at src/flusher.cc:179
      #8 0x00002aaaaac1106e in FlusherTask::run (this=0x1c68b9a0) at src/tasks.cc:78
      #9 0x00002aaaaabdc549 in ExecutorThread::run (this=0x1c678040) at src/scheduler.cc:94
      #10 0x00002aaaaabdbeef in launch_executor_thread (arg=0x1c678040) at src/scheduler.cc:35
      #11 0x00002ba66dc6b49a in platform_thread_wrap (arg=0x1c678c40) at /home/jenkins/couchbase/cmake/platform/src/cb_pthreads.c:18
      #12 0x0000003f14e0673d in start_thread () from /lib64/libpthread.so.0
      #13 0x0000003f142d44bd in clone () from /lib64/libc.so.6
      (gdb)
      (gdb) f 5
      #5 0x00002aaaaabd6043 in Flusher::flushVB (this=0x1c627330) at src/flusher.cc:269
      269 if (store->flushVBucket(vbid) == RETRY_FLUSH_VBUCKET) {
      (gdb) info locals
      vbid = 65535
      _PRETTY_FUNCTION_ = "void Flusher::flushVB()"
      (gdb) print *this
      $1 = {store = 0x1c607890, state = stopping, taskMutex = {_vptr.Mutex = 0x2aaaaaef5590, mutex = {data = {_lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0,
      _spins = 0, __list = {_prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, __align = 0}, holder = 47993808496400, held = false}, taskId = 30,
      minSleepTime = 2, flushStart = 1, forceShutdownReceived =

      {value = false}

      , hpVbs = std::queue wrapping: std::deque with 0 elements,
      lpVbs = std::queue wrapping: std::deque with -33553603 elements =

      {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1160, 46080, 10922, 0, 1160, 46080, 10922, 0, 8720, 46080, 10922, 0, 8720, 46080, 10922, 0, 25632, 30575, 8302, 27750, 29557, 25960, 8306, 22312, 26994, 25972, 28448, 8294, 27745, 8300, 26980, 29810, 8313, 29801, 28005, 10611, 10, 0, 0, 0, 65535, 65535, 0, 0, 28773, 25439, 27496, 29279, 28005, 30319, 29285, 29535, 26996, 25965, 29184, 29541, 28520, 25708, 42240, 42405, 113, 0, 0, 0, 43024, 46080, 10922, 0, 120, 46080, 10922, 0, 65535, 65535, 0, 0, 53, 48, 0, 0, 48, 0, 0, 0, 68, 0, 0, 0, 8784, 46080, 10922, 0, 6464, 46080, 10922, 0, 43328, 46080, 10922, 0, 12800, 46080, 10922, 0, 8824, 46080, 10922, 0, 8888, 46080, 10922, 0, 240, 0, 0, 0, 68, 0, 0, 0, 5920, 46080, 10922, 0, 12800, 46080, 10922, 0, 21472, 46080, 10922, 0, 0, 0, 0, 0, 5960, 46080, 10922, 0, 65048, 9966, 63, 0, 304, 0, 0, 0, 68, 0, 0, 0, 38080, 46080, 10922, 0, 11, 0, 0, 0, 65535, 65535, 0, 0, 28773, 26975, 26990, 26228, 27753, 101, 26996, 25965, 30063, 116, 30063, 116, 64, 0, 0, 0, 49, 0, 0, 0, 3040, 46080, 10922, 0, 37984, 46080...}

      , doHighPriority = false, numHighPriority = 0, pendingMutation =

      {value = true}

      , shard = 0x1c608300}
      (gdb) quit

      Here is the test case failure.

      Running [0028/0232]: test touch (MB-7342) (couchstore)...engine_testapp: src/vbucketmap.cc:125: bool VBucketMap::isBucketCreation(uint16_t) const: Assertion `id < size' failed.
      CORE DUMPED

      I've also tried running valgrind, but was unable to get any relevant information. On another note I saw a double free error at one point in this code path as well, but this one is much more rare.

      Thread 1 (Thread 0x4e151940 (LWP 4022)):
      #0 0x0000003f14230265 in raise () from /lib64/libc.so.6
      #1 0x0000003f14231d10 in abort () from /lib64/libc.so.6
      #2 0x0000003f1426a99b in __libc_message () from /lib64/libc.so.6
      #3 0x0000003f1427245f in _int_free () from /lib64/libc.so.6
      #4 0x0000003f142728bb in free () from /lib64/libc.so.6
      #5 0x00002aaaaab73e18 in __gnu_cxx::new_allocator<unsigned short>::deallocate (this=0x101de278, __p=0x102404e0)
      at /usr/lib/gcc/x86_64-redhat-linux6E/4.4.6/../../../../include/c++/4.4.6/ext/new_allocator.h:95
      #6 0x00002aaaaaba9d7e in std::_Deque_base<unsigned short, std::allocator<unsigned short> >::_M_deallocate_node (this=0x101de278, __p=0x102404e0)
      at /usr/lib/gcc/x86_64-redhat-linux6E/4.4.6/../../../../include/c++/4.4.6/bits/stl_deque.h:450
      #7 0x00002aaaaaba94b0 in std::deque<unsigned short, std::allocator<unsigned short> >::_M_pop_front_aux (this=0x101de278)
      at /usr/lib/gcc/x86_64-redhat-linux6E/4.4.6/../../../../include/c++/4.4.6/bits/deque.tcc:445
      #8 0x00002aaaaaba6338 in std::deque<unsigned short, std::allocator<unsigned short> >::pop_front (this=0x101de278)
      --Type <return> to continue, or q <return> to quit--
      at /usr/lib/gcc/x86_64-redhat-linux6E/4.4.6/../../../../include/c++/4.4.6/bits/stl_deque.h:1241
      #9 0x00002aaaaaba3ece in std::queue<unsigned short, std::deque<unsigned short, std::allocator<unsigned short> > >::pop (this=0x101de278)
      at /usr/lib/gcc/x86_64-redhat-linux6E/4.4.6/../../../../include/c++/4.4.6/bits/stl_queue.h:247
      #10 0x00002aaaaabd6078 in Flusher::flushVB (this=0x101de1c0) at src/flusher.cc:270
      #11 0x00002aaaaabd59ee in Flusher::completeFlush (this=0x101de1c0) at src/flusher.cc:206
      #12 0x00002aaaaabd57e0 in Flusher::step (this=0x101de1c0, tid=67) at src/flusher.cc:179
      #13 0x00002aaaaac110be in VBDeleteTask::run (this=0x43) at src/tasks.cc:86
      #14 0x00002aaaaabdc599 in ExecutorThread::run (this=0x10242640) at src/scheduler.cc:98
      #15 0x00002aaaaabdbf3f in launch_executor_thread (arg=0x10242640) at src/scheduler.cc:38
      #16 0x00002b7d42c4d49a in platform_thread_wrap (arg=0x10243240) at /home/jenkins/couchbase/cmake/platform/src/cb_pthreads.c:18
      #17 0x0000003f14e0673d in start_thread () from /lib64/libpthread.so.0
      #18 0x0000003f142d44bd in clone () from /lib64/libc.so.6

      I'm assigning initially to Sundar because I am out of ideas on how to approach this and maybe a fresh pair of eyes will help. I'm also overloaded with other tasks and don't have a lot of time to look at this.

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            sundar Sundar Sridharan (Inactive)
            mikew Mike Wiederhold [X] (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty