Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-4394

memcached crash while rebalancing 15 nodes with 30M items (FATAL: Object returned from mccouch with CAS == 0)

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Critical
    • Resolution: Duplicate
    • Affects Version/s: 2.0
    • Fix Version/s: 2.0
    • Component/s: couchbase-bucket
    • Security Level: Public
    • Labels:
      None
    • Environment:
      centos 5.4 64 bit on ecs

      Description

      create a cluster of 10 node couchbase server 2.0.0r-177
      Load 30 million items to cluster so that reach about 85% resident.
      Keep the load runing and add 5 more nodes 2.0.0r-177
      Rebalance cluster. Failed.

      1. core.memcached.1420.log
        15 kB
        Thuan Nguyen
      2. core.memcached.21045.log
        14 kB
        Thuan Nguyen
      3. core.memcached.2406.log
        17 kB
        Thuan Nguyen
      4. core.memcached.2408.log
        15 kB
        Thuan Nguyen
      5. core.memcached.2422.log
        15 kB
        Thuan Nguyen
      No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

        Show
        farshid Farshid Ghods (Inactive) added a comment - http://www.couchbase.org/issues/browse/MB-4412
        Hide
        farshid Farshid Ghods (Inactive) added a comment -

        looked at the more recent core logs
        its a dupe of http://www.couchbase.org/issues/browse/MB-4412

        Show
        farshid Farshid Ghods (Inactive) added a comment - looked at the more recent core logs its a dupe of http://www.couchbase.org/issues/browse/MB-4412
        Hide
        thuan Thuan Nguyen added a comment -

        I just got a crash again today and have 3 new core attached in here.
        I run 15 python threads to load 30 millions items. The script does all set.
        python scripts/mixload-allset.py -i manual2.0.ini -p prefix=key_01,size=655,count=2000000 &
        Command to run memcachetest.
        ./memcachetest -h 184.72.85.127:11211 -i 100000 -c 50000 -m 128 -t 2 -l 
        After finish loading 30 million items, I do 70% set,get, 30% delete and 30% set again.
        counter_10 = 0
        all_set = False
        while i < count:
        try:
        key = "

        {0}

        -

        {1}

        ".format(prefix, i)
        if counter_10 >= 7:
        if all_set == True:
        mc.delete(key)
        mc.set(key, 0, 0, payload)
        if counter_10 == 10:
        counter_10 = 0
        else:
        mc.set(key, 0, 0, payload)
        mc.get(key)
        counter_10 += 1
        i += 1
        if i == int(count):
        all_set = True
        i = 0

        Show
        thuan Thuan Nguyen added a comment - I just got a crash again today and have 3 new core attached in here. I run 15 python threads to load 30 millions items. The script does all set. python scripts/mixload-allset.py -i manual2.0.ini -p prefix=key_01,size=655,count=2000000 & Command to run memcachetest. ./memcachetest -h 184.72.85.127:11211 -i 100000 -c 50000 -m 128 -t 2 -l  After finish loading 30 million items, I do 70% set,get, 30% delete and 30% set again. counter_10 = 0 all_set = False while i < count: try: key = " {0} - {1} ".format(prefix, i) if counter_10 >= 7: if all_set == True: mc.delete(key) mc.set(key, 0, 0, payload) if counter_10 == 10: counter_10 = 0 else: mc.set(key, 0, 0, payload) mc.get(key) counter_10 += 1 i += 1 if i == int(count): all_set = True i = 0
        Hide
        farshid Farshid Ghods (Inactive) added a comment -

        can't access those vms anymore. they seem to be terminated by now.

        Tony,

        can you please provide the ratio of set/get/delete/expire which you run the mixloader with ? and possibly copy-paste that part of the mix-loader which loops over all the keys and run those memcached commands

        Show
        farshid Farshid Ghods (Inactive) added a comment - can't access those vms anymore. they seem to be terminated by now. Tony, can you please provide the ratio of set/get/delete/expire which you run the mixloader with ? and possibly copy-paste that part of the mix-loader which loops over all the keys and run those memcached commands
        Hide
        dustin Dustin Sallings (Inactive) added a comment -

        Although this is also very interesting from one of the attached stacks. How many different things are going wrong here?

        Thread 1 (Thread 0x7f8e0c97e700 (LWP 22692)):
        #0 0x0000000000000000 in ?? ()
        #1 0x00007f8e0d294b1c in Task::maxExpectedDuration (this=0x4b2c3000) at dispatcher.hh:152
        #2 0x00007f8e0d2945a2 in Dispatcher::run (this=0xf5b9000) at dispatcher.cc:136
        #3 0x00007f8e0d29479c in launch_dispatcher_thread (arg=0xf5b9000) at dispatcher.cc:28
        #4 0x00007f8e11afc7e1 in start_thread () from /lib64/libpthread.so.0
        #5 0x00007f8e11863ead in clone () from /lib64/libc.so.6

        Show
        dustin Dustin Sallings (Inactive) added a comment - Although this is also very interesting from one of the attached stacks. How many different things are going wrong here? Thread 1 (Thread 0x7f8e0c97e700 (LWP 22692)): #0 0x0000000000000000 in ?? () #1 0x00007f8e0d294b1c in Task::maxExpectedDuration (this=0x4b2c3000) at dispatcher.hh:152 #2 0x00007f8e0d2945a2 in Dispatcher::run (this=0xf5b9000) at dispatcher.cc:136 #3 0x00007f8e0d29479c in launch_dispatcher_thread (arg=0xf5b9000) at dispatcher.cc:28 #4 0x00007f8e11afc7e1 in start_thread () from /lib64/libpthread.so.0 #5 0x00007f8e11863ead in clone () from /lib64/libc.so.6
        Hide
        dustin Dustin Sallings (Inactive) added a comment -

        I didn't mean to assign this to myself, but I'm going to pass a baton briefly to Farshid for some reproduction data.

        I'd really like an attachment shard that does this so I can try to do the same thing in isolation.

        Show
        dustin Dustin Sallings (Inactive) added a comment - I didn't mean to assign this to myself, but I'm going to pass a baton briefly to Farshid for some reproduction data. I'd really like an attachment shard that does this so I can try to do the same thing in isolation.
        Hide
        dustin Dustin Sallings (Inactive) added a comment -

        Found myself commenting on a bug in email. Need to be careful about that.

        The error is misleading. It appears to be one of these two things:

        static bool decodeMeta(const uint8_t *dta, uint32_t &seqno, uint64_t &cas,
        uint32_t &length, uint32_t &flags) {
        if (*dta != 0x01)

        { // Unsupported meta tag return false; }

        ++dta;
        if (*dta != 20)

        { // Unsupported size return false; }

        Considering the prior allocation was 4GB, I'm guessing that something read something incorrectly and we're just off by this point.

        Any chance this is a small database I can play with?

        Show
        dustin Dustin Sallings (Inactive) added a comment - Found myself commenting on a bug in email. Need to be careful about that. The error is misleading. It appears to be one of these two things: static bool decodeMeta(const uint8_t *dta, uint32_t &seqno, uint64_t &cas, uint32_t &length, uint32_t &flags) { if (*dta != 0x01) { // Unsupported meta tag return false; } ++dta; if (*dta != 20) { // Unsupported size return false; } Considering the prior allocation was 4GB, I'm guessing that something read something incorrectly and we're just off by this point. Any chance this is a small database I can play with?
        Hide
        farshid Farshid Ghods (Inactive) added a comment -

        Port server memcached on node 'ns_1@10.34.149.233' exited with status 134. Restarting. Messages: Preloaded 5898290 keys (with metadata)
        tcmalloc: large alloc 4294938624 bytes == 0x49c5e000 @
        FATAL: Object returned from mccouch with CAS == 0

        Show
        farshid Farshid Ghods (Inactive) added a comment - Port server memcached on node 'ns_1@10.34.149.233' exited with status 134. Restarting. Messages: Preloaded 5898290 keys (with metadata) tcmalloc: large alloc 4294938624 bytes == 0x49c5e000 @ FATAL: Object returned from mccouch with CAS == 0

          People

          • Assignee:
            dustin Dustin Sallings (Inactive)
            Reporter:
            thuan Thuan Nguyen
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Gerrit Reviews

              There are no open Gerrit changes