Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-6592

[longevity] memcached hangs when aborting during swap rebalance operation and fails to restart ( exit 71 )

    Details

      Description

      Cluster information:

      • 11 centos 6.2 64bit server with 4 cores CPU
      • Each server has 10 GB RAM and 150 GB disk.
      • 8 GB RAM for couchbase server at each node (80% total system memmories)
      • Disk format ext3 on both data and root
      • Each server has its own drive, no disk sharing with other server.
      • Load 9 million items to both buckets
      • Initial indexing, so cpu a little heavy load
      • Cluster has 2 buckets, default (3GB) and saslbucket (3GB)
      • Each bucket has one doc and 2 views for each doc (default d1 and saslbucket d11)
      • Create cluster with 10 nodes installed couchbase server 2.0.0-1697

      10.3.121.13
      10.3.121.14
      10.3.121.15
      10.3.121.16
      10.3.121.17
      10.3.121.20
      10.3.121.22
      10.3.121.24
      10.3.121.25
      10.3.121.23

      • Data path /data
      • View path /data
      • Do swap rebalance. Add node 26 and remove node 25
      • Rebalance failed and saw a lot of error message memcached exited with status 71 in log page.

      Link to diags of all nodes https://s3.amazonaws.com/packages.couchbase/diag-logs/orange/201209/11nodes-1697-memcached-exit-71-20120910.tgz

      Link to atop node 13 https://s3.amazonaws.com/packages.couchbase/atop-files/orange/201209/atop-node13
      Due to large size of atop file, all other atop files are in /tmp directory of each node

      No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

        thuan Thuan Nguyen created issue -
        karan Karan Kumar (Inactive) made changes -
        Field Original Value New Value
        Attachment memcached_logfile [ 14919 ]
        Hide
        karan Karan Kumar (Inactive) added a comment -

        From the memcached logs

        Sat Sep 8 17:56:06.159645 3: Connection closed by mccouch
        Sat Sep 8 17:56:06.159864 3: Trying to connect to mccouch: "localhost:11213"
        Sat Sep 8 17:56:06.160432 3: Failed to connect to: "localhost:11213"
        Sat Sep 8 17:56:06.160462 3: Failed to connect to: "localhost:11213"
        Sat Sep 8 17:56:06.165925 3: Failed to connect to: "localhost:11213"
        Sat Sep 8 17:56:06.165961 3: Failed to connect to: "localhost:11213"
        Sat Sep 8 17:56:06.171397 3: Failed to connect to: "localhost:11213"
        Sat Sep 8 17:56:06.171430 3: Failed to connect to: "localhost:11213"
        Sat Sep 8 17:56:06.176813 3: Failed to connect to: "localhost:11213"
        Sat Sep 8 17:56:06.176839 3: Failed to connect to: "localhost:11213"
        Sat Sep 8 17:56:06.182241 3: Failed to connect to: "localhost:11213"
        Sat Sep 8 17:56:06.182267 3: Failed to connect to: "localhost:11213"
        Sat Sep 8 17:56:06.187653 3: Failed to connect to: "localhost:11213"
        Sat Sep 8 17:56:06.187679 3: Failed to connect to: "localhost:11213"
        Sat Sep 8 17:56:06.193033 3: Failed to connect to: "localhost:11213"
        Sat Sep 8 17:56:06.193058 3: Failed to connect to: "localhost:11213"
        Sat Sep 8 17:56:06.198462 3: Failed to connect to: "localhost:11213"
        Sat Sep 8 17:56:06.198518 3: Failed to connect to: "localhost:11213"
        Sat Sep 8 17:56:06.204013 3: Failed to connect to: "localhost:11213"
        Sat Sep 8 17:56:06.204038 3: Failed to connect to: "localhost:11213"
        Sat Sep 8 17:56:06.209451 3: Failed to connect to: "localhost:11213"
        Sat Sep 8 17:56:06.209477 3: Failed to connect to: "localhost:11213"
        Sat Sep 8 17:56:06.214788 3: Failed to connect to: "localhost:11213"
        Sat Sep 8 17:56:06.214813 3: Failed to connect to: "localhost:11213"
        Sat Sep 8 17:56:06.220685 3: Failed to connect to: "localhost:11213"
        Sat Sep 8 17:56:06.220712 3: Failed to connect to: "localhost:11213"
        Sat Sep 8 17:56:06.226387 3: Failed to connect to: "localhost:11213"
        Sat Sep 8 17:56:06.226415 3: Failed to connect to: "localhost:11213"
        Sat Sep 8 17:56:06.231879 3: Failed to connect to: "localhost:11213"
        Sat Sep 8 17:56:06.231909 3: Failed to connect to: "localhost:11213"
        Sat Sep 8 17:56:06.237313 3: Failed to connect to: "localhost:11213"
        Sat Sep 8 17:56:06.237339 3: Failed to connect to: "localhost:11213"
        Sat Sep 8 17:56:06.242731 3: Failed to connect to: "localhost:11213"
        Sat Sep 8 17:56:06.242757 3: Failed to connect to: "localhost:11213"
        Sat Sep 8 17:56:06.248138 3: Failed to connect to: "localhost:11213"
        Sat Sep 8 17:56:06.248164 3: Failed to connect to: "localhost:11213"
        Sat Sep 8 17:56:06.253603 3: Failed to connect to: "localhost:11213"
        Sat Sep 8 17:56:06.253630 3: Failed to connect to: "localhost:11213"
        Sat Sep 8 17:56:06.261017 3: Failed to connect to: "localhost:11213"

        Show
        karan Karan Kumar (Inactive) added a comment - From the memcached logs Sat Sep 8 17:56:06.159645 3: Connection closed by mccouch Sat Sep 8 17:56:06.159864 3: Trying to connect to mccouch: "localhost:11213" Sat Sep 8 17:56:06.160432 3: Failed to connect to: "localhost:11213" Sat Sep 8 17:56:06.160462 3: Failed to connect to: "localhost:11213" Sat Sep 8 17:56:06.165925 3: Failed to connect to: "localhost:11213" Sat Sep 8 17:56:06.165961 3: Failed to connect to: "localhost:11213" Sat Sep 8 17:56:06.171397 3: Failed to connect to: "localhost:11213" Sat Sep 8 17:56:06.171430 3: Failed to connect to: "localhost:11213" Sat Sep 8 17:56:06.176813 3: Failed to connect to: "localhost:11213" Sat Sep 8 17:56:06.176839 3: Failed to connect to: "localhost:11213" Sat Sep 8 17:56:06.182241 3: Failed to connect to: "localhost:11213" Sat Sep 8 17:56:06.182267 3: Failed to connect to: "localhost:11213" Sat Sep 8 17:56:06.187653 3: Failed to connect to: "localhost:11213" Sat Sep 8 17:56:06.187679 3: Failed to connect to: "localhost:11213" Sat Sep 8 17:56:06.193033 3: Failed to connect to: "localhost:11213" Sat Sep 8 17:56:06.193058 3: Failed to connect to: "localhost:11213" Sat Sep 8 17:56:06.198462 3: Failed to connect to: "localhost:11213" Sat Sep 8 17:56:06.198518 3: Failed to connect to: "localhost:11213" Sat Sep 8 17:56:06.204013 3: Failed to connect to: "localhost:11213" Sat Sep 8 17:56:06.204038 3: Failed to connect to: "localhost:11213" Sat Sep 8 17:56:06.209451 3: Failed to connect to: "localhost:11213" Sat Sep 8 17:56:06.209477 3: Failed to connect to: "localhost:11213" Sat Sep 8 17:56:06.214788 3: Failed to connect to: "localhost:11213" Sat Sep 8 17:56:06.214813 3: Failed to connect to: "localhost:11213" Sat Sep 8 17:56:06.220685 3: Failed to connect to: "localhost:11213" Sat Sep 8 17:56:06.220712 3: Failed to connect to: "localhost:11213" Sat Sep 8 17:56:06.226387 3: Failed to connect to: "localhost:11213" Sat Sep 8 17:56:06.226415 3: Failed to connect to: "localhost:11213" Sat Sep 8 17:56:06.231879 3: Failed to connect to: "localhost:11213" Sat Sep 8 17:56:06.231909 3: Failed to connect to: "localhost:11213" Sat Sep 8 17:56:06.237313 3: Failed to connect to: "localhost:11213" Sat Sep 8 17:56:06.237339 3: Failed to connect to: "localhost:11213" Sat Sep 8 17:56:06.242731 3: Failed to connect to: "localhost:11213" Sat Sep 8 17:56:06.242757 3: Failed to connect to: "localhost:11213" Sat Sep 8 17:56:06.248138 3: Failed to connect to: "localhost:11213" Sat Sep 8 17:56:06.248164 3: Failed to connect to: "localhost:11213" Sat Sep 8 17:56:06.253603 3: Failed to connect to: "localhost:11213" Sat Sep 8 17:56:06.253630 3: Failed to connect to: "localhost:11213" Sat Sep 8 17:56:06.261017 3: Failed to connect to: "localhost:11213"
        karan Karan Kumar (Inactive) made changes -
        Summary [longevity] memcached constantly exited with status 71 on node 13 [longevity] memcached constantly exited with status 71 and unable to restart
        Hide
        karan Karan Kumar (Inactive) added a comment -

        GDB output of the memcached process:-
        https://friendpaste.com/5gFQw9wPBFOgNjue64HfM2

        Not sure if related to
        http://www.couchbase.com/issues/browse/MB-5653

        Show
        karan Karan Kumar (Inactive) added a comment - GDB output of the memcached process:- https://friendpaste.com/5gFQw9wPBFOgNjue64HfM2 Not sure if related to http://www.couchbase.com/issues/browse/MB-5653
        karan Karan Kumar (Inactive) made changes -
        Assignee Karan Kumar [ karan ] Trond Norbye [ trond ]
        Hide
        karan Karan Kumar (Inactive) added a comment -

        We have the core file of the memcached process available if needed.

        Show
        karan Karan Kumar (Inactive) added a comment - We have the core file of the memcached process available if needed.
        Hide
        karan Karan Kumar (Inactive) added a comment -

        10.3.121.13 (root/couchbase)

        /root/karan

        Show
        karan Karan Kumar (Inactive) added a comment - 10.3.121.13 (root/couchbase) /root/karan
        thuan Thuan Nguyen made changes -
        Priority Major [ 3 ] Blocker [ 1 ]
        Hide
        farshid Farshid Ghods (Inactive) added a comment -

        Karan,

        shouldn't this be assigned to couchbase bucket team first ?

        Show
        farshid Farshid Ghods (Inactive) added a comment - Karan, shouldn't this be assigned to couchbase bucket team first ?
        Hide
        karan Karan Kumar (Inactive) added a comment -

        Assigning to Chiyoung to see if this is ep-engine related

        Show
        karan Karan Kumar (Inactive) added a comment - Assigning to Chiyoung to see if this is ep-engine related
        karan Karan Kumar (Inactive) made changes -
        Assignee Trond Norbye [ trond ] Chiyoung Seo [ chiyoung ]
        Hide
        chiyoung Chiyoung Seo added a comment -

        Mike, please take a look at this issue.

        Show
        chiyoung Chiyoung Seo added a comment - Mike, please take a look at this issue.
        chiyoung Chiyoung Seo made changes -
        Assignee Chiyoung Seo [ chiyoung ] Mike Wiederhold [ mikew ]
        chiyoung Chiyoung Seo made changes -
        Sprint Status Current Sprint
        Hide
        trond Trond Norbye added a comment -

        Just guessing here, but from the callstack it looks like the crash is in the assign() method for the basic_string in flushOneDelOrSet.. (the other stacks seems somewhat sane)..

        Show
        trond Trond Norbye added a comment - Just guessing here, but from the callstack it looks like the crash is in the assign() method for the basic_string in flushOneDelOrSet.. (the other stacks seems somewhat sane)..
        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        Given we're spinning, looks similar to a case I hit some time ago when some list in flusher was 'infinite'. I even showed this to Chiyoung but we were unable to make any sense of what we saw

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - Given we're spinning, looks similar to a case I hit some time ago when some list in flusher was 'infinite'. I even showed this to Chiyoung but we were unable to make any sense of what we saw
        mikew Mike Wiederhold made changes -
        Component/s couchbase-bucket [ 10173 ]
        Component/s bucket-engine [ 10010 ]
        Hide
        mikew Mike Wiederhold added a comment -

        Trond,

        I think this might be in the logging thread. I found this in the backtrace with gdb.

        Core was generated by `/opt/couchbase/bin/memcached -X /opt/couchbase/lib/memcached/stdin_term_handler.'.
        #0 0x0000003d7160b7bb in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

        There are a few ep-engine threads at this location also, but since the logging stuff is relatively new I think it might be the cause. Please take a look at this and assign it back to me if you think the problem is in ep-engine.

        Show
        mikew Mike Wiederhold added a comment - Trond, I think this might be in the logging thread. I found this in the backtrace with gdb. Core was generated by `/opt/couchbase/bin/memcached -X /opt/couchbase/lib/memcached/stdin_term_handler.'. #0 0x0000003d7160b7bb in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 There are a few ep-engine threads at this location also, but since the logging stuff is relatively new I think it might be the cause. Please take a look at this and assign it back to me if you think the problem is in ep-engine.
        mikew Mike Wiederhold made changes -
        Assignee Mike Wiederhold [ mikew ] Trond Norbye [ trond ]
        Hide
        trond Trond Norbye added a comment -

        I find it relatively hard to believe that it may dump core on that line given that the code for that looks like:

        At file scope we have:

        static pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
        static pthread_cond_t cond = PTHREAD_COND_INITIALIZER;

        The call we're currently stuck in looks like:

        pthread_mutex_lock(&mutex);
        while (run) {
        struct timeval tp;
        gettimeofday(&tp, NULL);

        [ ... cut ...]

        gettimeofday(&tp, NULL);
        next = tp.tv_sec + (unsigned int)sleeptime;
        struct timespec ts =

        { .tv_sec = next }

        ;
        pthread_cond_timedwait(&cond, &mutex, &ts); <- This is where we're stuck
        }

        I can't see how we can pass stuff to pthread_cond_timedwait here that may cause it to crash (it could return with EINVAL for invalid input arguments)...

        If only I figured out how to ask gdb to show me the thread that caused the crash (and why.. which signal etc)

        Show
        trond Trond Norbye added a comment - I find it relatively hard to believe that it may dump core on that line given that the code for that looks like: At file scope we have: static pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER; static pthread_cond_t cond = PTHREAD_COND_INITIALIZER; The call we're currently stuck in looks like: pthread_mutex_lock(&mutex); while (run) { struct timeval tp; gettimeofday(&tp, NULL); [ ... cut ...] gettimeofday(&tp, NULL); next = tp.tv_sec + (unsigned int)sleeptime; struct timespec ts = { .tv_sec = next } ; pthread_cond_timedwait(&cond, &mutex, &ts); <- This is where we're stuck } I can't see how we can pass stuff to pthread_cond_timedwait here that may cause it to crash (it could return with EINVAL for invalid input arguments)... If only I figured out how to ask gdb to show me the thread that caused the crash (and why.. which signal etc)
        farshid Farshid Ghods (Inactive) made changes -
        Fix Version/s 2.0-beta-refresh [ 10385 ]
        Fix Version/s 2.0-beta [ 10113 ]
        Hide
        chiyoung Chiyoung Seo added a comment -

        Let me take a look at this issue to see if there are anything suspicious in ep-engine.

        Show
        chiyoung Chiyoung Seo added a comment - Let me take a look at this issue to see if there are anything suspicious in ep-engine.
        chiyoung Chiyoung Seo made changes -
        Assignee Trond Norbye [ trond ] Chiyoung Seo [ chiyoung ]
        karan Karan Kumar (Inactive) made changes -
        Labels system-test
        farshid Farshid Ghods (Inactive) made changes -
        Summary [longevity] memcached constantly exited with status 71 and unable to restart [longevity] memcached hangs when aborting during swap rebalance operation and fails to restart
        farshid Farshid Ghods (Inactive) made changes -
        Summary [longevity] memcached hangs when aborting during swap rebalance operation and fails to restart [longevity] memcached hangs when aborting during swap rebalance operation and fails to restart ( exit 71 )
        farshid Farshid Ghods (Inactive) made changes -
        Labels system-test 2.0-beta-release-notes system-test
        Hide
        chiyoung Chiyoung Seo added a comment -

        Tony,

        I was not able to reproduce this issue with 4 node cluster and still don't know why it happened.

        Did you see the same issue recently in your manual and longevity test?

        Show
        chiyoung Chiyoung Seo added a comment - Tony, I was not able to reproduce this issue with 4 node cluster and still don't know why it happened. Did you see the same issue recently in your manual and longevity test?
        chiyoung Chiyoung Seo made changes -
        Assignee Chiyoung Seo [ chiyoung ] Thuan Nguyen [ thuan ]
        Hide
        thuan Thuan Nguyen added a comment -

        I have not seen this issue since then in my system test.

        Show
        thuan Thuan Nguyen added a comment - I have not seen this issue since then in my system test.
        Hide
        chiyoung Chiyoung Seo added a comment -

        Let's close this bug at this time, and create a new bug if we see this issue again. There have been lots of fixes including bucket destroy in ep-engine.

        Show
        chiyoung Chiyoung Seo added a comment - Let's close this bug at this time, and create a new bug if we see this issue again. There have been lots of fixes including bucket destroy in ep-engine.
        chiyoung Chiyoung Seo made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Resolution Cannot Reproduce [ 5 ]
        chiyoung Chiyoung Seo made changes -
        Sprint Status Current Sprint
        farshid Farshid Ghods (Inactive) made changes -
        Status Resolved [ 5 ] Closed [ 6 ]

          People

          • Assignee:
            thuan Thuan Nguyen
            Reporter:
            thuan Thuan Nguyen
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Gerrit Reviews

              There are no open Gerrit changes