Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-6592

[longevity] memcached hangs when aborting during swap rebalance operation and fails to restart ( exit 71 )

    Details

      Description

      Cluster information:

      • 11 centos 6.2 64bit server with 4 cores CPU
      • Each server has 10 GB RAM and 150 GB disk.
      • 8 GB RAM for couchbase server at each node (80% total system memmories)
      • Disk format ext3 on both data and root
      • Each server has its own drive, no disk sharing with other server.
      • Load 9 million items to both buckets
      • Initial indexing, so cpu a little heavy load
      • Cluster has 2 buckets, default (3GB) and saslbucket (3GB)
      • Each bucket has one doc and 2 views for each doc (default d1 and saslbucket d11)
      • Create cluster with 10 nodes installed couchbase server 2.0.0-1697

      10.3.121.13
      10.3.121.14
      10.3.121.15
      10.3.121.16
      10.3.121.17
      10.3.121.20
      10.3.121.22
      10.3.121.24
      10.3.121.25
      10.3.121.23

      • Data path /data
      • View path /data
      • Do swap rebalance. Add node 26 and remove node 25
      • Rebalance failed and saw a lot of error message memcached exited with status 71 in log page.

      Link to diags of all nodes https://s3.amazonaws.com/packages.couchbase/diag-logs/orange/201209/11nodes-1697-memcached-exit-71-20120910.tgz

      Link to atop node 13 https://s3.amazonaws.com/packages.couchbase/atop-files/orange/201209/atop-node13
      Due to large size of atop file, all other atop files are in /tmp directory of each node

      No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

        Hide
        trond Trond Norbye added a comment -

        I find it relatively hard to believe that it may dump core on that line given that the code for that looks like:

        At file scope we have:

        static pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
        static pthread_cond_t cond = PTHREAD_COND_INITIALIZER;

        The call we're currently stuck in looks like:

        pthread_mutex_lock(&mutex);
        while (run) {
        struct timeval tp;
        gettimeofday(&tp, NULL);

        [ ... cut ...]

        gettimeofday(&tp, NULL);
        next = tp.tv_sec + (unsigned int)sleeptime;
        struct timespec ts =

        { .tv_sec = next }

        ;
        pthread_cond_timedwait(&cond, &mutex, &ts); <- This is where we're stuck
        }

        I can't see how we can pass stuff to pthread_cond_timedwait here that may cause it to crash (it could return with EINVAL for invalid input arguments)...

        If only I figured out how to ask gdb to show me the thread that caused the crash (and why.. which signal etc)

        Show
        trond Trond Norbye added a comment - I find it relatively hard to believe that it may dump core on that line given that the code for that looks like: At file scope we have: static pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER; static pthread_cond_t cond = PTHREAD_COND_INITIALIZER; The call we're currently stuck in looks like: pthread_mutex_lock(&mutex); while (run) { struct timeval tp; gettimeofday(&tp, NULL); [ ... cut ...] gettimeofday(&tp, NULL); next = tp.tv_sec + (unsigned int)sleeptime; struct timespec ts = { .tv_sec = next } ; pthread_cond_timedwait(&cond, &mutex, &ts); <- This is where we're stuck } I can't see how we can pass stuff to pthread_cond_timedwait here that may cause it to crash (it could return with EINVAL for invalid input arguments)... If only I figured out how to ask gdb to show me the thread that caused the crash (and why.. which signal etc)
        Hide
        chiyoung Chiyoung Seo added a comment -

        Let me take a look at this issue to see if there are anything suspicious in ep-engine.

        Show
        chiyoung Chiyoung Seo added a comment - Let me take a look at this issue to see if there are anything suspicious in ep-engine.
        Hide
        chiyoung Chiyoung Seo added a comment -

        Tony,

        I was not able to reproduce this issue with 4 node cluster and still don't know why it happened.

        Did you see the same issue recently in your manual and longevity test?

        Show
        chiyoung Chiyoung Seo added a comment - Tony, I was not able to reproduce this issue with 4 node cluster and still don't know why it happened. Did you see the same issue recently in your manual and longevity test?
        Hide
        thuan Thuan Nguyen added a comment -

        I have not seen this issue since then in my system test.

        Show
        thuan Thuan Nguyen added a comment - I have not seen this issue since then in my system test.
        Hide
        chiyoung Chiyoung Seo added a comment -

        Let's close this bug at this time, and create a new bug if we see this issue again. There have been lots of fixes including bucket destroy in ep-engine.

        Show
        chiyoung Chiyoung Seo added a comment - Let's close this bug at this time, and create a new bug if we see this issue again. There have been lots of fixes including bucket destroy in ep-engine.

          People

          • Assignee:
            thuan Thuan Nguyen
            Reporter:
            thuan Thuan Nguyen
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Gerrit Reviews

              There are no open Gerrit changes