Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-12218

DGM cluster saw "out of memory" errors from couchstore on vbucket snapshot path

    XMLWordPrintable

Details

    • Bug
    • Resolution: Duplicate
    • Major
    • bug-backlog
    • 3.0-Beta
    • couchbase-bucket
    • Security Level: Public

    Description

      Raising this defect after looking at a large DGM cluster that had a stalled rebalance. It looks like some failures in couchstore (memory issues) lead to memcached termination and stall of the rebalance, whereas maybe the error could of been handled and ejection performed?

      The cluster is a 4 node "large" scale cluster hosted in Azure. Cihan provided me access via a private key which I would rather people request from Cihan rather than me spreading the key around At the moment the cluster is stuck and there is historical logging data on a number of nodes indicating memory errors were caught, but lead to termination and I suspect the stall.

      The tail end of the following file shows memory problems are detected and logged:

      Starting at 10:31 we see the following pattern.

      Sat Sep 13 10:31:31.375401 UTC 3: (b1_full_ejection) Warning: couchstore_open_db failed, name=/data/couchbase/b1_full_ejection/1020.couch.1 option=1 rev=1 error=failed to allocate buffer [errno = 12: 'Cannot allocate memory']
      Sat Sep 13 10:31:31.375461 UTC 3: (b1_full_ejection) Warning: failed to open database, name=/data/couchbase/b1_full_ejection/1020.couch.1020
      Sat Sep 13 10:31:31.375474 UTC 3: (b1_full_ejection) Warning: failed to set new state, active, for vbucket 1020
      Sat Sep 13 10:31:31.375398 UTC 3: (b1_full_ejection) Warning: couchstore_open_db failed, name= option=1 rev=1 error=failed to allocate buffer []
      Sat Sep 13 10:31:31.375481 UTC 3: (b1_full_ejection) VBucket snapshot task failed!!! Rescheduling

      And finally the file ends with:

      Sat Sep 13 10:31:31.577731 UTC 3: (b1_full_ejection) nonio_worker_9: Exception caught in task "Checkpoint Remover on vb 189": std::bad_alloc

      Next version of memcached.log is the following file which indicates that memcached was restarted:

      Sat Sep 13 10:32:29.783313 UTC 3: (b1_full_ejection) Trying to connect to mccouch: "127.0.0.1:11213"
      Sat Sep 13 10:32:29.787504 UTC 3: (b1_full_ejection) Connected to mccouch: "127.0.0.1:11213"
      Sat Sep 13 10:32:29.797130 UTC 3: (No Engine) Bucket b1_full_ejection registered with low priority
      Sat Sep 13 10:32:29.797244 UTC 3: (No Engine) Spawning 4 readers, 4 writers, 1 auxIO, 1 nonIO threads
      Sat Sep 13 10:32:30.100791 UTC 3: (b1_full_ejection) metadata loaded in 301 ms

      cbcollect logs from 3 of 4 nodes (/tmp is tiny on node 41) which may be useful, but don't have the historical data from the live node as above)

      http://customers.couchbase.com.s3.amazonaws.com/jimw/cbbase-43.zip
      http://customers.couchbase.com.s3.amazonaws.com/jimw/cbbase-42.zip
      http://customers.couchbase.com.s3.amazonaws.com/jimw/cbbase-40.zip

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              dhaikney David Haikney (Inactive)
              jwalker Jim Walker
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty