Details
-
Bug
-
Resolution: Duplicate
-
Major
-
3.0-Beta
-
Security Level: Public
-
Triaged
-
-
Unknown
Description
Raising this defect after looking at a large DGM cluster that had a stalled rebalance. It looks like some failures in couchstore (memory issues) lead to memcached termination and stall of the rebalance, whereas maybe the error could of been handled and ejection performed?
The cluster is a 4 node "large" scale cluster hosted in Azure. Cihan provided me access via a private key which I would rather people request from Cihan rather than me spreading the key around At the moment the cluster is stuck and there is historical logging data on a number of nodes indicating memory errors were caught, but lead to termination and I suspect the stall.
The tail end of the following file shows memory problems are detected and logged:
Starting at 10:31 we see the following pattern.
Sat Sep 13 10:31:31.375401 UTC 3: (b1_full_ejection) Warning: couchstore_open_db failed, name=/data/couchbase/b1_full_ejection/1020.couch.1 option=1 rev=1 error=failed to allocate buffer [errno = 12: 'Cannot allocate memory']
Sat Sep 13 10:31:31.375461 UTC 3: (b1_full_ejection) Warning: failed to open database, name=/data/couchbase/b1_full_ejection/1020.couch.1020
Sat Sep 13 10:31:31.375474 UTC 3: (b1_full_ejection) Warning: failed to set new state, active, for vbucket 1020
Sat Sep 13 10:31:31.375398 UTC 3: (b1_full_ejection) Warning: couchstore_open_db failed, name= option=1 rev=1 error=failed to allocate buffer []
Sat Sep 13 10:31:31.375481 UTC 3: (b1_full_ejection) VBucket snapshot task failed!!! Rescheduling
And finally the file ends with:
Sat Sep 13 10:31:31.577731 UTC 3: (b1_full_ejection) nonio_worker_9: Exception caught in task "Checkpoint Remover on vb 189": std::bad_alloc
Next version of memcached.log is the following file which indicates that memcached was restarted:
Sat Sep 13 10:32:29.783313 UTC 3: (b1_full_ejection) Trying to connect to mccouch: "127.0.0.1:11213"
Sat Sep 13 10:32:29.787504 UTC 3: (b1_full_ejection) Connected to mccouch: "127.0.0.1:11213"
Sat Sep 13 10:32:29.797130 UTC 3: (No Engine) Bucket b1_full_ejection registered with low priority
Sat Sep 13 10:32:29.797244 UTC 3: (No Engine) Spawning 4 readers, 4 writers, 1 auxIO, 1 nonIO threads
Sat Sep 13 10:32:30.100791 UTC 3: (b1_full_ejection) metadata loaded in 301 ms
cbcollect logs from 3 of 4 nodes (/tmp is tiny on node 41) which may be useful, but don't have the historical data from the live node as above)
http://customers.couchbase.com.s3.amazonaws.com/jimw/cbbase-43.zip
http://customers.couchbase.com.s3.amazonaws.com/jimw/cbbase-42.zip
http://customers.couchbase.com.s3.amazonaws.com/jimw/cbbase-40.zip
Attachments
Issue Links
- relates to
-
MB-13091 CLONE - DGM cluster saw "out of memory" errors from couchstore on vbucket snapshot path
- Closed