Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Duplicate
Priority: Major
Fix Version/s: bug-backlog
Affects Version/s: 3.0-Beta
Component/s: couchbase-bucket
Security Level: Public
Labels:
- error-handling
- memory
Environment:

Hide
[info] OS Name : Linux 3.2.0-68-virtual
[info] OS Version : Ubuntu 12.04.5 LTS
[info] CB Version : 3.0.0-1209-rel-enterprise

[info] Architecture : x86_64
[info] Virtual Host : Microsoft HyperV
[ok] Installed CPUs : 4
[ok] Installed RAM : 28140 MB
[ok] Used RAM : 69.9% (19658 / 28139 MB)

Show
[info] OS Name : Linux 3.2.0-68-virtual [info] OS Version : Ubuntu 12.04.5 LTS [info] CB Version : 3.0.0-1209-rel-enterprise [info] Architecture : x86_64 [info] Virtual Host : Microsoft HyperV [ok] Installed CPUs : 4 [ok] Installed RAM : 28140 MB [ok] Used RAM : 69.9% (19658 / 28139 MB)

Triage:
Triaged
Link to Log File, atop/blg, CBCollectInfo, Core dump:

Hide
Some memcached.log files from cbase-43

http://customers.couchbase.com.s3.amazonaws.com/jimw/cbase-43-memcached.log.5.txt
http://customers.couchbase.com.s3.amazonaws.com/jimw/cbase-43-memcached.log.4.txt

Show
Some memcached.log files from cbase-43 http://customers.couchbase.com.s3.amazonaws.com/jimw/cbase-43-memcached.log.5.txt http://customers.couchbase.com.s3.amazonaws.com/jimw/cbase-43-memcached.log.4.txt
Is this a Regression?:
Unknown

Description

Raising this defect after looking at a large DGM cluster that had a stalled rebalance. It looks like some failures in couchstore (memory issues) lead to memcached termination and stall of the rebalance, whereas maybe the error could of been handled and ejection performed?

The cluster is a 4 node "large" scale cluster hosted in Azure. Cihan provided me access via a private key which I would rather people request from Cihan rather than me spreading the key around At the moment the cluster is stuck and there is historical logging data on a number of nodes indicating memory errors were caught, but lead to termination and I suspect the stall.

The tail end of the following file shows memory problems are detected and logged:

http://customers.couchbase.com.s3.amazonaws.com/jimw/cbase-43-memcached.log.4.txt

Starting at 10:31 we see the following pattern.

Sat Sep 13 10:31:31.375401 UTC 3: (b1_full_ejection) Warning: couchstore_open_db failed, name=/data/couchbase/b1_full_ejection/1020.couch.1 option=1 rev=1 error=failed to allocate buffer [errno = 12: 'Cannot allocate memory']
Sat Sep 13 10:31:31.375461 UTC 3: (b1_full_ejection) Warning: failed to open database, name=/data/couchbase/b1_full_ejection/1020.couch.1020
Sat Sep 13 10:31:31.375474 UTC 3: (b1_full_ejection) Warning: failed to set new state, active, for vbucket 1020
Sat Sep 13 10:31:31.375398 UTC 3: (b1_full_ejection) Warning: couchstore_open_db failed, name= option=1 rev=1 error=failed to allocate buffer []
Sat Sep 13 10:31:31.375481 UTC 3: (b1_full_ejection) VBucket snapshot task failed!!! Rescheduling

And finally the file ends with:

Sat Sep 13 10:31:31.577731 UTC 3: (b1_full_ejection) nonio_worker_9: Exception caught in task "Checkpoint Remover on vb 189": std::bad_alloc

Next version of memcached.log is the following file which indicates that memcached was restarted:

http://customers.couchbase.com.s3.amazonaws.com/jimw/cbase-43-memcached.log.5.txt

Sat Sep 13 10:32:29.783313 UTC 3: (b1_full_ejection) Trying to connect to mccouch: "127.0.0.1:11213"
Sat Sep 13 10:32:29.787504 UTC 3: (b1_full_ejection) Connected to mccouch: "127.0.0.1:11213"
Sat Sep 13 10:32:29.797130 UTC 3: (No Engine) Bucket b1_full_ejection registered with low priority
Sat Sep 13 10:32:29.797244 UTC 3: (No Engine) Spawning 4 readers, 4 writers, 1 auxIO, 1 nonIO threads
Sat Sep 13 10:32:30.100791 UTC 3: (b1_full_ejection) metadata loaded in 301 ms

cbcollect logs from 3 of 4 nodes (/tmp is tiny on node 41) which may be useful, but don't have the historical data from the live node as above)

http://customers.couchbase.com.s3.amazonaws.com/jimw/cbbase-43.zip
http://customers.couchbase.com.s3.amazonaws.com/jimw/cbbase-42.zip
http://customers.couchbase.com.s3.amazonaws.com/jimw/cbbase-40.zip

Attachments

Issue Links

relates to

MB-13091 CLONE - DGM cluster saw "out of memory" errors from couchstore on vbucket snapshot path

Closed

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews
- Show All Issues
- Show Open Issues

No reviews matched the request. Check your Options in the drop-down menu of this sections header.

Activity

People

Assignee:: David Haikney (Inactive)

Reporter:: Jim Walker

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 22/Sep/14 9:03 AM

Updated:: 31/Mar/15 12:29 PM

Resolved:: 31/Mar/15 12:29 PM

Gerrit Reviews

There are no open Gerrit changes

Details

Description

Attachments

Issue Links

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty