Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: 2.0-beta-2
Affects Version/s: 2.0-beta-2
Component/s: XDCR
Security Level: Public
Labels:
None
Environment:
2.0.0-1835-rel
Ubuntu EC2
10 (west coast) : 10 (south east)
standard bucket :: west -> southeast
default :: southeast -> west

Description

Set up 2 unidirectional replications either way on the 2 clusters (mentioned).
After load and replication of about 75M+ items on the standard bucket and 45M+ items on the default bucket,
it started with a few nodes and then gradually all of them on cluster1 (west coast): nodes started going down.
Couchbase-server seemed to be running still on each of them.
It was seen that some of these nodes were warmed up (for unknown reasons), and after the warmup, ns_server couldn't connect to the vbuckets, leaving many of them in a dead state.

The ATOP information on one of the nodes that went down (c1):
24830 couchbas 20 0 3717m 3.4g 1956 S 14 23.4 2:53.76 memcached
9590 couchbas 20 0 3095m 1.4g 1648 S 10 9.8 1040:28 beam.smp

Was able to grab the diags of a few nodes before all of them went down (attached).

c1: http://ec2-184-169-219-253.us-west-1.compute.amazonaws.com:8091/
c2: http://ec2-54-251-5-97.ap-southeast-1.compute.amazonaws.com:8091/