GoXDCR failing during a rebalance (CBAuth stale)

Description

While doing some testing around failure scenario for the Java SDK, I came across a rebalance error in the webconsole. I was rebalancing a 4 nodes cluster to a healthy state of 4 nodes up, 1 of which only is query+index in addition to data.

After that, I retried rebalancing a few times and each time saw the following error (goxdcr due to cbauth stale):

I also tried to collect the logs from node 4 (.104), and it restarted on me each time! Note that node1 (.101, the data+query+index node) is also apparently restarting.

Finally was able to collect and upload logs (for my entire test session of today) from node3, logs are attached.

Components

Affects versions

Fix versions

Labels

Environment

4.0.0-RC (4.0.0-4047 Enterprise Edition (build-4047)) Ubuntu14

Link to Log File, atop/blg, CBCollectInfo, Core dump

https://s3.amazonaws.com/cb-customers/SDK_Team_Simon/collectinfo-2015-08-31T123806-ns_1%40192.168.125.101.zip https://s3.amazonaws.com/cb-customers/SDK_Team_Simon/collectinfo-2015-08-31T123806-ns_1%40192.168.125.102.zip https://s3.amazonaws.com/cb-customers/SDK_Team_Simon/collectinfo-2015-08-31T123806-ns_1%40192.168.125.103.zip https://s3.amazonaws.com/cb-customers/SDK_Team_Simon/collectinfo-2015-08-31T123806-ns_1%40192.168.125.104.zip

Release Notes Description

None

Activity

Raju Suravarjjala October 30, 2015 at 8:25 PM

Simon, feel free to reopen if necessary

Aliaksey Artamonau September 1, 2015 at 4:54 PM

The failures are due to OOM killer. Machines have only 1Gb of ram with no swap. That's not enough for 4.0.0 unfortunately.

Yu Sui August 31, 2015 at 5:29 PM

xdcr kept restarting because it kept having issues connecting to the metakv service. There are two kinds of metakv related errors in the log files, both caused xdcr restart. Re-assigning to the owner of metakv.

1. 2015/08/31 08:42:15 revrpc: Got error (dial tcp 127.0.0.1:8091: connection refused) and will retry in 1s
MetadataService 2015-08-31T08:42:15.311Z [ERROR] metakv.ListAllChildren failed. path=/remoteCluster/, err=Get http://127.0.0.1:8091/_metakv/remoteCluster/: CBAuth database is stale: last reason: dial tcp 127.0.0.1:8091: connection refused, num_of_retry=0

2. ReplicationSpecChangeListener 2015-08-31T08:42:14.613Z [INFO] metakv.RunObserveChildren failed, err=Get http://127.0.0.1:8091/_metakv/replicationSpec/?feed=continuous: dial tcp 127.0.0.1:8091: connection refused

Won't Fix
Pinned fields
Click on the next to a field label to start pinning.

Details

Assignee

Reporter

Is this a Regression?

Unknown

Triage

Untriaged

Operating System

Ubuntu 64-bit

Priority

Instabug

Open Instabug

PagerDuty

Sentry

Zendesk Support

Created August 31, 2015 at 12:53 PM
Updated October 30, 2015 at 8:25 PM
Resolved September 1, 2015 at 4:54 PM
Instabug