GoXDCR failing during a rebalance (CBAuth stale)

Description

While doing some testing around failure scenario for the Java SDK, I came across a rebalance error in the webconsole. I was rebalancing a 4 nodes cluster to a healthy state of 4 nodes up, 1 of which only is query+index in addition to data.

After that, I retried rebalancing a few times and each time saw the following error (goxdcr due to cbauth stale):

I also tried to collect the logs from node 4 (.104), and it restarted on me each time! Note that node1 (.101, the data+query+index node) is also apparently restarting.

Finally was able to collect and upload logs (for my entire test session of today) from node3, logs are attached.

Components

Affects versions

4.0.0

Fix versions

4.1.0

Labels

None

Environment

4.0.0-RC (4.0.0-4047 Enterprise Edition (build-4047)) Ubuntu14

Link to Log File, atop/blg, CBCollectInfo, Core dump

https://s3.amazonaws.com/cb-customers/SDK_Team_Simon/collectinfo-2015-08-31T123806-ns_1%40192.168.125.101.zip https://s3.amazonaws.com/cb-customers/SDK_Team_Simon/collectinfo-2015-08-31T123806-ns_1%40192.168.125.102.zip https://s3.amazonaws.com/cb-customers/SDK_Team_Simon/collectinfo-2015-08-31T123806-ns_1%40192.168.125.103.zip https://s3.amazonaws.com/cb-customers/SDK_Team_Simon/collectinfo-2015-08-31T123806-ns_1%40192.168.125.104.zip

Release Notes Description

None

Activity

Raju Suravarjjala October 30, 2015 at 8:25 PM

Simon, feel free to reopen if necessary

Aliaksey Artamonau September 1, 2015 at 4:54 PM

The failures are due to OOM killer. Machines have only 1Gb of ram with no swap. That's not enough for 4.0.0 unfortunately.

Yu Sui August 31, 2015 at 5:29 PM

xdcr kept restarting because it kept having issues connecting to the metakv service. There are two kinds of metakv related errors in the log files, both caused xdcr restart. Re-assigning to the owner of metakv.

1. 2015/08/31 08:42:15 revrpc: Got error (dial tcp 127.0.0.1:8091: connection refused) and will retry in 1s
MetadataService 2015-08-31T08:42:15.311Z [ERROR] metakv.ListAllChildren failed. path=/remoteCluster/, err=Get http://127.0.0.1:8091/_metakv/remoteCluster/: CBAuth database is stale: last reason: dial tcp 127.0.0.1:8091: connection refused, num_of_retry=0

2. ReplicationSpecChangeListener 2015-08-31T08:42:14.613Z [INFO] metakv.RunObserveChildren failed, err=Get http://127.0.0.1:8091/_metakv/replicationSpec/?feed=continuous: dial tcp 127.0.0.1:8091: connection refused

Won't Fix

Pinned fields

Click on the next to a field label to start pinning.

Details
Assignee
Aliaksey Artamonau
Reporter
Simon Baslé
Is this a Regression?
Unknown
Triage
Untriaged
Operating System
Ubuntu 64-bit
Priority
Major
Instabug
Open Instabug

PagerDuty

Sentry

Zendesk Support

Created August 31, 2015 at 12:53 PM

Updated October 30, 2015 at 8:25 PM

Resolved September 1, 2015 at 4:54 PM

Instabug

GoXDCR failing during a rebalance (CBAuth stale)

Description

Components

Affects versions

Fix versions

Labels

Environment

Link to Log File, atop/blg, CBCollectInfo, Core dump

Release Notes Description

Activity

Raju Suravarjjala October 30, 2015 at 8:25 PM

Aliaksey Artamonau September 1, 2015 at 4:54 PM

Yu Sui August 31, 2015 at 5:29 PM

DetailsAssigneeAliaksey ArtamonauAliaksey ArtamonauReporterSimon BasléSimon BasléIs this a Regression?UnknownTriageUntriagedOperating SystemUbuntu 64-bitPriorityMajorInstabugOpen Instabug

Details

Assignee

Reporter

Is this a Regression?

Triage

Operating System

Priority

Instabug

PagerDutyPagerDuty Incident

PagerDuty

Sentry Linked Issues

Sentry

Zendesk SupportLinked Tickets

Zendesk Support

Details
Assignee
Aliaksey Artamonau
Reporter
Simon Baslé
Is this a Regression?
Unknown
Triage
Untriaged
Operating System
Ubuntu 64-bit
Priority
Major
Instabug
Open Instabug

PagerDuty

Sentry

Zendesk Support