Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: 2.7.4
Affects Version/s: 2.7.3
Component/s: None
Labels:
None
Environment:
Linux ip-10-0-129-240 4.4.11-23.53.amzn1.x86_64 #1 SMP Wed Jun 1 22:22:50 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

Description

When running cbc-pillowfight, if a node is stopped, and a failover triggered, cbc-pillowfight does not recover.

Version tested as failing: 2.7.3

Version tested as working: 2.6.4

Steps to reproduce:

Note also, brief video showing reproduction steps here:

https://drive.google.com/open?id=0Bz1warxYkSR1REpGSEhWell6eVk

Create 3 node cluster, 1 bucket
From a separate application server, run cbc-pillowfight
One one node in the cluster run:
- service couchbase-server stop
The node will show up as 'down' in the UI
- There are less than 1024 active vbuckets.
Pillowfight's traffic load will become sporadic due to timeouts (this is expected, I think).
Trigger a failover from the UI of the down node
There are now 1024 active vbuckets
Pillowfight does not recover, traffic remains sporadic.
... ...
Start the couchbase node again
- service couchbase-server start
Add the node back in with 'delta recovery' option.
Issue rebalance
As soon as the rebalance is started (and before it is complete!) pillowfight appears to start running full traffic loads again successfully.

Further notes:

I found same behaviour regardless of whether couchbase was shutdown cleanly or had a 'hard stop'.
My tests showed that this worked successfully on 2.6.4, I will collect additional logs for this version and attach to ticket.

Attachments:

pillowfight.log a -vvv log from pillowfight client run during the full process outlined above, with version 2.7.3.

Screenshots showing the traffic load at different stages (pre node down, during node down/failover, after rebalance started)

Server side logs are available:
https://s3.amazonaws.com/cb-customers/tom/collectinfo-2017-04-01T102252-ns_1%40ec2-34-248-41-12.eu-west-1.compute.amazonaws.com.zip
https://s3.amazonaws.com/cb-customers/tom/collectinfo-2017-04-01T102252-ns_1%40ec2-52-30-209-45.eu-west-1.compute.amazonaws.com.zip
https://s3.amazonaws.com/cb-customers/tom/collectinfo-2017-04-01T102252-ns_1%40ec2-52-31-18-152.eu-west-1.compute.amazonaws.com.zip

Attachments

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

pillowfight.log
6.88 MB
01/Apr/17 3:33 AM
Screen Shot 2017-04-01 at 11.19.53.png
161 kB
01/Apr/17 3:37 AM
Screen Shot 2017-04-01 at 11.20.20.png
169 kB
01/Apr/17 3:37 AM
Screen Shot 2017-04-01 at 11.20.33.png
188 kB
01/Apr/17 3:37 AM
Screen Shot 2017-04-01 at 11.21.22.png
463 kB
01/Apr/17 3:37 AM
Screen Shot 2017-04-01 at 11.21.47.png
214 kB
01/Apr/17 3:37 AM
Screen Shot 2017-04-01 at 11.22.00.png
191 kB
01/Apr/17 3:37 AM
v2.6.4-pillowfight.log
4.30 MB
01/Apr/17 3:51 AM
v2.7.0-pillowfight.log
2.57 MB
01/Apr/17 3:52 AM

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews
- Show All Issues
- Show Open Issues

For Gerrit Dashboard: CCBC-766
#	Subject	Branch	Project	Status	CR	V
76201,4	CCBC-766: Fix stuck CCCP subsystem.	master	libcouchbase	Status: MERGED	+2	+1

Activity

People

Assignee:: Mark Nunberg (Inactive)

Reporter:: Tom Green (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 01/Apr/17 3:36 AM

Updated:: 10/Apr/17 12:08 PM

Resolved:: 10/Apr/17 12:08 PM

Gerrit Reviews

There are no open Gerrit changes

Show There is 1 closed Gerrit change

Hide There is 1 closed Gerrit change

CCBC-766: Fix stuck CCCP subsystem.: Gerrit Review:

PagerDuty