Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Fixed
Priority: Blocker
Fix Version/s: 2.5.0
Affects Version/s: 2.2.0, 2.5.0
Component/s: XDCR
Security Level: Public
Labels:
None

Description

easiest way to introduce this is by forcing the server into heavy tmp_oom which causes the streams to crash and restart (expectedly). The problem is that while they should all restart very quickly, the whole set can take up to a few minutes.

To reproduce:
-Setup 2 4-node clusters with the beer-sample database
-Link that bucket in a bi-directional replication with settings:
-version 1 or version 2 (need to test both to make sure the fix applies to both)
-Replicators: >128 (I know it's high, but this is where we see the issue most easily)
-Restart interval: 1s (you can leave it at the default of 20s which shows the same problem, but 1s makes it more obvious and "finishes" faster)
-Optimistic threshold: 11000 (don't know if this is important, but it's how I've been running the test)
-Run the following workload from two separate clients, against each bucket simultaneously:
(using libcouchbase-bin...it shouldn't really matter what you use, but this is my test)
Client1: cbc pillowfight --host <cluster1> -b beer-sample --num-threads 4 --min-size 10240 --max-size 10240 --ratio 50 -Q 4 -I 30000
Client2: cbc pillowfight --host <cluster2> -b beer-sample --num-threads 4 --min-size 10240 --max-size 10240 --ratio 50 -Q 4 -I 20000
(notice that client1 puts in 30k items, while client2 puts in 20k...this is important to notice the difference in item counts)
-You will quickly observe that many sets fail with temporary failures. This is okay and expected
~~Once the disk write queue has finished raining, you will observe lots more XDCR traffic in very "spikey" intervals, going from 0~~ ~200 items/sec being transferred once per second
-You will also notice that the item counts do not match and only finished synchronizing many minutes after the workload has stopped

The high level issue is that we would not expect it to take so much longer for the item counts to catch up.

I'm adding both xdcr and ns_server, though Junyi has verified that the XDCR code is doing "the right thing" by asking a stream to restart within it's interval and now the suspicion is that ns_server is somehow either not scheduling them to restart all at once or in some other way delaying.

I've reproduced this with both CAPI and XMEM, CAPI only on 2.2 (because XMEM had another bug there), and both on 2.5-1007

These logs are using bi-directional xmem, with 256 replicators, 30s restart interval and optimistic threshold of 11000 (the steps below change this slightly but the numbers don't affect the overall behavior):
http://s3.amazonaws.com/customers.couchbase.com/cbse_898/2.5.0_1007xmemcluster1node1.zip
http://s3.amazonaws.com/customers.couchbase.com/cbse_898/2.5.0_1007xmemcluster1node2.zip
http://s3.amazonaws.com/customers.couchbase.com/cbse_898/2.5.0_1007xmemcluster1node3.zip
http://s3.amazonaws.com/customers.couchbase.com/cbse_898/2.5.0_1007xmemcluster1node4.zip

http://s3.amazonaws.com/customers.couchbase.com/cbse_898/2.5.0_1007xmemcluster2node1.zip
http://s3.amazonaws.com/customers.couchbase.com/cbse_898/2.5.0_1007xmemcluster2node2.zip
http://s3.amazonaws.com/customers.couchbase.com/cbse_898/2.5.0_1007xmemcluster2node3.zip
http://s3.amazonaws.com/customers.couchbase.com/cbse_898/2.5.0_1007xmemcluster2node4.zip

In this output, the XMEM XDCR stream was created at "2013-12-11T10:51:28.327", the workload started right after that and the workload finished sometime before 2013-12-11T10:53:48.422...the XDCR synchronization continues for a few minutes after.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

10.3.4.186-12232013-1410-diag.zip
7.29 MB
23/Dec/13 2:49 PM
10.3.4.187-12232013-1411-diag.zip
9.32 MB
23/Dec/13 2:49 PM
After fix - consecutive error recovery.png
562 kB
16/Jan/14 6:25 PM
After fix - recovery and xdcr ops consistency.png
715 kB
16/Jan/14 6:25 PM
After fix - xdcr recovers and incoming ops is consistent.png
555 kB
16/Jan/14 6:25 PM
Before fix - another run - inconsistent ops.png
656 kB
16/Jan/14 6:25 PM
Before fix - ops contantly falling to 0, enoent worsening it.png
697 kB
16/Jan/14 6:25 PM
Before fix - recovery and consistency.png
887 kB
16/Jan/14 6:25 PM
Screen Shot 2013-12-23 at 2.08.34 PM.png
625 kB
23/Dec/13 2:46 PM

Issue Links

is duplicated by

MB-9707 users may see incorrect "Outbound mutations" stat after topology change at source cluster (was: Rebalance in/out operation on Source cluster caused outbound replication mutations != 0 for long time while no write operation on source cluster)