Details
-
Bug
-
Resolution: Duplicate
-
Blocker
-
3.0
-
Security Level: Public
-
CentOS, 64 bit 15GB RAM nodes , 4 cores
-
Untriaged
-
No
-
June 30 - July 18
Description
Setup
--------
8 * 8 clusters with XDCR
3 buckets - 2 non-sasl buckets and 1 saslbucket
1. standardbucket (bi-directional - 6GB) <---the bucket under discussion
2. standardbucket1 (uni-directional 5GB)
3. saslbucket (no xdcr 1 GB)
Source cluster - http://172.23.105.44:8091/index.html
Destination cluster - http://172.23.105.55:8091/index.html
Scenario
-------------
- Did the following:
1. Load on both clusters till vb_active_resident_items_ratio < 50.
2. Access phase with 98% gets, 2%sets runs for 3 hours
3. Rebalance-out 1 node at cluster1 with workload
4. Rebalance-in 1 node at cluster1 with workload
5. Failover and remove node at cluster1 with workload
6. Failover and add-back node at cluster1 with workload
7. Rebalance-out 1 node at cluster2 with workload
8. Rebalance-in 1 node at cluster2 with workload
9. Failover and remove node at cluster2 with workload
10. Failover and add-back node at cluster2 with workload
11. Soft restart all nodes in cluster1 one by one with workload
12. Soft restart all nodes in cluster2 one by one with workload
After 1.5days since the completion of the test
number of items in 'standardbucket' in C1 - 67309284
number of items in 'standardbucket' in C2 - 67513466
Missing items : 204182 at C1
No outbound replications on either clusters, XDCR queue size = 0, xdcr percent complete = 100%,many many checkpointing errors are seen on C1 and C2.
C1 - http://172.23.105.44:8091/index.html#sec=buckets
C2 - http://172.23.105.54:8091/index.html#sec=buckets
The items in standardbucket in source and destination clusters do not match but there are no items in the replication queue on both clusters
Views with revid info (sorry, I ran out of diskspace trying to download and compare 120M items, trying other machines)
-------------------------------
http://172.23.105.44:8092/standardbucket/_design/ddoc/_view/_all_doc?descending=false&stale=false&connection_timeout=60000&limit=100000000&skip=0
Checkpointing errors seen in logs
---------------------------------------------------
[xdcr:error,2014-04-10T14:19:43.911,ns_1@172.23.105.44:<0.12433.0>:xdc_vbucket_rep_ckpt:do_checkpoint_old:218]Checkpointing failed due to remote vbopaque mismatch:
[xdcr:error,2014-04-10T14:19:43.912,ns_1@172.23.105.44:<0.12433.0>:xdc_vbucket_rep:start_replication:1000]checkpoint commit failure at start of replication for vb 264
[xdcr:error,2014-04-10T14:19:43.912,ns_1@172.23.105.44:<0.12433.0>:xdc_vbucket_rep:terminate:534]Replication (XMem mode) `455e4a452e237ea5f4d86c543303b49c/standardbucket1/standardbucket1` (`standardbucket1/264` -> `http://*****@172.23.105.57:8092/standardbucket1%2f264%3be6756656d287a83af17925c49bd9c6e0`) failed.Please see ns_server debug log for complete state dump
[xdcr:error,2014-04-10T14:19:47.443,ns_1@172.23.105.44:<0.12836.0>:xdc_vbucket_rep_ckpt:do_checkpoint_old:220]Checkpointing failed unexpectedly (or could be network problem): {error,500,
<<"
\n">>}
[xdcr:error,2014-04-10T14:19:47.443,ns_1@172.23.105.44:<0.12836.0>:xdc_vbucket_rep:start_replication:1000]checkpoint commit failure at start of replication for vb 124
[xdcr:error,2014-04-10T14:19:47.444,ns_1@172.23.105.44:<0.12836.0>:xdc_vbucket_rep:terminate:534]Replication (XMem mode) `455e4a452e237ea5f4d86c543303b49c/standardbucket1/standardbucket1` (`standardbucket1/124` -> `http://*****@172.23.105.54:8092/standardbucket1%2f124%3be6756656d287a83af17925c49bd9c6e0`) failed.Please see ns_server debug log for complete state dump
[xdcr:error,2014-04-10T14:00:00.006,ns_1@172.23.105.44:<0.5889.0>:xdc_vbucket_rep_ckpt:do_checkpoint_old:218]Checkpointing failed due to remote vbopaque mismatch:
[xdcr:error,2014-04-10T12:34:55.818,ns_1@172.23.105.44:<0.6033.0>:xdc_vbucket_rep:handle_info:118]Error initializing vb replicator ({init_state,
{rep,
<<"455e4a452e237ea5f4d86c543303b49c/standardbucket1/standardbucket1">>,
<<"standardbucket1">>,
<<"/remoteClusters/455e4a452e237ea5f4d86c543303b49c/buckets/standardbucket1">>,
"xmem",
[
{worker_batch_size,500},
{failure_restart_interval,30},
{doc_batch_size_kb,2048},
{checkpoint_interval,1800},
{max_concurrent_reps,32},
{connection_timeout,180},
{worker_processes,4},
{http_connections,20},
{retries_per_request,2},
{socket_options,
[{keepalive,true},{nodelay,false}]},
{pause_requested,false},
{supervisor_max_r,25},
{supervisor_max_t,5},
{trace_dump_invprob,1000}]},
113,"xmem",<0.6012.0>,<0.6013.0>,
<0.6009.0>}):{exit,
{shutdown,
{gen_server,call,
['ns_memcached-standardbucket1',
{stats, [118,98,117,99,107,101, 116,45,115,101,113,110, 111,32,"113"]},
180000]}}}
[xdcr:error,2014-04-10T12:59:12.767,ns_1@172.23.105.54:<0.5861.0>:xdc_vbucket_rep:terminate:507]Shutting xdcr vb replicator ({init_state,
{rep,
<<"e4e06be20b146e347092d4fb78ba36f5/standardbucket/standardbucket">>,
<<"standardbucket">>,
<<"/remoteClusters/e4e06be20b146e347092d4fb78ba36f5/buckets/standardbucket">>,
"xmem",
[{optimistic_replication_threshold,256}
,
,
,
,
,
,
,
,
,
,
{socket_options,
[
,
{nodelay,false}]},
,
,
,
]},
24,"xmem",<0.5839.0>,<0.5840.0>,<0.5831.0>}) down without ever successfully initializing: {{{badmatch,
{error,
timeout}},
[
,
,
,
,
,
,
,
]},
{gen_server,
call,
['ns_memcached-standardbucket',
,
180000]}}
A sense of how many "Checkpointing failed unexpectedly" errors are seen on a node :
[root@172.23.105.44 logs]# grep -c "Checkpointing failed unexpectedly" xdcr_errors.1
256
Could be a duplicate of MB-10792. This can be closed if attached logs reveal so, in which case, since there is data loss, this problem should be a blocker.
Cbcollect info
--------------------
Source - https://s3.amazonaws.com/bugdb/jira/MB-10844/source.tar.gz
Destination - https://s3.amazonaws.com/bugdb/jira/MB-10844/dest.tar.gz
Attachments
Issue Links
- relates to
-
MB-10856 Persistence and internal replication(TAP/UPR) are broken , affects XDCR
- Closed