Loading...

Details

Type: Bug
Resolution: Duplicate
Priority: Blocker
Fix Version/s: 3.0
Affects Version/s: 3.0
Component/s: couchbase-bucket, XDCR
Security Level: Public
Labels:
- dataloss
Environment:
CentOS, 64 bit 15GB RAM nodes , 4 cores

Triage:
Untriaged
Is this a Regression?:
No
Sprint:
June 30 - July 18

Description

Setup
--------
8 * 8 clusters with XDCR
3 buckets - 2 non-sasl buckets and 1 saslbucket
1. standardbucket (bi-directional - 6GB) <---the bucket under discussion
2. standardbucket1 (uni-directional 5GB)
3. saslbucket (no xdcr 1 GB)

Source cluster - http://172.23.105.44:8091/index.html
Destination cluster - http://172.23.105.55:8091/index.html

Scenario
-------------

Did the following:

1. Load on both clusters till vb_active_resident_items_ratio < 50.
2. Access phase with 98% gets, 2%sets runs for 3 hours
3. Rebalance-out 1 node at cluster1 with workload
4. Rebalance-in 1 node at cluster1 with workload
5. Failover and remove node at cluster1 with workload
6. Failover and add-back node at cluster1 with workload
7. Rebalance-out 1 node at cluster2 with workload
8. Rebalance-in 1 node at cluster2 with workload
9. Failover and remove node at cluster2 with workload
10. Failover and add-back node at cluster2 with workload
11. Soft restart all nodes in cluster1 one by one with workload
12. Soft restart all nodes in cluster2 one by one with workload

After 1.5days since the completion of the test
number of items in 'standardbucket' in C1 - 67309284
number of items in 'standardbucket' in C2 - 67513466

Missing items : 204182 at C1

No outbound replications on either clusters, XDCR queue size = 0, xdcr percent complete = 100%,many many checkpointing errors are seen on C1 and C2.
C1 - http://172.23.105.44:8091/index.html#sec=buckets
C2 - http://172.23.105.54:8091/index.html#sec=buckets

The items in standardbucket in source and destination clusters do not match but there are no items in the replication queue on both clusters

Views with revid info (sorry, I ran out of diskspace trying to download and compare 120M items, trying other machines)
-------------------------------
http://172.23.105.44:8092/standardbucket/_design/ddoc/_view/_all_doc?descending=false&stale=false&connection_timeout=60000&limit=100000000&skip=0

http://172.23.105.54:8092/standardbucket/_design/ddco1/_view/54_docs?descending=false&stale=false&connection_timeout=60000&limit=100000000&skip=0

Checkpointing errors seen in logs
---------------------------------------------------
[xdcr:error,2014-04-10T14:19:43.911,ns_1@172.23.105.44:<0.12433.0>:xdc_vbucket_rep_ckpt:do_checkpoint_old:218]Checkpointing failed due to remote vbopaque mismatch:

{mismatch, [<<"264435226106585">>, <<"1397159965">>]}

[xdcr:error,2014-04-10T14:19:43.912,ns_1@172.23.105.44:<0.12433.0>:xdc_vbucket_rep:start_replication:1000]checkpoint commit failure at start of replication for vb 264
[xdcr:error,2014-04-10T14:19:43.912,ns_1@172.23.105.44:<0.12433.0>:xdc_vbucket_rep:terminate:534]Replication (XMem mode) `455e4a452e237ea5f4d86c543303b49c/standardbucket1/standardbucket1` (`standardbucket1/264` -> `http://*****@172.23.105.57:8092/standardbucket1%2f264%3be6756656d287a83af17925c49bd9c6e0`) failed.Please see ns_server debug log for complete state dump
[xdcr:error,2014-04-10T14:19:47.443,ns_1@172.23.105.44:<0.12836.0>:xdc_vbucket_rep_ckpt:do_checkpoint_old:220]Checkpointing failed unexpectedly (or could be network problem): {error,500,
<<"

{\"error\":\"unexpected_reason\",\"reason\":\"killed\"}

\n">>}
[xdcr:error,2014-04-10T14:19:47.443,ns_1@172.23.105.44:<0.12836.0>:xdc_vbucket_rep:start_replication:1000]checkpoint commit failure at start of replication for vb 124
[xdcr:error,2014-04-10T14:19:47.444,ns_1@172.23.105.44:<0.12836.0>:xdc_vbucket_rep:terminate:534]Replication (XMem mode) `455e4a452e237ea5f4d86c543303b49c/standardbucket1/standardbucket1` (`standardbucket1/124` -> `http://*****@172.23.105.54:8092/standardbucket1%2f124%3be6756656d287a83af17925c49bd9c6e0`) failed.Please see ns_server debug log for complete state dump
[xdcr:error,2014-04-10T14:00:00.006,ns_1@172.23.105.44:<0.5889.0>:xdc_vbucket_rep_ckpt:do_checkpoint_old:218]Checkpointing failed due to remote vbopaque mismatch:

{mismatch, [<<"11294974043546">>, <<"1397159968">>]}

[xdcr:error,2014-04-10T12:34:55.818,ns_1@172.23.105.44:<0.6033.0>:xdc_vbucket_rep:handle_info:118]Error initializing vb replicator ({init_state,
{rep,
<<"455e4a452e237ea5f4d86c543303b49c/standardbucket1/standardbucket1">>,
<<"standardbucket1">>,
<<"/remoteClusters/455e4a452e237ea5f4d86c543303b49c/buckets/standardbucket1">>,
"xmem",
[

{optimistic_replication_threshold,256},
{worker_batch_size,500},
{failure_restart_interval,30},
{doc_batch_size_kb,2048},
{checkpoint_interval,1800},
{max_concurrent_reps,32},
{connection_timeout,180},
{worker_processes,4},
{http_connections,20},
{retries_per_request,2},
{socket_options,
[{keepalive,true},{nodelay,false}]},
{pause_requested,false},
{supervisor_max_r,25},
{supervisor_max_t,5},
{trace_dump_invprob,1000}]},
113,"xmem",<0.6012.0>,<0.6013.0>,
<0.6009.0>}):{exit,
{shutdown,
{gen_server,call,
['ns_memcached-standardbucket1',
{stats, [118,98,117,99,107,101, 116,45,115,101,113,110, 111,32,"113"]},
180000]}}}

[xdcr:error,2014-04-10T12:59:12.767,ns_1@172.23.105.54:<0.5861.0>:xdc_vbucket_rep:terminate:507]Shutting xdcr vb replicator ({init_state,
{rep,
<<"e4e06be20b146e347092d4fb78ba36f5/standardbucket/standardbucket">>,
<<"standardbucket">>,
<<"/remoteClusters/e4e06be20b146e347092d4fb78ba36f5/buckets/standardbucket">>,
"xmem",
[{optimistic_replication_threshold,256}

,

{worker_batch_size,500}

,

{failure_restart_interval,30}

,

{doc_batch_size_kb,2048}

,

{checkpoint_interval,1800}

,

{max_concurrent_reps,32}

,

{connection_timeout,180}

,

{worker_processes,4}

,

{http_connections,20}

,

{retries_per_request,2}

,
{socket_options,
[

{keepalive,true}

,

{nodelay,false}

]},

{pause_requested,false}

,

{supervisor_max_r,25}

,

{supervisor_max_t,5}

,

{trace_dump_invprob,1000}

]},
24,"xmem",<0.5839.0>,<0.5840.0>,<0.5831.0>}) down without ever successfully initializing: {{{badmatch,
{error,
timeout}},
[

{mc_client_binary, cmd_vocal_recv, 5}

,

{mc_client_binary, create_bucket, 4}

,

{ns_memcached, ensure_bucket, 2}

,

{ns_memcached, complete_connection_phase, 2}

,

{ns_memcached, handle_cast, 2}

,

{gen_server, 5}

,

{ns_memcached, init, 1}

,

{gen_server, init_it, 6}

]},
{gen_server,
call,
['ns_memcached-standardbucket',

{stats, [118, 98, 117, 99, 107, 101, 116, 45, 115, 101, 113, 110, 111, 32, "24"]}

,
180000]}}
A sense of how many "Checkpointing failed unexpectedly" errors are seen on a node :
[root@172.23.105.44 logs]# grep -c "Checkpointing failed unexpectedly" xdcr_errors.1
256

Could be a duplicate of ~~MB-10792~~. This can be closed if attached logs reveal so, in which case, since there is data loss, this problem should be a blocker.

Cbcollect info
--------------------

Source - https://s3.amazonaws.com/bugdb/jira/MB-10844/source.tar.gz
Destination - https://s3.amazonaws.com/bugdb/jira/MB-10844/dest.tar.gz

Attachments

Issue Links

relates to

MB-10856 Persistence and internal replication(TAP/UPR) are broken , affects XDCR

Closed

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews
- Show All Issues
- Show Open Issues

No reviews matched the request. Check your Options in the drop-down menu of this sections header.

KV+XDCR (TAP) : system test - item count mismatch (memcached connection requests time out)

Details

Description

Attachments

Issue Links

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty