Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Fixed
Priority: Blocker
Fix Version/s: 2.5.0
Affects Version/s: 2.5.0
Component/s: ns_server
Security Level: Public
Labels:
- ns_server
Environment:
CentOS 64bit, 4 cpu cores, 4GB RAM

Operating System:
Centos 64-bit

Description

Scenario
-------------
I have 2 clusters (3*3) with bidirectional XDCR setup. I pump data(3M items) simultaneously into both clusters. While data is getting loaded and replicated to other cluster, I rebalance out 2 nodes from cluster1. The performance dips, rebalance runs for a long time, but fails at the end of the operation. By that time, data has been loaded into the bucket aruna_bkt186. Now I try to rebalance out the same two nodes again. Rebalance-out again fails, but this time the data buckets tab shows aruna_bkt186 on 1 node but default bucket on 3 nodes. The disk usage on the rebalanced nodes also goes down greatly. Therefore, rebalance-out seems to fail before deleting default bucket from the 2 nodes. Also, the node we are transferring the data to (10.3.4.186) goes down, I'm not sure if it is because of its RAM usage. Screenshots of the server and data buckets are attached.

Error as seen in log-
------------------------------
Rebalance exited with reason {bulk_set_vbucket_state_failed,
[{'ns_1@10.3.4.186',
{'EXIT',
{{nodedown,'ns_1@10.3.4.186'},
{gen_server,call,
[

{'janitor_agent-default', 'ns_1@10.3.4.186'}

,
{if_rebalance,<0.3526.54>,
{update_vbucket_state,379,replica,
passive,undefined}},
infinity]}}}}]}
ns_orchestrator002
Node 'ns_1@10.3.4.188' saw that node 'ns_1@10.3.4.186' went down. Details: [

{nodedown_reason, connection_closed}] ns_node_disco005
<0.31020.54> exited with {bulk_set_vbucket_state_failed,
[{'ns_1@10.3.4.186',
{'EXIT',
{{nodedown,'ns_1@10.3.4.186'},
{gen_server,call,
[{'janitor_agent-default','ns_1@10.3.4.186'},
{if_rebalance,<0.3526.54>,
{update_vbucket_state,379,replica,passive,
undefined}},
infinity]}}}}]} ns_vbucket_mover000
Node 'ns_1@10.3.4.187' saw that node 'ns_1@10.3.4.186' went down. Details: [{nodedown_reason, connection_closed}

] ns_node_disco005
Server error during processing: ["web request failed",

{path,"/pools/default/tasks"}

{type,exit}

,
{what,
{timeout,
{gen_server,call,
[

{global,ns_rebalance_observer}

,
get_detailed_progress,10000]}}},
{trace,
[

{gen_server,call,3}

{ns_rebalance_observer, get_detailed_progress,0}

{ns_doctor,get_detailed_progress,0}

{ns_doctor,do_build_tasks_list,4}

{menelaus_web,handle_tasks,2}

{request_throttler,do_request,3}

{menelaus_web,loop,3}

{mochiweb_http,headers,5}

]}] menelaus_web019
Shutting down bucket "aruna_bkt186" on 'ns_1@10.3.4.188' for deletion
Shutting down bucket "aruna_bkt186" on 'ns_1@10.3.4.187' for deletion
Bucket "default" rebalance does not seem to be swap rebalance
Started rebalancing bucket default
Bucket "aruna_bkt186" rebalance does not seem to be swap rebalance ns_vbucket_mover000
Started rebalancing bucket aruna_bkt186
Starting rebalance, KeepNodes = ['ns_1@10.3.4.186'], EjectNodes = ['ns_1@10.3.4.188',
'ns_1@10.3.4.187']

XDCR info
----------------
3*2 bidirectional XDCR

[cluster1] - all nodes running 2.5.0-871
1:10.3.4.186
2.10.3.4.187
3.10.3.4.188

[cluster2] - all nodes running 2.2.0-821
1.10.3.4.189
2.10.3.4.190
3.10.3.4.191

Error is seen while trying to rebalance-out 10.3.4.187 and 10.3.4.188 from cluster1.

Bucket info
----------------
default and 1 SASL bucket(aruna_bkt186) on cluster1 bidirectionally replicated with default and aruna_bkt189on cluster2 respectively.
aruna_bkt186 --> aruna_bkt189 : replication protocol version2
aruna_bkt186 <-- aruna_bkt189 : v1
default on 10.3.4.186 --> default on 10.3.4.189 : v2
default on 10.3.4.186 <-- default on 10.3.4.189 : v1.

Cbcollect info attached.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

10.3.4.186-1182013-1249-diag.zip
43.25 MB
08/Nov/13 1:09 PM
10.3.4.187-1182013-1253-diag.zip
27.50 MB
08/Nov/13 1:09 PM
10.3.4.188-1182013-1256-diag.zip
22.44 MB
08/Nov/13 1:09 PM
Screen Shot 2013-11-08 at 12.26.25 PM.png
109 kB
08/Nov/13 1:09 PM
Screen Shot 2013-11-08 at 12.27.07 PM.png
128 kB
08/Nov/13 1:09 PM

Issue Links

relates to

MB-9209 Increasing beam memory usage over replication on source XDCR nodes

Closed

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews
- Show All Issues
- Show Open Issues

No reviews matched the request. Check your Options in the drop-down menu of this sections header.

Activity

People

Assignee:: Aruna Piravi (Inactive)

Reporter:: Aruna Piravi (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 08/Nov/13 1:09 PM

Updated:: 04/Dec/13 12:20 PM

Resolved:: 13/Nov/13 12:26 PM

Gerrit Reviews

There are no open Gerrit changes

Show There is 1 closed Gerrit change

Hide There is 1 closed Gerrit change

CBQE-1691: updated rebalance xdcr async_rebalance_out test to verify MB-9497: Gerrit Review:

Rebalance fails while loading data simultaneously onto 2 active clusters with bidirectional XDCR

Details

Description

Attachments

Attachments

Issue Links

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty