Details
Description
- Create 2 clusters with 2 nodes each
- Create one standard bucket each on both the clusters and set up a load of 2M items on each.
- Set up bidirectional replication between the 2 clusters
- With on going load on both the clusters:
- Add a server and remove an existing server on cluster1, rebalance
- Add a server and remove an existing server on cluster2, rebalance
- After a point, rebalance fails on both the clusters with the following reasons <noted as on the orchestrators>
- Rebalancing seems to be failing for multiple reasons:
CLUSTER1: < 10.1.3.235, 10.1.3.236 [remove], 10.3.2.54 [add] >
Rebalance exited with reason {timeout,
{gen_server,call,
[
,
{get_vbucket,114},
60000]}}
CLUSTER2: < 10.1.3.237, 10.1.3.238 [remove], 10.3.2.55 [add] >
Rebalance exited with reason {exited,
{'EXIT',<0.25928.1>,
}}
While replication is going on (with on going load as well), with swap rebalance,
a bunch of crash reports are seen on the diags, reasons being:
- badmatch, corrupted data
- db_not found
- checkpoint_commit_failure, failure on target commit
- missing_checkpoint_stat { << as per UI, rebalance seems to have failed because of this}
Rebalance is failing when swap rebalance is done on just one cluster as well (rather than both), with bidirectional replication between the 2 clusters
and on going load on both the clusters:
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
[couchdb:error,2012-09-10T14:55:22.506,ns_1@10.1.3.235:<0.7099.2>:couch_log:error:42]Uncaught error in HTTP request: {exit,
{timeout,
{gen_server,call,
['ns_memcached-bucket',
,
60000]}}}
[ns_server:info,2012-09-10T14:55:22.506,ns_1@10.1.3.235:<0.8252.2>:ns_replicas_builder_utils:kill_a_bunch_of_tap_names:59]Killed the following tap names on 'ns_1@10.1.3.236': [<<"replication_building_564_'ns_1@10.1.3.235'">>,
<<"replication_building_564_'ns_1@10.3.2.54'">>]
[ns_server:info,2012-09-10T14:55:22.507,ns_1@10.1.3.235:<0.8224.2>:ns_single_vbucket_mover:mover_inner_old_style:199]Got exit message (parent is <0.21723.0>). Exiting...
{'EXIT',<0.8252.2>,{missing_checkpoint_stat,'ns_1@10.1.3.235',564}}
[error_logger:error,2012-09-10T14:55:22.508,ns_1@10.1.3.235:error_logger:ale_error_logger_handler:log_report:72]
=========================CRASH REPORT=========================
crasher:
initial call: erlang:apply/2
pid: <0.8252.2>
registered_name: []
exception exit:
in function ns_replicas_builder:'wait_checkpoint_opened/5-lc$^0/1-0'/2
in call from ns_replicas_builder:wait_checkpoint_opened/5
in call from ns_replicas_builder:'build_replicas_main/6-fun-1'/8
in call from misc:try_with_maybe_ignorant_after/2
in call from ns_replicas_builder:build_replicas_main/6
ancestors: [<0.8224.2>,<0.21723.0>,<0.21409.0>]
messages: []
links: [<0.8224.2>,<0.8253.2>]
dictionary: []
trap_exit: true
status: running
heap_size: 121393
stack_size: 24
reductions: 18730
neighbours:
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -