Loading...

Details

Type: Technical task
Resolution: Duplicate
Priority: Blocker
Fix Version/s: 3.0
Affects Version/s: 2.0
Component/s: couchbase-bucket
Security Level: Public
Labels:
- customer
- system-test
Environment:
centos 6.2 64bit build 2.0.0-1931

Description

Cluster information:

8 centos 6.2 64bit server with 4 cores CPU
Each server has 32 GB RAM and 400 GB SSD disk.
24.8 GB RAM for couchbase server at each node
SSD disk format ext4 on /data
Each server has its own SSD drive, no disk sharing with other server.
Create cluster with 6 nodes installed couchbase server 2.0.0-1931
Link to manifest file http://builds.hq.northscale.net/latestbuilds/couchbase-server-enterprise_x86_64_2.0.0-1931-rel.rpm.manifest.xml
Cluster has 2 buckets, default and saslbucket (12GB/each with 1 replica) and with 64 vbuckets setup.
Each bucket has one doc and 2 views for each doc (default d1 and saslbucket d11)

10.6.2.37
10.6.2.38
10.6.2.44
10.6.2.45
10.6.2.42
10.6.2.43

Load 20 million items to each bucket. Each key has size 1024 bytes
After done loading, wait until initial index.
After initial indexing done, mutate all items with size from 1024 to 1512 bytes.
Queries all 4 views from 2 docs
Add node 44 and rebalance. Passed
Add node 45 and rebalance. Passed.
Check auto failover is enable on cluster.
Turn on firewall on node 40
iptables -A INPUT -p tcp -i eth0 --dport 1000:60000 -j REJECT
iptables -A OUTPUT -p tcp -o eth0 --sport 1000:60000 -j REJECT
Node 40 was down as expected.
Auto failover kicked in after one minute.
Disable firewall on node 40. Cluster saw node 40 up.
Add node 40 back to cluster and rebalance. In few seconds, rebalance failed with error: "Failed to wait deletion of some buckets on some nodes." Filed bug ~~MB-7110~~
Wait about 1 and half hour, rebalance again. Rebalance failed with error:" wait_checkpoint_persisted_failed"

ns_server:info,2012-11-06T5:42:13.901,ns_1@10.6.2.37:janitor_agent-default<0.30140.0>:janitor_agent:handle_info:676]Undoing temporary vbucket states caused by rebalance
[error_logger:error,2012-11-06T5:42:13.901,ns_1@10.6.2.37:error_logger<0.5.0>:ale_error_logger_handler:log_report:72]
=========================CRASH REPORT=========================
crasher:
initial call: ns_single_vbucket_mover:mover/6
pid: <0.11943.2727>
registered_name: []
exception exit: {unexpected_exit,
{'EXIT',<0.12020.2727>,
{{wait_checkpoint_persisted_failed,"default",50,3131,
[{'ns_1@10.6.2.40',
{'EXIT',
{{badmatch,{error,timeout,
[

{mc_client_binary,cmd_binary_vocal_recv,5},
{mc_client_binary,select_bucket,2},
{ns_memcached,ensure_bucket,2},
{ns_memcached,handle_info,2},
{gen_server,handle_msg,5},
{proc_lib,init_p_do_apply,3}]},
{gen_server,call,
['ns_memcached-default',
{wait_for_checkpoint_persistence,37,2959},
infinity]}},
{gen_server,call,
[{'janitor_agent-default','ns_1@10.6.2.40'},
{if_rebalance,<0.32081.2694>,
{wait_checkpoint_persisted,50,3131}},
infinity]}}}}]},
[{ns_single_vbucket_mover, '-wait_checkpoint_persisted_many/5-fun-1-',5}]}}}
in function ns_single_vbucket_mover:spawn_and_wait/1
in call from ns_single_vbucket_mover:mover_inner/6
in call from misc:try_with_maybe_ignorant_after/2
in call from ns_single_vbucket_mover:mover/6
ancestors: [<0.32081.2694>,<0.18896.2646>]
messages: [{'EXIT',<0.32081.2694>,
{unexpected_exit,
{'EXIT',<0.20985.2736>,
{{wait_checkpoint_persisted_failed,"default",37,2959,
[{'ns_1@10.6.2.40',
{'EXIT',
{{badmatch,{error,timeout,
[{mc_client_binary,cmd_binary_vocal_recv,5}

,

{mc_client_binary,select_bucket,2},
{ns_memcached,ensure_bucket,2},
{ns_memcached,handle_info,2},
{gen_server,handle_msg,5},
{proc_lib,init_p_do_apply,3}]},
{gen_server,call,
['ns_memcached-default',
{wait_for_checkpoint_persistence,37,2959},
infinity]}},
{gen_server,call,
[{'janitor_agent-default','ns_1@10.6.2.40'},
{if_rebalance,<0.32081.2694>,
{wait_checkpoint_persisted,37,2959}},
infinity]}}}}]},
[{ns_single_vbucket_mover, '-wait_checkpoint_persisted_many/5-fun-1-',5}]}}}}]
links: [<0.32081.2694>,<0.17284.2744>]
dictionary: [{cleanup_list,[<0.11946.2727>,<0.12020.2727>]}]
trap_exit: true
status: running
heap_size: 6765
stack_size: 24
reductions: 12015
neighbours:

[user:info,2012-11-06T5:42:13.903,ns_1@10.6.2.37:<0.14641.0>:ns_orchestrator:handle_info:319]Rebalance exited with reason {unexpected_exit,
{'EXIT',<0.20985.2736>,
{{wait_checkpoint_persisted_failed,"default",
37,2959,
[{'ns_1@10.6.2.40',
{'EXIT',
{{badmatch,{error,timeout,
[{mc_client_binary, cmd_binary_vocal_recv,5},
{mc_client_binary,select_bucket,2}

,

{ns_memcached,ensure_bucket,2},
{ns_memcached,handle_info,2},
{gen_server,handle_msg,5},
{proc_lib,init_p_do_apply,3}]},
{gen_server,call,
['ns_memcached-default',
{wait_for_checkpoint_persistence,37, 2959},
infinity]}},
{gen_server,call,
[{'janitor_agent-default', 'ns_1@10.6.2.40'},
{if_rebalance,<0.32081.2694>,
{wait_checkpoint_persisted,37,2959}},
infinity]}}}}]},
[{ns_single_vbucket_mover, '-wait_checkpoint_persisted_many/5-fun-1-', 5}]}}}

[error_logger:error,2012-11-06T5:42:13.902,ns_1@10.6.2.37:error_logger<0.5.0>:ale_error_logger_handler:log_msg:76]** Generic server <0.32081.2694> terminating
** Last message in was {'EXIT',<0.20927.2736>,
{unexpected_exit,
{'EXIT',<0.20985.2736>,
{{wait_checkpoint_persisted_failed,"default",37,
2959,
[{'ns_1@10.6.2.40',
{'EXIT',
{{badmatch,{error,timeout,
[{mc_client_binary,cmd_binary_vocal_recv,5},
{mc_client_binary,select_bucket,2},
{ns_memcached,ensure_bucket,2}

,

{ns_memcached,handle_info,2}

,

{gen_server,handle_msg,5}

,

{proc_lib,init_p_do_apply,3}

]},
{gen_server,call,
['ns_memcached-default',

{wait_for_checkpoint_persistence,37,2959}

,
infinity]}},
{gen_server,call,
[

{'janitor_agent-default','ns_1@10.6.2.40'}

,
{if_rebalance,<0.32081.2694>,
{wait_checkpoint_persisted,37,2959}},
infinity]}}}}]},
[

{ns_single_vbucket_mover, '-wait_checkpoint_persisted_many/5-fun-1-', 5}

]}}}}

- When Server state == {state,"default",<0.32082.2694>,
  {dict,8,16,16,8,80,48,
  {[],[],[],[],[],[],[],[],[],[],[],[],[], [],[],[]}
  ,
  {{[['ns_1@10.6.2.40'|8]],
  [],
  [['ns_1@10.6.2.42'|3]],
  [['ns_1@10.6.2.43'|3]],

I will upload collect info later

Attachments

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews
- Show All Issues
- Show Open Issues

No reviews matched the request. Check your Options in the drop-down menu of this sections header.

[system test] rebalance failed with error "wait_checkpoint_persisted_failed" due to timeout

Details

Description

Attachments

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty