Details

    • Type: Technical task
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.0
    • Fix Version/s: 3.0
    • Component/s: couchbase-bucket
    • Security Level: Public
    • Labels:
    • Environment:
      windows 2008 R2 64bit with build 2.0.0-1952

      Description

      Environment:
      8 windows server 2008 R2 64 bit in ec2. Each server has 7.5 GB RAM and 45GB EBS disk to store data.
      Create a 6 nodes windows cluster in ec2

      Loading phase:

      • Create one default bucket and load 24 million items to it. Each item has size from 128 to 512 bytes.

      Access phase:

      • Mutate items with ratio creates/updates/gets/deletes/expirations = 10/60/20/5/5
      • Doing rebalance add node, remove node, failover, failover and add back. The last relalance failed when doing remove a node and failover another node at the same time. Then rebalance.
      • Rebalance failed with error "sync_shutdown_many_i_am_trapping_exits"

      rebalance:debug,2012-11-20T5:02:22.275,ns_1@10.110.206.19:<0.22405.6>:janitor_agent:do_wait_checkpoint_persisted:828]Got etmpfail waiting for checkpoint persistence. Will try again
      [ns_server:info,2012-11-20T5:02:22.478,ns_1@10.110.206.19:ns_port_memcached<0.2135.0>:ns_port_server:log:171]memcached<0.2135.0>: Tue Nov 20 05:02:22.284363 GMT Standard Time 3: Notified the timeout on checkpoint persistence for vbucket 881, cookie 0000000005A16B00

      [ns_server:info,2012-11-20T5:02:23.585,ns_1@10.110.206.19:<0.22552.6>:compaction_daemon:spawn_vbucket_compactor:639]Compacting <<"default/345">>
      [ns_server:info,2012-11-20T5:02:31.370,ns_1@10.110.206.19:<0.10210.2>:ns_orchestrator:handle_info:282]Skipping janitor in state rebalancing: {rebalancing_state,<0.12789.2>,
      {dict,7,16,16,8,80,48,

      {[],[],[],[],[],[],[],[],[],[],[],[], [],[],[],[]}

      ,
      {{[],[],[],[],[],
      [['ns_1@10.111.66.215'|1.0],
      ['ns_1@10.214.194.95'|
      0.9067357512953368]],
      [['ns_1@10.96.107.176'|
      0.6649484536082475],
      ['ns_1@10.110.222.56'|
      0.37058823529411766]],
      [],
      [['ns_1@10.46.181.58'|
      0.34705882352941175]],
      [['ns_1@10.2.63.249'|
      0.9642857142857143],
      ['ns_1@10.110.206.19'|1.0]],
      [],[],[],[],[],[]}}},
      ['ns_1@10.2.63.249',
      'ns_1@10.96.107.176',
      'ns_1@10.110.206.19',
      'ns_1@10.111.66.215',
      'ns_1@10.214.194.95',
      'ns_1@10.110.222.56'],
      ['ns_1@10.46.181.58'],
      []}
      [ns_server:debug,2012-11-20T5:02:34.037,ns_1@10.110.206.19:janitor_agent-default<0.2184.0>:janitor_agent:handle_info:682]Got done message from subprocess: <0.22405.6> (ok)
      [ns_server:info,2012-11-20T5:02:34.256,ns_1@10.110.206.19:ns_port_memcached<0.2135.0>:ns_port_server:log:171]memcached<0.2135.0>: Tue Nov 20 05:02:34.015714 GMT Standard Time 3: Notified the completion of checkpoint persistence for vbucket 881, cookie 0000000005A16B00

      [ns_server:error,2012-11-20T5:02:35.847,ns_1@10.110.206.19:ns_doctor<0.2076.0>:ns_doctor:update_status:205]The following buckets became not ready on node 'ns_1@10.111.66.215': ["default"], those of them are active []
      [ns_server:debug,2012-11-20T5:02:36.003,ns_1@10.110.206.19:capi_set_view_manager-default<0.2157.0>:capi_set_view_manager:handle_info:349]doing replicate_newnodes_docs
      [ns_server:debug,2012-11-20T5:02:36.034,ns_1@10.110.206.19:xdc_rdoc_replication_srv<0.2167.0>:xdc_rdoc_replication_srv:handle_info:132]doing replicate_newnodes_docs
      [ns_server:debug,2012-11-20T5:02:36.034,ns_1@10.110.206.19:xdc_rdoc_replication_srv<0.2167.0>:xdc_rdoc_replication_srv:replicate_change_to_node:160]Sending _design/_replicator_info to ns_1@10.111.66.215
      [ns_server:error,2012-11-20T5:02:38.749,ns_1@10.110.206.19:<0.20550.6>:misc:sync_shutdown_many_i_am_trapping_exits:1408]Shutdown of the following failed: [

      {<15693.31507.5>,killed}]
      [ns_server:info,2012-11-20T5:02:38.749,ns_1@10.110.206.19:<0.20550.6>:ns_replicas_builder_utils:kill_a_bunch_of_tap_names:59]Killed the following tap names on 'ns_1@10.214.194.95': [<<"replication_building_916_'ns_1@10.96.107.176'">>,
      <<"replication_building_916_'ns_1@10.111.66.215'">>]
      [error_logger:error,2012-11-20T5:02:38.749,ns_1@10.110.206.19:error_logger<0.5.0>:ale_error_logger_handler:log_msg:76]** Generic server <0.20550.6> terminating
      ** Last message in was {'EXIT',<0.20543.6>,shutdown}
      ** When Server state == {state,"default",916,'ns_1@10.214.194.95',
      [{'ns_1@10.96.107.176',<15680.9900.6>},
      {'ns_1@10.111.66.215',<15693.31507.5>}]}
      ** Reason for termination ==
      ** {{badmatch,[{<15693.31507.5>,killed}

      ]},
      [

      {misc,sync_shutdown_many_i_am_trapping_exits,1},
      {misc,try_with_maybe_ignorant_after,2},
      {gen_server,terminate,6},
      {proc_lib,init_p_do_apply,3}]}

      [ns_server:error,2012-11-20T5:02:38.749,ns_1@10.110.206.19:<0.20543.6>:misc:sync_shutdown_many_i_am_trapping_exits:1408]Shutdown of the following failed: [{<0.20550.6>,
      {{badmatch,[{<15693.31507.5>,killed}]},
      [{misc, sync_shutdown_many_i_am_trapping_exits, 1},
      {misc,try_with_maybe_ignorant_after,2},
      {gen_server,terminate,6},
      {proc_lib,init_p_do_apply,3}]}},
      {<0.22253.6>,
      {{wait_checkpoint_persisted_failed,
      "default",916,249,
      [{'ns_1@10.111.66.215',
      {'EXIT',
      {noproc,
      {gen_server,call,
      [{'janitor_agent-default', 'ns_1@10.111.66.215'},
      {if_rebalance,<0.12877.2>,
      {wait_checkpoint_persisted,916,
      249}},
      infinity]}}}}]},
      [{ns_single_vbucket_mover, '-wait_checkpoint_persisted_many/5-fun-1-', 5}]}}]
      [ns_server:error,2012-11-20T5:02:38.749,ns_1@10.110.206.19:<0.20543.6>:misc:try_with_maybe_ignorant_after:1444]Eating exception from ignorant after-block:
      {error,
      {badmatch,
      [{<0.20550.6>,
      {{badmatch,[{<15693.31507.5>,killed}]},
      [{misc,sync_shutdown_many_i_am_trapping_exits,1}

      ,

      {misc,try_with_maybe_ignorant_after,2},
      {gen_server,terminate,6},
      {proc_lib,init_p_do_apply,3}]}},
      {<0.22253.6>,
      {{wait_checkpoint_persisted_failed,"default",916,249,
      [{'ns_1@10.111.66.215',
      {'EXIT',
      {noproc,
      {gen_server,call,
      [{'janitor_agent-default', 'ns_1@10.111.66.215'},
      {if_rebalance,<0.12877.2>,
      {wait_checkpoint_persisted,916,249}},
      infinity]}}}}]},
      [{ns_single_vbucket_mover, '-wait_checkpoint_persisted_many/5-fun-1-',5}]}}]},
      [{misc,sync_shutdown_many_i_am_trapping_exits,1},
      {misc,try_with_maybe_ignorant_after,2}

      ,

      {ns_single_vbucket_mover,mover,6}

      ,

      {proc_lib,init_p_do_apply,3}]}
      [rebalance:error,2012-11-20T5:02:38.749,ns_1@10.110.206.19:<0.12877.2>:ns_vbucket_mover:handle_info:252]<0.20543.6> exited with {unexpected_exit,
      {'EXIT',<0.22253.6>,
      {{wait_checkpoint_persisted_failed,"default",916,
      249,
      [{'ns_1@10.111.66.215',
      {'EXIT',
      {noproc,
      {gen_server,call,
      [{'janitor_agent-default', 'ns_1@10.111.66.215'},
      {if_rebalance,<0.12877.2>,
      {wait_checkpoint_persisted,916,249}},
      infinity]}}}}]},
      [{ns_single_vbucket_mover, '-wait_checkpoint_persisted_many/5-fun-1-',5}]}}}
      [ns_server:error,2012-11-20T5:02:38.749,ns_1@10.110.206.19:<0.22383.6>:ns_single_vbucket_mover:spawn_and_wait:81]Got unexpected exit signal {'EXIT',<0.12877.2>,
      {unexpected_exit,
      {'EXIT',<0.22253.6>,
      {{wait_checkpoint_persisted_failed,"default",
      916,249,
      [{'ns_1@10.111.66.215',
      {'EXIT',
      {noproc,
      {gen_server,call,
      [{'janitor_agent-default', 'ns_1@10.111.66.215'},
      {if_rebalance,<0.12877.2>,
      {wait_checkpoint_persisted,916,249}},
      infinity]}}}}]},
      [{ns_single_vbucket_mover, '-wait_checkpoint_persisted_many/5-fun-1-', 5}]}}}}
      [ns_server:debug,2012-11-20T5:02:38.749,ns_1@10.110.206.19:<0.22388.6>:ebucketmigrator_srv:confirm_sent_messages:710]Going to wait for reception of opaque message ack
      [ns_server:debug,2012-11-20T5:02:38.749,ns_1@10.110.206.19:<0.22664.6>:ebucketmigrator_srv:confirm_sent_messages:705]Sending opaque message to confirm downstream reception
      [ns_server:debug,2012-11-20T5:02:38.749,ns_1@10.110.206.19:<0.22664.6>:ebucketmigrator_srv:confirm_sent_messages:707]Sent fine
      [ns_server:error,2012-11-20T5:02:38.749,ns_1@10.110.206.19:<0.22351.6>:ns_single_vbucket_mover:spawn_and_wait:81]Got unexpected exit signal {'EXIT',<0.12877.2>,
      {unexpected_exit,
      {'EXIT',<0.22253.6>,
      {{wait_checkpoint_persisted_failed,"default",
      916,249,
      [{'ns_1@10.111.66.215',
      {'EXIT',
      {noproc,
      {gen_server,call,
      [{'janitor_agent-default', 'ns_1@10.111.66.215'},
      {if_rebalance,<0.12877.2>,
      {wait_checkpoint_persisted,916,249}},
      infinity]}}}}]},
      [{ns_single_vbucket_mover, '-wait_checkpoint_persisted_many/5-fun-1-', 5}]}}}}
      [ns_server:error,2012-11-20T5:02:38.749,ns_1@10.110.206.19:<0.22475.6>:ns_single_vbucket_mover:spawn_and_wait:81]Got unexpected exit signal {'EXIT',<0.12877.2>,
      {unexpected_exit,
      {'EXIT',<0.22253.6>,
      {{wait_checkpoint_persisted_failed,"default",
      916,249,
      [{'ns_1@10.111.66.215',
      {'EXIT',
      {noproc,
      {gen_server,call,
      [{'janitor_agent-default', 'ns_1@10.111.66.215'},
      {if_rebalance,<0.12877.2>,
      {wait_checkpoint_persisted,916,249}},
      infinity]}}}}]},
      [{ns_single_vbucket_mover, '-wait_checkpoint_persisted_many/5-fun-1-', 5}]
      }}}}
      [rebalance:info,2012-11-20T5:02:38.764,ns_1@10.110.206.19:<0.22388.6>:ebucketmigrator_srv:do_confirm_sent_messages:684]Got close ack!

      [error_logger:error,2012-11-20T5:02:38.764,ns_1@10.110.206.19:error_logger<0.5.0>:ale_error_logger_handler:log_report:72]
      =========================CRASH REPORT=========================
      crasher:
      initial call: new_ns_replicas_builder:init/1
      pid: <0.20550.6>
      registered_name: []
      exception exit: {{badmatch,[{<15693.31507.5>,killed}]},
      [{misc,sync_shutdown_many_i_am_trapping_exits,1},
      {misc,try_with_maybe_ignorant_after,2},
      {gen_server,terminate,6},
      {proc_lib,init_p_do_apply,3}

      ]}
      in function gen_server:terminate/6
      ancestors: [<0.20543.6>,<0.12877.2>,<0.12789.2>]
      messages: []
      links: [<0.20543.6>]
      dictionary: []
      trap_exit: true
      status: running
      heap_size: 317811
      stack_size: 24
      reductions: 32264
      neighbours:

      [error_logger:error,2012-11-20T5:02:38.764,ns_1@10.110.206.19:error_logger<0.5.0>:ale_error_logger_handler:log_report:72]
      =========================CRASH REPORT=========================
      crasher:
      initial call: ns_single_vbucket_mover:mover/6
      pid: <0.20543.6>
      registered_name: []
      exception exit: {unexpected_exit,
      {'EXIT',<0.22253.6>,
      {{wait_checkpoint_persisted_failed,"default",916,249,
      [{'ns_1@10.111.66.215',
      {'EXIT',
      {noproc,
      {gen_server,call,
      [

      {'janitor_agent-default','ns_1@10.111.66.215'},
      {if_rebalance,<0.12877.2>,
      {wait_checkpoint_persisted,916,249}},
      infinity]}}}}]},
      [{ns_single_vbucket_mover, '-wait_checkpoint_persisted_many/5-fun-1-',5}]}}}
      in function ns_single_vbucket_mover:spawn_and_wait/1
      in call from ns_single_vbucket_mover:mover_inner/6
      in call from misc:try_with_maybe_ignorant_after/2
      in call from ns_single_vbucket_mover:mover/6
      ancestors: [<0.12877.2>,<0.12789.2>]
      messages: []
      links: [<0.12877.2>]
      dictionary: [{cleanup_list,[<0.20550.6>,<0.22253.6>]}]
      trap_exit: true
      status: running
      heap_size: 4181
      stack_size: 24
      reductions: 12536
      neighbours:

      [ns_server:info,2012-11-20T5:02:38.780,ns_1@10.110.206.19:<0.22387.6>:ns_replicas_builder_utils:kill_a_bunch_of_tap_names:59]Killed the following tap names on 'ns_1@10.2.63.249': [<<"replication_building_881_'ns_1@10.96.107.176'">>,
      <<"replication_building_881_'ns_1@10.110.206.19'">>]
      [error_logger:error,2012-11-20T5:02:38.780,ns_1@10.110.206.19:error_logger<0.5.0>:ale_error_logger_handler:log_report:72]
      =========================CRASH REPORT=========================
      crasher:
      initial call: ns_single_vbucket_mover:mover/6
      pid: <0.22383.6>
      registered_name: []
      exception exit: {unexpected_exit,
      {'EXIT',<0.12877.2>,
      {unexpected_exit,
      {'EXIT',<0.22253.6>,
      {{wait_checkpoint_persisted_failed,"default",916,249,
      [{'ns_1@10.111.66.215',
      {'EXIT',
      {noproc,
      {gen_server,call,
      [{'janitor_agent-default','ns_1@10.111.66.215'}

      ,
      {if_rebalance,<0.12877.2>,
      {wait_checkpoint_persisted,916,249}},
      infinity]}}}}]},
      [

      {ns_single_vbucket_mover, '-wait_checkpoint_persisted_many/5-fun-1-',5}]}}}}}
      in function ns_single_vbucket_mover:spawn_and_wait/1
      in call from ns_single_vbucket_mover:mover_inner/6
      in call from misc:try_with_maybe_ignorant_after/2
      in call from ns_single_vbucket_mover:mover/6
      ancestors: [<0.12877.2>,<0.12789.2>]
      messages: [{'EXIT',<0.12877.2>,
      {unexpected_exit,
      {'EXIT',<0.22253.6>,
      {{wait_checkpoint_persisted_failed,"default",916,249,
      [{'ns_1@10.111.66.215',
      {'EXIT',
      {noproc,
      {gen_server,call,
      [{'janitor_agent-default','ns_1@10.111.66.215'},
      {if_rebalance,<0.12877.2>,
      {wait_checkpoint_persisted,916,249}},
      infinity]}}}}]},
      [{ns_single_vbucket_mover, '-wait_checkpoint_persisted_many/5-fun-1-',5}]}}}}]
      links: [<0.12877.2>]
      dictionary: [{cleanup_list,[<0.22387.6>,<0.22401.6>]}]
      trap_exit: true
      status: running
      heap_size: 377
      stack_size: 24
      reductions: 5554
      neighbours:

      [ns_server:info,2012-11-20T5:02:38.795,ns_1@10.110.206.19:<0.22478.6>:ns_replicas_builder_utils:kill_a_bunch_of_tap_names:59]Killed the following tap names on 'ns_1@10.96.107.176': [<<"replication_building_990_'ns_1@10.110.222.56'">>]
      [error_logger:error,2012-11-20T5:02:38.795,ns_1@10.110.206.19:error_logger<0.5.0>:ale_error_logger_handler:log_report:72]
      =========================CRASH REPORT=========================
      crasher:
      initial call: ns_single_vbucket_mover:mover/6
      pid: <0.22475.6>
      registered_name: []
      exception exit: {unexpected_exit,
      {'EXIT',<0.12877.2>,
      {unexpected_exit,
      {'EXIT',<0.22253.6>,
      {{wait_checkpoint_persisted_failed,"default",916,249,
      [{'ns_1@10.111.66.215',
      {'EXIT',
      {noproc,
      {gen_server,call,
      [{'janitor_agent-default','ns_1@10.111.66.215'},
      {if_rebalance,<0.12877.2>,
      {wait_checkpoint_persisted,916,249}},
      infinity]}}}}]},
      [{ns_single_vbucket_mover, '-wait_checkpoint_persisted_many/5-fun-1-',5}

      ]}}}}}
      in function ns_single_vbucket_mover:spawn_and_wait/1
      in call from ns_single_vbucket_mover:mover_inner/6
      in call from misc:try_with_maybe_ignorant_after/2
      in call from ns_single_vbucket_mover:mover/6
      ancestors: [<0.12877.2>,<0.12789.2>]
      messages: [{'EXIT',<0.12877.2>,
      {unexpected_exit,
      {'EXIT',<0.22253.6>,
      {{wait_checkpoint_persisted_failed,"default",916,249,
      [{'ns_1@10.111.66.215',
      {'EXIT',
      {noproc,
      {gen_server,call,
      [

      {'janitor_agent-default','ns_1@10.111.66.215'},
      {if_rebalance,<0.12877.2>,
      {wait_checkpoint_persisted,916,249}},
      infinity]}}}}]},
      [{ns_single_vbucket_mover, '-wait_checkpoint_persisted_many/5-fun-1-',5}]}}}}]
      links: [<0.12877.2>,<0.22663.6>]
      dictionary: [{cleanup_list,[<0.22478.6>,<0.22493.6>]}]
      trap_exit: true
      status: running
      heap_size: 4181
      stack_size: 24
      reductions: 4822
      neighbours:

      [ns_server:info,2012-11-20T5:02:38.827,ns_1@10.110.206.19:<0.22355.6>:ns_replicas_builder_utils:kill_a_bunch_of_tap_names:59]Killed the following tap names on 'ns_1@10.46.181.58': [<<"replication_building_679_'ns_1@10.214.194.95'">>,
      <<"replication_building_679_'ns_1@10.96.107.176'">>]
      [error_logger:error,2012-11-20T5:02:38.827,ns_1@10.110.206.19:error_logger<0.5.0>:ale_error_logger_handler:log_report:72]
      =========================CRASH REPORT=========================
      crasher:
      initial call: ns_single_vbucket_mover:mover/6
      pid: <0.22351.6>
      registered_name: []
      exception exit: {unexpected_exit,
      {'EXIT',<0.12877.2>,
      {unexpected_exit,
      {'EXIT',<0.22253.6>,
      {{wait_checkpoint_persisted_failed,"default",916,249,
      [{'ns_1@10.111.66.215',
      {'EXIT',
      {noproc,
      {gen_server,call,
      [{'janitor_agent-default','ns_1@10.111.66.215'}

      ,
      {if_rebalance,<0.12877.2>,
      {wait_checkpoint_persisted,916,249}},
      infinity]}}}}]},
      [

      {ns_single_vbucket_mover, '-wait_checkpoint_persisted_many/5-fun-1-',5}

      ]}}}}}
      in function ns_single_vbucket_mover:spawn_and_wait/1
      in call from ns_single_vbucket_mover:mover_inner/6
      in call from misc:try_with_maybe_ignorant_after/2
      in call from ns_single_vbucket_mover:mover/6
      ancestors: [<0.12877.2>,<0.12789.2>]
      messages: [{'EXIT',<0.12877.2>,
      {unexpected_exit,
      {'EXIT',<0.22253.6>,
      {{wait_checkpoint_persisted_failed,"default",916,249,
      [{'ns_1@10.111.66.215',
      {'EXIT',
      {noproc,
      {gen_server,call,
      [

      {'janitor_agent-default','ns_1@10.111.66.215'}

      ,
      {if_rebalance,<0.12877.2>,
      {wait_checkpoint_persisted,916,249}},
      infinity]}}}}]},
      [

      {ns_single_vbucket_mover, '-wait_checkpoint_persisted_many/5-fun-1-',5}

      ]}}}}]
      links: [<0.12877.2>,<0.22671.6>]
      dictionary: [

      {cleanup_list,[<0.22355.6>,<0.22393.6>]}

      ]
      trap_exit: true
      status: running
      heap_size: 377
      stack_size: 24
      reductions: 5554
      neighbours:

      [ns_server:debug,2012-11-20T5:02:38.827,ns_1@10.110.206.19:<0.12884.2>:ns_pubsub:do_subscribe_link:132]Parent process of subscription

      {ns_node_disco_events,<0.12877.2>}

      exited with reason {unexpected_exit,
      {'EXIT',
      <0.22253.6>,
      {{wait_checkpoint_persisted_failed,
      "default",
      916,
      249,
      [{'ns_1@10.111.66.215',
      {'EXIT',
      {noproc,
      {gen_server,
      call,
      [

      {'janitor_agent-default', 'ns_1@10.111.66.215'}

      ,
      {if_rebalance,
      <0.12877.2>,
      {wait_checkpoint_persisted,
      916,
      249}},
      infinity]}}}}]},
      [

      {ns_single_vbucket_mover, '-wait_checkpoint_persisted_many/5-fun-1-', 5}

      ]}}}
      [user:info,2012-11-20T5:02:38.842,ns_1@10.110.206.19:<0.10210.2>:ns_orchestrator:handle_info:319]Rebalance exited with reason {unexpected_exit,
      {'EXIT',<0.22253.6>,
      {{wait_checkpoint_persisted_failed,"default",
      916,249,
      [{'ns_1@10.111.66.215',
      {'EXIT',
      {noproc,
      {gen_server,call,
      [

      {'janitor_agent-default', 'ns_1@10.111.66.215'}

      ,
      {if_rebalance,<0.12877.2>,
      {wait_checkpoint_persisted,916,249}},
      infinity]}}}}]},
      [

      {ns_single_vbucket_mover, '-wait_checkpoint_persisted_many/5-fun-1-', 5}

      ]}}}

      [ns_server:info,2012-11-20T5:02:38.842,ns_1@10.110.206.19:<0.23178.6>:diag_handler:log_all_tap_and_checkpoint_stats:127]logging tap & checkpoint stats
      [ns_server:info,2012-11-20T5:02:38.858,ns_1@10.110.206.19:janitor_agent-default<0.2184.0>:janitor_agent:handle_info:676]Undoing temporary vbucket states caused by rebalance
      [ns_server:debug,2012-11-20T5:02:38.858,ns_1@10.110.206.19:capi_set_view_manager-default<0.2157.0>:capi_set_view_manager:handle_info:349]doing replicate_newnodes_docs
      [ns_server:info,2012-11-20T5:02:38.998,ns_1@10.110.206.19:ns_port_memcached<0.2135.0>:ns_port_server:log:171]memcached<0.2135.0>: Tue Nov 20 05:02:38.773775 GMT Standard Time 3: TAP (Consumer) eq_tapq:anon_402 - disconnected

      [error_logger:error,2012-11-20T5:02:38.827,ns_1@10.110.206.19:error_logger<0.5.0>:ale_error_logger_handler:log_msg:76]** Generic server <0.12877.2> terminating

        • Last message in was {'EXIT',<0.20543.6>,
          {unexpected_exit,
          {'EXIT',<0.22253.6>,
          {{wait_checkpoint_persisted_failed,"default",916,
          249,
          [{'ns_1@10.111.66.215',
          {'EXIT',
          {noproc,
          {gen_server,call,
          [ {'janitor_agent-default', 'ns_1@10.111.66.215'}

          ,
          {if_rebalance,<0.12877.2>,
          {wait_checkpoint_persisted,916,249}},
          infinity]}}}}]},
          [

          {ns_single_vbucket_mover, '-wait_checkpoint_persisted_many/5-fun-1-', 5}

          ]}}}}

      Link to manifest file of this build http://builds.hq.northscale.net/latestbuilds/couchbase-server-enterprise_x86_64_2.0.0-1952-rel.setup.exe.manifest.xml

        • Do rebalance again, rebalance passed.
      No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

        thuan Thuan Nguyen created issue -
        Show
        thuan Thuan Nguyen added a comment - Link to collect info of all nodes https://s3.amazonaws.com/bugdb/jira/MB-7225/8nodes-ci-1952-reb-wait_checkpoint_persisted_failed-20121120-131457.tgz
        Hide
        jin Jin Lim (Inactive) added a comment -

        Assign it to NS_SERVER team for triaging first from their end. It appears to me that the shutdown of bucket got stuck (or failed) somewhere. Please assign back to me afterwards. Thanks!

        Show
        jin Jin Lim (Inactive) added a comment - Assign it to NS_SERVER team for triaging first from their end. It appears to me that the shutdown of bucket got stuck (or failed) somewhere. Please assign back to me afterwards. Thanks!
        jin Jin Lim (Inactive) made changes -
        Field Original Value New Value
        Assignee Jin Lim [ jin ] Aliaksey Artamonau [ aliaksey artamonau ]
        Hide
        dipti Dipti Borkar added a comment -

        Bug Scrub: Aliaksey A can you take a look at this for 2.0.1 ?

        Show
        dipti Dipti Borkar added a comment - Bug Scrub: Aliaksey A can you take a look at this for 2.0.1 ?
        farshid Farshid Ghods (Inactive) made changes -
        Parent MB-6595 [ 19671 ]
        Issue Type Bug [ 1 ] Technical task [ 7 ]
        Hide
        andreibaranouski Andrei Baranouski added a comment -

        seems like the same issue
        ./testrunner -i /tmp/rebalance_regression.ini get-logs=True,wait_timeout=100 -t swaprebalance.SwapRebalanceFailedTests.test_add_back_failed_node,replica=2,num-buckets=3,num-swap=1,keys-count=1000000,swap-orchestrator=False
        http://qa.hq.northscale.net/job/centos-64-2.0-rebalance-regressions/151/consoleFull

        [2012-12-06 13:29:05,721] - [rest_client:883] INFO - rebalance params : password=password&ejectedNodes=&user=Administrator&knownNodes=ns_1%4010.3.121.94%2Cns_1%4010.3.121.92%2Cns_1%4010.3.121.98%2Cns_1%4010.3.121.96%2Cns_1%4010.3.121.93%2Cns_1%4010.3.121.97%2Cns_1%4010.3.121.95
        [2012-12-06 13:29:05,888] - [rest_client:890] INFO - rebalance operation started
        [2012-12-06 13:29:06,158] - [rest_client:986] INFO - rebalance percentage : 0 %
        ....
        2012-12-06 14:02:21,927] - [rest_client:986] INFO - rebalance percentage : 28.3558937832 %
        [2012-12-06 14:02:22,938] - [data_helper:289] INFO - creating direct client 10.3.121.93:11210 bucket-1
        [2012-12-06 14:02:23,048] - [data_helper:289] INFO - creating direct client 10.3.121.92:11210 bucket-2
        [2012-12-06 14:02:23,209] - [data_helper:289] INFO - creating direct client 10.3.121.93:11210 bucket-0
        [2012-12-06 14:02:25,825] - [rest_client:986] INFO - rebalance percentage : 28.3558937832 %
        [2012-12-06 14:02:26,952] - [data_helper:289] INFO - creating direct client 10.3.121.92:11210 bucket-1
        [2012-12-06 14:02:27,191] - [data_helper:289] INFO - creating direct client 10.3.121.93:11210 bucket-2
        [2012-12-06 14:02:27,608] - [data_helper:289] INFO - creating direct client 10.3.121.92:11210 bucket-0
        [2012-12-06 14:02:29,619] - [rest_client:971] ERROR -

        {u'status': u'none', u'errorMessage': u'Rebalance failed. See logs for detailed reason. You can try rebalance again.'}

        - rebalance failed
        ERROR
        RebalanceFailedException: Rebalance Failed:

        {u'status': u'none', u'errorMessage': u'Rebalance failed. See logs for detailed reason. You can try rebalance again.'}

        - rebalance failed

        [ns_server:error,2012-12-06T14:04:58.426,ns_1@10.3.121.93:<0.9035.81>:misc:inner_wait_shutdown:1426]Expected exit signal from <0.9040.81> but could not get it in 5 seconds. This is a bug, but process we're waiting for is dead (noproc), so trying to ignore...
        [ns_server:debug,2012-12-06T14:04:58.475,ns_1@10.3.121.93:<0.9035.81>:misc:inner_wait_shutdown:1427]Here's messages:
        {messages,[

        {'EXIT',<0.8898.81>,shutdown}]}
        [ns_server:error,2012-12-06T14:04:58.475,ns_1@10.3.121.93:<0.9035.81>:misc:sync_shutdown_many_i_am_trapping_exits:1408]Shutdown of the following failed: [{<0.9040.81>,noproc}]
        [stats:warn,2012-12-06T14:04:58.811,ns_1@10.3.121.93:<0.15386.38>:stats_collector:latest_tick:223]Dropped 1 ticks
        [stats:warn,2012-12-06T14:04:58.811,ns_1@10.3.121.93:<0.15338.38>:stats_collector:latest_tick:223]Dropped 1 ticks
        [ns_server:info,2012-12-06T14:04:58.837,ns_1@10.3.121.93:<0.9035.81>:ns_replicas_builder_utils:kill_a_bunch_of_tap_names:59]Killed the following tap names on 'ns_1@10.3.121.92': [<<"replication_building_736_'ns_1@10.3.121.97'">>,
        <<"replication_building_736_'ns_1@10.3.121.93'">>]
        [ns_server:error,2012-12-06T14:04:58.838,ns_1@10.3.121.93:<0.8898.81>:misc:sync_shutdown_many_i_am_trapping_exits:1408]Shutdown of the following failed: [{<0.9035.81>,
        {{badmatch,[{<0.9040.81>,noproc}]
        },
        [{misc, sync_shutdown_many_i_am_trapping_exits, 1},
        {misc,try_with_maybe_ignorant_after,2},
        {gen_server,terminate,6},
        {proc_lib,init_p_do_apply,3}]}},
        {<0.9256.81>,
        {{wait_checkpoint_persisted_failed,
        "bucket-2",736,10,
        [{'ns_1@10.3.121.93',
        {'EXIT',
        {{badmatch,{error,closed,
        [{mc_client_binary, cmd_binary_vocal_recv,5},
        {mc_client_binary,select_bucket, 2},
        {ns_memcached,ensure_bucket,2},
        {ns_memcached,handle_info,2},
        {gen_server,handle_msg,5},
        {proc_lib,init_p_do_apply,3}]},
        {gen_server,call,
        ['ns_memcached-bucket-2',
        {set_vbucket,811,replica},
        180000]}},
        {gen_server,call,
        [{'janitor_agent-bucket-2', 'ns_1@10.3.121.93'},
        {if_rebalance,<0.18991.71>,
        {wait_checkpoint_persisted,736,
        10}},
        infinity]}}}}]},
        [{ns_single_vbucket_mover, '-wait_checkpoint_persisted_many/5-fun-1-', 5}]}}]
        [ns_server:error,2012-12-06T14:04:58.840,ns_1@10.3.121.93:<0.8898.81>:misc:try_with_maybe_ignorant_after:1444]Eating exception from ignorant after-block:
        {error,
        {badmatch,
        [{<0.9035.81>,
        {{badmatch,[{<0.9040.81>,noproc}]},
        [{misc,sync_shutdown_many_i_am_trapping_exits,1},
        {misc,try_with_maybe_ignorant_after,2},
        {gen_server,terminate,6},
        {proc_lib,init_p_do_apply,3}]}},
        {<0.9256.81>,
        {{wait_checkpoint_persisted_failed,"bucket-2",736,10,
        [{'ns_1@10.3.121.93',
        {'EXIT',
        {{badmatch,{error,closed,
        [{mc_client_binary,cmd_binary_vocal_recv,5},
        {mc_client_binary,select_bucket,2},
        {ns_memcached,ensure_bucket,2},
        {ns_memcached,handle_info,2},
        {gen_server,handle_msg,5},
        {proc_lib,init_p_do_apply,3}]},
        {gen_server,call,
        ['ns_memcached-bucket-2',
        {set_vbucket,811,replica},
        180000]}},
        {gen_server,call,
        [{'janitor_agent-bucket-2','ns_1@10.3.121.93'},
        {if_rebalance,<0.18991.71>,
        {wait_checkpoint_persisted,736,10}},
        infinity]}}}}]},
        [{ns_single_vbucket_mover, '-wait_checkpoint_persisted_many/5-fun-1-',5}]}}]},
        [{misc,sync_shutdown_many_i_am_trapping_exits,1},
        {misc,try_with_maybe_ignorant_after,2},
        {ns_single_vbucket_mover,mover,6},
        {proc_lib,init_p_do_apply,3}]}
        [error_logger:error,2012-12-06T14:04:58.854,ns_1@10.3.121.93:error_logger<0.5.0>:ale_error_logger_handler:log_msg:76]** Generic server <0.9035.81> terminating
        ** Last message in was {'EXIT',<0.9040.81>,downstream_closed}
        ** When Server state == {state,"bucket-2",736,'ns_1@10.3.121.92',
        [{'ns_1@10.3.121.97',<20182.18667.52>},
        {'ns_1@10.3.121.93',<0.9040.81>}]}
        ** Reason for termination ==
        ** {{badmatch,[{<0.9040.81>,noproc}]},
        [{misc,sync_shutdown_many_i_am_trapping_exits,1},
        {misc,try_with_maybe_ignorant_after,2},
        {gen_server,terminate,6},
        {proc_lib,init_p_do_apply,3}]}

        [error_logger:error,2012-12-06T14:04:58.856,ns_1@10.3.121.93:error_logger<0.5.0>:ale_error_logger_handler:log_report:72]
        =========================CRASH REPORT=========================
        crasher:
        initial call: new_ns_replicas_builder:init/1
        pid: <0.9035.81>
        registered_name: []
        exception exit: {{badmatch,[{<0.9040.81>,noproc}]},
        [{misc,sync_shutdown_many_i_am_trapping_exits,1},
        {misc,try_with_maybe_ignorant_after,2},
        {gen_server,terminate,6},
        {proc_lib,init_p_do_apply,3}]}
        in function gen_server:terminate/6
        ancestors: [<0.8898.81>,<0.18991.71>,<0.18809.71>]
        messages: [{'EXIT',<0.8898.81>,shutdown}

        ]
        links: [<0.8898.81>]
        dictionary: []
        trap_exit: true
        status: running
        heap_size: 514229
        stack_size: 24
        reductions: 43190
        neighbours:

        [error_logger:error,2012-12-06T14:04:58.878,ns_1@10.3.121.93:error_logger<0.5.0>:ale_error_logger_handler:log_report:72]
        =========================CRASH REPORT=========================
        crasher:
        initial call: ns_single_vbucket_mover:mover/6
        pid: <0.8898.81>
        registered_name: []
        exception exit: {unexpected_exit,
        {'EXIT',<0.9256.81>,
        {{wait_checkpoint_persisted_failed,"bucket-2",736,10,
        [{'ns_1@10.3.121.93',
        {'EXIT',
        {{badmatch,{error,closed,
        [

        {mc_client_binary,cmd_binary_vocal_recv,5},
        {mc_client_binary,select_bucket,2},
        {ns_memcached,ensure_bucket,2},
        {ns_memcached,handle_info,2},
        {gen_server,handle_msg,5},
        {proc_lib,init_p_do_apply,3}]},
        {gen_server,call,
        ['ns_memcached-bucket-2',
        {set_vbucket,811,replica},
        180000]}},
        {gen_server,call,
        [{'janitor_agent-bucket-2','ns_1@10.3.121.93'},
        {if_rebalance,<0.18991.71>,
        {wait_checkpoint_persisted,736,10}},
        infinity]}}}}]},
        [{ns_single_vbucket_mover, '-wait_checkpoint_persisted_many/5-fun-1-',5}]}}}
        in function ns_single_vbucket_mover:spawn_and_wait/1
        in call from ns_single_vbucket_mover:mover_inner/6
        in call from misc:try_with_maybe_ignorant_after/2
        in call from ns_single_vbucket_mover:mover/6
        ancestors: [<0.18991.71>,<0.18809.71>]
        messages: [{'EXIT',<0.18991.71>,
        {{bulk_set_vbucket_state_failed,
        [{'ns_1@10.3.121.93',
        {'EXIT',
        {{badmatch,{error,closed,
        [{mc_client_binary,cmd_binary_vocal_recv,5}

        ,

        {mc_client_binary,select_bucket,2},
        {ns_memcached,ensure_bucket,2},
        {ns_memcached,handle_info,2},
        {gen_server,handle_msg,5},
        {proc_lib,init_p_do_apply,3}]},
        {gen_server,call,
        ['ns_memcached-bucket-2',
        {set_vbucket,811,replica},
        180000]}},
        {gen_server,call,
        [{'janitor_agent-bucket-2','ns_1@10.3.121.93'},
        {if_rebalance,<0.18991.71>,
        {update_vbucket_state,811,replica,undefined,
        'ns_1@10.3.121.95'}},
        infinity]}}}}]},
        [{janitor_agent,bulk_set_vbucket_state,4},
        {ns_vbucket_mover,update_replication_post_move,3},
        {ns_vbucket_mover,handle_info,2},
        {gen_server,handle_msg,5},
        {proc_lib,init_p_do_apply,3}]}}]
        links: [<0.18991.71>,<0.9800.81>]
        dictionary: [{cleanup_list,[<0.9035.81>,<0.9256.81>]}]
        trap_exit: true
        status: running
        heap_size: 6765
        stack_size: 24
        reductions: 14002
        neighbours:

        [user:info,2012-12-06T14:04:58.905,ns_1@10.3.121.93:<0.2696.0>:ns_orchestrator:handle_info:319]Rebalance exited with reason {{bulk_set_vbucket_state_failed,
        [{'ns_1@10.3.121.93',
        {'EXIT',
        {{badmatch,{error,closed,
        [{mc_client_binary, cmd_binary_vocal_recv,5},
        {mc_client_binary,select_bucket,2}

        ,

        {ns_memcached,ensure_bucket,2}

        ,

        {ns_memcached,handle_info,2}

        ,

        {gen_server,handle_msg,5},
        {proc_lib,init_p_do_apply,3}]},
        {gen_server,call,
        ['ns_memcached-bucket-2',
        {set_vbucket,811,replica},
        180000]}},
        {gen_server,call,
        [{'janitor_agent-bucket-2', 'ns_1@10.3.121.93'},
        {if_rebalance,<0.18991.71>,
        {update_vbucket_state,811,replica,
        undefined,'ns_1@10.3.121.95'}},
        infinity]}}}}]},
        [{janitor_agent,bulk_set_vbucket_state,4},
        {ns_vbucket_mover, update_replication_post_move,3},
        {ns_vbucket_mover,handle_info,2},
        {gen_server,handle_msg,5}

        ,

        {proc_lib,init_p_do_apply,3}

        ]}

        [ns_server:debug,2012-12-06T14:04:58.912,ns_1@10.3.121.93:<0.19026.71>:ns_pubsub:do_subscribe_link:132]Parent process of subscription

        {ns_node_disco_events,<0.18991.71>}

        exited with reason {{bulk_set_vbucket_state_failed,
        [{'ns_1@10.3.121.93',
        {'EXIT',
        {{{{badmatch,
        {error,
        closed}},
        [

        {mc_client_binary, cmd_binary_vocal_recv, 5}

        ,

        {mc_client_binary, select_bucket, 2}

        ,

        {ns_memcached, ensure_bucket, 2}

        ,

        {ns_memcached, handle_info, 2}

        ,

        {gen_server, handle_msg, 5}

        ,

        {proc_lib, init_p_do_apply, 3}

        ]},
        {gen_server,
        call,
        ['ns_memcached-bucket-2',

        {set_vbucket, 811, replica}

        ,
        180000]}},
        {gen_server,
        call,
        [

        {'janitor_agent-bucket-2', 'ns_1@10.3.121.93'}

        ,
        {if_rebalance,
        <0.18991.71>,
        {update_vbucket_state,
        811,
        replica,
        undefined,
        'ns_1@10.3.121.95'}},
        infinity]}}}}]},
        [

        {janitor_agent, bulk_set_vbucket_state, 4}

        ,

        {ns_vbucket_mover, update_replication_post_move, 3}

        ,

        {ns_vbucket_mover, handle_info, 2}

        ,

        {gen_server, handle_msg, 5}

        ,

        {proc_lib, init_p_do_apply, 3}

        ]}
        [stats:warn,2012-12-06T14:04:58.920,ns_1@10.3.121.93:<0.15437.38>:stats_collector:latest_tick:223]Dropped 1 ticks
        [ns_server:debug,2012-12-06T14:04:58.998,ns_1@10.3.121.93:capi_set_view_manager-bucket-0<0.15312.38>:capi_set_view_manager:handle_info:349]doing replicate_newnodes_docs
        [ns_server:debug,2012-12-06T14:04:58.999,ns_1@10.3.121.93:capi_set_view_manager-bucket-1<0.15355.38>:capi_set_view_manager:handle_info:349]doing replicate_newnodes_docs
        [ns_server:debug,2012-12-06T14:04:58.999,ns_1@10.3.121.93:capi_set_view_manager-bucket-2<0.15405.38>:capi_set_view_manager:handle_info:349]doing replicate_newnodes_docs
        [ns_server:info,2012-12-06T14:04:59.061,ns_1@10.3.121.93:<0.10836.81>:diag_handler:log_all_tap_and_checkpoint_stats:127]logging tap & checkpoint stats
        [ns_server:debug,2012-12-06T14:04:59.122,ns_1@10.3.121.93:ns_config_rep<0.360.0>:ns_config_rep:do_push_keys:317]Replicating some config keys ([counters,rebalance_status,rebalancer_pid]..)
        [error_logger:error,2012-12-06T14:04:58.885,ns_1@10.3.121.93:error_logger<0.5.0>:ale_error_logger_handler:log_msg:76]** Generic server <0.18991.71> terminating

        https://s3.amazonaws.com/bugdb/jira/MB-6595/d72d6e9d-b034-4838-b582-0ffdfeb5b204-10.3.121.92-diag.txt.gz
        https://s3.amazonaws.com/bugdb/jira/MB-6595/d72d6e9d-b034-4838-b582-0ffdfeb5b204-10.3.121.93-diag.txt.gz
        https://s3.amazonaws.com/bugdb/jira/MB-6595/d72d6e9d-b034-4838-b582-0ffdfeb5b204-10.3.121.94-diag.txt.gz
        https://s3.amazonaws.com/bugdb/jira/MB-6595/d72d6e9d-b034-4838-b582-0ffdfeb5b204-10.3.121.95-diag.txt.gz
        https://s3.amazonaws.com/bugdb/jira/MB-6595/d72d6e9d-b034-4838-b582-0ffdfeb5b204-10.3.121.96-diag.txt.gz
        https://s3.amazonaws.com/bugdb/jira/MB-6595/d72d6e9d-b034-4838-b582-0ffdfeb5b204-10.3.121.97-diag.txt.gz
        https://s3.amazonaws.com/bugdb/jira/MB-6595/d72d6e9d-b034-4838-b582-0ffdfeb5b204-10.3.121.98-diag.txt.gz

        Show
        andreibaranouski Andrei Baranouski added a comment - seems like the same issue ./testrunner -i /tmp/rebalance_regression.ini get-logs=True,wait_timeout=100 -t swaprebalance.SwapRebalanceFailedTests.test_add_back_failed_node,replica=2,num-buckets=3,num-swap=1,keys-count=1000000,swap-orchestrator=False http://qa.hq.northscale.net/job/centos-64-2.0-rebalance-regressions/151/consoleFull [2012-12-06 13:29:05,721] - [rest_client:883] INFO - rebalance params : password=password&ejectedNodes=&user=Administrator&knownNodes=ns_1%4010.3.121.94%2Cns_1%4010.3.121.92%2Cns_1%4010.3.121.98%2Cns_1%4010.3.121.96%2Cns_1%4010.3.121.93%2Cns_1%4010.3.121.97%2Cns_1%4010.3.121.95 [2012-12-06 13:29:05,888] - [rest_client:890] INFO - rebalance operation started [2012-12-06 13:29:06,158] - [rest_client:986] INFO - rebalance percentage : 0 % .... 2012-12-06 14:02:21,927] - [rest_client:986] INFO - rebalance percentage : 28.3558937832 % [2012-12-06 14:02:22,938] - [data_helper:289] INFO - creating direct client 10.3.121.93:11210 bucket-1 [2012-12-06 14:02:23,048] - [data_helper:289] INFO - creating direct client 10.3.121.92:11210 bucket-2 [2012-12-06 14:02:23,209] - [data_helper:289] INFO - creating direct client 10.3.121.93:11210 bucket-0 [2012-12-06 14:02:25,825] - [rest_client:986] INFO - rebalance percentage : 28.3558937832 % [2012-12-06 14:02:26,952] - [data_helper:289] INFO - creating direct client 10.3.121.92:11210 bucket-1 [2012-12-06 14:02:27,191] - [data_helper:289] INFO - creating direct client 10.3.121.93:11210 bucket-2 [2012-12-06 14:02:27,608] - [data_helper:289] INFO - creating direct client 10.3.121.92:11210 bucket-0 [2012-12-06 14:02:29,619] - [rest_client:971] ERROR - {u'status': u'none', u'errorMessage': u'Rebalance failed. See logs for detailed reason. You can try rebalance again.'} - rebalance failed ERROR RebalanceFailedException: Rebalance Failed: {u'status': u'none', u'errorMessage': u'Rebalance failed. See logs for detailed reason. You can try rebalance again.'} - rebalance failed [ns_server:error,2012-12-06T14:04:58.426,ns_1@10.3.121.93:<0.9035.81>:misc:inner_wait_shutdown:1426] Expected exit signal from <0.9040.81> but could not get it in 5 seconds. This is a bug, but process we're waiting for is dead (noproc), so trying to ignore... [ns_server:debug,2012-12-06T14:04:58.475,ns_1@10.3.121.93:<0.9035.81>:misc:inner_wait_shutdown:1427] Here's messages: {messages,[ {'EXIT',<0.8898.81>,shutdown}]} [ns_server:error,2012-12-06T14:04:58.475,ns_1@10.3.121.93:<0.9035.81>:misc:sync_shutdown_many_i_am_trapping_exits:1408] Shutdown of the following failed: [{<0.9040.81>,noproc}] [stats:warn,2012-12-06T14:04:58.811,ns_1@10.3.121.93:<0.15386.38>:stats_collector:latest_tick:223] Dropped 1 ticks [stats:warn,2012-12-06T14:04:58.811,ns_1@10.3.121.93:<0.15338.38>:stats_collector:latest_tick:223] Dropped 1 ticks [ns_server:info,2012-12-06T14:04:58.837,ns_1@10.3.121.93:<0.9035.81>:ns_replicas_builder_utils:kill_a_bunch_of_tap_names:59] Killed the following tap names on 'ns_1@10.3.121.92': [<<"replication_building_736_'ns_1@10.3.121.97'">>, <<"replication_building_736_'ns_1@10.3.121.93'">>] [ns_server:error,2012-12-06T14:04:58.838,ns_1@10.3.121.93:<0.8898.81>:misc:sync_shutdown_many_i_am_trapping_exits:1408] Shutdown of the following failed: [{<0.9035.81>, {{badmatch,[{<0.9040.81>,noproc}] }, [{misc, sync_shutdown_many_i_am_trapping_exits, 1}, {misc,try_with_maybe_ignorant_after,2}, {gen_server,terminate,6}, {proc_lib,init_p_do_apply,3}]}}, {<0.9256.81>, {{wait_checkpoint_persisted_failed, "bucket-2",736,10, [{'ns_1@10.3.121.93', {'EXIT', {{ badmatch,{error,closed , [{mc_client_binary, cmd_binary_vocal_recv,5}, {mc_client_binary,select_bucket, 2}, {ns_memcached,ensure_bucket,2}, {ns_memcached,handle_info,2}, {gen_server,handle_msg,5}, {proc_lib,init_p_do_apply,3}]}, {gen_server,call, ['ns_memcached-bucket-2', {set_vbucket,811,replica}, 180000]}}, {gen_server,call, [{'janitor_agent-bucket-2', 'ns_1@10.3.121.93'}, {if_rebalance,<0.18991.71>, {wait_checkpoint_persisted,736, 10}}, infinity]}}}}]}, [{ns_single_vbucket_mover, '-wait_checkpoint_persisted_many/5-fun-1-', 5}] }}] [ns_server:error,2012-12-06T14:04:58.840,ns_1@10.3.121.93:<0.8898.81>:misc:try_with_maybe_ignorant_after:1444] Eating exception from ignorant after-block: {error, {badmatch, [{<0.9035.81>, {{badmatch, [{<0.9040.81>,noproc}] }, [{misc,sync_shutdown_many_i_am_trapping_exits,1}, {misc,try_with_maybe_ignorant_after,2}, {gen_server,terminate,6}, {proc_lib,init_p_do_apply,3}]}}, {<0.9256.81>, {{wait_checkpoint_persisted_failed,"bucket-2",736,10, [{'ns_1@10.3.121.93', {'EXIT', {{ badmatch,{error,closed , [{mc_client_binary,cmd_binary_vocal_recv,5}, {mc_client_binary,select_bucket,2}, {ns_memcached,ensure_bucket,2}, {ns_memcached,handle_info,2}, {gen_server,handle_msg,5}, {proc_lib,init_p_do_apply,3}]}, {gen_server,call, ['ns_memcached-bucket-2', {set_vbucket,811,replica}, 180000]}}, {gen_server,call, [{'janitor_agent-bucket-2','ns_1@10.3.121.93'}, {if_rebalance,<0.18991.71>, {wait_checkpoint_persisted,736,10}}, infinity]}}}}]}, [{ns_single_vbucket_mover, '-wait_checkpoint_persisted_many/5-fun-1-',5}] }}]}, [{misc,sync_shutdown_many_i_am_trapping_exits,1}, {misc,try_with_maybe_ignorant_after,2}, {ns_single_vbucket_mover,mover,6}, {proc_lib,init_p_do_apply,3}]} [error_logger:error,2012-12-06T14:04:58.854,ns_1@10.3.121.93:error_logger<0.5.0>:ale_error_logger_handler:log_msg:76] ** Generic server <0.9035.81> terminating ** Last message in was {'EXIT',<0.9040.81>,downstream_closed} ** When Server state == {state,"bucket-2",736,'ns_1@10.3.121.92', [{'ns_1@10.3.121.97',<20182.18667.52>}, {'ns_1@10.3.121.93',<0.9040.81>}]} ** Reason for termination == ** {{badmatch, [{<0.9040.81>,noproc}] }, [{misc,sync_shutdown_many_i_am_trapping_exits,1}, {misc,try_with_maybe_ignorant_after,2}, {gen_server,terminate,6}, {proc_lib,init_p_do_apply,3}]} [error_logger:error,2012-12-06T14:04:58.856,ns_1@10.3.121.93:error_logger<0.5.0>:ale_error_logger_handler:log_report:72] =========================CRASH REPORT========================= crasher: initial call: new_ns_replicas_builder:init/1 pid: <0.9035.81> registered_name: [] exception exit: {{badmatch, [{<0.9040.81>,noproc}] }, [{misc,sync_shutdown_many_i_am_trapping_exits,1}, {misc,try_with_maybe_ignorant_after,2}, {gen_server,terminate,6}, {proc_lib,init_p_do_apply,3}]} in function gen_server:terminate/6 ancestors: [<0.8898.81>,<0.18991.71>,<0.18809.71>] messages: [{'EXIT',<0.8898.81>,shutdown} ] links: [<0.8898.81>] dictionary: [] trap_exit: true status: running heap_size: 514229 stack_size: 24 reductions: 43190 neighbours: [error_logger:error,2012-12-06T14:04:58.878,ns_1@10.3.121.93:error_logger<0.5.0>:ale_error_logger_handler:log_report:72] =========================CRASH REPORT========================= crasher: initial call: ns_single_vbucket_mover:mover/6 pid: <0.8898.81> registered_name: [] exception exit: {unexpected_exit, {'EXIT',<0.9256.81>, {{wait_checkpoint_persisted_failed,"bucket-2",736,10, [{'ns_1@10.3.121.93', {'EXIT', {{ badmatch,{error,closed , [ {mc_client_binary,cmd_binary_vocal_recv,5}, {mc_client_binary,select_bucket,2}, {ns_memcached,ensure_bucket,2}, {ns_memcached,handle_info,2}, {gen_server,handle_msg,5}, {proc_lib,init_p_do_apply,3}]}, {gen_server,call, ['ns_memcached-bucket-2', {set_vbucket,811,replica}, 180000]}}, {gen_server,call, [{'janitor_agent-bucket-2','ns_1@10.3.121.93'}, {if_rebalance,<0.18991.71>, {wait_checkpoint_persisted,736,10}}, infinity]}}}}]}, [{ns_single_vbucket_mover, '-wait_checkpoint_persisted_many/5-fun-1-',5}] }}} in function ns_single_vbucket_mover:spawn_and_wait/1 in call from ns_single_vbucket_mover:mover_inner/6 in call from misc:try_with_maybe_ignorant_after/2 in call from ns_single_vbucket_mover:mover/6 ancestors: [<0.18991.71>,<0.18809.71>] messages: [{'EXIT',<0.18991.71>, {{bulk_set_vbucket_state_failed, [{'ns_1@10.3.121.93', {'EXIT', {{ badmatch,{error,closed , [{mc_client_binary,cmd_binary_vocal_recv,5} , {mc_client_binary,select_bucket,2}, {ns_memcached,ensure_bucket,2}, {ns_memcached,handle_info,2}, {gen_server,handle_msg,5}, {proc_lib,init_p_do_apply,3}]}, {gen_server,call, ['ns_memcached-bucket-2', {set_vbucket,811,replica}, 180000]}}, {gen_server,call, [{'janitor_agent-bucket-2','ns_1@10.3.121.93'}, {if_rebalance,<0.18991.71>, {update_vbucket_state,811,replica,undefined, 'ns_1@10.3.121.95'}}, infinity]}}}}]}, [{janitor_agent,bulk_set_vbucket_state,4}, {ns_vbucket_mover,update_replication_post_move,3}, {ns_vbucket_mover,handle_info,2}, {gen_server,handle_msg,5}, {proc_lib,init_p_do_apply,3}]}}] links: [<0.18991.71>,<0.9800.81>] dictionary: [{cleanup_list,[<0.9035.81>,<0.9256.81>]}] trap_exit: true status: running heap_size: 6765 stack_size: 24 reductions: 14002 neighbours: [user:info,2012-12-06T14:04:58.905,ns_1@10.3.121.93:<0.2696.0>:ns_orchestrator:handle_info:319] Rebalance exited with reason {{bulk_set_vbucket_state_failed, [{'ns_1@10.3.121.93', {'EXIT', {{ badmatch,{error,closed , [{mc_client_binary, cmd_binary_vocal_recv,5}, {mc_client_binary,select_bucket,2} , {ns_memcached,ensure_bucket,2} , {ns_memcached,handle_info,2} , {gen_server,handle_msg,5}, {proc_lib,init_p_do_apply,3}]}, {gen_server,call, ['ns_memcached-bucket-2', {set_vbucket,811,replica}, 180000]}}, {gen_server,call, [{'janitor_agent-bucket-2', 'ns_1@10.3.121.93'}, {if_rebalance,<0.18991.71>, {update_vbucket_state,811,replica, undefined,'ns_1@10.3.121.95'}}, infinity]}}}}]}, [{janitor_agent,bulk_set_vbucket_state,4}, {ns_vbucket_mover, update_replication_post_move,3}, {ns_vbucket_mover,handle_info,2}, {gen_server,handle_msg,5} , {proc_lib,init_p_do_apply,3} ]} [ns_server:debug,2012-12-06T14:04:58.912,ns_1@10.3.121.93:<0.19026.71>:ns_pubsub:do_subscribe_link:132] Parent process of subscription {ns_node_disco_events,<0.18991.71>} exited with reason {{bulk_set_vbucket_state_failed, [{'ns_1@10.3.121.93', {'EXIT', {{{{badmatch, {error, closed}}, [ {mc_client_binary, cmd_binary_vocal_recv, 5} , {mc_client_binary, select_bucket, 2} , {ns_memcached, ensure_bucket, 2} , {ns_memcached, handle_info, 2} , {gen_server, handle_msg, 5} , {proc_lib, init_p_do_apply, 3} ]}, {gen_server, call, ['ns_memcached-bucket-2', {set_vbucket, 811, replica} , 180000]}}, {gen_server, call, [ {'janitor_agent-bucket-2', 'ns_1@10.3.121.93'} , {if_rebalance, <0.18991.71>, {update_vbucket_state, 811, replica, undefined, 'ns_1@10.3.121.95'}}, infinity]}}}}]}, [ {janitor_agent, bulk_set_vbucket_state, 4} , {ns_vbucket_mover, update_replication_post_move, 3} , {ns_vbucket_mover, handle_info, 2} , {gen_server, handle_msg, 5} , {proc_lib, init_p_do_apply, 3} ]} [stats:warn,2012-12-06T14:04:58.920,ns_1@10.3.121.93:<0.15437.38>:stats_collector:latest_tick:223] Dropped 1 ticks [ns_server:debug,2012-12-06T14:04:58.998,ns_1@10.3.121.93:capi_set_view_manager-bucket-0<0.15312.38>:capi_set_view_manager:handle_info:349] doing replicate_newnodes_docs [ns_server:debug,2012-12-06T14:04:58.999,ns_1@10.3.121.93:capi_set_view_manager-bucket-1<0.15355.38>:capi_set_view_manager:handle_info:349] doing replicate_newnodes_docs [ns_server:debug,2012-12-06T14:04:58.999,ns_1@10.3.121.93:capi_set_view_manager-bucket-2<0.15405.38>:capi_set_view_manager:handle_info:349] doing replicate_newnodes_docs [ns_server:info,2012-12-06T14:04:59.061,ns_1@10.3.121.93:<0.10836.81>:diag_handler:log_all_tap_and_checkpoint_stats:127] logging tap & checkpoint stats [ns_server:debug,2012-12-06T14:04:59.122,ns_1@10.3.121.93:ns_config_rep<0.360.0>:ns_config_rep:do_push_keys:317] Replicating some config keys ( [counters,rebalance_status,rebalancer_pid] ..) [error_logger:error,2012-12-06T14:04:58.885,ns_1@10.3.121.93:error_logger<0.5.0>:ale_error_logger_handler:log_msg:76] ** Generic server <0.18991.71> terminating https://s3.amazonaws.com/bugdb/jira/MB-6595/d72d6e9d-b034-4838-b582-0ffdfeb5b204-10.3.121.92-diag.txt.gz https://s3.amazonaws.com/bugdb/jira/MB-6595/d72d6e9d-b034-4838-b582-0ffdfeb5b204-10.3.121.93-diag.txt.gz https://s3.amazonaws.com/bugdb/jira/MB-6595/d72d6e9d-b034-4838-b582-0ffdfeb5b204-10.3.121.94-diag.txt.gz https://s3.amazonaws.com/bugdb/jira/MB-6595/d72d6e9d-b034-4838-b582-0ffdfeb5b204-10.3.121.95-diag.txt.gz https://s3.amazonaws.com/bugdb/jira/MB-6595/d72d6e9d-b034-4838-b582-0ffdfeb5b204-10.3.121.96-diag.txt.gz https://s3.amazonaws.com/bugdb/jira/MB-6595/d72d6e9d-b034-4838-b582-0ffdfeb5b204-10.3.121.97-diag.txt.gz https://s3.amazonaws.com/bugdb/jira/MB-6595/d72d6e9d-b034-4838-b582-0ffdfeb5b204-10.3.121.98-diag.txt.gz
        Hide
        andreibaranouski Andrei Baranouski added a comment -

        test on centos 64

        Show
        andreibaranouski Andrei Baranouski added a comment - test on centos 64
        Hide
        jin Jin Lim (Inactive) added a comment -

        ep engine bt from the crashing thread:

        #0 0x00007ff450f1e2d2 in setQueuedTime (this=0x3d8b79a0, qi=..., checkpointManager=<value optimized out>)
        at src/queueditem.hh:54
        #1 Checkpoint::queueDirty (this=0x3d8b79a0, qi=..., checkpointManager=<value optimized out>) at src/checkpoint.cc:118
        #2 0x00007ff450f20658 in CheckpointManager::closeOpenCheckpoint_UNLOCKED (this=0x96774a8, id=9) at src/checkpoint.cc:315
        #3 0x00007ff450f206d7 in CheckpointManager::closeOpenCheckpoint (this=0x96774a8, id=9) at src/checkpoint.cc:323
        #4 0x00007ff450f7c0b2 in TapConsumer::processCheckpointCommand(<anonymous enum>, uint16_t, uint64_t) (this=0x6ad3930,
        event=TAP_CHECKPOINT_END, vbucket=128, checkpointId=9) at src/tapconnection.cc:1490
        #5 0x00007ff450f50aa2 in EventuallyPersistentEngine::tapNotify(const void *, void *, uint16_t, uint8_t, uint16_t, <anonymous enum>, uint32_t, const void *, size_t, uint32_t, uint32_t, uint64_t, const void *, size_t, uint16_t) (this=0xbcfd600,
        cookie=<value optimized out>, engine_specific=<value optimized out>, nengine=<value optimized out>, tap_flags=1,
        --Type <return> to continue, or q <return> to quit--
        tap_event=TAP_CHECKPOINT_END, tap_seqno=106326, key=0x7a4e028, nkey=14, flags=0, exptime=0, cas=0, data=0x7a4e036,
        ndata=8, vbucket=30850) at src/ep_engine.cc:2010
        #6 0x00007ff450f515a8 in EvpTapNotify(ENGINE_HANDLE *, const void *, void *, uint16_t, uint8_t, uint16_t, <anonymous enum>, uint32_t, const void *, size_t, uint32_t, uint32_t, uint64_t, const void *, size_t, uint16_t) (handle=0xbcfd600,
        cookie=0x6a27b80, engine_specific=0x7a4e028, nengine=8, ttl=254 '\376', tap_flags=1, tap_event=TAP_CHECKPOINT_END,
        tap_seqno=106326, key=0x7a4e028, nkey=14, flags=0, exptime=0, cas=0, data=0x7a4e036, ndata=8,
        vbucket=<value optimized out>) at src/ep_engine.cc:1040
        #7 0x00007ff453a08a04 in bucket_tap_notify (handle=<value optimized out>, cookie=0x6a27b80, engine_specific=0x7a4e028,
        nengine=8, ttl=254 '\376', tap_flags=42136, tap_event=TAP_CHECKPOINT_END, tap_seqno=106326, key=0x7a4e028, nkey=14,
        flags=0, exptime=0, cas=0, data=0x7a4e036, ndata=8, vbucket=<value optimized out>) at bucket_engine.c:1942
        #8 0x0000000000409eb2 in process_bin_tap_packet (event=TAP_CHECKPOINT_END, c=0x6a27b80) at daemon/memcached.c:3031
        #9 0x00000000004121fa in process_bin_packet (c=0x6a27b80) at daemon/memcached.c:3129
        #10 complete_nread_binary (c=0x6a27b80) at daemon/memcached.c:3738
        #11 complete_nread (c=0x6a27b80) at daemon/memcached.c:3820
        #12 conn_nread (c=0x6a27b80) at daemon/memcached.c:5673
        #13 0x00000000004068a5 in event_handler (fd=<value optimized out>, which=<value optimized out>, arg=0x6a27b80)
        at daemon/memcached.c:5936
        #14 0x00007ff4565ef48c in event_process_active_single_queue (base=0x6a84c80, flags=<value optimized out>) at event.c:1308
        #15 event_process_active (base=0x6a84c80, flags=<value optimized out>) at event.c:1375
        #16 event_base_loop (base=0x6a84c80, flags=<value optimized out>) at event.c:1572
        #17 0x0000000000414b94 in worker_libevent (arg=0x221f5f0) at daemon/thread.c:301
        #18 0x00007ff455aead8c in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
        #19 0x00007ff455834fdd in clone () from /lib/x86_64-linux-gnu/libc.so.6
        #20 0x0000000000000000 in ?? ()

        Show
        jin Jin Lim (Inactive) added a comment - ep engine bt from the crashing thread: #0 0x00007ff450f1e2d2 in setQueuedTime (this=0x3d8b79a0, qi=..., checkpointManager=<value optimized out>) at src/queueditem.hh:54 #1 Checkpoint::queueDirty (this=0x3d8b79a0, qi=..., checkpointManager=<value optimized out>) at src/checkpoint.cc:118 #2 0x00007ff450f20658 in CheckpointManager::closeOpenCheckpoint_UNLOCKED (this=0x96774a8, id=9) at src/checkpoint.cc:315 #3 0x00007ff450f206d7 in CheckpointManager::closeOpenCheckpoint (this=0x96774a8, id=9) at src/checkpoint.cc:323 #4 0x00007ff450f7c0b2 in TapConsumer::processCheckpointCommand(<anonymous enum>, uint16_t, uint64_t) (this=0x6ad3930, event=TAP_CHECKPOINT_END, vbucket=128, checkpointId=9) at src/tapconnection.cc:1490 #5 0x00007ff450f50aa2 in EventuallyPersistentEngine::tapNotify(const void *, void *, uint16_t, uint8_t, uint16_t, <anonymous enum>, uint32_t, const void *, size_t, uint32_t, uint32_t, uint64_t, const void *, size_t, uint16_t) (this=0xbcfd600, cookie=<value optimized out>, engine_specific=<value optimized out>, nengine=<value optimized out>, tap_flags=1, -- Type <return> to continue, or q <return> to quit -- tap_event=TAP_CHECKPOINT_END, tap_seqno=106326, key=0x7a4e028, nkey=14, flags=0, exptime=0, cas=0, data=0x7a4e036, ndata=8, vbucket=30850) at src/ep_engine.cc:2010 #6 0x00007ff450f515a8 in EvpTapNotify(ENGINE_HANDLE *, const void *, void *, uint16_t, uint8_t, uint16_t, <anonymous enum>, uint32_t, const void *, size_t, uint32_t, uint32_t, uint64_t, const void *, size_t, uint16_t) (handle=0xbcfd600, cookie=0x6a27b80, engine_specific=0x7a4e028, nengine=8, ttl=254 '\376', tap_flags=1, tap_event=TAP_CHECKPOINT_END, tap_seqno=106326, key=0x7a4e028, nkey=14, flags=0, exptime=0, cas=0, data=0x7a4e036, ndata=8, vbucket=<value optimized out>) at src/ep_engine.cc:1040 #7 0x00007ff453a08a04 in bucket_tap_notify (handle=<value optimized out>, cookie=0x6a27b80, engine_specific=0x7a4e028, nengine=8, ttl=254 '\376', tap_flags=42136, tap_event=TAP_CHECKPOINT_END, tap_seqno=106326, key=0x7a4e028, nkey=14, flags=0, exptime=0, cas=0, data=0x7a4e036, ndata=8, vbucket=<value optimized out>) at bucket_engine.c:1942 #8 0x0000000000409eb2 in process_bin_tap_packet (event=TAP_CHECKPOINT_END, c=0x6a27b80) at daemon/memcached.c:3031 #9 0x00000000004121fa in process_bin_packet (c=0x6a27b80) at daemon/memcached.c:3129 #10 complete_nread_binary (c=0x6a27b80) at daemon/memcached.c:3738 #11 complete_nread (c=0x6a27b80) at daemon/memcached.c:3820 #12 conn_nread (c=0x6a27b80) at daemon/memcached.c:5673 #13 0x00000000004068a5 in event_handler (fd=<value optimized out>, which=<value optimized out>, arg=0x6a27b80) at daemon/memcached.c:5936 #14 0x00007ff4565ef48c in event_process_active_single_queue (base=0x6a84c80, flags=<value optimized out>) at event.c:1308 #15 event_process_active (base=0x6a84c80, flags=<value optimized out>) at event.c:1375 #16 event_base_loop (base=0x6a84c80, flags=<value optimized out>) at event.c:1572 #17 0x0000000000414b94 in worker_libevent (arg=0x221f5f0) at daemon/thread.c:301 #18 0x00007ff455aead8c in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0 #19 0x00007ff455834fdd in clone () from /lib/x86_64-linux-gnu/libc.so.6 #20 0x0000000000000000 in ?? ()
        Hide
        jin Jin Lim (Inactive) added a comment -

        Andrei, the node 93 which crashed was it a node that being removed or failed over? This will give us clear understanding of why and what this node was doing prior to the crash.
        Thanks much!
        Jin

        Show
        jin Jin Lim (Inactive) added a comment - Andrei, the node 93 which crashed was it a node that being removed or failed over? This will give us clear understanding of why and what this node was doing prior to the crash. Thanks much! Jin
        Hide
        jin Jin Lim (Inactive) added a comment -

        Just talked to Chiyoung for his input. Given that these nodes are all Windwos, can you please also verify if all nodes had been upgraded with Service Pack 1? We believe this is a memory corruption issue and we wanted to make sure that all nodes have proper Windows environment. Thanks.

        Show
        jin Jin Lim (Inactive) added a comment - Just talked to Chiyoung for his input. Given that these nodes are all Windwos, can you please also verify if all nodes had been upgraded with Service Pack 1? We believe this is a memory corruption issue and we wanted to make sure that all nodes have proper Windows environment. Thanks.
        jin Jin Lim (Inactive) made changes -
        Assignee Aliaksey Artamonau [ aliaksey artamonau ] Jin Lim [ jin ]
        Hide
        jin Jin Lim (Inactive) added a comment -

        Please assign back to Jin (or ep engine team) after addressing above questions. Thanks.

        Show
        jin Jin Lim (Inactive) added a comment - Please assign back to Jin (or ep engine team) after addressing above questions. Thanks.
        jin Jin Lim (Inactive) made changes -
        Assignee Jin Lim [ jin ] Andrei Baranouski [ andreibaranouski ]
        Hide
        Aliaksey Artamonau Aliaksey Artamonau added a comment -

        See Andrei's comment above. He ran the test on centos 64.

        Show
        Aliaksey Artamonau Aliaksey Artamonau added a comment - See Andrei's comment above. He ran the test on centos 64.
        Aliaksey Artamonau Aliaksey Artamonau made changes -
        Assignee Andrei Baranouski [ andreibaranouski ] Jin Lim [ jin ]
        Hide
        jin Jin Lim (Inactive) added a comment -

        EP Engine team will try to reproduce the crash (segfault) with more debugging. In the mean time QE team please see if the manual test steps described above can be automated or incorporated into a new test case. Thanks.

        Show
        jin Jin Lim (Inactive) added a comment - EP Engine team will try to reproduce the crash (segfault) with more debugging. In the mean time QE team please see if the manual test steps described above can be automated or incorporated into a new test case. Thanks.
        Hide
        andreibaranouski Andrei Baranouski added a comment -

        93 was a master originally, and it was never removed from the cluster in this test on centos 64

        Show
        andreibaranouski Andrei Baranouski added a comment - 93 was a master originally, and it was never removed from the cluster in this test on centos 64
        Hide
        farshid Farshid Ghods (Inactive) added a comment -

        deferring to 2.1 per bug scrub meeting ( Dipti & Farshid -December 7th )

        Show
        farshid Farshid Ghods (Inactive) added a comment - deferring to 2.1 per bug scrub meeting ( Dipti & Farshid -December 7th )
        farshid Farshid Ghods (Inactive) made changes -
        Fix Version/s 2.1 [ 10414 ]
        Fix Version/s 2.0.1 [ 10399 ]
        Hide
        jin Jin Lim (Inactive) added a comment -

        can not reproduce

        Show
        jin Jin Lim (Inactive) added a comment - can not reproduce
        jin Jin Lim (Inactive) made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]

          People

          • Assignee:
            jin Jin Lim (Inactive)
            Reporter:
            thuan Thuan Nguyen
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Gerrit Reviews

              There are no open Gerrit changes