Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-6992

rebalance hangs after failing over disconnected node

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 2.0
    • Fix Version/s: 2.0
    • Component/s: couchbase-bucket, ns_server
    • Security Level: Public
    • Labels:
    • Environment:
      build 10.3.3.59

      Description

      1 node went down while loading data on 22 node cluster. (possibly related to xen-hypervisor as it could not ping gateway and network interface needed to be restarted)
      While node was down I tried to fail it over and rebalance.
      However, rebalance never completes and looks like there is no rebalance activity occuring on tap.

      Some activity seen in logs at time of node down:

      10.3.3.59 sees .60 nodedown :

      [user:warn,2012-10-22T11:06:38.896,ns_1@10.3.3.59:ns_node_disco:ns_node_disco:handle_info:168]Node 'ns_1@10.3.3.59' saw that node 'ns_1@10.3.3.60' went down.

      at the same time stamp node .60 shows:

      [ns_server:error,2012-10-22T11:06:00.350,ns_1@10.3.3.60:<0.12281.36>:ns_janitor:cleanup_with_states:84]Bucket "default" not yet ready on ['ns_1@10.3.2.84','ns_1@10.3.2.
      85',
      'ns_1@10.3.2.110','ns_1@10.3.2.111',
      'ns_1@10.3.2.112','ns_1@10.3.2.113',
      'ns_1@10.3.2.114','ns_1@10.3.2.115',
      'ns_1@10.3.3.59','ns_1@10.3.3.62',
      'ns_1@10.3.3.65','ns_1@10.3.3.66',
      'ns_1@10.3.3.69','ns_1@10.3.3.70',
      'ns_1@10.3.121.90','ns_1@10.3.121.91',
      'ns_1@10.3.2.107','ns_1@10.3.2.108',
      'ns_1@10.3.2.109']
      [ns_server:debug,2012-10-22T11:06:07.388,ns_1@10.3.3.60:<0.12508.36>:janitor_agent:new_style_query_vbucket_states_loop:116]Exception from query_vbucket_states of "defau
      lt":'ns_1@10.3.2.85'
      {'EXIT',{{nodedown,'ns_1@10.3.2.85'},
      {gen_server,call,
      [

      {'janitor_agent-default','ns_1@10.3.2.85'}

      ,
      query_vbucket_states,infinity]}}}

      1. 10.3.3.59.debug.tar.gz
        3.89 MB
        Tommie McAfee
      2. 10.3.3.60.debug.tar.gz
        830 kB
        Tommie McAfee
      No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

        tommie Tommie McAfee created issue -
        Hide
        tommie Tommie McAfee added a comment -

        Also, when Jin took a look he noticed modes nodes are at 100% rebalance status, however 10.3.2.112 is still at 63% and found the following CRASH report was seen on that host.

        [couchdb:error,2012-10-21T23:07:11.144,ns_1@127.0.0.1:couch_view:couch_log:error:42]Exit on non-updater process: config_change
        [couchdb:error,2012-10-21T23:07:11.144,ns_1@127.0.0.1:couch_set_view:couch_log:error:42]Exit on non-updater process: config_change
        [error_logger:error,2012-10-21T23:07:11.144,ns_1@127.0.0.1:error_logger:ale_error_logger_handler:log_report:72]
        =========================CRASH REPORT=========================
        crasher:
        initial call: couch_server:init/1
        pid: <0.216.0>
        registered_name: couch_server
        exception exit: {function_clause,
        [{couch_server,'terminate/2-fun-0',
        [

        {<<"_replicator">>,<0.444.0>}

        ]},

        {lists,foreach,2}

        ,

        {gen_server,terminate,6}

        ,

        {proc_lib,init_p_do_apply,3}

        ]}
        in function gen_server:terminate/6
        ancestors: [couch_primary_services,couch_server_sup,cb_couch_sup,
        ns_server_cluster_sup,<0.60.0>]
        messages: []
        links: [<0.230.0>,<0.444.0>,<0.211.0>]
        dictionary: []
        trap_exit: true
        status: running
        heap_size: 2584
        stack_size: 24
        reductions: 4530
        neighbours:

        Show
        tommie Tommie McAfee added a comment - Also, when Jin took a look he noticed modes nodes are at 100% rebalance status, however 10.3.2.112 is still at 63% and found the following CRASH report was seen on that host. [couchdb:error,2012-10-21T23:07:11.144,ns_1@127.0.0.1:couch_view:couch_log:error:42] Exit on non-updater process: config_change [couchdb:error,2012-10-21T23:07:11.144,ns_1@127.0.0.1:couch_set_view:couch_log:error:42] Exit on non-updater process: config_change [error_logger:error,2012-10-21T23:07:11.144,ns_1@127.0.0.1:error_logger:ale_error_logger_handler:log_report:72] =========================CRASH REPORT========================= crasher: initial call: couch_server:init/1 pid: <0.216.0> registered_name: couch_server exception exit: {function_clause, [{couch_server,' terminate/2-fun-0 ', [ {<<"_replicator">>,<0.444.0>} ]}, {lists,foreach,2} , {gen_server,terminate,6} , {proc_lib,init_p_do_apply,3} ]} in function gen_server:terminate/6 ancestors: [couch_primary_services,couch_server_sup,cb_couch_sup, ns_server_cluster_sup,<0.60.0>] messages: [] links: [<0.230.0>,<0.444.0>,<0.211.0>] dictionary: [] trap_exit: true status: running heap_size: 2584 stack_size: 24 reductions: 4530 neighbours:
        Hide
        mikew Mike Wiederhold added a comment -

        Alk,

        Replication building is completed on 10.3.2.112 but ns_server has not killed the tap stream. Please take a look at this and let me know if there is anything else I can look at.

        Mike-Wiederholds-MacBook-Pro:ep-engine mikewied$ management/cbstats 10.3.2.112:11210 tap | egrep replication_building
        eq_tapq:replication_building_494_'ns_1@10.3.2.108':ack_log_size: 0
        eq_tapq:replication_building_494_'ns_1@10.3.2.108':ack_seqno: 359937
        eq_tapq:replication_building_494_'ns_1@10.3.2.108':ack_window_full: false
        eq_tapq:replication_building_494_'ns_1@10.3.2.108':backfill_completed: true
        eq_tapq:replication_building_494_'ns_1@10.3.2.108':backfill_start_timestamp: 1350926553
        eq_tapq:replication_building_494_'ns_1@10.3.2.108':bg_jobs_completed: 166602
        eq_tapq:replication_building_494_'ns_1@10.3.2.108':bg_jobs_issued: 166602
        eq_tapq:replication_building_494_'ns_1@10.3.2.108':bg_result_size: 0
        eq_tapq:replication_building_494_'ns_1@10.3.2.108':connected: true
        eq_tapq:replication_building_494_'ns_1@10.3.2.108':created: 39985
        eq_tapq:replication_building_494_'ns_1@10.3.2.108':flags: 85 (ack,backfill,vblist,checkpoints)
        eq_tapq:replication_building_494_'ns_1@10.3.2.108':has_queued_item: false
        eq_tapq:replication_building_494_'ns_1@10.3.2.108':idle: true
        eq_tapq:replication_building_494_'ns_1@10.3.2.108':paused: 1
        eq_tapq:replication_building_494_'ns_1@10.3.2.108':pending_backfill: false
        eq_tapq:replication_building_494_'ns_1@10.3.2.108':pending_disconnect: false
        eq_tapq:replication_building_494_'ns_1@10.3.2.108':pending_disk_backfill: false
        eq_tapq:replication_building_494_'ns_1@10.3.2.108':qlen: 0
        eq_tapq:replication_building_494_'ns_1@10.3.2.108':qlen_high_pri: 0
        eq_tapq:replication_building_494_'ns_1@10.3.2.108':qlen_low_pri: 0
        eq_tapq:replication_building_494_'ns_1@10.3.2.108':queue_backfillremaining: 0
        eq_tapq:replication_building_494_'ns_1@10.3.2.108':queue_backoff: 0
        eq_tapq:replication_building_494_'ns_1@10.3.2.108':queue_drain: 359877
        eq_tapq:replication_building_494_'ns_1@10.3.2.108':queue_fill: 0
        eq_tapq:replication_building_494_'ns_1@10.3.2.108':queue_itemondisk: 0
        eq_tapq:replication_building_494_'ns_1@10.3.2.108':queue_memory: 0
        eq_tapq:replication_building_494_'ns_1@10.3.2.108':rec_fetched: 193330
        eq_tapq:replication_building_494_'ns_1@10.3.2.108':recv_ack_seqno: 359936
        eq_tapq:replication_building_494_'ns_1@10.3.2.108':reserved: 1
        eq_tapq:replication_building_494_'ns_1@10.3.2.108':seqno_ack_requested: 359936
        eq_tapq:replication_building_494_'ns_1@10.3.2.108':supports_ack: true
        eq_tapq:replication_building_494_'ns_1@10.3.2.108':suspended: false
        eq_tapq:replication_building_494_'ns_1@10.3.2.108':total_backlog_size: 0
        eq_tapq:replication_building_494_'ns_1@10.3.2.108':total_noops: 3963
        eq_tapq:replication_building_494_'ns_1@10.3.2.108':type: producer
        eq_tapq:replication_building_494_'ns_1@10.3.2.108':vb_filter:

        { 494 }
        eq_tapq:replication_building_494_'ns_1@10.3.2.108':vb_filters: 1
        eq_tapq:replication_building_494_'ns_1@10.3.3.65':ack_log_size: 0
        eq_tapq:replication_building_494_'ns_1@10.3.3.65':ack_seqno: 97912
        eq_tapq:replication_building_494_'ns_1@10.3.3.65':ack_window_full: false
        eq_tapq:replication_building_494_'ns_1@10.3.3.65':backfill_completed: true
        eq_tapq:replication_building_494_'ns_1@10.3.3.65':backfill_start_timestamp: 0
        eq_tapq:replication_building_494_'ns_1@10.3.3.65':bg_jobs_completed: 0
        eq_tapq:replication_building_494_'ns_1@10.3.3.65':bg_jobs_issued: 0
        eq_tapq:replication_building_494_'ns_1@10.3.3.65':bg_result_size: 0
        eq_tapq:replication_building_494_'ns_1@10.3.3.65':connected: true
        eq_tapq:replication_building_494_'ns_1@10.3.3.65':created: 39985
        eq_tapq:replication_building_494_'ns_1@10.3.3.65':flags: 85 (ack,backfill,vblist,checkpoints)
        eq_tapq:replication_building_494_'ns_1@10.3.3.65':has_queued_item: false
        eq_tapq:replication_building_494_'ns_1@10.3.3.65':idle: true
        eq_tapq:replication_building_494_'ns_1@10.3.3.65':paused: 1
        eq_tapq:replication_building_494_'ns_1@10.3.3.65':pending_backfill: false
        eq_tapq:replication_building_494_'ns_1@10.3.3.65':pending_disconnect: false
        eq_tapq:replication_building_494_'ns_1@10.3.3.65':pending_disk_backfill: false
        eq_tapq:replication_building_494_'ns_1@10.3.3.65':qlen: 0
        eq_tapq:replication_building_494_'ns_1@10.3.3.65':qlen_high_pri: 0
        eq_tapq:replication_building_494_'ns_1@10.3.3.65':qlen_low_pri: 0
        eq_tapq:replication_building_494_'ns_1@10.3.3.65':queue_backfillremaining: 0
        eq_tapq:replication_building_494_'ns_1@10.3.3.65':queue_backoff: 0
        eq_tapq:replication_building_494_'ns_1@10.3.3.65':queue_drain: 97854
        eq_tapq:replication_building_494_'ns_1@10.3.3.65':queue_fill: 0
        eq_tapq:replication_building_494_'ns_1@10.3.3.65':queue_itemondisk: 0
        eq_tapq:replication_building_494_'ns_1@10.3.3.65':queue_memory: 0
        eq_tapq:replication_building_494_'ns_1@10.3.3.65':rec_fetched: 97909
        eq_tapq:replication_building_494_'ns_1@10.3.3.65':recv_ack_seqno: 97911
        eq_tapq:replication_building_494_'ns_1@10.3.3.65':reserved: 1
        eq_tapq:replication_building_494_'ns_1@10.3.3.65':seqno_ack_requested: 97911
        eq_tapq:replication_building_494_'ns_1@10.3.3.65':supports_ack: true
        eq_tapq:replication_building_494_'ns_1@10.3.3.65':suspended: false
        eq_tapq:replication_building_494_'ns_1@10.3.3.65':total_backlog_size: 0
        eq_tapq:replication_building_494_'ns_1@10.3.3.65':total_noops: 3964
        eq_tapq:replication_building_494_'ns_1@10.3.3.65':type: producer
        eq_tapq:replication_building_494_'ns_1@10.3.3.65':vb_filter: { 494 }

        eq_tapq:replication_building_494_'ns_1@10.3.3.65':vb_filters: 1

        Show
        mikew Mike Wiederhold added a comment - Alk, Replication building is completed on 10.3.2.112 but ns_server has not killed the tap stream. Please take a look at this and let me know if there is anything else I can look at. Mike-Wiederholds-MacBook-Pro:ep-engine mikewied$ management/cbstats 10.3.2.112:11210 tap | egrep replication_building eq_tapq:replication_building_494_'ns_1@10.3.2.108':ack_log_size: 0 eq_tapq:replication_building_494_'ns_1@10.3.2.108':ack_seqno: 359937 eq_tapq:replication_building_494_'ns_1@10.3.2.108':ack_window_full: false eq_tapq:replication_building_494_'ns_1@10.3.2.108':backfill_completed: true eq_tapq:replication_building_494_'ns_1@10.3.2.108':backfill_start_timestamp: 1350926553 eq_tapq:replication_building_494_'ns_1@10.3.2.108':bg_jobs_completed: 166602 eq_tapq:replication_building_494_'ns_1@10.3.2.108':bg_jobs_issued: 166602 eq_tapq:replication_building_494_'ns_1@10.3.2.108':bg_result_size: 0 eq_tapq:replication_building_494_'ns_1@10.3.2.108':connected: true eq_tapq:replication_building_494_'ns_1@10.3.2.108':created: 39985 eq_tapq:replication_building_494_'ns_1@10.3.2.108':flags: 85 (ack,backfill,vblist,checkpoints) eq_tapq:replication_building_494_'ns_1@10.3.2.108':has_queued_item: false eq_tapq:replication_building_494_'ns_1@10.3.2.108':idle: true eq_tapq:replication_building_494_'ns_1@10.3.2.108':paused: 1 eq_tapq:replication_building_494_'ns_1@10.3.2.108':pending_backfill: false eq_tapq:replication_building_494_'ns_1@10.3.2.108':pending_disconnect: false eq_tapq:replication_building_494_'ns_1@10.3.2.108':pending_disk_backfill: false eq_tapq:replication_building_494_'ns_1@10.3.2.108':qlen: 0 eq_tapq:replication_building_494_'ns_1@10.3.2.108':qlen_high_pri: 0 eq_tapq:replication_building_494_'ns_1@10.3.2.108':qlen_low_pri: 0 eq_tapq:replication_building_494_'ns_1@10.3.2.108':queue_backfillremaining: 0 eq_tapq:replication_building_494_'ns_1@10.3.2.108':queue_backoff: 0 eq_tapq:replication_building_494_'ns_1@10.3.2.108':queue_drain: 359877 eq_tapq:replication_building_494_'ns_1@10.3.2.108':queue_fill: 0 eq_tapq:replication_building_494_'ns_1@10.3.2.108':queue_itemondisk: 0 eq_tapq:replication_building_494_'ns_1@10.3.2.108':queue_memory: 0 eq_tapq:replication_building_494_'ns_1@10.3.2.108':rec_fetched: 193330 eq_tapq:replication_building_494_'ns_1@10.3.2.108':recv_ack_seqno: 359936 eq_tapq:replication_building_494_'ns_1@10.3.2.108':reserved: 1 eq_tapq:replication_building_494_'ns_1@10.3.2.108':seqno_ack_requested: 359936 eq_tapq:replication_building_494_'ns_1@10.3.2.108':supports_ack: true eq_tapq:replication_building_494_'ns_1@10.3.2.108':suspended: false eq_tapq:replication_building_494_'ns_1@10.3.2.108':total_backlog_size: 0 eq_tapq:replication_building_494_'ns_1@10.3.2.108':total_noops: 3963 eq_tapq:replication_building_494_'ns_1@10.3.2.108':type: producer eq_tapq:replication_building_494_'ns_1@10.3.2.108':vb_filter: { 494 } eq_tapq:replication_building_494_'ns_1@10.3.2.108':vb_filters: 1 eq_tapq:replication_building_494_'ns_1@10.3.3.65':ack_log_size: 0 eq_tapq:replication_building_494_'ns_1@10.3.3.65':ack_seqno: 97912 eq_tapq:replication_building_494_'ns_1@10.3.3.65':ack_window_full: false eq_tapq:replication_building_494_'ns_1@10.3.3.65':backfill_completed: true eq_tapq:replication_building_494_'ns_1@10.3.3.65':backfill_start_timestamp: 0 eq_tapq:replication_building_494_'ns_1@10.3.3.65':bg_jobs_completed: 0 eq_tapq:replication_building_494_'ns_1@10.3.3.65':bg_jobs_issued: 0 eq_tapq:replication_building_494_'ns_1@10.3.3.65':bg_result_size: 0 eq_tapq:replication_building_494_'ns_1@10.3.3.65':connected: true eq_tapq:replication_building_494_'ns_1@10.3.3.65':created: 39985 eq_tapq:replication_building_494_'ns_1@10.3.3.65':flags: 85 (ack,backfill,vblist,checkpoints) eq_tapq:replication_building_494_'ns_1@10.3.3.65':has_queued_item: false eq_tapq:replication_building_494_'ns_1@10.3.3.65':idle: true eq_tapq:replication_building_494_'ns_1@10.3.3.65':paused: 1 eq_tapq:replication_building_494_'ns_1@10.3.3.65':pending_backfill: false eq_tapq:replication_building_494_'ns_1@10.3.3.65':pending_disconnect: false eq_tapq:replication_building_494_'ns_1@10.3.3.65':pending_disk_backfill: false eq_tapq:replication_building_494_'ns_1@10.3.3.65':qlen: 0 eq_tapq:replication_building_494_'ns_1@10.3.3.65':qlen_high_pri: 0 eq_tapq:replication_building_494_'ns_1@10.3.3.65':qlen_low_pri: 0 eq_tapq:replication_building_494_'ns_1@10.3.3.65':queue_backfillremaining: 0 eq_tapq:replication_building_494_'ns_1@10.3.3.65':queue_backoff: 0 eq_tapq:replication_building_494_'ns_1@10.3.3.65':queue_drain: 97854 eq_tapq:replication_building_494_'ns_1@10.3.3.65':queue_fill: 0 eq_tapq:replication_building_494_'ns_1@10.3.3.65':queue_itemondisk: 0 eq_tapq:replication_building_494_'ns_1@10.3.3.65':queue_memory: 0 eq_tapq:replication_building_494_'ns_1@10.3.3.65':rec_fetched: 97909 eq_tapq:replication_building_494_'ns_1@10.3.3.65':recv_ack_seqno: 97911 eq_tapq:replication_building_494_'ns_1@10.3.3.65':reserved: 1 eq_tapq:replication_building_494_'ns_1@10.3.3.65':seqno_ack_requested: 97911 eq_tapq:replication_building_494_'ns_1@10.3.3.65':supports_ack: true eq_tapq:replication_building_494_'ns_1@10.3.3.65':suspended: false eq_tapq:replication_building_494_'ns_1@10.3.3.65':total_backlog_size: 0 eq_tapq:replication_building_494_'ns_1@10.3.3.65':total_noops: 3964 eq_tapq:replication_building_494_'ns_1@10.3.3.65':type: producer eq_tapq:replication_building_494_'ns_1@10.3.3.65':vb_filter: { 494 } eq_tapq:replication_building_494_'ns_1@10.3.3.65':vb_filters: 1
        mikew Mike Wiederhold made changes -
        Field Original Value New Value
        Assignee Mike Wiederhold [ mikew ] Aleksey Kondratenko [ alkondratenko ]
        Chisheng Chisheng Hong (Inactive) made changes -
        Labels system-test
        Show
        tommie Tommie McAfee added a comment - All logs here: https://docs.google.com/folder/d/0B48zH0LlqjKrNm9UYlhuNzZjSnM/edit This behavior was noticed during the 1.8.1 use case testing in 2.0 build 1870 http://hub.internal.couchbase.com/confluence/display/QA/2.0+System+Test+Specs#2.0SystemTestSpecs-SystemTest%3A1.8.1usecase%28PlumWorkloadv1.0%29
        farshid Farshid Ghods (Inactive) made changes -
        Priority Major [ 3 ] Blocker [ 1 ]
        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        BTW, I don't have a convenient way to download all logs from google drive.

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - BTW, I don't have a convenient way to download all logs from google drive.
        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        We're stuck waiting for checkpoint 67 to be persisted on vbucket 494 on node .65.

        {<17848.6247.34>,
        [

        {registered_name,[]}

        ,

        {status,waiting}

        ,
        {initial_call,{proc_lib,init_p,3}},
        {backtrace,
        [<<"Program counter: 0x00007f5e10ee9e48 (gen:do_call/4 + 576)">>,
        <<"CP: 0x0000000000000000 (invalid)">>,<<"arity = 0">>,<<>>,
        <<"0x00007f5e0b871858 Return addr 0x00007f5e10f68870 (gen_server:call/3 + 128)">>,
        <<"y(0) #Ref<0.0.50.98207>">>,
        <<"y(1) 'ns_1@10.3.3.65'">>,<<"y(2) []">>,
        <<"y(3) infinity">>,
        <<"y(4)

        {wait_for_checkpoint_persistence,494,67}">>,
        <<"y(5) '$gen_call'">>,<<"y(6) <0.2607.0>">>,<<>>,
        <<"x00007f5e0b871898 Return addr 0x00007f5e06f9e228 (janitor_agent:do_wait_checkpoint_persist">>,
        <<"y(0) infinity">>,
        <<"y(1) {wait_for_checkpoint_persistence,494,67}

        ">>,
        <<"y(2) 'ns_memcached-default'">>,
        <<"y(3) Catch 0x00007f5e10f68870 (gen_server:call/3 + 128)">>,
        <<>>,

        (thats janitor_agent child on .65 doing that call)

        And stats indicate that it's actually persisted:

        {<<"vb_494:persisted_checkpoint_id">>,<<"92">>}

        ,

        {<<"vb_494:checkpoint_extension">>,<<"false">>}

        ,

        {<<"vb_494:num_items_for_persistence">>,<<"0">>}

        ,

        {<<"vb_494:num_checkpoints">>,<<"1">>}

        ,

        {<<"vb_494:num_open_checkpoint_items">>,<<"0">>}

        ,

        {<<"vb_494:num_checkpoint_items">>,<<"1">>}

        ,

        {<<"vb_494:num_tap_cursors">>,<<"0">>}

        ,

        {<<"vb_494:last_closed_checkpoint_id">>,<<"92">>}

        ,

        {<<"vb_494:open_checkpoint_id">>,<<"93">>}

        ,

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - We're stuck waiting for checkpoint 67 to be persisted on vbucket 494 on node .65. {<17848.6247.34>, [ {registered_name,[]} , {status,waiting} , {initial_call,{proc_lib,init_p,3}}, {backtrace, [<<"Program counter: 0x00007f5e10ee9e48 (gen:do_call/4 + 576)">>, <<"CP: 0x0000000000000000 (invalid)">>,<<"arity = 0">>,<<>>, <<"0x00007f5e0b871858 Return addr 0x00007f5e10f68870 (gen_server:call/3 + 128)">>, <<"y(0) #Ref<0.0.50.98207>">>, <<"y(1) 'ns_1@10.3.3.65'">>,<<"y(2) []">>, <<"y(3) infinity">>, <<"y(4) {wait_for_checkpoint_persistence,494,67}">>, <<"y(5) '$gen_call'">>,<<"y(6) <0.2607.0>">>,<<>>, <<"x00007f5e0b871898 Return addr 0x00007f5e06f9e228 (janitor_agent:do_wait_checkpoint_persist">>, <<"y(0) infinity">>, <<"y(1) {wait_for_checkpoint_persistence,494,67} ">>, <<"y(2) 'ns_memcached-default'">>, <<"y(3) Catch 0x00007f5e10f68870 (gen_server:call/3 + 128)">>, <<>>, (thats janitor_agent child on .65 doing that call) And stats indicate that it's actually persisted: {<<"vb_494:persisted_checkpoint_id">>,<<"92">>} , {<<"vb_494:checkpoint_extension">>,<<"false">>} , {<<"vb_494:num_items_for_persistence">>,<<"0">>} , {<<"vb_494:num_checkpoints">>,<<"1">>} , {<<"vb_494:num_open_checkpoint_items">>,<<"0">>} , {<<"vb_494:num_checkpoint_items">>,<<"1">>} , {<<"vb_494:num_tap_cursors">>,<<"0">>} , {<<"vb_494:last_closed_checkpoint_id">>,<<"92">>} , {<<"vb_494:open_checkpoint_id">>,<<"93">>} ,
        alkondratenko Aleksey Kondratenko (Inactive) made changes -
        Assignee Aleksey Kondratenko [ alkondratenko ] Chiyoung Seo [ chiyoung ]
        Hide
        mikew Mike Wiederhold added a comment -

        Chiyoung,

        This is an issue with the new checkpoint persistence command you added. I think we should add a stat for checkpoint persistence commands in progress. For example a stat that is incremented when we receive a checkpoint persistence request and is decremented when the request completes.

        Show
        mikew Mike Wiederhold added a comment - Chiyoung, This is an issue with the new checkpoint persistence command you added. I think we should add a stat for checkpoint persistence commands in progress. For example a stat that is incremented when we receive a checkpoint persistence request and is decremented when the request completes.
        Hide
        chiyoung Chiyoung Seo added a comment -

        I dumped connections stats from memcached layer in 10.3.3.65 and found one connection that is still in ewouldblock state and corresponds to vb_checkpoint_persistence command

        STAT conn 0x58cf080
        STAT socket 83
        STAT protocol binary
        STAT transport TCP
        STAT nevents 20
        STAT sasl_conn 0xcbaf5390
        STAT state conn_nread
        STAT substate bin_reading_packet
        STAT registered_in_libevent 0
        STAT ev_flags 12
        STAT which 2
        STAT rbuf 0x58d0000
        STAT rcurr 0x58d0020
        STAT rsize 2048
        STAT rbytes 0
        STAT wbuf 0x58e7800
        STAT wcurr 0x58fb000
        STAT wsize 2048
        STAT wbytes 24
        STAT write_and_go 0x4104f0
        STAT write_and_free (nil)
        STAT ritem 0x58d0020
        STAT rlbytes 0
        STAT item (nil)
        STAT store_op 0
        STAT sbytes 0
        STAT iov 0x58e0800
        STAT iovsize 400
        STAT iovused 0
        STAT msglist 0x58d5440
        STAT msgsize 10
        STAT msgused 1
        STAT msgcurr 0
        STAT msgbytes 0
        STAT ilist 0x58d9100
        STAT isize 200
        STAT icurr 0x58d9100
        STAT ileft 0
        STAT suffixlist 0x58539a0
        STAT suffixsize 20
        STAT suffixcurr 0x58539a0
        STAT suffixleft 0
        STAT noreply 0
        STAT refcount 1
        STAT dynamic_buffer.buffer (nil)
        STAT dynamic_buffer.size 2048
        STAT dynamic_buffer.offset 24
        STAT engine_storage 0xcd6fb0a0
        STAT cas 0
        STAT cmd 177
        STAT opaque 0
        STAT keylen 0
        STAT list_state 0
        STAT next (nil)
        STAT thread 0x10c55f0
        STAT aiostat 0
        STAT ewouldblock 1
        STAT tap_iterator (nil)

        I'm further debugging it now.

        Show
        chiyoung Chiyoung Seo added a comment - I dumped connections stats from memcached layer in 10.3.3.65 and found one connection that is still in ewouldblock state and corresponds to vb_checkpoint_persistence command STAT conn 0x58cf080 STAT socket 83 STAT protocol binary STAT transport TCP STAT nevents 20 STAT sasl_conn 0xcbaf5390 STAT state conn_nread STAT substate bin_reading_packet STAT registered_in_libevent 0 STAT ev_flags 12 STAT which 2 STAT rbuf 0x58d0000 STAT rcurr 0x58d0020 STAT rsize 2048 STAT rbytes 0 STAT wbuf 0x58e7800 STAT wcurr 0x58fb000 STAT wsize 2048 STAT wbytes 24 STAT write_and_go 0x4104f0 STAT write_and_free (nil) STAT ritem 0x58d0020 STAT rlbytes 0 STAT item (nil) STAT store_op 0 STAT sbytes 0 STAT iov 0x58e0800 STAT iovsize 400 STAT iovused 0 STAT msglist 0x58d5440 STAT msgsize 10 STAT msgused 1 STAT msgcurr 0 STAT msgbytes 0 STAT ilist 0x58d9100 STAT isize 200 STAT icurr 0x58d9100 STAT ileft 0 STAT suffixlist 0x58539a0 STAT suffixsize 20 STAT suffixcurr 0x58539a0 STAT suffixleft 0 STAT noreply 0 STAT refcount 1 STAT dynamic_buffer.buffer (nil) STAT dynamic_buffer.size 2048 STAT dynamic_buffer.offset 24 STAT engine_storage 0xcd6fb0a0 STAT cas 0 STAT cmd 177 STAT opaque 0 STAT keylen 0 STAT list_state 0 STAT next (nil) STAT thread 0x10c55f0 STAT aiostat 0 STAT ewouldblock 1 STAT tap_iterator (nil) I'm further debugging it now.
        chiyoung Chiyoung Seo made changes -
        Sprint Status Current Sprint
        Component/s couchbase-bucket [ 10173 ]
        Hide
        chiyoung Chiyoung Seo added a comment -

        Tommie,

        As we saw, the rebalance out 10.3.3.65 was successful. While debugging this issue, I had some issues in .65 while doing GETs for non-resident items, which puts my connection in ewouldblock state in memcached layer. I will continue to debug this issue.

        Please update the bug if you see the same rebalance hung issue again.

        Show
        chiyoung Chiyoung Seo added a comment - Tommie, As we saw, the rebalance out 10.3.3.65 was successful. While debugging this issue, I had some issues in .65 while doing GETs for non-resident items, which puts my connection in ewouldblock state in memcached layer. I will continue to debug this issue. Please update the bug if you see the same rebalance hung issue again.
        Hide
        thuan Thuan Nguyen added a comment -

        Integrated in github-ep-engine-2-0 #451 (See http://qa.hq.northscale.net/job/github-ep-engine-2-0/451/)
        MB-6992 Add more informative logs to checkpoint prioritization (Revision 3c719d47ca41285bbcbc61817f719180448f1042)

        Result = SUCCESS
        Chiyoung Seo :
        Files :

        • src/ep.cc
        Show
        thuan Thuan Nguyen added a comment - Integrated in github-ep-engine-2-0 #451 (See http://qa.hq.northscale.net/job/github-ep-engine-2-0/451/ ) MB-6992 Add more informative logs to checkpoint prioritization (Revision 3c719d47ca41285bbcbc61817f719180448f1042) Result = SUCCESS Chiyoung Seo : Files : src/ep.cc
        Show
        chiyoung Chiyoung Seo added a comment - http://review.couchbase.org/#/c/22022/
        chiyoung Chiyoung Seo made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        chiyoung Chiyoung Seo made changes -
        Sprint Status Current Sprint
        Hide
        thuan Thuan Nguyen added a comment -

        Integrated in github-ep-engine-2-0 #452 (See http://qa.hq.northscale.net/job/github-ep-engine-2-0/452/)
        MB-6992 Control the flusher execution by the transaction size (Revision b327be09a1f971145fda5c249b4fa7a8304b1920)

        Result = SUCCESS
        Chiyoung Seo :
        Files :

        • tests/ep_testsuite.cc
        • src/ep.cc
        • src/ep.hh
        • src/flusher.hh
        • src/flusher.cc
        Show
        thuan Thuan Nguyen added a comment - Integrated in github-ep-engine-2-0 #452 (See http://qa.hq.northscale.net/job/github-ep-engine-2-0/452/ ) MB-6992 Control the flusher execution by the transaction size (Revision b327be09a1f971145fda5c249b4fa7a8304b1920) Result = SUCCESS Chiyoung Seo : Files : tests/ep_testsuite.cc src/ep.cc src/ep.hh src/flusher.hh src/flusher.cc
        farshid Farshid Ghods (Inactive) made changes -
        Status Resolved [ 5 ] Closed [ 6 ]

          People

          • Assignee:
            chiyoung Chiyoung Seo
            Reporter:
            tommie Tommie McAfee
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Gerrit Reviews

              There are no open Gerrit changes