Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-4760

Node is in unusual state after failed rebalance

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 2.0-developer-preview-4
    • Fix Version/s: 2.0-developer-preview-4
    • Component/s: couchbase-bucket
    • Security Level: Public
    • Labels:
      None
    • Environment:
      build 639, .deb
      6 node cluster

      Description

      While loading data, and querying the view, an attempt to rebalance out 2 nodes fails. It is now impossible to add the node back to the cluster as the rest-api throws:

      curl -u Administrator:password http://10.1.2.108:8091/nodes/self
      "Node is unknown to this cluster."

      If I go to the UI, the node shows 1 cluster that isn't itself (10.1.2.105), even though I expected it to be reinitialized.

      I will leave this the state that it's in for the next hour or so if needed:

      2012-02-03 06:29:54.360 menelaus_web:19:warning:server error during request processing(ns_1@10.1.2.104) - Server error during processing: ["web request failed",

      {path,"//pools/default/buckets/default"}

      ,

      {type,error}

      ,
      {what,{case_clause,rebalance_running}},
      {trace,
      [

      {menelaus_web_buckets, handle_bucket_delete,3}

      ,

      {menelaus_web,loop,3}

      ,

      {mochiweb_http,headers,5}

      ,

      {proc_lib,init_p_do_apply,3}

      ]}]
      2012-02-03 06:29:54.431 ns_orchestrator:2:info:message(ns_1@10.1.2.104) - Rebalance exited with reason stopped

      # Subject Project Status CR V
      For Gerrit Dashboard: &For+MB-4760=message:MB-4760

        Activity

        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        http://review.couchbase.org/13032 is not fix.

        Here's what I'm seeing in logs:

        [couchdb:info] [2012-02-03 7:06:56] [ns_1@10.1.2.104:<0.18429.2>:couch_log:info:39] Set view `default`, main group `_design/dev_test_view-a4ed105`, terminating with reason: {cleaner_died,
        {badarith,
        [

        {couch_set_view_util, btree_purge_fun, 4}

        ,

        {couch_btree, kv_guided_purge, 5}

        ,

        {couch_btree, guided_purge, 4},
        {couch_btree, kp_guided_purge, 5},
        {couch_btree, guided_purge, 4}

        ,

        {couch_btree, kp_guided_purge, 5}

        ,

        {couch_btree, guided_purge, 4}

        ,

        {couch_btree, guided_purge, 3}

        ]}}
        [couchdb:info] [2012-02-03 7:06:56] [ns_1@10.1.2.104:<0.18429.2>:couch_log:info:39] Stopping cleanup process for set view `default`, group `_design/dev_test_view-a4ed105`
        [error_logger:error] [2012-02-03 7:06:56] [ns_1@10.1.2.104:error_logger:ale_error_logger_handler:log_msg:76] Error in process <0.20053.2> on node 'ns_1@10.1.2.104' with exit value: {badarith,[

        {couch_set_view_util,btree_purge_fun,4},{couch_btree,kv_guided_purge,5},{couch_btree,guided_purge,4},{couch_btree,kp_guided_purge,5},{couch_btree,guided_purge,4},{couch_btree,kp_guided_purge,5},{couch_btree...


        And then we have main group 'waiting' cleaner dead in terminate:


        {<0.18429.2>,
        [{registered_name,[]},
        {status,waiting},
        {initial_call,{proc_lib,init_p,5}},
        {backtrace,
        [<<"Program counter: 0x00007f32801145c0 (couch_set_view_group:stop_cleaner/1 + 624)">>,
        <<"CP: 0x0000000000000000 (invalid)">>,
        <<"arity = 0">>,<<>>,
        <<"0x00007f32809de7d0 Return addr 0x00007f328010c760 (couch_set_view_group:terminate/2 + 736)">>,
        <<"y(0) []">>,<<"y(1) []">>,
        <<"y(2) []">>,<<"y(3) []">>,
        <<"(4) {state,{\"/opt/couchbase/var/lib/couchdb\",<<7 bytes>>,{set_view_group,<<16 bytes>>,">>,
        <<"(5) {set_view_group,<<16 bytes>>,<0.18433.2>,<<7 bytes>>,<<29 bytes>>,<<10 bytes>>,[],">>,
        <<"(6) {\"/opt/couchbase/var/lib/couchdb\",<<7 bytes>>,{set_view_group,<<16 bytes>>,nil,<<7">>,
        <<"y(7) {set_view_group_stats,1,0,10,0,2,0,[{[{<<8 bytes>>,3.250037e+01}]}],[],[]}">>,
        <<"y(8) <0.20053.2>">>,<<>>,
        <<"0x00007f32809de820 Return addr 0x00007f328a107788 (gen_server:terminate/6 + 184)">>,
        <<"y(0) []">>,<<"y(1) []">>,
        <<"(2) {cleaner_died,{badarith,[{couch_set_view_util,btree_purge_fun,4}

        ,{couch_btree,kv_g">>,
        <<"y(3) nil">>,<<"y(4) nil">>,
        <<"(5) {set_view_group,<<16 bytes>>,<0.18433.2>,<<7 bytes>>,<<29 bytes>>,<<10 bytes>>,[],">>,
        <<>>,
        <<"0x00007f32809de858 Return addr 0x00007f3280102440 (couch_set_view_group:init/1 + 488)">>,
        <<"y(0) []">>,
        <<"(1) {state,{\"/opt/couchbase/var/lib/couchdb\",<<7 bytes>>,{set_view_group,<<16 bytes>>,">>,
        <<"y(2) couch_set_view_group">>,
        <<"(3) {'EXIT',<0.20053.2>,{badarith,[

        {couch_set_view_util,btree_purge_fun,4},{couch_btre">>,
        <<"y(4) <0.18429.2>">>,
        <<"(5) {cleaner_died,{badarith,[{couch_set_view_util,btree_purge_fun,4}

        ,{couch_btree,kv_g">>,
        <<"y(6) Catch 0x00007f328a107788 (gen_server:terminate/6 + 184)">>,
        <<>>,
        <<"0x00007f32809de898 Return addr 0x00007f328a088fe8 (proc_lib:init_p_do_apply/3 + 56)">>,
        <<"y(0) Catch 0x00007f3280102460 (couch_set_view_group:init/1 + 520)">>,
        <<"y(1) []">>,
        <<"(2) {set_view_group,<<16 bytes>>,nil,<<7 bytes>>,<<29 bytes>>,<<10 bytes>>,[],[

        {set_vi">>, <<>>, <<"0x00007f32809de8b8 Return addr 0x00000000008a00b8 (<terminate process normally>)">>, <<"y(0) Catch 0x00007f328a089008 (proc_lib:init_p_do_apply/3 + 88)">>, <<>>]}

        ,

        {error_handler,error_handler}

        ,
        {garbage_collection,
        [

        {min_bin_vheap_size,46368}

        ,

        {min_heap_size,233}

        ,

        {fullsweep_after,0}

        ,

        {minor_gcs,0}

        ]},

        {heap_size,1597}

        ,

        {total_heap_size,1597}

        ,

        {links,[<0.18433.2>,<0.195.0>]}

        ,

        {memory,13960}

        ,

        {message_queue_len,2}

        ,

        {reductions,20570}

        ,

        {trap_exit,true}

        ]},

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - http://review.couchbase.org/13032 is not fix. Here's what I'm seeing in logs: [couchdb:info] [2012-02-03 7:06:56] [ns_1@10.1.2.104:<0.18429.2>:couch_log:info:39] Set view `default`, main group `_design/dev_test_view-a4ed105`, terminating with reason: {cleaner_died, {badarith, [ {couch_set_view_util, btree_purge_fun, 4} , {couch_btree, kv_guided_purge, 5} , {couch_btree, guided_purge, 4}, {couch_btree, kp_guided_purge, 5}, {couch_btree, guided_purge, 4} , {couch_btree, kp_guided_purge, 5} , {couch_btree, guided_purge, 4} , {couch_btree, guided_purge, 3} ]}} [couchdb:info] [2012-02-03 7:06:56] [ns_1@10.1.2.104:<0.18429.2>:couch_log:info:39] Stopping cleanup process for set view `default`, group `_design/dev_test_view-a4ed105` [error_logger:error] [2012-02-03 7:06:56] [ns_1@10.1.2.104:error_logger:ale_error_logger_handler:log_msg:76] Error in process <0.20053.2> on node 'ns_1@10.1.2.104' with exit value: {badarith,[ {couch_set_view_util,btree_purge_fun,4},{couch_btree,kv_guided_purge,5},{couch_btree,guided_purge,4},{couch_btree,kp_guided_purge,5},{couch_btree,guided_purge,4},{couch_btree,kp_guided_purge,5},{couch_btree... And then we have main group 'waiting' cleaner dead in terminate: {<0.18429.2>, [{registered_name,[]}, {status,waiting}, {initial_call,{proc_lib,init_p,5}}, {backtrace, [<<"Program counter: 0x00007f32801145c0 (couch_set_view_group:stop_cleaner/1 + 624)">>, <<"CP: 0x0000000000000000 (invalid)">>, <<"arity = 0">>,<<>>, <<"0x00007f32809de7d0 Return addr 0x00007f328010c760 (couch_set_view_group:terminate/2 + 736)">>, <<"y(0) []">>,<<"y(1) []">>, <<"y(2) []">>,<<"y(3) []">>, <<"(4) {state,{\"/opt/couchbase/var/lib/couchdb\",<<7 bytes>>,{set_view_group,<<16 bytes>>,">>, <<"(5) {set_view_group,<<16 bytes>>,<0.18433.2>,<<7 bytes>>,<<29 bytes>>,<<10 bytes>>,[],">>, <<"(6) {\"/opt/couchbase/var/lib/couchdb\",<<7 bytes>>,{set_view_group,<<16 bytes>>,nil,<<7">>, <<"y(7) {set_view_group_stats,1,0,10,0,2,0,[{ [{<<8 bytes>>,3.250037e+01}] }],[],[]}">>, <<"y(8) <0.20053.2>">>,<<>>, <<"0x00007f32809de820 Return addr 0x00007f328a107788 (gen_server:terminate/6 + 184)">>, <<"y(0) []">>,<<"y(1) []">>, <<"(2) {cleaner_died,{badarith,[{couch_set_view_util,btree_purge_fun,4} ,{couch_btree,kv_g">>, <<"y(3) nil">>,<<"y(4) nil">>, <<"(5) {set_view_group,<<16 bytes>>,<0.18433.2>,<<7 bytes>>,<<29 bytes>>,<<10 bytes>>,[],">>, <<>>, <<"0x00007f32809de858 Return addr 0x00007f3280102440 (couch_set_view_group:init/1 + 488)">>, <<"y(0) []">>, <<"(1) {state,{\"/opt/couchbase/var/lib/couchdb\",<<7 bytes>>,{set_view_group,<<16 bytes>>,">>, <<"y(2) couch_set_view_group">>, <<"(3) {'EXIT',<0.20053.2>,{badarith,[ {couch_set_view_util,btree_purge_fun,4},{couch_btre">>, <<"y(4) <0.18429.2>">>, <<"(5) {cleaner_died,{badarith,[{couch_set_view_util,btree_purge_fun,4} ,{couch_btree,kv_g">>, <<"y(6) Catch 0x00007f328a107788 (gen_server:terminate/6 + 184)">>, <<>>, <<"0x00007f32809de898 Return addr 0x00007f328a088fe8 (proc_lib:init_p_do_apply/3 + 56)">>, <<"y(0) Catch 0x00007f3280102460 (couch_set_view_group:init/1 + 520)">>, <<"y(1) []">>, <<"(2) {set_view_group,<<16 bytes>>,nil,<<7 bytes>>,<<29 bytes>>,<<10 bytes>>,[],[ {set_vi">>, <<>>, <<"0x00007f32809de8b8 Return addr 0x00000000008a00b8 (<terminate process normally>)">>, <<"y(0) Catch 0x00007f328a089008 (proc_lib:init_p_do_apply/3 + 88)">>, <<>>]} , {error_handler,error_handler} , {garbage_collection, [ {min_bin_vheap_size,46368} , {min_heap_size,233} , {fullsweep_after,0} , {minor_gcs,0} ]}, {heap_size,1597} , {total_heap_size,1597} , {links,[<0.18433.2>,<0.195.0>]} , {memory,13960} , {message_queue_len,2} , {reductions,20570} , {trap_exit,true} ]},
        Hide
        filipe manana filipe manana added a comment -

        Yes, but http://review.couchbase.org/13032 is based on the observation from the diag logs in this ticket.
        The stack trace you just pasted is another, different issue.

        Show
        filipe manana filipe manana added a comment - Yes, but http://review.couchbase.org/13032 is based on the observation from the diag logs in this ticket. The stack trace you just pasted is another, different issue.
        Hide
        filipe manana filipe manana added a comment -

        For this last stack trace, it's the same issue as in MB-4774.
        http://review.couchbase.org/#change,13067

        Show
        filipe manana filipe manana added a comment - For this last stack trace, it's the same issue as in MB-4774 . http://review.couchbase.org/#change,13067
        Hide
        thuan Thuan Nguyen added a comment -

        Integrated in github-couchdb-preview #333 (See http://qa.hq.northscale.net/job/github-couchdb-preview/333/)
        don't wait for EXIT from cleaner when it's dead.MB-4760 (Revision 08e7f872750c5e3ff2104022e97fc8be4859a5b5)

        Result = SUCCESS
        Filipe David Borba Manana :
        Files :

        • src/couch_set_view/src/couch_set_view_group.erl
        Show
        thuan Thuan Nguyen added a comment - Integrated in github-couchdb-preview #333 (See http://qa.hq.northscale.net/job/github-couchdb-preview/333/ ) don't wait for EXIT from cleaner when it's dead. MB-4760 (Revision 08e7f872750c5e3ff2104022e97fc8be4859a5b5) Result = SUCCESS Filipe David Borba Manana : Files : src/couch_set_view/src/couch_set_view_group.erl
        Show
        filipe manana filipe manana added a comment - Both Alk's change and mine were merged. Closing this. https://github.com/couchbase/couchdb/commit/6319846fa68c73580e5ead96dbe27868447f730f https://github.com/couchbase/couchdb/commit/08e7f872750c5e3ff2104022e97fc8be4859a5b5

          People

          • Assignee:
            FilipeManana Filipe Manana (Inactive)
            Reporter:
            tommie Tommie McAfee
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Due:
              Created:
              Updated:
              Resolved: