Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-4970

Undefined set view error when running views after rebalancing out a node which was restarted

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 2.0-developer-preview-4
    • Fix Version/s: 2.0-beta
    • Component/s: view-engine
    • Security Level: Public
    • Labels:
      None
    • Environment:
      Windows 2008 R2 64Bit, Ubuntu 11.04.

      Description

      Given:

      • A cluster with 3 boxes (2 Windows 2008 R2, 1 Ubuntu 11.04) and 5 data buckets with no replication.

      When:

      • Ubuntu´s box goes down.
      • Restart the couchbase-server service in that box.

      Then:

      • Can´t execute any view.
      • The admin console shows:

      Subset of nodes failed with the following error:
      [

      { "from": "http://10.230.58.238:8092/_view_merge/", "reason": "Undefined set view `test` for `_design/dev_CategoryTree` design document." }

      ]

      Logs:
      [ns_server:info] [2012-03-28 17:11:53] [ns_1@10.230.58.221:<0.17585.189>:ns_vbm_sup:spawn_mover:198] Spawned mover "dev" 125 'ns_1@10.230.58.221' -> 'ns_1@10.230.58.238': <0.17586.189>
      [ns_server:info] [2012-03-28 17:11:53] [ns_1@10.230.58.221:<0.738.0>:ns_port_server:log:161] memcached<0.738.0>: Vbucket <121> is going dead.
      memcached<0.738.0>: Vbucket <122> is going dead.
      memcached<0.738.0>: Vbucket <123> is going dead.
      memcached<0.738.0>: Vbucket <124> is going dead.

      [rebalance:info] [2012-03-28 17:11:53] [ns_1@10.230.58.221:<0.17586.189>:ebucketmigrator_srv:init:135] CheckpointIdsDict:
      {dict,128,26,32,16,130,78,

      {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]}

      ,
      {{[[0|1],[32|1],[64|1],[96|1]],
      [[19|1],[51|1],[83|1],[115|1]],
      [[6|1],[38|1],[70|1],[102|1]],
      [[25|1],[57|1],[89|1],[121|1]],
      [[12|1],[44|1],[76|1],[108|1]],
      [[31|1],[63|1],[95|1],[127|1]],
      [[18|1],[50|1],[82|1],[114|1]],
      [[5|2],[37|1],[69|1],[101|1]],
      [[24|1],[56|1],[88|1],[120|1]],
      [[11|1],[43|1],[75|1],[107|1]],
      [[14|1],[30|1],[46|1],[62|1],[78|1],[94|1],[110|1],[126|1]],
      [[1|1],[17|1],[33|1],[49|1],[65|1],[81|1],[97|1],[113|1]],
      [[4|1],[20|1],[36|1],[52|1],[68|1],[84|1],[100|1],[116|1]],
      [[7|1],[23|1],[39|1],[55|1],[71|1],[87|1],[103|1],[119|1]],
      [[10|1],[26|1],[42|1],[58|1],[74|1],[90|1],[106|1],[122|1]],
      [[13|1],[29|1],[45|1],[61|1],[77|1],[93|1],[109|1],[125|1]]},

      {[[16|1],[48|1],[80|1],[112|1]], [[3|1],[35|1],[67|1],[99|1]], [[22|1],[54|1],[86|1],[118|1]], [[9|1],[41|1],[73|1],[105|1]], [[28|1],[60|1],[92|1],[124|1]], [[15|1],[47|1],[79|1],[111|1]], [[2|1],[34|1],[66|1],[98|1]], [[21|1],[53|1],[85|1],[117|1]], [[8|1],[40|1],[72|1],[104|1]], [[27|1],[59|1],[91|1],[123|1]], [],[],[],[],[],[]}

      }}

      [rebalance:info] [2012-03-28 17:11:53] [ns_1@10.230.58.221:<0.17586.189>:ebucketmigrator_srv:init:166] Starting tap stream:
      [

      {vbuckets,"}

      "},
      {checkpoints,[

      {125,0}

      ]},

      {name,"rebalance_125"}

      ,

      {takeover,true}

      ]

      [error_logger:error] [2012-03-28 17:11:53] [ns_1@10.230.58.221:error_logger:ale_error_logger_handler:log_msg:76] ** Generic server auto_failover terminating

        • Last message in was tick
        • When Server state == {state,
          Unknown macro: {state, [{node_state,'ns_1@10.230.58.221',0,up,false}, {node_state,'ns_1@10.230.58.238',1, nearly_down,false}, {node_state,'ns_1@10.230.58.37',0,up,false}], 0,3}

          ,

          {interval,#Ref<0.0.3282.124225>}

          ,
          30,0}

        • Reason for termination ==
        • {{badmatch,rebalancing},
          [ {ns_cluster_membership,failover,1},
          {auto_failover,'-handle_info/2-fun-0-',2},
          {lists,foldl,3},
          {auto_failover,handle_info,2},
          {gen_server,handle_msg,5},
          {proc_lib,init_p_do_apply,3}]}

          [ns_server:info] [2012-03-28 17:11:53] [ns_1@10.230.58.221:<0.17586.189>:ebucketmigrator_srv:init:175] upstream_sender pid: <0.17589.189>
          [error_logger:error] [2012-03-28 17:11:53] [ns_1@10.230.58.221:error_logger:ale_error_logger_handler:log_report:72]
          =========================CRASH REPORT=========================
          crasher:
          initial call: auto_failover:init/1
          pid: <0.14760.180>
          registered_name: []
          exception exit: {{badmatch,rebalancing},
          [{ns_cluster_membership,failover,1}

          ,

          {auto_failover,'-handle_info/2-fun-0-',2},
          {lists,foldl,3},
          {auto_failover,handle_info,2},
          {gen_server,handle_msg,5},
          {proc_lib,init_p_do_apply,3}]}
          in function gen_server:terminate/6
          ancestors: [mb_master_sup,mb_master,ns_server_sup,ns_server_cluster_sup,
          <0.51.0>]
          messages: []
          links: [<0.696.0>,<0.160.0>]
          dictionary: [{random_seed,{6571,6892,26285}}]
          trap_exit: false
          status: running
          heap_size: 75025
          stack_size: 24
          reductions: 4294650
          neighbours:

          [error_logger:error] [2012-03-28 17:11:53] [ns_1@10.230.58.221:error_logger:ale_error_logger_handler:log_report:72]
          =========================SUPERVISOR REPORT=========================
          Supervisor: {local,mb_master_sup}
          Context: child_terminated
          Reason: {{badmatch,rebalancing},
          [{ns_cluster_membership,failover,1},
          {auto_failover,'-handle_info/2-fun-0-',2}

          ,

          {lists,foldl,3}

          ,

          {auto_failover,handle_info,2}

          ,

          {gen_server,handle_msg,5}

          ,

          {proc_lib,init_p_do_apply,3}

          ]}
          Offender: [

          {pid,<0.14760.180>}

          ,

          {name,auto_failover}

          ,
          {mfargs,{auto_failover,start_link,[]}},

          {restart_type,permanent}

          ,

          {shutdown,10}

          ,

          {child_type,worker}

          ]

      [rebalance:info] [2012-03-28 17:11:53] [ns_1@10.230.58.221:<0.17586.189>:ebucketmigrator_srv:terminate:202] Skipping close ack for successfull takover

      No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

        Hide
        farshid Farshid Ghods (Inactive) added a comment -

        can you also attach the diags from :8091/diag

        did you wait until the node came back up and the bucket status was ready (green) ?

        Show
        farshid Farshid Ghods (Inactive) added a comment - can you also attach the diags from :8091/diag did you wait until the node came back up and the bucket status was ready (green) ?
        Hide
        francares francares added a comment -

        Yep, I did. If I do not wait until it becomes green It can´t make a rebalance operation I think.

        Could you tell me what other info in the diag file do you need? Its size is about 300Mb. I´ve added latest log entries when the issue was manifested.
        In my opinion you should improve the logs functionality because it´s not good to have big logs files.
        Maybe a listener for Splunk or other tool should be fine.

        Show
        francares francares added a comment - Yep, I did. If I do not wait until it becomes green It can´t make a rebalance operation I think. Could you tell me what other info in the diag file do you need? Its size is about 300Mb. I´ve added latest log entries when the issue was manifested. In my opinion you should improve the logs functionality because it´s not good to have big logs files. Maybe a listener for Splunk or other tool should be fine.
        Hide
        farshid Farshid Ghods (Inactive) added a comment -

        can you please try to bzip or zip the file and see if the size is less than 20 MB .

        jira allows uploading attachments <= 20 MB

        Show
        farshid Farshid Ghods (Inactive) added a comment - can you please try to bzip or zip the file and see if the size is less than 20 MB . jira allows uploading attachments <= 20 MB
        Hide
        farshid Farshid Ghods (Inactive) added a comment -

        also can you please update the "when" part :

        you didn't mention rebalancing or possible failover int he bug description

        Show
        farshid Farshid Ghods (Inactive) added a comment - also can you please update the "when" part : you didn't mention rebalancing or possible failover int he bug description
        Hide
        francares francares added a comment -

        Added log file.

        I don´t have permissions to modify the field description. So here it go:

        When:

        • Ubuntu´s box goes down.
        • Restart the couchbase-server service in that box.
        • Add server again.
        • Do a rebalance.

        Let me know if you need more info...

        Show
        francares francares added a comment - Added log file. I don´t have permissions to modify the field description. So here it go: When: Ubuntu´s box goes down. Restart the couchbase-server service in that box. Add server again. Do a rebalance. Let me know if you need more info...
        Hide
        farshid Farshid Ghods (Inactive) added a comment -

        Hi Filipe,

        Can you please have a look at this and see if there is anything obvious in the logs

        Farshid

        Show
        farshid Farshid Ghods (Inactive) added a comment - Hi Filipe, Can you please have a look at this and see if there is anything obvious in the logs Farshid
        Hide
        FilipeManana Filipe Manana (Inactive) added a comment -

        Can't see anything related to my side.

        There's the following however repeated several times (ep-engine/mccouch):

        [ns_server:info] [2012-03-28 17:32:53] [ns_1@10.230.58.221:<0.11597.191>:menelaus_web:handle_streaming:947] menelaus_web streaming socket closed by client
        [couchdb:info] [2012-03-28 17:32:54] [ns_1@10.230.58.221:<0.11891.190>:couch_log:info:39] FLUSHING ALL THE THINGS!
        [views:info] [2012-03-28 17:32:54] [ns_1@10.230.58.221:'capi_set_view_manager-qa':capi_set_view_manager:apply_map:445] Applying map to bucket qa:
        [

        {active,[]}

        ,

        {passive,[]}

        ,

        {replica,[]}

        ,

        {ignore,[]}

        ]
        [views:info] [2012-03-28 17:32:54] [ns_1@10.230.58.221:'capi_set_view_manager-qa':capi_set_view_manager:apply_map:450] Classified vbuckets for qa:
        Active: []
        Passive: []
        Cleanup: [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,
        26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,
        48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,
        70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85]
        Replica: []
        ReplicaCleanup: []
        [couchdb:info] [2012-03-28 17:32:55] [ns_1@10.230.58.221:'capi_set_view_manager-qa':couch_log:error:42] MC daemon: Error opening vb 0 in <<"qa">>:

        {not_found, no_db_file}
        [couchdb:info] [2012-03-28 17:32:55] [ns_1@10.230.58.221:'capi_set_view_manager-qa':couch_log:error:42] MC daemon: Error opening vb 1 in <<"qa">>: {not_found, no_db_file}

        Isn't there a known issue with bucket flush not working ok?

        Also seeing this (ns_server) repeated many many times:

        {error,rest_error,
        <<"Could not connect to \"10.230.58.37\" on port 8091. This could be due to an incorrect host/port combination or a firewall in place between the servers.">>,
        {error,econnrefused}}

        [cluster:debug] [2012-03-26 17:13:14] [ns_1@10.230.58.221:ns_cluster:ns_cluster:handle_call:120] add_node("10.230.58.37", 8091, ..) -> {error,
        engage_cluster,
        <<"Prepare join failed. Could not connect to \"10.230.58.37\" on port 8091. This could be due to an incorrect host/port combination or a firewall in place between the servers.">>,

        And finally a file permission error when trying to open a vbucket database file:

        [couchdb:info] [2012-03-27 12:29:08] [ns_1@10.230.58.221:<0.9685.158>:couch_log:error:42] Set view `test`, main group `_design/dev_appsByCategory`, doc loader error
        error: {case_clause,{error,eacces}}
        stacktrace: [

        {couch_db,fast_reads,2}

        ,

        {couch_set_view_updater,'-load_changes/7-fun-2-',8}

        ,

        {lists,foldl,3}

        ,

        {couch_set_view_updater,load_changes,7}

        ,

        {couch_set_view_updater,'-update/7-fun-4-',10}

        ]

        The later can be kind of fixed so that it doesn't hard crash, but it won't solve the issue.

        Show
        FilipeManana Filipe Manana (Inactive) added a comment - Can't see anything related to my side. There's the following however repeated several times (ep-engine/mccouch): [ns_server:info] [2012-03-28 17:32:53] [ns_1@10.230.58.221:<0.11597.191>:menelaus_web:handle_streaming:947] menelaus_web streaming socket closed by client [couchdb:info] [2012-03-28 17:32:54] [ns_1@10.230.58.221:<0.11891.190>:couch_log:info:39] FLUSHING ALL THE THINGS! [views:info] [2012-03-28 17:32:54] [ns_1@10.230.58.221:'capi_set_view_manager-qa':capi_set_view_manager:apply_map:445] Applying map to bucket qa: [ {active,[]} , {passive,[]} , {replica,[]} , {ignore,[]} ] [views:info] [2012-03-28 17:32:54] [ns_1@10.230.58.221:'capi_set_view_manager-qa':capi_set_view_manager:apply_map:450] Classified vbuckets for qa: Active: [] Passive: [] Cleanup: [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25, 26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47, 48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69, 70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85] Replica: [] ReplicaCleanup: [] [couchdb:info] [2012-03-28 17:32:55] [ns_1@10.230.58.221:'capi_set_view_manager-qa':couch_log:error:42] MC daemon: Error opening vb 0 in <<"qa">>: {not_found, no_db_file} [couchdb:info] [2012-03-28 17:32:55] [ns_1@10.230.58.221:'capi_set_view_manager-qa':couch_log:error:42] MC daemon: Error opening vb 1 in <<"qa">>: {not_found, no_db_file} Isn't there a known issue with bucket flush not working ok? Also seeing this (ns_server) repeated many many times: {error,rest_error, <<"Could not connect to \"10.230.58.37\" on port 8091. This could be due to an incorrect host/port combination or a firewall in place between the servers.">>, {error,econnrefused}} [cluster:debug] [2012-03-26 17:13:14] [ns_1@10.230.58.221:ns_cluster:ns_cluster:handle_call:120] add_node("10.230.58.37", 8091, ..) -> {error, engage_cluster, <<"Prepare join failed. Could not connect to \"10.230.58.37\" on port 8091. This could be due to an incorrect host/port combination or a firewall in place between the servers.">>, And finally a file permission error when trying to open a vbucket database file: [couchdb:info] [2012-03-27 12:29:08] [ns_1@10.230.58.221:<0.9685.158>:couch_log:error:42] Set view `test`, main group `_design/dev_appsByCategory`, doc loader error error: {case_clause,{error,eacces}} stacktrace: [ {couch_db,fast_reads,2} , {couch_set_view_updater,'-load_changes/7-fun-2-',8} , {lists,foldl,3} , {couch_set_view_updater,load_changes,7} , {couch_set_view_updater,'-update/7-fun-4-',10} ] The later can be kind of fixed so that it doesn't hard crash, but it won't solve the issue.
        Hide
        farshid Farshid Ghods (Inactive) added a comment -

        please try out the latest dp4 refresh build available http://www.couchbase.com/download

        Show
        farshid Farshid Ghods (Inactive) added a comment - please try out the latest dp4 refresh build available http://www.couchbase.com/download

          People

          • Assignee:
            farshid Farshid Ghods (Inactive)
            Reporter:
            francares francares
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Gerrit Reviews

              There are no open Gerrit changes