Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-7865

swap rebalance 1.8.1 -> 2.0.1: addNode error 500 reason: unknown /Connection attempt from disallowed node

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.0.1
    • Fix Version/s: 2.0.1
    • Component/s: ns_server
    • Security Level: Public
    • Labels:
      None
    • Environment:
      centos-64 , 1.8.1-942-rel -> 2.0.1-170-rel

      Description

      ./testrunner -i /tmp/upgrade.ini get-logs=False,upgrade_version=2.0.1-170-rel,initial_vbuckets=64 -t newupgradetests.MultiNodesUpgradeTests.online_upgrade_swap_rebalance,initial_version=1.8.1-942-rel,standard_buckets=1,items=500000,max_verify=1000,GROUP=1_8;ONLINE

      steps:
      1. 2 node with 1.8.1-942: 10.3.3.14 & 10.3.3.11
      2 bucket*500K items
      2. install 2 node with 2.0.1-170-rel: 10.3.3.13 & 10.3.3.13

      3. [2013-03-04 13:32:37,568] - [rest_client:721] INFO - adding remote node @10.3.3.13:8091 to this cluster @10.3.3.11:8091
      [2013-03-04 13:32:39,028] - [rest_client:804] INFO - rebalance params : password=password&ejectedNodes=ns_1%4010.3.3.11&user=Administrator&knownNodes=ns_1%4010.3.3.11%2Cns_1%4010.3.3.14%2Cns_1%4010.3.3.13

      4.2013-03-04 13:36:55,544] - [newupgradetests:407] INFO - Swap rebalance: rebalance out 1 old version nodes, rebalance in 1 2.0 Nodes
      [2013-03-04 13:36:55,545] - [basetestcase:147] INFO - sleep for 10 secs. ...
      [2013-03-04 13:37:05,691] - [rest_client:819] INFO - /diag/eval status: True content: 'ns_1@10.3.3.13' command: node(global:whereis_name(ns_orchestrator))
      [2013-03-04 13:37:05,799] - [newupgradetests:420] INFO - 2.0 Node 10.3.3.13 becomes the master
      [2013-03-04 13:37:06,558] - [task:242] INFO - adding node 10.3.3.16:8091 to cluster
      [2013-03-04 13:37:06,559] - [rest_client:721] INFO - adding remote node @10.3.3.16:8091 to this cluster @10.3.3.14:8091
      [2013-03-04 13:37:36,569] - [rest_client:578] ERROR - http://10.3.3.14:8091/controller/addNode error 500 reason: unknown ["Unexpected server error, request logged."]
      [2013-03-04 13:37:36,571] - [rest_client:741] ERROR - add_node error : ["Unexpected server error, request logged."]

      on 10.3.3.14:

      [ns_server:debug] [2013-03-04 13:41:32] [ns_1@10.3.3.14:<0.14189.0>:ns_node_disco:do_nodes_wanted_updated_fun:182] ns_node_disco: nodes_wanted updated: ['ns_1@10.3.3.13','ns_1@10.3.3.14',
      'ns_1@10.3.3.16'], with cookie: ffswxwgaaiayftjf
      [ns_server:debug] [2013-03-04 13:41:32] [ns_1@10.3.3.14:<0.14189.0>:ns_node_disco:do_nodes_wanted_updated_fun:188] ns_node_disco: nodes_wanted pong: ['ns_1@10.3.3.13','ns_1@10.3.3.14',
      'ns_1@10.3.3.16'], with cookie: ffswxwgaaiayftjf
      [menelaus:warn] [2013-03-04 13:41:33] [ns_1@10.3.3.14:<0.14035.0>:menelaus_web:loop:334] Server error during processing: ["web request failed",

      {path,"/controller/addNode"}

      ,

      {type,exit}

      ,
      {what,
      {timeout,
      {gen_server,call,
      [ns_cluster,
      {add_node,"10.3.3.16",8091,
      {"Administrator","password"}},
      30000]}}},
      {trace,
      [

      {gen_server,call,3},
      {ns_cluster,add_node,3},
      {menelaus_web,handle_add_node,1},
      {menelaus_web,loop,3},
      {mochiweb_http,headers,5},
      {proc_lib,init_p_do_apply,3}]}]
      [ns_server:debug] [2013-03-04 13:41:33] [ns_1@10.3.3.14:ns_bucket_worker:ns_bucket_sup:update_childs:91] Stopping child for dead bucket: {{per_bucket_sup,"default"},
      <0.630.0>,supervisor,
      [single_bucket_sup]}

      [ns_server:debug] [2013-03-04 13:41:33] [ns_1@10.3.3.14:ns_config_log:ns_config_log:log_common:111] config change:
      buckets ->
      [{configs,[{"standard_bucket0",
      [{map,[]},
      {fastForwardMap,[]},
      {num_replicas,1},
      {ram_quota,736100352},
      {auth_type,none},
      {moxi_port,11214},
      {type,membase},
      {num_vbuckets,64},
      {servers,['ns_1@10.3.3.14','ns_1@10.3.3.13']}]}]}]
      [ns_server:debug] [2013-03-04 13:41:33] [ns_1@10.3.3.14:<0.630.0>:single_bucket_sup:top_loop:28] Delegating exit {'EXIT',<0.575.0>,shutdown} to child supervisor: <0.631.0>

      [ns_server:debug] [2013-03-04 13:41:33] [ns_1@10.3.3.14:<0.643.0>:ns_pubsub:do_subscribe_link:132] Parent process exited with reason shutdown
      [ns_server:debug] [2013-03-04 13:41:33] [ns_1@10.3.3.14:<0.641.0>:ns_pubsub:do_subscribe_link:132] Parent process exited with reason shutdown
      [ns_server:debug] [2013-03-04 13:41:33] [ns_1@10.3.3.14:<0.641.0>:ns_pubsub:do_subscribe_link:145] Deleting {ns_pubsub,#Ref<0.0.0.4996>} event handler: ok
      [ns_server:debug] [2013-03-04 13:41:33] [ns_1@10.3.3.14:<0.643.0>:ns_pubsub:do_subscribe_link:145] Deleting {ns_pubsub,#Ref<0.0.0.5024>} event handler: ok
      [rebalance:warn] [2013-03-04 13:41:33] [ns_1@10.3.3.14:<0.14009.0>:ebucketmigrator_srv:do_confirm_sent_messages:337] Got error while trying to read close ack:{error,closed}

      [ns_server:debug] [2013-03-04 13:41:33] [ns_1@10.3.3.14:ns_config_isasl_sync:ns_config_isasl_sync:writeSASLConf:133] Writing isasl passwd file: "/opt/couchbase/var/lib/couchbase/data/isasl.pw"
      [user:info] [2013-03-04 13:41:33] [ns_1@10.3.3.14:'ns_memcached-default':ns_memcached:terminate:467] Shutting down bucket "default" on 'ns_1@10.3.3.14' for deletion
      [ns_server:info] [2013-03-04 13:41:34] [ns_1@10.3.3.14:ns_port_memcached:ns_port_server:log:166] memcached<0.570.0>: TAP (Consumer) eq_tapq:anon_167 - disconnected
      memcached<0.570.0>: Schedule cleanup of "eq_tapq:anon_167"
      memcached<0.570.0>: TAP (Producer) eq_tapq:replication_ns_1@10.3.3.13 - disconnected, keep alive for 300 seconds
      memcached<0.570.0>: Shutting down tap connections!
      memcached<0.570.0>: Schedule cleanup of "eq_tapq:replication_ns_1@10.3.3.11"
      memcached<0.570.0>: Schedule cleanup of "eq_tapq:replication_ns_1@10.3.3.13"
      memcached<0.570.0>: TAP (Producer) eq_tapq:replication_ns_1@10.3.3.11 - Clear the tap queues by force
      memcached<0.570.0>: TAP (Producer) eq_tapq:replication_ns_1@10.3.3.13 - Clear the tap queues by force

      [cluster:debug] [2013-03-04 13:41:36] [ns_1@10.3.3.14:ns_cluster:ns_cluster:do_add_node_engaged_inner:479] Reply from complete_join on "10.3.3.16:8091": {ok,[]}

      [cluster:debug] [2013-03-04 13:41:36] [ns_1@10.3.3.14:ns_cluster:ns_cluster:handle_call:120] add_node("10.3.3.16", 8091, ..) -> {ok,'ns_1@10.3.3.16'}

      [ns_server:info] [2013-03-04 13:41:36] [ns_1@10.3.3.14:ns_port_memcached:ns_port_server:log:166] memcached<0.570.0>: Total memory in memoryDeallocated() >= GIGANTOR !!! Disable the memory tracker...

      [ns_server:debug] [2013-03-04 13:41:37] [ns_1@10.3.3.14:'ns_memcached-default':ns_storage_conf:delete_bucket_db_directory:384] Result of deleting db directory: "/opt/couchbase/var/lib/couchbase/data/default-data": ok
      [ns_server:debug] [2013-03-04 13:41:37] [ns_1@10.3.3.14:<0.630.0>:single_bucket_sup:top_loop:24] per-bucket supervisor for "default" died with reason shutdown



      10.3.3.16

      [ns_doctor:error,2013-03-04T13:41:30.557,ns_1@10.3.3.16:<0.420.0>:ns_doctor:get_nodes:156]Error attempting to get nodes: {exit,
      {timeout,
      {gen_server,call,[ns_doctor,get_nodes]}}}
      [error_logger:error,2013-03-04T13:41:30.559,ns_1@10.3.3.16:error_logger<0.6.0>:ale_error_logger_handler:log_msg:76]** Generic server ns_doctor terminating
      ** Last message in was {get_node,'ns_1@10.3.3.16'}
      ** When Server state == {state,{dict,0,16,16,8,80,48,
      {[],[],[],[],[],[],[],[],[],[],[],[],[], [],[],[]},
      {{[],[],[],[],[],[],[],[],[],[],[],[],[],
      [],[],[]}}},
      undefined,undefined,undefined}
      ** Reason for termination ==
      ** {badarg,[{dict,fetch,
      ['ns_1@10.3.3.16',
      {dict,0,16,16,8,80,48,
      {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},
      {{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
      []}}}]},
      {ns_doctor,handle_call,3},
      {gen_server,handle_msg,5},
      {proc_lib,init_p_do_apply,3}]}

      [ns_doctor:error,2013-03-04T13:41:30.559,ns_1@10.3.3.16:<0.420.0>:ns_doctor:get_node:165]Error attempting to get node 'ns_1@10.3.3.16': {exit,
      {{badarg,
      [{dict,fetch,
      ['ns_1@10.3.3.16',
      {dict,0,16,16,8,80,48,
      {[],[],[],[],[],[],[], [],[],[],[],[],[],[], [],[]},
      {{[],[],[],[],[],[],[],
      [],[],[],[],[],[],[],
      [],[]}}}]},
      {ns_doctor,handle_call,3},
      {gen_server,handle_msg,5},
      {proc_lib,init_p_do_apply, 3}]},
      {gen_server,call,
      [ns_doctor,
      {get_node, 'ns_1@10.3.3.16'}]}}}
      [error_logger:error,2013-03-04T13:41:30.615,ns_1@10.3.3.16:error_logger<0.6.0>:ale_error_logger_handler:log_report:72]
      =========================CRASH REPORT=========================
      crasher:
      initial call: ns_doctor:init/1
      pid: <0.1431.0>
      registered_name: ns_doctor
      exception exit: {badarg,
      [{dict,fetch,
      ['ns_1@10.3.3.16',
      {dict,0,16,16,8,80,48,
      {[],[],[],[],[],[],[],[],[],[],[],[],[],[], [],[]},
      {{[],[],[],[],[],[],[],[],[],[],[],[],[],[],
      [],[]}}}]},
      {ns_doctor,handle_call,3},
      {gen_server,handle_msg,5},
      {proc_lib,init_p_do_apply,3}]}
      in function gen_server:terminate/6
      ancestors: [ns_server_sup,ns_server_cluster_sup,<0.60.0>]
      messages: []
      links: [<0.346.0>,<0..34.0>,<0.57.0>]
      dictionary: []
      trap_exit: false
      status: running
      heap_size: 377
      stack_size: 24
      reductions: 1113
      neighbours:

      [error_logger:error,2013-03-04T13:41:30.616,ns_1@10.3.3.16:error_logger<0.6.0>:ale_error_logger_handler:log_report:72]
      =========================SUPERVISOR REPORT=========================
      Supervisor: {local,ns_server_sup}
      Context: child_terminated
      Reason: {badarg,
      [{dict,fetch,
      ['ns_1@10.3.3.16',
      {dict,0,16,16,8,80,48,
      {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[], []},
      {{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
      []}}}]},
      {ns_doctor,handle_call,3},
      {gen_server,handle_msg,5},
      {proc_lib,init_p_do_apply,3}]}
      Offender: [{pid,<0.1431.0>},
      {name,ns_doctor},
      {mfargs,{ns_doctor,start_link,[]}},
      {restart_type,permanent},
      {shutdown,1000},
      {child_type,worker}]


      [error_logger:error,2013-03-04T13:41:31.253,ns_1@10.3.3.16:error_logger<0.6.0>:ale_error_logger_handler:log_report:72]
      =========================SUPERVISOR REPORT=========================
      Supervisor: {local,ns_bucket_sup}
      Context: shutdown_error
      Reason: normal
      Offender: [{pid,<0.467.0>},
      {name,buckets_observing_subscription},
      {mfargs,{ns_bucket_sup,subscribe_on_config_events,[]}},
      {restart_type,permanent},
      {shutdown,1000},
      {child_type,worker}]


      [error_logger:error,2013-03-04T13:41:31.850,ns_1@10.3.3.16:error_logger<0.6.0>:ale_error_logger_handler:log_msg:76]** Connection attempt from disallowed node 'ns_1@10.3.3.14' **

      [ns_server:error,2013-03-04T13:41:32.533,ns_1@10.3.3.16:ns_heart<0.1557.0>:ns_heart:grab_samples_loading_tasks:328]Failed to grab samples loader tasks: {exit,
      {noproc,
      {gen_server,call,
      [samples_loader_tasks,get_tasks,
      2000]}},
      [{gen_server,call,3}

      ,

      {ns_heart,grab_samples_loading_tasks,0},
      {ns_heart,current_status,0},
      {ns_heart,handle_info,2},
      {gen_server,handle_msg,5},
      {proc_lib,init_p_do_apply,3}]}
      [ns_server:error,2013-03-04T13:41:32.600,ns_1@10.3.3.16:ns_heart<0.1557.0>:ns_heart:grab_samples_loading_tasks:328]Failed to grab samples loader tasks: {exit,
      {noproc,
      {gen_server,call,
      [samples_loader_tasks,get_tasks,
      2000]}},
      [{gen_server,call,3},
      {ns_heart,grab_samples_loading_tasks,0}

      ,

      {ns_heart,current_status,0}

      ,

      {ns_heart,handle_call,3}

      ,

      {gen_server,handle_msg,5}

      ,

      {proc_lib,init_p_do_apply,3}

      ]}
      [error_logger:error,2013-03-04T13:41:47.228,ns_1@10.3.3.16:error_logger<0.6.0>:ale_error_logger_handler:log_report:72]
      =========================SUPERVISOR REPORT=========================
      Supervisor:

      {local,ns_bucket_sup}

      Context: shutdown_error
      Reason: normal
      Offender: [

      {pid,<0.1639.0>}

      ,

      {name,buckets_observing_subscription}

      ,
      {mfargs,{ns_bucket_sup,subscribe_on_config_events,[]}},

      {restart_type,permanent}

      ,

      {shutdown,1000}

      ,

      {child_type,worker}

      ]

      [error_logger:error,2013-03-04T13:41:47.540,ns_1@10.3.3.16:error_logger<0.6.0>:ale_error_logger_handler:log_msg:76]** Connection attempt from disallowed node 'ns_1@10.3.3.14' **

      [error_logger:error,2013-03-04T13:41:47.546,ns_1@10.3.3.16:error_logger<0.6.0>:ale_error_logger_handler:log_msg:76]** Connection attempt from disallowed node 'ns_1@10.3.3.14' **

      No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        There's something timeoutful about this environment. Most likely not a problem if you continue retrying.

        [stats:warn,2013-03-04T13:41:30.549,ns_1@10.3.3.16:system_stats_collector<0.469.0>:system_stats_collector:handle_info:133]lost 6 ticks
        [ns_server:warn,2013-03-04T13:41:30.549,ns_1@10.3.3.16:mb_master<0.1439.0>:mb_master:handle_info:232]Skipped 5 heartbeats

        [ns_doctor:error,2013-03-04T13:41:30.557,ns_1@10.3.3.16:<0.420.0>:ns_doctor:get_nodes:156]Error attempting to get nodes: {exit,
        {timeout,

        {gen_server,call,[ns_doctor,get_nodes]}

        }}

        Please attach scheduler delay measurement recordings to investigate further

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - There's something timeoutful about this environment. Most likely not a problem if you continue retrying. [stats:warn,2013-03-04T13:41:30.549,ns_1@10.3.3.16:system_stats_collector<0.469.0>:system_stats_collector:handle_info:133] lost 6 ticks [ns_server:warn,2013-03-04T13:41:30.549,ns_1@10.3.3.16:mb_master<0.1439.0>:mb_master:handle_info:232] Skipped 5 heartbeats [ns_doctor:error,2013-03-04T13:41:30.557,ns_1@10.3.3.16:<0.420.0>:ns_doctor:get_nodes:156] Error attempting to get nodes: {exit, {timeout, {gen_server,call,[ns_doctor,get_nodes]} }} Please attach scheduler delay measurement recordings to investigate further
        Hide
        jin Jin Lim added a comment -

        Please provide the latest update if any on this. The dev team had previously requested some info (scheduler measurement)

        Show
        jin Jin Lim added a comment - Please provide the latest update if any on this. The dev team had previously requested some info (scheduler measurement)
        Hide
        andreibaranouski Andrei Baranouski added a comment -

        don't see the error on 2.0.2 builds. to use the record of measurement delays need approve http://review.couchbase.org/#/c/25110/ http://review.couchbase.org/#/c/25121/

        Show
        andreibaranouski Andrei Baranouski added a comment - don't see the error on 2.0.2 builds. to use the record of measurement delays need approve http://review.couchbase.org/#/c/25110/ http://review.couchbase.org/#/c/25121/
        Hide
        jin Jin Lim added a comment -

        Thanks much Andrei.

        Assuming there is a real customer who would perform swap rebalance 1.8.1 -> 2.0.1, how easily they would run into this issue? Did you run into this issue every time you run the test case?

        In any case would retrying to add the failed node eventually succeed? As long as this is a corner case and or there is a workaround (by retrying or else) we will consider not to fix since 2.0.2 doesn't show the same error.

        Show
        jin Jin Lim added a comment - Thanks much Andrei. Assuming there is a real customer who would perform swap rebalance 1.8.1 -> 2.0.1, how easily they would run into this issue? Did you run into this issue every time you run the test case? In any case would retrying to add the failed node eventually succeed? As long as this is a corner case and or there is a workaround (by retrying or else) we will consider not to fix since 2.0.2 doesn't show the same error.
        Hide
        andreibaranouski Andrei Baranouski added a comment -

        will rerun http://qa.hq.northscale.net/view/2.0.1/job/centos-64-2.0-upgrade/ job against 2.0.1-170 with measurement delays script today

        Show
        andreibaranouski Andrei Baranouski added a comment - will rerun http://qa.hq.northscale.net/view/2.0.1/job/centos-64-2.0-upgrade/ job against 2.0.1-170 with measurement delays script today
        Hide
        andreibaranouski Andrei Baranouski added a comment -

        the issue was not reproduced against 170 build when run with measurement delays
        http://qa.hq.northscale.net/job/centos-64-2.0-upgrade/101/
        http://qa.hq.northscale.net/job/centos-64-2.0-upgrade/100/

        Show
        andreibaranouski Andrei Baranouski added a comment - the issue was not reproduced against 170 build when run with measurement delays http://qa.hq.northscale.net/job/centos-64-2.0-upgrade/101/ http://qa.hq.northscale.net/job/centos-64-2.0-upgrade/100/
        Hide
        andreibaranouski Andrei Baranouski added a comment -

        not reproduced anymore

        Show
        andreibaranouski Andrei Baranouski added a comment - not reproduced anymore
        Hide
        andreibaranouski Andrei Baranouski added a comment -

        didn't see the issue anymore

        Show
        andreibaranouski Andrei Baranouski added a comment - didn't see the issue anymore

          People

          • Assignee:
            andreibaranouski Andrei Baranouski
            Reporter:
            andreibaranouski Andrei Baranouski
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Gerrit Reviews

              There are no open Gerrit changes