Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-7847

online upgrade 2.0.0 -> 2.0.1: addNode that is not used for a long time after installation: Prepare join failed... ('Skipped 1/2/3 heartbeats" at that time)

    Details

    • Type: Task
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.0.1
    • Fix Version/s: 2.1.0
    • Component/s: ns_server
    • Security Level: Public
    • Labels:
      None

      Description

      http://qa.hq.northscale.net/job/centos-64-2.0-upgrade/89/consoleFull

      ./testrunner -i /tmp/upgrade.ini get-logs=False,upgrade_version=2.0.1-169-rel,initial_vbuckets=64 -t newupgradetests.MultiNodesUpgradeTests.online_upgrade_rebalance_in_out,initial_version=2.0.0-1976-rel,items=50000,expire_time=10000,wait_expire=true,GROUP=2_0;ONLINE

      steps:
      1. cluster with 2 2.0.0-1976 nodes (10.3.3.11, 10.3.3.14)
      2. 50000 items with expiration=10000
      3. install 2.0.1-169 on 10.3.3.13, 10.3.3.16
      4. after installation test slept 10000 seconds( ~ 3 hours) and then tried to add new nodes to cluster
      [2013-02-28 10:49:37,885] - [basetestcase:147] INFO - sleep for 10 secs. Installation of new version is done. Wait for rebalance ...
      [2013-02-28 10:49:47,896] - [basetestcase:147] INFO - sleep for 10000 secs. ...
      [2013-02-28 13:36:28,396] - [task:242] INFO - adding node 10.3.3.13:8091 to cluster
      [2013-02-28 13:36:28,401] - [rest_client:721] INFO - adding remote node @10.3.3.13:8091 to this cluster @10.3.3.11:8091
      [2013-02-28 13:36:31,520] - [task:242] INFO - adding node 10.3.3.16:8091 to cluster
      [2013-02-28 13:36:31,521] - [rest_client:721] INFO - adding remote node @10.3.3.16:8091 to this cluster @10.3.3.11:8091
      [2013-02-28 13:36:32,538] - [rest_client:578] ERROR - http://10.3.3.11:8091/controller/addNode error 400 reason: unknown ["Prepare join failed. Could not connect to \"10.3.3.16\" on port 8091. This could be due to an incorrect host/port combination or a firewall in place between the servers."]
      [2013-02-28 13:36:32,538] - [rest_client:741] ERROR - add_node error : ["Prepare join failed. Could not connect to \"10.3.3.16\" on port 8091. This could be due to an incorrect host/port combination or a firewall in place between the servers."]
      ERROR

      10.3.3.16 was not added and its logs contain at that time:

      [ns_server:error,2013-02-28T13:36:13.437,ns_1@127.0.0.1:ns_heart<0.26082.0>:ns_heart:grab_samples_loading_tasks:328]Failed to grab samples loader tasks: {exit,
      {noproc,
      {gen_server,call,
      [samples_loader_tasks,get_tasks,
      2000]}},
      [

      {gen_server,call,3},
      {ns_heart,grab_samples_loading_tasks,0},
      {ns_heart,current_status,0},
      {ns_heart,handle_info,2},
      {gen_server,handle_msg,5},
      {proc_lib,init_p_do_apply,3}]}
      [ns_server:warn,2013-02-28T13:36:19.112,ns_1@127.0.0.1:mb_master<0.26095.0>:mb_master:handle_info:232]Skipped 1 heartbeats

      [ns_server:warn,2013-02-28T13:36:26.792,ns_1@127.0.0.1:mb_master<0.26095.0>:mb_master:handle_info:232]Skipped 3 heartbeats

      [ns_server:warn,2013-02-28T13:36:35.218,ns_1@127.0.0.1:mb_master<0.26095.0>:mb_master:handle_info:232]Skipped 2 heartbeats

      [error_logger:info,2013-02-28T13:36:35.295,ns_1@127.0.0.1:error_logger<0.6.0>:ale_error_logger_handler:log_report:72]
      =========================PROGRESS REPORT=========================
      supervisor: {local,ns_server_sup}
      started: [{pid,<0.26105.0>},
      {name,master_activity_events_keeper},
      {mfargs,{master_activity_events_keeper,start_link,[]}},
      {restart_type,permanent},
      {shutdown,brutal_kill},
      {child_type,worker}]

      [ns_server:error,2013-02-28T13:36:35.359,ns_1@127.0.0.1:ns_heart<0.26082.0>:ns_heart:grab_samples_loading_tasks:328]Failed to grab samples loader tasks: {exit,
      {noproc,
      {gen_server,call,
      [samples_loader_tasks,get_tasks,
      2000]}},
      [{gen_server,call,3}

      ,

      {ns_heart,grab_samples_loading_tasks,0}

      ,

      {ns_heart,current_status,0}

      ,

      {ns_heart,handle_info,2}

      ,

      {gen_server,handle_msg,5}

      ,

      {proc_lib,init_p_do_apply,3}

      ]}

      I also see that there are several crashes on the nodes, even if they are clean(should we fix it? separate bug?)

      No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        Please attach relevant measure-sched-delay recordings.

        There are not supposed to be any timeouts when node is out of cluster while you appear to be hitting it.

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - Please attach relevant measure-sched-delay recordings. There are not supposed to be any timeouts when node is out of cluster while you appear to be hitting it.
        Hide
        maria Maria McDuff (Inactive) added a comment -

        andrei, is this still happening?

        Show
        maria Maria McDuff (Inactive) added a comment - andrei, is this still happening?
        Hide
        andreibaranouski Andrei Baranouski added a comment -

        tested with timeout=6000 sec, passed

        /usr/bin/python2.7 testrunner -i centos-64-2.0-upgrade.ini -t newupgradetests.MultiNodesUpgradeTests.online_upgrade_rebalance_in_out,initial_version=2.0.0-1976-rel,items=50000,expire_time=6000,wait_expire=true,GROUP=2_0,upgrade_version=2.0.2-769-rel,initial_vbuckets=128

        Show
        andreibaranouski Andrei Baranouski added a comment - tested with timeout=6000 sec, passed /usr/bin/python2.7 testrunner -i centos-64-2.0-upgrade.ini -t newupgradetests.MultiNodesUpgradeTests.online_upgrade_rebalance_in_out,initial_version=2.0.0-1976-rel,items=50000,expire_time=6000,wait_expire=true,GROUP=2_0,upgrade_version=2.0.2-769-rel,initial_vbuckets=128

          People

          • Assignee:
            andreibaranouski Andrei Baranouski
            Reporter:
            andreibaranouski Andrei Baranouski
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Gerrit Reviews

              There are no open Gerrit changes