Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-7847

online upgrade 2.0.0 -> 2.0.1: addNode that is not used for a long time after installation: Prepare join failed... ('Skipped 1/2/3 heartbeats" at that time)

    XMLWordPrintable

Details

    • Task
    • Resolution: Fixed
    • Major
    • 2.1.0
    • 2.0.1
    • ns_server
    • Security Level: Public
    • None

    Description

      http://qa.hq.northscale.net/job/centos-64-2.0-upgrade/89/consoleFull

      ./testrunner -i /tmp/upgrade.ini get-logs=False,upgrade_version=2.0.1-169-rel,initial_vbuckets=64 -t newupgradetests.MultiNodesUpgradeTests.online_upgrade_rebalance_in_out,initial_version=2.0.0-1976-rel,items=50000,expire_time=10000,wait_expire=true,GROUP=2_0;ONLINE

      steps:
      1. cluster with 2 2.0.0-1976 nodes (10.3.3.11, 10.3.3.14)
      2. 50000 items with expiration=10000
      3. install 2.0.1-169 on 10.3.3.13, 10.3.3.16
      4. after installation test slept 10000 seconds( ~ 3 hours) and then tried to add new nodes to cluster
      [2013-02-28 10:49:37,885] - [basetestcase:147] INFO - sleep for 10 secs. Installation of new version is done. Wait for rebalance ...
      [2013-02-28 10:49:47,896] - [basetestcase:147] INFO - sleep for 10000 secs. ...
      [2013-02-28 13:36:28,396] - [task:242] INFO - adding node 10.3.3.13:8091 to cluster
      [2013-02-28 13:36:28,401] - [rest_client:721] INFO - adding remote node @10.3.3.13:8091 to this cluster @10.3.3.11:8091
      [2013-02-28 13:36:31,520] - [task:242] INFO - adding node 10.3.3.16:8091 to cluster
      [2013-02-28 13:36:31,521] - [rest_client:721] INFO - adding remote node @10.3.3.16:8091 to this cluster @10.3.3.11:8091
      [2013-02-28 13:36:32,538] - [rest_client:578] ERROR - http://10.3.3.11:8091/controller/addNode error 400 reason: unknown ["Prepare join failed. Could not connect to \"10.3.3.16\" on port 8091. This could be due to an incorrect host/port combination or a firewall in place between the servers."]
      [2013-02-28 13:36:32,538] - [rest_client:741] ERROR - add_node error : ["Prepare join failed. Could not connect to \"10.3.3.16\" on port 8091. This could be due to an incorrect host/port combination or a firewall in place between the servers."]
      ERROR

      10.3.3.16 was not added and its logs contain at that time:

      [ns_server:error,2013-02-28T13:36:13.437,ns_1@127.0.0.1:ns_heart<0.26082.0>:ns_heart:grab_samples_loading_tasks:328]Failed to grab samples loader tasks: {exit,
      {noproc,
      {gen_server,call,
      [samples_loader_tasks,get_tasks,
      2000]}},
      [

      {gen_server,call,3},
      {ns_heart,grab_samples_loading_tasks,0},
      {ns_heart,current_status,0},
      {ns_heart,handle_info,2},
      {gen_server,handle_msg,5},
      {proc_lib,init_p_do_apply,3}]}
      [ns_server:warn,2013-02-28T13:36:19.112,ns_1@127.0.0.1:mb_master<0.26095.0>:mb_master:handle_info:232]Skipped 1 heartbeats

      [ns_server:warn,2013-02-28T13:36:26.792,ns_1@127.0.0.1:mb_master<0.26095.0>:mb_master:handle_info:232]Skipped 3 heartbeats

      [ns_server:warn,2013-02-28T13:36:35.218,ns_1@127.0.0.1:mb_master<0.26095.0>:mb_master:handle_info:232]Skipped 2 heartbeats

      [error_logger:info,2013-02-28T13:36:35.295,ns_1@127.0.0.1:error_logger<0.6.0>:ale_error_logger_handler:log_report:72]
      =========================PROGRESS REPORT=========================
      supervisor: {local,ns_server_sup}
      started: [{pid,<0.26105.0>},
      {name,master_activity_events_keeper},
      {mfargs,{master_activity_events_keeper,start_link,[]}},
      {restart_type,permanent},
      {shutdown,brutal_kill},
      {child_type,worker}]

      [ns_server:error,2013-02-28T13:36:35.359,ns_1@127.0.0.1:ns_heart<0.26082.0>:ns_heart:grab_samples_loading_tasks:328]Failed to grab samples loader tasks: {exit,
      {noproc,
      {gen_server,call,
      [samples_loader_tasks,get_tasks,
      2000]}},
      [{gen_server,call,3}

      ,

      {ns_heart,grab_samples_loading_tasks,0}

      ,

      {ns_heart,current_status,0}

      ,

      {ns_heart,handle_info,2}

      ,

      {gen_server,handle_msg,5}

      ,

      {proc_lib,init_p_do_apply,3}

      ]}

      I also see that there are several crashes on the nodes, even if they are clean(should we fix it? separate bug?)

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            andreibaranouski Andrei Baranouski
            andreibaranouski Andrei Baranouski
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty