Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-6315

service fails to start sometimes [was: cluster is broken when reboot all nodes at the same time]

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • 2.0
    • 2.0-beta
    • ns_server
    • Security Level: Public
    • None

    Description

      build-705
      steps:
      1. 3 nodes in cluster with 1 sasl bucket and 10M items(10.3.121.112, 10.3.121.113, 10.3.121.114)
      2. reboot all nodes at the same time

      result:
      10.3.121.112, 10.3.121.113 are in pending state, 10.3.121.114 is down with the error in the logs:

      [error_logger:error,2012-08-19T20:31:39.066,ns_1@10.3.121.114:error_logger:ale_error_logger_handler:log_report:72]
      =========================SUPERVISOR REPORT=========================
      Supervisor:

      {local,menelaus_sup}

      Context: child_terminated
      Reason: {noproc,
      {gen_server,call,
      [

      {'stats_reader-sasl','ns_1@10.3.121.114'}

      ,

      {latest,minute,1}

      ]}}
      Offender: [

      {pid,<0.4312.0>}

      ,

      {name,menelaus_web_alerts_srv}

      ,
      {mfargs,{menelaus_web_alerts_srv,start_link,[]}},

      {restart_type,permanent},
      {shutdown,5000},
      {child_type,worker}]


      [error_logger:error,2012-08-19T20:40:14.856,ns_1@10.3.121.114:error_logger:ale_error_logger_handler:log_msg:76]** Node 'ns_1@10.3.121.112' not responding **
      ** Removing (timedout) connection **

      [ns_server:error,2012-08-19T20:40:56.438,ns_1@10.3.121.114:ns_doctor:ns_doctor:update_status:203]The following buckets became not ready on node 'ns_1@10.3.121.112': ["sasl"], those of them are active []
      [error_logger:error,2012-08-19T20:42:34.008,ns_1@10.3.121.114:error_logger:ale_error_logger_handler:log_report:72]
      =========================SUPERVISOR REPORT=========================
      Supervisor: {local,'ns_vbm_new_sup-sasl'}
      Context: child_terminated
      Reason: normal
      Offender: [{pid,<0.7117.0>},
      {name,
      {new_child_id,
      [171,172,173,174,175,176,177,178,179,180,181,182,
      183,184,185,186,187,188,189,190,191,192,193,194,
      195,196,197,198,199,200,201,202,203,204,205,206,
      207,208,209,210,211,212,213,214,215,216,217,218,
      219,220,221,222,223,224,225,226,227,228,229,230,
      231,232,233,234,235,236,237,238,239,240,241,242,
      243,244,245,246,247,248,249,250,251,252,253,254,
      255,256,257,258,259,260,261,262,263,264,265,266,
      267,268,269,270,271,272,273,274,275,276,277,278,
      279,280,281,282,283,284,285,286,287,288,289,290,
      291,292,293,294,295,296,297,298,299,300,301,302,
      303,304,305,306,307,308,309,310,311,312,313,314,
      315,316,317,318,319,320,321,322,323,324,325,326,
      327,328,329,330,331,332,333,334,335,336,337,338,
      339,340,341],
      'ns_1@10.3.121.112'}},
      {mfargs,
      {ebucketmigrator_srv,start_link,
      [{"10.3.121.112",11209},
      {"10.3.121.114",11209},
      [{username,"sasl"},
      {password,"sasl"},
      {vbuckets, [171,172,173,174,175,176,177,178,179,180,181, 182,183,184,185,186,187,188,189,190,191,192, 193,194,195,196,197,198,199,200,201,202,203, 204,205,206,207,208,209,210,211,212,213,214, 215,216,217,218,219,220,221,222,223,224,225, 226,227,228,229,230,231,232,233,234,235,236, 237,238,239,240,241,242,243,244,245,246,247, 248,249,250,251,252,253,254,255,256,257,258, 259,260,261,262,263,264,265,266,267,268,269, 270,271,272,273,274,275,276,277,278,279,280, 281,282,283,284,285,286,287,288,289,290,291, 292,293,294,295,296,297,298,299,300,301,302, 303,304,305,306,307,308,309,310,311,312,313, 314,315,316,317,318,319,320,321,322,323,324, 325,326,327,328,329,330,331,332,333,334,335, 336,337,338,339,340,341]},
      {takeover,false},
      {suffix,"ns_1@10.3.121.114"}]]}},
      {restart_type,permanent}

      ,

      {shutdown,60000}

      ,

      {child_type,worker}

      ]

      so, 10.3.121.114 didn't find orchestrator after restarting and didn't get up

      error from orchestrator that hangs in pending state:

      ns_server:warn,2012-08-19T20:54:16.524,ns_1@10.3.121.112:'capi_ddoc_replication_srv-sasl':cb_generic_replication_srv:handle_info:140]Remote server node

      {'capi_ddoc_replication_srv-sasl','ns_1@10.3.121.114'}

      process down: noconnection
      [error_logger:error,2012-08-19T20:54:16.525,ns_1@10.3.121.112:error_logger:ale_error_logger_handler:log_report:72]
      =========================CRASH REPORT=========================
      crasher:
      initial call: ns_memcached:init/1
      pid: <0.655.0>
      registered_name: []
      exception exit: badmatch,{error,timeout,
      [

      {mc_client_binary,cmd_binary_vocal_recv,5}

      ,

      {mc_client_binary,create_bucket,4}

      ,

      {ns_memcached,ensure_bucket,2}

      ,

      {ns_memcached,init,1}

      ,

      {gen_server,init_it,6}

      ,

      {proc_lib,init_p_do_apply,3}

      ]}
      in function gen_server:init_it/6
      ancestors: ['ns_memcached_sup-sasl','single_bucket_sup-sasl',<0.552.0>]
      messages: [check_started,check_started,check_started,check_started,
      check_started,check_started,check_started,
      {'$gen_call',

      {<0.462.0>,#Ref<0.0.0.5515>}

      ,connected},
      check_started,check_started,
      {'$gen_call',

      {<0.731.0>,#Ref<0.0.0.5674>}

      ,topkeys},
      check_started,check_started,check_started,check_started,
      check_started,check_started,check_started,check_started,
      {'$gen_call',

      {<0.462.0>,#Ref<0.0.0.5963>}

      ,connected},
      check_started,check_started,check_started,check_started,
      check_started,check_started,check_started,check_started,
      check_started,check_started,
      {'$gen_call',

      {<0.462.0>,#Ref<0.0.0.6549>}

      ,connected},
      check_started,check_started,check_started,check_started,
      check_started,check_started,check_started,check_started,
      check_started,check_started,
      {'$gen_call',

      {<0.462.0>,#Ref<0.0.0.6898>}

      ,connected},
      check_started,check_started,check_started,check_started,
      check_started,check_started,check_started,check_started,
      check_started,check_started,
      {'$gen_call',

      {<0.462.0>,#Ref<0.0.0.7388>}

      ,connected},
      check_started,check_started,check_started,check_started,
      check_started,check_started,check_started,check_started,
      check_started,check_started,
      {'$gen_call',

      {<0.462.0>,#Ref<0.0.0.7797>}

      ,connected},
      check_started,check_started,check_started]
      links: <0.60.0>,<0.648.0>,#Port<0.7311>
      dictionary: []
      trap_exit: true
      status: running
      heap_size: 75025
      stack_size: 24
      reductions: 6393
      neighbours:

      Attachments

        1. logs12.tar.gz
          9.13 MB
        2. logs14.tar.gz
          8.84 MB
        For Gerrit Dashboard: MB-6315
        # Subject Branch Project Status CR V

        Activity

          People

            farshid Farshid Ghods (Inactive)
            andreibaranouski Andrei Baranouski
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty