Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-7113

windows - constant restarts of mb_master during small scale performance tests

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 2.0-beta-2
    • Fix Version/s: 2.0
    • Component/s: ns_server
    • Security Level: Public
    • Labels:
    • Environment:
      VMs, Windows 64-bit, 4 nodes, HDD, 4 cores, 24GB
      Build 1940

      Description

      Restarts happen on 1 or 2 nodes every time I run tests, usually with the same error.
      No problems with loading data and initial indexing. Why does it happen?

      [ns_server:info,2012-11-06T12:43:31.635,ns_1@10.2.3.31:mb_master<0.18558.13>:mb_master:terminate:288]Synchronously shutting down child mb_master_sup
      [error_logger:error,2012-11-06T12:43:32.745,ns_1@10.2.3.31:error_logger<0.5.0>:ale_error_logger_handler:log_report:72]
      =========================SUPERVISOR REPORT=========================
      Supervisor:

      {local,mb_master_sup}

      Context: shutdown_error
      Reason: killed
      Offender: [

      {pid,<0.19815.13>}

      ,

      {name,ns_orchestrator}

      ,
      {mfargs,{ns_orchestrator,start_link,[]}},

      {restart_type,permanent},
      {shutdown,20},
      {child_type,worker}]


      [stats:warn,2012-11-06T12:43:32.651,ns_1@10.2.3.31:system_stats_collector<0.478.0>:system_stats_collector:handle_info:133]lost 7 ticks
      [ns_server:debug,2012-11-06T12:43:33.495,ns_1@10.2.3.31:<0.18559.13>:ns_pubsub:do_subscribe_link:132]Parent process of subscription {ns_config_events,<0.18558.13>} exited with reason {timeout,
      {gen_server,
      call,
      [ns_node_disco,
      nodes_wanted]}}
      [error_logger:error,2012-11-06T12:43:34.073,ns_1@10.2.3.31:error_logger<0.5.0>:ale_error_logger_handler:log_msg:76]** State machine mb_master terminating
      ** Last message in was send_heartbeat
      ** When State == master
      ** Data == {state,<0.19814.13>,'ns_1@10.2.3.31',
      ['ns_1@10.2.3.31','ns_1@10.2.3.33','ns_1@10.2.3.34',
      'ns_1@10.2.3.35'],
      {1352,234605,120106}}
      ** Reason for termination =
      ** {timeout,{gen_server,call,[ns_node_disco,nodes_wanted]}}

      [ns_server:debug,2012-11-06T12:43:35.166,ns_1@10.2.3.31:ns_server_sup<0.385.0>:mb_master:check_master_takeover_needed:144]Sending master node question to the following nodes: ['ns_1@10.2.3.35',
      'ns_1@10.2.3.34',
      'ns_1@10.2.3.33']
      [error_logger:error,2012-11-06T12:43:35.276,ns_1@10.2.3.31:error_logger<0.5.0>:ale_error_logger_handler:log_report:72]
      =========================CRASH REPORT=========================
      crasher:
      initial call: mb_master:init/1
      pid: <0.18558.13>
      registered_name: mb_master
      exception exit: {timeout,{gen_server,call,[ns_node_disco,nodes_wanted]}}
      in function gen_fsm:terminate/7
      ancestors: [ns_server_sup,ns_server_cluster_sup,<0.66.0>]
      messages: [send_heartbeat,send_heartbeat,send_heartbeat,send_heartbeat,
      {#Ref<0.0.372.79904>, ['ns_1@10.2.3.31','ns_1@10.2.3.33','ns_1@10.2.3.34', 'ns_1@10.2.3.35']}]
      links: [<0.385.0>,<0.18559.13>,<0.63.0>]
      dictionary: []
      trap_exit: true
      status: running
      heap_size: 377
      stack_size: 24
      reductions: 147300
      neighbours:

      [error_logger:error,2012-11-06T12:43:35.307,ns_1@10.2.3.31:error_logger<0.5.0>:ale_error_logger_handler:log_report:72]
      =========================SUPERVISOR REPORT=========================
      Supervisor: {local,ns_server_sup}
      Context: child_terminated
      Reason: {timeout,{gen_server,call,[ns_node_disco,nodes_wanted]}}
      Offender: [{pid,<0.18558.13>},
      {name,mb_master},
      {mfargs,{mb_master,start_link,[]}},
      {restart_type,permanent}

      ,

      {shutdown,infinity}

      ,

      {child_type,supervisor}

      ]

      [ns_server:error,2012-11-06T12:43:35.323,ns_1@10.2.3.31:<0.788.0>:ns_memcached:verify_report_long_call:297]call

      {stats,<<>>}

      took too long: 10203000 us
      [couchdb:error,2012-11-06T12:43:41.588,ns_1@10.2.3.31:<0.24345.2>:couch_log:error:42]Uncaught error in HTTP request: {exit,
      {timeout,

      {gen_server,call,[ns_config,get]}

      }}

      Stacktrace: [

      {diag_handler,diagnosing_timeouts,1}

      ,

      {menelaus_auth,check_auth,1}

      ,

      {menelaus_auth,bucket_auth_fun,1}

      ,

      {menelaus_auth,is_bucket_accessible,2}

      ,

      {capi_frontend,do_db_req,2}

      ,

      {couch_httpd,handle_request,6}

      ,

      {mochiweb_http,headers,5}

      ,

      {proc_lib,init_p_do_apply,3}

      ]
      [error_logger:error,2012-11-06T12:43:41.604,ns_1@10.2.3.31:error_logger<0.5.0>:ale_error_logger_handler:log_msg:76]** Generic server disksup terminating

        • Last message in was timeout
        • When Server state == [{data,[{"OS",{win32,nt}},
          {"Timeout",60000}

          ,

          {"Threshold",80}

          ,

          Unknown macro: {"DiskData", [{"C:\\",52324348,51}, {"E:\\",268432380,14}]}

          ]}]

        • Reason for termination ==
        • Unknown macro: {timeout,{gen_server,call,[os_mon_sysinfo,get_disk_info]}}
      # Subject Project Status CR V
      For Gerrit Dashboard: &For+MB-7113=message:MB-7113

        Activity

        Hide
        pavelpaulau Pavel Paulau added a comment -
        Show
        pavelpaulau Pavel Paulau added a comment - Reproduced in build 1956. Diags: https://s3.amazonaws.com/bugdb/jira/MB-7113/202261a9/diags_1956.tar.gz
        Hide
        siri Sriram Melkote added a comment -

        Looking at perfmon, memory seems to be not an issue. Without any swap enabled, both CPU and Memory usage is flat and stable. However, there are many crashes:

        6429 initial call: compaction_daemon:spawn_view_index_compactor/6-fun-0/0
        69 initial call: couch_db:init/1
        51 initial call: couch_file:spawn_writer/2
        15 initial call: disksup:init/1
        42 initial call: memsup:init/1

        WIll ping Filipe for information on the first crash. For the others, it appears that they are likely MB-7180 related.

        Show
        siri Sriram Melkote added a comment - Looking at perfmon, memory seems to be not an issue. Without any swap enabled, both CPU and Memory usage is flat and stable. However, there are many crashes: 6429 initial call: compaction_daemon: spawn_view_index_compactor/6-fun-0 /0 69 initial call: couch_db:init/1 51 initial call: couch_ file:spawn_writer/2 15 initial call: disksup:init/1 42 initial call: memsup:init/1 WIll ping Filipe for information on the first crash. For the others, it appears that they are likely MB-7180 related.
        Hide
        steve Steve Yen added a comment -

        from bug-scrub...

        this should go away when we switch back to async threads 16 for windows.

        Bin,
        Please change windows erlang to use async threads 16 (and for clarity, linux remains as non-async (scheduler threads) 16:16).

        Show
        steve Steve Yen added a comment - from bug-scrub... this should go away when we switch back to async threads 16 for windows. Bin, Please change windows erlang to use async threads 16 (and for clarity, linux remains as non-async (scheduler threads) 16:16).
        Show
        bcui Bin Cui added a comment - http://review.couchbase.org/#/c/22857/
        Hide
        steve Steve Yen added a comment -
        Show
        steve Steve Yen added a comment - http://review.couchbase.org/#/c/22868/ (on the right branch)

          People

          • Assignee:
            bcui Bin Cui
            Reporter:
            pavelpaulau Pavel Paulau
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Gerrit Reviews

              There are no open Gerrit changes