Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-27594

NS Server will not recover if system runs out of file descriptor

    XMLWordPrintable

Details

    • Bug
    • Resolution: Won't Fix
    • Major
    • 5.5.0
    • 5.5.0
    • ns_server
    • None
    • Local vagrant box
    • Untriaged
    • Ubuntu 64-bit
    • Unknown

    Description

      • Running 4 node cluster with KV, Eventing, Query, Indexing 
      • Execute the following command with root to limit the number of files descriptor echo "100" > /proc/sys/fs/file-max
      • Try to deploy any eventing function. Then we observer Connection Refuse error in eventing logs

      2018-01-16T05:06:09.167+00:00 [Info] CRDP[worker_Test_1:/tmp/127.0.0.1:8091_worker_Test_1.sock:1803] DCP events: DCP_STREAMREQ:341 V8 events: LATENCY_STATS:241 THR_MAP:1 THR_COUNT:1 SOURCE_MAP:1 HANDLER_CODE:1 EXECUTION_STATS:242 FAILURE_STATS:242 LCB_EXCEPTION_STATS:242 V8_INIT:1 V8_LOAD:1 Timer events: Doc: 0 Cron: 0, vbs owned len: 341 vbs owned: [342-682] Plasma stats: Insert: 0 Delete: 0 Lookup: 0
      2018-01-16T05:06:09.167+00:00 [Info] CRDP[worker_Test_1:/tmp/127.0.0.1:8091_worker_Test_1.sock:1803] CPP worker stats. Failure stats: {"bucket_op_exception_count":0,"checkpoint_failure_count":0,"n1ql_op_exception_count":0,"timeout_count":0} execution stats: {"agg_queue_size":0,"doc_timer_create_failure":0,"non_doc_timer_create_failure":0,"on_delete_failure":0,"on_delete_success":0,"on_update_failure":0,"on_update_success":0}
      2018-01-16T05:06:09.169+00:00 [Info] CRDP[worker_Test_0:/tmp/127.0.0.1:8091_worker_Test_0.sock:1807] DCP events: DCP_STREAMREQ:342 V8 events: THR_MAP:1 THR_COUNT:1 V8_LOAD:1 FAILURE_STATS:241 LATENCY_STATS:240 LCB_EXCEPTION_STATS:241 V8_INIT:1 SOURCE_MAP:1 HANDLER_CODE:1 EXECUTION_STATS:241 Timer events: Doc: 0 Cron: 0, vbs owned len: 342 vbs owned: [0-341] Plasma stats: Insert: 0 Delete: 0 Lookup: 0
      2018-01-16T05:06:09.171+00:00 [Info] CRDP[worker_Test_0:/tmp/127.0.0.1:8091_worker_Test_0.sock:1807] CPP worker stats. Failure stats: {"bucket_op_exception_count":0,"checkpoint_failure_count":0,"n1ql_op_exception_count":0,"timeout_count":0} execution stats: {"agg_queue_size":0,"doc_timer_create_failure":0,"non_doc_timer_create_failure":0,"on_delete_failure":0,"on_delete_success":0,"on_update_failure":0,"on_update_success":0}
      2018-01-16T05:06:09.179+00:00 [Info] CRDP[worker_Test_2:/tmp/127.0.0.1:8091_worker_Test_2.sock:1802] DCP events: DCP_STREAMREQ:341 V8 events: THR_MAP:1 THR_COUNT:1 V8_INIT:1 V8_LOAD:1 SOURCE_MAP:1 EXECUTION_STATS:241 HANDLER_CODE:1 FAILURE_STATS:241 LATENCY_STATS:240 LCB_EXCEPTION_STATS:241 Timer events: Doc: 0 Cron: 0, vbs owned len: 341 vbs owned: [683-1023] Plasma stats: Insert: 0 Delete: 0 Lookup: 0
      2018-01-16T05:06:09.179+00:00 [Info] CRDP[worker_Test_2:/tmp/127.0.0.1:8091_worker_Test_2.sock:1802] CPP worker stats. Failure stats: {"bucket_op_exception_count":0,"checkpoint_failure_count":0,"n1ql_op_exception_count":0,"timeout_count":0} execution stats: {"agg_queue_size":0,"doc_timer_create_failure":0,"non_doc_timer_create_failure":0,"on_delete_failure":0,"on_delete_success":0,"on_update_failure":0,"on_update_success":0}
      2018-01-16T05:06:09.203+00:00 [Error] CRCO[Big:worker_Big_1:/tmp/127.0.0.1:8091_worker_Big_1.sock:1679] Failed to grab routable interface, err: CBAuth database is stale: last reason: dial tcp 127.0.0.1:8091: getsockopt: connection refused
      2018-01-16T05:06:09.203+00:00 [Debug] RTLP Retrying after 1s
      2018-01-16T05:06:09.262+00:00 [Error] CRCO[for:worker_for_0:/tmp/127.0.0.1:8091_worker_for_0.sock:1847] Failed to grab routable interface, err: CBAuth database is stale: last reason: dial tcp 127.0.0.1:8091: getsockopt: connection refused
      2018-01-16T05:06:09.262+00:00 [Debug] RTLP Retrying after 1s
      2018-01-16T05:06:09.442+00:00 [Error] CRCO[Test:worker_Test_1:/tmp/127.0.0.1:8091_worker_Test_1.sock:1803] Failed to grab routable interface, err: CBAuth database is stale: last reason: dial tcp 127.0.0.1:8091: getsockopt: connection refused
      2018-01-16T05:06:09.442+00:00 [Debug] RTLP Retrying after 1s
      2018-01-16T05:06:09.684+00:00 [Error] CRCO[Big:worker_Big_0:/tmp/127.0.0.1:8091_worker_Big_0.sock:1677] Failed to grab routable interface, err: CBAuth database is stale: last reason: dial tcp 127.0.0.1:8091: getsockopt: connection refused
      2018-01-16T05:06:09.684+00:00 [Debug] RTLP Retrying after 1s
      2018-01-16T05:06:09.826+00:00 [Error] CRCO[Test:worker_Test_2:/tmp/127.0.0.1:8091_worker_Test_2.sock:1802] Failed to grab routable interface, err: CBAuth database is stale: last reason: dial tcp 127.0.0.1:8091: getsockopt: connection refused
      2018-01-16T05:06:09.826+00:00 [Debug] RTLP Retrying after 1s
      2018-01-16T05:06:09.853+00:00 [Error] CRCO[Test:worker_Test_0:/tmp/127.0.0.1:8091_worker_Test_0.sock:1807] Failed to grab routable interface, err: CBAuth database is stale: last reason: dial tcp 127.0.0.1:8091: getsockopt: connection refused
      2018-01-16T05:06:09.853+00:00 [Debug] RTLP Retrying after 1s
      2018-01-16T05:06:09.932+00:00 [Error] CRCO[for:worker_for_2:/tmp/127.0.0.1:8091_worker_for_2.sock:1857] Failed to grab routable interface, err: CBAuth database is stale: last reason: dial tcp 127.0.0.1:8091: getsockopt: connection refused
      2018-01-16T05:06:09.932+00:00 [Debug] RTLP Retrying after 1s
      2018-01-16T05:06:09.989+00:00 [Error] CRCO[Big:worker_Big_2:/tmp/127.0.0.1:8091_worker_Big_2.sock:1684] Failed to grab routable interface, err: CBAuth database is stale: last reason: dial tcp 127.0.0.1:8091: getsockopt: connection refused
      2018-01-16T05:06:09.989+00:00 [Debug] RTLP Retrying after 1s
      2018/01/16 05:06:10 revrpc: Got error (dial tcp 127.0.0.1:8091: getsockopt: connection refused) and will retry in 1s
      2018/01/16 05:06:10 revrpc: Got error (dial tcp 127.0.0.1:8091: getsockopt: connection refused) and will retry in 1s
      2018-01-16T05:06:10.204+00:00 [Error] CRCO[Big:worker_Big_1:/tmp/127.0.0.1:8091_worker_Big_1.sock:1679] Failed to grab routable interface, err: CBAuth database is stale: last reason: dial tcp 127.0.0.1:8091: getsockopt: connection refused
      2018-01-16T05:06:10.204+00:00 [Debug] RTLP Retrying after 1s
      2018-01-16T05:06:10.264+00:00 [Error] CRCO[for:worker_for_0:/tmp/127.0.0.1:8091_worker_for_0.sock:1847] Failed to grab routable interface, err: CBAuth database is stale: last reason: dial tcp 127.0.0.1:8091: getsockopt: connection refused
      2018-01-16T05:06:10.264+00:00 [Debug] RTLP Retrying after 1s
      2018-01-16T05:06:10.445+00:00 [Error] CRCO[Test:worker_Test_1:/tmp/127.0.0.1:8091_worker_Test_1.sock:1803] Failed to grab routable interface, err: CBAuth database is stale: last reason: dial tcp 127.0.0.1:8091: getsockopt: connection refused
      2018-01-16T05:06:10.445+00:00 [Debug] RTLP Retrying after 1s
      2018-01-16T05:06:10.618+00:00 [Error] Eventing::main metakv observe error for settings, err: Get http://127.0.0.1:8091/_metakv/eventing/settings/?feed=continuous: CBAuth database is stale: last reason: dial tcp 127.0.0.1:8091: getsockopt: connection refused. Retrying...
      2018-01-16T05:06:10.635+00:00 [Error] Eventing::main metakv observe error for rebalance token, err: Get http://127.0.0.1:8091/_metakv/eventing/rebalanceToken/?feed=continuous: CBAuth database is stale: last reason: dial tcp 127.0.0.1:8091: getsockopt: connection refused. Retrying...
      2018-01-16T05:06:10.636+00:00 [Error] Eventing::main metakv observe error for event handler code, err: Get http://127.0.0.1:8091/_metakv/eventing/apps/?feed=continuous: CBAuth database is stale: last reason: dial tcp 127.0.0.1:8091: getsockopt: connection refused. Retrying...
      2018-01-16T05:06:10.685+00:00 [Error] CRCO[Big:worker_Big_0:/tmp/127.0.0.1:8091_worker_Big_0.sock:1677] Failed to grab routable interface, err: CBAuth database is stale: last reason: dial tcp 127.0.0.1:8091: getsockopt: connection refused
      2018-01-16T05:06:10.685+00:00 [Debug] RTLP Retrying after 1s
      2018-01-16T05:06:10.827+00:00 [Error] CRCO[Test:worker_Test_2:/tmp/127.0.0.1:8091_worker_Test_2.sock:1802] Failed to grab routable interface, err: CBAuth database is stale: last reason: dial tcp 127.0.0.1:8091: getsockopt: connection refused
      2018-01-16T05:06:10.827+00:00 [Debug] RTLP Retrying after 1s
      2018-01-16T05:06:10.854+00:00 [Error] CRCO[Test:worker_Test_0:/tmp/127.0.0.1:8091_worker_Test_0.sock:1807] Failed to grab routable interface, err: CBAuth database is stale: last reason: dial tcp 127.0.0.1:8091: getsockopt: connection refused
      2018-01-16T05:06:10.854+00:00 [Debug] RTLP Retrying after 1s
      2018-01-16T05:06:10.934+00:00 [Error] CRCO[for:worker_for_2:/tmp/127.0.0.1:8091_worker_for_2.sock:1857] Failed to grab routable interface, err: CBAuth database is stale: last reason: dial tcp 127.0.0.1:8091: getsockopt: connection refused
      2018-01-16T05:06:10.934+00:00 [Debug] RTLP Retrying after 1s
      2018-01-16T05:06:10.969+00:00 [Warn] CbAuthHandler::GetCredentials error=CBAuth database is stale: last reason: dial tcp 127.0.0.1:8091: getsockopt: connection refused Retrying (9)
      2018-01-16T05:06:10.989+00:00 [Error] CRCO[Big:worker_Big_2:/tmp/127.0.0.1:8091_worker_Big_2.sock:1684] Failed to grab routable interface, err: CBAuth database is stale: last reason: dial tcp 127.0.0.1:8091: getsockopt: connection refused

      * NS server not recovered when FD's are made available again. NS logs

      =========================ERROR REPORT=========================
      {mochiweb_socket_server,289,{acceptor_error,{error,accept_failed}}}
      [error_logger:error,2018-01-16T05:38:06.771Z,couchdb_ns_1@127.0.0.1:error_logger<0.6.0>:ale_error_logger_handler:do_log:203]
      =========================ERROR REPORT=========================
      application: mochiweb
      "Accept failed error"
      "{error,enfile}"
       
      [ale_logger:error,2018-01-16T05:38:06.775Z,couchdb_ns_1@127.0.0.1:ale<0.47.0>:ale:handle_info:306]ale_reports_handler terminated with reason {'EXIT',
      {undef,
      [{lib,format_exception,
      [5,exit,
      {error,accept_failed},
      [{mochiweb_acceptor,init,3,
      [{file,
      "/home/couchbase/jenkins/workspace/couchbase-server-unix/couchdb/src/mochiweb/mochiweb_acceptor.erl"},
      {line,33}]},
      {proc_lib,init_p_do_apply,3,
      [{file,"proc_lib.erl"},
      {line,239}]}],
      #Fun<proc_lib.0.44701605>,
      #Fun<proc_lib.1.44701605>,
      latin1],
      []},
      {proc_lib,format_exception,4,
      [{file,"proc_lib.erl"},
      {line,729}]},
      {proc_lib,format_rep,2,
      [{file,"proc_lib.erl"},
      {line,718}]},
      {proc_lib,format_rep,2,
      [{file,"proc_lib.erl"},
      {line,720}]},
      {proc_lib,format_rep,2,
      [{file,"proc_lib.erl"},
      {line,716}]},
      {proc_lib,format,2,
      [{file,"proc_lib.erl"},
      {line,704}]},
      {ale_error_logger_handler,
      format_report,2,
      [{file,
      "src/ale_error_logger_handler.erl"},
      {line,152}]},
      {ale_error_logger_handler,
      log_report,3,
      [{file,
      "src/ale_error_logger_handler.erl"},
      {line,110}]}]}}; restarting
      [error_logger:error,2018-01-16T05:38:06.872Z,couchdb_ns_1@127.0.0.1:error_logger<0.6.0>:ale_error_logger_handler:do_log:203]
      =========================ERROR REPORT=========================
      {mochiweb_socket_server,289,{acceptor_error,{error,accept_failed}}}
      [error_logger:error,2018-01-16T05:38:06.872Z,couchdb_ns_1@127.0.0.1:error_logger<0.6.0>:ale_error_logger_handler:do_log:203]
      =========================ERROR REPORT=========================
      application: mochiweb
      "Accept failed error"
      "{error,enfile}"
       
      [ale_logger:error,2018-01-16T05:38:06.875Z,couchdb_ns_1@127.0.0.1:ale<0.47.0>:ale:handle_info:306]ale_reports_handler terminated with reason {'EXIT',
      {undef,
      [{lib,format_exception,
      [5,exit,
      {error,accept_failed},
      [{mochiweb_acceptor,init,3,
      [{file,
      "/home/couchbase/jenkins/workspace/couchbase-server-unix/couchdb/src/mochiweb/mochiweb_acceptor.erl"},
      {line,33}]},
      {proc_lib,init_p_do_apply,3,
      [{file,"proc_lib.erl"},
      {line,239}]}],
      #Fun<proc_lib.0.44701605>,
      #Fun<proc_lib.1.44701605>,
      latin1],
      []},
      {proc_lib,format_exception,4,
      [{file,"proc_lib.erl"},
      {line,729}]},
      {proc_lib,format_rep,2,
      [{file,"proc_lib.erl"},
      {line,718}]},
      {proc_lib,format_rep,2,
      [{file,"proc_lib.erl"},
      {line,720}]},
      {proc_lib,format_rep,2,
      [{file,"proc_lib.erl"},
      {line,716}]},
      {proc_lib,format,2,
      [{file,"proc_lib.erl"},
      {line,704}]},
      {ale_error_logger_handler,
      format_report,2,
      [{file,
      "src/ale_error_logger_handler.erl"},
      {line,152}]},
      {ale_error_logger_handler,
      log_report,3,
      [{file,
      "src/ale_error_logger_handler.erl"},
      {line,110}]}]}}; restarting

      System only recover after restart of couchbase server

      Attachments

        1. MB-27594_1.zip
          1.16 MB
        2. MB-27594.zip
          5.56 MB

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              dfinlay Dave Finlay
              vikas.chaudhary Vikas Chaudhary
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty