Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-55608

couch_config:set crashes with timeout

    XMLWordPrintable

Details

    • Untriaged
    • 0
    • Yes

    Description

      This considerably slows down the start of ns_couchdb node and produces multiple cascading crashes like this one:

      [error_logger:error,2023-02-15T09:01:22.091-08:00,couchdb_n_0@cb.local:cb_config_couch_sync<0.307.0>:ale_error_logger_handler:do_log:101]
      =========================ERROR REPORT=========================
      ** Generic server cb_config_couch_sync terminating 
      ** Last message in was {notable_change,secure_headers}
      ** When Server state == {state}
      ** Reason for termination ==
      ** {{timeout,{gen_server,call,
                               [couch_config,
                                {set,"httpd","extra_headers",
                                     "[{\"X-Content-Type-Options\",\"nosniff\"},\n {\"X-Frame-Options\",\"DENY\"},\n {\"X-Permitted-Cross-Domain-Policies\",\"none\"},\n {\"X-XSS-Protection\",\"1; mode=block\"}]",
                                     false}]}},
          [{gen_server,call,2,[{file,"gen_server.erl"},{line,370}]},
           {cb_config_couch_sync,apply_to_couch_config,4,
                                 [{file,"src/cb_config_couch_sync.erl"},{line,104}]},
           {cb_config_couch_sync,handle_info,2,
                                 [{file,"src/cb_config_couch_sync.erl"},{line,59}]},
           {gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,1123}]},
           {gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,1200}]},
           {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,240}]}]}
      

      The commit that caused a regression:
      https://review.couchbase.org/c/couchdb/+/162551

      The timeout happens because the following call takes 5 sec to complete:
      couch_system_event:settings_changed

      On a surface it looks like the code was designed with the intent to make couch_system_event:settings_changed call fast:

      handle_call({system_event, Data}, _, #state{queue = Queue} = State) ->
          NewQueue = queue_put(Queue, Data),
          State2 = State#state{queue=NewQueue},
          self() ! send,
          {reply, ok, State2};
      

      But since possibly slow call lhttpc:request is done in the same process, the following happens: multiple send messages get piled up in the process queue. The handler of each of them executes lhttpc:request. system_event call meanwhile system_event call is sitting in the queue waiting for all send handlers to finish. Then gen_server:call 5 sec timeout expires.

      The easiest fix is to make system_event call a cast

      The proper fix would be a redesign of couch_system_event so lhttpc:request doesn't block the message queue. The current version of couch_system_event server can accumulate a lot of system_event messages in the queue and then crash losing bunch of log entries.

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            ankit.prabhu Ankit Prabhu
            artem Artem Stemkovski
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty