Details
-
Bug
-
Resolution: Fixed
-
Critical
-
7.1.0
-
Untriaged
-
0
-
Yes
Description
This considerably slows down the start of ns_couchdb node and produces multiple cascading crashes like this one:
[error_logger:error,2023-02-15T09:01:22.091-08:00,couchdb_n_0@cb.local:cb_config_couch_sync<0.307.0>:ale_error_logger_handler:do_log:101]
|
=========================ERROR REPORT=========================
|
** Generic server cb_config_couch_sync terminating
|
** Last message in was {notable_change,secure_headers}
|
** When Server state == {state}
|
** Reason for termination ==
|
** {{timeout,{gen_server,call,
|
[couch_config,
|
{set,"httpd","extra_headers",
|
"[{\"X-Content-Type-Options\",\"nosniff\"},\n {\"X-Frame-Options\",\"DENY\"},\n {\"X-Permitted-Cross-Domain-Policies\",\"none\"},\n {\"X-XSS-Protection\",\"1; mode=block\"}]",
|
false}]}},
|
[{gen_server,call,2,[{file,"gen_server.erl"},{line,370}]},
|
{cb_config_couch_sync,apply_to_couch_config,4,
|
[{file,"src/cb_config_couch_sync.erl"},{line,104}]},
|
{cb_config_couch_sync,handle_info,2,
|
[{file,"src/cb_config_couch_sync.erl"},{line,59}]},
|
{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,1123}]},
|
{gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,1200}]},
|
{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,240}]}]}
|
The commit that caused a regression:
https://review.couchbase.org/c/couchdb/+/162551
The timeout happens because the following call takes 5 sec to complete:
couch_system_event:settings_changed
On a surface it looks like the code was designed with the intent to make couch_system_event:settings_changed call fast:
handle_call({system_event, Data}, _, #state{queue = Queue} = State) ->
|
NewQueue = queue_put(Queue, Data),
|
State2 = State#state{queue=NewQueue},
|
self() ! send,
|
{reply, ok, State2};
|
But since possibly slow call lhttpc:request is done in the same process, the following happens: multiple send messages get piled up in the process queue. The handler of each of them executes lhttpc:request. system_event call meanwhile system_event call is sitting in the queue waiting for all send handlers to finish. Then gen_server:call 5 sec timeout expires.
The easiest fix is to make system_event call a cast
The proper fix would be a redesign of couch_system_event so lhttpc:request doesn't block the message queue. The current version of couch_system_event server can accumulate a lot of system_event messages in the queue and then crash losing bunch of log entries.