Details
-
Bug
-
Resolution: Fixed
-
Critical
-
Cheshire-Cat
-
Untriaged
-
1
-
Yes
-
KV-Engine 2021-March
Description
Logs : https://supportal.couchbase.com/snapshot/a5c987add8e2ba7de815826265fca1e4::1
Concerned node : 172.23.104.16
While debugging an eventing case, we observed an issue where libcouchbase instance used by eventing continuously times out while trying to get the cluster config map from KV:
2021-03-12T00:53:51.988-08:00 [Info] eventing-consumer [worker_timers2_0_0:/tmp/127.0.0.1:8091_0_2093318803.sock:16472] [lcb,server L:814 I:693280967] Failing command with error LCB_ERR_TIMEOUT (201): {"b":"metadata","i": "544d25282952a0c7/2ec3390c27d099ff/b079d","l":"172.23.121.165:44534","r":"172.23.104.16:11210","s":"kv:get_cluster_config","t":58000000}
|
Looking at the trigger point of these timeouts, we notice that around the time issues started memcached closed off quite a number of connections:
ag --no-number "unrecoverable error encountered" memcached.log | grep "2021-03-12T" | wc -l
|
189
|
2021-03-12T00:19:11.077389-08:00 INFO 118: Connection::on_event: unrecoverable error encountered: ["writing","error"], shutting down connection
|
2021-03-12T00:26:24.773140-08:00 INFO 154: Connection::on_event: unrecoverable error encountered: ["writing","error"], shutting down connection
|
2021-03-12T00:26:41.333501-08:00 INFO 1175: Connection::on_event: unrecoverable error encountered: ["writing","error"], shutting down connection
|
2021-03-12T00:26:52.328011-08:00 INFO 1160: Connection::on_event: unrecoverable error encountered: ["reading","error"], shutting down connection
|
2021-03-12T00:26:52.328043-08:00 INFO 1179: Connection::on_event: unrecoverable error encountered: ["reading","error"], shutting down connection
|
Corresponding to this timestamp the following crash reports were seen on the local ns_server node. There are two unique crash reports both of which arise during "get_keys" which I believe is when a user tries to view the documents for a bucket via UI:
1. First form:
[error_logger:error,2021-03-12T00:19:11.073-08:00,ns_1@172.23.104.16:<0.20507.0>:ale_error_logger_handler:do_log:107]
|
=========================CRASH REPORT=========================
|
crasher:
|
initial call: erlang:apply/2
|
pid: <0.20507.0>
|
registered_name: []
|
exception error: no function clause matching
|
mc_binary:decode_first_key(<<129,184,0,0,0,0,0,0,0,0,0,
|
43,0,0,0,0,0,0,0,0,0,0,0,0,
|
0,41,13,101,118,101,110,116,
|
105,110,103,58,58,49,51,51,
|
53,49,49,49,55,57,49,58,...>>) (src/mc_binary.erl, line 281)
|
in function mc_binary:mk_heap_item/3 (src/mc_binary.erl, line 287)
|
in call from mc_binary:handle_get_keys_response/4 (src/mc_binary.erl, line 361)
|
in call from mc_binary:'-do_get_keys/4-fun-1-'/5 (src/mc_binary.erl, line 465)
|
in call from mc_binary:'-get_keys_recv/5-fun-0-'/5 (src/mc_binary.erl, line 326)
|
in call from lists:foldl/3 (lists.erl, line 1263)
|
in call from mc_binary:get_keys_recv/5 (src/mc_binary.erl, line 322)
|
in call from mc_binary:do_get_keys/4 (src/mc_binary.erl, line 462)
|
ancestors: ['ns_memcached-metadata',<0.20476.0>,
|
'single_bucket_kv_sup-metadata',ns_bucket_sup,
|
ns_bucket_worker_sup,ns_server_sup,ns_server_nodes_sup,
|
<0.254.0>,ns_server_cluster_sup,root_sup,<0.140.0>]
|
2. Second form:
[error_logger:error,2021-03-12T00:27:03.617-08:00,ns_1@172.23.104.16:<0.5299.24>:ale_error_logger_handler:do_log:107]
|
=========================CRASH REPORT=========================
|
crasher:
|
initial call: erlang:apply/2
|
pid: <0.5299.24>
|
registered_name: []
|
exception error: no function clause matching
|
mc_binary:decode_header(res,
|
<<24,2,3,0,16,0,0,0,0,0,0,19,79,
|
143,1,0,22,107,138,232,192,126,
|
0,0>>) (src/mc_binary.erl, line 205)
|
in function mc_binary:quick_active_recv/3 (src/mc_binary.erl, line 55)
|
in call from mc_binary:'-get_keys_recv/5-fun-0-'/5 (src/mc_binary.erl, line 324)
|
in call from lists:foldl/3 (lists.erl, line 1263)
|
in call from mc_binary:get_keys_recv/5 (src/mc_binary.erl, line 322)
|
in call from mc_binary:do_get_keys/4 (src/mc_binary.erl, line 462)
|
in call from mc_binary:get_keys/5 (src/mc_binary.erl, line 440)
|
in call from ns_memcached:do_handle_call/3 (src/ns_memcached.erl, line 632)
|
ancestors: ['ns_memcached-metadata',<0.20476.0>,
|
'single_bucket_kv_sup-metadata',ns_bucket_sup,
|
ns_bucket_worker_sup,ns_server_sup,ns_server_nodes_sup,
|
<0.254.0>,ns_server_cluster_sup,root_sup,<0.140.0>]
|
Note the parameters in mc_binary:decode_header for the second crash - The magic being reported is 24 (0x18).
decode_header - http://src.couchbase.org/source/xref/trunk/ns_server/src/mc_binary.erl#215
Note here that RES_MAGIC is always mapped to 0x81 by ns_server - http://src.couchbase.org/source/xref/trunk/couchdb/src/couch_audit/src/memcached_calls.erl#27
While memcached can respond with 0x18 - https://github.com/couchbase/kv_engine/blob/6356c8e84562160f4da062d68617f2a58be600fa/include/mcbp/protocol/magic.h#L35
This can result in a crash causing ns_memcached closing the connection to the kv_engine.
This in-turn will result in memcached assuming that connection to ns_server has been terminated.
Similar issue was also seen earlier - https://issues.couchbase.com/browse/MB-45206 however it seems that this detail regarding ns_server not being able to match the magic was overlooked.
—