Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-45599

Corrupt mcbp response header from get_all_keys

    XMLWordPrintable

Details

    • Untriaged
    • 1
    • Yes
    • KV-Engine 2021-March

    Description

      Logs : https://supportal.couchbase.com/snapshot/a5c987add8e2ba7de815826265fca1e4::1
      Concerned node : 172.23.104.16

      While debugging an eventing case, we observed an issue where libcouchbase instance used by eventing continuously times out while trying to get the cluster config map from KV:

       2021-03-12T00:53:51.988-08:00 [Info] eventing-consumer [worker_timers2_0_0:/tmp/127.0.0.1:8091_0_2093318803.sock:16472] [lcb,server L:814 I:693280967] Failing command with error LCB_ERR_TIMEOUT (201): {"b":"metadata","i": "544d25282952a0c7/2ec3390c27d099ff/b079d","l":"172.23.121.165:44534","r":"172.23.104.16:11210","s":"kv:get_cluster_config","t":58000000}
      

      Looking at the trigger point of these timeouts, we notice that around the time issues started memcached closed off quite a number of connections:

      ag --no-number "unrecoverable error encountered" memcached.log | grep "2021-03-12T" | wc -l
           189
      

      2021-03-12T00:19:11.077389-08:00 INFO 118: Connection::on_event: unrecoverable error encountered: ["writing","error"], shutting down connection
      2021-03-12T00:26:24.773140-08:00 INFO 154: Connection::on_event: unrecoverable error encountered: ["writing","error"], shutting down connection
      2021-03-12T00:26:41.333501-08:00 INFO 1175: Connection::on_event: unrecoverable error encountered: ["writing","error"], shutting down connection
      2021-03-12T00:26:52.328011-08:00 INFO 1160: Connection::on_event: unrecoverable error encountered: ["reading","error"], shutting down connection
      2021-03-12T00:26:52.328043-08:00 INFO 1179: Connection::on_event: unrecoverable error encountered: ["reading","error"], shutting down connection
      

      Corresponding to this timestamp the following crash reports were seen on the local ns_server node. There are two unique crash reports both of which arise during "get_keys" which I believe is when a user tries to view the documents for a bucket via UI:

      1. First form:

       [error_logger:error,2021-03-12T00:19:11.073-08:00,ns_1@172.23.104.16:<0.20507.0>:ale_error_logger_handler:do_log:107]
       =========================CRASH REPORT=========================
         crasher:
           initial call: erlang:apply/2
           pid: <0.20507.0>
           registered_name: []
           exception error: no function clause matching
                            mc_binary:decode_first_key(<<129,184,0,0,0,0,0,0,0,0,0,
                                                         43,0,0,0,0,0,0,0,0,0,0,0,0,
                                                         0,41,13,101,118,101,110,116,
                                                         105,110,103,58,58,49,51,51,
                                                         53,49,49,49,55,57,49,58,...>>) (src/mc_binary.erl, line 281)
             in function  mc_binary:mk_heap_item/3 (src/mc_binary.erl, line 287)
             in call from mc_binary:handle_get_keys_response/4 (src/mc_binary.erl, line 361)
             in call from mc_binary:'-do_get_keys/4-fun-1-'/5 (src/mc_binary.erl, line 465)
             in call from mc_binary:'-get_keys_recv/5-fun-0-'/5 (src/mc_binary.erl, line 326)
             in call from lists:foldl/3 (lists.erl, line 1263)
             in call from mc_binary:get_keys_recv/5 (src/mc_binary.erl, line 322)
             in call from mc_binary:do_get_keys/4 (src/mc_binary.erl, line 462)
           ancestors: ['ns_memcached-metadata',<0.20476.0>,
                         'single_bucket_kv_sup-metadata',ns_bucket_sup,
                         ns_bucket_worker_sup,ns_server_sup,ns_server_nodes_sup,
                         <0.254.0>,ns_server_cluster_sup,root_sup,<0.140.0>]
      

      2. Second form:

       [error_logger:error,2021-03-12T00:27:03.617-08:00,ns_1@172.23.104.16:<0.5299.24>:ale_error_logger_handler:do_log:107]
       =========================CRASH REPORT=========================
         crasher:
           initial call: erlang:apply/2
           pid: <0.5299.24>
           registered_name: []
           exception error: no function clause matching
                            mc_binary:decode_header(res,
                                                    <<24,2,3,0,16,0,0,0,0,0,0,19,79,
                                                      143,1,0,22,107,138,232,192,126,
                                                      0,0>>) (src/mc_binary.erl, line 205)
             in function  mc_binary:quick_active_recv/3 (src/mc_binary.erl, line 55)
             in call from mc_binary:'-get_keys_recv/5-fun-0-'/5 (src/mc_binary.erl, line 324)
             in call from lists:foldl/3 (lists.erl, line 1263)
             in call from mc_binary:get_keys_recv/5 (src/mc_binary.erl, line 322)
             in call from mc_binary:do_get_keys/4 (src/mc_binary.erl, line 462)
             in call from mc_binary:get_keys/5 (src/mc_binary.erl, line 440)
             in call from ns_memcached:do_handle_call/3 (src/ns_memcached.erl, line 632)
           ancestors: ['ns_memcached-metadata',<0.20476.0>,
                         'single_bucket_kv_sup-metadata',ns_bucket_sup,
                         ns_bucket_worker_sup,ns_server_sup,ns_server_nodes_sup,
                         <0.254.0>,ns_server_cluster_sup,root_sup,<0.140.0>]
      

      Note the parameters in mc_binary:decode_header for the second crash - The magic being reported is 24 (0x18).

      decode_header - http://src.couchbase.org/source/xref/trunk/ns_server/src/mc_binary.erl#215
      Note here that RES_MAGIC is always mapped to 0x81 by ns_server - http://src.couchbase.org/source/xref/trunk/couchdb/src/couch_audit/src/memcached_calls.erl#27

      While memcached can respond with 0x18 - https://github.com/couchbase/kv_engine/blob/6356c8e84562160f4da062d68617f2a58be600fa/include/mcbp/protocol/magic.h#L35

      This can result in a crash causing ns_memcached closing the connection to the kv_engine.
      This in-turn will result in memcached assuming that connection to ns_server has been terminated.

      Similar issue was also seen earlier - https://issues.couchbase.com/browse/MB-45206 however it seems that this detail regarding ns_server not being able to match the magic was overlooked.

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              abhishek.jindal Abhishek Jindal
              abhishek.jindal Abhishek Jindal
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty