Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-45599

Corrupt mcbp response header from get_all_keys

    XMLWordPrintable

Details

    • Untriaged
    • 1
    • Yes
    • KV-Engine 2021-March

    Description

      Logs : https://supportal.couchbase.com/snapshot/a5c987add8e2ba7de815826265fca1e4::1
      Concerned node : 172.23.104.16

      While debugging an eventing case, we observed an issue where libcouchbase instance used by eventing continuously times out while trying to get the cluster config map from KV:

       2021-03-12T00:53:51.988-08:00 [Info] eventing-consumer [worker_timers2_0_0:/tmp/127.0.0.1:8091_0_2093318803.sock:16472] [lcb,server L:814 I:693280967] Failing command with error LCB_ERR_TIMEOUT (201): {"b":"metadata","i": "544d25282952a0c7/2ec3390c27d099ff/b079d","l":"172.23.121.165:44534","r":"172.23.104.16:11210","s":"kv:get_cluster_config","t":58000000}
      

      Looking at the trigger point of these timeouts, we notice that around the time issues started memcached closed off quite a number of connections:

      ag --no-number "unrecoverable error encountered" memcached.log | grep "2021-03-12T" | wc -l
           189
      

      2021-03-12T00:19:11.077389-08:00 INFO 118: Connection::on_event: unrecoverable error encountered: ["writing","error"], shutting down connection
      2021-03-12T00:26:24.773140-08:00 INFO 154: Connection::on_event: unrecoverable error encountered: ["writing","error"], shutting down connection
      2021-03-12T00:26:41.333501-08:00 INFO 1175: Connection::on_event: unrecoverable error encountered: ["writing","error"], shutting down connection
      2021-03-12T00:26:52.328011-08:00 INFO 1160: Connection::on_event: unrecoverable error encountered: ["reading","error"], shutting down connection
      2021-03-12T00:26:52.328043-08:00 INFO 1179: Connection::on_event: unrecoverable error encountered: ["reading","error"], shutting down connection
      

      Corresponding to this timestamp the following crash reports were seen on the local ns_server node. There are two unique crash reports both of which arise during "get_keys" which I believe is when a user tries to view the documents for a bucket via UI:

      1. First form:

       [error_logger:error,2021-03-12T00:19:11.073-08:00,ns_1@172.23.104.16:<0.20507.0>:ale_error_logger_handler:do_log:107]
       =========================CRASH REPORT=========================
         crasher:
           initial call: erlang:apply/2
           pid: <0.20507.0>
           registered_name: []
           exception error: no function clause matching
                            mc_binary:decode_first_key(<<129,184,0,0,0,0,0,0,0,0,0,
                                                         43,0,0,0,0,0,0,0,0,0,0,0,0,
                                                         0,41,13,101,118,101,110,116,
                                                         105,110,103,58,58,49,51,51,
                                                         53,49,49,49,55,57,49,58,...>>) (src/mc_binary.erl, line 281)
             in function  mc_binary:mk_heap_item/3 (src/mc_binary.erl, line 287)
             in call from mc_binary:handle_get_keys_response/4 (src/mc_binary.erl, line 361)
             in call from mc_binary:'-do_get_keys/4-fun-1-'/5 (src/mc_binary.erl, line 465)
             in call from mc_binary:'-get_keys_recv/5-fun-0-'/5 (src/mc_binary.erl, line 326)
             in call from lists:foldl/3 (lists.erl, line 1263)
             in call from mc_binary:get_keys_recv/5 (src/mc_binary.erl, line 322)
             in call from mc_binary:do_get_keys/4 (src/mc_binary.erl, line 462)
           ancestors: ['ns_memcached-metadata',<0.20476.0>,
                         'single_bucket_kv_sup-metadata',ns_bucket_sup,
                         ns_bucket_worker_sup,ns_server_sup,ns_server_nodes_sup,
                         <0.254.0>,ns_server_cluster_sup,root_sup,<0.140.0>]
      

      2. Second form:

       [error_logger:error,2021-03-12T00:27:03.617-08:00,ns_1@172.23.104.16:<0.5299.24>:ale_error_logger_handler:do_log:107]
       =========================CRASH REPORT=========================
         crasher:
           initial call: erlang:apply/2
           pid: <0.5299.24>
           registered_name: []
           exception error: no function clause matching
                            mc_binary:decode_header(res,
                                                    <<24,2,3,0,16,0,0,0,0,0,0,19,79,
                                                      143,1,0,22,107,138,232,192,126,
                                                      0,0>>) (src/mc_binary.erl, line 205)
             in function  mc_binary:quick_active_recv/3 (src/mc_binary.erl, line 55)
             in call from mc_binary:'-get_keys_recv/5-fun-0-'/5 (src/mc_binary.erl, line 324)
             in call from lists:foldl/3 (lists.erl, line 1263)
             in call from mc_binary:get_keys_recv/5 (src/mc_binary.erl, line 322)
             in call from mc_binary:do_get_keys/4 (src/mc_binary.erl, line 462)
             in call from mc_binary:get_keys/5 (src/mc_binary.erl, line 440)
             in call from ns_memcached:do_handle_call/3 (src/ns_memcached.erl, line 632)
           ancestors: ['ns_memcached-metadata',<0.20476.0>,
                         'single_bucket_kv_sup-metadata',ns_bucket_sup,
                         ns_bucket_worker_sup,ns_server_sup,ns_server_nodes_sup,
                         <0.254.0>,ns_server_cluster_sup,root_sup,<0.140.0>]
      

      Note the parameters in mc_binary:decode_header for the second crash - The magic being reported is 24 (0x18).

      decode_header - http://src.couchbase.org/source/xref/trunk/ns_server/src/mc_binary.erl#215
      Note here that RES_MAGIC is always mapped to 0x81 by ns_server - http://src.couchbase.org/source/xref/trunk/couchdb/src/couch_audit/src/memcached_calls.erl#27

      While memcached can respond with 0x18 - https://github.com/couchbase/kv_engine/blob/6356c8e84562160f4da062d68617f2a58be600fa/include/mcbp/protocol/magic.h#L35

      This can result in a crash causing ns_memcached closing the connection to the kv_engine.
      This in-turn will result in memcached assuming that connection to ns_server has been terminated.

      Similar issue was also seen earlier - https://issues.couchbase.com/browse/MB-45206 however it seems that this detail regarding ns_server not being able to match the magic was overlooked.

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            dfinlay Dave Finlay added a comment -

            Hey Dan,

            Can you take a look at this? Is it the case that a response with flex framing extras is always returned if tracing is enabled - even if clients don't handle this? It seems like there should probably be a hello flag for this but I didn't have time to look. Can you check?

            dfinlay Dave Finlay added a comment - Hey Dan, Can you take a look at this? Is it the case that a response with flex framing extras is always returned if tracing is enabled - even if clients don't handle this? It seems like there should probably be a hello flag for this but I didn't have time to look. Can you check?
            drigby Dave Rigby added a comment -

            Dave Finlay Certainly if the client negotiates HELO::Tracing then yes, they should expect to receive timing information, encoded using the "alternative" response magic of 0x18 - see https://github.com/couchbase/kv_engine/blob/master/docs/BinaryProtocol.md#response-header

            Has ns_server recently started negotiating tracing?

            drigby Dave Rigby added a comment - Dave Finlay Certainly if the client negotiates HELO::Tracing then yes, they should expect to receive timing information, encoded using the "alternative" response magic of 0x18 - see https://github.com/couchbase/kv_engine/blob/master/docs/BinaryProtocol.md#response-header Has ns_server recently started negotiating tracing?
            dfinlay Dave Finlay added a comment -

            I'm pretty sure ns_server hasn't negotiated tracing. Abhijeeth Nuthan: can you check to be sure?

            dfinlay Dave Finlay added a comment - I'm pretty sure ns_server hasn't negotiated tracing. Abhijeeth Nuthan : can you check to be sure?
            owend Daniel Owen added a comment -

            Hey Trond Norbye Could you take a look from the memcached side?
            My code examination suggests that we should not be returning the alternative response magic unless it was negotiated.
            But looking through the logs I do not see ns_server HELO with Tracing.

            owend Daniel Owen added a comment - Hey Trond Norbye Could you take a look from the memcached side? My code examination suggests that we should not be returning the alternative response magic unless it was negotiated. But looking through the logs I do not see ns_server HELO with Tracing.
            dfinlay Dave Finlay added a comment -

            Thanks Dan; I'll assign to Trond and set as memcached for the time being. If it looks to be an ns-server issue please reassign.

            dfinlay Dave Finlay added a comment - Thanks Dan; I'll assign to Trond and set as memcached for the time being. If it looks to be an ns-server issue please reassign.

            Dave Finlay: Confirming that we don't negotiate tracing for both replication connections or control connection to memcached.  

            Abhijeeth.Nuthan Abhijeeth Nuthan added a comment - Dave Finlay : Confirming that we don't negotiate tracing for both replication connections or control connection to memcached.  

            The problem is that ep-engine tries to send data over the connection from one of its background threads and not from the frontend thread so we end up racing on the thread local buffer.

            trond Trond Norbye added a comment - The problem is that ep-engine tries to send data over the connection from one of its background threads and not from the frontend thread so we end up racing on the thread local buffer.

            Build couchbase-server-7.0.0-4938 contains kv_engine commit a16d072 with commit message:
            MB-45599: Use stack buffer in SendResponse

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.0-4938 contains kv_engine commit a16d072 with commit message: MB-45599 : Use stack buffer in SendResponse

            People

              abhishek.jindal Abhishek Jindal
              abhishek.jindal Abhishek Jindal
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty