Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-37074

[Jepsen] DCP Producer is receiving key "not found" from DCP consumer

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Done
    • 6.5.0
    • 6.5.0
    • couchbase-bucket
    • None
    • Untriaged
    • Unknown

    Description

      During the kv-engine-jepsen-nightly-341 we observed this error message on node 172.28.128.42:

      ./20191126T213349.000-0800/172.28.128.42/cbcollect_info_ns_1@172.28.128.42_20191127-053651/memcached.log:2019-11-27T05:36:00.036338+00:00 ERROR 59: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.28.128.42->ns_1@172.28.128.245:default - Disconnecting. Received status Not found for op:DCP_PREPARE response:{"bodylen":0,"cas":0,"datatype":"raw","extlen":0,"keylen":0,"magic":"ClientResponse","opaque":95,"opcode":"DCP_PREPARE","status":"Not found"}
      

      While performing test:
      lein trampoline run test --nodes-file ./nodes --username vagrant --ssh-private-key ./resources/vagrantkey --workload=rebalance --scenario=sequential-rebalance-out-in --replicas=2 --no-autofailover --disrupt-count=1 --rate=0 --durability=0:100:0:0 --enable-tcp-capture --hashdump --enable-memcached-debug-log-level

      Jepsen Test Description:

      1. Setup 4 node cluster
      2. Perform read and durability Majority writes (this continues thought out rebalances) over a set of keys treating them as registers
      3. After 10 seconds rebalance one node out by stating its an ejected node for the rebalance (in this case 172.28.128.246)
      4. After 10 seconds rebalance the node back into the cluster by adding it to the cluster and then rebalancing it back in (node the add fails in this jepsen run which then means the rebalance fails )
      5. Wait 5 seconds
      6. Analysis the linaraziablity of the read and writes this should pass

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            The cause of the issue a stream going away due to takeover- this code in particular.

            http://src.couchbase.org/source/xref/mad-hatter/kv_engine/engines/ep/src/dcp/consumer.cc#1681-1686

            1681      auto stream = findStream(vbucket);
            1682      if (!stream || stream->getOpaque() != opaque) {
            1683          // No such stream with the given vbucket / opaque - return KEY_ENOENT
            1684          // to indicate that back to peer.
            1685          return ENGINE_KEY_ENOENT;
            1686      }
            

            Here are the logs from the replica side.

            2019-11-27T05:36:00.009358+00:00 INFO 60: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.28.128.42->ns_1@172.28.128.245:default - (vb:897) Attempting to add takeover stream: opaque_:483, start_seqno_:2588, end_seqno_:18446744073709551615, vb_uuid:137925520613486, snap_start_seqno_:2588, snap_end_seqno_:2588, last_seqno:2588, stream_req_value:{"uid":"0"}
            2019-11-27T05:36:00.013473+00:00 INFO 60: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.28.128.42->ns_1@172.28.128.245:default - (vb:897) End stream received with opaque 95 but current opaque for that vb is 483 - ignoring
            

            Logs from the active side - note opaque 95.

            2019-11-27T05:36:00.036338+00:00 ERROR 59: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.28.128.42->ns_1@172.28.128.245:default - Disconnecting. Received status Not found for op:DCP_PREPARE response:{"bodylen":0,"cas":0,"datatype":"raw","extlen":0,"keylen":0,"magic":"ClientResponse","opaque":95,"opcode":"DCP_PREPARE","status":"Not found"}
            

            This causes us to disconnect which it should not as we did nothing wrong. Need to work out how to fix, problem is we sometimes do want to disconnect if consumer response with ENGINE_KEY_ENOENT for a commit or abort.

            ben.huddleston Ben Huddleston added a comment - The cause of the issue a stream going away due to takeover- this code in particular. http://src.couchbase.org/source/xref/mad-hatter/kv_engine/engines/ep/src/dcp/consumer.cc#1681-1686 1681 auto stream = findStream(vbucket); 1682 if (!stream || stream->getOpaque() != opaque) { 1683 // No such stream with the given vbucket / opaque - return KEY_ENOENT 1684 // to indicate that back to peer. 1685 return ENGINE_KEY_ENOENT; 1686 } Here are the logs from the replica side. 2019-11-27T05:36:00.009358+00:00 INFO 60: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.28.128.42->ns_1@172.28.128.245:default - (vb:897) Attempting to add takeover stream: opaque_:483, start_seqno_:2588, end_seqno_:18446744073709551615, vb_uuid:137925520613486, snap_start_seqno_:2588, snap_end_seqno_:2588, last_seqno:2588, stream_req_value:{"uid":"0"} 2019-11-27T05:36:00.013473+00:00 INFO 60: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.28.128.42->ns_1@172.28.128.245:default - (vb:897) End stream received with opaque 95 but current opaque for that vb is 483 - ignoring Logs from the active side - note opaque 95. 2019-11-27T05:36:00.036338+00:00 ERROR 59: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.28.128.42->ns_1@172.28.128.245:default - Disconnecting. Received status Not found for op:DCP_PREPARE response:{"bodylen":0,"cas":0,"datatype":"raw","extlen":0,"keylen":0,"magic":"ClientResponse","opaque":95,"opcode":"DCP_PREPARE","status":"Not found"} This causes us to disconnect which it should not as we did nothing wrong. Need to work out how to fix, problem is we sometimes do want to disconnect if consumer response with ENGINE_KEY_ENOENT for a commit or abort.
            owend Daniel Owen added a comment -

            Marking as a Done as part of MB-36971. Because issue was introduced with patch http://review.couchbase.org/#/c/118214/ and hence the same MB contains a subsequent fix.

            owend Daniel Owen added a comment - Marking as a Done as part of MB-36971 . Because issue was introduced with patch http://review.couchbase.org/#/c/118214/ and hence the same MB contains a subsequent fix.

            People

              ben.huddleston Ben Huddleston
              richard.demellow Richard deMellow
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty