Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-45673

DCP Stream Request returns EINVAL for UID body

    XMLWordPrintable

Details

    • Bug
    • Resolution: Not a Bug
    • Blocker
    • None
    • Cheshire-Cat
    • couchbase-bucket
    • None
    • Triaged
    • 1
    • Yes

    Description

      Since collection was introduced, DCP documentation for stream-request has stated that for resumption, a consumer should provide a value.

      Stream-request resumption must also include a collection's manifest-UID in the value

      The value spec states that UID can be provided

      uid can be set by the client when they are resuming a stream, the value should be the uid they last observed from a collection's DCP System event.

      This is how XDCR performs stream requests when resuming the main pipeline.

      ND: vbno 434 opaque 65970 bodyLen: 11 mcReqBody: 00000000  7b 22 75 69 64 22 3a 22  37 22 7d                 |{"uid":"7"}|
      

      I understand that with quorum failover, the manifest UID itself is not sufficient and most likely chronicle ID will be used, and that KV is going to move towards that direction.
      I also see that this commit was recently checked in as part of addressing the chronicle issue (MB-45505)
      https://github.com/couchbase/kv_engine/commit/7efc1df4d9d8619e4e65c53766529f16e8d10994

      However, this commit essentially breaks the currently documented DCP protocol, at least for XDCR as a DCP consumer. (Reverting the commit locally in my dev env and recompiling resolves this issue.)

      What happens is that when XDCR replication is paused and resumed, XDCR will faithfully send the StreamReq with the body of UID (shown above). Note that this behavior has been there since XDCR first implemented the collections stream request.
      As a result of the commit change, memcached returns EINVAL as it is unable to parse the body of stream request.

      This leads to XDCR unable to resume replication. As a result, XDCR cannot replicate any more data.

      I want to reiterate that I understand that we’re moving towards chronicle ID… especially given that MANIFEST_AHEAD is no longer relevant in StreamReq cases. But at this stage, this commit, seems to cause some functional breakage. The goal here should be try to at least unblock the inability for XDCR to resume replication in time before the weekly build gets picke dup.

      IMO there are two ways to go about unblocking this:
      1. Ask XDCR to change the way it resumes - It’s an acceptable answer but I’d argue that it could be more risky as the goal here is to try to resolve this issue before Thursday PST build for internal testing, and this option would require introducing changes to the current XDCR <-> DCP communications.
      2. Revert this changeset - I’d like to propose this solution as according to the commit msg, the commit was done in anticipation of quorum failover work, and doesn’t seem to be actually fixing anything urgent.

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              neil.huang Neil Huang
              neil.huang Neil Huang
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty