Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-38390

DCP Deletion V2 packets are sent regardless of whether the DELETE_TIMES feature is negotiated

    XMLWordPrintable

    Details

    • Triage:
      Untriaged
    • Is this a Regression?:
      Yes

      Description

      Steps to Reproduce

      1. Create a single node cluster of Couchbase Server 5.0.1
      2. Add a Couchbase Server 6.5.0 node and rebalance
      3. Create a bucket, my example uses test
      4. Create a document that resides on the 6.5.0 node, this command works if you followed my steps :

        /opt/couchbase/bin/cbc create -U "couchbase://localhost/test" -u Administrator -P password testing1234 -V test
        

      5. Delete the document that was just created:

        /opt/couchbase/bin/cbc rm -U "couchbase://localhost/test" -u Administrator -P password testing1234
        

      6. Add a new 6.5.0 node and mark the other 6.5.0 node for removal (so that you guarantee the document will move from one 6.5.x node to the other).
      7. Rebalance

      Expected Result
      Rebalance succeeds

      Actual Result

      Rebalance fails due to the error:

      Rebalance exited with reason {mover_crashed,
                                    {unexpected_exit,
                                     {'EXIT',<0.21330.0>,
                                      {{bulk_set_vbucket_state_failed,
                                        [{'ns_1@10.143.171.103',
                                          {'EXIT',
                                           {{{{{child_interrupted,
                                                {'EXIT',<24454.7397.0>,
                                                 socket_closed}},
      

      All subsequent attempts to rebalance all fail.

      Analysis

      This is because DCP deletion packets between the two 6.5.0 nodes are being incorrectly handled:

      2020-03-22T18:26:15.302067-07:00 WARNING 58: Invalid format specified for "DCP_DELETION" - Status: "Invalid arguments" - Closing connection. Packet:[{"bodylen":32,"cas":1584926732193038336,"datatype":"raw","extlen":21,"keylen":11,"magic":"ClientRequest","opaque":13,"opcode":"DCP_DELETION","vbucket":739}] Reason:"Request must include extras of length 18"
      

      This then closes the DCP connection and ultimately leads to the rebalance failure.

      The invalid DCP deletion packet is for the following doc:

      Doc seq: 2
           id: (collection:0x0:default) testing1234
           rev: 2
           content_meta: 3
           size (on disk): 0
           cas: 1584926732193038336, expiry: 1584926732, flags: 0, datatype: 0x00 (raw)
           doc deleted (explicit)
      

      DELETE_TIMES has not been negotiated as part of the DCP stream:

      2020-03-22T18:26:15.108609-07:00 INFO (test) EventuallyPersistentEngine::dcpOpen: opening new DCP Consumer handler - stream name:replication:ns_1@10.143.171.102->ns_1@10.143.171.103:test, opaque:0, seqno:0, flags:0b000000000000000000000000000
      00100 value:null
      2020-03-22T18:26:15.108615-07:00 INFO 58: DCP connection opened successfully. INCLUDE_XATTRS [ 127.0.0.1:36102 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ]
      

      This is not negotiated because there is still a 5.0.x node in the cluster that is not DELETE_TIMES aware, however from a packet capture (MB-38390.pcap) it seems that the 6.5 nodes are sending DCP_DELETION_V2 packets, despite it not being negotiated:

      Couchbase Protocol, DCP (Key) Deletion Request, Opcode: 0x58, VBucket: 0x2e3
          Magic: Request (0x80)
          Opcode: DCP (Key) Deletion (0x58)
          Key Length: 11
          Extras Length: 21
          Data Type: 0x00
          VBucket: 739 (0x02e3)
          [Value Length: 0]
          Total Body Length: 32
          Opaque: 0x0000002a
          CAS: 0x15fe19cdc8b10000
          Extras
          Key: testing1234
      

      This is obviously rejected by the consumer as it is expected an extras length of 18, rather than 21, as covered in https://github.com/couchbase/kv_engine/blob/master/docs/dcp/documentation/commands/deletion.md.

      Note: Haven't attached cbcollect_infos as the issue is readily reproducible 100% of the time by following the steps, if needed I can go ahead and generate these though.

      Workaround
      This has the net effect that for any cluster where there are deleted documents, it is impossible to directly upgrade from Couchbase Server <5.5.0 to Couchbase Server 6.5.0.
      The workaround is to upgrade to an intermediate version before moving to 6.5.0, such as Couchbase Server 6.0.4, which through testing does not show the same issue.

        Attachments

        For Gerrit Dashboard: MB-38390
        # Subject Branch Project Status CR V

          Activity

          Hide
          thuan Thuan Nguyen added a comment -

          I tried several ways to upgrade from 5.x.x to 6.5.0 (online and failover with minimum 3 nodes cluster) with each vbucket has at least one delete item. I am able to reproduce this issue instantly.
          One way to upgrade directly from 5.x.x to 6.5.0 (in this case, any 6.5.0 node has delete item in its vbuckets) is offline upgrade. After doing offline upgrade, I am able to add and remove 6.5.0 nodes.

          Show
          thuan Thuan Nguyen added a comment - I tried several ways to upgrade from 5.x.x to 6.5.0 (online and failover with minimum 3 nodes cluster) with each vbucket has at least one delete item. I am able to reproduce this issue instantly. One way to upgrade directly from 5.x.x to 6.5.0 (in this case, any 6.5.0 node has delete item in its vbuckets) is offline upgrade. After doing offline upgrade, I am able to add and remove 6.5.0 nodes.
          Hide
          thuan Thuan Nguyen added a comment -

          Verified build 6.5.1-6248 fix this issue. I tried swap and incremental online upgrade with 4 nodes cluster.
          Also create a ticket CBQE-5513 to add this scenario to upgrade test suite.

          Show
          thuan Thuan Nguyen added a comment - Verified build 6.5.1-6248 fix this issue. I tried swap and incremental online upgrade with 4 nodes cluster. Also create a ticket CBQE-5513 to add this scenario to upgrade test suite.
          Hide
          james.harrison James Harrison added a comment -

          Sorry for pinging a closed ticket, but just wanted to note Thuan Nguyen, an offline upgrade to 6.5.0 also has caveats. If a spock/vulcan/alice has written an access log before the upgrade to 6.5.0 the node will crash on warmup because of MB-38327. Work around is to delete the access log.

          Show
          james.harrison James Harrison added a comment - Sorry for pinging a closed ticket, but just wanted to note Thuan Nguyen , an offline upgrade to 6.5.0 also has caveats. If a spock/vulcan/alice has written an access log before the upgrade to 6.5.0 the node will crash on warmup because of MB-38327 . Work around is to delete the access log.
          Hide
          build-team Couchbase Build Team added a comment -

          Build couchbase-server-7.0.0-1703 contains kv_engine commit f612fb6 with commit message:
          MB-38390: Make enable_expiry_opcode respect includeDeleteTime

          Show
          build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.0-1703 contains kv_engine commit f612fb6 with commit message: MB-38390 : Make enable_expiry_opcode respect includeDeleteTime
          Hide
          build-team Couchbase Build Team added a comment -

          Build couchbase-server-6.6.0-7519 contains kv_engine commit f612fb6 with commit message:
          MB-38390: Make enable_expiry_opcode respect includeDeleteTime

          Show
          build-team Couchbase Build Team added a comment - Build couchbase-server-6.6.0-7519 contains kv_engine commit f612fb6 with commit message: MB-38390 : Make enable_expiry_opcode respect includeDeleteTime

            People

            Assignee:
            thuan Thuan Nguyen
            Reporter:
            matt.carabine Matt Carabine
            Votes:
            0 Vote for this issue
            Watchers:
            11 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved:

                Gerrit Reviews

                There are no open Gerrit changes

                  PagerDuty