Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-35001

Rebalance failed with "duplicate item when vbstate is non-active:3" [ETA 2019/7/17]

    XMLWordPrintable

Details

    • Untriaged
    • No
    • KV-Engine Mad-Hatter Beta

    Description

      Summary

      During rebalance the following error is seen on the incoming replica - a duplicate item is seen in a Checkpoint:

      2019-07-04T12:13:07.145526+01:00 ERROR 55: exception occurred in runloop during packet execution. Cookie info: [{"aiostat":"success","connection":"[ 127.0.0.1:65383 - 127.0.0.1:11995 (<ud>@ns_server</ud>) ]","engine_storage":"0x00000001067af018","ewouldblock":false,"packet":{"bodylen":154,"cas":1562238787022946304,"datatype":"raw","extlen":33,"keylen":21,"magic":"ClientRequest","opaque":25,"opcode":"DCP_PREPARE","vbucket":3},"refcount":1}] - closing connection ([ 127.0.0.1:65383 - 127.0.0.1:11995 (<ud>@ns_server</ud>) ]): 
      CheckpointManager::queueDirty(vb:3) - got Ckpt::queueDirty() status:failure:duplicate item when vbstate is non-active:3
      

      After local reproduction, it seems like the following scenario is causing this error. The active node has the following items on disk and in memory (checkpoint manager):

      Disk:
              1:PRE(a), 2:CMT(a), 3:SET(b)
       
      Memory:
                                  3:CKPT_START
                                  3:SET(b),     4:PRE(a), 5:SET(c)
      

      (Items 1..2 were in a closed, removed checkpoint and no longer in-memory.)

      An ep-engine replica attempting to stream all of this (0..infinity) will result in a backfill of items 1..3, with a checkpoint cursor being placed at seqno:4. Note this isn't the start of the Checkpoint (which is 3) and hence not pointing at a checkpoint_start item. As such when this is streamed over DCP (up to seqno:4) the consumer will see (note the flags sent):

      SNAPSHOT_MARKER(start=1, end=3, flags=DISK|CKPT)
      1:PRE(a)
      2:CMT(a)
      3:SET(b)
      SNAPSHOT_MARKER(start=4, end=5, flags=MEM)
      4:PRE(a),
      [[[missing seqno 5]]
      

      If the consumer puts all of these mutations in the same Checkpoint, then it will result in duplicate PRE(a) items (which breaks Checkpoint invariant).

      Steps to Reproduce
      Exact steps tbc, but seen when rebalancing in a node while modifying the same key(s) - this should result in an initial Disk snapshot with some Key being prepared in it, followed by a Memory snapshot which also has the same Key being prepared.

      Expected Results
      It shouldn't crash - the subsequent "duplicate" Prepare should be accepted by the replica.

      Actual Results
      Above crash seen.

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            drigby Dave Rigby (Inactive)
            drigby Dave Rigby (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty