Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-42780

[Upgrade] Rebalance_in failed with reason "bulk_set_vbucket_state_failed :: sync_shutdown_many_i_am_trapping_exits"

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 6.6.0
    • Fix Version/s: 6.6.1
    • Component/s: couchbase-bucket
    • Environment:
      Enterprise Edition 6.0.1 build 2037 (Existing 4 node cluster)
      Enterprise Edition 6.6.1 build 9182 (New node coming in)

      Description

       Scenario:

      1. 4 Node cluster (6.0.1 build 2037)
      2. Create couchbase-bucket with replica=1, size=100M
      3. Load bucket into DGM using the cbworkloadgen,

        ./cbworkloadgen -n 172.23.105.155:8091 -b default -u Administrator -p password --max-items=1200000 -r .95 -l

      4. Rebalance_in 6.6.1-9182 node into the cluster (172.23.105.244)

      Observation:

      Rebalance failed with following reason,

      Worker <0.3486.0> (for action {move,{577,
      ['ns_1@172.23.105.212',
      'ns_1@172.23.105.155'],
      ['ns_1@172.23.105.155',
      'ns_1@172.23.105.244'],
      []}}) exited with reason {unexpected_exit,
      {'EXIT', <0.3512.0>,
      {{bulk_set_vbucket_state_failed,
      [{'ns_1@172.23.105.244',
      {'EXIT',
      {{{{{badmatch,
      [{<0.3659.0>,
      {done, exit, {socket_closed,
      {gen_server, call, [<0.2333.0>,
      {setup_streams, [83, 339]}, infinity]}},
      [{gen_server, call, 3,
      [{file, "gen_server.erl"}, {line, 214}]},
      {dcp_replicator, '-spawn_and_wait/1-fun-0-', 1,
      [{file, "src/dcp_replicator.erl"}, {line, 243}]}]}}]},
      [{misc, sync_shutdown_many_i_am_trapping_exits, 1, 
      [{file, "src/misc.erl"}, {line, 1374}]},
      {dcp_replicator, spawn_and_wait, 1,
      [{file, "src/dcp_replicator.erl"}, {line, 265}]},
      {dcp_replicator, handle_call, 3,
      [{file, "src/dcp_replicator.erl"}, {line, 121}]},
      {gen_server, try_handle_call, 4,
      [{file, "gen_server.erl"}, {line, 636}]},
      {gen_server, handle_msg, 6,
      [{file, "gen_server.erl"}, {line, 665}]},
      {proc_lib, init_p_do_apply, 3,
      [{file, "proc_lib.erl"}, {line, 247}]}]},
      {gen_server, call,
      [<0.2332.0>, get_partitions, infinity]}},
      {gen_server, call,
      ['dcp_replication_manager-default',
      {get_replicator_pid, 83}, infinity]}},
      {gen_server, call,
      [{'janitor_agent-default', 'ns_1@172.23.105.244'},
      {if_rebalance, <0.2246.0>,
      {update_vbucket_state, 577, replica,
      undefined, 'ns_1@172.23.105.212'}}, infinity]}}}}]},
      [{janitor_agent, bulk_set_vbucket_state, 4,
      [{file, "src/janitor_agent.erl"}, {line, 403}]},
      {proc_lib, init_p,3,
      [{file, "proc_lib.erl"}, {line, 232}]}]}}} 

      Note: Hit this while trying to validate MB-41283

      CC Richard deMellow

       

        Attachments

          Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

            Activity

            ashwin.govindarajulu Ashwin Govindarajulu created issue -
            ashwin.govindarajulu Ashwin Govindarajulu made changes -
            Field Original Value New Value
            Description  

            *Scenario*:
             # 4 Node cluster (6.0.1 build 2037)
             # Create couchbase-bucket with replica=1, size=100M
             # Load bucket into DGM using the cbworkloadgen,
            {noformat}
            ./cbworkloadgen -n 172.23.105.155:8091 -b default -u Administrator -p password --max-items=1200000 -r .95 -l{noformat}

             # Rebalance_in 6.6.1-9182 node into the cluster ({color:#000000}172.23.105.244{color})

            *Observation*:

            Rebalance failed with following reason,
            {noformat}
            Worker <0.3486.0> (for action {move,{577,
            ['ns_1@172.23.105.212',
            'ns_1@172.23.105.155'],
            ['ns_1@172.23.105.155',
            'ns_1@172.23.105.244'],
            []}}) exited with reason {unexpected_exit,
            {'EXIT', <0.3512.0>,
            {{bulk_set_vbucket_state_failed,
            [{'ns_1@172.23.105.244',
            {'EXIT',
            {{{{{badmatch,
            [{<0.3659.0>,
            {done, exit, {socket_closed,
            {gen_server, call, [<0.2333.0>,
            {setup_streams, [83, 339]}, infinity]}},
            [{gen_server, call, 3,
            [{file, "gen_server.erl"}, {line, 214}]},
            {dcp_replicator, '-spawn_and_wait/1-fun-0-', 1,
            [{file, "src/dcp_replicator.erl"}, {line, 243}]}]}}]},
            [{misc, sync_shutdown_many_i_am_trapping_exits, 1,
            [{file, "src/misc.erl"}, {line, 1374}]},
            {dcp_replicator, spawn_and_wait, 1,
            [{file, "src/dcp_replicator.erl"}, {line, 265}]},
            {dcp_replicator, handle_call, 3,
            [{file, "src/dcp_replicator.erl"}, {line, 121}]},
            {gen_server, try_handle_call, 4,
            [{file, "gen_server.erl"}, {line, 636}]},
            {gen_server, handle_msg, 6,
            [{file, "gen_server.erl"}, {line, 665}]},
            {proc_lib, init_p_do_apply, 3,
            [{file, "proc_lib.erl"}, {line, 247}]}]},
            {gen_server, call,
            [<0.2332.0>, get_partitions, infinity]}},
            {gen_server, call,
            ['dcp_replication_manager-default',
            {get_replicator_pid, 83}, infinity]}},
            {gen_server, call,
            [{'janitor_agent-default', 'ns_1@172.23.105.244'},
            {if_rebalance, <0.2246.0>,
            {update_vbucket_state, 577, replica,
            undefined, 'ns_1@172.23.105.212'}}, infinity]}}}}]},
            [{janitor_agent, bulk_set_vbucket_state, 4,
            [{file, "src/janitor_agent.erl"}, {line, 403}]},
            {proc_lib, init_p,3,
            [{file, "proc_lib.erl"}, {line, 232}]}]}}} {noformat}
            *Note*: Hit this while trying to validate MB-41283

            CC [~richard.demellow]

             
             *Scenario*:
             # 4 Node cluster (6.0.1 build 2037)
             # Create couchbase-bucket with replica=1, size=100M
             # Load bucket into DGM using the cbworkloadgen,
            {noformat}./cbworkloadgen -n 172.23.105.155:8091 -b default -u Administrator -p password --max-items=1200000 -r .95 -l{noformat}
             # Rebalance_in 6.6.1-9182 node into the cluster ({color:#000000}172.23.105.244{color})

            *Observation*:

            Rebalance failed with following reason,
            {noformat}Worker <0.3486.0> (for action {move,{577,
            ['ns_1@172.23.105.212',
            'ns_1@172.23.105.155'],
            ['ns_1@172.23.105.155',
            'ns_1@172.23.105.244'],
            []}}) exited with reason {unexpected_exit,
            {'EXIT', <0.3512.0>,
            {{bulk_set_vbucket_state_failed,
            [{'ns_1@172.23.105.244',
            {'EXIT',
            {{{{{badmatch,
            [{<0.3659.0>,
            {done, exit, {socket_closed,
            {gen_server, call, [<0.2333.0>,
            {setup_streams, [83, 339]}, infinity]}},
            [{gen_server, call, 3,
            [{file, "gen_server.erl"}, {line, 214}]},
            {dcp_replicator, '-spawn_and_wait/1-fun-0-', 1,
            [{file, "src/dcp_replicator.erl"}, {line, 243}]}]}}]},
            [{misc, sync_shutdown_many_i_am_trapping_exits, 1,
            [{file, "src/misc.erl"}, {line, 1374}]},
            {dcp_replicator, spawn_and_wait, 1,
            [{file, "src/dcp_replicator.erl"}, {line, 265}]},
            {dcp_replicator, handle_call, 3,
            [{file, "src/dcp_replicator.erl"}, {line, 121}]},
            {gen_server, try_handle_call, 4,
            [{file, "gen_server.erl"}, {line, 636}]},
            {gen_server, handle_msg, 6,
            [{file, "gen_server.erl"}, {line, 665}]},
            {proc_lib, init_p_do_apply, 3,
            [{file, "proc_lib.erl"}, {line, 247}]}]},
            {gen_server, call,
            [<0.2332.0>, get_partitions, infinity]}},
            {gen_server, call,
            ['dcp_replication_manager-default',
            {get_replicator_pid, 83}, infinity]}},
            {gen_server, call,
            [{'janitor_agent-default', 'ns_1@172.23.105.244'},
            {if_rebalance, <0.2246.0>,
            {update_vbucket_state, 577, replica,
            undefined, 'ns_1@172.23.105.212'}}, infinity]}}}}]},
            [{janitor_agent, bulk_set_vbucket_state, 4,
            [{file, "src/janitor_agent.erl"}, {line, 403}]},
            {proc_lib, init_p,3,
            [{file, "proc_lib.erl"}, {line, 232}]}]}}} {noformat}
            *Note*: Hit this while trying to validate MB-41283

            CC [~richard.demellow]

             
            paolo.cocchi Paolo Cocchi made changes -
            Assignee Daniel Owen [ owend ] Paolo Cocchi [ paolo.cocchi ]
            Hide
            paolo.cocchi Paolo Cocchi added a comment - - edited

            ns_server.info.log:

            [ns_server:info,2020-11-16T22:39:20.252-08:00,ns_1@172.23.105.155:janitor_agent-default<0.14362.0>:janitor_agent:handle_info:801]Rebalancer <13742.2246.0> died with reason {unexpected_
            exit,
                                                        {'EXIT',<13742.3512.0>,
                                                         {{bulk_set_vbucket_state_failed,
                                                           [{'ns_1@172.23.105.244',
                                                             {'EXIT',
                                                              {{{{{badmatch,
                                                                   [{<13742.3659.0>,
                                                                     {done,exit,
                                                                      {socket_closed,
            

            memcached error on node .244 (which is the only 6.6.1 node), replica connection being closed:

            2020-11-16T22:39:19.828037-08:00 INFO 53: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.105.155->ns_1@172.23.105.244:default - (vb:83) Attempting to add stream: opaque_:19, start_seqno_:0, end_seqno_:18446744073709551615, vb_uuid:267978355066003, snap_start_seqno_:0, snap_end_seqno_:0, last_seqno:0, stream_req_value:{"uid":"0"}
            ..
            2020-11-16T22:39:19.858135-08:00 ERROR 53: exception occurred in runloop during packet execution. Cookie info: [{"aiostat":"success","connection":"[ {\"ip\":\"127.0.0.1\",\"port\":43312} - {\"ip\":\"127.0.0.1\",\"port\":11209} (<ud>@ns_server</ud>) ]","engine_storage":"0x00007f4283feac10","ewouldblock":false,"packet":{"bodylen":52,"cas":1605595159858380800,"datatype":"raw","extlen":31,"key":"<ud>pymc1033629</ud>","keylen":11,"magic":"ClientRequest","opaque":19,"opcode":"DCP_MUTATION","vbucket":83},"refcount":1}] - closing connection ([ {"ip":"127.0.0.1","port":43312} - {"ip":"127.0.0.1","port":11209} (<ud>@ns_server</ud>) ]): Checkpoint::queueDirty: Unable to find key in keyIndex with op:mutation seqno:11190for cursor:persistence in current checkpoint.
            

            The producer is node .155 (6.0.1) that is sending a disk snapshot:

            2020-11-16T22:39:19.831717-08:00 INFO 59: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.105.155->ns_1@172.23.105.244:default - (vb 83) Creating stream with start seqno 0 and end seqno 18446744073709551615; requested end seqno was 18446744073709551615
            2020-11-16T22:39:19.831763-08:00 INFO 59: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.105.155->ns_1@172.23.105.244:default - (vb 83) Scheduling backfill from 1 to 12024, reschedule flag : False
            2020-11-16T22:39:19.832021-08:00 INFO 59: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.105.155->ns_1@172.23.105.244:default - (vb 83) Sending disk snapshot with start seqno 0 and end seqno 12106
            

            Show
            paolo.cocchi Paolo Cocchi added a comment - - edited ns_server.info.log: [ns_server:info,2020-11-16T22:39:20.252-08:00,ns_1@172.23.105.155:janitor_agent-default<0.14362.0>:janitor_agent:handle_info:801]Rebalancer <13742.2246.0> died with reason {unexpected_ exit, {'EXIT',<13742.3512.0>, {{bulk_set_vbucket_state_failed, [{'ns_1@172.23.105.244', {'EXIT', {{{{{badmatch, [{<13742.3659.0>, {done,exit, {socket_closed, memcached error on node .244 (which is the only 6.6.1 node), replica connection being closed: 2020-11-16T22:39:19.828037-08:00 INFO 53: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.105.155->ns_1@172.23.105.244:default - (vb:83) Attempting to add stream: opaque_:19, start_seqno_:0, end_seqno_:18446744073709551615, vb_uuid:267978355066003, snap_start_seqno_:0, snap_end_seqno_:0, last_seqno:0, stream_req_value:{"uid":"0"} .. 2020-11-16T22:39:19.858135-08:00 ERROR 53: exception occurred in runloop during packet execution. Cookie info: [{"aiostat":"success","connection":"[ {\"ip\":\"127.0.0.1\",\"port\":43312} - {\"ip\":\"127.0.0.1\",\"port\":11209} (<ud>@ns_server</ud>) ]","engine_storage":"0x00007f4283feac10","ewouldblock":false,"packet":{"bodylen":52,"cas":1605595159858380800,"datatype":"raw","extlen":31,"key":"<ud>pymc1033629</ud>","keylen":11,"magic":"ClientRequest","opaque":19,"opcode":"DCP_MUTATION","vbucket":83},"refcount":1}] - closing connection ([ {"ip":"127.0.0.1","port":43312} - {"ip":"127.0.0.1","port":11209} (<ud>@ns_server</ud>) ]): Checkpoint::queueDirty: Unable to find key in keyIndex with op:mutation seqno:11190for cursor:persistence in current checkpoint. The producer is node .155 (6.0.1) that is sending a disk snapshot: 2020-11-16T22:39:19.831717-08:00 INFO 59: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.105.155->ns_1@172.23.105.244:default - (vb 83) Creating stream with start seqno 0 and end seqno 18446744073709551615; requested end seqno was 18446744073709551615 2020-11-16T22:39:19.831763-08:00 INFO 59: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.105.155->ns_1@172.23.105.244:default - (vb 83) Scheduling backfill from 1 to 12024, reschedule flag : False 2020-11-16T22:39:19.832021-08:00 INFO 59: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.105.155->ns_1@172.23.105.244:default - (vb 83) Sending disk snapshot with start seqno 0 and end seqno 12106
            Hide
            paolo.cocchi Paolo Cocchi added a comment - - edited

            I am keeping an eye at the motivation behind MB-41283. As per fix description at http://review.couchbase.org/c/kv_engine/+/135630, the issue involves checkpoint marker flags. It may be useful to highlight that from 6.0.x to 6.5.x we have fixed/changed something in the area of marker flags that are sent from active to replica, which may relate to this issue.

            Show
            paolo.cocchi Paolo Cocchi added a comment - - edited I am keeping an eye at the motivation behind MB-41283 . As per fix description at http://review.couchbase.org/c/kv_engine/+/135630 , the issue involves checkpoint marker flags. It may be useful to highlight that from 6.0.x to 6.5.x we have fixed/changed something in the area of marker flags that are sent from active to replica, which may relate to this issue.
            Hide
            paolo.cocchi Paolo Cocchi added a comment - - edited

            We fail in:

            int64_t Checkpoint::getMutationId(const CheckpointCursor& cursor) const {
            ..
                auto& keyIndex = (*cursor.currentPos)->isCommitted() ? committedKeyIndex
                                                                     : preparedKeyIndex;
                auto cursor_item_idx = keyIndex.find(makeIndexKey(*cursor.currentPos));
                if (cursor_item_idx == keyIndex.end()) {
                    throw std::logic_error(
                            "Checkpoint::queueDirty: Unable "
                            "to find key in keyIndex with op:" +
                            to_string((*cursor.currentPos)->getOperation()) +
                            " seqno:" + std::to_string((*cursor.currentPos)->getBySeqno()) +
                            " for cursor:" + cursor.name + " in current checkpoint.");
                }
                return cursor_item_idx->second.mutation_id;
            }
            

            Called by:

            QueueDirtyStatus Checkpoint::queueDirty(const queued_item& qi,
                                                    CheckpointManager* checkpointManager) {
            ..
                if (qi->isCheckPointMetaItem()) {
                    ..
                } else {
                    // Check in the appropriate key index if an item already exists.
                    auto& keyIndex =
                            qi->isCommitted() ? committedKeyIndex : preparedKeyIndex;
                    auto it = keyIndex.find(makeIndexKey(qi));
                    ..
                    // Check if this checkpoint already has an item for the same key
                    // and the item has not been expelled.
                    if (it != keyIndex.end()) {
                        if (it->second.mutation_id > highestExpelledSeqno) {
                            // Normal path - we haven't expelled the item. We have a valid
                            // cursor position to read the item and make our de-dupe checks.
                            const auto currPos = it->second.position;
                            if (!(canDedup(*currPos, qi))) {
                                return QueueDirtyStatus::FailureDuplicateItem;
                            }
             
                            rv = QueueDirtyStatus::SuccessExistingItem;
                            const int64_t currMutationId{it->second.mutation_id};
             
                            // Given the key already exists, need to check all cursors in
                            // this Checkpoint and see if the existing item for this key is
                            // to the "left" of the cursor (i.e. has already been
                            // processed).
                            for (auto& cursor : checkpointManager->cursors) {
                                if ((*(cursor.second->currentCheckpoint)).get() == this) {
                                    if (cursor.second->name ==
                                        CheckpointManager::pCursorName) {
                                        int64_t cursor_mutation_id =
                                                getMutationId(*cursor.second);         <------- !!
                                        queued_item& cursor_item =
                                                *(cursor.second->currentPos);
            ..
            

            Not related to the issue, but just a comment on Checkpoint::getMutationId() and its usage.
            Essentially in Checkpoint::queueDirty() we need to compare 'it->second.mutation_id' to the seqno of the item currently pointed by each cursor placed in the checkpoint (I'll name it cursor-seqno from now on).
            For getting cursor-seqno we call Checkpoint::getMutationId(cursor). The logic there is: (1) get the cursor_item_idx and (2) return cursor_item_idx->second.mutation_id.
            Couldn't we just do 'cursor-seqno = (*(cursor.second->currentPos))->getBySeqno()' in queueDirty() and avoid the call to getMutationId() ?

            Note that, given how Checkpoint::queueDirty() and Checkpoint::getMutationId() work, then the mutation dumped in logs is just the mutation pointed by the persistence cursor at the time of the failure (ie, not the mutation that the node is processing when we fail).

            Show
            paolo.cocchi Paolo Cocchi added a comment - - edited We fail in: int64_t Checkpoint::getMutationId(const CheckpointCursor& cursor) const { .. auto& keyIndex = (*cursor.currentPos)->isCommitted() ? committedKeyIndex : preparedKeyIndex; auto cursor_item_idx = keyIndex.find(makeIndexKey(*cursor.currentPos)); if (cursor_item_idx == keyIndex.end()) { throw std::logic_error( "Checkpoint::queueDirty: Unable " "to find key in keyIndex with op:" + to_string((*cursor.currentPos)->getOperation()) + " seqno:" + std::to_string((*cursor.currentPos)->getBySeqno()) + " for cursor:" + cursor.name + " in current checkpoint."); } return cursor_item_idx->second.mutation_id; } Called by: QueueDirtyStatus Checkpoint::queueDirty(const queued_item& qi, CheckpointManager* checkpointManager) { .. if (qi->isCheckPointMetaItem()) { .. } else { // Check in the appropriate key index if an item already exists. auto& keyIndex = qi->isCommitted() ? committedKeyIndex : preparedKeyIndex; auto it = keyIndex.find(makeIndexKey(qi)); .. // Check if this checkpoint already has an item for the same key // and the item has not been expelled. if (it != keyIndex.end()) { if (it->second.mutation_id > highestExpelledSeqno) { // Normal path - we haven't expelled the item. We have a valid // cursor position to read the item and make our de-dupe checks. const auto currPos = it->second.position; if (!(canDedup(*currPos, qi))) { return QueueDirtyStatus::FailureDuplicateItem; }   rv = QueueDirtyStatus::SuccessExistingItem; const int64_t currMutationId{it->second.mutation_id};   // Given the key already exists, need to check all cursors in // this Checkpoint and see if the existing item for this key is // to the "left" of the cursor (i.e. has already been // processed). for (auto& cursor : checkpointManager->cursors) { if ((*(cursor.second->currentCheckpoint)).get() == this) { if (cursor.second->name == CheckpointManager::pCursorName) { int64_t cursor_mutation_id = getMutationId(*cursor.second); <------- !! queued_item& cursor_item = *(cursor.second->currentPos); .. Not related to the issue, but just a comment on Checkpoint::getMutationId() and its usage. Essentially in Checkpoint::queueDirty() we need to compare 'it->second.mutation_id' to the seqno of the item currently pointed by each cursor placed in the checkpoint (I'll name it cursor-seqno from now on). For getting cursor-seqno we call Checkpoint::getMutationId(cursor). The logic there is: (1) get the cursor_item_idx and (2) return cursor_item_idx->second.mutation_id. Couldn't we just do 'cursor-seqno = (*(cursor.second->currentPos))->getBySeqno()' in queueDirty() and avoid the call to getMutationId() ? Note that, given how Checkpoint::queueDirty() and Checkpoint::getMutationId() work, then the mutation dumped in logs is just the mutation pointed by the persistence cursor at the time of the failure (ie, not the mutation that the node is processing when we fail).
            owend Daniel Owen made changes -
            Priority Major [ 3 ] Critical [ 2 ]
            Hide
            paolo.cocchi Paolo Cocchi added a comment - - edited

            While in MB-41283 we hit an issue in the Memory->Disk transition, here we are hitting the following in the opposite Disk->Memory transition (note that Active is 6.0.1 and Replica is 6.6.1):

            1. Active sends a Disk snapshot [X, Y]
            2. Replica receives it successfully, but since MB-35889 (6.6.0) replica doesn't add keys to Checkpoint indexes for Disk snapshots -> no entry for any key in the Disk checkpoint at replica
            3. Active switches to in-memory -> it sends a SnapMarker with the MARKER_FLAG_CHK NOT set (possible in 6.0.1, eg MB-32862) + it starts sending mutation within the new Memory snapshot
            4. Replica receives the SnapMarker for the Memory snapshot and turns the existing Disk checkpoint into a Memory checkpoint
            5. Replica receives mutation for the Memory snapshot and queues those mutation in the Checkpoint that already contains mutations from the original Disk snapshot
            6. At some point Replica receives a mutation for a key that already exists in the Checkpoint -> the deduplication code path above is executed, but we don't have any entry in the index for that key at this point, so we fail

            In the code above we need to compare the mutation-seqno with the cursor-seqno for every cursor in the Checkpoint.
            If we find a cursor that points to a mutation received as part of the original Disk snapshot, then we try to lookup the key of that mutation in the keyIndex. Given that in 6.6.1 for Disk snapshots we don't add keys to the index, then we don't find that key and we throw as seen in this MB.

            Note: I expect this issue to affect also any upgrade from pre-6.5 (ie, 6.0.x, where we had issues like MB-32862) to 6.6.0, as MB-35889 was introduced in 6.6.0.

            Show
            paolo.cocchi Paolo Cocchi added a comment - - edited While in MB-41283 we hit an issue in the Memory->Disk transition, here we are hitting the following in the opposite Disk->Memory transition (note that Active is 6.0.1 and Replica is 6.6.1): Active sends a Disk snapshot [X, Y] Replica receives it successfully, but since MB-35889 (6.6.0) replica doesn't add keys to Checkpoint indexes for Disk snapshots -> no entry for any key in the Disk checkpoint at replica Active switches to in-memory -> it sends a SnapMarker with the MARKER_FLAG_CHK NOT set (possible in 6.0.1, eg MB-32862 ) + it starts sending mutation within the new Memory snapshot Replica receives the SnapMarker for the Memory snapshot and turns the existing Disk checkpoint into a Memory checkpoint Replica receives mutation for the Memory snapshot and queues those mutation in the Checkpoint that already contains mutations from the original Disk snapshot At some point Replica receives a mutation for a key that already exists in the Checkpoint -> the deduplication code path above is executed, but we don't have any entry in the index for that key at this point, so we fail In the code above we need to compare the mutation-seqno with the cursor-seqno for every cursor in the Checkpoint. If we find a cursor that points to a mutation received as part of the original Disk snapshot, then we try to lookup the key of that mutation in the keyIndex. Given that in 6.6.1 for Disk snapshots we don't add keys to the index, then we don't find that key and we throw as seen in this MB. Note: I expect this issue to affect also any upgrade from pre-6.5 (ie, 6.0.x, where we had issues like MB-32862 ) to 6.6.0, as MB-35889 was introduced in 6.6.0.
            Hide
            paolo.cocchi Paolo Cocchi added a comment -

            Test at http://review.couchbase.org/c/kv_engine/+/140532 encodes the failure scenario described in my previous comment. Fix in progress.

            Show
            paolo.cocchi Paolo Cocchi added a comment - Test at http://review.couchbase.org/c/kv_engine/+/140532 encodes the failure scenario described in my previous comment. Fix in progress.
            paolo.cocchi Paolo Cocchi made changes -
            Affects Version/s 6.6.0 [ 16787 ]
            Hide
            paolo.cocchi Paolo Cocchi added a comment -

            On a separate thread, MB-42805.

            Show
            paolo.cocchi Paolo Cocchi added a comment - On a separate thread, MB-42805 .
            paolo.cocchi Paolo Cocchi made changes -
            Link This issue is triggering MB-42805 [ MB-42805 ]
            Hide
            drigby Dave Rigby added a comment -

            Thanks for the details.

            Paolo Cocchi One concern I have with reverting MB-35889 - that has already shipped in 6.6.0; so we will have to deal with code which doesn't maintain the KeyIndex for Disk checkpoints in upgrade paths going forward (i.e. 6.6.0 -> newer). What is the concequence of that - i.e. are there any scenarios where during cluster upgrade a 6.6.0 node could have problems?

            Show
            drigby Dave Rigby added a comment - Thanks for the details. Paolo Cocchi One concern I have with reverting MB-35889 - that has already shipped in 6.6.0; so we will have to deal with code which doesn't maintain the KeyIndex for Disk checkpoints in upgrade paths going forward (i.e. 6.6.0 -> newer). What is the concequence of that - i.e. are there any scenarios where during cluster upgrade a 6.6.0 node could have problems?
            Hide
            drigby Dave Rigby added a comment -

            Also - assuming we use the above patch; what's the consequence of just using Item::getbySeqno for the mutation_id - we always wake the flusher to re-persist even if strictly not necessary?

            Show
            drigby Dave Rigby added a comment - Also - assuming we use the above patch; what's the consequence of just using Item::getbySeqno for the mutation_id - we always wake the flusher to re-persist even if strictly not necessary?
            owend Daniel Owen made changes -
            Due Date 19/Nov/20
            owend Daniel Owen made changes -
            Labels functional-test approved-for-6.6.1 functional-test
            owend Daniel Owen made changes -
            Is this a Regression? Unknown [ 10452 ] Yes [ 10450 ]
            owend Daniel Owen made changes -
            Link This issue blocks MB-40528 [ MB-40528 ]
            richard.demellow Richard deMellow made changes -
            Link This issue relates to MB-41283 [ MB-41283 ]
            richard.demellow Richard deMellow made changes -
            Link This issue causes CBSE-9239 [ CBSE-9239 ]
            paolo.cocchi Paolo Cocchi made changes -
            Triage Untriaged [ 10351 ] Triaged [ 10350 ]
            Assignee Paolo Cocchi [ paolo.cocchi ] Ashwin Govindarajulu [ ashwin.govindarajulu ]
            Resolution Fixed [ 1 ]
            Status Open [ 1 ] Resolved [ 5 ]
            Hide
            paolo.cocchi Paolo Cocchi added a comment -

            Hi Ashwin Govindarajulu, fix submitted, assigning back to you for verification, thanks.

            Show
            paolo.cocchi Paolo Cocchi added a comment - Hi Ashwin Govindarajulu , fix submitted, assigning back to you for verification, thanks.
            Hide
            build-team Couchbase Build Team added a comment -

            Build couchbase-server-6.6.1-9193 contains kv_engine commit c4454a5 with commit message:
            MB-42780: Make replica resilient to missing MARKER_FLAG_CHK

            Show
            build-team Couchbase Build Team added a comment - Build couchbase-server-6.6.1-9193 contains kv_engine commit c4454a5 with commit message: MB-42780 : Make replica resilient to missing MARKER_FLAG_CHK
            Hide
            build-team Couchbase Build Team added a comment -

            Build couchbase-server-6.6.1-9194 contains kv_engine commit 675007e with commit message:
            MB-42780: Expand tests and improve comments

            Show
            build-team Couchbase Build Team added a comment - Build couchbase-server-6.6.1-9194 contains kv_engine commit 675007e with commit message: MB-42780 : Expand tests and improve comments
            Hide
            ashwin.govindarajulu Ashwin Govindarajulu added a comment -

            Not seeing this crash while trying to rebalance_in 6.6.1-9194 node on top of 6.0.1-2037 cluster.

            Closing this ticket.

            Show
            ashwin.govindarajulu Ashwin Govindarajulu added a comment - Not seeing this crash while trying to rebalance_in 6.6.1-9194 node on top of 6.0.1-2037 cluster. Closing this ticket.
            ashwin.govindarajulu Ashwin Govindarajulu made changes -
            Status Resolved [ 5 ] Closed [ 6 ]
            Hide
            ashwin.govindarajulu Ashwin Govindarajulu added a comment -

            Sure Richard deMellow. Already started working on that test.

            Show
            ashwin.govindarajulu Ashwin Govindarajulu added a comment - Sure Richard deMellow . Already started working on that test.
            Hide
            build-team Couchbase Build Team added a comment -

            Build couchbase-server-7.0.0-4069 contains kv_engine commit 675007e with commit message:
            MB-42780: Expand tests and improve comments

            Show
            build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.0-4069 contains kv_engine commit 675007e with commit message: MB-42780 : Expand tests and improve comments
            Hide
            build-team Couchbase Build Team added a comment -

            Build couchbase-server-7.0.0-4069 contains kv_engine commit c4454a5 with commit message:
            MB-42780: Make replica resilient to missing MARKER_FLAG_CHK

            Show
            build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.0-4069 contains kv_engine commit c4454a5 with commit message: MB-42780 : Make replica resilient to missing MARKER_FLAG_CHK
            drigby Dave Rigby made changes -
            Affects Version/s 6.6.1 [ 17002 ]
            Hide
            build-team Couchbase Build Team added a comment -

            Build couchbase-server-7.0.0-4127 contains kv_engine commit eba2ffd with commit message:
            MB-42780: Logically revert MB-41283

            Show
            build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.0-4127 contains kv_engine commit eba2ffd with commit message: MB-42780 : Logically revert MB-41283
            Hide
            build-team Couchbase Build Team added a comment -

            Build couchbase-server-7.0.0-4129 contains kv_engine commit cc90dc9 with commit message:
            MB-42780: CM allows extending only Memory checkpoints

            Show
            build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.0-4129 contains kv_engine commit cc90dc9 with commit message: MB-42780 : CM allows extending only Memory checkpoints
            richard.demellow Richard deMellow made changes -
            Link This issue relates to CBSE-9397 [ CBSE-9397 ]
            James Flather James Flather made changes -
            Link This issue blocks CBSE-9711 [ CBSE-9711 ]

              People

              Assignee:
              ashwin.govindarajulu Ashwin Govindarajulu
              Reporter:
              ashwin.govindarajulu Ashwin Govindarajulu
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Dates

                Due:
                Created:
                Updated:
                Resolved:

                  PagerDuty