Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-37103

[System test]: Disk Checkpoint does not have an initialised HCS

    XMLWordPrintable

Details

    Description

      Build: 6.5.0-4908 not seen on 4890

      Test: MH longevity with durability

      Cycle: 2nd

      Day: 1st

      Seeing following crash 

      Service 'memcached' exited with status 134. Restarting. Messages:
      2019-11-29T21:06:49.818637-08:00 CRITICAL /opt/couchbase/bin/../lib/libstdc++.so.6() [0x7f83e4b3d000+0x8f213]
      2019-11-29T21:06:49.818649-08:00 CRITICAL /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f83df688000+0x74098]
      2019-11-29T21:06:49.818656-08:00 CRITICAL /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f83df688000+0x77434]
      2019-11-29T21:06:49.818665-08:00 CRITICAL /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f83df688000+0x77843]
      2019-11-29T21:06:49.818671-08:00 CRITICAL /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f83df688000+0x77924]
      2019-11-29T21:06:49.818677-08:00 CRITICAL /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f83df688000+0x809f9]
      2019-11-29T21:06:49.818687-08:00 CRITICAL /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f83df688000+0x12f964]
      2019-11-29T21:06:49.818691-08:00 CRITICAL /opt/couchbase/bin/../lib/libplatform_so.so.0.1.0() [0x7f83e69e6000+0x8ee7]
      2019-11-29T21:06:49.818695-08:00 CRITICAL /lib64/libpthread.so.0() [0x7f83e4408000+0x7dc5]
      2019-11-29T21:06:49.818720-08:00 CRITICAL /lib64/libc.so.6(clone+0x6d) [0x7f83e4047000+0xf621d] hidens_log 000ns_1@172.23.99.25 9:06:52 PM   29 Nov, 2019
      
      

      Test step 

      [2019-11-29T20:28:41-08:00, sequoiatools/couchbase-cli:6.5:0c44c4] server-add -c 172.23.108.103:8091 --server-add https://172.23.106.100 -u Administrator -p password --server-add-username Administrator --server-add-password password --services data
      [2019-11-29T20:42:41-08:00, sequoiatools/couchbase-cli:6.5:4b9505] failover -c 172.23.108.103:8091 --server-failover 172.23.104.5:8091 -u Administrator -p password
      [2019-11-29T21:03:27-08:00, sequoiatools/couchbase-cli:6.5:b143b9] failover -c 172.23.108.103:8091 --server-failover 172.23.99.20:8091 -u Administrator -p password --force
      [2019-11-29T21:04:21-08:00, sequoiatools/couchbase-cli:6.5:ba2ca6] rebalance -c 172.23.108.103:8091 -u Administrator -p password
      → 
       
       
      Error occurred on container - sequoiatools/couchbase-cli:6.5:[rebalance -c 172.23.108.103:8091 -u Administrator -p password]
       
       
      docker logs ba2ca6
      docker start ba2ca6
       
       
      *Unable to display progress bar on this os
      JERROR: Rebalance failed. See logs for detailed reason. You can try again.
      [2019-11-29T21:07:15-08:00, sequoiatools/cmd:1e7d4b] 60 

      Rebalance failed with during same period 

       [user:error,2019-11-29T21:06:52.374-08:00,ns_1@172.23.108.103:<0.12064.0>:ns_orchestrator:log_rebalance_completion:1445]Rebalance exited with reason {mover_crashed,
                                    {unexpected_exit,
                                     {'EXIT',<0.7202.822>,
                                      {{{{{child_interrupted,
                                           {'EXIT',<25874.23227.201>,socket_closed}},
                                          [{dcp_replicator,spawn_and_wait,1,
                                            [{file,"src/dcp_replicator.erl"},
                                             {line,266}]},
                                           {dcp_replicator,handle_call,3,
                                            [{file,"src/dcp_replicator.erl"},
                                             {line,121}]},
                                           {gen_server,try_handle_call,4,
                                            [{file,"gen_server.erl"},{line,636}]},
                                           {gen_server,handle_msg,6,
                                            [{file,"gen_server.erl"},{line,665}]},
                                           {proc_lib,init_p_do_apply,3,
                                            [{file,"proc_lib.erl"},{line,247}]}]},
                                         {gen_server,call,
                                          [<25874.23258.201>,
                                           {setup_replication,
                                            [21,22,106,163,336,760,939,940,941,942,
                                             943,944,945]},
                                           infinity]}},
                                        {gen_server,call,
                                         ['replication_manager-STOCK',
                                          {change_vbucket_replication,21,
                                           'ns_1@172.23.99.25'},
                                          infinity]}},
                                       {gen_server,call,
                                        [{'janitor_agent-STOCK',
                                          'ns_1@172.23.104.61'},
                                         {if_rebalance,<0.5737.821>,
                                          {update_vbucket_state,47,active,
                                           undefined,undefined,
                                           [['ns_1@172.23.104.61',
                                             'ns_1@172.23.97.122']]}},
                                         infinity]}}}}}.
      Rebalance Operation Id = e3a4b107de62a1338edd76e84547484d

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            owend Daniel Owen added a comment -

            On node 172.23.99.25 seeing the following:

            GSL: Precondition failure at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/dcp/active_stream.cc: 1185
            

            owend Daniel Owen added a comment - On node 172.23.99.25 seeing the following: GSL: Precondition failure at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/dcp/active_stream.cc: 1185
            owend Daniel Owen added a comment -

            The KV change between last working build (4890) and the build seeing the issue (4908) is as follows:

             
             * Commit: 5e3f2c1d4b4f538244ac7b73a5c842728bc9018b in build: 6.5.0-4906
               MB-36923: Allow "default" and "disk_io_optimized" for ThreadCount
               Change-Id: I56c62e624f7f067168a2b074fb079602d1287983
               Reviewed-on: http://review.couchbase.org/118511
             
             * Commit: 8d8ad4ccdda0193d5a9befd40489c1381be7de56 in build: 6.5.0-4902
               MB-37013: Update DcpSnapShotMarker V2 to allow for an extra seqno
               Change-Id: I886503d6353d01b284b04af730d581f6be6784c9
               Reviewed-on: http://review.couchbase.org/118387
               
             * Commit: 0861963a1590a0196c7939ea2dc571609cbeee5b in build: 6.5.0-4901
               MB-36971: Return KEY_EEXISTS if stream opaque incorrect
               Change-Id: I511aa2108da93dfabde4cbcb500bc52952aa777b
               Reviewed-on: http://review.couchbase.org/118560
               
             * Commit: f17fdd7b1bcbe892e13302a1f94e03b53b205bf9 in build: 6.5.0-4900
               MB-36971: Send the HCS when streaming a Disk-Checkpoint
               Change-Id: I6e360cc81a6f05ea4165c82d754b3af1112bac5b
               Reviewed-on: http://review.couchbase.org/118350
               
             * Commit: 72e74070611b0bd90ca0ec2cd6554d37fbcb2a39 in build: 6.5.0-4899
               MB-37036: Cap automatic num_reader_threads to 64
               Change-Id: I857144d64b238923bec9ffefcd0e32e86bc0cfad
               Reviewed-on: http://review.couchbase.org/118447
             
             * Commit: 19210da3d889943346cf8ec9743eaf783c4eb553 in build: 6.5.0-4892
               MB-36971: Ensure that DCP Producer handles KeyEnoent correctly
               Change-Id: Ie694f73c8e668298dd14d205ca753762cab7fab4
               Reviewed-on: http://review.couchbase.org/118214
             
               
             * Commit: 25a1e3d19f4ae957a4ae7dec313819fcb318df80 in build: 6.5.0-4891
               MB-36923: Use correct value for num_writer_thread listener
               Change-Id: If5332d453d0abf1d27f4c63dbcc1a27acd206bc0
               Reviewed-on: http://review.couchbase.org/118448
            

            owend Daniel Owen added a comment - The KV change between last working build (4890) and the build seeing the issue (4908) is as follows:   * Commit: 5e3f2c1d4b4f538244ac7b73a5c842728bc9018b in build: 6.5.0-4906 MB-36923: Allow "default" and "disk_io_optimized" for ThreadCount Change-Id: I56c62e624f7f067168a2b074fb079602d1287983 Reviewed-on: http://review.couchbase.org/118511   * Commit: 8d8ad4ccdda0193d5a9befd40489c1381be7de56 in build: 6.5.0-4902 MB-37013: Update DcpSnapShotMarker V2 to allow for an extra seqno Change-Id: I886503d6353d01b284b04af730d581f6be6784c9 Reviewed-on: http://review.couchbase.org/118387 * Commit: 0861963a1590a0196c7939ea2dc571609cbeee5b in build: 6.5.0-4901 MB-36971: Return KEY_EEXISTS if stream opaque incorrect Change-Id: I511aa2108da93dfabde4cbcb500bc52952aa777b Reviewed-on: http://review.couchbase.org/118560 * Commit: f17fdd7b1bcbe892e13302a1f94e03b53b205bf9 in build: 6.5.0-4900 MB-36971: Send the HCS when streaming a Disk-Checkpoint Change-Id: I6e360cc81a6f05ea4165c82d754b3af1112bac5b Reviewed-on: http://review.couchbase.org/118350 * Commit: 72e74070611b0bd90ca0ec2cd6554d37fbcb2a39 in build: 6.5.0-4899 MB-37036: Cap automatic num_reader_threads to 64 Change-Id: I857144d64b238923bec9ffefcd0e32e86bc0cfad Reviewed-on: http://review.couchbase.org/118447   * Commit: 19210da3d889943346cf8ec9743eaf783c4eb553 in build: 6.5.0-4892 MB-36971: Ensure that DCP Producer handles KeyEnoent correctly Change-Id: Ie694f73c8e668298dd14d205ca753762cab7fab4 Reviewed-on: http://review.couchbase.org/118214   * Commit: 25a1e3d19f4ae957a4ae7dec313819fcb318df80 in build: 6.5.0-4891 MB-36923: Use correct value for num_writer_thread listener Change-Id: If5332d453d0abf1d27f4c63dbcc1a27acd206bc0 Reviewed-on: http://review.couchbase.org/118448
            owend Daniel Owen added a comment -

            The Expect that is failing is the following:

            1181          // If the stream supports SyncRep then send the HCS for CktpType::disk
            1182          const auto sendHCS = supportSyncReplication() && isCkptTypeDisk;
            1183          const auto hcsToSend = sendHCS ? highCompletedSeqno : boost::none;
            1184          if (sendHCS) {
            1185              Expects(hcsToSend.is_initialized());     <<<<<<<<<<<<<<<<<<<<<<<<<< Failing expect
            1186              log(spdlog::level::level_enum::info,
            1187                  "{} ActiveStream::snapshot: Sending disk snapshot with start "
            1188                  "seqno {}, end seqno {}, and"
            1189                  " high completed seqno {}",
            1190                  logPrefix,
            1191                  snapStart,
            1192                  snapEnd,
            1193                  hcsToSend);
            1194          }
            

            The code was introduced in

             * Commit: f17fdd7b1bcbe892e13302a1f94e03b53b205bf9 in build: 6.5.0-4900
               MB-36971: Send the HCS when streaming a Disk-Checkpoint
               Change-Id: I6e360cc81a6f05ea4165c82d754b3af1112bac5b
               Reviewed-on: http://review.couchbase.org/118350
            

            owend Daniel Owen added a comment - The Expect that is failing is the following: 1181 // If the stream supports SyncRep then send the HCS for CktpType::disk 1182 const auto sendHCS = supportSyncReplication() && isCkptTypeDisk; 1183 const auto hcsToSend = sendHCS ? highCompletedSeqno : boost::none; 1184 if (sendHCS) { 1185 Expects(hcsToSend.is_initialized()); <<<<<<<<<<<<<<<<<<<<<<<<<< Failing expect 1186 log(spdlog::level::level_enum::info, 1187 "{} ActiveStream::snapshot: Sending disk snapshot with start " 1188 "seqno {}, end seqno {}, and" 1189 " high completed seqno {}", 1190 logPrefix, 1191 snapStart, 1192 snapEnd, 1193 hcsToSend); 1194 } The code was introduced in * Commit: f17fdd7b1bcbe892e13302a1f94e03b53b205bf9 in build: 6.5.0-4900 MB-36971: Send the HCS when streaming a Disk-Checkpoint Change-Id: I6e360cc81a6f05ea4165c82d754b3af1112bac5b Reviewed-on: http://review.couchbase.org/118350
            drigby Dave Rigby added a comment - - edited

            As per analysis of MB-37109, the same problem seen there is occurring here:

            // If the stream supports SyncRep then send the HCS in the
                    // SnapshotMarker if it is not 0
                    auto sendHCS = supportSyncReplication() && highCompletedSeqno &&
                                   *highCompletedSeqno != 0;
                    auto hcsToSend = sendHCS ? highCompletedSeqno : boost::none;
                    log(spdlog::level::level_enum::info,
                        "{} ActiveStream::markDiskSnapshot: Sending disk snapshot with "
                        "start seqno {}, end seqno {}, and"
                        " high completed seqno {}",
                        logPrefix,
                        startSeqno,
                        endSeqno,
                        hcsToSend);
            

            On a bucket which has never performed any SyncWrites, the HCS will indeed be zero. This will cause sendHCS to be false, and hence hcsToSend will be boost::none. This results in the Checkpoint on the recipient being marked as a Disk Snapshot but without a HCS, which ultimately causes the crash seen when this node later becomes Active and attempts to stream out a Snapshot marker.
            Note that http://review.couchbase.org/#/c/113492/ added this clause, to avoid an assertion in the consumer (consumer) flusher attempting to flush a HCS not greater than the current HCS - which is false for the initial snapshot (0 > 0):

                            if (hcs) {
                                Expects(hcs > vbstate.persistedCompletedSeqno);
            

            drigby Dave Rigby added a comment - - edited As per analysis of MB-37109 , the same problem seen there is occurring here: // If the stream supports SyncRep then send the HCS in the // SnapshotMarker if it is not 0 auto sendHCS = supportSyncReplication() && highCompletedSeqno && *highCompletedSeqno != 0; auto hcsToSend = sendHCS ? highCompletedSeqno : boost::none; log (spdlog::level::level_enum::info, "{} ActiveStream::markDiskSnapshot: Sending disk snapshot with " "start seqno {}, end seqno {}, and" " high completed seqno {}" , logPrefix, startSeqno, endSeqno, hcsToSend); On a bucket which has never performed any SyncWrites, the HCS will indeed be zero. This will cause sendHCS to be false, and hence hcsToSend will be boost::none. This results in the Checkpoint on the recipient being marked as a Disk Snapshot but without a HCS, which ultimately causes the crash seen when this node later becomes Active and attempts to stream out a Snapshot marker. Note that http://review.couchbase.org/#/c/113492/ added this clause, to avoid an assertion in the consumer (consumer) flusher attempting to flush a HCS not greater than the current HCS - which is false for the initial snapshot (0 > 0): if (hcs) { Expects(hcs > vbstate.persistedCompletedSeqno);

            Build couchbase-server-6.5.0-4916 contains kv_engine commit c4f9f45 with commit message:
            MB-37103: Send HCS from backfill even if zero

            build-team Couchbase Build Team added a comment - Build couchbase-server-6.5.0-4916 contains kv_engine commit c4f9f45 with commit message: MB-37103 : Send HCS from backfill even if zero

            Build couchbase-server-7.0.0-1102 contains kv_engine commit c4f9f45 with commit message:
            MB-37103: Send HCS from backfill even if zero

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.0-1102 contains kv_engine commit c4f9f45 with commit message: MB-37103 : Send HCS from backfill even if zero

            People

              vikas.chaudhary Vikas Chaudhary
              vikas.chaudhary Vikas Chaudhary
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty