Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-45825

[Windows][Collections] - Minidumps seen on collections crud + failover on Windows

    XMLWordPrintable

Details

    • Untriaged
    • Windows 64-bit
    • 1
    • No

    Description

      Script to Repro

      {'change_ephemeral_purge_age_and_interval': 'False',
       'cluster_name': 'win10-bucket-ops-11111222223334444',
       'conf_file': 'conf/collections/collections_failover_crud_on_collections.conf',
       'crash_warning': 'True',
       'get-cbcollect-info': 'True',
       'ini': '/tmp/win10-bucket-ops-11111222223334444.ini',
       'num_nodes': 5,
       'rerun': 'False',
       'spec': 'collections_failover_crud_on_collections'}
      

      Minidump is 5cb17598-d26c-4e2f-a12c-2b5cd2ad2f21.dmp on 172.23.104.247:

      Administrator@WIN-1T98IIFH727 /cygdrive/c/Program Files/couchbase/server/var/lib/couchbase/crash
      $ ls -lrt
      total 176
      -rwxrwx---+ 1 Administrators SYSTEM 179602 Apr 20 20:32 5cb17598-d26c-4e2f-a12c-2b5cd2ad2f21.dmp
       
      Administrator@WIN-1T98IIFH727 /cygdrive/c/Program Files/couchbase/server/var/lib/couchbase/crash
      $ cd ../logs/
       
      Administrator@WIN-1T98IIFH727 /cygdrive/c/Program Files/couchbase/server/var/lib/couchbase/logs
      $ grep CRITICAL *
      memcached.log.000529.txt:2021-04-20T20:32:30.458617-07:00 CRITICAL Breakpad caught a crash (Couchbase version 7.0.0-4985). Writing crash dump to c:/Program Files/Couchbase/Server/var/lib/couchbase/crash\5cb17598-d26c-4e2f-a12c-2b5cd2ad2f21.dmp before terminating.
      memcached.log.000529.txt:2021-04-20T20:32:30.459472-07:00 CRITICAL Stack backtrace of crashed thread:
      memcached.log.000529.txt:2021-04-20T20:32:30.465599-07:00 CRITICAL     #0  c:\Program Files\Couchbase\Server\bin\memcached.exe(magma::File::File+8155721) [0x00007FF630A1E9EB]
      memcached.log.000529.txt:2021-04-20T20:32:30.465645-07:00 CRITICAL     #1  c:\Program Files\Couchbase\Server\bin\memcached.exe(magma::File::File+8395231) [0x00007FF630A59181]
      memcached.log.000529.txt:2021-04-20T20:32:30.467617-07:00 CRITICAL     #2  C:\Windows\System32\KERNEL32.DLL(BaseThreadInitThunk+20) [0x00007FFEE57784D4]
      memcached.log.000529.txt:2021-04-20T20:32:30.469073-07:00 CRITICAL     #3  C:\Windows\SYSTEM32\ntdll.dll(RtlUserThreadStart+33) [0x00007FFEE599E8B1]
      grep: rebalance: Is a directory
      

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            owend Daniel Owen added a comment -

            See the following message repeated until just before the memcached restart.

            [ns_server:info,2021-04-20T20:33:12.866-07:00,babysitter_of_ns_1@cb.local:<0.143.0>:ns_port_server:log:218]saslauthd_port<0.143.0>: 2021/04/20 20:33:12 revrpc: Got error (dial tcp 127.0.0.1:8091: connectex: No connection could be made because the target machine actively refused it.) and will retry in 1s
             
            [ns_server:info,2021-04-20T23:53:19.725-07:00,babysitter_of_ns_1@cb.local:<0.143.0>:ns_port_server:log:218]saslauthd_port<0.143.0>: 2021/04/20 23:53:19 revrpc: Got error (dial tcp 127.0.0.1:8091: connectex: No connection could be made because the target machine actively refused it.) and will retry in 1s
            

            owend Daniel Owen added a comment - See the following message repeated until just before the memcached restart. [ns_server:info,2021-04-20T20:33:12.866-07:00,babysitter_of_ns_1@cb.local:<0.143.0>:ns_port_server:log:218]saslauthd_port<0.143.0>: 2021/04/20 20:33:12 revrpc: Got error (dial tcp 127.0.0.1:8091: connectex: No connection could be made because the target machine actively refused it.) and will retry in 1s   [ns_server:info,2021-04-20T23:53:19.725-07:00,babysitter_of_ns_1@cb.local:<0.143.0>:ns_port_server:log:218]saslauthd_port<0.143.0>: 2021/04/20 23:53:19 revrpc: Got error (dial tcp 127.0.0.1:8091: connectex: No connection could be made because the target machine actively refused it.) and will retry in 1s

            Analysis of 5cb17598-d26c-4e2f-a12c-2b5cd2ad2f21.dmp

            0:006> !analyze -v
            *******************************************************************************
            *                                                                             *
            *                        Exception Analysis                                   *
            *                                                                             *
            *******************************************************************************
             
            *** WARNING: Unable to verify checksum for memcached.exe
            *** WARNING: Unable to verify timestamp for event_core.dll
            *** WARNING: Unable to verify timestamp for platform_so.dll
             
            KEY_VALUES_STRING: 1
             
                Key  : Analysis.CPU.Sec
                Value: 0
             
                Key  : Analysis.DebugAnalysisProvider.CPP
                Value: Create: 8007007e on WINGMAN
             
                Key  : Analysis.DebugData
                Value: CreateObject
             
                Key  : Analysis.DebugModel
                Value: CreateObject
             
                Key  : Analysis.Elapsed.Sec
                Value: 4
             
                Key  : Analysis.Memory.CommitPeak.Mb
                Value: 185
             
                Key  : Analysis.System
                Value: CreateObject
             
                Key  : Timeline.Process.Start.DeltaSec
                Value: 3301
             
             
            CONTEXT:  (.ecxr)
            rax=0000000000000000 rbx=000000cfd5bfec10 rcx=0000000000000000
            rdx=0000000000000000 rsi=000000cfd5bff240 rdi=000000cfd5bfdaa0
            rip=00007ffee2904c48 rsp=000000cfd5bfccb0 rbp=000000cfd5bfd340
             r8=0000000000000000  r9=0000000000000000 r10=0000000000000000
            r11=0000000000000000 r12=000000cfd5bfcdf0 r13=0000000000000000
            r14=000000cfd5bfe0c8 r15=0000000000000000
            iopl=0         nv up ei pl nz na po nc
            cs=0033  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00000206
            KERNELBASE!RaiseException+0x68:
            00007ffe`e2904c48 488b8c24c0000000 mov     rcx,qword ptr [rsp+0C0h] ss:000000cf`d5bfcd70=0000abf8b02a5a29
            Resetting default scope
             
            EXCEPTION_RECORD:  (.exr -1)
            ExceptionAddress: 00007ffee2904c48 (KERNELBASE!RaiseException+0x0000000000000068)
               ExceptionCode: e06d7363 (C++ EH exception)
              ExceptionFlags: 00000001
            NumberParameters: 4
               Parameter[0]: 0000000019930520
               Parameter[1]: 000000cfd5bff390
               Parameter[2]: 00007ff630e8add8
               Parameter[3]: 00007ff630200000
             
            PROCESS_NAME:  memcached.exe
             
            ERROR_CODE: (NTSTATUS) 0xe06d7363 - <Unable to get error code text>
             
            EXCEPTION_CODE_STR:  e06d7363
             
            EXCEPTION_PARAMETER1:  0000000019930520
             
            EXCEPTION_PARAMETER2:  000000cfd5bff390
             
            EXCEPTION_PARAMETER3:  00007ff630e8add8
             
            EXCEPTION_PARAMETER4: 7ff630200000
             
            STACK_TEXT:  
            000000cf`d5bfccb0 00007ffe`dcaa3351 : 00007ff6`30bc99d0 000000cf`d5bfe0c8 0000026c`00000100 0000026c`95808830 : KERNELBASE!RaiseException+0x68
            000000cf`d5bfcd90 00007ffe`e59d9b03 : 0000ba6d`00000000 000000cf`d5bff240 00000000`00000000 00000000`00000000 : VCRUNTIME140!__FrameHandler3::CxxCallCatchBlock+0x151
            000000cf`d5bfce40 00007ff6`3093254e : 0000026c`8a3a20a8 0000026c`8a3a2020 0000026c`8a3a20a8 0000026c`00000001 : ntdll!RcConsolidateFrames+0x3
            000000cf`d5bff420 00007ff6`30934ddd : 0000026c`9bbb0648 0000026c`8a3a20a8 0000026c`96c34240 0000026c`955a05b0 : memcached!std::vector<prometheus::ClientMetric,std::allocator<prometheus::ClientMetric> >::vector<prometheus::ClientMetric,std::allocator<prometheus::ClientMetric> >+0xfe
            000000cf`d5bff480 00007ff6`30971d80 : 00000000`00000078 0000026c`889f4240 00000000`00000078 0000026c`96d80670 : memcached!cb::prometheus::MetricServer::KVCollectable::Collect+0x17d
            000000cf`d5bff590 00007ff6`309700d6 : 000000cf`d5bff678 0000026c`96c303c8 0000026c`975b0f68 00003903`af4e39d0 : memcached!prometheus::detail::CollectMetrics+0xc0
            000000cf`d5bff630 00007ff6`309529c6 : 0000026c`975b0f68 00000000`00000000 00000000`00000000 00000000`00000000 : memcached!prometheus::detail::MetricsHandler::handleGet+0x96
            000000cf`d5bff6d0 00007ff6`30958300 : 0000026c`98ae09f0 0000026c`98ae09f0 000000cf`d5bff830 00000000`00000001 : memcached!CivetServer::requestHandler+0xc6
            000000cf`d5bff730 00007ff6`30965ae4 : 00000000`00000000 00000000`00000000 0000026c`98b4e5e0 00007ff6`30cb6560 : memcached!vsscanf_l+0x4300
            000000cf`d5bffbb0 00007ff6`30969d88 : 00000000`00000001 0000026c`98b4e5e0 0000026c`98ae09f0 00000000`00000000 : memcached!mg_write+0x21f4
            000000cf`d5bffca0 00007ff6`30969b09 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : memcached!sscanf+0x948
            000000cf`d5bffd30 00007ffe`e20bf4a0 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : memcached!sscanf+0x6c9
            000000cf`d5bffd60 00007ffe`e57784d4 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : ucrtbase!thread_start<unsigned int (__cdecl*)(void * __ptr64)>+0x40
            000000cf`d5bffd90 00007ffe`e599e8b1 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : kernel32!BaseThreadInitThunk+0x14
            000000cf`d5bffdc0 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : ntdll!RtlUserThreadStart+0x21
             
             
            FAULTING_SOURCE_LINE:  d:\agent\_work\2\s\src\vctools\crt\vcruntime\src\eh\frame.cpp
             
            FAULTING_SOURCE_FILE:  d:\agent\_work\2\s\src\vctools\crt\vcruntime\src\eh\frame.cpp
             
            FAULTING_SOURCE_LINE_NUMBER:  1534
             
            FAULTING_SOURCE_CODE:  
            No source found for 'd:\agent\_work\2\s\src\vctools\crt\vcruntime\src\eh\frame.cpp'
             
             
            SYMBOL_NAME:  vcruntime140!__FrameHandler3::CxxCallCatchBlock+151
             
            MODULE_NAME: VCRUNTIME140
             
            IMAGE_NAME:  VCRUNTIME140.dll
             
            STACK_COMMAND:  ~6s ; .ecxr ; kb
             
            FAILURE_BUCKET_ID:  CPP_EXCEPTION_e06d7363_VCRUNTIME140.dll!__FrameHandler3::CxxCallCatchBlock
             
            OS_VERSION:  10.0.14393.2430
             
            BUILDLAB_STR:  rs1_release_inmarket_aim
             
            OSPLATFORM_TYPE:  x64
             
            OSNAME:  Windows 10
             
            FAILURE_ID_HASH:  {785a6fcc-a7a7-1220-6869-a9b6948e71fa}
             
            Followup:     MachineOwner
            ---------
             
            
            

            richard.demellow Richard deMellow added a comment - Analysis of 5cb17598-d26c-4e2f-a12c-2b5cd2ad2f21.dmp 0:006> !analyze -v ******************************************************************************* * * * Exception Analysis * * * *******************************************************************************   *** WARNING: Unable to verify checksum for memcached.exe *** WARNING: Unable to verify timestamp for event_core.dll *** WARNING: Unable to verify timestamp for platform_so.dll   KEY_VALUES_STRING: 1   Key : Analysis.CPU.Sec Value: 0   Key : Analysis.DebugAnalysisProvider.CPP Value: Create: 8007007e on WINGMAN   Key : Analysis.DebugData Value: CreateObject   Key : Analysis.DebugModel Value: CreateObject   Key : Analysis.Elapsed.Sec Value: 4   Key : Analysis.Memory.CommitPeak.Mb Value: 185   Key : Analysis.System Value: CreateObject   Key : Timeline.Process.Start.DeltaSec Value: 3301     CONTEXT: (.ecxr) rax=0000000000000000 rbx=000000cfd5bfec10 rcx=0000000000000000 rdx=0000000000000000 rsi=000000cfd5bff240 rdi=000000cfd5bfdaa0 rip=00007ffee2904c48 rsp=000000cfd5bfccb0 rbp=000000cfd5bfd340 r8=0000000000000000 r9=0000000000000000 r10=0000000000000000 r11=0000000000000000 r12=000000cfd5bfcdf0 r13=0000000000000000 r14=000000cfd5bfe0c8 r15=0000000000000000 iopl=0 nv up ei pl nz na po nc cs=0033 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00000206 KERNELBASE!RaiseException+0x68: 00007ffe`e2904c48 488b8c24c0000000 mov rcx,qword ptr [rsp+0C0h] ss:000000cf`d5bfcd70=0000abf8b02a5a29 Resetting default scope   EXCEPTION_RECORD: (.exr -1) ExceptionAddress: 00007ffee2904c48 (KERNELBASE!RaiseException+0x0000000000000068) ExceptionCode: e06d7363 (C++ EH exception) ExceptionFlags: 00000001 NumberParameters: 4 Parameter[0]: 0000000019930520 Parameter[1]: 000000cfd5bff390 Parameter[2]: 00007ff630e8add8 Parameter[3]: 00007ff630200000   PROCESS_NAME: memcached.exe   ERROR_CODE: (NTSTATUS) 0xe06d7363 - <Unable to get error code text>   EXCEPTION_CODE_STR: e06d7363   EXCEPTION_PARAMETER1: 0000000019930520   EXCEPTION_PARAMETER2: 000000cfd5bff390   EXCEPTION_PARAMETER3: 00007ff630e8add8   EXCEPTION_PARAMETER4: 7ff630200000   STACK_TEXT: 000000cf`d5bfccb0 00007ffe`dcaa3351 : 00007ff6`30bc99d0 000000cf`d5bfe0c8 0000026c`00000100 0000026c`95808830 : KERNELBASE!RaiseException+0x68 000000cf`d5bfcd90 00007ffe`e59d9b03 : 0000ba6d`00000000 000000cf`d5bff240 00000000`00000000 00000000`00000000 : VCRUNTIME140!__FrameHandler3::CxxCallCatchBlock+0x151 000000cf`d5bfce40 00007ff6`3093254e : 0000026c`8a3a20a8 0000026c`8a3a2020 0000026c`8a3a20a8 0000026c`00000001 : ntdll!RcConsolidateFrames+0x3 000000cf`d5bff420 00007ff6`30934ddd : 0000026c`9bbb0648 0000026c`8a3a20a8 0000026c`96c34240 0000026c`955a05b0 : memcached!std::vector<prometheus::ClientMetric,std::allocator<prometheus::ClientMetric> >::vector<prometheus::ClientMetric,std::allocator<prometheus::ClientMetric> >+0xfe 000000cf`d5bff480 00007ff6`30971d80 : 00000000`00000078 0000026c`889f4240 00000000`00000078 0000026c`96d80670 : memcached!cb::prometheus::MetricServer::KVCollectable::Collect+0x17d 000000cf`d5bff590 00007ff6`309700d6 : 000000cf`d5bff678 0000026c`96c303c8 0000026c`975b0f68 00003903`af4e39d0 : memcached!prometheus::detail::CollectMetrics+0xc0 000000cf`d5bff630 00007ff6`309529c6 : 0000026c`975b0f68 00000000`00000000 00000000`00000000 00000000`00000000 : memcached!prometheus::detail::MetricsHandler::handleGet+0x96 000000cf`d5bff6d0 00007ff6`30958300 : 0000026c`98ae09f0 0000026c`98ae09f0 000000cf`d5bff830 00000000`00000001 : memcached!CivetServer::requestHandler+0xc6 000000cf`d5bff730 00007ff6`30965ae4 : 00000000`00000000 00000000`00000000 0000026c`98b4e5e0 00007ff6`30cb6560 : memcached!vsscanf_l+0x4300 000000cf`d5bffbb0 00007ff6`30969d88 : 00000000`00000001 0000026c`98b4e5e0 0000026c`98ae09f0 00000000`00000000 : memcached!mg_write+0x21f4 000000cf`d5bffca0 00007ff6`30969b09 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : memcached!sscanf+0x948 000000cf`d5bffd30 00007ffe`e20bf4a0 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : memcached!sscanf+0x6c9 000000cf`d5bffd60 00007ffe`e57784d4 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : ucrtbase!thread_start<unsigned int (__cdecl*)(void * __ptr64)>+0x40 000000cf`d5bffd90 00007ffe`e599e8b1 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : kernel32!BaseThreadInitThunk+0x14 000000cf`d5bffdc0 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : ntdll!RtlUserThreadStart+0x21     FAULTING_SOURCE_LINE: d:\agent\_work\2\s\src\vctools\crt\vcruntime\src\eh\frame.cpp   FAULTING_SOURCE_FILE: d:\agent\_work\2\s\src\vctools\crt\vcruntime\src\eh\frame.cpp   FAULTING_SOURCE_LINE_NUMBER: 1534   FAULTING_SOURCE_CODE: No source found for 'd:\agent\_work\2\s\src\vctools\crt\vcruntime\src\eh\frame.cpp'     SYMBOL_NAME: vcruntime140!__FrameHandler3::CxxCallCatchBlock+151   MODULE_NAME: VCRUNTIME140   IMAGE_NAME: VCRUNTIME140.dll   STACK_COMMAND: ~6s ; .ecxr ; kb   FAILURE_BUCKET_ID: CPP_EXCEPTION_e06d7363_VCRUNTIME140.dll!__FrameHandler3::CxxCallCatchBlock   OS_VERSION: 10.0.14393.2430   BUILDLAB_STR: rs1_release_inmarket_aim   OSPLATFORM_TYPE: x64   OSNAME: Windows 10   FAILURE_ID_HASH: {785a6fcc-a7a7-1220-6869-a9b6948e71fa}   Followup: MachineOwner ---------  
            richard.demellow Richard deMellow added a comment - - edited

            Looks like it's coming from the following line of code I've marked //<---- here:

            class MetricServer::KVCollectable : public ::prometheus::Collectable {
            public:
                KVCollectable(Cardinality cardinality, GetStatsCallback getStatsCB)
                    : cardinality(cardinality), getStatsCB(std::move(getStatsCB)) {
                }
                /**
                 * Gathers high or low cardinality metrics
                 * and returns them in the prometheus required structure.
                 */
                [[nodiscard]] std::vector<::prometheus::MetricFamily> Collect()
                        const override {
                    std::unordered_map<std::string, ::prometheus::MetricFamily> statsMap;
                    PrometheusStatCollector collector(statsMap);
                    getStatsCB(collector, cardinality);
                    // KVCollectable interface requires a vector of metric families,
                    // but during collection it is necessary to frequently look up
                    // families by name, so they are stored in a map.
                    // Unpack them into a vector.
                    std::vector<::prometheus::MetricFamily> result;
                    result.reserve(statsMap.size());
                    for (const auto& statEntry : statsMap) {
                        result.push_back(statEntry.second /* MetricFamily */); //<----- here 
                    }
                    return result;
                }
            private:
                Cardinality cardinality;
                // function to call on every incoming request to generate stats
                GetStatsCallback getStatsCB;
            };
            

            Also link on how to debug Windows exceptions https://devblogs.microsoft.com/oldnewthing/20100730-00/?p=13273

            richard.demellow Richard deMellow added a comment - - edited Looks like it's coming from the following line of code I've marked //<---- here : class MetricServer::KVCollectable : public ::prometheus::Collectable { public : KVCollectable(Cardinality cardinality, GetStatsCallback getStatsCB) : cardinality(cardinality), getStatsCB(std::move(getStatsCB)) { } /** * Gathers high or low cardinality metrics * and returns them in the prometheus required structure. */ [[nodiscard]] std::vector<::prometheus::MetricFamily> Collect() const override { std::unordered_map<std::string, ::prometheus::MetricFamily> statsMap; PrometheusStatCollector collector(statsMap); getStatsCB(collector, cardinality); // KVCollectable interface requires a vector of metric families, // but during collection it is necessary to frequently look up // families by name, so they are stored in a map. // Unpack them into a vector. std::vector<::prometheus::MetricFamily> result; result.reserve(statsMap.size()); for ( const auto& statEntry : statsMap) { result.push_back(statEntry.second /* MetricFamily */ ); //<----- here } return result; } private : Cardinality cardinality; // function to call on every incoming request to generate stats GetStatsCallback getStatsCB; }; Also link on how to debug Windows exceptions https://devblogs.microsoft.com/oldnewthing/20100730-00/?p=13273
            owend Daniel Owen added a comment -

            Believe this to be caused by issues highlighted in MB-45061.
            Closing as incomplete.
            If issues persist after MB-45061 is resolved then please reopen.

            owend Daniel Owen added a comment - Believe this to be caused by issues highlighted in MB-45061 . Closing as incomplete. If issues persist after MB-45061 is resolved then please reopen.

            Bulk closing all non-fixed issues. Pls reopen if needed to.

            mihir.kamdar Mihir Kamdar (Inactive) added a comment - Bulk closing all non-fixed issues. Pls reopen if needed to.

            People

              Balakumaran.Gopal Balakumaran Gopal
              Balakumaran.Gopal Balakumaran Gopal
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty