Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-45824

[Windows][Collections] - Minidumps seen on collections crud + rebalance on Windows

    XMLWordPrintable

Details

    • Untriaged
    • Windows 64-bit
    • 1
    • No

    Description

      Script to Repro
      Global Test input params:

      {'change_ephemeral_purge_age_and_interval': 'False',
       'cluster_name': 'win10-bucket-ops-11111222223334444555555555',
       'conf_file': 'conf/collections/collections_rebalance_crud_on_collections.conf',
       'crash_warning': 'True',
       'get-cbcollect-info': 'True',
       'ini': '/tmp/win10-bucket-ops-11111222223334444555555555.ini',
       'num_nodes': 5,
       'rerun': 'False',
       'spec': 'collections_rebalance_crud_on_collections'}
      

      Minidump is 2d18f71d-1efb-4906-9072-462523dcb9b5.dmp on 172.23.106.120.
      Not sure how to get the bt full on windows, so for now just attaching cbcollects. Logs seems to have rolled over.

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            owend Daniel Owen added a comment - - edited

            On .120 see babysitter log

            [error_logger:error,2021-04-20T06:41:14.272-07:00,babysitter_of_ns_1@cb.local:<0.140.0>:ale_error_logger_handler:do_log:101]
            =========================ERROR REPORT=========================
            ** Generic server <0.140.0> terminating 
            ** Last message in was {#Port<0.16>,{exit_status,-529697949}}
            ** When Server state == {state,#Port<0.16>,
                                        {memcached,
                                            "c:/Program Files/Couchbase/Server/bin/memcached",
                                            ["-C",
                                             "c:/Program Files/Couchbase/Server/var/lib/couchbase/config/memcached.json"],
                                            [{env,
                                                 [{"EVENT_NOSELECT","1"},
                                                  {"MEMCACHED_TOP_KEYS","5"},
                                                  {"CBSASL_PWFILE",
                                                   "c:/Program Files/Couchbase/Server/var/lib/couchbase/isasl.pw"},
                                                  {"JE_MALLOC_CONF","narenas:1"}]},
                                             use_stdio,stderr_to_stdout,exit_status,
                                             stream]},
                                        {ringbuffer,1046,1024,
                                            {[{<<"[*** LOG ERROR ***] [2021-04-20 11:39:38] [globalBucketLogger] argument index out of range">>,
                                               90},
                                              {<<"[*** LOG ERROR ***] [2021-04-20 11:38:36] [globalBucketLogger] argument index out of range">>,
                                               90},
                                              {<<"[*** LOG ERROR ***] [2021-04-20 11:37:36] [globalBucketLogger] argument index out of range">>,
                                               90},
                                              {<<"[*** LOG ERROR ***] [2021-04-20 11:36:17] [globalBucketLogger] argument index out of range">>,
                                               90},
                                              {<<"[*** LOG ERROR ***] [2021-04-20 11:35:16] [globalBucketLogger] argument index out of range">>,
                                               90},
                                              {<<"[*** LOG ERROR ***] [2021-04-20 11:34:16] [globalBucketLogger] argument index out of range">>,
                                               90}],
                                             [{<<"W0420 11:04:29.689070  3228 HazptrDomain.h:671] Using the default inline executor for asynchronous reclamation may be susceptible to deadlock if the current thread happens to hold a resource needed by the deleter of a reclaimable object">>,
                                               236},
                                              {<<"[*** LOG ERROR ***] [2021-04-20 11:30:16] [globalBucketLogger] argument index out of range">>,
                                               90},
                                              {<<"[*** LOG ERROR ***] [2021-04-20 11:31:17] [globalBucketLogger] argument index out of range">>,
                                               90},
                                              {<<"[*** LOG ERROR ***] [2021-04-20 11:32:17] [globalBucketLogger] argument index out of range">>,
                                               90}]}},
                                        undefined,undefined,[],0}
            ** Reason for termination ==
            ** {abnormal,-529697949}
            
            

            owend Daniel Owen added a comment - - edited On .120 see babysitter log [error_logger:error,2021-04-20T06:41:14.272-07:00,babysitter_of_ns_1@cb.local:<0.140.0>:ale_error_logger_handler:do_log:101] =========================ERROR REPORT========================= ** Generic server <0.140.0> terminating ** Last message in was {#Port<0.16>,{exit_status,-529697949}} ** When Server state == {state,#Port<0.16>, {memcached, "c:/Program Files/Couchbase/Server/bin/memcached", ["-C", "c:/Program Files/Couchbase/Server/var/lib/couchbase/config/memcached.json"], [{env, [{"EVENT_NOSELECT","1"}, {"MEMCACHED_TOP_KEYS","5"}, {"CBSASL_PWFILE", "c:/Program Files/Couchbase/Server/var/lib/couchbase/isasl.pw"}, {"JE_MALLOC_CONF","narenas:1"}]}, use_stdio,stderr_to_stdout,exit_status, stream]}, {ringbuffer,1046,1024, {[{<<"[*** LOG ERROR ***] [2021-04-20 11:39:38] [globalBucketLogger] argument index out of range">>, 90}, {<<"[*** LOG ERROR ***] [2021-04-20 11:38:36] [globalBucketLogger] argument index out of range">>, 90}, {<<"[*** LOG ERROR ***] [2021-04-20 11:37:36] [globalBucketLogger] argument index out of range">>, 90}, {<<"[*** LOG ERROR ***] [2021-04-20 11:36:17] [globalBucketLogger] argument index out of range">>, 90}, {<<"[*** LOG ERROR ***] [2021-04-20 11:35:16] [globalBucketLogger] argument index out of range">>, 90}, {<<"[*** LOG ERROR ***] [2021-04-20 11:34:16] [globalBucketLogger] argument index out of range">>, 90}], [{<<"W0420 11:04:29.689070 3228 HazptrDomain.h:671] Using the default inline executor for asynchronous reclamation may be susceptible to deadlock if the current thread happens to hold a resource needed by the deleter of a reclaimable object">>, 236}, {<<"[*** LOG ERROR ***] [2021-04-20 11:30:16] [globalBucketLogger] argument index out of range">>, 90}, {<<"[*** LOG ERROR ***] [2021-04-20 11:31:17] [globalBucketLogger] argument index out of range">>, 90}, {<<"[*** LOG ERROR ***] [2021-04-20 11:32:17] [globalBucketLogger] argument index out of range">>, 90}]}}, undefined,undefined,[],0} ** Reason for termination == ** {abnormal,-529697949}

            guides/gradlew --refresh-dependencies testrunner -P jython=/opt/jython/bin/jython -P 'args=-i /tmp/win10-bucket-ops-11111222223334444555555555.ini rerun=False,get-cbcollect-info=True,crash_warning=True,change_ephemeral_purge_age_and_interval=False -t bucket_collections.collections_rebalance.CollectionsRebalance.test_data_load_collections_with_swap_rebalance,nodes_init=4,nodes_swap=2,override_spec_params=durability;replicas,durability=MAJORITY_AND_PERSIST_TO_ACTIVE,replicas=2,bucket_spec=multi_bucket.buckets_all_membase_for_rebalance_tests_more_collections,data_load_spec=volume_test_load_with_CRUD_on_collections,data_load_stage=during,quota_percent=80,skip_validations=False,GROUP=rebalance_with_collection_crud_durability_MAJORITY_AND_PERSIST_TO_ACTIVE'
            

            This bug is hit after multi node swap rebalance(2 nodes) along with durability data load. Then we notice that 172.23.106.120 never comes back up again(http://node:8091 was not accessible). To do a cbollect for this bug I had to reboot the box to make it work again.

            Balakumaran.Gopal Balakumaran Gopal added a comment - guides/gradlew --refresh-dependencies testrunner -P jython=/opt/jython/bin/jython -P 'args=-i /tmp/win10-bucket-ops-11111222223334444555555555.ini rerun=False,get-cbcollect-info=True,crash_warning=True,change_ephemeral_purge_age_and_interval=False -t bucket_collections.collections_rebalance.CollectionsRebalance.test_data_load_collections_with_swap_rebalance,nodes_init=4,nodes_swap=2,override_spec_params=durability;replicas,durability=MAJORITY_AND_PERSIST_TO_ACTIVE,replicas=2,bucket_spec=multi_bucket.buckets_all_membase_for_rebalance_tests_more_collections,data_load_spec=volume_test_load_with_CRUD_on_collections,data_load_stage=during,quota_percent=80,skip_validations=False,GROUP=rebalance_with_collection_crud_durability_MAJORITY_AND_PERSIST_TO_ACTIVE' This bug is hit after multi node swap rebalance(2 nodes) along with durability data load. Then we notice that 172.23.106.120 never comes back up again( http://node:8091 was not accessible). To do a cbollect for this bug I had to reboot the box to make it work again.
            owend Daniel Owen added a comment - - edited

            Thanks to Richard deMellow for his Windows debugging skills we have the backtrace

            FAULTING_SOURCE_CODE:  
                68: 
                69: void* operator new(std::size_t count) {
                70:     void* result = cb_malloc(count);
                71:     if (result == nullptr) {
            >   72:         throw std::bad_alloc();
                73:     }
                74:     return result;
                75: }
                76: 
                77: void* operator new(std::size_t count, std::align_val_t al) {
            

            Here is the call graph:

            1. Call Site

              00 ntdll!NtWaitForSingleObject+0x14
              01 KERNELBASE!WaitForSingleObjectEx+0x8f
              02 memcached!google_breakpad::ExceptionHandler::WriteMinidumpOnHandlerThread+0x85
              03 memcached!google_breakpad::ExceptionHandler::HandleException+0xf8
              04 KERNELBASE!UnhandledExceptionFilter+0x157
              05 ntdll!RtlUserThreadStart$filt$0+0x38
              06 ntdll!_C_specific_handler+0x96
              07 ntdll!RtlpExecuteHandlerForException+0xd
              08 ntdll!RtlDispatchException+0x373
              09 ntdll!RtlRaiseException+0x2d9
              0a KERNELBASE!RaiseException+0x68
              0b VCRUNTIME140!_CxxThrowException(void * pExceptionObject = 0x00000080`ac7fefd0, struct _s__ThrowInfo * pThrowInfo = <Value unavailable error>)+0xad
              0c memcached!operator new(unsigned int64 count = <Value unavailable error>)+0x2f
              0d memcached!std::_Default_allocate_traits::_Allocate(void)+0x5 (Inline Function @ 00007ff7`96a0159b)
              0e memcached!std::_Allocate_manually_vector_aligned(void)+0x17 (Inline Function @ 00007ff7`96a0159b)
              0f memcached!std::_Allocate(void)+0x1f (Inline Function @ 00007ff7`96a0159b)
              10 memcached!std::allocator<char>::allocate(unsigned int64 _Count = <Value unavailable error>)+0x1f (Inline Function @ 00007ff7`96a0159b)
              11 memcached!std::basic_string<char,std::char_traits<char>,std::allocator<char> >::_Reallocate_for(void)+0x66 (Inline Function @ 00007ff7`96a0159b)
              12 memcached!std::basic_string<char,std::char_traits<char>,std::allocator<char> >::assign(char * _Ptr = 0x0000021a`72570020 "--- memory read error at address 0x0000021a`72570020 ---", unsigned int64 _Count = 0x80000)+0xab
              13 memcached!prometheus::Serializer::Serialize+0x13b
              14 memcached!prometheus::detail::MetricsHandler::handleGet+0x12a
              15 memcached!CivetServer::requestHandler+0xc6
              16 memcached!vsscanf_l+0x4300
              17 memcached!mg_write+0x21f4
              18 memcached!sscanf+0x948
              19 memcached!sscanf+0x6c9
              1a ucrtbase!thread_start<unsigned int +0x40
              1b kernel32!BaseThreadInitThunk+0x14
              1c ntdll!RtlUserThreadStart+0x21
              

            We threw because of a bad alloc.

            owend Daniel Owen added a comment - - edited Thanks to Richard deMellow for his Windows debugging skills we have the backtrace FAULTING_SOURCE_CODE: 68: 69: void* operator new(std::size_t count) { 70: void* result = cb_malloc(count); 71: if (result == nullptr) { > 72: throw std::bad_alloc(); 73: } 74: return result; 75: } 76: 77: void* operator new(std::size_t count, std::align_val_t al) { Here is the call graph: Call Site 00 ntdll!NtWaitForSingleObject+0x14 01 KERNELBASE!WaitForSingleObjectEx+0x8f 02 memcached!google_breakpad::ExceptionHandler::WriteMinidumpOnHandlerThread+0x85 03 memcached!google_breakpad::ExceptionHandler::HandleException+0xf8 04 KERNELBASE!UnhandledExceptionFilter+0x157 05 ntdll!RtlUserThreadStart$filt$0+0x38 06 ntdll!_C_specific_handler+0x96 07 ntdll!RtlpExecuteHandlerForException+0xd 08 ntdll!RtlDispatchException+0x373 09 ntdll!RtlRaiseException+0x2d9 0a KERNELBASE!RaiseException+0x68 0b VCRUNTIME140!_CxxThrowException(void * pExceptionObject = 0x00000080`ac7fefd0, struct _s__ThrowInfo * pThrowInfo = <Value unavailable error>)+0xad 0c memcached!operator new(unsigned int64 count = <Value unavailable error>)+0x2f 0d memcached!std::_Default_allocate_traits::_Allocate(void)+0x5 (Inline Function @ 00007ff7`96a0159b) 0e memcached!std::_Allocate_manually_vector_aligned(void)+0x17 (Inline Function @ 00007ff7`96a0159b) 0f memcached!std::_Allocate(void)+0x1f (Inline Function @ 00007ff7`96a0159b) 10 memcached!std::allocator<char>::allocate(unsigned int64 _Count = <Value unavailable error>)+0x1f (Inline Function @ 00007ff7`96a0159b) 11 memcached!std::basic_string<char,std::char_traits<char>,std::allocator<char> >::_Reallocate_for(void)+0x66 (Inline Function @ 00007ff7`96a0159b) 12 memcached!std::basic_string<char,std::char_traits<char>,std::allocator<char> >::assign(char * _Ptr = 0x0000021a`72570020 "--- memory read error at address 0x0000021a`72570020 ---", unsigned int64 _Count = 0x80000)+0xab 13 memcached!prometheus::Serializer::Serialize+0x13b 14 memcached!prometheus::detail::MetricsHandler::handleGet+0x12a 15 memcached!CivetServer::requestHandler+0xc6 16 memcached!vsscanf_l+0x4300 17 memcached!mg_write+0x21f4 18 memcached!sscanf+0x948 19 memcached!sscanf+0x6c9 1a ucrtbase!thread_start<unsigned int +0x40 1b kernel32!BaseThreadInitThunk+0x14 1c ntdll!RtlUserThreadStart+0x21 We threw because of a bad alloc.
            owend Daniel Owen added a comment - - edited

            Therefore resovling as an environmental issue. Box only hag 6GB of RAM

            [warn]  Installed RAM       : 6140 MB - less than recommended (16384 MB)
            

            Therefore will resolve as "Not a Bug" and suggest running with more memory.
            Changed my mind that this is may be a bug (although might not be with KV)

            owend Daniel Owen added a comment - - edited Therefore resovling as an environmental issue. Box only hag 6GB of RAM [warn] Installed RAM : 6140 MB - less than recommended (16384 MB) Therefore will resolve as "Not a Bug" and suggest running with more memory. Changed my mind that this is may be a bug (although might not be with KV)
            owend Daniel Owen added a comment - - edited

            Decided to reopen - as looking at a similar error MB-45861.
            Where we are getting std:bad_alloc

            owend Daniel Owen added a comment - - edited Decided to reopen - as looking at a similar error MB-45861 . Where we are getting std:bad_alloc
            owend Daniel Owen added a comment -

            Believe this to be caused by issues highlighted in MB-45061.
            Closing as incomplete.
            If issues persist after MB-45061 is resolved then please reopen.

            owend Daniel Owen added a comment - Believe this to be caused by issues highlighted in MB-45061 . Closing as incomplete. If issues persist after MB-45061 is resolved then please reopen.

            Bulk closing all non-fixed issues. Pls reopen if needed to.

            mihir.kamdar Mihir Kamdar (Inactive) added a comment - Bulk closing all non-fixed issues. Pls reopen if needed to.

            People

              Balakumaran.Gopal Balakumaran Gopal
              Balakumaran.Gopal Balakumaran Gopal
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty