Uploaded image for project: 'Couchbase C client library libcouchbase'
  1. Couchbase C client library libcouchbase
  2. CCBC-972

Core at stop when connection with the cluster is broken

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • 2.9.5
    • 2.9.3, 2.9.4
    • library
    • None
    • Linux
    • 1

    Description

      We are facing some cores at stop time when implementing a new unit test for validating the fix for CCBC-948.

      The purpose of the test is to ensure that Libcouchbase doesn't end up in an infinite retry loop when the remote cluster gets unreachable. To do so, we just start a CouchbaseMock instance, perform a couple of Store/Get operations, kill the mock, and perform the same operations again.

      Thanks to the mentioned fix, the last two operations are now returning a network error, as expected. We are no longer stuck in an infinite loop. But randomly, we get a core at the end of the test. For example, my last core occurrence happened after 20 successful runs.

      The core happens when calling lcb_destroy with a unreachable cluster (or mock). It looks like some items are in the retry queue and lcb_destroy tries to connect again when flushing the items from the queue.

      Here is the callstack I got from gdb:

      #0  0x00007ffff232b964 in strcmp () from /lib64/libc.so.6
      #1  0x0000000000917719 in lcb_host_equals (a=0x3e0, b=0x424fa60)
          at /gctmp/gmolleda/_src/mdw/LibCouchbase/13-8-4-3/libcouchbase-2.8.4/src/hostlist.cc:161
      #2  0x00000000008fcd13 in lcb_st::find_server (this=0x423f270, host=...)
          at /gctmp/gmolleda/_src/mdw/LibCouchbase/13-8-4-3/libcouchbase-2.8.4/src/instance.cc:648
      #3  0x0000000000903b42 in CccpProvider::schedule_next_request(<anonymous enum>, bool) (this=0x423fae0, err=LCB_ERROR,
          can_rollover=<value optimized out>) at /gctmp/gmolleda/_src/mdw/LibCouchbase/13-8-4-3/libcouchbase-2.8.4/src/bucketconfig/bc_cccp.cc:141
      #4  0x0000000000903f45 in CccpProvider::mcio_error(<anonymous enum>) (this=0x423fae0, err=LCB_ERROR)
          at /gctmp/gmolleda/_src/mdw/LibCouchbase/13-8-4-3/libcouchbase-2.8.4/src/bucketconfig/bc_cccp.cc:169
      #5  0x000000000090404a in lcb::clconfig::cccp_update(const void *, <anonymous enum>, const void *, size_t, const lcb_host_t *) (cookie_=0x4240e80,
          err=LCB_ERROR, bytes=0x0, nbytes=0, origin=0x1000)
          at /gctmp/gmolleda/_src/mdw/LibCouchbase/13-8-4-3/libcouchbase-2.8.4/src/bucketconfig/bc_cccp.cc:237
      #6  0x0000000000913a26 in ext_callback_proxy(mc_PIPELINE *, mc_PACKET *, <anonymous enum>, const void *) (pl=0x7fffffffa530,
          req=<value optimized out>, rc=LCB_ERROR, resdata=0x7fffffffa770)
          at /gctmp/gmolleda/_src/mdw/LibCouchbase/13-8-4-3/libcouchbase-2.8.4/src/getconfig.cc:33
      #7  0x000000000091615c in H_config (pipeline=0x7fffffffa530, req=0x4241b90, res=0x7fffffffa770, immerr=LCB_ERROR)
          at /gctmp/gmolleda/_src/mdw/LibCouchbase/13-8-4-3/libcouchbase-2.8.4/src/handler.cc:878
      #8  mcreq_dispatch_response(mc_PIPELINE *, mc_PACKET *, lcb::MemcachedResponse *, <anonymous enum>) (pipeline=0x7fffffffa530, req=0x4241b90,
          res=0x7fffffffa770, immerr=LCB_ERROR) at /gctmp/gmolleda/_src/mdw/LibCouchbase/13-8-4-3/libcouchbase-2.8.4/src/handler.cc:986
      #9  0x000000000094132a in lcb::RetryQueue::fail(lcb::RetryOp *, <anonymous enum>) (this=0x4240210, op=0x4241a00, err=LCB_ERROR)
          at /gctmp/gmolleda/_src/mdw/LibCouchbase/13-8-4-3/libcouchbase-2.8.4/src/retryq.cc:164
      #10 0x000000000094141e in lcb::RetryQueue::~RetryQueue (this=0x4240210, __in_chrg=<value optimized out>)
          at /gctmp/gmolleda/_src/mdw/LibCouchbase/13-8-4-3/libcouchbase-2.8.4/src/retryq.cc:429
      #11 0x00000000008fd7e1 in lcb_destroy (instance=0x423f270) at /gctmp/gmolleda/_src/mdw/LibCouchbase/13-8-4-3/libcouchbase-2.8.4/src/instance.cc:587
      #12 0x00007ffff784c619 in kvclient::CouchbaseCluster::disconnect (this=0x423de90) at src/kvclient/CouchbaseCluster.cpp:608
      

      I've attached the trace of my gdb session for debugging purpose. The display of the lcb::Server object in lcb_st::find_server may be interesting (especially the curhost attribute).

      I was not able to reproduce the same cores with Libcouchbase's unit tests (it happens when the released memory was reused, so it depends on test conditions), but we can easily highlight the memory corruptions using Valgrind/Memcheck.

      I modified the "GetUnitTest.testGetHit" unit test just to kill the mock before performing the read operations. Of course, the test returns an error, but the purpose is only to check how Libcouchbase behaves at exit time.

      Here is how to proceed:

      1. apply the KillMockTest.patch file to the libcouchbase repository.
        Basically, it creates a "MockEnvironment::Stop()" method that kills the mock, and calls it inside the "GetUnitTest.testGetHit" unit test.
      2. Builds the unit test using the "make alltests" command.
        I used the "--enable-debug --enable-embedded-libevent-plugin" config options for building the library, I didn't test what happened in release mode or without libevent. I don't expect many changes.
      3. Run the unit test with Valgrind using the following command:

      valgrind --tool=memcheck  --trace-children=yes --trace-children-skip=*/start_mock.sh bin/check-all "--srcdir" "/home/centos/dev/couchbase/libcouchbase_origin" "--testdir" "/home/centos/dev/couchbase/libcouchbase_origin/build/bin" "--libdir" "/home/centos/dev/couchbase/libcouchbase_origin/build/lib" "--plugins" "select" "--tests" "unit-tests" "--verbose" "--" "--gtest_print_time=1" "–gtest_filter=GetUnitTest.testGetHit"
      

      Valgrind returns a list of invalid memory accesses.

      The first one in the list I got was:

      ==8778== Invalid read of size 8
      ==8778==    at 0x5EF7AC: lcb::Server::get_host() const (mcserver.h:102)
      ==8778==    by 0x5EB8CB: lcb_st::find_server(lcb_host_st const&) const (instance.cc:689)
      ==8778==    by 0x6A79B0: CccpProvider::schedule_next_request(lcb_error_t, bool) (bc_cccp.cc:141)
      ==8778==    by 0x6A8264: CccpProvider::mcio_error(lcb_error_t) (bc_cccp.cc:169)
      ==8778==    by 0x6A88BB: lcb::clconfig::cccp_update(void const*, lcb_error_t, void const*, unsigned long, lcb_host_st const*) (bc_cccp.cc:237)
      ==8778==    by 0x6AD184: ext_callback_proxy(mc_pipeline_st*, mc_packet_st*, lcb_error_t, void const*) (getconfig.cc:31)
      ==8778==    by 0x612A3D: H_config(mc_pipeline_st*, mc_packet_st*, lcb::MemcachedResponse*, lcb_error_t) (handler.cc:906)
      ==8778==    by 0x60E5B6: mcreq_dispatch_response (handler.cc:1022)
      ==8778==    by 0x64B76E: lcb::RetryQueue::fail(lcb::RetryOp*, lcb_error_t) (retryq.cc:164)
      ==8778==    by 0x64C9D1: lcb::RetryQueue::~RetryQueue() (retryq.cc:429)
      ==8778==    by 0x5EB179: lcb_destroy (instance.cc:614)
      ==8778==    by 0x5C4CCE: HandleWrap::destroy() (mock-environment.cc:536)
      ==8778==  Address 0x7e610c0 is 560 bytes inside a block of size 568 free'd
      ==8778==    at 0x4C2B18D: operator delete(void*) (vg_replace_malloc.c:576)
      ==8778==    by 0x63F109: lcb::Server::start_errored_ctx(lcb::Server::State) (mcserver.cc:952)
      ==8778==    by 0x63F245: lcb::Server::close() (mcserver.cc:927)
      ==8778==    by 0x5EB07D: lcb_destroy (instance.cc:603)
      ==8778==    by 0x5C4CCE: HandleWrap::destroy() (mock-environment.cc:536)
      ==8778==    by 0x5C4D85: HandleWrap::~HandleWrap() (mock-environment.cc:548)
      ==8778==    by 0x5CB2A1: GetUnitTest_testGetHit_Test::TestBody() (t_get.cc:119)
      ==8778==    by 0x6E5602: void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (gtest.cc:2078)
      ==8778==    by 0x6D7CB0: void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (gtest.cc:2114)
      ==8778==    by 0x6BE6C6: testing::Test::Run() (gtest.cc:2150)
      ==8778==    by 0x6BF51B: testing::TestInfo::Run() (gtest.cc:2326)
      ==8778==    by 0x6BFD44: testing::TestCase::Run() (gtest.cc:2444)
      ==8778==  Block was alloc'd at
      ==8778==    at 0x4C2A203: operator new(unsigned long) (vg_replace_malloc.c:334)
      ==8778==    by 0x621FF5: lcb_update_vbconfig (newconfig.cc:346)
      ==8778==    by 0x5F6DBA: lcb::Bootstrap::config_callback(lcb::clconfig::EventType, lcb::clconfig::ConfigInfo*) (bootstrap.cc:83)
      ==8778==    by 0x5F6437: lcb::Bootstrap::clconfig_lsn(lcb::clconfig::EventType, lcb::clconfig::ConfigInfo*) (bootstrap.cc:125)
      ==8778==    by 0x5FC3C6: lcb::clconfig::Confmon::invoke_listeners(lcb::clconfig::EventType, lcb::clconfig::ConfigInfo*) (confmon.cc:359)
      ==8778==    by 0x5FC256: lcb::clconfig::Confmon::do_set_next(lcb::clconfig::ConfigInfo*, bool) (confmon.cc:179)
      ==8778==    by 0x5FC7FE: lcb::clconfig::Confmon::provider_got_config(lcb::clconfig::Provider*, lcb::clconfig::ConfigInfo*) (confmon.cc:241)
      ==8778==    by 0x6A870B: CccpProvider::update(char const*, char const*) (bc_cccp.cc:212)
      ==8778==    by 0x6A928B: CccpProvider::on_io_read() (bc_cccp.cc:391)
      ==8778==    by 0x6A98FB: io_read_handler(lcbio_CTX*, unsigned int) (bc_cccp.cc:345)
      ==8778==    by 0x695FD4: invoke_read_cb (ctx.c:274)
      ==8778==    by 0x696083: E_handler (ctx.c:303)
      

      This is exactly the same memory corruption that happens in the cores I got. It's just not failing in the unit test as the memory was not reused yet.

      Attachments

        1. gdb_trace.txt
          14 kB
          Guillaume Molléda
        2. KillMockTest.patch
          1 kB
          Guillaume Molléda
        For Gerrit Dashboard: CCBC-972
        # Subject Branch Project Status CR V

        Activity

          People

            avsej Sergey Avseyev
            gmolleda Guillaume Molléda
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty