Uploaded image for project: 'Couchbase C client library libcouchbase'
  1. Couchbase C client library libcouchbase
  2. CCBC-707

completion-io: crash on async destruction with pending writes

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.6.2
    • Component/s: None
    • Labels:
      None

      Description

      I do not have a condensed test case that can reproduce this, but it is happening in a few of our applications. Here is a backtrace:

      (lldb) bt    
      * thread #1: tid = 0, 0x00007f623f2905b9 couchbase_impl.node`on_flush_done(ctx=0x0000000004076720, expected=2543, actual=2543) + 56 at mcserver.c:84, name = 'node', stop reason = signal SIGSEGV
        * frame #0: 0x00007f623f2905b9 couchbase_impl.node`on_flush_done(ctx=0x0000000004076720, expected=2543, actual=2543) + 56 at mcserver.c:84
          frame #1: 0x00007f623f2cfce8 couchbase_impl.node`Cw_ex_handler(sd=0x0000000004019ca0, status=0, wdata=0x00000000000009ef) + 68 at ctx.c:527
          frame #2: 0x00007f623f28fbb5 couchbase_impl.node`write2_callback(req=0x0000000004035e30, status=0) + 95 at plugin-libuv.c:383
          frame #3: 0x0000000001b0f9e4 node`uv__write_callbacks(stream=0x0000000004019cf8) + 328 at stream.c:940
          frame #4: 0x0000000001b0e7e9 node`uv__stream_destroy(stream=0x0000000004019cf8) + 312 at stream.c:450
          frame #5: 0x0000000001b019f5 node`uv__finish_close(handle=0x0000000004019cf8) + 145 at core.c:256
          frame #6: 0x0000000001b01af9 node`uv__run_closing_handles(loop=0x000000000285c3e0) + 68 at core.c:286
          frame #7: 0x0000000001b01d17 node`uv_run(loop=0x000000000285c3e0, mode=UV_RUN_ONCE) + 202 at core.c:356
          frame #8: 0x0000000001953fec node`node::StartNodeInstance(arg=0x00007ffd822ccbe0) + 727 at node.cc:4394
          frame #9: 0x00000000019542f1 node`node::Start(argc=2, argv=0x0000000003f87bc0) + 241 at node.cc:4474
          frame #10: 0x000000000198918e node`main(argc=3, argv=0x00007ffd822ccd48) + 62 at node_main.cc:53
          frame #11: 0x00007f6241578b45 libc.so.6`__libc_start_main + 245
          frame #12: 0x0000000000ded529 node
      (lldb) f 0
      frame #0: 0x00007f623f2905b9 couchbase_impl.node`on_flush_done(ctx=0x0000000004076720, expected=2543, actual=2543) + 56 at mcserver.c:84
         81  	    mc_SERVER *server = lcbio_ctx_data(ctx);
         82  	    lcb_U64 now = 0;
         83  	
      -> 84  	    if (LCBT_SETTING(server->instance, readj_ts_wait)) {
         85  	        now = gethrtime();
         86  	    }
         87  	
      

      I have a core dump (and would be happy to share if that will help) and it looks like server->instance->settings is garbage when this is hit.

        Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

          Hide
          mnunberg Mark Nunberg (Inactive) added a comment - - edited

          And I'm assuming the crash happens at the end of the log?

          That crash dump is unfortunately not very useful without having access to the actual binaries which generated it. I suppose if your C-fu is up to par, you could go around poking at some of the internal structures. I'd be interested in seeing what the original request was in the first place; i.e. that which write() is in response to; perhaps deconstructing the buffers (even knowing how large the buffers are might give a hint)

          Show
          mnunberg Mark Nunberg (Inactive) added a comment - - edited And I'm assuming the crash happens at the end of the log? That crash dump is unfortunately not very useful without having access to the actual binaries which generated it. I suppose if your C-fu is up to par, you could go around poking at some of the internal structures. I'd be interested in seeing what the original request was in the first place; i.e. that which write() is in response to; perhaps deconstructing the buffers (even knowing how large the buffers are might give a hint)
          Hide
          evanlucas Evan Lucas added a comment -

          Ok, so it looks to be a memcached request (sorry for not having more details at this point). I've got a (somewhat) reproducible test case that can trigger this (and it is setup to run inside lldb) I'll attach a tarball that includes it all.

          Show
          evanlucas Evan Lucas added a comment - Ok, so it looks to be a memcached request (sorry for not having more details at this point). I've got a (somewhat) reproducible test case that can trigger this (and it is setup to run inside lldb) I'll attach a tarball that includes it all.
          Hide
          evanlucas Evan Lucas added a comment - - edited

          This is a tarball of a somewhat reproducible test case. Thanks!

          EDIT: Look above for the attachments section

          Show
          evanlucas Evan Lucas added a comment - - edited This is a tarball of a somewhat reproducible test case. Thanks! EDIT: Look above for the attachments section
          Hide
          mnunberg Mark Nunberg (Inactive) added a comment -

          So the issue seems to be caused by a delayed write-done notification from libuv after the library handle has already been closed. This is a bug on our part. I'm still trying to figure out a graceful way to solve this, but in the meantime spinning the event loop for a bit before calling .disconnect() might help; after all it's really just a matter of ensuring libcouchbase ends up receiving the write-done callback before lcb_destroy_async is called.

          Show
          mnunberg Mark Nunberg (Inactive) added a comment - So the issue seems to be caused by a delayed write-done notification from libuv after the library handle has already been closed. This is a bug on our part. I'm still trying to figure out a graceful way to solve this, but in the meantime spinning the event loop for a bit before calling .disconnect() might help; after all it's really just a matter of ensuring libcouchbase ends up receiving the write-done callback before lcb_destroy_async is called.
          Hide
          mnunberg Mark Nunberg (Inactive) added a comment -

          I've checked and this can be fixed fairly simply by removing reference to the `instance` (`lcb_t`); which is deliberately decoupled from `lcb_settings`

          Instead of line 84 being:

              if (LCBT_SETTING(server->instance, readj_ts_wait)) {
          

          It should be

            if (server->settings->readj_ts_wait) {
          

          The reason the server structure maintains its own dedicated reference to `settings` is self-evident here

          Show
          mnunberg Mark Nunberg (Inactive) added a comment - I've checked and this can be fixed fairly simply by removing reference to the `instance` (`lcb_t`); which is deliberately decoupled from `lcb_settings` Instead of line 84 being: if (LCBT_SETTING(server->instance, readj_ts_wait)) { It should be if (server->settings->readj_ts_wait) { The reason the server structure maintains its own dedicated reference to `settings` is self-evident here

            People

            • Assignee:
              mnunberg Mark Nunberg (Inactive)
              Reporter:
              evanlucas Evan Lucas
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Gerrit Reviews

                There are no open Gerrit changes

                  PagerDuty

                  Error rendering 'com.pagerduty.jira-server-plugin:PagerDuty'. Please contact your Jira administrators.