Uploaded image for project: 'Couchbase C client library libcouchbase'
  1. Couchbase C client library libcouchbase
  2. CCBC-1328

gethrtime appears to have a race.

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • 3.0.6
    • library
    • 1

    Description

      One of the c++ transaction beta customers was doing some benchmarking, and ran into some issues.  When exploring their issues, I did some things like create 100 threads, each of which creates a cluster, ultimately calling lcb_connect on their own instances.  I found that I'd sometimes get an EXC_ARITHMETIC (code=EXC_I386_DIV, subcode=0x0) and crash.

       

      Looking at the stack trace, it seems possible that 2 or more threads will call gethrtime simultaneously.  Ultimately, it seems that one thread can call mach_timebase_info(&tmbi) while another one is using that (as they both raced on start, both seeing 0).  My guess is that call results in the tmbi.denom briefly being 0 (before it puts a 1 in there), and that's the divide-by-zero we see.

      The stack trace looks like:

      Process 44694 stopped
      * thread #2, stop reason = EXC_ARITHMETIC (code=EXC_I386_DIV, subcode=0x0)    frame #0: 0x0000000100412cf5 libtransactions_cxx.1.dylib`gethrtime at gethrtime.c:58:40
         55      }   
         56   
         57      now = mach_absolute_time();
      -> 58      return ((now - start) * tmbi.numer / tmbi.denom) + CLOCK_START_OFFSET;   
         59   
         60   #elif defined(HAVE_CLOCK_GETTIME)   
         61      struct timespec tm;'
      Target 0: (bench2) stopped.
      (lldb) bt
      * thread #2, stop reason = EXC_ARITHMETIC (code=EXC_I386_DIV, subcode=0x0)  
      * frame #0: 0x0000000100412cf5 libtransactions_cxx.1.dylib`gethrtime at gethrtime.c:58:40    
       frame #1: 0x0000000100455a6c libtransactions_cxx.1.dylib`lcb::Bootstrap::bootstrap(this=0x0000000104605f50, options=2) at bootstrap.cc:224:20    
      frame #2: 0x00000001004a1583 libtransactions_cxx.1.dylib`lcb_st::bootstrap(this=0x0000000101c046a0, options=2) at internal.h:177:26    
      frame #3: 0x0000000100446c5a libtransactions_cxx.1.dylib`::lcb_connect(instance=0x0000000101c046a0) at instance.cc:740:22    
      frame #4: 0x00000001003c9436 libtransactions_cxx.1.dylib`couchbase::cluster::connect(this=0x000070000e8c8e70) at cluster.cxx:89:10    
      frame #5: 0x00000001003c8f33 libtransactions_cxx.1.dylib`couchbase::cluster::cluster(this=0x000070000e8c8e70, cluster_address=<unavailable>, user_name="", password="") at cluster.cxx:21:5    frame #6: 0x00000001003c9725 libtransactions_cxx.1.dylib`couchbase::cluster::cluster(this=0x000070000e8c8e70, cluster_address=<unavailable>, user_name=<unavailable>, password=<unavailable>) at cluster.cxx:19:1    
      frame #7: 0x0000000100029577 bench2`main::$_0::operator(this=0x0000000101e041a8)() const at bench2.cxx:72:25    
      frame #8: 0x00000001000294ad bench2`decltype(__f=0x0000000101e041a8)()) std::__1::__invoke<main::$_0>(main::$_0&&) at type_traits:4361:1

      The code looks like:

       36 hrtime_t gethrtime(void)
       37 {
       38 #ifdef __APPLE__
       39
       40     /* Most things expect a pretty large timestamp - even though a smaller one
       41      * may be perfectly valid. Initialize the default with an offset of one day,
       42      * in nanoseconds
       43      */
       44
       45     /* Use the various mach stuff:
       46      * https://developer.apple.com/library/mac/qa/qa1398/_index.html */
       47
       48     static uint64_t start = 0;
       49     uint64_t now;
       50     static mach_timebase_info_data_t tmbi;
       51
       52     if (start == 0) {
       53         start = mach_absolute_time();
       54         mach_timebase_info(&tmbi);
       55     }
       56
       57     now = mach_absolute_time();
       58     return ((now - start) * tmbi.numer / tmbi.denom) + CLOCK_START_OFFSET;
       59
       60 #elif defined(HAVE_CLOCK_GETTIME)
      

      which is just the apple part.  There are statics in the other #defines below this, for any OS that has no gethrtime builtin (presumably that is reentrant).  Could be the HAVE_QUERYPERFORMANCE_COUNTER section has this issue as well, on the LARGE_INTEGER pf as I'm just a bit suspicious of QueryPerformanceFrequency.

      Since this is C, there is a compare-and-swap available for all the platforms, but platform dependent.  Doing a compare-and-swap on start seems sufficient.  Maybe there is a better way?  Maybe LCB already has a compare-and-swap implemented for various platforms, so we don't have to make that?

       

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            david.kelly David Kelly (Inactive)
            david.kelly David Kelly (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty