Details
-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
3.0.6
-
1
Description
One of the c++ transaction beta customers was doing some benchmarking, and ran into some issues. When exploring their issues, I did some things like create 100 threads, each of which creates a cluster, ultimately calling lcb_connect on their own instances. I found that I'd sometimes get an EXC_ARITHMETIC (code=EXC_I386_DIV, subcode=0x0) and crash.
Looking at the stack trace, it seems possible that 2 or more threads will call gethrtime simultaneously. Ultimately, it seems that one thread can call mach_timebase_info(&tmbi) while another one is using that (as they both raced on start, both seeing 0). My guess is that call results in the tmbi.denom briefly being 0 (before it puts a 1 in there), and that's the divide-by-zero we see.
The stack trace looks like:
Process 44694 stopped
|
* thread #2, stop reason = EXC_ARITHMETIC (code=EXC_I386_DIV, subcode=0x0) frame #0: 0x0000000100412cf5 libtransactions_cxx.1.dylib`gethrtime at gethrtime.c:58:40
|
55 }
|
56
|
57 now = mach_absolute_time();
|
-> 58 return ((now - start) * tmbi.numer / tmbi.denom) + CLOCK_START_OFFSET;
|
59
|
60 #elif defined(HAVE_CLOCK_GETTIME)
|
61 struct timespec tm;'
|
Target 0: (bench2) stopped.
|
(lldb) bt
|
* thread #2, stop reason = EXC_ARITHMETIC (code=EXC_I386_DIV, subcode=0x0)
|
* frame #0: 0x0000000100412cf5 libtransactions_cxx.1.dylib`gethrtime at gethrtime.c:58:40
|
frame #1: 0x0000000100455a6c libtransactions_cxx.1.dylib`lcb::Bootstrap::bootstrap(this=0x0000000104605f50, options=2) at bootstrap.cc:224:20
|
frame #2: 0x00000001004a1583 libtransactions_cxx.1.dylib`lcb_st::bootstrap(this=0x0000000101c046a0, options=2) at internal.h:177:26
|
frame #3: 0x0000000100446c5a libtransactions_cxx.1.dylib`::lcb_connect(instance=0x0000000101c046a0) at instance.cc:740:22
|
frame #4: 0x00000001003c9436 libtransactions_cxx.1.dylib`couchbase::cluster::connect(this=0x000070000e8c8e70) at cluster.cxx:89:10
|
frame #5: 0x00000001003c8f33 libtransactions_cxx.1.dylib`couchbase::cluster::cluster(this=0x000070000e8c8e70, cluster_address=<unavailable>, user_name="", password="") at cluster.cxx:21:5 frame #6: 0x00000001003c9725 libtransactions_cxx.1.dylib`couchbase::cluster::cluster(this=0x000070000e8c8e70, cluster_address=<unavailable>, user_name=<unavailable>, password=<unavailable>) at cluster.cxx:19:1
|
frame #7: 0x0000000100029577 bench2`main::$_0::operator(this=0x0000000101e041a8)() const at bench2.cxx:72:25
|
frame #8: 0x00000001000294ad bench2`decltype(__f=0x0000000101e041a8)()) std::__1::__invoke<main::$_0>(main::$_0&&) at type_traits:4361:1
|
The code looks like:
36 hrtime_t gethrtime(void) |
37 { |
38 #ifdef __APPLE__ |
39 |
40 /* Most things expect a pretty large timestamp - even though a smaller one |
41 * may be perfectly valid. Initialize the default with an offset of one day,
|
42 * in nanoseconds
|
43 */
|
44 |
45 /* Use the various mach stuff: |
46 * https://developer.apple.com/library/mac/qa/qa1398/_index.html */
|
47 |
48 static uint64_t start = 0; |
49 uint64_t now; |
50 static mach_timebase_info_data_t tmbi; |
51 |
52 if (start == 0) { |
53 start = mach_absolute_time(); |
54 mach_timebase_info(&tmbi); |
55 } |
56 |
57 now = mach_absolute_time(); |
58 return ((now - start) * tmbi.numer / tmbi.denom) + CLOCK_START_OFFSET; |
59 |
60 #elif defined(HAVE_CLOCK_GETTIME) |
which is just the apple part. There are statics in the other #defines below this, for any OS that has no gethrtime builtin (presumably that is reentrant). Could be the HAVE_QUERYPERFORMANCE_COUNTER section has this issue as well, on the LARGE_INTEGER pf as I'm just a bit suspicious of QueryPerformanceFrequency.
Since this is C, there is a compare-and-swap available for all the platforms, but platform dependent. Doing a compare-and-swap on start seems sufficient. Maybe there is a better way? Maybe LCB already has a compare-and-swap implemented for various platforms, so we don't have to make that?