Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: 5.5.0
Affects Version/s: 5.5.0
Component/s: view-engine
Labels:
None

Triage:
Untriaged
Is this a Regression?:
Yes

Description

I recently noticed that couchdb vm takes a lot of time to terminate or it doesn't terminate at all. In my case it affects cluster_run environment, but from code inspection, it seems that it should also apply to regular server termination.

In cluster_run specifically, we skip the graceful termination and just call erlang:halt/2. This internally results in beam.smp calling exit() function. The exit() function calls the atexit() handlers. The following backtrace indicates that one of the atexit handlers locks up:

Thread 16 (Thread 0x7f7fc10be700 (LWP 1499)):

#0  0x00007f7fc4d5ee39 in pthread_cond_destroy@@GLIBC_2.3.2 () from /usr/lib/libpthread.so.0

#1  0x00007f7fc47c9448 in __run_exit_handlers () from /usr/lib/libc.so.6

#2  0x00007f7fc47c949a in exit () from /usr/lib/libc.so.6

#3  0x000055cd4d157904 in erl_exit_vv (n=0, flush_async=flush_async@entry=0, fmt=fmt@entry=0x55cd4d2f8ed8 "", args1=args1@entry=0x7f7fc10bdc18, args2=args2@entry=0x7f7fc10bdc30) at beam/erl_init.c:1884

#4  0x000055cd4d157bee in erl_exit (n=n@entry=0, fmt=fmt@entry=0x55cd4d2f8ed8 "") at beam/erl_init.c:1893

#5  0x000055cd4d192c6e in halt_2 (A__p=0x7f7fc3b40328, BIF__ARGS=<optimized out>) at beam/bif.c:3994

#6  0x000055cd4d26a7d7 in process_main () at beam/beam_emu.c:2579

#7  0x000055cd4d1aa877 in sched_thread_func (vesdp=0x7f7fc434ddc0) at beam/erl_process.c:5801

#8  0x000055cd4d2d84c5 in thr_wrapper (vtwd=0x7ffc203fca20) at pthread/ethread.c:106

#9  0x00007f7fc4d5908a in start_thread () from /usr/lib/libpthread.so.0

#10 0x00007f7fc488842f in clone () from /usr/lib/libc.so.6

Further investigation showed that the pthread_cond_destroy that locks up corresponds to the statically allocated condition variable used by mapreduce_nif.cc (http://src.couchbase.org/source/xref/vulcan/couchdb/src/mapreduce/mapreduce_nif.cc#50):

#0  0x00007fe165e4d2b9 in futex_wait (private=<optimized out>, expected=12, futex_word=0x7fe15cb3f644 <cv+36>) at ../sysdeps/unix/sysv/linux/futex-internal.h:61

#1  futex_wait_simple (private=<optimized out>, expected=12, futex_word=0x7fe15cb3f644 <cv+36>) at ../sysdeps/nptl/futex-internal.h:135

#2  __pthread_cond_destroy (cond=0x7fe15cb3f620 <cv>) at pthread_cond_destroy.c:54

#3  0x00007fe1658b9258 in __run_exit_handlers (status=status@entry=0, listp=0x7fe165c326f8 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true) at exit.c:83

So it's cv's destructor that is a culprit. The problem is that it's an undefined behavior to call pthread_cond_destroy on a condition variable that is still being used. And it is the case that it's still used by the terminatorLoop()

Thread 20 (Thread 0x7facd24ec700 (LWP 5290)):

#0  0x00007fad14833756 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /usr/lib/libpthread.so.0

#1  0x00007facde3082a5 in __gthread_cond_timedwait (__abs_timeout=0x7facd24ebe60, __mutex=<optimized out>, __cond=0x7facde5455e0 <cv>) at /usr/include/c++/7.2.1/x86_64-pc-linux-gnu/bits/gthr-default.h:871

#2  std::condition_variable::__wait_until_impl<std::chrono::duration<long, std::ratio<1l, 1000000000l> > > (__atime=..., __lock=<synthetic pointer>..., this=0x7facde5455e0 <cv>) at /usr/include/c++/7.2.1/condition_variable:166

#3  std::condition_variable::wait_until<std::chrono::duration<long, std::ratio<1l, 1000000000l> > > (__atime=..., __lock=<synthetic pointer>..., this=0x7facde5455e0 <cv>) at /usr/include/c++/7.2.1/condition_variable:106

#4  std::condition_variable::wait_for<long, std::ratio<1l, 1000l> > (__rtime=..., __lock=<synthetic pointer>..., this=0x7facde5455e0 <cv>) at /usr/include/c++/7.2.1/condition_variable:138

#5  terminatorLoop (args=<optimized out>) at /home/shaleny/dev/membase/repo-master/couchdb/src/mapreduce/mapreduce_nif.cc:542

#6  0x0000558eb9b5e4c5 in thr_wrapper (vtwd=0x7fad1137eb00) at pthread/ethread.c:106

#7  0x00007fad1482d08a in start_thread () from /usr/lib/libpthread.so.0

#8  0x00007fad1435c42f in clone () from /usr/lib/libc.so.6

So in my particular case it locks up the vm. Some other system might launch nuclear missiles.

Probably the easiest way to avoid the issue (and that's a workaround I applied locally) is to not have statically allocated objects at all and allocate them dynamically once onLoad function is called. But I'm not insisting on any specific fix.

Attachments

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews
- Show All Issues
- Show Open Issues

No reviews matched the request. Check your Options in the drop-down menu of this sections header.

Activity

People

Assignee:: Harsha HS (Inactive)

Reporter:: Aliaksey Artamonau (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 04/Jan/18 6:50 AM

Updated:: 27/Jun/18 1:45 AM

Resolved:: 10/Jan/18 3:03 AM

Gerrit Reviews

There are no open Gerrit changes

Show There are 3 closed Gerrit changes

Hide There are 3 closed Gerrit changes

MB-27422, MB-27424 Do not use statically allocated resource in NIF: Gerrit Review:

MB-27422 Notify terminator thread before unloading of mapreduce NIF: Gerrit Review:

MB-27422 Use enif_priv_data to get rid of global condition variable: Gerrit Review:

Undefined behavior in mapreduce_nif.cc resulting in couchdb vm failing to terminate (or taking a long time to teminate)

Details

Description

Attachments

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty