Loading...

Details

Type: Bug
Resolution: Duplicate
Priority: Critical
Fix Version/s: None
Affects Version/s: 6.5.0
Component/s: couchbase-bucket
Labels:
- functional-test
Environment:
6.5.0-4908

Triage:
Triaged
Operating System:
Centos 64-bit
Link to Log File, atop/blg, CBCollectInfo, Core dump:
https://cb-jira.s3.us-east-2.amazonaws.com/logs/memcached_crash/test_1.zip
Is this a Regression?:
Yes

Description

Script to Repro

./testrunner -i /tmp/testexec.6198.ini -p get-cbcollect-info=False,bucket_type=ephemeral,GROUP=P1_Set2,get-cbcollect-info=True -t rebalance.rebalanceinout.RebalanceInOutTests.test_incremental_rebalance_in_out_with_mutation_and_expiration,items=100000,value_size=512,max_verify=100000,zone=2,GROUP=IN_OUT;P1;P1_Set2

Test to repro

Rebalances nodes into and out of the cluster while doing mutations and

expirations. Use 'zone' param to have nodes divided into server groups

by having zone > 1.

This test begins by loading a given number of items into the cluster.

It then adds one node, rebalances that node into the cluster, and then

rebalances it back out. During the rebalancing we update half of the

items in the cluster and expire the other half. Once the node has been

removed and added back we recreate the expired items, wait for the

disk queues to drain, and then verify that there has been no data loss,

sum(curr_items) match the curr_items_total.We then remove and

add back two nodes at a time and so on until we have reached the point

where we are adding back and removing at least half of the nodes.

Rebalance failure

2019-12-01 22:08:23 | ERROR | MainProcess | Cluster_Thread | [rest_client._rebalance_status_and_progress] {u'status': u'none', u'errorMessage': u'Rebalance failed. See logs for detailed reason. You can try again.'} - rebalance failed

2019-12-01 22:08:23 | INFO | MainProcess | Cluster_Thread | [rest_client.print_UI_logs] Latest logs from UI on 172.23.104.211:

2019-12-01 22:08:23 | ERROR | MainProcess | Cluster_Thread | [rest_client.print_UI_logs] {u'node': u'ns_1@172.23.104.216', u'code': 0, u'text': u'Bucket "default" loaded on node \'ns_1@172.23.104.216\' in 0 seconds.', u'shortText': u'message', u'serverTime': u'2019-12-01T22:08:20.343Z', u'module': u'ns_memcached', u'tstamp': 1575266900343, u'type': u'info'}

2019-12-01 22:08:23 | ERROR | MainProcess | Cluster_Thread | [rest_client.print_UI_logs] {u'node': u'ns_1@172.23.104.216', u'code': 0, u'text': u"Control connection to memcached on 'ns_1@172.23.104.216' disconnected. Check logs for details.", u'shortText': u'message', u'serverTime': u'2019-12-01T22:08:19.303Z', u'module': u'ns_memcached', u'tstamp': 1575266899303, u'type': u'info'}

2019-12-01 22:08:23 | ERROR | MainProcess | Cluster_Thread | [rest_client.print_UI_logs] {u'node': u'ns_1@172.23.104.216', u'code': 0, u'text': u"Service 'memcached' exited with status 134. Restarting. Messages:\n2019-12-01T22:08:19.277234-08:00 CRITICAL     /opt/couchbase/bin/../lib/libstdc++.so.6() [0x7f6273aac000+0x8f213]\n2019-12-01T22:08:19.277277-08:00 CRITICAL     /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f626e488000+0x74098]\n2019-12-01T22:08:19.277296-08:00 CRITICAL     /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f626e488000+0x77434]\n2019-12-01T22:08:19.277314-08:00 CRITICAL     /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f626e488000+0x77843]\n2019-12-01T22:08:19.277334-08:00 CRITICAL     /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f626e488000+0x77924]\n2019-12-01T22:08:19.277352-08:00 CRITICAL     /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f626e488000+0x809f9]\n2019-12-01T22:08:19.277373-08:00 CRITICAL     /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f626e488000+0x12f964]\n2019-12-01T22:08:19.277385-08:00 CRITICAL     /opt/couchbase/bin/../lib/libplatform_so.so.0.1.0() [0x7f6275955000+0x8ee7]\n2019-12-01T22:08:19.277401-08:00 CRITICAL     /lib64/libpthread.so.0() [0x7f6273377000+0x7dd5]\n2019-12-01T22:08:19.277475-08:00 CRITICAL     /lib64/libc.so.6(clone+0x6d) [0x7f6272faa000+0xfdead]", u'shortText': u'message', u'serverTime': u'2019-12-01T22:08:19.297Z', u'module': u'ns_log', u'tstamp': 1575266899297, u'type': u'info'}

2019-12-01 22:08:23 | ERROR | MainProcess | Cluster_Thread | [rest_client.print_UI_logs] {u'node': u'ns_1@172.23.104.211', u'code': 0, u'text': u'auto-reprovision is disabled as maximum number of nodes (1) that can be auto-reprovisioned has been reached.', u'shortText': u'message', u'serverTime': u'2019-12-01T22:08:18.669Z', u'module': u'auto_reprovision', u'tstamp': 1575266898669, u'type': u'info'}

2019-12-01 22:08:23 | ERROR | MainProcess | Cluster_Thread | [rest_client.print_UI_logs] {u'node': u'ns_1@172.23.104.211', u'code': 0, u'text': u'Bucket "default" has been reprovisioned on following nodes: [\'ns_1@172.23.104.220\']. Nodes on which the data service restarted: [\'ns_1@172.23.104.220\',\n                                                                                                                                 \'ns_1@172.23.104.243\'].', u'shortText': u'message', u'serverTime': u'2019-12-01T22:08:18.668Z', u'module': u'auto_reprovision', u'tstamp': 1575266898668, u'type': u'info'}

Backtrace from gdb

(gdb) bt

#0  0x00007f6272fe0207 in __gconv_transform_internal_ucs2reverse () from /usr/lib64/libc-2.17.so

#1  0x0000000000000006 in ?? ()

#2  0x00007f6273025dc3 in wprintf () from /usr/lib64/libc-2.17.so

#3  0x0000000000000001 in ?? ()

#4  0x0000000a3affb1f0 in ?? ()

#5  0x000000020000000e in ?? ()

#6  0x00007f623affd600 in ?? ()

#7  0x00007f623affb190 in ?? ()

#8  0x00007f6271b5f400 in ?? ()

#9  0x0000000000000068 in ?? ()

#10 0x000000003affd600 in ?? ()

#11 0x00007f623affb230 in ?? ()

#12 0x00007f623affbe20 in ?? ()

#13 0x0000000000000068 in ?? ()

#14 0x00007f6272a00980 in ?? ()

#15 0x00007f6274e5fd58 in tcache_alloc_small (slow_path=false, zero=false, binind=10, size=0, tcache=0x7f62730258ce <putwc_unlocked+30>, arena=<optimized out>, tsd=<optimized out>) at include/jemalloc/internal/tcache_inlines.h:60

#16 arena_malloc (slow_path=false, tcache=0x7f62730258ce <putwc_unlocked+30>, zero=false, ind=10, size=0, arena=0x0, tsdn=<optimized out>) at include/jemalloc/internal/arena_inlines_b.h:165

#17 iallocztm (slow_path=false, arena=0x0, is_internal=false, tcache=0x7f62730258ce <putwc_unlocked+30>, zero=false, ind=10, size=0, tsdn=<optimized out>) at include/jemalloc/internal/jemalloc_internal_inlines_c.h:53

#18 imalloc_no_sample (ind=10, usize=0, size=0, tsd=0x7f627336d3a0 <_IO_obstack_jumps+128>, dopts=<synthetic pointer>, sopts=<synthetic pointer>) at src/jemalloc.c:1949

#19 imalloc_body (tsd=0x7f627336d3a0 <_IO_obstack_jumps+128>, dopts=<synthetic pointer>, sopts=<synthetic pointer>) at src/jemalloc.c:2123

#20 imalloc (dopts=<synthetic pointer>, sopts=<synthetic pointer>) at src/jemalloc.c:2258

#21 je_malloc_default (size=<optimized out>) at src/jemalloc.c:2289

#22 0x00007f627596043c in cb_malloc (size=0) at /home/couchbase/jenkins/workspace/couchbase-server-unix/platform/src/cb_malloc.cc:51

#23 0x00007f6276a000b9 in operator new (count=<optimized out>) at /home/couchbase/jenkins/workspace/couchbase-server-unix/platform/src/global_new_replacement.cc:71

#24 0x00007f626e4faf71 in MutationResponse (sid=..., enableExpiryOut=Yes, includeCollectionID=(unknown: 32), includeDeleteTime=(unknown: 162), includeXattrs=Yes, includeVal=Yes, opaque=2, item=..., this=0x7f6238814c10)

    at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/dcp/response.h:429

#25 make_unique<MutationResponse, SingleThreadedRCPtr<Item, Item*, std::default_delete<Item> > const&, unsigned int const&, IncludeValue const&, IncludeXattrs const&, IncludeDeleteTime const&, DocKeyEncodesCollectionId const&, EnableExpiryOutput const&, cb::mcbp::DcpStreamId const&> () at /usr/local/include/c++/7.3.0/bits/unique_ptr.h:825

#26 ActiveStream::makeResponseFromItem (this=<optimized out>, item=..., sendCommitSyncWriteAs=<optimized out>) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/dcp/active_stream.cc:1029

#27 0x00007f626e4ff434 in ActiveStream::processItems (this=0x7f623affb3b0, this@entry=0x7f6238814c10, outstandingItemsResult=..., streamMutex=...)

    at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/dcp/active_stream.cc:1101

#28 0x00007f626e4ff843 in ActiveStream::nextCheckpointItemTask (this=this@entry=0x7f6238814c10, streamMutex=...) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/dcp/active_stream.cc:868

#29 0x00007f626e4ff924 in ActiveStream::nextCheckpointItemTask (this=0x7f6238814c10) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/dcp/active_stream.cc:858

#30 0x00007f626e5089f9 in ActiveStreamCheckpointProcessorTask::run (this=0x7f6238819110) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/dcp/active_stream_checkpoint_processor_task.cc:56

#31 0x00007f626e5b7964 in ExecutorThread::run (this=0x7f6271b97960) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/executorthread.cc:187

#32 0x00007f627595dee7 in run (this=0x7f6271a6e670) at /home/couchbase/jenkins/workspace/couchbase-server-unix/platform/src/cb_pthreads.cc:58

#33 platform_thread_wrap (arg=0x7f6271a6e670) at /home/couchbase/jenkins/workspace/couchbase-server-unix/platform/src/cb_pthreads.cc:71

#34 0x00007f627337edd5 in start_thread () from /usr/lib64/libpthread-2.17.so

#35 0x00007f62730a7ead in tdestroy_recurse () from /usr/lib64/libc-2.17.so

#36 0x0000000000000000 in ?? ()

(gdb)

cbcollect_info attached.
Last successful run was on 6.5.0-4897.

Attachments

Issue Links

duplicates

MB-37103 [System test]: Disk Checkpoint does not have an initialised HCS

Closed

Rebalance fails and memcached crashes seen in Ephemeral rebalance in out tests

Details

Description

Attachments

Issue Links

Activity

People

Dates

PagerDuty