Loading...

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: 7.0.0
Affects Version/s: Cheshire-Cat
Component/s: couchbase-bucket
Labels:
Environment:
7.0.0-4374-enterprise

Triage:
Untriaged
Operating System:
Centos 64-bit
Epic Link:
KV: Collections
Story Points:
1
Is this a Regression?:
Yes
Sprint:
KV-Engine 2021-Feb

Description

Script to Repro

guides/gradlew --refresh-dependencies testrunner -P jython=/opt/jython/bin/jython -P 'args=-i /tmp/win10-bucket-ops.ini rerun=False,quota_percent=95,crash_warning=True -t bucket_collections.collections_rebalance.CollectionsRebalance.test_data_load_collections_with_hard_failover_recovery,data_load_stage=during,quota_percent=80,nodes_failover=2,recovery_type=full,rerun=False,nodes_init=5,bucket_spec=multi_bucket.buckets_for_rebalance_tests_more_collections,data_load_spec=volume_test_load_with_CRUD_on_collections'

Steps to Repro
1) Create a 5 node cluster
2021-02-03 20:25:50,934 | test | INFO | pool-1-thread-6 | [table_view:display:72] Rebalance Overview
------------------------------------

Nodes

Services

Status

------------------------------------

172.23.98.196	kv	Cluster node
172.23.98.195	None	<--- IN —
172.23.121.10	None	<--- IN —
172.23.104.186	None	<--- IN —
172.23.120.206	None	<--- IN —

------------------------------------

2)Create buckets/scopes/collections/data
-------------------------------------------------------------------------

Bucket

Type

Replicas

Durability

TTL

Items

RAM Quota

RAM Used

Disk Used

-------------------------------------------------------------------------

bucket1	couchbase	3	none	3000	1048576000	218057960	322905355
bucket2	ephemeral	3	none	3000	1048576000	329473336	170
default	couchbase	3	none	250000	5242880000	470890808	460819585

-------------------------------------------------------------------------

3)Hard faiilover 2 nodes.

2021-02-03 20:29:36,849 | test  | INFO    | MainThread | [collections_rebalance:rebalance_operation:600] failing over nodes [ip:172.23.104.186 port:8091 ssh_username:root, ip:172.23.120.206 port:8091 ssh_username:root]

2021-02-03 20:29:50,240 | test  | INFO    | pool-1-thread-23 | [rest_client:monitorRebalance:1438] Rebalance done. Taken 8.05900001526 seconds to complete

2021-02-03 20:29:50,243 | test  | INFO    | pool-1-thread-23 | [common_lib:sleep:22] Sleep 8.05900001526 seconds. Reason: Wait after rebalance complete

2021-02-03 20:31:58,346 | test  | INFO    | MainThread | [collections_rebalance:wait_for_failover_or_assert:224] 1 nodes failed over as expected in 0.0409998893738 seconds

2021-02-03 20:32:10,351 | test  | INFO    | pool-1-thread-8 | [rest_client:monitorRebalance:1438] Rebalance done. Taken 8.07899999619 seconds to complete

2021-02-03 20:32:10,355 | test  | INFO    | pool-1-thread-8 | [common_lib:sleep:22] Sleep 8.07899999619 seconds. Reason: Wait after rebalance complete

2021-02-03 20:34:18,476 | test  | INFO    | MainThread | [collections_rebalance:wait_for_failover_or_assert:224] 2 nodes failed over as expected in 0.0379998683929 seconds

4)Do full recovery and rebalance

2021-02-03 20:34:44,459 | test  | WARNING | MainThread | [rest_client:get_nodes:1696] 172.23.104.186 - Node not part of cluster inactiveFailed

2021-02-03 20:34:44,459 | test  | WARNING | MainThread | [rest_client:get_nodes:1696] 172.23.120.206 - Node not part of cluster inactiveFailed

Rebalance fails and we lot of mindumps. The one of interest is shown below
grep CRITICAL on 172.23.121.10

memcached.log.000015.txt:2021-02-03T20:36:33.577265-08:00 CRITICAL Caught unhandled std::exception-derived exception. what(): decodeManifest: duplicate collection:0xce in stored data

memcached.log.000015.txt:2021-02-03T20:36:33.577964-08:00 CRITICAL *** Fatal error encountered during exception handling ***

memcached.log.000015.txt:2021-02-03T20:36:33.602241-08:00 CRITICAL *** Fatal error encountered during exception handling ***

memcached.log.000015.txt:2021-02-03T20:36:33.618348-08:00 CRITICAL *** Fatal error encountered during exception handling ***

memcached.log.000015.txt:2021-02-03T20:36:33.985955-08:00 CRITICAL Breakpad caught a crash (Couchbase version 7.0.0-4374). Writing crash dump to /opt/couchbase/var/lib/couchbase/crash/92e07269-f018-44cc-04ca8aa4-cdf08df0.dmp before terminating.

memcached.log.000015.txt:2021-02-03T20:36:33.985973-08:00 CRITICAL Stack backtrace of crashed thread:

memcached.log.000015.txt:2021-02-03T20:36:33.986765-08:00 CRITICAL     /opt/couchbase/bin/memcached() [0x400000+0x145bbd]

memcached.log.000015.txt:2021-02-03T20:36:33.986791-08:00 CRITICAL     /opt/couchbase/bin/memcached(_ZN15google_breakpad16ExceptionHandler12GenerateDumpEPNS0_12CrashContextE+0x3ea) [0x400000+0x15b3fa]

memcached.log.000015.txt:2021-02-03T20:36:33.986814-08:00 CRITICAL     /opt/couchbase/bin/memcached(_ZN15google_breakpad16ExceptionHandler13SignalHandlerEiP9siginfo_tPv+0xb8) [0x400000+0x15b738]

memcached.log.000015.txt:2021-02-03T20:36:33.986830-08:00 CRITICAL     /lib64/libpthread.so.0() [0x7f23971e1000+0xf630]

memcached.log.000015.txt:2021-02-03T20:36:33.986873-08:00 CRITICAL     /lib64/libc.so.6(gsignal+0x37) [0x7f2396e13000+0x36387]

memcached.log.000015.txt:2021-02-03T20:36:33.986915-08:00 CRITICAL     /lib64/libc.so.6(abort+0x148) [0x7f2396e13000+0x37a78]

memcached.log.000015.txt:2021-02-03T20:36:33.986972-08:00 CRITICAL     /opt/couchbase/bin/../lib/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x125) [0x7f2397916000+0x91195]

memcached.log.000015.txt:2021-02-03T20:36:33.986995-08:00 CRITICAL     /opt/couchbase/bin/memcached() [0x400000+0x155632]

memcached.log.000015.txt:2021-02-03T20:36:33.987042-08:00 CRITICAL     /opt/couchbase/bin/../lib/libstdc++.so.6() [0x7f2397916000+0x8ef86]

memcached.log.000015.txt:2021-02-03T20:36:33.987088-08:00 CRITICAL     /opt/couchbase/bin/../lib/libstdc++.so.6() [0x7f2397916000+0x8efd1]

memcached.log.000015.txt:2021-02-03T20:36:33.987112-08:00 CRITICAL     /opt/couchbase/bin/../lib/libep.so() [0x7f239b247000+0x16f2f3]

memcached.log.000015.txt:2021-02-03T20:36:33.987133-08:00 CRITICAL     /opt/couchbase/bin/../lib/libep.so() [0x7f239b247000+0x169352]

memcached.log.000015.txt:2021-02-03T20:36:33.987157-08:00 CRITICAL     /opt/couchbase/bin/../lib/libep.so() [0x7f239b247000+0x2e9bd6]

memcached.log.000015.txt:2021-02-03T20:36:33.987186-08:00 CRITICAL     /opt/couchbase/bin/../lib/libep.so() [0x7f239b247000+0x2d20ca]

memcached.log.000015.txt:2021-02-03T20:36:33.987210-08:00 CRITICAL     /opt/couchbase/bin/../lib/libep.so() [0x7f239b247000+0x2eccf9]

memcached.log.000015.txt:2021-02-03T20:36:33.987231-08:00 CRITICAL     /opt/couchbase/bin/../lib/libep.so() [0x7f239b247000+0x167793]

memcached.log.000015.txt:2021-02-03T20:36:33.987297-08:00 CRITICAL     /opt/couchbase/bin/../lib/libstdc++.so.6() [0x7f2397916000+0xb9dcf]

memcached.log.000015.txt:2021-02-03T20:36:33.987311-08:00 CRITICAL     /lib64/libpthread.so.0() [0x7f23971e1000+0x7ea5]

memcached.log.000015.txt:2021-02-03T20:36:33.987365-08:00 CRITICAL     /lib64/libc.so.6(clone+0x6d) [0x7f2396e13000+0xfe8dd]

cbcollect_info attached. This was not seen on 7.0.0-4342.
This bug could be related to ~~MB-44097~~ as its the same test and the same minidumps of ~~MB-44097~~ is also seen here.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

bt_full.txt
8 kB
03/Feb/21 9:14 PM
consoleText.txt
370 kB
03/Feb/21 9:16 PM
info_threads.txt
4 kB
03/Feb/21 9:14 PM
thread_apply_all_bt.txt
94 kB
03/Feb/21 9:14 PM
vb12_open.json
46 kB
04/Feb/21 3:22 AM

Issue Links

relates to

MB-44097 Crash when collection disk size underflows with concurrent flush & compaction

Closed

[Collections] : decodeManifest: duplicate collection:0xce in stored data ------ collection CRUD + multi node hard failover + full recovery + rebalance

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

PagerDuty