Details
-
Bug
-
Resolution: Fixed
-
Blocker
-
Cheshire-Cat
-
7.0.0-4374-enterprise
-
Untriaged
-
Centos 64-bit
-
1
-
Yes
-
KV-Engine 2021-Feb
Description
Script to Repro
guides/gradlew --refresh-dependencies testrunner -P jython=/opt/jython/bin/jython -P 'args=-i /tmp/win10-bucket-ops.ini rerun=False,quota_percent=95,crash_warning=True -t bucket_collections.collections_rebalance.CollectionsRebalance.test_data_load_collections_with_hard_failover_recovery,data_load_stage=during,quota_percent=80,nodes_failover=2,recovery_type=full,rerun=False,nodes_init=5,bucket_spec=multi_bucket.buckets_for_rebalance_tests_more_collections,data_load_spec=volume_test_load_with_CRUD_on_collections'
|
Steps to Repro
1) Create a 5 node cluster
2021-02-03 19:29:07,426 | test | INFO | pool-1-thread-6 | [table_view:display:72] Rebalance Overview
Nodes | Services | Status |
172.23.98.196 | kv | Cluster node |
172.23.98.195 | None | <--- IN — |
172.23.121.10 | None | <--- IN — |
172.23.104.186 | None | <--- IN — |
172.23.120.206 | None | <--- IN — |
2) Create buckets/scopes/collections/data.
2021-02-03 19:34:17,336 | test | INFO | MainThread | [table_view:display:72] Bucket statistics
Bucket | Type | Replicas | Durability | TTL | Items | RAM Quota | RAM Used | Disk Used |
bucket1 | couchbase | 3 | none | 0 | 3000 | 1048576000 | 218121544 | 314572404 |
bucket2 | ephemeral | 3 | none | 0 | 3000 | 1048576000 | 331036392 | 170 |
default | couchbase | 3 | none | 0 | 500000 | 10485760000 | 706696616 | 558261158 |
3) Hard failover 2 nodes.
2021-02-03 19:34:24,032 | test | INFO | MainThread | [collections_rebalance:rebalance_operation:600] failing over nodes [ip:172.23.104.186 port:8091 ssh_username:root, ip:172.23.120.206 port:8091 ssh_username:root]
|
2021-02-03 19:34:36,875 | test | INFO | pool-1-thread-20 | [rest_client:monitorRebalance:1438] Rebalance done. Taken 8.33500003815 seconds to complete
|
2021-02-03 19:34:36,887 | test | INFO | pool-1-thread-20 | [common_lib:sleep:22] Sleep 8.33500003815 seconds. Reason: Wait after rebalance complete
|
2021-02-03 19:36:45,301 | test | INFO | MainThread | [collections_rebalance:wait_for_failover_or_assert:224] 1 nodes failed over as expected in 0.0710000991821 seconds
|
2021-02-03 19:36:59,030 | test | INFO | pool-1-thread-25 | [rest_client:monitorRebalance:1438] Rebalance done. Taken 8.6819999218 seconds to complete
|
2021-02-03 19:36:59,039 | test | INFO | pool-1-thread-25 | [common_lib:sleep:22] Sleep 8.6819999218 seconds. Reason: Wait after rebalance complete
|
2021-02-03 19:39:08,802 | test | INFO | MainThread | [collections_rebalance:wait_for_failover_or_assert:224] 2 nodes failed over as expected in 1.07400012016 seconds
|
4) Do full recovery + rebalance. Rebalance fails.
2021-02-03 19:39:53,246 | test | WARNING | MainThread | [rest_client:get_nodes:1696] 172.23.104.186 - Node not part of cluster inactiveFailed
|
2021-02-03 19:39:53,249 | test | WARNING | MainThread | [rest_client:get_nodes:1696] 172.23.120.206 - Node not part of cluster inactiveFailed
|
We see the following coredumps on 172.23.98.196, 172.23.98.195 and 172.23.121.10.
grep CRITICAL memcached on 172.23.98.196(97539118-64f3-442a-bb4c8ab6-c98e1f02.dmp )
[root@s81706 logs]# grep CRITICAL memcached.log.0000*
|
memcached.log.000011.txt:2021-02-03T19:39:45.584027-08:00 CRITICAL *** Fatal error encountered during exception handling ***
|
memcached.log.000011.txt:2021-02-03T19:39:45.585012-08:00 CRITICAL Caught unhandled std::exception-derived exception. what(): ThrowExceptionUnderflowPolicy current:0 arg:1
|
memcached.log.000011.txt:2021-02-03T19:39:46.065505-08:00 CRITICAL Breakpad caught a crash (Couchbase version 7.0.0-4374). Writing crash dump to /opt/couchbase/var/lib/couchbase/crash/97539118-64f3-442a-bb4c8ab6-c98e1f02.dmp before terminating.
|
memcached.log.000011.txt:2021-02-03T19:39:46.065583-08:00 CRITICAL Stack backtrace of crashed thread:
|
memcached.log.000011.txt:2021-02-03T19:39:46.065909-08:00 CRITICAL /opt/couchbase/bin/memcached() [0x400000+0x145bbd]
|
memcached.log.000011.txt:2021-02-03T19:39:46.065948-08:00 CRITICAL /opt/couchbase/bin/memcached(_ZN15google_breakpad16ExceptionHandler12GenerateDumpEPNS0_12CrashContextE+0x3ea) [0x400000+0x15b3fa]
|
memcached.log.000011.txt:2021-02-03T19:39:46.065969-08:00 CRITICAL /opt/couchbase/bin/memcached(_ZN15google_breakpad16ExceptionHandler13SignalHandlerEiP9siginfo_tPv+0xb8) [0x400000+0x15b738]
|
memcached.log.000011.txt:2021-02-03T19:39:46.065989-08:00 CRITICAL /lib64/libpthread.so.0() [0x7f98c92b1000+0xf630]
|
memcached.log.000011.txt:2021-02-03T19:39:46.066055-08:00 CRITICAL /lib64/libc.so.6(gsignal+0x37) [0x7f98c8ee3000+0x36387]
|
memcached.log.000011.txt:2021-02-03T19:39:46.066112-08:00 CRITICAL /lib64/libc.so.6(abort+0x148) [0x7f98c8ee3000+0x37a78]
|
memcached.log.000011.txt:2021-02-03T19:39:46.066190-08:00 CRITICAL /opt/couchbase/bin/../lib/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x125) [0x7f98c99e6000+0x91195]
|
memcached.log.000011.txt:2021-02-03T19:39:46.066215-08:00 CRITICAL /opt/couchbase/bin/memcached() [0x400000+0x155632]
|
memcached.log.000011.txt:2021-02-03T19:39:46.066259-08:00 CRITICAL /opt/couchbase/bin/../lib/libstdc++.so.6() [0x7f98c99e6000+0x8ef86]
|
memcached.log.000011.txt:2021-02-03T19:39:46.066292-08:00 CRITICAL /opt/couchbase/bin/../lib/libstdc++.so.6() [0x7f98c99e6000+0x8efd1]
|
memcached.log.000011.txt:2021-02-03T19:39:46.066315-08:00 CRITICAL /opt/couchbase/bin/../lib/libep.so() [0x7f98cd317000+0x16f2f3]
|
memcached.log.000011.txt:2021-02-03T19:39:46.066330-08:00 CRITICAL /opt/couchbase/bin/../lib/libep.so() [0x7f98cd317000+0x169352]
|
memcached.log.000011.txt:2021-02-03T19:39:46.066352-08:00 CRITICAL /opt/couchbase/bin/../lib/libep.so() [0x7f98cd317000+0x2e9bd6]
|
memcached.log.000011.txt:2021-02-03T19:39:46.066372-08:00 CRITICAL /opt/couchbase/bin/../lib/libep.so() [0x7f98cd317000+0x2d20ca]
|
memcached.log.000011.txt:2021-02-03T19:39:46.066397-08:00 CRITICAL /opt/couchbase/bin/../lib/libep.so() [0x7f98cd317000+0x2eccf9]
|
memcached.log.000011.txt:2021-02-03T19:39:46.066422-08:00 CRITICAL /opt/couchbase/bin/../lib/libep.so() [0x7f98cd317000+0x167793]
|
memcached.log.000011.txt:2021-02-03T19:39:46.066501-08:00 CRITICAL /opt/couchbase/bin/../lib/libstdc++.so.6() [0x7f98c99e6000+0xb9dcf]
|
memcached.log.000011.txt:2021-02-03T19:39:46.066517-08:00 CRITICAL /lib64/libpthread.so.0() [0x7f98c92b1000+0x7ea5]
|
memcached.log.000011.txt:2021-02-03T19:39:46.066774-08:00 CRITICAL /lib64/libc.so.6(clone+0x6d) [0x7f98c8ee3000+0xfe8dd]
|
cbcollect_info attached. This was not seen on 7.0.0-4342.
Attachments
Issue Links
- blocks
-
MB-44021 [Collections] - AddressSanitizer: seen during graceful failover + full recovery
- Closed
- depends on
-
MB-43818 Expand information captured from exceptions which are (probably) fatal
- Closed
- is duplicated by
-
MB-44137 CouchStore: 1-Scope, 100 Collections: ThrowExceptionUnderflowPolicy current:0 arg:-422
- Closed
-
MB-44172 CouchStore: Swap rebalance failed due to mover crashed during dcp_takeover
- Closed
-
MB-44194 Couchstore: Rebalance failed due to bad replicas.
- Closed
- relates to
-
MB-44098 [Collections] : decodeManifest: duplicate collection:0xce in stored data ------ collection CRUD + multi node hard failover + full recovery + rebalance
- Closed
-
MB-44102 [Collections] - Seeing CRITICAL messages with CouchKVStore::maybePatchOnDiskPrepares()
- Closed