VBucket::abort (vb:156) - active failed as HashTable value is not CommittedState::Pending
Description
Components
Affects versions
Fix versions
Environment
Link to Log File, atop/blg, CBCollectInfo, Core dump
Release Notes Description
Attachments
- 20 Aug 2020, 08:33 AM
- 20 Aug 2020, 08:33 AM
- 20 Aug 2020, 08:32 AM
- 20 Aug 2020, 08:32 AM
- 20 Aug 2020, 08:32 AM
- 04 Jun 2020, 11:12 AM
- 04 Jun 2020, 11:10 AM
relates to
Activity
Ashwin Govindarajulu May 27, 2021 at 6:47 PM
Not seeing this issue in latest weekly runs. Closing the ticket.
Dave Rigby February 5, 2021 at 3:08 PM
Given this hasn't been seen since August and a number of fixed have gone into HashTable state tracking, resolving as "Cannot Reproduce".
If this re-occurs please re-open and we can investigate further.
James Harrison January 18, 2021 at 4:46 PM
The provided pcap zips are rather small, and appear to contain just a pcaps directory
$ unzip -t 10.112.191.101_pcaps.zip
Archive: 10.112.191.101_pcaps.zip
testing: pcaps/ OK
No errors detected in compressed data of 10.112.191.101_pcaps.zip.
$ unzip -t 10.112.191.102_pcaps.zip
Archive: 10.112.191.102_pcaps.zip
testing: pcaps/ OK
No errors detected in compressed data of 10.112.191.102_pcaps.zip.
$ ls -lah 10.112.191.10*
-rw-r--r--@ 1 jamesharrison staff 162B 18 Jan 13:47 10.112.191.101_pcaps.zip
-rw-r--r--@ 1 jamesharrison staff 162B 18 Jan 13:47 10.112.191.102_pcaps.zip
The logs from 2020-08-20 are very similar to previous presentations - clean node creates active vbucket and later an abort fails as the expected prepare was not found in the hashtable. Nothing suspicious logged, no restarts, no re-warmups, no vb takeovers.
One mild difference:
VBucket::abort (vb:1009) - active failed as HashTable value is not CommittedState::Pending - <ud> SV @0x7f9728e85ae0 .CJ ..R.Cp temp: seq:2 rev:2 cas:1597908477626941440 key:"cid:0x0:test_collections-3, size:19" exp:0 age:1 fc:5 vallen:83 val age:1
.CJ ..R.Cp
C - Compressed
J - Json
R - Resident
Cp - Committed via Prepare
The committed value found in the hashtable was committed by prepare, rather than by mutation this time.
Going by the test log,
2020-08-20 12:58:05,259 | test | INFO | pool-2-thread-2 | [table_view:display:72] Keys failed in YuZxX8l_FTdbPYoGu41XD9IaWn:_default:_default
+--------------------+-----------------------------------------------------------------------------------------------------------------------------------------+
| Update Key | Exception |
+--------------------+-----------------------------------------------------------------------------------------------------------------------------------------+
| test_collections-3 | com.couchbase.client.core.error.DurabilityAmbiguousException | The server returned with a durability ambiguous response on this request |
+--------------------+-----------------------------------------------------------------------------------------------------------------------------------------+
...
2020-08-20 12:58:05,262 | test | ERROR | pool-2-thread-10 | [task:call:3398] Failed to load 1 docs from 2 to 3
2020-08-20 12:58:07,232 | test | INFO | pool-2-thread-5 | [table_view:display:72] Keys failed in YuZxX8l_FTdbPYoGu41XD9IaWn:_default:_default
+--------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Delete Key | Exception |
+--------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| test_collections-3 | {'error': 'com.couchbase.client.core.error.RequestCanceledException | RemoveRequest, Reason: NO_MORE_RETRIES (CHANNEL_CLOSED_WHILE_IN_FLIGHT) | 10.112.191.102:11210 | reason:NO_MORE_RETRIES (CHANNEL_CLOSED_WHILE_IN_FLIGHT) | retryAttempts:0 | retryReasons:None', 'cas': 0, 'value': {}} |
+--------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
the document has been stored previously, then a subsequent update received a durability ambiguous response, before the attempted sync delete lead to the crash.
This is from some time ago, and a lot has changed since. Is this still reproducible?
Ashwin Govindarajulu August 20, 2020 at 8:34 AM
@James Harrison reproduced the same issue using 7.0.0-2241.
Please find the pcaps + cbcollect logs,
[^collectinfo-2020-08-20T073015-ns_1@10.112.191.102.zip]
[^collectinfo-2020-08-20T072825-ns_1@10.112.191.101.zip]
Ashwin Govindarajulu August 6, 2020 at 7:07 AM
Update:
Tried reproducing using the same build and not able to reproduce the same crash on server VMs. But seeing a new issue https://couchbasecloud.atlassian.net/browse/MB-40779#icft=MB-40779.
Will update the ticket after few more retries.
Build: 7.0.0-2241
Scenario:
Two node cluster, Couchbase bucket (replicas=1)
Performing durability abort scenario tests using sigstop on memcached process on each node
Service 'memcached' exited with status 134. Restarting. Messages: 2020-06-04T03:51:04.496609-07:00 CRITICAL /opt/couchbase/bin/../lib/libep.so() [0x7f914bb3f000+0xdb910] 2020-06-04T03:51:04.496609-07:00 CRITICAL /opt/couchbase/bin/../lib/libep.so() [0x7f914bb3f000+0xd5bae] 2020-06-04T03:51:04.496609-07:00 CRITICAL /opt/couchbase/bin/../lib/libep.so() [0x7f914bb3f000+0xd61a3] 2020-06-04T03:51:04.496609-07:00 CRITICAL /opt/couchbase/bin/../lib/libep.so() [0x7f914bb3f000+0x1a3810] 2020-06-04T03:51:04.496609-07:00 CRITICAL /opt/couchbase/bin/../lib/libep.so() [0x7f914bb3f000+0xdd5ad] 2020-06-04T03:51:04.496609-07:00 CRITICAL /opt/couchbase/bin/../lib/libep.so() [0x7f914bb3f000+0x146623] 2020-06-04T03:51:04.496614-07:00 CRITICAL /opt/couchbase/bin/../lib/libep.so() [0x7f914bb3f000+0x13cf8f] 2020-06-04T03:51:04.496620-07:00 CRITICAL /opt/couchbase/bin/../lib/libplatform_so.so.0.1.0() [0x7f914a67d000+0x10777] 2020-06-04T03:51:04.496625-07:00 CRITICAL /lib64/libpthread.so.0() [0x7f9147cc3000+0x7df3] 2020-06-04T03:51:04.496654-07:00 CRITICAL /lib64/libc.so.6(clone+0x6d) [0x7f9147902000+0xf61ad] Compactor for database `eJG-dFwcwK86fvITBOA_ytghCVkrl` (pid [{type,database}, {important,true}, {name, <<"eJG-dFwcwK86fvITBOA_ytghCVkrl">>}, {fa, {#Fun<compaction_daemon.4.9063353>, [<<"eJG-dFwcwK86fvITBOA_ytghCVkrl">>, {config, {30, undefined}, {30, undefined}, undefined, false,false, {daemon_config, 30,131072, 20971520}}, false, {[{type, bucket}]}]}}]) terminated unexpectedly: {{{badmatch, {error, timeout}}, [{mc_client_binary, stats_recv, 4, [{file, "src/mc_client_binary.erl"}, {line, 168}]}, {mc_client_binary, stats, 4, [{file, "src/mc_client_binary.erl"}, {line, 419}]}, {ns_memcached, do_handle_call, 3, [{file, "src/ns_memcached.erl"}, {line, 463}]}, {ns_memcached, worker_loop, 3, [{file, "src/ns_memcached.erl"}, {line, 236}]}, {proc_lib, init_p_do_apply, 3, [{file, "proc_lib.erl"}, {line, 247}]}]}, {gen_server, call, [{'ns_memcached-eJG-dFwcwK86fvITBOA_ytghCVkrl', 'ns_1@10.112.191.103'}, {raw_stats, <<"diskinfo">>, #Fun<compaction_daemon.18.9063353>, {<<"0">>, <<"0">>}}, 180000]}} hide
Observation:
Followed by the crash, the docs in the server are not accessible.
Cb_collect_logs:
https://cb-jira.s3.us-east-2.amazonaws.com/logs/mc_crash/collectinfo-2020-06-04T105726-ns_1%4010.112.191.103.zip
https://cb-jira.s3.us-east-2.amazonaws.com/logs/mc_crash/collectinfo-2020-06-04T105726-ns_1%4010.112.191.104.zip
Testcase:
epengine.collection_crud_negative.CollectionDurabilityTests: test_durability_abort,num_items=100,replicas=1,nodes_init=2,override_spec_params=replicas;num_items,GROUP=P0