VBucket::abort (vb:156) - active failed as HashTable value is not CommittedState::Pending

Description

Build: 7.0.0-2241

Scenario:

Two node cluster, Couchbase bucket (replicas=1)
Performing durability abort scenario tests using sigstop on memcached process on each node

Service 'memcached' exited with status 134. Restarting. Messages:
2020-06-04T03:51:04.496609-07:00 CRITICAL /opt/couchbase/bin/../lib/libep.so() [0x7f914bb3f000+0xdb910]
2020-06-04T03:51:04.496609-07:00 CRITICAL /opt/couchbase/bin/../lib/libep.so() [0x7f914bb3f000+0xd5bae]
2020-06-04T03:51:04.496609-07:00 CRITICAL /opt/couchbase/bin/../lib/libep.so() [0x7f914bb3f000+0xd61a3]
2020-06-04T03:51:04.496609-07:00 CRITICAL /opt/couchbase/bin/../lib/libep.so() [0x7f914bb3f000+0x1a3810]
2020-06-04T03:51:04.496609-07:00 CRITICAL /opt/couchbase/bin/../lib/libep.so() [0x7f914bb3f000+0xdd5ad]
2020-06-04T03:51:04.496609-07:00 CRITICAL /opt/couchbase/bin/../lib/libep.so() [0x7f914bb3f000+0x146623]
2020-06-04T03:51:04.496614-07:00 CRITICAL /opt/couchbase/bin/../lib/libep.so() [0x7f914bb3f000+0x13cf8f]
2020-06-04T03:51:04.496620-07:00 CRITICAL /opt/couchbase/bin/../lib/libplatform_so.so.0.1.0() [0x7f914a67d000+0x10777]
2020-06-04T03:51:04.496625-07:00 CRITICAL /lib64/libpthread.so.0() [0x7f9147cc3000+0x7df3]
2020-06-04T03:51:04.496654-07:00 CRITICAL /lib64/libc.so.6(clone+0x6d) [0x7f9147902000+0xf61ad]

Compactor for database `eJG-dFwcwK86fvITBOA_ytghCVkrl` (pid [{type,database},
{important,true},
{name,
<<"eJG-dFwcwK86fvITBOA_ytghCVkrl">>},
{fa,
{#Fun<compaction_daemon.4.9063353>,
[<<"eJG-dFwcwK86fvITBOA_ytghCVkrl">>,
{config,
{30, undefined},
{30, undefined},
undefined,
false,false,
{daemon_config, 30,131072, 20971520}}, false,
{[{type, bucket}]}]}}]) terminated unexpectedly: {{{badmatch,
{error, timeout}},
[{mc_client_binary, stats_recv, 4,
[{file, "src/mc_client_binary.erl"}, {line, 168}]},
{mc_client_binary, stats, 4,
[{file, "src/mc_client_binary.erl"}, {line, 419}]},
{ns_memcached, do_handle_call, 3,
[{file, "src/ns_memcached.erl"}, {line, 463}]},
{ns_memcached, worker_loop, 3,
[{file, "src/ns_memcached.erl"}, {line, 236}]},
{proc_lib, init_p_do_apply, 3,
[{file, "proc_lib.erl"}, {line, 247}]}]},
{gen_server, call,
[{'ns_memcached-eJG-dFwcwK86fvITBOA_ytghCVkrl', 'ns_1@10.112.191.103'},
{raw_stats, <<"diskinfo">>,
#Fun<compaction_daemon.18.9063353>,
{<<"0">>, <<"0">>}}, 180000]}} hide

Observation:

Followed by the crash, the docs in the server are not accessible.

Cb_collect_logs:
https://cb-jira.s3.us-east-2.amazonaws.com/logs/mc_crash/collectinfo-2020-06-04T105726-ns_1%4010.112.191.103.zip
https://cb-jira.s3.us-east-2.amazonaws.com/logs/mc_crash/collectinfo-2020-06-04T105726-ns_1%4010.112.191.104.zip
Testcase:

epengine.collection_crud_negative.CollectionDurabilityTests:
  test_durability_abort,num_items=100,replicas=1,nodes_init=2,override_spec_params=replicas;num_items,GROUP=P0

Components

couchbase-bucket

Affects versions

Cheshire-Cat

Fix versions

None

Labels

collectionsdurabilityfunctional-test

Environment

Enterprise Edition 7.0.0 build 2241 Centos7 on Vagrant box

Link to Log File, atop/blg, CBCollectInfo, Core dump

None

Release Notes Description

None

Attachments

20 Aug 2020, 08:33 AM
20 Aug 2020, 08:33 AM
20 Aug 2020, 08:32 AM
20 Aug 2020, 08:32 AM
20 Aug 2020, 08:32 AM
04 Jun 2020, 11:12 AM
04 Jun 2020, 11:10 AM

Linked issues

relates to

MB-40286

VBucket::commit (vb:924) failed as no HashTable item found with key

Activity

Show:

Ashwin Govindarajulu May 27, 2021 at 6:47 PM

Not seeing this issue in latest weekly runs. Closing the ticket.

Dave Rigby February 5, 2021 at 3:08 PM

Given this hasn't been seen since August and a number of fixed have gone into HashTable state tracking, resolving as "Cannot Reproduce".

If this re-occurs please re-open and we can investigate further.

James Harrison January 18, 2021 at 4:46 PM

The provided pcap zips are rather small, and appear to contain just a pcaps directory

$ unzip -t 10.112.191.101_pcaps.zip
Archive:  10.112.191.101_pcaps.zip
    testing: pcaps/                   OK
No errors detected in compressed data of 10.112.191.101_pcaps.zip.
$ unzip -t 10.112.191.102_pcaps.zip
Archive:  10.112.191.102_pcaps.zip
    testing: pcaps/                   OK
No errors detected in compressed data of 10.112.191.102_pcaps.zip.
$ ls -lah 10.112.191.10*
-rw-r--r--@ 1 jamesharrison  staff   162B 18 Jan 13:47 10.112.191.101_pcaps.zip
-rw-r--r--@ 1 jamesharrison  staff   162B 18 Jan 13:47 10.112.191.102_pcaps.zip

The logs from 2020-08-20 are very similar to previous presentations - clean node creates active vbucket and later an abort fails as the expected prepare was not found in the hashtable. Nothing suspicious logged, no restarts, no re-warmups, no vb takeovers.

One mild difference:

VBucket::abort (vb:1009) - active failed as HashTable value is not CommittedState::Pending - <ud> SV @0x7f9728e85ae0 .CJ ..R.Cp temp:    seq:2 rev:2 cas:1597908477626941440 key:"cid:0x0:test_collections-3, size:19" exp:0 age:1 fc:5 vallen:83 val age:1


.CJ ..R.Cp
C - Compressed
J - Json
R - Resident
Cp - Committed via Prepare

The committed value found in the hashtable was committed by prepare, rather than by mutation this time.

Going by the test log,

2020-08-20 12:58:05,259 | test  | INFO    | pool-2-thread-2 | [table_view:display:72] Keys failed in YuZxX8l_FTdbPYoGu41XD9IaWn:_default:_default
+--------------------+-----------------------------------------------------------------------------------------------------------------------------------------+
| Update Key         | Exception                                                                                                                               |
+--------------------+-----------------------------------------------------------------------------------------------------------------------------------------+
| test_collections-3 | com.couchbase.client.core.error.DurabilityAmbiguousException | The server returned with a durability ambiguous response on this request |
+--------------------+-----------------------------------------------------------------------------------------------------------------------------------------+
...
2020-08-20 12:58:05,262 | test  | ERROR   | pool-2-thread-10 | [task:call:3398] Failed to load 1 docs from 2 to 3
2020-08-20 12:58:07,232 | test  | INFO    | pool-2-thread-5 | [table_view:display:72] Keys failed in YuZxX8l_FTdbPYoGu41XD9IaWn:_default:_default
+--------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Delete Key         | Exception                                                                                                                                                                                                                                                                                     |
+--------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| test_collections-3 | {'error': 'com.couchbase.client.core.error.RequestCanceledException | RemoveRequest, Reason: NO_MORE_RETRIES (CHANNEL_CLOSED_WHILE_IN_FLIGHT) | 10.112.191.102:11210 | reason:NO_MORE_RETRIES (CHANNEL_CLOSED_WHILE_IN_FLIGHT) | retryAttempts:0 | retryReasons:None', 'cas': 0, 'value': {}} |
+--------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

the document has been stored previously, then a subsequent update received a durability ambiguous response, before the attempted sync delete lead to the crash.

This is from some time ago, and a lot has changed since. Is this still reproducible?

Ashwin Govindarajulu August 20, 2020 at 8:34 AM

@James Harrison reproduced the same issue using 7.0.0-2241.

Please find the pcaps + cbcollect logs,

[^collectinfo-2020-08-20T073015-ns_1@10.112.191.102.zip]

[^collectinfo-2020-08-20T072825-ns_1@10.112.191.101.zip]

Ashwin Govindarajulu August 6, 2020 at 7:07 AM

Update:

Tried reproducing using the same build and not able to reproduce the same crash on server VMs. But seeing a new issue https://couchbasecloud.atlassian.net/browse/MB-40779#icft=MB-40779.

Will update the ticket after few more retries.

Cannot Reproduce

Pinned fields

Click on the next to a field label to start pinning.

Details

Assignee

Ashwin Govindarajulu

Reporter

Ashwin Govindarajulu

Is this a Regression?

Unknown

Triage

Triaged

Operating System

Centos 64-bit

Story Points

Priority

Critical

Instabug

Open Instabug

PagerDuty

Sentry

Zendesk Support

Created June 4, 2020 at 11:14 AM

Updated May 27, 2021 at 6:47 PM

Resolved February 5, 2021 at 3:08 PM

Configure

Instabug