VBucket::abort (vb:156) - active failed as HashTable value is not CommittedState::Pending

Description

Build: 7.0.0-2241

Scenario:

  • Two node cluster, Couchbase bucket (replicas=1)

  •  Performing durability abort scenario tests using sigstop on memcached process on each node

Service 'memcached' exited with status 134. Restarting. Messages: 2020-06-04T03:51:04.496609-07:00 CRITICAL /opt/couchbase/bin/../lib/libep.so() [0x7f914bb3f000+0xdb910] 2020-06-04T03:51:04.496609-07:00 CRITICAL /opt/couchbase/bin/../lib/libep.so() [0x7f914bb3f000+0xd5bae] 2020-06-04T03:51:04.496609-07:00 CRITICAL /opt/couchbase/bin/../lib/libep.so() [0x7f914bb3f000+0xd61a3] 2020-06-04T03:51:04.496609-07:00 CRITICAL /opt/couchbase/bin/../lib/libep.so() [0x7f914bb3f000+0x1a3810] 2020-06-04T03:51:04.496609-07:00 CRITICAL /opt/couchbase/bin/../lib/libep.so() [0x7f914bb3f000+0xdd5ad] 2020-06-04T03:51:04.496609-07:00 CRITICAL /opt/couchbase/bin/../lib/libep.so() [0x7f914bb3f000+0x146623] 2020-06-04T03:51:04.496614-07:00 CRITICAL /opt/couchbase/bin/../lib/libep.so() [0x7f914bb3f000+0x13cf8f] 2020-06-04T03:51:04.496620-07:00 CRITICAL /opt/couchbase/bin/../lib/libplatform_so.so.0.1.0() [0x7f914a67d000+0x10777] 2020-06-04T03:51:04.496625-07:00 CRITICAL /lib64/libpthread.so.0() [0x7f9147cc3000+0x7df3] 2020-06-04T03:51:04.496654-07:00 CRITICAL /lib64/libc.so.6(clone+0x6d) [0x7f9147902000+0xf61ad] Compactor for database `eJG-dFwcwK86fvITBOA_ytghCVkrl` (pid [{type,database}, {important,true}, {name, <<"eJG-dFwcwK86fvITBOA_ytghCVkrl">>}, {fa, {#Fun<compaction_daemon.4.9063353>, [<<"eJG-dFwcwK86fvITBOA_ytghCVkrl">>, {config, {30, undefined}, {30, undefined}, undefined, false,false, {daemon_config, 30,131072, 20971520}}, false, {[{type, bucket}]}]}}]) terminated unexpectedly: {{{badmatch, {error, timeout}}, [{mc_client_binary, stats_recv, 4, [{file, "src/mc_client_binary.erl"}, {line, 168}]}, {mc_client_binary, stats, 4, [{file, "src/mc_client_binary.erl"}, {line, 419}]}, {ns_memcached, do_handle_call, 3, [{file, "src/ns_memcached.erl"}, {line, 463}]}, {ns_memcached, worker_loop, 3, [{file, "src/ns_memcached.erl"}, {line, 236}]}, {proc_lib, init_p_do_apply, 3, [{file, "proc_lib.erl"}, {line, 247}]}]}, {gen_server, call, [{'ns_memcached-eJG-dFwcwK86fvITBOA_ytghCVkrl', 'ns_1@10.112.191.103'}, {raw_stats, <<"diskinfo">>, #Fun<compaction_daemon.18.9063353>, {<<"0">>, <<"0">>}}, 180000]}} hide

Observation:

Followed by the crash, the docs in the server are not accessible. 

Cb_collect_logs:
https://cb-jira.s3.us-east-2.amazonaws.com/logs/mc_crash/collectinfo-2020-06-04T105726-ns_1%4010.112.191.103.zip
https://cb-jira.s3.us-east-2.amazonaws.com/logs/mc_crash/collectinfo-2020-06-04T105726-ns_1%4010.112.191.104.zip
Testcase:

epengine.collection_crud_negative.CollectionDurabilityTests: test_durability_abort,num_items=100,replicas=1,nodes_init=2,override_spec_params=replicas;num_items,GROUP=P0

Affects versions

Fix versions

None

Environment

Enterprise Edition 7.0.0 build 2241 Centos7 on Vagrant box

Link to Log File, atop/blg, CBCollectInfo, Core dump

None

Release Notes Description

None

Attachments

7
  • 20 Aug 2020, 08:33 AM
  • 20 Aug 2020, 08:33 AM
  • 20 Aug 2020, 08:32 AM
  • 20 Aug 2020, 08:32 AM
  • 20 Aug 2020, 08:32 AM
  • 04 Jun 2020, 11:12 AM
  • 04 Jun 2020, 11:10 AM

Activity

Show:

Ashwin Govindarajulu May 27, 2021 at 6:47 PM

Not seeing this issue in latest weekly runs. Closing the ticket.

Dave Rigby February 5, 2021 at 3:08 PM

Given this hasn't been seen since August and a number of fixed have gone into HashTable state tracking, resolving as "Cannot Reproduce".

If this re-occurs please re-open and we can investigate further.

James Harrison January 18, 2021 at 4:46 PM

The provided pcap zips are rather small, and appear to contain just a pcaps directory

$ unzip -t 10.112.191.101_pcaps.zip Archive: 10.112.191.101_pcaps.zip testing: pcaps/ OK No errors detected in compressed data of 10.112.191.101_pcaps.zip. $ unzip -t 10.112.191.102_pcaps.zip Archive: 10.112.191.102_pcaps.zip testing: pcaps/ OK No errors detected in compressed data of 10.112.191.102_pcaps.zip. $ ls -lah 10.112.191.10* -rw-r--r--@ 1 jamesharrison staff 162B 18 Jan 13:47 10.112.191.101_pcaps.zip -rw-r--r--@ 1 jamesharrison staff 162B 18 Jan 13:47 10.112.191.102_pcaps.zip

The logs from 2020-08-20 are very similar to previous presentations - clean node creates active vbucket and later an abort fails as the expected prepare was not found in the hashtable. Nothing suspicious logged, no restarts, no re-warmups, no vb takeovers.

One mild difference:

VBucket::abort (vb:1009) - active failed as HashTable value is not CommittedState::Pending - <ud> SV @0x7f9728e85ae0 .CJ ..R.Cp temp: seq:2 rev:2 cas:1597908477626941440 key:"cid:0x0:test_collections-3, size:19" exp:0 age:1 fc:5 vallen:83 val age:1 .CJ ..R.Cp C - Compressed J - Json R - Resident Cp - Committed via Prepare

The committed value found in the hashtable was committed by prepare, rather than by mutation this time.

Going by the test log,

2020-08-20 12:58:05,259 | test | INFO | pool-2-thread-2 | [table_view:display:72] Keys failed in YuZxX8l_FTdbPYoGu41XD9IaWn:_default:_default +--------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ | Update Key | Exception | +--------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ | test_collections-3 | com.couchbase.client.core.error.DurabilityAmbiguousException | The server returned with a durability ambiguous response on this request | +--------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ ... 2020-08-20 12:58:05,262 | test | ERROR | pool-2-thread-10 | [task:call:3398] Failed to load 1 docs from 2 to 3 2020-08-20 12:58:07,232 | test | INFO | pool-2-thread-5 | [table_view:display:72] Keys failed in YuZxX8l_FTdbPYoGu41XD9IaWn:_default:_default +--------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Delete Key | Exception | +--------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | test_collections-3 | {'error': 'com.couchbase.client.core.error.RequestCanceledException | RemoveRequest, Reason: NO_MORE_RETRIES (CHANNEL_CLOSED_WHILE_IN_FLIGHT) | 10.112.191.102:11210 | reason:NO_MORE_RETRIES (CHANNEL_CLOSED_WHILE_IN_FLIGHT) | retryAttempts:0 | retryReasons:None', 'cas': 0, 'value': {}} | +--------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

the document has been stored previously, then a subsequent update received a durability ambiguous response, before the attempted sync delete lead to the crash.

This is from some time ago, and a lot has changed since. Is this still reproducible?

Ashwin Govindarajulu August 20, 2020 at 8:34 AM

reproduced the same issue using 7.0.0-2241.

Please find the pcaps + cbcollect logs,

[^collectinfo-2020-08-20T073015-ns_1@10.112.191.102.zip]

[^collectinfo-2020-08-20T072825-ns_1@10.112.191.101.zip]

Ashwin Govindarajulu August 6, 2020 at 7:07 AM

Update:

Tried reproducing using the same build and not able to reproduce the same crash on server VMs. But seeing a new issue https://couchbasecloud.atlassian.net/browse/MB-40779#icft=MB-40779.

Will update the ticket after few more retries.

Cannot Reproduce
Pinned fields
Click on the next to a field label to start pinning.

Details

Assignee

Ashwin Govindarajulu

Reporter

Is this a Regression?

Unknown

Triage

Triaged

Operating System

Centos 64-bit

Story Points

1

Priority

Instabug

Open Instabug

PagerDuty

Sentry

Zendesk Support

Created June 4, 2020 at 11:14 AM
Updated May 27, 2021 at 6:47 PM
Resolved February 5, 2021 at 3:08 PM
Instabug