Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-39744

VBucket::abort (vb:156) - active failed as HashTable value is not CommittedState::Pending

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Cannot Reproduce
    • Cheshire-Cat
    • None
    • couchbase-bucket
    • Enterprise Edition 7.0.0 build 2241
      Centos7 on Vagrant box
    • Triaged
    • Centos 64-bit
    • 1
    • Unknown

    Description

      Build: 7.0.0-2241

      Scenario:

      • Two node cluster, Couchbase bucket (replicas=1)
      •  Performing durability abort scenario tests using sigstop on memcached process on each node

      Service 'memcached' exited with status 134. Restarting. Messages:
      2020-06-04T03:51:04.496609-07:00 CRITICAL /opt/couchbase/bin/../lib/libep.so() [0x7f914bb3f000+0xdb910]
      2020-06-04T03:51:04.496609-07:00 CRITICAL /opt/couchbase/bin/../lib/libep.so() [0x7f914bb3f000+0xd5bae]
      2020-06-04T03:51:04.496609-07:00 CRITICAL /opt/couchbase/bin/../lib/libep.so() [0x7f914bb3f000+0xd61a3]
      2020-06-04T03:51:04.496609-07:00 CRITICAL /opt/couchbase/bin/../lib/libep.so() [0x7f914bb3f000+0x1a3810]
      2020-06-04T03:51:04.496609-07:00 CRITICAL /opt/couchbase/bin/../lib/libep.so() [0x7f914bb3f000+0xdd5ad]
      2020-06-04T03:51:04.496609-07:00 CRITICAL /opt/couchbase/bin/../lib/libep.so() [0x7f914bb3f000+0x146623]
      2020-06-04T03:51:04.496614-07:00 CRITICAL /opt/couchbase/bin/../lib/libep.so() [0x7f914bb3f000+0x13cf8f]
      2020-06-04T03:51:04.496620-07:00 CRITICAL /opt/couchbase/bin/../lib/libplatform_so.so.0.1.0() [0x7f914a67d000+0x10777]
      2020-06-04T03:51:04.496625-07:00 CRITICAL /lib64/libpthread.so.0() [0x7f9147cc3000+0x7df3]
      2020-06-04T03:51:04.496654-07:00 CRITICAL /lib64/libc.so.6(clone+0x6d) [0x7f9147902000+0xf61ad]
       
      Compactor for database `eJG-dFwcwK86fvITBOA_ytghCVkrl` (pid [{type,database},
      {important,true},
      {name,
      <<"eJG-dFwcwK86fvITBOA_ytghCVkrl">>},
      {fa,
      {#Fun<compaction_daemon.4.9063353>,
      [<<"eJG-dFwcwK86fvITBOA_ytghCVkrl">>,
      {config,
      {30, undefined},
      {30, undefined},
      undefined,
      false,false,
      {daemon_config, 30,131072, 20971520}}, false,
      {[{type, bucket}]}]}}]) terminated unexpectedly: {{{badmatch,
      {error, timeout}},
      [{mc_client_binary, stats_recv, 4,
      [{file, "src/mc_client_binary.erl"}, {line, 168}]},
      {mc_client_binary, stats, 4,
      [{file, "src/mc_client_binary.erl"}, {line, 419}]},
      {ns_memcached, do_handle_call, 3,
      [{file, "src/ns_memcached.erl"}, {line, 463}]},
      {ns_memcached, worker_loop, 3,
      [{file, "src/ns_memcached.erl"}, {line, 236}]},
      {proc_lib, init_p_do_apply, 3,
      [{file, "proc_lib.erl"}, {line, 247}]}]},
      {gen_server, call,
      [{'ns_memcached-eJG-dFwcwK86fvITBOA_ytghCVkrl', 'ns_1@10.112.191.103'},
      {raw_stats, <<"diskinfo">>,
      #Fun<compaction_daemon.18.9063353>,
      {<<"0">>, <<"0">>}}, 180000]}} hide

      Observation:

      Followed by the crash, the docs in the server are not accessible. 

      Cb_collect_logs:
      https://cb-jira.s3.us-east-2.amazonaws.com/logs/mc_crash/collectinfo-2020-06-04T105726-ns_1%4010.112.191.103.zip
      https://cb-jira.s3.us-east-2.amazonaws.com/logs/mc_crash/collectinfo-2020-06-04T105726-ns_1%4010.112.191.104.zip
      Testcase:

      epengine.collection_crud_negative.CollectionDurabilityTests:
        test_durability_abort,num_items=100,replicas=1,nodes_init=2,override_spec_params=replicas;num_items,GROUP=P0
      

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            Update:

            Tried reproducing using the same build and not able to reproduce the same crash on server VMs. But seeing a new issue MB-40779.

            Will update the ticket after few more retries.

            ashwin.govindarajulu Ashwin Govindarajulu added a comment - Update: Tried reproducing using the same build and not able to reproduce the same crash on server VMs. But seeing a new issue MB-40779 . Will update the ticket after few more retries.

            James Harrison reproduced the same issue using 7.0.0-2241.

            Please find the pcaps + cbcollect logs,

            [^collectinfo-2020-08-20T073015-ns_1@10.112.191.102.zip]

            [^collectinfo-2020-08-20T072825-ns_1@10.112.191.101.zip]

            10.112.191.102_pcaps.zip

            10.112.191.101_pcaps.zip

            test.log

            ashwin.govindarajulu Ashwin Govindarajulu added a comment - James Harrison reproduced the same issue using 7.0.0-2241. Please find the pcaps + cbcollect logs, [^collectinfo-2020-08-20T073015-ns_1@10.112.191.102.zip] [^collectinfo-2020-08-20T072825-ns_1@10.112.191.101.zip] 10.112.191.102_pcaps.zip 10.112.191.101_pcaps.zip test.log

            The provided pcap zips are rather small, and appear to contain just a pcaps directory

            $ unzip -t 10.112.191.101_pcaps.zip
            Archive:  10.112.191.101_pcaps.zip
                testing: pcaps/                   OK
            No errors detected in compressed data of 10.112.191.101_pcaps.zip.
            $ unzip -t 10.112.191.102_pcaps.zip
            Archive:  10.112.191.102_pcaps.zip
                testing: pcaps/                   OK
            No errors detected in compressed data of 10.112.191.102_pcaps.zip.
            $ ls -lah 10.112.191.10*
            -rw-r--r--@ 1 jamesharrison  staff   162B 18 Jan 13:47 10.112.191.101_pcaps.zip
            -rw-r--r--@ 1 jamesharrison  staff   162B 18 Jan 13:47 10.112.191.102_pcaps.zip
            

            The logs from 2020-08-20 are very similar to previous presentations - clean node creates active vbucket and later an abort fails as the expected prepare was not found in the hashtable. Nothing suspicious logged, no restarts, no re-warmups, no vb takeovers.

            One mild difference:

            VBucket::abort (vb:1009) - active failed as HashTable value is not CommittedState::Pending - <ud> SV @0x7f9728e85ae0 .CJ ..R.Cp temp:    seq:2 rev:2 cas:1597908477626941440 key:"cid:0x0:test_collections-3, size:19" exp:0 age:1 fc:5 vallen:83 val age:1
             
             
            .CJ ..R.Cp
            C - Compressed
            J - Json
            R - Resident
            Cp - Committed via Prepare
            

            The committed value found in the hashtable was committed by prepare, rather than by mutation this time.

            Going by the test log,

            2020-08-20 12:58:05,259 | test  | INFO    | pool-2-thread-2 | [table_view:display:72] Keys failed in YuZxX8l_FTdbPYoGu41XD9IaWn:_default:_default
            +--------------------+-----------------------------------------------------------------------------------------------------------------------------------------+
            | Update Key         | Exception                                                                                                                               |
            +--------------------+-----------------------------------------------------------------------------------------------------------------------------------------+
            | test_collections-3 | com.couchbase.client.core.error.DurabilityAmbiguousException | The server returned with a durability ambiguous response on this request |
            +--------------------+-----------------------------------------------------------------------------------------------------------------------------------------+
            ...
            2020-08-20 12:58:05,262 | test  | ERROR   | pool-2-thread-10 | [task:call:3398] Failed to load 1 docs from 2 to 3
            2020-08-20 12:58:07,232 | test  | INFO    | pool-2-thread-5 | [table_view:display:72] Keys failed in YuZxX8l_FTdbPYoGu41XD9IaWn:_default:_default
            +--------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            | Delete Key         | Exception                                                                                                                                                                                                                                                                                     |
            +--------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            | test_collections-3 | {'error': 'com.couchbase.client.core.error.RequestCanceledException | RemoveRequest, Reason: NO_MORE_RETRIES (CHANNEL_CLOSED_WHILE_IN_FLIGHT) | 10.112.191.102:11210 | reason:NO_MORE_RETRIES (CHANNEL_CLOSED_WHILE_IN_FLIGHT) | retryAttempts:0 | retryReasons:None', 'cas': 0, 'value': {}} |
            +--------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            

            the document has been stored previously, then a subsequent update received a durability ambiguous response, before the attempted sync delete lead to the crash.

            This is from some time ago, and a lot has changed since. Is this still reproducible?

            james.harrison James Harrison added a comment - The provided pcap zips are rather small, and appear to contain just a pcaps directory $ unzip -t 10.112.191.101_pcaps.zip Archive: 10.112.191.101_pcaps.zip testing: pcaps/ OK No errors detected in compressed data of 10.112.191.101_pcaps.zip. $ unzip -t 10.112.191.102_pcaps.zip Archive: 10.112.191.102_pcaps.zip testing: pcaps/ OK No errors detected in compressed data of 10.112.191.102_pcaps.zip. $ ls -lah 10.112.191.10* -rw-r--r--@ 1 jamesharrison staff 162B 18 Jan 13:47 10.112.191.101_pcaps.zip -rw-r--r--@ 1 jamesharrison staff 162B 18 Jan 13:47 10.112.191.102_pcaps.zip The logs from 2020-08-20 are very similar to previous presentations - clean node creates active vbucket and later an abort fails as the expected prepare was not found in the hashtable. Nothing suspicious logged, no restarts, no re-warmups, no vb takeovers. One mild difference: VBucket::abort (vb:1009) - active failed as HashTable value is not CommittedState::Pending - <ud> SV @0x7f9728e85ae0 .CJ ..R.Cp temp: seq:2 rev:2 cas:1597908477626941440 key:"cid:0x0:test_collections-3, size:19" exp:0 age:1 fc:5 vallen:83 val age:1     .CJ ..R.Cp C - Compressed J - Json R - Resident Cp - Committed via Prepare The committed value found in the hashtable was committed by prepare, rather than by mutation this time. Going by the test log, 2020-08-20 12:58:05,259 | test | INFO | pool-2-thread-2 | [table_view:display:72] Keys failed in YuZxX8l_FTdbPYoGu41XD9IaWn:_default:_default +--------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ | Update Key | Exception | +--------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ | test_collections-3 | com.couchbase.client.core.error.DurabilityAmbiguousException | The server returned with a durability ambiguous response on this request | +--------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ ... 2020-08-20 12:58:05,262 | test | ERROR | pool-2-thread-10 | [task:call:3398] Failed to load 1 docs from 2 to 3 2020-08-20 12:58:07,232 | test | INFO | pool-2-thread-5 | [table_view:display:72] Keys failed in YuZxX8l_FTdbPYoGu41XD9IaWn:_default:_default +--------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Delete Key | Exception | +--------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | test_collections-3 | {'error': 'com.couchbase.client.core.error.RequestCanceledException | RemoveRequest, Reason: NO_MORE_RETRIES (CHANNEL_CLOSED_WHILE_IN_FLIGHT) | 10.112.191.102:11210 | reason:NO_MORE_RETRIES (CHANNEL_CLOSED_WHILE_IN_FLIGHT) | retryAttempts:0 | retryReasons:None', 'cas': 0, 'value': {}} | +--------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ the document has been stored previously, then a subsequent update received a durability ambiguous response, before the attempted sync delete lead to the crash. This is from some time ago, and a lot has changed since. Is this still reproducible?
            drigby Dave Rigby added a comment -

            Given this hasn't been seen since August and a number of fixed have gone into HashTable state tracking, resolving as "Cannot Reproduce".

            If this re-occurs please re-open and we can investigate further.

            drigby Dave Rigby added a comment - Given this hasn't been seen since August and a number of fixed have gone into HashTable state tracking, resolving as "Cannot Reproduce". If this re-occurs please re-open and we can investigate further.

            Not seeing this issue in latest weekly runs. Closing the ticket.

            ashwin.govindarajulu Ashwin Govindarajulu added a comment - Not seeing this issue in latest weekly runs. Closing the ticket.

            People

              ashwin.govindarajulu Ashwin Govindarajulu
              ashwin.govindarajulu Ashwin Govindarajulu
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty