Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-52056

Sync Gateway testing causing bucket flush failures

    XMLWordPrintable

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Major
    • 7.2.4
    • 7.1.0
    • ns_server
    • None
    • Untriaged
    • Unknown

    Description

      With Couchbase Server 7.1 now being the latest version on Docker (couchbase/server:latest), we have noticed our tests have been failing consistently due to bucket flush errors, reported in SGW logs as:

      Error flushing bucket: {"_":"Flush failed with unexpected error. Check server logs for details."}  Will retry. -- base.(*Collection).Flush.func1() at collection.go:631
      

      This has happened locally and on Jenkins. I looked at the logs and noticed a lot of crash reports in the ns server debug logs:

      =========================CRASH REPORT=========================
        crasher:
          initial call: misc:turn_into_gen_server/4
          pid: <15922.25127.141>
          registered_name: 'capi_set_view_manager-sg_int_1_1651663419116586663'
          exception throw: {file_already_opened,
                               "/opt/couchbase/var/lib/couchbase/data/@indexes/sg_int_1_1651663419116586663/main_72cc6e6eba2986295f83acae24e19759.view.1"}
            in function  couch_set_view:get_group_server/2 (/home/couchbase/jenkins/workspace/couchbase-server-unix/couchdb/src/couch_set_view/src/couch_set_view.erl, line 437)
            in call from couch_set_view:define_group/4 (/home/couchbase/jenkins/workspace/couchbase-server-unix/couchdb/src/couch_set_view/src/couch_set_view.erl, line 143)
            in call from timer:tc/3 (timer.erl, line 197)
            in call from capi_set_view_manager:maybe_define_group/2 (src/capi_set_view_manager.erl, line 292)
            in call from capi_set_view_manager:'-init/1-lc$^1/1-0-'/2 (src/capi_set_view_manager.erl, line 175)
            in call from capi_set_view_manager:init/1 (src/capi_set_view_manager.erl, line 176)
            in call from misc:turn_into_gen_server/4 (src/misc.erl, line 503)
          ancestors: [<0.11095.28>,
                        'single_bucket_kv_sup-sg_int_1_1651663419116586663',
                        ns_bucket_sup,ns_bucket_worker_sup,ns_server_sup,
                        ns_server_nodes_sup,<0.270.0>,ns_server_cluster_sup,
                        root_sup,<0.145.0>]
          message_queue_len: 0
          messages: []
          links: [<0.11095.28>,<15922.25143.141>]
          dictionary: []
          trap_exit: false
          status: running
          heap_size: 4185
          stack_size: 29
          reductions: 28307
        neighbours:
      

      as well as errors in the ns server error logs:

      [ns_server:error,2022-05-04T11:28:22.050Z,ns_1@127.0.0.1:<0.8103.8>:menelaus_util:reply_server_error_before_close:210]Server error during processing: ["web request failed",
                                       {path,
                                        "/pools/default/buckets/sg_int_2_1651663419116586663/controller/doFlush"},
                                       {method,'POST'},
                                       {type,exit},
                                       {what,
                                        {{{badmatch,
                                           {error,
                                            {failed_nodes,['ns_1@127.0.0.1']}}},
                                          [{ns_janitor,cleanup_apply_config_body,4,
                                            [{file,"src/ns_janitor.erl"},
                                             {line,295}]},
                                           {ns_janitor,
                                            '-cleanup_apply_config/4-fun-0-',4,
                                            [{file,"src/ns_janitor.erl"},
                                             {line,215}]},
                                           {async,'-async_init/4-fun-1-',3,
                                            [{file,"src/async.erl"},{line,191}]}]},
                                         {gen_statem,call,
                                          [{via,leader_registry,ns_orchestrator},
                                           {flush_bucket,
                                            "sg_int_2_1651663419116586663"},
                                           infinity]}}},
                                       {trace,
                                        [{gen,do_call,4,
                                          [{file,"gen.erl"},{line,220}]},
                                         {gen,do_for_proc,2,
                                          [{file,"gen.erl"},{line,381}]},
                                         {gen_statem,call_dirty,4,
                                          [{file,"gen_statem.erl"},{line,684}]},
                                         {menelaus_web_buckets,
                                          do_handle_bucket_flush,2,
                                          [{file,"src/menelaus_web_buckets.erl"},
                                           {line,703}]},
                                         {request_tracker,request,2,
                                          [{file,"src/request_tracker.erl"},
                                           {line,40}]},
                                         {menelaus_util,handle_request,2,
                                          [{file,"src/menelaus_util.erl"},
                                           {line,221}]},
                                         {mochiweb_http,headers,6,
                                          [{file,
                                            "/home/couchbase/jenkins/workspace/couchbase-server-unix/couchdb/src/mochiweb/mochiweb_http.erl"},
                                           {line,153}]},
                                         {proc_lib,init_p_do_apply,3,
                                          [{file,"proc_lib.erl"},{line,226}]}]}]
      [ns_server:error,2022-05-04T11:28:22.274Z,ns_1@127.0.0.1:ns_doctor<0.882.0>:ns_doctor:update_status:303]The following buckets became not ready on node 'ns_1@127.0.0.1': ["sg_int_0_1651663419116586663",
                                                                        "sg_int_2_1651663419116586663"], those of them are active ["sg_int_0_1651663419116586663",
                                                                                                                                   "sg_int_2_1651663419116586663"]
      

      and warnings in the memcached logs:
      WARNING (sg_int_1_1651664472706223954) CouchKVStore::unlinkCouchFile: remove error:2, vb:446, rev:42, fname:/opt/couchbase/var/lib/couchbase/data/sg_int_1_1651664472706223954/446.couch.42.

      The Docker image enterprise-7.0.3 and other versions have had similar errors in the past but only quite rarely and never so consistent for all tests. We are running only 1 node. Sync Gateway and cbcollect logs attached.

      Could you please guide us as to what is going wrong and if it is a potential bug in CBS?

      Attachments

        1. cbcollect.zip
          75.65 MB
        2. verbose_int.out.raw
          5.84 MB

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              tor.colvin Tor Colvin
              isaac.lambat Isaac Lambat
              Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty