Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-44598

Exceeded retries count trying to get consistent keys/values for config replication

    XMLWordPrintable

Details

    • Untriaged
    • 1
    • Unknown

    Description

      In an FTS test, we attempt to create 1000 indexes. During the index creation phase, we start to see lots of these errors:

      "Exceeded retries count trying to get consistent keys/values for config replication. This is minor bug. Everything is safe, but please file bug and attach logs"

      Furthermore, the UI for Search fails to load with a "Gateway Timeout"

      This is the test run with the failure: http://perf.jenkins.couchbase.com/job/atlas_collections/247/console

      These are the logs:

      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-atlas_collections-247/172.23.99.211.zip

      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-atlas_collections-247/172.23.99.39.zip

      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-atlas_collections-247/172.23.99.40.zip

      build: 7.0.0-4502

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            Dave Finlay I think this may be a ns_server bug, seems its complaining about the following:

            Prometheus http request failed:
            URL: http://127.0.0.1:9123/api/v1/query

            [ns_server:error,2021-02-25T17:57:49.276-08:00,ns_1@172.23.99.211:service_status_keeper_worker<0.2391.0>:rest_utils:get_json:59]Request to (fts) api/nsstatus failed: {error,timeout}

            [ns_server:error,2021-02-25T17:57:49.276-08:00,ns_1@172.23.99.211:service_status_keeper-fts<0.2396.0>:service_status_keeper:handle_cast:104]Service service_fts returned incorrect status

            korrigan.clark Korrigan Clark added a comment - Dave Finlay  I think this may be a ns_server bug, seems its complaining about the following: Prometheus http request failed: URL: http://127.0.0.1:9123/api/v1/query [ns_server:error,2021-02-25T17:57:49.276-08:00,ns_1@172.23.99.211:service_status_keeper_worker<0.2391.0>:rest_utils:get_json:59] Request to (fts) api/nsstatus failed: {error,timeout} [ns_server:error,2021-02-25T17:57:49.276-08:00,ns_1@172.23.99.211:service_status_keeper-fts<0.2396.0>:service_status_keeper:handle_cast:104] Service service_fts returned incorrect status

            Also this in ns_server.fts.log:

            2021-02-25T18:17:24.392-08:00 [WARN] ns_server: retrieve partition seqs: CBPartitionSeqs, Stats (collections-details) failed, err: operation has timed out – cbft.RunSourcePartitionSeqs() at ns_server.go:1096

            korrigan.clark Korrigan Clark added a comment - Also this in ns_server.fts.log: 2021-02-25T18:17:24.392-08:00 [WARN] ns_server: retrieve partition seqs: CBPartitionSeqs, Stats (collections-details) failed, err: operation has timed out – cbft.RunSourcePartitionSeqs() at ns_server.go:1096

            In memcached.log:

            2021-02-25T17:30:31.634830-08:00 WARNING 279: Invalid password specified for [<ud>@fts</ud>]. Mechanism:[SCRAM-SHA512], UUID:[eab06aed-0c9d-48ee-5eea-6c4c61d7e72f]
            2021-02-25T17:30:31.635074-08:00 INFO 279: select_bucket failed - Not authenticated. {"cid":"2e18543a9d0d2ddf/d373c7c1b7fe1251/6","connection":"[

            {"ip":"172.23.99.38","port":59392}

            - {"ip":"172.23.99.211","port":11210} (not authenticated) ]","bucket":"bucket-1"}
            2021-02-25T17:30:31.635244-08:00 INFO 279: Connection::on_event: unrecoverable error encountered: ["reading","error"], shutting down connection

            korrigan.clark Korrigan Clark added a comment - In memcached.log: 2021-02-25T17:30:31.634830-08:00 WARNING 279: Invalid password specified for [<ud>@fts</ud>] . Mechanism: [SCRAM-SHA512] , UUID: [eab06aed-0c9d-48ee-5eea-6c4c61d7e72f] 2021-02-25T17:30:31.635074-08:00 INFO 279: select_bucket failed - Not authenticated. {"cid":"2e18543a9d0d2ddf/d373c7c1b7fe1251/6","connection":"[ {"ip":"172.23.99.38","port":59392} - {"ip":"172.23.99.211","port":11210} (not authenticated) ]","bucket":"bucket-1"} 2021-02-25T17:30:31.635244-08:00 INFO 279: Connection::on_event: unrecoverable error encountered: ["reading","error"] , shutting down connection
            dfinlay Dave Finlay added a comment - - edited

            Hi Korry

            With respect to this log message:

            Exceeded retries count trying to get consistent keys/values for config replication. This is minor bug. Everything is safe, but please file bug and attach logs

            See this comment that I wrote on MB-43518 - an issue which relates to the same log trace. The comment explains why the log trace is displayed, and I conclude that despite the fact that the log is shown, I don't think we should actually change the config synchronization logic. I'm still of that opinion. The one thing that maybe we should do is consider dropping the log message as if we're not going to make a change, there is no point in alerting users. I've filed a task to consider this: MB-44600.

            dfinlay Dave Finlay added a comment - - edited Hi Korry With respect to this log message: Exceeded retries count trying to get consistent keys/values for config replication. This is minor bug. Everything is safe, but please file bug and attach logs See this comment that I wrote on MB-43518 - an issue which relates to the same log trace. The comment explains why the log trace is displayed, and I conclude that despite the fact that the log is shown, I don't think we should actually change the config synchronization logic. I'm still of that opinion. The one thing that maybe we should do is consider dropping the log message as if we're not going to make a change, there is no point in alerting users. I've filed a task to consider this: MB-44600 .
            dfinlay Dave Finlay added a comment -

            With respect to these log traces:

            [ns_server:error,2021-02-25T17:30:03.856-08:00,ns_1@cb.local:<0.1510.0>:prometheus:post_async:194]Prometheus http request failed:
            URL: http://127.0.0.1:9123/api/v1/query
            Body: query=%7Bname%3D~%60kv_curr_items%7Ckv_curr_items_tot%7Ckv_vb_replica_curr_items%7Ckv_mem_used_bytes%7Ccouch_docs_...
            

            I don't think this is a bug - there are just a handful of these messages and they all occur in the first 30 seconds after the node was restarted and Prometheus is just getting going.

            [ns_server:info,2021-02-25T17:29:42.855-08:00,nonode@nohost:<0.139.0>:ns_server:init_logging:151]Started & configured logging
            [ns_server:info,2021-02-25T17:29:42.859-08:00,nonode@nohost:<0.139.0>:ns_server:log_pending:33]Static config terms:
            [{error_logger_mf_dir,"/opt/couchbase/var/lib/couchbase/logs"},
            ...
            

            On the other issues - ns_server timing out when calling FTS and FTS having trouble working with memcached, the FTS team should investigate.

            dfinlay Dave Finlay added a comment - With respect to these log traces: [ns_server:error,2021-02-25T17:30:03.856-08:00,ns_1@cb.local:<0.1510.0>:prometheus:post_async:194]Prometheus http request failed: URL: http://127.0.0.1:9123/api/v1/query Body: query=%7Bname%3D~%60kv_curr_items%7Ckv_curr_items_tot%7Ckv_vb_replica_curr_items%7Ckv_mem_used_bytes%7Ccouch_docs_... I don't think this is a bug - there are just a handful of these messages and they all occur in the first 30 seconds after the node was restarted and Prometheus is just getting going. [ns_server:info,2021-02-25T17:29:42.855-08:00,nonode@nohost:<0.139.0>:ns_server:init_logging:151]Started & configured logging [ns_server:info,2021-02-25T17:29:42.859-08:00,nonode@nohost:<0.139.0>:ns_server:log_pending:33]Static config terms: [{error_logger_mf_dir,"/opt/couchbase/var/lib/couchbase/logs"}, ... On the other issues - ns_server timing out when calling FTS and FTS having trouble working with memcached, the FTS team should investigate.
            lynn.straus Lynn Straus added a comment -

            Per review in the Cheshire Cat Ops meeting and per email updates, all remaining majors/minors/trivials are being moved out of Cheshire Cat on March 10. The purpose is to allow teams to focus on Critical bugs for the remainder of the release. Teams can "pull back" deferred bugs as schedule and bandwidth allow.

            Changing target fix version to CheshireCat.next and adding "deferred-from-Cheshire-Cat" label.

            lynn.straus Lynn Straus added a comment - Per review in the Cheshire Cat Ops meeting and per email updates, all remaining majors/minors/trivials are being moved out of Cheshire Cat on March 10. The purpose is to allow teams to focus on Critical bugs for the remainder of the release. Teams can "pull back" deferred bugs as schedule and bandwidth allow. Changing target fix version to CheshireCat.next and adding "deferred-from-Cheshire-Cat" label.

            For Neo(7.1), we've improved the number of indexes supported in FTS. Officially, we'd support 500 indexes.  There have been tests that create 1000 indexes successfully, but not all ops are supported for 1000 yet.

            keshav Keshav Murthy added a comment - For Neo(7.1), we've improved the number of indexes supported in FTS. Officially, we'd support 500 indexes.  There have been tests that create 1000 indexes successfully, but not all ops are supported for 1000 yet.

            People

              Sreekanth Sivasankaran Sreekanth Sivasankaran
              korrigan.clark Korrigan Clark
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty