Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-50861

[BP MB-50770 to 7.0.x] Query log has flood with messages

    XMLWordPrintable

Details

    • Untriaged
    • 1
    • Yes

    Description

      Repro : Same as MB-50766 + following steps
      after travel-sample failed
      create default bucket (no data)
      create primary index it takes 17s+
      then drop the default bucket

      query.log keep flood with

      127.0.0.1:8091/pools/default/b/default": Requested resource not found.^M
      ) .. Retrying(122)
      2022-02-02T20:03:55.221-08:00 [Warn] cluster_info: Out of sync for bucket default. Retrying to getTerseBucket. retry count 1
      2022-02-02T20:03:55.227-08:00 [Warn] cluster_info: Out of sync for bucket default. Retrying to getTerseBucket. retry count 2
      2022-02-02T20:03:55.233-08:00 [Warn] cluster_info: Out of sync for bucket default. Retrying to getTerseBucket. retry count 3
      2022-02-02T20:03:55.239-08:00 [Warn] cluster_info: Out of sync for bucket default. Retrying to getTerseBucket. retry count 4
      2022-02-02T20:03:55.245-08:00 [Warn] cluster_info: Out of sync for bucket default. Retrying to getTerseBucket. retry count 5
      2022-02-02T20:03:57.252-08:00 [Info] Error occurred during cluster info update (HTTP error 404 Object Not Found getting "http://127.0.0.1:8091/pools/default/b/default": Requested resource not found.^M
      ) .. Retrying(123)
      2022-02-02T20:03:57.254-08:00 [Warn] cluster_info: Out of sync for bucket default. Retrying to getTerseBucket. retry count 1
      2022-02-02T20:03:57.260-08:00 [Warn] cluster_info: Out of sync for bucket default. Retrying to getTerseBucket. retry count 2
      2022-02-02T20:03:57.266-08:00 [Warn] cluster_info: Out of sync for bucket default. Retrying to getTerseBucket. retry count 3
      2022-02-02T20:03:57.271-08:00 [Warn] cluster_info: Out of sync for bucket default. Retrying to getTerseBucket. retry count 4
      2022-02-02T20:03:57.277-08:00 [Warn] cluster_info: Out of sync for bucket default. Retrying to getTerseBucket. retry count 5
      2022-02-02T20:03:59.284-08:00 [Info] Error occurred during cluster info update (HTTP error 404 Object Not Found getting "http://127.0.0.1:8091/pools/default/b/default": Requested resource not found.^M
      ) .. Retrying(124)
      2022-02-02T20:03:59.288-08:00 [Warn] cluster_info: Out of sync for bucket default. Retrying to getTerseBucket. retry count 1
      2022-02-02T20:03:59.294-08:00 [Warn] cluster_info: Out of sync for bucket default. Retrying to getTerseBucket. retry count 2
      2022-02-02T20:03:59.300-08:00 [Warn] cluster......
      
      

      The bucket already dropped.

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            Copied from parent MB-50770:

            The fix

            From the code analysis above, it is clear that FetchForBucket() should not be using a RetryHelper object, which is what creates the middle of the three retry loops that will iterate 301 times:

            1. checkForTokens() outer retry loop executes once every 30 seconds and will terminate the first time the keyspace is found to be invalid.
            2. FetchForBucket() middle retry loop executes 301 times and is not needed. (I find this routine has its own inner retry loop as well but it is not triggered by invalid keyspace, only for invalid ClusterInfoCache object.)
            3. RefreshBucket() (and also RefreshManifest() called just after it) inner retry loops are low-level routines that do 5x retries in tight loops. These retries are needed for general robustness of these routines, including for other callers, but will not flood the log once the middle retry loop is removed.

            Removing the middle retry loop will result in this scenario producing only 5 of the log message instead of 1,505, and it will make the code match the original intent.

            FetchForBucket() (cluster_info.go) was created in 7.1.0 via MB-46245 change set https://review.couchbase.org/c/indexing/+/155866, creating the middle 301x retry loop, so it is a regression and the root cause of the current MB.

            This change was then backported to 7.0.2 via MB-47635 change set https://review.couchbase.org/c/indexing/+/159091, thus the bug now exists there too.

            checkValidKeyspace() (metadata_provider.go) and the every-30-seconds check were added in 7.0.0 via MB-46058 change set https://review.couchbase.org/c/indexing/+/153284. These are not causes of the bug.

            kevin.cherkauer Kevin Cherkauer (Inactive) added a comment - Copied from parent MB-50770 : The fix From the code analysis above, it is clear that FetchForBucket() should not be using a RetryHelper object, which is what creates the middle of the three retry loops that will iterate 301 times: checkForTokens() outer retry loop executes once every 30 seconds and will terminate the first time the keyspace is found to be invalid. FetchForBucket() middle retry loop executes 301 times and is not needed . (I find this routine has its own inner retry loop as well but it is not triggered by invalid keyspace, only for invalid ClusterInfoCache object.) RefreshBucket() (and also RefreshManifest() called just after it) inner retry loops are low-level routines that do 5x retries in tight loops. These retries are needed for general robustness of these routines, including for other callers, but will not flood the log once the middle retry loop is removed. Removing the middle retry loop will result in this scenario producing only 5 of the log message instead of 1,505, and it will make the code match the original intent. FetchForBucket() (cluster_info.go) was created in 7.1.0 via MB-46245 change set https://review.couchbase.org/c/indexing/+/155866 , creating the middle 301x retry loop, so it is a regression and the root cause of the current MB. This change was then backported to 7.0.2 via MB-47635 change set https://review.couchbase.org/c/indexing/+/159091 , thus the bug now exists there too. checkValidKeyspace() (metadata_provider.go) and the every-30-seconds check were added in 7.0.0 via MB-46058 change set https://review.couchbase.org/c/indexing/+/153284 . These are not causes of the bug.

            Steps to verify

            1. Create a cluster with n0:kv+n1ql,n1:Index
            2. Set indexer.debug.enableBackgroundIndexCreation to false to disable the scheduled index creation
            3. Create an index and while that index is being created create few more indexes to make it scheduled index.
            4. Delete the bucket on which the indexes are created
            5. indexer.debug.enableBackgroundIndexCreation to true

            We should not see any log flooding on the query side. Query client check for keyspace once every 30 seconds. 
            After 30 seconds we should see errors like below but to max of 5 times per index in scheduled state

             
            2022-04-07T20:07:35.662+05:30 [Warn] Pool::RefreshBucket: Retried 6 times due to out of sync for bucket default. Final err: HTTP error 404 Object Not Found getting "http://127.0.0.1:9000/pools/default/b/default": Requested resource not found.
             
            2022-04-07T20:07:37.663+05:30 [Info] Error occurred during cluster info update (HTTP error 404 Object Not Found getting "http://127.0.0.1:9000/pools/default/b/default": Requested resource not found.
            ) .. Retrying(1)
            
            

            sai.teja Sai Krishna Teja added a comment - Steps to verify Create a cluster with n0:kv+n1ql,n1:Index Set indexer.debug.enableBackgroundIndexCreation to false to disable the scheduled index creation Create an index and while that index is being created create few more indexes to make it scheduled index. Delete the bucket on which the indexes are created indexer.debug.enableBackgroundIndexCreation to true We should not see any log flooding on the query side. Query client check for keyspace once every 30 seconds.  After 30 seconds we should see errors like below but to max of 5 times per index in scheduled state   2022-04-07T20:07:35.662+05:30 [Warn] Pool::RefreshBucket: Retried 6 times due to out of sync for bucket default. Final err: HTTP error 404 Object Not Found getting "http://127.0.0.1:9000/pools/default/b/default": Requested resource not found.   2022-04-07T20:07:37.663+05:30 [Info] Error occurred during cluster info update (HTTP error 404 Object Not Found getting "http://127.0.0.1:9000/pools/default/b/default": Requested resource not found. ) .. Retrying(1)

            Build couchbase-server-7.0.4-7245 contains indexing commit 61aa8d0 with commit message:
            MB-50861 [BP MB 50770 to 7.0.x] Reduce retries under waitForScheduledIndex()

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.4-7245 contains indexing commit 61aa8d0 with commit message: MB-50861 [BP MB 50770 to 7.0.x] Reduce retries under waitForScheduledIndex()

            Sai Krishna Teja, I followed the below steps, however, I still see high no. of msgs for "out of sync". Let me know if this no. is expected

             

            Steps to verify

            1. Create a cluster with n0:kv+n1ql,n1:Index
            2. Set indexer.debug.enableBackgroundIndexCreation to false to disable the scheduled index creation
            3. Create 10 indexes in parallel, 3 indexes went to schedule state and rest of the indexes were built one by one.
            4. Delete the bucket on which the indexes are created
            5. indexer.debug.enableBackgroundIndexCreation to true. Check for the "out of sync" msgs in query.log

            Logs attached

            node1-cb660-centos7.vagrants.zip

            node2-cb660-centos7.vagrants.zip

            hemant.rajput Hemant Rajput added a comment - Sai Krishna Teja , I followed the below steps, however, I still see high no. of msgs for "out of sync". Let me know if this no. is expected   Steps to verify Create a cluster with n0:kv+n1ql,n1:Index Set indexer.debug.enableBackgroundIndexCreation to false to disable the scheduled index creation Create 10 indexes in parallel, 3 indexes went to schedule state and rest of the indexes were built one by one. Delete the bucket on which the indexes are created indexer.debug.enableBackgroundIndexCreation to true. Check for the "out of sync" msgs in query.log Logs attached node1-cb660-centos7.vagrants.zip node2-cb660-centos7.vagrants.zip

            Hemant Rajput this looks fine. Every Scheduled index will validate the bucket independently once every 30 secs while while waiting for the scheduled index and try to do so 5 times to account for any failures so this should not flood the logs now. 

            sai.teja Sai Krishna Teja added a comment - Hemant Rajput this looks fine. Every Scheduled index will validate the bucket independently once every 30 secs while while waiting for the scheduled index and try to do so 5 times to account for any failures so this should not flood the logs now. 

            Validated on 7.0.4-7261. Closing as per Sai's comment.

            hemant.rajput Hemant Rajput added a comment - Validated on 7.0.4-7261. Closing as per Sai's comment.

            People

              sai.teja Sai Krishna Teja
              kevin.cherkauer Kevin Cherkauer (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty