Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-43518

Seeing "Exceeded retries count trying to get consistent keys/values for config replication"

    XMLWordPrintable

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 6.6.1
    • Neo.next
    • fts
    • Enterprise Edition 6.6.1 build 1000012378

    Description

      Build: 6.6.1 build 1000012378

      Scenario:

      • Cluster with 5 kv, 2 index+n1ql, 2 search nodes
      • 6 bkts with 5000 docs each
      • Built 200 GSI indexes with replica 1 (50 indexes on 4 buckets)
      • Created 30 fts custom indexes (10 indexes on 3 buckets), just to add more entries to metakv
      • Create and Drop 100 gsi indexes sequentially on 4 buckets ( so this would be adding more entries of create/drop of 400 indexes)
      • Create and drop 50 fts indexes on 3 buckets.

      After step#6, seeing below prints in the logs,

      Exceeded retries count trying to get consistent keys/values for config replication. This is minor bug. Everything is safe, but please file bug and attach logs

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            dfinlay Dave Finlay added a comment -

            First, I should explain this behavior. This log trace:

            Exceeded retries count trying to get consistent keys/values for config replication. This is minor bug. Everything is safe, but please file bug and attach logs
            

            occurs during config replication (aka config synchronization) when there are too many independent changes to the configuration over short time interval. What happens is that:

            1. Some key in the configuration changes and the config replicator is notified to replicate the change to other nodes
            2. Just before it replicates, the replicator checks to see whether it's been notified of other changes. If not, the changes get replicated. Else it picks the new changes up and prepares to send the accumulated changes and at that point it repeats this step.
            3. If the replicator checks & accumules changes 10 times, it gives up and just replicates the entire configuration.

            The reason for the logic of checking and accumulating is to attempt to batch replicated changes as much as possible as it's more efficient. The reason to give up after 10 is to avoid the accumulating potentially hundreds of changes one by one, which would be slow. It's in this case - that the replicator checks and accumulate changes 10 times that the log trace is emitted.

            As the log trace says, it's not a bug but it does indicate that the config is undergoing a lot of change at this time. And indeed that's the case.

            We see these logs consistently during the time period from 13:26 to 13:48:

            2021-01-07T13:26:17.510-08:00, ns_config_rep:0:warning:message(ns_1@172.23.121.46) - Exceeded retries count trying to get consistent keys/values for config replication. This is minor bug. Everything is safe, but please file bug and attach logs
            2021-01-07T13:26:32.190-08:00, ns_config_rep:0:warning:message(ns_1@172.23.121.47) - Exceeded retries count trying to get consistent keys/values for config replication. This is minor bug. Everything is safe, but please file bug and attach logs
            ...
            2021-01-07T13:48:50.985-08:00, ns_config_rep:0:warning:message(ns_1@172.23.121.47) - Exceeded retries count trying to get consistent keys/values for config replication. This is minor bug. Everything is safe, but please file bug and attach logs
            

            During this time, FTS is writing heavily to metakv. E.g. on node 47 we see:

            $ grep -E "(PUT|DELETE|POST).*metakv.*fts" ns_server.http_access_internal.log  | grep -o ...Jan.2021:..:.. | uniq -c 
               3 07/Jan/2021:11:24
             472 07/Jan/2021:11:49
             784 07/Jan/2021:13:26
             646 07/Jan/2021:13:27
             512 07/Jan/2021:13:28
             599 07/Jan/2021:13:29
             442 07/Jan/2021:13:30
             462 07/Jan/2021:13:31
             496 07/Jan/2021:13:32
             402 07/Jan/2021:13:33
             340 07/Jan/2021:13:34
             385 07/Jan/2021:13:35
             375 07/Jan/2021:13:36
             393 07/Jan/2021:13:37
             331 07/Jan/2021:13:38
             365 07/Jan/2021:13:39
             298 07/Jan/2021:13:40
             328 07/Jan/2021:13:41
             331 07/Jan/2021:13:42
             263 07/Jan/2021:13:43
             268 07/Jan/2021:13:44
             243 07/Jan/2021:13:45
             256 07/Jan/2021:13:46
             231 07/Jan/2021:13:47
             232 07/Jan/2021:13:48
             247 07/Jan/2021:13:49
             222 07/Jan/2021:13:50
             162 07/Jan/2021:13:51
            

             $ grep -E "(PUT|DELETE|POST).*metakv.*fts" ns_server.http_access_internal.log  | grep -o ...Jan.2021:..:.. | uniq -c 
               3 07/Jan/2021:11:24
             525 07/Jan/2021:11:49
             839 07/Jan/2021:13:26
             666 07/Jan/2021:13:27
             538 07/Jan/2021:13:28
             591 07/Jan/2021:13:29
             478 07/Jan/2021:13:30
             477 07/Jan/2021:13:31
             504 07/Jan/2021:13:32
             443 07/Jan/2021:13:33
             383 07/Jan/2021:13:34
             376 07/Jan/2021:13:35
             373 07/Jan/2021:13:36
             406 07/Jan/2021:13:37
             371 07/Jan/2021:13:38
             340 07/Jan/2021:13:39
             342 07/Jan/2021:13:40
             342 07/Jan/2021:13:41
             371 07/Jan/2021:13:42
             309 07/Jan/2021:13:43
             278 07/Jan/2021:13:44
             206 07/Jan/2021:13:45
             274 07/Jan/2021:13:46
             303 07/Jan/2021:13:47
             339 07/Jan/2021:13:48
             241 07/Jan/2021:13:49
             211 07/Jan/2021:13:50
             136 07/Jan/2021:13:51
            

            Adding these numbers together and graphing we see:

            So, FTS is consistently writing to metakv between 500 and 1600 times per minute. Something like this pretty easily explains cases where the config replicator will check and accumulate changes 10 times and bail out and send the entire configuration.

            Essentially the root cause of this issue is that there are probably too many FTS keys in the configuration and that when FTS needs to write it writes heavily and the end result is heavy config synchronization. All of this is by design - but it would be useful if there weren't quite so many keys. (In this case the config has something like 19,000 keys most of which are FTS related.) Of course, FTS probably created a lot of keys to allow independent updates without conflicts.

            And on that topic, FTS is seeing quite a number of config conflicts on its keys.

            [user:debug,2021-01-07T11:49:16.532-08:00,ns_1@172.23.121.46:ns_config_rep_merger<0.4219.0>:ns_config:merge_values_using_timestamps:1339]Conflicting configuration changes to field {metakv,
                                                        <<"/fts/cbgt/cfg/curMetaKvPlanKey">>}:
            [{'_vclock',[{<<"0d3b83acfd354a73398c217b0ad5314b">>,{3,63777268156}},
                         {<<"b65d5e2ad66ff9066bd07c8a8d8c9c76">>,{4,63777268156}}]}|
             <<"{\"path\":\"/fts/cbgt/cfg/planPIndexesLean/planPIndexesLean-5730f7120353abd502a37be3d1cf35ba-1610048956465/\",\"uuid\":\"25530c19941a7ea8\",\"implVersion\":\"5.5.0\"}">>] and
            [{'_vclock',[{<<"0d3b83acfd354a73398c217b0ad5314b">>,{4,63777268156}},
                         {<<"b65d5e2ad66ff9066bd07c8a8d8c9c76">>,{3,63777268156}}]}|
             <<"{\"path\":\"/fts/cbgt/cfg/planPIndexesLean/planPIndexesLean-ce1788f34884bf79266533e961bd9425-1610048956420/\",\"uuid\":\"6b0d1b504e51cea4\",\"implVersion\":\"5.5.0\"}">>], choosing the former.
            

            This conflict looks like it might be problematic, but perhaps it's OK - the FTS team should know.

            $ fgrep -A 1 Conflicting ns_server.debug.log | fgrep -B 1 /fts/cbgt/cfg | fgrep /fts/cbgt  | wc -l
                 187
            

            In summary, I would say that:

            • It would be great if the number of FTS config keys could be reduced, though this may not be very easy to do
            • FTS should take a look at the config conflicts and check that they are ok
            • Else, I don't believe that sending the full config after 10 attempts to replicate only the changed keys is a behavior that strongly warrants change and so I think we should leave it as it currently is.
            dfinlay Dave Finlay added a comment - First, I should explain this behavior. This log trace: Exceeded retries count trying to get consistent keys/values for config replication. This is minor bug. Everything is safe, but please file bug and attach logs occurs during config replication (aka config synchronization) when there are too many independent changes to the configuration over short time interval. What happens is that: Some key in the configuration changes and the config replicator is notified to replicate the change to other nodes Just before it replicates, the replicator checks to see whether it's been notified of other changes. If not, the changes get replicated. Else it picks the new changes up and prepares to send the accumulated changes and at that point it repeats this step. If the replicator checks & accumules changes 10 times, it gives up and just replicates the entire configuration. The reason for the logic of checking and accumulating is to attempt to batch replicated changes as much as possible as it's more efficient. The reason to give up after 10 is to avoid the accumulating potentially hundreds of changes one by one, which would be slow. It's in this case - that the replicator checks and accumulate changes 10 times that the log trace is emitted. As the log trace says, it's not a bug but it does indicate that the config is undergoing a lot of change at this time. And indeed that's the case. We see these logs consistently during the time period from 13:26 to 13:48: 2021-01-07T13:26:17.510-08:00, ns_config_rep:0:warning:message(ns_1@172.23.121.46) - Exceeded retries count trying to get consistent keys/values for config replication. This is minor bug. Everything is safe, but please file bug and attach logs 2021-01-07T13:26:32.190-08:00, ns_config_rep:0:warning:message(ns_1@172.23.121.47) - Exceeded retries count trying to get consistent keys/values for config replication. This is minor bug. Everything is safe, but please file bug and attach logs ... 2021-01-07T13:48:50.985-08:00, ns_config_rep:0:warning:message(ns_1@172.23.121.47) - Exceeded retries count trying to get consistent keys/values for config replication. This is minor bug. Everything is safe, but please file bug and attach logs During this time, FTS is writing heavily to metakv. E.g. on node 47 we see: $ grep -E "(PUT|DELETE|POST).*metakv.*fts" ns_server.http_access_internal.log | grep -o ...Jan.2021:..:.. | uniq -c 3 07/Jan/2021:11:24 472 07/Jan/2021:11:49 784 07/Jan/2021:13:26 646 07/Jan/2021:13:27 512 07/Jan/2021:13:28 599 07/Jan/2021:13:29 442 07/Jan/2021:13:30 462 07/Jan/2021:13:31 496 07/Jan/2021:13:32 402 07/Jan/2021:13:33 340 07/Jan/2021:13:34 385 07/Jan/2021:13:35 375 07/Jan/2021:13:36 393 07/Jan/2021:13:37 331 07/Jan/2021:13:38 365 07/Jan/2021:13:39 298 07/Jan/2021:13:40 328 07/Jan/2021:13:41 331 07/Jan/2021:13:42 263 07/Jan/2021:13:43 268 07/Jan/2021:13:44 243 07/Jan/2021:13:45 256 07/Jan/2021:13:46 231 07/Jan/2021:13:47 232 07/Jan/2021:13:48 247 07/Jan/2021:13:49 222 07/Jan/2021:13:50 162 07/Jan/2021:13:51 $ grep -E "(PUT|DELETE|POST).*metakv.*fts" ns_server.http_access_internal.log | grep -o ...Jan.2021:..:.. | uniq -c 3 07/Jan/2021:11:24 525 07/Jan/2021:11:49 839 07/Jan/2021:13:26 666 07/Jan/2021:13:27 538 07/Jan/2021:13:28 591 07/Jan/2021:13:29 478 07/Jan/2021:13:30 477 07/Jan/2021:13:31 504 07/Jan/2021:13:32 443 07/Jan/2021:13:33 383 07/Jan/2021:13:34 376 07/Jan/2021:13:35 373 07/Jan/2021:13:36 406 07/Jan/2021:13:37 371 07/Jan/2021:13:38 340 07/Jan/2021:13:39 342 07/Jan/2021:13:40 342 07/Jan/2021:13:41 371 07/Jan/2021:13:42 309 07/Jan/2021:13:43 278 07/Jan/2021:13:44 206 07/Jan/2021:13:45 274 07/Jan/2021:13:46 303 07/Jan/2021:13:47 339 07/Jan/2021:13:48 241 07/Jan/2021:13:49 211 07/Jan/2021:13:50 136 07/Jan/2021:13:51 Adding these numbers together and graphing we see: So, FTS is consistently writing to metakv between 500 and 1600 times per minute. Something like this pretty easily explains cases where the config replicator will check and accumulate changes 10 times and bail out and send the entire configuration. Essentially the root cause of this issue is that there are probably too many FTS keys in the configuration and that when FTS needs to write it writes heavily and the end result is heavy config synchronization. All of this is by design - but it would be useful if there weren't quite so many keys. (In this case the config has something like 19,000 keys most of which are FTS related.) Of course, FTS probably created a lot of keys to allow independent updates without conflicts. And on that topic, FTS is seeing quite a number of config conflicts on its keys. [user:debug,2021-01-07T11:49:16.532-08:00,ns_1@172.23.121.46:ns_config_rep_merger<0.4219.0>:ns_config:merge_values_using_timestamps:1339]Conflicting configuration changes to field {metakv, <<"/fts/cbgt/cfg/curMetaKvPlanKey">>}: [{'_vclock',[{<<"0d3b83acfd354a73398c217b0ad5314b">>,{3,63777268156}}, {<<"b65d5e2ad66ff9066bd07c8a8d8c9c76">>,{4,63777268156}}]}| <<"{\"path\":\"/fts/cbgt/cfg/planPIndexesLean/planPIndexesLean-5730f7120353abd502a37be3d1cf35ba-1610048956465/\",\"uuid\":\"25530c19941a7ea8\",\"implVersion\":\"5.5.0\"}">>] and [{'_vclock',[{<<"0d3b83acfd354a73398c217b0ad5314b">>,{4,63777268156}}, {<<"b65d5e2ad66ff9066bd07c8a8d8c9c76">>,{3,63777268156}}]}| <<"{\"path\":\"/fts/cbgt/cfg/planPIndexesLean/planPIndexesLean-ce1788f34884bf79266533e961bd9425-1610048956420/\",\"uuid\":\"6b0d1b504e51cea4\",\"implVersion\":\"5.5.0\"}">>], choosing the former. This conflict looks like it might be problematic, but perhaps it's OK - the FTS team should know. $ fgrep -A 1 Conflicting ns_server.debug.log | fgrep -B 1 /fts/cbgt/cfg | fgrep /fts/cbgt | wc -l 187 In summary, I would say that: It would be great if the number of FTS config keys could be reduced, though this may not be very easy to do FTS should take a look at the config conflicts and check that they are ok Else, I don't believe that sending the full config after 10 attempts to replicate only the changed keys is a behavior that strongly warrants change and so I think we should leave it as it currently is.

            Thanks again Dave Finlay for the insightful analysis here.

            We did a trade-off between `the number of keys` Vs `value size of a key` and chose to go with the more keys than putting more information against a few key values sometime in Alice.  In this new context, now it looks like both could be wrinkle points.

            Is there a recommendation from metkv between these two options (too many keys Vs large keys) for a client?

            (Fixing/reducing the number of keys with higher numbers of indexes won't be an easy problem for FTS too.)

             

            We were all in anticipation of the new metakv in CC hoping that it would solve most of our consistent meta store headaches.  

            Could you please share the latest updates on how the new metakv handles (CC one) these edge usecases?

            for eg:

            1. Any max limit on the size of a key/value?
            2. Any limit on the number of keys?
            3. What would be the minimum conflict resolution interval? (iirc current metakv conflict resolution happens at 1 sec granularity and anything which happens within a sub-second ought to get lost due to the conflict resolutions)

             

            Sreekanth Sivasankaran Sreekanth Sivasankaran added a comment - Thanks again Dave Finlay  for the insightful analysis here. We did a trade-off between `the number of keys` Vs `value size of a key` and chose to go with the more keys than putting more information against a few key values sometime in Alice.  In this new context, now it looks like both could be wrinkle points. Is there a recommendation from metkv between these two options (too many keys Vs large keys) for a client? (Fixing/reducing the number of keys with higher numbers of indexes won't be an easy problem for FTS too.)   We were all in anticipation of the new metakv in CC hoping that it would solve most of our consistent meta store headaches.   Could you please share the latest updates on how the new metakv handles (CC one) these edge usecases? for eg: Any max limit on the size of a key/value? Any limit on the number of keys? What would be the minimum conflict resolution interval? (iirc current metakv conflict resolution happens at 1 sec granularity and anything which happens within a sub-second ought to get lost due to the conflict resolutions)  
            ashwin.govindarajulu Ashwin Govindarajulu added a comment - Update : Able to reproduce the same issue on   6.6.2 build 9439. Logs: https://supportal.couchbase.com/snapshot/45fc73b30c379b964536e719dd7c7731::0 https://uploads.couchbase.com/MB-12345/collectinfo-2021-01-20T104242-ns_1%40172.23.106.245.zip https://uploads.couchbase.com/MB-12345/collectinfo-2021-01-20T104242-ns_1%40172.23.107.104.zip https://uploads.couchbase.com/MB-12345/collectinfo-2021-01-20T104242-ns_1%40172.23.107.19.zip https://uploads.couchbase.com/MB-12345/collectinfo-2021-01-20T104242-ns_1%40172.23.107.87.zip https://uploads.couchbase.com/MB-12345/collectinfo-2021-01-20T104242-ns_1%40172.23.121.21.zip https://uploads.couchbase.com/MB-12345/collectinfo-2021-01-20T104242-ns_1%40172.23.121.41.zip https://uploads.couchbase.com/MB-12345/collectinfo-2021-01-20T104242-ns_1%40172.23.121.45.zip https://uploads.couchbase.com/MB-12345/collectinfo-2021-01-20T104242-ns_1%40172.23.121.46.zip https://uploads.couchbase.com/MB-12345/collectinfo-2021-01-20T104242-ns_1%40172.23.121.47.zip

            People

              Sreekanth Sivasankaran Sreekanth Sivasankaran
              ashwin.govindarajulu Ashwin Govindarajulu
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty