Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-50231

[Backport MB-44158 to 6.6.5] Increase watcher timeout to 30 sec in DDL service manager

    XMLWordPrintable

Details

    • Untriaged
    • 1
    • Unknown

    Description

      Watch metadata algorithm waits for 1 second for metadata to be synced. If 1 second is not enough to metadata sync, then retry helper gets invoked. This retry helper will wait until for existing watcher to become in-sync which has its own timeout.

      DDL service manager closes the retry helper after a timeout.

      Before version 6.6.2, the timeout was 0.5 seconds.
      In version 6.6.2, the timeout was increased to 5 seconds.

      In the attached CBSE (using version 6.6.0), the metadata file size is already large. So, to be on safer side, the timeout should be increased to 30 seconds (which is default in version 7.0 onwards).

      Note that such timeout does not exist in case of query client.

      CC Deepkaran Salooja, Jeelan Poola, Wayne Siu, Mihir Kamdar.

      Note that the workaround in this case is disruptive as the node with large metadata file size needs to be removed from the cluster.

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            jeelan.poola Jeelan Poola added a comment -

            Wayne Siu It would be great to get this included in 6.6.5 as we have seen customers run into this. The fix is only bumping up a timeout value. Cc Keshav Murthy

            jeelan.poola Jeelan Poola added a comment - Wayne Siu It would be great to get this included in 6.6.5 as we have seen customers run into this. The fix is only bumping up a timeout value. Cc Keshav Murthy

            Build couchbase-server-6.6.5-10072 contains indexing commit d640f9f with commit message:
            MB-50231: Increase timeout for watcher to go live to 30 Sec

            build-team Couchbase Build Team added a comment - Build couchbase-server-6.6.5-10072 contains indexing commit d640f9f with commit message: MB-50231 : Increase timeout for watcher to go live to 30 Sec

            QE Verification:

            Perform large number of metadata operations. In presence of large size of metadata file (generated due to large number of metadata operations like index create/delete) of lookout for messages like the following.

            DDLServiceMgr: cannot connect to node
            

            And

            Failed to initialize metadata provider.
            

            Please report the occurrences of such messages.

            Thanks.

            amit.kulkarni Amit Kulkarni added a comment - QE Verification: Perform large number of metadata operations. In presence of large size of metadata file (generated due to large number of metadata operations like index create/delete) of lookout for messages like the following. DDLServiceMgr: cannot connect to node And Failed to initialize metadata provider. Please report the occurrences of such messages. Thanks.

            Thanks Amit for the details for validating the issue. I have written a script that would constantly create-drop indexes for a given time. With this, I was able to see the below mentioned warnings once the FDB MetadataStore size was 350 MB+. This was with build 6.6.5-10071 on a 2 node cluster. Can you pls help and confirm if this is a correct reproduction of the issue ? Once confirmed, I will proceed with the validation of the fix.

            [root@s12235 logs]# zgrep -e "cannot connect to node" -e "Failed to initialize metadata provider" -i indexer.log*
            indexer.log:2022-01-05T20:01:18.162-08:00 [Warn] DDLServiceMgr: cannot connect to node 172.23.107.75:9100
            indexer.log:2022-01-05T20:25:05.881-08:00 [Warn] DDLServiceMgr: cannot connect to node 172.23.107.56:9100
            indexer.log:2022-01-05T20:33:15.797-08:00 [Warn] DDLServiceMgr: cannot connect to node 172.23.107.75:9100
            indexer.log:2022-01-05T20:33:24.255-08:00 [Warn] DDLServiceMgr: cannot connect to node 172.23.107.75:9100
            indexer.log:2022-01-05T20:38:36.260-08:00 [Warn] DDLServiceMgr: cannot connect to node 172.23.107.75:9100
            indexer.log:2022-01-05T20:42:05.369-08:00 [Warn] DDLServiceMgr: cannot connect to node 172.23.107.75:9100
            indexer.log:2022-01-05T20:42:15.556-08:00 [Warn] DDLServiceMgr: cannot connect to node 172.23.107.56:9100
            indexer.log:2022-01-05T20:43:38.117-08:00 [Warn] DDLServiceMgr: cannot connect to node 172.23.107.75:9100
            indexer.log:2022-01-05T20:45:20.308-08:00 [Warn] DDLServiceMgr: cannot connect to node 172.23.107.75:9100
            

            Logs :
            https://cb-jira.s3.us-east-2.amazonaws.com/cb-engineering/MB-50231_QE_Repro/collectinfo-2022-01-06T060425-ns_1%40172.23.107.56.zip
            https://cb-jira.s3.us-east-2.amazonaws.com/cb-engineering/MB-50231_QE_Repro/collectinfo-2022-01-06T060425-ns_1%40172.23.107.75.zip

            mihir.kamdar Mihir Kamdar (Inactive) added a comment - Thanks Amit for the details for validating the issue. I have written a script that would constantly create-drop indexes for a given time. With this, I was able to see the below mentioned warnings once the FDB MetadataStore size was 350 MB+. This was with build 6.6.5-10071 on a 2 node cluster. Can you pls help and confirm if this is a correct reproduction of the issue ? Once confirmed, I will proceed with the validation of the fix. [root@s12235 logs]# zgrep -e "cannot connect to node" -e "Failed to initialize metadata provider" -i indexer.log* indexer.log:2022-01-05T20:01:18.162-08:00 [Warn] DDLServiceMgr: cannot connect to node 172.23.107.75:9100 indexer.log:2022-01-05T20:25:05.881-08:00 [Warn] DDLServiceMgr: cannot connect to node 172.23.107.56:9100 indexer.log:2022-01-05T20:33:15.797-08:00 [Warn] DDLServiceMgr: cannot connect to node 172.23.107.75:9100 indexer.log:2022-01-05T20:33:24.255-08:00 [Warn] DDLServiceMgr: cannot connect to node 172.23.107.75:9100 indexer.log:2022-01-05T20:38:36.260-08:00 [Warn] DDLServiceMgr: cannot connect to node 172.23.107.75:9100 indexer.log:2022-01-05T20:42:05.369-08:00 [Warn] DDLServiceMgr: cannot connect to node 172.23.107.75:9100 indexer.log:2022-01-05T20:42:15.556-08:00 [Warn] DDLServiceMgr: cannot connect to node 172.23.107.56:9100 indexer.log:2022-01-05T20:43:38.117-08:00 [Warn] DDLServiceMgr: cannot connect to node 172.23.107.75:9100 indexer.log:2022-01-05T20:45:20.308-08:00 [Warn] DDLServiceMgr: cannot connect to node 172.23.107.75:9100 Logs : https://cb-jira.s3.us-east-2.amazonaws.com/cb-engineering/MB-50231_QE_Repro/collectinfo-2022-01-06T060425-ns_1%40172.23.107.56.zip https://cb-jira.s3.us-east-2.amazonaws.com/cb-engineering/MB-50231_QE_Repro/collectinfo-2022-01-06T060425-ns_1%40172.23.107.75.zip
            amit.kulkarni Amit Kulkarni added a comment - - edited

            Hi Mihir Kamdar,

            Thanks for the logs.

            The experiment looks good. Following are the observations.

            Indexer starts on both the nodes at 2022-01-05T17:29.

            MacBook-Pro:MB-50231 amithk$ less cbcollect_info_ns_1\@172.23.107.75_20220106-060428/ns_server.indexer.log | grep "Indexer started with command line"
            2022-01-05T17:29:11.377-08:00 [Info] Indexer started with command line: [/opt/couchbase/bin/indexer -adminPort=9100 -scanPort=9101 -httpPort=9102 -streamInitPort=9103 -streamCatchupPort=9104 -streamMaintPort=9105 --httpsPort=19102 --certFile=/opt/couchbase/var/lib/couchbase/config/memcached-cert.pem --keyFile=/opt/couchbase/var/lib/couchbase/config/memcached-key.pem -vbuckets=1024 -cluster=127.0.0.1:8091 -storageDir=/opt/couchbase/var/lib/couchbase/data/@2i -diagDir=/opt/couchbase/var/lib/couchbase/crash -nodeUUID=6b2030cfb869eacb2e872d221d48906e -ipv6=false -isEnterprise=true]
            MacBook-Pro:MB-50231 amithk$ less cbcollect_info_ns_1\@172.23.107.56_20220106-060428/ns_server.indexer.log | grep "Indexer started with command line"
            2022-01-05T17:29:38.152-08:00 [Info] Indexer started with command line: [/opt/couchbase/bin/indexer -adminPort=9100 -scanPort=9101 -httpPort=9102 -streamInitPort=9103 -streamCatchupPort=9104 -streamMaintPort=9105 --httpsPort=19102 --certFile=/opt/couchbase/var/lib/couchbase/config/memcached-cert.pem --keyFile=/opt/couchbase/var/lib/couchbase/config/memcached-key.pem -vbuckets=1024 -cluster=127.0.0.1:8091 -storageDir=/opt/couchbase/var/lib/couchbase/data/@2i -diagDir=/opt/couchbase/var/lib/couchbase/crash -nodeUUID=1dbbc4224709a9eafb9b4b0a6cf1daa7 -ipv6=false -isEnterprise=true]
            

            Around 2022-01-05T18:23, the initial 1 second wait starts to become insufficient.

            MacBook-Pro:MB-50231 amithk$ less cbcollect_info_ns_1\@172.23.107.75_20220106-060428/ns_server.indexer.log | grep "WatchMetadata(): unable to reach indexer at 172.23.107.75" > Retry_75_75.log
            MacBook-Pro:MB-50231 amithk$ less cbcollect_info_ns_1\@172.23.107.75_20220106-060428/ns_server.indexer.log | grep "WatchMetadata(): unable to reach indexer at 172.23.107.56" > Retry_75_56.log
            MacBook-Pro:MB-50231 amithk$ less cbcollect_info_ns_1\@172.23.107.56_20220106-060428/ns_server.indexer.log | grep "WatchMetadata(): unable to reach indexer at 172.23.107.56" > Retry_56_56.log
            MacBook-Pro:MB-50231 amithk$ less cbcollect_info_ns_1\@172.23.107.56_20220106-060428/ns_server.indexer.log | grep "WatchMetadata(): unable to reach indexer at 172.23.107.75" > Retry_56_75.log
            

            MacBook-Pro:MB-50231 amithk$ head -3 Retry_75_75.log 
            2022-01-05T18:23:27.570-08:00 [Info] WatchMetadata(): unable to reach indexer at 172.23.107.75:9100. Retry in background.
            2022-01-05T18:29:57.235-08:00 [Info] WatchMetadata(): unable to reach indexer at 172.23.107.75:9100. Retry in background.
            2022-01-05T18:36:19.814-08:00 [Info] WatchMetadata(): unable to reach indexer at 172.23.107.75:9100. Retry in background.
            MacBook-Pro:MB-50231 amithk$ head -3 Retry_75_56.log 
            2022-01-05T18:25:14.978-08:00 [Info] WatchMetadata(): unable to reach indexer at 172.23.107.56:9100. Retry in background.
            2022-01-05T18:27:07.089-08:00 [Info] WatchMetadata(): unable to reach indexer at 172.23.107.56:9100. Retry in background.
            2022-01-05T18:30:54.931-08:00 [Info] WatchMetadata(): unable to reach indexer at 172.23.107.56:9100. Retry in background.
            MacBook-Pro:MB-50231 amithk$ head -3 Retry_56_56.log 
            2022-01-05T18:26:53.703-08:00 [Info] WatchMetadata(): unable to reach indexer at 172.23.107.56:9100. Retry in background.
            2022-01-05T18:27:09.238-08:00 [Info] WatchMetadata(): unable to reach indexer at 172.23.107.56:9100. Retry in background.
            2022-01-05T18:28:55.518-08:00 [Info] WatchMetadata(): unable to reach indexer at 172.23.107.56:9100. Retry in background.
            MacBook-Pro:MB-50231 amithk$ head -3 Retry_56_75.log 
            2022-01-05T18:23:02.272-08:00 [Info] WatchMetadata(): unable to reach indexer at 172.23.107.75:9100. Retry in background.
            2022-01-05T18:26:02.304-08:00 [Info] WatchMetadata(): unable to reach indexer at 172.23.107.75:9100. Retry in background.
            2022-01-05T18:26:07.285-08:00 [Info] WatchMetadata(): unable to reach indexer at 172.23.107.75:9100. Retry in background.
            

            Around 2022-01-05T20, total 6 second delay (1 initial + 5 during retry) starts to become insufficient.

            MacBook-Pro:MB-50231 amithk$ less cbcollect_info_ns_1\@172.23.107.75_20220106-060428/ns_server.indexer.log | grep "DDLServiceMgr: cannot connect to node 172.23.107.75" > CannotConnect_75_75.log
            MacBook-Pro:MB-50231 amithk$ less cbcollect_info_ns_1\@172.23.107.75_20220106-060428/ns_server.indexer.log | grep "DDLServiceMgr: cannot connect to node 172.23.107.56" > CannotConnect_75_56.log
            MacBook-Pro:MB-50231 amithk$ less cbcollect_info_ns_1\@172.23.107.56_20220106-060428/ns_server.indexer.log | grep "DDLServiceMgr: cannot connect to node 172.23.107.56" > CannotConnect_56_56.log
            MacBook-Pro:MB-50231 amithk$ less cbcollect_info_ns_1\@172.23.107.56_20220106-060428/ns_server.indexer.log | grep "DDLServiceMgr: cannot connect to node 172.23.107.75" > CannotConnect_56_75.log
            

            MacBook-Pro:MB-50231 amithk$ head -3 CannotConnect_75_75.log 
            2022-01-05T19:52:02.939-08:00 [Warn] DDLServiceMgr: cannot connect to node 172.23.107.75:9100
            2022-01-05T20:01:02.956-08:00 [Warn] DDLServiceMgr: cannot connect to node 172.23.107.75:9100
            2022-01-05T20:01:17.323-08:00 [Warn] DDLServiceMgr: cannot connect to node 172.23.107.75:9100
            MacBook-Pro:MB-50231 amithk$ head -3 CannotConnect_75_56.log 
            2022-01-05T20:05:14.390-08:00 [Warn] DDLServiceMgr: cannot connect to node 172.23.107.56:9100
            2022-01-05T20:08:39.121-08:00 [Warn] DDLServiceMgr: cannot connect to node 172.23.107.56:9100
            2022-01-05T20:11:32.958-08:00 [Warn] DDLServiceMgr: cannot connect to node 172.23.107.56:9100
            MacBook-Pro:MB-50231 amithk$ head -3 CannotConnect_56_56.log 
            2022-01-05T20:25:05.881-08:00 [Warn] DDLServiceMgr: cannot connect to node 172.23.107.56:9100
            2022-01-05T20:42:15.556-08:00 [Warn] DDLServiceMgr: cannot connect to node 172.23.107.56:9100
            MacBook-Pro:MB-50231 amithk$ head -3 CannotConnect_56_75.log 
            2022-01-05T20:01:18.162-08:00 [Warn] DDLServiceMgr: cannot connect to node 172.23.107.75:9100
            2022-01-05T20:33:15.797-08:00 [Warn] DDLServiceMgr: cannot connect to node 172.23.107.75:9100
            2022-01-05T20:33:24.255-08:00 [Warn] DDLServiceMgr: cannot connect to node 172.23.107.75:9100
            

            During the timeframe from 2022-01-05T20:00 to 2022-01-05T20:59. DDL service manager tries connecting to adminport 100s of times.

            MacBook-Pro:MB-50231 amithk$ less cbcollect_info_ns_1\@172.23.107.75_20220106-060428/ns_server.indexer.log | grep "MetadataProvider.WatchMetadata" | grep "172.23.107.75" > WatchMetadata_75_75.log
            MacBook-Pro:MB-50231 amithk$ less cbcollect_info_ns_1\@172.23.107.75_20220106-060428/ns_server.indexer.log | grep "MetadataProvider.WatchMetadata" | grep "172.23.107.56" > WatchMetadata_75_56.log
            MacBook-Pro:MB-50231 amithk$ less cbcollect_info_ns_1\@172.23.107.56_20220106-060428/ns_server.indexer.log | grep "MetadataProvider.WatchMetadata" | grep "172.23.107.56" > WatchMetadata_56_56.log
            MacBook-Pro:MB-50231 amithk$ less cbcollect_info_ns_1\@172.23.107.56_20220106-060428/ns_server.indexer.log | grep "MetadataProvider.WatchMetadata" | grep "172.23.107.75" > WatchMetadata_56_75.log
            

            MacBook-Pro:MB-50231 amithk$ less WatchMetadata_75_75.log | grep -c "2022-01-05T20"
            348
            MacBook-Pro:MB-50231 amithk$ less WatchMetadata_75_56.log | grep -c "2022-01-05T20"
            348
            MacBook-Pro:MB-50231 amithk$ less WatchMetadata_56_56.log | grep -c "2022-01-05T20"
            350
            MacBook-Pro:MB-50231 amithk$ less WatchMetadata_56_75.log | grep -c "2022-01-05T20"
            350
            

            Out of which it fails only a few times.

            MacBook-Pro:MB-50231 amithk$ less CannotConnect_75_75.log | grep -c "2022-01-05T20"
            10
            MacBook-Pro:MB-50231 amithk$ less CannotConnect_75_56.log | grep -c "2022-01-05T20"
            7
            MacBook-Pro:MB-50231 amithk$ less CannotConnect_56_56.log | grep -c "2022-01-05T20"
            2
            MacBook-Pro:MB-50231 amithk$ less CannotConnect_56_75.log | grep -c "2022-01-05T20"
            7
            

            After the fix, the initial wait of 1 sec remains the same, but total wait of 6 sec becomes 31 seconds.

            So, expected observation after the fix:
            1. Problem ("DDLServiceMgr: cannot connect to node" error) should start appearing a lot later in the experiment as compared to before the fix (3 and a half hours after indexer start).
            2. Total frequency of the problem should be less than before the fix.

            The DDL service manager keeps trying to connect to nodes regularly, until it is done with its task. So, these errors themselves are not too bad if next iteration is succeeding.

            As long as the frequency of problem is reduced after the fix, we should be good.

            amit.kulkarni Amit Kulkarni added a comment - - edited Hi Mihir Kamdar , Thanks for the logs. The experiment looks good. Following are the observations. Indexer starts on both the nodes at 2022-01-05T17:29. MacBook-Pro:MB- 50231 amithk$ less cbcollect_info_ns_1\ @172 .23. 107 .75_20220106- 060428 /ns_server.indexer.log | grep "Indexer started with command line" 2022 - 01 -05T17: 29 : 11.377 - 08 : 00 [Info] Indexer started with command line: [/opt/couchbase/bin/indexer -adminPort= 9100 -scanPort= 9101 -httpPort= 9102 -streamInitPort= 9103 -streamCatchupPort= 9104 -streamMaintPort= 9105 --httpsPort= 19102 --certFile=/opt/couchbase/var/lib/couchbase/config/memcached-cert.pem --keyFile=/opt/couchbase/var/lib/couchbase/config/memcached-key.pem -vbuckets= 1024 -cluster= 127.0 . 0.1 : 8091 -storageDir=/opt/couchbase/var/lib/couchbase/data/ @2i -diagDir=/opt/couchbase/var/lib/couchbase/crash -nodeUUID=6b2030cfb869eacb2e872d221d48906e -ipv6= false -isEnterprise= true ] MacBook-Pro:MB- 50231 amithk$ less cbcollect_info_ns_1\ @172 .23. 107 .56_20220106- 060428 /ns_server.indexer.log | grep "Indexer started with command line" 2022 - 01 -05T17: 29 : 38.152 - 08 : 00 [Info] Indexer started with command line: [/opt/couchbase/bin/indexer -adminPort= 9100 -scanPort= 9101 -httpPort= 9102 -streamInitPort= 9103 -streamCatchupPort= 9104 -streamMaintPort= 9105 --httpsPort= 19102 --certFile=/opt/couchbase/var/lib/couchbase/config/memcached-cert.pem --keyFile=/opt/couchbase/var/lib/couchbase/config/memcached-key.pem -vbuckets= 1024 -cluster= 127.0 . 0.1 : 8091 -storageDir=/opt/couchbase/var/lib/couchbase/data/ @2i -diagDir=/opt/couchbase/var/lib/couchbase/crash -nodeUUID=1dbbc4224709a9eafb9b4b0a6cf1daa7 -ipv6= false -isEnterprise= true ] Around 2022-01-05T18:23, the initial 1 second wait starts to become insufficient. MacBook-Pro:MB- 50231 amithk$ less cbcollect_info_ns_1\ @172 .23. 107 .75_20220106- 060428 /ns_server.indexer.log | grep "WatchMetadata(): unable to reach indexer at 172.23.107.75" > Retry_75_75.log MacBook-Pro:MB- 50231 amithk$ less cbcollect_info_ns_1\ @172 .23. 107 .75_20220106- 060428 /ns_server.indexer.log | grep "WatchMetadata(): unable to reach indexer at 172.23.107.56" > Retry_75_56.log MacBook-Pro:MB- 50231 amithk$ less cbcollect_info_ns_1\ @172 .23. 107 .56_20220106- 060428 /ns_server.indexer.log | grep "WatchMetadata(): unable to reach indexer at 172.23.107.56" > Retry_56_56.log MacBook-Pro:MB- 50231 amithk$ less cbcollect_info_ns_1\ @172 .23. 107 .56_20220106- 060428 /ns_server.indexer.log | grep "WatchMetadata(): unable to reach indexer at 172.23.107.75" > Retry_56_75.log MacBook-Pro:MB- 50231 amithk$ head - 3 Retry_75_75.log 2022 - 01 -05T18: 23 : 27.570 - 08 : 00 [Info] WatchMetadata(): unable to reach indexer at 172.23 . 107.75 : 9100 . Retry in background. 2022 - 01 -05T18: 29 : 57.235 - 08 : 00 [Info] WatchMetadata(): unable to reach indexer at 172.23 . 107.75 : 9100 . Retry in background. 2022 - 01 -05T18: 36 : 19.814 - 08 : 00 [Info] WatchMetadata(): unable to reach indexer at 172.23 . 107.75 : 9100 . Retry in background. MacBook-Pro:MB- 50231 amithk$ head - 3 Retry_75_56.log 2022 - 01 -05T18: 25 : 14.978 - 08 : 00 [Info] WatchMetadata(): unable to reach indexer at 172.23 . 107.56 : 9100 . Retry in background. 2022 - 01 -05T18: 27 : 07.089 - 08 : 00 [Info] WatchMetadata(): unable to reach indexer at 172.23 . 107.56 : 9100 . Retry in background. 2022 - 01 -05T18: 30 : 54.931 - 08 : 00 [Info] WatchMetadata(): unable to reach indexer at 172.23 . 107.56 : 9100 . Retry in background. MacBook-Pro:MB- 50231 amithk$ head - 3 Retry_56_56.log 2022 - 01 -05T18: 26 : 53.703 - 08 : 00 [Info] WatchMetadata(): unable to reach indexer at 172.23 . 107.56 : 9100 . Retry in background. 2022 - 01 -05T18: 27 : 09.238 - 08 : 00 [Info] WatchMetadata(): unable to reach indexer at 172.23 . 107.56 : 9100 . Retry in background. 2022 - 01 -05T18: 28 : 55.518 - 08 : 00 [Info] WatchMetadata(): unable to reach indexer at 172.23 . 107.56 : 9100 . Retry in background. MacBook-Pro:MB- 50231 amithk$ head - 3 Retry_56_75.log 2022 - 01 -05T18: 23 : 02.272 - 08 : 00 [Info] WatchMetadata(): unable to reach indexer at 172.23 . 107.75 : 9100 . Retry in background. 2022 - 01 -05T18: 26 : 02.304 - 08 : 00 [Info] WatchMetadata(): unable to reach indexer at 172.23 . 107.75 : 9100 . Retry in background. 2022 - 01 -05T18: 26 : 07.285 - 08 : 00 [Info] WatchMetadata(): unable to reach indexer at 172.23 . 107.75 : 9100 . Retry in background. Around 2022-01-05T20, total 6 second delay (1 initial + 5 during retry) starts to become insufficient. MacBook-Pro:MB- 50231 amithk$ less cbcollect_info_ns_1\ @172 .23. 107 .75_20220106- 060428 /ns_server.indexer.log | grep "DDLServiceMgr: cannot connect to node 172.23.107.75" > CannotConnect_75_75.log MacBook-Pro:MB- 50231 amithk$ less cbcollect_info_ns_1\ @172 .23. 107 .75_20220106- 060428 /ns_server.indexer.log | grep "DDLServiceMgr: cannot connect to node 172.23.107.56" > CannotConnect_75_56.log MacBook-Pro:MB- 50231 amithk$ less cbcollect_info_ns_1\ @172 .23. 107 .56_20220106- 060428 /ns_server.indexer.log | grep "DDLServiceMgr: cannot connect to node 172.23.107.56" > CannotConnect_56_56.log MacBook-Pro:MB- 50231 amithk$ less cbcollect_info_ns_1\ @172 .23. 107 .56_20220106- 060428 /ns_server.indexer.log | grep "DDLServiceMgr: cannot connect to node 172.23.107.75" > CannotConnect_56_75.log MacBook-Pro:MB- 50231 amithk$ head - 3 CannotConnect_75_75.log 2022 - 01 -05T19: 52 : 02.939 - 08 : 00 [Warn] DDLServiceMgr: cannot connect to node 172.23 . 107.75 : 9100 2022 - 01 -05T20: 01 : 02.956 - 08 : 00 [Warn] DDLServiceMgr: cannot connect to node 172.23 . 107.75 : 9100 2022 - 01 -05T20: 01 : 17.323 - 08 : 00 [Warn] DDLServiceMgr: cannot connect to node 172.23 . 107.75 : 9100 MacBook-Pro:MB- 50231 amithk$ head - 3 CannotConnect_75_56.log 2022 - 01 -05T20: 05 : 14.390 - 08 : 00 [Warn] DDLServiceMgr: cannot connect to node 172.23 . 107.56 : 9100 2022 - 01 -05T20: 08 : 39.121 - 08 : 00 [Warn] DDLServiceMgr: cannot connect to node 172.23 . 107.56 : 9100 2022 - 01 -05T20: 11 : 32.958 - 08 : 00 [Warn] DDLServiceMgr: cannot connect to node 172.23 . 107.56 : 9100 MacBook-Pro:MB- 50231 amithk$ head - 3 CannotConnect_56_56.log 2022 - 01 -05T20: 25 : 05.881 - 08 : 00 [Warn] DDLServiceMgr: cannot connect to node 172.23 . 107.56 : 9100 2022 - 01 -05T20: 42 : 15.556 - 08 : 00 [Warn] DDLServiceMgr: cannot connect to node 172.23 . 107.56 : 9100 MacBook-Pro:MB- 50231 amithk$ head - 3 CannotConnect_56_75.log 2022 - 01 -05T20: 01 : 18.162 - 08 : 00 [Warn] DDLServiceMgr: cannot connect to node 172.23 . 107.75 : 9100 2022 - 01 -05T20: 33 : 15.797 - 08 : 00 [Warn] DDLServiceMgr: cannot connect to node 172.23 . 107.75 : 9100 2022 - 01 -05T20: 33 : 24.255 - 08 : 00 [Warn] DDLServiceMgr: cannot connect to node 172.23 . 107.75 : 9100 During the timeframe from 2022-01-05T20:00 to 2022-01-05T20:59. DDL service manager tries connecting to adminport 100s of times. MacBook-Pro:MB- 50231 amithk$ less cbcollect_info_ns_1\ @172 .23. 107 .75_20220106- 060428 /ns_server.indexer.log | grep "MetadataProvider.WatchMetadata" | grep "172.23.107.75" > WatchMetadata_75_75.log MacBook-Pro:MB- 50231 amithk$ less cbcollect_info_ns_1\ @172 .23. 107 .75_20220106- 060428 /ns_server.indexer.log | grep "MetadataProvider.WatchMetadata" | grep "172.23.107.56" > WatchMetadata_75_56.log MacBook-Pro:MB- 50231 amithk$ less cbcollect_info_ns_1\ @172 .23. 107 .56_20220106- 060428 /ns_server.indexer.log | grep "MetadataProvider.WatchMetadata" | grep "172.23.107.56" > WatchMetadata_56_56.log MacBook-Pro:MB- 50231 amithk$ less cbcollect_info_ns_1\ @172 .23. 107 .56_20220106- 060428 /ns_server.indexer.log | grep "MetadataProvider.WatchMetadata" | grep "172.23.107.75" > WatchMetadata_56_75.log MacBook-Pro:MB- 50231 amithk$ less WatchMetadata_75_75.log | grep -c "2022-01-05T20" 348 MacBook-Pro:MB- 50231 amithk$ less WatchMetadata_75_56.log | grep -c "2022-01-05T20" 348 MacBook-Pro:MB- 50231 amithk$ less WatchMetadata_56_56.log | grep -c "2022-01-05T20" 350 MacBook-Pro:MB- 50231 amithk$ less WatchMetadata_56_75.log | grep -c "2022-01-05T20" 350 Out of which it fails only a few times. MacBook-Pro:MB- 50231 amithk$ less CannotConnect_75_75.log | grep -c "2022-01-05T20" 10 MacBook-Pro:MB- 50231 amithk$ less CannotConnect_75_56.log | grep -c "2022-01-05T20" 7 MacBook-Pro:MB- 50231 amithk$ less CannotConnect_56_56.log | grep -c "2022-01-05T20" 2 MacBook-Pro:MB- 50231 amithk$ less CannotConnect_56_75.log | grep -c "2022-01-05T20" 7 After the fix, the initial wait of 1 sec remains the same, but total wait of 6 sec becomes 31 seconds. So, expected observation after the fix: 1. Problem ("DDLServiceMgr: cannot connect to node" error) should start appearing a lot later in the experiment as compared to before the fix (3 and a half hours after indexer start). 2. Total frequency of the problem should be less than before the fix. The DDL service manager keeps trying to connect to nodes regularly, until it is done with its task. So, these errors themselves are not too bad if next iteration is succeeding. As long as the frequency of problem is reduced after the fix, we should be good.
            mihir.kamdar Mihir Kamdar (Inactive) added a comment - - edited

            Thanks for the analysis Amit Kulkarni. I ran the same test with the fix with build 6.6.5-10072. Here are my observations after the FDB MetadataStore file size is over 450 MB.

            On 172.23.107.75

            [root@localhost @2i]# ls -lah MetadataStore
            -rw-rw----. 1 couchbase couchbase 460M Jan  6 16:38 MetadataStore
            [root@localhost @2i]# zgrep -e "unable to reach indexer" -i ../../logs/indexer.log* | wc -l
            1072
            [root@localhost @2i]# zgrep -e "Successfully connected to indexer at 172.23.107.56:9100 after retry" -i ../../logs/indexer.log* | wc -l
            625
            [root@localhost @2i]# zgrep -e "Successfully connected to indexer at 172.23.107.75:9100 after retry" -i ../../logs/indexer.log* | wc -l
            445
            [root@localhost @2i]# zgrep -e "cannot connect to node" -e "Failed to initialize metadata provider" -i ../../logs/indexer.log*
            ../../logs/indexer.log:2022-01-06T16:22:56.577-08:00 [Warn] DDLServiceMgr: cannot connect to node 172.23.107.75:9100
            ../../logs/indexer.log:2022-01-06T16:24:59.114-08:00 [Warn] DDLServiceMgr: cannot connect to node 172.23.107.56:9100
            

            On 172.23.107.56

            [root@s12235 @2i]# ls -lah MetadataStore
            -rw-rw----. 1 couchbase couchbase 456M Jan  6 16:39 MetadataStore
            [root@s12235 @2i]# zgrep -e "unable to reach indexer" -i ../../logs/indexer.log* | wc -l
            1112
            [root@s12235 @2i]# zgrep -e "Successfully connected to indexer at 172.23.107.56:9100 after retry" -i ../../logs/indexer.log* | wc -l
            585
            [root@s12235 @2i]# zgrep -e "Successfully connected to indexer at 172.23.107.75:9100 after retry" -i ../../logs/indexer.log* | wc -l
            527
            [root@s12235 @2i]# zgrep -e "cannot connect to node" -e "Failed to initialize metadata provider" -i indexer.log*
            gzip: indexer.log*.gz: No such file or directory
            [root@s12235 @2i]# zgrep -e "cannot connect to node" -e "Failed to initialize metadata provider" -i ../../logs/indexer.log*
            [root@s12235 @2i]# exit
            

            Can you pls confirm that this is as per the expectation and the fix worked ? Here are the logs :
            https://cb-jira.s3.us-east-2.amazonaws.com/cb-engineering/MB-50231_QE_FixValidation/collectinfo-2022-01-07T015129-ns_1%40172.23.107.56.zip
            https://cb-jira.s3.us-east-2.amazonaws.com/cb-engineering/MB-50231_QE_FixValidation/collectinfo-2022-01-07T015129-ns_1%40172.23.107.75.zip

             

            mihir.kamdar Mihir Kamdar (Inactive) added a comment - - edited Thanks for the analysis Amit Kulkarni . I ran the same test with the fix with build 6.6.5-10072. Here are my observations after the FDB MetadataStore file size is over 450 MB. On 172.23.107.75 [root@localhost @2i]# ls -lah MetadataStore -rw-rw----. 1 couchbase couchbase 460M Jan 6 16:38 MetadataStore [root@localhost @2i]# zgrep -e "unable to reach indexer" -i ../../logs/indexer.log* | wc -l 1072 [root@localhost @2i]# zgrep -e "Successfully connected to indexer at 172.23.107.56:9100 after retry" -i ../../logs/indexer.log* | wc -l 625 [root@localhost @2i]# zgrep -e "Successfully connected to indexer at 172.23.107.75:9100 after retry" -i ../../logs/indexer.log* | wc -l 445 [root@localhost @2i]# zgrep -e "cannot connect to node" -e "Failed to initialize metadata provider" -i ../../logs/indexer.log* ../../logs/indexer.log:2022-01-06T16:22:56.577-08:00 [Warn] DDLServiceMgr: cannot connect to node 172.23.107.75:9100 ../../logs/indexer.log:2022-01-06T16:24:59.114-08:00 [Warn] DDLServiceMgr: cannot connect to node 172.23.107.56:9100 On 172.23.107.56 [root@s12235 @2i]# ls -lah MetadataStore -rw-rw----. 1 couchbase couchbase 456M Jan 6 16:39 MetadataStore [root@s12235 @2i]# zgrep -e "unable to reach indexer" -i ../../logs/indexer.log* | wc -l 1112 [root@s12235 @2i]# zgrep -e "Successfully connected to indexer at 172.23.107.56:9100 after retry" -i ../../logs/indexer.log* | wc -l 585 [root@s12235 @2i]# zgrep -e "Successfully connected to indexer at 172.23.107.75:9100 after retry" -i ../../logs/indexer.log* | wc -l 527 [root@s12235 @2i]# zgrep -e "cannot connect to node" -e "Failed to initialize metadata provider" -i indexer.log* gzip: indexer.log*.gz: No such file or directory [root@s12235 @2i]# zgrep -e "cannot connect to node" -e "Failed to initialize metadata provider" -i ../../logs/indexer.log* [root@s12235 @2i]# exit Can you pls confirm that this is as per the expectation and the fix worked ? Here are the logs : https://cb-jira.s3.us-east-2.amazonaws.com/cb-engineering/MB-50231_QE_FixValidation/collectinfo-2022-01-07T015129-ns_1%40172.23.107.56.zip https://cb-jira.s3.us-east-2.amazonaws.com/cb-engineering/MB-50231_QE_FixValidation/collectinfo-2022-01-07T015129-ns_1%40172.23.107.75.zip  
            amit.kulkarni Amit Kulkarni added a comment - - edited

            Thanks Mihir Kamdar for the verification.

            As per the new logs (includes the fix), metadata sync was tried more than 8000 times (more than in previous experiment where the metadata sync was tried 6400 times).

            MacBook-Pro:newLogs amithk$ grep -c "MetadataProvider.WatchMetadata" cbcollect_info_ns_1\@172.23.107.56_20220107-015132/ns_server.indexer.log
            4110
            MacBook-Pro:newLogs amithk$ grep -c "MetadataProvider.WatchMetadata" cbcollect_info_ns_1\@172.23.107.75_20220107-015132/ns_server.indexer.log
            4103
            

            The metadata sync completed successfully before the timeout all but 2 times.

            MacBook-Pro:newLogs amithk$ grep -c "DDLServiceMgr: cannot connect to node" cbcollect_info_ns_1\@172.23.107.75_20220107-015132/ns_server.indexer.log
            2
            MacBook-Pro:newLogs amithk$ grep -c "DDLServiceMgr: cannot connect to node" cbcollect_info_ns_1\@172.23.107.56_20220107-015132/ns_server.indexer.log
            0
            

            As the error frequency is drastically reduced and the indexer retry mechanism is in-built, the verification seems to be successful. The fix has worked as per the expectation.

            amit.kulkarni Amit Kulkarni added a comment - - edited Thanks Mihir Kamdar for the verification. As per the new logs (includes the fix), metadata sync was tried more than 8000 times (more than in previous experiment where the metadata sync was tried 6400 times). MacBook-Pro:newLogs amithk$ grep -c "MetadataProvider.WatchMetadata" cbcollect_info_ns_1\ @172 .23. 107 .56_20220107- 015132 /ns_server.indexer.log 4110 MacBook-Pro:newLogs amithk$ grep -c "MetadataProvider.WatchMetadata" cbcollect_info_ns_1\ @172 .23. 107 .75_20220107- 015132 /ns_server.indexer.log 4103 The metadata sync completed successfully before the timeout all but 2 times. MacBook-Pro:newLogs amithk$ grep -c "DDLServiceMgr: cannot connect to node" cbcollect_info_ns_1\ @172 .23. 107 .75_20220107- 015132 /ns_server.indexer.log 2 MacBook-Pro:newLogs amithk$ grep -c "DDLServiceMgr: cannot connect to node" cbcollect_info_ns_1\ @172 .23. 107 .56_20220107- 015132 /ns_server.indexer.log 0 As the error frequency is drastically reduced and the indexer retry mechanism is in-built, the verification seems to be successful. The fix has worked as per the expectation.

            Thanks Amit Kulkarni for the validation. I will close this bug as Fix Verified.

            mihir.kamdar Mihir Kamdar (Inactive) added a comment - Thanks Amit Kulkarni for the validation. I will close this bug as Fix Verified.

            People

              mihir.kamdar Mihir Kamdar (Inactive)
              amit.kulkarni Amit Kulkarni
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty