Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-34931

[high-bucket] FTS: swap node rebalance failed under high bucket density tests

    XMLWordPrintable

Details

    • Untriaged
    • Unknown

    Description

      Spawned this ticket from https://issues.couchbase.com/browse/MB-32547, which fixed a boltdb panic.

      And, recent runs from Mahesh Mandhare on build 6.5.0-3633 show an error when accessing the /api/stats REST endpoint, so making this separate ticket to keep things clear.

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            steve Steve Yen added a comment -

            The last set of cbcollect-info's from MB-32547 are pertinent.

            steve Steve Yen added a comment - The last set of cbcollect-info's from MB-32547 are pertinent.

            Mahesh Mandhare, can you please re-test these in the latest MH builds as there are other relevant fixes went in. 

            Can we get this retested on build 6.5.0-4522 or on the latest MH.

            Sreekanth Sivasankaran Sreekanth Sivasankaran added a comment - Mahesh Mandhare , can you please re-test these in the latest MH builds as there are other relevant fixes went in.  Can we get this retested on build 6.5.0-4522 or on the latest MH.
            mahesh.mandhare Mahesh Mandhare (Inactive) added a comment - Build 6.5.0-4558 Observed that swap rebalance of FTS node takes huge time. In below run it is running for more than 24 hours and still progress is 58%. I have to stop the test to use cluster for other tests. I have collected logs before I stop the test, here are details- Job-  http://perf.jenkins.couchbase.com/job/arke-multi-bucket/325 Logs- https://cb-jira.s3.us-east-2.amazonaws.com/logs/fts_swap_stuck/collectinfo-2019-10-25T032842-ns_1%40172.23.96.20.zip https://cb-jira.s3.us-east-2.amazonaws.com/logs/fts_swap_stuck/collectinfo-2019-10-25T032842-ns_1%40172.23.96.23.zip https://cb-jira.s3.us-east-2.amazonaws.com/logs/fts_swap_stuck/collectinfo-2019-10-25T032842-ns_1%40172.23.97.12.zip https://cb-jira.s3.us-east-2.amazonaws.com/logs/fts_swap_stuck/collectinfo-2019-10-25T032842-ns_1%40172.23.97.13.zip https://cb-jira.s3.us-east-2.amazonaws.com/logs/fts_swap_stuck/collectinfo-2019-10-25T032842-ns_1%40172.23.97.14.zip https://cb-jira.s3.us-east-2.amazonaws.com/logs/fts_swap_stuck/collectinfo-2019-10-25T032842-ns_1%40172.23.97.15.zip https://cb-jira.s3.us-east-2.amazonaws.com/logs/fts_swap_stuck/collectinfo-2019-10-25T032842-ns_1%40172.23.97.177.zip https://cb-jira.s3.us-east-2.amazonaws.com/logs/fts_swap_stuck/collectinfo-2019-10-25T032842-ns_1%40172.23.97.19.zip https://cb-jira.s3.us-east-2.amazonaws.com/logs/fts_swap_stuck/collectinfo-2019-10-25T032842-ns_1%40172.23.97.20.zip

            Mahesh Mandhare, one quick doubt here - whats the configuration of the system here.

            Mainly, how much FTS memory quota is allotted here? 

            How many documents are indexed?

            How many FTS indexes are there?

            Sreekanth Sivasankaran Sreekanth Sivasankaran added a comment - Mahesh Mandhare , one quick doubt here - whats the configuration of the system here. Mainly, how much FTS memory quota is allotted here?  How many documents are indexed? How many FTS indexes are there?
            lynn.straus Lynn Straus added a comment -

            Added due date field (preset to Nov 15). Please update the due date to the current ETA for a fix. Thanks.

            lynn.straus Lynn Straus added a comment - Added due date field (preset to Nov 15). Please update the due date to the current ETA for a fix. Thanks.

            Sreekanth Sivasankaran,

            FTS memory quota: 40GB

            FTS indexes: 30 indexes, 1 index per bucket

            Docs: ~1M per index

             

            mahesh.mandhare Mahesh Mandhare (Inactive) added a comment - Sreekanth Sivasankaran , FTS memory quota: 40GB FTS indexes: 30 indexes, 1 index per bucket Docs: ~1M per index  
            Sreekanth Sivasankaran Sreekanth Sivasankaran added a comment - - edited

            Checked the logs and the test scenario is as follows.

             

            1. It does some complex setup with 30 FTS indexes with 6 partitions.
            2. It does rebalances for different services
            3. Then it comes to FTS rebalance- first rebalance in an FTS node, then rebalance out an FTS node, then swap rebalance FTS node.

            First two rebalance attempts succeeds, but the third swap rebalance attempts is reported to be taking too much time. 

            The swap rebalance has to move around 180 partitions to complete.

            From the logs, it never looked as blocked. Just that it progresses slowly.

            Checking further, there were some temporary bucket connect errors from cbdatasource, which got resolved after a while.

            eg:

            2019-10-24T18:27:13.283-07:00 [INFO] feed_dcp: OnError, name: bucket-19-fts_less_words_2d5c87fcd687c27f_f4e0a48a: bucketName: bucket-19, bucketUUID: , err: could not connect to any serverURL: []string{"http://127.0.0.1:8091"}
             
             
            2019-10-24T19:45:06.258-07:00 [INFO] feed_dcp: OnError, name: bucket-19-fts_less_words_2d5c87fcd687c27f_f4e0a48a: bucketName: bucket-19, bucketUUID: , err: could not connect to any serverURL: []string{"http://127.0.0.1:8091"}  

            This error occurred 422 times and blocked the partition build for about 1 hour 15 min and then it self healed over refreshworkers in cbdatasource.

            Similar errors can be observed with other partitions as well, but few gets resolved within a few mins (~3 min ) for partitions like 

            bucket-1-fts_less_words_3e0e4ea008a374f1_13aa53f3 

             

            Another partition bucket-13-fts_less_words_4296b9023224f55d_18572d8 took almost 3 hours to self heal these connection refresh errors.

            2019-10-24T15:19:19.246-07:00 [INFO] feed_dcp: OnError, name: bucket-13-fts_less_words_4296b9023224f55d_18572d87: bucketName: bucket-13, bucketUUID: , err: could not connect to any serverURL: []string{"http://127.0.0.1:8091"}
             
            2019-10-24T18:19:40.698-07:00 [INFO] feed_dcp: OnError, name: bucket-13-fts_less_words_4296b9023224f55d_18572d87: bucketName: bucket-13, bucketUUID: , err: could not connect to any serverURL: []string{"http://127.0.0.1:8091"} 

             

            Perhaps this hints that something wrong with the refreshWorker/configs part of cbdatasource.

            This needs to be checked further...

            Today, we don't do anything explicitly at the cbft side on these server connect errors, ref: https://github.com/couchbase/go-couchbase/blob/master/cbdatasource/cbdatasource.go#L781 

             

             

             

             

             

            Sreekanth Sivasankaran Sreekanth Sivasankaran added a comment - - edited Checked the logs and the test scenario is as follows.   It does some complex setup with 30 FTS indexes with 6 partitions. It does rebalances for different services Then it comes to FTS rebalance- first rebalance in an FTS node, then rebalance out an FTS node, then swap rebalance FTS node. First two rebalance attempts succeeds, but the third swap rebalance attempts is reported to be taking too much time.  The swap rebalance has to move around 180 partitions to complete. From the logs, it never looked as blocked. Just that it progresses slowly. Checking further, there were some temporary bucket connect errors from cbdatasource, which got resolved after a while. eg: 2019 - 10 -24T18: 27 : 13.283 - 07 : 00 [INFO] feed_dcp: OnError, name: bucket- 19 -fts_less_words_2d5c87fcd687c27f_f4e0a48a: bucketName: bucket- 19 , bucketUUID: , err: could not connect to any serverURL: []string{ "http://127.0.0.1:8091" }     2019 - 10 -24T19: 45 : 06.258 - 07 : 00 [INFO] feed_dcp: OnError, name: bucket- 19 -fts_less_words_2d5c87fcd687c27f_f4e0a48a: bucketName: bucket- 19 , bucketUUID: , err: could not connect to any serverURL: []string{ "http://127.0.0.1:8091" } This error occurred 422 times and blocked the partition build for about 1 hour 15 min  and then it self healed over refreshworkers in cbdatasource. Similar errors can be observed with other partitions as well, but few gets resolved within a few mins (~3 min ) for partitions like  bucket- 1 -fts_less_words_3e0e4ea008a374f1_13aa53f3   Another partition bucket-13-fts_less_words_4296b9023224f55d_18572d8 took almost 3 hours  to self heal these connection refresh errors. 2019 - 10 -24T15: 19 : 19.246 - 07 : 00 [INFO] feed_dcp: OnError, name: bucket- 13 -fts_less_words_4296b9023224f55d_18572d87: bucketName: bucket- 13 , bucketUUID: , err: could not connect to any serverURL: []string{ "http://127.0.0.1:8091" }   2019 - 10 -24T18: 19 : 40.698 - 07 : 00 [INFO] feed_dcp: OnError, name: bucket- 13 -fts_less_words_4296b9023224f55d_18572d87: bucketName: bucket- 13 , bucketUUID: , err: could not connect to any serverURL: []string{ "http://127.0.0.1:8091" }   Perhaps this hints that something wrong with the refreshWorker/configs part of cbdatasource. This needs to be checked further... Today, we don't do anything explicitly at the cbft side on these server connect errors, ref:  https://github.com/couchbase/go-couchbase/blob/master/cbdatasource/cbdatasource.go#L781           
            steve Steve Yen added a comment -

            No a-ha, just some thoughts / notes...

            Perhaps swap rebalance is part of the factor, compared to the other clean rebalance-out and rebalance-in?

            Looks this this test has been failing since for awhile – it last worked in May 2019? http://perf.jenkins.couchbase.com/job/arke-multi-bucket/

            One thought is to compare with some other failing runs? And, to compare with the last successful run? That is, are the clues or potential fingerprints that you're seeing the latest logs also showing up in the older runs?

            > This error occurred 422 times and blocked the partition build for about 1 hour 15 min and then it self healed over refreshworkers in cbdatasource.
            > Another partition bucket-13-fts_less_words_4296b9023224f55d_18572d8 took almost 3 hours to self heal these connection refresh errors.

            Wondering if stats error counters during those timeframes are showing which codepaths in refreshworkers are being taken? Like, is it perhaps always going through the connection-error codepath... https://github.com/couchbase/go-couchbase/blob/master/cbdatasource/cbdatasource.go#L679 ...or some other codepath?

            The loop is also in an exponential-backoff outer loop – wondering what the sleep time of that exponential backoff is? Could it be we're sleeping for many hours at a time?

            As mentioned, these aren't a-ha's, but more hmmm's in the hopes sparking more ideas.

            steve Steve Yen added a comment - No a-ha, just some thoughts / notes... Perhaps swap rebalance is part of the factor, compared to the other clean rebalance-out and rebalance-in? Looks this this test has been failing since for awhile – it last worked in May 2019? http://perf.jenkins.couchbase.com/job/arke-multi-bucket/ One thought is to compare with some other failing runs? And, to compare with the last successful run? That is, are the clues or potential fingerprints that you're seeing the latest logs also showing up in the older runs? > This error occurred 422 times and blocked the partition build for about 1 hour 15 min and then it self healed over refreshworkers in cbdatasource. > Another partition bucket-13-fts_less_words_4296b9023224f55d_18572d8 took almost 3 hours to self heal these connection refresh errors. Wondering if stats error counters during those timeframes are showing which codepaths in refreshworkers are being taken? Like, is it perhaps always going through the connection-error codepath... https://github.com/couchbase/go-couchbase/blob/master/cbdatasource/cbdatasource.go#L679 ...or some other codepath? The loop is also in an exponential-backoff outer loop – wondering what the sleep time of that exponential backoff is? Could it be we're sleeping for many hours at a time? As mentioned, these aren't a-ha's, but more hmmm's in the hopes sparking more ideas.
            Sreekanth Sivasankaran Sreekanth Sivasankaran added a comment - - edited

            Yes Steve Yen, it is always going through the connection error code path and the exponential loop for refreshCluster().

            https://github.com/couchbase/go-couchbase/blob/master/cbdatasource/cbdatasource.go#L679

            https://github.com/couchbase/go-couchbase/blob/master/cbdatasource/cbdatasource.go#L781

            And the stats confirmed it, 

            Stats snippet right for the above bucket-13-fts_less_words_4296b9023224f55d_18572d87 partition

            "bucketDataSourceStats": {
                "TotStart": 1,
                "TotKick": 0,
                "TotKickDeduped": 0,
                "TotKickOk": 0,
                "TotRefreshCluster": 974,
                "TotRefreshClusterConnectBucket": 974,
                "TotRefreshClusterConnectBucketErr": 973,
                "TotRefreshClusterConnectBucketOk": 0,
                "TotRefreshClusterBucketUUIDErr": 0,
                "TotRefreshClusterVBMNilErr": 0,
                "TotRefreshClusterKickWorkers": 0,
                "TotRefreshClusterKickWorkersClosed": 0,
                "TotRefreshClusterKickWorkersStopped": 0,
                "TotRefreshClusterKickWorkersOk": 0,
                "TotRefreshClusterStopped": 0,
                "TotRefreshClusterAwokenClosed": 0,
                "TotRefreshClusterAwokenStopped": 0,
                "TotRefreshClusterAwokenRestart": 0,
                "TotRefreshClusterAwoken": 0,
                "TotRefreshClusterAllServerURLsConnectBucketErr": 973,
                "TotRefreshClusterDone": 0,
                "TotRefreshWorkersStarted": 1,
                "TotRefreshWorkers": 0,
                "TotRefreshWorkersClusterChKicks": 0,
                "TotRefreshWorkersSecurityUpdates": 0,
                "TotRefreshWorkersVBMNilErr": 0,
                "TotRefreshWorkersVBucketIDErr": 0,
                "TotRefreshWorkersServerIdxsErr": 0,
                "TotRefreshWorkersMasterIdxErr": 0,
                "TotRefreshWorkersMasterServerErr": 0,
                "TotRefreshWorkersRemoveWorker": 0,
                "TotRefreshWorkersAddWorker": 0,
                "TotRefreshWorkersKickWorker": 0,
                "TotRefreshWorkersCloseWorker": 0,
                "TotRefreshWorkersLoop": 0,
                "TotRefreshWorkersLoopDone": 0,
                "TotRefreshWorkersDone": 0, 

             

            >>The loop is also in an exponential-backoff outer loop – wondering what the sleep time of that exponential backoff is? Could it be >>we're sleeping for many hours at a time?

            The sleep time is capped in code ref: https://github.com/couchbase/go-couchbase/blob/master/cbdatasource/cbdatasource.go#L2220 and we can also see that the retry happens here around ~10sec interval like below.

            2019-10-24T18:18:23.001-07:00 [INFO] feed_dcp: OnError, name: bucket-13-fts_less_words_4296b9023224f55d_18572d87: bucketName: bucket-13, bucketUUID: , err: invalid character '<' looking for beginning of value
             
            2019-10-24T18:18:34.112-07:00 [INFO] feed_dcp: OnError, name: bucket-13-fts_less_words_4296b9023224f55d_18572d87: bucketName: bucket-13, bucketUUID: , err: invalid character '<' looking for beginning of value
             
            2019-10-24T18:18:45.179-07:00 [INFO] feed_dcp: OnError, name: bucket-13-fts_less_words_4296b9023224f55d_18572d87: bucketName: bucket-13, bucketUUID: , err: invalid character '<' looking for beginning of value
             
            2019-10-24T18:18:56.393-07:00 [INFO] feed_dcp: OnError, name: bucket-13-fts_less_words_4296b9023224f55d_18572d87: bucketName: bucket-13, bucketUUID: , err: invalid character '<' looking for beginning of value 

             

            From these temporary errors, one thing is confirmed that we are having difficulty in connecting to the ns_server endpoints (http://127.0.0.1:8091)and the above errors mostly confirms that we are getting a non json response from that endpoints, which resolves on its own upon retry without any config/endpoint changes from cbft.

            Against these timestamps, we can see 301 HTTP error codes(permanent URL redirection,) in the ns_server internal access codes. (no other errors codes/abnormal traces noted so far)

            127.0.0.1 - - [24/Oct/2019:18:18:22 -0700] "GET / HTTP/1.1" 301 236 - Go-http-client/1.1 
             
            127.0.0.1 - - [24/Oct/2019:18:18:34 -0700] "GET / HTTP/1.1" 301 236 - Go-http-client/1.1
             
            127.0.0.1 - - [24/Oct/2019:18:18:45 -0700] "GET / HTTP/1.1" 301 236 - Go-http-client/1.1
             
            127.0.0.1 - - [24/Oct/2019:18:18:56 -0700] "GET / HTTP/1.1" 301 236 - Go-http-client/1.1

            Tagging ns_server team as they would be able to interpret this better or have more insights from these errors.

            Hi Dave Finlay, can the ns_server team please help us in deciphering these ns_server access errors here. (for node -172.23.97.15, this is getting swapped in)

             

             

            Sreekanth Sivasankaran Sreekanth Sivasankaran added a comment - - edited Yes Steve Yen , it is always going through the connection error code path and the exponential loop for refreshCluster(). https://github.com/couchbase/go-couchbase/blob/master/cbdatasource/cbdatasource.go#L679 https://github.com/couchbase/go-couchbase/blob/master/cbdatasource/cbdatasource.go#L781 And the stats confirmed it,  Stats snippet right for the above bucket-13-fts_less_words_4296b9023224f55d_18572d87 partition "bucketDataSourceStats" : { "TotStart" : 1 , "TotKick" : 0 , "TotKickDeduped" : 0 , "TotKickOk" : 0 , "TotRefreshCluster" : 974 , "TotRefreshClusterConnectBucket" : 974 , "TotRefreshClusterConnectBucketErr" : 973 , "TotRefreshClusterConnectBucketOk" : 0 , "TotRefreshClusterBucketUUIDErr" : 0 , "TotRefreshClusterVBMNilErr" : 0 , "TotRefreshClusterKickWorkers" : 0 , "TotRefreshClusterKickWorkersClosed" : 0 , "TotRefreshClusterKickWorkersStopped" : 0 , "TotRefreshClusterKickWorkersOk" : 0 , "TotRefreshClusterStopped" : 0 , "TotRefreshClusterAwokenClosed" : 0 , "TotRefreshClusterAwokenStopped" : 0 , "TotRefreshClusterAwokenRestart" : 0 , "TotRefreshClusterAwoken" : 0 , "TotRefreshClusterAllServerURLsConnectBucketErr" : 973 , "TotRefreshClusterDone" : 0 , "TotRefreshWorkersStarted" : 1 , "TotRefreshWorkers" : 0 , "TotRefreshWorkersClusterChKicks" : 0 , "TotRefreshWorkersSecurityUpdates" : 0 , "TotRefreshWorkersVBMNilErr" : 0 , "TotRefreshWorkersVBucketIDErr" : 0 , "TotRefreshWorkersServerIdxsErr" : 0 , "TotRefreshWorkersMasterIdxErr" : 0 , "TotRefreshWorkersMasterServerErr" : 0 , "TotRefreshWorkersRemoveWorker" : 0 , "TotRefreshWorkersAddWorker" : 0 , "TotRefreshWorkersKickWorker" : 0 , "TotRefreshWorkersCloseWorker" : 0 , "TotRefreshWorkersLoop" : 0 , "TotRefreshWorkersLoopDone" : 0 , "TotRefreshWorkersDone" : 0 ,   >>The loop is also in an exponential-backoff outer loop – wondering what the sleep time of that exponential backoff is? Could it be >>we're sleeping for many hours at a time? The sleep time is capped in code ref: https://github.com/couchbase/go-couchbase/blob/master/cbdatasource/cbdatasource.go#L2220  and we can also see that the retry happens here around ~10sec interval like below. 2019 - 10 -24T18: 18 : 23.001 - 07 : 00 [INFO] feed_dcp: OnError, name: bucket- 13 -fts_less_words_4296b9023224f55d_18572d87: bucketName: bucket- 13 , bucketUUID: , err: invalid character '<' looking for beginning of value   2019 - 10 -24T18: 18 : 34.112 - 07 : 00 [INFO] feed_dcp: OnError, name: bucket- 13 -fts_less_words_4296b9023224f55d_18572d87: bucketName: bucket- 13 , bucketUUID: , err: invalid character '<' looking for beginning of value   2019 - 10 -24T18: 18 : 45.179 - 07 : 00 [INFO] feed_dcp: OnError, name: bucket- 13 -fts_less_words_4296b9023224f55d_18572d87: bucketName: bucket- 13 , bucketUUID: , err: invalid character '<' looking for beginning of value   2019 - 10 -24T18: 18 : 56.393 - 07 : 00 [INFO] feed_dcp: OnError, name: bucket- 13 -fts_less_words_4296b9023224f55d_18572d87: bucketName: bucket- 13 , bucketUUID: , err: invalid character '<' looking for beginning of value   From these temporary errors, one thing is confirmed that we are having difficulty in connecting to the ns_server endpoints ( http://127.0.0.1:8091)and the above errors mostly confirms that we are getting a non json response from that endpoints, which resolves on its own upon retry without any config/endpoint changes from cbft. Against these timestamps, we can see 301 HTTP error codes(permanent URL redirection,) in the ns_server internal access codes. (no other errors codes/abnormal traces noted so far) 127.0 . 0.1 - - [ 24 /Oct/ 2019 : 18 : 18 : 22 - 0700 ] "GET / HTTP/1.1" 301 236 - Go-http-client/ 1.1   127.0 . 0.1 - - [ 24 /Oct/ 2019 : 18 : 18 : 34 - 0700 ] "GET / HTTP/1.1" 301 236 - Go-http-client/ 1.1   127.0 . 0.1 - - [ 24 /Oct/ 2019 : 18 : 18 : 45 - 0700 ] "GET / HTTP/1.1" 301 236 - Go-http-client/ 1.1   127.0 . 0.1 - - [ 24 /Oct/ 2019 : 18 : 18 : 56 - 0700 ] "GET / HTTP/1.1" 301 236 - Go-http-client/ 1.1 Tagging ns_server team as they would be able to interpret this better or have more insights from these errors. Hi  Dave Finlay , can the ns_server team please help us in deciphering these ns_server access errors here. (for node -172.23.97.15, this is getting swapped in)    
            dfinlay Dave Finlay added a comment -

            Timofey Barmin - would you mind taking a look into this issue when you have a chance? Thanks!

            dfinlay Dave Finlay added a comment - Timofey Barmin - would you mind taking a look into this issue when you have a chance? Thanks!

            Against these timestamps, we can see 301 HTTP error codes(permanent URL redirection,) in the ns_server internal access codes. (no other errors codes/abnormal traces noted so far)

            The resource that is being requested is '/', which redirects to /ui/index.html. There's not going to be any json there.

            Aliaksey Artamonau Aliaksey Artamonau (Inactive) added a comment - - edited Against these timestamps, we can see 301 HTTP error codes(permanent URL redirection,) in the ns_server internal access codes. (no other errors codes/abnormal traces noted so far) The resource that is being requested is '/', which redirects to /ui/index.html. There's not going to be any json there.
            Sreekanth Sivasankaran Sreekanth Sivasankaran added a comment - - edited

            Thanks Aliaksey Artamonau for the quick response,

            Few follow up questions here would be,  

            • Why in this case there are no client/caller information succeeding the  hostIP , eg : "127.0.0.1 - <missing here??> - " ?
            • If we don't see an entry for an http rest call from the client service on the http internal access log, then does it confirm that the request haven't even reached the ns_server's web server ?

             

            Though I am yet to connect the entire dots here, looks like other clients like GSI also experience similar 301 redirects in their ns_server interactions. Ref: MB-36464 to receive an html response to the rest calls. Unlike fts's case, GSI's use case is marked with a 500 server error in the preceding rest calls in the http internal logs.

            Nevertheless both FTS and GSIs calls are experiencing a similar 301 redirect from the ns_server.

            Quick glance snippet from the ns_server's internal http logs from the GSI case is added here.

            127.0.0.1 - - [13/Oct/2019:17:57:23 -0700] "GET /pools/default?uuid=3c530c1043b30b777ea2c873664e1736 HTTP/1.1" 500 44 - Go-http-client/1.1
            127.0.0.1 - - [13/Oct/2019:17:57:23 -0700] "GET / HTTP/1.1" 301 236 - Go-http-client/1.1
            |
            |
            127.0.0.1 - - [13/Oct/2019:18:32:00 -0700] "GET /pools/default?uuid=3c530c1043b30b777ea2c873664e1736 HTTP/1.1" 500 44 - Go-http-client/1.1
            127.0.0.1 - - [13/Oct/2019:18:32:00 -0700] "GET / HTTP/1.1" 301 236 - Go-http-client/1.1

             

            Sreekanth Sivasankaran Sreekanth Sivasankaran added a comment - - edited Thanks Aliaksey Artamonau  for the quick response, Few follow up questions here would be,   Why in this case there are no client/caller information succeeding the  hostIP , eg : "127.0.0.1 - <missing here??> - " ? If we don't see an entry for an http rest call from the client service on the http internal access log, then does it confirm that the request haven't even reached the ns_server's web server ?   Though I am yet to connect the entire dots here, looks like other clients like GSI also experience similar 301 redirects in their ns_server interactions. Ref:  MB-36464  to receive an html response to the rest calls. Unlike fts's case, GSI's use case is marked with a 500 server error in the preceding rest calls in the http internal logs. Nevertheless both FTS and GSIs calls are experiencing a similar 301 redirect from the ns_server. Quick glance snippet from the ns_server's internal http logs from the GSI case is added here. 127.0 . 0.1 - - [ 13 /Oct/ 2019 : 17 : 57 : 23 - 0700 ] "GET /pools/default?uuid=3c530c1043b30b777ea2c873664e1736 HTTP/1.1" 500 44 - Go-http-client/ 1.1 127.0 . 0.1 - - [ 13 /Oct/ 2019 : 17 : 57 : 23 - 0700 ] "GET / HTTP/1.1" 301 236 - Go-http-client/ 1.1 | | 127.0 . 0.1 - - [ 13 /Oct/ 2019 : 18 : 32 : 00 - 0700 ] "GET /pools/default?uuid=3c530c1043b30b777ea2c873664e1736 HTTP/1.1" 500 44 - Go-http-client/ 1.1 127.0 . 0.1 - - [ 13 /Oct/ 2019 : 18 : 32 : 00 - 0700 ] "GET / HTTP/1.1" 301 236 - Go-http-client/ 1.1  
            dfinlay Dave Finlay added a comment -

            Sreekanth Sivasankaran:

            Why in this case there are no client/caller information succeeding the hostIP , eg : "127.0.0.1 - <missing here??> - " ?

            It means the request is anonymous - it didn't authenticate. Any request against "/" won't authenticate - it will just be immediately redirected.

            If we don't see an entry for an http rest call from the client service on the http internal access log, then does it confirm that the request haven't even reached the ns_server's web server ?

            Basically, yes. The trace is written after the request is finished, so in theory it's possible the request is stuck spinning inside ns_server, but this would be very odd and we haven't seen cases like this in my memory. I think it's reasonable to assume that if it's not in the access log, ns_server didn't get it.

            dfinlay Dave Finlay added a comment - Sreekanth Sivasankaran : Why in this case there are no client/caller information succeeding the hostIP , eg : "127.0.0.1 - <missing here??> - " ? It means the request is anonymous - it didn't authenticate. Any request against "/" won't authenticate - it will just be immediately redirected. If we don't see an entry for an http rest call from the client service on the http internal access log, then does it confirm that the request haven't even reached the ns_server's web server ? Basically, yes. The trace is written after the request is finished, so in theory it's possible the request is stuck spinning inside ns_server, but this would be very odd and we haven't seen cases like this in my memory. I think it's reasonable to assume that if it's not in the access log, ns_server didn't get it.

            Build couchbase-server-6.5.0-4821 contains go-couchbase commit b1b83d4 with commit message:
            MB-34931 - fts swap rebalance failures

            build-team Couchbase Build Team added a comment - Build couchbase-server-6.5.0-4821 contains go-couchbase commit b1b83d4 with commit message: MB-34931 - fts swap rebalance failures

            Build couchbase-server-7.0.0-1044 contains go-couchbase commit b1b83d4 with commit message:
            MB-34931 - fts swap rebalance failures

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.0-1044 contains go-couchbase commit b1b83d4 with commit message: MB-34931 - fts swap rebalance failures

            Mahesh Mandhare, can you please re test this on MH-4841 or latest. We have some improved err handling and logging added.

            Sreekanth Sivasankaran Sreekanth Sivasankaran added a comment - Mahesh Mandhare , can you please re test this on MH-4841 or latest. We have some improved err handling and logging added.

            As perf team (Mahesh Mandhare) is yet to confirm the reproducibility in the latest 6.5.0 builds, we are yet to work/prioritise this for 6.5.1. Scope and gravity of the fix can only be assessed after analysing the next reproduction logs.

             

            Sreekanth Sivasankaran Sreekanth Sivasankaran added a comment - As perf team ( Mahesh Mandhare ) is yet to confirm the reproducibility in the latest 6.5.0 builds, we are yet to work/prioritise this for 6.5.1. Scope and gravity of the fix can only be assessed after analysing the next reproduction logs.  
            mahesh.mandhare Mahesh Mandhare (Inactive) added a comment - Build 6.5.0-4959 Observed swap rebalance failure in our last run -  http://perf.jenkins.couchbase.com/job/arke-multi-bucket/345 Logs- Rebalance in node- https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-345/172.23.97.15.zip Rebalance out node- https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-345/172.23.96.20.zip   Other nodes- https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-345/172.23.96.16.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-345/172.23.96.17.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-345/172.23.96.23.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-345/172.23.97.12.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-345/172.23.97.13.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-345/172.23.97.14.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-345/172.23.97.177.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-345/172.23.97.19.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-345/172.23.97.20.zip
            Sreekanth Sivasankaran Sreekanth Sivasankaran added a comment - - edited

            Summarising the test steps here for clarity,

            Step1-Rebalance in fts node.  

            15:43:33 2019-12-26T02:13:33 [INFO] Rebalancing in fts node, STARTING 
            15:43:33 2019-12-26T02:13:33 [INFO] Adding 172.23.97.15 as fts 
            15:43:33 2019-12-26T02:13:33 [INFO] Adding new node: 172.23.97.15 

            Step2-Rebalance out fts node.

            16:49:55 2019-12-26T03:19:55 [INFO] Rebalancing in fts node, COMPLETED.
            16:49:55 2019-12-26T03:19:55 [INFO] Rebalance time: 66.3618005434672 min
            16:59:55 2019-12-26T03:29:55 [INFO] Rebalancing out fts node
            16:59:55 2019-12-26T03:29:55 [INFO] ejected_nodes ['172.23.97.15'] 

            Step3-Swap rebalance fts nodes.

            21:59:22 2019-12-27T08:29:23 [INFO] Rebalancing out fts node, COMPLETED.
            21:59:22 2019-12-27T08:29:23 [INFO] Rebalance time: 1739.46331512928 min
            22:09:23 2019-12-27T08:39:23 [INFO] Swapping in fts node, STARTING
            22:09:23 2019-12-27T08:39:23 [INFO] ejected_nodes ['172.23.96.20']
            22:09:23 2019-12-27T08:39:23 [INFO] Adding 172.23.97.15 as fts
            22:09:23 2019-12-27T08:39:23 [INFO] Adding new node: 172.23.97.15 

            Step4-Rebalance in fts node.

            06:02:24 2019-12-27T16:32:24 [INFO] Rebalance completed
            06:12:24 2019-12-27T16:42:24 [INFO] Adding 172.23.96.20 as fts
            06:12:24 2019-12-27T16:42:24 [INFO] Adding new node: 172.23.96.20
            06:12:26 2019-12-27T16:42:26 [WARNING] ["Prepare join failed. Node is already part of cluster."] 

            At this time, as per the QE checks the rebalance operation is completed, but the next "Prepare join" operation failed for the cluster. This needs to be checked further. 

             

            Checking the latest fts logs attached, few things noticed during the really slow rebalance progress is,

            1. There is heavy app_herder interventions for slowing the ingestions during the rebalance out operation at fts node 20.
            2. There is a lot of (18055 times) connection timeout exceeded errors in the fts logs right from the beginning of the slow swap rebalance(*4082 times during swap rebalance)* operation.

            ref

            2019-12-27T08:51:09.715-08:00 [INFO] feed_dcp: OnError, name: bucket-1-fts_less_words_5408fdd306d70c44_13aa53f3: bucketName: bucket-1, bucketUUID: , err: connectBucket failed for server: http://127.0.0.1:8091, poolName: default, bucketName: bucket-1, err: Get http://127.0.0.1:8091/pools/default?uuid=e8c82739628dbf57add5acdaf2e204af: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

            This could mostly added to the slowness of overall rebalance. Checking this further..

            2019-12-27T16:42:26.962-08:00 [INFO] main: meh.OnFeedError, srcType: couchbase, err: worker auth, AuthenticateMemcachedConn, server: 172.23.97.14:11210, err: Unable to find given hostport in cbauth database: `172.23.97.14:11210'
            2019-12-27T16:42:26.962-08:00 [INFO] feed_dcp: OnError, name: bucket-18-fts_more_words_2fb6ca18f9ddf96b_aa574717: bucketName: bucket-18, bucketUUID: , err: worker auth, AuthenticateMemcachedConn, server: 172.23.97.14:11210, err: Unable to find given hostport in cbauth database: `172.23.97.14:11210'
            2019-12-27T16:42:26.962-08:00 [INFO] main: meh.OnFeedError, srcType: couchbase, err: worker auth, AuthenticateMemcachedConn, server: 172.23.97.14:11210, err: Unable to find given hostport in cbauth database: `172.23.97.14:11210'
            2019-12-27T16:42:26.988-08:00 [INFO] feed_dcp: OnError, name: bucket-18-fts_more_words_2fb6ca18f9ddf96b_f4e0a48a: bucketName: bucket-18, bucketUUID: , err: worker auth, AuthenticateMemcachedConn, server: 172.23.97.14:11210, err: Unable to find given hostport in cbauth database: `172.23.97.14:11210'
            2019-12-27T16:42:26.988-08:00 [INFO] main: meh.OnFeedError, srcType: couchbase, err: worker auth, AuthenticateMemcachedConn, server: 172.23.97.14:11210, err: Unable to find given hostport in cbauth database: `172.23.97.14:11210' 

            Around the slow swap rebalance timestamps, could see the Connection down messages in ns_server debug logs like below.

            [ns_server:debug,2019-12-27T08:39:27.501-08:00,ns_1@172.23.97.15:cb_dist<0.2860.0>:cb_dist:info_msg:754]cb_dist: Connection down: {con,#Ref<0.2846215055.861405203.147736>,
                                           inet_tcp_dist,<0.9202.441>,
                                           #Ref<0.2846215055.862191634.96055>}
            [error_logger:info,2019-12-27T08:39:27.501-08:00,ns_1@172.23.97.15:error_logger<0.32.0>:ale_error_logger_handler:do_log:203]
            =========================INFO REPORT=========================
            {net_kernel,{'EXIT',<0.9202.441>,shutdown}}
            [ns_server:debug,2019-12-27T08:39:27.621-08:00,ns_1@172.23.97.15:cb_dist<0.2860.0>:cb_dist:info_msg:754]cb_dist: Connection down: {con,#Ref<0.2846215055.861405203.147585>,
                                           inet_tcp_dist,<0.6626.441>,
                                           #Ref<0.2846215055.861405203.147587>}
            [error_logger:error,2019-12-27T08:39:27.621-08:00,ns_1@172.23.97.15:error_logger<0.32.0>:ale_error_logger_handler:do_log:203]** Connection attempt from disallowed node 'ns_1@172.23.97.20' ** [error_logger:info,2019-12-27T08:39:27.622-08:00,ns_1@172.23.97.15:error_logger<0.32.0>:ale_error_logger_handler:do_log:203]
            =========================INFO REPORT=========================
            {net_kernel,{'EXIT',<0.6626.441>,{recv_challenge_reply_failed,bad_cookie}}}
            [error_logger:info,2019-12-27T08:39:27.622-08:00,ns_1@172.23.97.15:error_logger<0.32.0>:ale_error_logger_handler:do_log:203]
            =========================INFO REPORT=========================
            {net_kernel,{net_kernel,913,nodedown,'ns_1@172.23.97.20'}}
            [ns_server:debug,2019-12-27T08:39:27.622-08:00,ns_1@172.23.97.15:net_kernel<0.2863.0>:cb_dist:info_msg:754]cb_dist: Setting up new connection to 'ns_1@172.23.97.20' using inet_tcp_dist
            [error_logger:info,2019-12-27T08:39:27.622-08:00,ns_1@172.23.97.15:error_logger<0.32.0>:ale_error_logger_handler:do_log:203]
            =========================INFO REPORT=========================
            {net_kernel,{connect,normal,'ns_1@172.23.97.20'}}
            [ns_server:debug,2019-12-27T08:39:27.622-08:00,ns_1@172.23.97.15:cb_dist<0.2860.0>:cb_dist:info_msg:754]cb_dist: Added connection {con,#Ref<0.2846215055.866648072.133803>,
                                           inet_tcp_dist,undefined,undefined}
            [ns_server:debug,2019-12-27T08:39:27.622-08:00,ns_1@172.23.97.15:cb_dist<0.2860.0>:cb_dist:info_msg:754]cb_dist: Updated connection: {con,#Ref<0.2846215055.866648072.133803>,
                                              inet_tcp_dist,<0.9415.441>,
                                              #Ref<0.2846215055.866648072.133804>}
            [ns_server:debug,2019-12-27T08:39:27.633-08:00,ns_1@172.23.97.15:cb_dist<0.2860.0>:cb_dist:info_msg:754]cb_dist: Accepted new connection from <0.2864.0>: {con,
                                                               #Ref<0.2846215055.861405203.147745>,
                                                               inet_tcp_dist,undefined,
                                                               undefined}
            [ns_server:debug,2019-12-27T08:39:27.634-08:00,ns_1@172.23.97.15:net_kernel<0.2863.0>:cb_dist:info_msg:754]cb_dist: Accepting connection from acceptor <0.2864.0> using module inet_tcp_dist
            [ns_server:debug,2019-12-27T08:39:27.634-08:00,ns_1@172.23.97.15:cb_dist<0.2860.0>:cb_dist:info_msg:754]cb_dist: Updated connection: {con,#Ref<0.2846215055.861405203.147745>,
                                              inet_tcp_dist,<0.7018.441>,
                                              #Ref<0.2846215055.861405203.147747>}
            [ns_server:debug,2019-12-27T08:39:27.634-08:00,ns_1@172.23.97.15:cb_dist<0.2860.0>:cb_dist:info_msg:754]cb_dist: Connection down: {con,#Ref<0.2846215055.866648072.133803>,
                                           inet_tcp_dist,<0.9415.441>,
                                           #Ref<0.2846215055.866648072.133804>}
            [ns_server:debug,2019-12-27T08:39:28.257-08:00,ns_1@172.23.97.15:cb_dist<0.2860.0>:cb_dist:info_msg:754]cb_dist: Accepted new connection from <0.2864.0>: {con, 

            And the same connection down traces during the final rebalance in op as well.

            [ns_server:debug,2019-12-27T16:44:11.547-08:00,ns_1@172.23.97.15:cb_dist<0.2860.0>:cb_dist:info_msg:754]cb_dist: Connection down: {con,#Ref<0.2846215055.876347393.200158>,
                                           inet_tcp_dist,<0.20026.492>,
                                           #Ref<0.2846215055.865075213.111559>}
            [ns_server:debug,2019-12-27T16:44:12.087-08:00,ns_1@172.23.97.15:cb_dist<0.2860.0>:cb_dist:info_msg:754]cb_dist: Accepted new connection from <0.2864.0>: {con,
                                                               #Ref<0.2846215055.864813070.245151>,
                                                               inet_tcp_dist,undefined,
                                                               undefined}
            [ns_server:debug,2019-12-27T16:44:12.087-08:00,ns_1@172.23.97.15:net_kernel<0.2863.0>:cb_dist:info_msg:754]cb_dist: Accepting connection from acceptor <0.2864.0> using module inet_tcp_dist
            [ns_server:debug,2019-12-27T16:44:12.087-08:00,ns_1@172.23.97.15:cb_dist<0.2860.0>:cb_dist:info_msg:754]cb_dist: Updated connection: {con,#Ref<0.2846215055.864813070.245151>,
                                              inet_tcp_dist,<0.20188.492>,
                                              #Ref<0.2846215055.865337356.205878>}
            [ns_server:debug,2019-12-27T16:44:16.144-08:00,ns_1@172.23.97.15:cb_dist<0.2860.0>:cb_dist:info_msg:754]cb_dist: Connection down: {con,#Ref<0.2846215055.864813070.245151>,
                                           inet_tcp_dist,<0.20188.492>,
                                           #Ref<0.2846215055.865337356.205878>}
            [error_logger:error,2019-12-27T16:44:16.144-08:00,ns_1@172.23.97.15:error_logger<0.32.0>:ale_error_logger_handler:do_log:203]** Connection attempt from disallowed node 'ns_1@172.23.97.18' ** [error_logger:info,2019-12-27T16:44:16.144-08:00,ns_1@172.23.97.15:error_logger<0.32.0>:ale_error_logger_handler:do_log:203]
            =========================INFO REPORT=========================
            {net_kernel,{'EXIT',<0.20188.492>,{recv_challenge_reply_failed,bad_cookie}}}
            [error_logger:info,2019-12-27T16:44:16.145-08:00,ns_1@172.23.97.15:error_logger<0.32.0>:ale_error_logger_handler:do_log:203]
            =========================INFO REPORT=========================
            {net_kernel,{net_kernel,913,nodedown,'ns_1@172.23.97.18'}}
            [ns_server:info,2019-12-27T16:44:16.201-08:00,ns_1@172.23.97.15:ns_config_rep<0.9456.441>:ns_config_rep:pull_one_node:352]Pulling config from: 'ns_1@172.23.97.19'
            [ns_server:debug,2019-12-27T16:44:17.081-08:00,ns_1@172.23.97.15:cb_dist<0.2860.0>:cb_dist:info_msg:754]cb_dist: Accepted new connection from <0.2864.0>: {con,
                                                               #Ref<0.2846215055.876347393.200665>,
                                                               inet_tcp_dist,undefined,
                                                               undefined}
            [ns_server:debug,2019-12-27T16:44:17.081-08:00,ns_1@172.23.97.15:net_kernel<0.2863.0>:cb_dist:info_msg:754]cb_dist: Accepting connection from acceptor <0.2864.0> using module inet_tcp_dist
            [ns_server:debug,2019-12-27T16:44:17.081-08:00,ns_1@172.23.97.15:cb_dist<0.2860.0>:cb_dist:info_msg:754]cb_dist: Updated connection: {con,#Ref<0.2846215055.876347393.200665>,
                                              inet_tcp_dist,<0.21935.492>,
                                              #Ref<0.2846215055.870580226.149215>} 

             

            From couchbase.log, during the final rebalance in operation, an interruption due to auto-failover looks suspicious too?

            {log_entry,{1577,464785,930372},
                        'ns_1@172.23.97.12',ns_orchestrator,0,
                        [<<"Starting rebalance, KeepNodes = ['ns_1@172.23.97.12','ns_1@172.23.97.13',\n                                 'ns_1@172.23.97.14','ns_1@172.23.97.19',\n                                 'ns_1@172.23.97.20','ns_1@172.23.97.177',\n                                 'ns_1@172.23.96.23','ns_1@172.23.97.15'], EjectNodes = ['ns_1@172.23.96.20'], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = 667f8fb9a7c0d2f26211ff9e47f922b6">>],
                        [],info,
                        {{2019,12,27},{8,39,45}}},
             
             
            {log_entry,{1577,493142,268948},
                        'ns_1@172.23.97.12',ns_orchestrator,0,
                        [<<"Rebalance interrupted due to auto-failover of nodes ['ns_1@172.23.97.14'].\nRebalance Operation Id = 667f8fb9a7c0d2f26211ff9e47f922b6">>],
                        [],info,
                        {{2019,12,27},{16,32,22}}},
             
             
            {log_entry,{1577,493648,365457},
                        'ns_1@172.23.97.14',ns_memcached,0,
                        [<<"Shutting down bucket \"bucket-1\" on 'ns_1@172.23.97.14' for deletion">>],
                        [],info,
                        {{2019,12,27},{16,40,48}}},
             {log_entry,{1577,493660,404583},
                        'ns_1@172.23.97.12',failover,0,
                        [<<"Failover couldn't complete on some nodes:\n['ns_1@172.23.97.14']">>],
                        [],crit,
                        {{2019,12,27},{16,41,0}}},
             {log_entry,{1577,493664,160525},
                        'ns_1@172.23.97.12',failover,0,
                        [<<"Deactivating failed over nodes ['ns_1@172.23.97.14']">>],
                        [],info,
                        {{2019,12,27},{16,41,4}}},
             {log_entry,{1577,493664,539493},
                        'ns_1@172.23.97.12',ns_orchestrator,0,
                        [<<"Failover completed successfully.\nRebalance Operation Id = f368c09eba51bafa9cde85a1db489cbd">>],
                        [],info,
                        {{2019,12,27},{16,41,4}}},
             {log_entry,{1577,493664,539672},
                        'ns_1@172.23.97.12',auto_failover,0,
                        [<<"Node ('ns_1@172.23.97.14') was automatically failed over. Reason: The cluster manager did not respond for the duration of the auto-failover threshold. ">>],
                        [],info,
                        {{2019,12,27},{16,41,4}}},
             {log_entry,{1577,493746,364689},
                        'ns_1@172.23.97.12',ns_cluster,5,
                        [<<"Failed to add node 172.23.96.20:18091 to cluster. Prepare join failed. Node is already part of cluster.">>],
                        [],info,
                        {{2019,12,27},{16,42,26}}},
             

             

            Sreekanth Sivasankaran Sreekanth Sivasankaran added a comment - - edited Summarising the test steps here for clarity, Step1-Rebalance in fts node.   15 : 43 : 33 2019 - 12 -26T02: 13 : 33 [INFO] Rebalancing in fts node, STARTING 15 : 43 : 33 2019 - 12 -26T02: 13 : 33 [INFO] Adding 172.23 . 97.15 as fts 15 : 43 : 33 2019 - 12 -26T02: 13 : 33 [INFO] Adding new node: 172.23 . 97.15 Step2-Rebalance out fts node. 16 : 49 : 55 2019 - 12 -26T03: 19 : 55 [INFO] Rebalancing in fts node, COMPLETED. 16 : 49 : 55 2019 - 12 -26T03: 19 : 55 [INFO] Rebalance time: 66.3618005434672 min 16 : 59 : 55 2019 - 12 -26T03: 29 : 55 [INFO] Rebalancing out fts node 16 : 59 : 55 2019 - 12 -26T03: 29 : 55 [INFO] ejected_nodes [ '172.23.97.15' ] Step3-Swap rebalance fts nodes. 21 : 59 : 22 2019 - 12 -27T08: 29 : 23 [INFO] Rebalancing out fts node, COMPLETED. 21 : 59 : 22 2019 - 12 -27T08: 29 : 23 [INFO] Rebalance time: 1739.46331512928 min 22 : 09 : 23 2019 - 12 -27T08: 39 : 23 [INFO] Swapping in fts node, STARTING 22 : 09 : 23 2019 - 12 -27T08: 39 : 23 [INFO] ejected_nodes [ '172.23.96.20' ] 22 : 09 : 23 2019 - 12 -27T08: 39 : 23 [INFO] Adding 172.23 . 97.15 as fts 22 : 09 : 23 2019 - 12 -27T08: 39 : 23 [INFO] Adding new node: 172.23 . 97.15 Step4-Rebalance in fts node. 06 : 02 : 24 2019 - 12 -27T16: 32 : 24 [INFO] Rebalance completed 06 : 12 : 24 2019 - 12 -27T16: 42 : 24 [INFO] Adding 172.23 . 96.20 as fts 06 : 12 : 24 2019 - 12 -27T16: 42 : 24 [INFO] Adding new node: 172.23 . 96.20 06 : 12 : 26 2019 - 12 -27T16: 42 : 26 [WARNING] [ "Prepare join failed. Node is already part of cluster." ] At this time, as per the QE checks the rebalance operation is completed, but the next "Prepare join" operation failed for the cluster. This needs to be checked further.    Checking the latest fts logs attached, few things noticed during the really slow rebalance progress is, There is heavy app_herder interventions for slowing the ingestions during the rebalance out operation at fts node 20. There is a lot of (18055 times) connection timeout exceeded errors in the fts logs right from the beginning of the slow swap rebalance(*4082 times during swap rebalance )* operation. ref 2019 - 12 -27T08: 51 : 09.715 - 08 : 00 [INFO] feed_dcp: OnError, name: bucket- 1 -fts_less_words_5408fdd306d70c44_13aa53f3: bucketName: bucket- 1 , bucketUUID: , err: connectBucket failed for server: http: //127.0.0.1:8091, poolName: default, bucketName: bucket-1, err: Get http://127.0.0.1:8091/pools/default?uuid=e8c82739628dbf57add5acdaf2e204af: net/http: request canceled (Client.Timeout exceeded while awaiting headers) This could mostly added to the slowness of overall rebalance. Checking this further.. 2019 - 12 -27T16: 42 : 26.962 - 08 : 00 [INFO] main: meh.OnFeedError, srcType: couchbase, err: worker auth, AuthenticateMemcachedConn, server: 172.23 . 97.14 : 11210 , err: Unable to find given hostport in cbauth database: ` 172.23 . 97.14 : 11210 ' 2019 - 12 -27T16: 42 : 26.962 - 08 : 00 [INFO] feed_dcp: OnError, name: bucket- 18 -fts_more_words_2fb6ca18f9ddf96b_aa574717: bucketName: bucket- 18 , bucketUUID: , err: worker auth, AuthenticateMemcachedConn, server: 172.23 . 97.14 : 11210 , err: Unable to find given hostport in cbauth database: ` 172.23 . 97.14 : 11210 ' 2019 - 12 -27T16: 42 : 26.962 - 08 : 00 [INFO] main: meh.OnFeedError, srcType: couchbase, err: worker auth, AuthenticateMemcachedConn, server: 172.23 . 97.14 : 11210 , err: Unable to find given hostport in cbauth database: ` 172.23 . 97.14 : 11210 ' 2019 - 12 -27T16: 42 : 26.988 - 08 : 00 [INFO] feed_dcp: OnError, name: bucket- 18 -fts_more_words_2fb6ca18f9ddf96b_f4e0a48a: bucketName: bucket- 18 , bucketUUID: , err: worker auth, AuthenticateMemcachedConn, server: 172.23 . 97.14 : 11210 , err: Unable to find given hostport in cbauth database: ` 172.23 . 97.14 : 11210 ' 2019 - 12 -27T16: 42 : 26.988 - 08 : 00 [INFO] main: meh.OnFeedError, srcType: couchbase, err: worker auth, AuthenticateMemcachedConn, server: 172.23 . 97.14 : 11210 , err: Unable to find given hostport in cbauth database: ` 172.23 . 97.14 : 11210 ' Around the slow swap rebalance timestamps, could see the Connection down messages in ns_server debug logs like below. [ns_server:debug, 2019 - 12 -27T08: 39 : 27.501 - 08 : 00 ,ns_1 @172 .23. 97.15 :cb_dist< 0.2860 . 0 >:cb_dist:info_msg: 754 ]cb_dist: Connection down: {con,#Ref< 0.2846215055 . 861405203.147736 >, inet_tcp_dist,< 0.9202 . 441 >, #Ref< 0.2846215055 . 862191634.96055 >} [error_logger:info, 2019 - 12 -27T08: 39 : 27.501 - 08 : 00 ,ns_1 @172 .23. 97.15 :error_logger< 0.32 . 0 >:ale_error_logger_handler:do_log: 203 ] =========================INFO REPORT========================= {net_kernel,{ 'EXIT' ,< 0.9202 . 441 >,shutdown}} [ns_server:debug, 2019 - 12 -27T08: 39 : 27.621 - 08 : 00 ,ns_1 @172 .23. 97.15 :cb_dist< 0.2860 . 0 >:cb_dist:info_msg: 754 ]cb_dist: Connection down: {con,#Ref< 0.2846215055 . 861405203.147585 >, inet_tcp_dist,< 0.6626 . 441 >, #Ref< 0.2846215055 . 861405203.147587 >} [error_logger:error, 2019 - 12 -27T08: 39 : 27.621 - 08 : 00 ,ns_1 @172 .23. 97.15 :error_logger< 0.32 . 0 >:ale_error_logger_handler:do_log: 203 ]** Connection attempt from disallowed node 'ns_1@172.23.97.20' ** [error_logger:info, 2019 - 12 -27T08: 39 : 27.622 - 08 : 00 ,ns_1 @172 .23. 97.15 :error_logger< 0.32 . 0 >:ale_error_logger_handler:do_log: 203 ] =========================INFO REPORT========================= {net_kernel,{ 'EXIT' ,< 0.6626 . 441 >,{recv_challenge_reply_failed,bad_cookie}}} [error_logger:info, 2019 - 12 -27T08: 39 : 27.622 - 08 : 00 ,ns_1 @172 .23. 97.15 :error_logger< 0.32 . 0 >:ale_error_logger_handler:do_log: 203 ] =========================INFO REPORT========================= {net_kernel,{net_kernel, 913 ,nodedown, 'ns_1@172.23.97.20' }} [ns_server:debug, 2019 - 12 -27T08: 39 : 27.622 - 08 : 00 ,ns_1 @172 .23. 97.15 :net_kernel< 0.2863 . 0 >:cb_dist:info_msg: 754 ]cb_dist: Setting up new connection to 'ns_1@172.23.97.20' using inet_tcp_dist [error_logger:info, 2019 - 12 -27T08: 39 : 27.622 - 08 : 00 ,ns_1 @172 .23. 97.15 :error_logger< 0.32 . 0 >:ale_error_logger_handler:do_log: 203 ] =========================INFO REPORT========================= {net_kernel,{connect,normal, 'ns_1@172.23.97.20' }} [ns_server:debug, 2019 - 12 -27T08: 39 : 27.622 - 08 : 00 ,ns_1 @172 .23. 97.15 :cb_dist< 0.2860 . 0 >:cb_dist:info_msg: 754 ]cb_dist: Added connection {con,#Ref< 0.2846215055 . 866648072.133803 >, inet_tcp_dist,undefined,undefined} [ns_server:debug, 2019 - 12 -27T08: 39 : 27.622 - 08 : 00 ,ns_1 @172 .23. 97.15 :cb_dist< 0.2860 . 0 >:cb_dist:info_msg: 754 ]cb_dist: Updated connection: {con,#Ref< 0.2846215055 . 866648072.133803 >, inet_tcp_dist,< 0.9415 . 441 >, #Ref< 0.2846215055 . 866648072.133804 >} [ns_server:debug, 2019 - 12 -27T08: 39 : 27.633 - 08 : 00 ,ns_1 @172 .23. 97.15 :cb_dist< 0.2860 . 0 >:cb_dist:info_msg: 754 ]cb_dist: Accepted new connection from < 0.2864 . 0 >: {con, #Ref< 0.2846215055 . 861405203.147745 >, inet_tcp_dist,undefined, undefined} [ns_server:debug, 2019 - 12 -27T08: 39 : 27.634 - 08 : 00 ,ns_1 @172 .23. 97.15 :net_kernel< 0.2863 . 0 >:cb_dist:info_msg: 754 ]cb_dist: Accepting connection from acceptor < 0.2864 . 0 > using module inet_tcp_dist [ns_server:debug, 2019 - 12 -27T08: 39 : 27.634 - 08 : 00 ,ns_1 @172 .23. 97.15 :cb_dist< 0.2860 . 0 >:cb_dist:info_msg: 754 ]cb_dist: Updated connection: {con,#Ref< 0.2846215055 . 861405203.147745 >, inet_tcp_dist,< 0.7018 . 441 >, #Ref< 0.2846215055 . 861405203.147747 >} [ns_server:debug, 2019 - 12 -27T08: 39 : 27.634 - 08 : 00 ,ns_1 @172 .23. 97.15 :cb_dist< 0.2860 . 0 >:cb_dist:info_msg: 754 ]cb_dist: Connection down: {con,#Ref< 0.2846215055 . 866648072.133803 >, inet_tcp_dist,< 0.9415 . 441 >, #Ref< 0.2846215055 . 866648072.133804 >} [ns_server:debug, 2019 - 12 -27T08: 39 : 28.257 - 08 : 00 ,ns_1 @172 .23. 97.15 :cb_dist< 0.2860 . 0 >:cb_dist:info_msg: 754 ]cb_dist: Accepted new connection from < 0.2864 . 0 >: {con, And the same connection down traces during the final rebalance in op as well. [ns_server:debug, 2019 - 12 -27T16: 44 : 11.547 - 08 : 00 ,ns_1 @172 .23. 97.15 :cb_dist< 0.2860 . 0 >:cb_dist:info_msg: 754 ]cb_dist: Connection down: {con,#Ref< 0.2846215055 . 876347393.200158 >, inet_tcp_dist,< 0.20026 . 492 >, #Ref< 0.2846215055 . 865075213.111559 >} [ns_server:debug, 2019 - 12 -27T16: 44 : 12.087 - 08 : 00 ,ns_1 @172 .23. 97.15 :cb_dist< 0.2860 . 0 >:cb_dist:info_msg: 754 ]cb_dist: Accepted new connection from < 0.2864 . 0 >: {con, #Ref< 0.2846215055 . 864813070.245151 >, inet_tcp_dist,undefined, undefined} [ns_server:debug, 2019 - 12 -27T16: 44 : 12.087 - 08 : 00 ,ns_1 @172 .23. 97.15 :net_kernel< 0.2863 . 0 >:cb_dist:info_msg: 754 ]cb_dist: Accepting connection from acceptor < 0.2864 . 0 > using module inet_tcp_dist [ns_server:debug, 2019 - 12 -27T16: 44 : 12.087 - 08 : 00 ,ns_1 @172 .23. 97.15 :cb_dist< 0.2860 . 0 >:cb_dist:info_msg: 754 ]cb_dist: Updated connection: {con,#Ref< 0.2846215055 . 864813070.245151 >, inet_tcp_dist,< 0.20188 . 492 >, #Ref< 0.2846215055 . 865337356.205878 >} [ns_server:debug, 2019 - 12 -27T16: 44 : 16.144 - 08 : 00 ,ns_1 @172 .23. 97.15 :cb_dist< 0.2860 . 0 >:cb_dist:info_msg: 754 ]cb_dist: Connection down: {con,#Ref< 0.2846215055 . 864813070.245151 >, inet_tcp_dist,< 0.20188 . 492 >, #Ref< 0.2846215055 . 865337356.205878 >} [error_logger:error, 2019 - 12 -27T16: 44 : 16.144 - 08 : 00 ,ns_1 @172 .23. 97.15 :error_logger< 0.32 . 0 >:ale_error_logger_handler:do_log: 203 ]** Connection attempt from disallowed node 'ns_1@172.23.97.18' ** [error_logger:info, 2019 - 12 -27T16: 44 : 16.144 - 08 : 00 ,ns_1 @172 .23. 97.15 :error_logger< 0.32 . 0 >:ale_error_logger_handler:do_log: 203 ] =========================INFO REPORT========================= {net_kernel,{ 'EXIT' ,< 0.20188 . 492 >,{recv_challenge_reply_failed,bad_cookie}}} [error_logger:info, 2019 - 12 -27T16: 44 : 16.145 - 08 : 00 ,ns_1 @172 .23. 97.15 :error_logger< 0.32 . 0 >:ale_error_logger_handler:do_log: 203 ] =========================INFO REPORT========================= {net_kernel,{net_kernel, 913 ,nodedown, 'ns_1@172.23.97.18' }} [ns_server:info, 2019 - 12 -27T16: 44 : 16.201 - 08 : 00 ,ns_1 @172 .23. 97.15 :ns_config_rep< 0.9456 . 441 >:ns_config_rep:pull_one_node: 352 ]Pulling config from: 'ns_1@172.23.97.19' [ns_server:debug, 2019 - 12 -27T16: 44 : 17.081 - 08 : 00 ,ns_1 @172 .23. 97.15 :cb_dist< 0.2860 . 0 >:cb_dist:info_msg: 754 ]cb_dist: Accepted new connection from < 0.2864 . 0 >: {con, #Ref< 0.2846215055 . 876347393.200665 >, inet_tcp_dist,undefined, undefined} [ns_server:debug, 2019 - 12 -27T16: 44 : 17.081 - 08 : 00 ,ns_1 @172 .23. 97.15 :net_kernel< 0.2863 . 0 >:cb_dist:info_msg: 754 ]cb_dist: Accepting connection from acceptor < 0.2864 . 0 > using module inet_tcp_dist [ns_server:debug, 2019 - 12 -27T16: 44 : 17.081 - 08 : 00 ,ns_1 @172 .23. 97.15 :cb_dist< 0.2860 . 0 >:cb_dist:info_msg: 754 ]cb_dist: Updated connection: {con,#Ref< 0.2846215055 . 876347393.200665 >, inet_tcp_dist,< 0.21935 . 492 >, #Ref< 0.2846215055 . 870580226.149215 >}   From couchbase.log, during the final rebalance in operation, an interruption due to auto-failover looks suspicious too? {log_entry,{ 1577 , 464785 , 930372 }, 'ns_1@172.23.97.12' ,ns_orchestrator, 0 , [<< "Starting rebalance, KeepNodes = ['ns_1@172.23.97.12','ns_1@172.23.97.13',\n 'ns_1@172.23.97.14','ns_1@172.23.97.19',\n 'ns_1@172.23.97.20','ns_1@172.23.97.177',\n 'ns_1@172.23.96.23','ns_1@172.23.97.15'], EjectNodes = ['ns_1@172.23.96.20'], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = 667f8fb9a7c0d2f26211ff9e47f922b6" >>], [],info, {{ 2019 , 12 , 27 },{ 8 , 39 , 45 }}},     {log_entry,{ 1577 , 493142 , 268948 }, 'ns_1@172.23.97.12' ,ns_orchestrator, 0 , [<< "Rebalance interrupted due to auto-failover of nodes ['ns_1@172.23.97.14'].\nRebalance Operation Id = 667f8fb9a7c0d2f26211ff9e47f922b6" >>], [],info, {{ 2019 , 12 , 27 },{ 16 , 32 , 22 }}},     {log_entry,{ 1577 , 493648 , 365457 }, 'ns_1@172.23.97.14' ,ns_memcached, 0 , [<< "Shutting down bucket \"bucket-1\" on 'ns_1@172.23.97.14' for deletion" >>], [],info, {{ 2019 , 12 , 27 },{ 16 , 40 , 48 }}}, {log_entry,{ 1577 , 493660 , 404583 }, 'ns_1@172.23.97.12' ,failover, 0 , [<< "Failover couldn't complete on some nodes:\n['ns_1@172.23.97.14']" >>], [],crit, {{ 2019 , 12 , 27 },{ 16 , 41 , 0 }}}, {log_entry,{ 1577 , 493664 , 160525 }, 'ns_1@172.23.97.12' ,failover, 0 , [<< "Deactivating failed over nodes ['ns_1@172.23.97.14']" >>], [],info, {{ 2019 , 12 , 27 },{ 16 , 41 , 4 }}}, {log_entry,{ 1577 , 493664 , 539493 }, 'ns_1@172.23.97.12' ,ns_orchestrator, 0 , [<< "Failover completed successfully.\nRebalance Operation Id = f368c09eba51bafa9cde85a1db489cbd" >>], [],info, {{ 2019 , 12 , 27 },{ 16 , 41 , 4 }}}, {log_entry,{ 1577 , 493664 , 539672 }, 'ns_1@172.23.97.12' ,auto_failover, 0 , [<< "Node ('ns_1@172.23.97.14') was automatically failed over. Reason: The cluster manager did not respond for the duration of the auto-failover threshold. " >>], [],info, {{ 2019 , 12 , 27 },{ 16 , 41 , 4 }}}, {log_entry,{ 1577 , 493746 , 364689 }, 'ns_1@172.23.97.12' ,ns_cluster, 5 , [<< "Failed to add node 172.23.96.20:18091 to cluster. Prepare join failed. Node is already part of cluster." >>], [],info, {{ 2019 , 12 , 27 },{ 16 , 42 , 26 }}},  

            Tracking the swap rebalance operation further down, it looks like the ns_stats endpoint is still failing to respond. Or timeout errors are visible in the ns_server_error.logs of moving in node 15.

            [ns_server:error,2019-12-27T16:28:58.703-08:00,ns_1@172.23.97.15:service_stats_collector-fts<0.15750.441>:rest_utils:get_json:62]Request to (fts) api/nsstats failed: {error,timeout}
            [ns_server:error,2019-12-27T16:30:04.630-08:00,ns_1@172.23.97.15:service_stats_collector-fts<0.15750.441>:rest_utils:get_json:62]Request to (fts) api/nsstats failed: {error,timeout}
            [ns_server:error,2019-12-27T16:31:45.631-08:00,ns_1@172.23.97.15:service_stats_collector-fts<0.15750.441>:rest_utils:get_json:62]Request to (fts) api/nsstats failed: {error,timeout}
            [ns_server:error,2019-12-27T16:32:22.265-08:00,ns_1@172.23.97.15:service_agent-fts<0.16568.441>:service_agent:handle_info:287]Rebalancer <25658.11575.1303> died unexpectedly: shutdown 

            and in ns_server debug logs shows as,

            [ns_server:error,2019-12-27T16:32:22.265-08:00,ns_1@172.23.97.15:service_agent-fts<0.16568.441>:service_agent:handle_info:287]Rebalancer <25658.11575.1303> died unexpectedly: shutdown
            [ns_server:info,2019-12-27T16:32:22.265-08:00,ns_1@172.23.97.15:rebalance_agent<0.7679.441>:rebalance_agent:handle_down:296]Rebalancer process <25658.25753.1300> died (reason shutdown).
            [ns_server:debug,2019-12-27T16:32:22.267-08:00,ns_1@172.23.97.15:service_agent-fts<0.16568.441>:service_agent:cleanup_service:553]Cleaning up stale tasks: 

            And mostly due to this it cancels the rebalance operation here-

            2019-12-27T16:32:22.267-08:00 [INFO] ctl/manager: CancelTask, taskId: rebalance:7c34c23200aaab2ca0ef32ba2b20785e, taskRev: 
            2019-12-27T16:32:22.267-08:00 [INFO] ctl/manager: stop taskHandle, ctlTopology.Rev: 6 

            Since the previous swap rebalance operation (Step 3 as mentioned above) got cancelled, the final "Rebalance In" operation fails as the previously supposed to be ejected node 20 is still a part of the cluster.

            Checking further on why nsStats endpoint is still timing out. Tagging Abhi Dangeti too, as he also worked on this endpoint changes sometime back.

             

             

            Sreekanth Sivasankaran Sreekanth Sivasankaran added a comment - Tracking the swap rebalance operation further down, it looks like the ns_stats endpoint is still failing to respond. Or timeout errors are visible in the ns_server_error.logs of moving in node 15. [ns_server:error, 2019 - 12 -27T16: 28 : 58.703 - 08 : 00 ,ns_1 @172 .23. 97.15 :service_stats_collector-fts< 0.15750 . 441 >:rest_utils:get_json: 62 ]Request to (fts) api/nsstats failed: {error,timeout} [ns_server:error, 2019 - 12 -27T16: 30 : 04.630 - 08 : 00 ,ns_1 @172 .23. 97.15 :service_stats_collector-fts< 0.15750 . 441 >:rest_utils:get_json: 62 ]Request to (fts) api/nsstats failed: {error,timeout} [ns_server:error, 2019 - 12 -27T16: 31 : 45.631 - 08 : 00 ,ns_1 @172 .23. 97.15 :service_stats_collector-fts< 0.15750 . 441 >:rest_utils:get_json: 62 ]Request to (fts) api/nsstats failed: {error,timeout} [ns_server:error, 2019 - 12 -27T16: 32 : 22.265 - 08 : 00 ,ns_1 @172 .23. 97.15 :service_agent-fts< 0.16568 . 441 >:service_agent:handle_info: 287 ]Rebalancer < 25658.11575 . 1303 > died unexpectedly: shutdown and in ns_server debug logs shows as, [ns_server:error, 2019 - 12 -27T16: 32 : 22.265 - 08 : 00 ,ns_1 @172 .23. 97.15 :service_agent-fts< 0.16568 . 441 >:service_agent:handle_info: 287 ]Rebalancer < 25658.11575 . 1303 > died unexpectedly: shutdown [ns_server:info, 2019 - 12 -27T16: 32 : 22.265 - 08 : 00 ,ns_1 @172 .23. 97.15 :rebalance_agent< 0.7679 . 441 >:rebalance_agent:handle_down: 296 ]Rebalancer process < 25658.25753 . 1300 > died (reason shutdown). [ns_server:debug, 2019 - 12 -27T16: 32 : 22.267 - 08 : 00 ,ns_1 @172 .23. 97.15 :service_agent-fts< 0.16568 . 441 >:service_agent:cleanup_service: 553 ]Cleaning up stale tasks: And mostly due to this it cancels the rebalance operation here- 2019 - 12 -27T16: 32 : 22.267 - 08 : 00 [INFO] ctl/manager: CancelTask, taskId: rebalance:7c34c23200aaab2ca0ef32ba2b20785e, taskRev: 2019 - 12 -27T16: 32 : 22.267 - 08 : 00 [INFO] ctl/manager: stop taskHandle, ctlTopology.Rev: 6 Since the previous swap rebalance operation (Step 3 as mentioned above) got cancelled, the final "Rebalance In" operation fails as the previously supposed to be ejected node 20 is still a part of the cluster. Checking further on why nsStats endpoint is still timing out. Tagging Abhi Dangeti  too, as he also worked on this endpoint changes sometime back.    

            Build sync_gateway-3.0.0-2 contains go-couchbase commit b1b83d4 with commit message:
            MB-34931 - fts swap rebalance failures

            build-team Couchbase Build Team added a comment - Build sync_gateway-3.0.0-2 contains go-couchbase commit b1b83d4 with commit message: MB-34931 - fts swap rebalance failures
            lynn.straus Lynn Straus added a comment -

            Per review in the Cheshire Cat Ops meeting and per email updates, all remaining majors/minors/trivials are being moved out of Cheshire Cat on March 10. The purpose is to allow teams to focus on Critical bugs for the remainder of the release. Teams can "pull back" deferred bugs as schedule and bandwidth allow.

            Changing target fix version to CheshireCat.next and adding "deferred-from-Cheshire-Cat" label.

            lynn.straus Lynn Straus added a comment - Per review in the Cheshire Cat Ops meeting and per email updates, all remaining majors/minors/trivials are being moved out of Cheshire Cat on March 10. The purpose is to allow teams to focus on Critical bugs for the remainder of the release. Teams can "pull back" deferred bugs as schedule and bandwidth allow. Changing target fix version to CheshireCat.next and adding "deferred-from-Cheshire-Cat" label.
            jyotsna.nayak Jyotsna Nayak added a comment - - edited

            (~Have rerun this test on CC build: 7.0.0-4554.
            Swap rebalance completed. Time taken for swap rebalance : 468.9639 min (~7.81 hrs)
            Link to the job: http://perf.jenkins.couchbase.com/job/themis_multibucket/63/consoleFull

            jyotsna.nayak Jyotsna Nayak added a comment - - edited (~Have rerun this test on CC build: 7.0.0-4554. Swap rebalance completed. Time taken for swap rebalance : 468.9639 min (~7.81 hrs) Link to the job: http://perf.jenkins.couchbase.com/job/themis_multibucket/63/consoleFull

            Anything pending here, Jyotsna Nayak

            keshav Keshav Murthy added a comment - Anything pending here, Jyotsna Nayak ? 

            No Keshav Murthy.
            This issue is not seen in the recent runs. Closing the issue.

            jyotsna.nayak Jyotsna Nayak added a comment - No Keshav Murthy . This issue is not seen in the recent runs. Closing the issue.

            People

              jyotsna.nayak Jyotsna Nayak
              mahesh.mandhare Mahesh Mandhare (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty