Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-46112

Rebalance failed with reason "agent died"

    XMLWordPrintable

Details

    Description

      Build: 7.0.0-5095

      Scenario:

      • Cluster with index,eventing,fts,backup services actively running
      • 3 FTS indexes created (2 with indexPartition=1 and 1 index with indexPartition=6)
      • Rebalance in extra FTS node into the cluster along with few other nodes with other services

      Observation:

      Fts rebalance failed with reason fts worker died,

       

      Rebalance exited with reason {service_rebalance_failed,fts,
      {worker_died,
      {'EXIT',<0.12316.37>,
      {rebalance_failed,inactivity_timeout}}}}.
      Rebalance Operation Id = b5ce1e972169e0377441008a937cb498
      ..
      ..
      Starting rebalance, KeepNodes = ['ns_1@172.23.105.175','ns_1@172.23.106.233',
      'ns_1@172.23.106.236','ns_1@172.23.106.238',
      'ns_1@172.23.106.250','ns_1@172.23.106.251',
      'ns_1@172.23.107.43','ns_1@172.23.107.44',
      'ns_1@172.23.107.45','ns_1@172.23.107.58',
      'ns_1@172.23.121.74','ns_1@172.23.121.78'], EjectNodes = [], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = b5ce1e972169e0377441008a937cb49

      Snapshot: https://supportal.couchbase.com/snapshot/ff8cc1ad0a5b1fdc2c01fdd52cec4165::0

       

      Attachments

        Issue Links

          For Gerrit Dashboard: MB-46112
          # Subject Branch Project Status CR V

          Activity

            abhinav Abhi Dangeti added a comment -

            Ashwin Govindarajulu It's necessary that in tests conducted on multiple nodes - you point out which nodes host the service of interest!

            abhinav Abhi Dangeti added a comment - Ashwin Govindarajulu  It's necessary that in tests conducted on multiple nodes - you point out which nodes host the service of interest!
            abhinav Abhi Dangeti added a comment -

            Looks like per the screenshot 172.23.106.251 and 172.23.107.58 are the nodes hosting FTS in this cluster.

            abhinav Abhi Dangeti added a comment - Looks like per the screenshot 172.23.106.251 and 172.23.107.58 are the nodes hosting FTS in this cluster.
            abhinav Abhi Dangeti added a comment -

            It seems like the rebalancer is stuck at getting a stats sample for one of the pindexes `hotels_b48a9b6f1e7e824d_54820232` for about 10minutes ..

            2021-05-04T11:15:25.323-07:00 [INFO] rebalance: waitAssignPIndexDone, awaiting a stats sample grab for pindex hotels_b48a9b6f1e7e824d_54820232, formerPrimaryNode f2a3da64a4d6793b6ca03605e6f601fe
            2021-05-04T11:15:26.321-07:00 [INFO] rebalance: waitAssignPIndexDone, awaiting a stats sample grab for pindex hotels_b48a9b6f1e7e824d_54820232, formerPrimaryNode f2a3da64a4d6793b6ca03605e6f601fe
            2021-05-04T11:15:26.323-07:00 [INFO] rebalance: waitAssignPIndexDone, awaiting a stats sample grab for pindex hotels_b48a9b6f1e7e824d_54820232, formerPrimaryNode f2a3da64a4d6793b6ca03605e6f601fe 

            abhinav Abhi Dangeti added a comment - It seems like the rebalancer is stuck at getting a stats sample for one of the pindexes ` hotels_b48a9b6f1e7e824d_54820232 ` for about 10minutes .. 2021 - 05 -04T11: 15 : 25.323 - 07 : 00 [INFO] rebalance: waitAssignPIndexDone, awaiting a stats sample grab for pindex hotels_b48a9b6f1e7e824d_54820232, formerPrimaryNode f2a3da64a4d6793b6ca03605e6f601fe 2021 - 05 -04T11: 15 : 26.321 - 07 : 00 [INFO] rebalance: waitAssignPIndexDone, awaiting a stats sample grab for pindex hotels_b48a9b6f1e7e824d_54820232, formerPrimaryNode f2a3da64a4d6793b6ca03605e6f601fe 2021 - 05 -04T11: 15 : 26.323 - 07 : 00 [INFO] rebalance: waitAssignPIndexDone, awaiting a stats sample grab for pindex hotels_b48a9b6f1e7e824d_54820232, formerPrimaryNode f2a3da64a4d6793b6ca03605e6f601fe
            abhinav Abhi Dangeti added a comment -

            This ticket looks quite similar to MB-44725 (which has been marked as "cannot reproduce").

            The last commit for MB-44725 was to increase logging in the area ..

            commit 82a3dd83e5182a6dc682c17ad661342c68b779d5
            Author: Sreekanth Sivasankaran <sreekanth.sivasankaran@couchbase.com>
            Date:   Wed Apr 7 13:37:03 2021 +0530
             
                MB-44725 - FTS rebalance_failed,inactivity_timeout
             
                Adding more logs and err handling in the
                pindex restart paths.
             
                Change-Id: Iaff4419685623cd258ccca71d973b92263bce8d7
                Reviewed-on: http://review.couchbase.org/c/cbgt/+/150572
                Well-Formed: Build Bot <build@couchbase.com>
                Reviewed-by: Abhinav Dangeti <abhinav@couchbase.com>
                Tested-by: Sreekanth Sivasankaran <sreekanth.sivasankaran@couchbase.com> 

            The log messages now do confirm that rollback is indeed restarting the pindexes - so that's good.

            2021-05-04T11:05:13.444-07:00 [INFO] pindex: restartPIndex starts for pindex: hotels_b48a9b6f1e7e824d_f4e0a48a
            2021-05-04T11:05:13.448-07:00 [INFO] pindex: restartPIndex starts for pindex: routes_46d6c63966e9ce0b_4c1c5584
            2021-05-04T11:05:13.448-07:00 [INFO] pindex: restartPIndex starts for pindex: flights_24af63d4d95b77ef_4c1c5584
            2021-05-04T11:05:13.483-07:00 [INFO] pindex: restartPIndex starts for pindex: hotels_b48a9b6f1e7e824d_6ddbfb54
            2021-05-04T11:05:13.483-07:00 [INFO] pindex: restartPIndex starts for pindex: hotels_b48a9b6f1e7e824d_54820232
            2021-05-04T11:05:13.486-07:00 [INFO] pindex: restartPIndex starts for pindex: hotels_b48a9b6f1e7e824d_18572d87
            2021-05-04T11:05:17.421-07:00 [INFO] pindex: restartPIndex starts for pindex: hotels_b48a9b6f1e7e824d_54820232
            2021-05-04T11:05:17.476-07:00 [INFO] pindex: restartPIndex starts for pindex: hotels_b48a9b6f1e7e824d_6ddbfb54
            2021-05-04T11:05:17.853-07:00 [INFO] pindex: restartPIndex starts for pindex: hotels_b48a9b6f1e7e824d_54820232
            2021-05-04T11:05:18.209-07:00 [INFO] pindex: restartPIndex starts for pindex: hotels_b48a9b6f1e7e824d_6ddbfb54
            2021-05-04T11:05:18.250-07:00 [INFO] pindex: restartPIndex starts for pindex: hotels_b48a9b6f1e7e824d_54820232
            2021-05-04T11:05:18.721-07:00 [INFO] pindex: restartPIndex starts for pindex: hotels_b48a9b6f1e7e824d_6ddbfb54
            2021-05-04T11:05:18.863-07:00 [INFO] pindex: restartPIndex starts for pindex: hotels_b48a9b6f1e7e824d_54820232
            2021-05-04T11:05:19.110-07:00 [INFO] pindex: restartPIndex starts for pindex: hotels_b48a9b6f1e7e824d_6ddbfb54
            2021-05-04T11:05:19.246-07:00 [INFO] pindex: restartPIndex starts for pindex: hotels_b48a9b6f1e7e824d_54820232
            2021-05-04T11:05:19.367-07:00 [INFO] pindex: restartPIndex starts for pindex: flights_24af63d4d95b77ef_4c1c5584
            2021-05-04T11:05:19.370-07:00 [INFO] pindex: restartPIndex starts for pindex: hotels_b48a9b6f1e7e824d_18572d87
            2021-05-04T11:05:19.413-07:00 [INFO] pindex: restartPIndex starts for pindex: routes_46d6c63966e9ce0b_4c1c5584
            2021-05-04T11:05:19.414-07:00 [INFO] pindex: restartPIndex starts for pindex: routes_46d6c63966e9ce0b_4c1c5584
            2021-05-04T11:05:19.610-07:00 [INFO] pindex: restartPIndex starts for pindex: hotels_b48a9b6f1e7e824d_6ddbfb54
            2021-05-04T11:05:20.650-07:00 [INFO] pindex: restartPIndex starts for pindex: hotels_b48a9b6f1e7e824d_f4e0a48a
            2021-05-04T11:05:20.680-07:00 [INFO] pindex: restartPIndex starts for pindex: hotels_b48a9b6f1e7e824d_54820232
            2021-05-04T11:05:21.201-07:00 [INFO] pindex: restartPIndex starts for pindex: flights_24af63d4d95b77ef_4c1c5584
            2021-05-04T11:05:21.753-07:00 [INFO] pindex: restartPIndex starts for pindex: hotels_b48a9b6f1e7e824d_6ddbfb54
            2021-05-04T11:05:22.277-07:00 [INFO] pindex: restartPIndex starts for pindex: routes_46d6c63966e9ce0b_4c1c5584 

            As for the rebalancer that is stuck from between 2021-05-04T11:05:25.400-07:00 to 2021-05-04T11:15:26.323-07:00 in obtaining the stats sample for pindex `hotels_b48a9b6f1e7e824d_54820232`, Sreekanth Sivasankaran would you take a look into the logs from node 172.23.106.251?

            abhinav Abhi Dangeti added a comment - This ticket looks quite similar to MB-44725 (which has been marked as "cannot reproduce"). The last commit for MB-44725 was to increase logging in the area .. commit 82a3dd83e5182a6dc682c17ad661342c68b779d5 Author: Sreekanth Sivasankaran <sreekanth.sivasankaran @couchbase .com> Date:   Wed Apr 7 13 : 37 : 03 2021 + 0530       MB- 44725 - FTS rebalance_failed,inactivity_timeout       Adding more logs and err handling in the     pindex restart paths.       Change-Id: Iaff4419685623cd258ccca71d973b92263bce8d7     Reviewed-on: http: //review.couchbase.org/c/cbgt/+/150572     Well-Formed: Build Bot <build @couchbase .com>     Reviewed-by: Abhinav Dangeti <abhinav @couchbase .com>     Tested-by: Sreekanth Sivasankaran <sreekanth.sivasankaran @couchbase .com> The log messages now do confirm that rollback is indeed restarting the pindexes - so that's good. 2021 - 05 -04T11: 05 : 13.444 - 07 : 00 [INFO] pindex: restartPIndex starts for pindex: hotels_b48a9b6f1e7e824d_f4e0a48a 2021 - 05 -04T11: 05 : 13.448 - 07 : 00 [INFO] pindex: restartPIndex starts for pindex: routes_46d6c63966e9ce0b_4c1c5584 2021 - 05 -04T11: 05 : 13.448 - 07 : 00 [INFO] pindex: restartPIndex starts for pindex: flights_24af63d4d95b77ef_4c1c5584 2021 - 05 -04T11: 05 : 13.483 - 07 : 00 [INFO] pindex: restartPIndex starts for pindex: hotels_b48a9b6f1e7e824d_6ddbfb54 2021 - 05 -04T11: 05 : 13.483 - 07 : 00 [INFO] pindex: restartPIndex starts for pindex: hotels_b48a9b6f1e7e824d_54820232 2021 - 05 -04T11: 05 : 13.486 - 07 : 00 [INFO] pindex: restartPIndex starts for pindex: hotels_b48a9b6f1e7e824d_18572d87 2021 - 05 -04T11: 05 : 17.421 - 07 : 00 [INFO] pindex: restartPIndex starts for pindex: hotels_b48a9b6f1e7e824d_54820232 2021 - 05 -04T11: 05 : 17.476 - 07 : 00 [INFO] pindex: restartPIndex starts for pindex: hotels_b48a9b6f1e7e824d_6ddbfb54 2021 - 05 -04T11: 05 : 17.853 - 07 : 00 [INFO] pindex: restartPIndex starts for pindex: hotels_b48a9b6f1e7e824d_54820232 2021 - 05 -04T11: 05 : 18.209 - 07 : 00 [INFO] pindex: restartPIndex starts for pindex: hotels_b48a9b6f1e7e824d_6ddbfb54 2021 - 05 -04T11: 05 : 18.250 - 07 : 00 [INFO] pindex: restartPIndex starts for pindex: hotels_b48a9b6f1e7e824d_54820232 2021 - 05 -04T11: 05 : 18.721 - 07 : 00 [INFO] pindex: restartPIndex starts for pindex: hotels_b48a9b6f1e7e824d_6ddbfb54 2021 - 05 -04T11: 05 : 18.863 - 07 : 00 [INFO] pindex: restartPIndex starts for pindex: hotels_b48a9b6f1e7e824d_54820232 2021 - 05 -04T11: 05 : 19.110 - 07 : 00 [INFO] pindex: restartPIndex starts for pindex: hotels_b48a9b6f1e7e824d_6ddbfb54 2021 - 05 -04T11: 05 : 19.246 - 07 : 00 [INFO] pindex: restartPIndex starts for pindex: hotels_b48a9b6f1e7e824d_54820232 2021 - 05 -04T11: 05 : 19.367 - 07 : 00 [INFO] pindex: restartPIndex starts for pindex: flights_24af63d4d95b77ef_4c1c5584 2021 - 05 -04T11: 05 : 19.370 - 07 : 00 [INFO] pindex: restartPIndex starts for pindex: hotels_b48a9b6f1e7e824d_18572d87 2021 - 05 -04T11: 05 : 19.413 - 07 : 00 [INFO] pindex: restartPIndex starts for pindex: routes_46d6c63966e9ce0b_4c1c5584 2021 - 05 -04T11: 05 : 19.414 - 07 : 00 [INFO] pindex: restartPIndex starts for pindex: routes_46d6c63966e9ce0b_4c1c5584 2021 - 05 -04T11: 05 : 19.610 - 07 : 00 [INFO] pindex: restartPIndex starts for pindex: hotels_b48a9b6f1e7e824d_6ddbfb54 2021 - 05 -04T11: 05 : 20.650 - 07 : 00 [INFO] pindex: restartPIndex starts for pindex: hotels_b48a9b6f1e7e824d_f4e0a48a 2021 - 05 -04T11: 05 : 20.680 - 07 : 00 [INFO] pindex: restartPIndex starts for pindex: hotels_b48a9b6f1e7e824d_54820232 2021 - 05 -04T11: 05 : 21.201 - 07 : 00 [INFO] pindex: restartPIndex starts for pindex: flights_24af63d4d95b77ef_4c1c5584 2021 - 05 -04T11: 05 : 21.753 - 07 : 00 [INFO] pindex: restartPIndex starts for pindex: hotels_b48a9b6f1e7e824d_6ddbfb54 2021 - 05 -04T11: 05 : 22.277 - 07 : 00 [INFO] pindex: restartPIndex starts for pindex: routes_46d6c63966e9ce0b_4c1c5584 As for the rebalancer that is stuck from between 2021-05-04T11:05:25.400-07:00 to 2021-05-04T11:15:26.323-07:00 in obtaining the stats sample for pindex `hotels_b48a9b6f1e7e824d_54820232`, Sreekanth Sivasankaran  would you take a look into the logs from node 172.23.106.251?

            Upon taking a closer look into the erring partition hotels_b48a9b6f1e7e824d_54820232 at timestamp 2021-05-04T11:05:20,

            the partition got rollbacked and closed. All subsequent rollback requests were failing with "bleve: BleveDest already closed" error as seen in the logs.

            2021-05-04T11:05:20.679-07:00 [INFO] feed_dcp_gocbcore: [hotels_b48a9b6f1e7e824d_54820232] Received rollback, for vb: 442, seqno requested: 583
            2021-05-04T11:05:20.679-07:00 [INFO] bleve: dest rollback, partition: 442, key: "<ud></ud>", seq: 0, err: <nil>
            2021-05-04T11:05:20.679-07:00 [INFO] pindex_bleve: batchWorker stopped for `hotels_b48a9b6f1e7e824d_54820232`
            2021-05-04T11:05:20.679-07:00 [INFO] pindex_bleve: batchWorker stopped for `hotels_b48a9b6f1e7e824d_54820232`
            2021-05-04T11:05:20.679-07:00 [INFO] pindex_bleve: batchWorker stopped for `hotels_b48a9b6f1e7e824d_54820232`
            2021-05-04T11:05:20.679-07:00 [INFO] pindex_bleve: batchWorker stopped for `hotels_b48a9b6f1e7e824d_54820232`
             
            2021-05-04T11:05:20.679-07:00 [INFO] feed_dcp_gocbcore: [hotels_b48a9b6f1e7e824d_f4e0a48a] Received rollback, for vb: 233, seqno requested: 597
            2021-05-04T11:05:20.679-07:00 [INFO] feed_dcp_gocbcore: [hotels_b48a9b6f1e7e824d_54820232] Received rollback, for vb: 443, seqno requested: 658
            2021-05-04T11:05:20.679-07:00 [INFO] feed_dcp_gocbcore: [hotels_b48a9b6f1e7e824d_f4e0a48a] Received rollback, for vb: 230, seqno requested: 607
             
            2021-05-04T11:05:20.679-07:00 [INFO] pindex_bleve_scorch_rollback: path: /opt/couchbase/var/lib/couchbase/data/@fts/hotels_b48a9b6f1e7e824d_54820232.pindex, totSnapshotExamined: 0
            2021-05-04T11:05:20.679-07:00 [INFO] feed_dcp_gocbcore: [hotels_b48a9b6f1e7e824d_f4e0a48a] Received rollback, for vb: 232, seqno requested: 571
            2021-05-04T11:05:20.679-07:00 [INFO] pindex_bleve_rollback: path: /opt/couchbase/var/lib/couchbase/data/@fts/hotels_b48a9b6f1e7e824d_54820232.pindex, wasClosed: true, wasPartial: false, err: <nil>
             
            2021-05-04T11:05:20.679-07:00 [INFO] feed_dcp_gocbcore: [hotels_b48a9b6f1e7e824d_f4e0a48a] Received rollback, for vb: 225, seqno requested: 535
            2021-05-04T11:05:20.680-07:00 [INFO] feed_dcp_gocbcore: [hotels_b48a9b6f1e7e824d_f4e0a48a] Received rollback, for vb: 229, seqno requested: 256
            2021-05-04T11:05:20.680-07:00 [INFO] feed_dcp_gocbcore: [hotels_b48a9b6f1e7e824d_54820232] Received rollback, for vb: 441, seqno requested: 317
            2021-05-04T11:05:20.680-07:00 [WARN] feed_dcp_gocbcore: [hotels_b48a9b6f1e7e824d_f4e0a48a] Rollback to seqno: 0, vbuuid: 0 for vb: 233, failed with err: bleve: BleveDest already closed -- cbgt.(*GocbcoreDCPFeed).rollback() at feed_dcp_gocbcore.go:1524 

            And it subsequently tries to restart the partition.

            2021-05-04T11:05:20.680-07:00 [INFO] pindex: restartPIndex starts for pindex: hotels_b48a9b6f1e7e824d_54820232 

            But the restart request for this partition never makes into the janitor work loop as that was busy processing other ongoing requests. This is confirmed from the logs too.

            The last successful restart attempt reached the janitor was at timestamp:

            2021-05-04T11:05:20.633-07:00 [INFO] janitor: awakes, op: janitor_close_pindex, msg: api-ClosePIndex:hotels_b48a9b6f1e7e824d_6ddbfb54 
            2021-05-04T11:05:20.678-07:00 [INFO] janitor: awakes, op: kick, msg: restart-pindex
            2021-05-04T11:05:21.500-07:00 [INFO] janitor: awakes, op: kick, msg: restart-pindex

            So, the bleve partition is closed, but the pindex is still registered with the manager due to the unsuccessful restart attempt.

            And the situation continues until the rebalance which starts at

            2021-05-04T11:05:25.284-07:00 [INFO] ctl/manager: StartTopologyChange, started 

            This causes the rebalance stats monitoring endpoint to fail later and the eventual rebalance failures too.

             Checking further ..

            Sreekanth Sivasankaran Sreekanth Sivasankaran added a comment - Upon taking a closer look into the erring partition hotels_b48a9b6f1e7e824d_54820232 at timestamp 2021-05-04T11:05:20 , the partition got rollbacked and closed. All subsequent rollback requests were failing with " bleve: BleveDest already closed " error as seen in the logs. 2021 - 05 -04T11: 05 : 20.679 - 07 : 00 [INFO] feed_dcp_gocbcore: [hotels_b48a9b6f1e7e824d_54820232] Received rollback, for vb: 442 , seqno requested: 583 2021 - 05 -04T11: 05 : 20.679 - 07 : 00 [INFO] bleve: dest rollback, partition: 442 , key: "<ud></ud>" , seq: 0 , err: <nil> 2021 - 05 -04T11: 05 : 20.679 - 07 : 00 [INFO] pindex_bleve: batchWorker stopped for `hotels_b48a9b6f1e7e824d_54820232` 2021 - 05 -04T11: 05 : 20.679 - 07 : 00 [INFO] pindex_bleve: batchWorker stopped for `hotels_b48a9b6f1e7e824d_54820232` 2021 - 05 -04T11: 05 : 20.679 - 07 : 00 [INFO] pindex_bleve: batchWorker stopped for `hotels_b48a9b6f1e7e824d_54820232` 2021 - 05 -04T11: 05 : 20.679 - 07 : 00 [INFO] pindex_bleve: batchWorker stopped for `hotels_b48a9b6f1e7e824d_54820232`   2021 - 05 -04T11: 05 : 20.679 - 07 : 00 [INFO] feed_dcp_gocbcore: [hotels_b48a9b6f1e7e824d_f4e0a48a] Received rollback, for vb: 233 , seqno requested: 597 2021 - 05 -04T11: 05 : 20.679 - 07 : 00 [INFO] feed_dcp_gocbcore: [hotels_b48a9b6f1e7e824d_54820232] Received rollback, for vb: 443 , seqno requested: 658 2021 - 05 -04T11: 05 : 20.679 - 07 : 00 [INFO] feed_dcp_gocbcore: [hotels_b48a9b6f1e7e824d_f4e0a48a] Received rollback, for vb: 230 , seqno requested: 607   2021 - 05 -04T11: 05 : 20.679 - 07 : 00 [INFO] pindex_bleve_scorch_rollback: path: /opt/couchbase/var/lib/couchbase/data/ @fts /hotels_b48a9b6f1e7e824d_54820232.pindex, totSnapshotExamined: 0 2021 - 05 -04T11: 05 : 20.679 - 07 : 00 [INFO] feed_dcp_gocbcore: [hotels_b48a9b6f1e7e824d_f4e0a48a] Received rollback, for vb: 232 , seqno requested: 571 2021 - 05 -04T11: 05 : 20.679 - 07 : 00 [INFO] pindex_bleve_rollback: path: /opt/couchbase/var/lib/couchbase/data/ @fts /hotels_b48a9b6f1e7e824d_54820232.pindex, wasClosed: true , wasPartial: false , err: <nil>   2021 - 05 -04T11: 05 : 20.679 - 07 : 00 [INFO] feed_dcp_gocbcore: [hotels_b48a9b6f1e7e824d_f4e0a48a] Received rollback, for vb: 225 , seqno requested: 535 2021 - 05 -04T11: 05 : 20.680 - 07 : 00 [INFO] feed_dcp_gocbcore: [hotels_b48a9b6f1e7e824d_f4e0a48a] Received rollback, for vb: 229 , seqno requested: 256 2021 - 05 -04T11: 05 : 20.680 - 07 : 00 [INFO] feed_dcp_gocbcore: [hotels_b48a9b6f1e7e824d_54820232] Received rollback, for vb: 441 , seqno requested: 317 2021 - 05 -04T11: 05 : 20.680 - 07 : 00 [WARN] feed_dcp_gocbcore: [hotels_b48a9b6f1e7e824d_f4e0a48a] Rollback to seqno: 0 , vbuuid: 0 for vb: 233 , failed with err: bleve: BleveDest already closed -- cbgt.(*GocbcoreDCPFeed).rollback() at feed_dcp_gocbcore.go: 1524 And it subsequently tries to restart the partition. 2021 - 05 -04T11: 05 : 20.680 - 07 : 00 [INFO] pindex: restartPIndex starts for pindex: hotels_b48a9b6f1e7e824d_54820232 But the restart request for this partition never makes into the janitor work loop as that was busy processing other ongoing requests. This is confirmed from the logs too. The last successful restart attempt reached the janitor was at timestamp: 2021 - 05 -04T11: 05 : 20.633 - 07 : 00 [INFO] janitor: awakes, op: janitor_close_pindex, msg: api-ClosePIndex:hotels_b48a9b6f1e7e824d_6ddbfb54 2021 - 05 -04T11: 05 : 20.678 - 07 : 00 [INFO] janitor: awakes, op: kick, msg: restart-pindex 2021 - 05 -04T11: 05 : 21.500 - 07 : 00 [INFO] janitor: awakes, op: kick, msg: restart-pindex So, the bleve partition is closed, but the pindex is still registered with the manager due to the unsuccessful restart attempt. And the situation continues until the rebalance which starts at 2021 - 05 -04T11: 05 : 25.284 - 07 : 00 [INFO] ctl/manager: StartTopologyChange, started This causes the rebalance stats monitoring endpoint to fail later and the eventual rebalance failures too.  Checking further ..

            These sort of issues ought to self-heal once the janitor advances with its task list. But the janitor work loop is deadlocked here as we can't see any further kick wake up messages in the logs.

            The last janitor kick processed was at the timestamp.

            2021-05-04T11:05:21.500-07:00 [INFO] janitor: awakes, op: kick, msg: restart-pindex
            2021-05-04T11:05:21.583-07:00 [INFO] janitor: pindexes to remove: 0
            2021-05-04T11:05:21.583-07:00 [INFO] janitor: pindexes to add: 0
            2021-05-04T11:05:21.583-07:00 [INFO] janitor: pindexes to restart: 0
            2021-05-04T11:05:21.583-07:00 [INFO] janitor: feeds to remove: 0
            2021-05-04T11:05:21.583-07:00 [INFO] janitor: feeds to add: 1
            2021-05-04T11:05:21.583-07:00 [INFO]   flights_24af63d4d95b77ef_4c1c5584

            A deadlocked janitor was blocking all further progress here.

            My analysis here is that the recent deadlock introduced with the fix here: MB-45989 is causing this. 

            Let me merge the latest fix and get it retested before more tickets get piled up.

             

            Sreekanth Sivasankaran Sreekanth Sivasankaran added a comment - These sort of issues ought to self-heal once the janitor advances with its task list. But the janitor work loop is deadlocked here as we can't see any further kick wake up messages in the logs. The last janitor kick processed was at the timestamp. 2021 - 05 -04T11: 05 : 21.500 - 07 : 00 [INFO] janitor: awakes, op: kick, msg: restart-pindex 2021 - 05 -04T11: 05 : 21.583 - 07 : 00 [INFO] janitor: pindexes to remove: 0 2021 - 05 -04T11: 05 : 21.583 - 07 : 00 [INFO] janitor: pindexes to add: 0 2021 - 05 -04T11: 05 : 21.583 - 07 : 00 [INFO] janitor: pindexes to restart: 0 2021 - 05 -04T11: 05 : 21.583 - 07 : 00 [INFO] janitor: feeds to remove: 0 2021 - 05 -04T11: 05 : 21.583 - 07 : 00 [INFO] janitor: feeds to add: 1 2021 - 05 -04T11: 05 : 21.583 - 07 : 00 [INFO] flights_24af63d4d95b77ef_4c1c5584 A deadlocked janitor was blocking all further progress here. My analysis here is that the recent deadlock introduced with the fix here:  MB-45989 is causing this.  Let me merge the latest fix and get it retested before more tickets get piled up.  

            Thanks Sreekanth..

            Just an update, able to hit the same issue while rebalancing out the FTS node

            +----------------+-------------+-----------------------+---------------+--------------+
            | Nodes          | Services    | Version               | CPU           | Status       |
            +----------------+-------------+-----------------------+---------------+--------------+
            | 172.23.107.58  | eventing    | 7.0.0-5095-enterprise | 68.0274805126 | Cluster node |
            | 172.23.107.44  | fts         | 7.0.0-5095-enterprise | 2.35664776307 | Cluster node |
            | 172.23.105.175 | kv          | 7.0.0-5095-enterprise | 83.7956961369 | Cluster node |
            | 172.23.107.45  | fts         | 7.0.0-5095-enterprise | 2.54600453743 | --- OUT ---> |
            | 172.23.107.47  | kv          | 7.0.0-5095-enterprise | 84.332764729  | Cluster node |
            | 172.23.107.84  | eventing    | 7.0.0-5095-enterprise | 59.6821976838 | Cluster node |
            | 172.23.121.78  | cbas        | 7.0.0-5095-enterprise | 35.1969676459 | Cluster node |
            | 172.23.106.250 | index, n1ql | 7.0.0-5095-enterprise | 18.1203515889 | Cluster node |
            | 172.23.107.88  | index, n1ql | 7.0.0-5095-enterprise | 3.36781029263 | Cluster node |
            | 172.23.107.43  | eventing    | 7.0.0-5095-enterprise | 63.9989329065 | Cluster node |
            | 172.23.106.236 | kv          | 7.0.0-5095-enterprise | 86.4183158712 | Cluster node |
            | 172.23.107.85  | index, n1ql | 7.0.0-5095-enterprise | 26.646389844  | Cluster node |
            | 172.23.106.251 | backup      | 7.0.0-5095-enterprise | 2.22697533971 | Cluster node |
            | 172.23.106.233 | kv          | 7.0.0-5095-enterprise | 87.605560608  | Cluster node |
            | 172.23.106.238 | index, n1ql | 7.0.0-5095-enterprise | 3.05016385178 | Cluster node |
            | 172.23.107.54  | kv          | 7.0.0-5095-enterprise | 85.4177545692 | Cluster node |
            | 172.23.107.78  | cbas        | 7.0.0-5095-enterprise | 33.6209220814 | Cluster node |
            | 172.23.121.74  | cbas        | 7.0.0-5095-enterprise | 38.1596571122 | Cluster node |
            | 172.23.107.107 | fts         | 7.0.0-5095-enterprise | 2.31650509883 | Cluster node |
            | 172.23.107.91  | kv          | 7.0.0-5095-enterprise | 73.7971478075 | Cluster node |
            +----------------+-------------+-----------------------+---------------+--------------+

             

            Snapshot: http://supportal.couchbase.com/snapshot/73a3f4bbca65dfdfd545c1c60852951b::0

            cbcollect logs:
            https://cb-engineering.s3.amazonaws.com/fts_rebalance_failed/collectinfo-2021-05-05T073833-ns_1%40172.23.105.175.zip
            https://cb-engineering.s3.amazonaws.com/fts_rebalance_failed/collectinfo-2021-05-05T073833-ns_1%40172.23.106.233.zip
            https://cb-engineering.s3.amazonaws.com/fts_rebalance_failed/collectinfo-2021-05-05T073833-ns_1%40172.23.106.236.zip
            https://cb-engineering.s3.amazonaws.com/fts_rebalance_failed/collectinfo-2021-05-05T073833-ns_1%40172.23.106.238.zip
            https://cb-engineering.s3.amazonaws.com/fts_rebalance_failed/collectinfo-2021-05-05T073833-ns_1%40172.23.106.250.zip
            https://cb-engineering.s3.amazonaws.com/fts_rebalance_failed/collectinfo-2021-05-05T073833-ns_1%40172.23.106.251.zip
            https://cb-engineering.s3.amazonaws.com/fts_rebalance_failed/collectinfo-2021-05-05T073833-ns_1%40172.23.107.107.zip
            https://cb-engineering.s3.amazonaws.com/fts_rebalance_failed/collectinfo-2021-05-05T073833-ns_1%40172.23.107.43.zip
            https://cb-engineering.s3.amazonaws.com/fts_rebalance_failed/collectinfo-2021-05-05T073833-ns_1%40172.23.107.44.zip
            https://cb-engineering.s3.amazonaws.com/fts_rebalance_failed/collectinfo-2021-05-05T073833-ns_1%40172.23.107.45.zip
            https://cb-engineering.s3.amazonaws.com/fts_rebalance_failed/collectinfo-2021-05-05T073833-ns_1%40172.23.107.47.zip
            https://cb-engineering.s3.amazonaws.com/fts_rebalance_failed/collectinfo-2021-05-05T073833-ns_1%40172.23.107.54.zip
            https://cb-engineering.s3.amazonaws.com/fts_rebalance_failed/collectinfo-2021-05-05T073833-ns_1%40172.23.107.58.zip
            https://cb-engineering.s3.amazonaws.com/fts_rebalance_failed/collectinfo-2021-05-05T073833-ns_1%40172.23.107.78.zip
            https://cb-engineering.s3.amazonaws.com/fts_rebalance_failed/collectinfo-2021-05-05T073833-ns_1%40172.23.107.84.zip
            https://cb-engineering.s3.amazonaws.com/fts_rebalance_failed/collectinfo-2021-05-05T073833-ns_1%40172.23.107.85.zip
            https://cb-engineering.s3.amazonaws.com/fts_rebalance_failed/collectinfo-2021-05-05T073833-ns_1%40172.23.107.88.zip
            https://cb-engineering.s3.amazonaws.com/fts_rebalance_failed/collectinfo-2021-05-05T073833-ns_1%40172.23.107.91.zip
            https://cb-engineering.s3.amazonaws.com/fts_rebalance_failed/collectinfo-2021-05-05T073833-ns_1%40172.23.121.74.zip
            https://cb-engineering.s3.amazonaws.com/fts_rebalance_failed/collectinfo-2021-05-05T073833-ns_1%40172.23.121.78.zip

            ashwin.govindarajulu Ashwin Govindarajulu added a comment - Thanks Sreekanth.. Just an update, able to hit the same issue while rebalancing out the FTS node +----------------+-------------+-----------------------+---------------+--------------+ | Nodes | Services | Version | CPU | Status | +----------------+-------------+-----------------------+---------------+--------------+ | 172.23.107.58 | eventing | 7.0.0-5095-enterprise | 68.0274805126 | Cluster node | | 172.23.107.44 | fts | 7.0.0-5095-enterprise | 2.35664776307 | Cluster node | | 172.23.105.175 | kv | 7.0.0-5095-enterprise | 83.7956961369 | Cluster node | | 172.23.107.45 | fts | 7.0.0-5095-enterprise | 2.54600453743 | --- OUT ---> | | 172.23.107.47 | kv | 7.0.0-5095-enterprise | 84.332764729 | Cluster node | | 172.23.107.84 | eventing | 7.0.0-5095-enterprise | 59.6821976838 | Cluster node | | 172.23.121.78 | cbas | 7.0.0-5095-enterprise | 35.1969676459 | Cluster node | | 172.23.106.250 | index, n1ql | 7.0.0-5095-enterprise | 18.1203515889 | Cluster node | | 172.23.107.88 | index, n1ql | 7.0.0-5095-enterprise | 3.36781029263 | Cluster node | | 172.23.107.43 | eventing | 7.0.0-5095-enterprise | 63.9989329065 | Cluster node | | 172.23.106.236 | kv | 7.0.0-5095-enterprise | 86.4183158712 | Cluster node | | 172.23.107.85 | index, n1ql | 7.0.0-5095-enterprise | 26.646389844 | Cluster node | | 172.23.106.251 | backup | 7.0.0-5095-enterprise | 2.22697533971 | Cluster node | | 172.23.106.233 | kv | 7.0.0-5095-enterprise | 87.605560608 | Cluster node | | 172.23.106.238 | index, n1ql | 7.0.0-5095-enterprise | 3.05016385178 | Cluster node | | 172.23.107.54 | kv | 7.0.0-5095-enterprise | 85.4177545692 | Cluster node | | 172.23.107.78 | cbas | 7.0.0-5095-enterprise | 33.6209220814 | Cluster node | | 172.23.121.74 | cbas | 7.0.0-5095-enterprise | 38.1596571122 | Cluster node | | 172.23.107.107 | fts | 7.0.0-5095-enterprise | 2.31650509883 | Cluster node | | 172.23.107.91 | kv | 7.0.0-5095-enterprise | 73.7971478075 | Cluster node | +----------------+-------------+-----------------------+---------------+--------------+   Snapshot: http://supportal.couchbase.com/snapshot/73a3f4bbca65dfdfd545c1c60852951b::0 cbcollect logs: https://cb-engineering.s3.amazonaws.com/fts_rebalance_failed/collectinfo-2021-05-05T073833-ns_1%40172.23.105.175.zip https://cb-engineering.s3.amazonaws.com/fts_rebalance_failed/collectinfo-2021-05-05T073833-ns_1%40172.23.106.233.zip https://cb-engineering.s3.amazonaws.com/fts_rebalance_failed/collectinfo-2021-05-05T073833-ns_1%40172.23.106.236.zip https://cb-engineering.s3.amazonaws.com/fts_rebalance_failed/collectinfo-2021-05-05T073833-ns_1%40172.23.106.238.zip https://cb-engineering.s3.amazonaws.com/fts_rebalance_failed/collectinfo-2021-05-05T073833-ns_1%40172.23.106.250.zip https://cb-engineering.s3.amazonaws.com/fts_rebalance_failed/collectinfo-2021-05-05T073833-ns_1%40172.23.106.251.zip https://cb-engineering.s3.amazonaws.com/fts_rebalance_failed/collectinfo-2021-05-05T073833-ns_1%40172.23.107.107.zip https://cb-engineering.s3.amazonaws.com/fts_rebalance_failed/collectinfo-2021-05-05T073833-ns_1%40172.23.107.43.zip https://cb-engineering.s3.amazonaws.com/fts_rebalance_failed/collectinfo-2021-05-05T073833-ns_1%40172.23.107.44.zip https://cb-engineering.s3.amazonaws.com/fts_rebalance_failed/collectinfo-2021-05-05T073833-ns_1%40172.23.107.45.zip https://cb-engineering.s3.amazonaws.com/fts_rebalance_failed/collectinfo-2021-05-05T073833-ns_1%40172.23.107.47.zip https://cb-engineering.s3.amazonaws.com/fts_rebalance_failed/collectinfo-2021-05-05T073833-ns_1%40172.23.107.54.zip https://cb-engineering.s3.amazonaws.com/fts_rebalance_failed/collectinfo-2021-05-05T073833-ns_1%40172.23.107.58.zip https://cb-engineering.s3.amazonaws.com/fts_rebalance_failed/collectinfo-2021-05-05T073833-ns_1%40172.23.107.78.zip https://cb-engineering.s3.amazonaws.com/fts_rebalance_failed/collectinfo-2021-05-05T073833-ns_1%40172.23.107.84.zip https://cb-engineering.s3.amazonaws.com/fts_rebalance_failed/collectinfo-2021-05-05T073833-ns_1%40172.23.107.85.zip https://cb-engineering.s3.amazonaws.com/fts_rebalance_failed/collectinfo-2021-05-05T073833-ns_1%40172.23.107.88.zip https://cb-engineering.s3.amazonaws.com/fts_rebalance_failed/collectinfo-2021-05-05T073833-ns_1%40172.23.107.91.zip https://cb-engineering.s3.amazonaws.com/fts_rebalance_failed/collectinfo-2021-05-05T073833-ns_1%40172.23.121.74.zip https://cb-engineering.s3.amazonaws.com/fts_rebalance_failed/collectinfo-2021-05-05T073833-ns_1%40172.23.121.78.zip
            abhinav Abhi Dangeti added a comment -

            Builds 7.0.0-5103+ have the fix for MB-45989 ..

            Build couchbase-server-7.0.0-5103 contains cbgt commit a8cb699 with commit message:
            MB-45989: Kicking the janitor from within the JanitorLoop is prohibited 

            abhinav Abhi Dangeti added a comment - Builds 7.0.0-5103+ have the fix for MB-45989 .. Build couchbase-server- 7.0 . 0 - 5103 contains cbgt commit a8cb699 with commit message: MB- 45989 : Kicking the janitor from within the JanitorLoop is prohibited

            Awaiting the test results after the fixes.

            Sreekanth Sivasankaran Sreekanth Sivasankaran added a comment - Awaiting the test results after the fixes.

            Still seeing the issue on 7.0.0-5103 during rebalance-out of FTS node (172.23.107.91)

            +----------------+-------------+-----------------------+---------------+--------------+
            | Nodes          | Services    | Version               | CPU           | Status       |
            +----------------+-------------+-----------------------+---------------+--------------+
            | 172.23.107.58  | eventing    | 7.0.0-5103-enterprise | 66.2251655629 | Cluster node |
            | 172.23.107.44  | eventing    | 7.0.0-5103-enterprise | 67.2144182348 | Cluster node |
            | 172.23.105.175 | kv          | 7.0.0-5103-enterprise | 85.5087627518 | Cluster node |
            | 172.23.107.45  | index, n1ql | 7.0.0-5103-enterprise | 2.42401406682 | Cluster node |
            | 172.23.107.47  | index, n1ql | 7.0.0-5103-enterprise | 23.6120312707 | Cluster node |
            | 172.23.107.84  | kv          | 7.0.0-5103-enterprise | 83.7305699482 | Cluster node |
            | 172.23.121.78  | index, n1ql | 7.0.0-5103-enterprise | 20.5553413035 | Cluster node |
            | 172.23.106.250 | cbas        | 7.0.0-5103-enterprise | 33.8083075682 | Cluster node |
            | 172.23.107.88  | kv          | 7.0.0-5103-enterprise | 82.7855008445 | Cluster node |
            | 172.23.107.43  | cbas        | 7.0.0-5103-enterprise | 33.1234088168 | Cluster node |
            | 172.23.106.236 | kv          | 7.0.0-5103-enterprise | 86.6442345024 | Cluster node |
            | 172.23.107.85  | fts         | 7.0.0-5103-enterprise | 2.82627810577 | Cluster node |
            | 172.23.106.251 | backup      | 7.0.0-5103-enterprise | 2.52006018054 | Cluster node |
            | 172.23.106.233 | kv          | 7.0.0-5103-enterprise | 85.7810673938 | Cluster node |
            | 172.23.106.238 | cbas        | 7.0.0-5103-enterprise | 40.3115348462 | Cluster node |
            | 172.23.107.54  | index, n1ql | 7.0.0-5103-enterprise | 2.32499685811 | Cluster node |
            | 172.23.107.78  | kv          | 7.0.0-5103-enterprise | 85.9544749095 | Cluster node |
            | 172.23.121.74  | eventing    | 7.0.0-5103-enterprise | 70.3625576796 | Cluster node |
            | 172.23.107.107 | fts         | 7.0.0-5103-enterprise | 3.03916581892 | Cluster node |
            | 172.23.107.91  | fts         | 7.0.0-5103-enterprise | 16.6752114842 | --- OUT ---> |
            +----------------+-------------+-----------------------+---------------+--------------+

            Rebalance exited with reason {service_rebalance_failed,fts,
            {worker_died, {'EXIT',<0.25658.83>, {rebalance_failed,inactivity_timeout}}}}. 
            Rebalance Operation Id = 36814e67c68dbb5851868791a2c40c4f
            

            Snapshot: http://supportal.couchbase.com/snapshot/b8190367e0e24aa99eff3e294d23a2be::0

            Cbcollect logs:
            https://cb-engineering.s3.amazonaws.com/fts_rebalance_stuck_5103/collectinfo-2021-05-05T140951-ns_1%40172.23.106.233.zip
            https://cb-engineering.s3.amazonaws.com/fts_rebalance_stuck_5103/collectinfo-2021-05-05T140951-ns_1%40172.23.106.236.zip
            https://cb-engineering.s3.amazonaws.com/fts_rebalance_stuck_5103/collectinfo-2021-05-05T140951-ns_1%40172.23.106.238.zip
            https://cb-engineering.s3.amazonaws.com/fts_rebalance_stuck_5103/collectinfo-2021-05-05T140951-ns_1%40172.23.106.250.zip
            https://cb-engineering.s3.amazonaws.com/fts_rebalance_stuck_5103/collectinfo-2021-05-05T140951-ns_1%40172.23.106.251.zip
            https://cb-engineering.s3.amazonaws.com/fts_rebalance_stuck_5103/collectinfo-2021-05-05T140951-ns_1%40172.23.107.107.zip
            https://cb-engineering.s3.amazonaws.com/fts_rebalance_stuck_5103/collectinfo-2021-05-05T140951-ns_1%40172.23.107.43.zip
            https://cb-engineering.s3.amazonaws.com/fts_rebalance_stuck_5103/collectinfo-2021-05-05T140951-ns_1%40172.23.107.44.zip
            https://cb-engineering.s3.amazonaws.com/fts_rebalance_stuck_5103/collectinfo-2021-05-05T140951-ns_1%40172.23.107.45.zip
            https://cb-engineering.s3.amazonaws.com/fts_rebalance_stuck_5103/collectinfo-2021-05-05T140951-ns_1%40172.23.107.47.zip
            https://cb-engineering.s3.amazonaws.com/fts_rebalance_stuck_5103/collectinfo-2021-05-05T140951-ns_1%40172.23.107.54.zip
            https://cb-engineering.s3.amazonaws.com/fts_rebalance_stuck_5103/collectinfo-2021-05-05T140951-ns_1%40172.23.107.58.zip
            https://cb-engineering.s3.amazonaws.com/fts_rebalance_stuck_5103/collectinfo-2021-05-05T140951-ns_1%40172.23.107.78.zip
            https://cb-engineering.s3.amazonaws.com/fts_rebalance_stuck_5103/collectinfo-2021-05-05T140951-ns_1%40172.23.107.84.zip
            https://cb-engineering.s3.amazonaws.com/fts_rebalance_stuck_5103/collectinfo-2021-05-05T140951-ns_1%40172.23.107.85.zip
            https://cb-engineering.s3.amazonaws.com/fts_rebalance_stuck_5103/collectinfo-2021-05-05T140951-ns_1%40172.23.107.88.zip
            https://cb-engineering.s3.amazonaws.com/fts_rebalance_stuck_5103/collectinfo-2021-05-05T140951-ns_1%40172.23.107.91.zip
            https://cb-engineering.s3.amazonaws.com/fts_rebalance_stuck_5103/collectinfo-2021-05-05T140951-ns_1%40172.23.121.74.zip
            https://cb-engineering.s3.amazonaws.com/fts_rebalance_stuck_5103/collectinfo-2021-05-05T140951-ns_1%40172.23.121.78.zip
            https://cb-engineering.s3.amazonaws.com/fts_rebalance_stuck_5103/collectinfo-2021-05-05T140951-ns_1%40172.23.105.175.zip
             

            ashwin.govindarajulu Ashwin Govindarajulu added a comment - Still seeing the issue on 7.0.0-5103 during rebalance-out of FTS node (172.23.107.91) +----------------+-------------+-----------------------+---------------+--------------+ | Nodes | Services | Version | CPU | Status | +----------------+-------------+-----------------------+---------------+--------------+ | 172.23.107.58 | eventing | 7.0.0-5103-enterprise | 66.2251655629 | Cluster node | | 172.23.107.44 | eventing | 7.0.0-5103-enterprise | 67.2144182348 | Cluster node | | 172.23.105.175 | kv | 7.0.0-5103-enterprise | 85.5087627518 | Cluster node | | 172.23.107.45 | index, n1ql | 7.0.0-5103-enterprise | 2.42401406682 | Cluster node | | 172.23.107.47 | index, n1ql | 7.0.0-5103-enterprise | 23.6120312707 | Cluster node | | 172.23.107.84 | kv | 7.0.0-5103-enterprise | 83.7305699482 | Cluster node | | 172.23.121.78 | index, n1ql | 7.0.0-5103-enterprise | 20.5553413035 | Cluster node | | 172.23.106.250 | cbas | 7.0.0-5103-enterprise | 33.8083075682 | Cluster node | | 172.23.107.88 | kv | 7.0.0-5103-enterprise | 82.7855008445 | Cluster node | | 172.23.107.43 | cbas | 7.0.0-5103-enterprise | 33.1234088168 | Cluster node | | 172.23.106.236 | kv | 7.0.0-5103-enterprise | 86.6442345024 | Cluster node | | 172.23.107.85 | fts | 7.0.0-5103-enterprise | 2.82627810577 | Cluster node | | 172.23.106.251 | backup | 7.0.0-5103-enterprise | 2.52006018054 | Cluster node | | 172.23.106.233 | kv | 7.0.0-5103-enterprise | 85.7810673938 | Cluster node | | 172.23.106.238 | cbas | 7.0.0-5103-enterprise | 40.3115348462 | Cluster node | | 172.23.107.54 | index, n1ql | 7.0.0-5103-enterprise | 2.32499685811 | Cluster node | | 172.23.107.78 | kv | 7.0.0-5103-enterprise | 85.9544749095 | Cluster node | | 172.23.121.74 | eventing | 7.0.0-5103-enterprise | 70.3625576796 | Cluster node | | 172.23.107.107 | fts | 7.0.0-5103-enterprise | 3.03916581892 | Cluster node | | 172.23.107.91 | fts | 7.0.0-5103-enterprise | 16.6752114842 | --- OUT ---> | +----------------+-------------+-----------------------+---------------+--------------+ Rebalance exited with reason {service_rebalance_failed,fts, {worker_died, {'EXIT',<0.25658.83>, {rebalance_failed,inactivity_timeout}}}}. Rebalance Operation Id = 36814e67c68dbb5851868791a2c40c4f Snapshot : http://supportal.couchbase.com/snapshot/b8190367e0e24aa99eff3e294d23a2be::0 Cbcollect logs : https://cb-engineering.s3.amazonaws.com/fts_rebalance_stuck_5103/collectinfo-2021-05-05T140951-ns_1%40172.23.106.233.zip https://cb-engineering.s3.amazonaws.com/fts_rebalance_stuck_5103/collectinfo-2021-05-05T140951-ns_1%40172.23.106.236.zip https://cb-engineering.s3.amazonaws.com/fts_rebalance_stuck_5103/collectinfo-2021-05-05T140951-ns_1%40172.23.106.238.zip https://cb-engineering.s3.amazonaws.com/fts_rebalance_stuck_5103/collectinfo-2021-05-05T140951-ns_1%40172.23.106.250.zip https://cb-engineering.s3.amazonaws.com/fts_rebalance_stuck_5103/collectinfo-2021-05-05T140951-ns_1%40172.23.106.251.zip https://cb-engineering.s3.amazonaws.com/fts_rebalance_stuck_5103/collectinfo-2021-05-05T140951-ns_1%40172.23.107.107.zip https://cb-engineering.s3.amazonaws.com/fts_rebalance_stuck_5103/collectinfo-2021-05-05T140951-ns_1%40172.23.107.43.zip https://cb-engineering.s3.amazonaws.com/fts_rebalance_stuck_5103/collectinfo-2021-05-05T140951-ns_1%40172.23.107.44.zip https://cb-engineering.s3.amazonaws.com/fts_rebalance_stuck_5103/collectinfo-2021-05-05T140951-ns_1%40172.23.107.45.zip https://cb-engineering.s3.amazonaws.com/fts_rebalance_stuck_5103/collectinfo-2021-05-05T140951-ns_1%40172.23.107.47.zip https://cb-engineering.s3.amazonaws.com/fts_rebalance_stuck_5103/collectinfo-2021-05-05T140951-ns_1%40172.23.107.54.zip https://cb-engineering.s3.amazonaws.com/fts_rebalance_stuck_5103/collectinfo-2021-05-05T140951-ns_1%40172.23.107.58.zip https://cb-engineering.s3.amazonaws.com/fts_rebalance_stuck_5103/collectinfo-2021-05-05T140951-ns_1%40172.23.107.78.zip https://cb-engineering.s3.amazonaws.com/fts_rebalance_stuck_5103/collectinfo-2021-05-05T140951-ns_1%40172.23.107.84.zip https://cb-engineering.s3.amazonaws.com/fts_rebalance_stuck_5103/collectinfo-2021-05-05T140951-ns_1%40172.23.107.85.zip https://cb-engineering.s3.amazonaws.com/fts_rebalance_stuck_5103/collectinfo-2021-05-05T140951-ns_1%40172.23.107.88.zip https://cb-engineering.s3.amazonaws.com/fts_rebalance_stuck_5103/collectinfo-2021-05-05T140951-ns_1%40172.23.107.91.zip https://cb-engineering.s3.amazonaws.com/fts_rebalance_stuck_5103/collectinfo-2021-05-05T140951-ns_1%40172.23.121.74.zip https://cb-engineering.s3.amazonaws.com/fts_rebalance_stuck_5103/collectinfo-2021-05-05T140951-ns_1%40172.23.121.78.zip https://cb-engineering.s3.amazonaws.com/fts_rebalance_stuck_5103/collectinfo-2021-05-05T140951-ns_1%40172.23.105.175.zip  
            Sreekanth Sivasankaran Sreekanth Sivasankaran added a comment - - edited

            Abhi Dangeti, the notify manager is making the feed establishments fall into a recursive infinite loop here. We need to revisit the manager kick fix, (reverting would also be an option else it may trigger a lot of other issues)

            Sreekanth Sivasankaran Sreekanth Sivasankaran added a comment - - edited Abhi Dangeti , the notify manager is making the feed establishments fall into a recursive infinite loop here. We need to revisit the manager kick fix, (reverting would also be an option else it may trigger a lot of other issues)

            FTS nodes - 85, 91,107 and the error is seen on 91. (its the same rollback followed by feed erring loop)

             2021-05-05T05:54:06.285-07:00 [WARN] feed_dcp_gocbcore: [a1_users_377bff4da3dcab77_4c1c5584] Rollback to seqno: 0, vbuuid: 0 for vb: 1023, failed with err: bleve: BleveDest already closed -- cbgt.(*GocbcoreDCPFeed).rollback() at feed_dcp_gocbcore.go:1533
            2021-05-05T05:54:06.286-07:00 [INFO] feed_dcp_gocbcore: releaseAgent, ref count decremented for DCPagent (key: travel-sample:969c10b537adcc0a20800cf3d619a7c0, agent: 0xc008b7f500, ref count: 3, number of agents for key: 1)
            2021-05-05T05:54:06.286-07:00 [INFO] feed_dcp_gocbcore: Close, name: a4_users_562107192a6809c6_4c1c5584, notify manager
            2021-05-05T05:54:06.289-07:00 [INFO] feed_dcp_gocbcore: releaseAgent, ref count decremented for DCPagent (key: travel-sample:969c10b537adcc0a20800cf3d619a7c0, agent: 0xc008b7f500, ref count: 2, number of agents for key: 1)
            2021-05-05T05:54:06.289-07:00 [INFO] feed_dcp_gocbcore: Close, name: a1_users_377bff4da3dcab77_4c1c5584, notify manager
            2021-05-05T05:54:06.320-07:00 [INFO] janitor: pindexes to remove: 0
            2021-05-05T05:54:06.320-07:00 [INFO] janitor: pindexes to add: 0
            2021-05-05T05:54:06.320-07:00 [INFO] janitor: pindexes to restart: 0
            2021-05-05T05:54:06.320-07:00 [INFO] janitor: feeds to remove: 0
            2021-05-05T05:54:06.320-07:00 [INFO] janitor: feeds to add: 2
            2021-05-05T05:54:06.320-07:00 [INFO]   a1_users_377bff4da3dcab77_4c1c5584
            2021-05-05T05:54:06.320-07:00 [INFO]   a4_users_562107192a6809c6_4c1c5584
            2021-05-05T05:54:06.321-07:00 [WARN] janitor: JanitorOnce, err: janitor: JanitorOnce errors: 2, []string{"#0: janitor: adding feed, err: feed_dcp_gocbcore: StartGocbcoreDCPFeed, could not prepare DCP feed, server: http://127.0.0.1:8091, bucketName: travel-sample, indexName: a1_users, err: bleve: BleveDest already closed", "#1: janitor: adding feed, err: feed_dcp_gocbcore: StartGocbcoreDCPFeed, could not prepare DCP feed, server: http://127.0.0.1:8091, bucketName: travel-sample, indexName: a4_users, err: bleve: BleveDest already closed"} -- cbgt.(*Manager).JanitorLoop() at manager_janitor.go:97
            2021-05-05T05:54:06.321-07:00 [INFO] janitor: awakes, op: kick, msg: gocbcore-feed
            2021-05-05T05:54:06.332-07:00 [INFO] pindex: restartPIndex starts for pindex: a2_users_4389e9ba8dc93303_4c1c5584
            2021-05-05T05:54:06.332-07:00 [WARN] feed_dcp_gocbcore: [a2_users_4389e9ba8dc93303_4c1c5584] Rollback to seqno: 0, vbuuid: 0 for vb: 1023, failed with err: bleve: BleveDest already closed -- cbgt.(*GocbcoreDCPFeed).rollback() at feed_dcp_gocbcore.go:1533
            2021-05-05T05:54:06.402-07:00 [INFO] feed_dcp_gocbcore: [a2_users_4389e9ba8dc93303_4c1c5584] DCP stream [118] for vb: 257, closed due to `state changed`, reconnecting ...
            2021-05-05T05:54:06.402-07:00 [INFO] feed_dcp_gocbcore: [routes_6fc715a37b5980d2_4c1c5584] DCP stream [116] for vb: 257, closed due to `state changed`, reconnecting ...
            2021-05-05T05:54:06.410-07:00 [INFO] feed_dcp_gocbcore: [a2_users_4389e9ba8dc93303_4c1c5584] DCP stream [118] for vb: 339, closed due to `state changed`, reconnecting ...
            2021-05-05T05:54:06.410-07:00 [INFO] feed_dcp_gocbcore: [routes_6fc715a37b5980d2_4c1c5584] DCP stream [116] for vb: 339, closed due to `state changed`, reconnecting ...
            2021-05-05T05:54:06.410-07:00 [INFO] feed_dcp_gocbcore: [routes_6fc715a37b5980d2_4c1c5584] Received rollback, for vb: 257, seqno requested: 222457
            2021-05-05T05:54:06.410-07:00 [WARN] feed_dcp_gocbcore: [routes_6fc715a37b5980d2_4c1c5584] Rollback to seqno: 0, vbuuid: 0 for vb: 257, failed with err: bleve: BleveDest already closed -- cbgt.(*GocbcoreDCPFeed).rollback() at feed_dcp_gocbcore.go:1533
            2021-05-05T05:54:06.414-07:00 [INFO] feed_dcp_gocbcore: [a2_users_4389e9ba8dc93303_4c1c5584] Received rollback, for vb: 257, seqno requested: 234028
            2021-05-05T05:54:06.414-07:00 [WARN] feed_dcp_gocbcore: [a2_users_4389e9ba8dc93303_4c1c5584] Rollback to seqno: 0, vbuuid: 0 for vb: 257, failed with err: bleve: BleveDest already closed -- cbgt.(*GocbcoreDCPFeed).rollback() at feed_dcp_gocbcore.go:1533
            2021-05-05T05:54:06.438-07:00 [INFO] janitor: pindexes to remove: 0
            2021-05-05T05:54:06.438-07:00 [INFO] janitor: pindexes to add: 0
            2021-05-05T05:54:06.438-07:00 [INFO] janitor: pindexes to restart: 0
            2021-05-05T05:54:06.438-07:00 [INFO] janitor: feeds to remove: 0
            2021-05-05T05:54:06.438-07:00 [INFO] janitor: feeds to add: 2
            2021-05-05T05:54:06.438-07:00 [INFO]   a4_users_562107192a6809c6_4c1c5584
            2021-05-05T05:54:06.438-07:00 [INFO]   a1_users_377bff4da3dcab77_4c1c5584
            2021-05-05T05:54:06.439-07:00 [WARN] janitor: JanitorOnce, err: janitor: JanitorOnce errors: 2, []string{"#0: janitor: adding feed, err: feed_dcp_gocbcore: StartGocbcoreDCPFeed, could not prepare DCP feed, server: http://127.0.0.1:8091, bucketName: travel-sample, indexName: a4_users, err: bleve: BleveDest already closed", "#1: janitor: adding feed, err: feed_dcp_gocbcore: StartGocbcoreDCPFeed, could not prepare DCP feed, server: http://127.0.0.1:8091, bucketName: travel-sample, indexName: a1_users, err: bleve: BleveDest already closed"} -- cbgt.(*Manager).JanitorLoop() at manager_janitor.go:97
            2021-05-05T05:54:06.439-07:00 [INFO] janitor: awakes, op: kick, msg: gocbcore-feed
            2021-05-05T05:54:06.486-07:00 [INFO] feed_dcp_gocbcore: [routes_6fc715a37b5980d2_4c1c5584] Received rollback, for vb: 339, seqno requested: 201006
            2021-05-05T05:54:06.486-07:00 [WARN] feed_dcp_gocbcore: [routes_6fc715a37b5980d2_4c1c5584] Rollback to seqno: 0, vbuuid: 0 for vb: 339, failed with err: bleve: BleveDest already closed -- cbgt.(*GocbcoreDCPFeed).rollback() at feed_dcp_gocbcore.go:1533
            2021-05-05T05:54:06.487-07:00 [INFO] feed_dcp_gocbcore: [a2_users_4389e9ba8dc93303_4c1c5584] Received rollback, for vb: 339, seqno requested: 211545
            2021-05-05T05:54:06.487-07:00 [WARN] feed_dcp_gocbcore: [a2_users_4389e9ba8dc93303_4c1c5584] Rollback to seqno: 0, vbuuid: 0 for vb: 339, failed with err: bleve: BleveDest already closed -- cbgt.(*GocbcoreDCPFeed).rollback() at feed_dcp_gocbcore.go:1533
            2021-05-05T05:54:06.544-07:00 [INFO] feed_dcp_gocbcore: [a2_users_4389e9ba8dc93303_4c1c5584] DCP stream [118] for vb: 1021, closed due to `state changed`, reconnecting ...
            2021-05-05T05:54:06.544-07:00 [INFO] feed_dcp_gocbcore: [routes_6fc715a37b5980d2_4c1c5584] DCP stream [116] for vb: 1021, closed due to `state changed`, reconnecting ...
            2021-05-05T05:54:06.546-07:00 [INFO] janitor: pindexes to remove: 0
            2021-05-05T05:54:06.546-07:00 [INFO] janitor: pindexes to add: 0
            2021-05-05T05:54:06.546-07:00 [INFO] janitor: pindexes to restart: 0
            2021-05-05T05:54:06.547-07:00 [INFO] janitor: feeds to remove: 0
            2021-05-05T05:54:06.547-07:00 [INFO] janitor: feeds to add: 2
            2021-05-05T05:54:06.547-07:00 [INFO]   a4_users_562107192a6809c6_4c1c5584
            2021-05-05T05:54:06.547-07:00 [INFO]   a1_users_377bff4da3dcab77_4c1c5584
            2021-05-05T05:54:06.548-07:00 [WARN] janitor: JanitorOnce, err: janitor: Janit

            Sreekanth Sivasankaran Sreekanth Sivasankaran added a comment - FTS nodes - 85, 91,107 and the error is seen on 91. (its the same rollback followed by feed erring loop) 2021 - 05 -05T05: 54 : 06.285 - 07 : 00 [WARN] feed_dcp_gocbcore: [a1_users_377bff4da3dcab77_4c1c5584] Rollback to seqno: 0 , vbuuid: 0 for vb: 1023 , failed with err: bleve: BleveDest already closed -- cbgt.(*GocbcoreDCPFeed).rollback() at feed_dcp_gocbcore.go: 1533 2021 - 05 -05T05: 54 : 06.286 - 07 : 00 [INFO] feed_dcp_gocbcore: releaseAgent, ref count decremented for DCPagent (key: travel-sample:969c10b537adcc0a20800cf3d619a7c0, agent: 0xc008b7f500 , ref count: 3 , number of agents for key: 1 ) 2021 - 05 -05T05: 54 : 06.286 - 07 : 00 [INFO] feed_dcp_gocbcore: Close, name: a4_users_562107192a6809c6_4c1c5584, notify manager 2021 - 05 -05T05: 54 : 06.289 - 07 : 00 [INFO] feed_dcp_gocbcore: releaseAgent, ref count decremented for DCPagent (key: travel-sample:969c10b537adcc0a20800cf3d619a7c0, agent: 0xc008b7f500 , ref count: 2 , number of agents for key: 1 ) 2021 - 05 -05T05: 54 : 06.289 - 07 : 00 [INFO] feed_dcp_gocbcore: Close, name: a1_users_377bff4da3dcab77_4c1c5584, notify manager 2021 - 05 -05T05: 54 : 06.320 - 07 : 00 [INFO] janitor: pindexes to remove: 0 2021 - 05 -05T05: 54 : 06.320 - 07 : 00 [INFO] janitor: pindexes to add: 0 2021 - 05 -05T05: 54 : 06.320 - 07 : 00 [INFO] janitor: pindexes to restart: 0 2021 - 05 -05T05: 54 : 06.320 - 07 : 00 [INFO] janitor: feeds to remove: 0 2021 - 05 -05T05: 54 : 06.320 - 07 : 00 [INFO] janitor: feeds to add: 2 2021 - 05 -05T05: 54 : 06.320 - 07 : 00 [INFO] a1_users_377bff4da3dcab77_4c1c5584 2021 - 05 -05T05: 54 : 06.320 - 07 : 00 [INFO] a4_users_562107192a6809c6_4c1c5584 2021 - 05 -05T05: 54 : 06.321 - 07 : 00 [WARN] janitor: JanitorOnce, err: janitor: JanitorOnce errors: 2 , []string{ "#0: janitor: adding feed, err: feed_dcp_gocbcore: StartGocbcoreDCPFeed, could not prepare DCP feed, server: http://127.0.0.1:8091, bucketName: travel-sample, indexName: a1_users, err: bleve: BleveDest already closed" , "#1: janitor: adding feed, err: feed_dcp_gocbcore: StartGocbcoreDCPFeed, could not prepare DCP feed, server: http://127.0.0.1:8091, bucketName: travel-sample, indexName: a4_users, err: bleve: BleveDest already closed" } -- cbgt.(*Manager).JanitorLoop() at manager_janitor.go: 97 2021 - 05 -05T05: 54 : 06.321 - 07 : 00 [INFO] janitor: awakes, op: kick, msg: gocbcore-feed 2021 - 05 -05T05: 54 : 06.332 - 07 : 00 [INFO] pindex: restartPIndex starts for pindex: a2_users_4389e9ba8dc93303_4c1c5584 2021 - 05 -05T05: 54 : 06.332 - 07 : 00 [WARN] feed_dcp_gocbcore: [a2_users_4389e9ba8dc93303_4c1c5584] Rollback to seqno: 0 , vbuuid: 0 for vb: 1023 , failed with err: bleve: BleveDest already closed -- cbgt.(*GocbcoreDCPFeed).rollback() at feed_dcp_gocbcore.go: 1533 2021 - 05 -05T05: 54 : 06.402 - 07 : 00 [INFO] feed_dcp_gocbcore: [a2_users_4389e9ba8dc93303_4c1c5584] DCP stream [ 118 ] for vb: 257 , closed due to `state changed`, reconnecting ... 2021 - 05 -05T05: 54 : 06.402 - 07 : 00 [INFO] feed_dcp_gocbcore: [routes_6fc715a37b5980d2_4c1c5584] DCP stream [ 116 ] for vb: 257 , closed due to `state changed`, reconnecting ... 2021 - 05 -05T05: 54 : 06.410 - 07 : 00 [INFO] feed_dcp_gocbcore: [a2_users_4389e9ba8dc93303_4c1c5584] DCP stream [ 118 ] for vb: 339 , closed due to `state changed`, reconnecting ... 2021 - 05 -05T05: 54 : 06.410 - 07 : 00 [INFO] feed_dcp_gocbcore: [routes_6fc715a37b5980d2_4c1c5584] DCP stream [ 116 ] for vb: 339 , closed due to `state changed`, reconnecting ... 2021 - 05 -05T05: 54 : 06.410 - 07 : 00 [INFO] feed_dcp_gocbcore: [routes_6fc715a37b5980d2_4c1c5584] Received rollback, for vb: 257 , seqno requested: 222457 2021 - 05 -05T05: 54 : 06.410 - 07 : 00 [WARN] feed_dcp_gocbcore: [routes_6fc715a37b5980d2_4c1c5584] Rollback to seqno: 0 , vbuuid: 0 for vb: 257 , failed with err: bleve: BleveDest already closed -- cbgt.(*GocbcoreDCPFeed).rollback() at feed_dcp_gocbcore.go: 1533 2021 - 05 -05T05: 54 : 06.414 - 07 : 00 [INFO] feed_dcp_gocbcore: [a2_users_4389e9ba8dc93303_4c1c5584] Received rollback, for vb: 257 , seqno requested: 234028 2021 - 05 -05T05: 54 : 06.414 - 07 : 00 [WARN] feed_dcp_gocbcore: [a2_users_4389e9ba8dc93303_4c1c5584] Rollback to seqno: 0 , vbuuid: 0 for vb: 257 , failed with err: bleve: BleveDest already closed -- cbgt.(*GocbcoreDCPFeed).rollback() at feed_dcp_gocbcore.go: 1533 2021 - 05 -05T05: 54 : 06.438 - 07 : 00 [INFO] janitor: pindexes to remove: 0 2021 - 05 -05T05: 54 : 06.438 - 07 : 00 [INFO] janitor: pindexes to add: 0 2021 - 05 -05T05: 54 : 06.438 - 07 : 00 [INFO] janitor: pindexes to restart: 0 2021 - 05 -05T05: 54 : 06.438 - 07 : 00 [INFO] janitor: feeds to remove: 0 2021 - 05 -05T05: 54 : 06.438 - 07 : 00 [INFO] janitor: feeds to add: 2 2021 - 05 -05T05: 54 : 06.438 - 07 : 00 [INFO] a4_users_562107192a6809c6_4c1c5584 2021 - 05 -05T05: 54 : 06.438 - 07 : 00 [INFO] a1_users_377bff4da3dcab77_4c1c5584 2021 - 05 -05T05: 54 : 06.439 - 07 : 00 [WARN] janitor: JanitorOnce, err: janitor: JanitorOnce errors: 2 , []string{ "#0: janitor: adding feed, err: feed_dcp_gocbcore: StartGocbcoreDCPFeed, could not prepare DCP feed, server: http://127.0.0.1:8091, bucketName: travel-sample, indexName: a4_users, err: bleve: BleveDest already closed" , "#1: janitor: adding feed, err: feed_dcp_gocbcore: StartGocbcoreDCPFeed, could not prepare DCP feed, server: http://127.0.0.1:8091, bucketName: travel-sample, indexName: a1_users, err: bleve: BleveDest already closed" } -- cbgt.(*Manager).JanitorLoop() at manager_janitor.go: 97 2021 - 05 -05T05: 54 : 06.439 - 07 : 00 [INFO] janitor: awakes, op: kick, msg: gocbcore-feed 2021 - 05 -05T05: 54 : 06.486 - 07 : 00 [INFO] feed_dcp_gocbcore: [routes_6fc715a37b5980d2_4c1c5584] Received rollback, for vb: 339 , seqno requested: 201006 2021 - 05 -05T05: 54 : 06.486 - 07 : 00 [WARN] feed_dcp_gocbcore: [routes_6fc715a37b5980d2_4c1c5584] Rollback to seqno: 0 , vbuuid: 0 for vb: 339 , failed with err: bleve: BleveDest already closed -- cbgt.(*GocbcoreDCPFeed).rollback() at feed_dcp_gocbcore.go: 1533 2021 - 05 -05T05: 54 : 06.487 - 07 : 00 [INFO] feed_dcp_gocbcore: [a2_users_4389e9ba8dc93303_4c1c5584] Received rollback, for vb: 339 , seqno requested: 211545 2021 - 05 -05T05: 54 : 06.487 - 07 : 00 [WARN] feed_dcp_gocbcore: [a2_users_4389e9ba8dc93303_4c1c5584] Rollback to seqno: 0 , vbuuid: 0 for vb: 339 , failed with err: bleve: BleveDest already closed -- cbgt.(*GocbcoreDCPFeed).rollback() at feed_dcp_gocbcore.go: 1533 2021 - 05 -05T05: 54 : 06.544 - 07 : 00 [INFO] feed_dcp_gocbcore: [a2_users_4389e9ba8dc93303_4c1c5584] DCP stream [ 118 ] for vb: 1021 , closed due to `state changed`, reconnecting ... 2021 - 05 -05T05: 54 : 06.544 - 07 : 00 [INFO] feed_dcp_gocbcore: [routes_6fc715a37b5980d2_4c1c5584] DCP stream [ 116 ] for vb: 1021 , closed due to `state changed`, reconnecting ... 2021 - 05 -05T05: 54 : 06.546 - 07 : 00 [INFO] janitor: pindexes to remove: 0 2021 - 05 -05T05: 54 : 06.546 - 07 : 00 [INFO] janitor: pindexes to add: 0 2021 - 05 -05T05: 54 : 06.546 - 07 : 00 [INFO] janitor: pindexes to restart: 0 2021 - 05 -05T05: 54 : 06.547 - 07 : 00 [INFO] janitor: feeds to remove: 0 2021 - 05 -05T05: 54 : 06.547 - 07 : 00 [INFO] janitor: feeds to add: 2 2021 - 05 -05T05: 54 : 06.547 - 07 : 00 [INFO] a4_users_562107192a6809c6_4c1c5584 2021 - 05 -05T05: 54 : 06.547 - 07 : 00 [INFO] a1_users_377bff4da3dcab77_4c1c5584 2021 - 05 -05T05: 54 : 06.548 - 07 : 00 [WARN] janitor: JanitorOnce, err: janitor: Janit
            abhinav Abhi Dangeti added a comment -

            Interesting, let me take a look into this as well.

            abhinav Abhi Dangeti added a comment - Interesting, let me take a look into this as well.
            abhinav Abhi Dangeti added a comment -

            Sreekanth Sivasankaran accurate findings on this issue.

            I've uploaded a couple changes for this ..

            The idea is to not notify manager always, but only on an agent setup failure - which is essentially the only situation when the feed initiation should be re-attempted. This fix will ensure that MB-45989 doesn't occur as well.

            abhinav Abhi Dangeti added a comment - Sreekanth Sivasankaran  accurate findings on this issue. I've uploaded a couple changes for this .. http://review.couchbase.org/c/cbgt/+/152939  - improves logging in the area http://review.couchbase.org/c/cbgt/+/152942  - conditionally notify the manager when feed initiation fails The idea is to not notify manager always, but only on an agent setup failure - which is essentially the only situation when the feed initiation should be re-attempted. This fix will ensure that MB-45989 doesn't occur as well.

            Hoping the above fixes will address this issue. A build should become available soon with them.

            abhinav Abhi Dangeti added a comment - Hoping the above fixes will address this issue. A build should become available soon with them.

            Build couchbase-server-7.0.0-5109 contains cbgt commit 52db00a with commit message:
            MB-46112: Notify manager only on agent error on feed initiation

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.0-5109 contains cbgt commit 52db00a with commit message: MB-46112 : Notify manager only on agent error on feed initiation

            Build couchbase-server-7.0.0-5109 contains cbgt commit fa82fd7 with commit message:
            MB-46112: Log feed name when StartGocbcoreDCPFeed fails

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.0-5109 contains cbgt commit fa82fd7 with commit message: MB-46112 : Log feed name when StartGocbcoreDCPFeed fails

            Hitting MB-46154 while trying to validate the fix.

            ashwin.govindarajulu Ashwin Govindarajulu added a comment - Hitting MB-46154 while trying to validate the fix.
            ashwin.govindarajulu Ashwin Govindarajulu added a comment - - edited

            Not seeing this issue on latest run using 7.0.0-5112-enterprise.

            Test has run for ~8hrs till now (will continue to run for some more time)

            Closing this ticket.

            ashwin.govindarajulu Ashwin Govindarajulu added a comment - - edited Not seeing this issue on latest run using 7.0.0-5112-enterprise . Test has run for ~8hrs till now (will continue to run for some more time) Closing this ticket.

            People

              ashwin.govindarajulu Ashwin Govindarajulu
              ashwin.govindarajulu Ashwin Govindarajulu
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty