Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-36400

[high-bucket] FTS rebalance out in high bucket density test takes really long hours (28hrs+)

    XMLWordPrintable

Details

    • Untriaged
    • No

    Description

      Build 6.5.0-4380

      Observed that rebalancing out FTS node in high bucket density test with 30 buckets taking 764.42 min and fails with following reason.

      Each bucket have 1 FTS index.

      Rebalance failure reason-

      Rebalance exited with reason {service_rebalance_failed,fts,
      {worker_died,
      {'EXIT',<0.13606.525>,
      {{linked_process_died,<27016.24380.381>,
      {timeout,
      {gen_server,call,
      [<27016.22501.381>,
      {call,"ServiceAPI.GetCurrentTopology",
      #Fun<json_rpc_connection.0.102434519>},
      60000]}}},
      {gen_server,call,
      [{'service_agent-fts','ns_1@172.23.97.15'},
      {if_rebalance,<0.19462.525>,
      {start_rebalance,
      <<"f81721335b42662e837612d6b366cca3">>,
      rebalance,
      [{[{node_id,
      <<"827b0f5298e3447fa469c0ebdcad3da0">>},
      {priority,0},
      {opaque,null}],
      full}],
      [[{node_id,
      <<"592321897f0f721188d778a77f4c9d9b">>},
      {priority,0},
      {opaque,null}]],
      <0.13606.525>}},
      90000]}}}}}.
      Rebalance Operation Id = f5116259df628e7619516c5b6cba1997
      

       On build 6.5.0-2082, this rebalance out took 79 min.

      Job- http://perf.jenkins.couchbase.com/job/arke-multi-bucket/322/

      Logs-
      FTS node- https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-322/172.23.96.20.zip
      FTS node going out- https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-322/172.23.97.15.zip

      Other nodes-
      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-322/172.23.96.16.zip
      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-322/172.23.96.17.zip
      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-322/172.23.96.23.zip
      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-322/172.23.97.12.zip
      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-322/172.23.97.13.zip
      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-322/172.23.97.14.zip
      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-322/172.23.97.177.zip
      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-322/172.23.97.19.zip
      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-322/172.23.97.20.zip

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            mahesh.mandhare Mahesh Mandhare (Inactive) created issue -
            mahesh.mandhare Mahesh Mandhare (Inactive) made changes -
            Field Original Value New Value
            Summary FTS rebalance out in high bucket density test takes huge time FTS rebalance out in high bucket density test takes 12 hours
            mahesh.mandhare Mahesh Mandhare (Inactive) made changes -
            Description Build 6.5.0-4380

            Observed that rebalancing out FTS node in high bucket density test with 30 buckets taking huge time. Each bucket have 1 FTS index.

            At the time of log collection it took 11 hours and rebalance progress was ~50%.

            I terminated test and collected logs.

            On build 6.5.0-2082, this rebalance out took 79 min.

            Job- [http://perf.jenkins.couchbase.com/job/arke-multi-bucket/322/]

            Logs-

            FTS node- [https://cb-jira.s3.us-east-2.amazonaws.com/logs/fts_reb_out_hbd/collectinfo-2019-10-09T085427-ns_1%40172.23.96.20.zip]

            FTS node going out- [https://cb-jira.s3.us-east-2.amazonaws.com/logs/fts_reb_out_hbd/collectinfo-2019-10-09T085427-ns_1%40172.23.97.15.zip]

            Other nodes-
            [https://cb-jira.s3.us-east-2.amazonaws.com/logs/fts_reb_out_hbd/collectinfo-2019-10-09T085427-ns_1%40172.23.96.23.zip]
            [https://cb-jira.s3.us-east-2.amazonaws.com/logs/fts_reb_out_hbd/collectinfo-2019-10-09T085427-ns_1%40172.23.97.12.zip]
            [https://cb-jira.s3.us-east-2.amazonaws.com/logs/fts_reb_out_hbd/collectinfo-2019-10-09T085427-ns_1%40172.23.97.13.zip]
            [https://cb-jira.s3.us-east-2.amazonaws.com/logs/fts_reb_out_hbd/collectinfo-2019-10-09T085427-ns_1%40172.23.97.14.zip]
            [https://cb-jira.s3.us-east-2.amazonaws.com/logs/fts_reb_out_hbd/collectinfo-2019-10-09T085427-ns_1%40172.23.97.177.zip]
            [https://cb-jira.s3.us-east-2.amazonaws.com/logs/fts_reb_out_hbd/collectinfo-2019-10-09T085427-ns_1%40172.23.97.19.zip]
            [https://cb-jira.s3.us-east-2.amazonaws.com/logs/fts_reb_out_hbd/collectinfo-2019-10-09T085427-ns_1%40172.23.97.20.zip]
            Build 6.5.0-4380

            Observed that rebalancing out FTS node in high bucket density test with 30 buckets taking 764.42 min. Each bucket have 1 FTS index.

            At the time of log collection it took 11 hours and rebalance progress was ~50%.

            On build 6.5.0-2082, this rebalance out took 79 min.

            Job- [http://perf.jenkins.couchbase.com/job/arke-multi-bucket/322/]

            Logs-

            FTS node- [https://cb-jira.s3.us-east-2.amazonaws.com/logs/fts_reb_out_hbd/collectinfo-2019-10-09T085427-ns_1%40172.23.96.20.zip]

            FTS node going out- [https://cb-jira.s3.us-east-2.amazonaws.com/logs/fts_reb_out_hbd/collectinfo-2019-10-09T085427-ns_1%40172.23.97.15.zip]

            Other nodes-
             [https://cb-jira.s3.us-east-2.amazonaws.com/logs/fts_reb_out_hbd/collectinfo-2019-10-09T085427-ns_1%40172.23.96.23.zip]
             [https://cb-jira.s3.us-east-2.amazonaws.com/logs/fts_reb_out_hbd/collectinfo-2019-10-09T085427-ns_1%40172.23.97.12.zip]
             [https://cb-jira.s3.us-east-2.amazonaws.com/logs/fts_reb_out_hbd/collectinfo-2019-10-09T085427-ns_1%40172.23.97.13.zip]
             [https://cb-jira.s3.us-east-2.amazonaws.com/logs/fts_reb_out_hbd/collectinfo-2019-10-09T085427-ns_1%40172.23.97.14.zip]
             [https://cb-jira.s3.us-east-2.amazonaws.com/logs/fts_reb_out_hbd/collectinfo-2019-10-09T085427-ns_1%40172.23.97.177.zip]
             [https://cb-jira.s3.us-east-2.amazonaws.com/logs/fts_reb_out_hbd/collectinfo-2019-10-09T085427-ns_1%40172.23.97.19.zip]
             [https://cb-jira.s3.us-east-2.amazonaws.com/logs/fts_reb_out_hbd/collectinfo-2019-10-09T085427-ns_1%40172.23.97.20.zip]
            mahesh.mandhare Mahesh Mandhare (Inactive) made changes -
            Summary FTS rebalance out in high bucket density test takes 12 hours FTS rebalance out in high bucket density test takes 12 hours and fails
            mahesh.mandhare Mahesh Mandhare (Inactive) made changes -
            Description Build 6.5.0-4380

            Observed that rebalancing out FTS node in high bucket density test with 30 buckets taking 764.42 min. Each bucket have 1 FTS index.

            At the time of log collection it took 11 hours and rebalance progress was ~50%.

            On build 6.5.0-2082, this rebalance out took 79 min.

            Job- [http://perf.jenkins.couchbase.com/job/arke-multi-bucket/322/]

            Logs-

            FTS node- [https://cb-jira.s3.us-east-2.amazonaws.com/logs/fts_reb_out_hbd/collectinfo-2019-10-09T085427-ns_1%40172.23.96.20.zip]

            FTS node going out- [https://cb-jira.s3.us-east-2.amazonaws.com/logs/fts_reb_out_hbd/collectinfo-2019-10-09T085427-ns_1%40172.23.97.15.zip]

            Other nodes-
             [https://cb-jira.s3.us-east-2.amazonaws.com/logs/fts_reb_out_hbd/collectinfo-2019-10-09T085427-ns_1%40172.23.96.23.zip]
             [https://cb-jira.s3.us-east-2.amazonaws.com/logs/fts_reb_out_hbd/collectinfo-2019-10-09T085427-ns_1%40172.23.97.12.zip]
             [https://cb-jira.s3.us-east-2.amazonaws.com/logs/fts_reb_out_hbd/collectinfo-2019-10-09T085427-ns_1%40172.23.97.13.zip]
             [https://cb-jira.s3.us-east-2.amazonaws.com/logs/fts_reb_out_hbd/collectinfo-2019-10-09T085427-ns_1%40172.23.97.14.zip]
             [https://cb-jira.s3.us-east-2.amazonaws.com/logs/fts_reb_out_hbd/collectinfo-2019-10-09T085427-ns_1%40172.23.97.177.zip]
             [https://cb-jira.s3.us-east-2.amazonaws.com/logs/fts_reb_out_hbd/collectinfo-2019-10-09T085427-ns_1%40172.23.97.19.zip]
             [https://cb-jira.s3.us-east-2.amazonaws.com/logs/fts_reb_out_hbd/collectinfo-2019-10-09T085427-ns_1%40172.23.97.20.zip]
            Build 6.5.0-4380

            Observed that rebalancing out FTS node in high bucket density test with 30 buckets taking 764.42 min and fails with following reason.

            Each bucket have 1 FTS index.

            Rebalance failure reason-
            {noformat}Rebalance exited with reason {service_rebalance_failed,fts,
            {worker_died,
            {'EXIT',<0.13606.525>,
            {{linked_process_died,<27016.24380.381>,
            {timeout,
            {gen_server,call,
            [<27016.22501.381>,
            {call,"ServiceAPI.GetCurrentTopology",
            #Fun<json_rpc_connection.0.102434519>},
            60000]}}},
            {gen_server,call,
            [{'service_agent-fts','ns_1@172.23.97.15'},
            {if_rebalance,<0.19462.525>,
            {start_rebalance,
            <<"f81721335b42662e837612d6b366cca3">>,
            rebalance,
            [{[{node_id,
            <<"827b0f5298e3447fa469c0ebdcad3da0">>},
            {priority,0},
            {opaque,null}],
            full}],
            [[{node_id,
            <<"592321897f0f721188d778a77f4c9d9b">>},
            {priority,0},
            {opaque,null}]],
            <0.13606.525>}},
            90000]}}}}}.
            Rebalance Operation Id = f5116259df628e7619516c5b6cba1997
            {noformat}
             On build 6.5.0-2082, this rebalance out took 79 min.

            Job- [http://perf.jenkins.couchbase.com/job/arke-multi-bucket/322/]

            Logs-
            FTS node- https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-322/172.23.96.20.zip
            FTS node going out- https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-322/172.23.97.15.zip

            Other nodes-
            https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-322/172.23.96.16.zip
            https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-322/172.23.96.17.zip
            https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-322/172.23.96.23.zip
            https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-322/172.23.97.12.zip
            https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-322/172.23.97.13.zip
            https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-322/172.23.97.14.zip
            https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-322/172.23.97.177.zip
            https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-322/172.23.97.19.zip
            https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-322/172.23.97.20.zip
            wayne Wayne Siu made changes -
            Summary FTS rebalance out in high bucket density test takes 12 hours and fails [high-bucket] FTS rebalance out in high bucket density test takes 12 hours and fails
            raju Raju Suravarjjala made changes -
            Priority Major [ 3 ] Critical [ 2 ]
            raju Raju Suravarjjala made changes -
            Fix Version/s Mad-Hatter [ 15037 ]
            Sreekanth Sivasankaran Sreekanth Sivasankaran made changes -
            Assignee Keshav Murthy [ keshav ] Sreekanth Sivasankaran [ sreekanth sivasankaran ]

            Mahesh Mandhare, Looks like this issue would have been fixed/improved by the change set for MB-34911

            Can we get this retested on build 6.5.0-4522 or the latest.

            Sreekanth Sivasankaran Sreekanth Sivasankaran added a comment - Mahesh Mandhare , Looks like this issue would have been fixed/improved by the change set for MB-34911 .  Can we get this retested on build 6.5.0-4522 or the latest.
            Sreekanth Sivasankaran Sreekanth Sivasankaran made changes -
            Assignee Sreekanth Sivasankaran [ sreekanth sivasankaran ] Mahesh Mandhare [ mahesh.mandhare ]
            lynn.straus Lynn Straus made changes -
            Labels Performance high-bucket-density Performance approved-for-mad-hatter high-bucket-density
            mahesh.mandhare Mahesh Mandhare (Inactive) made changes -
            Assignee Mahesh Mandhare [ mahesh.mandhare ] Sreekanth Sivasankaran [ sreekanth sivasankaran ]
            lynn.straus Lynn Straus added a comment -

            Added due date field (preset to Nov 15). Please update the due date to the current ETA for a fix. Thanks.

            lynn.straus Lynn Straus added a comment - Added due date field (preset to Nov 15). Please update the due date to the current ETA for a fix. Thanks.
            lynn.straus Lynn Straus made changes -
            Due Date 15/Nov/19

            Good to to know that the rebalance passed now.

            Given that a 2 node rebalance in/out operation for 30 indexes, wouldn't be surprised that it took 10 hours. But this needs to be investigated further. Mostly for CC as we don't have any immediate fixes available for this.

             

            Sreekanth Sivasankaran Sreekanth Sivasankaran added a comment - Good to to know that the rebalance passed now. Given that a 2 node rebalance in/out operation for 30 indexes, wouldn't be surprised that it took 10 hours. But this needs to be investigated further. Mostly for CC as we don't have any immediate fixes available for this.  
            Sreekanth Sivasankaran Sreekanth Sivasankaran made changes -
            Is this a Regression? Yes [ 10450 ] No [ 10451 ]
            keshav Keshav Murthy made changes -
            Fix Version/s Cheshire-Cat [ 15915 ]
            Fix Version/s Mad-Hatter [ 15037 ]

            Build 6.5.0-4959

            On this build rebalance out of FTS node took 1739.46 min.

            Job- http://perf.jenkins.couchbase.com/job/arke-multi-bucket/345

            mahesh.mandhare Mahesh Mandhare (Inactive) added a comment - Build 6.5.0-4959 On this build rebalance out of FTS node took 1739.46 min. Job-  http://perf.jenkins.couchbase.com/job/arke-multi-bucket/345

            Shivani Gupta, unlike the ticket title says - this rebalance passes now, just that it progresses really slow.  The scaling fix for this needs to be a really thought through and a comprehensive one. We are not planning any performance improvement fixes in 6.5.1.

            Sreekanth Sivasankaran Sreekanth Sivasankaran added a comment - Shivani Gupta , unlike the ticket title says - this rebalance passes now, just that it progresses really slow.  The scaling fix for this needs to be a really thought through and a comprehensive one. We are not planning any performance improvement fixes in 6.5.1.

            Sreekanth Sivasankaran thanks. Can you please fix the title so that it does not cause confusion going forward?

            shivani.gupta Shivani Gupta added a comment - Sreekanth Sivasankaran  thanks. Can you please fix the title so that it does not cause confusion going forward?
            Sreekanth Sivasankaran Sreekanth Sivasankaran made changes -
            Summary [high-bucket] FTS rebalance out in high bucket density test takes 12 hours and fails [high-bucket] FTS rebalance out in high bucket density test takes really long hours (1
            Sreekanth Sivasankaran Sreekanth Sivasankaran made changes -
            Summary [high-bucket] FTS rebalance out in high bucket density test takes really long hours (1 [high-bucket] FTS rebalance out in high bucket density test takes really long hours (28hrs+)
            shivani.gupta Shivani Gupta added a comment - - edited

            Sreekanth Sivasankaran in response to your comment:

            "Given that a 2 node rebalance in/out operation for 30 indexes, wouldn't be surprised that it took 10 hours. But this needs to be investigated further. Mostly for CC as we don't have any immediate fixes available for this."

            This is not expected behavior. As mentioned by Mahesh Mandhare  on 6.5.0-2082, this rebalance out took 79 min.

            Clearly there is some regression here. Please see if this can be addressed for 6.5.1.

            Keshav Murthy FYI.

            shivani.gupta Shivani Gupta added a comment - - edited Sreekanth Sivasankaran  in response to your comment: "Given that a 2 node rebalance in/out operation for 30 indexes, wouldn't be surprised that it took 10 hours. But this needs to be investigated further. Mostly for CC as we don't have any immediate fixes available for this." This is not expected behavior. As mentioned by Mahesh Mandhare   on 6.5.0-2082, this rebalance out took 79 min. Clearly there is some regression here. Please see if this can be addressed for 6.5.1. Keshav Murthy FYI.

            Thanks Shivani Gupta for brining this point right to my radar, I went completely oblivious(missed that point in description) to the fact that It used to work in older builds. Let me check this asap.

            Sreekanth Sivasankaran Sreekanth Sivasankaran added a comment - Thanks Shivani Gupta  for brining this point right to my radar, I went completely oblivious(missed that point in description) to the fact that It used to work in older builds. Let me check this asap.
            abhinav Abhinav Dangeti made changes -
            Link This issue relates to MB-34911 [ MB-34911 ]
            Sreekanth Sivasankaran Sreekanth Sivasankaran added a comment - - edited

            In the recent logs of the rebalance out operation, I can see heavy app_herder intervention (345 job run) to slow down things. Working with Mahesh Mandhare to get the older logs to verify/debug how it used to work earlier without the memory pressure.

            eg:

            2019-12-27T03:38:22.585-08:00 [INFO] app_herder: indexing over indexQuota: 31457280000, memUsed: 34548408524, preIndexingMemory: 561308, indexes: 152, waiting: 301
            2019-12-27T03:38:22.585-08:00 [INFO] app_herder: indexing over indexQuota: 31457280000, memUsed: 34548408524, preIndexingMemory: 561308, indexes: 152, waiting: 301
            2019-12-27T03:38:22.585-08:00 [INFO] app_herder: indexing over indexQuota: 31457280000, memUsed: 34548408524, preIndexingMemory: 561308, indexes: 152, waiting: 301
            2019-12-27T03:38:22.585-08:00 [INFO] app_herder: indexing over indexQuota: 31457280000, memUsed: 34548408524, preIndexingMemory: 561308, indexes: 152, waiting: 301
            2019-12-27T03:38:22.585-08:00 [INFO] app_herder: indexing over indexQuota: 31457280000, memUsed: 34548408524, preIndexingMemory: 561308, indexes: 152, waiting: 301
            2019-12-27T03:38:22.585-08:00 [INFO] app_herder: indexing over indexQuota: 31457280000, memUsed: 34548408524, preIndexingMemory: 561308, indexes: 152, waiting: 301
            2019-12-27T03:38:22.585-08:00 [INFO] app_herder: persister progress, indexes: 152, waiting: 301
            2019-12-27T03:38:22.585-08:00 [INFO] app_herder: indexing over indexQuota: 31457280000, memUsed: 34548408568, preIndexingMemory: 561352, indexes: 152, waiting: 301
            2019-12-27T03:38:22.586-08:00 [INFO] app_herder: indexing over indexQuota: 31457280000, memUsed: 34548408175, preIndexingMemory: 560959, indexes: 152, waiting: 301 

             

            Also noted with the  325 job, run there are certain really slow progress happening at the rebalance side without any app_herder intervention too.

            2019-10-23T23:26:03.726-07:00 [INFO] ctl/manager: revNum: 14621, progress: 0.455556
            2019-10-23T23:26:03.726-07:00 [INFO] ctl/manager: GetTaskList, haveTasksRev: 14620, changed, rv: &{Rev:[49 52 54 50 50] Tasks:[{Rev:[49 52 54 50 49] ID:rebalance:9396080039e9aacca18d08878b143b72 Type:task-rebalance Status:task-running IsCancelable:true Progress:0.45555555555555555 DetailedProgress:map[] Description:topology change ErrorMessage: Extra:map[topologyChange:{ID:9396080039e9aacca18d08878b143b72 CurrentTopologyRev:[] Type:topology-change-rebalance KeepNodes:[{NodeInfo:{NodeID:4a705a0cdc76bd9f03491477bdd47bc4 Priority:0 Opaque:<nil>} RecoveryType:recovery-full}] EjectNodes:[{NodeID:1b30978cfc3e26e1963452e02afaf888 Priority:0 Opaque:<nil>}]}]}]}
            2019-10-23T23:26:03.836-07:00 [INFO] cbdatasource: server: 172.23.97.13:11210, uprOpenName: fts:bucket-15-fts_less_words_4facf2d636f76065_54820232-2f78619e, worker, looping beg, vbucketState: "running" (has 86 vbuckets), 342-427
            2019-10-23T23:26:03.844-07:00 [INFO] cbdatasource: server: 172.23.97.12:11210, uprOpenName: fts:bucket-15-fts_less_words_4facf2d636f76065_54820232-2f78619e, worker, looping beg, vbucketState: "running" (has 85 vbuckets), 428-512
            2019-10-23T23:26:03.860-07:00 [INFO] janitor: pindexes to remove: 0
            2019-10-23T23:26:03.860-07:00 [INFO] janitor: pindexes to add: 0
            2019-10-23T23:26:03.860-07:00 [INFO] janitor: pindexes to restart: 0
            2019-10-23T23:26:03.861-07:00 [INFO] janitor: feeds to remove: 0
            2019-10-23T23:26:03.861-07:00 [INFO] janitor: feeds to add: 0
            2019-10-23T23:26:03.861-07:00 [INFO] janitor: awakes, op: kick, msg: cfg changed, key: planPIndexes
            2019-10-23T23:26:03.987-07:00 [INFO] ctl/manager: revNum: 14623, progress: 0.455556
            2019-10-23T23:26:03.987-07:00 [INFO] ctl/manager: GetTaskList, haveTasksRev: 14622, changed, rv: &{Rev:[49 52 54 50 52] Tasks:[{Rev:[49 52 54 50 51] ID:rebalance:9396080039e9aacca18d08878b143b72 Type:task-rebalance Status:task-running IsCancelable:true Progress:0.45555555555555555 DetailedProgress:map[] Description:topology change ErrorMessage: Extra:map[topologyChange:{ID:9396080039e9aacca18d08878b143b72 CurrentTopologyRev:[] Type:topology-change-rebalance KeepNodes:[{NodeInfo:{NodeID:4a705a0cdc76bd9f03491477bdd47bc4 Priority:0 Opaque:<nil>} RecoveryType:recovery-full}] EjectNodes:[{NodeID:1b30978cfc3e26e1963452e02afaf888 Priority:0 Opaque:<nil>}]}]}]}
            2019-10-23T23:26:04.222-07:00 [INFO] janitor: pindexes to remove: 0
            2019-10-23T23:26:04.222-07:00 [INFO] janitor: pindexes to add: 0
            2019-10-23T23:26:04.222-07:00 [INFO] janitor: pindexes to restart: 0
            2019-10-23T23:26:04.223-07:00 [INFO] janitor: feeds to remove: 0
            2019-10-23T23:26:04.223-07:00 [INFO] janitor: feeds to add: 0
            2019-10-23T23:26:04.223-07:00 [INFO] janitor: awakes, op: kick, msg: cfg changed, key: planPIndexes
            2019-10-23T23:26:04.238-07:00 [INFO] cbdatasource: server: 172.23.97.13:11210, uprOpenName: fts:bucket-1-fts_less_words_3e0e4ea008a374f1_f4e0a48a-52f3cc6d, worker, looping beg, vbucketState: "running" (has 1 vbuckets), 256
            2019-10-23T23:26:04.238-07:00 [INFO] cbdatasource: server: 172.23.97.12:11210, uprOpenName: fts:bucket-1-fts_less_words_3e0e4ea008a374f1_f4e0a48a-52f3cc6d, worker, looping beg, vbucketState: "running" (has 85 vbuckets), 171-255
            2019-10-23T23:26:04.243-07:00 [INFO] cbdatasource: server: 172.23.97.14:11210, uprOpenName: fts:bucket-1-fts_less_words_3e0e4ea008a374f1_f4e0a48a-52f3cc6d, worker, looping beg, vbucketState: "running" (has 85 vbuckets), 257-341
            2019-10-23T23:26:04.391-07:00 [INFO] cbdatasource: server: 172.23.97.13:11210, uprOpenName: fts:bucket-16-fts_more_words_3ec762f1cbd8aac3_13aa53f3-14b18c81, worker, looping beg, vbucketState: "running" (has 84 vbuckets), 87-170
            2019-10-23T23:26:04.392-07:00 [INFO] cbdatasource: server: 172.23.97.12:11210, uprOpenName: fts:bucket-16-fts_more_words_3ec762f1cbd8aac3_13aa53f3-14b18c81, worker, looping beg, vbucketState: "running" (has 87 vbuckets), 0-86
            2019-10-23T23:26:04.593-07:00 [INFO] janitor: pindexes to remove: 0 

            This needs to be investigated further.. 

             

            Sreekanth Sivasankaran Sreekanth Sivasankaran added a comment - - edited In the recent logs of the rebalance out operation, I can see heavy app_herder intervention (345 job run) to slow down things. Working with Mahesh Mandhare  to get the older logs to verify/debug how it used to work earlier without the memory pressure. eg: 2019 - 12 -27T03: 38 : 22.585 - 08 : 00 [INFO] app_herder: indexing over indexQuota: 31457280000 , memUsed: 34548408524 , preIndexingMemory: 561308 , indexes: 152 , waiting: 301 2019 - 12 -27T03: 38 : 22.585 - 08 : 00 [INFO] app_herder: indexing over indexQuota: 31457280000 , memUsed: 34548408524 , preIndexingMemory: 561308 , indexes: 152 , waiting: 301 2019 - 12 -27T03: 38 : 22.585 - 08 : 00 [INFO] app_herder: indexing over indexQuota: 31457280000 , memUsed: 34548408524 , preIndexingMemory: 561308 , indexes: 152 , waiting: 301 2019 - 12 -27T03: 38 : 22.585 - 08 : 00 [INFO] app_herder: indexing over indexQuota: 31457280000 , memUsed: 34548408524 , preIndexingMemory: 561308 , indexes: 152 , waiting: 301 2019 - 12 -27T03: 38 : 22.585 - 08 : 00 [INFO] app_herder: indexing over indexQuota: 31457280000 , memUsed: 34548408524 , preIndexingMemory: 561308 , indexes: 152 , waiting: 301 2019 - 12 -27T03: 38 : 22.585 - 08 : 00 [INFO] app_herder: indexing over indexQuota: 31457280000 , memUsed: 34548408524 , preIndexingMemory: 561308 , indexes: 152 , waiting: 301 2019 - 12 -27T03: 38 : 22.585 - 08 : 00 [INFO] app_herder: persister progress, indexes: 152 , waiting: 301 2019 - 12 -27T03: 38 : 22.585 - 08 : 00 [INFO] app_herder: indexing over indexQuota: 31457280000 , memUsed: 34548408568 , preIndexingMemory: 561352 , indexes: 152 , waiting: 301 2019 - 12 -27T03: 38 : 22.586 - 08 : 00 [INFO] app_herder: indexing over indexQuota: 31457280000 , memUsed: 34548408175 , preIndexingMemory: 560959 , indexes: 152 , waiting: 301   Also noted with the  325 job, run there are certain really slow progress happening at the rebalance side without any app_herder intervention too. 2019 - 10 -23T23: 26 : 03.726 - 07 : 00 [INFO] ctl/manager: revNum: 14621 , progress: 0.455556 2019 - 10 -23T23: 26 : 03.726 - 07 : 00 [INFO] ctl/manager: GetTaskList, haveTasksRev: 14620 , changed, rv: &{Rev:[ 49 52 54 50 50 ] Tasks:[{Rev:[ 49 52 54 50 49 ] ID:rebalance:9396080039e9aacca18d08878b143b72 Type:task-rebalance Status:task-running IsCancelable: true Progress: 0.45555555555555555 DetailedProgress:map[] Description:topology change ErrorMessage: Extra:map[topologyChange:{ID:9396080039e9aacca18d08878b143b72 CurrentTopologyRev:[] Type:topology-change-rebalance KeepNodes:[{NodeInfo:{NodeID:4a705a0cdc76bd9f03491477bdd47bc4 Priority: 0 Opaque:<nil>} RecoveryType:recovery-full}] EjectNodes:[{NodeID:1b30978cfc3e26e1963452e02afaf888 Priority: 0 Opaque:<nil>}]}]}]} 2019 - 10 -23T23: 26 : 03.836 - 07 : 00 [INFO] cbdatasource: server: 172.23 . 97.13 : 11210 , uprOpenName: fts:bucket- 15 -fts_less_words_4facf2d636f76065_54820232-2f78619e, worker, looping beg, vbucketState: "running" (has 86 vbuckets), 342 - 427 2019 - 10 -23T23: 26 : 03.844 - 07 : 00 [INFO] cbdatasource: server: 172.23 . 97.12 : 11210 , uprOpenName: fts:bucket- 15 -fts_less_words_4facf2d636f76065_54820232-2f78619e, worker, looping beg, vbucketState: "running" (has 85 vbuckets), 428 - 512 2019 - 10 -23T23: 26 : 03.860 - 07 : 00 [INFO] janitor: pindexes to remove: 0 2019 - 10 -23T23: 26 : 03.860 - 07 : 00 [INFO] janitor: pindexes to add: 0 2019 - 10 -23T23: 26 : 03.860 - 07 : 00 [INFO] janitor: pindexes to restart: 0 2019 - 10 -23T23: 26 : 03.861 - 07 : 00 [INFO] janitor: feeds to remove: 0 2019 - 10 -23T23: 26 : 03.861 - 07 : 00 [INFO] janitor: feeds to add: 0 2019 - 10 -23T23: 26 : 03.861 - 07 : 00 [INFO] janitor: awakes, op: kick, msg: cfg changed, key: planPIndexes 2019 - 10 -23T23: 26 : 03.987 - 07 : 00 [INFO] ctl/manager: revNum: 14623 , progress: 0.455556 2019 - 10 -23T23: 26 : 03.987 - 07 : 00 [INFO] ctl/manager: GetTaskList, haveTasksRev: 14622 , changed, rv: &{Rev:[ 49 52 54 50 52 ] Tasks:[{Rev:[ 49 52 54 50 51 ] ID:rebalance:9396080039e9aacca18d08878b143b72 Type:task-rebalance Status:task-running IsCancelable: true Progress: 0.45555555555555555 DetailedProgress:map[] Description:topology change ErrorMessage: Extra:map[topologyChange:{ID:9396080039e9aacca18d08878b143b72 CurrentTopologyRev:[] Type:topology-change-rebalance KeepNodes:[{NodeInfo:{NodeID:4a705a0cdc76bd9f03491477bdd47bc4 Priority: 0 Opaque:<nil>} RecoveryType:recovery-full}] EjectNodes:[{NodeID:1b30978cfc3e26e1963452e02afaf888 Priority: 0 Opaque:<nil>}]}]}]} 2019 - 10 -23T23: 26 : 04.222 - 07 : 00 [INFO] janitor: pindexes to remove: 0 2019 - 10 -23T23: 26 : 04.222 - 07 : 00 [INFO] janitor: pindexes to add: 0 2019 - 10 -23T23: 26 : 04.222 - 07 : 00 [INFO] janitor: pindexes to restart: 0 2019 - 10 -23T23: 26 : 04.223 - 07 : 00 [INFO] janitor: feeds to remove: 0 2019 - 10 -23T23: 26 : 04.223 - 07 : 00 [INFO] janitor: feeds to add: 0 2019 - 10 -23T23: 26 : 04.223 - 07 : 00 [INFO] janitor: awakes, op: kick, msg: cfg changed, key: planPIndexes 2019 - 10 -23T23: 26 : 04.238 - 07 : 00 [INFO] cbdatasource: server: 172.23 . 97.13 : 11210 , uprOpenName: fts:bucket- 1 -fts_less_words_3e0e4ea008a374f1_f4e0a48a-52f3cc6d, worker, looping beg, vbucketState: "running" (has 1 vbuckets), 256 2019 - 10 -23T23: 26 : 04.238 - 07 : 00 [INFO] cbdatasource: server: 172.23 . 97.12 : 11210 , uprOpenName: fts:bucket- 1 -fts_less_words_3e0e4ea008a374f1_f4e0a48a-52f3cc6d, worker, looping beg, vbucketState: "running" (has 85 vbuckets), 171 - 255 2019 - 10 -23T23: 26 : 04.243 - 07 : 00 [INFO] cbdatasource: server: 172.23 . 97.14 : 11210 , uprOpenName: fts:bucket- 1 -fts_less_words_3e0e4ea008a374f1_f4e0a48a-52f3cc6d, worker, looping beg, vbucketState: "running" (has 85 vbuckets), 257 - 341 2019 - 10 -23T23: 26 : 04.391 - 07 : 00 [INFO] cbdatasource: server: 172.23 . 97.13 : 11210 , uprOpenName: fts:bucket- 16 -fts_more_words_3ec762f1cbd8aac3_13aa53f3-14b18c81, worker, looping beg, vbucketState: "running" (has 84 vbuckets), 87 - 170 2019 - 10 -23T23: 26 : 04.392 - 07 : 00 [INFO] cbdatasource: server: 172.23 . 97.12 : 11210 , uprOpenName: fts:bucket- 16 -fts_more_words_3ec762f1cbd8aac3_13aa53f3-14b18c81, worker, looping beg, vbucketState: "running" (has 87 vbuckets), 0 - 86 2019 - 10 -23T23: 26 : 04.593 - 07 : 00 [INFO] janitor: pindexes to remove: 0 This needs to be investigated further..   
            keshav Keshav Murthy made changes -
            Assignee Sreekanth Sivasankaran [ sreekanth sivasankaran ] Jyotsna Nayak [ jyotsna.nayak ]

            The same high bucket density test ran into this as well as MB-34911 Please rerun the test and let us know.

            keshav Keshav Murthy added a comment - The same high bucket density test ran into this as well as MB-34911  Please rerun the test and let us know.
            keshav Keshav Murthy made changes -
            Priority Critical [ 2 ] Major [ 3 ]
            lynn.straus Lynn Straus added a comment -

            Per review in the Cheshire Cat Ops meeting and per email updates, all remaining majors/minors/trivials are being moved out of Cheshire Cat on March 10. The purpose is to allow teams to focus on Critical bugs for the remainder of the release. Teams can "pull back" deferred bugs as schedule and bandwidth allow.

            Changing target fix version to CheshireCat.next and adding "deferred-from-Cheshire-Cat" label.

            lynn.straus Lynn Straus added a comment - Per review in the Cheshire Cat Ops meeting and per email updates, all remaining majors/minors/trivials are being moved out of Cheshire Cat on March 10. The purpose is to allow teams to focus on Critical bugs for the remainder of the release. Teams can "pull back" deferred bugs as schedule and bandwidth allow. Changing target fix version to CheshireCat.next and adding "deferred-from-Cheshire-Cat" label.
            lynn.straus Lynn Straus made changes -
            Fix Version/s CheshireCat.Next [ 16908 ]
            Fix Version/s Cheshire-Cat [ 15915 ]
            Labels Performance approved-for-mad-hatter high-bucket-density Performance approved-for-mad-hatter deferred-from-Cheshire-Cat high-bucket-density
            jyotsna.nayak Jyotsna Nayak made changes -
            Status Open [ 1 ] In Progress [ 3 ]

            Have rerun this test on CC build: 7.0.0-4554.
            Out rebalance completed. Time taken for swap rebalance : 78.15 min ( 1hr 18 mins)
            Link to the job: http://perf.jenkins.couchbase.com/job/themis_multibucket/63/consoleFull

            jyotsna.nayak Jyotsna Nayak added a comment - Have rerun this test on CC build: 7.0.0-4554. Out rebalance completed. Time taken for swap rebalance : 78.15 min ( 1hr 18 mins) Link to the job: http://perf.jenkins.couchbase.com/job/themis_multibucket/63/consoleFull
            jyotsna.nayak Jyotsna Nayak made changes -
            Resolution Fixed [ 1 ]
            Status In Progress [ 3 ] Closed [ 6 ]

            People

              jyotsna.nayak Jyotsna Nayak
              mahesh.mandhare Mahesh Mandhare (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty