Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-36400

[high-bucket] FTS rebalance out in high bucket density test takes really long hours (28hrs+)

    XMLWordPrintable

Details

    • Untriaged
    • No

    Description

      Build 6.5.0-4380

      Observed that rebalancing out FTS node in high bucket density test with 30 buckets taking 764.42 min and fails with following reason.

      Each bucket have 1 FTS index.

      Rebalance failure reason-

      Rebalance exited with reason {service_rebalance_failed,fts,
      {worker_died,
      {'EXIT',<0.13606.525>,
      {{linked_process_died,<27016.24380.381>,
      {timeout,
      {gen_server,call,
      [<27016.22501.381>,
      {call,"ServiceAPI.GetCurrentTopology",
      #Fun<json_rpc_connection.0.102434519>},
      60000]}}},
      {gen_server,call,
      [{'service_agent-fts','ns_1@172.23.97.15'},
      {if_rebalance,<0.19462.525>,
      {start_rebalance,
      <<"f81721335b42662e837612d6b366cca3">>,
      rebalance,
      [{[{node_id,
      <<"827b0f5298e3447fa469c0ebdcad3da0">>},
      {priority,0},
      {opaque,null}],
      full}],
      [[{node_id,
      <<"592321897f0f721188d778a77f4c9d9b">>},
      {priority,0},
      {opaque,null}]],
      <0.13606.525>}},
      90000]}}}}}.
      Rebalance Operation Id = f5116259df628e7619516c5b6cba1997
      

       On build 6.5.0-2082, this rebalance out took 79 min.

      Job- http://perf.jenkins.couchbase.com/job/arke-multi-bucket/322/

      Logs-
      FTS node- https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-322/172.23.96.20.zip
      FTS node going out- https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-322/172.23.97.15.zip

      Other nodes-
      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-322/172.23.96.16.zip
      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-322/172.23.96.17.zip
      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-322/172.23.96.23.zip
      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-322/172.23.97.12.zip
      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-322/172.23.97.13.zip
      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-322/172.23.97.14.zip
      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-322/172.23.97.177.zip
      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-322/172.23.97.19.zip
      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-322/172.23.97.20.zip

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            Have rerun this test on CC build: 7.0.0-4554.
            Out rebalance completed. Time taken for swap rebalance : 78.15 min ( 1hr 18 mins)
            Link to the job: http://perf.jenkins.couchbase.com/job/themis_multibucket/63/consoleFull

            jyotsna.nayak Jyotsna Nayak added a comment - Have rerun this test on CC build: 7.0.0-4554. Out rebalance completed. Time taken for swap rebalance : 78.15 min ( 1hr 18 mins) Link to the job: http://perf.jenkins.couchbase.com/job/themis_multibucket/63/consoleFull
            lynn.straus Lynn Straus added a comment -

            Per review in the Cheshire Cat Ops meeting and per email updates, all remaining majors/minors/trivials are being moved out of Cheshire Cat on March 10. The purpose is to allow teams to focus on Critical bugs for the remainder of the release. Teams can "pull back" deferred bugs as schedule and bandwidth allow.

            Changing target fix version to CheshireCat.next and adding "deferred-from-Cheshire-Cat" label.

            lynn.straus Lynn Straus added a comment - Per review in the Cheshire Cat Ops meeting and per email updates, all remaining majors/minors/trivials are being moved out of Cheshire Cat on March 10. The purpose is to allow teams to focus on Critical bugs for the remainder of the release. Teams can "pull back" deferred bugs as schedule and bandwidth allow. Changing target fix version to CheshireCat.next and adding "deferred-from-Cheshire-Cat" label.

            The same high bucket density test ran into this as well as MB-34911 Please rerun the test and let us know.

            keshav Keshav Murthy added a comment - The same high bucket density test ran into this as well as MB-34911  Please rerun the test and let us know.
            Sreekanth Sivasankaran Sreekanth Sivasankaran added a comment - - edited

            In the recent logs of the rebalance out operation, I can see heavy app_herder intervention (345 job run) to slow down things. Working with Mahesh Mandhare to get the older logs to verify/debug how it used to work earlier without the memory pressure.

            eg:

            2019-12-27T03:38:22.585-08:00 [INFO] app_herder: indexing over indexQuota: 31457280000, memUsed: 34548408524, preIndexingMemory: 561308, indexes: 152, waiting: 301
            2019-12-27T03:38:22.585-08:00 [INFO] app_herder: indexing over indexQuota: 31457280000, memUsed: 34548408524, preIndexingMemory: 561308, indexes: 152, waiting: 301
            2019-12-27T03:38:22.585-08:00 [INFO] app_herder: indexing over indexQuota: 31457280000, memUsed: 34548408524, preIndexingMemory: 561308, indexes: 152, waiting: 301
            2019-12-27T03:38:22.585-08:00 [INFO] app_herder: indexing over indexQuota: 31457280000, memUsed: 34548408524, preIndexingMemory: 561308, indexes: 152, waiting: 301
            2019-12-27T03:38:22.585-08:00 [INFO] app_herder: indexing over indexQuota: 31457280000, memUsed: 34548408524, preIndexingMemory: 561308, indexes: 152, waiting: 301
            2019-12-27T03:38:22.585-08:00 [INFO] app_herder: indexing over indexQuota: 31457280000, memUsed: 34548408524, preIndexingMemory: 561308, indexes: 152, waiting: 301
            2019-12-27T03:38:22.585-08:00 [INFO] app_herder: persister progress, indexes: 152, waiting: 301
            2019-12-27T03:38:22.585-08:00 [INFO] app_herder: indexing over indexQuota: 31457280000, memUsed: 34548408568, preIndexingMemory: 561352, indexes: 152, waiting: 301
            2019-12-27T03:38:22.586-08:00 [INFO] app_herder: indexing over indexQuota: 31457280000, memUsed: 34548408175, preIndexingMemory: 560959, indexes: 152, waiting: 301 

             

            Also noted with the  325 job, run there are certain really slow progress happening at the rebalance side without any app_herder intervention too.

            2019-10-23T23:26:03.726-07:00 [INFO] ctl/manager: revNum: 14621, progress: 0.455556
            2019-10-23T23:26:03.726-07:00 [INFO] ctl/manager: GetTaskList, haveTasksRev: 14620, changed, rv: &{Rev:[49 52 54 50 50] Tasks:[{Rev:[49 52 54 50 49] ID:rebalance:9396080039e9aacca18d08878b143b72 Type:task-rebalance Status:task-running IsCancelable:true Progress:0.45555555555555555 DetailedProgress:map[] Description:topology change ErrorMessage: Extra:map[topologyChange:{ID:9396080039e9aacca18d08878b143b72 CurrentTopologyRev:[] Type:topology-change-rebalance KeepNodes:[{NodeInfo:{NodeID:4a705a0cdc76bd9f03491477bdd47bc4 Priority:0 Opaque:<nil>} RecoveryType:recovery-full}] EjectNodes:[{NodeID:1b30978cfc3e26e1963452e02afaf888 Priority:0 Opaque:<nil>}]}]}]}
            2019-10-23T23:26:03.836-07:00 [INFO] cbdatasource: server: 172.23.97.13:11210, uprOpenName: fts:bucket-15-fts_less_words_4facf2d636f76065_54820232-2f78619e, worker, looping beg, vbucketState: "running" (has 86 vbuckets), 342-427
            2019-10-23T23:26:03.844-07:00 [INFO] cbdatasource: server: 172.23.97.12:11210, uprOpenName: fts:bucket-15-fts_less_words_4facf2d636f76065_54820232-2f78619e, worker, looping beg, vbucketState: "running" (has 85 vbuckets), 428-512
            2019-10-23T23:26:03.860-07:00 [INFO] janitor: pindexes to remove: 0
            2019-10-23T23:26:03.860-07:00 [INFO] janitor: pindexes to add: 0
            2019-10-23T23:26:03.860-07:00 [INFO] janitor: pindexes to restart: 0
            2019-10-23T23:26:03.861-07:00 [INFO] janitor: feeds to remove: 0
            2019-10-23T23:26:03.861-07:00 [INFO] janitor: feeds to add: 0
            2019-10-23T23:26:03.861-07:00 [INFO] janitor: awakes, op: kick, msg: cfg changed, key: planPIndexes
            2019-10-23T23:26:03.987-07:00 [INFO] ctl/manager: revNum: 14623, progress: 0.455556
            2019-10-23T23:26:03.987-07:00 [INFO] ctl/manager: GetTaskList, haveTasksRev: 14622, changed, rv: &{Rev:[49 52 54 50 52] Tasks:[{Rev:[49 52 54 50 51] ID:rebalance:9396080039e9aacca18d08878b143b72 Type:task-rebalance Status:task-running IsCancelable:true Progress:0.45555555555555555 DetailedProgress:map[] Description:topology change ErrorMessage: Extra:map[topologyChange:{ID:9396080039e9aacca18d08878b143b72 CurrentTopologyRev:[] Type:topology-change-rebalance KeepNodes:[{NodeInfo:{NodeID:4a705a0cdc76bd9f03491477bdd47bc4 Priority:0 Opaque:<nil>} RecoveryType:recovery-full}] EjectNodes:[{NodeID:1b30978cfc3e26e1963452e02afaf888 Priority:0 Opaque:<nil>}]}]}]}
            2019-10-23T23:26:04.222-07:00 [INFO] janitor: pindexes to remove: 0
            2019-10-23T23:26:04.222-07:00 [INFO] janitor: pindexes to add: 0
            2019-10-23T23:26:04.222-07:00 [INFO] janitor: pindexes to restart: 0
            2019-10-23T23:26:04.223-07:00 [INFO] janitor: feeds to remove: 0
            2019-10-23T23:26:04.223-07:00 [INFO] janitor: feeds to add: 0
            2019-10-23T23:26:04.223-07:00 [INFO] janitor: awakes, op: kick, msg: cfg changed, key: planPIndexes
            2019-10-23T23:26:04.238-07:00 [INFO] cbdatasource: server: 172.23.97.13:11210, uprOpenName: fts:bucket-1-fts_less_words_3e0e4ea008a374f1_f4e0a48a-52f3cc6d, worker, looping beg, vbucketState: "running" (has 1 vbuckets), 256
            2019-10-23T23:26:04.238-07:00 [INFO] cbdatasource: server: 172.23.97.12:11210, uprOpenName: fts:bucket-1-fts_less_words_3e0e4ea008a374f1_f4e0a48a-52f3cc6d, worker, looping beg, vbucketState: "running" (has 85 vbuckets), 171-255
            2019-10-23T23:26:04.243-07:00 [INFO] cbdatasource: server: 172.23.97.14:11210, uprOpenName: fts:bucket-1-fts_less_words_3e0e4ea008a374f1_f4e0a48a-52f3cc6d, worker, looping beg, vbucketState: "running" (has 85 vbuckets), 257-341
            2019-10-23T23:26:04.391-07:00 [INFO] cbdatasource: server: 172.23.97.13:11210, uprOpenName: fts:bucket-16-fts_more_words_3ec762f1cbd8aac3_13aa53f3-14b18c81, worker, looping beg, vbucketState: "running" (has 84 vbuckets), 87-170
            2019-10-23T23:26:04.392-07:00 [INFO] cbdatasource: server: 172.23.97.12:11210, uprOpenName: fts:bucket-16-fts_more_words_3ec762f1cbd8aac3_13aa53f3-14b18c81, worker, looping beg, vbucketState: "running" (has 87 vbuckets), 0-86
            2019-10-23T23:26:04.593-07:00 [INFO] janitor: pindexes to remove: 0 

            This needs to be investigated further.. 

             

            Sreekanth Sivasankaran Sreekanth Sivasankaran added a comment - - edited In the recent logs of the rebalance out operation, I can see heavy app_herder intervention (345 job run) to slow down things. Working with Mahesh Mandhare  to get the older logs to verify/debug how it used to work earlier without the memory pressure. eg: 2019 - 12 -27T03: 38 : 22.585 - 08 : 00 [INFO] app_herder: indexing over indexQuota: 31457280000 , memUsed: 34548408524 , preIndexingMemory: 561308 , indexes: 152 , waiting: 301 2019 - 12 -27T03: 38 : 22.585 - 08 : 00 [INFO] app_herder: indexing over indexQuota: 31457280000 , memUsed: 34548408524 , preIndexingMemory: 561308 , indexes: 152 , waiting: 301 2019 - 12 -27T03: 38 : 22.585 - 08 : 00 [INFO] app_herder: indexing over indexQuota: 31457280000 , memUsed: 34548408524 , preIndexingMemory: 561308 , indexes: 152 , waiting: 301 2019 - 12 -27T03: 38 : 22.585 - 08 : 00 [INFO] app_herder: indexing over indexQuota: 31457280000 , memUsed: 34548408524 , preIndexingMemory: 561308 , indexes: 152 , waiting: 301 2019 - 12 -27T03: 38 : 22.585 - 08 : 00 [INFO] app_herder: indexing over indexQuota: 31457280000 , memUsed: 34548408524 , preIndexingMemory: 561308 , indexes: 152 , waiting: 301 2019 - 12 -27T03: 38 : 22.585 - 08 : 00 [INFO] app_herder: indexing over indexQuota: 31457280000 , memUsed: 34548408524 , preIndexingMemory: 561308 , indexes: 152 , waiting: 301 2019 - 12 -27T03: 38 : 22.585 - 08 : 00 [INFO] app_herder: persister progress, indexes: 152 , waiting: 301 2019 - 12 -27T03: 38 : 22.585 - 08 : 00 [INFO] app_herder: indexing over indexQuota: 31457280000 , memUsed: 34548408568 , preIndexingMemory: 561352 , indexes: 152 , waiting: 301 2019 - 12 -27T03: 38 : 22.586 - 08 : 00 [INFO] app_herder: indexing over indexQuota: 31457280000 , memUsed: 34548408175 , preIndexingMemory: 560959 , indexes: 152 , waiting: 301   Also noted with the  325 job, run there are certain really slow progress happening at the rebalance side without any app_herder intervention too. 2019 - 10 -23T23: 26 : 03.726 - 07 : 00 [INFO] ctl/manager: revNum: 14621 , progress: 0.455556 2019 - 10 -23T23: 26 : 03.726 - 07 : 00 [INFO] ctl/manager: GetTaskList, haveTasksRev: 14620 , changed, rv: &{Rev:[ 49 52 54 50 50 ] Tasks:[{Rev:[ 49 52 54 50 49 ] ID:rebalance:9396080039e9aacca18d08878b143b72 Type:task-rebalance Status:task-running IsCancelable: true Progress: 0.45555555555555555 DetailedProgress:map[] Description:topology change ErrorMessage: Extra:map[topologyChange:{ID:9396080039e9aacca18d08878b143b72 CurrentTopologyRev:[] Type:topology-change-rebalance KeepNodes:[{NodeInfo:{NodeID:4a705a0cdc76bd9f03491477bdd47bc4 Priority: 0 Opaque:<nil>} RecoveryType:recovery-full}] EjectNodes:[{NodeID:1b30978cfc3e26e1963452e02afaf888 Priority: 0 Opaque:<nil>}]}]}]} 2019 - 10 -23T23: 26 : 03.836 - 07 : 00 [INFO] cbdatasource: server: 172.23 . 97.13 : 11210 , uprOpenName: fts:bucket- 15 -fts_less_words_4facf2d636f76065_54820232-2f78619e, worker, looping beg, vbucketState: "running" (has 86 vbuckets), 342 - 427 2019 - 10 -23T23: 26 : 03.844 - 07 : 00 [INFO] cbdatasource: server: 172.23 . 97.12 : 11210 , uprOpenName: fts:bucket- 15 -fts_less_words_4facf2d636f76065_54820232-2f78619e, worker, looping beg, vbucketState: "running" (has 85 vbuckets), 428 - 512 2019 - 10 -23T23: 26 : 03.860 - 07 : 00 [INFO] janitor: pindexes to remove: 0 2019 - 10 -23T23: 26 : 03.860 - 07 : 00 [INFO] janitor: pindexes to add: 0 2019 - 10 -23T23: 26 : 03.860 - 07 : 00 [INFO] janitor: pindexes to restart: 0 2019 - 10 -23T23: 26 : 03.861 - 07 : 00 [INFO] janitor: feeds to remove: 0 2019 - 10 -23T23: 26 : 03.861 - 07 : 00 [INFO] janitor: feeds to add: 0 2019 - 10 -23T23: 26 : 03.861 - 07 : 00 [INFO] janitor: awakes, op: kick, msg: cfg changed, key: planPIndexes 2019 - 10 -23T23: 26 : 03.987 - 07 : 00 [INFO] ctl/manager: revNum: 14623 , progress: 0.455556 2019 - 10 -23T23: 26 : 03.987 - 07 : 00 [INFO] ctl/manager: GetTaskList, haveTasksRev: 14622 , changed, rv: &{Rev:[ 49 52 54 50 52 ] Tasks:[{Rev:[ 49 52 54 50 51 ] ID:rebalance:9396080039e9aacca18d08878b143b72 Type:task-rebalance Status:task-running IsCancelable: true Progress: 0.45555555555555555 DetailedProgress:map[] Description:topology change ErrorMessage: Extra:map[topologyChange:{ID:9396080039e9aacca18d08878b143b72 CurrentTopologyRev:[] Type:topology-change-rebalance KeepNodes:[{NodeInfo:{NodeID:4a705a0cdc76bd9f03491477bdd47bc4 Priority: 0 Opaque:<nil>} RecoveryType:recovery-full}] EjectNodes:[{NodeID:1b30978cfc3e26e1963452e02afaf888 Priority: 0 Opaque:<nil>}]}]}]} 2019 - 10 -23T23: 26 : 04.222 - 07 : 00 [INFO] janitor: pindexes to remove: 0 2019 - 10 -23T23: 26 : 04.222 - 07 : 00 [INFO] janitor: pindexes to add: 0 2019 - 10 -23T23: 26 : 04.222 - 07 : 00 [INFO] janitor: pindexes to restart: 0 2019 - 10 -23T23: 26 : 04.223 - 07 : 00 [INFO] janitor: feeds to remove: 0 2019 - 10 -23T23: 26 : 04.223 - 07 : 00 [INFO] janitor: feeds to add: 0 2019 - 10 -23T23: 26 : 04.223 - 07 : 00 [INFO] janitor: awakes, op: kick, msg: cfg changed, key: planPIndexes 2019 - 10 -23T23: 26 : 04.238 - 07 : 00 [INFO] cbdatasource: server: 172.23 . 97.13 : 11210 , uprOpenName: fts:bucket- 1 -fts_less_words_3e0e4ea008a374f1_f4e0a48a-52f3cc6d, worker, looping beg, vbucketState: "running" (has 1 vbuckets), 256 2019 - 10 -23T23: 26 : 04.238 - 07 : 00 [INFO] cbdatasource: server: 172.23 . 97.12 : 11210 , uprOpenName: fts:bucket- 1 -fts_less_words_3e0e4ea008a374f1_f4e0a48a-52f3cc6d, worker, looping beg, vbucketState: "running" (has 85 vbuckets), 171 - 255 2019 - 10 -23T23: 26 : 04.243 - 07 : 00 [INFO] cbdatasource: server: 172.23 . 97.14 : 11210 , uprOpenName: fts:bucket- 1 -fts_less_words_3e0e4ea008a374f1_f4e0a48a-52f3cc6d, worker, looping beg, vbucketState: "running" (has 85 vbuckets), 257 - 341 2019 - 10 -23T23: 26 : 04.391 - 07 : 00 [INFO] cbdatasource: server: 172.23 . 97.13 : 11210 , uprOpenName: fts:bucket- 16 -fts_more_words_3ec762f1cbd8aac3_13aa53f3-14b18c81, worker, looping beg, vbucketState: "running" (has 84 vbuckets), 87 - 170 2019 - 10 -23T23: 26 : 04.392 - 07 : 00 [INFO] cbdatasource: server: 172.23 . 97.12 : 11210 , uprOpenName: fts:bucket- 16 -fts_more_words_3ec762f1cbd8aac3_13aa53f3-14b18c81, worker, looping beg, vbucketState: "running" (has 87 vbuckets), 0 - 86 2019 - 10 -23T23: 26 : 04.593 - 07 : 00 [INFO] janitor: pindexes to remove: 0 This needs to be investigated further..   

            Thanks Shivani Gupta for brining this point right to my radar, I went completely oblivious(missed that point in description) to the fact that It used to work in older builds. Let me check this asap.

            Sreekanth Sivasankaran Sreekanth Sivasankaran added a comment - Thanks Shivani Gupta  for brining this point right to my radar, I went completely oblivious(missed that point in description) to the fact that It used to work in older builds. Let me check this asap.

            People

              jyotsna.nayak Jyotsna Nayak
              mahesh.mandhare Mahesh Mandhare (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty