Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-6726

Rebalance is slow when indexing/compaction and query load are going on in parallel

    Details

    • Sprint:
      PCI Team - Sprint 1, PCI Team - Sprint 5

      Description

      Cluster information:

      • 8 centos 6.2 64bit server with 4 cores CPU
      • Each server has 32 GB RAM and 400 GB SSD disk.
      • SSD disk format ext4 on /data
      • Each server has its own drive, no disk sharing with other server.
      • Cluster has 2 buckets, default (11GB) and saslbucket (11GB) with consistent view enable. For 2 buckets, we use only 68% total RAM of system.
      • Load 12 million items to saslbucket and 45 million items to default bucket. Each key has size from 512 bytes to 1500 bytes
      • Each bucket has one doc and 2 views for each doc (default d1 and saslbucket d11)
      • Create cluster with 4 nodes installed couchbase server 2.0.0-1746

      10.6.2.37
      10.6.2.38
      10.6.2.39
      10.6.2.40

      • Data path /data
      • View path /data
      • Let load running at 6K ops, query 400 to 500 per second
      • Add 4 nodes to cluster and rebalance
        10.6.2.42
        10.6.2.43
        10.6.2.44
        10.6.2.45
      • rebalance hang. Filed bug MB-6706
      • Try rebalance again. Cluster rebalances saslbucket first and took 9 hours to complete rebalance saslbucket
      • default bucket took 32 hours to complete

      Link to collect_info of all nodes https://s3.amazonaws.com/packages.couchbase/collect_info/orange/2_0_0/201209/8nodes-col-info-1746-reb-slow-constn-enable-20120925-141314.tgz

      Each atop file size from 1.5 GB to 2.5 GB. If atop file needed, I will upload them as request

      1. master_events.log.svg
        246 kB
        Aleksey Kondratenko
      2. master_events.tar.gz
        1.05 MB
        Pavel Paulau
      1. rebalance_progress.png
        37 kB

        Issue Links

        # Subject Project Status CR V
        For Gerrit Dashboard: &For+MB-6726=message:MB-6726

          Activity

          Hide
          chiyoung Chiyoung Seo added a comment -

          http://www.couchbase.com/issues/browse/MB-6714

          The fix was just merged today. Please test it with the next build.

          Show
          chiyoung Chiyoung Seo added a comment - http://www.couchbase.com/issues/browse/MB-6714 The fix was just merged today. Please test it with the next build.
          Hide
          ketaki Ketaki Gangal added a comment -

          Seeing slower rebalance as on MB-6769 with latest ep-engine changes.

          Show
          ketaki Ketaki Gangal added a comment - Seeing slower rebalance as on MB-6769 with latest ep-engine changes.
          Hide
          alkondratenko Aleksey Kondratenko (Inactive) added a comment -

          Ketaki, XDCR is a very different beast I've seen it to cause very severe CPU load and weird memory usage spikes which could affect anything.

          I need:

          a) diags

          b) I think brand new bug is a good idea here

          Show
          alkondratenko Aleksey Kondratenko (Inactive) added a comment - Ketaki, XDCR is a very different beast I've seen it to cause very severe CPU load and weird memory usage spikes which could affect anything. I need: a) diags b) I think brand new bug is a good idea here
          Hide
          ketaki Ketaki Gangal added a comment -

          Ok, I had filed bug 6769 for the same issue. But after some discussion w/ other folks, closed that as a duplicate of this one.

          We can have a xdcr-specific bug for understanding this behaviour.
          Closing this one and re-opening 6769 , with more logs.

          Show
          ketaki Ketaki Gangal added a comment - Ok, I had filed bug 6769 for the same issue. But after some discussion w/ other folks, closed that as a duplicate of this one. We can have a xdcr-specific bug for understanding this behaviour. Closing this one and re-opening 6769 , with more logs.
          Hide
          ketaki Ketaki Gangal added a comment -

          This bug was opened by Tony for slowness on rebalance and was fixed with http://www.couchbase.com/issues/browse/MB-6714.

          Closed this bug since we have another rebalance-xdcr slow bug MB-6769.

          Please re-open this, if this case is valid on the above system-test scenarios.

          Show
          ketaki Ketaki Gangal added a comment - This bug was opened by Tony for slowness on rebalance and was fixed with http://www.couchbase.com/issues/browse/MB-6714 . Closed this bug since we have another rebalance-xdcr slow bug MB-6769 . Please re-open this, if this case is valid on the above system-test scenarios.
          Hide
          farshid Farshid Ghods (Inactive) added a comment -

          blocker for 2.0 beta-2 as consistent views needs to be functional and performs reasonably fast for 2.0 beta 2

          Show
          farshid Farshid Ghods (Inactive) added a comment - blocker for 2.0 beta-2 as consistent views needs to be functional and performs reasonably fast for 2.0 beta 2
          Hide
          alkondratenko Aleksey Kondratenko (Inactive) added a comment -

          Original bug as indeed fixed by ep-engine folks.

          I've heard we're still having some slowness with indexes. Please open new bug with new and relevant diagnostics.

          Show
          alkondratenko Aleksey Kondratenko (Inactive) added a comment - Original bug as indeed fixed by ep-engine folks. I've heard we're still having some slowness with indexes. Please open new bug with new and relevant diagnostics.
          Hide
          karan Karan Kumar (Inactive) added a comment -

          We are still investigating slowness with indexes+rebalance+consistent views

          Show
          karan Karan Kumar (Inactive) added a comment - We are still investigating slowness with indexes+rebalance+consistent views
          Hide
          sharon Sharon Barr (Inactive) added a comment -

          isn't this a duplicate of MB-6796

          Show
          sharon Sharon Barr (Inactive) added a comment - isn't this a duplicate of MB-6796
          Hide
          farshid Farshid Ghods (Inactive) added a comment -

          once thats fixed we can retest

          Show
          farshid Farshid Ghods (Inactive) added a comment - once thats fixed we can retest
          Hide
          steve Steve Yen added a comment -

          pavel's test yesterday shows views + rebalance performance of...

          > Rebalance from 3 to 4 nodes with views (consistent view enabled).
          > just 4.7h (2.0.0-1858) vs. 5.3 h (2.0.0-1792).

          https://www.yammer.com/couchbase.com/#/Threads/show?threadId=224286191

          Show
          steve Steve Yen added a comment - pavel's test yesterday shows views + rebalance performance of... > Rebalance from 3 to 4 nodes with views (consistent view enabled). > just 4.7h (2.0.0-1858) vs. 5.3 h (2.0.0-1792). https://www.yammer.com/couchbase.com/#/Threads/show?threadId=224286191
          Hide
          thuan Thuan Nguyen added a comment -

          Re-test orange cluster with build 2.0.0-1862, swap rebalance took 43 hours to complete. Saslbucket took 20 hours and default bucket took 23.

          Cluster has 2 buckets (2 replica), default and saslbucket. Each bucket has 15 million items with key size from 512 to 1024 bytes.
          Do swap rebalace, add 2 nodes and remove 2 nodes. During rebalance, maintain the load about 6K ops on each bucket and queries about 200 to 300 per second on each bucket.

          Show
          thuan Thuan Nguyen added a comment - Re-test orange cluster with build 2.0.0-1862, swap rebalance took 43 hours to complete. Saslbucket took 20 hours and default bucket took 23. Cluster has 2 buckets (2 replica), default and saslbucket. Each bucket has 15 million items with key size from 512 to 1024 bytes. Do swap rebalace, add 2 nodes and remove 2 nodes. During rebalance, maintain the load about 6K ops on each bucket and queries about 200 to 300 per second on each bucket.
          Show
          thuan Thuan Nguyen added a comment - Link to manifest file of build 2.0.0-1862 http://builds.hq.northscale.net/latestbuilds/couchbase-server-enterprise_x86_64_2.0.0-1862-rel.rpm.manifest.xml
          Hide
          chiyoung Chiyoung Seo added a comment -

          Farshid,

          I don't think there is anything that I can do in ep-engine to resolve this issue.

          We know that this slowness is caused by enabling indexing / querying during rebalance with consistence view.

          Please assign it to Filipe or Alk for further investigations.

          Show
          chiyoung Chiyoung Seo added a comment - Farshid, I don't think there is anything that I can do in ep-engine to resolve this issue. We know that this slowness is caused by enabling indexing / querying during rebalance with consistence view. Please assign it to Filipe or Alk for further investigations.
          Hide
          farshid Farshid Ghods (Inactive) added a comment -

          assigning this to Siri/Mike for furthere investigation and traige

          Show
          farshid Farshid Ghods (Inactive) added a comment - assigning this to Siri/Mike for furthere investigation and traige
          Hide
          steve Steve Yen added a comment -

          assigning to mike as sriram is on plane at the moment. need to get alk + damien on this.

          Show
          steve Steve Yen added a comment - assigning to mike as sriram is on plane at the moment. need to get alk + damien on this.
          Hide
          steve Steve Yen added a comment - - edited

          latest update: pavel, alk & steve have agreed on a new larger test with more isolation. awaiting results.

          Show
          steve Steve Yen added a comment - - edited latest update: pavel, alk & steve have agreed on a new larger test with more isolation. awaiting results.
          Hide
          steve Steve Yen added a comment -

          notes from bug-scrub discussion...

          • farshid: QA now trying system testing with fewer vbuckets (64), which is unsupported configuration but need to turnaround test results can be turned around faster.
          • alk's considering batching – on small-scale his tests show it might help (but unknown what would happen on larger scale).
          • alk: consider recommending users disable consistent views before they rebalance.
          • yaseen: need to found out boundary conditions around this?
          Show
          steve Steve Yen added a comment - notes from bug-scrub discussion... farshid: QA now trying system testing with fewer vbuckets (64), which is unsupported configuration but need to turnaround test results can be turned around faster. alk's considering batching – on small-scale his tests show it might help (but unknown what would happen on larger scale). alk: consider recommending users disable consistent views before they rebalance. yaseen: need to found out boundary conditions around this?
          Hide
          pavelpaulau Pavel Paulau added a comment -

          Results for Alk's spec:

          https://docs.google.com/spreadsheet/ccc?key=0AgLUessE73UXdFVINGEyZ0hzOGUtbWVHeUU3T3Rlc1E#gid=0

          We should use more ddoc/views. Otherwise indexing is too fast.

          Show
          pavelpaulau Pavel Paulau added a comment - Results for Alk's spec: https://docs.google.com/spreadsheet/ccc?key=0AgLUessE73UXdFVINGEyZ0hzOGUtbWVHeUU3T3Rlc1E#gid=0 We should use more ddoc/views. Otherwise indexing is too fast.
          Hide
          pavelpaulau Pavel Paulau added a comment -

          Conclusions:
          – indexing is not a bottleneck
          – compaction does't affect results significantly (again)
          – we have to try three additional dimensions: more items (20M), more design ddocs (2), less vbuckets (512).

          Show
          pavelpaulau Pavel Paulau added a comment - Conclusions: – indexing is not a bottleneck – compaction does't affect results significantly (again) – we have to try three additional dimensions: more items (20M), more design ddocs (2), less vbuckets (512).
          Hide
          steve Steve Yen added a comment -

          Alk says no more experiment ideas at the moment for Pavel.

          Show
          steve Steve Yen added a comment - Alk says no more experiment ideas at the moment for Pavel.
          Hide
          steve Steve Yen added a comment -

          options reviewed in bug-scrub...

          • sequential vs parallel fix for rebalancing (start next phase right after backfill's done) [alk]
          • vbucket-id as prefix to back-index [filipe] - early laptop results look promising
          Show
          steve Steve Yen added a comment - options reviewed in bug-scrub... sequential vs parallel fix for rebalancing (start next phase right after backfill's done) [alk] vbucket-id as prefix to back-index [filipe] - early laptop results look promising
          Hide
          steve Steve Yen added a comment -

          bug-scrub moved to 2.0.1

          Show
          steve Steve Yen added a comment - bug-scrub moved to 2.0.1
          Hide
          mikew Mike Wiederhold added a comment -

          Editing the component to "view engine" and "ns_server". If there is anything that needs to be done for this on the ep-engine side please let us know.

          Show
          mikew Mike Wiederhold added a comment - Editing the component to "view engine" and "ns_server". If there is anything that needs to be done for this on the ep-engine side please let us know.
          Hide
          farshid Farshid Ghods (Inactive) added a comment -

          system test team needs to rerun the view tests ( 8 node SSD )

          Show
          farshid Farshid Ghods (Inactive) added a comment - system test team needs to rerun the view tests ( 8 node SSD )
          Hide
          pavelpaulau Pavel Paulau added a comment -

          As for build 121 there is no significant improvement in perf. tests.

          Show
          pavelpaulau Pavel Paulau added a comment - As for build 121 there is no significant improvement in perf. tests.
          Hide
          farshid Farshid Ghods (Inactive) added a comment -

          can you paste the link to the results ? ( comparing this build against 2.0 GA )

          Show
          farshid Farshid Ghods (Inactive) added a comment - can you paste the link to the results ? ( comparing this build against 2.0 GA )
          Show
          pavelpaulau Pavel Paulau added a comment - 2.0.0-1971: http://qa.hq.northscale.net/job/eperf-graph-loop/1217/artifact/reb-vperf-10M.loop_2.0.0-1971-rel-enterprise_2.0.0-1971-rel-enterprise_run_1_Dec-04-2012_22:54:50.pdf 2.0.1-121: http://qa.hq.northscale.net/view/eperf-graphs/job/eperf-graph-loop/1334/artifact/reb-vperf-10M.loop_2.0.1-121-rel-enterprise_2.0.1-121-rel-enterprise_run_1_Jan-09-2013_06%3A40%3A40.pdf It's not quite fair comparison but re-balance time is still ~ 4 hours.
          Hide
          farshid Farshid Ghods (Inactive) added a comment -

          rebalance is 10% faster but could be environment noise
          looks better :
          cpu util 5% better
          the memory usage numbers dont look quite right ( ns-server avg reports 21235692 MB == 21 GB ? )

          looks worse:
          runtime is 8 hours longer ,
          query ops/sec dropped from 537 to 339 but query latency is far better ( 48msec in 2..0.1 compared to 58 in 2.0 )
          drain rate is slower by 15% percent

          we need to find out why query ops/sec is very slow.

          reb−vperf−10M.conf

          1. Perf−rebalance test with views
          2. 1 design ddoc, 8 views
          3. 8K ops/sec
          4. 80% reads, 20% writes (12% updates/deletes, 8% inserts)
          5. 10M dataset
          6. Rebalance from 3 to 4 nodes
          Show
          farshid Farshid Ghods (Inactive) added a comment - rebalance is 10% faster but could be environment noise looks better : cpu util 5% better the memory usage numbers dont look quite right ( ns-server avg reports 21235692 MB == 21 GB ? ) looks worse: runtime is 8 hours longer , query ops/sec dropped from 537 to 339 but query latency is far better ( 48msec in 2..0.1 compared to 58 in 2.0 ) drain rate is slower by 15% percent we need to find out why query ops/sec is very slow. reb−vperf−10M.conf Perf−rebalance test with views 1 design ddoc, 8 views 8K ops/sec 80% reads, 20% writes (12% updates/deletes, 8% inserts) 10M dataset Rebalance from 3 to 4 nodes
          Hide
          farshid Farshid Ghods (Inactive) added a comment -

          Alk/Filipe,

          there are initial results available for the litmus test. can you also review the results and comment on what we need to change in the test case.

          from my discussion with Alk/Filipe i heard adding more ddocs would help. is there anything else we want to change in the tes t?

          Show
          farshid Farshid Ghods (Inactive) added a comment - Alk/Filipe, there are initial results available for the litmus test. can you also review the results and comment on what we need to change in the test case. from my discussion with Alk/Filipe i heard adding more ddocs would help. is there anything else we want to change in the tes t?
          Hide
          FilipeManana Filipe Manana (Inactive) added a comment -

          I do have measured results, and documented steps to reproduce, for rebalance out cases.
          What I see, is that specially for cases of 1 ddoc with 1 view case, the view engine (indexing and compaction) is poorly utilized:

          https://docs.google.com/document/d/1eUV53B5pXj-5X5FPcbOA3jz1U6jOjh1hnZmZa9bUNs0/edit

          So you a bit more than theory there.

          Show
          FilipeManana Filipe Manana (Inactive) added a comment - I do have measured results, and documented steps to reproduce, for rebalance out cases. What I see, is that specially for cases of 1 ddoc with 1 view case, the view engine (indexing and compaction) is poorly utilized: https://docs.google.com/document/d/1eUV53B5pXj-5X5FPcbOA3jz1U6jOjh1hnZmZa9bUNs0/edit So you a bit more than theory there.
          Hide
          pavelpaulau Pavel Paulau added a comment -

          As I mentioned before it is not fair comparison.

          After 2.0 release I revised config and rebalance tests now use 3 ddocs with 3 views per doc.
          Moreover after 2.0 release we discovered issue with one of ssd drives, so for many reasons we need a re-run.

          But in any case, 4 hours don't look like something ultra fast.

          We do more experiments, for instance rebalance out 4 -> 3 with no ops/queries took 7h (40M items) and 14h (100M items). Is it something that sounds good?

          Show
          pavelpaulau Pavel Paulau added a comment - As I mentioned before it is not fair comparison. After 2.0 release I revised config and rebalance tests now use 3 ddocs with 3 views per doc. Moreover after 2.0 release we discovered issue with one of ssd drives, so for many reasons we need a re-run. But in any case, 4 hours don't look like something ultra fast. We do more experiments, for instance rebalance out 4 -> 3 with no ops/queries took 7h (40M items) and 14h (100M items). Is it something that sounds good?
          Hide
          FilipeManana Filipe Manana (Inactive) added a comment -

          New results, for build 2.0.1-126:

          https://docs.google.com/document/d/1aHHhY-ami84-aEQyx42h95rlp7Vi6WTwlozaIstFgfA/edit

          In the meanwhile Pavel already re-triggered the tests he mentioned for this new build.

          Show
          FilipeManana Filipe Manana (Inactive) added a comment - New results, for build 2.0.1-126: https://docs.google.com/document/d/1aHHhY-ami84-aEQyx42h95rlp7Vi6WTwlozaIstFgfA/edit In the meanwhile Pavel already re-triggered the tests he mentioned for this new build.
          Hide
          pavelpaulau Pavel Paulau added a comment -

          Sorry, it's not faster at all.
          The same regression as Ronnie reports:

          https://docs.google.com/spreadsheet/ccc?key=0AgLUessE73UXdDV1SXhUZjJ0b0RhU3gtdlUzZGloUFE#gid=0

          Show
          pavelpaulau Pavel Paulau added a comment - Sorry, it's not faster at all. The same regression as Ronnie reports: https://docs.google.com/spreadsheet/ccc?key=0AgLUessE73UXdDV1SXhUZjJ0b0RhU3gtdlUzZGloUFE#gid=0
          Hide
          farshid Farshid Ghods (Inactive) added a comment - - edited

          summarizing what i see in the google doc above :

          rebalance out 10M items 2.0.1 ( +A ) is 15 percent slower than 2.0 GA ( +S16)
          rebalance in 10M items 2.0.1 ( +A ) is 40% slower than 2.0 GA ( +S16)

          Pavel is planning on rerunning some tests with +A option and post more results.

          Show
          farshid Farshid Ghods (Inactive) added a comment - - edited summarizing what i see in the google doc above : rebalance out 10M items 2.0.1 ( +A ) is 15 percent slower than 2.0 GA ( +S16) rebalance in 10M items 2.0.1 ( +A ) is 40% slower than 2.0 GA ( +S16) Pavel is planning on rerunning some tests with +A option and post more results.
          Hide
          pavelpaulau Pavel Paulau added a comment -

          we should thank Filipe, w/o latest improvements it'd be much worse.

          Show
          pavelpaulau Pavel Paulau added a comment - we should thank Filipe, w/o latest improvements it'd be much worse.
          Hide
          FilipeManana Filipe Manana (Inactive) added a comment -

          evperf results for build 2.0.1-139:

          https://docs.google.com/document/d/16BQu-sD32U4qZxEK8KcvVZMdcqH_klV90DEbnoCu-_E/edit

          specially for 40M, 4 -> 3 rebalance out without load in parallel, indexing + index compaction is still far from being a bottleneck.

          Previously, until a month or so ago, indexing was triggered sequentially in every node for rebalance out, so improving that would theoretically give close to 3x faster rebalance, but that didn't happen, as the rebalancer now triggers compactions and waits for them to finish before issuing more indexing work, amongst other details possibly. However index compaction doesn't justify not seeing a significant faster rebalance, lets say around 2x for example.

          Talking with Pavel, no significant speedups were observed for evperf rebalance tests after those improvements were merged.

          There's still something missing somewhere.

          Show
          FilipeManana Filipe Manana (Inactive) added a comment - evperf results for build 2.0.1-139: https://docs.google.com/document/d/16BQu-sD32U4qZxEK8KcvVZMdcqH_klV90DEbnoCu-_E/edit specially for 40M, 4 -> 3 rebalance out without load in parallel, indexing + index compaction is still far from being a bottleneck. Previously, until a month or so ago, indexing was triggered sequentially in every node for rebalance out, so improving that would theoretically give close to 3x faster rebalance, but that didn't happen, as the rebalancer now triggers compactions and waits for them to finish before issuing more indexing work, amongst other details possibly. However index compaction doesn't justify not seeing a significant faster rebalance, lets say around 2x for example. Talking with Pavel, no significant speedups were observed for evperf rebalance tests after those improvements were merged. There's still something missing somewhere.
          Hide
          farshid Farshid Ghods (Inactive) added a comment -

          based on recent perf runs from Pavel the 2.0.1-139-rel runs are much faster than 2.0.0 but are marked as not consistent.

          we are still waiting for Pavel to rerun these tests.

          Show
          farshid Farshid Ghods (Inactive) added a comment - based on recent perf runs from Pavel the 2.0.1-139-rel runs are much faster than 2.0.0 but are marked as not consistent. we are still waiting for Pavel to rerun these tests.
          Hide
          farshid Farshid Ghods (Inactive) added a comment -

          looking at the chart
          2.0.1-139 IN 10M 3 x 3 ASYNC 2063
          2.0.1-139 OUT 10M 3 x 3 ASYNC 2401

          but both are marked as not consistent
          compared to 2.0.0 1976 GA

          2.0.0-1976 OUT 10M 3 x 3 ASYNC 18491, 15391
          2.0.0-1976 IN 10M 3 x 3 ASYNC 12035

          Show
          farshid Farshid Ghods (Inactive) added a comment - looking at the chart 2.0.1-139 IN 10M 3 x 3 ASYNC 2063 2.0.1-139 OUT 10M 3 x 3 ASYNC 2401 but both are marked as not consistent compared to 2.0.0 1976 GA 2.0.0-1976 OUT 10M 3 x 3 ASYNC 18491, 15391 2.0.0-1976 IN 10M 3 x 3 ASYNC 12035
          Hide
          pavelpaulau Pavel Paulau added a comment -

          "not consistent" means rebalance aware index disabled.

          Alk requested these results.

          Show
          pavelpaulau Pavel Paulau added a comment - "not consistent" means rebalance aware index disabled. Alk requested these results.
          Hide
          FilipeManana Filipe Manana (Inactive) added a comment -

          Slightly related, MB-7413 brought a 3x speedup to incremental index updates, which was though to be the bottleneck. But after it was added, it didn't seem to have any significant impact on the rebalance time.

          There's now an evperf test to verify this, outside the rebalance context (added by CBD-767, thanks Pavel!).

          It confirms the 3x speedup:

          jenkins: http://qa.hq.northscale.net/view/eperf-parents/job/vesta-views/62/console

          incremental update time: 5676.96956992 seconds

            • build 2.0.1-121 (when MB-7413 was first introduced)

          jenkins: http://qa.hq.northscale.net/view/eperf-parents/job/vesta-views/63/console

          incremental update time: 1812.51270294 seconds

            • build 2.0.1-140 (latest 2.0.1)

          jenkins: http://qa.hq.northscale.net/view/eperf-parents/job/vesta-views/64/console

          incremental update time: 1939.70245409

          Show
          FilipeManana Filipe Manana (Inactive) added a comment - Slightly related, MB-7413 brought a 3x speedup to incremental index updates, which was though to be the bottleneck. But after it was added, it didn't seem to have any significant impact on the rebalance time. There's now an evperf test to verify this, outside the rebalance context (added by CBD-767, thanks Pavel!). It confirms the 3x speedup: build 2.0.1-120 (before MB-7413 ) jenkins: http://qa.hq.northscale.net/view/eperf-parents/job/vesta-views/62/console incremental update time: 5676.96956992 seconds build 2.0.1-121 (when MB-7413 was first introduced) jenkins: http://qa.hq.northscale.net/view/eperf-parents/job/vesta-views/63/console incremental update time: 1812.51270294 seconds build 2.0.1-140 (latest 2.0.1) jenkins: http://qa.hq.northscale.net/view/eperf-parents/job/vesta-views/64/console incremental update time: 1939.70245409
          Hide
          pavelpaulau Pavel Paulau added a comment -

          build 143
          master events logs
          in 3->4, out 4->3, swap 3->3.

          Show
          pavelpaulau Pavel Paulau added a comment - build 143 master events logs in 3->4, out 4->3, swap 3->3.
          Hide
          jin Jin Lim added a comment -

          Pavel, please provide latest performance number based on the latest build. I believe we have made a good progress since build 145?

          Show
          jin Jin Lim added a comment - Pavel, please provide latest performance number based on the latest build. I believe we have made a good progress since build 145?
          Hide
          pavelpaulau Pavel Paulau added a comment - - edited

          Since build 148 actually.

          http://dashboard.hq.couchbase.com/litmus/dashboard/
          (search for reb-vperf)

          + another summary

          https://docs.google.com/spreadsheet/ccc?key=0AgLUessE73UXdDV1SXhUZjJ0b0RhU3gtdlUzZGloUFE#gid=0

          In some cases we observe up to 20% improvement, but basically we just fixed MB-7676 regression.

          Show
          pavelpaulau Pavel Paulau added a comment - - edited Since build 148 actually. http://dashboard.hq.couchbase.com/litmus/dashboard/ (search for reb-vperf) + another summary https://docs.google.com/spreadsheet/ccc?key=0AgLUessE73UXdDV1SXhUZjJ0b0RhU3gtdlUzZGloUFE#gid=0 In some cases we observe up to 20% improvement, but basically we just fixed MB-7676 regression.
          Hide
          alkondratenko Aleksey Kondratenko (Inactive) added a comment -

          Please, point me to logs and master events from this runs.

          Show
          alkondratenko Aleksey Kondratenko (Inactive) added a comment - Please, point me to logs and master events from this runs.
          Show
          pavelpaulau Pavel Paulau added a comment - For instance, rebalance-in 60M items. 4DDOCS 2.0.0-1976: http://172.23.96.10:8080/job/apollo-views/26/artifact/ 4DDOCS 2.0.1-148: http://172.23.96.10:8080/job/apollo-views/25/artifact/ 1DDOC 2.0.0-1976: http://172.23.96.10:8080/job/apollo-views/23/artifact/ 1DDOC 2.0.1-148: http://172.23.96.10:8080/job/apollo-views/22/artifact/
          Hide
          alkondratenko Aleksey Kondratenko (Inactive) added a comment -

          May I have all of them and not just "for instance" ?

          Show
          alkondratenko Aleksey Kondratenko (Inactive) added a comment - May I have all of them and not just "for instance" ?
          Hide
          pavelpaulau Pavel Paulau added a comment -

          Like 30 logs?

          Show
          pavelpaulau Pavel Paulau added a comment - Like 30 logs?
          Hide
          alkondratenko Aleksey Kondratenko (Inactive) added a comment -

          Sure. If there's some way for me to find all of them, then feel free to explain me how.

          Show
          alkondratenko Aleksey Kondratenko (Inactive) added a comment - Sure. If there's some way for me to find all of them, then feel free to explain me how.
          Hide
          pavelpaulau Pavel Paulau added a comment -

          Well, I get these logs from 2 jobs:
          http://172.23.96.10:8080/job/apollo-views/
          http://qa.hq.northscale.net/view/eperf-parents/job/vesta-views/

          You can check parameters and artifacts of every build. Or you can wait and so that later I can grab all of them and upload single archive.
          This is something that I'd not recommend to do manually.

          Show
          pavelpaulau Pavel Paulau added a comment - Well, I get these logs from 2 jobs: http://172.23.96.10:8080/job/apollo-views/ http://qa.hq.northscale.net/view/eperf-parents/job/vesta-views/ You can check parameters and artifacts of every build. Or you can wait and so that later I can grab all of them and upload single archive. This is something that I'd not recommend to do manually.
          Hide
          alkondratenko Aleksey Kondratenko (Inactive) added a comment -

          Unfortunately I cannot make sense of stuff you linked to above.

          I will not necessarily need all logs. I'll start looking at |2.0.1-153 IN 60M 1 x 3 YES 19806 5.50| and it's corresponding 2.0.0 events. But I may find other stuff useful. And thus I may need logs of any test run.

          Show
          alkondratenko Aleksey Kondratenko (Inactive) added a comment - Unfortunately I cannot make sense of stuff you linked to above. I will not necessarily need all logs. I'll start looking at |2.0.1-153 IN 60M 1 x 3 YES 19806 5.50| and it's corresponding 2.0.0 events. But I may find other stuff useful. And thus I may need logs of any test run.
          Hide
          pavelpaulau Pavel Paulau added a comment -

          I added logs column to that spreadsheet. In some cases there are no master events in build artifacts.

          Show
          pavelpaulau Pavel Paulau added a comment - I added logs column to that spreadsheet. In some cases there are no master events in build artifacts.
          Hide
          alkondratenko Aleksey Kondratenko (Inactive) added a comment -

          Thanks a lot. I'll take a look

          Show
          alkondratenko Aleksey Kondratenko (Inactive) added a comment - Thanks a lot. I'll take a look
          Hide
          alkondratenko Aleksey Kondratenko (Inactive) added a comment - - edited

          See master_events.log.svg generated from master events of rebalance (cut-n-pasting from spreadsheet)

          2.0.1-153 IN 60M 1 x 3 YES 19806 5.5 no replica index http://172.23.96.10:8080/job/apollo-views/28/artifact/

          empty intervals (and I've confirmed from logs) represent view compactions. Surprisingly, they eat about 1/2 of all time. At least by looking at picture.

          As a workaround I'd like you to try same setup with increased value of vbucket moves before forced view compaction: http://i.imgur.com/T38iPCS.png Internal setting for this variable is named rebalanceMovesBeforeCompaction

          Let's try bumping it up a lot, say to 256 and see if that helps.

          cc-ing Filipe, in case he wants to add anything.

          Show
          alkondratenko Aleksey Kondratenko (Inactive) added a comment - - edited See master_events.log.svg generated from master events of rebalance (cut-n-pasting from spreadsheet) 2.0.1-153 IN 60M 1 x 3 YES 19806 5.5 no replica index http://172.23.96.10:8080/job/apollo-views/28/artifact/ empty intervals (and I've confirmed from logs) represent view compactions. Surprisingly, they eat about 1/2 of all time. At least by looking at picture. As a workaround I'd like you to try same setup with increased value of vbucket moves before forced view compaction: http://i.imgur.com/T38iPCS.png Internal setting for this variable is named rebalanceMovesBeforeCompaction Let's try bumping it up a lot, say to 256 and see if that helps. cc-ing Filipe, in case he wants to add anything.
          Hide
          pavelpaulau Pavel Paulau added a comment -

          Moving it to 2.1 since we see noticeable improvements in systems tests and even performance tests demonstrate some positive changes. Also we don't expect any particular fix from Filipe and Alk for 2.0.1 and 2.0.2.

          So it's not fast enough but obviously this bug is not 2.0.1/2.0.2 blocker anymore.

          Show
          pavelpaulau Pavel Paulau added a comment - Moving it to 2.1 since we see noticeable improvements in systems tests and even performance tests demonstrate some positive changes. Also we don't expect any particular fix from Filipe and Alk for 2.0.1 and 2.0.2. So it's not fast enough but obviously this bug is not 2.0.1/2.0.2 blocker anymore.
          Hide
          alkondratenko Aleksey Kondratenko (Inactive) added a comment - - edited

          Disagree with 2.0.2 a bit. There may still be reasonably low hanging fruit left. I.e. currently it appears we're spending a bit too much on views compaction. We can at least trade higher disk space usage during rebalance for faster rebalance time at this moment.

          Show
          alkondratenko Aleksey Kondratenko (Inactive) added a comment - - edited Disagree with 2.0.2 a bit. There may still be reasonably low hanging fruit left. I.e. currently it appears we're spending a bit too much on views compaction. We can at least trade higher disk space usage during rebalance for faster rebalance time at this moment.
          Hide
          pavelpaulau Pavel Paulau added a comment -

          I've tried 2.0.1-153 with rebalanceMovesBeforeCompaction=256. Brief comparison:
          – rebalance time 3h vs 5.5h with default settings and 5.4h with 2.0.0-1976
          – peak disk size is 415GB vs 366GB with default settings and 613GB with 2.0.0-1976

          Build artifacts:
          http://172.23.96.10:8080/job/apollo-views/28/artifact/

          PDF Report:
          http://172.23.96.10:8080/job/apollo-graph-loop/25/artifact/reb-vperf-60M-in-1d-256rmbc.loop_2.0.1-153-rel-enterprise_2.0.1-153-rel-enterprise_APPOLO_Feb-14-2013_19%3A52%3A00.pdf

          Show
          pavelpaulau Pavel Paulau added a comment - I've tried 2.0.1-153 with rebalanceMovesBeforeCompaction=256. Brief comparison: – rebalance time 3h vs 5.5h with default settings and 5.4h with 2.0.0-1976 – peak disk size is 415GB vs 366GB with default settings and 613GB with 2.0.0-1976 Build artifacts: http://172.23.96.10:8080/job/apollo-views/28/artifact/ PDF Report: http://172.23.96.10:8080/job/apollo-graph-loop/25/artifact/reb-vperf-60M-in-1d-256rmbc.loop_2.0.1-153-rel-enterprise_2.0.1-153-rel-enterprise_APPOLO_Feb-14-2013_19%3A52%3A00.pdf
          Hide
          alkondratenko Aleksey Kondratenko (Inactive) added a comment -

          415 vs 366 is including data. Views different is apparently quite a lot bigger.

          May I ask you to do few runs with 32, 64 and 128 ?

          It appears that we should consider bumping that value for great gain in rebalance time, but it remains to be seen which value works best. That's going to be trade off between disk space usage and rebalance time.

          Show
          alkondratenko Aleksey Kondratenko (Inactive) added a comment - 415 vs 366 is including data. Views different is apparently quite a lot bigger. May I ask you to do few runs with 32, 64 and 128 ? It appears that we should consider bumping that value for great gain in rebalance time, but it remains to be seen which value works best. That's going to be trade off between disk space usage and rebalance time.
          Hide
          pavelpaulau Pavel Paulau added a comment -

          No problem, will do.

          Show
          pavelpaulau Pavel Paulau added a comment - No problem, will do.
          Hide
          pavelpaulau Pavel Paulau added a comment -

          I observe very strange results in recent builds. For instance, "60M+1 ddoc+load" case is 2x slower when compared with 2.0. See spreadsheet for details (rows 16-18):
          https://docs.google.com/spreadsheet/ccc?key=0AgLUessE73UXdDV1SXhUZjJ0b0RhU3gtdlUzZGloUFE#gid=0

          Alk, do you have any idea why it happens?

          Show
          pavelpaulau Pavel Paulau added a comment - I observe very strange results in recent builds. For instance, "60M+1 ddoc+load" case is 2x slower when compared with 2.0. See spreadsheet for details (rows 16-18): https://docs.google.com/spreadsheet/ccc?key=0AgLUessE73UXdDV1SXhUZjJ0b0RhU3gtdlUzZGloUFE#gid=0 Alk, do you have any idea why it happens?
          Hide
          pavelpaulau Pavel Paulau added a comment -

          + chart for rebalance progress 153 vs 160.

          Show
          pavelpaulau Pavel Paulau added a comment - + chart for rebalance progress 153 vs 160.
          Hide
          ketaki Ketaki Gangal added a comment -
          Show
          ketaki Ketaki Gangal added a comment - Behaviour similar to http://www.couchbase.com/issues/browse/MB-7785
          Hide
          alkondratenko Aleksey Kondratenko (Inactive) added a comment -

          Just spotted Pavel's message.

          May I have manifest of build 160 where we supposedly jumped backwards about 2x perf-wise ?

          Show
          alkondratenko Aleksey Kondratenko (Inactive) added a comment - Just spotted Pavel's message. May I have manifest of build 160 where we supposedly jumped backwards about 2x perf-wise ?
          Hide
          pavelpaulau Pavel Paulau added a comment - - edited

          Ok, re-run of 160 was just 1.2x slower.

          But it's still hard to understand whether it's environment variation or something software specific.

          Show
          pavelpaulau Pavel Paulau added a comment - - edited Ok, re-run of 160 was just 1.2x slower. But it's still hard to understand whether it's environment variation or something software specific.
          Hide
          jin Jin Lim added a comment -

          From NS SRV team (Alk):

          wget --post-data='rebalanceMovesBeforeCompaction=256' --user=Administrator --password=asdasd http://lh:9000/internalSettings

          So that's usual POST to usual /internalSettings API with field name rebalanceMovesBeforeCompaction.
          Default value is 16. And that's how many in- or out- going vbuckets will be moved per node before pausing all moves, triggering views compaction and awaiting its completion.
          View compaction is forbidden during vbucket moves. So larger values means less frequent view compactions. Which means less time "spent" on it, but it also means greater disk usage by views.

          Show
          jin Jin Lim added a comment - From NS SRV team (Alk): wget --post-data='rebalanceMovesBeforeCompaction=256' --user=Administrator --password=asdasd http://lh:9000/internalSettings So that's usual POST to usual /internalSettings API with field name rebalanceMovesBeforeCompaction. Default value is 16. And that's how many in- or out- going vbuckets will be moved per node before pausing all moves, triggering views compaction and awaiting its completion. View compaction is forbidden during vbucket moves. So larger values means less frequent view compactions. Which means less time "spent" on it, but it also means greater disk usage by views.
          Hide
          dipti Dipti Borkar added a comment -

          For this PCI sprint, we need to find the best default and better understand the impact of changing this parameter. Pavel to drive.

          Show
          dipti Dipti Borkar added a comment - For this PCI sprint, we need to find the best default and better understand the impact of changing this parameter. Pavel to drive.
          Hide
          thuan Thuan Nguyen added a comment -

          Integrated in ui-testing #11 (See http://qa.hq.northscale.net/job/ui-testing/11/)
          MB-6726: add config with rebalance_moves_before_compaction=256 (Revision 7fc2baf69f359946f978d6d59f0b7f0458805ac1)
          MB-6726: add config with tuned reb. moves before comp. (Revision 043ce1685781d5b7fa207ffc265698cd62db906d)

          Result = SUCCESS
          pavelpaulau :
          Files :

          • conf/perf/reb-vperf-60M-in-1d-256rmbc.conf

          pavelpaulau :
          Files :

          • conf/perf/reb-vperf-60M-in-1d-128rmbc.conf
          • conf/perf/reb-vperf-60M-in-1d-32rmbc.conf
          • conf/perf/reb-vperf-60M-in-1d-64rmbc.conf
          Show
          thuan Thuan Nguyen added a comment - Integrated in ui-testing #11 (See http://qa.hq.northscale.net/job/ui-testing/11/ ) MB-6726 : add config with rebalance_moves_before_compaction=256 (Revision 7fc2baf69f359946f978d6d59f0b7f0458805ac1) MB-6726 : add config with tuned reb. moves before comp. (Revision 043ce1685781d5b7fa207ffc265698cd62db906d) Result = SUCCESS pavelpaulau : Files : conf/perf/reb-vperf-60M-in-1d-256rmbc.conf pavelpaulau : Files : conf/perf/reb-vperf-60M-in-1d-128rmbc.conf conf/perf/reb-vperf-60M-in-1d-32rmbc.conf conf/perf/reb-vperf-60M-in-1d-64rmbc.conf
          Hide
          pavelpaulau Pavel Paulau added a comment -

          I've created separate task for rebalanceMovesBeforeCompaction.

          Show
          pavelpaulau Pavel Paulau added a comment - I've created separate task for rebalanceMovesBeforeCompaction.
          Hide
          maria Maria McDuff (Inactive) added a comment -

          deferred out of 2.0.2 release.
          aliaksey and team need further investigation.

          Show
          maria Maria McDuff (Inactive) added a comment - deferred out of 2.0.2 release. aliaksey and team need further investigation.
          Hide
          maria Maria McDuff (Inactive) added a comment -

          Tony,

          pls refer to MB-8499. Pavel has verified the recommended number of replicators (now 64). pls verify if you are still seeing the slowness in rebalance when indexing/compaction is happening.

          Show
          maria Maria McDuff (Inactive) added a comment - Tony, pls refer to MB-8499 . Pavel has verified the recommended number of replicators (now 64). pls verify if you are still seeing the slowness in rebalance when indexing/compaction is happening.
          Hide
          pavelpaulau Pavel Paulau added a comment -

          Please notice that it's still slow, 10-15x slower than the same setup w/o views.

          Also what do you mean by number of replicators? I verified impact of rebalanceMovesBeforeCompaction.

          Show
          pavelpaulau Pavel Paulau added a comment - Please notice that it's still slow, 10-15x slower than the same setup w/o views. Also what do you mean by number of replicators? I verified impact of rebalanceMovesBeforeCompaction.
          Hide
          maria Maria McDuff (Inactive) added a comment -

          re-open if still exists in 2.5.

          Show
          maria Maria McDuff (Inactive) added a comment - re-open if still exists in 2.5.

            People

            • Assignee:
              thuan Thuan Nguyen
              Reporter:
              thuan Thuan Nguyen
            • Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Agile