Publish autoscaling best practices and recommended metrics

Description

This ticket tracks documentation specifically related to the QE-tested metrics and best practices for autoscaling (define set of metrics we recommend to scale on). The main documentation for autoscaling (e.g. how-tos, concepts, reference) is handled in .

Documentation Plan

Introduction. Guidelines and Best Practices

  • -modify - document any exceptions to existing best practices that don’t apply when auto-scaling is enabled (e.g. server groups, anti-affinity, etc)-

(NEW PAGE) Learn. Couchbase Cluster Concepts. Auto-scaling Best Practices

  • -New page - will include individual sections covering each service and the tested scaling metrics.-

    • Sections:

    • Introduction

    • Data Service

    • Index Service

    • Query Service

 – 

Tommie is currently producing a table of data that includes the thresholds/settings for each test scenario, along with the test results for a selective number of metrics. Tommie also presented a number of graphs showing the raw test results.

There seemed to be a consensus that once Tommie finalizes the test results for the Data Service, he should provide the following:

  1. The finished table of test scenarios and selected results

  2. A final set of graphs, each having annotated labels along the X-axis describing what/when relevant events occurred in the cluster (e.g. workload generated, rebalance start/stop, compaction start/stop, HPA window start/end, etc.)

  3. An opinionated statement describing the best practices that can be drawn from the test scenarios and graphs, along with any relevant caveats or suggestions that a customer might use to extrapolate the results for their own cluster configurations and workloads. For example: “A larger average document size than those tested may cause longer rebalance times, which may require reducing the scaling threshold for X metric.”

With the above, my hope is that we can create a best practices guide that presents a curated approach to the data – one where we try to only show the necessary graphs and data points to effectively justify our recommendations, rather than presenting a report full of raw data analysis.

Tommie also noted that some of the test settings he is using are best estimates, but potentially aren’t reflective of real-world customer scenarios. For example, the tests were assuming something like a 30% write rate. Tommie noted that it would be good to get early feedback from a wide audience to try an illicit opinions on whether the test scenarios and settings we are using accurately reflect what we’ve observed in customer environments. This might be a good incentive for us to quickly finish the best practices guide for the Data Service so that we can start passing it around internally within the company to get early feedback on both the data and the design of the guide.

Draft Documentation

Learn. Couchbase Cluster Concepts. Couchbase Cluster Auto-scaling. Auto-scaling Best Practices

  • New page documenting best practices and recommendations for Couchbase cluster auto-scaling

 

Release Notes Description

None

Activity

Show:

Roo Thorp June 1, 2021 at 3:41 PM

This went through extensive review and looks good to me, so I'm happy to sign off on it and close this ticket.

Eric Schneider June 1, 2021 at 7:56 AM

Draft documentation is available on staging: https://docs-staging.couchbase.com/operator/2.2/concept-couchbase-autoscaling-best-practices.html

 

Assigning to for further QE review.

Matt Ingenthron March 15, 2021 at 9:10 PM

Request from , can we make sure that we document a bit how a user would schedule scaling in conjunction with Operator base autoscaling? Specifically, if the user were to externally scale a cluster based on workload planning, how should this be considered in autoscaling. Quick answer is the autoscaling thresholds would need to consider avoiding scaling down, and this should be covered in the best practices.

Done
Pinned fields
Click on the next to a field label to start pinning.

Details

Assignee

Reporter

Story Points

Components

Fix versions

Due date

Priority

Instabug

Open Instabug

PagerDuty

Sentry

Zendesk Support

Created February 17, 2021 at 6:50 PM
Updated July 21, 2021 at 4:43 PM
Resolved June 1, 2021 at 7:56 AM
Instabug