Uploaded image for project: 'Couchbase Documentation'
  1. Couchbase Documentation
  2. DOC-92

Docs: Document "administrative task" of regular, planned server maintenance

    Details

    • Type: Improvement
    • Status: Reopened
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: 3.0.x
    • Fix Version/s: 4.1
    • Component/s: admin
    • Labels:

      Description

      From a customer request:
      We are obliged to follow a regular OS patching schedule for all our servers and have a maintenance window every Friday night.
      How would you recommend we deal with our Couchbase clusters for patching?

      From reading the Couchbase 2.0 Manual it looks like we have two options, one being a failover, and the other removing the node then re-adding it.
      What steps would you recommend we do when taking a node out to do maintenance on it? We plan to do this during our regular maintenance window when load on the servers would be really light.

      And the answer:
      Our best practice would be a graceful remove and rebalance so I would recommend that first. If you find it takes too long, you could do a failover. The danger with that is that some data would not be replicated and so an unexpected failure during that time would introduce a situation where you need to manually recover data. The graceful remove doesn't introduce that.

      Given that these are vms, it would actually be best to spin up one or more new nodes and swap them into the cluster, that way you never reduce capacity.

      No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

        Hide
        marija Marija Jovanovic added a comment -

        Resolved the issue by entering the proposed solution in the document under a new section "Planned server maintenance"
        "Planned server maintenance
        Customers who are obligated to follow a regular OS patching schedule for their servers with scheduled maintenance windows, there are two options available: server node failover or removal and re-adding of the node.
        In addition to scheduling such maintenance during periods when the server load is light, the best practices recommend to perform a graceful removal and rebalancing .
        If this takes too long, you could do a failover. The danger with this solution is that some data might not be replicated and so an unexpected failure during that time would introduce a situation where you need to manually recover data. The graceful removal doesn't introduce such problems.
        Since these nodes are VMs, the best solution is to spin up one or more new nodes and then swap them into the cluster and, therefore, never reduce capacity."
        If this is satisfactory, the bug will be closed after the document is published on the web.

        Show
        marija Marija Jovanovic added a comment - Resolved the issue by entering the proposed solution in the document under a new section "Planned server maintenance" "Planned server maintenance Customers who are obligated to follow a regular OS patching schedule for their servers with scheduled maintenance windows, there are two options available: server node failover or removal and re-adding of the node. In addition to scheduling such maintenance during periods when the server load is light, the best practices recommend to perform a graceful removal and rebalancing . If this takes too long, you could do a failover. The danger with this solution is that some data might not be replicated and so an unexpected failure during that time would introduce a situation where you need to manually recover data. The graceful removal doesn't introduce such problems. Since these nodes are VMs, the best solution is to spin up one or more new nodes and then swap them into the cluster and, therefore, never reduce capacity." If this is satisfactory, the bug will be closed after the document is published on the web.
        Hide
        perry Perry Krug added a comment - - edited

        Thanks Marija, I think this is a good start but could use a little bit more.

        With 3.0, we introduced a "graceful failover" and "delta node recovery" that's probably now our best practice for this but will need some confirmation from PM. So that is net-new content that needs to get added.

        More general improvements could be:
        -Providing some examples of when customers will need to be thinking about this would be helpful, i.e security patching. And also link to or refer to other situations that follow similar practices like OS upgrades, Couchbase Server upgrades, etc.
        -This would be made even better if we included links to the relevant sections of rebalance, failover, swap rebalance, etc and provided high-level step-by-step instructions for how to accomplish it, also accompanied by links on best practices and what to monitor while doing so
        -We'll want to call out the differences between physical, VM and cloud environments in terms of how easy (or not easy) to spin up new instances

        Show
        perry Perry Krug added a comment - - edited Thanks Marija, I think this is a good start but could use a little bit more. With 3.0, we introduced a "graceful failover" and "delta node recovery" that's probably now our best practice for this but will need some confirmation from PM. So that is net-new content that needs to get added. More general improvements could be: -Providing some examples of when customers will need to be thinking about this would be helpful, i.e security patching. And also link to or refer to other situations that follow similar practices like OS upgrades, Couchbase Server upgrades, etc. -This would be made even better if we included links to the relevant sections of rebalance, failover, swap rebalance, etc and provided high-level step-by-step instructions for how to accomplish it, also accompanied by links on best practices and what to monitor while doing so -We'll want to call out the differences between physical, VM and cloud environments in terms of how easy (or not easy) to spin up new instances
        Hide
        perry Perry Krug added a comment -

        Reopening as per my feedback in the last comment

        Show
        perry Perry Krug added a comment - Reopening as per my feedback in the last comment
        Hide
        marija Marija Jovanovic added a comment -

        Kirk was also expected to give use cases for this issue

        Show
        marija Marija Jovanovic added a comment - Kirk was also expected to give use cases for this issue
        Hide
        don Don Pinto added a comment - - edited

        Anil Kumar – Can you please help to triage this / close if not still an issue?

        Show
        don Don Pinto added a comment - - edited Anil Kumar – Can you please help to triage this / close if not still an issue?

          People

          • Assignee:
            anil Anil Kumar
            Reporter:
            perry Perry Krug
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:

              Gerrit Reviews

              There are no open Gerrit changes