Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-7746

failover alert for cases where users might lose data should be more quantifiable

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 2.0, 2.0.1, 2.1.0
    • Fix Version/s: feature-backlog
    • Component/s: ns_server
    • Security Level: Public
    • Labels:

      Description

      Recently a user posted a support ticket for this alert message http://grab.by/jPZm they got when clicking the failover button.

      this Alert should be more accurate in terms of how much data we might lose if possible.
      the user had no issue with replication before the node failed and no issues after failing over the node, but this alert created anxiety and concerns.

      We might want to at least show the replica and active item counts. the more challnaging part is knowing how many updates/deletes will be lost in this case

      Assigning to Dipti for prioritization.

      No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        Current code takes ep-engine's items_to_replicate stat and replication rate and tries to estimate replication lag in terms of seconds. It then smooths this metrics.

        It then roughly labels failover safeness as green if replication lag in seconds is less then 2. Otherwise it's yellow.

        Additionally if any tap replication from given node is missing (not started, or recently failed), then failover safeness is red.

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - Current code takes ep-engine's items_to_replicate stat and replication rate and tries to estimate replication lag in terms of seconds. It then smooths this metrics. It then roughly labels failover safeness as green if replication lag in seconds is less then 2. Otherwise it's yellow. Additionally if any tap replication from given node is missing (not started, or recently failed), then failover safeness is red.
        Hide
        dipti Dipti Borkar added a comment -

        Alk, do we have enough information to add to this error message? candidate for 2.0.2

        Show
        dipti Dipti Borkar added a comment - Alk, do we have enough information to add to this error message? candidate for 2.0.2
        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        We can add somewhat imprecise estimate of how many items are unreplicated and how much time we believe it needs.

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - We can add somewhat imprecise estimate of how many items are unreplicated and how much time we believe it needs.
        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        After discussion with Dipti the idea is to display that replication lag in time estimate, plus maybe replication lag as % of items to help users decide.

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - After discussion with Dipti the idea is to display that replication lag in time estimate, plus maybe replication lag as % of items to help users decide.
        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        And we should also point user to some metric on UI (need to add AFAIK)

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - And we should also point user to some metric on UI (need to add AFAIK)

          People

          • Assignee:
            anil Anil Kumar
            Reporter:
            sharon Sharon Barr (Inactive)
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:

              Gerrit Reviews

              There are no open Gerrit changes