Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.0
    • Fix Version/s: 3.0-Beta, 3.0
    • Component/s: UI, XDCR
    • Security Level: Public
    • Labels:
      None

      Description

      After seeing XDCR in action, would like to propose a few enhancements:

      -Put certain statistics in the XDCR screen as well as on the graph page:
      -Percentage complete/caught up. While backfilling replication this would describe the number of items already sent to the remote side out of the total in the bucket. Once running, it would show whether there is a significant amount of backup in the queue
      -Items per second to see speed of each stream and in total
      -Bandwidth in use. As per a customer, the most important thing with XDCR is going to be the possibly cross-country internet bandwidth and will need to monitor that for each replication stream and in total

      -On the graph page of outgoing, I would recommend removing "mutations checked", "mutations replicated", "data replication", "active vb reps", "waiting vb reps", "secs in replicating", "secs in checkpointing", "checkpoints issued" and "checkpoints failed". These stats really aren't useful from the perspective of someone trying to monitor or troubleshoot the current state of their cluster.
      -On the graph page of outbound, there's a bit of confusion over the difference between "mutations to replicate", "mutations in queue" and "queue size". Unless they are showing significantly (and usefully) different metrics, recommend to remove all but one
      -On the graph page of incoming, recommend to put "total ops/sec" on the far left to line up with the "ops/sec" in the summary section
      -"XDCR dest ops per sec" is confusing because this cluster is the "destination" yet the stat implies the other way around. Recommend "Incoming XDCR ops per sec"
      -"XDCR docs to replicate" is a little confusing because it doesn't match the same stat in the "outbound". Recommend to change "mutations to replicate" to "XDCR docs to replicate"
      -Would also be good to see outbound ops/sec in the summary section alongside the number remaining to replicate

        Issue Links

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

          Hide
          perry Perry Krug added a comment -

          Just adding the comment from 9218:

          Putting "outbound XDCR mutations" on one side and "incoming XDCR mutations" on the other side makes the two seem very related. Perhaps "outbound XDCR mutations" should be "XDCR backlog" to make it clearer that it is not a rate and should not match the number on the other side.

          Show
          perry Perry Krug added a comment - Just adding the comment from 9218: Putting "outbound XDCR mutations" on one side and "incoming XDCR mutations" on the other side makes the two seem very related. Perhaps "outbound XDCR mutations" should be "XDCR backlog" to make it clearer that it is not a rate and should not match the number on the other side.
          Hide
          perry Perry Krug added a comment - - edited

          To summarize the conversation and provide next steps:

          My primary goal here is to provide meaningful and "actionable" statistics to our customers. I recognize that there may be various other stats that are useful for testing and development, but not necessarily for the end customer. The determining factor in my mind is whether we can explain "what to do" when a particular number is high or low. If we do not have that, then I suggest the statistic does not need to be displayed in the UI. Much the same way we do not expose the 300 statistics available with cbstats, I think the same logic should be applied here.

          So...my requests are:
          -Change "outbound XDCR mutations" to "XDCR backlog" to indicate that this is the number of mutations within the source cluster that have not yet been replicated to the destination. This stat is shown both in the "summary" as well as the per-stream "outbound xdcr operations" sections
          -Change "mutations replicated optimistically" from an incrementing counter to a "per second" rate
          -Remove from "outbound xdcr operations" sections:
          -mutations checked*
          -mutations replicated*
          -data replicated*
          -active vb reps±
          -waiting vb reps±
          -secs in replicating*
          -secs in checkpointing*
          -checkpoints issued±
          -checkpoints failed±
          -mutations in queue~
          -XDCR queue size~

          To provide some more explanation:
          - These stats are constantly incrementing and therefore after weeks/months of time are not useful to describing any behavior or problem
          (±) - These stats are internal implementation details, and also do not signal to the user that they should take specific action
          (~) - These stats are "bounded parameters". Therefore they should never be higher than what the parameter is set to. Even if they are higher or lower, we don't have a recommendation on "what to do" back to the customer

          The stats I am suggesting to remove should still be available via the REST API, but I think they are not as useful in the UI. In the field, we sometimes need to explain not only what each stat means, but "what to do" based upon the value of these statistics. I don't feel that these statistics represent something the customer needs to be concerned about nor action on.

          Show
          perry Perry Krug added a comment - - edited To summarize the conversation and provide next steps: My primary goal here is to provide meaningful and "actionable" statistics to our customers. I recognize that there may be various other stats that are useful for testing and development, but not necessarily for the end customer. The determining factor in my mind is whether we can explain "what to do" when a particular number is high or low. If we do not have that, then I suggest the statistic does not need to be displayed in the UI. Much the same way we do not expose the 300 statistics available with cbstats, I think the same logic should be applied here. So...my requests are: -Change "outbound XDCR mutations" to "XDCR backlog" to indicate that this is the number of mutations within the source cluster that have not yet been replicated to the destination. This stat is shown both in the "summary" as well as the per-stream "outbound xdcr operations" sections -Change "mutations replicated optimistically" from an incrementing counter to a "per second" rate -Remove from "outbound xdcr operations" sections: -mutations checked* -mutations replicated* -data replicated* -active vb reps± -waiting vb reps± -secs in replicating* -secs in checkpointing* -checkpoints issued± -checkpoints failed± -mutations in queue~ -XDCR queue size~ To provide some more explanation: - These stats are constantly incrementing and therefore after weeks/months of time are not useful to describing any behavior or problem (±) - These stats are internal implementation details, and also do not signal to the user that they should take specific action (~) - These stats are "bounded parameters". Therefore they should never be higher than what the parameter is set to. Even if they are higher or lower, we don't have a recommendation on "what to do" back to the customer The stats I am suggesting to remove should still be available via the REST API, but I think they are not as useful in the UI. In the field, we sometimes need to explain not only what each stat means, but "what to do" based upon the value of these statistics. I don't feel that these statistics represent something the customer needs to be concerned about nor action on.
          Hide
          cihan Cihan Biyikoglu (Inactive) added a comment -

          We will consider the feedback but UPR work has priority and we are the long pole for the release. moving to backlog. assigning to myself.

          Show
          cihan Cihan Biyikoglu (Inactive) added a comment - We will consider the feedback but UPR work has priority and we are the long pole for the release. moving to backlog. assigning to myself.
          Hide
          alkondratenko Aleksey Kondratenko (Inactive) added a comment - - edited

          [Minor edit on ordering...new layout below]

          Discussed stats layout with Anil and Perry yesterday. Below is Anil's capture of result that conversation. I may add some more stats to this however (I'm thinking about %utilization that might be quite useful and easily doable).

          Hi Alk,

          Here is what we discussed on XDCR stats -
          First row
          Outbound XDCR mutations
          Percent completed
          Active vb reps
          Waiting vb reps

          Second row
          Mutation replication rate
          Data replication rate
          Mutation replicated optimistically rate
          Mutations checked rate

          Third row
          Meta ops latency
          Doc ops latency
          New stats
          New stats

          Thanks!

          Show
          alkondratenko Aleksey Kondratenko (Inactive) added a comment - - edited [Minor edit on ordering...new layout below] Discussed stats layout with Anil and Perry yesterday. Below is Anil's capture of result that conversation. I may add some more stats to this however (I'm thinking about %utilization that might be quite useful and easily doable). Hi Alk, Here is what we discussed on XDCR stats - First row Outbound XDCR mutations Percent completed Active vb reps Waiting vb reps Second row Mutation replication rate Data replication rate Mutation replicated optimistically rate Mutations checked rate Third row Meta ops latency Doc ops latency New stats New stats Thanks!
          Hide
          alkondratenko Aleksey Kondratenko (Inactive) added a comment -

          http://review.couchbase.org/#/c/40094/

          Will need some more naming/placing advice but at least it works nicely now.

          Show
          alkondratenko Aleksey Kondratenko (Inactive) added a comment - http://review.couchbase.org/#/c/40094/ Will need some more naming/placing advice but at least it works nicely now.

            People

            • Assignee:
              alkondratenko Aleksey Kondratenko (Inactive)
              Reporter:
              perry Perry Krug
            • Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Gerrit Reviews

                There are no open Gerrit changes