Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-6934

Displaying XDCR Replication error messages/warnings.

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 2.0
    • Fix Version/s: 2.0
    • Component/s: UI, XDCR
    • Security Level: Public
    • Labels:
      None
    • Environment:
      2.0-1856

      Description

      Hi,

      With the new error logging code, we now display "recent 10 errors". Added a screenshot at end of email.

      At any point, the last 10 error are displayed on the replication - 10 errors, which may or may not be valid depending upon the current time.

      This issue needs to be addressed at two levels -
      1. Level of error logging - Currently too much information is displayed, which also gives misleading idea on state of replication.
      2. Classification of errors v/s warnings.

      Having lower level information on the ns_logs can help trouble shoot , but having all of that information on the web-console might just confuse and overwhelm end-user IMO.

      XDCR can have an error at any of the following levels

      • xdc vbucket replicators - timing out, checkpoint failures, db_not_found
      • xdc replication manager
      • ns_server level - where it is unable to talk to the other remote cluster and so on.

      With some recent trials on the new code, we see a lot of errors on the level of bucket replicators, say vbucket XXX commit_checkpoint_failure.
      But the replication is continuing as expected. Replication has not failed, but it is continuing minus the above checkpoint failure.

      It might be nicer to classify errors v/s warnings.

      Errors - When finally xdcr has stopped working . No more data is being sent over to the destination.
      Replication will be attempted for X number of times, and is finally given up?

      Warnings - When there are timeouts, but it is a recoverable situation.

      -Ketaki

      Screenshot

      No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

        Hide
        junyi Junyi Xie (Inactive) added a comment -

        fixes on gerrit

        Show
        junyi Junyi Xie (Inactive) added a comment - fixes on gerrit
        Show
        junyi Junyi Xie (Inactive) added a comment - All fixes are on gerrit http://review.couchbase.org/#/c/21694/ http://review.couchbase.org/#/c/21903/ http://review.couchbase.org/#/c/21904/2
        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        Commit to filter out too old errors is in gerrit. I've also implemented Dipti's proposal to display errors link in normal color rather than red.

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - Commit to filter out too old errors is in gerrit. I've also implemented Dipti's proposal to display errors link in normal color rather than red.
        Hide
        junyi Junyi Xie (Inactive) added a comment -

        Alk,

        Change within XDCR is at

        http://review.couchbase.org/#/c/21694/

        Now the error returned to ns_server is a pair

        {Time, ErrorString}

        instead of a string.

        Please go ahead and modify UI code accordingly. Thanks.

        Show
        junyi Junyi Xie (Inactive) added a comment - Alk, Change within XDCR is at http://review.couchbase.org/#/c/21694/ Now the error returned to ns_server is a pair {Time, ErrorString} instead of a string. Please go ahead and modify UI code accordingly. Thanks.
        Hide
        ketaki Ketaki Gangal added a comment -

        Comments from Product Mgmt
        Hi Junyi,

        Is there a log level for the XDCR error messages?
        Are the last 10 errors the only errors tracked?
        Do these include info and warning messages or only errors in this list?
        Do we clean up this error log periodically? (there is no way for ns_server to know if the error is relevant any more)

        Aliaksey, as we discussed, at a minimum we need to change the "10 errors" link that appears the first time this message buffer gets populated to a link in aqua blue (like the IP address in cluster reference) and should say "Recent XDCR log messages"

        Junyi, if you can provide more visibility from the replicator side about warnings vs errors vs info messages, we can do something better, if not in 2.0 sometime in the future. But this basic level of error handling doesn't give users enough visibility into what is going on.

        Show
        ketaki Ketaki Gangal added a comment - Comments from Product Mgmt Hi Junyi, Is there a log level for the XDCR error messages? Are the last 10 errors the only errors tracked? Do these include info and warning messages or only errors in this list? Do we clean up this error log periodically? (there is no way for ns_server to know if the error is relevant any more) Aliaksey, as we discussed, at a minimum we need to change the "10 errors" link that appears the first time this message buffer gets populated to a link in aqua blue (like the IP address in cluster reference) and should say "Recent XDCR log messages" Junyi, if you can provide more visibility from the replicator side about warnings vs errors vs info messages, we can do something better, if not in 2.0 sometime in the future. But this basic level of error handling doesn't give users enough visibility into what is going on.

          People

          • Assignee:
            junyi Junyi Xie (Inactive)
            Reporter:
            ketaki Ketaki Gangal
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Gerrit Reviews

              There are no open Gerrit changes