Details

    • Type: Bug
    • Status: Closed
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 2.0
    • Fix Version/s: 2.0
    • Component/s: XDCR
    • Security Level: Public
    • Labels:
      None

      Description

      • Place holder for adding error logging on the Source XDCR cluster.
      • Will add more as we come across more use-cases/scenarios.

      Replication failure due to the following reason should be logged - This will be helpful for support to troubleshoot errors and for any end-user.
      *Today most of the replication failures are debugged using ns_server logs, we should move the error displaying on the UI as well.

      Errors
      -----------------

      • Source replication cluster reference cannot be deleted if there is a replication is set up (another bug to track this - MB-6843)
      • Source replication is deleted - issue user-visible log message saying "Replication has been deleted" on source cluster
      • Source bucket is deleted - (1) Issue warning message to the user on delete bucket that replication is going on (2). issue user-visible log message saying "Bucket has been deleted, replication to remote bucket has stopped" on source cluster (3) Error message in the XDCR page on the replication impacted saying "Bucket has been delete, XDCR has stopped"
      • Source bucket is flushed - Cannot be done.

      For errors on the destination node and replication times out: raise this error on the source node .

      No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        Fixed & approved but still sits in gerrit: http://review.couchbase.org/#/c/21459/

        it is naive, but so are 'errors' from xdcr

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - Fixed & approved but still sits in gerrit: http://review.couchbase.org/#/c/21459/ it is naive, but so are 'errors' from xdcr
        Hide
        junyi Junyi Xie (Inactive) added a comment -

        Nothing to do within XDCR. All ns_server work.

        Show
        junyi Junyi Xie (Inactive) added a comment - Nothing to do within XDCR. All ns_server work.
        Hide
        junyi Junyi Xie (Inactive) added a comment -

        Hi Dipti,

        Thanks for organizing the meeting. Actually today XDCR has already have the API to expose errors to ns_server. The API is within XDCR replication manager (xdc_rep_manager:latest_errors()) and when called, it will return the last 10 errors for each bucket which are actively replicating. Alk will expose these errors (or at least some of them) on UI. Alk will also determine where to expose these msgs.

        If users feel these error msgs are hard to understand, we can change it later to make it more user-friendly. At this time, we just need to ask Alk expose them to UI. Let me know if any other questions.

        Thanks!

        Junyi

        Show
        junyi Junyi Xie (Inactive) added a comment - Hi Dipti, Thanks for organizing the meeting. Actually today XDCR has already have the API to expose errors to ns_server. The API is within XDCR replication manager (xdc_rep_manager:latest_errors()) and when called, it will return the last 10 errors for each bucket which are actively replicating. Alk will expose these errors (or at least some of them) on UI. Alk will also determine where to expose these msgs. If users feel these error msgs are hard to understand, we can change it later to make it more user-friendly. At this time, we just need to ask Alk expose them to UI. Let me know if any other questions. Thanks! Junyi
        Hide
        dipti Dipti Borkar added a comment -

        Pasting useful message that we need to provide better errors on from MB-5611

        Some common XDCR failure reasons:

        1 db_not_found error: when node is unresponsive, for e.g:
        "could not open http://Administrator:
        *****@10.3.3.28:8092/default%2f120%3b093c0a978eb59342ea52d87eae424bb3/"

        2 badmatch,

        {error,corrupted_data}

        , Erlang-related corruption
        [

        {couch_compress,decompress,1}

        ,

        {couch_doc,with_uncompressed_body,1}

        ,

        {couch_doc,to_json_base64,1}

        ,

        {xdc_vbucket_rep_worker,maybe_flush_docs,3}

        ,

        {lists,foldl,3}

        ,

        {xdc_vbucket_rep_worker,local_process_batch,5}

        ,

        {xdc_vbucket_rep_worker,queue_fetch_loop,4}

        ]

        3 checkpoint_commit_failure
        {bad_return_value,
        {checkpoint_commit_failure,
        <<"Failure on target commit:

        {error,<<\"not_found\">>}

        ">>}}

        4 http_request_failed
        xdc_replicator:handle_info:282] Worker <0.11173.72> died with reason: {http_request_failed,"POST",
        "http://10.3.121.33:8092/default%2F684/_bulk_docs",
        {error,

        {code,500}

        }}

        Replicator: couldn't write document
        xdc_replicator_worker:flush_docs:111] Replicator: couldn't
        write document ``, revision ``,
        to target database `http://10.3.121.33:8092/default%2F683/`. Error: ``, reason: ``.

        5 replicator_died
        {replicator_died, {'EXIT',<15849.2212.0>, {badmatch,{error,closed}}}}

        6 bulk_set_vbucket_state_failed
        General error seen when rebalance fails due to vbucket_map not ready (possibly)
        that may cause replication to fail.

        Show
        dipti Dipti Borkar added a comment - Pasting useful message that we need to provide better errors on from MB-5611 Some common XDCR failure reasons: 1 db_not_found error: when node is unresponsive, for e.g: "could not open http://Administrator: *****@10.3.3.28:8092/default%2f120%3b093c0a978eb59342ea52d87eae424bb3/" 2 badmatch, {error,corrupted_data} , Erlang-related corruption [ {couch_compress,decompress,1} , {couch_doc,with_uncompressed_body,1} , {couch_doc,to_json_base64,1} , {xdc_vbucket_rep_worker,maybe_flush_docs,3} , {lists,foldl,3} , {xdc_vbucket_rep_worker,local_process_batch,5} , {xdc_vbucket_rep_worker,queue_fetch_loop,4} ] 3 checkpoint_commit_failure {bad_return_value, {checkpoint_commit_failure, <<"Failure on target commit: {error,<<\"not_found\">>} ">>}} 4 http_request_failed xdc_replicator:handle_info:282] Worker <0.11173.72> died with reason: {http_request_failed,"POST", "http://10.3.121.33:8092/default%2F684/_bulk_docs", {error, {code,500} }} Replicator: couldn't write document xdc_replicator_worker:flush_docs:111] Replicator: couldn't write document ``, revision ``, to target database ` http://10.3.121.33:8092/default%2F683/ `. Error: ``, reason: ``. 5 replicator_died {replicator_died, {'EXIT',<15849.2212.0>, {badmatch,{error,closed}}}} 6 bulk_set_vbucket_state_failed General error seen when rebalance fails due to vbucket_map not ready (possibly) that may cause replication to fail.
        Show
        abhinav Abhinav Dangeti added a comment - fyi, .. http://www.couchbase.com/issues/browse/MB-5611

          People

          • Assignee:
            alkondratenko Aleksey Kondratenko (Inactive)
            Reporter:
            ketaki Ketaki Gangal
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Gerrit Reviews

              There are no open Gerrit changes