Details

    • Type: Bug
    • Status: Closed
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 2.0
    • Fix Version/s: 2.0
    • Component/s: XDCR
    • Security Level: Public
    • Labels:
      None

      Description

      • Place holder for adding error logging on the Source XDCR cluster.
      • Will add more as we come across more use-cases/scenarios.

      Replication failure due to the following reason should be logged - This will be helpful for support to troubleshoot errors and for any end-user.
      *Today most of the replication failures are debugged using ns_server logs, we should move the error displaying on the UI as well.

      Errors
      -----------------

      • Source replication cluster reference cannot be deleted if there is a replication is set up (another bug to track this - MB-6843)
      • Source replication is deleted - issue user-visible log message saying "Replication has been deleted" on source cluster
      • Source bucket is deleted - (1) Issue warning message to the user on delete bucket that replication is going on (2). issue user-visible log message saying "Bucket has been deleted, replication to remote bucket has stopped" on source cluster (3) Error message in the XDCR page on the replication impacted saying "Bucket has been delete, XDCR has stopped"
      • Source bucket is flushed - Cannot be done.

      For errors on the destination node and replication times out: raise this error on the source node .

      No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

        ketaki Ketaki Gangal created issue -
        Show
        abhinav Abhinav Dangeti added a comment - fyi, .. http://www.couchbase.com/issues/browse/MB-5611
        Hide
        dipti Dipti Borkar added a comment -

        Pasting useful message that we need to provide better errors on from MB-5611

        Some common XDCR failure reasons:

        1 db_not_found error: when node is unresponsive, for e.g:
        "could not open http://Administrator:
        *****@10.3.3.28:8092/default%2f120%3b093c0a978eb59342ea52d87eae424bb3/"

        2 badmatch,

        {error,corrupted_data}

        , Erlang-related corruption
        [

        {couch_compress,decompress,1}

        ,

        {couch_doc,with_uncompressed_body,1}

        ,

        {couch_doc,to_json_base64,1}

        ,

        {xdc_vbucket_rep_worker,maybe_flush_docs,3}

        ,

        {lists,foldl,3}

        ,

        {xdc_vbucket_rep_worker,local_process_batch,5}

        ,

        {xdc_vbucket_rep_worker,queue_fetch_loop,4}

        ]

        3 checkpoint_commit_failure
        {bad_return_value,
        {checkpoint_commit_failure,
        <<"Failure on target commit:

        {error,<<\"not_found\">>}

        ">>}}

        4 http_request_failed
        xdc_replicator:handle_info:282] Worker <0.11173.72> died with reason: {http_request_failed,"POST",
        "http://10.3.121.33:8092/default%2F684/_bulk_docs",
        {error,

        {code,500}

        }}

        Replicator: couldn't write document
        xdc_replicator_worker:flush_docs:111] Replicator: couldn't
        write document ``, revision ``,
        to target database `http://10.3.121.33:8092/default%2F683/`. Error: ``, reason: ``.

        5 replicator_died
        {replicator_died, {'EXIT',<15849.2212.0>, {badmatch,{error,closed}}}}

        6 bulk_set_vbucket_state_failed
        General error seen when rebalance fails due to vbucket_map not ready (possibly)
        that may cause replication to fail.

        Show
        dipti Dipti Borkar added a comment - Pasting useful message that we need to provide better errors on from MB-5611 Some common XDCR failure reasons: 1 db_not_found error: when node is unresponsive, for e.g: "could not open http://Administrator: *****@10.3.3.28:8092/default%2f120%3b093c0a978eb59342ea52d87eae424bb3/" 2 badmatch, {error,corrupted_data} , Erlang-related corruption [ {couch_compress,decompress,1} , {couch_doc,with_uncompressed_body,1} , {couch_doc,to_json_base64,1} , {xdc_vbucket_rep_worker,maybe_flush_docs,3} , {lists,foldl,3} , {xdc_vbucket_rep_worker,local_process_batch,5} , {xdc_vbucket_rep_worker,queue_fetch_loop,4} ] 3 checkpoint_commit_failure {bad_return_value, {checkpoint_commit_failure, <<"Failure on target commit: {error,<<\"not_found\">>} ">>}} 4 http_request_failed xdc_replicator:handle_info:282] Worker <0.11173.72> died with reason: {http_request_failed,"POST", "http://10.3.121.33:8092/default%2F684/_bulk_docs", {error, {code,500} }} Replicator: couldn't write document xdc_replicator_worker:flush_docs:111] Replicator: couldn't write document ``, revision ``, to target database ` http://10.3.121.33:8092/default%2F683/ `. Error: ``, reason: ``. 5 replicator_died {replicator_died, {'EXIT',<15849.2212.0>, {badmatch,{error,closed}}}} 6 bulk_set_vbucket_state_failed General error seen when rebalance fails due to vbucket_map not ready (possibly) that may cause replication to fail.
        ketaki Ketaki Gangal made changes -
        Field Original Value New Value
        Priority Major [ 3 ] Critical [ 2 ]
        dipti Dipti Borkar made changes -
        Priority Critical [ 2 ] Blocker [ 1 ]
        dipti Dipti Borkar made changes -
        Description * Place holder for adding error logging on the Source XDCR cluster.
        - Will add more as we come across more use-cases/scenarios.

        Replication failure due to the following reason should be logged - This will be helpful for support to troubleshoot errors and for any end-user.
        *Today most of the replication failures are debugged using ns_server logs, we should move the error displaying on the UI as well.

        Errors
        -----------------
        - Delete Bucket on Destination , display error " Replication failed due to no-bucket to replicate on the destination"
        - Flush Bucket on Destination, display error " Replication failed due to flush executed on the destination"
        - Destination Cluster is down/unreachable - display error r " Replication failed, unable to reach destination"
        - Timeouts while trying to reach the destination cluster - This is most common and can occur under any/most conditions.
        Are there different levels of timeouts and can we classify errors based on these? It will be very useful when debugging the failures.

        - Source replication reference is deleted - Replication failed, missing replication reference?
        - Source replication is deleted - Replication failed, missing replication link
        - Source bucket is deleted - Replication failed, missing Source bucket
        - Source bucket is flushed - Error/Warning for Replication failed, Flush on source , Recreate replication
        - Source cluster is down/


        Warnings/ Error
        ------------------------
        - Replication attempt on mix cluster 1.8X and 2.0
        - Duplicate replication attempt between same bucket/cluster sets.
        - Warning if we see unusually higher[ some percentage] gets, higher conflicts on the destination - This would indicate something is wrong with the replication.

        * Place holder for adding error logging on the Source XDCR cluster.
        - Will add more as we come across more use-cases/scenarios.

        Replication failure due to the following reason should be logged - This will be helpful for support to troubleshoot errors and for any end-user.
        *Today most of the replication failures are debugged using ns_server logs, we should move the error displaying on the UI as well.

        Errors
        -----------------
        - Source replication cluster reference cannot be deleted if there is a replication is set up (another bug to track this - MB-6843)
        - Source replication is deleted - issue user-visible log message saying "Replication has been deleted" on source cluster
        - Source bucket is deleted - (1) Issue warning message to the user on delete bucket that replication is going on (2). issue user-visible log message saying "Bucket has been deleted, replication to remote bucket has stopped" on source cluster (3) Error message in the XDCR page on the replication impacted saying "Bucket has been delete, XDCR has stopped"
        - Source bucket is flushed - Cannot be done.


        For errors on the destination node and replication times out: raise this error on the source node .


        Hide
        junyi Junyi Xie (Inactive) added a comment -

        Hi Dipti,

        Thanks for organizing the meeting. Actually today XDCR has already have the API to expose errors to ns_server. The API is within XDCR replication manager (xdc_rep_manager:latest_errors()) and when called, it will return the last 10 errors for each bucket which are actively replicating. Alk will expose these errors (or at least some of them) on UI. Alk will also determine where to expose these msgs.

        If users feel these error msgs are hard to understand, we can change it later to make it more user-friendly. At this time, we just need to ask Alk expose them to UI. Let me know if any other questions.

        Thanks!

        Junyi

        Show
        junyi Junyi Xie (Inactive) added a comment - Hi Dipti, Thanks for organizing the meeting. Actually today XDCR has already have the API to expose errors to ns_server. The API is within XDCR replication manager (xdc_rep_manager:latest_errors()) and when called, it will return the last 10 errors for each bucket which are actively replicating. Alk will expose these errors (or at least some of them) on UI. Alk will also determine where to expose these msgs. If users feel these error msgs are hard to understand, we can change it later to make it more user-friendly. At this time, we just need to ask Alk expose them to UI. Let me know if any other questions. Thanks! Junyi
        Hide
        junyi Junyi Xie (Inactive) added a comment -

        Nothing to do within XDCR. All ns_server work.

        Show
        junyi Junyi Xie (Inactive) added a comment - Nothing to do within XDCR. All ns_server work.
        junyi Junyi Xie (Inactive) made changes -
        Assignee Junyi Xie [ junyi ] Aleksey Kondratenko [ alkondratenko ]
        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        Fixed & approved but still sits in gerrit: http://review.couchbase.org/#/c/21459/

        it is naive, but so are 'errors' from xdcr

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - Fixed & approved but still sits in gerrit: http://review.couchbase.org/#/c/21459/ it is naive, but so are 'errors' from xdcr
        alkondratenko Aleksey Kondratenko (Inactive) made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        farshid Farshid Ghods (Inactive) made changes -
        Status Resolved [ 5 ] Closed [ 6 ]

          People

          • Assignee:
            alkondratenko Aleksey Kondratenko (Inactive)
            Reporter:
            ketaki Ketaki Gangal
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Gerrit Reviews

              There are no open Gerrit changes