Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.0-beta-2, 2.1.0
    • Fix Version/s: 2.1.0
    • Component/s: tools
    • Security Level: Public
    • Labels:
      None
    • Environment:
      Linux
    • Sprint:
      PCI Team - Sprint 4, PCI Team - Sprint 8

      Description

      I'm trying to backup my cluster via cbbackup. I start the backup via
      /opt/couchbase/bin/cbbackup couchbase://Administrator:password@mymachine:8091 /backups/couchbase_backup_test
      The backup appears to work fine and a progress bar appears, but then exceeds 100% progress and never stops! Example:
      ^Cinterrupted.###############################] 210.8% (26141950/12403784 msgs)
      (I ctrl-C'd to kill it after 200% because it seems that this can't possibly work. Note that both the percent and the number of messages are off). I only have ~12 million items in the bucket, but it went right past that limit when backing up.
      Help?

      No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

        mikew Mike Wiederhold created issue -
        mikew Mike Wiederhold made changes -
        Field Original Value New Value
        Assignee Mike Wiederhold [ mikew ]
        steve Steve Yen made changes -
        Assignee Mike Wiederhold [ mikew ] Steve Yen [ steve ]
        steve Steve Yen made changes -
        Labels 2.0-release-notes
        steve Steve Yen made changes -
        Fix Version/s 2.0.2 [ 10418 ]
        Fix Version/s 2.0 [ 10114 ]
        Affects Version/s 2.0-beta-2 [ 10385 ]
        Hide
        steve Steve Yen added a comment -

        There's a couple things going on here that probably need documentation...

        • cbbackup first contacts the server to get the # of items. But, if the cluster is changing (there are item mutations), that # of items will just be an estimate.
        • then, cbbackup uses the TAP protocol to perform the backup. But, under some conditions (not all item values are resident in memory), the TAP protocol might actually send duplicate messages. That's why cbbackup reports "msgs" for progress instead of "items" in its numerator, but uses "items" in its denominator. That can lead to >100% in some cases.

        Whether it leads to >200% is somewhat unexpected, but it depends on the situation and what couchbase server is doing in generating the TAP stream.

        Show
        steve Steve Yen added a comment - There's a couple things going on here that probably need documentation... cbbackup first contacts the server to get the # of items. But, if the cluster is changing (there are item mutations), that # of items will just be an estimate. then, cbbackup uses the TAP protocol to perform the backup. But, under some conditions (not all item values are resident in memory), the TAP protocol might actually send duplicate messages. That's why cbbackup reports "msgs" for progress instead of "items" in its numerator, but uses "items" in its denominator. That can lead to >100% in some cases. Whether it leads to >200% is somewhat unexpected, but it depends on the situation and what couchbase server is doing in generating the TAP stream.
        Hide
        steve Steve Yen added a comment -

        The 2.0.2 filter isn't correct at the moment. Putting this into 2.0.1 for the moment as it'll be revisited again.

        Show
        steve Steve Yen added a comment - The 2.0.2 filter isn't correct at the moment. Putting this into 2.0.1 for the moment as it'll be revisited again.
        steve Steve Yen made changes -
        Fix Version/s 2.0.1 [ 10399 ]
        Fix Version/s 2.0.2 [ 10418 ]
        dipti Dipti Borkar made changes -
        Priority Major [ 3 ] Critical [ 2 ]
        Hide
        MichaelL Michael L added a comment -

        (I am the original poster)

        Changing the line above to use an IP address rather than hostname seems to have fixed the problem. My backups now run to 100% and then complete as expected.

        As for the root cause: I don't believe it has anything to do with the cluster changing, since I first encountered this when trying to backup an essentially idle cluster.

        Show
        MichaelL Michael L added a comment - (I am the original poster) Changing the line above to use an IP address rather than hostname seems to have fixed the problem. My backups now run to 100% and then complete as expected. As for the root cause: I don't believe it has anything to do with the cluster changing, since I first encountered this when trying to backup an essentially idle cluster.
        dipti Dipti Borkar made changes -
        Priority Critical [ 2 ] Blocker [ 1 ]
        steve Steve Yen made changes -
        Assignee Steve Yen [ steve ] Bin Cui [ bcui ]
        Hide
        bcui Bin Cui added a comment -

        Verified on a multi-node cluster that cbtransfer get the total item number correctly. And fail to reproduce the bug on the idle cluster.

        Show
        bcui Bin Cui added a comment - Verified on a multi-node cluster that cbtransfer get the total item number correctly. And fail to reproduce the bug on the idle cluster.
        Hide
        mikew Mike Wiederhold added a comment -

        I asked the user on the forums for more information to reproduce this issue. I will post the information here if and when he responds.

        Show
        mikew Mike Wiederhold added a comment - I asked the user on the forums for more information to reproduce this issue. I will post the information here if and when he responds.
        Hide
        farshid Farshid Ghods (Inactive) added a comment -

        deferring to 2.1 per bug scrub meeting ( Dipti & Farshid -December 7th )

        Show
        farshid Farshid Ghods (Inactive) added a comment - deferring to 2.1 per bug scrub meeting ( Dipti & Farshid -December 7th )
        farshid Farshid Ghods (Inactive) made changes -
        Fix Version/s 2.1 [ 10414 ]
        Fix Version/s 2.0.1 [ 10399 ]
        Hide
        pauluswaulus Paul Janssen added a comment -

        I have the same problem.
        Observed progress over 32000%.
        The expected msgs to save is far less than the actual msgs saved.
        Restore will load the same number of msgs as were actually saved.
        This will impact backup and restore time.
        This will impact diskspace.

        Show
        pauluswaulus Paul Janssen added a comment - I have the same problem. Observed progress over 32000%. The expected msgs to save is far less than the actual msgs saved. Restore will load the same number of msgs as were actually saved. This will impact backup and restore time. This will impact diskspace.
        Hide
        pauluswaulus Paul Janssen added a comment -

        Version info: 2.0.0 community edition (build-1723)

        Show
        pauluswaulus Paul Janssen added a comment - Version info: 2.0.0 community edition (build-1723)
        Hide
        pauluswaulus Paul Janssen added a comment - - edited

        Using ip-address (local,external) or hostname (localhost) does not make any difference, issue remains.

        Show
        pauluswaulus Paul Janssen added a comment - - edited Using ip-address (local,external) or hostname (localhost) does not make any difference, issue remains.
        dipti Dipti Borkar made changes -
        Fix Version/s 2.0.2 [ 10418 ]
        Fix Version/s 2.1 [ 10414 ]
        Hide
        maria Maria McDuff (Inactive) added a comment -

        bug scrub: Bin – have you had a chance to take a look? pls update.
        thanks.

        Show
        maria Maria McDuff (Inactive) added a comment - bug scrub: Bin – have you had a chance to take a look? pls update. thanks.
        Hide
        bcui Bin Cui added a comment -

        Cannot reproduce it in house.

        Show
        bcui Bin Cui added a comment - Cannot reproduce it in house.
        anil Anil Kumar made changes -
        Sprint PCI Team - Sprint 4 [ 7 ]
        anil Anil Kumar made changes -
        Rank Ranked higher
        anil Anil Kumar made changes -
        Rank Ranked higher
        maria Maria McDuff (Inactive) made changes -
        Assignee Bin Cui [ bcui ] Abhinav Dangeti [ abhinav ]
        Hide
        maria Maria McDuff (Inactive) added a comment -

        per bug scrub: abhinav – can you please repro in latest 2.0.2 build? thanks.

        Show
        maria Maria McDuff (Inactive) added a comment - per bug scrub: abhinav – can you please repro in latest 2.0.2 build? thanks.
        Hide
        abhinav Abhinav Dangeti added a comment -

        Cannot reproduce on 2.0.2-749-rel.

        • 3 nodes, 2 buckets
          [root@orange-11601 ~]# /opt/couchbase/bin/cbbackup couchbase://Administrator:password@localhost:8091 ~/backup
          ################### 100.0% (8557766/8558766 msgs)
          bucket: default, msgs transferred...
          : total | last | per sec
          batch : 10657 | 10657 | 14.7
          byte : 1131794345 | 1131794345 | 1562011.5
          msg : 8558766 | 8558766 | 11812.1
          ################### 100.0% (2024739/2024739 msgs)
          bucket: saslbucket, msgs transferred...
          : total | last | per sec
          batch : 14775 | 14775 | 86.0
          byte : 1367390279 | 1367390279 | 7955075.3
          msg : 10583505 | 10583505 | 61571.7
          done
        Show
        abhinav Abhinav Dangeti added a comment - Cannot reproduce on 2.0.2-749-rel. 3 nodes, 2 buckets [root@orange-11601 ~] # /opt/couchbase/bin/cbbackup couchbase://Administrator:password@localhost:8091 ~/backup ################### 100.0% (8557766/8558766 msgs) bucket: default, msgs transferred... : total | last | per sec batch : 10657 | 10657 | 14.7 byte : 1131794345 | 1131794345 | 1562011.5 msg : 8558766 | 8558766 | 11812.1 ################### 100.0% (2024739/2024739 msgs) bucket: saslbucket, msgs transferred... : total | last | per sec batch : 14775 | 14775 | 86.0 byte : 1367390279 | 1367390279 | 7955075.3 msg : 10583505 | 10583505 | 61571.7 done
        abhinav Abhinav Dangeti made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Hide
        maria Maria McDuff (Inactive) added a comment -

        not reproducible.

        Show
        maria Maria McDuff (Inactive) added a comment - not reproducible.
        maria Maria McDuff (Inactive) made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        bala.sharma Bala Sharma (Inactive) made changes -
        Resolution Fixed [ 1 ]
        Status Closed [ 6 ] Reopened [ 4 ]
        Assignee Abhinav Dangeti [ abhinav ] Steve Yen [ steve ]
        Hide
        bcui Bin Cui added a comment -

        First, the error itself is harmless. The tool tried to transfer design docs and the source cluster doesn't contain any. Since 2.0.2, customer can specify --data-only option for cbtransfer/cbback/cbrestore tool.

        But we still dont know the root cause why there is such a big difference between the initial msgs to be sent and the final msgs that are transferred.

        Show
        bcui Bin Cui added a comment - First, the error itself is harmless. The tool tried to transfer design docs and the source cluster doesn't contain any. Since 2.0.2, customer can specify --data-only option for cbtransfer/cbback/cbrestore tool. But we still dont know the root cause why there is such a big difference between the initial msgs to be sent and the final msgs that are transferred.
        Hide
        bcui Bin Cui added a comment - - edited

        One possible explanation about the deviant number:

        1. the estimate number is the total active item number
        2. the actual msg tranferred = total(tap_mutations + tap_delete)

        For the above customer case where they have 2 million item deleted, we will transfer not only the current active items, but also any deleted items.
        At again, there will be more msgs transferred if any key will have repeated set/deletions.

        Show
        bcui Bin Cui added a comment - - edited One possible explanation about the deviant number: 1. the estimate number is the total active item number 2. the actual msg tranferred = total(tap_mutations + tap_delete) For the above customer case where they have 2 million item deleted, we will transfer not only the current active items, but also any deleted items. At again, there will be more msgs transferred if any key will have repeated set/deletions.
        bcui Bin Cui made changes -
        Status Reopened [ 4 ] Resolved [ 5 ]
        Resolution Won't Fix [ 2 ]
        bcui Bin Cui made changes -
        Assignee Steve Yen [ steve ] Bin Cui [ bcui ]
        Hide
        perry Perry Krug added a comment -

        Reopening for visibility. Whether the tool is doing the "right" thing or not, there is still a major impact to the user both in terms of disk space being taken up, time being taken and perception of confidence, etc in the backup.

        Show
        perry Perry Krug added a comment - Reopening for visibility. Whether the tool is doing the "right" thing or not, there is still a major impact to the user both in terms of disk space being taken up, time being taken and perception of confidence, etc in the backup.
        perry Perry Krug made changes -
        Resolution Won't Fix [ 2 ]
        Status Resolved [ 5 ] Reopened [ 4 ]
        Hide
        maria Maria McDuff (Inactive) added a comment -

        Anil to work with Bin on customer use case.

        Show
        maria Maria McDuff (Inactive) added a comment - Anil to work with Bin on customer use case.
        Hide
        anil Anil Kumar added a comment - - edited

        Talked to Bin, here's the update.

        There are 2 issues here…
        1). In case of heavy DGM – tool only captures the 'total active items' in memory and not include the items on disk. Fix for this is to consider also resident-ratio to get the current_item. As per Bin, we have already the stats and this should be low-risk fix. He will be making this fix for 2.0.2.
        2). In case of Deletes on items – tool currently only captures the snapshot of 'active items' and doesn't consider any items getting deleted. Hence when it transfers it not only transfers current active items but also any deleted items which is unnecessary. To fix this we require some changes in EP-Engine side to provide stats on deleted items so that tool can smartly ignore those. Considering the timeframe for release this won't make it for 2.0.2 but we will have documentation explaining this to users. [Doc ticket on Karen ]

        Show
        anil Anil Kumar added a comment - - edited Talked to Bin, here's the update. There are 2 issues here… 1). In case of heavy DGM – tool only captures the 'total active items' in memory and not include the items on disk. Fix for this is to consider also resident-ratio to get the current_item. As per Bin, we have already the stats and this should be low-risk fix. He will be making this fix for 2.0.2. 2). In case of Deletes on items – tool currently only captures the snapshot of 'active items' and doesn't consider any items getting deleted. Hence when it transfers it not only transfers current active items but also any deleted items which is unnecessary. To fix this we require some changes in EP-Engine side to provide stats on deleted items so that tool can smartly ignore those. Considering the timeframe for release this won't make it for 2.0.2 but we will have documentation explaining this to users. [Doc ticket on Karen ]
        Hide
        abhinav Abhinav Dangeti added a comment -

        When cbbackup was run on a node with 6104188 items (~45% active resident ratio),

        root@plum-009:~# /opt/couchbase/bin/cbbackup http://localhost:8091 /data/backup
        ############################# 147.6% (9007021/6104188 msgs)
        bucket: default, msgs transferred...
        : total | last | per sec
        batch : 29230 | 29230 | 33.1
        byte : 10030064183 | 10030064183 | 11365547.3
        msg : 9007021 | 9007021 | 10206.3
        done

        Show
        abhinav Abhinav Dangeti added a comment - When cbbackup was run on a node with 6104188 items (~45% active resident ratio), root@plum-009:~# /opt/couchbase/bin/cbbackup http://localhost:8091 /data/backup ############################# 147.6% (9007021/6104188 msgs) bucket: default, msgs transferred... : total | last | per sec batch : 29230 | 29230 | 33.1 byte : 10030064183 | 10030064183 | 11365547.3 msg : 9007021 | 9007021 | 10206.3 done
        Show
        bcui Bin Cui added a comment - http://review.couchbase.org/#/c/26431/
        Hide
        maria Maria McDuff (Inactive) added a comment -

        per bug triage, this fix from bin only addresses item 1 (see below):
        1). In case of heavy DGM – tool only captures the 'total active items' in memory and not include the items on disk. Fix for this is to consider also resident-ratio to get the current_item. As per Bin, we have already the stats and this should be low-risk fix. He will be making this fix for 2.0.2.

        Show
        maria Maria McDuff (Inactive) added a comment - per bug triage, this fix from bin only addresses item 1 (see below): 1). In case of heavy DGM – tool only captures the 'total active items' in memory and not include the items on disk. Fix for this is to consider also resident-ratio to get the current_item. As per Bin, we have already the stats and this should be low-risk fix. He will be making this fix for 2.0.2.
        anil Anil Kumar made changes -
        Rank Ranked higher
        anil Anil Kumar made changes -
        Sprint PCI Team - Sprint 4 [ 7 ] PCI Team - Sprint 4, PCI Team - Sprint 8 [ 7, 18 ]
        bcui Bin Cui made changes -
        Status Reopened [ 4 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        kzeller kzeller made changes -
        Labels 2.0-release-notes 2.0-release-notes documentation
        kzeller kzeller made changes -
        Summary cbbackup loops infinitely [RN 2.1.0]cbbackup loops infinitely
        Hide
        anil Anil Kumar added a comment - - edited

        created new bug to track it here : http://www.couchbase.com/issues/browse/MB-8377

        Show
        anil Anil Kumar added a comment - - edited created new bug to track it here : http://www.couchbase.com/issues/browse/MB-8377
        anil Anil Kumar made changes -
        Summary [RN 2.1.0]cbbackup loops infinitely [DOC 2.1.0] cbbackup loops infinitely
        anil Anil Kumar made changes -
        Resolution Fixed [ 1 ]
        Status Resolved [ 5 ] Reopened [ 4 ]
        Assignee Bin Cui [ bcui ] Karen Zeller [ kzeller ]
        anil Anil Kumar made changes -
        Priority Blocker [ 1 ] Major [ 3 ]
        anil Anil Kumar made changes -
        Component/s documentation [ 10012 ]
        pavelpaulau Pavel Paulau made changes -
        Rank Ranked higher
        ingenthr Matt Ingenthron made changes -
        Rank Ranked higher
        kzeller kzeller made changes -
        Summary [DOC 2.1.0] cbbackup loops infinitely [DOC 2.2?] cbbackup loops infinitely
        kzeller kzeller made changes -
        Rank Ranked lower
        kzeller kzeller made changes -
        Story Points 0.25
        Hide
        anil Anil Kumar added a comment -

        Karen: we need to add this to 2.1 manual

        Show
        anil Anil Kumar added a comment - Karen: we need to add this to 2.1 manual
        anil Anil Kumar made changes -
        Summary [DOC 2.2?] cbbackup loops infinitely [DOC 2.1.0] cbbackup loops infinitely
        anil Anil Kumar made changes -
        Fix Version/s 2.2.0 [ 10620 ]
        anil Anil Kumar made changes -
        Labels 2.0-release-notes documentation
        Assignee Karen Zeller [ kzeller ] Bin Cui [ bcui ]
        Fix Version/s 2.2.0 [ 10620 ]
        Affects Version/s 2.1.0 [ 10418 ]
        Component/s documentation [ 10012 ]
        anil Anil Kumar made changes -
        Status Reopened [ 4 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        anil Anil Kumar made changes -
        Status Resolved [ 5 ] Closed [ 6 ]

          People

          • Assignee:
            bcui Bin Cui
            Reporter:
            mikew Mike Wiederhold
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Agile

                Gerrit Reviews

                There are no open Gerrit changes