Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.0-beta-2, 2.1.0
    • Fix Version/s: 2.1.0
    • Component/s: tools
    • Security Level: Public
    • Labels:
      None
    • Environment:
      Linux
    • Sprint:
      PCI Team - Sprint 4, PCI Team - Sprint 8

      Description

      I'm trying to backup my cluster via cbbackup. I start the backup via
      /opt/couchbase/bin/cbbackup couchbase://Administrator:password@mymachine:8091 /backups/couchbase_backup_test
      The backup appears to work fine and a progress bar appears, but then exceeds 100% progress and never stops! Example:
      ^Cinterrupted.###############################] 210.8% (26141950/12403784 msgs)
      (I ctrl-C'd to kill it after 200% because it seems that this can't possibly work. Note that both the percent and the number of messages are off). I only have ~12 million items in the bucket, but it went right past that limit when backing up.
      Help?

      No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

        Hide
        steve Steve Yen added a comment -

        There's a couple things going on here that probably need documentation...

        • cbbackup first contacts the server to get the # of items. But, if the cluster is changing (there are item mutations), that # of items will just be an estimate.
        • then, cbbackup uses the TAP protocol to perform the backup. But, under some conditions (not all item values are resident in memory), the TAP protocol might actually send duplicate messages. That's why cbbackup reports "msgs" for progress instead of "items" in its numerator, but uses "items" in its denominator. That can lead to >100% in some cases.

        Whether it leads to >200% is somewhat unexpected, but it depends on the situation and what couchbase server is doing in generating the TAP stream.

        Show
        steve Steve Yen added a comment - There's a couple things going on here that probably need documentation... cbbackup first contacts the server to get the # of items. But, if the cluster is changing (there are item mutations), that # of items will just be an estimate. then, cbbackup uses the TAP protocol to perform the backup. But, under some conditions (not all item values are resident in memory), the TAP protocol might actually send duplicate messages. That's why cbbackup reports "msgs" for progress instead of "items" in its numerator, but uses "items" in its denominator. That can lead to >100% in some cases. Whether it leads to >200% is somewhat unexpected, but it depends on the situation and what couchbase server is doing in generating the TAP stream.
        Hide
        steve Steve Yen added a comment -

        The 2.0.2 filter isn't correct at the moment. Putting this into 2.0.1 for the moment as it'll be revisited again.

        Show
        steve Steve Yen added a comment - The 2.0.2 filter isn't correct at the moment. Putting this into 2.0.1 for the moment as it'll be revisited again.
        Hide
        MichaelL Michael L added a comment -

        (I am the original poster)

        Changing the line above to use an IP address rather than hostname seems to have fixed the problem. My backups now run to 100% and then complete as expected.

        As for the root cause: I don't believe it has anything to do with the cluster changing, since I first encountered this when trying to backup an essentially idle cluster.

        Show
        MichaelL Michael L added a comment - (I am the original poster) Changing the line above to use an IP address rather than hostname seems to have fixed the problem. My backups now run to 100% and then complete as expected. As for the root cause: I don't believe it has anything to do with the cluster changing, since I first encountered this when trying to backup an essentially idle cluster.
        Hide
        bcui Bin Cui (Inactive) added a comment -

        Verified on a multi-node cluster that cbtransfer get the total item number correctly. And fail to reproduce the bug on the idle cluster.

        Show
        bcui Bin Cui (Inactive) added a comment - Verified on a multi-node cluster that cbtransfer get the total item number correctly. And fail to reproduce the bug on the idle cluster.
        Hide
        mikew Mike Wiederhold added a comment -

        I asked the user on the forums for more information to reproduce this issue. I will post the information here if and when he responds.

        Show
        mikew Mike Wiederhold added a comment - I asked the user on the forums for more information to reproduce this issue. I will post the information here if and when he responds.
        Hide
        farshid Farshid Ghods (Inactive) added a comment -

        deferring to 2.1 per bug scrub meeting ( Dipti & Farshid -December 7th )

        Show
        farshid Farshid Ghods (Inactive) added a comment - deferring to 2.1 per bug scrub meeting ( Dipti & Farshid -December 7th )
        Hide
        pauluswaulus Paul Janssen added a comment -

        I have the same problem.
        Observed progress over 32000%.
        The expected msgs to save is far less than the actual msgs saved.
        Restore will load the same number of msgs as were actually saved.
        This will impact backup and restore time.
        This will impact diskspace.

        Show
        pauluswaulus Paul Janssen added a comment - I have the same problem. Observed progress over 32000%. The expected msgs to save is far less than the actual msgs saved. Restore will load the same number of msgs as were actually saved. This will impact backup and restore time. This will impact diskspace.
        Hide
        pauluswaulus Paul Janssen added a comment -

        Version info: 2.0.0 community edition (build-1723)

        Show
        pauluswaulus Paul Janssen added a comment - Version info: 2.0.0 community edition (build-1723)
        Hide
        pauluswaulus Paul Janssen added a comment - - edited

        Using ip-address (local,external) or hostname (localhost) does not make any difference, issue remains.

        Show
        pauluswaulus Paul Janssen added a comment - - edited Using ip-address (local,external) or hostname (localhost) does not make any difference, issue remains.
        Hide
        maria Maria McDuff (Inactive) added a comment -

        bug scrub: Bin – have you had a chance to take a look? pls update.
        thanks.

        Show
        maria Maria McDuff (Inactive) added a comment - bug scrub: Bin – have you had a chance to take a look? pls update. thanks.
        Hide
        bcui Bin Cui (Inactive) added a comment -

        Cannot reproduce it in house.

        Show
        bcui Bin Cui (Inactive) added a comment - Cannot reproduce it in house.
        Hide
        maria Maria McDuff (Inactive) added a comment -

        per bug scrub: abhinav – can you please repro in latest 2.0.2 build? thanks.

        Show
        maria Maria McDuff (Inactive) added a comment - per bug scrub: abhinav – can you please repro in latest 2.0.2 build? thanks.
        Hide
        abhinav Abhinav Dangeti added a comment -

        Cannot reproduce on 2.0.2-749-rel.

        • 3 nodes, 2 buckets
          [root@orange-11601 ~]# /opt/couchbase/bin/cbbackup couchbase://Administrator:password@localhost:8091 ~/backup
          ################### 100.0% (8557766/8558766 msgs)
          bucket: default, msgs transferred...
          : total | last | per sec
          batch : 10657 | 10657 | 14.7
          byte : 1131794345 | 1131794345 | 1562011.5
          msg : 8558766 | 8558766 | 11812.1
          ################### 100.0% (2024739/2024739 msgs)
          bucket: saslbucket, msgs transferred...
          : total | last | per sec
          batch : 14775 | 14775 | 86.0
          byte : 1367390279 | 1367390279 | 7955075.3
          msg : 10583505 | 10583505 | 61571.7
          done
        Show
        abhinav Abhinav Dangeti added a comment - Cannot reproduce on 2.0.2-749-rel. 3 nodes, 2 buckets [root@orange-11601 ~] # /opt/couchbase/bin/cbbackup couchbase://Administrator:password@localhost:8091 ~/backup ################### 100.0% (8557766/8558766 msgs) bucket: default, msgs transferred... : total | last | per sec batch : 10657 | 10657 | 14.7 byte : 1131794345 | 1131794345 | 1562011.5 msg : 8558766 | 8558766 | 11812.1 ################### 100.0% (2024739/2024739 msgs) bucket: saslbucket, msgs transferred... : total | last | per sec batch : 14775 | 14775 | 86.0 byte : 1367390279 | 1367390279 | 7955075.3 msg : 10583505 | 10583505 | 61571.7 done
        Hide
        maria Maria McDuff (Inactive) added a comment -

        not reproducible.

        Show
        maria Maria McDuff (Inactive) added a comment - not reproducible.
        Hide
        bcui Bin Cui (Inactive) added a comment -

        First, the error itself is harmless. The tool tried to transfer design docs and the source cluster doesn't contain any. Since 2.0.2, customer can specify --data-only option for cbtransfer/cbback/cbrestore tool.

        But we still dont know the root cause why there is such a big difference between the initial msgs to be sent and the final msgs that are transferred.

        Show
        bcui Bin Cui (Inactive) added a comment - First, the error itself is harmless. The tool tried to transfer design docs and the source cluster doesn't contain any. Since 2.0.2, customer can specify --data-only option for cbtransfer/cbback/cbrestore tool. But we still dont know the root cause why there is such a big difference between the initial msgs to be sent and the final msgs that are transferred.
        Hide
        bcui Bin Cui (Inactive) added a comment - - edited

        One possible explanation about the deviant number:

        1. the estimate number is the total active item number
        2. the actual msg tranferred = total(tap_mutations + tap_delete)

        For the above customer case where they have 2 million item deleted, we will transfer not only the current active items, but also any deleted items.
        At again, there will be more msgs transferred if any key will have repeated set/deletions.

        Show
        bcui Bin Cui (Inactive) added a comment - - edited One possible explanation about the deviant number: 1. the estimate number is the total active item number 2. the actual msg tranferred = total(tap_mutations + tap_delete) For the above customer case where they have 2 million item deleted, we will transfer not only the current active items, but also any deleted items. At again, there will be more msgs transferred if any key will have repeated set/deletions.
        Hide
        perry Perry Krug added a comment -

        Reopening for visibility. Whether the tool is doing the "right" thing or not, there is still a major impact to the user both in terms of disk space being taken up, time being taken and perception of confidence, etc in the backup.

        Show
        perry Perry Krug added a comment - Reopening for visibility. Whether the tool is doing the "right" thing or not, there is still a major impact to the user both in terms of disk space being taken up, time being taken and perception of confidence, etc in the backup.
        Hide
        maria Maria McDuff (Inactive) added a comment -

        Anil to work with Bin on customer use case.

        Show
        maria Maria McDuff (Inactive) added a comment - Anil to work with Bin on customer use case.
        Hide
        anil Anil Kumar added a comment - - edited

        Talked to Bin, here's the update.

        There are 2 issues here…
        1). In case of heavy DGM – tool only captures the 'total active items' in memory and not include the items on disk. Fix for this is to consider also resident-ratio to get the current_item. As per Bin, we have already the stats and this should be low-risk fix. He will be making this fix for 2.0.2.
        2). In case of Deletes on items – tool currently only captures the snapshot of 'active items' and doesn't consider any items getting deleted. Hence when it transfers it not only transfers current active items but also any deleted items which is unnecessary. To fix this we require some changes in EP-Engine side to provide stats on deleted items so that tool can smartly ignore those. Considering the timeframe for release this won't make it for 2.0.2 but we will have documentation explaining this to users. [Doc ticket on Karen ]

        Show
        anil Anil Kumar added a comment - - edited Talked to Bin, here's the update. There are 2 issues here… 1). In case of heavy DGM – tool only captures the 'total active items' in memory and not include the items on disk. Fix for this is to consider also resident-ratio to get the current_item. As per Bin, we have already the stats and this should be low-risk fix. He will be making this fix for 2.0.2. 2). In case of Deletes on items – tool currently only captures the snapshot of 'active items' and doesn't consider any items getting deleted. Hence when it transfers it not only transfers current active items but also any deleted items which is unnecessary. To fix this we require some changes in EP-Engine side to provide stats on deleted items so that tool can smartly ignore those. Considering the timeframe for release this won't make it for 2.0.2 but we will have documentation explaining this to users. [Doc ticket on Karen ]
        Hide
        abhinav Abhinav Dangeti added a comment -

        When cbbackup was run on a node with 6104188 items (~45% active resident ratio),

        root@plum-009:~# /opt/couchbase/bin/cbbackup http://localhost:8091 /data/backup
        ############################# 147.6% (9007021/6104188 msgs)
        bucket: default, msgs transferred...
        : total | last | per sec
        batch : 29230 | 29230 | 33.1
        byte : 10030064183 | 10030064183 | 11365547.3
        msg : 9007021 | 9007021 | 10206.3
        done

        Show
        abhinav Abhinav Dangeti added a comment - When cbbackup was run on a node with 6104188 items (~45% active resident ratio), root@plum-009:~# /opt/couchbase/bin/cbbackup http://localhost:8091 /data/backup ############################# 147.6% (9007021/6104188 msgs) bucket: default, msgs transferred... : total | last | per sec batch : 29230 | 29230 | 33.1 byte : 10030064183 | 10030064183 | 11365547.3 msg : 9007021 | 9007021 | 10206.3 done
        Show
        bcui Bin Cui (Inactive) added a comment - http://review.couchbase.org/#/c/26431/
        Hide
        maria Maria McDuff (Inactive) added a comment -

        per bug triage, this fix from bin only addresses item 1 (see below):
        1). In case of heavy DGM – tool only captures the 'total active items' in memory and not include the items on disk. Fix for this is to consider also resident-ratio to get the current_item. As per Bin, we have already the stats and this should be low-risk fix. He will be making this fix for 2.0.2.

        Show
        maria Maria McDuff (Inactive) added a comment - per bug triage, this fix from bin only addresses item 1 (see below): 1). In case of heavy DGM – tool only captures the 'total active items' in memory and not include the items on disk. Fix for this is to consider also resident-ratio to get the current_item. As per Bin, we have already the stats and this should be low-risk fix. He will be making this fix for 2.0.2.
        Hide
        anil Anil Kumar added a comment - - edited

        created new bug to track it here : http://www.couchbase.com/issues/browse/MB-8377

        Show
        anil Anil Kumar added a comment - - edited created new bug to track it here : http://www.couchbase.com/issues/browse/MB-8377
        Hide
        anil Anil Kumar added a comment -

        Karen: we need to add this to 2.1 manual

        Show
        anil Anil Kumar added a comment - Karen: we need to add this to 2.1 manual

          People

          • Assignee:
            bcui Bin Cui (Inactive)
            Reporter:
            mikew Mike Wiederhold
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Agile

                Gerrit Reviews

                There are no open Gerrit changes