Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-6945

DocInfo.rev_seq is declared as 64-bit, but only 32 bits are persisted

    Details

    • Flagged:
      Release Note

      Description

      couch_common.h declares the field DocInfo.rev_seq as type uint64_t. However, the database only persists it as a 32-bit value (see assemble_seq_index_value, couch_save.c:19). This means that clients trying to set values larger than 2^32-1 will find the upper 32 bits truncated when they read the document.

      The field should be changed to type uint32_t. This could cause warnings in client code if it's assigning a 64-bit value to it, but I don't believe any of our code uses this field at all, currently

      No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

        Hide
        jens Jens Alfke added a comment - - edited

        Fixing this will break binary compatibility of the libcouchstore shared library because it will change field offsets in the DocInfo struct.

        It may also cause compile or runtime errors in language bindings — the only one we depend on is Python, and if our importer code doesn't set rev_seq it should be OK.

        Show
        jens Jens Alfke added a comment - - edited Fixing this will break binary compatibility of the libcouchstore shared library because it will change field offsets in the DocInfo struct. It may also cause compile or runtime errors in language bindings — the only one we depend on is Python, and if our importer code doesn't set rev_seq it should be OK.
        Hide
        dipti Dipti Borkar added a comment -

        does this mean XDCR is using only 32 bits?

        Show
        dipti Dipti Borkar added a comment - does this mean XDCR is using only 32 bits?
        Hide
        dipti Dipti Borkar added a comment -

        Junyi, Damien, can you take a quick look and see if this needs to be fixed for 2.0 ?

        is this needed for XDCR?

        Show
        dipti Dipti Borkar added a comment - Junyi, Damien, can you take a quick look and see if this needs to be fixed for 2.0 ? is this needed for XDCR?
        Hide
        junyi Junyi Xie (Inactive) added a comment -

        XDCR uses 64 bit seq number, on the source XDCR get this info from CouchDB, while on destination, we get it from ep_engine.

        Damien, can you please clarify how the issue is going to impact XDCR? Thanks.

        Show
        junyi Junyi Xie (Inactive) added a comment - XDCR uses 64 bit seq number, on the source XDCR get this info from CouchDB, while on destination, we get it from ep_engine. Damien, can you please clarify how the issue is going to impact XDCR? Thanks.
        Hide
        jens Jens Alfke added a comment -

        If we had to store all 64 bits of the sequence number, it would require an incompatible change in the CouchStore file format. We'd need to make a corresponding change to our CouchDB code so it can read the same databases, and probably come up with some migration tool for people who have databases created with the beta.

        Show
        jens Jens Alfke added a comment - If we had to store all 64 bits of the sequence number, it would require an incompatible change in the CouchStore file format. We'd need to make a corresponding change to our CouchDB code so it can read the same databases, and probably come up with some migration tool for people who have databases created with the beta.
        Hide
        dipti Dipti Borkar added a comment -

        Junyi, Damien,

        Have you confirmed that XDCR doesn't not get impacted by persisting only 32 bit sequence number?

        Show
        dipti Dipti Borkar added a comment - Junyi, Damien, Have you confirmed that XDCR doesn't not get impacted by persisting only 32 bit sequence number?
        Hide
        junyi Junyi Xie (Inactive) added a comment -

        XDCR will be get impacted if the number of changes is over 10^32=4B.

        Damien and Arraon, I assumed the storage layer should be able to store 64bit seq number, could you please clarify?

        Show
        junyi Junyi Xie (Inactive) added a comment - XDCR will be get impacted if the number of changes is over 10^32=4B. Damien and Arraon, I assumed the storage layer should be able to store 64bit seq number, could you please clarify?
        Hide
        aaron Aaron Miller (Inactive) added a comment -

        Which sequence number are we talking about?

        The rev_seq field is the revision counter for the item and is local to the key. It is incremented when a key is updated. It is 32 bits in the file format.

        The db_seq field is the identifier in the sequence index, it is the one that identifies an item within a couchstore file, and is 48 bits in the file format.

        Show
        aaron Aaron Miller (Inactive) added a comment - Which sequence number are we talking about? The rev_seq field is the revision counter for the item and is local to the key. It is incremented when a key is updated. It is 32 bits in the file format. The db_seq field is the identifier in the sequence index, it is the one that identifies an item within a couchstore file, and is 48 bits in the file format.
        Hide
        farshid Farshid Ghods (Inactive) added a comment -

        Aaron,

        is 10^32=4B limitation per vbucket or cluster wide ?

        Show
        farshid Farshid Ghods (Inactive) added a comment - Aaron, is 10^32=4B limitation per vbucket or cluster wide ?
        Hide
        jens Jens Alfke added a comment -

        Aaron: this is about rev_seq, not db_seq.

        Farshid: The 32-bit limit is per document. Assuming ep-engine increments rev_seq every time it updates a document, you'll be able to update any one document 4 billion times before this rolls over. I don't know how XDCR works, but this doesn't seem like it should be an issue for it.

        Show
        jens Jens Alfke added a comment - Aaron: this is about rev_seq, not db_seq. Farshid: The 32-bit limit is per document. Assuming ep-engine increments rev_seq every time it updates a document, you'll be able to update any one document 4 billion times before this rolls over. I don't know how XDCR works, but this doesn't seem like it should be an issue for it.
        Hide
        damien damien added a comment - - edited

        Yes, we need to fix this. I think raising the stored size to 48 bits should be plenty. We'll need to increment the header version #, so that old files persisting the smaller not opened, but we give an error on opening.

        Dipti, when we fix this, old files created during beta will be not be openable by the GA code. Is this acceptable? Fixing it so new code can open old files would require large changes in couchstore that I don't feel comfortable with, but if we need to do it we should know soon so we can scope out the work.

        Show
        damien damien added a comment - - edited Yes, we need to fix this. I think raising the stored size to 48 bits should be plenty. We'll need to increment the header version #, so that old files persisting the smaller not opened, but we give an error on opening. Dipti, when we fix this, old files created during beta will be not be openable by the GA code. Is this acceptable? Fixing it so new code can open old files would require large changes in couchstore that I don't feel comfortable with, but if we need to do it we should know soon so we can scope out the work.
        Hide
        steve Steve Yen added a comment -

        I talked with Aaron about this, too. It is indeed a per-doc thing. Single datacenter situation (non-XDCR) should have no effect, since nobody except for XDCR reads the fields. On XDCR case, one scenario is a "window of remote-datacenter downtime" in bi-directional replication case, an older doc might win in the (very low probability) of value wrap-around. Not a blocker, imo, after learning all this.

        Show
        steve Steve Yen added a comment - I talked with Aaron about this, too. It is indeed a per-doc thing. Single datacenter situation (non-XDCR) should have no effect, since nobody except for XDCR reads the fields. On XDCR case, one scenario is a "window of remote-datacenter downtime" in bi-directional replication case, an older doc might win in the (very low probability) of value wrap-around. Not a blocker, imo, after learning all this.
        Hide
        jens Jens Alfke added a comment -

        Damien: I'm not sure I understand why this is so important. The issue I see is that if a document is updated 4 billion times between replications, its rev_seq will roll over and may cause it not to be transferred. Is that something that will occur in real usage? (I'm asking sincerely, not rhetorically; I don't know the use cases of CouchStore as well as CouchDB.)

        Note that any fix has to be made to both CouchStore and CouchDB simultaneously.

        Show
        jens Jens Alfke added a comment - Damien: I'm not sure I understand why this is so important. The issue I see is that if a document is updated 4 billion times between replications, its rev_seq will roll over and may cause it not to be transferred. Is that something that will occur in real usage? (I'm asking sincerely, not rhetorically; I don't know the use cases of CouchStore as well as CouchDB.) Note that any fix has to be made to both CouchStore and CouchDB simultaneously.
        Hide
        damien damien added a comment -

        Yes, because when a document is created, it's assigned the high water rev of the last deletion in the whole vbucket. This way, when we create new docs that we previously deleted, we don't need to look on disk for the meta-data to ensure it now has a higher rev_seq than the deletion (if we don't, the new creation will "lose" to the remote deletion in a XDCR situation). So it's not a that single doc will be updated 4 billion times, it's that all docs combined might be updated a cumulative 4 billion times. It's still unlikely, but possible, so we need more than a 32 bit number for storing the rev_seq.

        Show
        damien damien added a comment - Yes, because when a document is created, it's assigned the high water rev of the last deletion in the whole vbucket. This way, when we create new docs that we previously deleted, we don't need to look on disk for the meta-data to ensure it now has a higher rev_seq than the deletion (if we don't, the new creation will "lose" to the remote deletion in a XDCR situation). So it's not a that single doc will be updated 4 billion times, it's that all docs combined might be updated a cumulative 4 billion times. It's still unlikely, but possible, so we need more than a 32 bit number for storing the rev_seq.
        Hide
        junyi Junyi Xie (Inactive) added a comment -

        Probably I did not explain clearly. We bumped up key seq number used by XDCR from 32bit to 64bit because of some special handling of deletion in XDCR, which is pretty dirty and dark and you may not want to know.

        In short, we need to maintain a "high watermark" of key sequence number across all keys in the vbucket. That is, this field should be bigger enough to hold the max key seqno in the vbucket. Talked to Damien, we will make key seq number and global seq number both 48bits.

        No change is needed on ep_engine and XDCR side. We still use 64 bit seq number, although only 48 bits are really effective.

        Show
        junyi Junyi Xie (Inactive) added a comment - Probably I did not explain clearly. We bumped up key seq number used by XDCR from 32bit to 64bit because of some special handling of deletion in XDCR, which is pretty dirty and dark and you may not want to know. In short, we need to maintain a "high watermark" of key sequence number across all keys in the vbucket. That is, this field should be bigger enough to hold the max key seqno in the vbucket. Talked to Damien, we will make key seq number and global seq number both 48bits. No change is needed on ep_engine and XDCR side. We still use 64 bit seq number, although only 48 bits are really effective.
        Hide
        junyi Junyi Xie (Inactive) added a comment -

        Per discussion with Damien, hand over to Aaron.

        Show
        junyi Junyi Xie (Inactive) added a comment - Per discussion with Damien, hand over to Aaron.
        Hide
        dipti Dipti Borkar added a comment -

        I don't understand why we use a 64 bit seq number in memory but only 48 bits are effective and persisted. This is likely a bug we will hit in the future. You guys are the experts and understand this better than I do, but shouldn't this be consistent across the board?

        breaking Beta to GA is really not advised but I think we need to ensure that XDCR and the way sequences are handled is full-proof. Can someone please write up the design for this and paste it in the bug to make sure everyone is on the same page?

        Show
        dipti Dipti Borkar added a comment - I don't understand why we use a 64 bit seq number in memory but only 48 bits are effective and persisted. This is likely a bug we will hit in the future. You guys are the experts and understand this better than I do, but shouldn't this be consistent across the board? breaking Beta to GA is really not advised but I think we need to ensure that XDCR and the way sequences are handled is full-proof. Can someone please write up the design for this and paste it in the bug to make sure everyone is on the same page?
        Hide
        damien damien added a comment -

        Dipti, it's an optimization, so we don't store unneeded bits on disk. Keeping the on disk representation small as possible makes reads, writes and compaction fast.

        And we aren't in any danger of overflowing 48 bits. If we continuously did 100,000 updates per sec per vbucket, it would take 60+ years before we could possibly overflow 48bits.

        Show
        damien damien added a comment - Dipti, it's an optimization, so we don't store unneeded bits on disk. Keeping the on disk representation small as possible makes reads, writes and compaction fast. And we aren't in any danger of overflowing 48 bits. If we continuously did 100,000 updates per sec per vbucket, it would take 60+ years before we could possibly overflow 48bits.
        Hide
        aaron Aaron Miller (Inactive) added a comment -

        This value is not actually kept around in memory. It's only used in the interface to the persistence layer.

        I was previously unaware of the high water rev_seq behavior that was added above couchstore to accommodate XDCR. I grok that it works like such:

        • We keep a per-vbucket "high water" deleted rev_seq.
        • When an item is deleted, if its rev_seq is higher than this, we set this value to that item's rev_seq
        • Brand-new items, when created, have their rev_seq set to this high water mark, rather than to 0.

        This is so that deleting and recreating an item will not set its rev_seq lower than before, thereby preventing it from ever winning in a XDCR conflict situation.

        This behavior essentially makes rev_seq global across the data file, instead of local to the document as we expected when we designed the file format, and as such it is not unreasonable for it to get large enough to wrap around.

        It appears this was done for efficiency, since metadata for deleted items is not available in RAM, and we didn't want to have to ask Couchstore if an item had previously been deleted every time we do a create, however I think we can eliminate this behavior (and once again make rev_seq local to the doc) by adding an option to couchstore to do the work of finding the rev_seq of the last deleted item at insert time, and as it will have to pass over the item's indexed "tombstone" to create it anyway, it should not actually be any less efficient.

        Looking into this now.

        Another option would be to increase the size of this field in the file format, but that would break file format compatibility with all the previous releases.

        Show
        aaron Aaron Miller (Inactive) added a comment - This value is not actually kept around in memory. It's only used in the interface to the persistence layer. I was previously unaware of the high water rev_seq behavior that was added above couchstore to accommodate XDCR. I grok that it works like such: We keep a per-vbucket "high water" deleted rev_seq. When an item is deleted, if its rev_seq is higher than this, we set this value to that item's rev_seq Brand-new items, when created, have their rev_seq set to this high water mark, rather than to 0. This is so that deleting and recreating an item will not set its rev_seq lower than before, thereby preventing it from ever winning in a XDCR conflict situation. This behavior essentially makes rev_seq global across the data file, instead of local to the document as we expected when we designed the file format, and as such it is not unreasonable for it to get large enough to wrap around. It appears this was done for efficiency, since metadata for deleted items is not available in RAM, and we didn't want to have to ask Couchstore if an item had previously been deleted every time we do a create, however I think we can eliminate this behavior (and once again make rev_seq local to the doc) by adding an option to couchstore to do the work of finding the rev_seq of the last deleted item at insert time, and as it will have to pass over the item's indexed "tombstone" to create it anyway, it should not actually be any less efficient. Looking into this now. Another option would be to increase the size of this field in the file format, but that would break file format compatibility with all the previous releases.
        Hide
        jens Jens Alfke added a comment -

        Dipti: "I don't understand why we use a 64 bit seq number in memory but only 48 bits are effective and persisted."

        The C language and the CPU don't support 48-bit integers. The next size up from 32 bits is 64. So the sequence number is fundamentally a 48-bit number, but is represented in memory as a 64-bit value.

        We should be more clear in the API docs about what the limits of the different values are, though. And range-check them in API calls.

        Show
        jens Jens Alfke added a comment - Dipti: "I don't understand why we use a 64 bit seq number in memory but only 48 bits are effective and persisted." The C language and the CPU don't support 48-bit integers. The next size up from 32 bits is 64. So the sequence number is fundamentally a 48-bit number, but is represented in memory as a 64-bit value. We should be more clear in the API docs about what the limits of the different values are, though. And range-check them in API calls.
        Hide
        dipti Dipti Borkar added a comment -

        Thank you for all the details. I would defer to your judgement (Damien, Aaron, Jens) to fix this. It seems like given these are global per vBucket, there is a chance of wrapping particularly in use cases like session stores that frequently delete and insert the document. If we need to make any storage changes, we should have a quick discussion with QE and the leads to understand the impact. We will need to document / release note it.

        main questions are: will backup and restore still work from beta build to GA? (upgrade is not supported)

        QE will need to retest these functional tests.

        Show
        dipti Dipti Borkar added a comment - Thank you for all the details. I would defer to your judgement (Damien, Aaron, Jens) to fix this. It seems like given these are global per vBucket, there is a chance of wrapping particularly in use cases like session stores that frequently delete and insert the document. If we need to make any storage changes, we should have a quick discussion with QE and the leads to understand the impact. We will need to document / release note it. main questions are: will backup and restore still work from beta build to GA? (upgrade is not supported) QE will need to retest these functional tests.
        Hide
        damien damien added a comment -

        Aaron is currently reworking the way we keep rev seq and the high deleted rev, as it simplifies the code and can require less space on disk and in-memory. Once he has a patch ready, we'll evaluate the risk of merging the patch this late.

        Show
        damien damien added a comment - Aaron is currently reworking the way we keep rev seq and the high deleted rev, as it simplifies the code and can require less space on disk and in-memory. Once he has a patch ready, we'll evaluate the risk of merging the patch this late.
        Hide
        steve Steve Yen added a comment -

        Thanks Damien.

        Thought I'd add some notes from talking with Damien & Aaron, where they're evaluating solution #1 is the current favored approach...

        solution #1: Aaron is enhancing couchstore API so that couchstore can choose the right rev-seq, even for recreated items.

        +++ Pros: no changes to the current 32-bit field in the couchstore file format. This would also move us from a per-vbucket level "highwatermark" design to a per-document level of req-seq tracking.
        — Cons: changes to couchstore and to ep-engine.

        solution #2: change from 32-bit field to 48-bit field in the couchstore file format.

        +++ Pros: easy to understand this kind of change, so likely lower risk in implementing it.
        — Cons: file-format incompatibility with 2.0-beta, so can't support in-place, offline upgrades from 2.0-beta to 2.0-GA. Also, changes to couchstore and couchdb.

        solution #3: the math tells us this is unlikely to happen soon, as the seq-num is only updated at persistence-time not at item mutation time. That is, 400K ops/sec will not have 400K seq-num increments/sec. Instead, the req-seq # is incremented only when it gets through couchstore and persisted, so the speed of persistence limits the growth of this #.

        +++ Pros: easiest to understand, as it's the do-nothing approach.
        — Cons: this only delays the inevitable, as we still have to fix it, such as by implementing solution #1 in some ".next" release.

        And, Aaron & Damin discussed doing both #1 and #2.

        Finally, looking at the backup/restore code, it transfers the full 64-bits of rev-seq #, so it should be ok already.

        Show
        steve Steve Yen added a comment - Thanks Damien. Thought I'd add some notes from talking with Damien & Aaron, where they're evaluating solution #1 is the current favored approach... solution #1: Aaron is enhancing couchstore API so that couchstore can choose the right rev-seq, even for recreated items. +++ Pros: no changes to the current 32-bit field in the couchstore file format. This would also move us from a per-vbucket level "highwatermark" design to a per-document level of req-seq tracking. — Cons: changes to couchstore and to ep-engine. solution #2: change from 32-bit field to 48-bit field in the couchstore file format. +++ Pros: easy to understand this kind of change, so likely lower risk in implementing it. — Cons: file-format incompatibility with 2.0-beta, so can't support in-place, offline upgrades from 2.0-beta to 2.0-GA. Also, changes to couchstore and couchdb. solution #3: the math tells us this is unlikely to happen soon, as the seq-num is only updated at persistence-time not at item mutation time. That is, 400K ops/sec will not have 400K seq-num increments/sec. Instead, the req-seq # is incremented only when it gets through couchstore and persisted, so the speed of persistence limits the growth of this #. +++ Pros: easiest to understand, as it's the do-nothing approach. — Cons: this only delays the inevitable, as we still have to fix it, such as by implementing solution #1 in some ".next" release. And, Aaron & Damin discussed doing both #1 and #2. Finally, looking at the backup/restore code, it transfers the full 64-bits of rev-seq #, so it should be ok already.
        Hide
        steve Steve Yen added a comment -

        Notes from another sync-up mtg that Damien called (with Aaron, Farshid, Steve)...

        The agreed-upon, favored solution is now #2, increasing the 32-bit field to 48-bit field. A new toy-build will be forthcoming soon. Additional unit test(s) forthcoming. The change will start from Jen's patch that cleans up some of the magic numbers currently spread amongst the couchstore code. Damien will handle the changes on the couchdb side.

        On the other solutions, solution idea #1 won't work due to intra-cluster (TAP) replication. Solution idea #3 will run out of bits too soon (e.g., 100 days).

        Show
        steve Steve Yen added a comment - Notes from another sync-up mtg that Damien called (with Aaron, Farshid, Steve)... The agreed-upon, favored solution is now #2, increasing the 32-bit field to 48-bit field. A new toy-build will be forthcoming soon. Additional unit test(s) forthcoming. The change will start from Jen's patch that cleans up some of the magic numbers currently spread amongst the couchstore code. Damien will handle the changes on the couchdb side. On the other solutions, solution idea #1 won't work due to intra-cluster (TAP) replication. Solution idea #3 will run out of bits too soon (e.g., 100 days).
        Hide
        steve Steve Yen added a comment -

        update: toy build from aaron now in testing

        Show
        steve Steve Yen added a comment - update: toy build from aaron now in testing
        Hide
        ketaki Ketaki Gangal added a comment -

        Tested this on a small-2:2 node cluster, xdcr works as expected.

        Show
        ketaki Ketaki Gangal added a comment - Tested this on a small-2:2 node cluster, xdcr works as expected.
        Hide
        steve Steve Yen added a comment -

        From bug scrub mtg...

        For Release Note...

        The fix changes on-disk representation between 2.0 beta2 to 2.0 GA.

        Aaron after you're done / resolved this, please reassign to Chisheng...

        Chisheng, after this fix goes in (or testing on the toy build), backup of 2.0 beta2 and restore to 2.0 needs retest.

        Show
        steve Steve Yen added a comment - From bug scrub mtg... For Release Note... The fix changes on-disk representation between 2.0 beta2 to 2.0 GA. Aaron after you're done / resolved this, please reassign to Chisheng... Chisheng, after this fix goes in (or testing on the toy build), backup of 2.0 beta2 and restore to 2.0 needs retest.
        Hide
        junyi Junyi Xie (Inactive) added a comment -

        Ketaki, do we understand why the replication rate is so slow in your test yesterday? This change-set is a fundamental change in storage, and it may break all XDCR functionality, so we need to be very careful before merging this one.

        Show
        junyi Junyi Xie (Inactive) added a comment - Ketaki, do we understand why the replication rate is so slow in your test yesterday? This change-set is a fundamental change in storage, and it may break all XDCR functionality, so we need to be very careful before merging this one.
        Hide
        thuan Thuan Nguyen added a comment -

        Integrated in github-couchdb-preview #531 (See http://qa.hq.northscale.net/job/github-couchdb-preview/531/)
        MB-6945: Increase size of rev seq from 32 bits to 48 bits (Revision 4cf0a82487e198389e19b41f93e6bf6fdd80f782)

        Result = SUCCESS
        steve :
        Files :

        • src/couchdb/couch_db_updater.erl
        • src/couchdb/couch_db.hrl
        Show
        thuan Thuan Nguyen added a comment - Integrated in github-couchdb-preview #531 (See http://qa.hq.northscale.net/job/github-couchdb-preview/531/ ) MB-6945 : Increase size of rev seq from 32 bits to 48 bits (Revision 4cf0a82487e198389e19b41f93e6bf6fdd80f782) Result = SUCCESS steve : Files : src/couchdb/couch_db_updater.erl src/couchdb/couch_db.hrl
        Hide
        kzeller kzeller added a comment -

        RN: Document revision numbers had been stored as 32 bit but are
        now stored as 48 bit values. This is to support
        the functioning of XDCR and to support a larger number
        of document revisions.

        Show
        kzeller kzeller added a comment - RN: Document revision numbers had been stored as 32 bit but are now stored as 48 bit values. This is to support the functioning of XDCR and to support a larger number of document revisions.

          People

          • Assignee:
            Chisheng Chisheng Hong (Inactive)
            Reporter:
            jens Jens Alfke
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Gerrit Reviews

              There are no open Gerrit changes