Uploaded image for project: 'Couchbase Gateway'
  1. Couchbase Gateway
  2. CBG-280

Syncgateway doesn't import all documents while importing 50M docs ( build 2.1.3-4)

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 2.1.3
    • Fix Version/s: 2.6.0
    • Component/s: SyncGateway
    • Security Level: Public
    • Labels:
    • Sprint:
      CBG Sprint 18, CBG Sprint 19
    • Story Points:
      8

      Description

      Syncgateway import hangs after importing ~560K docs out of 1M docs .

      I do not see anything in specific in error log folder .

      But  I do see crc32c_doc_unmarshal_count & crc32c_mismatch_count ~987558 . 

      "syncGateway_import": {"crc32c_doc_unmarshal_count": 987558, "crc32c_mismatch_count": 987558, "import_count": 504360},

       

      Can you please take a look at the log files attachedlogs_172.23.97.21.tar.gz

        Attachments

          Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

            Activity

            Hide
            adamf Adam Fraser added a comment -

            In the latest test (http://perf.jenkins.couchbase.com/job/magma_sg_import/182) 55 documents weren't imported, and there are a relatively large number of kv ops failing in a 2-3 minute window.  A server node failed over during this test, so the suspicion is that these operations failed during that failover.  Some of these failures were network errors/retry loop errors during incr of _sync:seq, others were during WriteCasWithXattr:

            8 instances of "RetryLoop for WriteCasWithXattr with key [] giving up after 11 attempts"
            8 instances of "RetryLoop for CouchbaseBucketGoCB Set() giving up after 11 attempts"
            4 instances of "RetryLoop for CouchbaseBucketGoCB AddRaw() giving up after 11 attempts"
            31 instances of "RetryLoop for Incr with key: _sync:seq giving up after 11 attempts" (2 other incr errors were caused by network error)

            This isn't enough to account for all 55 non-imported documents, but it does suggest that the server failover may be related to the failed imports.  Given that incr is failing, I suspect this may be related to the increased pressure on the _sync:seq node - this will be alleviated with CBG-258.

            Other notable details from the logs that may merit further investigation:

            • the skipped sequence queue is extremely large.  Sample log output:
              • Cache: Issuing skipped sequence clean query for 50 sequences, 22823191 remain pending (db:db)
              • We should probably increase max_num_pending and max_wait_pending for high volume import tests
            • Info logging is excessively verbose during import for 2.1.3-4 (100MB info logs were being rotated several times a minute). 
              • This was primarily the two lines:
                • Import: Created new rev ID 1-d8bbe478d3786b02d5c270de6154104e and
                • [INF] CRUD: Stored doc "00000000000044709267" / "1-d8bbe478d3786b02d5c270de6154104e"
              • The first line has been modified in 2.5 to include doc id, but is really still redundant with the second.  The second is probably a candidate to be moved to debug logging. 
            Show
            adamf Adam Fraser added a comment - In the latest test ( http://perf.jenkins.couchbase.com/job/magma_sg_import/182 ) 55 documents weren't imported, and there are a relatively large number of kv ops failing in a 2-3 minute window.  A server node failed over during this test, so the suspicion is that these operations failed during that failover.  Some of these failures were network errors/retry loop errors during incr of _sync:seq, others were during WriteCasWithXattr: 8 instances of "RetryLoop for WriteCasWithXattr with key [] giving up after 11 attempts" 8 instances of "RetryLoop for CouchbaseBucketGoCB Set() giving up after 11 attempts" 4 instances of "RetryLoop for CouchbaseBucketGoCB AddRaw() giving up after 11 attempts" 31 instances of "RetryLoop for Incr with key: _sync:seq giving up after 11 attempts" (2 other incr errors were caused by network error) This isn't enough to account for all 55 non-imported documents, but it does suggest that the server failover may be related to the failed imports.  Given that incr is failing, I suspect this may be related to the increased pressure on the _sync:seq node - this will be alleviated with CBG-258 . Other notable details from the logs that may merit further investigation: the skipped sequence queue is extremely large.  Sample log output: Cache: Issuing skipped sequence clean query for 50 sequences, 22823191 remain pending (db:db) We should probably increase max_num_pending and max_wait_pending for high volume import tests Info logging is excessively verbose during import for 2.1.3-4 (100MB info logs were being rotated several times a minute).  This was primarily the two lines: Import: Created new rev ID 1-d8bbe478d3786b02d5c270de6154104e and [INF] CRUD: Stored doc "00000000000044709267" / "1-d8bbe478d3786b02d5c270de6154104e" The first line has been modified in 2.5 to include doc id, but is really still redundant with the second.  The second is probably a candidate to be moved to debug logging. 
            Hide
            adamf Adam Fraser added a comment -

            The latest run doesn't show any network errors, and import count shows 4999994/5000000.

            http://perf.jenkins.couchbase.com/job/magma_sg_import/183

            Looking at the debug logs, there are 6 documents showing the following log message on the import node:

            434808:2019-04-05T13:01:34.575-07:00 [DBG] Import+: Not importing mutation - document 00000000000014335137 has been subsequently updated and will be imported based on that mutation.
             
            598991:2019-04-05T13:21:45.843-07:00 [DBG] Import+: Not importing mutation - document 00000000000024543612 has been subsequently updated and will be imported based on that mutation.
             
            661478:2019-04-05T13:22:23.726-07:00 [DBG] Import+: Not importing mutation - document 00000000000019906912 has been subsequently updated and will be imported based on that mutation.
             
            374471:2019-04-05T13:23:14.901-07:00 [DBG] Import+: Not importing mutation - document 00000000000014888187 has been subsequently updated and will be imported based on that mutation.
             
            573097:2019-04-05T13:24:44.860-07:00 [DBG] Import+: Not importing mutation - document 00000000000029662654 has been subsequently updated and will be imported based on that mutation.
             
            672187:2019-04-05T13:25:41.101-07:00 [DBG] Import+: Not importing mutation - document 00000000000049990789 has been subsequently updated and will be imported based on that mutation.
            

            Checking these document in the bucket, they have all been imported and have the _sync xattr populated.

            The debug logs also show the following message for each of these keys:

            434161:2019-04-05T13:01:33.531-07:00 [DBG] RetryLoop retrying WriteCasWithXattr with key 00000000000014335137 after 5 ms.
             
            513386:2019-04-05T13:21:43.620-07:00 [DBG] RetryLoop retrying WriteCasWithXattr with key 00000000000024543612 after 5 ms.
             
            660726:2019-04-05T13:22:22.046-07:00 [DBG] RetryLoop retrying WriteCasWithXattr with key 00000000000019906912 after 5 ms.
             
            303222:2019-04-05T13:23:13.705-07:00 [DBG] RetryLoop retrying WriteCasWithXattr with key 00000000000014888187 after 5 ms.
             
            517850:2019-04-05T13:24:43.616-07:00 [DBG] RetryLoop retrying WriteCasWithXattr with key 00000000000029662654 after 5 ms.
             
            658811:2019-04-05T13:25:40.887-07:00 [DBG] RetryLoop retrying WriteCasWithXattr with key 00000000000049990789 after 5 ms.
            

            I checked the logs for all three Sync Gateway nodes, and the other two (non-import) nodes don't have anything unexpected for these keys (i.e. they don't appear to be imported by the other node).  I'm wondering whether the import is in fact succeeding after a non-CAS related retry, but we're misidentifying the retry as a cas failure and so not incrementing the import count.  Need to try to repro the RetryLoop behaviour and identify whether there is such a path in the code.

            Show
            adamf Adam Fraser added a comment - The latest run doesn't show any network errors, and import count shows 4999994/5000000. http://perf.jenkins.couchbase.com/job/magma_sg_import/183 Looking at the debug logs, there are 6 documents showing the following log message on the import node: 434808 : 2019 - 04 -05T13: 01 : 34.575 - 07 : 00 [DBG] Import+: Not importing mutation - document 00000000000014335137 has been subsequently updated and will be imported based on that mutation.   598991 : 2019 - 04 -05T13: 21 : 45.843 - 07 : 00 [DBG] Import+: Not importing mutation - document 00000000000024543612 has been subsequently updated and will be imported based on that mutation.   661478 : 2019 - 04 -05T13: 22 : 23.726 - 07 : 00 [DBG] Import+: Not importing mutation - document 00000000000019906912 has been subsequently updated and will be imported based on that mutation.   374471 : 2019 - 04 -05T13: 23 : 14.901 - 07 : 00 [DBG] Import+: Not importing mutation - document 00000000000014888187 has been subsequently updated and will be imported based on that mutation.   573097 : 2019 - 04 -05T13: 24 : 44.860 - 07 : 00 [DBG] Import+: Not importing mutation - document 00000000000029662654 has been subsequently updated and will be imported based on that mutation.   672187 : 2019 - 04 -05T13: 25 : 41.101 - 07 : 00 [DBG] Import+: Not importing mutation - document 00000000000049990789 has been subsequently updated and will be imported based on that mutation. Checking these document in the bucket, they have all been imported and have the _sync xattr populated. The debug logs also show the following message for each of these keys: 434161 : 2019 - 04 -05T13: 01 : 33.531 - 07 : 00 [DBG] RetryLoop retrying WriteCasWithXattr with key 00000000000014335137 after 5 ms.   513386 : 2019 - 04 -05T13: 21 : 43.620 - 07 : 00 [DBG] RetryLoop retrying WriteCasWithXattr with key 00000000000024543612 after 5 ms.   660726 : 2019 - 04 -05T13: 22 : 22.046 - 07 : 00 [DBG] RetryLoop retrying WriteCasWithXattr with key 00000000000019906912 after 5 ms.   303222 : 2019 - 04 -05T13: 23 : 13.705 - 07 : 00 [DBG] RetryLoop retrying WriteCasWithXattr with key 00000000000014888187 after 5 ms.   517850 : 2019 - 04 -05T13: 24 : 43.616 - 07 : 00 [DBG] RetryLoop retrying WriteCasWithXattr with key 00000000000029662654 after 5 ms.   658811 : 2019 - 04 -05T13: 25 : 40.887 - 07 : 00 [DBG] RetryLoop retrying WriteCasWithXattr with key 00000000000049990789 after 5 ms. I checked the logs for all three Sync Gateway nodes, and the other two (non-import) nodes don't have anything unexpected for these keys (i.e. they don't appear to be imported by the other node).  I'm wondering whether the import is in fact succeeding after a non-CAS related retry, but we're misidentifying the retry as a cas failure and so not incrementing the import count.  Need to try to repro the RetryLoop behaviour and identify whether there is such a path in the code.
            Hide
            adamf Adam Fraser added a comment -

            I attempted to reproduce this by forcing an error (and the subsequent retry loop), and the import count is incremented as expected.  I believe this means that the initial import is succeeding, but the operation response times out.  This would have the following result:

            • Operation succeeds, document is imported, but response times out
            • gocb returns ErrTimeout to us
            • timeout is a recoverable error, so we retry
            • The retry attempt gets a CAS error (correctly, as document has already been updated), and ends import processing
            • Sync Gateway node can't confirm that it's the one that completed the import, so the import_count expvar isn't incremented

            I don't see any obvious solution for this - we can't safely increment the import_count after step 4 in this series of events, because we can't identify that step 1 actually succeeded.  (i.e. we might have timed out without succeeding, and another node may have performed an on-demand import for this document, resulting in double-counting the import).  

            Adding an additional 'import_cancelled_doc_already_imported' isn't feasible, as it would require an additional kv op to retrieve the xattr after CAS failure during feed processing, which has unacceptable performance overhead (particularly in scenarios where the document is being rapidly updated).

            Show
            adamf Adam Fraser added a comment - I attempted to reproduce this by forcing an error (and the subsequent retry loop), and the import count is incremented as expected.  I believe this means that the initial import is succeeding, but the operation response times out.  This would have the following result: Operation succeeds, document is imported, but response times out gocb returns ErrTimeout to us timeout is a recoverable error, so we retry The retry attempt gets a CAS error (correctly, as document has already been updated), and ends import processing Sync Gateway node can't confirm that it's the one that completed the import, so the import_count expvar isn't incremented I don't see any obvious solution for this - we can't safely increment the import_count after step 4 in this series of events, because we can't identify that step 1 actually succeeded.  (i.e. we might have timed out without succeeding, and another node may have performed an on-demand import for this document, resulting in double-counting the import).   Adding an additional 'import_cancelled_doc_already_imported' isn't feasible, as it would require an additional kv op to retrieve the xattr after CAS failure during feed processing, which has unacceptable performance overhead (particularly in scenarios where the document is being rapidly updated).
            Hide
            adamf Adam Fraser added a comment -

            As discussed above, there isn't a way for Sync Gateway to differentiate between the 'import succeeded but response timed out' scenario and other CAS failures ('import was performed by another node', 'document was updated via the SDK before it was imported'), so there's no way to fix the import_count expvar after hitting this scenario.

            However, we've added a new expvar ('import_cancel_cas') to make it possible to identify when the import_count hasn't been incremented due to CAS failure (as opposed hitting other errors while importing).  Given that this performance test imports on a single node, and only performs one write per document, it should be able to use the combination of 'import_count' and 'import_cancel_cas' as pass/fail criteria.

            Show
            adamf Adam Fraser added a comment - As discussed above, there isn't a way for Sync Gateway to differentiate between the 'import succeeded but response timed out' scenario and other CAS failures ('import was performed by another node', 'document was updated via the SDK before it was imported'), so there's no way to fix the import_count expvar after hitting this scenario. However, we've added a new expvar ('import_cancel_cas') to make it possible to identify when the import_count hasn't been incremented due to CAS failure (as opposed hitting other errors while importing).  Given that this performance test imports on a single node, and only performs one write per document, it should be able to use the combination of 'import_count' and 'import_cancel_cas' as pass/fail criteria.
            Hide
            adamf Adam Fraser added a comment -

            Sharath Sulochana Please review the comments around root cause, and we can discuss enhancements to the perf test as needed.

            Show
            adamf Adam Fraser added a comment - Sharath Sulochana Please review the comments around root cause, and we can discuss enhancements to the perf test as needed.

              People

              Assignee:
              sharath.sulochana Sharath Sulochana (Inactive)
              Reporter:
              sharath.sulochana Sharath Sulochana (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Gerrit Reviews

                  There are no open Gerrit changes

                    PagerDuty