Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-37340

index scan timeout in RC4 build Intermittently during syncgateway load testing

    XMLWordPrintable

Details

    • Untriaged
    • Unknown

    Description

      Observing  intermittent indexer failures in syncgateway load tests which is eventually resulting some query failures on syncgateway side .

      cb_collect_info logs :

      [^collectinfo-2019-12-23T214220-ns_1@172.23.100.190.zip]

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            sharath.sulochana Sharath Sulochana (Inactive) created issue -
            sharath.sulochana Sharath Sulochana (Inactive) made changes -
            Field Original Value New Value
            Affects Version/s Mad-Hatter [ 15037 ]
            sharath.sulochana Sharath Sulochana (Inactive) made changes -
            Fix Version/s Mad-Hatter [ 15037 ]
            sharath.sulochana Sharath Sulochana (Inactive) made changes -
            Component/s secondary-index [ 11211 ]
            sharath.sulochana Sharath Sulochana (Inactive) made changes -
            Assignee Deepkaran Salooja [ deepkaran.salooja ]
            sharath.sulochana Sharath Sulochana (Inactive) made changes -
            Labels performance secondary-index
            sharath.sulochana Sharath Sulochana (Inactive) made changes -
            Summary Intermittent Indexer crashing in RC4 build during syncgateway load testing Indexer crashing in RC4 build Intermittently during syncgateway load testing
            sharath.sulochana Sharath Sulochana (Inactive) made changes -
            Link This issue is a backport of CBG-650 [ CBG-650 ]

            Sharath Sulochana, I didn't find crash/panic in the indexer logs of any of the four nodes. Please provide more details about what's failing.

            deepkaran.salooja Deepkaran Salooja added a comment - Sharath Sulochana , I didn't find crash/panic in the indexer logs of any of the four nodes. Please provide more details about what's failing.
            deepkaran.salooja Deepkaran Salooja made changes -
            Assignee Deepkaran Salooja [ deepkaran.salooja ] Sharath Sulochana [ sharath.sulochana ]

            Basically  n1ql queries running from syncgateway onto couchbase server are timing out (after 75sec) intermittently  .

            Debugging it further there were no query errors/failures but I am seeing these indexer error messages . My assumption is failure in syncgateway are related to these indexer errors (I may be wrong) . But I don't see these errors for successful runs . 

            Could you change the bug title accordingly .

            sharath.sulochana Sharath Sulochana (Inactive) added a comment - Basically  n1ql queries running from syncgateway onto couchbase server are timing out (after 75sec) intermittently  . Debugging it further there were no query errors/failures but I am seeing these indexer error messages . My assumption is failure in syncgateway are related to these indexer errors (I may be wrong) . But I don't see these errors for successful runs .  Could you change the bug title accordingly .
            sharath.sulochana Sharath Sulochana (Inactive) made changes -
            Assignee Sharath Sulochana [ sharath.sulochana ] Deepkaran Salooja [ deepkaran.salooja ]
            deepkaran.salooja Deepkaran Salooja made changes -
            Summary Indexer crashing in RC4 build Intermittently during syncgateway load testing index scan timeout in RC4 build Intermittently during syncgateway load testing

            On 172.23.100.192, I see one instance of index scan timeout recorded for index sg_access_1

            2019-12-23T13:27:52.078-08:00 [Info] SCAN##611 REQUEST defnId:8465836998817855857, instId:1513771677507661399, index:bucket-1/sg_access_1, type:scan, partitions:[0], scans: <ud>([{["sg-user-172"] ["sg-user-172"] 3 range [{[{"sg-user-172" "sg-user-172" 3}] ["sg-user-172"] ["sg-user-172"] 3 }] <nil>}])</ud>, limit:9223372036854775807, consistency:session_consistency_strict, requestId:cdfdaf24-3055-481b-9ef1-3b88ba0d0707
             
            2019-12-23T13:27:52.089-08:00 [Info] SCAN##611 RESPONSE status:(error = Index scan timed out), requestId: cdfdaf24-3055-481b-9ef1-3b88ba0d0707
            

            deepkaran.salooja Deepkaran Salooja added a comment - On 172.23.100.192, I see one instance of index scan timeout recorded for index sg_access_1 2019-12-23T13:27:52.078-08:00 [Info] SCAN##611 REQUEST defnId:8465836998817855857, instId:1513771677507661399, index:bucket-1/sg_access_1, type:scan, partitions:[0], scans: <ud>([{["sg-user-172"] ["sg-user-172"] 3 range [{[{"sg-user-172" "sg-user-172" 3}] ["sg-user-172"] ["sg-user-172"] 3 }] <nil>}])</ud>, limit:9223372036854775807, consistency:session_consistency_strict, requestId:cdfdaf24-3055-481b-9ef1-3b88ba0d0707   2019-12-23T13:27:52.089-08:00 [Info] SCAN##611 RESPONSE status:(error = Index scan timed out), requestId: cdfdaf24-3055-481b-9ef1-3b88ba0d0707

            1. scan timeout only happens on the replica index for sg_access_1(Replica 1) on 172.23.100.192. All scans work fine on 172.23.100.193.

            2. The retry of the scan after timeout goes to 172.23.100.193 and is successful.

            3. Comparing the last PeriodicStats of the index(where retry succeeds) and its replica(where the scan timeouts), the last stat before timeout is at 2019-12-23T13:27:27.

            stats sg_access_1 172.23.100.193 sg_access_1(Replica 1) 172.23.100.192
            num_requests 988 1013
            num_completed_requests 988 1012
            num_scan_timeouts 0 0
            num_open_snapshots 1 1
            num_snapshots 1 1
            num_snapshot_waiters 0 1
            num_docs_pending 0 0
            num_docs_queued 0 0
            items_count 0 0
            mutation_queue_size 0 0
            flush_queue_size 0 0
            num_docs_indexed 2102870 2102878
            num_docs_processed 2103107 2103107
            num_flush_queued 2102870 2102878
            num_items_flushed 0 0

            a. items_count is 0.
            b. Nothing is pending/queued to be indexed.
            c. Stats are very similar. The difference of 8 for num_docs_indexed/num_flush_queued can be attributed to dedup.

            The next stats after the timeout on .93 and successful retry on .92 at 2019-12-23T13:28:27

            stats sg_access_1 172.23.100.193 sg_access_1(Replica 1) 172.23.100.192
            num_requests 989 1013
            num_completed_requests 989 1013
            num_scan_timeouts 0 2
            num_open_snapshots 1 1
            num_snapshots 1 1
            num_snapshot_waiters 0 1
            num_docs_pending 0 0
            num_docs_queued 0 0
            items_count 0 0
            mutation_queue_size 0 0
            flush_queue_size 0 0
            num_docs_indexed 2102870 2102878
            num_docs_processed 2103107 2103107
            num_flush_queued 2102870 2102878
            num_items_flushed 0 0

            a. There is nothing more indexed(no new snapshot created) when the scan succeeds on 172.23.100.193.
            b. num_requests/num_completed requests increments by 1 on 172.23.100.193 recording the successful retry.

            4. Indexer is using session_consistency_strict for the scan. This needs to be looked into further as there doesn't seem to be any rollback on KV. There could be a potential issue here.

            deepkaran.salooja Deepkaran Salooja added a comment - 1. scan timeout only happens on the replica index for sg_access_1(Replica 1) on 172.23.100.192 . All scans work fine on 172.23.100.193 . 2. The retry of the scan after timeout goes to 172.23.100.193 and is successful. 3. Comparing the last PeriodicStats of the index(where retry succeeds) and its replica(where the scan timeouts), the last stat before timeout is at 2019-12-23T13:27:27. stats sg_access_1 172.23.100.193 sg_access_1(Replica 1) 172.23.100.192 num_requests 988 1013 num_completed_requests 988 1012 num_scan_timeouts 0 0 num_open_snapshots 1 1 num_snapshots 1 1 num_snapshot_waiters 0 1 num_docs_pending 0 0 num_docs_queued 0 0 items_count 0 0 mutation_queue_size 0 0 flush_queue_size 0 0 num_docs_indexed 2102870 2102878 num_docs_processed 2103107 2103107 num_flush_queued 2102870 2102878 num_items_flushed 0 0 a. items_count is 0. b. Nothing is pending/queued to be indexed. c. Stats are very similar. The difference of 8 for num_docs_indexed/num_flush_queued can be attributed to dedup. The next stats after the timeout on .93 and successful retry on .92 at 2019-12-23T13:28:27 stats sg_access_1 172.23.100.193 sg_access_1(Replica 1) 172.23.100.192 num_requests 989 1013 num_completed_requests 989 1013 num_scan_timeouts 0 2 num_open_snapshots 1 1 num_snapshots 1 1 num_snapshot_waiters 0 1 num_docs_pending 0 0 num_docs_queued 0 0 items_count 0 0 mutation_queue_size 0 0 flush_queue_size 0 0 num_docs_indexed 2102870 2102878 num_docs_processed 2103107 2103107 num_flush_queued 2102870 2102878 num_items_flushed 0 0 a. There is nothing more indexed(no new snapshot created) when the scan succeeds on 172.23.100.193 . b. num_requests/num_completed requests increments by 1 on 172.23.100.193 recording the successful retry. 4. Indexer is using session_consistency_strict for the scan. This needs to be looked into further as there doesn't seem to be any rollback on KV. There could be a potential issue here.

            Deepkaran Salooja: You are right in your suspicion around session_consistency_strict. I could not reproduce this by simple scans on an empty index. I got timeout when I instrumented the code to always force session_consistency_strict mode.

            When we enter this mode, we create a new Timestamp (TS) with seq numbers and vbuuids from bucket. But Bucket field of this TS is not set leading to AsRecent comparison always returning false. The fix would be to set Bucket field in the new TS created in session_consistency_strict mode. Given the severity of this issue, either we need to make this fix & do targeted testing around this or disable the enable_session_consistency_strict setting.
            The reason this test case entered session_consistency_strict mode even without rollback could be because of caching of bucket seq nos used for request timestamp calculation.

            prathibha Prathibha Bisarahalli (Inactive) added a comment - - edited Deepkaran Salooja : You are right in your suspicion around session_consistency_strict. I could not reproduce this by simple scans on an empty index. I got timeout when I instrumented the code to always force session_consistency_strict mode. When we enter this mode, we create a new Timestamp (TS) with seq numbers and vbuuids from bucket. But Bucket field of this TS is not set leading to AsRecent comparison always returning false. The fix would be to set Bucket field in the new TS created in session_consistency_strict mode. Given the severity of this issue, either we need to make this fix & do targeted testing around this or disable the enable_session_consistency_strict setting. The reason this test case entered session_consistency_strict mode even without rollback could be because of caching of bucket seq nos used for request timestamp calculation.
            jeelan.poola Jeelan Poola made changes -
            Priority Major [ 3 ] Critical [ 2 ]
            lynn.straus Lynn Straus added a comment -

            Sharath Sulochana is this a regression?  The previous comment mentions that the proposed fix would minimize risk; does it fully address the issues?  Is there an issue targeting this for 6.5.1 where it sounds like this area will be fully tested?

            lynn.straus Lynn Straus added a comment - Sharath Sulochana  is this a regression?  The previous comment mentions that the proposed fix would minimize risk; does it fully address the issues?  Is there an issue targeting this for 6.5.1 where it sounds like this area will be fully tested?

            Lynn Straus - A bug targeting 6.5.1 will be opened once fix to disable the strict consistency scans. The fix is here, reviewed and tested: http://review.couchbase.org/#/c/119749/ .

            This is merged to unstable branch. Waiting for approval to merge the fix to MH branch (http://review.couchbase.org/#/c/119776/)

            prathibha Prathibha Bisarahalli (Inactive) added a comment - - edited Lynn Straus - A bug targeting 6.5.1 will be opened once fix to disable the strict consistency scans. The fix is here, reviewed and tested: http://review.couchbase.org/#/c/119749/ . This is merged to unstable branch. Waiting for approval to merge the fix to MH branch ( http://review.couchbase.org/#/c/119776/ )

            Lynn Straus This looks to be a critical bug for any n1ql query execution and its an edge case . This issue shown up  while testing syncgateway . In sg testing it happens very randomly and there is no fall back option .

            I guess Deepkaran Salooja & Prathibha Bisarahalli are the best to decide its overall impact and history of changes .

             

            sharath.sulochana Sharath Sulochana (Inactive) added a comment - - edited Lynn Straus  This looks to be a critical bug for any n1ql query execution and its an edge case . This issue shown up  while testing syncgateway . In sg testing it happens very randomly and there is no fall back option . I guess Deepkaran Salooja  & Prathibha Bisarahalli  are the best to decide its overall impact and history of changes .  

            Amit Kulkarni Thanks for the detailed explanation. Can you make sure that I get only this bug fix as part of the next RC? I do not want any other changes
            Prathibha BisarahalliPlease confirm the above

            raju Raju Suravarjjala added a comment - Amit Kulkarni Thanks for the detailed explanation. Can you make sure that I get only this bug fix as part of the next RC? I do not want any other changes Prathibha Bisarahalli Please confirm the above

            Raju Suravarjjala - The only change that we would be making is the fix for this bug (http://review.couchbase.org/#/c/119776/).

            prathibha Prathibha Bisarahalli (Inactive) added a comment - Raju Suravarjjala - The only change that we would be making is the fix for this bug ( http://review.couchbase.org/#/c/119776/ ).
            lynn.straus Lynn Straus made changes -
            Labels performance secondary-index approved-for-mad-hatter performance secondary-index
            lynn.straus Lynn Straus made changes -
            Link This issue blocks MB-36676 [ MB-36676 ]

            Thanks Prathibha Bisarahalli please check in

            raju Raju Suravarjjala added a comment - Thanks Prathibha Bisarahalli please check in

            Raju Suravarjjala - It is merged to mad-hatter branch.

            prathibha Prathibha Bisarahalli (Inactive) added a comment - Raju Suravarjjala - It is merged to mad-hatter branch.

            Build couchbase-server-6.5.1-6025 contains indexing commit fc2e1b7 with commit message:
            MB-37340: [Backport to MH] Disable session_consistency_strict scans

            build-team Couchbase Build Team added a comment - Build couchbase-server-6.5.1-6025 contains indexing commit fc2e1b7 with commit message: MB-37340 : [Backport to MH] Disable session_consistency_strict scans
            prathibha Prathibha Bisarahalli (Inactive) made changes -
            Resolution Fixed [ 1 ]
            Status Open [ 1 ] Resolved [ 5 ]

            Build couchbase-server-6.5.0-4960 contains indexing commit fc2e1b7 with commit message:
            MB-37340: [Backport to MH] Disable session_consistency_strict scans

            build-team Couchbase Build Team added a comment - Build couchbase-server-6.5.0-4960 contains indexing commit fc2e1b7 with commit message: MB-37340 : [Backport to MH] Disable session_consistency_strict scans

            Build couchbase-server-1006.5.1-1008 contains indexing commit fc2e1b7 with commit message:
            MB-37340: [Backport to MH] Disable session_consistency_strict scans

            build-team Couchbase Build Team added a comment - Build couchbase-server-1006.5.1-1008 contains indexing commit fc2e1b7 with commit message: MB-37340 : [Backport to MH] Disable session_consistency_strict scans
            ajay.bhullar Ajay Bhullar added a comment -

            in the comments i see two scenarios the fix pertains too:
            In Alice we have a known bug MB-32087, where we can return stale result even with Session Consistent scans (when there is a bucket flush). The fix for MB-32087 is done in MH. Now identifying bucket flush for every scan request will result in throughput degradation. So, to avoid all scans getting affected, we have introduced an optimisation with the fix of MB-32087. This optimisation affects scans in two use cases:
            (1) Actual bucket flush AND
            (2) index snapshot TS being ahead of "reused" value of bucket TS (without bucket flush).

            are there any other scenarios? Also how would I go about recreating the scenarios mentioned above? Thanks

            ajay.bhullar Ajay Bhullar added a comment - in the comments i see two scenarios the fix pertains too: In Alice we have a known bug MB-32087, where we can return stale result even with Session Consistent scans (when there is a bucket flush). The fix for MB-32087 is done in MH. Now identifying bucket flush for every scan request will result in throughput degradation. So, to avoid all scans getting affected, we have introduced an optimisation with the fix of MB-32087. This optimisation affects scans in two use cases: (1) Actual bucket flush AND (2) index snapshot TS being ahead of "reused" value of bucket TS (without bucket flush). are there any other scenarios? Also how would I go about recreating the scenarios mentioned above? Thanks
            ajay.bhullar Ajay Bhullar made changes -
            Labels approved-for-mad-hatter performance secondary-index approved-for-mad-hatter performance request-dev-verify secondary-index

            Hi Ajay Bhullar

            This bug fix reverses the fix done for MB-32087. For current fix, it would be good to cover below scenarios:
            1. Regression tests covering all three consistency options (Any, Request Plus and Query consistency).
            2. Without any rollback or bucket flush: It would be good to run repeated scans with all three consistency options and fast snapshots generation by indexer - for example:
            a. MOI storage with simple index on single small field
            b. MOI array index on an array that does not exist in the document and MOI storage. Ex: CREATE INDEX `idx1` ON `bucket-1`(all array1)
            And, array1 is missing in the document; since array1 is missing, there is not much work done on storage ensuring fast snapshots.
            3. Bucket flush: A request_plus scan right after a bucket flush will return results instead of 0. Please note, this point is not a validation of the fix but really just an old behavior we are reverting to. The tracking bug for this issue is MB-32087.

            Sharath Sulochana: Please update when the Sync Gateway load tests have run successfully for multiple iterations with the fix.

            prathibha Prathibha Bisarahalli (Inactive) added a comment - - edited Hi Ajay Bhullar This bug fix reverses the fix done for MB-32087. For current fix, it would be good to cover below scenarios: 1. Regression tests covering all three consistency options (Any, Request Plus and Query consistency). 2. Without any rollback or bucket flush: It would be good to run repeated scans with all three consistency options and fast snapshots generation by indexer - for example: a. MOI storage with simple index on single small field b. MOI array index on an array that does not exist in the document and MOI storage. Ex: CREATE INDEX `idx1` ON `bucket-1`(all array1) And, array1 is missing in the document; since array1 is missing, there is not much work done on storage ensuring fast snapshots. 3. Bucket flush: A request_plus scan right after a bucket flush will return results instead of 0. Please note, this point is not a validation of the fix but really just an old behavior we are reverting to. The tracking bug for this issue is MB-32087. Sharath Sulochana : Please update when the Sync Gateway load tests have run successfully for multiple iterations with the fix.

            Prathibha Bisarahalli -All Sync Gateway tests are successful.  No more intermittent timeout issues .

            sharath.sulochana Sharath Sulochana (Inactive) added a comment - Prathibha Bisarahalli  -All Sync Gateway tests are successful.  No more intermittent timeout issues .
            ajay.bhullar Ajay Bhullar added a comment -

            all the above test cases have been tested, all queries were run ~2 hours for each scan_consistency, they always returned the same results each time, and the amount of time each query took was similar (for 2 a and b). For the third test case, girish has confirmed we see the behavior detailed in MB-32087 before the fix went in. Regression looks clean

            ajay.bhullar Ajay Bhullar added a comment - all the above test cases have been tested, all queries were run ~2 hours for each scan_consistency, they always returned the same results each time, and the amount of time each query took was similar (for 2 a and b). For the third test case, girish has confirmed we see the behavior detailed in MB-32087 before the fix went in. Regression looks clean
            ajay.bhullar Ajay Bhullar added a comment -

            closing as all the testing detailed above went well and SG tests are no longer intermittently failing

            ajay.bhullar Ajay Bhullar added a comment - closing as all the testing detailed above went well and SG tests are no longer intermittently failing
            ajay.bhullar Ajay Bhullar made changes -
            Status Resolved [ 5 ] Closed [ 6 ]

            Build couchbase-server-7.0.0-1194 contains indexing commit f631c36 with commit message:
            MB-37340: Disable session_consistency_strict scans

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.0-1194 contains indexing commit f631c36 with commit message: MB-37340 : Disable session_consistency_strict scans
            wayne Wayne Siu made changes -
            Link This issue is caused by MB-32087 [ MB-32087 ]

            People

              deepkaran.salooja Deepkaran Salooja
              sharath.sulochana Sharath Sulochana (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty