Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-45081

FTS index build gets stuck with gocbcore@9.1.3

    XMLWordPrintable

Details

    • Untriaged
    • 1
    • Yes

    Description

      Build : 7.0.0-4708
      Last known good build : 7.0.0-4706

      Steps to repro :
      1. 1 node cluster with kv, index, query, search services
      2. Install beer-sample bucket
      3. Create an FTS index with default mapping with 6 partitions.

      Indexing gets stuck with 7156 docs at 97.99%. See screenshot.

      Changelog between the builds : http://changelog.build.couchbase.com/?product=couchbase-server&fromVersion=7.0.0&fromBuild=4706&toVersion=7.0.0&toBuild=4708&f_cbft=on&f_cbftx=on&f_cbgt=on&f_kv_engine=on&f_n1fty=on&f_ns_server=on&f_plasma=on&f_product-metadata=on&f_query=on&f_testrunner=on

      Logs : https://cb-jira.s3.us-east-2.amazonaws.com/logs/fts_idx_stuck/collectinfo-2021-03-18T222546-ns_1%40127.0.0.1.zip

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            mihir.kamdar Mihir Kamdar created issue -
            abhinav Abhinav Dangeti made changes -
            Field Original Value New Value
            Assignee Keshav Murthy [ keshav ] Abhinav Dangeti [ abhinav ]
            mihir.kamdar Mihir Kamdar made changes -
            Description Build : 7.0.0-4708
            Last known good build : 7.0.0-4706

            Steps to repro :
            1. 1 node cluster with kv, index, query, search services
            2. Install beer-sample bucket
            3. Create an FTS index with default mapping with 6 partitions.

            Indexing gets stuck with 7156 docs at 97.99%. See screenshot.

            Changelog between the builds : http://changelog.build.couchbase.com/?product=couchbase-server&fromVersion=7.0.0&fromBuild=4706&toVersion=7.0.0&toBuild=4708&f_cbft=on&f_cbftx=on&f_cbgt=on&f_kv_engine=on&f_n1fty=on&f_ns_server=on&f_plasma=on&f_product-metadata=on&f_query=on&f_testrunner=on
            Build : 7.0.0-4708
            Last known good build : 7.0.0-4706

            Steps to repro :
            1. 1 node cluster with kv, index, query, search services
            2. Install beer-sample bucket
            3. Create an FTS index with default mapping with 6 partitions.

            Indexing gets stuck with 7156 docs at 97.99%. See screenshot.

            Changelog between the builds : http://changelog.build.couchbase.com/?product=couchbase-server&fromVersion=7.0.0&fromBuild=4706&toVersion=7.0.0&toBuild=4708&f_cbft=on&f_cbftx=on&f_cbgt=on&f_kv_engine=on&f_n1fty=on&f_ns_server=on&f_plasma=on&f_product-metadata=on&f_query=on&f_testrunner=on

            Logs : https://cb-jira.s3.us-east-2.amazonaws.com/logs/fts_idx_stuck/collectinfo-2021-03-18T222546-ns_1%40127.0.0.1.zip

            Higher memory consumption's the reason - the app herder has floored the brake.

            abhinav Abhinav Dangeti added a comment - Higher memory consumption's the reason - the app herder has floored the brake.
            abhinav Abhinav Dangeti made changes -
            Attachment run-memory-revert-aa0eefe.pprof [ 131750 ]
            Attachment run-memory-9.1.3.pprof [ 131751 ]
            abhinav Abhinav Dangeti made changes -
            Attachment Screen Shot 2021-03-18 at 5.37.35 PM.png [ 131752 ]
            Attachment Screen Shot 2021-03-18 at 5.44.02 PM.png [ 131753 ]
            abhinav Abhinav Dangeti made changes -
            Attachment run-memory-revert-aa0eefe.pprof [ 131750 ]
            abhinav Abhinav Dangeti made changes -
            Attachment Screen Shot 2021-03-18 at 5.37.35 PM.png [ 131752 ]
            abhinav Abhinav Dangeti made changes -
            Attachment Screen Shot 2021-03-18 at 5.44.02 PM.png [ 131753 ]
            abhinav Abhinav Dangeti made changes -
            Attachment run-memory-9.1.3.pprof [ 131751 ]
            abhinav Abhinav Dangeti made changes -
            abhinav Abhinav Dangeti made changes -
            Attachment run-memory-9.1.3.pprof [ 131759 ]
            Attachment Screen Shot 2021-03-18 at 6.01.01 PM.png [ 131760 ]
            abhinav Abhinav Dangeti added a comment - - edited

            Ok, I'm glad we caught this early.

            Not only do I see higher memory consumption that's causing VERY SLOW indexing progress, some of the DCP mutations aren't making it into FTS at all.

            I've collected heap profiles upon index build, and here's how it looked, almost completed dominated by gocbcore's dialMemdClient -

            I've found the change within gocbcore that's responsible for this (sorry James Lee, not-so-good news from FTS on your change)..

            commit aa0eefea765225169e0bce8e6026e70be6aab364
            Author: James Lee <james.lee@couchbase.com>
            Date:   Mon Feb 8 16:58:25 2021 +0000    GOCBC-1056 Reduce CPU usage
                
                Motivation
                ----------
                After doing some performance testing of cbbackupmgr we noticed that we
                were spending a significant amount of CPU time in gocbcore. We'd like to
                reduce this as much as possible to free up CPU time for cbbackupmgr.
                
                Changes
                -------
                1) Avoid unnecessary allocations of DCP packets/TCP write buffers
                2) Avoid unnecessary branching using if statements in the golden path
                   where a switch statement would suffice.
                3) Rewrote the memdopmap implementation to use a map structure instead
                   of a doubly linked list structure. This avoids linear traversal for
                   each request which depending on the scenario could result in
                   not-insignificant overhead.
                4) Added the ability to disable buffer acknowledgement via a option
                   using the DCP agent.
                5) Use a 20MB buffered reader on each memcached connection, the same
                   size as used by KV engine.
            ...

            On reverting the change above I see the index progress is quick and back to how it was before we upgraded to gocbcore@9.1.3. I also grabbed the heap profile upon index build this time around and this is how it looked -

             

            Brett Lawson Charles Dixon what's the play here? Should FTS ..

            • Wait for a quick fix within gocbcore and an immediate release after
            • Revert aa0eefe within gocbcore and an immediate release after (and perhaps try adding the fixed/corrected change back later)
            • Move FTS back to gocbcore@9.1.2 until a fix is made to this and then skip 9.1.3 altogether.

            Please be advised that we'll need to unblock QE as quickly as possible.

            abhinav Abhinav Dangeti added a comment - - edited Ok, I'm glad we caught this early. Not only do I see higher memory consumption that's causing VERY SLOW indexing progress, some of the DCP mutations aren't making it into FTS at all. I've collected heap profiles upon index build, and here's how it looked, almost completed dominated by gocbcore's dialMemdClient - I've found the change within gocbcore that's responsible for this (sorry James Lee , not-so-good news from FTS on your change).. commit aa0eefea765225169e0bce8e6026e70be6aab364 Author: James Lee <james.lee @couchbase .com> Date: Mon Feb 8 16 : 58 : 25 2021 + 0000 GOCBC- 1056 Reduce CPU usage Motivation ---------- After doing some performance testing of cbbackupmgr we noticed that we were spending a significant amount of CPU time in gocbcore. We'd like to reduce this as much as possible to free up CPU time for cbbackupmgr. Changes ------- 1 ) Avoid unnecessary allocations of DCP packets/TCP write buffers 2 ) Avoid unnecessary branching using if statements in the golden path where a switch statement would suffice. 3 ) Rewrote the memdopmap implementation to use a map structure instead of a doubly linked list structure. This avoids linear traversal for each request which depending on the scenario could result in not-insignificant overhead. 4 ) Added the ability to disable buffer acknowledgement via a option using the DCP agent. 5 ) Use a 20MB buffered reader on each memcached connection, the same size as used by KV engine. ... On reverting the change above I see the index progress is quick and back to how it was before we upgraded to gocbcore@9.1.3 . I also grabbed the heap profile upon index build this time around and this is how it looked -   Brett Lawson Charles Dixon  what's the play here? Should FTS .. Wait for a quick fix within gocbcore and an immediate release after Revert aa0eefe within gocbcore and an immediate release after (and perhaps try adding the fixed/corrected change back later) Move FTS back to gocbcore@9.1.2 until a fix is made to this and then skip 9.1.3 altogether. Please be advised that we'll need to unblock QE as quickly as possible.
            abhinav Abhinav Dangeti made changes -
            Assignee Abhinav Dangeti [ abhinav ] Brett Lawson [ brett19 ]
            abhinav Abhinav Dangeti made changes -
            Labels build-sanity functional-test build-sanity functional-test gocbcore
            brett19 Brett Lawson added a comment -

            Hey Abhinav Dangeti,

            Looking at the code submitted by James Lee, I'm now seeing that there is a slight difference in semantics with regards to the handling of dcp shutdown which is likely leading to this problem. Specifically, the fact that some of the notification channels are now being signalled indiscriminately rather than only under certain circumstances. However, I also think that this might have actually had a bug in it before the changes went in and it simply was less likely to be triggered. I'm going to take a look at this and should have a potential fix (at least if this is indeed the issue) by tomorrow afternoon.

            P.S. We currently rely heavily on the services using us to test the edges of DCP functionality due to the challenges that exist in testing it. Once I have a potential fix tomorrow, is it going to be possible to rerun the testing and unambiguously validate that it corrected the problem?

            Cheers, Brett

            brett19 Brett Lawson added a comment - Hey Abhinav Dangeti , Looking at the code submitted by James Lee , I'm now seeing that there is a slight difference in semantics with regards to the handling of dcp shutdown which is likely leading to this problem. Specifically, the fact that some of the notification channels are now being signalled indiscriminately rather than only under certain circumstances. However, I also think that this might have actually had a bug in it before the changes went in and it simply was less likely to be triggered. I'm going to take a look at this and should have a potential fix (at least if this is indeed the issue) by tomorrow afternoon. P.S. We currently rely heavily on the services using us to test the edges of DCP functionality due to the challenges that exist in testing it. Once I have a potential fix tomorrow, is it going to be possible to rerun the testing and unambiguously validate that it corrected the problem? Cheers, Brett

            Sounds good Brett Lawson. Absolutely - I'll be able to quickly test your fix, ping me when you have it up on gerrit.

            abhinav Abhinav Dangeti added a comment - Sounds good Brett Lawson . Absolutely - I'll be able to quickly test your fix, ping me when you have it up on gerrit.

            Abhinav Dangeti:  QE would like to have a full run shortly.  Will there be any regression/breakage if you revert to gocbcore@9.1.2 now and move to 9.1.3 after verifying+testing the fix?

            keshav Keshav Murthy added a comment - Abhinav Dangeti :  QE would like to have a full run shortly.  Will there be any regression/breakage if you revert to gocbcore@9.1.2 now and move to 9.1.3 after verifying+testing the fix?

            Keshav Murthy We've reverted to using gocbcore@9.1.2 while we wait for Brett.

            abhinav Abhinav Dangeti added a comment - Keshav Murthy  We've reverted to using gocbcore@9.1.2 while we wait for Brett.
            brett19 Brett Lawson made changes -
            Link This issue relates to GOCBC-1073 [ GOCBC-1073 ]
            abhinav Abhinav Dangeti made changes -
            Summary FTS index build gets stuck FTS index build gets stuck with gocbcore@9.1.3
            abhinav Abhinav Dangeti made changes -
            Priority Test Blocker [ 6 ] Critical [ 2 ]

            Reducing this to "critical" status for now, as QE isn't blocked by this any longer.

            abhinav Abhinav Dangeti added a comment - Reducing this to "critical" status for now, as QE isn't blocked by this any longer.
            abhinav Abhinav Dangeti made changes -
            Assignee Brett Lawson [ brett19 ] Abhinav Dangeti [ abhinav ]
            abhinav Abhinav Dangeti made changes -
            Resolution Fixed [ 1 ]
            Status Open [ 1 ] Resolved [ 5 ]

            Build couchbase-server-7.0.0-4752 contains cbft commit 8de7e56 with commit message:
            MB-45081: Upgrade to gocbcore@9.1.3 + fixes

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.0-4752 contains cbft commit 8de7e56 with commit message: MB-45081 : Upgrade to gocbcore@9.1.3 + fixes

            Build couchbase-server-7.0.0-4752 contains cbftx commit 18f134c with commit message:
            MB-45081: Upgrade to gocbcore@9.1.3 + fixes

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.0-4752 contains cbftx commit 18f134c with commit message: MB-45081 : Upgrade to gocbcore@9.1.3 + fixes

            Build couchbase-server-7.0.0-4752 contains cbgt commit 8e5e597 with commit message:
            MB-45081: Upgrade to gocbcore@9.1.3 + fixes

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.0-4752 contains cbgt commit 8e5e597 with commit message: MB-45081 : Upgrade to gocbcore@9.1.3 + fixes

            Build couchbase-server-7.0.0-4752 contains n1fty commit c4dc5a2 with commit message:
            MB-45081: Upgrade to gocbcore@9.1.3 + fixes

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.0-4752 contains n1fty commit c4dc5a2 with commit message: MB-45081 : Upgrade to gocbcore@9.1.3 + fixes

            Build couchbase-server-7.0.0-4752 contains query commit df4fa9a with commit message:
            MB-45081: Upgrade to gocbcore@9.1.3 + fixes

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.0-4752 contains query commit df4fa9a with commit message: MB-45081 : Upgrade to gocbcore@9.1.3 + fixes
            mihir.kamdar Mihir Kamdar added a comment -

            Closing as verified on the recent cheshire cat builds. Not seeing the failures any more in build sanity

            mihir.kamdar Mihir Kamdar added a comment - Closing as verified on the recent cheshire cat builds. Not seeing the failures any more in build sanity
            mihir.kamdar Mihir Kamdar made changes -
            Status Resolved [ 5 ] Closed [ 6 ]
            lynn.straus Lynn Straus made changes -
            Fix Version/s 7.0.0 [ 17233 ]
            lynn.straus Lynn Straus made changes -
            Fix Version/s Cheshire-Cat [ 15915 ]

            People

              abhinav Abhinav Dangeti
              mihir.kamdar Mihir Kamdar
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                PagerDuty