Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-45989

FTS: rollback not happening - FTS not catching up with KV

    XMLWordPrintable

Details

    • Untriaged
    • 1
    • Unknown

    Description

      Build: 7.0.0-5017
      Test Suite: moving-topology-scorch_5.5_P1
      Rollback Tests:

      ./testrunner -i /tmp/testexec.21146.ini -p get-cbcollect-info=False,disable_HTP=True,get-logs=False,stop-on-failure=False,GROUP=P1,index_type=scorch,fts_quota=990 -t fts.moving_topology_fts.MovingTopFTS.partial_rollback,cluster=D,D,F,items=50000,update=True,upd=30,upd_del_fields=['dept'],GROUP=P1,get-cbcollect-info=False,disable_HTP=True,get-logs=False,stop-on-failure=False,index_type=scorch,fts_quota=990

      ./testrunner -i /tmp/testexec.21146.ini -p get-cbcollect-info=False,disable_HTP=True,get-logs=False,stop-on-failure=False,GROUP=P1,index_type=scorch,fts_quota=990 -t fts.moving_topology_fts.MovingTopFTS.partial_rollback,cluster=D,D,F,items=50000,delete=True,del=10,GROUP=P1,get-cbcollect-info=False,disable_HTP=True,get-logs=False,stop-on-failure=False,index_type=scorch,fts_quota=990

      Test steps:

      • Cluster setup : n1:kv, n2:kv, n3:search
      • Load 50000 with field and value : mutated:0
      • Create default index
      • Stop persistence on both kv nodes
      • Update 15000 docs with mutated:1
      • Get the query count before rollback: {"query": "mutated:>0"}

        which is 15000

      • Kill memcached on n1
      • start persistence on n2
      • Failover n2
      • Kv rolled back and updated 15000 docs reverted back to before update. So they reverted back to mutated:0
      • We expect FTS index also should rollback and update the index along with kv. But its not.
      • Query {"query": "mutated:>0"}

        returns 15000, but in KV there are 0 docs

      curl -XGET  http://172.23.96.141:8094/api/nsstats -u Administrator:password | jq | grep rollback
        % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                       Dload  Upload   Total   Spent    Left  Speed
      100  3594    0  3594    0     0  20078      0 --:--:-- --:--:-- --:--:-- 20191
        "default:default_index:timer_rollback_count": 0,
        "tot_rollback_full": 0,
        "tot_rollback_partial": 0,
      

      Attached screenshot where mutated:1 returns 15000 results, but actually looking into the doc, mutated is 0

      Logs:

      https://cb-jira.s3.us-east-2.amazonaws.com/logs/rollback/collectinfo-2021-04-29T003740-ns_1%40172.23.96.141.zip [FTS]
      https://cb-jira.s3.us-east-2.amazonaws.com/logs/rollback/collectinfo-2021-04-29T003740-ns_1%40172.23.97.211.zip [KV]
      https://cb-jira.s3.us-east-2.amazonaws.com/logs/rollback/collectinfo-2021-04-29T003740-ns_1%40172.23.97.212.zip [KV]

      Attachments

        Issue Links

          For Gerrit Dashboard: MB-45989
          # Subject Branch Project Status CR V

          Activity

            Charles Dixon looks right. Seems like WaitUntilReady didn't return at that spot too and then the index was deleted at ..

            2021-05-03T17:02:22.873-07:00 [INFO] rest_delete_index: delete index request received for default_index 

            , but gocbcore continues to log ..

            2021-05-03T17:02:42.451-07:00 [WARN] (GOCBCORE) Pipeline Client 0xc0063cfbc0 failed to bootstrap: authentication failure | {"status_code":36,"document_key":"default","bucket":"default","error_name":"EACCESS","error_description":"Not authorized for command","opaque":3,"context":"Not authenticated","last_dispatched_to":"172.23.97.212:11207","last_dispatched_from":"172.23.96.141:56750","last_connection_id":"9843479e4a95ef5f/7076482a6e26c31b"} -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go:615
            2021-05-03T17:02:47.463-07:00 [WARN] (GOCBCORE) Pipeline Client 0xc0063cfbc0 failed to bootstrap: authentication failure | {"status_code":36,"document_key":"default","bucket":"default","error_name":"EACCESS","error_description":"Not authorized for command","opaque":3,"context":"Not authenticated","last_dispatched_to":"172.23.97.212:11207","last_dispatched_from":"172.23.96.141:56752","last_connection_id":"9843479e4a95ef5f/cc1a8bf1bf825a01"} -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go:615 

            .. and then the process was killed and restarted at 2021-05-03T17:04:38.

            Debug logging might reveal more here.

            Girish Benakappa the test looks small enough to enable debug logging easily and get a re-run. Would you run this command on the FTS node (every time the node is restarted- don't think this setting is persisted) ..

            curl -X PUT -u <username>:<password> http://<ip>:8094/api/managerOptions -d '{"logLevel":"DEBU"}' 

            abhinav Abhinav Dangeti added a comment - Charles Dixon  looks right. Seems like WaitUntilReady didn't return at that spot too and then the index was deleted at .. 2021 - 05 -03T17: 02 : 22.873 - 07 : 00 [INFO] rest_delete_index: delete index request received for default_index , but gocbcore continues to log .. 2021 - 05 -03T17: 02 : 42.451 - 07 : 00 [WARN] (GOCBCORE) Pipeline Client 0xc0063cfbc0 failed to bootstrap: authentication failure | { "status_code" : 36 , "document_key" : "default" , "bucket" : "default" , "error_name" : "EACCESS" , "error_description" : "Not authorized for command" , "opaque" : 3 , "context" : "Not authenticated" , "last_dispatched_to" : "172.23.97.212:11207" , "last_dispatched_from" : "172.23.96.141:56750" , "last_connection_id" : "9843479e4a95ef5f/7076482a6e26c31b" } -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go: 615 2021 - 05 -03T17: 02 : 47.463 - 07 : 00 [WARN] (GOCBCORE) Pipeline Client 0xc0063cfbc0 failed to bootstrap: authentication failure | { "status_code" : 36 , "document_key" : "default" , "bucket" : "default" , "error_name" : "EACCESS" , "error_description" : "Not authorized for command" , "opaque" : 3 , "context" : "Not authenticated" , "last_dispatched_to" : "172.23.97.212:11207" , "last_dispatched_from" : "172.23.96.141:56752" , "last_connection_id" : "9843479e4a95ef5f/cc1a8bf1bf825a01" } -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go: 615 .. and then the process was killed and restarted at 2021-05-03T17:04:38. Debug logging might reveal more here. Girish Benakappa  the test looks small enough to enable debug logging easily and get a re-run. Would you run this command on the FTS node ( every time the node is restarted - don't think this setting is persisted) .. curl -X PUT -u <username>:<password> http: //<ip>:8094/api/managerOptions -d '{"logLevel":"DEBU"}'

            Looks like this is rather easy to reproduce.

            I just killed memcached on a node (with cluster_run). I have FTS running on the same node. Here's debug logs (towards the very end), where gocbcore indefinitely blocks on "creating new agent". 

            fts.log

            abhinav Abhinav Dangeti added a comment - Looks like this is rather easy to reproduce. I just killed memcached on a node (with cluster_run). I have FTS running on the same node. Here's debug logs (towards the very end), where gocbcore indefinitely blocks on "creating new agent".  fts.log

            I've inadvertently introduced a deadlock with my previous fix here. The correct fix should be to asynchronously "kick" the manager (and by extension the janitor) while inside the JanitorLoop to let the manager know to re-initiate a feed that failed due to KV not being ready. Here's the fix for this that I've verified to work ..

            http://review.couchbase.org/c/cbgt/+/152817

            abhinav Abhinav Dangeti added a comment - I've inadvertently introduced a deadlock with my previous fix here. The correct fix should be to asynchronously "kick" the manager (and by extension the janitor) while inside the JanitorLoop to let the manager know to re-initiate a feed that failed due to KV not being ready. Here's the fix for this that I've verified to work .. http://review.couchbase.org/c/cbgt/+/152817

            Build couchbase-server-7.0.0-5103 contains cbgt commit a8cb699 with commit message:
            MB-45989: Kicking the janitor from within the JanitorLoop is prohibited

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.0-5103 contains cbgt commit a8cb699 with commit message: MB-45989 : Kicking the janitor from within the JanitorLoop is prohibited
            girish.benakappa Girish Benakappa added a comment - - edited

            Verified that partial_rollback tests are passing with 7.0.0-5103. Closing this issue.

            girish.benakappa Girish Benakappa added a comment - - edited Verified that partial_rollback tests are passing with 7.0.0-5103. Closing this issue.

            People

              abhinav Abhinav Dangeti
              girish.benakappa Girish Benakappa
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty