Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-51289

[System Test] Queries stuck in timeout/submitted stage since 9+ hrs blocking rebalance

    XMLWordPrintable

Details

    Description

      Build : 7.1.0-2416
      Test : -test tests/integration/neo/test_neo_couchstore_milestone4.yml -scope tests/integration/neo/scope_couchstore.yml
      Scale : 3
      Iteration : 1st

      There are 144 queries right now that are stuck in "timeout" or "submitted" state and is blocking a rebalance operation to add a new query node (172.23.104.137) to the cluster. This rebalance operation has been in progress for 5.5+ hrs, out of which more than 5 hrs it has been in the query service rebalance phase due to this issue.

      select state,count(*) from system:active_requests where state!="running" group by state
      [
        {
          "$1": 128,
          "state": "timeout"
        },
        {
          "$1": 16,
          "state": "submitted"
        }
      ]
      

      1. Do we really need graceful shutdown when adding a new query node to the cluster ?
      2. What is causing these queries to time out and be in the submitted state ?

      Query nodes : 172.23.104.137, 172.23.104.155, 172.23.104.157

      Attached :
      1. cbcollect
      2. active_requests dumps from all 3 nodes

      Not sure if this is a regression or related to a recent change in the longevity to run N1QL statements in JS UDF. The previous run of the same test with 7.1.0-2400 did not show this issue.

      UPDATE: rebalance completed successfully after I manually cancelled all the above queries.

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            Donald.haggart Donald Haggart added a comment - - edited

            A query timeout I'd have thought would help and it should proceed - specifically it was the "submitted" requests halting proceedings.

            https://github.com/couchbase/query/blob/9c49deb6997970b3e841039a634886eca78eac67/server/server.go#L1483

            The query.log on the nodes in question contain messages indicating waiting for statements...

            However graceful shutdown only affects a node removal - but this was apparently during an add operation???

            [edit: not on the add, on the remove... we have a sequence:
            We have 2 nodes:

            2022-03-02T09:19:12.405-08:00 [INFO] Topology: d9cd7ba03d15cd7484ee364bf219632c[172.23.104.155:8091] 5fb043c7c911e9ea38fac7f66e61eaf0[172.23.104.157:8091]

            Then ~1 hour later we have notification of 3 nodes:

            2022-03-02T10:34:20.370-08:00 [INFO] Topology: 619857388f2e9d05b5d98efd5f9649dc[172.23.104.137:8091] d9cd7ba03d15cd7484ee364bf219632c[172.23.104.155:8091] 5fb043c7c911e9ea38fac7f66e61eaf0[172.23.104.157:8091]

            Then back to 2 ~30 mins later (which we react to with the graceful shutdown):

            2022-03-02T11:07:27.263-08:00 [INFO] Topology: d9cd7ba03d15cd7484ee364bf219632c[172.23.104.155:8091] 5fb043c7c911e9ea38fac7f66e61eaf0[172.23.104.157:8091] 
            2022-03-02T11:07:27.367-08:00 [INFO] Topology change: shutdown initiated on: 619857388f2e9d05b5d98efd5f9649dc[172.23.104.137:8091]

            ]

            I can see on .137:

             

            2022-03-02T11:52:14.495-08:00 [INFO] Topology change: shutdown initiated on: 5fb043c7c911e9ea38fac7f66e61eaf0[172.23.104.157:8091]

            And on .157:

             

             

            2022-03-02T11:52:14.395-08:00 [INFO] Graceful shutdown initiated.
            2022-03-02T11:52:14.396-08:00 [INFO] Shutdown: Waiting for 16 active request(s) and 0 active transaction(s) to complete.
            ...
            2022-03-02T16:42:25.744-08:00 [INFO] Shutdown: Waiting for 16 active request(s) and 0 active transaction(s) to complete.
            

             

            -> Those are the 16 "submitted" that aren't moving out of that state in order to run, complete and allow the shutdown to continue.

             

            I guess we could ignore SUBMITTED requests too, but that seems like masking over a problem - why didn't they get to run?  (Perhaps the same reason that the 128 others reached timeout?)  And logically, if they were issued & accepted before the shutdown was initiated, why should the graceful shutdown stop them from the opportunity to run?

            An example "submitted" statement:

            {"clientContextID":"query_thread_7","elapsedTime":"5h21m55.374380698s","executionTime":"5h21m55.374333826s","requestId":"f0e5712e-4ce7-43f7-9746-9cad34adaab4","requestTime":"2022-03-02T11:17:08.741-08:00","scanConsistency":"scan_plus","state":"submitted","statement":"EXECUTE FUNCTION run_n1ql_query('bucket5')","users":"Administrator"}

            shows it was requested 30+ minutes before the shutdown was attempted.

            We do see:

             

            16 @ 0xd72c65 0xd847e5 0xd847ce 0xda67c5 0xdc64e5 0x1fa7233 0x1fab1ca 0x1faaf65 0x1faa6a5 0x1fa9867 0x20e398a 0x14f2e33 0x10a9863 0x10a4c8d 0xdaa861
            # 0xda67c4  sync.runtime_Semacquire+0x44            /home/couchbase/.cbdepscache/exploded/x86_64/go-1.16.6/go/src/runtime/sema.go:56
            # 0xdc64e4  sync.(*WaitGroup).Wait+0x64           /home/couchbase/.cbdepscache/exploded/x86_64/go-1.16.6/go/src/sync/waitgroup.go:130
            # 0x1fa7232 github.com/couchbase/query/server.(*BaseRequest).sleep+0x32   /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/query/server/request.go:1145
            # 0x1fab1c9 github.com/couchbase/query/server.(*runQueue).addRequest+0x1a9    /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/query/server/server.go:833
            # 0x1faaf64 github.com/couchbase/query/server.(*runQueue).enqueue+0x84    /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/query/server/server.go:785
            # 0x1faa6a4 github.com/couchbase/query/server.(*Server).handlePlusRequest+0x44  /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/query/server/server.go:672                                                                      
            # 0x1fa9866 github.com/couchbase/query/server.(*Server).PlusServiceRequest+0x86 /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/query/server/server.go:602
            # 0x20e3989 github.com/couchbase/query/server/http.(*HttpEndpoint).ServeHTTP+0x329  /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/query/server/http/service_endpoint.go:445
            # 0x14f2e32 github.com/gorilla/mux.(*Router).ServeHTTP+0xd2       /home/couchbase/.cbdepscache/gomodcache/pkg/mod/github.com/gorilla/mux@v1.8.0/mux.go:210
            # 0x10a9862 net/http.serverHandler.ServeHTTP+0xa2         /home/couchbase/.cbdepscache/exploded/x86_64/go-1.16.6/go/src/net/http/server.go:2867
            # 0x10a4c8c net/http.(*conn).serve+0x8cc            /home/couchbase/.cbdepscache/exploded/x86_64/go-1.16.6/go/src/net/http/server.go:1932
            

             

            16 seemingly stuck here - the servicer gate is not being signalled it seems.

             

            Donald.haggart Donald Haggart added a comment - - edited A query timeout I'd have thought would help and it should proceed - specifically it was the "submitted" requests halting proceedings. https://github.com/couchbase/query/blob/9c49deb6997970b3e841039a634886eca78eac67/server/server.go#L1483 The query.log on the nodes in question contain messages indicating waiting for statements... However graceful shutdown only affects a node removal - but this was apparently during an add operation??? [edit: not on the add, on the remove... we have a sequence: We have 2 nodes: 2022-03-02T09:19:12.405-08:00 [INFO] Topology: d9cd7ba03d15cd7484ee364bf219632c[172.23.104.155:8091] 5fb043c7c911e9ea38fac7f66e61eaf0[172.23.104.157:8091] Then ~1 hour later we have notification of 3 nodes: 2022-03-02T10:34:20.370-08:00 [INFO] Topology: 619857388f2e9d05b5d98efd5f9649dc[172.23.104.137:8091] d9cd7ba03d15cd7484ee364bf219632c[172.23.104.155:8091] 5fb043c7c911e9ea38fac7f66e61eaf0[172.23.104.157:8091] Then back to 2 ~30 mins later (which we react to with the graceful shutdown): 2022-03-02T11:07:27.263-08:00 [INFO] Topology: d9cd7ba03d15cd7484ee364bf219632c[172.23.104.155:8091] 5fb043c7c911e9ea38fac7f66e61eaf0[172.23.104.157:8091] 2022-03-02T11:07:27.367-08:00 [INFO] Topology change: shutdown initiated on: 619857388f2e9d05b5d98efd5f9649dc[172.23.104.137:8091] ] I can see on .137:   2022-03-02T11:52:14.495-08:00 [INFO] Topology change: shutdown initiated on: 5fb043c7c911e9ea38fac7f66e61eaf0[172.23.104.157:8091] And on .157:     2022-03-02T11:52:14.395-08:00 [INFO] Graceful shutdown initiated. 2022-03-02T11:52:14.396-08:00 [INFO] Shutdown: Waiting for 16 active request(s) and 0 active transaction(s) to complete. ... 2022-03-02T16:42:25.744-08:00 [INFO] Shutdown: Waiting for 16 active request(s) and 0 active transaction(s) to complete.   -> Those are the 16 "submitted" that aren't moving out of that state in order to run, complete and allow the shutdown to continue.   I guess we could ignore SUBMITTED requests too, but that seems like masking over a problem - why didn't they get to run?  (Perhaps the same reason that the 128 others reached timeout?)  And logically, if they were issued & accepted before the shutdown was initiated, why should the graceful shutdown stop them from the opportunity to run? An example "submitted" statement: {"clientContextID":"query_thread_7","elapsedTime":"5h21m55.374380698s","executionTime":"5h21m55.374333826s","requestId":"f0e5712e-4ce7-43f7-9746-9cad34adaab4","requestTime":"2022-03-02T11:17:08.741-08:00","scanConsistency":"scan_plus","state":"submitted","statement":"EXECUTE FUNCTION run_n1ql_query('bucket5')","users":"Administrator"} shows it was requested 30+ minutes before the shutdown was attempted. We do see:   16 @ 0xd72c65 0xd847e5 0xd847ce 0xda67c5 0xdc64e5 0x1fa7233 0x1fab1ca 0x1faaf65 0x1faa6a5 0x1fa9867 0x20e398a 0x14f2e33 0x10a9863 0x10a4c8d 0xdaa861 # 0xda67c4  sync.runtime_Semacquire+0x44            /home/couchbase/.cbdepscache/exploded/x86_64/go-1.16.6/go/src/runtime/sema.go:56 # 0xdc64e4  sync.(*WaitGroup).Wait+0x64           /home/couchbase/.cbdepscache/exploded/x86_64/go-1.16.6/go/src/sync/waitgroup.go:130 # 0x1fa7232 github.com/couchbase/query/server.(*BaseRequest).sleep+0x32   /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/query/server/request.go:1145 # 0x1fab1c9 github.com/couchbase/query/server.(*runQueue).addRequest+0x1a9    /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/query/server/server.go:833 # 0x1faaf64 github.com/couchbase/query/server.(*runQueue).enqueue+0x84    /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/query/server/server.go:785 # 0x1faa6a4 github.com/couchbase/query/server.(*Server).handlePlusRequest+0x44  /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/query/server/server.go:672                                                                       # 0x1fa9866 github.com/couchbase/query/server.(*Server).PlusServiceRequest+0x86 /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/query/server/server.go:602 # 0x20e3989 github.com/couchbase/query/server/http.(*HttpEndpoint).ServeHTTP+0x329  /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/query/server/http/service_endpoint.go:445 # 0x14f2e32 github.com/gorilla/mux.(*Router).ServeHTTP+0xd2       /home/couchbase/.cbdepscache/gomodcache/pkg/mod/github.com/gorilla/mux@v1.8.0/mux.go:210 # 0x10a9862 net/http.serverHandler.ServeHTTP+0xa2         /home/couchbase/.cbdepscache/exploded/x86_64/go-1.16.6/go/src/net/http/server.go:2867 # 0x10a4c8c net/http.(*conn).serve+0x8cc            /home/couchbase/.cbdepscache/exploded/x86_64/go-1.16.6/go/src/net/http/server.go:1932   16 seemingly stuck here - the servicer gate is not being signalled it seems.  
            Donald.haggart Donald Haggart added a comment - - edited

            So of course there are plenty of stacks running which would mean these 16 don't get serviced -

            9 @ 0xd72c65 0xd3e68f 0xd3e2cb 0x20f249c 0x1fabf72 0x1faa7a5 0x1fa9867 0x20e398a 0x14f2e33 0x10a9863 0x10a4c8d 0xdaa861
            # 0x20f249b github.com/couchbase/query/server/http.(*httpRequest).Execute+0x4fb /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/query/server/http/service_response.go:159                                                       # 0x1fabf71 github.com/couchbase/query/server.(*Server).serviceRequest+0x8d1  /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/query/server/server.go:1112
            # 0x1faa7a4 github.com/couchbase/query/server.(*Server).handlePlusRequest+0x144 /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/query/server/server.go:683
            # 0x1fa9866 github.com/couchbase/query/server.(*Server).PlusServiceRequest+0x86 /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/query/server/server.go:602
            # 0x20e3989 github.com/couchbase/query/server/http.(*HttpEndpoint).ServeHTTP+0x329  /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/query/server/http/service_endpoint.go:445
            # 0x14f2e32 github.com/gorilla/mux.(*Router).ServeHTTP+0xd2       /home/couchbase/.cbdepscache/gomodcache/pkg/mod/github.com/gorilla/mux@v1.8.0/mux.go:210
            # 0x10a9862 net/http.serverHandler.ServeHTTP+0xa2         /home/couchbase/.cbdepscache/exploded/x86_64/go-1.16.6/go/src/net/http/server.go:2867
            # 0x10a4c8c net/http.(*conn).serve+0x8cc            /home/couchbase/.cbdepscache/exploded/x86_64/go-1.16.6/go/src/net/http/server.go:1932
            

            119 @ 0xd72c65 0xd3e68f 0xd3e2cb 0x20f249c 0x1fabf72 0x1faa807 0x1fa9867 0x20e398a 0x14f2e33 0x10a9863 0x10a4c8d 0xdaa861 # 0x20f249b github.com/couchbase/query/server/http.(*httpRequest).Execute+0x4fb /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/query/server/http/service_response.go:159                                                        # 0x1fabf71 github.com/couchbase/query/server.(*Server).serviceRequest+0x8d1  /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/query/server/server.go:1112 # 0x1faa806 github.com/couchbase/query/server.(*Server).handlePlusRequest+0x1a6 /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/query/server/server.go:687 # 0x1fa9866 github.com/couchbase/query/server.(*Server).PlusServiceRequest+0x86 /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/query/server/server.go:602 # 0x20e3989 github.com/couchbase/query/server/http.(*HttpEndpoint).ServeHTTP+0x329  /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/query/server/http/service_endpoint.go:445 # 0x14f2e32 github.com/gorilla/mux.(*Router).ServeHTTP+0xd2       /home/couchbase/.cbdepscache/gomodcache/pkg/mod/github.com/gorilla/mux@v1.8.0/mux.go:210 # 0x10a9862 net/http.serverHandler.ServeHTTP+0xa2         /home/couchbase/.cbdepscache/exploded/x86_64/go-1.16.6/go/src/net/http/server.go:2867 # 0x10a4c8c net/http.(*conn).serve+0x8cc            /home/couchbase/.cbdepscache/exploded/x86_64/go-1.16.6/go/src/net/http/server.go:1932
            

             

             

            But really it seems we're stuck on the keyspace mutex, at least according to the stacks, with many blocked at couchbase.go:1014

             

            For keyspace lock holders we have:

            1 @ 0xd72c65 0xd847e5 0xd847ce 0xda68e7 0x20ad68a 0x20ad618 0x20af125 0x20ace47 0x20ad35f 0x215654e 0x1da17d2 0xdaa861
            # 0xda68e6  sync.runtime_SemacquireMutex+0x46             /home/couchbase/.cbdepscache/exploded/x86_64/go-1.16.6/go/src/runtime/sema.go:71
            # 0x20ad689 sync.(*RWMutex).RLock+0xa9                /home/couchbase/.cbdepscache/exploded/x86_64/go-1.16.6/go/src/sync/rwmutex.go:63
            # 0x20ad617 github.com/couchbase/query/datastore/couchbase.(*namespace).getPool+0x37    /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/query/datastore/couchbase/couchbase.go:1096
            # 0x20af124 github.com/couchbase/query/datastore/couchbase.newKeyspace+0x44       /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/query/datastore/couchbase/couchbase.go:1304
            # 0x20ace46 github.com/couchbase/query/datastore/couchbase.(*namespace).keyspaceByName+0x446  /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/query/datastore/couchbase/couchbase.go:1028                                       # 0x20ad35e github.com/couchbase/query/datastore/couchbase.(*namespace).KeyspaceById+0x3e   /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/query/datastore/couchbase/couchbase.go:1065
            # 0x215654d github.com/couchbase/query/datastore/system.(*indexIndex).ScanEntries+0x62d   /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/query/datastore/system/system_keyspace_indexes.go:512
            # 0x1da17d1 github.com/couchbase/query/execution.(*PrimaryScan).scanEntries+0x151     /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/query/execution/scan_primary.go:202
            

            Waiting on read acquisition of the pool ns lock.

            And:

            <<< waiting for nslock >>>
            # 0x20af046 github.com/couchbase/query/datastore/couchbase.(*namespace).reload2+0x5e6   query/datastore/couchbase/couchbase.go:1259
            # 0x20ae035 github.com/couchbase/query/datastore/couchbase.(*namespace).reload+0x195    query/datastore/couchbase/couchbase.go:1176
            # 0x20afc1b github.com/couchbase/query/datastore/couchbase.newKeyspace+0xb3b      query/datastore/couchbase/couchbase.go:1312
            # 0x20ace46 github.com/couchbase/query/datastore/couchbase.(*namespace).keyspaceByName+0x446  query/datastore/couchbase/couchbase.go:1028   <<< entry lock held to get here >>>
            # 0x20ac91e github.com/couchbase/query/datastore/couchbase.(*namespace).KeyspaceByName+0x3e  query/datastore/couchbase/couchbase.go:893
            # 0x13a2f38 github.com/couchbase/query/datastore.GetKeyspace+0x218          query/datastore/datastore.go:298
            # 0x15a3e17 github.com/couchbase/query/plannerbase.NewBaseKeyspace+0x157        query/plannerbase/keyspace.go:62
            ... 

            waiting on write acquisition of the pool nslock. (There are 2 routines waiting on this lock here.)

            These hold the pool nslocks:

            <<< waiting for entry lock >>>
            # 0x20ad011 github.com/couchbase/query/datastore/couchbase.(*namespace).keyspaceByName+0x611  datastore/couchbase/couchbase.go:1014
            # 0x20ac76b github.com/couchbase/query/datastore/couchbase.(*namespace).KeyspaceByName+0x42b  query/datastore/couchbase/couchbase.go:893
            # 0x20ac74e github.com/couchbase/query/datastore/couchbase.(*namespace).Objects+0x40e   query/datastore/couchbase/couchbase.go:857   <<< nslock.RLock held to get here >>>
            # 0x21560cd github.com/couchbase/query/datastore/system.(*indexIndex).ScanEntries+0x1ad   query/datastore/system/system_keyspace_indexes.go:505
            # 0x1da17d1 github.com/couchbase/query/execution.(*PrimaryScan).scanEntries+0x151     query/execution/scan_primary.go:202
            

            And are waiting on the keyspace locks - so we have a deadlock.

             

            Donald.haggart Donald Haggart added a comment - - edited So of course there are plenty of stacks running which would mean these 16 don't get serviced - 9 @ 0xd72c65 0xd3e68f 0xd3e2cb 0x20f249c 0x1fabf72 0x1faa7a5 0x1fa9867 0x20e398a 0x14f2e33 0x10a9863 0x10a4c8d 0xdaa861 # 0x20f249b github.com/couchbase/query/server/http.(*httpRequest).Execute+0x4fb /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/query/server/http/service_response.go:159                                                       # 0x1fabf71 github.com/couchbase/query/server.(*Server).serviceRequest+0x8d1  /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/query/server/server.go:1112 # 0x1faa7a4 github.com/couchbase/query/server.(*Server).handlePlusRequest+0x144 /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/query/server/server.go:683 # 0x1fa9866 github.com/couchbase/query/server.(*Server).PlusServiceRequest+0x86 /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/query/server/server.go:602 # 0x20e3989 github.com/couchbase/query/server/http.(*HttpEndpoint).ServeHTTP+0x329  /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/query/server/http/service_endpoint.go:445 # 0x14f2e32 github.com/gorilla/mux.(*Router).ServeHTTP+0xd2       /home/couchbase/.cbdepscache/gomodcache/pkg/mod/github.com/gorilla/mux@v1.8.0/mux.go:210 # 0x10a9862 net/http.serverHandler.ServeHTTP+0xa2         /home/couchbase/.cbdepscache/exploded/x86_64/go-1.16.6/go/src/net/http/server.go:2867 # 0x10a4c8c net/http.(*conn).serve+0x8cc            /home/couchbase/.cbdepscache/exploded/x86_64/go-1.16.6/go/src/net/http/server.go:1932 119 @ 0xd72c65 0xd3e68f 0xd3e2cb 0x20f249c 0x1fabf72 0x1faa807 0x1fa9867 0x20e398a 0x14f2e33 0x10a9863 0x10a4c8d 0xdaa861 # 0x20f249b github.com/couchbase/query/server/http.(*httpRequest).Execute+0x4fb /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/query/server/http/service_response.go:159                                                        # 0x1fabf71 github.com/couchbase/query/server.(*Server).serviceRequest+0x8d1  /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/query/server/server.go:1112 # 0x1faa806 github.com/couchbase/query/server.(*Server).handlePlusRequest+0x1a6 /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/query/server/server.go:687 # 0x1fa9866 github.com/couchbase/query/server.(*Server).PlusServiceRequest+0x86 /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/query/server/server.go:602 # 0x20e3989 github.com/couchbase/query/server/http.(*HttpEndpoint).ServeHTTP+0x329  /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/query/server/http/service_endpoint.go:445 # 0x14f2e32 github.com/gorilla/mux.(*Router).ServeHTTP+0xd2       /home/couchbase/.cbdepscache/gomodcache/pkg/mod/github.com/gorilla/mux@v1.8.0/mux.go:210 # 0x10a9862 net/http.serverHandler.ServeHTTP+0xa2         /home/couchbase/.cbdepscache/exploded/x86_64/go-1.16.6/go/src/net/http/server.go:2867 # 0x10a4c8c net/http.(*conn).serve+0x8cc            /home/couchbase/.cbdepscache/exploded/x86_64/go-1.16.6/go/src/net/http/server.go:1932     But really it seems we're stuck on the keyspace mutex, at least according to the stacks, with many blocked at couchbase.go:1014   For keyspace lock holders we have: 1 @ 0xd72c65 0xd847e5 0xd847ce 0xda68e7 0x20ad68a 0x20ad618 0x20af125 0x20ace47 0x20ad35f 0x215654e 0x1da17d2 0xdaa861 # 0xda68e6  sync.runtime_SemacquireMutex+0x46             /home/couchbase/.cbdepscache/exploded/x86_64/go-1.16.6/go/src/runtime/sema.go:71 # 0x20ad689 sync.(*RWMutex).RLock+0xa9                /home/couchbase/.cbdepscache/exploded/x86_64/go-1.16.6/go/src/sync/rwmutex.go:63 # 0x20ad617 github.com/couchbase/query/datastore/couchbase.(*namespace).getPool+0x37    /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/query/datastore/couchbase/couchbase.go:1096 # 0x20af124 github.com/couchbase/query/datastore/couchbase.newKeyspace+0x44       /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/query/datastore/couchbase/couchbase.go:1304 # 0x20ace46 github.com/couchbase/query/datastore/couchbase.(*namespace).keyspaceByName+0x446  /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/query/datastore/couchbase/couchbase.go:1028                                       # 0x20ad35e github.com/couchbase/query/datastore/couchbase.(*namespace).KeyspaceById+0x3e   /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/query/datastore/couchbase/couchbase.go:1065 # 0x215654d github.com/couchbase/query/datastore/system.(*indexIndex).ScanEntries+0x62d   /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/query/datastore/system/system_keyspace_indexes.go:512 # 0x1da17d1 github.com/couchbase/query/execution.(*PrimaryScan).scanEntries+0x151     /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/query/execution/scan_primary.go:202 Waiting on read acquisition of the pool ns lock. And: <<< waiting for nslock >>> # 0x20af046 github.com/couchbase/query/datastore/couchbase.(*namespace).reload2+0x5e6   query/datastore/couchbase/couchbase.go:1259 # 0x20ae035 github.com/couchbase/query/datastore/couchbase.(*namespace).reload+0x195   query/datastore/couchbase/couchbase.go:1176 # 0x20afc1b github.com/couchbase/query/datastore/couchbase.newKeyspace+0xb3b     query/datastore/couchbase/couchbase.go:1312 # 0x20ace46 github.com/couchbase/query/datastore/couchbase.(*namespace).keyspaceByName+0x446 query/datastore/couchbase/couchbase.go:1028 <<< entry lock held to get here >>> # 0x20ac91e github.com/couchbase/query/datastore/couchbase.(*namespace).KeyspaceByName+0x3e query/datastore/couchbase/couchbase.go:893 # 0x13a2f38 github.com/couchbase/query/datastore.GetKeyspace+0x218          query/datastore/datastore.go:298 # 0x15a3e17 github.com/couchbase/query/plannerbase.NewBaseKeyspace+0x157       query/plannerbase/keyspace.go:62 ... waiting on write acquisition of the pool nslock. (There are 2 routines waiting on this lock here.) These hold the pool nslocks: <<< waiting for entry lock >>> # 0x20ad011 github.com/couchbase/query/datastore/couchbase.(*namespace).keyspaceByName+0x611  datastore/couchbase/couchbase.go:1014 # 0x20ac76b github.com/couchbase/query/datastore/couchbase.(*namespace).KeyspaceByName+0x42b  query/datastore/couchbase/couchbase.go:893 # 0x20ac74e github.com/couchbase/query/datastore/couchbase.(*namespace).Objects+0x40e   query/datastore/couchbase/couchbase.go:857 <<< nslock.RLock held to get here >>> # 0x21560cd github.com/couchbase/query/datastore/system.(*indexIndex).ScanEntries+0x1ad   query/datastore/system/system_keyspace_indexes.go:505 # 0x1da17d1 github.com/couchbase/query/execution.(*PrimaryScan).scanEntries+0x151     query/execution/scan_primary.go:202 And are waiting on the keyspace locks - so we have a deadlock.  

            Build couchbase-server-7.1.0-2441 contains query commit 0aed5d3 with commit message:
            MB-51289 remove namespace / keyspace stall

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-2441 contains query commit 0aed5d3 with commit message: MB-51289 remove namespace / keyspace stall

            Build couchbase-server-7.2.0-1008 contains query commit 0aed5d3 with commit message:
            MB-51289 remove namespace / keyspace stall

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.2.0-1008 contains query commit 0aed5d3 with commit message: MB-51289 remove namespace / keyspace stall

            Not seen in the test run with 7.1.0-2447 which went on for 3 days, and 4 iterations

            mihir.kamdar Mihir Kamdar (Inactive) added a comment - Not seen in the test run with 7.1.0-2447 which went on for 3 days, and 4 iterations

            People

              mihir.kamdar Mihir Kamdar (Inactive)
              mihir.kamdar Mihir Kamdar (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty