Uploaded image for project: 'Couchbase Python Client Library'
  1. Couchbase Python Client Library
  2. PYCBC-950

N1QL query connections remain open after completion

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Test Blocker
    • Resolution: Fixed
    • 3.0.2
    • 3.0.2
    • None
    • 1

    Description

       

      I am working on new n1ql perf test and have been having issues with the python client n1ql query latencies and throughput. If I create a bucket with N items and then use 1 client to issue sequential n1ql queries, I see that one of the machines in the cluster has a linearly increasing number of connections in time_wait. The queries are like the following:

      SELECT * FROM default:`bucket-1`.`scope-1`.`collection-1` USE KEYS['n1ql-000000000056']

      These target can also be default scope and collection (if docs are in there).

      Here is a graph showing this: 

      http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=iris_700-2351_access_227c#f087785f256442b0b8bdd8d9b9ba328a

      My client gets about 200 queries/sec and you can see that after 10 seconds there are around 2000 connections in time_wait. Even if I set a query timeout to 1000000 (which is 1 million nano seconds or 1 milliseconds) the connections are not killed.

       

      This is the commit of the python client that I am using: 8d5820a297b35378d314eb399c041b214aa6a7db

       

      This is the lcb logging for a single op:

      45615ms [Ic41f35ec74693510] {11415} [TRACE] (http-io - L:401) <172.23.100.70:8091> GET  http://172.23.100.70:8091/pools. Body=0 bytes 
      45615ms [Ic41f35ec74693510] {11415} [DEBUG] (lcbio_mgr - L:481) <172.23.100.70:8091> (HE=0x32f1200) Creating new connection because none are available in the pool 
      45616ms [Ic41f35ec74693510] {11415} [TRACE] (lcbio_mgr - L:393) <172.23.100.70:8091> (HE=0x32f1200) New pool entry: I=0x42d5b40 
      45616ms [Ic41f35ec74693510] {11415} [INFO] (connection - L:487) <172.23.100.70:8091> (SOCK=5e90741e7c6c856a) Starting. Timeout=75000000us 
      45616ms [Ic41f35ec74693510] {11415} [DEBUG] (connection - L:252) <172.23.100.70:8091> (SOCK=5e90741e7c6c856a) Created new socket with FD=34 
      45616ms [Ic41f35ec74693510] {11415} [TRACE] (connection - L:352) <172.23.100.70:8091> (SOCK=5e90741e7c6c856a) Scheduling I/O watcher for asynchronous connection completion. 
      45616ms [Ic41f35ec74693510] {11415} [INFO] (connection - L:155) <172.23.100.70:8091> (SOCK=5e90741e7c6c856a) Connected established 
      45616ms [Ic41f35ec74693510] {11415} [DEBUG] (connection - L:108) <172.23.100.70:8091> (SOCK=5e90741e7c6c856a) Successfully set TCP_NODELAY 
      45616ms [Ic41f35ec74693510] {11415} [DEBUG] (connection - L:108) <172.23.100.70:8091> (SOCK=5e90741e7c6c856a) Successfully set TCP_KEEPALIVE 
      45616ms [Ic41f35ec74693510] {11415} [DEBUG] (lcbio_mgr - L:349) <172.23.100.70:8091> (HE=0x32f1200) Received result for I=0x42d5b40,C=(nil); E=0x0 
      45616ms [Ic41f35ec74693510] {11415} [DEBUG] (lcbio_mgr - L:309) <172.23.100.70:8091> (HE=0x32f1200) Assigning R=0x3d1bb40 SOCKET=0x4bd9610 
      45616ms [Ic41f35ec74693510] {11415} [DEBUG] (ioctx - L:95) <172.23.100.70:8091> (CTX=0x3b167a0,unknown) Pairing with SOCK=5e90741e7c6c856a 
      45618ms [Ic41f35ec74693510] {11415} [DEBUG] (ioctx - L:140) <172.23.100.70:8091> (CTX=0x3b167a0,mgmt/capi) Destroying context. Pending Writes=0, Entered=true, Socket Refcount=1 
      45618ms [I32775f9223f27d7a] {11415} [TRACE] (http-io - L:401) <172.23.100.55:8093> POST  http://172.23.100.55:8093/query/service. Body=240 bytes 
      45618ms [I32775f9223f27d7a] {11415} [DEBUG] (lcbio_mgr - L:472) <172.23.100.55:8093> (HE=0x3667700) Found ready connection in pool. Reusing socket and not creating new connection 
      45618ms [I32775f9223f27d7a] {11415} [DEBUG] (lcbio_mgr - L:309) <172.23.100.55:8093> (HE=0x3667700) Assigning R=0x370a5a0 SOCKET=0x34bfd10 
      45618ms [I32775f9223f27d7a] {11415} [DEBUG] (ioctx - L:95) <172.23.100.55:8093> (CTX=0x4d29740,unknown) Pairing with SOCK=7df24133bd864f92 
      45620ms [I32775f9223f27d7a] {11415} [DEBUG] (ioctx - L:140) <172.23.100.55:8093> (CTX=0x4d29740,mgmt/capi) Destroying context. Pending Writes=0, Entered=true, Socket Refcount=1 
      45620ms [I32775f9223f27d7a] {11415} [DEBUG] (lcbio_mgr - L:536) <172.23.100.55:8093> (HE=0x3667700) Placing socket back into the pool. I=0x3729840,C=0x34bfd10
      

      The cluster has a data node and a query node among others. .70 is the data node and .55 is the query node. You can see from the logging that each query, lcb first creates a new connection to the data node .70 and then reuses a connection to the query node .55. The query connections appears to be reusing sockets but i see no indication that the data connection is reusing sockets, but is creating a new connection query and not destroying it after completion.

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          Sergey Avseyev can you take a look at this.

          korrigan.clark Korrigan Clark (Inactive) added a comment - Sergey Avseyev  can you take a look at this.
          Ellis.Breen Ellis Breen added a comment -

          Hi, apologies for the delay in response - that does sound on the face of it like a libcouchbase issue (though perhaps PYCBC is misconfiguring/misusing LCB) - I'm quite busy with urgent CBSEs/prospects at the moment so hopefully Sergey Avseyev could spot any issues more quickly.

          Will come back to this ASAP otherwise - probably a day or two.

          Ellis.Breen Ellis Breen added a comment - Hi, apologies for the delay in response - that does sound on the face of it like a libcouchbase issue (though perhaps PYCBC is misconfiguring/misusing LCB) - I'm quite busy with urgent CBSEs/prospects at the moment so hopefully Sergey Avseyev  could spot any issues more quickly. Will come back to this ASAP otherwise - probably a day or two.

          query timeout controls amount of time allocated for the request, but the connection might still be living for a while

          if you want to switch this behaviour off, try adding `http_poolsize=0` to connection string (or LCB_CNTL_HTTP_POOLSIZE for lcb_cntl function). When you set it to zero, the library will skip pool for HTTP connections

          avsej Sergey Avseyev added a comment - query timeout controls amount of time allocated for the request, but the connection might still be living for a while if you want to switch this behaviour off, try adding `http_poolsize=0` to connection string (or LCB_CNTL_HTTP_POOLSIZE for lcb_cntl function). When you set it to zero, the library will skip pool for HTTP connections

          repro.py

          You can use this script to see the problem.

          First install correct python sdk3: pip install git+https://github.com/couchbase/couchbase-python-client@8d5820a297b35378d314eb399c041b214aa6a7db

          You must create cluster with 1 data node and 1 query node. Bucket called bucket-1, create scope called scope-1 in bucket-1 and collection called collection-1 inside scope-1. And be sure to change the cluster IP in the script. If you go to the data node machine while the query loop in the script is running and run: 

          netstat | grep 'TIME_WAIT' |wc -l

          you will see this grows to 1000+ within a minute or so. This is the case with sdk3 and not sdk2.

          korrigan.clark Korrigan Clark (Inactive) added a comment - repro.py You can use this script to see the problem. First install correct python sdk3: pip install git+ https://github.com/couchbase/couchbase-python-client@8d5820a297b35378d314eb399c041b214aa6a7db You must create cluster with 1 data node and 1 query node. Bucket called bucket-1, create scope called scope-1 in bucket-1 and collection called collection-1 inside scope-1. And be sure to change the cluster IP in the script. If you go to the data node machine while the query loop in the script is running and run:  netstat | grep 'TIME_WAIT' |wc -l you will see this grows to 1000+ within a minute or so. This is the case with sdk3 and not sdk2.

          Korrigan Clark: can you try the connstr change suggested by Sergey Avseyev to see fi that has the desired affect? We may still need to make changes, but this would unblock you.

          ingenthr Matt Ingenthron added a comment - Korrigan Clark : can you try the connstr change suggested by Sergey Avseyev to see fi that has the desired affect? We may still need to make changes, but this would unblock you.

          I tried with the suggested connection string params and it looks like it now it seems worse... Per query it will node create a new tcp conn to data node and a new conn to query, so now there are 2 new conns created and destroyed per query. Here is the lcb logging for a single query with conn string couchbase://172.23.100.70?password=password&http_poolsize=0 

          220ms [I876576e1b49c19c8] {16780} [TRACE] (http-io - L:401) <172.23.100.70:8091> GET  http://172.23.100.70:8091/pools. Body=0 bytes 
          220ms [I876576e1b49c19c8] {16780} [DEBUG] (lcbio_mgr - L:481) <172.23.100.70:8091> (HE=0x238a070) Creating new connection because none are available in the pool 
          220ms [I876576e1b49c19c8] {16780} [TRACE] (lcbio_mgr - L:393) <172.23.100.70:8091> (HE=0x238a070) New pool entry: I=0x1767b70 220ms [I876576e1b49c19c8] {16780} [INFO] (connection - L:487) <172.23.100.70:8091> (SOCK=db31a141164a4798) Starting. Timeout=75000000us 
          220ms [I876576e1b49c19c8] {16780} [DEBUG] (connection - L:252) <172.23.100.70:8091> (SOCK=db31a141164a4798) Created new socket with FD=28 
          220ms [I876576e1b49c19c8] {16780} [TRACE] (connection - L:352) <172.23.100.70:8091> (SOCK=db31a141164a4798) Scheduling I/O watcher for asynchronous connection completion. 
          220ms [I876576e1b49c19c8] {16780} [INFO] (connection - L:155) <172.23.100.70:8091> (SOCK=db31a141164a4798) Connected established 
          220ms [I876576e1b49c19c8] {16780} [DEBUG] (connection - L:108) <172.23.100.70:8091> (SOCK=db31a141164a4798) Successfully set TCP_NODELAY 
          220ms [I876576e1b49c19c8] {16780} [DEBUG] (connection - L:108) <172.23.100.70:8091> (SOCK=db31a141164a4798) Successfully set TCP_KEEPALIVE 
          220ms [I876576e1b49c19c8] {16780} [DEBUG] (lcbio_mgr - L:349) <172.23.100.70:8091> (HE=0x238a070) Received result for I=0x1767b70,C=(nil); E=0x0 
          221ms [I876576e1b49c19c8] {16780} [DEBUG] (lcbio_mgr - L:309) <172.23.100.70:8091> (HE=0x238a070) Assigning R=0x1fd57b0 SOCKET=0x2172f70 
          221ms [I876576e1b49c19c8] {16780} [DEBUG] (ioctx - L:95) <172.23.100.70:8091> (CTX=0x273d680,unknown) Pairing with SOCK=db31a141164a4798 
          222ms [I876576e1b49c19c8] {16780} [DEBUG] (ioctx - L:140) <172.23.100.70:8091> (CTX=0x273d680,mgmt/capi) Destroying context. Pending Writes=0, Entered=true, Socket Refcount=1 
          223ms [Ifafb43dc4b475c0e] {16780} [TRACE] (http-io - L:401) <172.23.100.55:8093> POST  http://172.23.100.55:8093/query/service. Body=244 bytes 
          223ms [Ifafb43dc4b475c0e] {16780} [DEBUG] (lcbio_mgr - L:481) <172.23.100.55:8093> (HE=0x22d2c10) Creating new connection because none are available in the pool 
          223ms [Ifafb43dc4b475c0e] {16780} [TRACE] (lcbio_mgr - L:393) <172.23.100.55:8093> (HE=0x22d2c10) New pool entry: I=0x2461f30 
          223ms [Ifafb43dc4b475c0e] {16780} [INFO] (connection - L:487) <172.23.100.55:8093> (SOCK=21c67a168cbbc78d) Starting. Timeout=75000000us 
          223ms [Ifafb43dc4b475c0e] {16780} [DEBUG] (connection - L:252) <172.23.100.55:8093> (SOCK=21c67a168cbbc78d) Created new socket with FD=28 
          223ms [Ifafb43dc4b475c0e] {16780} [TRACE] (connection - L:352) <172.23.100.55:8093> (SOCK=21c67a168cbbc78d) Scheduling I/O watcher for asynchronous connection completion. 
          223ms [Ifafb43dc4b475c0e] {16780} [INFO] (connection - L:155) <172.23.100.55:8093> (SOCK=21c67a168cbbc78d) Connected established 
          223ms [Ifafb43dc4b475c0e] {16780} [DEBUG] (connection - L:108) <172.23.100.55:8093> (SOCK=21c67a168cbbc78d) Successfully set TCP_NODELAY 
          223ms [Ifafb43dc4b475c0e] {16780} [DEBUG] (connection - L:108) <172.23.100.55:8093> (SOCK=21c67a168cbbc78d) Successfully set TCP_KEEPALIVE 
          223ms [Ifafb43dc4b475c0e] {16780} [DEBUG] (lcbio_mgr - L:349) <172.23.100.55:8093> (HE=0x22d2c10) Received result for I=0x2461f30,C=(nil); E=0x0 
          223ms [Ifafb43dc4b475c0e] {16780} [DEBUG] (lcbio_mgr - L:309) <172.23.100.55:8093> (HE=0x22d2c10) Assigning R=0x18adb20 SOCKET=0x2749f30 
          223ms [Ifafb43dc4b475c0e] {16780} [DEBUG] (ioctx - L:95) <172.23.100.55:8093> (CTX=0x26f70e0,unknown) Pairing with SOCK=21c67a168cbbc78d 
          226ms [Ifafb43dc4b475c0e] {16780} [DEBUG] (ioctx - L:140) <172.23.100.55:8093> (CTX=0x26f70e0,mgmt/capi) Destroying context. Pending Writes=0, Entered=true, Socket Refcount=1
          

          korrigan.clark Korrigan Clark (Inactive) added a comment - I tried with the suggested connection string params and it looks like it now it seems worse... Per query it will node create a new tcp conn to data node and a new conn to query, so now there are 2 new conns created and destroyed per query. Here is the lcb logging for a single query with conn string couchbase://172.23.100.70?password=password&http_poolsize=0  220ms [I876576e1b49c19c8] { 16780 } [TRACE] (http-io - L: 401 ) < 172.23 . 100.70 : 8091 > GET  http: //172.23.100.70:8091/pools. Body=0 bytes 220ms [I876576e1b49c19c8] { 16780 } [DEBUG] (lcbio_mgr - L: 481 ) < 172.23 . 100.70 : 8091 > (HE= 0x238a070 ) Creating new connection because none are available in the pool 220ms [I876576e1b49c19c8] { 16780 } [TRACE] (lcbio_mgr - L: 393 ) < 172.23 . 100.70 : 8091 > (HE= 0x238a070 ) New pool entry: I= 0x1767b70 220ms [I876576e1b49c19c8] { 16780 } [INFO] (connection - L: 487 ) < 172.23 . 100.70 : 8091 > (SOCK=db31a141164a4798) Starting. Timeout=75000000us 220ms [I876576e1b49c19c8] { 16780 } [DEBUG] (connection - L: 252 ) < 172.23 . 100.70 : 8091 > (SOCK=db31a141164a4798) Created new socket with FD= 28 220ms [I876576e1b49c19c8] { 16780 } [TRACE] (connection - L: 352 ) < 172.23 . 100.70 : 8091 > (SOCK=db31a141164a4798) Scheduling I/O watcher for asynchronous connection completion. 220ms [I876576e1b49c19c8] { 16780 } [INFO] (connection - L: 155 ) < 172.23 . 100.70 : 8091 > (SOCK=db31a141164a4798) Connected established 220ms [I876576e1b49c19c8] { 16780 } [DEBUG] (connection - L: 108 ) < 172.23 . 100.70 : 8091 > (SOCK=db31a141164a4798) Successfully set TCP_NODELAY 220ms [I876576e1b49c19c8] { 16780 } [DEBUG] (connection - L: 108 ) < 172.23 . 100.70 : 8091 > (SOCK=db31a141164a4798) Successfully set TCP_KEEPALIVE 220ms [I876576e1b49c19c8] { 16780 } [DEBUG] (lcbio_mgr - L: 349 ) < 172.23 . 100.70 : 8091 > (HE= 0x238a070 ) Received result for I= 0x1767b70 ,C=(nil); E= 0x0 221ms [I876576e1b49c19c8] { 16780 } [DEBUG] (lcbio_mgr - L: 309 ) < 172.23 . 100.70 : 8091 > (HE= 0x238a070 ) Assigning R= 0x1fd57b0 SOCKET= 0x2172f70 221ms [I876576e1b49c19c8] { 16780 } [DEBUG] (ioctx - L: 95 ) < 172.23 . 100.70 : 8091 > (CTX= 0x273d680 ,unknown) Pairing with SOCK=db31a141164a4798 222ms [I876576e1b49c19c8] { 16780 } [DEBUG] (ioctx - L: 140 ) < 172.23 . 100.70 : 8091 > (CTX= 0x273d680 ,mgmt/capi) Destroying context. Pending Writes= 0 , Entered= true , Socket Refcount= 1 223ms [Ifafb43dc4b475c0e] { 16780 } [TRACE] (http-io - L: 401 ) < 172.23 . 100.55 : 8093 > POST  http: //172.23.100.55:8093/query/service. Body=244 bytes 223ms [Ifafb43dc4b475c0e] { 16780 } [DEBUG] (lcbio_mgr - L: 481 ) < 172.23 . 100.55 : 8093 > (HE= 0x22d2c10 ) Creating new connection because none are available in the pool 223ms [Ifafb43dc4b475c0e] { 16780 } [TRACE] (lcbio_mgr - L: 393 ) < 172.23 . 100.55 : 8093 > (HE= 0x22d2c10 ) New pool entry: I= 0x2461f30 223ms [Ifafb43dc4b475c0e] { 16780 } [INFO] (connection - L: 487 ) < 172.23 . 100.55 : 8093 > (SOCK=21c67a168cbbc78d) Starting. Timeout=75000000us 223ms [Ifafb43dc4b475c0e] { 16780 } [DEBUG] (connection - L: 252 ) < 172.23 . 100.55 : 8093 > (SOCK=21c67a168cbbc78d) Created new socket with FD= 28 223ms [Ifafb43dc4b475c0e] { 16780 } [TRACE] (connection - L: 352 ) < 172.23 . 100.55 : 8093 > (SOCK=21c67a168cbbc78d) Scheduling I/O watcher for asynchronous connection completion. 223ms [Ifafb43dc4b475c0e] { 16780 } [INFO] (connection - L: 155 ) < 172.23 . 100.55 : 8093 > (SOCK=21c67a168cbbc78d) Connected established 223ms [Ifafb43dc4b475c0e] { 16780 } [DEBUG] (connection - L: 108 ) < 172.23 . 100.55 : 8093 > (SOCK=21c67a168cbbc78d) Successfully set TCP_NODELAY 223ms [Ifafb43dc4b475c0e] { 16780 } [DEBUG] (connection - L: 108 ) < 172.23 . 100.55 : 8093 > (SOCK=21c67a168cbbc78d) Successfully set TCP_KEEPALIVE 223ms [Ifafb43dc4b475c0e] { 16780 } [DEBUG] (lcbio_mgr - L: 349 ) < 172.23 . 100.55 : 8093 > (HE= 0x22d2c10 ) Received result for I= 0x2461f30 ,C=(nil); E= 0x0 223ms [Ifafb43dc4b475c0e] { 16780 } [DEBUG] (lcbio_mgr - L: 309 ) < 172.23 . 100.55 : 8093 > (HE= 0x22d2c10 ) Assigning R= 0x18adb20 SOCKET= 0x2749f30 223ms [Ifafb43dc4b475c0e] { 16780 } [DEBUG] (ioctx - L: 95 ) < 172.23 . 100.55 : 8093 > (CTX= 0x26f70e0 ,unknown) Pairing with SOCK=21c67a168cbbc78d 226ms [Ifafb43dc4b475c0e] { 16780 } [DEBUG] (ioctx - L: 140 ) < 172.23 . 100.55 : 8093 > (CTX= 0x26f70e0 ,mgmt/capi) Destroying context. Pending Writes= 0 , Entered= true , Socket Refcount= 1
          korrigan.clark Korrigan Clark (Inactive) added a comment - http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=iris_700-2351_access_72e1#f087785f256442b0b8bdd8d9b9ba328a

          It looks like, from this ticket: https://issues.couchbase.com/browse/MB-39728, that the sdk may be asking for collection manifest more than once per connection even for plain kv get/set.

          korrigan.clark Korrigan Clark (Inactive) added a comment - It looks like, from this ticket:  https://issues.couchbase.com/browse/MB-39728 , that the sdk may be asking for collection manifest more than once per connection even for plain kv get/set.

          there was a fix in libcouchbase 3.0.2: CCBC-1158. What libcouchbase is used in that ticket?

          avsej Sergey Avseyev added a comment - there was a fix in libcouchbase 3.0.2: CCBC-1158 . What libcouchbase is used in that ticket?

          3.0.1...I will upgrade the test to 3.0.2 and rerun.. thanks for pointing this out.

          korrigan.clark Korrigan Clark (Inactive) added a comment - 3.0.1...I will upgrade the test to 3.0.2 and rerun.. thanks for pointing this out.
          Ellis.Breen Ellis Breen added a comment - - edited

          Installing from PYCBC 3.0.1 or master (currently) should give you LCB 3.0.2, to be clear.

          Ellis.Breen Ellis Breen added a comment - - edited Installing from PYCBC 3.0.1 or master (currently) should give you LCB 3.0.2, to be clear.

          upgrading PYCBC to commit with LCB 3.0.2 actually is causing more issues now. Here is a graph of the same build number comparing before and after sdk upgrade: http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=ares_700-2416_access_ec65&snapshot=ares_700-2416_access_b451

          Note the ops trail off and the time wait connection increase. Sergey Avseyev Ellis Breen

          korrigan.clark Korrigan Clark (Inactive) added a comment - upgrading PYCBC to commit with LCB 3.0.2 actually is causing more issues now. Here is a graph of the same build number comparing before and after sdk upgrade:  http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=ares_700-2416_access_ec65&snapshot=ares_700-2416_access_b451 Note the ops trail off and the time wait connection increase. Sergey Avseyev Ellis Breen

          another interesting comparision is that this KV test showed LCB_ERR_DURABLE_WRITE_IN_PROGRESS errors before and now it shows LCB_ERR_SOCKET_SHUTDOWN errors. 

          http://perf.jenkins.couchbase.com/job/ares/14882/console

          http://perf.jenkins.couchbase.com/job/ares/14952/console

          korrigan.clark Korrigan Clark (Inactive) added a comment - another interesting comparision is that this KV test showed LCB_ERR_DURABLE_WRITE_IN_PROGRESS errors before and now it shows LCB_ERR_SOCKET_SHUTDOWN errors.  http://perf.jenkins.couchbase.com/job/ares/14882/console http://perf.jenkins.couchbase.com/job/ares/14952/console
          Ellis.Breen Ellis Breen added a comment -

          Odd - looks like some latencies increase, some decrease but overall throughput is diminished, and over time. I don't think there was really much in the way of modification in N1QL behavior from the Python side so I suspect this relates more to the LCB upgrade. Looks like the connection count is fairly stable according to the cbmonitor stats so sounds more like a change in how LCB handles operation queues. Any thoughts Sergey Avseyev?

          Ellis.Breen Ellis Breen added a comment - Odd - looks like some latencies increase, some decrease but overall throughput is diminished, and over time. I don't think there was really much in the way of modification in N1QL behavior from the Python side so I suspect this relates more to the LCB upgrade. Looks like the connection count is fairly stable according to the cbmonitor stats so sounds more like a change in how LCB handles operation queues. Any thoughts Sergey Avseyev ?
          Ellis.Breen Ellis Breen added a comment - - edited

          Korrigan Clark - would it be possible to have remote access to your perf test cluster, or a similar setup? I did try to reproduce this locally with a version of TxBench updated for SDK3 and modified to do N1QL but I only see TimeoutExceptions.

          Ellis.Breen Ellis Breen added a comment - - edited Korrigan Clark - would it be possible to have remote access to your perf test cluster, or a similar setup? I did try to reproduce this locally with a version of TxBench updated for SDK3 and modified to do N1QL but I only see TimeoutExceptions.

          The repro.py script should allow you to reproduce locally. I ran it on my laptop and repro worked fine. Please try that first as our perf clusters are very busy and not quite necessary to repro.

          From comment above:

          repro.py

          You can use this script to see the problem.

          First install correct python sdk3: pip install git+https://github.com/couchbase/couchbase-python-client@8d5820a297b35378d314eb399c041b214aa6a7db

          You must create cluster with 1 data node and 1 query node. Bucket called bucket-1, create scope called scope-1 in bucket-1 and collection called collection-1 inside scope-1. And be sure to change the cluster IP in the script. If you go to the data node machine while the query loop in the script is running and run: 

          netstat | grep 'TIME_WAIT' |wc -l

          you will see this grows to 1000+ within a minute or so. This is the case with sdk3 and not sdk2.

          korrigan.clark Korrigan Clark (Inactive) added a comment - The repro.py script should allow you to reproduce locally. I ran it on my laptop and repro worked fine. Please try that first as our perf clusters are very busy and not quite necessary to repro. From comment above: repro.py You can use this script to see the problem. First install correct python sdk3: pip install git+ https://github.com/couchbase/couchbase-python-client@8d5820a297b35378d314eb399c041b214aa6a7db You must create cluster with 1 data node and 1 query node. Bucket called bucket-1, create scope called scope-1 in bucket-1 and collection called collection-1 inside scope-1. And be sure to change the cluster IP in the script. If you go to the data node machine while the query loop in the script is running and run:  netstat | grep 'TIME_WAIT' |wc -l you will see this grows to 1000+ within a minute or so. This is the case with sdk3 and not sdk2.
          Ellis.Breen Ellis Breen added a comment -

          Thanks. Korrigan - will try this today.

          Ellis.Breen Ellis Breen added a comment - Thanks. Korrigan - will try this today.
          Ellis.Breen Ellis Breen added a comment -

          Hi, I've set up Cheshire Cat and adapted your script to set up the requisite scopes/collections, but not sure how to set up a N1QL index on a collection yet - I get the following as is:

          couchbase.exceptions.CouchbaseException: <RC=0x192[LCB_ERR_INDEX_FAILURE (402)], HTTP Request failed. Examine 'objextra' for full result, Results=1, C Source=(src/pycbc_http.c,212), OBJ=ViewResult<rc=0x192[LCB_ERR_INDEX_FAILURE (402)], value={'requestID': 'd1f4052a-ca78-417f-b844-995739741e9d', 'clientContextID': '43056814368a39d4', 'errors': [{'code': 12021, 'msg': 'Scope not found in CB datastore default:default.scope-1'}], 'status': 'fatal'}, http_status=0, tracing_context=0, tracing_output=None>, Context={'first_error_code': 12021, 'http_response_code': 500, 'first_error_message': 'Scope not found in CB datastore default:default.scope-1', 'statement': 'SELECT * FROM default:`default`.`scope-1`.`collection-1` USE KEYS[$1];', 'client_context_id': '43056814368a39d4', 'query_params': '', 'http_response_body': '', 'endpoint': '10.143.210.101:8093', 'type': 'QueryErrorContext'}, Tracing Output={":nokey:0": null}>
          

          What are you using to set up the index? Just the web UI?

          Ellis.Breen Ellis Breen added a comment - Hi, I've set up Cheshire Cat and adapted your script to set up the requisite scopes/collections, but not sure how to set up a N1QL index on a collection yet - I get the following as is: couchbase.exceptions.CouchbaseException: <RC= 0x192 [LCB_ERR_INDEX_FAILURE ( 402 )], HTTP Request failed. Examine 'objextra' for full result, Results= 1 , C Source=(src/pycbc_http.c, 212 ), OBJ=ViewResult<rc= 0x192 [LCB_ERR_INDEX_FAILURE ( 402 )], value={ 'requestID' : 'd1f4052a-ca78-417f-b844-995739741e9d' , 'clientContextID' : '43056814368a39d4' , 'errors' : [{ 'code' : 12021 , 'msg' : 'Scope not found in CB datastore default:default.scope-1' }], 'status' : 'fatal' }, http_status= 0 , tracing_context= 0 , tracing_output=None>, Context={ 'first_error_code' : 12021 , 'http_response_code' : 500 , 'first_error_message' : 'Scope not found in CB datastore default:default.scope-1' , 'statement' : 'SELECT * FROM default:`default`.`scope-1`.`collection-1` USE KEYS[$1];' , 'client_context_id' : '43056814368a39d4' , 'query_params' : '' , 'http_response_body' : '' , 'endpoint' : '10.143.210.101:8093' , 'type' : 'QueryErrorContext' }, Tracing Output={ ":nokey:0" : null }> What are you using to set up the index? Just the web UI?

          Ellis Breen the n1ql queries provided in the script do not need an index but you can create an index like this: 

          CREATE PRIMARY INDEX `primary_idx` ON default:`bucket-1`.`scope-1`.`collection-1`

          CREATE INDEX `primary_idx` ON default:`bucket-1`.`scope-1`.`collection-1`(field1)

          korrigan.clark Korrigan Clark (Inactive) added a comment - Ellis Breen  the n1ql queries provided in the script do not need an index but you can create an index like this:  CREATE PRIMARY INDEX `primary_idx` ON default:`bucket-1`.`scope-1`.`collection-1` CREATE INDEX `primary_idx` ON default:`bucket-1`.`scope-1`.`collection-1`(field1)
          Ellis.Breen Ellis Breen added a comment -

          Just to confirm - I have now reproduced this. It appears that the 8091 socket used by PYCBC/LCB for '/pools' is being continuously recreated rather than reused (as opposed to, for example, socket 8093 for N1QL queries). The LCB lob shows that the 8091 socket appears not be placed back into the pool, whereas 8093 is. Will look at this more tomorrow but would be very grateful for any input from Sergey Avseyev.

          Ellis.Breen Ellis Breen added a comment - Just to confirm - I have now reproduced this. It appears that the 8091 socket used by PYCBC/LCB for '/pools' is being continuously recreated rather than reused (as opposed to, for example, socket 8093 for N1QL queries). The LCB lob shows that the 8091 socket appears not be placed back into the pool, whereas 8093 is. Will look at this more tomorrow but would be very grateful for any input from Sergey Avseyev .

          I think we have another PYCBC related to this. There should be no situation with a Couchbase bucket type where regular data usage (not management usage) visits that service on port 8091. In other words, it's probably more of a PYCBC bug even if there is a pooling bug in CCBC.

          ingenthr Matt Ingenthron added a comment - I think we have another PYCBC related to this. There should be no situation with a Couchbase bucket type where regular data usage (not management usage) visits that service on port 8091. In other words, it's probably more of a PYCBC bug even if there is a pooling bug in CCBC.
          Ellis.Breen Ellis Breen added a comment - - edited

          Hi Matt - which PYCBC issue? I had a quick look and couldn't find it - maybe something old that was believed to be fixed? WIll dig in further tomorrow anyway.

          Ellis.Breen Ellis Breen added a comment - - edited Hi Matt - which PYCBC issue? I had a quick look and couldn't find it - maybe something old that was believed to be fixed? WIll dig in further tomorrow anyway.
          ingenthr Matt Ingenthron added a comment - - edited

          The other bug is PYCBC-954, which fell out of RBAC restrictions on CBCloud. See also PYCBC-866 which we deferred.

          The SDK-RFC on this specifies that GCCCP should be tried in any case, and then fail back to requiring the opening of a bucket (or failing to the app saying a bucket needs to be opened). Checking /pools is not a good idea for a number of reasons, including this access control restriction.

          ingenthr Matt Ingenthron added a comment - - edited The other bug is PYCBC-954 , which fell out of RBAC restrictions on CBCloud. See also PYCBC-866 which we deferred. The SDK-RFC on this specifies that GCCCP should be tried in any case, and then fail back to requiring the opening of a bucket (or failing to the app saying a bucket needs to be opened). Checking /pools is not a good idea for a number of reasons, including this access control restriction.
          Ellis.Breen Ellis Breen added a comment -

          Ah, perfect - they do indeed look similar. Thanks for all the pointers!

          Ellis.Breen Ellis Breen added a comment - Ah, perfect - they do indeed look similar. Thanks for all the pointers!

          If PYCBC-954 is resolved, then does that mean Korrigan Clark should try the 3.0.3 release Ellis Breen?

          ingenthr Matt Ingenthron added a comment - If PYCBC-954 is resolved, then does that mean Korrigan Clark should try the 3.0.3 release Ellis Breen ?
          Ellis.Breen Ellis Breen added a comment -

          It looks like running off master (which will be PYCBC 3.0.2) will include this fix, yes. Korrigan Clark - please let us know if this helps.

          Ellis.Breen Ellis Breen added a comment - It looks like running off master (which will be PYCBC 3.0.2) will include this fix, yes. Korrigan Clark - please let us know if this helps.

          Everything works perfectly now....

          You can see the time_wait connections are back to normal: 

          http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=iris_700-2477_access_e0f6#f087785f256442b0b8bdd8d9b9ba328a

          Thanks you. Closing the ticket.

          korrigan.clark Korrigan Clark (Inactive) added a comment - Everything works perfectly now.... You can see the time_wait connections are back to normal:  http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=iris_700-2477_access_e0f6#f087785f256442b0b8bdd8d9b9ba328a Thanks you. Closing the ticket.

          People

            Ellis.Breen Ellis Breen
            korrigan.clark Korrigan Clark (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty