Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-53357

[FTS] gocbcore fails to establish new connections to server

    XMLWordPrintable

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • Elixir
    • Elixir
    • fts
    • None
    • /opt/couchbase/bin/couchbase-server -v
      Couchbase Server 7.2.0-1820 (EE)

       cat /etc/couchbase.d/config_profile
      serverless
    • Untriaged
    • 1
    • Unknown

    Description

      Setup:

      • A nine (9) node c6gd.2xlarge AWS system with 20 buckets. 3 KV, 2 query, 2 index, 2 fts.
      • Load 35190 docs (replicated 10X travel-sample._default._default with UUID keys)
      • Build a Search index on all 20 nodes concurrently (as per "Elixir Performance & Scalability")

      I used a modified HTTP benchmarking tool like 'ab' based on tool 'bombardier'

      Randomly select a bucket via 125 concurrent clients to to make 1M Search requests 

       

      [ec2-user@ip-10-0-0-133 query02]$ ./cb_fts_bench -m POST -H "Content-Type: application/json" -t 15s -c 125 -k -a -u admin:$PASSWORD -n 1000000   -b '{"query": { "query": "+balcony +whales +ritz" }}' http://10.0.1.28:8094/api/index/ts[[SEQ:1:20]]._default.ts[[SEQ:1:20]]_fts_01/query
      Bombarding http://10.0.1.28:8094/api/index/ts[[SEQ:1:20]]._default.ts[[SEQ:1:20]]_fts_01/query with 1000000 request(s) using 125 connection(s)
       1000000 / 1000000 [===================================================] 100.00% 16104/s 1m2s
      Done!
      Statistics        Avg      Stdev        Max
        Reqs/sec     16148.31    2623.01   21654.98
        Latency        7.74ms     6.28ms   580.30ms
        HTTP codes:
          1xx - 0, 2xx - 1000000, 3xx - 0, 4xx - 0, 5xx - 0
          others - 0
        Throughput:    35.24MB/s
        HTTP 429 retries            0, pct.     0.000% across 1000000 reqs
        Ave Sizes
          resp.body:            1893 bytes (across   1000000 reqs)
          resp.body.hit[]:      1541 bytes (across   1000000 reqs having hits [100.00%])
        Other
          total_hits            10000000, ave    10.000
          tot_hits_docreads     10000000, ave    10.000

      The above shows 100% success as in all 2xx codes with good latancies.

      Repeat a few times (more 1M Search requests) but adjust the query command

       

      '{"query": { "query": "+balcony +ritz" }}'
       
      or
       
      '{"query": { "query": "+balcony" }}'

      After 3-4 runs the system exhibits hard timeouts (I hit ctrl-C) note the timeout is set to 15 sec. "-t 15s" flag

      [ec2-user@ip-10-0-0-133 query02]$ ./cb_fts_bench -m POST -H "Content-Type: application/json" -t 15s -c 125 -k -a -u admin:$PASSWORD -n 1000000   -b '{"query": { "query": "+balcony +whales +ritz" }}' http://10.0.1.28:8094/api/index/ts[[SEQ:1:20]]._default.ts[[SEQ:1:20]]_fts_01/query
      Bombarding http://10.0.1.28:8094/api/index/ts[[SEQ:1:20]]._default.ts[[SEQ:1:20]]_fts_01/query with 1000000 request(s) using 125 connection(s)
       375 / 1000000 [>------------------------]   0.04% 8/s 1d9h22m41s 1000000 / 1000000 [=========================================] 100.00% 22085/s 45s
      Done!
      Statistics        Avg      Stdev        Max
        Reqs/sec         8.42     219.19    6322.17
        Latency        15.00s   369.61us     15.00s
        HTTP codes:
          1xx - 0, 2xx - 0, 3xx - 0, 4xx - 0, 5xx - 0
          others - 500
        Errors:
             timeout - 500
        Throughput:     2.01KB/s
        HTTP 429 retries            0, pct.       NaN% across 0 reqs
        Ave Sizes
          resp.body:             n/a bytes (across         0 reqs)
          resp.body.hit[]:       n/a bytes (across         0 reqs having hits [ NaN%])
        Other
          total_hits                   0, ave       NaN
          tot_hits_docreads            0, ave       NaN

      The system became very unresponsive even the UI when I go to the search page merely shows  Gateway Timeout 

      It seems that there are some very odd WARN message in the fts.log file 

      2022-08-11T18:08:23.955+00:00 [WARN] (GOCBCORE) Pipeline Client 0x40001fe1c0 failed to bootstrap: ambiguous timeout | {"InnerError":{"InnerError":{"InnerError":{},"Message":"ambiguous timeout"}},"OperationID":"CMD_SASLAUTH","Opaque":"0x6","TimeObserved":7942,"RetryReasons":null,"RetryAttempts":0,"LastDispatchedTo":"10.0.1.72:11210","LastDispatchedFrom":"10.0.1.28:47652","LastConnectionID":"e211ae45503c88a0/41a5c677f50a0dba"} -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go:748
      2022-08-11T18:08:24.615+00:00 [WARN] ns_server: retrieve partition seqs: gocbcore_utils: CBPartitionSeqs, fetchClient err: gocbcore_utils: createAgents (1), setup err: agent setup failed, err: ambiguous timeout | {"InnerError":{"InnerError":{"InnerError":{},"Message":"ambiguous timeout"}},"OperationID":"CMD_SASLAUTH","Opaque":"0x6","TimeObserved":5711,"RetryReasons":null,"RetryAttempts":0,"LastDispatchedTo":"10.0.0.42:11210","LastDispatchedFrom":"10.0.1.28:39100","LastConnectionID":"e211ae45503c88a0/7975e2682b4795e2"} -- cbft.RunSourcePartitionSeqs() at ns_server.go:1121
      2022-08-11T18:08:27.623+00:00 [WARN] (GOCBCORE) CCCPPOLL: Failed to retrieve CCCP config. ambiguous timeout -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go:748
      2022-08-11T18:08:30.623+00:00 [WARN] (GOCBCORE) CCCPPOLL: Failed to retrieve CCCP config. ambiguous timeout -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go:748
      2022-08-11T18:08:31.624+00:00 [WARN] (GOCBCORE) Pipeline Client 0x400053a0e0 failed to bootstrap: ambiguous timeout | {"InnerError":{"InnerError":{"InnerError":{},"Message":"ambiguous timeout"}},"OperationID":"CMD_SASLAUTH","Opaque":"0x6","TimeObserved":5645,"RetryReasons":null,"RetryAttempts":0,"LastDispatchedTo":"10.0.1.72:11210","LastDispatchedFrom":"10.0.1.28:46532","LastConnectionID":"bc5dff67559585a6/444b9ac50706da8f"} -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go:748
      2022-08-11T18:08:31.624+00:00 [WARN] (GOCBCORE) Pipeline Client 0x400053a070 failed to bootstrap: ambiguous timeout | {"InnerError":{"InnerError":{"InnerError":{},"Message":"ambiguous timeout"}},"OperationID":"CMD_SASLAUTH","Opaque":"0x6","TimeObserved":10306,"RetryReasons":null,"RetryAttempts":0,"LastDispatchedTo":"10.0.0.42:11210","LastDispatchedFrom":"10.0.1.28:41644","LastConnectionID":"bc5dff67559585a6/56307db006506e17"} -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go:748
      2022-08-11T18:08:31.624+00:00 [WARN] (GOCBCORE) Pipeline Client 0x400053a000 failed to bootstrap: ambiguous timeout | {"InnerError":{"InnerError":{"InnerError":{},"Message":"ambiguous timeout"}},"OperationID":"CMD_SASLAUTH","Opaque":"0x6","TimeObserved":10199,"RetryReasons":null,"RetryAttempts":0,"LastDispatchedTo":"10.0.0.20:11210","LastDispatchedFrom":"10.0.1.28:50380","LastConnectionID":"bc5dff67559585a6/041ff9f08e481174"} -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go:748
      2022-08-11T18:08:35.294+00:00 [WARN] (GOCBCORE) CCCPPOLL: Failed to retrieve CCCP config. ambiguous timeout -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go:748
      2022-08-11T18:08:38.294+00:00 [WARN] (GOCBCORE) CCCPPOLL: Failed to retrieve CCCP config. ambiguous timeout -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go:748
      2022-08-11T18:08:39.294+00:00 [WARN] (GOCBCORE) Pipeline Client 0x4000454230 failed to bootstrap: ambiguous timeout | {"InnerError":{"InnerError":{"InnerError":{},"Message":"ambiguous timeout"}},"OperationID":"CMD_SASLAUTH","Opaque":"0x6","TimeObserved":10108,"RetryReasons":null,"RetryAttempts":0,"LastDispatchedTo":"10.0.1.72:11210","LastDispatchedFrom":"10.0.1.28:46542","LastConnectionID":"e64d3b08de2d94de/adf3cbb128b938bc"} -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go:748
      2022-08-11T18:08:39.294+00:00 [WARN] (GOCBCORE) Pipeline Client 0x40004541c0 failed to bootstrap: ambiguous timeout | {"InnerError":{"InnerError":{"InnerError":{},"Message":"ambiguous timeout"}},"OperationID":"CMD_SASLAUTH","Opaque":"0x6","TimeObserved":18986,"RetryReasons":null,"RetryAttempts":0,"LastDispatchedTo":"10.0.0.42:11210","LastDispatchedFrom":"10.0.1.28:41660","LastConnectionID":"e64d3b08de2d94de/3825347be8d0730a"} -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go:748
      2022-08-11T18:08:39.294+00:00 [WARN] (GOCBCORE) Pipeline Client 0x4000454070 failed to bootstrap: ambiguous timeout | {"InnerError":{"InnerError":{"InnerError":{},"Message":"ambiguous timeout"}},"OperationID":"CMD_SASLAUTH","Opaque":"0x6","TimeObserved":15376,"RetryReasons":null,"RetryAttempts":0,"LastDispatchedTo":"10.0.0.20:11210","LastDispatchedFrom":"10.0.1.28:50384","LastConnectionID":"e64d3b08de2d94de/22c002e7ed08c2ee"} -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go:748
      2022-08-11T18:08:42.957+00:00 [WARN] (GOCBCORE) CCCPPOLL: Failed to retrieve CCCP config. ambiguous timeout -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go:748
      2022-08-11T18:08:46.957+00:00 [WARN] (GOCBCORE) Pipeline Client 0x4000338540 failed to bootstrap: ambiguous timeout | {"InnerError":{"InnerError":{"InnerError":{},"Message":"ambiguous timeout"}},"OperationID":"CMD_SASLAUTH","Opaque":"0x6","TimeObserved":13169,"RetryReasons":null,"RetryAttempts":0,"LastDispatchedTo":"10.0.0.42:11210","LastDispatchedFrom":"10.0.1.28:39216","LastConnectionID":"3bb9c755d2e793c1/e1a22e1561cae4d6"} -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go:748
      2022-08-11T18:08:47.621+00:00 [WARN] ns_server: retrieve partition seqs: gocbcore_utils: CBPartitionSeqs, fetchClient err: gocbcore_utils: createAgents (1), setup err: agent setup failed, err: ambiguous timeout | {"InnerError":{"InnerError":{"InnerError":{},"Message":"ambiguous timeout"}},"OperationID":"CMD_SASLAUTH","Opaque":"0x6","TimeObserved":13169,"RetryReasons":null,"RetryAttempts":0,"LastDispatchedTo":"10.0.0.42:11210","LastDispatchedFrom":"10.0.1.28:39216","LastConnectionID":"3bb9c755d2e793c1/e1a22e1561cae4d6"} -- cbft.RunSourcePartitionSeqs() at ns_server.go:1121
      2022-08-11T18:08:50.630+00:00 [WARN] (GOCBCORE) CCCPPOLL: Failed to retrieve CCCP config. ambiguous timeout -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go:748
      2022-08-11T18:08:54.868+00:00 [WARN] janitor: JanitorOnce, err: janitor: JanitorOnce errors: 2, []string{"#0: janitor: adding feed, err: feed_dcp_gocbcore: StartGocbcoreDCPFeed, could not prepare DCP feed, name: ts05._default.ts05_fts_01_1f1e2ba304a8de31_4c1c5584, server: http://127.0.0.1:8091, bucketName: ts05, indexName: ts05._default.ts05_fts_01, err: newGocbcoreDCPFeed: error in setting up feed's stream options, err: agent setup failed, err: gocbcore_utils: createAgents (1), setup err: agent setup failed, err: ambiguous timeout | {\"InnerError\":{\"InnerError\":{\"InnerError\":{},\"Message\":\"ambiguous timeout\"}},\"OperationID\":\"CMD_SASLAUTH\",\"Opaque\":\"0x6\",\"TimeObserved\":10166,\"RetryReasons\":null,\"RetryAttempts\":0,\"LastDispatchedTo\":\"10.0.0.42:11210\",\"LastDispatchedFrom\":\"10.0.1.28:60876\",\"LastConnectionID\":\"fa1d034c7329ef90/09d2d78f77089071\"}", "#1: janitor: adding feed, err: feed_dcp_gocbcore: StartGocbcoreDCPFeed, could not prepare DCP feed, name: ts19._default.ts19_fts_01_40870f3600bd713b_4c1c5584, server: http://127.0.0.1:8091, bucketName: ts19, indexName: ts19._default.ts19_fts_01, err: newGocbcoreDCPFeed: error in setting up feed's stream options, err: agent setup failed, err: gocbcore_utils: createAgents (1), setup err: agent setup failed, err: ambiguous timeout | {\"InnerError\":{\"InnerError\":{\"InnerError\":{},\"Message\":\"ambiguous timeout\"}},\"OperationID\":\"CMD_SASLAUTH\",\"Opaque\":\"0x6\",\"TimeObserved\":18986,\"RetryReasons\":null,\"RetryAttempts\":0,\"LastDispatchedTo\":\"10.0.0.42:11210\",\"LastDispatchedFrom\":\"10.0.1.28:41660\",\"LastConnectionID\":\"e64d3b08de2d94de/3825347be8d0730a\"}"} -- cbgt.(*Manager).JanitorLoop() at manager_janitor.go:93

      I took two (2) cbcollect_info's form the two FTS nodes 

      node1

      /opt/couchbase/bin/cbcollect_info --tmp-dir=/tmp/space /tmp/cbcollect_fts_node1_of_2_gocbcore_sdk_issue
       
      curl -s --upload-file  /tmp/cbcollect_fts_node1_of_2_gocbcore_sdk_issue.zip https://s3.amazonaws.com/customers.couchbase.com/jonstrabala/

      node2

      /opt/couchbase/bin/cbcollect_info --tmp-dir=/tmp/space /tmp/cbcollect_fts_node1_of_2_gocbcore_sdk_issue
       
      curl -s --upload-file  /tmp/cbcollect_fts_node1_of_2_gocbcore_sdk_issue.zip https://s3.amazonaws.com/customers.couchbase.com/jonstrabala/

       

       

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          michael.blow Michael Blow added a comment - - edited

          Abhi Dangeti/Thejas Orkombu, just fyi- the registry related panics in regulator are a dup of MB-53358 which is fixed in 7.2.0-1827.

          michael.blow Michael Blow added a comment - - edited Abhi Dangeti / Thejas Orkombu , just fyi- the registry related panics in regulator are a dup of MB-53358 which is fixed in 7.2.0-1827.
          abhinav Abhi Dangeti added a comment -

          there's not a lot of information that I can give you based on these logs.

          Hey Charles Dixon , I suspected so. Let's try and get more logs by enabling DEBUG logging to understand this better.

          Jon Strabala here's how you enable log level to DEBUG. You'll need to do this on all FTS nodes prior to starting your test.

          curl -X PUT -u <username>:<password> http://<ip>:8094/api/managerOptions -d '{"logLevel":"DEBU"}' 


          the registry related panics in regulator are a dup of MB-53358 which is fixed in 7.2.0-1827.

          Thanks Michael Blow .

          abhinav Abhi Dangeti added a comment - there's not a lot of information that I can give you based on these logs. Hey Charles Dixon , I suspected so. Let's try and get more logs by enabling DEBUG logging to understand this better. Jon Strabala here's how you enable log level to DEBUG. You'll need to do this on all FTS nodes prior to starting your test. curl -X PUT -u <username>:<password> http: //<ip>:8094/api/managerOptions -d '{"logLevel":"DEBU"}' the registry related panics in regulator are a dup of MB-53358 which is fixed in 7.2.0-1827. Thanks Michael Blow .
          jon.strabala Jon Strabala added a comment -

          Abhi Dangeti okay I did a stop/start of the cluster (all nodes) no upgrades then waited a few minutes let it warm up until the buckets looked healthy (no nodes pending)

          Next on both FTS nodes I ran the following ... both giving ok as a response

           

          curl -X PUT -u $CB_USERNAME:$CB_PASSWORD http://localhost:8094/api/managerOptions -d '{"logLevel":"DEBU"}'
          {"status":"ok"}

          I started a query as before

           

          ./cb_fts_bench -m POST -H "Content-Type: application/json" -t 15s -c 125 -k -a -u admin:$CB_PASSWORD -n 1000000 -b '{"query": { "query": "+balcony +whales +ritz" }}' http://10.0.1.28:8094/api/index/ts[[SEQ:1:20]]._default.ts[[SEQ:1:20]]_fts_01/query
          Bombarding http://10.0.1.28:8094/api/index/ts[[SEQ:1:20]]._default.ts[[SEQ:1:20]]_fts_01/query with 1000000 request(s) using 125 connection(s)
           1000000 / 1000000 [=================================================================================================] 100.00% 1161/s 14m20s
          Done!
          Statistics        Avg      Stdev        Max
            Reqs/sec      1159.95    3887.23   22380.70
            Latency      107.59ms   160.20ms      1.00s
            HTTP codes:
              1xx - 0, 2xx - 1000000, 3xx - 0, 4xx - 0, 5xx - 0
              others - 0
            Throughput:     2.54MB/s
            HTTP 429 retries            0, pct.     0.000% across 1000000 reqs
            Ave Sizes
              resp.body:            1893 bytes (across   1000000 reqs)
              resp.body.hit[]:      1541 bytes (across   1000000 reqs having hits [100.00%])
            Other
              total_hits            10000000, ave    10.000
              tot_hits_docreads     10000000, ave    10.000

           

          It seemed to run much slower (14X) about 1159 reqs/sec. with the DEBUG  [ but no errors ]  as opposed to 16104 req/sec. as in my first run in the MB (I assume things are nicely reset and this is due to DEBUG). So instead of 1m2s to complete I am looking at about 12m to complete this test.

          I ran 3 more tests at 100K reqs total instead of 1M about 77 seconds each with similar performance and latancies and no 429 errors and no timeouts [ no errors ].

          Now I alter the query to '{"query": { "query": "balcony +ritz" }}' and run 100K reqs twice (2 times)  no 429 errors and no timeouts [ +no errors ].

          Now I alter the query to '{"query": { "query": "balcony" }}' and run 100K reqs twice (2 times)  no 429 errors and no timeouts [ +no errors ].

          Now I alter the query to '{"query":

          { "query": "+balcony" }

          , "size": 100}' and run 100K reqs once (1 times)  no 429 errors and no timeouts [ no errors ].

          Now I alter the query to '{"query":

          { "query": "+balcony" }

          , "size": 1000}' and run 100K reqs once (1 times)  no 429 errors and no timeouts [ no errors ].

          So at this point I switch to the other FTS node 10.0.1.251 (previously I was accessing 10.0.1.28)

          ./cb_fts_bench -m POST -H "Content-Type: application/json" -t 15s -c 125 -k -a -u admin:$CB_PASSWORD -n 1000000 -b '{"query": { "query": "+balcony +whales +ritz" }}' http://10.0.1.251:8094/api/index/ts[[SEQ:1:20]]._default.ts[[SEQ:1:20]]_fts_01/query
          Bombarding http://10.0.1.251:8094/api/index/ts[[SEQ:1:20]]._default.ts[[SEQ:1:20]]_fts_01/query with 1000000 request(s) using 125 connection(s)
           1000000 / 1000000 [====================================================================================================] 100.00% 16918/s 59s
          Done!
          Statistics        Avg      Stdev        Max
            Reqs/sec     16970.04    2767.06   22303.62
            Latency        7.37ms     1.51ms    65.80ms
            HTTP codes:
              1xx - 0, 2xx - 1000000, 3xx - 0, 4xx - 0, 5xx - 0
              others - 0
            Throughput:    37.24MB/s
            HTTP 429 retries            0, pct.     0.000% across 1000000 reqs
            Ave Sizes
              resp.body:            1905 bytes (across   1000000 reqs)
              resp.body.hit[]:      1541 bytes (across   1000000 reqs having hits [100.00%])
            Other
              total_hits            10000000, ave    10.000
              tot_hits_docreads     10000000, ave    10.000
          

          Now I got almost 17K reqs/sec. seems a bit odd I wouldn't expect going to a replica would slow things down this much quite a difference between the 2 nodes.   So the slow down above was not due to DEBUG

          I ran the same query again still at 1M no 429 errors and no timeouts [ no errors ].

          Now I alter the query to '{"query": { "query": "balcony +ritz" }}' and run 1M reqs once (1 time)  no 429 errors and no timeouts [ +no errors ].

          Now I alter the query to '{"query": { "query": "balcony" }}' and run 1M reqs once (1 time)  no 429 errors and no timeouts [ +no errors ].

          Now I alter the query to '{"query":

          { "query": "+balcony" }

          , "size": 1000}' and run 100K reqs once (1 time)  no 429 errors and no timeouts [ no errors ].

          Instead of 20 buckets I limited things to 2 buckets for size 10 and size 1000

          ./cb_fts_bench -m POST -H "Content-Type: application/json" -t 15s -c 125 -k -a -u admin:$CB_PASSWORD -n 100000 -b '{"query": { "query": "+balcony" }}' http://10.0.1.251:8094/api/index/ts[[SEQ:1:2]]._default.ts[[SEQ:1:2]]_fts_01/query
          Bombarding http://10.0.1.251:8094/api/index/ts[[SEQ:1:2]]._default.ts[[SEQ:1:2]]_fts_01/query with 100000 request(s) using 125 connection(s)
           100000 / 100000 [===========================================================================================================================================] 100.00% 10182/s 9s
          Done!
          Statistics        Avg      Stdev        Max
            Reqs/sec     10270.55    1872.90   14800.69
            Latency       12.17ms     4.57ms    97.35ms
            HTTP codes:
              1xx - 0, 2xx - 100000, 3xx - 0, 4xx - 0, 5xx - 0
              others - 0
            Throughput:    22.27MB/s
            HTTP 429 retries            0, pct.     0.000% across 100000 reqs
            Ave Sizes
              resp.body:            1893 bytes (across    100000 reqs)
              resp.body.hit[]:      1541 bytes (across    100000 reqs having hits [100.00%])
            Other
              total_hits           156000000, ave  1560.000
              tot_hits_docreads      1000000, ave    10.000
           
          ./cb_fts_bench -m POST -H "Content-Type: application/json" -t 15s -c 125 -k -a -u admin:$CB_PASSWORD -n 100000 -b '{"query": { "query": "+balcony" }, "size": 1000}' http://10.0.1.251:8094/api/index/ts[[SEQ:1:2]]._default.ts[[SEQ:1:2]]_fts_01/query
          Bombarding http://10.0.1.251:8094/api/index/ts[[SEQ:1:2]]._default.ts[[SEQ:1:2]]_fts_01/query with 100000 request(s) using 125 connection(s)
           100000 / 100000 [===========================================================================================================================================] 100.00% 2341/s 42s
          Done!
          Statistics        Avg      Stdev        Max
            Reqs/sec      2348.25     456.14    3567.00
            Latency       53.18ms    57.25ms   816.41ms
            HTTP codes:
              1xx - 0, 2xx - 100000, 3xx - 0, 4xx - 0, 5xx - 0
              others - 0
            Throughput:   347.59MB/s
            HTTP 429 retries            0, pct.     0.000% across 100000 reqs
            Ave Sizes
              resp.body:          154816 bytes (across    100000 reqs)
              resp.body.hit[]:    154461 bytes (across    100000 reqs having hits [100.00%])
            Other
              total_hits           156000000, ave  1560.000
              tot_hits_docreads    100000000, ave  1000.000
          

          I modified two indexes in buckets ts01 (so they are rebuilding) just couldn't reproduce the issue. 

          But I am concerned about the disparity in performance going to one node vs. the other (are all indexes non-replicas on the same node)?

           

          jon.strabala Jon Strabala added a comment - Abhi Dangeti okay I did a stop/start of the cluster (all nodes) no upgrades then waited a few minutes let it warm up until the buckets looked healthy (no nodes pending) Next on both FTS nodes I ran the following ... both giving ok as a response   curl -X PUT -u $CB_USERNAME:$CB_PASSWORD http://localhost:8094/api/managerOptions -d '{"logLevel":"DEBU"}' {"status":"ok"} I started a query as before   ./cb_fts_bench -m POST -H "Content-Type: application/json" -t 15s -c 125 -k -a -u admin:$CB_PASSWORD -n 1000000 -b '{"query": { "query": "+balcony +whales +ritz" }}' http://10.0.1.28:8094/api/index/ts[[SEQ:1:20]]._default.ts[[SEQ:1:20]]_fts_01/query Bombarding http://10.0.1.28:8094/api/index/ts[[SEQ:1:20]]._default.ts[[SEQ:1:20]]_fts_01/query with 1000000 request(s) using 125 connection(s)  1000000 / 1000000 [=================================================================================================] 100.00% 1161/s 14m20s Done! Statistics        Avg      Stdev        Max   Reqs/sec      1159.95    3887.23   22380.70   Latency      107.59ms   160.20ms      1.00s   HTTP codes:     1xx - 0, 2xx - 1000000, 3xx - 0, 4xx - 0, 5xx - 0     others - 0   Throughput:     2.54MB/s   HTTP 429 retries            0, pct.     0.000% across 1000000 reqs   Ave Sizes     resp.body:            1893 bytes (across   1000000 reqs)     resp.body.hit[]:      1541 bytes (across   1000000 reqs having hits [100.00%])   Other     total_hits            10000000, ave    10.000     tot_hits_docreads     10000000, ave    10.000   It seemed to run much slower (14X) about 1159 reqs/sec. with the DEBUG  [ but no errors ]  as opposed to 16104 req/sec. as in my first run in the MB (I assume things are nicely reset and this is due to DEBUG). So instead of 1m2s to complete I am looking at about 12m to complete this test. I ran 3 more tests at 100K reqs total instead of 1M about 77 seconds each with similar performance and latancies and no 429 errors and no timeouts [ no errors ]. Now I alter the query to '{"query": { "query": " balcony +ritz" }}' and run 100K reqs twice (2 times)  no 429 errors and no timeouts [ +no errors ]. Now I alter the query to '{"query": { "query": " balcony" }}' and run 100K reqs twice (2 times)  no 429 errors and no timeouts [ +no errors ]. Now I alter the query to '{"query": { "query": "+balcony" } , "size": 100}' and run 100K reqs once (1 times)  no 429 errors and no timeouts [ no errors ]. Now I alter the query to '{"query": { "query": "+balcony" } , "size": 1000}' and run 100K reqs once (1 times)  no 429 errors and no timeouts [ no errors ]. So at this point I switch to the other FTS node 10.0.1.251 (previously I was accessing 10.0.1.28) ./cb_fts_bench -m POST -H "Content-Type: application/json" -t 15s -c 125 -k -a -u admin:$CB_PASSWORD -n 1000000 -b '{"query": { "query": "+balcony +whales +ritz" }}' http://10.0.1.251:8094/api/index/ts[[SEQ:1:20]]._default.ts[[SEQ:1:20]]_fts_01/query Bombarding http://10.0.1.251:8094/api/index/ts[[SEQ:1:20]]._default.ts[[SEQ:1:20]]_fts_01/query with 1000000 request(s) using 125 connection(s)  1000000 / 1000000 [====================================================================================================] 100.00% 16918/s 59s Done! Statistics        Avg      Stdev        Max   Reqs/sec     16970.04    2767.06   22303.62   Latency        7.37ms     1.51ms    65.80ms   HTTP codes:     1xx - 0, 2xx - 1000000, 3xx - 0, 4xx - 0, 5xx - 0     others - 0   Throughput:    37.24MB/s   HTTP 429 retries            0, pct.     0.000% across 1000000 reqs   Ave Sizes     resp.body:            1905 bytes (across   1000000 reqs)     resp.body.hit[]:      1541 bytes (across   1000000 reqs having hits [100.00%])   Other     total_hits            10000000, ave    10.000     tot_hits_docreads     10000000, ave    10.000 Now I got almost 17K reqs/sec. seems a bit odd I wouldn't expect going to a replica would slow things down this muc h quite a difference between the 2 nodes.   So the slow down above was not due to DEBUG I ran the same query again still at 1M no 429 errors and no timeouts [ no errors ]. Now I alter the query to '{"query": { "query": " balcony +ritz" }}' and run 1M reqs once (1 time)  no 429 errors and no timeouts [ +no errors ]. Now I alter the query to '{"query": { "query": " balcony" }}' and run 1M reqs once (1 time)  no 429 errors and no timeouts [ +no errors ]. Now I alter the query to '{"query": { "query": "+balcony" } , "size": 1000}' and run 100K reqs once (1 time)  no 429 errors and no timeouts [ no errors ]. Instead of 20 buckets I limited things to 2 buckets for size 10 and size 1000 ./cb_fts_bench -m POST -H "Content-Type: application/json" -t 15s -c 125 -k -a -u admin:$CB_PASSWORD -n 100000 -b '{"query": { "query": "+balcony" }}' http://10.0.1.251:8094/api/index/ts[[SEQ:1:2]]._default.ts[[SEQ:1:2]]_fts_01/query Bombarding http://10.0.1.251:8094/api/index/ts[[SEQ:1:2]]._default.ts[[SEQ:1:2]]_fts_01/query with 100000 request(s) using 125 connection(s)  100000 / 100000 [===========================================================================================================================================] 100.00% 10182/s 9s Done! Statistics        Avg      Stdev        Max   Reqs/sec     10270.55    1872.90   14800.69   Latency       12.17ms     4.57ms    97.35ms   HTTP codes:     1xx - 0, 2xx - 100000, 3xx - 0, 4xx - 0, 5xx - 0     others - 0   Throughput:    22.27MB/s   HTTP 429 retries            0, pct.     0.000% across 100000 reqs   Ave Sizes     resp.body:            1893 bytes (across    100000 reqs)     resp.body.hit[]:      1541 bytes (across    100000 reqs having hits [100.00%])   Other     total_hits           156000000, ave  1560.000     tot_hits_docreads      1000000, ave    10.000   ./cb_fts_bench -m POST -H "Content-Type: application/json" -t 15s -c 125 -k -a -u admin:$CB_PASSWORD -n 100000 -b '{"query": { "query": "+balcony" }, "size": 1000}' http://10.0.1.251:8094/api/index/ts[[SEQ:1:2]]._default.ts[[SEQ:1:2]]_fts_01/query Bombarding http://10.0.1.251:8094/api/index/ts[[SEQ:1:2]]._default.ts[[SEQ:1:2]]_fts_01/query with 100000 request(s) using 125 connection(s)  100000 / 100000 [===========================================================================================================================================] 100.00% 2341/s 42s Done! Statistics        Avg      Stdev        Max   Reqs/sec      2348.25     456.14    3567.00   Latency       53.18ms    57.25ms   816.41ms   HTTP codes:     1xx - 0, 2xx - 100000, 3xx - 0, 4xx - 0, 5xx - 0     others - 0   Throughput:   347.59MB/s   HTTP 429 retries            0, pct.     0.000% across 100000 reqs   Ave Sizes     resp.body:          154816 bytes (across    100000 reqs)     resp.body.hit[]:    154461 bytes (across    100000 reqs having hits [100.00%])   Other     total_hits           156000000, ave  1560.000     tot_hits_docreads    100000000, ave  1000.000 I modified two indexes in buckets ts01 (so they are rebuilding) just couldn't reproduce the issue.  But I am concerned about the disparity in performance going to one node vs. the other (are all indexes non-replicas on the same node)?  
          jon.strabala Jon Strabala added a comment -

          I tried again with a new build Couchbase Server 7.2.0-1856 (EE) this build seems much more stable and the numbers look a lot better between the two eventing nodes

          QUERY LOOP: '{"query": { "query": "+ritz +whales +balcony" }, "size": 10}'
          IP 10.0.1.28
          Statistics        Avg      Stdev        Max
            Reqs/sec     13622.09    3672.77   18920.04
            Latency        9.79ms     4.53ms    36.70ms
              1xx - 0, 2xx - 1000, 3xx - 0, 4xx - 0, 5xx - 0
          QUERY LOOP: '{"query": { "query": "+ritz +whales +balcony" }, "size": 10}'
          IP 10.0.1.251
          Statistics        Avg      Stdev        Max
            Reqs/sec     13255.43    4114.09   20702.82
            Latency       10.15ms     4.36ms    30.05ms
              1xx - 0, 2xx - 1000, 3xx - 0, 4xx - 0, 5xx - 0
           
           
          QUERY LOOP: '{"query": { "query": "+ritz +balcony" }, "size": 10}'
          IP 10.0.1.28
          Statistics        Avg      Stdev        Max
            Reqs/sec     11798.66    1414.39   13786.87
            Latency       10.34ms     5.22ms    25.09ms
              1xx - 0, 2xx - 1000, 3xx - 0, 4xx - 0, 5xx - 0
          QUERY LOOP: '{"query": { "query": "+ritz +balcony" }, "size": 10}'
          IP 10.0.1.251
          Statistics        Avg      Stdev        Max
            Reqs/sec     10049.72    1685.59   11933.66
            Latency       12.19ms     6.18ms    36.14ms
              1xx - 0, 2xx - 1000, 3xx - 0, 4xx - 0, 5xx - 0
           
           
          QUERY LOOP: '{"query": { "query": "+balcony" }, "size": 10}'
          IP 10.0.1.28
          Statistics        Avg      Stdev        Max
            Reqs/sec      8593.26    3694.52   13017.76
            Latency       14.39ms     7.69ms    43.75ms
              1xx - 0, 2xx - 1000, 3xx - 0, 4xx - 0, 5xx - 0
          QUERY LOOP: '{"query": { "query": "+balcony" }, "size": 10}'
          IP 10.0.1.251
          Statistics        Avg      Stdev        Max
            Reqs/sec      9595.83    4991.08   20786.81
            Latency       14.89ms     7.82ms    48.27ms
              1xx - 0, 2xx - 1000, 3xx - 0, 4xx - 0, 5xx - 0
           
           
          QUERY LOOP: '{"query": { "query": "+ritz +whales +balcony"}, "size": 1000}'
          IP 10.0.1.28
          Statistics        Avg      Stdev        Max
            Reqs/sec     12060.51    2690.84   17036.66
            Latency        9.89ms     4.48ms    31.55ms
              1xx - 0, 2xx - 1000, 3xx - 0, 4xx - 0, 5xx - 0
          QUERY LOOP: '{"query": { "query": "+ritz +whales +balcony"}, "size": 1000}'
          IP 10.0.1.251
          Statistics        Avg      Stdev        Max
            Reqs/sec     12395.65    3089.53   17751.41
            Latency       10.47ms     5.96ms    40.51ms
              1xx - 0, 2xx - 1000, 3xx - 0, 4xx - 0, 5xx - 0
           
           
          QUERY LOOP: '{"query": { "query": "+ritz +balcony" }, "size": 1000}'
          IP 10.0.1.28
          Statistics        Avg      Stdev        Max
            Reqs/sec      7088.36    1741.21    9535.20
            Latency       17.48ms     7.39ms    40.76ms
              1xx - 0, 2xx - 1000, 3xx - 0, 4xx - 0, 5xx - 0
          QUERY LOOP: '{"query": { "query": "+ritz +balcony" }, "size": 1000}'
          IP 10.0.1.251
          Statistics        Avg      Stdev        Max
            Reqs/sec      6922.21    1946.19    8902.36
            Latency       17.62ms    10.01ms    50.73ms
              1xx - 0, 2xx - 1000, 3xx - 0, 4xx - 0, 5xx - 0
           
           
          QUERY LOOP: '{"query": { "query": "+balcony" }, "size": 1000}'
          IP 10.0.1.28
          Statistics        Avg      Stdev        Max
            Reqs/sec      2265.05     517.67    2960.90
            Latency       52.99ms    40.70ms   239.65ms
              1xx - 0, 2xx - 1000, 3xx - 0, 4xx - 0, 5xx - 0
          QUERY LOOP: '{"query": { "query": "+balcony" }, "size": 1000}'
          IP 10.0.1.251
          Statistics        Avg      Stdev        Max
            Reqs/sec      2192.85     524.55    2766.06
            Latency       53.86ms    46.48ms   275.13ms
              1xx - 0, 2xx - 1000, 3xx - 0, 4xx - 0, 5xx - 0

           

          jon.strabala Jon Strabala added a comment - I tried again with a new build Couchbase Server 7.2.0-1856 (EE) this build seems much more stable and the numbers look a lot better between the two eventing nodes QUERY LOOP: '{"query": { "query": "+ritz +whales +balcony" }, "size": 10}' IP 10.0.1.28 Statistics        Avg      Stdev        Max   Reqs/sec     13622.09    3672.77   18920.04   Latency        9.79ms     4.53ms    36.70ms     1xx - 0, 2xx - 1000, 3xx - 0, 4xx - 0, 5xx - 0 QUERY LOOP: '{"query": { "query": "+ritz +whales +balcony" }, "size": 10}' IP 10.0.1.251 Statistics        Avg      Stdev        Max   Reqs/sec     13255.43    4114.09   20702.82   Latency       10.15ms     4.36ms    30.05ms     1xx - 0, 2xx - 1000, 3xx - 0, 4xx - 0, 5xx - 0     QUERY LOOP: '{"query": { "query": "+ritz +balcony" }, "size": 10}' IP 10.0.1.28 Statistics        Avg      Stdev        Max   Reqs/sec     11798.66    1414.39   13786.87   Latency       10.34ms     5.22ms    25.09ms     1xx - 0, 2xx - 1000, 3xx - 0, 4xx - 0, 5xx - 0 QUERY LOOP: '{"query": { "query": "+ritz +balcony" }, "size": 10}' IP 10.0.1.251 Statistics        Avg      Stdev        Max   Reqs/sec     10049.72    1685.59   11933.66   Latency       12.19ms     6.18ms    36.14ms     1xx - 0, 2xx - 1000, 3xx - 0, 4xx - 0, 5xx - 0     QUERY LOOP: '{"query": { "query": "+balcony" }, "size": 10}' IP 10.0.1.28 Statistics        Avg      Stdev        Max   Reqs/sec      8593.26    3694.52   13017.76   Latency       14.39ms     7.69ms    43.75ms     1xx - 0, 2xx - 1000, 3xx - 0, 4xx - 0, 5xx - 0 QUERY LOOP: '{"query": { "query": "+balcony" }, "size": 10}' IP 10.0.1.251 Statistics        Avg      Stdev        Max   Reqs/sec      9595.83    4991.08   20786.81   Latency       14.89ms     7.82ms    48.27ms     1xx - 0, 2xx - 1000, 3xx - 0, 4xx - 0, 5xx - 0     QUERY LOOP: '{"query": { "query": "+ritz +whales +balcony"}, "size": 1000}' IP 10.0.1.28 Statistics        Avg      Stdev        Max   Reqs/sec     12060.51    2690.84   17036.66   Latency        9.89ms     4.48ms    31.55ms     1xx - 0, 2xx - 1000, 3xx - 0, 4xx - 0, 5xx - 0 QUERY LOOP: '{"query": { "query": "+ritz +whales +balcony"}, "size": 1000}' IP 10.0.1.251 Statistics        Avg      Stdev        Max   Reqs/sec     12395.65    3089.53   17751.41   Latency       10.47ms     5.96ms    40.51ms     1xx - 0, 2xx - 1000, 3xx - 0, 4xx - 0, 5xx - 0     QUERY LOOP: '{"query": { "query": "+ritz +balcony" }, "size": 1000}' IP 10.0.1.28 Statistics        Avg      Stdev        Max   Reqs/sec      7088.36    1741.21    9535.20   Latency       17.48ms     7.39ms    40.76ms     1xx - 0, 2xx - 1000, 3xx - 0, 4xx - 0, 5xx - 0 QUERY LOOP: '{"query": { "query": "+ritz +balcony" }, "size": 1000}' IP 10.0.1.251 Statistics        Avg      Stdev        Max   Reqs/sec      6922.21    1946.19    8902.36   Latency       17.62ms    10.01ms    50.73ms     1xx - 0, 2xx - 1000, 3xx - 0, 4xx - 0, 5xx - 0     QUERY LOOP: '{"query": { "query": "+balcony" }, "size": 1000}' IP 10.0.1.28 Statistics        Avg      Stdev        Max   Reqs/sec      2265.05     517.67    2960.90   Latency       52.99ms    40.70ms   239.65ms     1xx - 0, 2xx - 1000, 3xx - 0, 4xx - 0, 5xx - 0 QUERY LOOP: '{"query": { "query": "+balcony" }, "size": 1000}' IP 10.0.1.251 Statistics        Avg      Stdev        Max   Reqs/sec      2192.85     524.55    2766.06   Latency       53.86ms    46.48ms   275.13ms     1xx - 0, 2xx - 1000, 3xx - 0, 4xx - 0, 5xx - 0  

          Jon Strabala Marking this as resolved, please re-test in more recent builds.

          abhinav Abhi Dangeti added a comment - Jon Strabala Marking this as resolved, please re-test in more recent builds.

          People

            jon.strabala Jon Strabala
            jon.strabala Jon Strabala
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty