Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-48066

[TLS] query hangs in transaction

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Critical
    • Resolution: Duplicate
    • Affects Version/s: 7.0.2
    • Fix Version/s: 7.0.2
    • Component/s: query, secondary-index
    • Labels:
      None
    • Environment:
      7.0.2-6544
    • Triage:
      Untriaged
    • Story Points:
      1
    • Is this a Regression?:
      Unknown

      Description

      To repro:

      • setup 2 nodes cluster (kv:n1ql:index-kv)
      • load travel-sample bucket
      • enable N2N encryption: /opt/couchbase/bin/couchbase-cli node-to-node-encryption -c http://localhost:8091 -u Administrator -p password --enable
      • set TLS strict mode: /opt/couchbase/bin/couchbase-cli setting-security -c http://localhost:8091 -u Administrator -p password --set --cluster-encryption-level strict
      • Try simple select in transaction ... query hangs and eventually times out.

       cbq> begin work;
      {
          "requestID": "185daa00-dbd7-4c58-8eae-7d4d051eb2bf",
          "signature": "json",
          "results": [
          {
              "txid": "0756a4ca-d02f-4c02-8b6e-9f08b0a6fb20"
          }
          ],
          "status": "success",
          "metrics": {
              "elapsedTime": "1.067112ms",
              "executionTime": "901.84µs",
              "resultCount": 1,
              "resultSize": 62,
              "serviceLoad": 6,
              "transactionElapsedTime": "84.141µs",
              "transactionRemainingTime": "1m59.99989191s"
          }
      }
      cbq> select city,airportname from `travel-sample`.inventory.airport where lower(city) = 'lyon';
      {
          "requestID": "b976d0bd-cbb7-4b36-bc7b-fb1492675f8c",
          "signature": {
              "airportname": "json",
              "city": "json"
          },
          "results": [
          ],
          "errors": [
              {
                  "code": 1080,
                  "msg": "Timeout 1m55.648058014s exceeded",
                  "retry": true
              }
          ],
          "status": "timeout",
          "metrics": {
              "elapsedTime": "1m55.650817285s",
              "executionTime": "1m55.650571598s",
              "resultCount": 0,
              "resultSize": 0,
              "serviceLoad": 1,
              "transactionElapsedTime": "2m0.000363072s",
              "errorCount": 1
          }
      }

      outside of transaction, operations work fine.

        Attachments

          Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

            Activity

            Hide
            Sitaram.Vemulapalli Sitaram Vemulapalli added a comment -

            Not sure why GSI client using GSI Indexer 9101 port , Indexer team need to take look.

            2021-08-19T15:41:57.028-07:00 [Info] average scan response {226 ms}
            2021-08-19T15:42:03.787-07:00 [Error] receiving packet: read tcp 172.23.104.92:54800->172.23.104.92:9101: i/o timeout
            2021-08-19T15:42:03.787-07:00 [Error] [GsiScanClient:"172.23.104.92:9101"] req(765a8843-4cc6-43d1-9760-012c01438a54) connection "172.23.104.92:54800" response transport failed `read tcp 172.23.104.92:54800->172.23.104.92:9101: i/o timeout`
            2021-08-19T15:42:03.787-07:00 [Error] [GsiScanClient:"172.23.104.92:9101"] Scan3(765a8843-4cc6-43d1-9760-012c01438a54) response failed `read tcp 172.23.104.92:54800->172.23.104.92:9101: i/o timeout`
            2021-08-19T15:42:03.787-07:00 [Warn] scan failed: requestId 765a8843-4cc6-43d1-9760-012c01438a54 queryport 172.23.104.92:9101 inst 8909368539933541760 partition [0]
            2021-08-19T15:42:03.787-07:00 [Warn] Scan failed with error for index 16304947368467762895.  Trying scan again with replica, reqId:765a8843-4cc6-43d1-9760-012c01438a54 :  read tcp 172.23.104.92:54800->172.23.104.92:9101: i/o timeout from [172.23.104.92:9101] ...
            2021-08-19T15:42:03.787-07:00 [Error] PickRandom: Fail to find indexer for all index partitions. Num partition 1.  Partition with instances 0 
            2021-08-19T15:42:03.787-07:00 [Warn] Fail to find indexers to satisfy query request.  Terminate scan for index 16304947368467762895,  reqId:765a8843-4cc6-43d1-9760-012c01438a54 :  read tcp 172.23.104.92:54800->172.23.104.92:9101: i/o timeout from [172.23.104.92:9101]
            2021-08-19T15:42:03.787-07:00 [Info] [Queryport-connpool:172.23.104.92:9101] closing unhealthy connection "172.23.104.92:54800"
            2021-08-19T15:42:12.891-07:00 [Info] [Queryport-connpool:172.23.104.92:9101] active conns 0, free conns 1
            2021-08-19T15:42:13.409-07:00 [Info] GSIC[default/travel-sample-inventory-airport-1629410571128275540] c7ea858c-1ad2-4e0c-98c6-3c30df081646 new temp file ... /opt/couchbase/var/lib/couchbase/tmp/scan-results4414218759065
            2021-08-19T15:42:13.647-07:00 [Info] GSIC[default/travel-sample-inventory-airport-1629410571128275540] request(c7ea858c-1ad2-4e0c-98c6-3c30df081646) removing temp file /opt/couchbase/var/lib/couchbase/tmp/scan-results4414218759065 ...
            2021-08-19T15:42:16.014-07:00 [Info] GSIC[default/travel-sample-inventory-airport-1629410571128275540] 97b23a4f-e6e9-4077-a104-b35d0d762763 new temp file ... /opt/couchbase/var/lib/couchbase/tmp/scan-results4414619218468
            2021-08-19T15:42:16.207-07:00 [Info] GSIC[default/travel-sample-inventory-airport-1629410571128275540] request(97b23a4f-e6e9-4077-a104-b35d0d762763) removing temp file /opt/couchbase/var/lib/couchbase/tmp/scan-results4414619218468 ...
            2021-08-19T15:42:34.985-07:00 [Info] GSIC[default/travel-sample-inventory-airport-1629410571128275540] 34b5ae69-b14e-4c5e-81a6-aa5725cff119 new temp file ... /opt/couchbase/var/lib/couchbase/tmp/scan-results4414591192371
            2021-08-19T15:42:35.150-07:00 [Info] GSIC[default/travel-sample-inventory-airport-1629410571128275540] request(34b5ae69-b14e-4c5e-81a6-aa5725cff119) removing temp file /opt/couchbase/var/lib/couchbase/tmp/scan-results4414591192371 ...
            2021-08-19T15:42:37.510-07:00 [Info] GSIC[default/travel-sample-inventory-airport-1629410571128275540] 88ef8b31-6342-49e0-b703-5479d6765138 new temp file ... /opt/couchbase/var/lib/couchbase/tmp/scan-results4414178318070
            2021-08-19T15:42:37.714-07:00 [Info] GSIC[default/travel-sample-inventory-airport-1629410571128275540] request(88ef8b31-6342-49e0-b703-5479d6765138) removing temp file /opt/couchbase/var/lib/couchbase/tmp/scan-results4414178318070 ...
            2021-08-19T15:42:51.132-07:00 [Info] GSIC[default/travel-sample-inventory-airport-1629410571128275540] logstats "travel-sample" {"gsi_scan_count":25,"gsi_scan_duration":364971639789,"gsi_throttle_duration":31824508,"gsi_prime_duration":189093119,"gsi_blocked_duration":4633471685,"gsi_total_temp_files":23}
            2021-08-19T15:42:56.876-07:00 [Info] GSIC[default/travel-sample-inventory-airport-1629410571128275540] 16e65ab6-d3ce-4a14-9b0d-208ece4a253d new temp file ... /opt/couchbase/var/lib/couchbase/tmp/scan-results4414133979613
            2021-08-19T15:42:57.021-07:00 [Info] connected with 1 indexers
            2021-08-19T15:42:57.021-07:00 [Info] client stats current counts: current: 25, not current: 0
            2021-08-19T15:42:57.026-07:00 [Info] num concurrent scans {1}
            2021-08-19T15:42:57.026-07:00 [Info] average scan response {200 ms}
            2021-08-19T15:42:57.074-07:00 [Info] GSIC[default/travel-sample-inventory-airport-1629410571128275540] request(16e65ab6-d3ce-4a14-9b0d-208ece4a253d) removing temp file /opt/couchbase/var/lib/couchbase/tmp/scan-results4414133979613 ...
            2021-08-19T15:43:12.909-07:00 [Info] [Queryport-connpool:172.23.104.92:9101] active conns 1, free conns 0
            2021-08-19T15:43:51.132-07:00 [Info] GSIC[default/travel-sample-inventory-airport-1629410571128275540] logstats "travel-sample" {"gsi_scan_count":26,"gsi_scan_duration":365174599236,"gsi_throttle_duration":32847618,"gsi_prime_duration":195100513,"gsi_blocked_duration":4826624641,"gsi_total_temp_files":24}
            
            

            Show
            Sitaram.Vemulapalli Sitaram Vemulapalli added a comment - Not sure why GSI client using GSI Indexer 9101 port , Indexer team need to take look. 2021 - 08 -19T15: 41 : 57.028 - 07 : 00 [Info] average scan response { 226 ms} 2021 - 08 -19T15: 42 : 03.787 - 07 : 00 [Error] receiving packet: read tcp 172.23 . 104.92 : 54800 -> 172.23 . 104.92 : 9101 : i/o timeout 2021 - 08 -19T15: 42 : 03.787 - 07 : 00 [Error] [GsiScanClient: "172.23.104.92:9101" ] req(765a8843-4cc6-43d1- 9760 -012c01438a54) connection "172.23.104.92:54800" response transport failed `read tcp 172.23 . 104.92 : 54800 -> 172.23 . 104.92 : 9101 : i/o timeout` 2021 - 08 -19T15: 42 : 03.787 - 07 : 00 [Error] [GsiScanClient: "172.23.104.92:9101" ] Scan3(765a8843-4cc6-43d1- 9760 -012c01438a54) response failed `read tcp 172.23 . 104.92 : 54800 -> 172.23 . 104.92 : 9101 : i/o timeout` 2021 - 08 -19T15: 42 : 03.787 - 07 : 00 [Warn] scan failed: requestId 765a8843-4cc6-43d1- 9760 -012c01438a54 queryport 172.23 . 104.92 : 9101 inst 8909368539933541760 partition [ 0 ] 2021 - 08 -19T15: 42 : 03.787 - 07 : 00 [Warn] Scan failed with error for index 16304947368467762895 . Trying scan again with replica, reqId:765a8843-4cc6-43d1- 9760 -012c01438a54 : read tcp 172.23 . 104.92 : 54800 -> 172.23 . 104.92 : 9101 : i/o timeout from [ 172.23 . 104.92 : 9101 ] ... 2021 - 08 -19T15: 42 : 03.787 - 07 : 00 [Error] PickRandom: Fail to find indexer for all index partitions. Num partition 1 . Partition with instances 0 2021 - 08 -19T15: 42 : 03.787 - 07 : 00 [Warn] Fail to find indexers to satisfy query request. Terminate scan for index 16304947368467762895 , reqId:765a8843-4cc6-43d1- 9760 -012c01438a54 : read tcp 172.23 . 104.92 : 54800 -> 172.23 . 104.92 : 9101 : i/o timeout from [ 172.23 . 104.92 : 9101 ] 2021 - 08 -19T15: 42 : 03.787 - 07 : 00 [Info] [Queryport-connpool: 172.23 . 104.92 : 9101 ] closing unhealthy connection "172.23.104.92:54800" 2021 - 08 -19T15: 42 : 12.891 - 07 : 00 [Info] [Queryport-connpool: 172.23 . 104.92 : 9101 ] active conns 0 , free conns 1 2021 - 08 -19T15: 42 : 13.409 - 07 : 00 [Info] GSIC[ default /travel-sample-inventory-airport- 1629410571128275540 ] c7ea858c-1ad2-4e0c-98c6-3c30df081646 new temp file ... /opt/couchbase/var/lib/couchbase/tmp/scan-results4414218759065 2021 - 08 -19T15: 42 : 13.647 - 07 : 00 [Info] GSIC[ default /travel-sample-inventory-airport- 1629410571128275540 ] request(c7ea858c-1ad2-4e0c-98c6-3c30df081646) removing temp file /opt/couchbase/var/lib/couchbase/tmp/scan-results4414218759065 ... 2021 - 08 -19T15: 42 : 16.014 - 07 : 00 [Info] GSIC[ default /travel-sample-inventory-airport- 1629410571128275540 ] 97b23a4f-e6e9- 4077 -a104-b35d0d762763 new temp file ... /opt/couchbase/var/lib/couchbase/tmp/scan-results4414619218468 2021 - 08 -19T15: 42 : 16.207 - 07 : 00 [Info] GSIC[ default /travel-sample-inventory-airport- 1629410571128275540 ] request(97b23a4f-e6e9- 4077 -a104-b35d0d762763) removing temp file /opt/couchbase/var/lib/couchbase/tmp/scan-results4414619218468 ... 2021 - 08 -19T15: 42 : 34.985 - 07 : 00 [Info] GSIC[ default /travel-sample-inventory-airport- 1629410571128275540 ] 34b5ae69-b14e-4c5e-81a6-aa5725cff119 new temp file ... /opt/couchbase/var/lib/couchbase/tmp/scan-results4414591192371 2021 - 08 -19T15: 42 : 35.150 - 07 : 00 [Info] GSIC[ default /travel-sample-inventory-airport- 1629410571128275540 ] request(34b5ae69-b14e-4c5e-81a6-aa5725cff119) removing temp file /opt/couchbase/var/lib/couchbase/tmp/scan-results4414591192371 ... 2021 - 08 -19T15: 42 : 37.510 - 07 : 00 [Info] GSIC[ default /travel-sample-inventory-airport- 1629410571128275540 ] 88ef8b31- 6342 -49e0-b703-5479d6765138 new temp file ... /opt/couchbase/var/lib/couchbase/tmp/scan-results4414178318070 2021 - 08 -19T15: 42 : 37.714 - 07 : 00 [Info] GSIC[ default /travel-sample-inventory-airport- 1629410571128275540 ] request(88ef8b31- 6342 -49e0-b703-5479d6765138) removing temp file /opt/couchbase/var/lib/couchbase/tmp/scan-results4414178318070 ... 2021 - 08 -19T15: 42 : 51.132 - 07 : 00 [Info] GSIC[ default /travel-sample-inventory-airport- 1629410571128275540 ] logstats "travel-sample" { "gsi_scan_count" : 25 , "gsi_scan_duration" : 364971639789 , "gsi_throttle_duration" : 31824508 , "gsi_prime_duration" : 189093119 , "gsi_blocked_duration" : 4633471685 , "gsi_total_temp_files" : 23 } 2021 - 08 -19T15: 42 : 56.876 - 07 : 00 [Info] GSIC[ default /travel-sample-inventory-airport- 1629410571128275540 ] 16e65ab6-d3ce-4a14-9b0d-208ece4a253d new temp file ... /opt/couchbase/var/lib/couchbase/tmp/scan-results4414133979613 2021 - 08 -19T15: 42 : 57.021 - 07 : 00 [Info] connected with 1 indexers 2021 - 08 -19T15: 42 : 57.021 - 07 : 00 [Info] client stats current counts: current: 25 , not current: 0 2021 - 08 -19T15: 42 : 57.026 - 07 : 00 [Info] num concurrent scans { 1 } 2021 - 08 -19T15: 42 : 57.026 - 07 : 00 [Info] average scan response { 200 ms} 2021 - 08 -19T15: 42 : 57.074 - 07 : 00 [Info] GSIC[ default /travel-sample-inventory-airport- 1629410571128275540 ] request(16e65ab6-d3ce-4a14-9b0d-208ece4a253d) removing temp file /opt/couchbase/var/lib/couchbase/tmp/scan-results4414133979613 ... 2021 - 08 -19T15: 43 : 12.909 - 07 : 00 [Info] [Queryport-connpool: 172.23 . 104.92 : 9101 ] active conns 1 , free conns 0 2021 - 08 -19T15: 43 : 51.132 - 07 : 00 [Info] GSIC[ default /travel-sample-inventory-airport- 1629410571128275540 ] logstats "travel-sample" { "gsi_scan_count" : 26 , "gsi_scan_duration" : 365174599236 , "gsi_throttle_duration" : 32847618 , "gsi_prime_duration" : 195100513 , "gsi_blocked_duration" : 4826624641 , "gsi_total_temp_files" : 24 }
            Hide
            kamini.jagtiani Kamini Jagtiani added a comment -

            Sai Krishna Teja

            Have the TLS changes for secondary-index been merged to 7.0.2 branch?Thanks

            Show
            kamini.jagtiani Kamini Jagtiani added a comment - Sai Krishna Teja Have the TLS changes for secondary-index been merged to 7.0.2 branch?Thanks
            Hide
            sai.teja Sai Krishna Teja added a comment -

            Yes the changes are backported to 7.02 too. Build couchbase-server-7.0.2-6530 should have all the 3 patches.
            Admin port and Scan port will use same port with or without the encryption indexer will close the exiting connection and make a secure listener when enabled.

            Show
            sai.teja Sai Krishna Teja added a comment - Yes the changes are backported to 7.02 too. Build couchbase-server-7.0.2-6530 should have all the 3 patches. Admin port and Scan port will use same port with or without the encryption indexer will close the exiting connection and make a secure listener when enabled.
            Hide
            sai.teja Sai Krishna Teja added a comment -

            Same issue is being tracked in MB-48030

            Show
            sai.teja Sai Krishna Teja added a comment - Same issue is being tracked in MB-48030
            Hide
            mihir.kamdar Mihir Kamdar added a comment -

            Bulk closing not-fixed bugs

            Show
            mihir.kamdar Mihir Kamdar added a comment - Bulk closing not-fixed bugs
            Hide
            amit.kulkarni Amit Kulkarni added a comment -

            Pierre Regazzoni, was it a session consistent scan ?

            Show
            amit.kulkarni Amit Kulkarni added a comment - Pierre Regazzoni , was it a session consistent scan ?
            Hide
            varun.velamuri Varun Velamuri added a comment -

            Below are the correct steps to reproduce this issue:

            a. Create a 1-node cluster with kv+n1ql+index services. Use 127.0.0.1 as hostname when initialising the cluster

            b. Create bucket default and add 1 document to the bucket

            c. Create index on the bucket. Scan the index

            d. Now, add a new node to the cluster with KV + n1ql services on the cluster

            e. Enable node-to-node encryption and set TLS strict mode

            f. Add 10000 more documents to the bucket and initiate a session consistent scan - The scan will timeout

            The issue is because of a bug in the stream management code at projector which got surfaced due to enforce TLS. Will fix it under MB-48030

            Show
            varun.velamuri Varun Velamuri added a comment - Below are the correct steps to reproduce this issue: a. Create a 1-node cluster with kv+n1ql+index services. Use 127.0.0.1 as hostname when initialising the cluster b. Create bucket default and add 1 document to the bucket c. Create index on the bucket. Scan the index d. Now, add a new node to the cluster with KV + n1ql services on the cluster e. Enable node-to-node encryption and set TLS strict mode f. Add 10000 more documents to the bucket and initiate a session consistent scan - The scan will timeout The issue is because of a bug in the stream management code at projector which got surfaced due to enforce TLS. Will fix it under MB-48030
            Hide
            varun.velamuri Varun Velamuri added a comment -

            To be precise the bug is because of the following:

            a. When there is only one node in the cluster, ns_server would use the hostname as 127.0.0.1. When indexer sends a mutation topic request to projector, indexer uses "127.0.0.1" as endpoint address - Projector uses the same endpoint for book-keeping

            b. When another node is added to cluster, then ns_server replaces the hostname to an IP address. Say it replaced to 172.16.12.41

            c. When node-to-node encryption is enabled, all the HTTP connections are reset

            d. When the connections are reset, projector closes the endpoint connection - 127.0.0.1 and deletes it from book-keeping. But at this point in time, both endpoints exist in projector book-keeping i.e. 127.0.0.1 and 172.16.12.41 -> This is bug1 in the code.  Due to the above bug, projector does not shutdown the topic 

            e. * When indexer sends shutdownVbuckets request as a part of repair, projector does not process it as projector things that the vbucket is on node 127.0.0.1 while indexer asks to delete the vbucket on node 172.16.12.41 -> *This is bug2 in the code

            The bugs has nothing to do with node-to-node encryption. Any connection reset without restarting the process would have triggered the bug. It surfaces with node-2-node encryption changes as it resets the connection

            Show
            varun.velamuri Varun Velamuri added a comment - To be precise the bug is because of the following: a. When there is only one node in the cluster, ns_server would use the hostname as 127.0.0.1. When indexer sends a mutation topic request to projector, indexer uses "127.0.0.1" as endpoint address - Projector uses the same endpoint for book-keeping b. When another node is added to cluster, then ns_server replaces the hostname to an IP address. Say it replaced to 172.16.12.41 c. When node-to-node encryption is enabled, all the HTTP connections are reset d. When the connections are reset, projector closes the endpoint connection - 127.0.0.1 and deletes it from book-keeping. But at this point in time, both endpoints exist in projector book-keeping i.e. 127.0.0.1 and 172.16.12.41 -> This is bug1 in the code.  Due to the above bug, projector does not shutdown the topic   e. *  When indexer sends shutdownVbuckets request as a part of repair, projector does not process it as projector things that the vbucket is on node 127.0.0.1 while indexer asks to delete the vbucket on node 172.16.12.41 -> *This is bug2 in the code The bugs has nothing to do with node-to-node encryption. Any connection reset without restarting the process would have triggered the bug. It surfaces with node-2-node encryption changes as it resets the connection

              People

              Assignee:
              jeelan.poola Jeelan Poola
              Reporter:
              pierre.regazzoni Pierre Regazzoni
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Gerrit Reviews

                  There are no open Gerrit changes

                    PagerDuty