Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-36995

First select query fails when N2N Encryption is ON

    XMLWordPrintable

Details

    • Untriaged
    • Unknown

    Description

      Steps:
      1. Create a three node setup all nodes with KV + Index + Query. For correlating with logs from my repro, I have:
      node1: sb3904.sc.couchbase.com - KV+Index+Query
      node2: sb3905.sc.couchbase.com - KV+Index+Query
      node3: sb3906.sc.couchbase.com - KV+Index+Query

      2. Enable N2N encryption on, set level to All

      3. Issue below create index to query node1 targeting index creation to node2:

          CREATE INDEX idx1 ON default(name) WITH { "nodes": ["sb3905.sc.couchbase.com:8091"]}
      

      Sometimes the create index passes but mostly the first create fails with:
      msg":"GSI CreateIndex() - cause: Indexer node (sb3905.sc.couchbase.com:8091) not found. The node may be failed or under rebalance or network partitioned from query process."}

      4. Retry the query operation in the same query node after a few seconds. This time it would pass.

      5. Issue a select query for above created index in query node2. It is important to issue select statement to query node different than the one where create statement went. In my case, I issued select to query service in sb3905.

          SELECT name, meta().id FROM default WHERE name IS NOT NULL
      

      This fails with error:
      "msg":"No index available on keyspace default that matches your query. Use CREATE INDEX or CREATE PRIMARY INDEX to create an index, or check that your expected index is online."}],

      6. Try issuing the same select query to the third query node, it fails with the same error.

      Subsequent attempts to run select query from node2 and node3 succeeds.

      The root cause of this issue is explained in comment another N2N bug: https://issues.couchbase.com/browse/MB-35743?focusedCommentId=369617&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-369617

      At a high level, the issue occurs because:

      • GSI client is created upon first query made in the cbq-engine.
      • After GSI client is created, N1QL calls the datastore API SetConnectionSecurityConfig() to refresh TLS config in GSI Client.
      • Post which, CreateIndex or Scan APIs are invoked.
      • However, as part of creating GSI Client, watchers to indexers are created which keep failing as indexer is already on TLS port but client side security refresh happens asynchronously. (This is the reason for "tls: first record does not look like a TLS handshake. Kill Pipe." error logging on indexer side as indexer is using TLS but client is not yet)
      • WatchMetadata running and retrying is asynchronous to SetConnectionSecurityConfig(). It is possible that when CreateIndex or Scan is invoked, the meta client is not successfully connected to all indexer nodes in the cluster leading to create index failure or scan failing.

      For instance, digging into the query log of third node sb3906:

      GSI Client is created upon select query issued to the node at 04:49:29.092. WatchMetadata fails as indexer is already using TLS while client's security settings are not refreshed yet. (Pls note that index is created on node sb3905).

      2019-11-20T04:49:29.092-08:00 [Info] creating GsiClient for 127.0.0.1:8091
      2019-11-20T04:49:30.137-08:00 [Info] WatchMetadata(): unable to reach indexer at sb3904.sc.couchbase.com:9100. Retry in background.
      2019-11-20T04:49:31.137-08:00 [Info] WatchMetadata(): unable to reach indexer at sb3905.sc.couchbase.com:9100. Retry in background.
      2019-11-20T04:49:32.137-08:00 [Info] WatchMetadata(): unable to reach indexer at sb3906.sc.couchbase.com:9100. Retry in background.
      

      Security settings are received through the dataport API SetConnectionSecurityConfig() at 04:49:32.139

      2019-11-20T04:49:32.139-08:00 [Info] Recieve security change notification. encryption=true
      2019-11-20T04:49:32.139-08:00 [Info] Certificate refreshed successfully
      

      Scan query failed at 04:49:32 PST:

       curl -v http://localhost:8093/query/service -d 'statement=SELECT name, meta().id FROM default WHERE name IS NOT NULL&scan_consistency= request_plus&creds=[{"user":"Administrator", "pass":"password"}]'
      < Date: Wed, 20 Nov 2019 12:49:32 GMT
      "errors": [{"code":4000,"msg":"No index available on keyspace default that matches your query. Use CREATE INDEX or CREATE PRIMARY INDEX to create an index, or check that your expected index is online."}],
      "status": "fatal",
      

      However, watcher to node sb3905 got successfully created only at 04:49:33.

      2019-11-20T04:49:32.174-08:00 [Info] WatchMetadata(): Successfully connected to indexer at sb3904.sc.couchbase.com:9100 after retry.
      2019-11-20T04:49:32.176-08:00 [Info] WatchMetadata(): Successfully connected to indexer at sb3906.sc.couchbase.com:9100 after retry.
      2019-11-20T04:49:33.175-08:00 [Info] WatchMetadata(): Successfully connected to indexer at sb3905.sc.couchbase.com:9100 after retry.
      

      Subsequent scans after 04:49:33 work.

      Logs attached.

       

      Impact: Every time GSI Client is being created, ie when there is first query on a new cluster OR first query after query service restart can fail. Hence this bug fix would be a good-to-have in MH. 

      Attachments

        For Gerrit Dashboard: MB-36995
        # Subject Branch Project Status CR V

        Activity

          People

            ritam.sharma Ritam Sharma
            prathibha Prathibha Bisarahalli (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              PagerDuty