1. Create a three node setup all nodes with KV + Index + Query. For correlating with logs from my repro, I have:
node1: sb3904.sc.couchbase.com - KV+Index+Query
node2: sb3905.sc.couchbase.com - KV+Index+Query
node3: sb3906.sc.couchbase.com - KV+Index+Query
2. Enable N2N encryption on, set level to All
3. Issue below create index to query node1 targeting index creation to node2:
Sometimes the create index passes but mostly the first create fails with:
msg":"GSI CreateIndex() - cause: Indexer node (sb3905.sc.couchbase.com:8091) not found. The node may be failed or under rebalance or network partitioned from query process."}
4. Retry the query operation in the same query node after a few seconds. This time it would pass.
5. Issue a select query for above created index in query node2. It is important to issue select statement to query node different than the one where create statement went. In my case, I issued select to query service in sb3905.
This fails with error:
"msg":"No index available on keyspace default that matches your query. Use CREATE INDEX or CREATE PRIMARY INDEX to create an index, or check that your expected index is online."}],
6. Try issuing the same select query to the third query node, it fails with the same error.
Subsequent attempts to run select query from node2 and node3 succeeds.
The root cause of this issue is explained in comment another N2N bug: https://issues.couchbase.com/browse/MB-35743?focusedCommentId=369617&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-369617
At a high level, the issue occurs because:
- GSI client is created upon first query made in the cbq-engine.
- After GSI client is created, N1QL calls the datastore API SetConnectionSecurityConfig() to refresh TLS config in GSI Client.
- Post which, CreateIndex or Scan APIs are invoked.
- However, as part of creating GSI Client, watchers to indexers are created which keep failing as indexer is already on TLS port but client side security refresh happens asynchronously. (This is the reason for "tls: first record does not look like a TLS handshake. Kill Pipe." error logging on indexer side as indexer is using TLS but client is not yet)
- WatchMetadata running and retrying is asynchronous to SetConnectionSecurityConfig(). It is possible that when CreateIndex or Scan is invoked, the meta client is not successfully connected to all indexer nodes in the cluster leading to create index failure or scan failing.
For instance, digging into the query log of third node sb3906:
GSI Client is created upon select query issued to the node at 04:49:29.092. WatchMetadata fails as indexer is already using TLS while client's security settings are not refreshed yet. (Pls note that index is created on node sb3905).
Security settings are received through the dataport API SetConnectionSecurityConfig() at 04:49:32.139
Scan query failed at 04:49:32 PST:
However, watcher to node sb3905 got successfully created only at 04:49:33.
Subsequent scans after 04:49:33 work.
Impact: Every time GSI Client is being created, ie when there is first query on a new cluster OR first query after query service restart can fail. Hence this bug fix would be a good-to-have in MH.
|For Gerrit Dashboard: MB-36995|
|118285,7||MB-36995: Refresh security settings before creating new GsiClient||unstable||indexing||Status: MERGED||-2||+1|
|118354,4||MB-36995: Pass ConnectionSecurityConfig while invoking NewGSIIndexer||master||query||Status: MERGED||-2||+1|
|118553,2||MB-36995: Refresh security settings before creating new GsiClient||mad-hatter||indexing||Status: MERGED||+2||+1|
|118554,2||MB-36995: Pass ConnectionSecurityConfig while invoking NewGSIIndexer||mad-hatter||query||Status: MERGED||+2||+1|
|119038,2||MB-36995: Tests - Fix calls to NewGSIIndexer in test framework||unstable||indexing||Status: MERGED||+2||+1|
|119049,1||MB-36995: Tests - Fix calls to NewGSIIndexer in test framework||mad-hatter||indexing||Status: MERGED||+2||+1|