We faced an issue with our query service a few weeks ago in one of our clusters in Test. By mistake, when patching the cluster, the two query nodes were stopped at the same time.
When restarting the nodes, we noticed that our backends were still unable to run queries on the cluster. We had to recycle them in order to restore the service.
After investigation, we discovered that Libcouchbase doesn't refresh the topology map when all query nodes are out of the cluster. Even if we add them back, Libcouchbase keeps on considering that there is no query service on the cluster.
This problem can be reproduced easily with n1qlback and the travel-sample bucket loaded on a local cluster (running on Docker for example).
Here are the step to reproduce:
- Build a local dev cluster with data nodes and two query/index nodes.
As in this picture, for example:
- Load the travel-sample bucket.
- Create a text file with your sample query:
You should notice that n1qlback keeps on returning errors even after the comeback of the query nodes. Actually there are several problems in parallel, but here we will focus on the tcpdump.
Stop n1qlback and your tcpdump.
If you open your tcpdump file in Wireshark, for example, you will notice that n1qlback/ libcouchbase stopped asking the cluster for the topology map after the removal of the last query node. There is a hole in the dump, with no traffic at all between the process and the cluster. Actually Libcouchbase returns an error immediately during this period.
In the screenshot below, you can see the last TCP packets exchanged with the 172.17.0.7 at the end of the rebalance. The last "Get Cluster Config Response" message is also highlighted. Then, after that, no packet was sent during 59 seconds, until I stopped the n1qlback process. So Libcouchbase never detected that the query nodes were back.
I cannot attach my tcpdump to this ticket, it is really too big. But it's very easy to reproduce.
Please note: I tested with Libcouchbase 3.3.1.
For sure we should never stop all query nodes at the same time, this was really an operational mistake. But at least we would like to be able to recover the issue without having to restart our whole application in order to refresh the topology map.
Could you please ensure that Libcouchbase still refreshes the cluster topology map from time to time in order to detect the introduction of query nodes in the cluster ?