Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: 3.3.2
Affects Version/s: 3.3.0
Component/s: library
Labels:
None

Story Points:
1

Description

Hello,

We faced an issue with our query service a few weeks ago in one of our clusters in Test. By mistake, when patching the cluster, the two query nodes were stopped at the same time.

When restarting the nodes, we noticed that our backends were still unable to run queries on the cluster. We had to recycle them in order to restore the service.

After investigation, we discovered that Libcouchbase doesn't refresh the topology map when all query nodes are out of the cluster. Even if we add them back, Libcouchbase keeps on considering that there is no query service on the cluster.

This problem can be reproduced easily with n1qlback and the travel-sample bucket loaded on a local cluster (running on Docker for example).

Here are the step to reproduce:

Build a local dev cluster with data nodes and two query/index nodes.
As in this picture, for example:
Load the travel-sample bucket.
Create a text file with your sample query:

echo "{\"statement\":\"SELECT * FROM \`travel-sample\` WHERE icao=\\\"AFR\\\"\"}" > query.txt

* In another shell, start a tcpdump to capture the traffic between n1qlback and the cluster:

sudo tcpdump -X -w test_n1qlback_rebalance_travel-sample.pcap -s 20000 -i docker0 \( tcp port 11210 or tcp port 8093 \)

* Start n1qlback:

./bin/cbc-n1qlback -v -u admin -P password -U couchbase://172.17.0.3/travel-sample -f query.txt

* In the Couchbase UI, failover the query nodes and rebalance the cluster. Then add the nodes again and rebalance.

You should notice that n1qlback keeps on returning errors even after the comeback of the query nodes. Actually there are several problems in parallel, but here we will focus on the tcpdump.

Stop n1qlback and your tcpdump.

If you open your tcpdump file in Wireshark, for example, you will notice that n1qlback/ libcouchbase stopped asking the cluster for the topology map after the removal of the last query node. There is a hole in the dump, with no traffic at all between the process and the cluster. Actually Libcouchbase returns an error immediately during this period.

In the screenshot below, you can see the last TCP packets exchanged with the 172.17.0.7 at the end of the rebalance. The last "Get Cluster Config Response" message is also highlighted. Then, after that, no packet was sent during 59 seconds, until I stopped the n1qlback process. So Libcouchbase never detected that the query nodes were back.
I cannot attach my tcpdump to this ticket, it is really too big. But it's very easy to reproduce.

Please note: I tested with Libcouchbase 3.3.1.

For sure we should never stop all query nodes at the same time, this was really an operational mistake. But at least we would like to be able to recover the issue without having to restart our whole application in order to refresh the topology map.

Could you please ensure that Libcouchbase still refreshes the cluster topology map from time to time in order to detect the introduction of query nodes in the cluster ?

Thanks !

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

image-2022-06-14-18-25-46-051.png
80 kB
14/Jun/22 9:25 AM
image-2022-06-14-18-46-28-141.png
119 kB
14/Jun/22 9:46 AM

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews
- Show All Issues
- Show Open Issues

No reviews matched the request. Check your Options in the drop-down menu of this sections header.

Activity

People

Assignee:: Sergey Avseyev

Reporter:: Guillaume Molléda

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 14/Jun/22 9:53 AM

Updated:: 18/Dec/23 5:39 PM

Resolved:: 12/Aug/22 12:53 PM

Gerrit Reviews

There are no open Gerrit changes

Show There is 1 closed Gerrit change

Hide There is 1 closed Gerrit change

CCBC-1559: cbc-n1qlback: give time to IO loop in case of failure: Gerrit Review:

Libcouchbase doesn't refresh topology map after the loss of query nodes

Details

Description

Attachments

Attachments

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty