Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: 7.0.0
Affects Version/s: Cheshire-Cat
Component/s: secondary-index
Labels:
- 130nodetest
- rebalance-failed
Environment:
7.0.0-4678

Triage:
Untriaged
Story Points:
1
Is this a Regression?:
Unknown

Description

Description:

There was a rebalance failure during a full recovery from a graceful failover in the index service.

Cluster environment (130 nodes with a single service per node):

The cluster environment consists of 126 data nodes, 2 index nodes and 2 query nodes.

(The following was deduced/guessed with lots of help from Daniel Owen)

The master node: ns_1@ec2-3-85-225-46.compute-1.amazonaws.com

The error message that was observed in the logs:

Reading the ns_server.debug.log of the master node:

ns_server.debug.log
=========================CRASH REPORT=========================
crasher:
initial call: misc:'-spawn_monitor/1-fun-0-'/0
pid: <0.12685.507>
registered_name: 'service_rebalancer-index'
exception exit: {agent_died,<30122.11555.3>,
{linked_process_died,<30122.26068.3>,
{timeout,
{gen_server,call,
[<30122.21969.3>,
{call,"ServiceAPI.GetCurrentTopology",
#Fun<json_rpc_connection.0.44122352>},
60000]}}}}
in function service_rebalancer:run_rebalance/1 (src/service_rebalancer.erl, line 79)
ancestors: [<0.14018.497>]
message_queue_len: 0
messages: []
links: []
dictionary: []
trap_exit: false
status: running
heap_size: 4185
stack_size: 27
reductions: 8511
neighbours:

Looks like ns_server attempts to perform a RPC to ServiceAPI.GetCurrentTopology in secondary indexing (see: Secondary indexer source code).

Guess: Given the error message returned by ns_server and the screenshot indicating the rebalance failing exactly after one minute, perhaps this is a case of the cluster being too big and the timeout being too small.

What I expected to happen:

I expected the rebalance to succeed.

What was happening on the cluster before the rebalance failure:

There were 10 buckets with a 100 collections in each bucket with roughly ~350M documents per bucket (loaded by cbc-pillowfight).

There were several indexes of the form:

CREATE INDEX my_index ON default{bucket_no}._default.collection{collection_no}(Field_1, Field_2, Field_3);

(Roughly 20 to 30 indexes for bucket1 and 5 indexes for each bucket1 to bucket9)

There were lots of queries being executed via usage of cbc-n1qlback.

Logs:

Master [ns_server Coordinator, not GSI master] node (services: kv):

https://cb-engineering.s3.amazonaws.com/CBQE-6754/130_node_testing/collectinfo-2021-04-08T151016-ns_1%40ec2-3-85-225-46.compute-1.amazonaws.com.zip

Index nodes (services index):

https://cb-engineering.s3.amazonaws.com/CBQE-6754/130_node_testing/collectinfo-2021-04-08T151016-ns_1%40ec2-35-153-199-86.compute-1.amazonaws.com.zip

https://cb-engineering.s3.amazonaws.com/CBQE-6754/130_node_testing/collectinfo-2021-04-08T151016-ns_1%40ec2-54-167-252-55.compute-1.amazonaws.com.zip

Query nodes (services query):

https://cb-engineering.s3.amazonaws.com/CBQE-6754/130_node_testing/collectinfo-2021-04-08T151016-ns_1%40ec2-107-23-222-63.compute-1.amazonaws.com.zip

https://cb-engineering.s3.amazonaws.com/CBQE-6754/130_node_testing/collectinfo-2021-04-08T151016-ns_1%40ec2-52-90-226-181.compute-1.amazonaws.com.zip

Rest of the logs (The complete set of logs for all 130 nodes)

https://issues.couchbase.com/browse/CBQE-6754

Supportal

https://supportal.couchbase.com/snapshot/f73d1c329f907432419f026362f25f09::0

130 node testing report

https://docs.google.com/spreadsheets/d/1JMlyknLVDny4kmKKPJI9xrg4Xl9WcuhEioRy637zO8A/edit#gid=2066485831

Final notes:

Following this failure, I attempted to remove 10 nodes from the cluster one of which was index node and the rebalance timed out in a similar manner after 1 minute (seems to be timing out on a different RPC this time).

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

image-2021-04-08-18-45-53-654.png
87 kB
08/Apr/21 10:45 AM
rebalanceReport (1).json
380 kB
08/Apr/21 10:29 AM
Screenshot 2021-04-08 at 15.41.40.png
115 kB
08/Apr/21 10:08 AM
Screenshot 2021-04-08 at 15.58.32.png
87 kB
08/Apr/21 10:46 AM

Issue Links

backports to

MB-46524 [BP MB-45553] [Rebalance failure] Rebalance fails while performing a full recovery from a graceful failover.

Resolved

Activity

People

Assignee:: Girish Benakappa

Reporter:: Asad Zaidi (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 12 Start watching this issue

Dates

Due:: 06/May/21

Created:: 08/Apr/21 10:42 AM

Updated:: 11/Aug/21 8:07 PM

Resolved:: 06/May/21 11:47 AM

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

PagerDuty