Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: 7.1.0
Affects Version/s: 7.1.0
Component/s: tools
Labels:
None

Triage:
Untriaged
Story Points:
1
Is this a Regression?:
Yes
Sprint:
Tools 2021 Dec

Description

What's the issue?
I can't currently setup a cluster_run cluster with two nodes running the backup service, the rebalance fails.

Steps to reproduce
1) cluster_run --nodes 2 --dont-rename
2) cluster_connect -n 2 -s 1024 -M plasma -T n0:kv+backup,n1:kv+backup

Observations
1) This works as expected with 1, 3 and 4 nodes, just not 2
2) The second node hasn't yet created a backup service log file
3) This does appear to work in 7.0.0
4) This does appear to work with other services

Heartbeat Failure
[ns_server:error,2021-11-23T18:51:34.323Z,n_0@127.0.0.1:ns_heart_slow_status_updater<0.499.0>:ns_heart:grab_one_service_status:409]Failed to grab service backup status: {exit,
{timeout,
{gen_server,call,
['service_agent-backup',get_status,
2000]}},
[{gen_server,call,3,
[{file,"gen_server.erl"},{line,247}]},
{ns_heart,grab_one_service_status,1,
[{file,"src/ns_heart.erl"},
{line,406}]},
{ns_heart,
'-grab_service_statuses/0-lc$^1/1-1-',
1,
[{file,"src/ns_heart.erl"},
{line,402}]},
{ns_heart,current_status_slow_inner,
0,
[{file,"src/ns_heart.erl"},
{line,276}]},
{ns_heart,current_status_slow,1,
[{file,"src/ns_heart.erl"},
{line,235}]},
{ns_heart,slow_updater_loop,0,
[{file,"src/ns_heart.erl"},
{line,229}]}]}

CBAuth 500 status codes
2021/11/23 18:51:09 revrpc: Got error (Need 200 status!. Got {500 Internal Server Error 500 HTTP/1.1 1 1 map[Cache-Control:[no-cache,no-store,must-revalidate] Content-Length:[44] Content-Type:[application/json] Date:[Tue, 23 Nov 2021 18:51:08 GMT] Expires:[Thu, 01 Jan 1970 00:00:00 GMT] Pragma:[no-cache] Server:[Couchbase Server] X-Content-Type-Options:[nosniff] X-Frame-Options:[DENY] X-Permitted-Cross-Domain-Policies:[none] X-Xss-Protection:[1; mode=block]] 0xc000159540 44 [] true false map[] 0xc00018e000 <nil>}) and will retry in 1s
...
2021-11-23T18:51:09.477Z INFO (Main) Running node version backup-0.0.0-0000-bd0ebcd with options: -http-port=7100 -grpc-port=7200 -https-port=17100 -cert-path=/home/couchbase/Projects/couchbase-build/ns_server/data/n_0/config/certs/chain.pem -key-path=/home/couchbase/Projects/couchbase-build/ns_server/data/n_0/config/certs/pkey.pem -ca-path=/home/couchbase/Projects/couchbase-build/ns_server/data/n_0/config/certs/ca.pem -ipv4=required -ipv6=optional -cbm=/home/couchbase/Projects/couchbase-build/install/bin/cbbackupmgr -node-uuid=85ce9cbd1ede710f468d8ad026c12e62 -public-address=127.0.0.1 -admin-port=9000 -log-file=none -log-level=debug -integrated-mode -integrated-mode-host=http://127.0.0.1:9000 -secure-integrated-mode-host=https://127.0.0.1:19000 -integrated-mode-user=@backup -tmp-dir=/home/couchbase/Projects/couchbase-build/ns_server/tmp -cbauth-host=127.0.0.1:9000

ns_server crash report
[error_logger:error,2021-11-23T18:53:17.790Z,n_0@127.0.0.1:service_rebalancer-backup<0.5126.0>:ale_error_logger_handler:do_log:101]
=========================CRASH REPORT=========================
crasher:
initial call: misc:'-spawn_monitor/1-fun-0-'/0
pid: <0.5126.0>
registered_name: 'service_rebalancer-backup'
exception exit: {agent_died,<0.5049.0>,
{linked_process_died,<0.5050.0>,
{'n_0@127.0.0.1',
{no_connection,"backup-service_api"}}}}
in function service_rebalancer:run_rebalance/1 (src/service_rebalancer.erl, line 73)
ancestors: [cleanup_process,ns_janitor_server,ns_orchestrator_child_sup,
ns_orchestrator_sup,mb_master_sup,mb_master,
leader_registry_sup,leader_services_sup,<0.678.0>,
ns_server_sup,ns_server_nodes_sup,<0.273.0>,
ns_server_cluster_sup,root_sup,<0.146.0>]
message_queue_len: 0
messages: []
links: []
dictionary: []
trap_exit: false
status: running
heap_size: 2586
stack_size: 29
reductions: 7423
neighbours:

[ns_server:error,2021-11-23T18:53:17.791Z,n_0@127.0.0.1:cleanup_process<0.5125.0>:service_janitor:init_topology_aware_service:91]Initial rebalance for `backup` failed: {error,
{initial_rebalance_failed,backup,
{agent_died,<0.5049.0>,
{linked_process_died,<0.5050.0>,
{'n_0@127.0.0.1',
{no_connection,
"backup-service_api"}}}}}}

Rebalance Cancelled

2021-11-23T18:51:17.784Z INFO (Rebalance) Cancelling rebalance

2021-11-23T18:51:17.784Z ERROR (Rebalance) Couldn't confirm node was added {"nodeID": "85ce9cbd1ede710f468d8ad026c12e62", "err": "could not add self: retries aborted after 1 attempts: context canceled"}

Attachments

Issue Links

relates to

MB-49946 [CBBS] Add unit testing for the top-level rebalance functions

Open

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews
- Show All Issues
- Show Open Issues

No reviews matched the request. Check your Options in the drop-down menu of this sections header.

Activity

People

Assignee:: James Lee

Reporter:: James Lee

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 23/Nov/21 9:40 AM

Updated:: 20/Jan/22 9:38 AM

Resolved:: 07/Dec/21 6:46 AM

Gerrit Reviews

There are no open Gerrit changes

Show There is 1 closed Gerrit change

Hide There is 1 closed Gerrit change

MB-49732 Fix logic error when checking for retries aborted: Gerrit Review:

[CBBS] cluster_run/cluster_connect does not work with 2 nodes (both using cbbs)

Details

Description

Attachments

Issue Links

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty