Details
-
Bug
-
Resolution: Duplicate
-
Major
-
None
-
7.1.0
-
Enterprise Edition 7.1.0 build 1430 | 32 kv nodes | centos7 | 4 gb ram | 4 cpu(s) | virtualised | ~4.8GB total disk space per node
-
Triaged
-
1
-
Unknown
Description
Disclaimer: The hardware specifications may be insufficient given this particular cluster sizing.
Description:
A rebalance fails after a graceful-failover and delta-node recovery with the following error message:
172.23.100.11:ns_server.debug.log(lines 2257931 to 2257941) |
[user:error,2021-10-07T09:05:09.281-07:00,ns_1@172.23.100.11:<0.737.0>:ns_orchestrator:log_rebalance_completion:1412]Rebalance exited with reason {{badmatch,
|
{error,
|
{failed_nodes,['ns_1@172.23.104.217']}}},
|
[{ns_janitor,cleanup_apply_config_body,4,
|
[{file,"src/ns_janitor.erl"},{line,294}]},
|
{ns_janitor,'-cleanup_apply_config/4-fun-0-',
|
4,
|
[{file,"src/ns_janitor.erl"},{line,214}]},
|
{async,'-async_init/4-fun-1-',3,
|
[{file,"src/async.erl"},{line,191}]}]}.
|
Rebalance Operation Id = b6238e9a319d976afa675a4360ffdc71
|
Cluster configuration:
32 (kv only) node cluster featuring a single magma bucket with full ejection and replicas=1. ~160 million (randomised) documents each of 256 bytes.
What does the test before the rebalance failure?:
The test performs roughly 13 steps (listed in the appendix) and finally performs the following 3 steps while data is being loaded:
1. A graceful-failover of node 172.23.104.217 which succeeds.
2. Selects node 172.23.104.217 for delta-node-recovery.
3. A rebalance operation is performed.
What happens?:
The rebalance operation fails leaving the vbuckets with a non equal distribution:
The (most recent) rebalance report of the rebalance failure (rebalance_report_20211007T160509.json) (Orchestrator: 172.23.100.11):
7 Oct 09:05 rebalance_report_20211007T160509.json |
{
|
"stageInfo": {
|
"data": {
|
"startTime": "2021-10-07T09:02:54.508-07:00",
|
"completedTime": false,
|
"timeTaken": 134900,
|
"subStages": {
|
"deltaRecovery": {
|
"startTime": "2021-10-07T09:02:54.721-07:00",
|
"completedTime": false,
|
"timeTaken": 134688
|
}
|
}
|
}
|
},
|
"rebalanceId": "b6238e9a319d976afa675a4360ffdc71",
|
"nodesInfo": {
|
"active_nodes": [
|
"ns_1@172.23.104.217",
|
"ns_1@172.23.104.211",
|
"ns_1@172.23.100.13",
|
"ns_1@172.23.104.248",
|
"ns_1@172.23.104.222",
|
"ns_1@172.23.104.109",
|
"ns_1@172.23.100.15",
|
"ns_1@172.23.104.241",
|
"ns_1@172.23.104.251",
|
"ns_1@172.23.104.236",
|
"ns_1@172.23.100.22",
|
"ns_1@172.23.104.221",
|
"ns_1@172.23.100.11",
|
"ns_1@172.23.104.235",
|
"ns_1@172.23.104.249",
|
"ns_1@172.23.100.14",
|
"ns_1@172.23.100.17",
|
"ns_1@172.23.100.28",
|
"ns_1@172.23.100.20",
|
"ns_1@172.23.104.243",
|
"ns_1@172.23.104.86",
|
"ns_1@172.23.104.238",
|
"ns_1@172.23.104.246",
|
"ns_1@172.23.104.226",
|
"ns_1@172.23.104.73",
|
"ns_1@172.23.100.16",
|
"ns_1@172.23.104.134",
|
"ns_1@172.23.104.231",
|
"ns_1@172.23.104.237",
|
"ns_1@172.23.104.179",
|
"ns_1@172.23.104.234",
|
"ns_1@172.23.104.250"
|
],
|
"keep_nodes": [
|
"ns_1@172.23.104.217",
|
"ns_1@172.23.104.211",
|
"ns_1@172.23.100.13",
|
"ns_1@172.23.104.248",
|
"ns_1@172.23.104.222",
|
"ns_1@172.23.104.109",
|
"ns_1@172.23.100.15",
|
"ns_1@172.23.104.241",
|
"ns_1@172.23.104.251",
|
"ns_1@172.23.104.236",
|
"ns_1@172.23.100.22",
|
"ns_1@172.23.104.221",
|
"ns_1@172.23.100.11",
|
"ns_1@172.23.104.235",
|
"ns_1@172.23.104.249",
|
"ns_1@172.23.100.14",
|
"ns_1@172.23.100.17",
|
"ns_1@172.23.100.28",
|
"ns_1@172.23.100.20",
|
"ns_1@172.23.104.243",
|
"ns_1@172.23.104.86",
|
"ns_1@172.23.104.238",
|
"ns_1@172.23.104.246",
|
"ns_1@172.23.104.226",
|
"ns_1@172.23.104.73",
|
"ns_1@172.23.100.16",
|
"ns_1@172.23.104.134",
|
"ns_1@172.23.104.231",
|
"ns_1@172.23.104.237",
|
"ns_1@172.23.104.179",
|
"ns_1@172.23.104.234",
|
"ns_1@172.23.104.250"
|
],
|
"eject_nodes": [],
|
"delta_nodes": [
|
"ns_1@172.23.104.217"
|
],
|
"failed_nodes": []
|
},
|
"masterNode": "ns_1@172.23.100.11",
|
"completionMessage": "Rebalance exited with reason \{{badmatch,\n {error,\n {failed_nodes,['ns_1@172.23.104.217']}}},\n [\{ns_janitor,cleanup_apply_config_body,4,\n [{file,\"src/ns_janitor.erl\"},\{line,294}]},\n \{ns_janitor,'-cleanup_apply_config/4-fun-0-',\n 4,\n [{file,\"src/ns_janitor.erl\"},\{line,214}]},\n \{async,'-async_init/4-fun-1-',3,\n [{file,\"src/async.erl\"},\{line,191}]}]}."
|
}
|
Logs:
Logs from the orchestrator (172.23.100.11):
https://cb-engineering.s3.amazonaws.com/MB-48814/qe/collectinfo-2021-10-08T082312-ns_1%40172.23.100.11.zip
ns_server.debug.log(lines 2260596 to 2260636) |
=========================CRASH REPORT=========================
|
crasher:
|
initial call: ns_janitor_server:run_cleanup/2
|
pid: <0.8635.122>
|
registered_name: cleanup_process
|
exception error: no match of right hand side value
|
{error,{failed_nodes,['ns_1@172.23.104.217']}}
|
in function ns_janitor:cleanup_apply_config_body/4 (src/ns_janitor.erl, line 294)
|
in call from ns_janitor:'-cleanup_apply_config/4-fun-0-'/4 (src/ns_janitor.erl, line 214)
|
in call from async:'-async_init/4-fun-1-'/3 (src/async.erl, line 191)
|
ancestors: [ns_janitor_server,ns_orchestrator_child_sup,
|
ns_orchestrator_sup,mb_master_sup,mb_master,
|
leader_registry_sup,leader_services_sup,<0.680.0>,
|
ns_server_sup,ns_server_nodes_sup,<0.264.0>,
|
ns_server_cluster_sup,root_sup,<0.140.0>]
|
message_queue_len: 0
|
messages: []
|
links: [<0.734.0>]
|
dictionary: []
|
trap_exit: false
|
status: running
|
heap_size: 46422
|
stack_size: 27
|
reductions: 4107
|
neighbours:
|
|
[error_logger:error,2021-10-07T09:05:17.146-07:00,ns_1@172.23.100.11:logger_proxy<0.68.0>:ale_error_logger_handler:do_log:101]Error in process <0.8281.122> on node 'ns_1@172.23.100.11' with exit value:
|
{{badmatch,{error,{failed_nodes,['ns_1@172.23.104.217']}}},
|
[{ns_janitor,cleanup_apply_config_body,4,
|
[{file,"src/ns_janitor.erl"},{line,294}]},
|
{ns_janitor,'-cleanup_apply_config/4-fun-0-',4,
|
[{file,"src/ns_janitor.erl"},{line,214}]},
|
{async,'-async_init/4-fun-1-',3,[{file,"src/async.erl"},{line,191}]}]}
|
|
[error_logger:error,2021-10-07T09:05:17.146-07:00,ns_1@172.23.100.11:logger_proxy<0.68.0>:ale_error_logger_handler:do_log:101]Error in process <0.8652.122> on node 'ns_1@172.23.100.11' with exit value:
|
{{badmatch,{error,{failed_nodes,['ns_1@172.23.104.217']}}},
|
[{ns_janitor,cleanup_apply_config_body,4,
|
[{file,"src/ns_janitor.erl"},{line,294}]},
|
{ns_janitor,'-cleanup_apply_config/4-fun-0-',4,
|
[{file,"src/ns_janitor.erl"},{line,214}]},
|
{async,'-async_init/4-fun-1-',3,[{file,"src/async.erl"},{line,191}]}]}
|
(xref: http://src.couchbase.org/source/xref/trunk/ns_server/src/ns_janitor.erl#294)
There should be no hanging vbuckets (the sum of move start and move end events are even for each vbucket):
Command output |
# cat ns_server.debug.log | grep -e "ns_rebalance_observer.*vbucket" | cut -d ']' -f 2 | awk '{ print $6}' | sort -n | uniq -c
|
8 14)
|
8 15)
|
8 16)
|
10 17)
|
10 18)
|
10 19)
|
4 20)
|
4 21)
|
4 22)
|
4 23)
|
4 24)
|
4 25)
|
6 26)
|
6 27)
|
8 28)
|
8 29)
|
6 30)
|
6 31)
|
2 35)
|
2 39)
|
2 40)
|
2 41)
|
2 42)
|
2 46)
|
2 47)
|
2 48)
|
4 49)
|
4 50)
|
4 51)
|
4 52)
|
4 53)
|
4 54)
|
10 55)
|
10 56)
|
10 57)
|
4 58)
|
4 59)
|
4 60)
|
4 61)
|
4 62)
|
4 63)
|
2 64)
|
2 65)
|
2 66)
|
2 69)
|
2 70)
|
8 74)
|
6 75)
|
6 76)
|
6 77)
|
4 78)
|
4 79)
|
4 80)
|
4 81)
|
6 82)
|
6 83)
|
10 84)
|
6 85)
|
10 86)
|
8 87)
|
4 88)
|
4 89)
|
12 90)
|
12 91)
|
6 92)
|
8 93)
|
6 94)
|
10 95)
|
2 100)
|
2 101)
|
4 102)
|
4 103)
|
4 104)
|
4 105)
|
4 106)
|
4 107)
|
4 108)
|
2 109)
|
2 110)
|
2 111)
|
4 112)
|
4 113)
|
4 114)
|
4 115)
|
4 116)
|
4 117)
|
4 118)
|
4 119)
|
4 120)
|
8 121)
|
6 122)
|
12 123)
|
8 124)
|
8 125)
|
8 126)
|
6 127)
|
2 128)
|
2 129)
|
2 130)
|
6 131)
|
6 135)
|
4 136)
|
4 137)
|
2 138)
|
4 139)
|
4 140)
|
8 141)
|
4 142)
|
4 143)
|
6 144)
|
4 145)
|
4 146)
|
4 147)
|
4 148)
|
4 149)
|
6 150)
|
10 151)
|
6 152)
|
10 153)
|
10 154)
|
6 155)
|
12 156)
|
6 157)
|
6 158)
|
8 159)
|
2 160)
|
2 161)
|
2 162)
|
6 163)
|
2 164)
|
2 165)
|
2 166)
|
2 167)
|
4 170)
|
2 171)
|
2 172)
|
2 173)
|
2 174)
|
2 175)
|
2 176)
|
2 177)
|
2 178)
|
2 179)
|
2 180)
|
4 181)
|
4 182)
|
10 183)
|
10 184)
|
10 185)
|
4 186)
|
6 187)
|
4 188)
|
4 189)
|
4 190)
|
4 191)
|
4 192)
|
4 193)
|
4 194)
|
4 195)
|
6 196)
|
6 197)
|
6 198)
|
6 199)
|
6 200)
|
4 201)
|
6 202)
|
4 203)
|
4 204)
|
2 205)
|
2 206)
|
4 207)
|
2 208)
|
2 209)
|
4 210)
|
2 211)
|
4 212)
|
4 213)
|
8 214)
|
10 215)
|
10 216)
|
6 217)
|
12 218)
|
10 219)
|
6 220)
|
18 221)
|
8 222)
|
6 223)
|
4 224)
|
4 225)
|
4 226)
|
4 227)
|
4 228)
|
4 229)
|
4 230)
|
4 231)
|
4 232)
|
2 233)
|
2 234)
|
4 235)
|
6 236)
|
6 237)
|
6 238)
|
2 239)
|
12 240)
|
4 241)
|
4 242)
|
4 243)
|
8 244)
|
4 245)
|
4 246)
|
6 247)
|
4 248)
|
4 249)
|
6 250)
|
6 251)
|
6 252)
|
8 253)
|
6 254)
|
6 255)
|
4 256)
|
6 257)
|
6 258)
|
6 259)
|
6 260)
|
6 261)
|
6 262)
|
6 263)
|
6 264)
|
4 265)
|
4 266)
|
6 267)
|
6 268)
|
6 269)
|
6 270)
|
6 271)
|
4 272)
|
8 273)
|
6 274)
|
6 275)
|
4 276)
|
6 277)
|
4 278)
|
4 279)
|
4 280)
|
4 281)
|
4 282)
|
4 283)
|
4 284)
|
6 285)
|
6 286)
|
6 287)
|
4 288)
|
4 289)
|
8 290)
|
6 291)
|
6 292)
|
6 293)
|
6 294)
|
6 295)
|
6 296)
|
6 297)
|
4 298)
|
4 299)
|
4 300)
|
2 301)
|
2 302)
|
4 303)
|
6 304)
|
2 307)
|
2 311)
|
2 312)
|
2 313)
|
4 314)
|
4 315)
|
4 316)
|
6 317)
|
6 318)
|
6 319)
|
4 320)
|
4 321)
|
4 322)
|
4 323)
|
4 324)
|
4 325)
|
6 326)
|
6 327)
|
10 328)
|
10 329)
|
10 330)
|
8 331)
|
6 332)
|
6 333)
|
6 334)
|
6 335)
|
2 336)
|
8 337)
|
6 338)
|
6 339)
|
8 340)
|
8 341)
|
10 342)
|
8 343)
|
8 344)
|
8 345)
|
8 346)
|
12 347)
|
8 348)
|
8 349)
|
12 350)
|
10 351)
|
6 352)
|
4 353)
|
4 354)
|
8 355)
|
8 356)
|
8 357)
|
12 358)
|
4 359)
|
4 360)
|
4 361)
|
6 362)
|
4 363)
|
4 364)
|
4 365)
|
4 366)
|
4 367)
|
2 368)
|
4 369)
|
6 370)
|
10 371)
|
4 372)
|
10 373)
|
8 374)
|
8 375)
|
8 376)
|
6 377)
|
6 378)
|
6 379)
|
12 380)
|
6 381)
|
8 382)
|
8 383)
|
4 384)
|
4 385)
|
4 386)
|
6 387)
|
6 388)
|
6 389)
|
12 390)
|
12 391)
|
8 392)
|
6 393)
|
6 394)
|
6 395)
|
6 396)
|
2 397)
|
2 402)
|
2 403)
|
2 406)
|
4 407)
|
4 408)
|
4 409)
|
10 410)
|
10 411)
|
10 412)
|
12 413)
|
12 414)
|
8 415)
|
6 416)
|
6 417)
|
4 418)
|
4 419)
|
4 420)
|
4 421)
|
4 422)
|
4 423)
|
4 424)
|
4 425)
|
6 426)
|
12 427)
|
12 428)
|
10 429)
|
4 431)
|
2 432)
|
2 435)
|
2 439)
|
2 440)
|
2 441)
|
8 442)
|
8 443)
|
8 444)
|
6 445)
|
6 446)
|
6 447)
|
10 448)
|
10 449)
|
10 450)
|
6 451)
|
10 452)
|
8 453)
|
6 454)
|
10 455)
|
10 456)
|
10 457)
|
10 458)
|
6 459)
|
6 460)
|
6 461)
|
6 462)
|
6 463)
|
4 464)
|
4 465)
|
4 466)
|
2 469)
|
2 470)
|
2 471)
|
2 472)
|
4 473)
|
6 474)
|
6 475)
|
6 476)
|
10 477)
|
8 478)
|
8 479)
|
4 480)
|
4 481)
|
4 482)
|
12 483)
|
6 484)
|
6 485)
|
10 486)
|
6 487)
|
6 488)
|
6 489)
|
6 490)
|
6 491)
|
4 492)
|
4 493)
|
6 494)
|
8 495)
|
6 496)
|
6 497)
|
6 498)
|
2 499)
|
2 500)
|
2 501)
|
4 502)
|
2 505)
|
2 508)
|
8 509)
|
10 510)
|
10 511)
|
4 512)
|
4 513)
|
6 514)
|
4 515)
|
6 516)
|
6 517)
|
6 518)
|
6 519)
|
6 520)
|
6 521)
|
6 522)
|
6 523)
|
10 524)
|
10 525)
|
4 526)
|
6 527)
|
2 531)
|
2 537)
|
6 538)
|
6 539)
|
12 540)
|
8 541)
|
10 542)
|
10 543)
|
6 544)
|
6 545)
|
4 546)
|
6 547)
|
4 548)
|
4 549)
|
6 550)
|
6 551)
|
6 552)
|
4 553)
|
4 554)
|
4 555)
|
6 556)
|
8 557)
|
8 558)
|
6 559)
|
4 560)
|
6 561)
|
2 562)
|
4 563)
|
2 564)
|
4 565)
|
6 566)
|
2 567)
|
2 568)
|
6 569)
|
8 570)
|
8 571)
|
4 572)
|
8 573)
|
8 574)
|
8 575)
|
6 576)
|
6 577)
|
8 578)
|
4 579)
|
4 580)
|
10 581)
|
4 582)
|
4 583)
|
4 584)
|
6 585)
|
4 586)
|
4 587)
|
4 588)
|
2 589)
|
2 590)
|
2 594)
|
6 595)
|
2 598)
|
8 605)
|
8 606)
|
8 607)
|
8 608)
|
8 609)
|
6 610)
|
4 611)
|
4 612)
|
8 613)
|
8 614)
|
4 615)
|
4 616)
|
4 617)
|
4 618)
|
8 619)
|
10 620)
|
2 621)
|
2 622)
|
2 623)
|
8 624)
|
2 625)
|
2 626)
|
2 627)
|
4 631)
|
4 632)
|
4 633)
|
6 634)
|
12 635)
|
6 636)
|
6 637)
|
6 638)
|
6 639)
|
4 640)
|
4 641)
|
6 642)
|
6 643)
|
4 644)
|
4 645)
|
6 646)
|
6 647)
|
6 648)
|
8 649)
|
8 650)
|
6 651)
|
4 652)
|
4 653)
|
4 654)
|
4 655)
|
2 656)
|
2 657)
|
2 658)
|
2 659)
|
2 660)
|
2 662)
|
2 664)
|
4 665)
|
2 666)
|
4 667)
|
6 668)
|
4 669)
|
6 670)
|
6 671)
|
6 672)
|
12 673)
|
12 674)
|
6 675)
|
4 676)
|
4 677)
|
10 678)
|
6 679)
|
4 680)
|
4 681)
|
4 682)
|
6 683)
|
10 684)
|
10 685)
|
10 686)
|
4 687)
|
4 688)
|
4 689)
|
6 690)
|
8 691)
|
4 692)
|
8 693)
|
8 694)
|
2 695)
|
2 696)
|
4 697)
|
6 698)
|
6 699)
|
6 700)
|
4 701)
|
4 702)
|
6 703)
|
4 704)
|
4 705)
|
6 706)
|
6 707)
|
6 708)
|
6 709)
|
4 710)
|
4 711)
|
6 712)
|
10 713)
|
10 714)
|
8 715)
|
4 716)
|
4 717)
|
6 718)
|
4 719)
|
2 720)
|
2 721)
|
6 722)
|
4 726)
|
4 729)
|
4 733)
|
4 734)
|
4 735)
|
2 736)
|
2 737)
|
4 738)
|
6 739)
|
6 740)
|
6 741)
|
4 742)
|
4 743)
|
10 744)
|
6 745)
|
6 746)
|
6 747)
|
10 748)
|
10 749)
|
10 750)
|
4 751)
|
4 752)
|
4 753)
|
6 754)
|
10 755)
|
10 756)
|
10 757)
|
4 758)
|
4 759)
|
6 760)
|
8 761)
|
6 762)
|
4 763)
|
8 764)
|
4 765)
|
4 766)
|
4 767)
|
4 771)
|
4 772)
|
4 773)
|
4 774)
|
4 775)
|
6 776)
|
10 777)
|
10 778)
|
10 779)
|
4 780)
|
4 781)
|
6 782)
|
4 783)
|
4 784)
|
8 785)
|
4 786)
|
4 787)
|
4 788)
|
2 789)
|
2 793)
|
4 794)
|
8 795)
|
4 796)
|
4 797)
|
4 798)
|
6 799)
|
2 800)
|
2 801)
|
2 802)
|
4 803)
|
4 804)
|
6 805)
|
6 806)
|
6 807)
|
6 808)
|
6 809)
|
6 810)
|
6 811)
|
4 812)
|
4 813)
|
4 814)
|
10 815)
|
10 816)
|
10 817)
|
8 818)
|
8 819)
|
6 820)
|
8 821)
|
6 822)
|
4 823)
|
8 824)
|
10 825)
|
4 826)
|
6 827)
|
4 828)
|
4 829)
|
4 830)
|
6 831)
|
2 838)
|
2 839)
|
2 840)
|
10 841)
|
10 842)
|
6 843)
|
4 844)
|
4 845)
|
4 846)
|
4 847)
|
8 848)
|
4 849)
|
8 850)
|
8 851)
|
8 852)
|
8 853)
|
8 854)
|
10 855)
|
6 856)
|
6 857)
|
6 858)
|
6 859)
|
10 860)
|
6 861)
|
4 862)
|
4 863)
|
10 864)
|
4 865)
|
4 866)
|
4 867)
|
4 868)
|
6 869)
|
4 870)
|
4 871)
|
4 872)
|
4 873)
|
4 874)
|
4 875)
|
4 876)
|
4 877)
|
4 878)
|
4 879)
|
4 880)
|
4 881)
|
10 882)
|
10 883)
|
6 884)
|
4 885)
|
10 886)
|
10 887)
|
10 888)
|
8 889)
|
8 890)
|
8 891)
|
10 892)
|
4 893)
|
4 894)
|
10 895)
|
6 896)
|
4 897)
|
6 898)
|
10 899)
|
4 900)
|
4 901)
|
6 902)
|
2 903)
|
6 904)
|
6 905)
|
6 906)
|
4 907)
|
4 908)
|
6 909)
|
4 910)
|
6 911)
|
10 912)
|
4 913)
|
4 914)
|
6 915)
|
4 916)
|
4 917)
|
4 918)
|
16 919)
|
10 920)
|
12 921)
|
8 922)
|
8 923)
|
8 924)
|
8 925)
|
6 926)
|
6 927)
|
6 928)
|
6 929)
|
8 930)
|
8 931)
|
8 932)
|
8 933)
|
6 934)
|
6 935)
|
8 936)
|
4 937)
|
4 938)
|
4 939)
|
4 940)
|
6 941)
|
10 942)
|
8 943)
|
8 944)
|
8 945)
|
10 946)
|
8 947)
|
8 948)
|
8 949)
|
8 950)
|
8 951)
|
8 952)
|
6 953)
|
6 954)
|
6 955)
|
4 956)
|
4 957)
|
6 958)
|
6 959)
|
4 960)
|
6 961)
|
6 962)
|
4 963)
|
4 964)
|
4 965)
|
4 966)
|
4 967)
|
4 968)
|
4 969)
|
4 970)
|
6 971)
|
4 972)
|
4 973)
|
8 974)
|
10 975)
|
12 976)
|
8 977)
|
8 978)
|
8 979)
|
10 980)
|
10 981)
|
16 982)
|
10 983)
|
8 984)
|
8 985)
|
8 986)
|
8 987)
|
8 988)
|
8 989)
|
8 990)
|
8 991)
|
4 992)
|
6 993)
|
4 994)
|
6 995)
|
4 996)
|
6 997)
|
10 998)
|
4 999)
|
4 1000)
|
4 1001)
|
4 1002)
|
2 1003)
|
2 1004)
|
6 1005)
|
4 1006)
|
8 1007)
|
12 1008)
|
8 1009)
|
8 1010)
|
8 1011)
|
10 1012)
|
10 1013)
|
12 1014)
|
12 1015)
|
8 1016)
|
8 1017)
|
8 1018)
|
8 1019)
|
8 1020)
|
8 1021)
|
8 1022)
|
10 1023)
|
Logs from node that was gracefully failed-over node (172.23.104.217):
https://cb-engineering.s3.amazonaws.com/MB-48814/qe/collectinfo-2021-10-08T082312-ns_1%40172.23.104.217.zip
ns_server.debug.log(lines 365897 to 366008) |
=========================CRASH REPORT=========================
|
crasher:
|
initial call: janitor_agent:init/1
|
pid: <0.16850.83>
|
registered_name: 'janitor_agent-GleamBookUsers0'
|
exception exit: {{{{case_clause,
|
{error,
|
{{{badmatch,
|
{error,
|
{{badmatch,{dcp_error,etmpfail,undefined}},
|
[{dcp_proxy,connect,5,
|
[{file,"src/dcp_proxy.erl"},{line,262}]},
|
{dcp_proxy,maybe_connect,2,
|
[{file,"src/dcp_proxy.erl"},{line,236}]},
|
{dcp_consumer_conn,init,2,
|
[{file,"src/dcp_consumer_conn.erl"},
|
{line,51}]},
|
{dcp_proxy,init,1,
|
[{file,"src/dcp_proxy.erl"},{line,52}]},
|
{gen_server,init_it,2,
|
[{file,"gen_server.erl"},{line,374}]},
|
{gen_server,init_it,6,
|
[{file,"gen_server.erl"},{line,342}]},
|
{proc_lib,init_p_do_apply,3,
|
[{file,"proc_lib.erl"},{line,249}]}]}}},
|
[{dcp_replicator,init,1,
|
[{file,"src/dcp_replicator.erl"},{line,48}]},
|
{gen_server,init_it,2,
|
[{file,"gen_server.erl"},{line,374}]},
|
{gen_server,init_it,6,
|
[{file,"gen_server.erl"},{line,342}]},
|
{proc_lib,init_p_do_apply,3,
|
[{file,"proc_lib.erl"},{line,249}]}]},
|
{child,undefined,
|
{'ns_1@172.23.100.20',
|
[collections,del_times,del_user_xattr,json,
|
set_consumer_name,snappy,xattr]},
|
{dcp_replicator,start_link,
|
['ns_1@172.23.100.20',"GleamBookUsers0",
|
[collections,del_times,del_user_xattr,json,
|
set_consumer_name,snappy,xattr]]},
|
temporary,60000,worker,
|
[dcp_replicator]}}}},
|
[{dcp_sup,start_replicator,2,
|
[{file,"src/dcp_sup.erl"},{line,51}]},
|
{dcp_sup,'-manage_replicators/2-lc$^3/1-3-',2,
|
[{file,"src/dcp_sup.erl"},{line,101}]},
|
{dcp_replication_manager,handle_call,3,
|
[{file,"src/dcp_replication_manager.erl"},{line,83}]},
|
{gen_server,try_handle_call,4,
|
[{file,"gen_server.erl"},{line,661}]},
|
{gen_server,handle_msg,6,
|
[{file,"gen_server.erl"},{line,690}]},
|
{proc_lib,init_p_do_apply,3,
|
[{file,"proc_lib.erl"},{line,249}]}]},
|
{gen_server,call,
|
['dcp_replication_manager-GleamBookUsers0',
|
{manage_replicators,
|
['ns_1@172.23.100.20','ns_1@172.23.100.22',
|
'ns_1@172.23.100.28','ns_1@172.23.104.109',
|
'ns_1@172.23.104.134','ns_1@172.23.104.179',
|
'ns_1@172.23.104.211','ns_1@172.23.104.221',
|
'ns_1@172.23.104.222','ns_1@172.23.104.226',
|
'ns_1@172.23.104.231','ns_1@172.23.104.234',
|
'ns_1@172.23.104.235','ns_1@172.23.104.236',
|
'ns_1@172.23.104.237','ns_1@172.23.104.238',
|
'ns_1@172.23.104.241','ns_1@172.23.104.250',
|
'ns_1@172.23.104.251','ns_1@172.23.104.86']},
|
infinity]}},
|
{gen_server,call,
|
['replication_manager-GleamBookUsers0',
|
{set_desired_replications,
|
[{'ns_1@172.23.100.20',"ñòó"},
|
{'ns_1@172.23.100.22',[745,746,747]},
|
{'ns_1@172.23.100.28',[527,559,560]},
|
{'ns_1@172.23.104.109',[362,934,935]},
|
{'ns_1@172.23.104.134',[496,497,498]},
|
{'ns_1@172.23.104.179',[82,83,337,565]},
|
{'ns_1@172.23.104.211',[135,136,137,585]},
|
{'ns_1@172.23.104.221',[17,18,19,55]},
|
{'ns_1@172.23.104.222',"89TÝ"},
|
{'ns_1@172.23.104.226',[569,570,571]},
|
{'ns_1@172.23.104.231',"V·¸"},
|
{'ns_1@172.23.104.234',[185,380,691]},
|
{'ns_1@172.23.104.235',[413,693,694]},
|
{'ns_1@172.23.104.236',[755,756,757]},
|
{'ns_1@172.23.104.237',[573,574,575]},
|
{'ns_1@172.23.104.238',[414,882,883]},
|
{'ns_1@172.23.104.241',[342,982,998]},
|
{'ns_1@172.23.104.250',[744,841,842]},
|
{'ns_1@172.23.104.251',[219,558,649]},
|
{'ns_1@172.23.104.86',[673,674,899]}]},
|
infinity]}}
|
in function gen_server:call/3 (gen_server.erl, line 223)
|
in call from janitor_agent:handle_apply_new_config_replicas_phase/3 (src/janitor_agent.erl, line 1285)
|
in call from janitor_agent:check_for_node_rename/4 (src/janitor_agent.erl, line 1307)
|
in call from gen_server:try_handle_call/4 (gen_server.erl, line 661)
|
in call from gen_server:handle_msg/6 (gen_server.erl, line 690)
|
ancestors: ['janitor_agent_sup-GleamBookUsers0',
|
'single_bucket_kv_sup-GleamBookUsers0',ns_bucket_sup,
|
ns_bucket_worker_sup,ns_server_sup,ns_server_nodes_sup,
|
<0.18870.0>,ns_server_cluster_sup,root_sup,<0.139.0>]
|
message_queue_len: 0
|
messages: []
|
links: [<0.30475.82>]
|
dictionary: []
|
trap_exit: false
|
status: running
|
heap_size: 75113
|
stack_size: 27
|
reductions: 986223
|
neighbours:
|
This may indicate that something timed out from kv's side.
A brief look in memcached.log shows some slow runtime warnings, however these do not seem to be close to the start time of the rebalance (09:02:54).
Appendix:
Remaining logs:
https://cb-engineering.s3.amazonaws.com/MB-48814/qe/collectinfo-2021-10-08T082312-ns_1%40172.23.100.13.zip
https://cb-engineering.s3.amazonaws.com/MB-48814/qe/collectinfo-2021-10-08T082312-ns_1%40172.23.100.14.zip
https://cb-engineering.s3.amazonaws.com/MB-48814/qe/collectinfo-2021-10-08T082312-ns_1%40172.23.100.15.zip
https://cb-engineering.s3.amazonaws.com/MB-48814/qe/collectinfo-2021-10-08T082312-ns_1%40172.23.100.16.zip
https://cb-engineering.s3.amazonaws.com/MB-48814/qe/collectinfo-2021-10-08T082312-ns_1%40172.23.100.17.zip
https://cb-engineering.s3.amazonaws.com/MB-48814/qe/collectinfo-2021-10-08T082312-ns_1%40172.23.100.20.zip
https://cb-engineering.s3.amazonaws.com/MB-48814/qe/collectinfo-2021-10-08T082312-ns_1%40172.23.100.22.zip
https://cb-engineering.s3.amazonaws.com/MB-48814/qe/collectinfo-2021-10-08T082312-ns_1%40172.23.100.28.zip
https://cb-engineering.s3.amazonaws.com/MB-48814/qe/collectinfo-2021-10-08T082312-ns_1%40172.23.104.109.zip
https://cb-engineering.s3.amazonaws.com/MB-48814/qe/collectinfo-2021-10-08T082312-ns_1%40172.23.104.134.zip
https://cb-engineering.s3.amazonaws.com/MB-48814/qe/collectinfo-2021-10-08T082312-ns_1%40172.23.104.179.zip
https://cb-engineering.s3.amazonaws.com/MB-48814/qe/collectinfo-2021-10-08T082312-ns_1%40172.23.104.211.zip
https://cb-engineering.s3.amazonaws.com/MB-48814/qe/collectinfo-2021-10-08T082312-ns_1%40172.23.104.221.zip
https://cb-engineering.s3.amazonaws.com/MB-48814/qe/collectinfo-2021-10-08T082312-ns_1%40172.23.104.222.zip
https://cb-engineering.s3.amazonaws.com/MB-48814/qe/collectinfo-2021-10-08T082312-ns_1%40172.23.104.226.zip
https://cb-engineering.s3.amazonaws.com/MB-48814/qe/collectinfo-2021-10-08T082312-ns_1%40172.23.104.231.zip
https://cb-engineering.s3.amazonaws.com/MB-48814/qe/collectinfo-2021-10-08T082312-ns_1%40172.23.104.234.zip
https://cb-engineering.s3.amazonaws.com/MB-48814/qe/collectinfo-2021-10-08T082312-ns_1%40172.23.104.235.zip
https://cb-engineering.s3.amazonaws.com/MB-48814/qe/collectinfo-2021-10-08T082312-ns_1%40172.23.104.236.zip
https://cb-engineering.s3.amazonaws.com/MB-48814/qe/collectinfo-2021-10-08T082312-ns_1%40172.23.104.237.zip
https://cb-engineering.s3.amazonaws.com/MB-48814/qe/collectinfo-2021-10-08T082312-ns_1%40172.23.104.238.zip
https://cb-engineering.s3.amazonaws.com/MB-48814/qe/collectinfo-2021-10-08T082312-ns_1%40172.23.104.241.zip
https://cb-engineering.s3.amazonaws.com/MB-48814/qe/collectinfo-2021-10-08T082312-ns_1%40172.23.104.243.zip
https://cb-engineering.s3.amazonaws.com/MB-48814/qe/collectinfo-2021-10-08T082312-ns_1%40172.23.104.246.zip
https://cb-engineering.s3.amazonaws.com/MB-48814/qe/collectinfo-2021-10-08T082312-ns_1%40172.23.104.248.zip
https://cb-engineering.s3.amazonaws.com/MB-48814/qe/collectinfo-2021-10-08T082312-ns_1%40172.23.104.249.zip
https://cb-engineering.s3.amazonaws.com/MB-48814/qe/collectinfo-2021-10-08T082312-ns_1%40172.23.104.250.zip
https://cb-engineering.s3.amazonaws.com/MB-48814/qe/collectinfo-2021-10-08T082312-ns_1%40172.23.104.251.zip
https://cb-engineering.s3.amazonaws.com/MB-48814/qe/collectinfo-2021-10-08T082312-ns_1%40172.23.104.73.zip
https://cb-engineering.s3.amazonaws.com/MB-48814/qe/collectinfo-2021-10-08T082312-ns_1%40172.23.104.86.zip
The steps the test performed before the rebalance failure:
- Create a 32 node cluster
- Create required buckets and collections.
- Create 4000000 items sequentially.
- Update 4000000 random keys to create 50 percent fragmentation
- Create 4000000 items sequentially
- Update 4000000 random keys to create 50 percent fragmentation
- Multiple sequential auto-failover of 5 nodes followed by rebalance in.
- Rebalance in (a single node) with document loading.
- Rebalance out (a single node) with document loading.
- Rebalance in and out with document loading.
- Swap rebalance with document loading
- Failover a node and RebalanceOut that node with loading in parallel
- Failover a node and FullRecovery that node
- Failover a node and DeltaRecovery that node with loading in parallel followed by rebalance (note: test failed here)
Some (Promtimer) graphs from node 172.23.104.217 (The the vertical red bar at 09:02:54 depicts the start of the failed rebalance event.):
The graphs show that the node in question may not be as resource constrained as I initially thought.
Attachments
Issue Links
- duplicates
-
MB-47387 [Magma] - Opening bucket during recovery takes a long time
- Closed