Details
-
Bug
-
Resolution: Fixed
-
Critical
-
7.1.0, 7.1.1, 7.1.2, 7.2.0, 7.1.3, 7.2.1
-
Untriaged
-
0
-
Yes
Description
Issue
XDCR could miss replicating data. This found via the XDCR dev’s collection test suite of test 4i: https://github.com/couchbase/goxdcr/blob/master/tools/testScripts/collectionTestcases/4i_explicit_mapping_change_ts.shlib
The test was part of the effort to validate MB-49463.
Running it seems to fail now, with the following scenario.
The test pauses and changes the explicit mapping at the same time, then issues a pipeline resume.
At this time, once the pipeline resumes, backfill should have happened, but there is a race condition in XDCR that could cause missed data.
In production, a pipeline could be paused or it could just be restarting due to any other error circumstances at the time a backfill is raised, and the same situation could apply.
The mechanics that causes missing data is the way backfill is to be raised, logic here: https://github.com/couchbase/goxdcr/blob/20158cb0da504b7a51cf915afdacd8e841259408/backfill_manager/backfill_request_handler.go#L1036-L1049
The issue is when pipeline is stopping and backfill is being raised, checkpoint manager is going to perform checkpoint as part of the pipeline stop.
However, backfill raise logic may require the need to read checkpoints. And both of these efforts are not coordinated. Thus, it’s possible to read an incomplete set of checkpoints, which is missing certain VBs as the checkpoint manager is performing checkpoint independently and haven’t gotten the chance to create the VBs yet.
In this scenario, the definition of “incomplete” could mean either 1) missing the VB document or 2) checkpoint seqno being written isn’t persisted yet, and backfill mgr reads an earlier seqno.
As an example:
Say VB0 is at throughSeqno 100. Checkpoint currently sits at 50.
When pipeline stops, the checkpoint needs to be updated to 100.
Before checkpoint can be written to 100, backfill raise got the previous checkpoint of 50, and performs a backfill from 0 to 50 only.
Once the main pipeline and backfill pipeline resumes, the main pipeline will start at 100, while backfill pipeline will perform backfill from 0 to 50 only.
XDCR will have missed replicating seqnos 50 to 100.
Through some debugging and analysis, the culprit is due to the checkpoint manager cache implemented as part of P2P of MB-49434.
Workaround
One way to workaround is to disable checkpoint cache, using a replication setting with the key of ckptSvcCacheEnabled. By default as of 7.1.0, it is true.
Disabling the cache will ensure the test passes and all data is backfilled.
Solution
From at least what I can see, things are raceful. Looking at the original need to implement MB-49434, I would argue that it may be safer to invalidate the checkpoint cache as soon as a single VB checkpoint is inserted.
I suspect right now the cache implementation is too complex, and having the proposed solution seemed to fix the problem.
I need to re-evaluate the cost and do some rudimentary testing to ensure that P2P won't have a performance hit due to the simplified cache proposal (i.e. reintroduce MB-49434). If it doesn't, then the proposed solution would probably be best.
Testing and debugging below
test output |
Waiting 15 seconds before checking item count for backfill
|
Checking item count 1 / 8
|
ERROR: Cluster C2 bucket B2 only has 28194 items. Items are increasing...
|
XDCR is missing a subset of backfill. Using the diff tool, we have the following counts with the specific VB information:
2023-05-19T15:07:55.984-07:00 INFO GOXDCR.xdcrDiffTool: Source bucket item count including tombstones is 30000 (excluding 0 filtered mutations)
|
2023-05-19T15:07:55.984-07:00 INFO GOXDCR.xdcrDiffTool: Target bucket item count including tombstones is 29854 (excluding 0 filtered mutations)
|
2023-05-19T15:07:55.984-07:00 INFO GOXDCR.xdcrDiffTool: Here are the vbuckets with different item counts:
|
2023-05-19T15:07:55.984-07:00 INFO GOXDCR.xdcrDiffTool: vb:576 source count 9, target count 6
|
2023-05-19T15:07:55.984-07:00 INFO GOXDCR.xdcrDiffTool: vb:921 source count 9, target count 6
|
2023-05-19T15:07:55.984-07:00 INFO GOXDCR.xdcrDiffTool: vb:703 source count 39, target count 26
|
2023-05-19T15:07:55.984-07:00 INFO GOXDCR.xdcrDiffTool: vb:749 source count 42, target count 28
|
2023-05-19T15:07:55.984-07:00 INFO GOXDCR.xdcrDiffTool: vb:928 source count 21, target count 14
|
2023-05-19T15:07:55.984-07:00 INFO GOXDCR.xdcrDiffTool: vb:846 source count 15, target count 10
|
2023-05-19T15:07:55.984-07:00 INFO GOXDCR.xdcrDiffTool: vb:525 source count 45, target count 30
|
2023-05-19T15:07:55.984-07:00 INFO GOXDCR.xdcrDiffTool: vb:827 source count 39, target count 26
|
2023-05-19T15:07:55.984-07:00 INFO GOXDCR.xdcrDiffTool: vb:30 source count 51, target count 34
|
2023-05-19T15:07:55.984-07:00 INFO GOXDCR.xdcrDiffTool: vb:196 source count 6, target count 4
|
2023-05-19T15:07:55.984-07:00 INFO GOXDCR.xdcrDiffTool: vb:316 source count 36, target count 24
|
2023-05-19T15:07:55.984-07:00 INFO GOXDCR.xdcrDiffTool: vb:1000 source count 12, target count 8
|
2023-05-19T15:07:55.984-07:00 INFO GOXDCR.xdcrDiffTool: vb:713 source count 51, target count 34
|
2023-05-19T15:07:55.984-07:00 INFO GOXDCR.xdcrDiffTool: vb:671 source count 18, target count 12
|
2023-05-19T15:07:55.984-07:00 INFO GOXDCR.xdcrDiffTool: vb:757 source count 6, target count 4
|
2023-05-19T15:07:55.985-07:00 INFO GOXDCR.xdcrDiffTool: vb:565 source count 39, target count 26
|
2023-05-19T15:07:55.985-07:00 INFO GOXDCR.xdcrDiffTool: DiffDataFiles routine completed
|
2023-05-19T15:07:56.004-07:00 INFO GOXDCR.xdcrDiffTool: runMutationDiffer started with compareBody=false
|
2023-05-19T15:07:56.004-07:00 INFO GOXDCR.xdcrDiffTool: Mutation srcDiff to work on 146 srcPovFetchList with diffs.
|
2023-05-19T15:07:56.165-07:00 INFO GOXDCR.xdcrDiffTool: runMutationDiffer completed
|
Perusing through the logs, we see the following “optimization” message:
2023-05-19T15:06:14.750-07:00 INFO GOXDCR.GenericPipeline: Starting BackfillPipeline backfill_88782077c55b0b9c1d358b69729abeab/B1/B2-723272000
|
2023-05-19T15:06:14.940-07:00 INFO GOXDCR.DcpNozzle: VBTasksMap only has a subset of original VBs. Replaced [512 …1023] with [512 1023] by removing [525 565 576 671 703 713 749 757 827 846 921 928 1000]
|
2023-05-19T15:06:15.280-07:00 INFO GOXDCR.DcpNozzle: VBTasksMap only has a subset of original VBs. Replaced [0 … 511] with [0 … 511] by removing [30 196 316]
|
The backfill task that corresponds with this specific missing data set is supposed to be:
2023-05-19T15:06:09.933-07:00 INFO GOXDCR.BackfillMgr: Replication 88782077c55b0b9c1d358b69729abeab/B1/B2 - These collections need to backfill SOURCE ||S1.col3|| -> TARGET(s) |Scope: S1 Collection: col3|
|
for vb->seqnos map[0:37 1:21 2:57 3:65 4:65 5:61 6:25 7:33 8:13 9:37 10:77 11:65 12:73 13:53 14:29 15:33 16:61 17:65 18:25 19:29 20:13 21:37 22:65 23:69 24:73 25:69 26:17 27:33 28:21 29:17 31:53 32:69 33:57 34:9 35:29 36:29 37:25 38:69 39:53 40:61 41:57 42:41 43:21 44:21 45:29 46:57 47:57 48:13 49:33 50:65 51:57 52:69 53:57 54:21 55:21 56:29 57:37 58:61 59:57 60:57 61:61 62:13 63:37 64:25 65:33 66:61 67:61 68:57 69:57 70:29 71:21 72:21 73:25 74:65 75:53 76:77 77:61 78:9 79:33 80:61 81:61 82:21 83:33 84:33 85:29 86:57 87:57 88:65 89:53 90:29 91:17 92:17 93:25 94:69 95:57 96:73 97:61 98:21 99:33 100:9 101:37 102:77 103:57 104:57 105:65 106:29 107:37 108:33 109:29 110:65 111:61 112:29 113:21 114:73 115:61 116:73 117:57 118:13 119:25 120:13 121:29 122:57 123:69 124:65 125:61 126:33 127:41 128:57 129:65 130:13 131:29 132:25 133:17 134:57 135:69 136:61 137:61 138:33 139:21 140:25 141:33 142:57 143:57 144:21 145:21 146:53 147:65 148:61 149:77 150:33 151:9 152:37 153:25 154:61 155:61 156:57 157:57 158:17 159:29 160:25 161:13 162:69 163:57 164:61 165:65 166:37 167:33 168:17 169:29 170:65 171:73 172:57 173:73 174:21 175:17 176:69 177:57 178:33 179:29 180:25 181:33 182:61 183:65 184:61 185:73 186:29 187:21 188:33 189:9 190:57 191:77 192:69 193:73 194:29 195:17 197:25 198:57 199:73 200:65 201:57 202:25 203:25 204:33 205:13 206:69 207:65 208:33 209:13 210:65 211:77 212:53 213:73 214:29 215:29 216:17 217:37 218:65 219:57 220:65 221:65 222:29 223:25 224:37 225:29 226:57 227:61 228:61 229:57 230:33 231:13 232:29 233:13 234:57 235:65 236:57 237:69 238:21 239:21 240:57 241:61 242:25 243:41 244:25 245:21 246:61 247:57 248:57 249:69 250:25 251:9 252:25 253:29 254:53 255:69 256:33 257:17 258:69 259:73 260:53 261:73 262:17 263:21 264:29 265:25 266:65 267:57 268:69 269:65 270:37 271:13 272:65 273:77 274:37 275:13 276:33 277:29 278:53 279:73 280:65 281:57 282:21 283:37 284:33 285:25 286:61 287:65 288:57 289:61 290:33 291:29 292:37 293:13 294:61 295:57 296:57 297:65 298:33 299:13 300:21 301:21 302:57 303:69 304:25 305:41 306:57 307:61 308:57 309:57 310:29 311:21 312:29 313:9 314:57 315:69 317:69 318:25 319:29 320:17 321:29 322:53 323:65 324:57 325:69 326:25 327:17 328:29 329:21 330:61 331:61 332:57 333:57 334:29 335:33 336:53 337:65 338:25 339:21 340:33 341:9 342:61 343:77 344:61 345:61 346:37 347:25 348:21 349:29 350:57 351:57 352:69 353:57 354:29 355:13 356:41 357:33 358:61 359:65 360:61 361:73 362:21 363:29 364:25 365:13 366:57 367:73 368:37 369:29 370:65 371:57 372:61 373:65 374:29 375:33 376:33 377:21 378:61 379:73 380:57 381:77 382:37 383:9 384:61 385:61 386:25 387:37 388:29 389:17 390:57 391:57 392:65 393:53 394:21 395:21 396:9 397:33 398:77 399:61 400:21 401:33 402:61 403:61 404:57 405:57 406:33 407:25 408:29 409:13 410:65 411:57 412:69 413:57 414:17 415:25 416:21 417:29 418:73 419:61 420:77 421:57 422:13 423:33 424:29 425:33 426:57 427:69 428:65 429:61 430:33 431:25 432:73 433:65 434:29 435:17 436:13 437:21 438:73 439:57 440:57 441:69 442:13 443:25 444:33 445:37 446:65 447:61 448:57 449:65 450:37 451:17 452:25 453:29 454:65 455:65 456:77 457:65 458:13 459:33 460:33 461:29 462:73 463:53 464:25 465:25 466:57 467:65 468:65 469:69 470:13 471:33 472:17 473:29 474:73 475:69 476:73 477:57 478:21 479:13 480:9 481:25 482:69 483:57 484:69 485:53 486:29 487:25 488:41 489:25 490:61 491:57 492:57 493:61 494:21 495:25 496:65 497:57 498:13 499:29 500:21 501:21 502:69 503:53 504:61 505:57 506:29 507:37 508:13 509:33 510:57 511:61 512:21 513:17 514:73 515:53 516:73 517:69 518:17 519:33 520:13 521:37 522:65 523:69 524:57 526:25 527:29 528:73 529:53 530:29 531:33 532:13 533:37 534:77 535:65 536:65 537:61 538:25 539:33 540:37 541:21 542:57 543:65 544:57 545:61 546:13 547:37 548:29 549:33 550:61 551:57 552:69 553:57 554:21 555:21 556:13 557:33 558:65 559:57 560:21 561:29 562:57 563:57 564:61 566:41 567:25 568:29 569:25 570:69 571:53 572:69 573:57 574:9 575:29 577:25 578:69 579:57 580:65 581:53 582:29 583:17 584:33 585:29 586:57 587:57 588:61 589:61 590:21 591:29 592:77 593:61 594:9 595:33 596:21 597:25 598:65 599:53 600:57 601:57 602:29 603:21 604:25 605:37 606:61 607:61 608:65 609:61 610:33 611:41 612:13 613:29 614:57 615:69 616:73 617:57 618:13 619:25 620:29 621:21 622:73 623:61 624:33 625:29 626:65 627:61 628:57 629:65 630:29 631:37 632:9 633:37 634:77 635:57 636:73 637:61 638:21 639:33 640:57 641:57 642:17 643:29 644:37 645:25 646:61 647:61 648:61 649:77 650:33 651:9 652:21 653:21 654:53 655:65 656:25 657:33 658:57 659:57 660:61 661:61 662:33 663:21 664:25 665:17 666:57 667:69 668:57 669:65 670:13 672:33 673:13 674:57 675:77 676:61 677:73 678:29 679:21 680:25 681:33 682:61 683:65 684:69 685:57 686:33 687:29 688:57 689:73 690:21 691:13 692:17 693:29 694:65 695:73 696:61 697:65 698:37 699:33 700:25 701:13 702:69 704:65 705:65 706:29 707:25 708:17 709:37 710:65 711:57 712:53 714:29 715:33 716:33 717:13 718:65 719:77 720:33 721:13 722:69 723:65 724:65 725:57 726:25 727:25 728:13 729:21 730:57 731:73 732:69 733:73 734:29 735:17 736:25 737:29 738:53 739:69 740:57 741:69 742:25 743:9 744:25 745:21 746:61 747:57 748:57 750:25 751:41 752:53 753:69 754:21 755:21 756:29 758:57 759:65 760:61 761:57 762:33 763:13 764:37 765:29 766:57 767:61 768:33 769:25 770:61 771:65 772:65 773:57 774:21 775:37 776:33 777:29 778:53 779:73 780:65 781:77 782:37 783:13 784:69 785:65 786:37 787:13 788:29 789:25 790:65 791:61 792:53 793:73 794:17 795:21 796:33 797:17 798:69 799:73 800:53 801:69 802:25 803:29 804:29 805:9 806:57 807:69 808:57 809:57 810:29 811:21 812:21 813:41 814:57 815:61 816:21 817:21 818:57 819:69 820:57 821:65 822:33 823:13 824:37 825:13 826:61 828:57 829:61 830:37 831:29 832:21 833:29 834:57 835:57 836:61 837:61 838:33 839:25 840:33 841:9 842:61 843:77 844:53 845:65 847:21 848:57 849:57 850:29 851:33 852:33 853:21 854:61 855:61 856:57 857:69 858:25 859:17 860:17 861:29 862:53 863:65 864:57 865:77 866:37 867:9 868:33 869:21 870:61 871:73 872:61 873:65 874:29 875:33 876:37 877:29 878:65 879:57 880:25 881:13 882:57 883:73 884:61 885:73 886:21 887:29 888:41 889:33 890:61 891:65 892:69 893:57 894:29 895:13 896:69 897:57 898:17 899:25 900:29 901:13 902:65 903:57 904:57 905:57 906:33 907:25 908:21 909:33 910:61 911:61 912:9 913:33 914:77 915:61 916:65 917:53 918:21 919:21 920:29 922:57 923:57 924:61 925:61 926:25 927:37 929:37 930:65 931:61 932:57 933:69 934:13 935:25 936:17 937:21 938:73 939:57 940:73 941:65 942:29 943:17 944:65 945:61 946:33 947:25 948:29 949:33 950:57 951:69 952:77 953:57 954:9 955:33 956:21 957:29 958:73 959:61 960:73 961:57 962:25 963:13 964:17 965:29 966:73 967:69 968:65 969:69 970:13 971:33 972:25 973:25 974:57 975:65 976:29 977:29 978:73 979:53 980:77 981:65 982:13 983:33 984:25 985:29 986:65 987:65 988:57 989:65 990:37 991:17 992:13 993:33 994:57 995:61 996:61 997:57 998:29 999:37 1001:21 1002:69 1003:57 1004:65 1005:57 1006:13 1007:29 1008:57 1009:61 1010:21 1011:25 1012:41 1013:25 1014:61 1015:57 1016:69 1017:53 1018:29 1019:25 1020:9 1021:25 1022:69 1023:57]
|
Looking at the backfill tasks, we’re missing the VBs. The Backfill task isn’t being raised correctly.
Issue | Resolution |
XDCR Checkpoint Manager instances were not cleaned up under certain circumstances due to timing and networking issues when contacting target, or when an invalid backfill task was fed in as input. | Checkpoint Manager instances are now cleaned up. A flag has been added to check for invalid backfill tasks. |