Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-54667

Timekeeper vbstate update after stream begin during repair causes repair loop

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • 7.6.0
    • 7.0.4, 7.1.3
    • secondary-index
    • Security Level: Public
    • Untriaged
    • 0
    • Unknown

    Description

      update to the stream state in the timekeeper in between the repair.

      handlePoolChange triggered a stream repair due to the failover and it was in progress

      2022-11-04T06:33:34.372+00:00 [Info] Timekeeper::handlePoolChange streamId: MAINT_STREAM, keyspaceId:PayDocker, vbList: [42 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 94 95 96 97 98 99 100 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 190 191 192 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 266 268 269 270 271 272 273 298 299 300 301 321 322 323 324 325 326 327 328 329 341 342 393 394 395 396 397 398 399 400 445 446 468 502 503 504 505 506 507 508 509 511 513 514 515 516 517 518 519 550 551 552 553 554 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 661 662 663 664 674 675 676 677 678 679 680 681 682 723 724 725 726 727 728 729 730 731 732 733 734 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 841 842 843 844 845 846 847 848 849 850 851 852 853 854 860 861 862 863 864 865 866 867 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 982 983 984 985 986 987 988 989 990 991 998 999 1000 1001 1002 1003 1004 1005 1006]

      Taking vb 1006 for example
      Current state for vb 1006 is
      vbState: VBS_CONN_ERROR repairState: REPAIR_RESTART_VB

      2022-11-04T06:33:34.376+00:00 [Info] StreamState::connection error - set repair state to RESTART_VB for MAINT_STREAM keyspaceId PayDocker vb 1006

      StreamEnd did not come as the replica was just transitioned to active before the shutdown restart vbuckets call to projector. As the vbucket not active on the projector it did not close the stream.

       

      2022-11-04T06:33:30.679800+00:00 INFO (PayDocker) VBucket::setState: transitioning vb:1006 with high seqno:295 from:replica to:active meta:
      {"topology":[["ns_1@bkaoaitpeewypmd.jcsmd2b3pesg05ey.cloud.couchbase.com",null]]}
      

       

      StreamBegin is seen and its assumed to be duplicate as no StreamEnd was seen
      vbState: VBS_STREAM_BEGIN repairState: REPAIR_NONE

      2022-11-04T06:33:36.563+00:00 [Error] DATP[->dataport ":9105"] duplicate vbucket 10.0.17.217:34536#PayDocker#1006
      2022-11-04T06:33:36.564+00:00 [Info] TK StreamBegin MAINT_STREAM PayDocker 1006 242666451476113 0 255 bkaoaitpeewypmd.jcsmd2b3pesg05ey.cloud.couchbase.com:8091. HWT [295-295,295].
      

      Dataport connection closes and triggers new repair

      2022-11-04T06:34:04.662+00:00 [Info] MutationStreamReader::handleStreamInfoMsg 
      Received ConnectionError from Client for Stream MAINT_STREAM map[
      ...
      PayDocker:[730 204 679 886 824 326 885 881 1006 731 99 59 127 681 860 959 142 971 131 553 78 973 136 820 121 1005 970 1000 73 957 584 146 225 819 843 1001 81 224 821 590 70 889 882 861 968 58 519 826 514 895 190 324 582 446 728 998 734 814 848 94 950 984 217 942 80 876 148 398 888 149 587 72 863 851 133 97 983 212 506 952 862 810 71 299 842 982 60 270 845 52 88 325 341 815 822 517 53 118 884 86 844 394 966 550 827 145 214 505 849 117 946 682 209 219 137 808 678 985 49 878 144 208 82 990 83 216 134 1003 955 153 892 969 809 329 576 675 206 226 879 130 880 954 395 513 327 271 832 210 951 511 578 813 397 50 574 891 552 207 554 949 890 575 967 68 399 122 322 87 151 119 74 831 508 85 893 75 585 867 807 676 960 802 887 396 95 231 269 723 662 581 1002 829 850 300 988 298 726 588 150 811 963 64 98 875 66 972 944 816 551 273 223 580 583 123 77 577 54 138 400 828 229 661 128 76 953 518 124 126 321 986 61 301 79 999 221 502 63 57 674 213 866 817 205 586 853 152 864 140 125 192 56 129 818 830 846 854 964 804 272 143 445 51 812 877 883 579 732 230 943 961 852 503 328 677 727 139 987 1004 805 468 841 956 147 42 589 218 664 220 504 96 393 945 227 55 724 991 509 803 958 323 132 116 948 67 865 989 222 663 947 268 228 591 215 69 342 516 48 847 806 729 725 962 120 733 65 680 100 62 515 47 191 825 965 141 84 823 894 266 135 507 211]
      ].

      VBucket state is set to VBS_CONN_ERROR due to the above connection error handling
      vbState: VBS_CONN_ERROR repairState: REPAIR_RESTART_VB

      2022-11-04T06:34:04.666+00:00 [Info] StreamState::connection error - set repair state to RESTART_VB for MAINT_STREAM keyspaceId PayDocker vb 1006

      New repair ends as there is a repair running already
      Old repair continues after timekeeper gets response from kv_sender
      vbState: VBS_CONN_ERROR repairState: REPAIR_SHUTDOWN_VB

      2022-11-04T06:34:36.923+00:00 [Info] StreamState::set repair state to SHUTDOWN_VB for MAINT_STREAM keyspaceId PayDocker vb 1006

      Timekeeper as it already shutdown vbuckets it sets repair state to REPAIR_SHUTDOWN_VB but the vbucket state was set to VBS_CONN_ERROR after previous stream begin was seen
      But the next repair to fix the VBS_CONN_ERROR will not pick this vbucket to shutdown again as the state is REPAIR_SHUTDOWN_VB
      So we will not get a StreamEnd or StreamBegin again for this vbucket to reset the vbucket state even MTR will not restart this vbucket as it already active

      Attachments

        Issue Links

          For Gerrit Dashboard: MB-54667
          # Subject Branch Project Status CR V

          Activity

            People

              pavan.pb Pavan PB
              sai.teja Sai Krishna Teja
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are 2 open Gerrit changes

                  PagerDuty