Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-46005

[Windows] Index rebalance stuck for > 5hrs(MAINT_STREAM In Repair)

    XMLWordPrintable

Details

    Description

      Build: 7.0.0 - 5060

      Scenario:

      Adding 2 nodes into the cluster (1 KV and 1 n1ql+index) node.

      Cluster has active FTS, 2i, eventing and cbas services running.

       

      Observing index rebalance step stuck for more than 5 hrs.

      Rebalance Operation Id = cb4133936ca1cf441dfb7e7e80b08fd8

      +----------------+----------------+-----------------------+----------------+--------------+
      | Nodes          | Services       | Version               | CPU            | Status       |
      +----------------+----------------+-----------------------+----------------+--------------+
      | 172.23.136.114 | index, n1ql    | 7.0.0-5017-enterprise | 16.2759689664  | Cluster node |
      | 172.23.136.106 | kv             | 7.0.0-5017-enterprise | 96.3590072138  | Cluster node |
      | 172.23.136.107 | kv             | 7.0.0-5017-enterprise | 95.904451103   | Cluster node |
      | 172.23.136.108 | index, n1ql    | 7.0.0-5017-enterprise | 6.25161456988  | Cluster node |
      | 172.23.136.115 | backup         | 7.0.0-5017-enterprise | 0.247504125069 | Cluster node |
      | 172.23.136.113 | eventing, fts  | 7.0.0-5017-enterprise | 48.0959898431  | Cluster node |
      | 172.23.136.110 | kv             | 7.0.0-5017-enterprise | 82.6015216793  | Cluster node |
      | 172.23.136.105 | kv             | 7.0.0-5017-enterprise | 97.6586680867  | Cluster node |
      | 172.23.136.112 | ['kv']         |                       |                | <--- IN ---  |
      | 172.23.138.127 | ['n1ql,index'] |                       |                | <--- IN ---  |
      +----------------+----------------+-----------------------+----------------+--------------+
      

       

      Attachments

        Issue Links

          For Gerrit Dashboard: MB-46005
          # Subject Branch Project Status CR V

          Activity

            The other issue is with regard to how indexer manages repair.

            a. For vb:929, Indexer got more than one StreamBegin. So, it marked this vb for connection error

            2021-04-29T00:36:15.922-07:00 [Info] Timekeeper::handleStreamBegin Owner count > 1. Treat as CONN_ERR. StreamId MAINT_STREAM MutationMeta KeyspaceId: travel-sample Vbucket: 929 Vbuuid: 42283421376440 Seqno: 32167 FirstSnap: false
            2021-04-29T00:36:15.922-07:00 [Info] StreamState::connection error - set repair state to RESTART_VB for MAINT_STREAM keyspaceId travel-sample vb 929
            2021-04-29T00:36:15.922-07:00 [Info] Timekeeper::handleStreamConnError RepairStream due to ConnError. StreamId MAINT_STREAM KeyspaceId travel-sample VbList [929]

            b. For this, indexer sent a message with ConnErrVbs containing vb:929. This lead to shutdown vbuckets being called for vb:929

            2021-04-29T00:36:16.925-07:00 [Info] KVSender::sendRestartVbuckets Projector 172.23.136.112:9999 Topic MAINT_STREAM_TOPIC_aee54a90f72022800c2cd6f5f6023d1b travel-sample travel-sample
            2021-04-29T00:36:16.925-07:00 [Info] KVSender::sendRestartVbuckets ShutdownVbuckets 172.23.136.112:9999 Topic MAINT_STREAM_TOPIC_aee54a90f72022800c2cd6f5f6023d1b travel-sample travel-sample ConnErrVbs [929]
            2021-04-29T00:36:16.925-07:00 [Info] KVSender::sendRestartVbuckets ShutdownVbuckets Projector 172.23.136.112:9999 Topic MAINT_STREAM_TOPIC_aee54a90f72022800c2cd6f5f6023d1b travel-sample travel-sample
             ShutdownTs bucket: travel-sample, scope :, collectionIDs: [], vbuckets: 1 -
             {vbno, vbuuid, manifest, seqno, snapshot-start, snapshot-end}
             { 929 2674e04f9fb8 0 32203 32203 32203}
            2021-04-29T00:36:16.976-07:00 [Error] KVSender::sendRestartVbuckets Unexpected Error During ShutdownVbuckets Request for Projector 172.23.136.112:9999 Topic MAINT_STREAM_TOPIC_aee54a90f72022800c2cd6f5f6023d1b travel-sample. Err feed.feeder.

            c. Shutdown failed do the the error mentioned in above comment i.e. two DCP streams started to use same UUID

            d. As the restart was successful, timekeeper updated the state of the vbucket to SHUTDOWN_VB so that it can clear it's state when a StreamBegin comes

            2021-04-29T00:37:17.050-07:00 [Info] StreamState::set repair state to SHUTDOWN_VB for MAINT_STREAM keyspaceId travel-sample vb 929

            e. Stream begin never comes as projector thinks nothing is wrong with VB (since shutdown failed). As the state is updated to REPAIR_SHUTDOWN_VB, indexer does not attempt a new shutdown again

            As no shutdown is attempted for this VB, the state remains corrupt and timekeeper never comes out of the restart loop

            varun.velamuri Varun Velamuri added a comment - The other issue is with regard to how indexer manages repair. a. For vb:929, Indexer got more than one StreamBegin. So, it marked this vb for connection error 2021-04-29T00:36:15.922-07:00 [Info] Timekeeper::handleStreamBegin Owner count > 1. Treat as CONN_ERR. StreamId MAINT_STREAM MutationMeta KeyspaceId: travel-sample Vbucket: 929 Vbuuid: 42283421376440 Seqno: 32167 FirstSnap: false 2021-04-29T00:36:15.922-07:00 [Info] StreamState::connection error - set repair state to RESTART_VB for MAINT_STREAM keyspaceId travel-sample vb 929 2021-04-29T00:36:15.922-07:00 [Info] Timekeeper::handleStreamConnError RepairStream due to ConnError. StreamId MAINT_STREAM KeyspaceId travel-sample VbList [929] b. For this, indexer sent a message with ConnErrVbs containing vb:929. This lead to shutdown vbuckets being called for vb:929 2021-04-29T00:36:16.925-07:00 [Info] KVSender::sendRestartVbuckets Projector 172.23.136.112:9999 Topic MAINT_STREAM_TOPIC_aee54a90f72022800c2cd6f5f6023d1b travel-sample travel-sample 2021-04-29T00:36:16.925-07:00 [Info] KVSender::sendRestartVbuckets ShutdownVbuckets 172.23.136.112:9999 Topic MAINT_STREAM_TOPIC_aee54a90f72022800c2cd6f5f6023d1b travel-sample travel-sample ConnErrVbs [929] 2021-04-29T00:36:16.925-07:00 [Info] KVSender::sendRestartVbuckets ShutdownVbuckets Projector 172.23.136.112:9999 Topic MAINT_STREAM_TOPIC_aee54a90f72022800c2cd6f5f6023d1b travel-sample travel-sample ShutdownTs bucket: travel-sample, scope :, collectionIDs: [], vbuckets: 1 - {vbno, vbuuid, manifest, seqno, snapshot-start, snapshot-end} { 929 2674e04f9fb8 0 32203 32203 32203} 2021-04-29T00:36:16.976-07:00 [Error] KVSender::sendRestartVbuckets Unexpected Error During ShutdownVbuckets Request for Projector 172.23.136.112:9999 Topic MAINT_STREAM_TOPIC_aee54a90f72022800c2cd6f5f6023d1b travel-sample. Err feed.feeder. c. Shutdown failed do the the error mentioned in above comment i.e. two DCP streams started to use same UUID d. As the restart was successful, timekeeper updated the state of the vbucket to SHUTDOWN_VB so that it can clear it's state when a StreamBegin comes 2021-04-29T00:37:17.050-07:00 [Info] StreamState::set repair state to SHUTDOWN_VB for MAINT_STREAM keyspaceId travel-sample vb 929 e. Stream begin never comes as projector thinks nothing is wrong with VB (since shutdown failed). As the state is updated to REPAIR_SHUTDOWN_VB, indexer does not attempt a new shutdown again As no shutdown is attempted for this VB, the state remains corrupt and timekeeper never comes out of the restart loop
            varun.velamuri Varun Velamuri added a comment - - edited

            a. Actually, indexer has the logic to retry MTR if it does not get StreamBegin in 30 minutes and the repair state of the vbucket stays in REPAIR_SHUTDOWN_VB. But, indexer was asked to repairWithMTR even before this 30 minutes elapsed

            2021-04-29T00:57:43.835-07:00 [Info] timekeeper::repairStreamWithMTR - set repair state to REPAIR_MTR for MAINT_STREAM keyspaceId travel-sample vb 929
            2021-04-29T00:57:43.835-07:00 [Info] Indexer::handleKVStreamRepair Initiate Stream Repair MAINT_STREAM KeyspaceId travel-sample StreamId 1

            b. Indexer started keyspaceID stream without shutting down the old stream 

            2021-04-29T00:57:43.835-07:00 [Info] Indexer::startKeyspaceIdStream Stream: MAINT_STREAM KeyspaceId: travel-sample SessionId 1 RestartTS bucket: travel-sample, scopeId: , collectionId: , vbuckets: 1024 Crc64: 1107319255133256699 snapType INMEM_SNAP -
            ...
            2021-04-29T00:57:43.836-07:00 [Info] Timekeeper::handleStreamOpen MAINT_STREAM travel-sample Status STREAM_ACTIVE. Nothing to do.
            2021-04-29T00:57:43.836-07:00 [Info] MutationMgr::handleOpenStream
             Message: MsgStreamUpdate
             Type: OPEN_STREAM
             Stream: MAINT_STREAM
             KeyspaceId: travel-sample
             BuildTS: []
            

            c. Ideally mutation manager would have first shutdown the stream and re-opened it. When projector receives this MTR, it does nothing as it thinks all vbuckets are active

            2021-04-29T00:57:44.514-07:00 [Info] FEED[<=>MAINT_STREAM_TOPIC_6b10458358738de8de82f6fb90235715(127.0.0.1:8091)] ##35 start-timestamp bucket: travel-sample, scope :, collectionIDs: [], vbuckets: 0 -
             {vbno, vbuuid, manifest, seqno, snapshot-start, snapshot-end}
            2021-04-29T00:57:44.523-07:00 [Info] FEED[<=>MAINT_STREAM_TOPIC_aee54a90f72022800c2cd6f5f6023d1b(127.0.0.1:8091)] ##36 endpoint "172.23.136.107:9105" active ...
            

             
            I think a proper fix for this would be that indexer should close existing stream and open a new stream on retry

            varun.velamuri Varun Velamuri added a comment - - edited a. Actually, indexer has the logic to retry MTR if it does not get StreamBegin in 30 minutes and the repair state of the vbucket stays in REPAIR_SHUTDOWN_VB. But, indexer was asked to repairWithMTR even before this 30 minutes elapsed 2021-04-29T00:57:43.835-07:00 [Info] timekeeper::repairStreamWithMTR - set repair state to REPAIR_MTR for MAINT_STREAM keyspaceId travel-sample vb 929 2021-04-29T00:57:43.835-07:00 [Info] Indexer::handleKVStreamRepair Initiate Stream Repair MAINT_STREAM KeyspaceId travel-sample StreamId 1 b. Indexer started keyspaceID stream without shutting down the old stream  2021-04-29T00:57:43.835-07:00 [Info] Indexer::startKeyspaceIdStream Stream: MAINT_STREAM KeyspaceId: travel-sample SessionId 1 RestartTS bucket: travel-sample, scopeId: , collectionId: , vbuckets: 1024 Crc64: 1107319255133256699 snapType INMEM_SNAP - ... 2021-04-29T00:57:43.836-07:00 [Info] Timekeeper::handleStreamOpen MAINT_STREAM travel-sample Status STREAM_ACTIVE. Nothing to do. 2021-04-29T00:57:43.836-07:00 [Info] MutationMgr::handleOpenStream Message: MsgStreamUpdate Type: OPEN_STREAM Stream: MAINT_STREAM KeyspaceId: travel-sample BuildTS: [] c. Ideally mutation manager would have first shutdown the stream and re-opened it. When projector receives this MTR, it does nothing as it thinks all vbuckets are active 2021-04-29T00:57:44.514-07:00 [Info] FEED[<=>MAINT_STREAM_TOPIC_6b10458358738de8de82f6fb90235715(127.0.0.1:8091)] ##35 start-timestamp bucket: travel-sample, scope :, collectionIDs: [], vbuckets: 0 - {vbno, vbuuid, manifest, seqno, snapshot-start, snapshot-end} 2021-04-29T00:57:44.523-07:00 [Info] FEED[<=>MAINT_STREAM_TOPIC_aee54a90f72022800c2cd6f5f6023d1b(127.0.0.1:8091)] ##36 endpoint "172.23.136.107:9105" active ...   I think a proper fix for this would be that indexer should close existing stream and open a new stream on retry

            Filed MB-46251 for robust handling errors in shutdownVbuckets code path for 7.0.1. The patch: http://review.couchbase.org/c/indexing/+/153091 should fix the current issue

            varun.velamuri Varun Velamuri added a comment - Filed  MB-46251 for robust handling errors in shutdownVbuckets code path for 7.0.1. The patch: http://review.couchbase.org/c/indexing/+/153091  should fix the current issue

            Build couchbase-server-7.0.0-5148 contains indexing commit 3303da3 with commit message:
            MB-46005 Use UUID when building feed name for getting failover logs

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.0-5148 contains indexing commit 3303da3 with commit message: MB-46005 Use UUID when building feed name for getting failover logs

            Closing based on the run on Enterprise Edition 7.0.0 build 5238

            ashwin.govindarajulu Ashwin Govindarajulu added a comment - Closing based on the run on Enterprise Edition 7.0.0 build 5238

            People

              ashwin.govindarajulu Ashwin Govindarajulu
              ashwin.govindarajulu Ashwin Govindarajulu
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty