Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-47429

[BP 7.0.2 MB-43020]- Detect missing log file segment during initialization

    XMLWordPrintable

Details

    • Untriaged
    • 3
    • No

    Description

      During initialization of multiFilelog, the start and end offsets are interpreted from the log segment file names. If any of the log files are missing, these offsets can be incorrect and must be verified. Additionally, these offsets must also be verified against the offsets in the superblock.

       

      If any of the log.X.data files are deleted, the indexer can run into panic-restart loop. I am able to reproduce this issue locally with the following steps:

      • set indexer.plasma.LSSSegmentFileSize to a small size (say 5MB)
      • create index and load enough documents to see multiple log.X.data files
      • delete some of the log.X.data files and restart indexer

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            Wayne Siu Please approve this ticket for 7.0.1. There is a unit test to test the fix.

            saptarshi.sen Saptarshi Sen added a comment - Wayne Siu  Please approve this ticket for 7.0.1. There is a unit test to test the fix.

            Build couchbase-server-7.0.1-5971 contains plasma commit 2ca0f18 with commit message:
            MB-47429 [BP] : Add check for validating log segments

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.1-5971 contains plasma commit 2ca0f18 with commit message: MB-47429 [BP] : Add check for validating log segments
            sanjit.chauhan Sanjit Chauhan added a comment - - edited

            Srinath Duvuru Saptarshi Sen:

            Tested this below case against couchbase-server-7.0.1-5971

            Action performed:

            a. Created Cluster with 2 index nodes, 1 KV+ query node, 1 KV

            b. Add documents 

            c. Added mulitple gsi's

            d. Delete log file for one of the indexer

            e. Restart the indexer. Indexer restarted smoothly. The respective indexer also got deleted for which log which file was deleted. Same discussed with Saptarshi Sen as well. 

            f. Fail-over one of the indexer node and perform rebalance. Rebalance completed. 

             

            Let me know in case I need to cover any more workflow. Else I will close the JIRA

            sanjit.chauhan Sanjit Chauhan added a comment - - edited Srinath Duvuru Saptarshi Sen : Tested this below case against couchbase-server-7.0.1-5971 Action performed: a. Created Cluster with 2 index nodes, 1 KV+ query node, 1 KV b. Add documents  c. Added mulitple gsi's d. Delete log file for one of the indexer e. Restart the indexer. Indexer restarted smoothly. The respective indexer also got deleted for which log which file was deleted. Same discussed with Saptarshi Sen  as well.  f. Fail-over one of the indexer node and perform rebalance. Rebalance completed.    Let me know in case I need to cover any more workflow. Else I will close the JIRA

            Hi Sanjit, 

               1.  For point e), it may be a good idea to capture the error message similar to what we saw in the logs yesterday.

                   We should have a FATAL error message from Plasma. 

               2. If you have time, may be we can try the following two cases: 

               a). We can try deleting the log file for recovery as well. This should be under recovery directory as shown below. 

                   I would expect the behavior to be the same 

            │   │   └── shard5
            │   │   ├── config.json
            │   │   ├── data
            │   │   │   ├── header.data
            │   │   │   ├── log.00000000000000.data
            │   │   │   └── recovery. <---------------------
            │   │   │   ├── header.data
            │   │   │   └── log.00000000000000.data <----------

               b). In case of fatal errors, if the index was created using 2 or more replicas, the indexer will try to rebuild the corrupted index.

                We can check whether index gets rebuilt or not after receiving fatal error.  ( I am not sure whether index rebuilds on the same node or on a different node.  )

             

            Thanks

            Saptarshi

            saptarshi.sen Saptarshi Sen added a comment - Hi Sanjit,     1.  For point e), it may be a good idea to capture the error message similar to what we saw in the logs yesterday.        We should have a FATAL error message from Plasma.     2. If you have time, may be we can try the following two cases:     a). We can try deleting the log file for recovery as well. This should be under recovery directory as shown below.         I would expect the behavior to be the same  │   │   └── shard5 │   │   ├── config.json │   │   ├── data │   │   │   ├── header.data │   │   │   ├── log.00000000000000.data │   │   │   └── recovery. <--------------------- │   │   │   ├── header.data │   │   │   └── log.00000000000000.data <----------    b). In case of fatal errors, if the index was created using 2 or more replicas, the indexer will try to rebuild the corrupted index.     We can check whether index gets rebuilt or not after receiving fatal error.  ( I am not sure whether index rebuilds on the same node or on a different node.  )   Thanks Saptarshi
            sanjit.chauhan Sanjit Chauhan added a comment - - edited

            sure Saptarshi Sen: Sure I will testing these workflows. 

            I am closing this particular Jira as I already validated the issue for which this particular Jira been raised. 

            In case, while testing above workflows (as suggested by you) I get any issue, I will be raising separate JIRA.

            sanjit.chauhan Sanjit Chauhan added a comment - - edited sure Saptarshi Sen : Sure I will testing these workflows.  I am closing this particular Jira as I already validated the issue for which this particular Jira been raised.  In case, while testing above workflows (as suggested by you) I get any issue, I will be raising separate JIRA.

            People

              sanjit.chauhan Sanjit Chauhan
              srinath.duvuru Srinath Duvuru
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty