Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-53162

RollbackAll to zero seen in the indexer logs

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Duplicate
    • 6.6.2
    • None
    • secondary-index
    • Untriaged
    • 1
    • Unknown

    Description

      Steps to reproduce:

      1. Create 4 buckets 
      2. Create indexes with replicas on each of the 4 buckets.
      3. Run pillowfight to continuously load data ((buckets have 1M, 1M , 1M and 3M items). The bucket RR needs to be under 10%. Load until then
      4. Run a shell script that runs the request_plus scans continuously.
      5. Run stress-ng with the params:

        stress-ng --vm 4 --vm-bytes 1G --metrics-brief --vm-keep --vm-locked -m 4 --aggressive --vm-populate
        

        (Adjust the --vm-bytes param depending upon the VM resources)

      6. Once you run enough stress-ng processes, OOM kill will kick in. This can be verified by checking the dmesg ( dmesg -T | egrep -i 'killed process' )
      7. There's a possibility that stress-ng gets spawned and killed since OOM kill is determined by a oom_score_adj factor. In order to make sure that memcached gets killed run this 

      echo 1000 > /proc/<memcached PID>/oom_score_adj  

      Rollbacks to zero seen

      72.23.100.19 : index
      2022-07-28T06:01:49.498-07:00 [Info] StorageMgr::rollbackAllToZero MAINT_STREAM test4
      2022-07-28T06:01:52.344-07:00 [Info] StorageMgr::rollbackAllToZero MAINT_STREAM test3 

      cbcollect logs -> 

       

          url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1659015082/collectinfo-2022-07-28T133123-ns_1%40172.23.100.15.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1659015082/collectinfo-2022-07-28T133123-ns_1%40172.23.100.16.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1659015082/collectinfo-2022-07-28T133123-ns_1%40172.23.100.17.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1659015082/collectinfo-2022-07-28T133123-ns_1%40172.23.100.19.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1659015082/collectinfo-2022-07-28T133123-ns_1%40172.23.100.22.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1659015082/collectinfo-2022-07-28T133123-ns_1%40172.23.121.215.zip 

      The cluster wasn't in a healthy state as one of the nodes had very high memory consumption and a rebalance did not work. Please look around the timestamp 2022-07-28T06:01:49.498-07:00 (this is the only occurrence so shouldn't be confusing).

       

      Not clear if it helps but some of the older logs are here ->

       

           url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1659012574/collectinfo-2022-07-28T124935-ns_1%40172.23.100.15.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1659012574/collectinfo-2022-07-28T124935-ns_1%40172.23.100.16.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1659012574/collectinfo-2022-07-28T124935-ns_1%40172.23.100.17.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1659012574/collectinfo-2022-07-28T124935-ns_1%40172.23.100.19.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1659012574/collectinfo-2022-07-28T124935-ns_1%40172.23.100.22.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1659012574/collectinfo-2022-07-28T124935-ns_1%40172.23.121.215.zip 

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            varun.velamuri Varun Velamuri added a comment - - edited

            The rollbacks seem to be due to vbucket rolling back to zero i.e. branching at "0"

            2022-08-01T06:19:47.989-07:00 [Warn] Timekeeper::handleStreamBegin StreamBegin rollback for StreamId MAINT_STREAM Bucket test2 vbucket 173. Rollback (0,     3e937bf01afb).
            2022-08-01T06:19:47.989-07:00 [Info] Timekeeper::handleStreamBegin start repairStream. StreamId MAINT_STREAM MutationMeta Bucket: test2 Vbucket: 173 Vbuuid: 68803160447739 Seqno: 0 FirstSnap: false

            For this vbucket, below is the failover table:

             vb_173:0:id:                         50377784113261
             vb_173:0:seq:                        0
             vb_173:1:id:                         68803160447739
             vb_173:1:seq:                        0
             vb_173:num_entries:                  2
             vb_173:num_erroneous_entries_erased: 0

            Pavan PB - You can grep for the string "Need older than last used snapshot" in indexer log. If any rollbacks hit this string, that might be related to this issue. Otherwise, you can open a new ticket - so that it becomes easy to track.

            varun.velamuri Varun Velamuri added a comment - - edited The rollbacks seem to be due to vbucket rolling back to zero i.e. branching at "0" 2022-08-01T06:19:47.989-07:00 [Warn] Timekeeper::handleStreamBegin StreamBegin rollback for StreamId MAINT_STREAM Bucket test2 vbucket 173. Rollback (0,     3e937bf01afb). 2022-08-01T06:19:47.989-07:00 [Info] Timekeeper::handleStreamBegin start repairStream. StreamId MAINT_STREAM MutationMeta Bucket: test2 Vbucket: 173 Vbuuid: 68803160447739 Seqno: 0 FirstSnap: false For this vbucket, below is the failover table:  vb_173:0:id:                         50377784113261  vb_173:0:seq:                        0  vb_173:1:id:                         68803160447739  vb_173:1:seq:                        0  vb_173:num_entries:                  2  vb_173:num_erroneous_entries_erased: 0 Pavan PB - You can grep for the string " Need older than last used snapshot " in indexer log. If any rollbacks hit this string, that might be related to this issue. Otherwise, you can open a new ticket - so that it becomes easy to track.
            pavan.pb Pavan PB added a comment -

            Ticket filed for the "Need older than last used snapshot" case -> https://issues.couchbase.com/browse/MB-53221 cc Balakumaran Gopal Ritam Sharma Amit Kulkarni Varun Velamuri 

            pavan.pb Pavan PB added a comment - Ticket filed for the "Need older than last used snapshot" case -> https://issues.couchbase.com/browse/MB-53221 cc Balakumaran Gopal Ritam Sharma Amit Kulkarni Varun Velamuri  

            From the comment at: https://issues.couchbase.com/browse/MB-53162?focusedCommentId=617397&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-617397, the issue is similar to MB-53183. Moreover, I can see the validateRestartTsVbuuid messages just before rollback.

            2022-07-28T06:01:07.852-07:00 [Info] StorageMgr::validateRestartTsVbuuid Updating Bucket test3 Vb 333 Seqno 14856 Vbuuid From 104201562900145 To 26768062637320. Flog [[156336626426308 15026] [26768062637320 14856] [238177069188272 8288] [164510306017143 0]]
            2022-07-28T06:01:07.852-07:00 [Info] StorageMgr::validateRestartTsVbuuid Updating Bucket test3 Vb 337 Seqno 14131 Vbuuid From 195645726253669 To 256422641823396. Flog [[26454714488855 14295] [256422641823396 14131] [59761043442211 7585] [46521075569218 0]]

             So, MB-53221 is also in effect. Closing this as a duplicate of MB-53183 & MB-53221.

            varun.velamuri Varun Velamuri added a comment - From the comment at: https://issues.couchbase.com/browse/MB-53162?focusedCommentId=617397&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-617397 , the issue is similar to MB-53183 . Moreover, I can see the validateRestartTsVbuuid messages just before rollback. 2022-07-28T06:01:07.852-07:00 [Info] StorageMgr::validateRestartTsVbuuid Updating Bucket test3 Vb 333 Seqno 14856 Vbuuid From 104201562900145 To 26768062637320. Flog [[156336626426308 15026] [26768062637320 14856] [238177069188272 8288] [164510306017143 0]] 2022-07-28T06:01:07.852-07:00 [Info] StorageMgr::validateRestartTsVbuuid Updating Bucket test3 Vb 337 Seqno 14131 Vbuuid From 195645726253669 To 256422641823396. Flog [[26454714488855 14295] [256422641823396 14131] [59761043442211 7585] [46521075569218 0]]  So, MB-53221 is also in effect. Closing this as a duplicate of MB-53183 & MB-53221 .

            Duplicate of MB-53183 & MB-53221 

            varun.velamuri Varun Velamuri added a comment - Duplicate of MB-53183 & MB-53221  
            pavan.pb Pavan PB added a comment -

            This is a duplicate of MB-53221. Closing the ticket.

            pavan.pb Pavan PB added a comment - This is a duplicate of MB-53221 . Closing the ticket.

            People

              pavan.pb Pavan PB
              pavan.pb Pavan PB
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty