Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-53115

Index rollbackAllToZero messages seen

    XMLWordPrintable

Details

    • Untriaged
    • 1
    • Unknown

    Description

      Steps to reproduce:

      1. Create 4 buckets 
      2. Create indexes with replicas on each of the 4 buckets.
      3. Run pillowfight to continuously load data ((buckets have 1M, 1M , 1M and 3M items). The bucket RR needs to be under 10%. Load until then
      4. Run a shell script that runs the request_plus scans continuously.
      5. Run stress-ng  with the params -> stress-ng --vm 4 --vm-bytes 1G --metrics-brief --vm-keep --vm-locked -m 4 --aggressive --vm-populate.  (Adjust the --vm-bytes param depending upon the VM resources)
      6. Once you run enough stress-ng processes, OOM kill will kick in. This can be verified by checking the dmesg ( dmesg -T | egrep -i 'killed process' )
      7. There's a possibility that stress-ng gets spawned and killed since OOM kill is determined by a oom_score_adj factor. In order to make sure that memcached gets killed run this 

      echo 1000 > /proc/<memcached PID>/oom_score_adj 

      1. Observe that the scans are timing out and that the index has rolled back to 0.

       

      Logs and pcaps are attached. There were 2 instances of rollbacks observed. Please use these timestamps for analysis

      Instance 1

       

      Index node1 ( 172.23.106.159) 
      2022-07-26T03:26:54.738-07:00 [Info] StorageMgr::rollbackAllToZero MAINT_STREAM test8
       
      Index node2 ( 172.23.106.163) 
       
      2022-07-26T03:26:58.186-07:00 [Info] StorageMgr::rollbackAllToZero MAINT_STREAM test8

      Instance 2

       

       

      Index node1 ( 172.23.106.159) 
       
      2022-07-26T05:06:12.658-07:00 [Info] StorageMgr::rollbackAllToZero MAINT_STREAM test8
       
       Index node2 ( 172.23.106.163) 
       
      2022-07-26T05:06:10.805-07:00 [Info] StorageMgr::rollbackAllToZero MAINT_STREAM test8

       

       

       

      Log bundles -> 

      s3://cb-customers-secure/cbse12279oomkil2/2022-07-26/collectinfo-2022-07-26t125548-ns_1@172.23.105.36.zip
      s3://cb-customers-secure/cbse12279oomkil2/2022-07-26/collectinfo-2022-07-26t125548-ns_1@172.23.105.37.zip
      s3://cb-customers-secure/cbse12279oomkil2/2022-07-26/collectinfo-2022-07-26t125548-ns_1@172.23.106.156.zip
      s3://cb-customers-secure/cbse12279oomkil2/2022-07-26/collectinfo-2022-07-26t125548-ns_1@172.23.106.159.zip
      s3://cb-customers-secure/cbse12279oomkil2/2022-07-26/collectinfo-2022-07-26t125548-ns_1@172.23.106.163.zip
      s3://cb-customers-secure/cbse12279oomkil2/2022-07-26/collectinfo-2022-07-26t125548-ns_1@172.23.106.204.zip

       

       

      I have the packet capture files too and can attach them if necessary. 

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          pavan.pb Pavan PB created issue -
          pavan.pb Pavan PB made changes -
          Field Original Value New Value
          Description Steps to reproduce:
           # Create 4 buckets 
           # Create indexes with replicas on each of the 4 buckets.
           # Run pillowfight to continuously load data ((buckets have 1M, 1M , 1M and 3M items). The bucket RR needs to be under 10%. Load until then
           # Run a shell script that runs the request_plus scans continuously.
           # Run stress-ng  with the params -> [stress-ng|https://wiki.ubuntu.com/Kernel/Reference/stress-ng] --vm 4 --vm-bytes 1G --metrics-brief --vm-keep --vm-locked -m 4 --aggressive --vm-populate.  (Adjust the --vm-bytes param depending upon the VM resources)

           # Once you run enough stress-ng processes, OOM kill will kick in. This can be verified by checking the dmesg ( dmesg -T | egrep -i 'killed process' ) 
           # There's a possibility that stress-ng gets spawned and killed since OOM kill is determined by a oom_score_adj factor. In order to make sure that memcached gets killed run this 
          {code:java}
          echo 1000 > /proc/<memcached PID>/oom_score_adj {code}

           # Observe that the scans are timing out and that the index has rolled back to 0.

           

          Logs and pcaps are attached. There were 2 instances of rollbacks observed. Please use these timestamps for analysis

          Instance 1

           
          {code:java}
          Index node1 ( 172.23.106.159)
          2022-07-26T03:26:54.738-07:00 [Info] StorageMgr::rollbackAllToZero MAINT_STREAM test8

          Index node2 ( 172.23.106.163)

          2022-07-26T03:26:58.186-07:00 [Info] StorageMgr::rollbackAllToZero MAINT_STREAM test8{code}
          Instance 2

           

           
          {code:java}
          Index node1 ( 172.23.106.159)

          2022-07-26T05:06:12.658-07:00 [Info] StorageMgr::rollbackAllToZero MAINT_STREAM test8

           Index node2 ( 172.23.106.163)

          2022-07-26T05:06:10.805-07:00 [Info] StorageMgr::rollbackAllToZero MAINT_STREAM test8{code}
           

           

           

          Log bundles -> 
          {code:java}
          https://cb-engineering.s3.amazonaws.com/CBSE12279OOMKILl/collectinfo-2022-07-26T121550-ns_1%40172.23.105.36.zip
          https://cb-engineering.s3.amazonaws.com/CBSE12279OOMKILl/collectinfo-2022-07-26T121550-ns_1%40172.23.105.37.zip
          https://cb-engineering.s3.amazonaws.com/CBSE12279OOMKILl/collectinfo-2022-07-26T121550-ns_1%40172.23.106.156.zip
          https://cb-engineering.s3.amazonaws.com/CBSE12279OOMKILl/collectinfo-2022-07-26T121550-ns_1%40172.23.106.159.zip
          https://cb-engineering.s3.amazonaws.com/CBSE12279OOMKILl/collectinfo-2022-07-26T121550-ns_1%40172.23.106.163.zip
          https://cb-engineering.s3.amazonaws.com/CBSE12279OOMKILl/collectinfo-2022-07-26T121550-ns_1%40172.23.106.204.zip {code}
           

           

          I have the packet capture files too and can attach them if necessary. 
          Steps to reproduce:
           # Create 4 buckets 
           # Create indexes with replicas on each of the 4 buckets.
           # Run pillowfight to continuously load data ((buckets have 1M, 1M , 1M and 3M items). The bucket RR needs to be under 10%. Load until then
           # Run a shell script that runs the request_plus scans continuously.
           # Run stress-ng  with the params -> [stress-ng|https://wiki.ubuntu.com/Kernel/Reference/stress-ng] --vm 4 --vm-bytes 1G --metrics-brief --vm-keep --vm-locked -m 4 --aggressive --vm-populate.  (Adjust the --vm-bytes param depending upon the VM resources)
           # Once you run enough stress-ng processes, OOM kill will kick in. This can be verified by checking the dmesg ( dmesg -T | egrep -i 'killed process' )
           # There's a possibility that stress-ng gets spawned and killed since OOM kill is determined by a oom_score_adj factor. In order to make sure that memcached gets killed run this 

          {code:java}echo 1000 > /proc/<memcached PID>/oom_score_adj {code} # Observe that the scans are timing out and that the index has rolled back to 0.

           

          Logs and pcaps are attached. There were 2 instances of rollbacks observed. Please use these timestamps for analysis

          Instance 1

           
          {code:java}Index node1 ( 172.23.106.159)
          2022-07-26T03:26:54.738-07:00 [Info] StorageMgr::rollbackAllToZero MAINT_STREAM test8

          Index node2 ( 172.23.106.163)

          2022-07-26T03:26:58.186-07:00 [Info] StorageMgr::rollbackAllToZero MAINT_STREAM test8{code}
          Instance 2

           

           
          {code:java}Index node1 ( 172.23.106.159)

          2022-07-26T05:06:12.658-07:00 [Info] StorageMgr::rollbackAllToZero MAINT_STREAM test8

           Index node2 ( 172.23.106.163)

          2022-07-26T05:06:10.805-07:00 [Info] StorageMgr::rollbackAllToZero MAINT_STREAM test8{code}
           

           

           

          Log bundles -> 
          {code:java}https://cb-engineering.s3.amazonaws.com/CBSE12279OOMKILl/collectinfo-2022-07-26T121550-ns_1%40172.23.105.36.zip
          https://cb-engineering.s3.amazonaws.com/CBSE12279OOMKILl/collectinfo-2022-07-26T121550-ns_1%40172.23.105.37.zip
          https://cb-engineering.s3.amazonaws.com/CBSE12279OOMKILl/collectinfo-2022-07-26T121550-ns_1%40172.23.106.156.zip
          https://cb-engineering.s3.amazonaws.com/CBSE12279OOMKILl/collectinfo-2022-07-26T121550-ns_1%40172.23.106.159.zip
          https://cb-engineering.s3.amazonaws.com/CBSE12279OOMKILl/collectinfo-2022-07-26T121550-ns_1%40172.23.106.163.zip
          https://cb-engineering.s3.amazonaws.com/CBSE12279OOMKILl/collectinfo-2022-07-26T121550-ns_1%40172.23.106.204.zip {code}
           

           

          I have the packet capture files too and can attach them if necessary. 
          pavan.pb Pavan PB made changes -
          Description Steps to reproduce:
           # Create 4 buckets 
           # Create indexes with replicas on each of the 4 buckets.
           # Run pillowfight to continuously load data ((buckets have 1M, 1M , 1M and 3M items). The bucket RR needs to be under 10%. Load until then
           # Run a shell script that runs the request_plus scans continuously.
           # Run stress-ng  with the params -> [stress-ng|https://wiki.ubuntu.com/Kernel/Reference/stress-ng] --vm 4 --vm-bytes 1G --metrics-brief --vm-keep --vm-locked -m 4 --aggressive --vm-populate.  (Adjust the --vm-bytes param depending upon the VM resources)
           # Once you run enough stress-ng processes, OOM kill will kick in. This can be verified by checking the dmesg ( dmesg -T | egrep -i 'killed process' )
           # There's a possibility that stress-ng gets spawned and killed since OOM kill is determined by a oom_score_adj factor. In order to make sure that memcached gets killed run this 

          {code:java}echo 1000 > /proc/<memcached PID>/oom_score_adj {code} # Observe that the scans are timing out and that the index has rolled back to 0.

           

          Logs and pcaps are attached. There were 2 instances of rollbacks observed. Please use these timestamps for analysis

          Instance 1

           
          {code:java}Index node1 ( 172.23.106.159)
          2022-07-26T03:26:54.738-07:00 [Info] StorageMgr::rollbackAllToZero MAINT_STREAM test8

          Index node2 ( 172.23.106.163)

          2022-07-26T03:26:58.186-07:00 [Info] StorageMgr::rollbackAllToZero MAINT_STREAM test8{code}
          Instance 2

           

           
          {code:java}Index node1 ( 172.23.106.159)

          2022-07-26T05:06:12.658-07:00 [Info] StorageMgr::rollbackAllToZero MAINT_STREAM test8

           Index node2 ( 172.23.106.163)

          2022-07-26T05:06:10.805-07:00 [Info] StorageMgr::rollbackAllToZero MAINT_STREAM test8{code}
           

           

           

          Log bundles -> 
          {code:java}https://cb-engineering.s3.amazonaws.com/CBSE12279OOMKILl/collectinfo-2022-07-26T121550-ns_1%40172.23.105.36.zip
          https://cb-engineering.s3.amazonaws.com/CBSE12279OOMKILl/collectinfo-2022-07-26T121550-ns_1%40172.23.105.37.zip
          https://cb-engineering.s3.amazonaws.com/CBSE12279OOMKILl/collectinfo-2022-07-26T121550-ns_1%40172.23.106.156.zip
          https://cb-engineering.s3.amazonaws.com/CBSE12279OOMKILl/collectinfo-2022-07-26T121550-ns_1%40172.23.106.159.zip
          https://cb-engineering.s3.amazonaws.com/CBSE12279OOMKILl/collectinfo-2022-07-26T121550-ns_1%40172.23.106.163.zip
          https://cb-engineering.s3.amazonaws.com/CBSE12279OOMKILl/collectinfo-2022-07-26T121550-ns_1%40172.23.106.204.zip {code}
           

           

          I have the packet capture files too and can attach them if necessary. 
          Steps to reproduce:
           # Create 4 buckets 
           # Create indexes with replicas on each of the 4 buckets.
           # Run pillowfight to continuously load data ((buckets have 1M, 1M , 1M and 3M items). The bucket RR needs to be under 10%. Load until then
           # Run a shell script that runs the request_plus scans continuously.
           # Run stress-ng  with the params -> [stress-ng|https://wiki.ubuntu.com/Kernel/Reference/stress-ng] --vm 4 --vm-bytes 1G --metrics-brief --vm-keep --vm-locked -m 4 --aggressive --vm-populate.  (Adjust the --vm-bytes param depending upon the VM resources)
           # Once you run enough stress-ng processes, OOM kill will kick in. This can be verified by checking the dmesg ( dmesg -T | egrep -i 'killed process' )
           # There's a possibility that stress-ng gets spawned and killed since OOM kill is determined by a oom_score_adj factor. In order to make sure that memcached gets killed run this 

          {code:java}echo 1000 > /proc/<memcached PID>/oom_score_adj {code}
          # Observe that the scans are timing out and that the index has rolled back to 0.

           

          Logs and pcaps are attached. There were 2 instances of rollbacks observed. Please use these timestamps for analysis

          Instance 1

           
          {code:java}Index node1 ( 172.23.106.159)
          2022-07-26T03:26:54.738-07:00 [Info] StorageMgr::rollbackAllToZero MAINT_STREAM test8

          Index node2 ( 172.23.106.163)

          2022-07-26T03:26:58.186-07:00 [Info] StorageMgr::rollbackAllToZero MAINT_STREAM test8{code}
          Instance 2

           

           
          {code:java}Index node1 ( 172.23.106.159)

          2022-07-26T05:06:12.658-07:00 [Info] StorageMgr::rollbackAllToZero MAINT_STREAM test8

           Index node2 ( 172.23.106.163)

          2022-07-26T05:06:10.805-07:00 [Info] StorageMgr::rollbackAllToZero MAINT_STREAM test8{code}
           

           

           

          Log bundles -> 
          {code:java}s3://cb-customers-secure/cbse12279oomkil2/2022-07-26/collectinfo-2022-07-26t125548-ns_1@172.23.105.36.zip
          s3://cb-customers-secure/cbse12279oomkil2/2022-07-26/collectinfo-2022-07-26t125548-ns_1@172.23.105.37.zip
          s3://cb-customers-secure/cbse12279oomkil2/2022-07-26/collectinfo-2022-07-26t125548-ns_1@172.23.106.156.zip
          s3://cb-customers-secure/cbse12279oomkil2/2022-07-26/collectinfo-2022-07-26t125548-ns_1@172.23.106.159.zip
          s3://cb-customers-secure/cbse12279oomkil2/2022-07-26/collectinfo-2022-07-26t125548-ns_1@172.23.106.163.zip
          s3://cb-customers-secure/cbse12279oomkil2/2022-07-26/collectinfo-2022-07-26t125548-ns_1@172.23.106.204.zip{code}
           

           

          I have the packet capture files too and can attach them if necessary. 
          drigby Dave Rigby made changes -
          Assignee Dave Rigby [ drigby ] Varun Velamuri [ varun.velamuri ]
          amit.kulkarni Amit Kulkarni made changes -
          Fix Version/s 7.1.2 [ 18414 ]
          dfinlay Dave Finlay made changes -
          Attachment screenshot-1.png [ 189411 ]
          drigby Dave Rigby made changes -
          drigby Dave Rigby made changes -
          drigby Dave Rigby made changes -
          varun.velamuri Varun Velamuri made changes -
          Fix Version/s 6.6.6 [ 18032 ]
          Fix Version/s 7.1.2 [ 18414 ]
          varun.velamuri Varun Velamuri made changes -
          Resolution Not a Bug [ 10200 ]
          Status Open [ 1 ] Resolved [ 5 ]
          hemant.rajput Hemant Rajput made changes -
          Status Resolved [ 5 ] Closed [ 6 ]

          People

            varun.velamuri Varun Velamuri
            pavan.pb Pavan PB
            Votes:
            0 Vote for this issue
            Watchers:
            11 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty