Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-51332

XDCR is oom killed by kernal during failover and full recovery of a node followed by rebalance.

    XMLWordPrintable

Details

    Description

      Steps:

      1. Step 1: Create a 3 node cluster
        2022-03-04 16:41:58,838 | test | INFO | pool-3-thread-26 | [task:check:474] Rebalance completed with progress: 100% in 15.0710000992 sec
      2. Step 1*: Create a 3 node XDCR remote cluster
        2022-03-04 16:42:38,523 | test | INFO | pool-3-thread-28 | [task:check:474] Rebalance completed with progress: 100% in 25.0929999352 sec
      3. Step 2: Create required buckets and collections.
      4. Step 2*: Create required buckets and collections on XDCR remote.
      5. Step 1: Create 10000000 items sequentially
      6. Step 2: Update 10000000 RandonKey keys to create 50 percent fragmentation
      7. Step 3: Create 10000000 items sequentially
      8. Step 4: Update 10000000 RandonKey keys to create 50 percent fragmentation
      9. Step 5: Rebalance in with Loading of docs
        2022-03-05 20:01:13,065 | test | INFO | pool-3-thread-22 | [task:check:474] Rebalance completed with progress: 100% in 19683.95 sec
      10. Step 6: Rebalance Out with Loading of docs
        2022-03-06 03:16:48,865 | test | INFO | pool-3-thread-19 | [task:check:474] Rebalance completed with progress: 100% in 26089.494 sec
      11. Step 7: Rebalance In_Out with Loading of docs
        2022-03-06 06:41:04,989 | test | INFO | pool-3-thread-18 | [task:check:474] Rebalance completed with progress: 100% in 12209.365 sec
      12. Step 8: Swap with Loading of docs
        2022-03-06 09:39:43,861 | test | INFO | pool-3-thread-21 | [task:check:474] Rebalance completed with progress: 100% in 10676.6900001 sec
      13. Step 9: Failover 2 node and RebalanceOut that node with loading in parallel
      14. Step 10: Rebalance in with Loading of docs
        2022-03-06 16:07:05,046 | test | INFO | pool-3-thread-25 | [task:check:474] Rebalance completed with progress: 100% in 14182.8969998 sec
      15. Step 11: Failover a node and FullRecovery that node
        2022-03-06 23:56:31,661 | test | INFO | pool-3-thread-26 | [task:check:474] Rebalance completed with progress: 100% in 27306.309 sec

      XDCR crash is seen at 6 Mar 7:35 PM:

      172.23.121.74 at 7:35:13 PM 6 Mar, 2022

      Service 'goxdcr' exited with status 137. Restarting. Messages:
      2022-03-06T19:34:56.819-08:00 INFO GOXDCR.PipelineMgr: Replication Status = map[4f89306bc90199f5b722458fb4c62d2b/GleamBookUsers0/GleamBookUsers0:name={4f89306bc90199f5b722458fb4c62d2b/GleamBookUsers0/GleamBookUsers0}, status={Replicating}, errors={[]}, oldProgress={All incoming nozzles have been opened}, progress={Pipeline is running}, oldBackfillProgress={Source nozzles have been closed}, backfillProgress={Pipeline has been stopped}]
      2022-03-06T19:34:57.279-08:00 INFO GOXDCR.TopoChangeDet: TopologyChangeDetectorSvc for pipeline 4f89306bc90199f5b722458fb4c62d2b/GleamBookUsers0/GleamBookUsers0 handleTargetTopologyChange completed
      2022-03-06T19:35:00.105-08:00 INFO GOXDCR.StatsMgr: 4f89306bc90199f5b722458fb4c62d2b/GleamBookUsers0/GleamBookUsers0 total_docs=464055518, docs_processed=237476905, changes_left=226578613
      2022-03-06T19:35:00.941-08:00 WARN GOXDCR.ThrSeqTrackSvc: 4f89306bc90199f5b722458fb4c62d2b/GleamBookUsers0/GleamBookUsers0_ThroughSeqnoTracker GetThroughSeqnos completed after 737.439322ms
      2022-03-06T19:35:02.445-08:00 INFO GOXDCR.TopoChangeDet: TopologyChangeDetectorSvc for pipeline 4f89306bc90199f5b722458fb4c62d2b/GleamBookUsers0/GleamBookUsers0 handleTargetTopologyChange completed
      2022-03-06T19:35:08.124-08:00 INFO GOXDCR.StatsMgr: 4f89306bc90199f5b722458fb4c62d2b/GleamBookUsers0/GleamBookUsers0 total_docs=464055518, docs_processed=238603165, changes_left=225452353
      

      Logs from src cluster where crash is seen collected at 7 Mar 00:02 AM are attached.
      Current time logs are linked in the ticket.

      QE Test

      guides/gradlew --refresh-dependencies testrunner -P jython=/opt/jython/bin/jython -P 'args=-i /tmp/magma_temp_job1.ini -p bucket_storage=magma,bucket_eviction_policy=fullEviction,rerun=False -t aGoodDoctor.Hospital.Murphy.ClusterOpsVolume,nodes_init=3,graceful=True,skip_cleanup=True,num_items=10000000,num_buckets=1,bucket_names=GleamBook,doc_size=1024,bucket_type=membase,eviction_policy=fullEviction,iterations=1,batch_size=1000,sdk_timeout=60,log_level=debug,infra_log_level=debug,rerun=False,skip_cleanup=True,key_size=18,randomize_doc_size=False,randomize_value=True,assert_crashes_on_load=True,num_collections=50,maxttl=10,num_indexes=5,pc=25,index_nodes=0,xdcr_collections=50,xdcr_remote_nodes=3,cbas_nodes=0,fts_nodes=0,ops_rate=80000,ramQuota=10240,doc_ops=create:update:delete:read,rebl_ops_rate=20000,key_type=RandomKey,vbuckets=1024,mutation_perc=30,replicas=2 -m rest'
      

      Attachments

        1. 172.23.106.233_0315_pdf.pdf
          326 kB
        2. 175Mem.log
          310 kB
        3. goxdcr_rss.png
          goxdcr_rss.png
          78 kB
        4. MoreTargetThanSource_1G.png
          MoreTargetThanSource_1G.png
          97 kB
        5. MoreTargetThanSource_6G.png
          MoreTargetThanSource_6G.png
          77 kB
        6. Node175_1G.pdf
          305 kB
        7. Node175_1G-6G_growth.pdf
          287 kB
        8. Node175_6G.pdf
          293 kB
        9. Node175_CPU_utilization.png
          Node175_CPU_utilization.png
          216 kB
        10. Node175_Proc_utilization.png
          Node175_Proc_utilization.png
          277 kB
        11. node175_rss.png
          node175_rss.png
          166 kB
        12. profile_233_grow.pdf
          369 kB
        13. profile_236_growth.pdf
          307 kB
        14. Screen Shot 2022-03-07 at 11.24.27 AM.png
          Screen Shot 2022-03-07 at 11.24.27 AM.png
          186 kB
        15. Screen Shot 2022-03-07 at 11.24.37 AM.png
          Screen Shot 2022-03-07 at 11.24.37 AM.png
          200 kB
        16. Screen Shot 2022-03-22 at 9.45.27 PM.png
          Screen Shot 2022-03-22 at 9.45.27 PM.png
          215 kB
        17. wrappedMCRequest.pdf
          332 kB

        Issue Links

          Activity

            People

              ritesh.agarwal Ritesh Agarwal
              ritesh.agarwal Ritesh Agarwal
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                PagerDuty