Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: 7.1.0
Affects Version/s: 7.1.0
Component/s: XDCR
Labels:
Environment:
7.1.0-2438

Triage:
Untriaged
Link to Log File, atop/blg, CBCollectInfo, Core dump:

Hide
s3://cb-customers-secure/xdcr_dstn/2022-03-07/collectinfo-2022-03-07t180208-ns_1@172.23.106.238.zip
s3://cb-customers-secure/xdcr_dstn/2022-03-07/collectinfo-2022-03-07t180208-ns_1@172.23.106.250.zip
s3://cb-customers-secure/xdcr_dstn/2022-03-07/collectinfo-2022-03-07t180208-ns_1@172.23.106.251.zip
s3://cb-customers-secure/xdcr_src/2022-03-07/collectinfo-2022-03-07t180159-ns_1@172.23.105.175.zip
s3://cb-customers-secure/xdcr_src/2022-03-07/collectinfo-2022-03-07t180159-ns_1@172.23.106.236.zip
s3://cb-customers-secure/xdcr_src/2022-03-07/collectinfo-2022-03-07t180159-ns_1@172.23.121.74.zip

New Snapshot for xdcr_src → http://supportal.couchbase.com/snapshot/80d5ed2bdb12c30bdb16cc18d92f603f::0
New Snapshot for xdcr_dstn → http://supportal.couchbase.com/snapshot/4f89306bc90199f5b722458fb4c62d2b::0

Show
s3://cb-customers-secure/xdcr_dstn/2022-03-07/collectinfo-2022-03-07t180208-ns_1@172.23.106.238.zip s3://cb-customers-secure/xdcr_dstn/2022-03-07/collectinfo-2022-03-07t180208-ns_1@172.23.106.250.zip s3://cb-customers-secure/xdcr_dstn/2022-03-07/collectinfo-2022-03-07t180208-ns_1@172.23.106.251.zip s3://cb-customers-secure/xdcr_src/2022-03-07/collectinfo-2022-03-07t180159-ns_1@172.23.105.175.zip s3://cb-customers-secure/xdcr_src/2022-03-07/collectinfo-2022-03-07t180159-ns_1@172.23.106.236.zip s3://cb-customers-secure/xdcr_src/2022-03-07/collectinfo-2022-03-07t180159-ns_1@172.23.121.74.zip New Snapshot for xdcr_src → http://supportal.couchbase.com/snapshot/80d5ed2bdb12c30bdb16cc18d92f603f::0 New Snapshot for xdcr_dstn → http://supportal.couchbase.com/snapshot/4f89306bc90199f5b722458fb4c62d2b::0
Story Points:
1
Is this a Regression?:
Unknown

Description

Steps:

Step 1: Create a 3 node cluster
2022-03-04 16:41:58,838 | test | INFO | pool-3-thread-26 | [task:check:474] Rebalance completed with progress: 100% in 15.0710000992 sec
Step 1*: Create a 3 node XDCR remote cluster
2022-03-04 16:42:38,523 | test | INFO | pool-3-thread-28 | [task:check:474] Rebalance completed with progress: 100% in 25.0929999352 sec
Step 2: Create required buckets and collections.
Step 2*: Create required buckets and collections on XDCR remote.
Step 1: Create 10000000 items sequentially
Step 2: Update 10000000 RandonKey keys to create 50 percent fragmentation
Step 3: Create 10000000 items sequentially
Step 4: Update 10000000 RandonKey keys to create 50 percent fragmentation
Step 5: Rebalance in with Loading of docs
2022-03-05 20:01:13,065 | test | INFO | pool-3-thread-22 | [task:check:474] Rebalance completed with progress: 100% in 19683.95 sec
Step 6: Rebalance Out with Loading of docs
2022-03-06 03:16:48,865 | test | INFO | pool-3-thread-19 | [task:check:474] Rebalance completed with progress: 100% in 26089.494 sec
Step 7: Rebalance In_Out with Loading of docs
2022-03-06 06:41:04,989 | test | INFO | pool-3-thread-18 | [task:check:474] Rebalance completed with progress: 100% in 12209.365 sec
Step 8: Swap with Loading of docs
2022-03-06 09:39:43,861 | test | INFO | pool-3-thread-21 | [task:check:474] Rebalance completed with progress: 100% in 10676.6900001 sec
Step 9: Failover 2 node and RebalanceOut that node with loading in parallel
Step 10: Rebalance in with Loading of docs
2022-03-06 16:07:05,046 | test | INFO | pool-3-thread-25 | [task:check:474] Rebalance completed with progress: 100% in 14182.8969998 sec
Step 11: Failover a node and FullRecovery that node
2022-03-06 23:56:31,661 | test | INFO | pool-3-thread-26 | [task:check:474] Rebalance completed with progress: 100% in 27306.309 sec

XDCR crash is seen at 6 Mar 7:35 PM:

172.23.121.74 at 7:35:13 PM 6 Mar, 2022
Service 'goxdcr' exited with status 137. Restarting. Messages:
2022-03-06T19:34:56.819-08:00 INFO GOXDCR.PipelineMgr: Replication Status = map[4f89306bc90199f5b722458fb4c62d2b/GleamBookUsers0/GleamBookUsers0:name={4f89306bc90199f5b722458fb4c62d2b/GleamBookUsers0/GleamBookUsers0}, status={Replicating}, errors={[]}, oldProgress={All incoming nozzles have been opened}, progress={Pipeline is running}, oldBackfillProgress={Source nozzles have been closed}, backfillProgress={Pipeline has been stopped}]
2022-03-06T19:34:57.279-08:00 INFO GOXDCR.TopoChangeDet: TopologyChangeDetectorSvc for pipeline 4f89306bc90199f5b722458fb4c62d2b/GleamBookUsers0/GleamBookUsers0 handleTargetTopologyChange completed
2022-03-06T19:35:00.105-08:00 INFO GOXDCR.StatsMgr: 4f89306bc90199f5b722458fb4c62d2b/GleamBookUsers0/GleamBookUsers0 total_docs=464055518, docs_processed=237476905, changes_left=226578613
2022-03-06T19:35:00.941-08:00 WARN GOXDCR.ThrSeqTrackSvc: 4f89306bc90199f5b722458fb4c62d2b/GleamBookUsers0/GleamBookUsers0_ThroughSeqnoTracker GetThroughSeqnos completed after 737.439322ms
2022-03-06T19:35:02.445-08:00 INFO GOXDCR.TopoChangeDet: TopologyChangeDetectorSvc for pipeline 4f89306bc90199f5b722458fb4c62d2b/GleamBookUsers0/GleamBookUsers0 handleTargetTopologyChange completed
2022-03-06T19:35:08.124-08:00 INFO GOXDCR.StatsMgr: 4f89306bc90199f5b722458fb4c62d2b/GleamBookUsers0/GleamBookUsers0 total_docs=464055518, docs_processed=238603165, changes_left=225452353

Logs from src cluster where crash is seen collected at 7 Mar 00:02 AM are attached.
Current time logs are linked in the ticket.

QE Test

guides/gradlew --refresh-dependencies testrunner -P jython=/opt/jython/bin/jython -P 'args=-i /tmp/magma_temp_job1.ini -p bucket_storage=magma,bucket_eviction_policy=fullEviction,rerun=False -t aGoodDoctor.Hospital.Murphy.ClusterOpsVolume,nodes_init=3,graceful=True,skip_cleanup=True,num_items=10000000,num_buckets=1,bucket_names=GleamBook,doc_size=1024,bucket_type=membase,eviction_policy=fullEviction,iterations=1,batch_size=1000,sdk_timeout=60,log_level=debug,infra_log_level=debug,rerun=False,skip_cleanup=True,key_size=18,randomize_doc_size=False,randomize_value=True,assert_crashes_on_load=True,num_collections=50,maxttl=10,num_indexes=5,pc=25,index_nodes=0,xdcr_collections=50,xdcr_remote_nodes=3,cbas_nodes=0,fts_nodes=0,ops_rate=80000,ramQuota=10240,doc_ops=create:update:delete:read,rebl_ops_rate=20000,key_type=RandomKey,vbuckets=1024,mutation_perc=30,replicas=2 -m rest'

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

172.23.106.233_0315_pdf.pdf
326 kB
15/Mar/22 8:04 PM
175Mem.log
310 kB
09/Mar/22 3:39 PM
goxdcr_rss.png
78 kB
22/Mar/22 3:25 PM
MoreTargetThanSource_1G.png
97 kB
21/Mar/22 3:58 PM
MoreTargetThanSource_6G.png
77 kB
21/Mar/22 3:58 PM
Node175_1G.pdf
305 kB
21/Mar/22 3:57 PM
Node175_1G-6G_growth.pdf
287 kB
21/Mar/22 3:57 PM
Node175_6G.pdf
293 kB
21/Mar/22 3:57 PM
Node175_CPU_utilization.png
216 kB
22/Mar/22 9:39 PM
Node175_Proc_utilization.png
277 kB
22/Mar/22 9:39 PM
node175_rss.png
166 kB
22/Mar/22 9:47 PM
profile_233_grow.pdf
369 kB
16/Mar/22 4:28 PM
profile_236_growth.pdf
307 kB
15/Mar/22 9:21 PM
Screen Shot 2022-03-07 at 11.24.27 AM.png
186 kB
07/Mar/22 11:35 AM
Screen Shot 2022-03-07 at 11.24.37 AM.png
200 kB
07/Mar/22 11:35 AM
Screen Shot 2022-03-22 at 9.45.27 PM.png
215 kB
22/Mar/22 9:47 PM
wrappedMCRequest.pdf
332 kB
11/Mar/22 11:01 AM

Issue Links

is duplicated by

MB-51384 Rebalance in of a node failed due to wait_seqno_persisted_failed.

Closed

Sub-Tasks

XDCR - memory usage investigation continuation

Closed

Neil Huang

Gerrit Reviews

- Show All Reviews
- Show Open Reviews
- Show All Issues
- Show Open Issues

For Gerrit Dashboard: MB-51332
#	Subject	Branch	Project	Status	CR	V
171929,3	MB-51332: persist backfill spec mapping regardless of VB change	master	goxdcr	Status: MERGED	+2	+1
172143,4	MB-51332: Xmem recycle objects when it is stopped or when target VB topology is changed	master	goxdcr	Status: MERGED	+2	+1
172467,2	MB-51332: add stats for getMeta counter	master	goxdcr	Status: MERGED	+2	+1

Activity

People

Assignee:: Ritesh Agarwal

Reporter:: Ritesh Agarwal

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Due:: 25/Mar/22

Created:: 07/Mar/22 10:44 AM

Updated:: 24/Mar/22 1:51 PM

Resolved:: 24/Mar/22 9:35 AM

Gerrit Reviews

There are no open Gerrit changes

Show There are 3 closed Gerrit changes

Hide There are 3 closed Gerrit changes

MB-51332: persist backfill spec mapping regardless of VB change: Gerrit Review:

MB-51332: Xmem recycle objects when it is stopped or when target VB topology is changed: Gerrit Review:

MB-51332: add stats for getMeta counter: Gerrit Review:

XDCR is oom killed by kernal during failover and full recovery of a node followed by rebalance.

Details

Description

Attachments

Attachments

Issue Links

Sub-Tasks

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty