Description
A few MB's have been raised where XDCR seems stuck.
The XDCR logs in these cases are not able to be VIM'ed successfully because it's stuck printing a huge byte slice.
Signature will have the following stack trace in mprof file:
heap profile: 26846: 11892151624 [7378024: 3579106433776] @ heap/1048576
|
1: 3658186752 [1: 3658186752] @ 0x50542e 0x50557e 0x509e4e 0xa0b81d 0xa0b714 0x933cd6 0x933a8e 0x9da5d2 0x9da145 0x9c065f 0x9c0274 0xabb518 0xab84e5 0x471981
|
# 0x50542d log.(*Logger).Output+0x38d /home/couchbase/.cbdepscache/exploded/x86_64/go-1.15.8/go/src/log/log.go:177
|
# 0x50557d log.(*Logger).Printf+0x7d /home/couchbase/.cbdepscache/exploded/x86_64/go-1.15.8/go/src/log/log.go:188
|
# 0x509e4d github.com/couchbase/goxdcr/log.(*CommonLogger).logMsgf+0x12d /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/goxdcr/log/logger.go:170
|
# 0xa0b81c github.com/couchbase/goxdcr/log.(*CommonLogger).Warnf+0x27c /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/goxdcr/log/logger.go:189
|
# 0xa0b713 github.com/couchbase/goxdcr/metadata_svc.(*MetaKVMetadataSvc).set.func2+0x173 /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/goxdcr/metadata_svc/metakv_metadata_service.go:210
|
# 0x933cd5 github.com/couchbase/goxdcr/utils.(*Utilities).ExponentialBackoffExecutorWithOriginalError+0x75 /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/goxdcr/utils/utils.go:2481
|
# 0x933a8d github.com/couchbase/goxdcr/utils.(*Utilities).ExponentialBackoffExecutor+0x8d /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/goxdcr/utils/utils.go:2469
|
# 0x9da5d1 github.com/couchbase/goxdcr/metadata_svc.(*MetaKVMetadataSvc).set+0x371 /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/goxdcr/metadata_svc/metakv_metadata_service.go:215
|
# 0x9da144 github.com/couchbase/goxdcr/metadata_svc.(*MetaKVMetadataSvc).Set+0xa4 /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/goxdcr/metadata_svc/metakv_metadata_service.go:165
|
# 0x9c065e github.com/couchbase/goxdcr/metadata_svc.(*BackfillReplicationService).setBackfillSpecUsingMarshalledData+0x17e /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/goxdcr/metadata_svc/backfill_repl_service.go:505
|
# 0x9c0273 github.com/couchbase/goxdcr/metadata_svc.(*BackfillReplicationService).SetBackfillReplSpec+0x153 /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/goxdcr/metadata_svc/backfill_repl_service.go:490
|
# 0xabb517 github.com/couchbase/goxdcr/backfill_manager.(*BackfillRequestHandler).metaKvOp+0x57 /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/goxdcr/backfill_manager/backfill_request_handler.go:682
|
# 0xab84e4 github.com/couchbase/goxdcr/backfill_manager.(*BackfillRequestHandler).run+0x1064 /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/goxdcr/backfill_manager/backfill_request_handler.go:308
|
The log message "head" will have just a bunch of byte data:
==============================================================================
|
couchbase logs (goxdcr.log)
|
cbbrowse_logs goxdcr.log
|
==============================================================================
|
108 108 44 110 117 108 108 44 110 117 108 108 44 110 117 108 108 44 110 117 108 108 44 110 117 108 108 44 110 117 108 108 44 123 34 84 105 109 101 115 116 97 109 112 115 34 58 123 34 83 116 97 114 116 105 110 103 84 105 109 101 115 116 97 109 112 34 58 123 34 86 98 110
|
111 34 58 56 54 49 44 34 86 98 117 117 105 100 34 58 48 44 34 83 101 113 110 111 34 58 51 50 55 56 49 55 49 51 44 34 83 110 97 112 115 104 111 116 83 116 97 114 116 34 58 48 44 34 83 110 97 112 115 104 111 116 69 110 100 34 58 48 44 34 77 97 110 105 102 101 115 116 73 6
|
8 115 34 58 123 34 83 111 117 114 99 101 77 97 110 105 102 101 115 116 73 100 34 58 48 44 34 84 97 114 103 101 116 77 97 110 105 102
|
...
|
The current theory is that in a very long running test and backfill replications keep getting passed around and not being finished, the specs can merge and grow. It shouldn't have been able to grow unbounded though so there's some investigation there needed.
In the meantime, the backfill spec grows unbounded and metakv.Set will fail. In XDCR we try to print out the value, which is probably not a great idea because it's not readable anyway... and this printing is what causes XDCR to be "frozen".
metakvOpSetFunc := func() error {
|
if sensitive {
|
err = metakv.SetSensitive(getPathFromKey(key), value, rev)
|
} else {
|
err = metakv.Set(getPathFromKey(key), value, rev)
|
}
|
if err == metakv.ErrRevMismatch {
|
err = service_def.ErrorRevisionMismatch
|
return nil
|
} else if err == nil {
|
return nil
|
} else {
|
redactOnce()
|
meta_svc.logger.Warnf("metakv.Set failed. key=%v, value=%v, err=%v\n", key, valueToPrint, err)
|
return err
|
}
|
- We should see why the backfill merge is growing unbounded when the goal of the Accomodate function should have worked and deduped. If logic is sound, then it means that there are too many fragmented segments.
- If the backfill task gets too large, we should give up on all the fragmented segments and just have one gigantic backfill segment (seqno 0 to througSeqno) to prevent leading to this situation... we should prob check the size of the backfill replication spec as a metric to judge whether or not it's too big.
- MetakvSvc should not allow a Set call on data that's too big for REST call as it won't even reach metakv
- The logger.Warn should not print the value but rather the size of the value
Actually... as I'm typing this I realized the .Accomodate function can indeed cause many fragments and there's no "defragmentation" algorithm present.
When a backfill spec is already composed of two fragments:
1-5 + {{ 7-10}}
and a new one comes in as {{ 0-20 }}
We will get:
1 - 5
7 - 10
+ new ones:
0-1
5-7
10 - 20
So in a way, as backfill requests keep coming in (and in error scenarios where backfill jobs isn't being serviced), there will be more and more fragments leading to increasing size of the backfill spec... there should be some sort of defrag if it gets too big.
Attachments
Issue Links
- causes
-
MB-49095 [magma, 10TB, 1%]XDCR replication is not moving while the src cluster is in idle state. Items in dstn are more than src.
-
- Closed
-
-
MB-49196 XDCR - panic in toy build component test for combineTask
-
- Closed
-
-
MB-49050 [System Test][XDCR] ~12M mutations remaining for almost a day - xdcr seems to be catching up slowly
-
- Closed
-
For Gerrit Dashboard: MB-49101 | ||||||
---|---|---|---|---|---|---|
# | Subject | Branch | Project | Status | CR | V |
164179,2 | MB-49101: When merge an incoming backfill task, if it cannot merge with the first on the list, continue on to the rest of the list. | master | goxdcr | Status: MERGED | +2 | +1 |
164185,3 | MB-49101: Add new task to the end of existing tasks if possible to ruduce task list growth. | master | goxdcr | Status: MERGED | +2 | +1 |