Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-49101

XDCR - backfillSpec grows leading to set failure and stuck printing errMsg

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • 7.1.0
    • 7.0.0, 7.0.1, 7.0.2, 7.1.0
    • XDCR
    • None
    • Untriaged
    • 1
    • Yes

    Description

      A few MB's have been raised where XDCR seems stuck.

      The XDCR logs in these cases are not able to be VIM'ed successfully because it's stuck printing a huge byte slice.

      Signature will have the following stack trace in mprof file:

      heap profile: 26846: 11892151624 [7378024: 3579106433776] @ heap/1048576
      1: 3658186752 [1: 3658186752] @ 0x50542e 0x50557e 0x509e4e 0xa0b81d 0xa0b714 0x933cd6 0x933a8e 0x9da5d2 0x9da145 0x9c065f 0x9c0274 0xabb518 0xab84e5 0x471981
      #       0x50542d        log.(*Logger).Output+0x38d                                                                                      /home/couchbase/.cbdepscache/exploded/x86_64/go-1.15.8/go/src/log/log.go:177
      #       0x50557d        log.(*Logger).Printf+0x7d                                                                                       /home/couchbase/.cbdepscache/exploded/x86_64/go-1.15.8/go/src/log/log.go:188
      #       0x509e4d        github.com/couchbase/goxdcr/log.(*CommonLogger).logMsgf+0x12d                                                   /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/goxdcr/log/logger.go:170
      #       0xa0b81c        github.com/couchbase/goxdcr/log.(*CommonLogger).Warnf+0x27c                                                     /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/goxdcr/log/logger.go:189
      #       0xa0b713        github.com/couchbase/goxdcr/metadata_svc.(*MetaKVMetadataSvc).set.func2+0x173                                   /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/goxdcr/metadata_svc/metakv_metadata_service.go:210
      #       0x933cd5        github.com/couchbase/goxdcr/utils.(*Utilities).ExponentialBackoffExecutorWithOriginalError+0x75                 /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/goxdcr/utils/utils.go:2481
      #       0x933a8d        github.com/couchbase/goxdcr/utils.(*Utilities).ExponentialBackoffExecutor+0x8d                                  /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/goxdcr/utils/utils.go:2469
      #       0x9da5d1        github.com/couchbase/goxdcr/metadata_svc.(*MetaKVMetadataSvc).set+0x371                                         /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/goxdcr/metadata_svc/metakv_metadata_service.go:215
      #       0x9da144        github.com/couchbase/goxdcr/metadata_svc.(*MetaKVMetadataSvc).Set+0xa4                                          /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/goxdcr/metadata_svc/metakv_metadata_service.go:165
      #       0x9c065e        github.com/couchbase/goxdcr/metadata_svc.(*BackfillReplicationService).setBackfillSpecUsingMarshalledData+0x17e /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/goxdcr/metadata_svc/backfill_repl_service.go:505
      #       0x9c0273        github.com/couchbase/goxdcr/metadata_svc.(*BackfillReplicationService).SetBackfillReplSpec+0x153                /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/goxdcr/metadata_svc/backfill_repl_service.go:490
      #       0xabb517        github.com/couchbase/goxdcr/backfill_manager.(*BackfillRequestHandler).metaKvOp+0x57                            /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/goxdcr/backfill_manager/backfill_request_handler.go:682
      #       0xab84e4        github.com/couchbase/goxdcr/backfill_manager.(*BackfillRequestHandler).run+0x1064                               /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/goxdcr/backfill_manager/backfill_request_handler.go:308
      

      The log message "head" will have just a bunch of byte data:

      ==============================================================================
      couchbase logs (goxdcr.log)
      cbbrowse_logs goxdcr.log
      ==============================================================================
      108 108 44 110 117 108 108 44 110 117 108 108 44 110 117 108 108 44 110 117 108 108 44 110 117 108 108 44 110 117 108 108 44 123 34 84 105 109 101 115 116 97 109 112 115 34 58 123 34 83 116 97 114 116 105 110 103 84 105 109 101 115 116 97 109 112 34 58 123 34 86 98 110
      111 34 58 56 54 49 44 34 86 98 117 117 105 100 34 58 48 44 34 83 101 113 110 111 34 58 51 50 55 56 49 55 49 51 44 34 83 110 97 112 115 104 111 116 83 116 97 114 116 34 58 48 44 34 83 110 97 112 115 104 111 116 69 110 100 34 58 48 44 34 77 97 110 105 102 101 115 116 73 6
      8 115 34 58 123 34 83 111 117 114 99 101 77 97 110 105 102 101 115 116 73 100 34 58 48 44 34 84 97 114 103 101 116 77 97 110 105 102
      ...
      

      The current theory is that in a very long running test and backfill replications keep getting passed around and not being finished, the specs can merge and grow. It shouldn't have been able to grow unbounded though so there's some investigation there needed.

      In the meantime, the backfill spec grows unbounded and metakv.Set will fail. In XDCR we try to print out the value, which is probably not a great idea because it's not readable anyway... and this printing is what causes XDCR to be "frozen".

      	metakvOpSetFunc := func() error {
      		if sensitive {
      			err = metakv.SetSensitive(getPathFromKey(key), value, rev)
      		} else {
      			err = metakv.Set(getPathFromKey(key), value, rev)
      		}
      		if err == metakv.ErrRevMismatch {
      			err = service_def.ErrorRevisionMismatch
      			return nil
      		} else if err == nil {
      			return nil
      		} else {
      			redactOnce()
      			meta_svc.logger.Warnf("metakv.Set failed. key=%v, value=%v, err=%v\n", key, valueToPrint, err)
      			return err
      		}
      

      1. We should see why the backfill merge is growing unbounded when the goal of the Accomodate function should have worked and deduped. If logic is sound, then it means that there are too many fragmented segments.
      2. If the backfill task gets too large, we should give up on all the fragmented segments and just have one gigantic backfill segment (seqno 0 to througSeqno) to prevent leading to this situation... we should prob check the size of the backfill replication spec as a metric to judge whether or not it's too big.
      3. MetakvSvc should not allow a Set call on data that's too big for REST call as it won't even reach metakv
      4. The logger.Warn should not print the value but rather the size of the value

      Actually... as I'm typing this I realized the .Accomodate function can indeed cause many fragments and there's no "defragmentation" algorithm present.
      When a backfill spec is already composed of two fragments:
      1-5 + {{ 7-10}}
      and a new one comes in as {{ 0-20 }}
      We will get:
      1 - 5
      7 - 10
      + new ones:
      0-1
      5-7
      10 - 20

      So in a way, as backfill requests keep coming in (and in error scenarios where backfill jobs isn't being serviced), there will be more and more fragments leading to increasing size of the backfill spec... there should be some sort of defrag if it gets too big.

      Attachments

        Issue Links

          Activity

            People

              lilei.chen Lilei Chen (Inactive)
              neil.huang Neil Huang
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  PagerDuty