Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-6662

XDC: worse performance with expired items

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: XDCR
    • Security Level: Public
    • Labels:
    • Environment:
      VMs, CentOS 6.2, 4-to-4 nodes, build 1723

      Description

      Unidir symptoms (expiration ratio: 3%, expiration time: 5 minutes):
      – extremely high XDC ops/sec and XDC gets/sec rate on destination side. At the same time actual replication rate (new items/sec) is very low.
      – XDC queue grows all the time on source side (unlike case w/o expired items)

      No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

        pavelpaulau Pavel Paulau created issue -
        Hide
        pavelpaulau Pavel Paulau added a comment -

        Ketaki,
        Please update/close your bug and assign this issue to Junyi.

        http://www.couchbase.com/issues/browse/MB-6563

        Show
        pavelpaulau Pavel Paulau added a comment - Ketaki, Please update/close your bug and assign this issue to Junyi. http://www.couchbase.com/issues/browse/MB-6563
        steve Steve Yen made changes -
        Field Original Value New Value
        Priority Major [ 3 ] Blocker [ 1 ]
        Hide
        pavelpaulau Pavel Paulau added a comment -

        jic: reports.

        Show
        pavelpaulau Pavel Paulau added a comment - jic: reports.
        Hide
        junyi Junyi Xie (Inactive) added a comment -

        Known issue. Duplicate of MB-6563.

        Show
        junyi Junyi Xie (Inactive) added a comment - Known issue. Duplicate of MB-6563 .
        junyi Junyi Xie (Inactive) made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Resolution Duplicate [ 3 ]
        Hide
        ketaki Ketaki Gangal added a comment -

        Closed MB-6563, using this bug to track expired items.

        Show
        ketaki Ketaki Gangal added a comment - Closed MB-6563 , using this bug to track expired items.
        ketaki Ketaki Gangal made changes -
        Resolution Duplicate [ 3 ]
        Status Resolved [ 5 ] Reopened [ 4 ]
        Assignee Ketaki Gangal [ ketaki ] Junyi Xie [ junyi ]
        Hide
        pavelpaulau Pavel Paulau added a comment -

        Tried build 1723 w/o expired items and everything looks fine (at least as it was before).

        Show
        pavelpaulau Pavel Paulau added a comment - Tried build 1723 w/o expired items and everything looks fine (at least as it was before).
        pavelpaulau Pavel Paulau made changes -
        Description Unidir symptoms (3% of items expire every 5 minutes):
        -- extremely high XDC ops/sec and XDC gets/sec rate on destination side. At the same time actual replication rate (new items/sec) is very low.
        -- XDC queue grows all the time on source side (unlike case w/o expired items)
        Unidir symptoms (expiration ratio: 3%, expiration time: 5 minutes):
        -- extremely high XDC ops/sec and XDC gets/sec rate on destination side. At the same time actual replication rate (new items/sec) is very low.
        -- XDC queue grows all the time on source side (unlike case w/o expired items)
        junyi Junyi Xie (Inactive) made changes -
        Priority Blocker [ 1 ] Critical [ 2 ]
        junyi Junyi Xie (Inactive) made changes -
        Assignee Junyi Xie [ junyi ] Ketaki Gangal [ ketaki ]
        pavelpaulau Pavel Paulau made changes -
        Attachment debug.tar.gz [ 15055 ]
        Hide
        junyi Junyi Xie (Inactive) added a comment -

        OK, Pavel reproduced the issue. Now it looks like it is a bug in ep_engine incorrectly account the getMeta ops.

        I deleted replication on the source cluster at 19:29PM and XDCR activity is stopped. Look at logs at 10.2.2.190, there is no capi_replication trace after 19:29PM. However, on the UI it keeps showing very high rate of getMeta in XDC section.

        Please look at screen shot: Junyi-Pavel-test Screen Shot 2012-09-14 at 7.39.21 PM.png from one of destination 10.2.2.190

        By cbstats, you see the get_meta ops keep increasing (while del/set ops have no change) after I stopped the XDCR completely.

        Junyis-MacBook-Pro:management junyi$ cbstats 10.2.2.190:11211 all | grep meta
        ep_num_ops_del_meta: 34329
        ep_num_ops_get_meta: 41910986
        ep_num_ops_set_meta: 14257849
        Junyis-MacBook-Pro:management junyi$ cbstats 10.2.2.190:11211 all | grep meta
        ep_num_ops_del_meta: 34329
        ep_num_ops_get_meta: 44140162
        ep_num_ops_set_meta: 14257849
        Junyis-MacBook-Pro:management junyi$ cbstats 10.2.2.190:11211 all | grep meta
        ep_num_ops_del_meta: 34329
        ep_num_ops_get_meta: 44226426
        ep_num_ops_set_meta: 14257849
        Junyis-MacBook-Pro:management junyi$ cbstats 10.2.2.190:11211 all | grep meta
        ep_num_ops_del_meta: 34329
        ep_num_ops_get_meta: 44254887
        ep_num_ops_set_meta: 14257849
        Junyis-MacBook-Pro:management junyi$ cbstats 10.2.2.190:11211 all | grep meta
        ep_num_ops_del_meta: 34329
        ep_num_ops_get_meta: 44273589
        ep_num_ops_set_meta: 14257849

        Since no activity after 19:29PM, these getMeta should come from CAPI/XDCR, looks more likely the ep_engine incorrectly account the getMeta operations. It is a blocker IMHO, and hand over to Chiyoung for investigation.

        Thanks.

        Show
        junyi Junyi Xie (Inactive) added a comment - OK, Pavel reproduced the issue. Now it looks like it is a bug in ep_engine incorrectly account the getMeta ops. I deleted replication on the source cluster at 19:29PM and XDCR activity is stopped. Look at logs at 10.2.2.190, there is no capi_replication trace after 19:29PM. However, on the UI it keeps showing very high rate of getMeta in XDC section. Please look at screen shot: Junyi-Pavel-test Screen Shot 2012-09-14 at 7.39.21 PM.png from one of destination 10.2.2.190 By cbstats, you see the get_meta ops keep increasing (while del/set ops have no change) after I stopped the XDCR completely. Junyis-MacBook-Pro:management junyi$ cbstats 10.2.2.190:11211 all | grep meta ep_num_ops_del_meta: 34329 ep_num_ops_get_meta: 41910986 ep_num_ops_set_meta: 14257849 Junyis-MacBook-Pro:management junyi$ cbstats 10.2.2.190:11211 all | grep meta ep_num_ops_del_meta: 34329 ep_num_ops_get_meta: 44140162 ep_num_ops_set_meta: 14257849 Junyis-MacBook-Pro:management junyi$ cbstats 10.2.2.190:11211 all | grep meta ep_num_ops_del_meta: 34329 ep_num_ops_get_meta: 44226426 ep_num_ops_set_meta: 14257849 Junyis-MacBook-Pro:management junyi$ cbstats 10.2.2.190:11211 all | grep meta ep_num_ops_del_meta: 34329 ep_num_ops_get_meta: 44254887 ep_num_ops_set_meta: 14257849 Junyis-MacBook-Pro:management junyi$ cbstats 10.2.2.190:11211 all | grep meta ep_num_ops_del_meta: 34329 ep_num_ops_get_meta: 44273589 ep_num_ops_set_meta: 14257849 Since no activity after 19:29PM, these getMeta should come from CAPI/XDCR, looks more likely the ep_engine incorrectly account the getMeta operations. It is a blocker IMHO, and hand over to Chiyoung for investigation. Thanks.
        junyi Junyi Xie (Inactive) made changes -
        junyi Junyi Xie (Inactive) made changes -
        Priority Critical [ 2 ] Blocker [ 1 ]
        junyi Junyi Xie (Inactive) made changes -
        Assignee Ketaki Gangal [ ketaki ] Chiyoung Seo [ chiyoung ]
        junyi Junyi Xie (Inactive) made changes -
        Comment [ on VMs, set up a 4->4 unidirectional XDCR with about 8M items, on source side, loading it with 2-3K ops/sec using the same workload ( 3% of items expiration time: 5 minutes) in Pavel's test.

        Concurrent mcsoda data loading:

        lib/perf_engines/mcsoda.py 10.3.3.61:11211 vbuckets=1024 doc-gen=0 doc-cache=0 ratio-creates=1 ratio-sets=1 ratio-expirations=0.03 expiration=300 ratio-deletes=0.04 min-value-size=2,3 max-items=20000000 exit-after-creates=0 prefix=XDCR_10
        lib/perf_engines/mcsoda.py 10.3.3.61:11211 vbuckets=1024 doc-gen=0 doc-cache=0 ratio-creates=1 ratio-sets=1 ratio-expirations=0.03 expiration=300 ratio-deletes=0.04 min-value-size=2,3 max-items=20000000 exit-after-creates=0 prefix=XDCR_9


        Observation:
        1. on destination side, the replication rate is stable, around 1-3k which is similar to the testcase w/o expired items. See the screenshot attached.
        Need to figure out how to reproduce it.

        2. on the source side, the XDC queue slowly increase to > 1M items, apparently the replication speed is hard to always catch up with front-end ops.

        3. when compactor is running at dest, stats become choppy, not surprised, we know that.


        ]
        junyi Junyi Xie (Inactive) made changes -
        Comment [ Downgrade the priority until we can reproduce it. ]
        junyi Junyi Xie (Inactive) made changes -
        Comment [ Ketaki, please reproduce it on VM if possible, and point me to the clusters. Thanks. ]
        Show
        chiyoung Chiyoung Seo added a comment - http://review.couchbase.org/#/c/20927/1
        chiyoung Chiyoung Seo made changes -
        Status Reopened [ 4 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Hide
        pavelpaulau Pavel Paulau added a comment -

        Verified.

        Show
        pavelpaulau Pavel Paulau added a comment - Verified.
        Hide
        thuan Thuan Nguyen added a comment -

        Integrated in github-ep-engine-2-0 #431 (See http://qa.hq.northscale.net/job/github-ep-engine-2-0/431/)
        MB-6662 Ignore the item's expiration time for SET_WITH_META (Revision 5a9b619ae5c9c65c3a1e2902d57f60d0e4915c59)

        Result = SUCCESS
        Chiyoung Seo :
        Files :

        • src/stored-value.hh
        Show
        thuan Thuan Nguyen added a comment - Integrated in github-ep-engine-2-0 #431 (See http://qa.hq.northscale.net/job/github-ep-engine-2-0/431/ ) MB-6662 Ignore the item's expiration time for SET_WITH_META (Revision 5a9b619ae5c9c65c3a1e2902d57f60d0e4915c59) Result = SUCCESS Chiyoung Seo : Files : src/stored-value.hh
        pavelpaulau Pavel Paulau made changes -
        Status Resolved [ 5 ] Closed [ 6 ]

          People

          • Assignee:
            chiyoung Chiyoung Seo
            Reporter:
            pavelpaulau Pavel Paulau
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Gerrit Reviews

              There are no open Gerrit changes