Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-50087

Multi-node XDCR test doesn't report replication_changes_left correctly on build 7.1.0-1885

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 7.1.0
    • 7.1.0
    • XDCR
    • Untriaged
    • 1
    • Unknown

    Description

      The XDCR test will monitor replication_changes_left after XDCR starts. In this run, xdcr reported replication_changes_left = 0, and it resulted in wrong performance number reported. 

      Avg. initial XDCR rate (items/sec), 5 -> 5 (2 source nozzles, 4 target nozzles), 1 bucket x 1G x 1KB, DGM, s=1, c=1000

      build: 7.1.0-1885

      Job: http://perf.jenkins.couchbase.com/job/titan/12877/

      2021-12-13T23:33:11 [INFO] Adding a remote cluster: 172.23.96.100

      2021-12-13T23:33:11 [INFO] Starting replication with parameters {'replicationType': 'continuous', 'fromBucket': 'bucket-1', 'toBucket': 'bucket-1', 'toCluster': 'perf'}

      2021-12-13T23:33:11 [INFO] Getting remote clusters

      2021-12-13T23:33:11 [INFO] Monitoring XDCR queues: bucket-1

      2021-12-13T23:33:33 [INFO] replication_changes_left reached 0

       

      In a good run, replication_changes_left will decrease gradually. 

      http://perf.jenkins.couchbase.com/job/titan/12397/

      2021-11-01T17:41:31 [INFO] Adding a remote cluster: 172.23.96.100

      2021-11-01T17:41:31 [INFO] Starting replication with parameters {'replicationType': 'continuous', 'fromBucket': 'bucket-1', 'toBucket': 'bucket-1', 'toCluster': 'perf'}

      2021-11-01T17:41:31 [INFO] Getting remote clusters

      2021-11-01T17:41:31 [INFO] Monitoring XDCR queues: bucket-1

      2021-11-01T17:41:53 [INFO] replication_changes_left = 991,768,354

      2021-11-01T17:41:55 [INFO] replication_changes_left = 990,671,216

      2021-11-01T17:41:58 [INFO] replication_changes_left = 988,192,619

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          bo-chun.wang Bo-Chun Wang added a comment -

          The issue happened intermittently. I re-ran the run and got a good run this time.

          http://perf.jenkins.couchbase.com/job/titan/12879/

           

          bo-chun.wang Bo-Chun Wang added a comment - The issue happened intermittently. I re-ran the run and got a good run this time. http://perf.jenkins.couchbase.com/job/titan/12879/  
          lilei.chen Lilei Chen added a comment - - edited

          The test is checking for changes_left at 2021-12-13T23:33:33. In Neo, XDCR takes longer to start because of BucketTopologySvc and PeerToPeer communication. Node 105 pipeline takes ~4 minutes to start:

          2021-12-13T23:33:11.866-08:00 INFO GOXDCR.PipelineMgr: Try to start/restart Pipeline 446e948e576963e457ebcf751354fb62/bucket-1/bucket-1.
          ...
          2021-12-13T23:37:12.388-08:00 INFO GOXDCR.GenericPipeline: ----------- MainPipeline 446e948e576963e457ebcf751354fb62/bucket-1/bucket-1-923917077 has been started----------

          Live pipeline stats update started at

          2021-12-13T23:37:12.313-08:00 INFO GOXDCR.StatsMgr: 446e948e576963e457ebcf751354fb62/bucket-1/bucket-1-923917077 updateStats started

          So there is no changes_left update from the pipeline itself when test check for it.

          However, it did construct stats for this pipeline at 2021-12-13T23:34:10.575-08:00 as a paused pipeline since it has not started and no stats is available: 

          2021-12-13T23:34:10.575-08:00 INFO GOXDCR.ReplMgr: Calculating stats for never run replication 446e948e576963e457ebcf751354fb62/bucket-1/bucket-1. kv_vb_map=..., total_docs=200422410, docs_processed=0, changes_left=200422410

          For the changes_left stats to be available, either the pipeline has to be started, or the paused pipeline update has been run. Since the paused pipeline update is done every 60s, it is timing dependent. In this run, the paused pipeline stats update was run 59s after pipeline starting so the test has to wait longer.

          In the good run, paused pipeline update was run 3s after pipeline start. So the test could check almost immediately.

          If the test sleep for 60s before checking, it should be good. Also after XDCR improvement on pipeline start time, this should go away.

          I also tested creating replication as paused and was able to get correct changes_left value.

          lilei.chen Lilei Chen added a comment - - edited The test is checking for changes_left at 2021-12-13T23:33:33. In Neo, XDCR takes longer to start because of BucketTopologySvc and PeerToPeer communication. Node 105 pipeline takes ~4 minutes to start: 2021 - 12 -13T23: 33 : 11.866 - 08 : 00 INFO GOXDCR.PipelineMgr: Try to start/restart Pipeline 446e948e576963e457ebcf751354fb62/bucket- 1 /bucket- 1 . ... 2021 - 12 -13T23: 37 : 12.388 - 08 : 00 INFO GOXDCR.GenericPipeline: ----------- MainPipeline 446e948e576963e457ebcf751354fb62/bucket- 1 /bucket- 1 - 923917077 has been started---------- Live pipeline stats update started at 2021 - 12 -13T23: 37 : 12.313 - 08 : 00 INFO GOXDCR.StatsMgr: 446e948e576963e457ebcf751354fb62/bucket- 1 /bucket- 1 - 923917077 updateStats started So there is no changes_left update from the pipeline itself when test check for it. However, it did construct stats for this pipeline at 2021-12-13T23:34:10.575-08:00 as a paused pipeline since it has not started and no stats is available:  2021 - 12 -13T23: 34 : 10.575 - 08 : 00 INFO GOXDCR.ReplMgr: Calculating stats for never run replication 446e948e576963e457ebcf751354fb62/bucket- 1 /bucket- 1 . kv_vb_map=..., total_docs= 200422410 , docs_processed= 0 , changes_left= 200422410 For the changes_left stats to be available, either the pipeline has to be started, or the paused pipeline update has been run. Since the paused pipeline update is done every 60s, it is timing dependent. In this run, the paused pipeline stats update was run 59s after pipeline starting so the test has to wait longer. In the good run, paused pipeline update was run 3s after pipeline start. So the test could check almost immediately. If the test sleep for 60s before checking, it should be good. Also after XDCR improvement on pipeline start time, this should go away. I also tested creating replication as paused and was able to get correct changes_left value.
          lilei.chen Lilei Chen added a comment -

          Pipeline took 4 minutes to start because PeerToPeer communication timed out. This was fixed in MB-50095. I am rerunning the test with build 1970 to see if it is fixed with MB-50095. Also re-assign to Neil to verify it is the same root cause.

          lilei.chen Lilei Chen added a comment - Pipeline took 4 minutes to start because PeerToPeer communication timed out. This was fixed in MB-50095 . I am rerunning the test with build 1970 to see if it is fixed with MB-50095 . Also re-assign to Neil to verify it is the same root cause.
          lilei.chen Lilei Chen added a comment -

          Neil Huang 

          With build 1970, P2P still timed out after 4 minutes. Here is the test that I ran: http://perf.jenkins.couchbase.com/job/titan/13135/

          lilei.chen Lilei Chen added a comment - Neil Huang   With build 1970, P2P still timed out after 4 minutes. Here is the test that I ran: http://perf.jenkins.couchbase.com/job/titan/13135/
          neil.huang Neil Huang added a comment - - edited

          The problem is XDCR failed to parse the following payload when received manifests from peers:

              "SrcManifests": {
                "1": {
                  "uid": "1",
                  "scopes": [
                    {
                      "collections": null,
                      "name": "_default",
                      "uid": "0"
                    },
                    {
                      "collections": [
                        {
                          "name": "collection-1",
                          "uid": "8"
                        },
                        {
                          "name": "collection-10",
                          "uid": "9"
                        },
                        {
                          "name": "collection-100",
                          "uid": "a"
                        },
                        {
                          "name": "collection-1000",
                          "uid": "b"
          

          Specifically, we’re looking at the default scope without any default collections.

          Looking at the test set up, we see that scope-1 was set up with an empty collection set.
          I think this is being loaded via bulk collection loader… So I think this means that there is no default collection in the default scope for bucket-1.

          00:03:01     "collections": {
          00:03:01         "bucket-1": {
          00:03:01             "_default": {},
          00:03:01             "scope-1": {
          00:03:01                 "collection-1": {
          00:03:01                     "access": 1,
          00:03:01                     "load": 1
          00:03:01                 },
          00:03:01                 "collection-10": {
          00:03:01                     "access": 1,
          00:03:01                     "load": 1
          00:03:01                 },
          00:03:01                 "collection-100": {
          00:03:01                     "access": 1,
          00:03:01                     "load": 1
          00:03:01                 },
          00:03:01                 "collection-1000": {
          00:03:01                     "access": 1,
          00:03:01                     "load": 1
          00:03:01                 },
          00:03:01                 "collection-101": {
          00:03:01                     "access": 1,
          00:03:01                     "load": 1
          00:03:01                 },
          00:03:01                 "collection-102": {
          00:03:01                     "access": 1,
          00:03:01                     "load": 1
          00:03:01                 },
          

          Normally, this seems to work fine when I try to serialize / deserialize manifests directly.
          However, when manifests are serialized and deserialized as part of the payload, the collection key will have a <nil> value instead of an empty array.
          This seems to be causing error from being parsed.

          neil.huang Neil Huang added a comment - - edited The problem is XDCR failed to parse the following payload when received manifests from peers: … "SrcManifests": { "1": { "uid": "1", "scopes": [ { "collections": null, "name": "_default", "uid": "0" }, { "collections": [ { "name": "collection-1", "uid": "8" }, { "name": "collection-10", "uid": "9" }, { "name": "collection-100", "uid": "a" }, { "name": "collection-1000", "uid": "b" Specifically, we’re looking at the default scope without any default collections. Looking at the test set up, we see that scope-1 was set up with an empty collection set. I think this is being loaded via bulk collection loader… So I think this means that there is no default collection in the default scope for bucket-1. 00:03:01 "collections": { 00:03:01 "bucket-1": { 00:03:01 "_default": {}, 00:03:01 "scope-1": { 00:03:01 "collection-1": { 00:03:01 "access": 1, 00:03:01 "load": 1 00:03:01 }, 00:03:01 "collection-10": { 00:03:01 "access": 1, 00:03:01 "load": 1 00:03:01 }, 00:03:01 "collection-100": { 00:03:01 "access": 1, 00:03:01 "load": 1 00:03:01 }, 00:03:01 "collection-1000": { 00:03:01 "access": 1, 00:03:01 "load": 1 00:03:01 }, 00:03:01 "collection-101": { 00:03:01 "access": 1, 00:03:01 "load": 1 00:03:01 }, 00:03:01 "collection-102": { 00:03:01 "access": 1, 00:03:01 "load": 1 00:03:01 }, Normally, this seems to work fine when I try to serialize / deserialize manifests directly. However, when manifests are serialized and deserialized as part of the payload, the collection key will have a <nil> value instead of an empty array. This seems to be causing error from being parsed.
          neil.huang Neil Huang added a comment -

          I was able to reproduce and validate the fix locally. Please see the next build and verify.

          neil.huang Neil Huang added a comment - I was able to reproduce and validate the fix locally. Please see the next build and verify.

          Build couchbase-server-7.1.0-2061 contains goxdcr commit b1bb9cd with commit message:
          MB-50087: manifest unmarshalling after being serialize by p2p vbpayload

          build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-2061 contains goxdcr commit b1bb9cd with commit message: MB-50087 : manifest unmarshalling after being serialize by p2p vbpayload

          Build couchbase-server-7.1.0-2069 contains goxdcr commit 95073a5 with commit message:
          MB-50087: fix unit test to use the right data file

          build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-2069 contains goxdcr commit 95073a5 with commit message: MB-50087 : fix unit test to use the right data file
          bo-chun.wang Bo-Chun Wang added a comment -

          I have a good run with build 7.1.0-2061. I close this issue.

          http://perf.jenkins.couchbase.com/job/titan/13220/

           

          bo-chun.wang Bo-Chun Wang added a comment - I have a good run with build 7.1.0-2061. I close this issue. http://perf.jenkins.couchbase.com/job/titan/13220/  

          People

            bo-chun.wang Bo-Chun Wang
            bo-chun.wang Bo-Chun Wang
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty