Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-15873

GoXDCR: Replications go missing when target cluster is online upgraded

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • 4.0.0
    • 4.0.0
    • XDCR
    • Security Level: Public

    Description

      Build


      4.0.0-3494

      Testcase
      --------
      ./testrunner -i INI_FILE.ini get-cbcollect-info=True,get-logs=False,stop-on-failure=True,get-coredumps=True,fail_on_errors=1,GROUP=ONLINE,upgrade_version=4.0.0-3528-rel,initial_vbuckets=1024 -t xdcr.upgradeXDCR.UpgradeTests.online_cluster_upgrade,initial_version=2.5.0-1059-rel,bucket_topology=default:1>2;standard_bucket_1:1<2;sasl_bucket_1:1><2,expires=500,GROUP=ONLINE

      http://qa.hq.northscale.net/job/cen006-p1-xxdcr-vset05-02-goxdcr-backward-compatibility-and-upgrade/161/consoleFull

      Steps
      =====
      1. Online upgrade C1[.11,.16] from 2.5.0 to 4.0 using extra node .21. All replications are intact.
      2. Online upgrade C2[.19,.20] from 2.5.0 using .21.

      At the end of this test (when C2 is upgraded), replications from C1 are missing.

      Live cluster: http://10.1.2.11:8091/index.html#sec=replications

      On .11 -
       
      StatisticsManager 2015-07-25T20:11:37.811-07:00 [INFO] Rounter Router_dcp_dcbe280157d4856d3c2b15315d6683b5/sasl_bucket_1/sasl_bucket_1_10.1.2.11:11210_0 = map[xmem_dcbe280157d4856d3c2b15315d6683b5/sasl_bucket_1/sasl_bucket_1_10.1.2.21:11210_15:16 xmem_dcbe280157d4856d3c2b15315d6683b5/sasl_bucket_1/sasl_bucket_1_10.1.2.21:11210_2:16 xmem_dcbe280157d4856d3c2b15315d6683b5/sasl_bucket_1/sasl_bucket_1_10.1.2.21:11210_0:15 xmem_dcbe280157d4856d3c2b15315d6683b5/sasl_bucket_1/sasl_bucket_1_10.1.2.21:11210_21:14 xmem_dcbe280157d4856d3c2b15315d6683b5/sasl_bucket_1/sasl_bucket_1_10.1.2.21:11210_4:14 xmem_dcbe280157d4856d3c2b15315d6683b5/sasl_bucket_1/sasl_bucket_1_10.1.2.21:11210_12:16 xmem_dcbe280157d4856d3c2b15315d6683b5/sasl_bucket_1/sasl_bucket_1_10.1.2.21:11210_28:18 xmem_dcbe280157d4856d3c2b15315d6683b5/sasl_bucket_1/sasl_bucket_1_10.1.2.21:11210_30:17 xmem_dcbe280157d4856d3c2b15315d6683b5/sasl_bucket_1/sasl_bucket_1_10.1.2.21:11210_26:19 xmem_dcbe280157d4856d3c2b15315d6683b5/sasl_bucket_1/sasl_bucket_1_10.1.2.21:11210_25:14 xmem_dcbe280157d4856d3c2b15315d6683b5/sasl_bucket_1/sasl_bucket_1_10.1.2.21:11210_17:17 xmem_dcbe280157d4856d3c2b15315d6683b5/sasl_bucket_1/sasl_bucket_1_10.1.2.21:11210_1:13 xmem_dcbe280157d4856d3c2b15315d6683b5/sasl_bucket_1/sasl_bucket_1_10.1.2.21:11210_22:11 xmem_dcbe280157d4856d3c2b15315d6683b5/sasl_bucket_1/sasl_bucket_1_10.1.2.21:11210_6:15 xmem_dcbe280157d4856d3c2b15315d6683b5/sasl_bucket_1/sasl_bucket_1_10.1.2.21:11210_3:13 xmem_dcbe280157d4856d3c2b15315d6683b5/sasl_bucket_1/sasl_bucket_1_10.1.2.21:11210_24:16 xmem_dcbe280157d4856d3c2b15315d6683b5/sasl_bucket_1/sasl_bucket_1_10.1.2.21:11210_27:14 xmem_dcbe280157d4856d3c2b15315d6683b5/sasl_bucket_1/sasl_bucket_1_10.1.2.21:11210_29:14 xmem_dcbe280157d4856d3c2b15315d6683b5/sasl_bucket_1/sasl_bucket_1_10.1.2.21:11210_20:18 xmem_dcbe280157d4856d3c2b15315d6683b5/sasl_bucket_1/sasl_bucket_1_10.1.2.21:11210_31:18 xmem_dcbe280157d4856d3c2b15315d6683b5/sasl_bucket_1/sasl_bucket_1_10.1.2.21:11210_10:15 xmem_dcbe280157d4856d3c2b15315d6683b5/sasl_bucket_1/sasl_bucket_1_10.1.2.21:11210_14:13 xmem_dcbe280157d4856d3c2b15315d6683b5/sasl_bucket_1/sasl_bucket_1_10.1.2.21:11210_7:18 xmem_dcbe280157d4856d3c2b15315d6683b5/sasl_bucket_1/sasl_bucket_1_10.1.2.21:11210_9:14 xmem_dcbe280157d4856d3c2b15315d6683b5/sasl_bucket_1/sasl_bucket_1_10.1.2.21:11210_18:16 xmem_dcbe280157d4856d3c2b15315d6683b5/sasl_bucket_1/sasl_bucket_1_10.1.2.21:11210_19:15 xmem_dcbe280157d4856d3c2b15315d6683b5/sasl_bucket_1/sasl_bucket_1_10.1.2.21:11210_23:18 xmem_dcbe280157d4856d3c2b15315d6683b5/sasl_bucket_1/sasl_bucket_1_10.1.2.21:11210_8:15 xmem_dcbe280157d4856d3c2b15315d6683b5/sasl_bucket_1/sasl_bucket_1_10.1.2.21:11210_11:15 xmem_dcbe280157d4856d3c2b15315d6683b5/sasl_bucket_1/sasl_bucket_1_10.1.2.21:11210_5:15 xmem_dcbe280157d4856d3c2b15315d6683b5/sasl_bucket_1/sasl_bucket_1_10.1.2.21:11210_13:16 xmem_dcbe280157d4856d3c2b15315d6683b5/sasl_bucket_1/sasl_bucket_1_10.1.2.21:11210_16:15]
      :
      ReplicationSpecService 2015-07-25T20:11:41.328-07:00 [ERROR] spec dcbe280157d4856d3c2b15315d6683b5/sasl_bucket_1/sasl_bucket_1 refers to non-existent target bucket "sasl_bucket_1"
      ReplicationSpecService 2015-07-25T20:11:41.328-07:00 [ERROR] Replication specification dcbe280157d4856d3c2b15315d6683b5/sasl_bucket_1/sasl_bucket_1 is no longer valid, garbage collect it. error=spec dcbe280157d4856d3c2b15315d6683b5/sasl_bucket_1/sasl_bucket_1 refers to non-existent target bucket "sasl_bucket_1"
       
      ReplicationSpecChangeListener 2015-07-25T20:11:41.335-07:00 [INFO] metakvCallback called on listener ReplicationSpecChangeListener with path = /replicationSpec/dcbe280157d4856d3c2b15315d6683b5/sasl_bucket_1/sasl_bucket_1
      ReplicationSpecService 2015-07-25T20:11:41.335-07:00 [INFO] ReplicationSpecServiceCallback called on path = /replicationSpec/dcbe280157d4856d3c2b15315d6683b5/sasl_bucket_1/sasl_bucket_1
      ReplicationSpecChangeListener 2015-07-25T20:11:41.335-07:00 [INFO] specChangedCallback called on id = dcbe280157d4856d3c2b15315d6683b5/sasl_bucket_1/sasl_bucket_1, oldSpec=&{dcbe280157d4856d3c2b15315d6683b5/sasl_bucket_1/sasl_bucket_1 sasl_bucket_1  dcbe280157d4856d3c2b15315d6683b5 sasl_bucket_1  0xc20816e1b0 [131 108 0 0 0 1 104 2 109 0 0 0 32 50 48 99 51 100 49 52 53 54 53 50 54 101 98 57 50 100 100 99 55 48 55 102 53 55 50 100 51 54 100 57 100 104 2 97 1 110 5 0 150 197 40 207 14 106]}, newSpec=<nil>
      ReplicationSpecChangeListener 2015-07-25T20:11:41.335-07:00 [INFO] old spec settings=&{xmem  true 1800 500 2048 10 256 2 64 1000 50 Info 1000 <nil>}
      PipelineManager 2015-07-25T20:11:41.335-07:00 [ERROR] Invalid replication status dcbe280157d4856d3c2b15315d6683b5/sasl_bucket_1/sasl_bucket_1, failed to retrieve spec. err=Requested resource not found
      PipelineManager 2015-07-25T20:11:41.335-07:00 [INFO] Stopping pipeline dcbe280157d4856d3c2b15315d6683b5/sasl_bucket_1/sasl_bucket_1 since the replication spec has been deleted
      

      At this time, from test log, we seem to have removed the extra node .21 to which .11 was replicating to, which caused the replication to get deleted.

      Pls note we added .19 and .20 to C2 just minutes before we removed .21. goxdcr should be able to detect the new nodes and look for buckets in them.

      Pls also note that C1 however points to the new C2 node (.19) as remote cluster ref IP. While goxdcr is able to change the remote cluster reference to .19, for replications, it still looks at .21 and deletes them because .21 is no longer a part of C2.

      (Timing on jenkins slave and nodes are slightly off)

      2015-07-25 20:11:37,657 - root - INFO - adding remote node @10.1.2.19:8091 to this cluster @10.1.2.21:8091
      2015-07-25 20:11:40,082 - root - INFO - adding node 10.1.2.20:8091 to cluster
      2015-07-25 20:11:40,083 - root - INFO - adding remote node @10.1.2.20:8091 to this cluster @10.1.2.21:8091
      2015-07-25 20:11:43,251 - root - INFO - rebalance params : password=password&ejectedNodes=&user=Administrator&knownNodes=ns_1%4010.1.2.20%2Cns_1%4010.1.2.19%2Cns_1%4010.1.2.21
      2015-07-25 20:11:43,262 - root - INFO - rebalance operation started
      2015-07-25 20:11:43,277 - root - INFO - rebalance percentage : 0.00 %
      2015-07-25 20:11:53,297 - root - INFO - rebalance percentage : 6.83 %
      2015-07-25 20:12:03,317 - root - INFO - rebalance percentage : 14.19 %
      2015-07-25 20:12:13,343 - root - INFO - rebalance percentage : 21.96 %
      2015-07-25 20:12:23,363 - root - INFO - rebalance percentage : 28.91 %
      2015-07-25 20:12:33,383 - root - INFO - rebalance percentage : 32.99 %
      2015-07-25 20:12:43,403 - root - INFO - rebalance percentage : 39.15 %
      2015-07-25 20:12:53,432 - root - INFO - rebalance percentage : 46.31 %
      2015-07-25 20:13:03,451 - root - INFO - rebalance percentage : 53.43 %
      2015-07-25 20:13:13,481 - root - INFO - rebalance percentage : 60.67 %
      2015-07-25 20:13:23,499 - root - INFO - rebalance percentage : 64.76 %
      2015-07-25 20:13:33,518 - root - INFO - rebalance percentage : 69.43 %
      2015-07-25 20:13:43,547 - root - INFO - rebalance percentage : 76.63 %
      2015-07-25 20:13:53,567 - root - INFO - rebalance percentage : 83.95 %
      2015-07-25 20:14:03,586 - root - INFO - rebalance percentage : 91.32 %
      2015-07-25 20:14:13,605 - root - INFO - rebalance percentage : 96.73 %
      2015-07-25 20:14:23,644 - root - INFO - rebalancing was completed with progress: 100% in 160.381052971 sec
      2015-07-25 20:14:23,645 - root - INFO - Rebalance in all 4.0.0-3528-enterprise nodes completed
      2015-07-25 20:14:23,745 - root - INFO - Node versions in cluster [u'4.0.0-3528-enterprise', u'4.0.0-3528-enterprise', u'4.0.0-3528-enterprise']
      2015-07-25 20:14:23,745 - root - INFO - sleep for 15 secs.  ...
      2015-07-25 20:14:38,777 - root - INFO - /diag/eval status on 10.1.2.21:8091: True content: 'ns_1@10.1.2.21' command: node(global:whereis_name(ns_orchestrator))
      2015-07-25 20:14:38,778 - root - INFO - after rebalance in the master is ns_1@10.1.2.21
      2015-07-25 20:14:38,778 - root - INFO - Rebalancing out all old version nodes
      2015-07-25 20:14:39,707 - root - INFO - rebalance params : password=password&ejectedNodes=ns_1%4010.1.2.21&user=Administrator&knownNodes=ns_1%4010.1.2.20%2Cns_1%4010.1.2.19%2Cns_1%4010.1.2.21
      2015-07-25 20:14:39,877 - root - INFO - rebalance operation started
      2015-07-25 20:14:39,896 - root - INFO - rebalance percentage : 0.00 %
      2015-07-25 20:14:49,916 - root - INFO - rebalance percentage : 11.26 %
      2015-07-25 20:14:59,947 - root - INFO - rebalance percentage : 22.83 %
      2015-07-25 20:15:09,966 - root - INFO - rebalance percentage : 33.33 %
      2015-07-25 20:15:19,990 - root - INFO - rebalance percentage : 33.77 %
      2015-07-25 20:15:30,017 - root - INFO - rebalance percentage : 44.87 %
      2015-07-25 20:15:40,097 - root - INFO - rebalance percentage : 54.92 %
      2015-07-25 20:15:50,124 - root - INFO - rebalance percentage : 65.04 %
      2015-07-25 20:16:00,144 - root - INFO - rebalance percentage : 66.67 %
      2015-07-25 20:16:10,175 - root - INFO - rebalance percentage : 75.06 %
      2015-07-25 20:16:20,195 - root - INFO - rebalance percentage : 85.72 %
      2015-07-25 20:16:30,216 - root - INFO - rebalance percentage : 97.01 %
      2015-07-25 20:16:40,238 - root - INFO - rebalance percentage : 100.00 %
      2015-07-25 20:16:43,734 - root - ERROR - socket error while connecting to http
      

      Attachments

        For Gerrit Dashboard: MB-15873
        # Subject Branch Project Status CR V

        Activity

          People

            apiravi Aruna Piravi (Inactive)
            apiravi Aruna Piravi (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty