When performing rolling upgrades, esp when a cluster is running in mixed mode where some nodes are < 7.0 and some are 7.0, XDCR does not prevent the users from activating collection-related modes/toggles.
In the interest of safety, XDCR should enforce implicit mapping and not allow users to edit anything collection-related until it is confirmed that source cluster is fully upgraded.
Also, it may make sense to check target cluster compatibility before allowing collection-related feature in replication settings to be set.
If this MB does not become Cheshire-Cat committed, then this MB should be a part of release notes to ensure that users follow procedure and do not play around with collection settings in the middle of rolling upgrades.
Neil Huang
added a comment - If this MB does not become Cheshire-Cat committed, then this MB should be a part of release notes to ensure that users follow procedure and do not play around with collection settings in the middle of rolling upgrades.
ns-server and console is not supposed to turn on 7.0 feature until the cluster is fully upgraded. Do you find a bug that says otherwise?
John Liang
added a comment - - edited ns-server and console is not supposed to turn on 7.0 feature until the cluster is fully upgraded. Do you find a bug that says otherwise?
When trying to reproduce MB-45892 locally where I ran a 2-node mixed-mode cluster (one node 6.6.2 and one node upgraded from 6.6.2 to 7.0.0), each node has its own respective UI.
This means on the 6.6.2 node:8091, it showed the 6.6.2 node UI. On the 7.0.0 node:8091, it showed the 7.0.0 UI with full blown collection UI support.
Neil Huang
added a comment - When trying to reproduce MB-45892 locally where I ran a 2-node mixed-mode cluster (one node 6.6.2 and one node upgraded from 6.6.2 to 7.0.0), each node has its own respective UI.
This means on the 6.6.2 node:8091, it showed the 6.6.2 node UI. On the 7.0.0 node:8091, it showed the 7.0.0 UI with full blown collection UI support.
The silver lining here is that because the source bucket has no other collections other than the default one, the UI is "safe" from users choosing explicit mapping rule. Note that users still are able to create manual rules via REST API, which this MB's safeguard will help address.
I could still create a migration rule though:
And the replication shows up as migration rule on the UI... note the warning about compatibility mode:
Looking at the logs, we can see that the rule was created on the 7.0 node and saved as a replication spec:
2021-04-26T16:44:31.728Z INFO GOXDCR.CollectionsManifestSvc: CollectionsManifestAgent: Updated source manifest from old version 0 to new version 0
2021-04-26T16:44:31.728Z WARN GOXDCR.CheckpointSvc: Did not find spec 1ed372852f77c38a01ff41adce6c3dba/B1/B2 from internal Cache
2021-04-26T16:44:31.728Z WARN GOXDCR.CheckpointSvc: Did not find spec 1ed372852f77c38a01ff41adce6c3dba/B1/B2 from internal Cache
2021-04-26T16:44:31.728Z INFO GOXDCR.BackfillMgr: Starting backfill request handler for spec 1ed372852f77c38a01ff41adce6c3dba/B1/B2 internalId e6yBxv6KlO9zepS5jMAptg==
2021-04-26T16:44:31.728Z INFO GOXDCR.ReplicationSpecChangeListener: Starting pipeline 1ed372852f77c38a01ff41adce6c3dba/B1/B2 since the replication spec has been changed to active
2021-04-26T16:44:31.728Z INFO GOXDCR.ReplMgr: Success adding replication specification 1ed372852f77c38a01ff41adce6c3dba/B1/B2
2021-04-26T16:44:31.728Z INFO GOXDCR.ReplMgr: Replication specification 1ed372852f77c38a01ff41adce6c3dba/B1/B2 is created
2021-04-26T16:44:31.728Z INFO GOXDCR.AdminPort: Finished doCreateReplicationRequest call
On the 6.6. node however, it receives the spec and shows errors where the rules are in place and it doesn't know how to parse it, but the pipeline still started without parsing those rules anyway:
2021-04-26T16:44:31.729Z INFO GOXDCR.ReplicationSpecChangeListener: metakvCallback called on listener ReplicationSpecChangeListener with path = /replicationSpec/1ed372852f77c38a01ff41adce6c3dba/B1/B2
2021-04-26T16:44:31.729Z INFO GOXDCR.ReplSpecSvc: ReplicationSpecServiceCallback called on path = /replicationSpec/1ed372852f77c38a01ff41adce6c3dba/B1/B2
2021-04-26T16:44:31.729Z WARN GOXDCR.SettingsCommon: Settings unmarshalled from metakv has the following issues: map[dismissEvent:not a valid setting CollectionsMgtMulti:not a valid setting mergeFunctionMapping:not a valid setting hlvPruningWindowSec:not a valid setting retryOnRemoteAuthErrMaxWaitSec:not a valid setting collectionsSkipSrcValidation:not a valid setting delSpecificBackfillForVb:not a valid setting retryOnRemoteAuthErr:not a valid setting delAllBackfills:not a valid setting manualBackfill:not a valid setting colMappingRules:not a valid setting]
2021-04-26T16:44:31.729Z INFO GOXDCR.ReplicationSpecChangeListener: Starting pipeline 1ed372852f77c38a01ff41adce6c3dba/B1/B2 since the replication spec has been changed to active
2021-04-26T16:44:31.729Z INFO GOXDCR.PipelineMgr: ReplicationStatus is created and set with 1ed372852f77c38a01ff41adce6c3dba/B1/B2
2021-04-26T16:44:31.729Z INFO GOXDCR.PipelineMgr: Pipeline updater 1ed372852f77c38a01ff41adce6c3dba/B1/B2 is launched with retry_interval=10
So even though the UI will prevent explicit mapping mode, it doesn't prevent migration mode from happening. Moreover, if user uses REST and/or CLI (not sure, need to verify) to create explicit mapping mode+rules, then the replication can be created successfully without these safe guards in place and now XDCR replications behavior will be inconsistent between the 6.6 node and the 7.0 node.
Neil Huang
added a comment - - edited This is the UI on the mixed-mode 6.6.2 node:
This is the UI on the mixed-mode 7.0.0 node:
The silver lining here is that because the source bucket has no other collections other than the default one, the UI is "safe" from users choosing explicit mapping rule. Note that users still are able to create manual rules via REST API, which this MB's safeguard will help address.
I could still create a migration rule though:
And the replication shows up as migration rule on the UI... note the warning about compatibility mode:
Looking at the logs, we can see that the rule was created on the 7.0 node and saved as a replication spec:
2021-04-26T16:44:31.728Z INFO GOXDCR.CollectionsManifestSvc: CollectionsManifestAgent: Updated source manifest from old version 0 to new version 0
2021-04-26T16:44:31.728Z WARN GOXDCR.CheckpointSvc: Did not find spec 1ed372852f77c38a01ff41adce6c3dba/B1/B2 from internal Cache
2021-04-26T16:44:31.728Z WARN GOXDCR.CheckpointSvc: Did not find spec 1ed372852f77c38a01ff41adce6c3dba/B1/B2 from internal Cache
2021-04-26T16:44:31.728Z INFO GOXDCR.BackfillMgr: Starting backfill request handler for spec 1ed372852f77c38a01ff41adce6c3dba/B1/B2 internalId e6yBxv6KlO9zepS5jMAptg==
2021-04-26T16:44:31.728Z INFO GOXDCR.ReplicationSpecChangeListener: specChangedCallback called on id = 1ed372852f77c38a01ff41adce6c3dba/B1/B2, oldSpec=, newSpec=Id: 1ed372852f77c38a01ff41adce6c3dba/B1/B2 InternalId: e6yBxv6KlO9zepS5jMAptg== SourceBucketName: B1 SourceBucketUUID: b35cc3442c5c1ee57e21ab1f2f60b3e4 TargetClusterUUID: 1ed372852f77c38a01ff41adce6c3dba TargetBucketName: B2 TargetBucketUUID: d37aa03dfdba54462e61d6322ecc8806 Settings: map[CollectionsMgtMulti:ExplicitMapping: false Mirroring: false Migration: true OSO: true active:true backlogThreshold:50 bandwidth_limit:0 checkpoint_interval:600 colMappingRules:map[type="airport":scope1.collection1] collectionsExplicitMapping:false collectionsMigrationMode:true collectionsMirroringMode:false collectionsOSOMode:true collectionsSkipSrcValidation:false compression_type:3 delAllBackfills:false delSpecificBackfillForVb:-1 dismissEvent:-1 doc_batch_size_kb:2048 failure_restart_interval:10 filterBypassExpiry:false filterDeletion:false filterExpiration:false filter_exp_del:0 filter_expression: filter_expression_version:0 filter_skip_restream:false hlvPruningWindowSec:259200 log_level:Info manualBackfill: mergeFunctionMapping:map[] optimistic_replication_threshold:256 priority:High replication_type:xmem retryOnRemoteAuthErr:true retryOnRemoteAuthErrMaxWaitSec:3600 source_nozzle_per_node:2 stats_interval:1000 target_nozzle_per_node:2 worker_batch_size:500]
2021-04-26T16:44:31.728Z INFO GOXDCR.ReplicationSpecChangeListener: new spec settings=retryOnRemoteAuthErr:true, filter_expression:, mergeFunctionMapping:map[], retryOnRemoteAuthErrMaxWaitSec:3600, backlogThreshold:50, CollectionsMgtMulti:ExplicitMapping: false Mirroring: false Migration: true OSO: true, filter_expression_version:0, doc_batch_size_kb:2048, collectionsSkipSrcValidation:false, stats_interval:1000, delAllBackfills:false, bandwidth_limit:0, optimistic_replication_threshold:256, source_nozzle_per_node:2, replication_type:xmem, active:true, log_level:Info, dismissEvent:-1, failure_restart_interval:10, priority:High, checkpoint_interval:600, worker_batch_size:500, target_nozzle_per_node:2, hlvPruningWindowSec:259200, filter_skip_restream:false, manualBackfill:, filter_exp_del:0, delSpecificBackfillForVb:-1, colMappingRules:map[type="airport":scope1.collection1], compression_type:3
2021-04-26T16:44:31.728Z INFO GOXDCR.ReplicationSpecChangeListener: Starting pipeline 1ed372852f77c38a01ff41adce6c3dba/B1/B2 since the replication spec has been changed to active
2021-04-26T16:44:31.728Z INFO GOXDCR.ReplMgr: Success adding replication specification 1ed372852f77c38a01ff41adce6c3dba/B1/B2
2021-04-26T16:44:31.728Z INFO GOXDCR.ReplMgr: Replication specification 1ed372852f77c38a01ff41adce6c3dba/B1/B2 is created
2021-04-26T16:44:31.728Z INFO GOXDCR.AdminPort: Finished doCreateReplicationRequest call
On the 6.6. node however, it receives the spec and shows errors where the rules are in place and it doesn't know how to parse it, but the pipeline still started without parsing those rules anyway:
2021-04-26T16:44:31.729Z INFO GOXDCR.ReplicationSpecChangeListener: metakvCallback called on listener ReplicationSpecChangeListener with path = /replicationSpec/1ed372852f77c38a01ff41adce6c3dba/B1/B2
2021-04-26T16:44:31.729Z INFO GOXDCR.ReplSpecSvc: ReplicationSpecServiceCallback called on path = /replicationSpec/1ed372852f77c38a01ff41adce6c3dba/B1/B2
2021-04-26T16:44:31.729Z WARN GOXDCR.SettingsCommon: Settings unmarshalled from metakv has the following issues: map[dismissEvent:not a valid setting CollectionsMgtMulti:not a valid setting mergeFunctionMapping:not a valid setting hlvPruningWindowSec:not a valid setting retryOnRemoteAuthErrMaxWaitSec:not a valid setting collectionsSkipSrcValidation:not a valid setting delSpecificBackfillForVb:not a valid setting retryOnRemoteAuthErr:not a valid setting delAllBackfills:not a valid setting manualBackfill:not a valid setting colMappingRules:not a valid setting]
settings=checkpoint_interval:600, stats_interval:1000, hlvPruningWindowSec:259200, priority:0, backlogThreshold:50, delAllBackfills:false, manualBackfill:, colMappingRules:map[type="airport":scope1.collection1], doc_batch_size_kb:2048, filter_expression:, filter_skip_restream:false, replication_type:xmem, retryOnRemoteAuthErrMaxWaitSec:3600, collectionsSkipSrcValidation:false, filter_exp_del:0, log_level:13, optimistic_replication_threshold:256, target_nozzle_per_node:2, worker_batch_size:500, compression_type:3, delSpecificBackfillForVb:-1, dismissEvent:-1, retryOnRemoteAuthErr:true, source_nozzle_per_node:2, CollectionsMgtMulti:12, failure_restart_interval:10, active:true, bandwidth_limit:0, filter_expression_version:0, mergeFunctionMapping:map[]
2021-04-26T16:44:31.729Z INFO GOXDCR.ReplicationSpecChangeListener: specChangedCallback called on id = 1ed372852f77c38a01ff41adce6c3dba/B1/B2, oldSpec=, newSpec=Id: 1ed372852f77c38a01ff41adce6c3dba/B1/B2 InternalId: e6yBxv6KlO9zepS5jMAptg== SourceBucketName: B1 SourceBucketUUID: b35cc3442c5c1ee57e21ab1f2f60b3e4 TargetClusterUUID: 1ed372852f77c38a01ff41adce6c3dba TargetBucketName: B2 TargetBucketUUID: d37aa03dfdba54462e61d6322ecc8806 Settings: map[log_level:Info optimistic_replication_threshold:256 filterExpiration:false filterDeletion:false source_nozzle_per_node:2 compression_type:3 filter_exp_del:0 filter_expression: replication_type:xmem worker_batch_size:500 target_nozzle_per_node:2 active:true priority:High stats_interval:1000 checkpoint_interval:600 failure_restart_interval:10 doc_batch_size_kb:2048 backlogThreshold:50 bandwidth_limit:0 filter_expression_version:0 filter_skip_restream:false filterBypassExpiry:false]
2021-04-26T16:44:31.729Z INFO GOXDCR.ReplicationSpecChangeListener: new spec settings=optimistic_replication_threshold:256, priority:High, log_level:Info, compression_type:3, checkpoint_interval:600, replication_type:xmem, filter_exp_del:0, target_nozzle_per_node:2, source_nozzle_per_node:2, failure_restart_interval:10, filter_skip_restream:false, worker_batch_size:500, filter_expression_version:0, stats_interval:1000, backlogThreshold:50, doc_batch_size_kb:2048, filter_expression:, active:true, bandwidth_limit:0
2021-04-26T16:44:31.729Z INFO GOXDCR.ReplicationSpecChangeListener: Starting pipeline 1ed372852f77c38a01ff41adce6c3dba/B1/B2 since the replication spec has been changed to active
2021-04-26T16:44:31.729Z INFO GOXDCR.PipelineMgr: PipelineOpSerializer 1ed372852f77c38a01ff41adce6c3dba/B1/B2 handling job: {1ed372852f77c38a01ff41adce6c3dba/B1/B2 2 <nil> <nil>}
2021-04-26T16:44:31.729Z INFO GOXDCR.PipelineMgr: ReplicationStatus is created and set with 1ed372852f77c38a01ff41adce6c3dba/B1/B2
2021-04-26T16:44:31.729Z INFO GOXDCR.PipelineMgr: Pipeline updater 1ed372852f77c38a01ff41adce6c3dba/B1/B2 is launched with retry_interval=10
So even though the UI will prevent explicit mapping mode, it doesn't prevent migration mode from happening. Moreover, if user uses REST and/or CLI (not sure, need to verify) to create explicit mapping mode+rules, then the replication can be created successfully without these safe guards in place and now XDCR replications behavior will be inconsistent between the 6.6 node and the 7.0 node.
Upgrade FAQ
At which point in the upgrade process will the new features of the upgrade be available?
Once every node in the cluster is upgraded to the target release, the new features of that release are available for use. Even if 90% of all nodes are upgraded, the cluster is still considered to be on the older revision, and newer features are unavailable.
Hyun-Ju Vega
added a comment - Expected upgrade behavior – new features are not generally available in a cluster that is being upgraded until all nodes are upgraded.
https://docs.couchbase.com/server/current/install/upgrade.html#upgrade-faq
Upgrade FAQ
At which point in the upgrade process will the new features of the upgrade be available?
Once every node in the cluster is upgraded to the target release, the new features of that release are available for use. Even if 90% of all nodes are upgraded, the cluster is still considered to be on the older revision, and newer features are unavailable.
Rob Ashcom I see that Pavel updates the ETA. I am assuming he meant this is a UI bug. I am assigning this to UI team for now. Please assign it back to us if this is not a UI bug. Thanks.
John Liang
added a comment - Rob Ashcom I see that Pavel updates the ETA. I am assuming he meant this is a UI bug. I am assigning this to UI team for now. Please assign it back to us if this is not a UI bug. Thanks.
If this MB does not become Cheshire-Cat committed, then this MB should be a part of release notes to ensure that users follow procedure and do not play around with collection settings in the middle of rolling upgrades.