Uploaded image for project: 'Couchbase .NET client library'
  1. Couchbase .NET client library
  2. NCBC-3177

NRE when rebalancing and cluster map is missing an alternate address

    XMLWordPrintable

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • None
    • 3.3.0
    • library
    • None
    • 1

    Description

      Scenario: Mixed node upgrade (Enterprise Edition 6.6.3 build 9808 and Enterprise Edition 7.0.3 build 7032). First bootstrap to 6.6.3 and run a load. Then add 7.0.3 and hit rebalance; the server will not return the AlternateAddress for the 7.0.3 node.

      During a rebalance of a cluster using alternate addresses, its possible for the server to not return an alternative address:

      {
      	"rev": 1844,
      	"nodesExt": [{
      		"services": {
      			"mgmt": 8091,
      			"mgmtSSL": 18091,
      			"kv": 11210,
      			"kvSSL": 11207,
      			"capi": 8092,
      			"capiSSL": 18092,
      			"projector": 9999,
      			"projector": 9999
      		},
      		"thisNode": true,
      		"hostname": "cb1.lan",
      		"alternateAddresses": {
      			"external": {
      				"hostname": "mbp.local",
      				"ports": {
      					"mgmt": 9091,
      					"kv": 11211
      				}
      			}
      		}
      	}, {
      		"services": {
      			"mgmt": 8091,
      			"mgmtSSL": 18091,
      			"kv": 11210,
      			"kvSSL": 11207,
      			"capi": 8092,
      			"capiSSL": 18092,
      			"projector": 9999,
      			"projector": 9999
      		},
      		"hostname": "cb2.lan"
      	}],
      	"clusterCapabilitiesVer": [1, 0],
      	"clusterCapabilities": {
      		"n1ql": ["enhancedPreparedStatements"]
      	}
      }
      

      If the NRE is handled (check for null AlternateAddress) in the SDK, this is in turn leads up to NMVB being returned after rebalance completes:

      2022-03-31T16:15:00.3782234-07:00  [DBG] Op failed: "Couchbase.Core.IO.Operations.Set`1[<>f__AnonymousType2`1[System.String]]" (01625f1d)
      Couchbase.KeyValue.NotMyVBucketException: Exception of type 'Couchbase.KeyValue.NotMyVBucketException' was thrown.
         at Couchbase.Core.ClusterNode.ExecuteOp(Func`4 sender, IOperation op, Object state, CancellationTokenPair tokenPair) in C:\Users\Jeff Morris\source\couchbase-net-client\src\Couchbase\Core\ClusterNode.cs:line 550
      -----------------------Context Info---------------------------
      {"DispatchedFrom":"JeffMorris-0503","DispatchedTo":"mbp.local:11211","DocumentKey":"mykey3","ClientContextId":"24606","Cas":0,"Status":"vBucketBelongsToAnotherServer","BucketName":"default","CollectionName":"_default","ScopeName":"_default","Message":"KV Error: {Name=\u0022NOT_MY_VBUCKET\u0022, Description=\u0022Server does not know about this vBucket\u0022, Attributes=\u0022fetch-config,invalid-input\u0022}","OpCode":"set"}
      
      

      If the application is then stopped and restarted, then bootstrapping fails with the following socket exception (because there is no alternate address to use for the second node "cb2.lan"):

      2022-03-31T16:23:19.5538156-07:00  [DBG] Attempted bootstrapping on endpoint "mbp.local:11211" has failed. (e80489ed)
      System.Net.Sockets.SocketException (11001): No such host is known.
         at System.Net.NameResolutionPal.ProcessResult(SocketError errorCode, GetAddrInfoExContext* context)
         at System.Net.NameResolutionPal.GetAddressInfoExCallback(Int32 error, Int32 bytes, NativeOverlapped* overlapped)
      --- End of stack trace from previous location ---
         at Couchbase.DnsClientDnsResolver.GetIpAddressAsync(String hostName, CancellationToken cancellationToken) in C:\Users\Jeff Morris\source\couchbase-net-client\src\Couchbase\DnsClientDnsResolver.cs:line 45
         at Couchbase.Utils.IpEndPointService.GetIpEndPointAsync(String hostNameOrIpAddress, Int32 port, CancellationToken cancellationToken) in C:\Users\Jeff Morris\source\couchbase-net-client\src\Couchbase\Utils\IpEndPointService.cs:line 39
         at Couchbase.Core.IO.Connections.ConnectionFactory.CreateAndConnectAsync(HostEndpointWithPort hostEndpoint, CancellationToken cancellationToken) in C:\Users\Jeff Morris\source\couchbase-net-client\src\Couchbase\Core\IO\Connections\ConnectionFactory.cs:line 41
         at Couchbase.Core.IO.Connections.ConnectionPoolBase.CreateConnectionAsync(CancellationToken cancellationToken) in C:\Users\Jeff Morris\source\couchbase-net-client\src\Couchbase\Core\IO\Connections\ConnectionPoolBase.cs:line 84
         at Couchbase.Core.IO.Connections.DataFlow.DataFlowConnectionPool.<>c__DisplayClass30_0.<<AddConnectionsAsync>g__StartConnection|0>d.MoveNext() in C:\Users\Jeff Morris\source\couchbase-net-client\src\Couchbase\Core\IO\Connections\DataFlow\DataFlowConnectionPool.cs:line 294
      --- End of stack trace from previous location ---
         at Couchbase.Core.IO.Connections.DataFlow.DataFlowConnectionPool.AddConnectionsAsync(Int32 count, CancellationToken cancellationToken) in C:\Users\Jeff Morris\source\couchbase-net-client\src\Couchbase\Core\IO\Connections\DataFlow\DataFlowConnectionPool.cs:line 337
         at Couchbase.Core.IO.Connections.DataFlow.DataFlowConnectionPool.InitializeAsync(CancellationToken cancellationToken) in C:\Users\Jeff Morris\source\couchbase-net-client\src\Couchbase\Core\IO\Connections\DataFlow\DataFlowConnectionPool.cs:line 87
         at Couchbase.Core.ClusterNode.InitializeAsync() in C:\Users\Jeff Morris\source\couchbase-net-client\src\Couchbase\Core\ClusterNode.cs:line 217
         at Couchbase.Core.DI.ClusterNodeFactory.CreateAndConnectAsync(HostEndpointWithPort endPoint, BucketType bucketType, NodeAdapter nodeAdapter, CancellationToken cancellationToken) in C:\Users\Jeff Morris\source\couchbase-net-client\src\Couchbase\Core\DI\ClusterNodeFactory.cs:line 65
         at Couchbase.Core.ClusterContext.BootstrapGlobalAsync() in C:\Users\Jeff Morris\source\couchbase-net-client\src\Couchbase\Core\ClusterContext.cs:line 379
      

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          jmorris Jeff Morris added a comment -

          To recreate:

          • Provision 2 nodes 6.6.3 and 7.0.3: enable alternate addresses on each (couchbase-cli setting-alternate-address...)
          • Have an app put a load on the 6.6.3 node and then add 7.0.3 node, then rebalance.
          • Remove the 7.0.3 node and rebalance.
          • Add the same 7.0.3 and rebalance...the server will return NMVB and not recover.
          • Go back to the 7.0.3 node an reenable alternate addresses: couchbase-cli setting-alternate-address -c localhost:8091 --username Administrator --password password --set --node cb2.lan --hostname mbp.local --ports mgmt=9092,kv=11212;
          • Restart the app and it will then work correctly.
          jmorris Jeff Morris added a comment - To recreate: Provision 2 nodes 6.6.3 and 7.0.3: enable alternate addresses on each ( couchbase-cli setting-alternate-address... ) Have an app put a load on the 6.6.3 node and then add 7.0.3 node, then rebalance. Remove the 7.0.3 node and rebalance. Add the same 7.0.3 and rebalance...the server will return NMVB and not recover. Go back to the 7.0.3 node an reenable alternate addresses: couchbase-cli setting-alternate-address -c localhost:8091 --username Administrator --password password --set --node cb2.lan --hostname mbp.local --ports mgmt=9092,kv=11212; Restart the app and it will then work correctly.
          jmorris Jeff Morris added a comment -

          Closing as I have been convinced this is expected behavior: when a node is removed from a cluster it resets itself back to a "fresh state".

          jmorris Jeff Morris added a comment - Closing as I have been convinced this is expected behavior: when a node is removed from a cluster it resets itself back to a "fresh state".
          jmorris Jeff Morris added a comment -

          Reopening to fix the NRE that is thrown within the SDK, however, this will cause NMVB's as the cluster state is bad at this point. The resolution is to set the alternate addresses again on the node that was swapped out as it state will be "refreshed". This is really a server configuration issue at this point. Once the alternate addresses have been set on the server, the SDK will resolve itself when an updated cluster map is returned.

          jmorris Jeff Morris added a comment - Reopening to fix the NRE that is thrown within the SDK, however, this will cause NMVB's as the cluster state is bad at this point. The resolution is to set the alternate addresses again on the node that was swapped out as it state will be "refreshed". This is really a server configuration issue at this point. Once the alternate addresses have been set on the server, the SDK will resolve itself when an updated cluster map is returned.

          People

            jmorris Jeff Morris
            jmorris Jeff Morris
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty