Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-39842

[IPv6] Rebalance failure when adding another ipv6 node to the cluster with nodes having dual stack nw

    XMLWordPrintable

Details

    Description

      Build : 7.0.0-2278

      Steps to reproduce :
      1. Initialize a 1-node cluster with kv, n1ql and index service, with ipv6 enabled.
      2. Add another node to this cluster with kv, n1ql and index service.
      3. Start rebalance.

      Rebalance fails with the following error -

      [ns_server:error,2020-06-09T10:07:10.054-07:00,ns_1@s10501-ip6.qe.couchbase.com:service_rebalancer-index-worker<0.3064.4>:service_agent:process_bad_results:862]Service call get_node_infos (service index) failed on some nodes:
      [{'ns_1@s10501-ip6.qe.couchbase.com',
           {exit,
               {{linked_process_died,<0.3002.4>,{no_connection,"index-service_api"}},
                {gen_server,call,
                    [{'service_agent-index','ns_1@s10501-ip6.qe.couchbase.com'},
                     {if_rebalance,<0.3054.4>,get_node_info},
                     infinity]}}}}]
      [ns_server:error,2020-06-09T10:07:10.054-07:00,ns_1@s10501-ip6.qe.couchbase.com:cleanup_process<0.3053.4>:service_janitor:maybe_init_topology_aware_service:87]Initial rebalance for `index` failed: {error,
                                             {initial_rebalance_failed,index,
                                              {agent_died,<0.3001.4>,
                                               {linked_process_died,<0.3002.4>,
                                                {no_connection,
                                                 "index-service_api"}}}}}
      

      The indexer logs is full of errors like the following -

      2020-06-09T10:10:44.592-07:00 [Error] KVSender::closeMutationStream, MAINT_STREAM  Error from Projector Post http://s10501-ip6.qe.couchbase.com:9999/adminport/shutdownTopicRequest: dial tcp 172.23.211.58:9999: connect: connection refused
      2020-06-09T10:10:44.592-07:00 [Fatal] Indexer::closeAllStreams Stream MAINT_STREAM Projector health check needed, indexer can not proceed, Error received Post http://s10501-ip6.qe.couchbase.com:9999/adminport/shutdownTopicRequest: dial tcp 172.23.211.58:9999: connect: connection refused. Retrying (526).
      2020-06-09T10:10:49.592-07:00 [Info] KVSender::sendShutdownTopic Projector s10501-ip6.qe.couchbase.com:9999 Topic MAINT_STREAM_TOPIC_fa659551e9c83565b61c9b506e3c98fe
      2020-06-09T10:10:49.593-07:00 [Error] KVSender::sendShutdownTopic Unexpected Error During Shutdown Projector s10501-ip6.qe.couchbase.com:9999 Topic MAINT_STREAM_TOPIC_fa659551e9c83565b61c9b506e3c98fe. Err Post http://s10501-ip6.qe.couchbase.com:9999/adminport/shutdownTopicRequest: dial tcp 172.23.211.58:9999: connect: connection refused
      2020-06-09T10:10:49.593-07:00 [Error] KVSender::closeMutationStream MAINT_STREAM  Error Received Post http://s10501-ip6.qe.couchbase.com:9999/adminport/shutdownTopicRequest: dial tcp 172.23.211.58:9999: connect: connection refused from s10501-ip6.qe.couchbase.com:9999
      

      This issue is seen with VMs that have IPv4 addresses as well as IPv6 ones. The above machines have the following output from ifconfig -

      [root@s10501-ip6 logs]# ifconfig
      eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
              inet 172.23.211.58  netmask 255.255.255.0  broadcast 172.23.211.255
              inet6 fd63:6f75:6368:20d3:ac3c:257e:9c5:6619  prefixlen 64  scopeid 0x0<global>
              inet6 fe80::dc11:da68:e01f:5368  prefixlen 64  scopeid 0x20<link>
              ether 02:57:a0:50:0a:cb  txqueuelen 1000  (Ethernet)
              RX packets 208184507  bytes 91874002860 (85.5 GiB)
              RX errors 0  dropped 0  overruns 0  frame 0
              TX packets 284737800  bytes 230012848725 (214.2 GiB)
              TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
       
      lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
              inet 127.0.0.1  netmask 255.0.0.0
              inet6 ::1  prefixlen 128  scopeid 0x10<host>
              loop  txqueuelen 1  (Local Loopback)
              RX packets 222097621  bytes 499733590235 (465.4 GiB)
              RX errors 0  dropped 0  overruns 0  frame 0
              TX packets 222097621  bytes 499733590235 (465.4 GiB)
              TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
       
      [root@s10502-ip6 ~]# ifconfig
      eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
              inet 172.23.211.43  netmask 255.255.255.0  broadcast 172.23.211.255
              inet6 fd63:6f75:6368:20d3:d97a:7875:2e48:ea25  prefixlen 64  scopeid 0x0<global>
              inet6 fe80::8b3f:d5df:572d:bbc4  prefixlen 64  scopeid 0x20<link>
              ether fe:c0:e3:98:2b:d3  txqueuelen 1000  (Ethernet)
              RX packets 10688808  bytes 10507260695 (9.7 GiB)
              RX errors 0  dropped 0  overruns 0  frame 0
              TX packets 5667521  bytes 3245147569 (3.0 GiB)
              TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
       
      lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
              inet 127.0.0.1  netmask 255.0.0.0
              inet6 ::1  prefixlen 128  scopeid 0x10<host>
              loop  txqueuelen 1  (Local Loopback)
              RX packets 5453035  bytes 8576274848 (7.9 GiB)
              RX errors 0  dropped 0  overruns 0  frame 0
              TX packets 5453035  bytes 8576274848 (7.9 GiB)
              TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
      

      But on machines where there is only ipv6 interface, this issue is not seen.

      [root@s10510-ip6 tmp]# ifconfig
      eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
              inet6 fd63:6f75:6368:20d4:d7f8:96be:407a:be32  prefixlen 64  scopeid 0x0<global>
              inet6 2600:2109:1:d4:8654:af10:e32a:1e4f  prefixlen 64  scopeid 0x0<global>
              inet6 fd63:6f75:6368:20d4:1234::1  prefixlen 64  scopeid 0x0<global>
              inet6 fe80::392:e5e9:4473:3f46  prefixlen 64  scopeid 0x20<link>
              ether 3e:12:72:c2:7a:79  txqueuelen 1000  (Ethernet)
              RX packets 10172173  bytes 10242758492 (9.5 GiB)
              RX errors 0  dropped 0  overruns 0  frame 0
              TX packets 6246975  bytes 4362772469 (4.0 GiB)
              TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
       
      lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
              inet 127.0.0.1  netmask 255.0.0.0
              inet6 ::1  prefixlen 128  scopeid 0x10<host>
              loop  txqueuelen 1  (Local Loopback)
              RX packets 8721361  bytes 7492122535 (6.9 GiB)
              RX errors 0  dropped 0  overruns 0  frame 0
              TX packets 8721361  bytes 7492122535 (6.9 GiB)
              TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
      

      The following commit could have led to this issue-

      Commit: 8b5354fd0de9be3bf162c6e24cf4f5794bb20f05  in build: couchbase-server-7.0.0-2265
      MB-31109: Make ip:port binding on projector and indexer more lenient<br />
      If the cluster is configured to use ipv4, allow GSI processes
      to come up successfully even if they cannot bind to ipv6:port.
       
      Similarly, if the cluster is configured to use ipv6, allow GSI
      processes to come up successfully even if they cannot bind to
      ipv4:port.
       
      Note that the GSI clients will use the node names from cluster
      info cache. And the cluster configuration change with respect
      to ipv4/ipv6 protocol stack is not supported if the cluser
      node names are based on the ip addresses.
       
      Change-Id: Iadb546e60ef32a3edd8ce3e5e41be8d1e721b443
      Author: Amit Kulkarni <amit.kulkarni@couchbase.com>
      

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          This issue got resolved after updating the /etc/hosts file to remove the IPv4 binding from it. Thanks Amit Kulkarni for pointing that out. Is there anything to look at from the product side, or can we close this issue? I am lowering the severity since we have a resolution.

          mihir.kamdar Mihir Kamdar (Inactive) added a comment - This issue got resolved after updating the /etc/hosts file to remove the IPv4 binding from it. Thanks Amit Kulkarni for pointing that out. Is there anything to look at from the product side, or can we close this issue? I am lowering the severity since we have a resolution.

          Nothing to be looked at from product side. I think we can close the issue.

          amit.kulkarni Amit Kulkarni added a comment - Nothing to be looked at from product side. I think we can close the issue.

          People

            amit.kulkarni Amit Kulkarni
            mihir.kamdar Mihir Kamdar (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty