Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-37811

[CX] Analytics cluster down on IP reassignment

    XMLWordPrintable

Details

    • Untriaged
    • Unknown
    • CX Sprint 185

    Description

      As observed in a customer Kubernetes environment, the CC node was recreated with the existing storage, but assigned a new IP address. Due to network address caching, one of the existing NCs was perpetually unable to connect to the CC as it did not realize the IP address update until the driver could be restarted.

      We should set networkaddress.cache.ttl to a reasonable value, so that we are protected from this situation. e.g. something <= 5 minutes

       

      The above suggestion is insufficient, due to caching above the name service which is also a factor.

      Attachments

        Issue Links

          For Gerrit Dashboard: MB-37811
          # Subject Branch Project Status CR V

          Activity

            michael.blow Michael Blow created issue -
            michael.blow Michael Blow made changes -
            Field Original Value New Value
            Link This issue causes CBSE-7909 [ CBSE-7909 ]
            michael.blow Michael Blow made changes -
            Labels 6.5.1-candidate 6.x-candidate 6.5.1-candidate 6.x-candidate kubernetes
            till Till Westmann made changes -
            Rank Ranked higher
            michael.blow Michael Blow made changes -
            Rank Ranked higher
            michael.blow Michael Blow made changes -
            Sprint CX Sprint 185 [ 983 ]
            michael.blow Michael Blow made changes -
            Rank Ranked lower
            till Till Westmann made changes -
            Labels 6.5.1-candidate 6.x-candidate kubernetes 6.5.1-candidate 6.x-candidate kubernetes releasenote
            michael.blow Michael Blow made changes -
            Status Open [ 1 ] In Progress [ 3 ]
            michael.blow Michael Blow made changes -
            Labels 6.5.1-candidate 6.x-candidate kubernetes releasenote 6.0.5-candidate 6.5.1-candidate 6.x-candidate kubernetes releasenote
            till Till Westmann made changes -
            Labels 6.0.5-candidate 6.5.1-candidate 6.x-candidate kubernetes releasenote 6.0.5-candidate 6.5.1-candidate 6.x-candidate kubernetes releasenote triaged

            Workaround for the benefit of anyone that encounters this issue:

            curl -v -u <admin>:<password> -X PUT -d jvmArgs="-Dnetworkaddress.cache.ttl=120" http://localhost:8095/analytics/config/service
            

            This sets the cache ttl value mentioned by Michael Blow at an Analytics Service level, note that this assumes you previously had no custom JVM args set.

            For any Analytics service nodes currently running you'll also have to restart the Analytics Service:

            curl -v -u <admin>:<password> -X POST http://localhost:8095/analytics/node/restart
            

            matt.carabine Matt Carabine added a comment - Workaround for the benefit of anyone that encounters this issue: curl -v -u <admin>:<password> -X PUT -d jvmArgs="-Dnetworkaddress.cache.ttl=120" http://localhost:8095/analytics/config/service This sets the cache ttl value mentioned by Michael Blow at an Analytics Service level, note that this assumes you previously had no custom JVM args set. For any Analytics service nodes currently running you'll also have to restart the Analytics Service: curl -v -u <admin>:<password> -X POST http://localhost:8095/analytics/node/restart

            Thanks Matt Carabine, it looks like this workaround is insufficient to avoid the issue, at least in my preliminary testing w/ 6.0.3. Am debugging through the issues now to ensure a complete fix for 6.5.1.

            michael.blow Michael Blow added a comment - Thanks Matt Carabine , it looks like this workaround is insufficient to avoid the issue, at least in my preliminary testing w/ 6.0.3. Am debugging through the issues now to ensure a complete fix for 6.5.1.

            Michael Blow brought this up with the DBaaS team, and this could be a blocker for supporting Analytics on DBaaS (as it's using the operator under the hood where IP changes are likely), so if you do find any workaround that would work it would be great if you could put it in the MB as soon as possible.

            Otherwise, they may need to wait for 6.5.1 to support Analytics.

            matt.carabine Matt Carabine added a comment - Michael Blow brought this up with the DBaaS team, and this could be a blocker for supporting Analytics on DBaaS (as it's using the operator under the hood where IP changes are likely), so if you do find any workaround that would work it would be great if you could put it in the MB as soon as possible. Otherwise, they may need to wait for 6.5.1 to support Analytics.

            Matt Carabine am working on this issue actively and will advise when I have a confirmed workaround.

            michael.blow Michael Blow added a comment - Matt Carabine am working on this issue actively and will advise when I have a confirmed workaround.

            Thanks a lot Michael Blow!!

            matt.carabine Matt Carabine added a comment - Thanks a lot Michael Blow !!
            michael.blow Michael Blow made changes -
            Labels 6.0.5-candidate 6.5.1-candidate 6.x-candidate kubernetes releasenote triaged 6.0.5-candidate 6.5.1-candidate 6.x-candidate DBaaS kubernetes releasenote triaged
            michael.blow Michael Blow added a comment - - edited

            I've updated CBSE-7909 with my findings, repeating it here:

            Unfortunately the above proposed workaround does not solve the issue- the only known solution on an IP address update on the CC is to restart each NC using either the Node Restart API] on the remainder of the Analytics nodes, or otherwise restart the service on the nodes (rebooting, killall -9 java, etc.)

            michael.blow Michael Blow added a comment - - edited I've updated CBSE-7909 with my findings, repeating it here: Unfortunately the above proposed workaround does not solve the issue- the only known solution on an IP address update on the CC is to restart each NC using either the  Node Restart API ] on the remainder of the Analytics nodes, or otherwise restart the service on the nodes (rebooting, killall -9 java, etc.)
            wayne Wayne Siu made changes -
            Affects Version/s Mad-Hatter [ 15037 ]
            Affects Version/s 6.5.0 [ 16624 ]
            michael.blow Michael Blow made changes -
            Summary [CX] Infinite DNS lookup address caching breaks clusters on IP reassignment [CX] Analytics cluster down on IP reassignment
            michael.blow Michael Blow made changes -
            Description As observed in a customer Kubernetes environment, the CC node was recreated with the existing storage, but assigned a new IP address. Due to network address caching, one of the existing NCs was perpetually unable to connect to the CC as it did not realize the IP address update until the driver could be restarted.

            We should set {{networkaddress.cache.ttl}} to a reasonable value, so that we are protected from this situation. e.g. something <= 5 minutes
            As observed in a customer Kubernetes environment, the CC node was recreated with the existing storage, but assigned a new IP address. Due to network address caching, one of the existing NCs was perpetually unable to connect to the CC as it did not realize the IP address update until the driver could be restarted.

            -We should set {{networkaddress.cache.ttl}} to a reasonable value, so that we are protected from this situation. e.g. something <= 5 minutes-

             

            The above suggestion is insufficient, due to caching above the name service which is also a factor.
            michael.blow Michael Blow made changes -
            Link This issue is parent task of MB-37834 [ MB-37834 ]
            wayne Wayne Siu made changes -
            Link This issue blocks MB-37192 [ MB-37192 ]
            michael.blow Michael Blow made changes -
            Remote Link This issue links to "AsterixDB Gerrit Review (Web Link)" [ 19202 ]
            michael.blow Michael Blow made changes -
            Resolution Fixed [ 1 ]
            Status In Progress [ 3 ] Resolved [ 5 ]

            Build couchbase-server-6.5.1-6139 contains cbas commit c5f4aa7 with commit message:
            MB-37811: configure InetAddress cache ttl to 15s

            build-team Couchbase Build Team added a comment - Build couchbase-server-6.5.1-6139 contains cbas commit c5f4aa7 with commit message: MB-37811 : configure InetAddress cache ttl to 15s

            Build couchbase-server-6.5.1-6139 contains cbas-core commit 9cb70ef with commit message:
            MB-37811: remove restart node workarounds

            build-team Couchbase Build Team added a comment - Build couchbase-server-6.5.1-6139 contains cbas-core commit 9cb70ef with commit message: MB-37811 : remove restart node workarounds

            Build couchbase-server-7.0.0-1296 contains cbas commit c5f4aa7 with commit message:
            MB-37811: configure InetAddress cache ttl to 15s

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.0-1296 contains cbas commit c5f4aa7 with commit message: MB-37811 : configure InetAddress cache ttl to 15s

            Build couchbase-server-7.0.0-1297 contains cbas-core commit 9cb70ef with commit message:
            MB-37811: remove restart node workarounds

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.0-1297 contains cbas-core commit 9cb70ef with commit message: MB-37811 : remove restart node workarounds

            Build couchbase-server-1006.5.1-1052 contains cbas commit c5f4aa7 with commit message:
            MB-37811: configure InetAddress cache ttl to 15s

            build-team Couchbase Build Team added a comment - Build couchbase-server-1006.5.1-1052 contains cbas commit c5f4aa7 with commit message: MB-37811 : configure InetAddress cache ttl to 15s

            Build couchbase-server-1006.5.1-1052 contains cbas-core commit 9cb70ef with commit message:
            MB-37811: remove restart node workarounds

            build-team Couchbase Build Team added a comment - Build couchbase-server-1006.5.1-1052 contains cbas-core commit 9cb70ef with commit message: MB-37811 : remove restart node workarounds
            wayne Wayne Siu made changes -
            Labels 6.0.5-candidate 6.5.1-candidate 6.x-candidate DBaaS kubernetes releasenote triaged 6.0.5-candidate DBaaS approved-for-6.5.1 kubernetes releasenote triaged
            michael.blow Michael Blow made changes -
            Remote Link This issue links to "AsterixDB Gerrit Review (Web Link)" [ 19223 ]
            mihir.kamdar Mihir Kamdar (Inactive) made changes -
            Assignee Michael Blow [ michael.blow ] Arunkumar Senthilnathan [ arunkumar ]

            Arunkumar Senthilnathan can you pls take a look at this - it is K8S related.

            mihir.kamdar Mihir Kamdar (Inactive) added a comment - Arunkumar Senthilnathan can you pls take a look at this - it is K8S related.

            Covered by regression tests written by devs

            arunkumar Arunkumar Senthilnathan added a comment - Covered by regression tests written by devs
            arunkumar Arunkumar Senthilnathan made changes -
            Status Resolved [ 5 ] Closed [ 6 ]

            Build couchbase-server-1006.5.1-1125 contains cbas commit c5f4aa7 with commit message:
            MB-37811: configure InetAddress cache ttl to 15s

            build-team Couchbase Build Team added a comment - Build couchbase-server-1006.5.1-1125 contains cbas commit c5f4aa7 with commit message: MB-37811 : configure InetAddress cache ttl to 15s

            Build couchbase-server-1006.5.1-1125 contains cbas-core commit 9cb70ef with commit message:
            MB-37811: remove restart node workarounds

            build-team Couchbase Build Team added a comment - Build couchbase-server-1006.5.1-1125 contains cbas-core commit 9cb70ef with commit message: MB-37811 : remove restart node workarounds
            till Till Westmann made changes -
            Labels 6.0.5-candidate DBaaS approved-for-6.5.1 kubernetes releasenote triaged 6.0.6-candidate DBaaS approved-for-6.5.1 kubernetes releasenote triaged

            Build couchbase-server-6.6.2-9599 contains cbas-core commit 9cb70ef with commit message:
            MB-37811: remove restart node workarounds

            build-team Couchbase Build Team added a comment - Build couchbase-server-6.6.2-9599 contains cbas-core commit 9cb70ef with commit message: MB-37811 : remove restart node workarounds

            People

              arunkumar Arunkumar Senthilnathan
              michael.blow Michael Blow
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty