Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-52313

[Windows] - cbas keeps crashing consistently when upgrading from 7.0.4-7274 to 7.1.0 on windows due to java.exe missing

    XMLWordPrintable

Details

    • Untriaged
    • Windows 64-bit
    • 1
    • Yes

    Description

      Found this issue while verifying toy build from MB-52249

      1. Install 7.0.4-7274 on a single windows 2019 node
      2. Setup the cluster and install travel-sample
      3. Upgrade the cluster to 7.1.0 release version

      Observe that cbas keeps crashing continuously with error:

      Service 'cbas' exited with status 1. Restarting. Messages:
      2022-05-25T17:05:20.024-07:00 INFO CBAS.cbas uuid: 88f1c109b55511914b3dc6b24fd61eeb
      2022-05-25T17:05:20.032-07:00 INFO CBAS.cbas IPv4 addresses: [172.23.136.107 127.0.0.1]
      2022-05-25T17:05:20.032-07:00 INFO CBAS.cbas IPv6 addresses: [fe80::f56b:fdf5:1076:31ed%Ethernet 5 ::1%Loopback Pseudo-Interface 1]
      2022-05-25T17:05:20.032-07:00 INFO CBAS.cbas setting ipv4 required, ipv6 optional
      2022-05-25T17:05:20.034-07:00 INFO CBAS.cbas using previously configured io devices (1)
      2022-05-25T17:05:20.034-07:00 INFO CBAS.cbas setting java.home to bundled jre (c:/Program Files/Couchbase/Server/lib/cbas\runtime)
      2022-05-25T17:05:20.034-07:00 ERRO CBAS.cbas cbas process aborting with exit code 1 due to error validating java home: unable to determine java version: exec: "c:/Program Files/Couchbase/Server/lib/cbas\\runtime\\bin
      java.exe": file does not exist
      2022-05-25T17:05:20.034-07:00 FATA CBAS.cbas error validating java home: unable to determine java version: exec: "c:/Program Files/Couchbase/Server/lib/cbas\\runtime\\bin
      java.exe": file does not exist

      We can see that java.exe is missing in the specified location

      Issue is not observed in 7.0.4-7265 (RC1)

      Issue not observed in linux and deb packages (verified with 7.0.4-7278)

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            tony.hillman Tony Hillman added a comment - - edited
            tony.hillman Tony Hillman added a comment - - edited DOC-10124

            Let me explain the underlying bug, and I think that will answer your first questions.

            In a normal Server upgrade, Windows Installer (which is the Windows subsystem which handles software installs, uninstall, and upgrades) removes the files for the older version of Server and then replaces them with the files from the newer version of Server. In the 7.0.4->7.1.0 upgrade in particular, though, Windows Installer gets confused because a few files are actually newer (have a higher version number) in the "older" Server version 7.0.4. It ends up removing those files but then not replacing them at all, so when 7.1.0 tries to start, it finds some files missing. This causes (at least) memcached and cbas to fail, making the node unusable.

            When the customer invokes the Repair operation on this broken 7.1.0 installation, Windows Installer verifies all Server files on the drive, and replaces any that don't match what the 7.1.0 installer says they should be, including any files that are outright missing. So after the Repair operation completes, the Server installation now looks exactly like it should have looked after the initial upgrade. Therefore Server can start and all functionality should be normal.

            Hopefully that gives you the detail you need to understand the bug. FYI, earlier when I said that "7.0.4->7.1.0 is broken and can't be fixed", what I meant was that we cannot fix the bug, not that there's no way to fix a customer's installation. Any customer doing any upgrade from 7.0.4->7.1.0 will trigger this bug and wind up with a broken Server installation, and there is nothing we can do about that. However, once they've hit the bug and have a broken Server, using Repair should fix their installation.

            To reply to your other questions:

            will this have any side affects if the node is running just CBAS vs CBAS+others like (KV, GSI etc.).

            To the best of my knowledge, the services that are on the node won't matter. The node will be unusable until the Repair operation is performed.

            Also do customers need to eject the node completely out of the cluster to apply this change or failover is good?

            If they're doing an Upgrade on Windows, they are implicitly shutting the node down as part of an Offline Cluster Upgrade, is that right? So now the individual node Upgrade process has an additional step, which is the Repair operation. The node will not re-enter the cluster prior to the Repair being complete. I don't believe this additional step changes the logic of whether an Offline Cluster Upgrade should be performed, or the process by which the Offline Cluster Upgrade will be done.

            I do not currently believe that this bug causes any additional risk of data loss during the upgrade procedure. However I certainly cannot swear to that. If the customer is following the Offline Cluster Upgrade procedure properly, they should have a full data backup prior to starting.

            Also at a later time, if customers want to follow any different approach to upgrade from 7.1.0 to 7.1.1?

            The bug here only affects Upgrade from exactly 7.0.4 to exactly 7.1.0. No other pair of Server versions are known to be affected. In particular, 7.0.4->7.1.1 is known to work normally, and so far as I know 7.1.0->7.1.1 works normally as well. And if a customer upgrades from 7.0.4 to 7.1.0, hits this bug, and Repairs their installation, then there should be no reason they couldn't later follow the normal upgrade process from 7.1.0 to any future Server version.

            It is possible there will be other pairs of Server versions affected by variants of this same bug in future. This is why I'm strongly considering disallowing node upgrade on Windows entirely. That's far outside the scope of this ticket though.

            do we have QE sign off on this interim solution or yet to be validated?

            You would need to talk to QE about that. My local testing has only been with trivial one-node configurations.

            ceej Chris Hillery added a comment - Let me explain the underlying bug, and I think that will answer your first questions. In a normal Server upgrade, Windows Installer (which is the Windows subsystem which handles software installs, uninstall, and upgrades) removes the files for the older version of Server and then replaces them with the files from the newer version of Server. In the 7.0.4->7.1.0 upgrade in particular, though, Windows Installer gets confused because a few files are actually newer (have a higher version number) in the "older" Server version 7.0.4. It ends up removing those files but then not replacing them at all, so when 7.1.0 tries to start, it finds some files missing. This causes (at least) memcached and cbas to fail, making the node unusable. When the customer invokes the Repair operation on this broken 7.1.0 installation, Windows Installer verifies all Server files on the drive, and replaces any that don't match what the 7.1.0 installer says they should be, including any files that are outright missing. So after the Repair operation completes, the Server installation now looks exactly like it should have looked after the initial upgrade. Therefore Server can start and all functionality should be normal. Hopefully that gives you the detail you need to understand the bug. FYI, earlier when I said that "7.0.4->7.1.0 is broken and can't be fixed", what I meant was that we cannot fix the bug , not that there's no way to fix a customer's installation. Any customer doing any upgrade from 7.0.4->7.1.0 will trigger this bug and wind up with a broken Server installation, and there is nothing we can do about that. However, once they've hit the bug and have a broken Server, using Repair should fix their installation. To reply to your other questions: will this have any side affects if the node is running just CBAS vs CBAS+others like (KV, GSI etc.). To the best of my knowledge, the services that are on the node won't matter. The node will be unusable until the Repair operation is performed. Also do customers need to eject the node completely out of the cluster to apply this change or failover is good? If they're doing an Upgrade on Windows, they are implicitly shutting the node down as part of an Offline Cluster Upgrade, is that right? So now the individual node Upgrade process has an additional step, which is the Repair operation. The node will not re-enter the cluster prior to the Repair being complete. I don't believe this additional step changes the logic of whether an Offline Cluster Upgrade should be performed, or the process by which the Offline Cluster Upgrade will be done. I do not currently believe that this bug causes any additional risk of data loss during the upgrade procedure. However I certainly cannot swear to that. If the customer is following the Offline Cluster Upgrade procedure properly, they should have a full data backup prior to starting. Also at a later time, if customers want to follow any different approach to upgrade from 7.1.0 to 7.1.1? The bug here only affects Upgrade from exactly 7.0.4 to exactly 7.1.0. No other pair of Server versions are known to be affected. In particular, 7.0.4->7.1.1 is known to work normally, and so far as I know 7.1.0->7.1.1 works normally as well. And if a customer upgrades from 7.0.4 to 7.1.0, hits this bug, and Repairs their installation, then there should be no reason they couldn't later follow the normal upgrade process from 7.1.0 to any future Server version. It is possible there will be other pairs of Server versions affected by variants of this same bug in future. This is why I'm strongly considering disallowing node upgrade on Windows entirely. That's far outside the scope of this ticket though. do we have QE sign off on this interim solution or yet to be validated? You would need to talk to QE about that. My local testing has only been with trivial one-node configurations.

            We will manually validate this work around and update this ticket

            arunkumar Arunkumar Senthilnathan (Inactive) added a comment - We will manually validate this work around and update this ticket

            FYI I have filed MB-52563 to track entirely disabling Upgrade on Windows.

            ceej Chris Hillery added a comment - FYI I have filed MB-52563 to track entirely disabling Upgrade on Windows.

            Verified that 7.0.4 -> 7.1.0 windows upgrade workaround is good i.e. after the upgrade is done, if the user does a repair on the installation, couchbase server comes up with all the data and services in tact

            arunkumar Arunkumar Senthilnathan (Inactive) added a comment - Verified that 7.0.4 -> 7.1.0 windows upgrade workaround is good i.e. after the upgrade is done, if the user does a repair on the installation, couchbase server comes up with all the data and services in tact

            People

              tony.hillman Tony Hillman
              arunkumar Arunkumar Senthilnathan (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty