[Windows] - cbas keeps crashing consistently when upgrading from 7.0.4-7274 to 7.1.0 on windows due to java.exe missing

Description

Found this issue while verifying toy build from

1. Install 7.0.4-7274 on a single windows 2019 node
2. Setup the cluster and install travel-sample
3. Upgrade the cluster to 7.1.0 release version

Observe that cbas keeps crashing continuously with error:

Service 'cbas' exited with status 1. Restarting. Messages:
2022-05-25T17:05:20.024-07:00 INFO CBAS.cbas uuid: 88f1c109b55511914b3dc6b24fd61eeb
2022-05-25T17:05:20.032-07:00 INFO CBAS.cbas IPv4 addresses: [172.23.136.107 127.0.0.1]
2022-05-25T17:05:20.032-07:00 INFO CBAS.cbas IPv6 addresses: [fe80::f56b:fdf5:1076:31ed%Ethernet 5 ::1%Loopback Pseudo-Interface 1]
2022-05-25T17:05:20.032-07:00 INFO CBAS.cbas setting ipv4 required, ipv6 optional
2022-05-25T17:05:20.034-07:00 INFO CBAS.cbas using previously configured io devices (1)
2022-05-25T17:05:20.034-07:00 INFO CBAS.cbas setting java.home to bundled jre (c:/Program Files/Couchbase/Server/lib/cbas\runtime)
2022-05-25T17:05:20.034-07:00 ERRO CBAS.cbas cbas process aborting with exit code 1 due to error validating java home: unable to determine java version: exec: "c:/Program Files/Couchbase/Server/lib/cbas\\runtime\\bin
java.exe": file does not exist
2022-05-25T17:05:20.034-07:00 FATA CBAS.cbas error validating java home: unable to determine java version: exec: "c:/Program Files/Couchbase/Server/lib/cbas\\runtime\\bin
java.exe": file does not exist

We can see that java.exe is missing in the specified location

Issue is not observed in 7.0.4-7265 (RC1)

Issue not observed in linux and deb packages (verified with 7.0.4-7278)

Components

Affects versions

Fix versions

Labels

Environment

None

Link to Log File, atop/blg, CBCollectInfo, Core dump

None

Release Notes Description

None

Attachments

1

Activity

Gary Gray September 3, 2024 at 4:03 PM

Closing this because it appears it was resolved by a separate DOC ticket closed by Ray several years ago. This appears to be the original dev ticket that Tony assigned to himself for some reason.

Arunkumar Senthilnathan July 1, 2022 at 7:04 PM

Verified that 7.0.4 -> 7.1.0 windows upgrade workaround is good i.e. after the upgrade is done, if the user does a repair on the installation, couchbase server comes up with all the data and services in tact

Chris Hillery June 15, 2022 at 11:01 AM

FYI I have filed to track entirely disabling Upgrade on Windows.

Arunkumar Senthilnathan June 10, 2022 at 8:49 PM

We will manually validate this work around and update this ticket

Chris Hillery June 8, 2022 at 7:35 AM

Let me explain the underlying bug, and I think that will answer your first questions.

In a normal Server upgrade, Windows Installer (which is the Windows subsystem which handles software installs, uninstall, and upgrades) removes the files for the older version of Server and then replaces them with the files from the newer version of Server. In the 7.0.4->7.1.0 upgrade in particular, though, Windows Installer gets confused because a few files are actually newer (have a higher version number) in the "older" Server version 7.0.4. It ends up removing those files but then not replacing them at all, so when 7.1.0 tries to start, it finds some files missing. This causes (at least) memcached and cbas to fail, making the node unusable.

When the customer invokes the Repair operation on this broken 7.1.0 installation, Windows Installer verifies all Server files on the drive, and replaces any that don't match what the 7.1.0 installer says they should be, including any files that are outright missing. So after the Repair operation completes, the Server installation now looks exactly like it should have looked after the initial upgrade. Therefore Server can start and all functionality should be normal.

Hopefully that gives you the detail you need to understand the bug. FYI, earlier when I said that "7.0.4->7.1.0 is broken and can't be fixed", what I meant was that we cannot fix the bug, not that there's no way to fix a customer's installation. Any customer doing any upgrade from 7.0.4->7.1.0 will trigger this bug and wind up with a broken Server installation, and there is nothing we can do about that. However, once they've hit the bug and have a broken Server, using Repair should fix their installation.

To reply to your other questions:

will this have any side affects if the node is running just CBAS vs CBAS+others like (KV, GSI etc.).

To the best of my knowledge, the services that are on the node won't matter. The node will be unusable until the Repair operation is performed.

Also do customers need to eject the node completely out of the cluster to apply this change or failover is good?

If they're doing an Upgrade on Windows, they are implicitly shutting the node down as part of an Offline Cluster Upgrade, is that right? So now the individual node Upgrade process has an additional step, which is the Repair operation. The node will not re-enter the cluster prior to the Repair being complete. I don't believe this additional step changes the logic of whether an Offline Cluster Upgrade should be performed, or the process by which the Offline Cluster Upgrade will be done.

I do not currently believe that this bug causes any additional risk of data loss during the upgrade procedure. However I certainly cannot swear to that. If the customer is following the Offline Cluster Upgrade procedure properly, they should have a full data backup prior to starting.

Also at a later time, if customers want to follow any different approach to upgrade from 7.1.0 to 7.1.1?

The bug here only affects Upgrade from exactly 7.0.4 to exactly 7.1.0. No other pair of Server versions are known to be affected. In particular, 7.0.4->7.1.1 is known to work normally, and so far as I know 7.1.0->7.1.1 works normally as well. And if a customer upgrades from 7.0.4 to 7.1.0, hits this bug, and Repairs their installation, then there should be no reason they couldn't later follow the normal upgrade process from 7.1.0 to any future Server version.

It is possible there will be other pairs of Server versions affected by variants of this same bug in future. This is why I'm strongly considering disallowing node upgrade on Windows entirely. That's far outside the scope of this ticket though.

do we have QE sign off on this interim solution or yet to be validated?

You would need to talk to QE about that. My local testing has only been with trivial one-node configurations.

Done
Pinned fields
Click on the next to a field label to start pinning.

Details

Assignee

Reporter

Is this a Regression?

Yes

Triage

Untriaged

Operating System

Windows 64-bit

Story Points

Priority

Instabug

Open Instabug

PagerDuty

Sentry

Zendesk Support

Created May 26, 2022 at 12:26 AM
Updated September 3, 2024 at 4:03 PM
Resolved September 3, 2024 at 4:03 PM
Instabug