[Windows] - cbas keeps crashing consistently when upgrading from 7.0.4-7274 to 7.1.0 on windows due to java.exe missing
Description
Components
Affects versions
Fix versions
Labels
Environment
Link to Log File, atop/blg, CBCollectInfo, Core dump
Release Notes Description
Attachments
depends on
Activity
Gary Gray September 3, 2024 at 4:03 PM
Closing this because it appears it was resolved by a separate DOC ticket closed by Ray several years ago. This appears to be the original dev ticket that Tony assigned to himself for some reason.
Arunkumar Senthilnathan July 1, 2022 at 7:04 PM
Verified that 7.0.4 -> 7.1.0 windows upgrade workaround is good i.e. after the upgrade is done, if the user does a repair on the installation, couchbase server comes up with all the data and services in tact
Chris Hillery June 15, 2022 at 11:01 AM
FYI I have filed to track entirely disabling Upgrade on Windows.
Arunkumar Senthilnathan June 10, 2022 at 8:49 PM
We will manually validate this work around and update this ticket
Chris Hillery June 8, 2022 at 7:35 AM
Let me explain the underlying bug, and I think that will answer your first questions.
In a normal Server upgrade, Windows Installer (which is the Windows subsystem which handles software installs, uninstall, and upgrades) removes the files for the older version of Server and then replaces them with the files from the newer version of Server. In the 7.0.4->7.1.0 upgrade in particular, though, Windows Installer gets confused because a few files are actually newer (have a higher version number) in the "older" Server version 7.0.4. It ends up removing those files but then not replacing them at all, so when 7.1.0 tries to start, it finds some files missing. This causes (at least) memcached and cbas to fail, making the node unusable.
When the customer invokes the Repair operation on this broken 7.1.0 installation, Windows Installer verifies all Server files on the drive, and replaces any that don't match what the 7.1.0 installer says they should be, including any files that are outright missing. So after the Repair operation completes, the Server installation now looks exactly like it should have looked after the initial upgrade. Therefore Server can start and all functionality should be normal.
Hopefully that gives you the detail you need to understand the bug. FYI, earlier when I said that "7.0.4->7.1.0 is broken and can't be fixed", what I meant was that we cannot fix the bug, not that there's no way to fix a customer's installation. Any customer doing any upgrade from 7.0.4->7.1.0 will trigger this bug and wind up with a broken Server installation, and there is nothing we can do about that. However, once they've hit the bug and have a broken Server, using Repair should fix their installation.
To reply to your other questions:
will this have any side affects if the node is running just CBAS vs CBAS+others like (KV, GSI etc.).
To the best of my knowledge, the services that are on the node won't matter. The node will be unusable until the Repair operation is performed.
Also do customers need to eject the node completely out of the cluster to apply this change or failover is good?
If they're doing an Upgrade on Windows, they are implicitly shutting the node down as part of an Offline Cluster Upgrade, is that right? So now the individual node Upgrade process has an additional step, which is the Repair operation. The node will not re-enter the cluster prior to the Repair being complete. I don't believe this additional step changes the logic of whether an Offline Cluster Upgrade should be performed, or the process by which the Offline Cluster Upgrade will be done.
I do not currently believe that this bug causes any additional risk of data loss during the upgrade procedure. However I certainly cannot swear to that. If the customer is following the Offline Cluster Upgrade procedure properly, they should have a full data backup prior to starting.
Also at a later time, if customers want to follow any different approach to upgrade from 7.1.0 to 7.1.1?
The bug here only affects Upgrade from exactly 7.0.4 to exactly 7.1.0. No other pair of Server versions are known to be affected. In particular, 7.0.4->7.1.1 is known to work normally, and so far as I know 7.1.0->7.1.1 works normally as well. And if a customer upgrades from 7.0.4 to 7.1.0, hits this bug, and Repairs their installation, then there should be no reason they couldn't later follow the normal upgrade process from 7.1.0 to any future Server version.
It is possible there will be other pairs of Server versions affected by variants of this same bug in future. This is why I'm strongly considering disallowing node upgrade on Windows entirely. That's far outside the scope of this ticket though.
do we have QE sign off on this interim solution or yet to be validated?
You would need to talk to QE about that. My local testing has only been with trivial one-node configurations.
Details
Assignee
Gary GrayGary GrayReporter
Arunkumar SenthilnathanArunkumar Senthilnathan(Deactivated)Is this a Regression?
YesTriage
UntriagedOperating System
Windows 64-bitStory Points
1Priority
CriticalInstabug
Open Instabug
Details
Details
Assignee
Reporter
Is this a Regression?
Triage
Operating System
Story Points
Priority
Instabug
PagerDuty
PagerDuty Incident
PagerDuty
PagerDuty Incident
PagerDuty

Sentry
Linked Issues
Sentry
Linked Issues
Sentry
Zendesk Support
Linked Tickets
Zendesk Support
Linked Tickets
Zendesk Support

Found this issue while verifying toy build from
1. Install 7.0.4-7274 on a single windows 2019 node
2. Setup the cluster and install travel-sample
3. Upgrade the cluster to 7.1.0 release version
Observe that cbas keeps crashing continuously with error:
Service 'cbas' exited with status 1. Restarting. Messages:
2022-05-25T17:05:20.024-07:00 INFO CBAS.cbas uuid: 88f1c109b55511914b3dc6b24fd61eeb
2022-05-25T17:05:20.032-07:00 INFO CBAS.cbas IPv4 addresses: [172.23.136.107 127.0.0.1]
2022-05-25T17:05:20.032-07:00 INFO CBAS.cbas IPv6 addresses: [fe80::f56b:fdf5:1076:31ed%Ethernet 5 ::1%Loopback Pseudo-Interface 1]
2022-05-25T17:05:20.032-07:00 INFO CBAS.cbas setting ipv4 required, ipv6 optional
2022-05-25T17:05:20.034-07:00 INFO CBAS.cbas using previously configured io devices (1)
2022-05-25T17:05:20.034-07:00 INFO CBAS.cbas setting java.home to bundled jre (c:/Program Files/Couchbase/Server/lib/cbas\runtime)
2022-05-25T17:05:20.034-07:00 ERRO CBAS.cbas cbas process aborting with exit code 1 due to error validating java home: unable to determine java version: exec: "c:/Program Files/Couchbase/Server/lib/cbas\\runtime\\bin
java.exe": file does not exist
2022-05-25T17:05:20.034-07:00 FATA CBAS.cbas error validating java home: unable to determine java version: exec: "c:/Program Files/Couchbase/Server/lib/cbas\\runtime\\bin
java.exe": file does not exist
We can see that java.exe is missing in the specified location
Issue is not observed in 7.0.4-7265 (RC1)
Issue not observed in linux and deb packages (verified with 7.0.4-7278)