Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-48367

Uninstalling failure on Windows

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Fixed
    • 7.0.2
    • 7.1.0
    • installer
    • Untriaged
    • Windows 64-bit
    • 1
    • Unknown

    Description

      Tested on 7.0.2 6645

      Not sure the exact step to reproduce it.

      Performed some testing on this build where Enforce TLS was enabled.

      Now stop the Couchbaseserver and uninstalled  it from control panel.

      Progress of uninstalling started rolling back and CB server not uninstalled.

      Tried to repair it but after repair also not able to uninstall it.

      To recreate the Issue I tried to install fresh build 6653 with Enforce TLS and uninstall it and it uninstalled without issues.

       

       

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            You have been able to reproduce the issue in earlier comments

            No. I was able to reproduce the most recent symptom once the damage had already been done. I don't know how to reproduce any actual issue - ie, I couldn't start with a clean VM (or even a dirty one), execute a series of steps, and get it into the state I found Lilei's VM in. I did try a number of variations, both on my own VM and on Lilei's, but never saw anything like it.

            and Deepika too has been able to reproduce the issue multiple time 

            Also not true. The situation Deepika originally opened this ticket for could not have been the same as the one Lilei found - at least, not the same as the symptom I was able to see. The symptom I saw required a code change which hadn't even been written at the time Deepika originally opened this ticket.

            You seem to be assuming that every situation which could cause the installer or uninstaller to roll back is the same issue. That's nowhere near true. A rollback is like a core dump - it's the outwardly-visible final result of a bug. It is not itself the bug. Every core dump or rollback you see could very easily have completely different causes.

            and the VM is available for you to review the state of the installation

            FYI, no it's not. As I said, after reproducing the symptom (not issue), I took steps to verify that the problem was what I thought it was, which involved successfully uninstalling Server. Since we don't have reproduction instructions, there's no way to restore the bad state. There's no relevant information left on that VM; you may as well start using it for more testing again (or, better yet, wipe it and create a fresh one).

            Since QE is able to get into this state, there are 100% chances that customers will be able to get to this state.

            Again, not really. As I said, QE is frequently installing software with known bugs. They're also un- and re-installing lots of different versions far more frequently than any customer would. And they're hitting issues which could only even possibly exist between two different non-released builds. They're inadvertently testing situations that would never occur on a customer deployment.

            There's no value - there's negative value - in QE testing scenarios that customers would not and could not experience. And it's not just about me wasting time chasing phantom bugs. Consider this: whatever Lilei did led to a garbage install with a mix of binaries from different Server versions. Clearly all the testing they did on that install is meaningless and must be discarded. But if the uninstall process hadn't happened to fail, they wouldn't have even known that they were testing garbage. Because QE frequently re-uses distressed VMs for numerous tests, it is extremely likely QE is sometimes testing things they don't expect. That can lead to spurious bug reports, but it could also lead to tests succeeding that should have failed.

            That's why my recommendation is to make it policy to do Windows testing on freshly-created VMs. (This kind of thing can happen on any OS, but history clearly shows that's much more prevalent on Windows.) Without that, I assert that you cannot have real confidence in the testing you do, even the tests that pass.

            If the installation is fragile, then it needs to be better and have better resiliency.

            I agree. Unfortunately a significant fraction of the time, the bugs are in Windows Installer, not anything that we can control. If Microsoft made a more robust and predictable framework, we could have more resilience. That's not what we have, though. In fact, my experience is that almost every change we make to the installer brings a high chance of fixing one issue and breaking something else, that we may or may not discover right away.

            Are there bugs in our MSIs? Without a doubt. Are they ones we can in any way control or work around? In my experience, that's about 50/50. Is it worth the effort it would take to dig into every flaky installation experience; attempt to reproduce it using strange combinations of internal builds; figure out the best way to avoid the problem; and risk destabilizing other parts of the installer to put in the change? Categorically no. That's why my first response is always to ask QE to try again on a fresh VM. If QE finds a reproducible bug in a customer-appropriate situation that can be reproduced starting from a clean VM, THEN it's worth the time, energy, and inherent risk of fixing.

            ---------

            One possible bit of hope: the vast majority of the Windows install/upgrade/uninstall bugs I've seen and been able to fix have had to do with Python in one way or another - either with the installation of the Python interpreter itself, or with some of the utility functions we wrote in Python that the installer calls out to. Starting with 7.1.0-1318 that situation is much improved, because we no longer "install" Python as part of our installer; we simply unpack the files onto disk, the same as Java, Erlang, and everything else. That greatly simplified our MSI and removed several entire classes of potential problems. That does still leave the installer utility functions. In Microsoft's ideal world, those would be written in C# and actually linked to the installer, but I have no idea how to do that. If you know of any developers who understand C# and would like to at least explore what that would mean, I'd be happy to work with them on some experiments.

            ceej Chris Hillery added a comment - You have been able to reproduce the issue in earlier comments No. I was able to reproduce  the most recent symptom once the damage had already been done. I don't know how to reproduce any actual issue - ie, I couldn't start with a clean VM (or even a dirty one), execute a series of steps, and get it into the state I found Lilei's VM in. I did try a number of variations, both on my own VM and on Lilei's, but never saw anything like it. and Deepika too has been able to reproduce the issue multiple time  Also not true. The situation Deepika originally opened this ticket for could not have been the same as the one Lilei found - at least, not the same as the symptom I was able to see. The symptom I saw required a code change which hadn't even been written at the time Deepika originally opened this ticket. You seem to be assuming that every situation which could cause the installer or uninstaller to roll back is the same issue. That's nowhere near true. A rollback is like a core dump - it's the outwardly-visible final result of a bug. It is not itself the bug. Every core dump or rollback you see could very easily have completely different causes. and the VM is available for you to review the state of the installation FYI, no it's not. As I said, after reproducing the symptom (not issue), I took steps to verify that the problem was what I thought it was, which involved successfully uninstalling Server. Since we don't have reproduction instructions, there's no way to restore the bad state. There's no relevant information left on that VM; you may as well start using it for more testing again (or, better yet, wipe it and create a fresh one). Since QE is able to get into this state, there are 100% chances that customers will be able to get to this state. Again, not really. As I said, QE is frequently installing software with known bugs. They're also un- and re-installing lots of different versions far more frequently than any customer would. And they're hitting issues which could only even possibly exist between two different non-released builds. They're inadvertently testing situations that would never occur on a customer deployment. There's no value - there's  negative value - in QE testing scenarios that customers would not and could not experience. And it's not just about me wasting time chasing phantom bugs. Consider this: whatever Lilei did led to a garbage install with a mix of binaries from different Server versions. Clearly all the testing they did on that install is meaningless and must be discarded. But if the uninstall process hadn't happened to fail, they  wouldn't have even known  that they were testing garbage. Because QE frequently re-uses distressed VMs for numerous tests, it is extremely likely QE is sometimes testing things they don't expect. That can lead to spurious bug reports, but it could also lead to tests succeeding that  should have  failed. That's why my recommendation is to make it policy to do Windows testing on freshly-created VMs. (This kind of thing can happen on any OS, but history clearly shows that's much more prevalent on Windows.) Without that, I assert that you cannot have real confidence in the testing you do, even the tests that pass. If the installation is fragile, then it needs to be better and have better resiliency. I agree. Unfortunately a significant fraction of the time, the bugs are in Windows Installer, not anything that we can control. If Microsoft made a more robust and predictable framework, we could have more resilience. That's not what we have, though. In fact, my experience is that almost every change we make to the installer brings a high chance of fixing one issue and breaking something else, that we may or may not discover right away. Are there bugs in our MSIs? Without a doubt. Are they ones we can in any way control or work around? In my experience, that's about 50/50. Is it worth the effort it would take to dig into every flaky installation experience; attempt to reproduce it using strange combinations of internal builds; figure out the best way to avoid the problem; and risk destabilizing other parts of the installer to put in the change? Categorically no. That's why my first response is always to ask QE to try again on a fresh VM. If QE finds a reproducible bug in a customer-appropriate situation that can be reproduced starting from a clean VM , THEN it's worth the time, energy, and inherent risk of fixing. --------- One possible bit of hope: the vast majority of the Windows install/upgrade/uninstall bugs I've seen and been able to fix have had to do with Python in one way or another - either with the installation of the Python interpreter itself, or with some of the utility functions we wrote in Python that the installer calls out to. Starting with 7.1.0-1318 that situation is much improved, because we no longer "install" Python as part of our installer; we simply unpack the files onto disk, the same as Java, Erlang, and everything else. That greatly simplified our MSI and removed several entire classes of potential problems. That does still leave the installer utility functions. In Microsoft's ideal world, those would be written in C# and actually linked to the installer, but I have no idea how to do that. If you know of any developers who understand C# and would like to at least explore what that would mean, I'd be happy to work with them on some experiments.
            lilei.chen Lilei Chen added a comment -

            I think the reason my uninstall/install failed because I didn't close some of the log files I was reading. If I have a clean install, and try to uninstall, as long as I close everything and stop the service completely, uninstall would work. It would be nice if there is an error message about the reason of the failure.

            If I have some of the files open, and try to install a newer version, then it could get into a state where the system can neither uninstall or upgrade.

            lilei.chen Lilei Chen added a comment - I think the reason my uninstall/install failed because I didn't close some of the log files I was reading. If I have a clean install, and try to uninstall, as long as I close everything and stop the service completely, uninstall would work. It would be nice if there is an error message about the reason of the failure. If I have some of the files open, and try to install a newer version, then it could get into a state where the system can neither uninstall or upgrade.

            Lilei Chen - Thanks. You are right the uninstallation would fail in this case. The 2nd case is very interesting though, with an upgrade with files open or in the state where a user can do an install, is concerning.

            ritam.sharma Ritam Sharma added a comment - Lilei Chen - Thanks. You are right the uninstallation would fail in this case. The 2nd case is very interesting though, with an upgrade with files open or in the state where a user can do an install, is concerning.
            lilei.chen Lilei Chen added a comment -

            Ritam Sharma In the second case, the upgrade will fail. But I think it might leave something behind so that C:\Program Files\Couchbase\Server has a mishmash of files from different Server versions in it, and I would not be able to upgrade or uninstall after that.

            lilei.chen Lilei Chen added a comment - Ritam Sharma In the second case, the upgrade will fail. But I think it might leave something behind so that C:\Program Files\Couchbase\Server has a mishmash of files from different Server versions in it, and I would not be able to upgrade or uninstall after that.
            ritam.sharma Ritam Sharma added a comment -

            Closing, not able to reproduce the failure.

            ritam.sharma Ritam Sharma added a comment - Closing, not able to reproduce the failure.

            People

              ritam.sharma Ritam Sharma
              deepika.verma Deepika Verma (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h
                  1h

                  Gerrit Reviews

                    There are no open Gerrit changes

                    PagerDuty