You have been able to reproduce the issue in earlier comments
No. I was able to reproduce the most recent symptom once the damage had already been done. I don't know how to reproduce any actual issue - ie, I couldn't start with a clean VM (or even a dirty one), execute a series of steps, and get it into the state I found Lilei's VM in. I did try a number of variations, both on my own VM and on Lilei's, but never saw anything like it.
and Deepika too has been able to reproduce the issue multiple time
Also not true. The situation Deepika originally opened this ticket for could not have been the same as the one Lilei found - at least, not the same as the symptom I was able to see. The symptom I saw required a code change which hadn't even been written at the time Deepika originally opened this ticket.
You seem to be assuming that every situation which could cause the installer or uninstaller to roll back is the same issue. That's nowhere near true. A rollback is like a core dump - it's the outwardly-visible final result of a bug. It is not itself the bug. Every core dump or rollback you see could very easily have completely different causes.
and the VM is available for you to review the state of the installation
FYI, no it's not. As I said, after reproducing the symptom (not issue), I took steps to verify that the problem was what I thought it was, which involved successfully uninstalling Server. Since we don't have reproduction instructions, there's no way to restore the bad state. There's no relevant information left on that VM; you may as well start using it for more testing again (or, better yet, wipe it and create a fresh one).
Since QE is able to get into this state, there are 100% chances that customers will be able to get to this state.
Again, not really. As I said, QE is frequently installing software with known bugs. They're also un- and re-installing lots of different versions far more frequently than any customer would. And they're hitting issues which could only even possibly exist between two different non-released builds. They're inadvertently testing situations that would never occur on a customer deployment.
There's no value - there's negative value - in QE testing scenarios that customers would not and could not experience. And it's not just about me wasting time chasing phantom bugs. Consider this: whatever Lilei did led to a garbage install with a mix of binaries from different Server versions. Clearly all the testing they did on that install is meaningless and must be discarded. But if the uninstall process hadn't happened to fail, they wouldn't have even known that they were testing garbage. Because QE frequently re-uses distressed VMs for numerous tests, it is extremely likely QE is sometimes testing things they don't expect. That can lead to spurious bug reports, but it could also lead to tests succeeding that should have failed.
That's why my recommendation is to make it policy to do Windows testing on freshly-created VMs. (This kind of thing can happen on any OS, but history clearly shows that's much more prevalent on Windows.) Without that, I assert that you cannot have real confidence in the testing you do, even the tests that pass.
If the installation is fragile, then it needs to be better and have better resiliency.
I agree. Unfortunately a significant fraction of the time, the bugs are in Windows Installer, not anything that we can control. If Microsoft made a more robust and predictable framework, we could have more resilience. That's not what we have, though. In fact, my experience is that almost every change we make to the installer brings a high chance of fixing one issue and breaking something else, that we may or may not discover right away.
Are there bugs in our MSIs? Without a doubt. Are they ones we can in any way control or work around? In my experience, that's about 50/50. Is it worth the effort it would take to dig into every flaky installation experience; attempt to reproduce it using strange combinations of internal builds; figure out the best way to avoid the problem; and risk destabilizing other parts of the installer to put in the change? Categorically no. That's why my first response is always to ask QE to try again on a fresh VM. If QE finds a reproducible bug in a customer-appropriate situation that can be reproduced starting from a clean VM, THEN it's worth the time, energy, and inherent risk of fixing.
One possible bit of hope: the vast majority of the Windows install/upgrade/uninstall bugs I've seen and been able to fix have had to do with Python in one way or another - either with the installation of the Python interpreter itself, or with some of the utility functions we wrote in Python that the installer calls out to. Starting with 7.1.0-1318 that situation is much improved, because we no longer "install" Python as part of our installer; we simply unpack the files onto disk, the same as Java, Erlang, and everything else. That greatly simplified our MSI and removed several entire classes of potential problems. That does still leave the installer utility functions. In Microsoft's ideal world, those would be written in C# and actually linked to the installer, but I have no idea how to do that. If you know of any developers who understand C# and would like to at least explore what that would mean, I'd be happy to work with them on some experiments.