Details
-
Bug
-
Status: Closed
-
Test Blocker
-
Resolution: Fixed
-
7.1.0
-
Triaged
-
1
-
Yes
-
Build Team 2022 Sprint 1
Description
All windows performance runs failed on 7.1.0-2021. The runs were trying to configure RAM quota, and they failed after 20 attempts.
2022-01-10T10:41:26 [INFO] Configuring data RAM quota: 20480 MB
2022-01-10T10:41:31 [WARNING] ["Unexpected server error, request logged."]
2022-01-10T10:41:31 [WARNING] Retrying http://zeus-srv-01.perf.couchbase.com:8091/pools/default
...
2022-01-10T10:46:16 [WARNING] Retrying http://zeus-srv-01.perf.couchbase.com:8091/pools/default
2022-01-10T10:46:26 [ERROR] Request http://zeus-srv-01.perf.couchbase.com:8091/pools/default failed after 20 attempts
Build step 'Execute shell' marked build as failure
Attachments
- 172.23.96.25.zip
- 1.40 MB
Issue Links
- is duplicated by
-
MB-50354 [Windows] - Cluster init fails on windows
-
- Closed
-
Activity
From the ns_server.http_access.log we see the requests that are getting the server error. They are likely timeouts as they're all ~5000 msecs in processing time.
172.23.100.212 - - [10/Jan/2022:10:41:18 -0800] "POST /pools/default HTTP/1.1" 500 44 - "python-requests/2.24.0" 5000
|
172.23.100.212 - - [10/Jan/2022:10:41:33 -0800] "POST /pools/default HTTP/1.1" 500 44 - "python-requests/2.24.0" 5000
|
172.23.100.212 - - [10/Jan/2022:10:41:48 -0800] "POST /pools/default HTTP/1.1" 500 44 - "python-requests/2.24.0" 5000
|
172.23.100.212 - - [10/Jan/2022:10:42:03 -0800] "POST /pools/default HTTP/1.1" 500 44 - "python-requests/2.24.0" 5000
|
172.23.100.212 - - [10/Jan/2022:10:42:18 -0800] "POST /pools/default HTTP/1.1" 500 44 - "python-requests/2.24.0" 5000
|
172.23.100.212 - - [10/Jan/2022:10:42:33 -0800] "POST /pools/default HTTP/1.1" 500 44 - "python-requests/2.24.0" 5000
|
172.23.100.212 - - [10/Jan/2022:10:42:48 -0800] "POST /pools/default HTTP/1.1" 500 44 - "python-requests/2.24.0" 5000
|
172.23.100.212 - - [10/Jan/2022:10:43:03 -0800] "POST /pools/default HTTP/1.1" 500 44 - "python-requests/2.24.0" 5000
|
172.23.100.212 - - [10/Jan/2022:10:43:18 -0800] "POST /pools/default HTTP/1.1" 500 44 - "python-requests/2.24.0" 5015
|
172.23.100.212 - - [10/Jan/2022:10:43:33 -0800] "POST /pools/default HTTP/1.1" 500 44 - "python-requests/2.24.0" 5000
|
172.23.100.212 - - [10/Jan/2022:10:43:48 -0800] "POST /pools/default HTTP/1.1" 500 44 - "python-requests/2.24.0" 5016
|
172.23.100.212 - - [10/Jan/2022:10:44:03 -0800] "POST /pools/default HTTP/1.1" 500 44 - "python-requests/2.24.0" 5015
|
172.23.100.212 - - [10/Jan/2022:10:44:19 -0800] "POST /pools/default HTTP/1.1" 500 44 - "python-requests/2.24.0" 5000
|
172.23.100.212 - - [10/Jan/2022:10:44:34 -0800] "POST /pools/default HTTP/1.1" 500 44 - "python-requests/2.24.0" 5000
|
172.23.100.212 - - [10/Jan/2022:10:44:49 -0800] "POST /pools/default HTTP/1.1" 500 44 - "python-requests/2.24.0" 5000
|
172.23.100.212 - - [10/Jan/2022:10:45:04 -0800] "POST /pools/default HTTP/1.1" 500 44 - "python-requests/2.24.0" 5000
|
172.23.100.212 - - [10/Jan/2022:10:45:19 -0800] "POST /pools/default HTTP/1.1" 500 44 - "python-requests/2.24.0" 5000
|
172.23.100.212 - - [10/Jan/2022:10:45:34 -0800] "POST /pools/default HTTP/1.1" 500 44 - "python-requests/2.24.0" 5000
|
172.23.100.212 - - [10/Jan/2022:10:45:49 -0800] "POST /pools/default HTTP/1.1" 500 44 - "python-requests/2.24.0" 5000
|
172.23.100.212 - - [10/Jan/2022:10:46:04 -0800] "POST /pools/default HTTP/1.1" 500 44 - "python-requests/2.24.0" 5000
|
During this time the ns_server.debug.log shows we're trying to connect to memcached.
[ns_server:warn,2022-01-10T10:41:16.040-08:00,ns_1@cb.local:memcached_refresh<0.4561.0>:ns_memcached:connect:1221]Unable to connect: {error,{badmatch,[{inet,{error,econnrefused}}]}}.
|
[ns_server:debug,2022-01-10T10:41:16.040-08:00,ns_1@cb.local:memcached_refresh<0.4561.0>:memcached_refresh:handle_info:93]Refresh of [isasl,rbac] failed. Retry in 1000 ms.
|
[ns_server:warn,2022-01-10T10:41:16.197-08:00,ns_1@cb.local:<0.8304.0>:ns_memcached:connect:1224]Unable to connect: {error,{badmatch,[{inet,{error,econnrefused}}]}}, retrying.
|
[ns_server:warn,2022-01-10T10:41:16.535-08:00,ns_1@cb.local:<0.8313.0>:ns_memcached:connect:1224]Unable to connect: {error,{badmatch,[{inet,{error,econnrefused}}]}}, retrying.
|
[user:info,2022-01-10T10:41:16.893-08:00,ns_1@cb.local:<0.4680.0>:ns_log:consume_log:76]Service 'memcached' exited with status -1073741515. Restarting. Messages:
|
|
[ns_server:info,2022-01-10T10:41:16.893-08:00,ns_1@cb.local:ns_log<0.4673.0>:ns_log:is_duplicate_log:156]suppressing duplicate log ns_log:0([<<"Service 'memcached' exited with status -1073741515. Restarting. Messages:\n">>]) because it's been seen 4 times in the past 36.118 secs (last seen 9.046 secs ago
|
[ns_server:debug,2022-01-10T10:41:16.894-08:00,ns_1@cb.local:<0.8310.0>:remote_monitors:monitor_loop:122]Monitored remote process <15888.406.0> went down with: {abnormal,-1073741515}
|
[ns_server:warn,2022-01-10T10:41:17.044-08:00,ns_1@cb.local:memcached_refresh<0.4561.0>:ns_memcached:connect:1221]Unable to connect: {error,{badmatch,[{inet,{error,econnrefused}}]}}.
|
[ns_server:debug,2022-01-10T10:41:17.044-08:00,ns_1@cb.local:memcached_refresh<0.4561.0>:memcached_refresh:handle_info:93]Refresh of [isasl,rbac] failed. Retry in 1000 ms.
|
[ns_server:warn,2022-01-10T10:41:17.990-08:00,ns_1@cb.local:<0.8300.0>:ns_memcached:connect:1224]Unable to connect: {error,{badmatch,[{inet,{error,econnrefused}}]}}, retrying.
|
[ns_server:warn,2022-01-10T10:41:17.991-08:00,ns_1@cb.local:<0.8296.0>:ns_memcached:connect:1224]Unable to connect: {error,{badmatch,[{inet,{error,econnrefused}}]}}, retrying.
|
[ns_server:warn,2022-01-10T10:41:18.046-08:00,ns_1@cb.local:memcached_refresh<0.4561.0>:ns_memcached:connect:1221]Unable to connect: {error,{badmatch,[{inet,{error,econnrefused}}]}}.
|
[ns_server:debug,2022-01-10T10:41:18.046-08:00,ns_1@cb.local:memcached_refresh<0.4561.0>:memcached_refresh:handle_info:93]Refresh of [isasl,rbac] failed. Retry in 1000 ms.
|
[ns_server:warn,2022-01-10T10:41:18.194-08:00,ns_1@cb.local:<0.8304.0>:ns_memcached:connect:1224]Unable to connect: {error,{badmatch,[{inet,{error,econnrefused}}]}}, retrying.
|
[ns_server:warn,2022-01-10T10:41:18.540-08:00,ns_1@cb.local:<0.8313.0>:ns_memcached:connect:1224]Unable to connect: {error,{badmatch,[{inet,{error,econnrefused}}]}}, retrying.
|
There's nothing in the memcached.log
$ cat memcached.log
|
==============================================================================
|
Memcached logs
|
cd "C:\Program Files\Couchbase\Server\var\lib\couchbase\logs" && for /f %a IN ('dir memcached.log.* /od /tw /b') do type %a
|
==============================================================================
|
File Not Found
|
Looking at the couchbase.log there's a large number (8602) of faults like this
EventType=1^M^M
|
Logfile=Application^M^M
|
Message=Faulting application name: memcached.exe, version: 0.0.0.0, time stamp: 0x61d87cd9^M^M
|
Faulting module name: VCRUNTIME140D.dll, version: 6.3.9600.18696, time stamp: 0x59153753^M^M
|
Exception code: 0xc0000135^M^M
|
Fault offset: 0x00000000000ece60^M^M
|
Faulting process id: 0x9c4^M^M
|
Faulting application start time: 0x01d806549a97ab42^M^M
|
Faulting application path: c:\Program Files\Couchbase\Server\bin\memcached.exe^M^M
|
Faulting module path: VCRUNTIME140D.dll^M^M
|
Report Id: d84839ce-7247-11ec-80d1-001e678989dc^M^M
|
Faulting package full name: ^M^M
|
Faulting package-relative application ID: ^M^M
|
SourceName=Application Error^M^M
|
TimeGenerated=20220110190224.000000-000^M^M
|
So it appears memcached is faulting each time it starts up. Assigning the ticket to them for triage.
Looking at the Exception code - 0xc0000135 - Its means "The application failed to initialize properly"
The faulting module is VCRUNTIME140D.dll - https://ugetfix.com/ask/how-to-fix-vcruntime140_1-dll-missing-or-not-found-error-in-windows/
VCRUNTIME140_1.dll is a Dynamic Link Library[1] file that is used by several applications to launch specific commands or run a particular code. The file is not a part of the Windows operating system but is generally needed by third-party applications to run smoothly.
Found this - https://stackoverflow.com/questions/39904402/c-windows-service-and-msvcp140-dll
0xc0000135 is a simple "DLL not found" error. Since it is MSVCP140.dll that keeled over, very high odds that the machine is missing the Universal CRT. It is deployed by Windows Update.
Looks like solution is
install Visual C++ redistributable libraries. - https://docs.microsoft.com/en-us/cpp/windows/latest-supported-vc-redist?view=msvc-170
Hi Bo-Chun Wang, Could you check you have VCRUNTIME140D.dll installed? (and if so what version you have)
The machine is running
Microsoft Windows Server 2012 R2 Standard
|
OS Version: 6.3.9600 N/A Build 9600
|
On speaking to Chris Hillery
That hasn't been supported for Server since... 5.x?
Wayne mentioned that the perf clusters (possibly specifically the Windows boxes) were updated / recreated recently due to IT initiatives to apply OS updates.
If they were re-created fresh, then they won't have the Microsoft Universal C Runtime installed since that wasn't included in Windows Server 2012. VCRUNTIME140D.dll almost certainly depends on the UCRT being installed. (FYI the error message in the bug implies that VCRUNTIME140D.dll itself isn't missing, but it's failing to load, probably because one of its dependencies is missing.)
Reassigning to the issue to performance.
I am able to do a run with an old build (7.1.0-1885). Therefore, it's not clear to me why we hit the issue on newer builds. I assume these runs are using the same VCRUNTIME140D.dll.
http://perf.jenkins.couchbase.com/job/zeus/8167/
I checked the history and found the issue started between 7.1.0-1885 and 7.1.0-1960. Therefore, I think it may not be related to the cluster.
Do you have any ideas? I am going to do bisection to find the root cause.
Thanks for performing the bisection Bo-Chun Wang - I don't know why it would suddenly appear in a build (perhaps we are making use of new API call...)
Will add couchbase-bucket component back-in until we determine the cause of the issue.
Bo-Chun Wang Did you actually re-run all those performance runs today, or are you just looking at the historical run data?
I re-ran some of them today, including 7.1.0-1885, 7.1.0-1920, and 7.1.0-1940. Based on the results, the issue starts between 1920 and 1940. I am going to do run with 1925 and 1926 as Daniel suggested.
I finished the bisection. The root cause is build 7.1.0-1935. There are 4 eventing changes and 1 magma change in 1935. I can't figure out why the changes can cause the issue.
Daniel Owen Could you please take a look?
+Sarath Lakshman who made the magma change. Since the problem is with the KV executable and looks to be a problem associated with a missing DLL and the magma change looks to be adding a new dependency on zstandard into the KV binary, this seems to be almost certainly where the problem is.
Argh. Ok, I did think this was a problem last night, but Googling didn't bear me out. But now I've managed to find configuration: vcruntime140d.dll is the debug library. So it seems like we built the zstd-cpp Windows package in debug mode, and hence have a dependency on the debug version of the library, which isn't part of the MSVC redistributable package. We'll need to rebuild it. Cc Blair Watt
Bo-Chun Wang Daniel did ask if you could confirm that vcruntime140d.dll was on the VM in question - did you do so? I'm guessing that it's NOT actually there.
--------
Regardless of all of that, you really do need to start running the perf tests on a supported platform. For Neo that's Windows Server 2019.
There is no vcruntime140d.dll on the VM.
Administrator@zeus01-s526 /cygdrive/c/Windows/System32
$ ls vcruntime140
vcruntime140.dll vcruntime140_1.dll
We use the same cluster for 6.x and 7.x builds. Does Windows Server 2019 support both? The cluster is running Windows Server 2012.
OS Name: Microsoft Windows Server 2012 R2 Standard
OS Version: 6.3.9600 N/A Build 9600
Ok, I think that confirms that we need to rebuild zstd-cpp, so I'm moving this over to Build.
As for 6.x/7.x platform support, 7.1.0 is only officially supported on Windows Server 2019. 6.6.0 and later are also supported Windows Server 2019. I'm not sure about 6.5.0 - the Platform Support Matrix says Windows Server 2019 is "New - To Be Supported" for 6.5.0, which to me doesn't parse cleanly either way. If you still need to run perf on 6.0.x, though, Windows Server 2019 isn't supported, so ideally at least that should be a different cluster.
Build cbdeps::zstd-cpp-1.5.0-2 contains build-tools commit c2df20b with commit message:
MB-50325: Produce release builds of zstd-cpp
Build couchbase-server-7.1.0-2055 contains tlm commit 8ec6b8a with commit message:
MB-50325: Zstd-cpp build 2 (non-debug/release build)
Thanks for the heads-up on this, build 2055 contains the updated package which imports the correct vcruntime dll:
C:\>"c:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\bin\dumpbin.exe" /imports c:\temp\zstd-check\1.5.0-1\bin\zstd.dll | findstr /C:VCRUNTIME
|
VCRUNTIME140D.dll
|
C:\>"c:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\bin\dumpbin.exe" /imports c:\temp\zstd-check\1.5.0-2\bin\zstd.dll | findstr /C:VCRUNTIME
|
VCRUNTIME140.dll
|
Build cbdeps::erlang-neo-5 contains build-tools commit c2df20b with commit message:
MB-50325: Produce release builds of zstd-cpp
Build cbdeps::curl-7.78.0-6 contains build-tools commit c2df20b with commit message:
MB-50325: Produce release builds of zstd-cpp
Build cbdeps::prometheus-2.23-4 contains build-tools commit c2df20b with commit message:
MB-50325: Produce release builds of zstd-cpp
Build couchbase-lite-core-3.1.0-205 contains tlm commit 8ec6b8a with commit message:
MB-50325: Zstd-cpp build 2 (non-debug/release build)
Steve - would you mind looking into this one?