From the ns_server.http_access.log we see the requests that are getting the server error. They are likely timeouts as they're all ~5000 msecs in processing time.
During this time the ns_server.debug.log shows we're trying to connect to memcached.
[ns_server:warn,2022-01-10T10:41:16.040-08:00,ns_1@cb.local:memcached_refresh<0.4561.0>:ns_memcached:connect:1221]Unable to connect: {error,{badmatch,[{inet,{error,econnrefused}}]}}.
[ns_server:debug,2022-01-10T10:41:16.040-08:00,ns_1@cb.local:memcached_refresh<0.4561.0>:memcached_refresh:handle_info:93]Refresh of [isasl,rbac] failed. Retry in 1000 ms.
[ns_server:warn,2022-01-10T10:41:16.197-08:00,ns_1@cb.local:<0.8304.0>:ns_memcached:connect:1224]Unable to connect: {error,{badmatch,[{inet,{error,econnrefused}}]}}, retrying.
[ns_server:warn,2022-01-10T10:41:16.535-08:00,ns_1@cb.local:<0.8313.0>:ns_memcached:connect:1224]Unable to connect: {error,{badmatch,[{inet,{error,econnrefused}}]}}, retrying.
[user:info,2022-01-10T10:41:16.893-08:00,ns_1@cb.local:<0.4680.0>:ns_log:consume_log:76]Service 'memcached' exited with status -1073741515. Restarting. Messages:
[ns_server:info,2022-01-10T10:41:16.893-08:00,ns_1@cb.local:ns_log<0.4673.0>:ns_log:is_duplicate_log:156]suppressing duplicate log ns_log:0([<<"Service 'memcached' exited with status -1073741515. Restarting. Messages:\n">>]) because it's been seen 4 times in the past 36.118 secs (last seen 9.046 secs ago
[ns_server:debug,2022-01-10T10:41:16.894-08:00,ns_1@cb.local:<0.8310.0>:remote_monitors:monitor_loop:122]Monitored remote process <15888.406.0> went down with: {abnormal,-1073741515}
[ns_server:warn,2022-01-10T10:41:17.044-08:00,ns_1@cb.local:memcached_refresh<0.4561.0>:ns_memcached:connect:1221]Unable to connect: {error,{badmatch,[{inet,{error,econnrefused}}]}}.
[ns_server:debug,2022-01-10T10:41:17.044-08:00,ns_1@cb.local:memcached_refresh<0.4561.0>:memcached_refresh:handle_info:93]Refresh of [isasl,rbac] failed. Retry in 1000 ms.
[ns_server:warn,2022-01-10T10:41:17.990-08:00,ns_1@cb.local:<0.8300.0>:ns_memcached:connect:1224]Unable to connect: {error,{badmatch,[{inet,{error,econnrefused}}]}}, retrying.
[ns_server:warn,2022-01-10T10:41:17.991-08:00,ns_1@cb.local:<0.8296.0>:ns_memcached:connect:1224]Unable to connect: {error,{badmatch,[{inet,{error,econnrefused}}]}}, retrying.
[ns_server:warn,2022-01-10T10:41:18.046-08:00,ns_1@cb.local:memcached_refresh<0.4561.0>:ns_memcached:connect:1221]Unable to connect: {error,{badmatch,[{inet,{error,econnrefused}}]}}.
[ns_server:debug,2022-01-10T10:41:18.046-08:00,ns_1@cb.local:memcached_refresh<0.4561.0>:memcached_refresh:handle_info:93]Refresh of [isasl,rbac] failed. Retry in 1000 ms.
[ns_server:warn,2022-01-10T10:41:18.194-08:00,ns_1@cb.local:<0.8304.0>:ns_memcached:connect:1224]Unable to connect: {error,{badmatch,[{inet,{error,econnrefused}}]}}, retrying.
[ns_server:warn,2022-01-10T10:41:18.540-08:00,ns_1@cb.local:<0.8313.0>:ns_memcached:connect:1224]Unable to connect: {error,{badmatch,[{inet,{error,econnrefused}}]}}, retrying.
VCRUNTIME140_1.dll is a Dynamic Link Library[1] file that is used by several applications to launch specific commands or run a particular code. The file is not a part of the Windows operating system but is generally needed by third-party applications to run smoothly.
0xc0000135 is a simple "DLL not found" error. Since it is MSVCP140.dll that keeled over, very high odds that the machine is missing the Universal CRT. It is deployed by Windows Update.
Hi Bo-Chun Wang, Could you check you have VCRUNTIME140D.dll installed? (and if so what version you have)
Daniel Owen
added a comment - Looking at the Exception code - 0xc0000135 - Its means "The application failed to initialize properly"
The faulting module is VCRUNTIME140D.dll - https://ugetfix.com/ask/how-to-fix-vcruntime140_1-dll-missing-or-not-found-error-in-windows/
VCRUNTIME140_1.dll is a Dynamic Link Library [1] file that is used by several applications to launch specific commands or run a particular code. The file is not a part of the Windows operating system but is generally needed by third-party applications to run smoothly.
Found this - https://stackoverflow.com/questions/39904402/c-windows-service-and-msvcp140-dll
0xc0000135 is a simple "DLL not found" error. Since it is MSVCP140.dll that keeled over, very high odds that the machine is missing the Universal CRT. It is deployed by Windows Update.
Looks like solution is
install Visual C++ redistributable libraries. - https://docs.microsoft.com/en-us/cpp/windows/latest-supported-vc-redist?view=msvc-170
Hi Bo-Chun Wang , Could you check you have VCRUNTIME140D.dll installed? (and if so what version you have)
That hasn't been supported for Server since... 5.x?
Wayne mentioned that the perf clusters (possibly specifically the Windows boxes) were updated / recreated recently due to IT initiatives to apply OS updates.
If they were re-created fresh, then they won't have the Microsoft Universal C Runtime installed since that wasn't included in Windows Server 2012. VCRUNTIME140D.dll almost certainly depends on the UCRT being installed. (FYI the error message in the bug implies that VCRUNTIME140D.dll itself isn't missing, but it's failing to load, probably because one of its dependencies is missing.)
Reassigning to the issue to performance.
Daniel Owen
added a comment - The machine is running
Microsoft Windows Server 2012 R2 Standard
OS Version: 6.3.9600 N/A Build 9600
On speaking to Chris Hillery
That hasn't been supported for Server since... 5.x?
Wayne mentioned that the perf clusters (possibly specifically the Windows boxes) were updated / recreated recently due to IT initiatives to apply OS updates.
If they were re-created fresh, then they won't have the Microsoft Universal C Runtime installed since that wasn't included in Windows Server 2012. VCRUNTIME140D.dll almost certainly depends on the UCRT being installed. (FYI the error message in the bug implies that VCRUNTIME140D.dll itself isn't missing, but it's failing to load, probably because one of its dependencies is missing.)
Reassigning to the issue to performance.
I am able to do a run with an old build (7.1.0-1885). Therefore, it's not clear to me why we hit the issue on newer builds. I assume these runs are using the same VCRUNTIME140D.dll.
Do you have any ideas? I am going to do bisection to find the root cause.
Bo-Chun Wang
added a comment - - edited I am able to do a run with an old build (7.1.0-1885). Therefore, it's not clear to me why we hit the issue on newer builds. I assume these runs are using the same VCRUNTIME140D.dll.
http://perf.jenkins.couchbase.com/job/zeus/8167/
I checked the history and found the issue started between 7.1.0-1885 and 7.1.0-1960. Therefore, I think it may not be related to the cluster.
Build
Job
Success
6.6.4-9961
http://perf.jenkins.couchbase.com/job/zeus/8039/
Y
6.6.5-10068
http://perf.jenkins.couchbase.com/job/zeus/8055/
Y
7.0.3-7031
http://perf.jenkins.couchbase.com/job/zeus/8030/
Y
7.1.0-1885
http://perf.jenkins.couchbase.com/job/zeus/7902/
Y
7.1.0-1920
http://perf.jenkins.couchbase.com/job/zeus/8168/
Y
7.1.0-1925
http://perf.jenkins.couchbase.com/job/zeus/8170/
Y
7.1.0-1926
http://perf.jenkins.couchbase.com/job/zeus/8171/
Y
7.1.0-1931
http://perf.jenkins.couchbase.com/job/zeus/8172/
Y
7.1.0-1932
http://perf.jenkins.couchbase.com/job/zeus/8174/
Y
7.1.0-1934
http://perf.jenkins.couchbase.com/job/zeus/8175/
Y
7.1.0-1935
http://perf.jenkins.couchbase.com/job/zeus/8176/
N
7.1.0-1936
http://perf.jenkins.couchbase.com/job/zeus/8173/
N
7.1.0-1940
http://perf.jenkins.couchbase.com/job/zeus/8169/
N
7.1.0-1960
http://perf.jenkins.couchbase.com/job/zeus/8052/
N
7.1.0-1985
http://perf.jenkins.couchbase.com/job/zeus/8109/
N
7.1.0-2021
http://perf.jenkins.couchbase.com/job/zeus/8157/
N
Daniel Owen Chris Hillery
Do you have any ideas? I am going to do bisection to find the root cause.
Thanks for performing the bisection Bo-Chun Wang - I don't know why it would suddenly appear in a build (perhaps we are making use of new API call...)
Will add couchbase-bucket component back-in until we determine the cause of the issue.
Daniel Owen
added a comment - Thanks for performing the bisection Bo-Chun Wang - I don't know why it would suddenly appear in a build (perhaps we are making use of new API call...)
Will add couchbase-bucket component back-in until we determine the cause of the issue.
Daniel Owen
added a comment - I'm wondering if it is a platform change - http://changelog.build.couchbase.com/?product=couchbase-server&fromVersion=7.1.0&fromBuild=1920&toVersion=7.1.0&toBuild=1960&f_asterixdb=off&f_backup=off&f_build=off&f_cbas-core=off&f_cbas-ui=off&f_cbauth=off&f_cbbs=off&f_cbft=off&f_chronicle=off&f_eventing=off&f_eventing-ee=off&f_goxdcr=off&f_indexing=off&f_kv_engine=off&f_magma=off&f_ns_server=off&f_plasma=off&f_platform=on&f_product-metadata=off&f_query=off&f_query-ee=off&f_query-ui=off&f_sigar=off&f_testrunner=off&f_tlm=off&f_cbftx=off&f_cbgt=off&f_couchbase-cli=off&f_couchstore=off&f_gometa=off&f_n1fty=off
Therefore might be good to try 1925 and 1926
Bo-Chun Wang Did you actually re-run all those performance runs today, or are you just looking at the historical run data?
Chris Hillery
added a comment - Bo-Chun Wang Did you actually re-run all those performance runs today, or are you just looking at the historical run data?
I re-ran some of them today, including 7.1.0-1885, 7.1.0-1920, and 7.1.0-1940. Based on the results, the issue starts between 1920 and 1940. I am going to do run with 1925 and 1926 as Daniel suggested.
Bo-Chun Wang
added a comment - I re-ran some of them today, including 7.1.0-1885, 7.1.0-1920, and 7.1.0-1940. Based on the results, the issue starts between 1920 and 1940. I am going to do run with 1925 and 1926 as Daniel suggested.
I finished the bisection. The root cause is build 7.1.0-1935. There are 4 eventing changes and 1 magma change in 1935. I can't figure out why the changes can cause the issue.
Bo-Chun Wang
added a comment - I finished the bisection. The root cause is build 7.1.0-1935. There are 4 eventing changes and 1 magma change in 1935. I can't figure out why the changes can cause the issue.
http://changelog.build.couchbase.com/?product=couchbase-server&fromVersion=7.1.0&fromBuild=1934&toVersion=7.1.0&toBuild=1935&f_eventing=on&f_magma=on&f_product-metadata=on
Daniel Owen Could you please take a look?
+Sarath Lakshman who made the magma change. Since the problem is with the KV executable and looks to be a problem associated with a missing DLL and the magma change looks to be adding a new dependency on zstandard into the KV binary, this seems to be almost certainly where the problem is.
Dave Finlay
added a comment - - edited + Sarath Lakshman who made the magma change. Since the problem is with the KV executable and looks to be a problem associated with a missing DLL and the magma change looks to be adding a new dependency on zstandard into the KV binary, this seems to be almost certainly where the problem is.
Argh. Ok, I did think this was a problem last night, but Googling didn't bear me out. But now I've managed to find configuration: vcruntime140d.dll is the debug library. So it seems like we built the zstd-cpp Windows package in debug mode, and hence have a dependency on the debug version of the library, which isn't part of the MSVC redistributable package. We'll need to rebuild it. Cc Blair Watt
Bo-Chun Wang Daniel did ask if you could confirm that vcruntime140d.dll was on the VM in question - did you do so? I'm guessing that it's NOT actually there.
--------
Regardless of all of that, you really do need to start running the perf tests on a supported platform. For Neo that's Windows Server 2019.
Chris Hillery
added a comment - Argh. Ok, I did think this was a problem last night, but Googling didn't bear me out. But now I've managed to find configuration: vcruntime140 d .dll is the debug library. So it seems like we built the zstd-cpp Windows package in debug mode, and hence have a dependency on the debug version of the library, which isn't part of the MSVC redistributable package. We'll need to rebuild it. Cc Blair Watt
Bo-Chun Wang Daniel did ask if you could confirm that vcruntime140d.dll was on the VM in question - did you do so? I'm guessing that it's NOT actually there.
--------
Regardless of all of that, you really do need to start running the perf tests on a supported platform. For Neo that's Windows Server 2019.
We use the same cluster for 6.x and 7.x builds. Does Windows Server 2019 support both? The cluster is running Windows Server 2012.
OS Name: Microsoft Windows Server 2012 R2 Standard
OS Version: 6.3.9600 N/A Build 9600
Bo-Chun Wang
added a comment - There is no vcruntime140d.dll on the VM.
Administrator@zeus01-s526 /cygdrive/c/Windows/System32
$ ls vcruntime140
vcruntime140.dll vcruntime140_1.dll
We use the same cluster for 6.x and 7.x builds. Does Windows Server 2019 support both? The cluster is running Windows Server 2012.
OS Name: Microsoft Windows Server 2012 R2 Standard
OS Version: 6.3.9600 N/A Build 9600
Ok, I think that confirms that we need to rebuild zstd-cpp, so I'm moving this over to Build.
As for 6.x/7.x platform support, 7.1.0 is only officially supported on Windows Server 2019. 6.6.0 and later are also supported Windows Server 2019. I'm not sure about 6.5.0 - the Platform Support Matrix says Windows Server 2019 is "New - To Be Supported" for 6.5.0, which to me doesn't parse cleanly either way. If you still need to run perf on 6.0.x, though, Windows Server 2019 isn't supported, so ideally at least that should be a different cluster.
Chris Hillery
added a comment - Ok, I think that confirms that we need to rebuild zstd-cpp, so I'm moving this over to Build.
As for 6.x/7.x platform support, 7.1.0 is only officially supported on Windows Server 2019. 6.6.0 and later are also supported Windows Server 2019. I'm not sure about 6.5.0 - the Platform Support Matrix says Windows Server 2019 is "New - To Be Supported" for 6.5.0, which to me doesn't parse cleanly either way. If you still need to run perf on 6.0.x, though, Windows Server 2019 isn't supported, so ideally at least that should be a different cluster.
Build cbdeps::zstd-cpp-1.5.0-2 contains build-tools commit c2df20b with commit message: MB-50325: Produce release builds of zstd-cpp
Couchbase Build Team
added a comment - Build cbdeps::zstd-cpp-1.5.0-2 contains build-tools commit c2df20b with commit message:
MB-50325 : Produce release builds of zstd-cpp
Build cbdeps::erlang-neo-5 contains build-tools commit c2df20b with commit message: MB-50325: Produce release builds of zstd-cpp
Couchbase Build Team
added a comment - Build cbdeps::erlang-neo-5 contains build-tools commit c2df20b with commit message:
MB-50325 : Produce release builds of zstd-cpp
Build cbdeps::curl-7.78.0-6 contains build-tools commit c2df20b with commit message: MB-50325: Produce release builds of zstd-cpp
Couchbase Build Team
added a comment - Build cbdeps::curl-7.78.0-6 contains build-tools commit c2df20b with commit message:
MB-50325 : Produce release builds of zstd-cpp
Build cbdeps::prometheus-2.23-4 contains build-tools commit c2df20b with commit message: MB-50325: Produce release builds of zstd-cpp
Couchbase Build Team
added a comment - Build cbdeps::prometheus-2.23-4 contains build-tools commit c2df20b with commit message:
MB-50325 : Produce release builds of zstd-cpp
Steve - would you mind looking into this one?