Description
Summary
When using console-based logging alongside file-based logging at a low level (such as DEBUG) in a Couchbase Lite Java application on Windows, the replication eventually gets stuck due to waiting for a logging call that never returns.
Steps to reproduce
1. Create a simple Java app (source code attached as Main.java ) designed to synchronize a non-trivial number of documents (my tests reproduce after approximately 50000 documents) from a remote Sync Gateway, ideally with a good network to ensure network isn't the bottleneck.
Ideally add some way to keep track of replication progress (via the replicator status).
2. Enable console logging at debug-level and file-based logging:
final File path = new File("C:\\Users\\Administrator\\IdeaProjects\\logging-hang\\logs");
|
var logConfig = new LogFileConfiguration(path.toString());
|
logConfig.setMaxRotateCount(100);
|
logConfig.setMaxSize(5 * 1024);
|
Database.log.getFile().setConfig(logConfig);
|
Database.log.getFile().setLevel(LogLevel.DEBUG);
|
Database.log.getConsole().setLevel(LogLevel.DEBUG);
|
Database.log.getConsole().setDomains(EnumSet.of(LogDomain.ALL));
|
3. Run the app on Windows.
Expected Result
The app synchronizes all of the documents as expected.
Actual Result
The app eventually stops logging from Couchbase Lite and the replication makes no further progress.
Analysis
I grabbed the thread dump via Jstack when the app got into this 'stuck' state (thread_dump.txt) and can see the following:
"OkHttp http://172.31.8.21:4984/..." #19 prio=5 os_prio=0 cpu=4890.63ms elapsed=271.34s tid=0x0000018697f46800 nid=0x15e4 runnable [0x00000074e14fe000]
|
java.lang.Thread.State: RUNNABLE
|
at com.couchbase.lite.internal.core.C4Log.log(Native Method)
|
at com.couchbase.lite.FileLogger.log(FileLogger.java:55)
|
at com.couchbase.lite.internal.support.Log.sendToLoggers(Log.java:415)
|
at com.couchbase.lite.internal.support.Log.log(Log.java:394)
|
at com.couchbase.lite.internal.support.Log.v(Log.java:146)
|
at com.couchbase.lite.internal.replicator.AbstractCBLWebSocket$CBLWebSocketListener.onMessage(AbstractCBLWebSocket.java:167)
|
at okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:314)
|
at okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:224)
|
at okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:101)
|
at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:265)
|
at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:204)
|
at okhttp3.RealCall$AsyncCall.execute(RealCall.java:153)
|
at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
|
at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@14.0.1/ThreadPoolExecutor.java:1130)
|
at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@14.0.1/ThreadPoolExecutor.java:630)
|
at java.lang.Thread.run(java.base@14.0.1/Thread.java:832)
|
As you can see in this, the reason that replication is making no further progress is because the thread responsible from consuming new messages from the websocket is stuck in Lite Core's logging method.
I tried to get more detail about what Lite Core is doing at this stage but I guess it wasn't compiled with debug symbols as the stacktrace in windbg.exe was just addresses.
Trying to understand the problem a bit better, I made a few adjustments and found that this does not occur in the following situations:
- When you use VERBOSE level instead of DEBUG-level, implying an issue specifically with DEBUG level or just the volume of logging generated.
- When you use only console-based or only file-based logging, implying there is a potential deadlock caused by having 2 outputs.
- When the application is running on Linux, implying this is Windows-specific.
Workaround
Do not enable debug logging for console logging.