Uploaded image for project: 'Couchbase Lite'
  1. Couchbase Lite
  2. CBL-911

Couchbase Lite Java replication hangs when using DEBUG console + file logging on Windows

    XMLWordPrintable

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 2.7.1
    • 3.0
    • LiteCore
    • Security Level: Public
    • None
    • Jim 21
    • 5

    Description

      Summary

      When using console-based logging alongside file-based logging at a low level (such as DEBUG) in a Couchbase Lite Java application on Windows, the replication eventually gets stuck due to waiting for a logging call that never returns.

      Steps to reproduce

      1. Create a simple Java app (source code attached as Main.java ) designed to synchronize a non-trivial number of documents (my tests reproduce after approximately 50000 documents) from a remote Sync Gateway, ideally with a good network to ensure network isn't the bottleneck.
      Ideally add some way to keep track of replication progress (via the replicator status).
      2. Enable console logging at debug-level and file-based logging:

              final File path = new File("C:\\Users\\Administrator\\IdeaProjects\\logging-hang\\logs");
              var logConfig = new LogFileConfiguration(path.toString());
              logConfig.setMaxRotateCount(100);
              logConfig.setMaxSize(5 * 1024);
              Database.log.getFile().setConfig(logConfig);
              Database.log.getFile().setLevel(LogLevel.DEBUG);
              Database.log.getConsole().setLevel(LogLevel.DEBUG);
              Database.log.getConsole().setDomains(EnumSet.of(LogDomain.ALL));
      


      3. Run the app on Windows.

      Expected Result

      The app synchronizes all of the documents as expected.

      Actual Result

      The app eventually stops logging from Couchbase Lite and the replication makes no further progress.

      Analysis

      I grabbed the thread dump via Jstack when the app got into this 'stuck' state (thread_dump.txt) and can see the following:

      "OkHttp http://172.31.8.21:4984/..." #19 prio=5 os_prio=0 cpu=4890.63ms elapsed=271.34s tid=0x0000018697f46800 nid=0x15e4 runnable  [0x00000074e14fe000]
         java.lang.Thread.State: RUNNABLE
              at com.couchbase.lite.internal.core.C4Log.log(Native Method)
              at com.couchbase.lite.FileLogger.log(FileLogger.java:55)
              at com.couchbase.lite.internal.support.Log.sendToLoggers(Log.java:415)
              at com.couchbase.lite.internal.support.Log.log(Log.java:394)
              at com.couchbase.lite.internal.support.Log.v(Log.java:146)
              at com.couchbase.lite.internal.replicator.AbstractCBLWebSocket$CBLWebSocketListener.onMessage(AbstractCBLWebSocket.java:167)
              at okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:314)
              at okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:224)
              at okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:101)
              at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:265)
              at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:204)
              at okhttp3.RealCall$AsyncCall.execute(RealCall.java:153)
              at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@14.0.1/ThreadPoolExecutor.java:1130)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@14.0.1/ThreadPoolExecutor.java:630)
              at java.lang.Thread.run(java.base@14.0.1/Thread.java:832)
      

      As you can see in this, the reason that replication is making no further progress is because the thread responsible from consuming new messages from the websocket is stuck in Lite Core's logging method.
      I tried to get more detail about what Lite Core is doing at this stage but I guess it wasn't compiled with debug symbols as the stacktrace in windbg.exe was just addresses.

      Trying to understand the problem a bit better, I made a few adjustments and found that this does not occur in the following situations:

      • When you use VERBOSE level instead of DEBUG-level, implying an issue specifically with DEBUG level or just the volume of logging generated.
      • When you use only console-based or only file-based logging, implying there is a potential deadlock caused by having 2 outputs.
      • When the application is running on Linux, implying this is Windows-specific.

      Workaround

      Do not enable debug logging for console logging.

      Attachments

        1. LiteCore.pdb
          43.50 MB
        2. Main.java
          3 kB
        3. thread_dump.txt
          14 kB
        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          I'd have to go and setup all of the backend again, I nuked it because it seemed to just be related to dataset size and was reliably reproducible every time.

          It will probably take some time as I had it all setup in AWS, I'll get back to you when I get time to set it all up again.

          matt.carabine Matt Carabine added a comment - I'd have to go and setup all of the backend again, I nuked it because it seemed to just be related to dataset size and was reliably reproducible every time. It will probably take some time as I had it all setup in AWS, I'll get back to you when I get time to set it all up again.

          Did you definitely have both console logging and file-based logging enabled and set to debug level? From my testing it only triggered when both were enabled and set to debug level, possibly why there are two threads logging?

          matt.carabine Matt Carabine added a comment - Did you definitely have both console logging and file-based logging enabled and set to debug level? From my testing it only triggered when both were enabled and set to debug level, possibly why there are two threads logging?
          blake.meike Blake Meike added a comment -

          That really shouldn't be necessary.  If you can demonstrate that, please report it as a bug.  You shouldn't need to touch file-based logging at all, to get full console logging.  Just set

          Database.log.getConsole().setLevel(LogLevel.DEBUG)

          and, perhaps

          Database.log.getConsole().setDomains(LogDomain.ALL_DOMAINS)

          The CBL system uses a lot of threads.  Each may log, from time to time.  That is why there are two threads logging.

          blake.meike Blake Meike added a comment - That really shouldn't be necessary.  If you can demonstrate that, please report it as a bug.  You shouldn't need to touch file-based logging at all, to get full console logging.  Just set Database.log.getConsole().setLevel(LogLevel.DEBUG) and, perhaps Database.log.getConsole(). setDomains( LogDomain.ALL_DOMAINS) The CBL system uses a lot of threads.  Each may log, from time to time.  That is why there are two threads logging.
          blake.meike Blake Meike added a comment -

          Hey Matt Carabine, where are we on this?

          I have seen hangs like this but I haven't seen one in the 2.8 code, for quite a while. I'm hoping that the problem has been fixed. If you can reproduce it or, better yet, explain how we can reproduce it, we could determine whether or not the bug still exists. The thread dump from this ticket doesn't really show any deadlock. It is certainly possible that there is something going on in Core but, if it is there, we'll need to see it to fix it.

          Re-reading, I think I probably misunderstood your comment about verifying that both console and file logging were enabled. Even doing that, however, I have not seen a problem like this in a while.

          blake.meike Blake Meike added a comment - Hey Matt Carabine , where are we on this? I have seen hangs like this but I haven't seen one in the 2.8 code, for quite a while. I'm hoping that the problem has been fixed. If you can reproduce it or, better yet, explain how we can reproduce it, we could determine whether or not the bug still exists. The thread dump from this ticket doesn't really show any deadlock. It is certainly possible that there is something going on in Core but, if it is there, we'll need to see it to fix it. Re-reading, I think I probably misunderstood your comment about verifying that both console and file logging were enabled. Even doing that, however, I have not seen a problem like this in a while.
          blake.meike Blake Meike added a comment -

          I believe this was fixed in 2.8

          blake.meike Blake Meike added a comment - I believe this was fixed in 2.8

          People

            matt.carabine Matt Carabine
            matt.carabine Matt Carabine
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty