Details

    • Bug
    • Status: Resolved
    • Test Blocker
    • Resolution: Fixed
    • 3.0.4
    • 3.0.6
    • None
    • 1
    • SDK36: SearchFeat/Txns/Docs, SDK38: Txnβ-ready/Docs, SDK 44: Txns, SDK 3.1, Docs

    Description

      Running new collections n1ql tests requires we use python sdk3. All n1ql collections test run with 3.0.4.  I was seeing some odd behavior after the test had run for about 10 minutes and the client side would freeze up. Using top, I could see that the client processes would continually grow in memory usage. By 10 minutes, the client machine would be out of memory. I ran tracemalloc to see where the memory could be leaking and it seem to be coming from here: https://github.com/couchbase/couchbase-python-client/blob/master/couchbase_core/n1ql.py#L529

      This is a sample output of tracemalloc after 100 queries where submitted:

      [2020-08-31 13:20:16,966: WARNING/Process-1:2] [ Top 20 differences ]
      [2020-08-31 13:20:16,966: WARNING/Process-1:2] /root/korry/perfrunner/env/lib/python3.5/site-packages/couchbase_core/n1ql.py:529: size=15.3 MiB (+193 KiB), count=153918 (+1900), average=104 B
      [2020-08-31 13:20:16,966: WARNING/Process-1:2] /usr/lib/python3.5/json/decoder.py:355: size=5703 KiB (+70.4 KiB), count=72901 (+900), average=80 B
      [2020-08-31 13:20:16,966: WARNING/Process-1:2] /root/korry/perfrunner/env/lib/python3.5/site-packages/couchbase_core/n1ql.py:417: size=1076 KiB (+13.3 KiB), count=8101 (+100), average=136 B
      [2020-08-31 13:20:16,966: WARNING/Process-1:2] /root/korry/perfrunner/env/lib/python3.5/site-packages/couchbase_core/__init__.py:262: size=1068 KiB (+13.3 KiB), count=16080 (+200), average=68 B
      [2020-08-31 13:20:16,966: WARNING/Process-1:2] /root/korry/perfrunner/env/lib/python3.5/site-packages/couchbase/cluster.py:595: size=506 KiB (+6400 B), count=8100 (+100), average=64 B
      [2020-08-31 13:20:16,967: WARNING/Process-1:2] /root/korry/perfrunner/env/lib/python3.5/site-packages/couchbase_core/__init__.py:292: size=505 KiB (+6400 B), count=8070 (+100), average=64 B
      [2020-08-31 13:20:16,967: WARNING/Process-1:2] /root/korry/perfrunner/env/lib/python3.5/site-packages/couchbase/n1ql.py:140: size=505 KiB (+6400 B), count=8066 (+100), average=64 B
      [2020-08-31 13:20:16,967: WARNING/Process-1:2] /root/korry/perfrunner/env/lib/python3.5/site-packages/couchbase_core/__init__.py:258: size=502 KiB (+6400 B), count=8040 (+100), average=64 B
      [2020-08-31 13:20:16,967: WARNING/Process-1:2] /root/korry/perfrunner/env/lib/python3.5/site-packages/couchbase/cluster.py:178: size=419 KiB (+6208 B), count=6704 (+97), average=64 B
      [2020-08-31 13:20:16,967: WARNING/Process-1:2] /root/korry/perfrunner/spring/reservoir.py:28: size=285 KiB (+3600 B), count=8099 (+100), average=36 B
      [2020-08-31 13:20:16,967: WARNING/Process-1:2] /root/korry/perfrunner/spring/cbgen_helpers.py:92: size=188 KiB (+2376 B), count=8028 (+99), average=24 B
      [2020-08-31 13:20:16,967: WARNING/Process-1:2] /root/korry/perfrunner/env/lib/python3.5/site-packages/couchbase_core/__init__.py:267: size=86.8 KiB (+192 B), count=1381 (+3), average=64 B
      [2020-08-31 13:20:16,967: WARNING/Process-1:2] /root/korry/perfrunner/spring/reservoir.py:31: size=67.5 KiB (+0 B), count=1 (+0), average=67.5 KiB
      [2020-08-31 13:20:16,967: WARNING/Process-1:2] /usr/lib/python3.5/tracemalloc.py:113: size=14.2 KiB (+0 B), count=165 (+0), average=88 B
      [2020-08-31 13:20:16,967: WARNING/Process-1:2] /usr/lib/python3.5/tracemalloc.py:123: size=13.2 KiB (+0 B), count=170 (+0), average=80 B
      [2020-08-31 13:20:16,967: WARNING/Process-1:2] /usr/lib/python3.5/tracemalloc.py:425: size=8408 B (+0 B), count=167 (+0), average=50 B
      [2020-08-31 13:20:16,967: WARNING/Process-1:2] /usr/lib/python3.5/multiprocessing/reduction.py:50: size=3556 B (+0 B), count=52 (+0), average=68 B
      [2020-08-31 13:20:16,968: WARNING/Process-1:2] /root/korry/perfrunner/env/lib/python3.5/site-packages/couchbase/options.py:73: size=3344 B (+0 B), count=46 (+0), average=73 B
      [2020-08-31 13:20:16,968: WARNING/Process-1:2] /root/korry/perfrunner/env/lib/python3.5/site-packages/couchbase/options.py:38: size=2384 B (+0 B), count=31 (+0), average=77 B
      [2020-08-31 13:20:16,968: WARNING/Process-1:2] /usr/lib/python3.5/tracemalloc.py:349: size=1944 B (+0 B), count=26 (+0), average=75 B
      

      The top one is: 

      /root/korry/perfrunner/env/lib/python3.5/site-packages/couchbase_core/n1ql.py:529: size=15.3 MiB (+193 KiB), count=153918 (+1900), average=104 B

      You can see that it is adding around 193 KiB ever 100 queries, or 1.93 KiB per query. Our test do upwards of 10k queries/sec which would result in 19 MB per second or around 1 GB per minute. This causes our tests to quickly run out of memory on the client machines.

      Attachments

        1. 100run.txt
          326 kB
        2. 10krun.txt
          326 kB
        3. 304tracemalloc.txt
          22 kB
        4. fixtracemalloc.txt
          22 kB
        5. image-2020-09-09-21-39-02-834.png
          image-2020-09-09-21-39-02-834.png
          326 kB
        6. repro.py
          1 kB
        7. screenshot-1.png
          screenshot-1.png
          209 kB
        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          100run.txt

          10krun.txt

          I have uploaded the output of valgrind on a small reprop script repro.py.  The two runs are with 100 queries and 10,000 queries. Here is the leak summary for both:

          10k

          ==16406== LEAK SUMMARY:
          ==16406== definitely lost: 48 bytes in 1 blocks
          ==16406== indirectly lost: 510 bytes in 12 blocks
          ==16406== possibly lost: 12,296 bytes in 22 blocks
          ==16406== still reachable: 21,087,700 bytes in 145,720 blocks
          ==16406== of which reachable via heuristic:
          ==16406== stdstring : 409,959 bytes in 9,999 blocks
          ==16406== suppressed: 0 bytes in 0 blocks

          100

          ==16872== LEAK SUMMARY:
          ==16872== definitely lost: 48 bytes in 1 blocks
          ==16872== indirectly lost: 510 bytes in 12 blocks
          ==16872== possibly lost: 19,712 bytes in 34 blocks
          ==16872== still reachable: 2,701,820 bytes in 7,910 blocks
          ==16872== of which reachable via heuristic:
          ==16872== stdstring : 4,059 bytes in 99 blocks
          ==16872== suppressed: 0 bytes in 0 blocks

          The interesting part is the "still reachable". You can see that the number of blocks still reachable corresponds to the number of queries run. Although these blocks are "still reachable" they are a problem as the memory usage continues to grow with the number of queries. "Still reachable" from what I could tell means that the memory could have been freed but wasnt prior to program exit. So technically it seems like its not a "memory leak" but the memory doesn't get freed until program exit. And, as the memory size grows with the number of queries, this would suggest that the query results are not being discarded properly after the sdk gives the results to my program, not sure thought, just an idea.

          korrigan.clark Korrigan Clark (Inactive) added a comment - 100run.txt 10krun.txt I have uploaded the output of valgrind on a small reprop script  repro.py .  The two runs are with 100 queries and 10,000 queries. Here is the leak summary for both: 10k ==16406== LEAK SUMMARY: ==16406== definitely lost: 48 bytes in 1 blocks ==16406== indirectly lost: 510 bytes in 12 blocks ==16406== possibly lost: 12,296 bytes in 22 blocks ==16406== still reachable: 21,087,700 bytes in 145,720 blocks ==16406== of which reachable via heuristic: ==16406== stdstring : 409,959 bytes in 9,999 blocks ==16406== suppressed: 0 bytes in 0 blocks 100 ==16872== LEAK SUMMARY: ==16872== definitely lost: 48 bytes in 1 blocks ==16872== indirectly lost: 510 bytes in 12 blocks ==16872== possibly lost: 19,712 bytes in 34 blocks ==16872== still reachable: 2,701,820 bytes in 7,910 blocks ==16872== of which reachable via heuristic: ==16872== stdstring : 4,059 bytes in 99 blocks ==16872== suppressed: 0 bytes in 0 blocks The interesting part is the "still reachable". You can see that the number of blocks still reachable corresponds to the number of queries run. Although these blocks are "still reachable" they are a problem as the memory usage continues to grow with the number of queries. "Still reachable" from what I could tell means that the memory could have been freed but wasnt prior to program exit. So technically it seems like its not a "memory leak" but the memory doesn't get freed until program exit. And, as the memory size grows with the number of queries, this would suggest that the query results are not being discarded properly after the sdk gives the results to my program, not sure thought, just an idea.
          Ellis.Breen Ellis Breen added a comment -

          Thanks Korry, trying to reproduce this leak now.

          Are you sure you meant to set 'go' to False in the example code?

          If I remove that part of the code, I can see from tracemalloc the 'count' value monotonically increasing. I thought perhaps this was due to garbage collection churn and that this would eventually settle, but not sure how long this might take, or what thresholds are required. There isn't anything obviously global retaining/caching data that I can see.

          I had suspected some buffering code in couchbase.n1ql.QueryResult but commenting that out doesn't appear to have stopped this rise.

          Continuing to investigate/profile at the moment - will try the 'memory_profile' module shortly.

          Ellis.Breen Ellis Breen added a comment - Thanks Korry, trying to reproduce this leak now. Are you sure you meant to set 'go' to False in the example code? If I remove that part of the code, I can see from tracemalloc the 'count' value monotonically increasing. I thought perhaps this was due to garbage collection churn and that this would eventually settle, but not sure how long this might take, or what thresholds are required. There isn't anything obviously global retaining/caching data that I can see. I had suspected some buffering code in couchbase.n1ql.QueryResult but commenting that out doesn't appear to have stopped this rise. Continuing to investigate/profile at the moment - will try the 'memory_profile' module shortly.
          Ellis.Breen Ellis Breen added a comment -

          I think this might be due to a self-reference 'self._parent' in QueryResult - I'm not sure where this is coming from, because even after I remove the self-assignment to self._parent, this parameter persists:

          Ellis.Breen Ellis Breen added a comment - I think this might be due to a self-reference 'self._parent' in QueryResult - I'm not sure where this is coming from, because even after I remove the self-assignment to self._parent, this parameter persists:
          Ellis.Breen Ellis Breen added a comment -

          See current patch set and experiments here:

          http://review.couchbase.org/c/couchbase-python-client/+/135796

          Ellis.Breen Ellis Breen added a comment - See current patch set and experiments here: http://review.couchbase.org/c/couchbase-python-client/+/135796
          Ellis.Breen Ellis Breen added a comment - - edited

          Ah, it turns out it was pointing to the QueryMetadata object. If I 'del metadata' this goes and there are no references left to 'result'.

          As the memory allocated by 'fetch' is reachable, and the only allocation ViewResult_fetch (which actually does the fetching in the C code - a historical quirk) is of a new list of rows if not already present, it looks like this data should be reachable by the Python object graph.

          The graphs I've produced with objgraph so far don't give me much of a steer as to what is retaining these lists, but I'll continue tomorrow.

           

           

           

          Ellis.Breen Ellis Breen added a comment - - edited Ah, it turns out it was pointing to the QueryMetadata object. If I 'del metadata' this goes and there are no references left to 'result'. As the memory allocated by 'fetch' is reachable, and the only allocation ViewResult_fetch (which actually does the fetching in the C code - a historical quirk) is of a new list of rows if not already present, it looks like this data should be reachable by the Python object graph. The graphs I've produced with objgraph so far don't give me much of a steer as to what is retaining these lists, but I'll continue tomorrow.      
          Ellis.Breen Ellis Breen added a comment - - edited

          I ran this on SDK2.5 to confirm there was a difference. Can confirm there are two excess 'dict' and 'list' objects retained each time we iterate through the result (but not on creating the N1QLRequest/QueryResult). I've stripped the code back to relying on the original N1QLQuery code in couchbase_core more or less, and still seeing this.

          Starting to wonder if perhaps there are some different semantics on the part of libcouchbase 2 vs 3, meaning that some data gets stuck in _N1QLRequest.raw.rows - as the couchbase_core code is pretty much the same (as is the ViewResult_fetch code). The 'raw' attribute of a _N1QLRequest is a ViewResult (this is how N1QL was implemented in SDK2).

          Ellis.Breen Ellis Breen added a comment - - edited I ran this on SDK2.5 to confirm there was a difference. Can confirm there are two excess 'dict' and 'list' objects retained each time we iterate through the result (but not on creating the N1QLRequest/QueryResult). I've stripped the code back to relying on the original N1QLQuery code in couchbase_core more or less, and still seeing this. Starting to wonder if perhaps there are some different semantics on the part of libcouchbase 2 vs 3, meaning that some data gets stuck in _N1QLRequest.raw.rows - as the couchbase_core code is pretty much the same (as is the ViewResult_fetch code). The 'raw' attribute of a _N1QLRequest is a ViewResult (this is how N1QL was implemented in SDK2).
          Ellis.Breen Ellis Breen added a comment -

          So, it turns out the ErrorContext code does a Py_INCREF on a dictionary unnecessarily, meaning that there is an excess reference to said dictionary. I'm surprised once I deleted the ViewResult that this errorcontext is added to that this should still be reachable but it is.

          This seems to have resolved the leaking dictionaries although there may still be a leaking list.

          Ellis.Breen Ellis Breen added a comment - So, it turns out the ErrorContext code does a Py_INCREF on a dictionary unnecessarily, meaning that there is an excess reference to said dictionary. I'm surprised once I deleted the ViewResult that this errorcontext is added to that this should still be reachable but it is. This seems to have resolved the leaking dictionaries although there may still be a leaking list.
          Ellis.Breen Ellis Breen added a comment -

          Korrigan Clark - please try the following:

          pip install git+http://review.couchbase.org/couchbase-python-client@refs/changes/81/135981/2
          

          This has the fix for the dictionary leak, though there may also be a list leak. Let me know if this helps and whether you still see probable leakage.

          Thanks,

           

          Ellis

          Ellis.Breen Ellis Breen added a comment - Korrigan Clark  - please try the following: pip install git+http: //review.couchbase.org/couchbase-python-client@refs/changes/81/135981/2 This has the fix for the dictionary leak, though there may also be a list leak. Let me know if this helps and whether you still see probable leakage. Thanks,   Ellis
          Ellis.Breen Ellis Breen added a comment - - edited

          Update: please try:

           

          pip install git+http://review.couchbase.org/couchbase-python-client@refs/changes/91/135991/2 --no-cache-dir
          

           

          This removes the 'list' leak I see too.

          P.S. the command for valgrind should have 'repro.py' at the end as parameters to the executable are taken after all Valgrind options:

          valgrind --leak-check=full --show-leak-kinds=all --track-origins=yes --verbose --xml --xml-file=leak.xml env/bin/python3.5 repro.py

          This is probably why there wasn't much detail in the Valgrind log...

          Ellis.Breen Ellis Breen added a comment - - edited Update: please try:   pip install git+http: //review.couchbase.org/couchbase-python-client@refs/changes/91/135991/2 --no-cache-dir   This removes the 'list' leak I see too. P.S. the command for valgrind should have 'repro.py' at the end as parameters to the executable are taken after all Valgrind options: valgrind --leak-check=full --show-leak-kinds=all --track-origins=yes --verbose --xml --xml-file=leak.xml env/bin/python3. 5  repro.py This is probably why there wasn't much detail in the Valgrind log...
          Ellis.Breen Ellis Breen added a comment - - edited

          Any kind of soak/overnight test would also be very useful. But it sounds like we should get fairly quick feedback as to whether it's improved, as it was previously leaking quite quickly.

          Ellis.Breen Ellis Breen added a comment - - edited Any kind of soak/overnight test would also be very useful. But it sounds like we should get fairly quick feedback as to whether it's improved, as it was previously leaking quite quickly.
          Ellis.Breen Ellis Breen added a comment -

          If this now only shows up in the perf test rather than repro.py, it should be possible to run the Couchbase jobs with Clang Sanitizer enabled as described before, as long as you can somehow launch these jobs with the right Clang Sanitizer environment variables (or ensure the Clang Sanitizer libraries are in the global library search path).

          Ellis.Breen Ellis Breen added a comment - If this now only shows up in the perf test rather than repro.py, it should be possible to run the Couchbase jobs with Clang Sanitizer enabled as described before, as long as you can somehow launch these jobs with the right Clang Sanitizer environment variables (or ensure the Clang Sanitizer libraries are in the global library search path).
          korrigan.clark Korrigan Clark (Inactive) added a comment - Could be related: https://www.python.org/dev/peps/pep-0442/ https://bugs.python.org/issue17468

          I ran tracemalloc against 3.0.4 and the patch set using a perf tests. The diffs were gathered before ops start and after ops finish. 14k ops in total. You can see in the patch set run fixtracemalloc.txt that memory usage was significantly lower, 6MB versus 27MB from 304 304tracemalloc.txt. However, it still shows "return self.raw.fetch(self._mres)" line 539 as the highest memory diff. So it seems the issue has at least been partially fixed, but I think the fetch method shouldnt have any memory growth given that the result for each query is only 1KB in size.

          korrigan.clark Korrigan Clark (Inactive) added a comment - I ran tracemalloc against 3.0.4 and the patch set using a perf tests. The diffs were gathered before ops start and after ops finish. 14k ops in total. You can see in the patch set run  fixtracemalloc.txt  that memory usage was significantly lower, 6MB versus 27MB from 304 304tracemalloc.txt . However, it still shows "return self.raw.fetch(self._mres)" line 539 as the highest memory diff. So it seems the issue has at least been partially fixed, but I think the fetch method shouldnt have any memory growth given that the result for each query is only 1KB in size.

          throughput has improved from 17k to 27k in this test using the patch set: http://perf.jenkins.couchbase.com/job/iris-collections/192/console

          Prior to the fix, the machine would seize up halfway through the test.

          korrigan.clark Korrigan Clark (Inactive) added a comment - throughput has improved from 17k to 27k in this test using the patch set:  http://perf.jenkins.couchbase.com/job/iris-collections/192/console Prior to the fix, the machine would seize up halfway through the test.
          Ellis.Breen Ellis Breen added a comment - - edited

          Good news that this has at least partially fixed the issue - I did see some residual leaking of 1 list per iteration - but this seemed to happen with SDK2, too.

          In my research I also learnt that to get accurate results from Valgrind you may need to disable PyMalloc (the Python memory allocator, which uses memory pools instead of malloc directly), and it seems to do this you need to compile Python in debug mode. I did this using "pyenv install <version> -g" on both MacOS and Ubuntu16 (not you will need quite a few development prerequisites to do this, beyond pyenv - but these are listed on the 'pyenv' GitHub site).

          However, on my Macbook Pro, Valgrind for MacOS runs incredibly slowly and complains of a rogue x64 opcode on my Ubuntu16 Vagrant, so I wasn't able to glean any extra info. The various reference counting debug options didn't lead to much more, although tracemalloc might offer some more backtrace info, going by the output when enabled.

          The generator bug sounds interesting, and along the lines I was thinking - the 'context' of the generator might be kept long after the QueryResult/N1QLRequest object that produced it has otherwise been forgotten - and with this, the the underlying 'ViewResult.rows'. Will take a look at that bug on Monday.

          Ellis.Breen Ellis Breen added a comment - - edited Good news that this has at least partially fixed the issue - I did see some residual leaking of 1 list per iteration - but this seemed to happen with SDK2, too. In my research I also learnt that to get accurate results from Valgrind you may need to disable PyMalloc (the Python memory allocator, which uses memory pools instead of malloc directly), and it seems to do this you need to compile Python in debug mode. I did this using "pyenv install <version> -g" on both MacOS and Ubuntu16 (not you will need quite a few development prerequisites to do this, beyond pyenv - but these are listed on the 'pyenv' GitHub site). However, on my Macbook Pro, Valgrind for MacOS runs incredibly slowly and complains of a rogue x64 opcode on my Ubuntu16 Vagrant, so I wasn't able to glean any extra info. The various reference counting debug options didn't lead to much more, although tracemalloc might offer some more backtrace info, going by the output when enabled. The generator bug sounds interesting, and along the lines I was thinking - the 'context' of the generator might be kept long after the QueryResult/N1QLRequest object that produced it has otherwise been forgotten - and with this, the the underlying 'ViewResult.rows'. Will take a look at that bug on Monday.
          wayne Wayne Siu added a comment -

          Ellis Breen

          Thanks for looking at the issue. Do you have an ETA when 3.0.6 will be available?  Thanks.

          wayne Wayne Siu added a comment - Ellis Breen Thanks for looking at the issue. Do you have an ETA when 3.0.6 will be available?  Thanks.
          Ellis.Breen Ellis Breen added a comment -

          Hi Wayne - 3.0.6 is due for release on the 6th October. The fix has been approved, just running some integration tests on it then plan to merge. I plan to do some extra checks to see if there any other leaks (preexisting in SDK2 or otherwise) before closing this issue.

          Ellis.Breen Ellis Breen added a comment - Hi Wayne - 3.0.6 is due for release on the 6th October. The fix has been approved, just running some integration tests on it then plan to merge. I plan to do some extra checks to see if there any other leaks (preexisting in SDK2 or otherwise) before closing this issue.

          Minimum is review and merge. Already reported working for QE. If possible, run valgrind on Python to see if there are other easy to fix related leaks.

          ingenthr Matt Ingenthron added a comment - Minimum is review and merge. Already reported working for QE. If possible, run valgrind on Python to see if there are other easy to fix related leaks.
          Ellis.Breen Ellis Breen added a comment - - edited

          Hi, I believe I have produced a genuine fix for this leak (that doesn't sometimes skip metadata rows). I should run some more soak tests but all the tests I've run (mock and server) indicate this is working. Valgrind with PYTHONMALLOC=malloc_debug doesn't show anything suspicious up, either.

          To test, run:

          pip install git+http://review.couchbase.org/couchbase-python-client@refs/changes/81/135981/13
          
          

          Thanks,

          Ellis

          Ellis.Breen Ellis Breen added a comment - - edited Hi, I believe I have produced a genuine fix for this leak (that doesn't sometimes skip metadata rows). I should run some more soak tests but all the tests I've run (mock and server) indicate this is working. Valgrind with PYTHONMALLOC=malloc_debug doesn't show anything suspicious up, either. To test, run: pip install git+http: //review.couchbase.org/couchbase-python-client@refs/changes/81/135981/13 Thanks, Ellis

          People

            brett19 Brett Lawson
            korrigan.clark Korrigan Clark (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty