Uploaded image for project: 'Couchbase Python Client Library'
  1. Couchbase Python Client Library
  2. PYCBC-714

Python SDK throwing SystemError

    XMLWordPrintable

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 2.5.9
    • 2.5.12, 3.0.1
    • None
    • None
    • 1
    • SDK17: FLE/DOC/ServerlessPF

    Attachments

      1. debug.log
        75 kB
      2. debug-good.log
        78 kB
      3. system_error.log
        38 kB
      For Gerrit Dashboard: PYCBC-714
      # Subject Branch Project Status CR V

      Activity

        Ellis.Breen Ellis Breen added a comment -

        A customer is periodically seeing this error:

         

        SystemError: <method 'get' of 'Bucket' objects> returned NULL without setting an error
        

        This error has very occasionally been seen in testing, and appears to be related to initialisation problems as when seen it always effects the first operation performed. So far we have been unable to reproduce it while attached to a debugger. I would theorise that there is some of race condition occurring that means that operations are called and are able to return NULL without setting an error with PyErr_Set/PyErr_Restore, but without a GDB/LLDB backtrace it is hard to prove.

         

         

        Ellis.Breen Ellis Breen added a comment - A customer is periodically seeing this error:   SystemError: <method 'get' of 'Bucket' objects> returned NULL without setting an error This error has very occasionally been seen in testing, and appears to be related to initialisation problems as when seen it always effects the first operation performed. So far we have been unable to reproduce it while attached to a debugger. I would theorise that there is some of race condition occurring that means that operations are called and are able to return NULL without setting an error with PyErr_Set/PyErr_Restore, but without a GDB/LLDB backtrace it is hard to prove.    
        Dundee Daniel Milde added a comment - - edited

        We are seeing this error as well for 2.5.10. We are also unable to reproduce.

        # dpkg -l | grep couchbase
        ii libcouchbase-dev:amd64 2.10.5-1 amd64 library for the Couchbase protocol, development files
        ii libcouchbase2-core:amd64 2.10.5-1 amd64 library for the Couchbase protocol, core files
         
        # pip list | grep couchbase
        couchbase 2.5.10
         
        # python3
        Python 3.7.3 (default, Apr 3 2019, 05:39:12) 
        [GCC 8.3.0] on linux
        Type "help", "copyright", "credits" or "license" for more information.
        >>> import couchbase
        >>> couchbase.__version__
        '2.5.10'
        >>> couchbase._libcouchbase
        <module 'couchbase._libcouchbase' from '/usr/local/lib/python3.7/dist-packages/couchbase/_libcouchbase.cpython-37m-x86_64-linux-gnu.so'>
         
        # file /usr/local/lib/python3.7/dist-packages/couchbase/_libcouchbase.cpython-37m-x86_64-linux-gnu.so
        /usr/local/lib/python3.7/dist-packages/couchbase/_libcouchbase.cpython-37m-x86_64-linux-gnu.so: ELF 64-bit LSB pie executable, x86-64, version 1 (SYSV), dynamically linked, BuildID[sha1]=a30562cc3a9a0de5b2d38787b556091ae9de750d, with debug_info, not stripped
         
        # uname -a
        Linux 5.3.0-24-generic #26~18.04.2-Ubuntu SMP Tue Nov 26 12:34:22 UTC 2019 x86_64 GNU/Linux
        

        Dundee Daniel Milde added a comment - - edited We are seeing this error as well for 2.5.10. We are also unable to reproduce. # dpkg -l | grep couchbase ii libcouchbase-dev:amd64 2.10.5-1 amd64 library for the Couchbase protocol, development files ii libcouchbase2-core:amd64 2.10.5-1 amd64 library for the Couchbase protocol, core files   # pip list | grep couchbase couchbase 2.5.10   # python3 Python 3.7.3 (default, Apr 3 2019, 05:39:12) [GCC 8.3.0] on linux Type "help" , "copyright" , "credits" or "license" for more information. >>> import couchbase >>> couchbase.__version__ '2.5.10' >>> couchbase._libcouchbase <module 'couchbase._libcouchbase' from '/usr/local/lib/python3.7/dist-packages/couchbase/_libcouchbase.cpython-37m-x86_64-linux-gnu.so' >   # file /usr/local/lib/python3.7/dist-packages/couchbase/_libcouchbase.cpython-37m-x86_64-linux-gnu.so /usr/local/lib/python3 .7 /dist-packages/couchbase/_libcouchbase .cpython-37m-x86_64-linux-gnu.so: ELF 64-bit LSB pie executable, x86-64, version 1 (SYSV), dynamically linked, BuildID[sha1]=a30562cc3a9a0de5b2d38787b556091ae9de750d, with debug_info, not stripped   # uname -a Linux 5.3.0-24-generic #26~18.04.2-Ubuntu SMP Tue Nov 26 12:34:22 UTC 2019 x86_64 GNU/Linux
        Dundee Daniel Milde added a comment -

        Affects couchbase 2.5.8 with libcouchbase 2.10.5-1 as well.

        Dundee Daniel Milde added a comment - Affects couchbase 2.5.8 with libcouchbase 2.10.5-1 as well.
        Dundee Daniel Milde added a comment -

        Couchbase 2.5.1 with libcouchbase 2.10.5-1 on Buster is broken too.

        We haven't seen these errors on Stretch with couchbase 2.5.1, so it looks like the libcouchbase is the cause.

        Dundee Daniel Milde added a comment - Couchbase 2.5.1 with libcouchbase 2.10.5-1 on Buster is broken too. We haven't seen these errors on Stretch with couchbase 2.5.1, so it looks like the libcouchbase is the cause.
        Ellis.Breen Ellis Breen added a comment -

        Interesting, I wonder what the latest version of libcouchbase that doesn't seem to trigger failures in this way is.

        If we can determine that, it should be considerably simpler to bisect any root issue from there on.

        If you can reproduce with Valgrind or Clang sanitizer (build the client using PYCBC_SAN_OPT_address=TRUE and PYCBC_SAN_OPT_undefined in the environment, and run with any runtime sanitizer library it requests) then that may also help us to diagnose this.

        Thanks,

        Ellis

        Ellis.Breen Ellis Breen added a comment - Interesting, I wonder what the latest version of libcouchbase that doesn't seem to trigger failures in this way is. If we can determine that, it should be considerably simpler to bisect any root issue from there on. If you can reproduce with Valgrind or Clang sanitizer (build the client using PYCBC_SAN_OPT_address=TRUE and PYCBC_SAN_OPT_undefined in the environment, and run with any runtime sanitizer library it requests) then that may also help us to diagnose this. Thanks, Ellis

        Last working configuration was:

        # pip list | grep couchbase
        couchbase                2.5.1
         
        # dpkg -l | grep couchbase
        ii  libcouchbase-dev:amd64           2.10.2-2                          amd64        library for the Couchbase protocol, development files
        ii  libcouchbase2-core:amd64         2.10.2-2                          amd64        library for the Couchbase protocol, core files
        

        but it was on Debian Stretch with Python 3.5.

        We are not able to reproduce it yet. It only happens in production a few times a day.

        Dundee Daniel Milde added a comment - Last working configuration was: # pip list | grep couchbase couchbase 2.5.1   # dpkg -l | grep couchbase ii libcouchbase-dev:amd64 2.10.2-2 amd64 library for the Couchbase protocol, development files ii libcouchbase2-core:amd64 2.10.2-2 amd64 library for the Couchbase protocol, core files but it was on Debian Stretch with Python 3.5. We are not able to reproduce it yet. It only happens in production a few times a day.

        I advised the user who reported this to our team to try setting quiet=False as a kwarg to the get() call and he reported that he no longer was seeing these errors and we've had no further reports of problems from anyone so far.

        Not sure if that's helpful for debugging but wanted to mention it.

        thendricks Todd Hendricks added a comment - I advised the user who reported this to our team to try setting quiet=False as a kwarg to the get() call and he reported that he no longer was seeing these errors and we've had no further reports of problems from anyone so far. Not sure if that's helpful for debugging but wanted to mention it.
        Dundee Daniel Milde added a comment -

        I was finally able to reproduce it. It just needs many repetitions to appear

        import logging
         
        import couchbase
        from couchbase.bucket import Bucket
        from couchbase.cluster import Cluster, PasswordAuthenticator
        from couchbase.exceptions import NotFoundError
         
        logging.basicConfig(level=logging.DEBUG)
        log = logging.root
         
        couchbase.enable_logging()
         
        couchbase_address = "couchbase://{}?operation_timeout={},config_total_timeout={}".format(
            ','.join((
                 ....
            )),
            10,
            20,
        )
        log.debug('Connecting to couchbase @ {}'.format(couchbase_address))
        cluster = Cluster(couchbase_address)
        cluster.authenticate(PasswordAuthenticator(
             ....
        ))
        b = cluster.open_bucket('xxx')
         
        try:
            b.get('xxx')
        except NotFoundError:
            print('ok')
        

        for i in $(seq 1 1000); do echo $i; LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libasan.so.5 python3 test.py || exit 1; done
        

        Output: system_error.log

        Dundee Daniel Milde added a comment - I was finally able to reproduce it. It just needs many repetitions to appear import logging   import couchbase from couchbase.bucket import Bucket from couchbase.cluster import Cluster, PasswordAuthenticator from couchbase.exceptions import NotFoundError   logging.basicConfig(level = logging.DEBUG) log = logging.root   couchbase.enable_logging()   couchbase_address = "couchbase://{}?operation_timeout={},config_total_timeout={}" . format ( ',' .join(( .... )), 10 , 20 , ) log.debug( 'Connecting to couchbase @ {}' . format (couchbase_address)) cluster = Cluster(couchbase_address) cluster.authenticate(PasswordAuthenticator( .... )) b = cluster.open_bucket( 'xxx' )   try : b.get( 'xxx' ) except NotFoundError: print ( 'ok' ) for i in $( seq 1 1000); do echo $i; LD_PRELOAD= /usr/lib/x86_64-linux-gnu/libasan .so.5 python3 test .py || exit 1; done Output: system_error.log
        Dundee Daniel Milde added a comment -

        Todd Hendricks:

        quiet=False does not seem to resolve it for me (at least on localhost, not tested in production yet)

        Traceback (most recent call last):
          File "test.py", line 32, in <module>
            b.get('xxx', quiet=False)
          File "/usr/local/lib/python3.7/dist-packages/couchbase/bucket.py", line 558, in get
            replica=replica, no_format=no_format)
        SystemError: <method 'get' of 'Bucket' objects> returned NULL without setting an error
        

        Dundee Daniel Milde added a comment - Todd Hendricks: quiet=False does not seem to resolve it for me (at least on localhost, not tested in production yet) Traceback (most recent call last): File "test.py" , line 32 , in <module> b.get( 'xxx' , quiet = False ) File "/usr/local/lib/python3.7/dist-packages/couchbase/bucket.py" , line 558 , in get replica = replica, no_format = no_format) SystemError: <method 'get' of 'Bucket' objects> returned NULL without setting an error
        Dundee Daniel Milde added a comment -

        It always happens in the same moment: bucket is selected and new context is being created.

        ...
        DEBUG:couchbase.negotiation:[4130742689] <10.64.120.148:11210> (CTX=0x1ea4b50,sasl,SASLREQ=0x1cff990) Server supports features: 0x03 (TCP nodelay), 0x06 (XATTR), 0x07 (XERROR), 0x08 (Select bucket), 0x0a (Snappy), 0x0b (JSON) (L:473)
        DEBUG:couchbase.negotiation:[4130742689] <10.64.120.148:11210> (CTX=0x1ea4b50,sasl,SASLREQ=0x1cff990) Sending SELECT_BUCKET (L:524)
        DEBUG:couchbase.ioctx:[4130742689] <10.64.120.148:11210> (CTX=0x1ea4b50,sasl) Destroying context. Pending Writes=0, Entered=true, Socket Refcount=1 (L:151)
        DEBUG:couchbase.ioctx:[4130742689] <10.64.120.148:11210> (CTX=0x1f95d60,unknown) Pairing with SOCK=14ae7f39f1c518c7 (L:101)
        Traceback (most recent call last):
          File "test.py", line 32, in <module>
            b.get('xxx')
          File "/usr/local/lib/python3.7/dist-packages/couchbase/bucket.py", line 554, in get
            replica=replica, no_format=no_format)
        SystemError: <method 'get' of 'Bucket' objects> returned NULL without setting an error
        

        Dundee Daniel Milde added a comment - It always happens in the same moment: bucket is selected and new context is being created. ... DEBUG:couchbase.negotiation:[4130742689] <10.64.120.148:11210> (CTX=0x1ea4b50,sasl,SASLREQ=0x1cff990) Server supports features: 0x03 (TCP nodelay), 0x06 (XATTR), 0x07 (XERROR), 0x08 (Select bucket), 0x0a (Snappy), 0x0b (JSON) (L:473) DEBUG:couchbase.negotiation:[4130742689] <10.64.120.148:11210> (CTX=0x1ea4b50,sasl,SASLREQ=0x1cff990) Sending SELECT_BUCKET (L:524) DEBUG:couchbase.ioctx:[4130742689] <10.64.120.148:11210> (CTX=0x1ea4b50,sasl) Destroying context. Pending Writes=0, Entered= true , Socket Refcount=1 (L:151) DEBUG:couchbase.ioctx:[4130742689] <10.64.120.148:11210> (CTX=0x1f95d60,unknown) Pairing with SOCK=14ae7f39f1c518c7 (L:101) Traceback (most recent call last): File "test.py" , line 32, in <module> b.get( 'xxx' ) File "/usr/local/lib/python3.7/dist-packages/couchbase/bucket.py" , line 554, in get replica=replica, no_format=no_format) SystemError: <method 'get' of 'Bucket' objects> returned NULL without setting an error

        With the recent findings on this David Kelly, can you have a look. Knowing how to repro should make it easier to narrow this down.

        ingenthr Matt Ingenthron added a comment - With the recent findings on this David Kelly , can you have a look. Knowing how to repro should make it easier to narrow this down.
        david.kelly David Kelly added a comment -

        It makes it easy to reproduce - I did it pretty quickly with

        import logging
        import couchbase
        from couchbase.cluster import Cluster, PasswordAuthenticator
        from couchbase.exceptions import NotFoundError
         
        logging.basicConfig(level=logging.DEBUG)
        log = logging.root
         
        couchbase.enable_logging()
         
        cluster = Cluster('couchbase://10.143.193.101')
        cluster.authenticate(PasswordAuthenticator("Administrator", "password"))
         
        b = cluster.open_bucket("default")
        try:
            b.get("xxx")
        except NotFoundError:
            print('ok')
        

        executed in a loop like this:
        for i in $(seq 1 1000); do echo $1; python3 test.py || exit 1; done

        Like, exactly that repro above but as vanilla as I could make it. I've seen this same error (though maybe not always a get) come up occasionally during tests, recently. But I couldn't repro it. Now I can! Yay!

        Now, as for why? IDK. I'll start poking at it a bit harder now. My first thing I notice is I think this happens only when the document doesn't exist as I ran it for a doc that did, and it finished without incident. Or it is just much less frequent in that case. Or I was lucky. I'll run under lldb and see what I can come up with...

        david.kelly David Kelly added a comment - It makes it easy to reproduce - I did it pretty quickly with import logging import couchbase from couchbase.cluster import Cluster, PasswordAuthenticator from couchbase.exceptions import NotFoundError   logging.basicConfig(level = logging.DEBUG) log = logging.root   couchbase.enable_logging()   cluster = Cluster( 'couchbase://10.143.193.101' ) cluster.authenticate(PasswordAuthenticator( "Administrator" , "password" ))   b = cluster.open_bucket( "default" ) try : b.get( "xxx" ) except NotFoundError: print ( 'ok' ) executed in a loop like this: for i in $(seq 1 1000); do echo $1; python3 test.py || exit 1; done Like, exactly that repro above but as vanilla as I could make it. I've seen this same error (though maybe not always a get) come up occasionally during tests, recently. But I couldn't repro it. Now I can! Yay! Now, as for why? IDK. I'll start poking at it a bit harder now. My first thing I notice is I think this happens only when the document doesn't exist as I ran it for a doc that did, and it finished without incident. Or it is just much less frequent in that case. Or I was lucky. I'll run under lldb and see what I can come up with...
        david.kelly David Kelly added a comment -

        Also – the equivalent code in the current 3.0.0.beta3 seems not to do this. That seems like a clue, so I'll look a bit more at that and see if there are differences in the bindings.

        david.kelly David Kelly added a comment - Also – the equivalent code in the current 3.0.0.beta3 seems not to do this. That seems like a clue, so I'll look a bit more at that and see if there are differences in the bindings.
        Dundee Daniel Milde added a comment - - edited

        Log with PYCBC_DEBUG enabled of failed: debug.log and correct attempt: debug-good.log

        Looks like the exception is lost for some reason.

        Dundee Daniel Milde added a comment - - edited Log with PYCBC_DEBUG enabled of failed: debug.log and correct attempt: debug-good.log Looks like the exception is lost for some reason.
        david.kelly David Kelly added a comment -

        Thx Daniel Milde - that is helpful! I was looking at the wrappers yesterday - but so far I didn't find the issue. I'll be looking a bit harder this afternoon...

        david.kelly David Kelly added a comment - Thx Daniel Milde - that is helpful! I was looking at the wrappers yesterday - but so far I didn't find the issue. I'll be looking a bit harder this afternoon...
        Dundee Daniel Milde added a comment -

        I have added some debugs checking if the exception is still set - PYCBC_DEBUG_PYFORMAT("XXX %R", PyErr_Occurred());

        And have found out that the exception is unset after this line:
        https://github.com/couchbase/couchbase-python-client/blob/release25/src/multiresult.c#L332

        But it is beyond my C knowledge to tell why

        Dundee Daniel Milde added a comment - I have added some debugs checking if the exception is still set - PYCBC_DEBUG_PYFORMAT("XXX %R", PyErr_Occurred()); And have found out that the exception is unset after this line: https://github.com/couchbase/couchbase-python-client/blob/release25/src/multiresult.c#L332 But it is beyond my C knowledge to tell why
        david.kelly David Kelly added a comment -

        Thats also super helpful – I'd planned a bunch of similar PYCBC_DEBUG_LOG statements - this narrows stuff down. I'll see if I can find it this afternoon.

        david.kelly David Kelly added a comment - Thats also super helpful – I'd planned a bunch of similar PYCBC_DEBUG_LOG statements - this narrows stuff down. I'll see if I can find it this afternoon.
        david.kelly David Kelly added a comment -

        Sadly I didn't get the root cause, but I'm hoping to get back to this in a bit. I have some other work I need to get done by the end of the week.

        Daniel Milde Todd Hendricks Thx for all your work on this. Last thing I noted was that when the error occurs, the exception is gone in the pycbc_wrap_teardown call. So - I think I can narrow it down further – my first guess is an issue with ref counts on those error objects, somehow.

        david.kelly David Kelly added a comment - Sadly I didn't get the root cause, but I'm hoping to get back to this in a bit. I have some other work I need to get done by the end of the week. Daniel Milde Todd Hendricks Thx for all your work on this. Last thing I noted was that when the error occurs, the exception is gone in the pycbc_wrap_teardown call. So - I think I can narrow it down further – my first guess is an issue with ref counts on those error objects, somehow.
        Dundee Daniel Milde added a comment -

        FYI: Adding parameter quiet=True didn't solve it entirely but it decreased the error rate a lot!

        Dundee Daniel Milde added a comment - FYI: Adding parameter quiet=True didn't solve it entirely but it decreased the error rate a lot!
        david.kelly David Kelly added a comment -

        Interestingly - this showed up once in our integration tests, and it was not right at startup (say, right after/while connecting to a cluster).  So - this does exist in 3.0.0 (albeit difficult to reproduce).  Will put it in the 3.0.1 milestone - hoping we can find it.   Could be a different repro case will be needed?  Anyways, will look shortly.

        david.kelly David Kelly added a comment - Interestingly - this showed up once in our integration tests, and it was not right at startup (say, right after/while connecting to a cluster).  So - this does exist in 3.0.0 (albeit difficult to reproduce).  Will put it in the 3.0.1 milestone - hoping we can find it.   Could be a different repro case will be needed?  Anyways, will look shortly.
        david.kelly David Kelly added a comment -

        In 3.0.0 now, will cherry-pick into 2.x shortly.   Should then be in 2.5.12 when it comes out shortly.

        david.kelly David Kelly added a comment - In 3.0.0 now, will cherry-pick into 2.x shortly.   Should then be in 2.5.12 when it comes out shortly.
        david.kelly David Kelly added a comment -

        We should be releasing this fix in 2.5.12 tomorrow.

        david.kelly David Kelly added a comment - We should be releasing this fix in 2.5.12 tomorrow.
        Dundee Daniel Milde added a comment -

        Thank you!

        Dundee Daniel Milde added a comment - Thank you!
        Ellis.Breen Ellis Breen added a comment -

        You're very welcome - we were very glad to fix this one!

        Ellis.Breen Ellis Breen added a comment - You're very welcome - we were very glad to fix this one!

        People

          david.kelly David Kelly
          Ellis.Breen Ellis Breen
          Votes:
          1 Vote for this issue
          Watchers:
          5 Start watching this issue

          Dates

            Created:
            Updated:
            Resolved:

            Gerrit Reviews

              There are no open Gerrit changes

              PagerDuty