Support for Serverless Execution Environments

Description

Refer to https://couchbasecloud.atlassian.net/browse/CBD-3502#icft=CBD-3502Can't find link

Environment

None

Gerrit Reviews

None

Release Notes Description

None

Attachments

20 Oct 2022, 12:29 AM
20 Oct 2022, 12:26 AM
20 Oct 2022, 12:26 AM
20 Oct 2022, 12:26 AM

Linked issues

depends on

CXXCBC-172

Support for Serverless Execution Environments

is blocked by

PYCBC-1422

Update Couchbase++ version to incorporate latest changes

Confluence content

mentioned on

Activity

Show:

Jared Casey October 20, 2022 at 9:59 PM

Serverless Execution Environment supported.

Jared Casey October 20, 2022 at 12:35 AM
Edited

Update 10/19/2022
Added logs to test various scenarios:

Corefile:

.:53 {
    forward . 8.8.8.8 9.9.9.9
    log
    errors
}

example.org:53 {
    file /root/cb.example.org
    log
}

cb.example.org

$ORIGIN example.org.
$TTL    3600
@                       IN SOA     ns.example.org. user.example.org. (
                                        2020080302  ;Serial
                                        7200        ;Refresh
                                        3600        ;Retry
                                        1209600     ;Expire
                                        3600)        ;Negative response caching TTL
@                       IN   NS     ns.example.org.
ns                      IN   CNAME  @
_couchbase._tcp 10800   IN   SRV    10  10  11210 ec2-34-212-171-20.us-west-2.compute.amazonaws.com.

CoreDNS docker command

docker run --rm --name coredns -v ~/path/to/setup/files/:/root -p 53:53/udp coredns/coredns -conf /root/Corefile

Local CoreDNS steps:

Setup EC2 instance(s) running Couchbase (could use local docker image, but EC2 works as well also easier w/ multinode as Vagrant + Mac M1 is not an easy combo IMHO).
Start CoreDNS (see Docker command above)
Start Python script (

)
Run for a while
Stop CoreDNS
Edit cb.example.org zone file SRV record to use google.com (can use another domain...)
Start CoreDNS (see Docker command above)
ssh into EC2 instance and run sudo systemctl stop couchbase-server
Should start to see timeouts from KV ops
Stop CoreDNS
Edit cb.example.org zone file SRV record back to value EC2 IP
ssh into EC2 isntance and run sudo systemctl start couchbase-server
Start CoreDNS (see Docker command above)
Should see client recover and docs printed out

Local CoreDNS freeze/thaw steps:

Setup EC2 instances running Couchbase
Edit cb.example.org zone file SRV record to point to one of the EC2 instance IP
Start CoreDNS (see Docker command above)
Start Python script (

)
Run for a while
run kill -s STOP <pid>
Stop CoreDNS
Edit cb.example.org zone file SRV record to use the other EC2 instance IP
ssh into EC2 (from step 3) instance and run sudo systemctl stop couchbase-server
Start CoreDNS (see Docker command above)
run kill -s CONT <pid>
Should see client recover and docs printed out

Rebalance testing steps listed earlier in ticket.

David Kelly July 19, 2022 at 2:46 PM

@Jared Casey - I was wrong – the one I saw was slightly different – Too many open files seems like perhaps we are just trying to (re)connect in a tight loop (in the client), creating a new socket each time, and end up in a situation where there are a large number of sockets (all the connection attempts) in time_wait state, perhaps. Perhaps looking at what netstat or similar says when this happens will verify? If so, then maybe the issue is in the client's logic around reconnection and making lots of sockets?

What I saw was that /dev/urandom ran out of random bytes. Each time we make a random in the client, we create a new device on the stack (which opens /dev/urandom then closes it). But, since it closes the file (not a socket) we don't build up file descriptors in time_wait, so probably not the same issue.

Jared Casey July 1, 2022 at 10:03 PM

I should note that if I do the following, I do not see any issues. Also, if I simple freeze/thaw (even for multiple minutes) I do not see any issues.

run simple looping program (either KV loop or query loop, results are the same)
issue kill -s STOP <pid> during sleep
go to node1 and execute rebalance that adds server
once rebalance is complete, issue kill -s CONT <pid>

Jared Casey July 1, 2022 at 10:00 PM
Edited

Test environment:

2 node vagrant setup w/ CBS 7.1.0
beer-sample bucket
Python SDK 4.0.2

Test steps:

run simple looping program (either KV loop or query loop, results are the same)
issue kill -s STOP <pid> during sleep
go to node1 and execute rebalance that removes server
once rebalance is complete, issue kill -s CONT <pid>

Result:

Sometimes I see a repeated UnAmbiguousTimeoutException repeatedly and then eventually it would seem the program stalls, but no crash
Other times I see the crash below. @David Kelly - I think this might be similar to a crash you had seen before?

Simple program

import time
import sys
import os

from couchbase.auth import PasswordAuthenticator
from couchbase.cluster import Cluster
from couchbase.options import ClusterOptions
from couchbase.exceptions import AmbiguousTimeoutException, UnAmbiguousTimeoutException


def loop_kv(collection, num=50, delay=3):
    exception_count = 0
    for i in range(num):
        try:
            r = collection.get('21st_amendment_brewery_cafe')
            print(f'loop count {i}, content: {r.content_as[dict]}')    
        except (AmbiguousTimeoutException, UnAmbiguousTimeoutException):
            exc_info = sys.exc_info()
            exception_count += 1
            print(f'loop count: {i}, exception count: {exception_count} Exception info: {exc_info[1]}')
        
        print(f'Sleeping for {delay} seconds...')
        time.sleep(delay)

def loop_query(cluster, num=50, delay=3):
    q_str = "SELECT * FROM `beer-sample` WHERE type='brewery' LIMIT 10"
    exception_count = 0
    for i in range(num):
        try:
            r = cluster.query(q_str).execute()
            print(f'loop count: {i}, Found {len(r)} records')
        except (AmbiguousTimeoutException, UnAmbiguousTimeoutException):
            exc_info = sys.exc_info()
            exception_count += 1
            print(f'loop count: {i}, exception count: {exception_count} Exception info: {exc_info[1]}')
        
        print(f'Sleeping for {delay} seconds...')
        time.sleep(delay)

if __name__ == "__main__":
    print(f'Process id: {os.getpid()}')
    cluster = Cluster.connect("couchbase://192.168.33.101", ClusterOptions(PasswordAuthenticator("Administrator", "password")))
    bucket = cluster.bucket('beer-sample')
    collection = bucket.default_collection()
    loop_kv(collection)
    # loop_query(cluster)

Crash output:

libc++abi: terminating with uncaught exception of type std::__1::system_error: random_device failed to open /dev/urandom: Too many open files
Process 87397 stopped
* thread #3, stop reason = signal SIGABRT
    frame #0: 0x00007fff2031492e libsystem_kernel.dylib`__pthread_kill + 10
libsystem_kernel.dylib`__pthread_kill:
->  0x7fff2031492e <+10>: jae    0x7fff20314938            ; <+20>
    0x7fff20314930 <+12>: movq   %rax, %rdi
    0x7fff20314933 <+15>: jmp    0x7fff2030ead9            ; cerror_nocancel
    0x7fff20314938 <+20>: retq
Target 0: (Python) stopped.
(lldb) bt
* thread #3, stop reason = signal SIGABRT
  * frame #0: 0x00007fff2031492e libsystem_kernel.dylib`__pthread_kill + 10
    frame #1: 0x00007fff203435bd libsystem_pthread.dylib`pthread_kill + 263
    frame #2: 0x00007fff20298406 libsystem_c.dylib`abort + 125
    frame #3: 0x00007fff20306ef2 libc++abi.dylib`abort_message + 241
    frame #4: 0x00007fff202f85e5 libc++abi.dylib`demangling_terminate_handler() + 242
    frame #5: 0x00007fff201f1595 libobjc.A.dylib`_objc_terminate() + 104
    frame #6: 0x00007fff20306307 libc++abi.dylib`std::__terminate(void (*)()) + 8
    frame #7: 0x00007fff203062a9 libc++abi.dylib`std::terminate() + 41
    frame #8: 0x00007fff202ad0c0 libc++.1.dylib`std::rethrow_exception(std::exception_ptr) + 17
    frame #9: 0x0000000150026214 pycbc_core.so`asio::detail::thread_info_base::rethrow_pending_exception(this=0x0000700008d7fe20) at thread_info_base.hpp:233:7
    frame #10: 0x0000000150025a99 pycbc_core.so`asio::detail::scheduler::do_run_one(this=0x00000001011631a0, lock=0x0000700008d7fde8, this_thread=0x0000700008d7fe20, ec=0x0000700008d7fee8) at scheduler.ipp:492:21
    frame #11: 0x000000015002567f pycbc_core.so`asio::detail::scheduler::run(this=0x00000001011631a0, ec=0x0000700008d7fee8) at scheduler.ipp:209:10
    frame #12: 0x000000015018fd2e pycbc_core.so`asio::io_context::run(this=0x0000000101432fa0) at io_context.ipp:62:24
    frame #13: 0x000000015018fcf8 pycbc_core.so`connection::connection(this=0x0000000101704198)::'lambda'()::operator()() const at client.hxx:243:48
    frame #14: 0x000000015018fc9d pycbc_core.so`decltype(__f=0x0000000101704198)::'lambda'()>(fp)()) std::__1::__invoke<connection::connection(int)::'lambda'()>(connection::connection(int)::'lambda'()&&) at type_traits:3747:1
    frame #15: 0x000000015018fc05 pycbc_core.so`void std::__1::__thread_execute<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, connection::connection(int)::'lambda'()>(__t=size=2, (null)=__tuple_indices<> @ 0x0000700008d7ff58)::'lambda'()>&, std::__1::__tuple_indices<>) at thread:280:5
    frame #16: 0x000000015018f48f pycbc_core.so`void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, connection::connection(int)::'lambda'()> >(__vp=0x0000000101704190) at thread:291:5
    frame #17: 0x00007fff203438fc libsystem_pthread.dylib`_pthread_start + 224
    frame #18: 0x00007fff2033f443 libsystem_pthread.dylib`thread_start + 15

Rebalance server out of cluster:

/opt/couchbase/bin/couchbase-cli rebalance -c 192.168.33.101 -u Administrator -p password --server-remove 192.168.33.102

Rebalance server into cluster:

$ /opt/couchbase/bin/couchbase-cli server-add -c https://192.168.33.101 --username Administrator --password password --server-add https://192.168.33.102 --server-add-username Administrator --server-add-password password --services data,query,fts --no-ssl-verify

$ /opt/couchbase/bin/couchbase-cli rebalance -c 192.168.33.101:8091 --username Administrator --password password

Fixed

Pinned fields

Click on the next to a field label to start pinning.

Details

Assignee

Jared Casey

Reporter

Arun Vijayraghavan(Deactivated)

Labels

sdkapi-3.4

Story Points

Components

library

Sprint

None

Fix versions

4.1.0

Priority

Major

Instabug

Open Instabug

PagerDuty

Sentry

Zendesk Support

Created August 19, 2021 at 8:54 PM

Updated November 3, 2022 at 5:39 PM

Resolved October 20, 2022 at 9:59 PM

Configure

Instabug

Support for Serverless Execution Environments

Description

Environment

Gerrit Reviews

Release Notes Description

Attachments

Linked issues

depends on

is blocked by

Confluence content

mentioned on

Activity

Jared Casey October 20, 2022 at 9:59 PM

Jared Casey October 20, 2022 at 12:35 AMEdited

David Kelly July 19, 2022 at 2:46 PM

Jared Casey July 1, 2022 at 10:03 PM

Jared Casey July 1, 2022 at 10:00 PMEdited

Details

Assignee

Reporter

Labels

Story Points

Components

Sprint

Fix versions

Priority

Instabug

PagerDuty

Sentry

Zendesk Support

Jared Casey October 20, 2022 at 12:35 AM
Edited

Jared Casey July 1, 2022 at 10:00 PM
Edited