Support for Serverless Execution Environments

Environment

None

Gerrit Reviews

None

Release Notes Description

None

Attachments

4
  • 20 Oct 2022, 12:29 AM
  • 20 Oct 2022, 12:26 AM
  • 20 Oct 2022, 12:26 AM
  • 20 Oct 2022, 12:26 AM

Confluence content

mentioned on

Activity

Show:

Jared Casey October 20, 2022 at 9:59 PM

Serverless Execution Environment supported.

Jared Casey October 20, 2022 at 12:35 AM
Edited

Update 10/19/2022
Added logs to test various scenarios:

  •  

  •  

  •  

Corefile:

.:53 { forward . 8.8.8.8 9.9.9.9 log errors } example.org:53 { file /root/cb.example.org log }

cb.example.org

$ORIGIN example.org. $TTL 3600 @ IN SOA ns.example.org. user.example.org. ( 2020080302 ;Serial 7200 ;Refresh 3600 ;Retry 1209600 ;Expire 3600) ;Negative response caching TTL @ IN NS ns.example.org. ns IN CNAME @ _couchbase._tcp 10800 IN SRV 10 10 11210 ec2-34-212-171-20.us-west-2.compute.amazonaws.com.

CoreDNS docker command

docker run --rm --name coredns -v ~/path/to/setup/files/:/root -p 53:53/udp coredns/coredns -conf /root/Corefile

Local CoreDNS steps:

  1. Setup EC2 instance(s) running Couchbase (could use local docker image, but EC2 works as well also easier w/ multinode as Vagrant + Mac M1 is not an easy combo IMHO).

  2. Start CoreDNS (see Docker command above)

  3. Start Python script (

  1. )

  2. Run for a while

  3. Stop CoreDNS

  4. Edit cb.example.org zone file SRV record to use google.com (can use another domain...)

  5. Start CoreDNS (see Docker command above)

  6. ssh into EC2 instance and run sudo systemctl stop couchbase-server

  7. Should start to see timeouts from KV ops

  8. Stop CoreDNS

  9. Edit cb.example.org zone file SRV record back to value EC2 IP

  10. ssh into EC2 isntance and run sudo systemctl start couchbase-server

  11. Start CoreDNS (see Docker command above)

  12. Should see client recover and docs printed out

Local CoreDNS freeze/thaw steps:

  1. Setup EC2 instances running Couchbase

  2. Edit cb.example.org zone file SRV record to point to one of the EC2 instance IP

  3. Start CoreDNS (see Docker command above)

  4. Start Python script (

  1. )

  2. Run for a while

  3. run kill -s STOP <pid>

  4. Stop CoreDNS

  5. Edit cb.example.org zone file SRV record to use the other EC2 instance IP

  6. ssh into EC2 (from step 3) instance and run sudo systemctl stop couchbase-server

  7. Start CoreDNS (see Docker command above)

  8. run kill -s CONT <pid>

  9. Should see client recover and docs printed out

Rebalance testing steps listed earlier in ticket.

David Kelly July 19, 2022 at 2:46 PM

 - I was wrong – the one I saw was slightly different – Too many open files seems like perhaps we are just trying to (re)connect in a tight loop (in the client), creating a new socket each time, and end up in a situation where there are a large number of sockets (all the connection attempts) in time_wait state, perhaps. Perhaps looking at what netstat or similar says when this happens will verify? If so, then maybe the issue is in the client's logic around reconnection and making lots of sockets?

What I saw was that /dev/urandom ran out of random bytes. Each time we make a random in the client, we create a new device on the stack (which opens /dev/urandom then closes it). But, since it closes the file (not a socket) we don't build up file descriptors in time_wait, so probably not the same issue.

Jared Casey July 1, 2022 at 10:03 PM

I should note that if I do the following, I do not see any issues. Also, if I simple freeze/thaw (even for multiple minutes) I do not see any issues.

  1. run simple looping program (either KV loop or query loop, results are the same)

  2. issue kill -s STOP <pid> during sleep

  3. go to node1 and execute rebalance that adds server

  4. once rebalance is complete, issue kill -s CONT <pid>

Jared Casey July 1, 2022 at 10:00 PM
Edited

Test environment:

  • 2 node vagrant setup w/ CBS 7.1.0

  • beer-sample bucket

  • Python SDK 4.0.2

Test steps:

  1. run simple looping program (either KV loop or query loop, results are the same)

  2. issue kill -s STOP <pid> during sleep

  3. go to node1 and execute rebalance that removes server

  4. once rebalance is complete, issue kill -s CONT <pid>

Result:

  • Sometimes I see a repeated UnAmbiguousTimeoutException repeatedly and then eventually it would seem the program stalls, but no crash

  • Other times I see the crash below. - I think this might be similar to a crash you had seen before?

Simple program

import time import sys import os from couchbase.auth import PasswordAuthenticator from couchbase.cluster import Cluster from couchbase.options import ClusterOptions from couchbase.exceptions import AmbiguousTimeoutException, UnAmbiguousTimeoutException def loop_kv(collection, num=50, delay=3): exception_count = 0 for i in range(num): try: r = collection.get('21st_amendment_brewery_cafe') print(f'loop count {i}, content: {r.content_as[dict]}') except (AmbiguousTimeoutException, UnAmbiguousTimeoutException): exc_info = sys.exc_info() exception_count += 1 print(f'loop count: {i}, exception count: {exception_count} Exception info: {exc_info[1]}') print(f'Sleeping for {delay} seconds...') time.sleep(delay) def loop_query(cluster, num=50, delay=3): q_str = "SELECT * FROM `beer-sample` WHERE type='brewery' LIMIT 10" exception_count = 0 for i in range(num): try: r = cluster.query(q_str).execute() print(f'loop count: {i}, Found {len(r)} records') except (AmbiguousTimeoutException, UnAmbiguousTimeoutException): exc_info = sys.exc_info() exception_count += 1 print(f'loop count: {i}, exception count: {exception_count} Exception info: {exc_info[1]}') print(f'Sleeping for {delay} seconds...') time.sleep(delay) if __name__ == "__main__": print(f'Process id: {os.getpid()}') cluster = Cluster.connect("couchbase://192.168.33.101", ClusterOptions(PasswordAuthenticator("Administrator", "password"))) bucket = cluster.bucket('beer-sample') collection = bucket.default_collection() loop_kv(collection) # loop_query(cluster)

Crash output:

libc++abi: terminating with uncaught exception of type std::__1::system_error: random_device failed to open /dev/urandom: Too many open files Process 87397 stopped * thread #3, stop reason = signal SIGABRT frame #0: 0x00007fff2031492e libsystem_kernel.dylib`__pthread_kill + 10 libsystem_kernel.dylib`__pthread_kill: -> 0x7fff2031492e <+10>: jae 0x7fff20314938 ; <+20> 0x7fff20314930 <+12>: movq %rax, %rdi 0x7fff20314933 <+15>: jmp 0x7fff2030ead9 ; cerror_nocancel 0x7fff20314938 <+20>: retq Target 0: (Python) stopped. (lldb) bt * thread #3, stop reason = signal SIGABRT * frame #0: 0x00007fff2031492e libsystem_kernel.dylib`__pthread_kill + 10 frame #1: 0x00007fff203435bd libsystem_pthread.dylib`pthread_kill + 263 frame #2: 0x00007fff20298406 libsystem_c.dylib`abort + 125 frame #3: 0x00007fff20306ef2 libc++abi.dylib`abort_message + 241 frame #4: 0x00007fff202f85e5 libc++abi.dylib`demangling_terminate_handler() + 242 frame #5: 0x00007fff201f1595 libobjc.A.dylib`_objc_terminate() + 104 frame #6: 0x00007fff20306307 libc++abi.dylib`std::__terminate(void (*)()) + 8 frame #7: 0x00007fff203062a9 libc++abi.dylib`std::terminate() + 41 frame #8: 0x00007fff202ad0c0 libc++.1.dylib`std::rethrow_exception(std::exception_ptr) + 17 frame #9: 0x0000000150026214 pycbc_core.so`asio::detail::thread_info_base::rethrow_pending_exception(this=0x0000700008d7fe20) at thread_info_base.hpp:233:7 frame #10: 0x0000000150025a99 pycbc_core.so`asio::detail::scheduler::do_run_one(this=0x00000001011631a0, lock=0x0000700008d7fde8, this_thread=0x0000700008d7fe20, ec=0x0000700008d7fee8) at scheduler.ipp:492:21 frame #11: 0x000000015002567f pycbc_core.so`asio::detail::scheduler::run(this=0x00000001011631a0, ec=0x0000700008d7fee8) at scheduler.ipp:209:10 frame #12: 0x000000015018fd2e pycbc_core.so`asio::io_context::run(this=0x0000000101432fa0) at io_context.ipp:62:24 frame #13: 0x000000015018fcf8 pycbc_core.so`connection::connection(this=0x0000000101704198)::'lambda'()::operator()() const at client.hxx:243:48 frame #14: 0x000000015018fc9d pycbc_core.so`decltype(__f=0x0000000101704198)::'lambda'()>(fp)()) std::__1::__invoke<connection::connection(int)::'lambda'()>(connection::connection(int)::'lambda'()&&) at type_traits:3747:1 frame #15: 0x000000015018fc05 pycbc_core.so`void std::__1::__thread_execute<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, connection::connection(int)::'lambda'()>(__t=size=2, (null)=__tuple_indices<> @ 0x0000700008d7ff58)::'lambda'()>&, std::__1::__tuple_indices<>) at thread:280:5 frame #16: 0x000000015018f48f pycbc_core.so`void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, connection::connection(int)::'lambda'()> >(__vp=0x0000000101704190) at thread:291:5 frame #17: 0x00007fff203438fc libsystem_pthread.dylib`_pthread_start + 224 frame #18: 0x00007fff2033f443 libsystem_pthread.dylib`thread_start + 15

Rebalance server out of cluster:

/opt/couchbase/bin/couchbase-cli rebalance -c 192.168.33.101 -u Administrator -p password --server-remove 192.168.33.102

Rebalance server into cluster:

$ /opt/couchbase/bin/couchbase-cli server-add -c https://192.168.33.101 --username Administrator --password password --server-add https://192.168.33.102 --server-add-username Administrator --server-add-password password --services data,query,fts --no-ssl-verify $ /opt/couchbase/bin/couchbase-cli rebalance -c 192.168.33.101:8091 --username Administrator --password password
Fixed
Pinned fields
Click on the next to a field label to start pinning.

Details

Assignee

Reporter

Story Points

Components

Sprint

Fix versions

Priority

Instabug

Open Instabug

PagerDuty

Sentry

Zendesk Support

Created August 19, 2021 at 8:54 PM
Updated November 3, 2022 at 5:39 PM
Resolved October 20, 2022 at 9:59 PM
Instabug