Support for Serverless Execution Environments
Description
Environment
Gerrit Reviews
Release Notes Description
Attachments
- 20 Oct 2022, 12:29 AM
- 20 Oct 2022, 12:26 AM
- 20 Oct 2022, 12:26 AM
- 20 Oct 2022, 12:26 AM
depends on
is blocked by
Activity
Jared Casey October 20, 2022 at 9:59 PM
Serverless Execution Environment supported.
Jared Casey October 20, 2022 at 12:35 AMEdited
Update 10/19/2022
Added logs to test various scenarios:
Corefile:
.:53 {
forward . 8.8.8.8 9.9.9.9
log
errors
}
example.org:53 {
file /root/cb.example.org
log
}
cb.example.org
$ORIGIN example.org.
$TTL 3600
@ IN SOA ns.example.org. user.example.org. (
2020080302 ;Serial
7200 ;Refresh
3600 ;Retry
1209600 ;Expire
3600) ;Negative response caching TTL
@ IN NS ns.example.org.
ns IN CNAME @
_couchbase._tcp 10800 IN SRV 10 10 11210 ec2-34-212-171-20.us-west-2.compute.amazonaws.com.
CoreDNS docker command
docker run --rm --name coredns -v ~/path/to/setup/files/:/root -p 53:53/udp coredns/coredns -conf /root/Corefile
Local CoreDNS steps:
Setup EC2 instance(s) running Couchbase (could use local docker image, but EC2 works as well also easier w/ multinode as Vagrant + Mac M1 is not an easy combo IMHO).
Start CoreDNS (see Docker command above)
Start Python script (
)
Run for a while
Stop CoreDNS
Edit cb.example.org zone file SRV record to use google.com (can use another domain...)
Start CoreDNS (see Docker command above)
ssh into EC2 instance and run sudo systemctl stop couchbase-server
Should start to see timeouts from KV ops
Stop CoreDNS
Edit cb.example.org zone file SRV record back to value EC2 IP
ssh into EC2 isntance and run sudo systemctl start couchbase-server
Start CoreDNS (see Docker command above)
Should see client recover and docs printed out
Local CoreDNS freeze/thaw steps:
Setup EC2 instances running Couchbase
Edit cb.example.org zone file SRV record to point to one of the EC2 instance IP
Start CoreDNS (see Docker command above)
Start Python script (
)
Run for a while
run kill -s STOP <pid>
Stop CoreDNS
Edit cb.example.org zone file SRV record to use the other EC2 instance IP
ssh into EC2 (from step 3) instance and run sudo systemctl stop couchbase-server
Start CoreDNS (see Docker command above)
run kill -s CONT <pid>
Should see client recover and docs printed out
Rebalance testing steps listed earlier in ticket.
David Kelly July 19, 2022 at 2:46 PM
@Jared Casey - I was wrong – the one I saw was slightly different – Too many open files
seems like perhaps we are just trying to (re)connect in a tight loop (in the client), creating a new socket each time, and end up in a situation where there are a large number of sockets (all the connection attempts) in time_wait
state, perhaps. Perhaps looking at what netstat
or similar says when this happens will verify? If so, then maybe the issue is in the client's logic around reconnection and making lots of sockets?
What I saw was that /dev/urandom
ran out of random bytes. Each time we make a random in the client, we create a new device on the stack (which opens /dev/urandom
then closes it). But, since it closes the file (not a socket) we don't build up file descriptors in time_wait, so probably not the same issue.
Jared Casey July 1, 2022 at 10:03 PM
I should note that if I do the following, I do not see any issues. Also, if I simple freeze/thaw (even for multiple minutes) I do not see any issues.
run simple looping program (either KV loop or query loop, results are the same)
issue kill -s STOP <pid> during sleep
go to node1 and execute rebalance that adds server
once rebalance is complete, issue kill -s CONT <pid>
Jared Casey July 1, 2022 at 10:00 PMEdited
Test environment:
2 node vagrant setup w/ CBS 7.1.0
beer-sample bucket
Python SDK 4.0.2
Test steps:
run simple looping program (either KV loop or query loop, results are the same)
issue kill -s STOP <pid> during sleep
go to node1 and execute rebalance that removes server
once rebalance is complete, issue kill -s CONT <pid>
Result:
Sometimes I see a repeated UnAmbiguousTimeoutException repeatedly and then eventually it would seem the program stalls, but no crash
Other times I see the crash below. @David Kelly - I think this might be similar to a crash you had seen before?
Simple program
import time
import sys
import os
from couchbase.auth import PasswordAuthenticator
from couchbase.cluster import Cluster
from couchbase.options import ClusterOptions
from couchbase.exceptions import AmbiguousTimeoutException, UnAmbiguousTimeoutException
def loop_kv(collection, num=50, delay=3):
exception_count = 0
for i in range(num):
try:
r = collection.get('21st_amendment_brewery_cafe')
print(f'loop count {i}, content: {r.content_as[dict]}')
except (AmbiguousTimeoutException, UnAmbiguousTimeoutException):
exc_info = sys.exc_info()
exception_count += 1
print(f'loop count: {i}, exception count: {exception_count} Exception info: {exc_info[1]}')
print(f'Sleeping for {delay} seconds...')
time.sleep(delay)
def loop_query(cluster, num=50, delay=3):
q_str = "SELECT * FROM `beer-sample` WHERE type='brewery' LIMIT 10"
exception_count = 0
for i in range(num):
try:
r = cluster.query(q_str).execute()
print(f'loop count: {i}, Found {len(r)} records')
except (AmbiguousTimeoutException, UnAmbiguousTimeoutException):
exc_info = sys.exc_info()
exception_count += 1
print(f'loop count: {i}, exception count: {exception_count} Exception info: {exc_info[1]}')
print(f'Sleeping for {delay} seconds...')
time.sleep(delay)
if __name__ == "__main__":
print(f'Process id: {os.getpid()}')
cluster = Cluster.connect("couchbase://192.168.33.101", ClusterOptions(PasswordAuthenticator("Administrator", "password")))
bucket = cluster.bucket('beer-sample')
collection = bucket.default_collection()
loop_kv(collection)
# loop_query(cluster)
Crash output:
libc++abi: terminating with uncaught exception of type std::__1::system_error: random_device failed to open /dev/urandom: Too many open files
Process 87397 stopped
* thread #3, stop reason = signal SIGABRT
frame #0: 0x00007fff2031492e libsystem_kernel.dylib`__pthread_kill + 10
libsystem_kernel.dylib`__pthread_kill:
-> 0x7fff2031492e <+10>: jae 0x7fff20314938 ; <+20>
0x7fff20314930 <+12>: movq %rax, %rdi
0x7fff20314933 <+15>: jmp 0x7fff2030ead9 ; cerror_nocancel
0x7fff20314938 <+20>: retq
Target 0: (Python) stopped.
(lldb) bt
* thread #3, stop reason = signal SIGABRT
* frame #0: 0x00007fff2031492e libsystem_kernel.dylib`__pthread_kill + 10
frame #1: 0x00007fff203435bd libsystem_pthread.dylib`pthread_kill + 263
frame #2: 0x00007fff20298406 libsystem_c.dylib`abort + 125
frame #3: 0x00007fff20306ef2 libc++abi.dylib`abort_message + 241
frame #4: 0x00007fff202f85e5 libc++abi.dylib`demangling_terminate_handler() + 242
frame #5: 0x00007fff201f1595 libobjc.A.dylib`_objc_terminate() + 104
frame #6: 0x00007fff20306307 libc++abi.dylib`std::__terminate(void (*)()) + 8
frame #7: 0x00007fff203062a9 libc++abi.dylib`std::terminate() + 41
frame #8: 0x00007fff202ad0c0 libc++.1.dylib`std::rethrow_exception(std::exception_ptr) + 17
frame #9: 0x0000000150026214 pycbc_core.so`asio::detail::thread_info_base::rethrow_pending_exception(this=0x0000700008d7fe20) at thread_info_base.hpp:233:7
frame #10: 0x0000000150025a99 pycbc_core.so`asio::detail::scheduler::do_run_one(this=0x00000001011631a0, lock=0x0000700008d7fde8, this_thread=0x0000700008d7fe20, ec=0x0000700008d7fee8) at scheduler.ipp:492:21
frame #11: 0x000000015002567f pycbc_core.so`asio::detail::scheduler::run(this=0x00000001011631a0, ec=0x0000700008d7fee8) at scheduler.ipp:209:10
frame #12: 0x000000015018fd2e pycbc_core.so`asio::io_context::run(this=0x0000000101432fa0) at io_context.ipp:62:24
frame #13: 0x000000015018fcf8 pycbc_core.so`connection::connection(this=0x0000000101704198)::'lambda'()::operator()() const at client.hxx:243:48
frame #14: 0x000000015018fc9d pycbc_core.so`decltype(__f=0x0000000101704198)::'lambda'()>(fp)()) std::__1::__invoke<connection::connection(int)::'lambda'()>(connection::connection(int)::'lambda'()&&) at type_traits:3747:1
frame #15: 0x000000015018fc05 pycbc_core.so`void std::__1::__thread_execute<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, connection::connection(int)::'lambda'()>(__t=size=2, (null)=__tuple_indices<> @ 0x0000700008d7ff58)::'lambda'()>&, std::__1::__tuple_indices<>) at thread:280:5
frame #16: 0x000000015018f48f pycbc_core.so`void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, connection::connection(int)::'lambda'()> >(__vp=0x0000000101704190) at thread:291:5
frame #17: 0x00007fff203438fc libsystem_pthread.dylib`_pthread_start + 224
frame #18: 0x00007fff2033f443 libsystem_pthread.dylib`thread_start + 15
Rebalance server out of cluster:
/opt/couchbase/bin/couchbase-cli rebalance -c 192.168.33.101 -u Administrator -p password --server-remove 192.168.33.102
Rebalance server into cluster:
$ /opt/couchbase/bin/couchbase-cli server-add -c https://192.168.33.101 --username Administrator --password password --server-add https://192.168.33.102 --server-add-username Administrator --server-add-password password --services data,query,fts --no-ssl-verify
$ /opt/couchbase/bin/couchbase-cli rebalance -c 192.168.33.101:8091 --username Administrator --password password
Details
Details
Assignee
Reporter
Labels
Story Points
Components
Sprint
Fix versions
Priority
Instabug
PagerDuty
PagerDuty Incident
PagerDuty

Sentry
Linked Issues
Sentry
Zendesk Support
Linked Tickets
Zendesk Support

Refer to https://couchbasecloud.atlassian.net/browse/CBD-3502#icft=CBD-3502Can't find link