Details
-
Bug
-
Resolution: Unresolved
-
Critical
-
7.6.0
-
7.6.0-2164-enterprise
-
Untriaged
-
Linux x86_64
-
0
-
No
Description
QE Test |
./testrunner -i /data/workspace/debian-p0-eventing-vset00-00-rebalance_bucket_op_with_cron_timers/testexec.124747.ini -p get-cbcollect-info=True,GROUP=bucket_op_with_cron_timers,skip_log_scan=True,default_bucket=False,sirius_url=http://172.23.120.103:4000 -t eventing.eventing_rebalance.EventingRebalance.test_killing_eventing_processes_during_eventing_rebalance,doc-per-day=20,dataset=default,nodes_init=5,services_init=kv-kv-eventing-eventing-index:n1ql,groups=simple,reset_services=True,handler_code=bucket_op_with_cron_timers,GROUP=bucket_op_with_cron_timers,get-cbcollect-info=True,skip_log_scan=True,default_bucket=False,sirius_url=http://172.23.120.103:4000
|
Test Steps
- Create 5 node having following cluster configuration -
Service Number of nodes KV 2 Eventing 2 Index:N1QL 1 - Create source, metadata and destination bucket.
- Create Eventing function having following definition -
Function_949901511_test_killing_eventing_processes_during_eventing_rebalance with
bindings: {'buckets': [{'alias': 'dst_bucket', 'bucket_name': 'dst_bucket', 'access': 'rw'}], 'metadata_bucket': 'metadata', 'source_bucket': 'src_bucket', 'curl': []} and
settings: {'checkpoint_interval': 20000, 'cleanup_timers': False, 'dcp_stream_boundary': 'everything', 'deployment_status': False, 'description': 'Sample Description', 'log_level': 'INFO', 'skip_timer_threshold': 86400, 'sock_batch_size': 1, 'tick_duration': 5000, 'timer_processing_tick_interval': 500, 'timer_worker_pool_size': 3, 'worker_count': 3, 'processing_status': False, 'cpp_worker_thread_count': 1, 'execution_timeout': 20, 'data_chan_size': 10000, 'worker_queue_cap': 100000, 'use_memory_manager': True, 'deadline_timeout': 21, 'timer_storage_chan_size': 10000, 'dcp_gen_chan_size': 10000}
2024-02-27 04:52:09 | INFO | MainProcess | test_thread | [eventing_base.deploy_function]
function OnUpdate(doc, meta) {
log("Doc created/updated", meta.id);
var expiry = new Date();
expiry.setSeconds(expiry.getSeconds() + 3);
var context = {docID : meta.id, random_text : "e6cZZGHuh0R7Aumoe6cZZGHuh0R7Aumoe6cZZGHuh0R7Aumoe6cZZGHuh0R7Aumoe6cZZGHuh0R7Aumoe6cZZGHuh0R7Aumoe6cZZGHuh0R7Aumoe6cZZGHuh0R7Aumoe6cZZGHuh0R7Aumoe6cZZGHuh0R7Aumoe6cZZGHuh0R7Aumoe6cZZGHuh0R7Aumoe6cZZGHuh0R7Aumoe6cZZGHuh07Aumoe6cZZGHuh07Aumoe6cZZGHuh07Aumoe6"};
createTimer(timerCallback, expiry, meta.id, context);
}
function timerCallback(context) {
var result= couchbase.insert(dst_bucket,{"id":context.docID},context.random_text);
log(result);
}
function OnDelete(meta) {
var expiry = new Date();
expiry.setSeconds(expiry.getSeconds() + 30);
var context = {docID : meta.id };
createTimer(NDtimerCallback, expiry, meta.id, context);
}
function NDtimerCallback(context) {
var meta={"id":context.docID}
var result = couchbase.delete(dst_bucket,meta);
log(result);
}
- Deploy Eventing function created in previous step.
- Add single Eventing node to the cluster.
- While Eventing rebalance operation is in progress, kill Eventing processes on existing Eventing nodes in the cluster.
Observation
Eventing consumer process crashed at this point in the test on Eventing node 172.23.217.255.
2024-02-26T20:53:10.851-08:00 [Warn] client::Serve [worker_src_bucket/_default/Function_949901511_test_killing_eventing_processes_during_eventing_rebalance_1:/tmp/127.0.0.1:8091_1_1785536404.sock:22441] Failed to read from stdout pipe, err: EOF |
2024-02-26T20:53:10.851-08:00 [Error] Consumer::feedbackReadMessageLoop [worker_src_bucket/_default/Function_949901511_test_killing_eventing_processes_during_eventing_rebalance_1:/tmp/127.0.0.1:8091_1_1785536404.sock:22441] Read from client socket failed, err: EOF |
2024-02-26T20:53:10.851-08:00 [Info] eventing-consumer [worker_src_bucket/_default/Function_949901511_test_killing_eventing_processes_during_eventing_rebalance_1:/tmp/127.0.0.1:8091_1_1785536404.sock:22441] |
2024-02-26T20:53:10.851-08:00 [Info] eventing-consumer [worker_src_bucket/_default/Function_949901511_test_killing_eventing_processes_during_eventing_rebalance_1:/tmp/127.0.0.1:8091_1_1785536404.sock:22441] == Minidump location: /opt/couchbase/var/lib/couchbase/crash/8da0e2c9-10ca-4f7e-ebe7f183-aaeda173.dmp Status: 1 == |
2024-02-26T20:53:10.851-08:00 [Warn] client::Serve [worker_src_bucket/_default/Function_949901511_test_killing_eventing_processes_during_eventing_rebalance_1:/tmp/127.0.0.1:8091_1_1785536404.sock:22441] Failed to read from stderr pipe, err: EOF |
2024-02-26T20:53:10.851-08:00 [Error] Consumer::readMessageLoop [worker_src_bucket/_default/Function_949901511_test_killing_eventing_processes_during_eventing_rebalance_1:/tmp/127.0.0.1:8091_1_1785536404.sock:22441] Read from client socket failed, err: EOF |
2024-02-26T20:53:10.851-08:00 [Warn] client::Serve [worker_src_bucket/_default/Function_949901511_test_killing_eventing_processes_during_eventing_rebalance_1:/tmp/127.0.0.1:8091_1_1785536404.sock:22441] Exiting c++ worker with error: signal: floating point exception |
Rebalance operation eventually fails with 20 minutes timeout error.
2024-02-26T21:13:29.825-08:00, ns_orchestrator:0:critical:message(ns_1@172.23.218.15) - Rebalance exited with reason {service_rebalance_failed,eventing, |
{worker_died,
|
{'EXIT',<0.18294.0>, |
{task_failed,rebalance,
|
{service_error,
|
<<"eventing rebalance hasn't made progress for past 1200 secs">>}}}}}. |
Rebalance Operation Id = dee3527526f8e925ffd7d3c229fb666a
|
BT of the crash
(gdb) bt
|
#0 0x000000000043315f in AppWorker::RouteMessageWithResponse (this=0x562b40 <AppWorker::GetAppWorker()::worker>, worker_msg=...) |
at /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/eventing/v8_consumer/src/client.cc:870 |
#1 0x0000000000434240 in AppWorker::ParseValidChunk (this=<optimized out>, stream=<optimized out>, nread=<optimized out>, buf=<optimized out>) |
at /opt/gcc-13.2.0/include/c++/13.2.0/bits/unique_ptr.h:197 |
#2 0x00007f09bb6feff9 in uv__read (stream=stream@entry=0x6ea160 <AppWorker::GetAppWorker()::worker+1603104>) at src/unix/stream.c:1249 |
#3 0x00007f09bb6ffbd0 in uv__stream_io (loop=<optimized out>, w=0x6ea1e8 <AppWorker::GetAppWorker()::worker+1603240>, events=1) at src/unix/stream.c:1316 |
#4 0x00007f09bb7048a0 in uv__io_poll (loop=loop@entry=0x6e9c78 <AppWorker::GetAppWorker()::worker+1601848>, timeout=-1) at src/unix/linux-core.c:400 |
#5 0x00007f09bb6f67f4 in uv_run (loop=0x6e9c78 <AppWorker::GetAppWorker()::worker+1601848>, mode=UV_RUN_DEFAULT) at src/unix/core.c:368 |
#6 0x0000000000427c6a in AppWorker::StartMainUVLoop (this=0x562b40 <AppWorker::GetAppWorker()::worker>) |
at /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/eventing/v8_consumer/src/client.cc:995 |
#7 AppWorker::StartMainUVLoop (this=0x562b40 <AppWorker::GetAppWorker()::worker>) |
at /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/eventing/v8_consumer/src/client.cc:993 |
#8 0x00007f09baf99aa3 in std::execute_native_thread_routine (__p=0xf73900) at /tmp/deploy/objdir/../gcc-13.2.0/libstdc++-v3/src/c++11/thread.cc:104 |
#9 0x00007f09bacf1fa3 in start_thread (arg=<optimized out>) at pthread_create.c:486 |
#10 0x00007f09bac2306f in ?? () from /lib/x86_64-linux-gnu/libc.so.6 |
#11 0x0000000000000000 in ?? () |
Note
Reran the test multiple times on 2164 build but saw this crash only once.
Same test was passing consistently without any issues for last 2 months.
Similar to MB-60934, seems like day 0 issue to me that we have hit now given that no related changes have been made recently in this area https://changelog.build.couchbase.com/?product=couchbase-server&fromVersion=7.6.0&fromBuild=2100&toVersion=7.6.0&toBuild=2164&f_analytics-dcp-client=off&f_asterixdb=off&f_backup=off&f_cbas=off&f_cbas-core=off&f_cbft=off&f_cbftx=off&f_cbgt=off&f_couchbase-cli=off&f_couchdb=off&f_eventing=on&f_faiss=off&f_indexing=off&f_kv_engine=off&f_n1fty=off&f_ns_server=off&f_plasma=off&f_platform=off&f_product-metadata=off&f_query=off&f_query-ee=off&f_testrunner=off&f_tlm=off&f_voltron=off.
I have attached logs from all 3 Eventing nodes involved in the test.