Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-60952

Eventing consumer crashed in AppWorker::RouteMessageWithResponse during rebalance operation

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Critical
    • 7.6.2
    • 7.6.0
    • eventing
    • 7.6.0-2164-enterprise
    • Untriaged
    • Linux x86_64
    • 0
    • No

    Description

      QE Test

      ./testrunner -i /data/workspace/debian-p0-eventing-vset00-00-rebalance_bucket_op_with_cron_timers/testexec.124747.ini -p get-cbcollect-info=True,GROUP=bucket_op_with_cron_timers,skip_log_scan=True,default_bucket=False,sirius_url=http://172.23.120.103:4000 -t eventing.eventing_rebalance.EventingRebalance.test_killing_eventing_processes_during_eventing_rebalance,doc-per-day=20,dataset=default,nodes_init=5,services_init=kv-kv-eventing-eventing-index:n1ql,groups=simple,reset_services=True,handler_code=bucket_op_with_cron_timers,GROUP=bucket_op_with_cron_timers,get-cbcollect-info=True,skip_log_scan=True,default_bucket=False,sirius_url=http://172.23.120.103:4000
      

      Test Steps

      1. Create 5 node having following cluster configuration -
        Service Number of nodes
        KV 2
        Eventing 2
        Index:N1QL 1
      2. Create source, metadata and destination bucket.
      3. Create Eventing function having following definition -

        Function_949901511_test_killing_eventing_processes_during_eventing_rebalance with 
        bindings: {'buckets': [{'alias': 'dst_bucket', 'bucket_name': 'dst_bucket', 'access': 'rw'}], 'metadata_bucket': 'metadata', 'source_bucket': 'src_bucket', 'curl': []} and 
        settings: {'checkpoint_interval': 20000, 'cleanup_timers': False, 'dcp_stream_boundary': 'everything', 'deployment_status': False, 'description': 'Sample Description', 'log_level': 'INFO', 'skip_timer_threshold': 86400, 'sock_batch_size': 1, 'tick_duration': 5000, 'timer_processing_tick_interval': 500, 'timer_worker_pool_size': 3, 'worker_count': 3, 'processing_status': False, 'cpp_worker_thread_count': 1, 'execution_timeout': 20, 'data_chan_size': 10000, 'worker_queue_cap': 100000, 'use_memory_manager': True, 'deadline_timeout': 21, 'timer_storage_chan_size': 10000, 'dcp_gen_chan_size': 10000}
        2024-02-27 04:52:09 | INFO | MainProcess | test_thread | [eventing_base.deploy_function] 
        function OnUpdate(doc, meta) {
            log("Doc created/updated", meta.id);
            var expiry = new Date();
            expiry.setSeconds(expiry.getSeconds() + 3);
         
            var context = {docID : meta.id, random_text : "e6cZZGHuh0R7Aumoe6cZZGHuh0R7Aumoe6cZZGHuh0R7Aumoe6cZZGHuh0R7Aumoe6cZZGHuh0R7Aumoe6cZZGHuh0R7Aumoe6cZZGHuh0R7Aumoe6cZZGHuh0R7Aumoe6cZZGHuh0R7Aumoe6cZZGHuh0R7Aumoe6cZZGHuh0R7Aumoe6cZZGHuh0R7Aumoe6cZZGHuh0R7Aumoe6cZZGHuh07Aumoe6cZZGHuh07Aumoe6cZZGHuh07Aumoe6"};
            createTimer(timerCallback,  expiry, meta.id, context);
        }
         
        function timerCallback(context) {
            var result= couchbase.insert(dst_bucket,{"id":context.docID},context.random_text);
            log(result);
        }
         
        function OnDelete(meta) {
            var expiry = new Date();
            expiry.setSeconds(expiry.getSeconds() + 30);
         
            var context = {docID : meta.id };
            createTimer(NDtimerCallback,  expiry, meta.id, context);
        }
         
        function NDtimerCallback(context) {
            var meta={"id":context.docID}
            var result = couchbase.delete(dst_bucket,meta);
            log(result);
        }
        

      4. Deploy Eventing function created in previous step.
      5. Add single Eventing node to the cluster.
      6. While Eventing rebalance operation is in progress, kill Eventing processes on existing Eventing nodes in the cluster.

      Observation

      Eventing consumer process crashed at this point in the test on Eventing node 172.23.217.255.

      2024-02-26T20:53:10.851-08:00 [Warn] client::Serve [worker_src_bucket/_default/Function_949901511_test_killing_eventing_processes_during_eventing_rebalance_1:/tmp/127.0.0.1:8091_1_1785536404.sock:22441] Failed to read from stdout pipe, err: EOF
      2024-02-26T20:53:10.851-08:00 [Error] Consumer::feedbackReadMessageLoop [worker_src_bucket/_default/Function_949901511_test_killing_eventing_processes_during_eventing_rebalance_1:/tmp/127.0.0.1:8091_1_1785536404.sock:22441] Read from client socket failed, err: EOF
      2024-02-26T20:53:10.851-08:00 [Info] eventing-consumer [worker_src_bucket/_default/Function_949901511_test_killing_eventing_processes_during_eventing_rebalance_1:/tmp/127.0.0.1:8091_1_1785536404.sock:22441] 
      2024-02-26T20:53:10.851-08:00 [Info] eventing-consumer [worker_src_bucket/_default/Function_949901511_test_killing_eventing_processes_during_eventing_rebalance_1:/tmp/127.0.0.1:8091_1_1785536404.sock:22441] == Minidump location: /opt/couchbase/var/lib/couchbase/crash/8da0e2c9-10ca-4f7e-ebe7f183-aaeda173.dmp Status: 1 ==
      2024-02-26T20:53:10.851-08:00 [Warn] client::Serve [worker_src_bucket/_default/Function_949901511_test_killing_eventing_processes_during_eventing_rebalance_1:/tmp/127.0.0.1:8091_1_1785536404.sock:22441] Failed to read from stderr pipe, err: EOF
      2024-02-26T20:53:10.851-08:00 [Error] Consumer::readMessageLoop [worker_src_bucket/_default/Function_949901511_test_killing_eventing_processes_during_eventing_rebalance_1:/tmp/127.0.0.1:8091_1_1785536404.sock:22441] Read from client socket failed, err: EOF
      2024-02-26T20:53:10.851-08:00 [Warn] client::Serve [worker_src_bucket/_default/Function_949901511_test_killing_eventing_processes_during_eventing_rebalance_1:/tmp/127.0.0.1:8091_1_1785536404.sock:22441] Exiting c++ worker with error: signal: floating point exception
      

      Rebalance operation eventually fails with 20 minutes timeout error.

      2024-02-26T21:13:29.825-08:00, ns_orchestrator:0:critical:message(ns_1@172.23.218.15) - Rebalance exited with reason {service_rebalance_failed,eventing,
                                    {worker_died,
                                     {'EXIT',<0.18294.0>,
                                      {task_failed,rebalance,
                                       {service_error,
                                        <<"eventing rebalance hasn't made progress for past 1200 secs">>}}}}}.
      Rebalance Operation Id = dee3527526f8e925ffd7d3c229fb666a
      

      BT of the crash

      (gdb) bt
      #0  0x000000000043315f in AppWorker::RouteMessageWithResponse (this=0x562b40 <AppWorker::GetAppWorker()::worker>, worker_msg=...)
          at /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/eventing/v8_consumer/src/client.cc:870
      #1  0x0000000000434240 in AppWorker::ParseValidChunk (this=<optimized out>, stream=<optimized out>, nread=<optimized out>, buf=<optimized out>)
          at /opt/gcc-13.2.0/include/c++/13.2.0/bits/unique_ptr.h:197
      #2  0x00007f09bb6feff9 in uv__read (stream=stream@entry=0x6ea160 <AppWorker::GetAppWorker()::worker+1603104>) at src/unix/stream.c:1249
      #3  0x00007f09bb6ffbd0 in uv__stream_io (loop=<optimized out>, w=0x6ea1e8 <AppWorker::GetAppWorker()::worker+1603240>, events=1) at src/unix/stream.c:1316
      #4  0x00007f09bb7048a0 in uv__io_poll (loop=loop@entry=0x6e9c78 <AppWorker::GetAppWorker()::worker+1601848>, timeout=-1) at src/unix/linux-core.c:400
      #5  0x00007f09bb6f67f4 in uv_run (loop=0x6e9c78 <AppWorker::GetAppWorker()::worker+1601848>, mode=UV_RUN_DEFAULT) at src/unix/core.c:368
      #6  0x0000000000427c6a in AppWorker::StartMainUVLoop (this=0x562b40 <AppWorker::GetAppWorker()::worker>)
          at /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/eventing/v8_consumer/src/client.cc:995
      #7  AppWorker::StartMainUVLoop (this=0x562b40 <AppWorker::GetAppWorker()::worker>)
          at /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/eventing/v8_consumer/src/client.cc:993
      #8  0x00007f09baf99aa3 in std::execute_native_thread_routine (__p=0xf73900) at /tmp/deploy/objdir/../gcc-13.2.0/libstdc++-v3/src/c++11/thread.cc:104
      #9  0x00007f09bacf1fa3 in start_thread (arg=<optimized out>) at pthread_create.c:486
      #10 0x00007f09bac2306f in ?? () from /lib/x86_64-linux-gnu/libc.so.6
      #11 0x0000000000000000 in ?? ()
      

      Note

      Reran the test multiple times on 2164 build but saw this crash only once.
      Same test was passing consistently without any issues for last 2 months.
      Similar to MB-60934, seems like day 0 issue to me that we have hit now given that no related changes have been made recently in this area https://changelog.build.couchbase.com/?product=couchbase-server&fromVersion=7.6.0&fromBuild=2100&toVersion=7.6.0&toBuild=2164&f_analytics-dcp-client=off&f_asterixdb=off&f_backup=off&f_cbas=off&f_cbas-core=off&f_cbft=off&f_cbftx=off&f_cbgt=off&f_couchbase-cli=off&f_couchdb=off&f_eventing=on&f_faiss=off&f_indexing=off&f_kv_engine=off&f_n1fty=off&f_ns_server=off&f_plasma=off&f_platform=off&f_product-metadata=off&f_query=off&f_query-ee=off&f_testrunner=off&f_tlm=off&f_voltron=off.

      I have attached logs from all 3 Eventing nodes involved in the test.

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            ankit.prabhu Ankit Prabhu
            sujay.gad Sujay Gad
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty