Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-59479

Indexer becomes unresponsive after some time with the node running in a Docker container

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • Morpheus
    • 7.2.2
    • storage-engine
    • OS: Arch Linux x86_64
      Host: 21CB007CUK ThinkPad X1 Carbon Gen 10
      Kernel: 6.5.9-arch2-1
      CPU: 12th Gen Intel i7-1260P (16) @ 4.700GHz
      Memory: 31798MiB
    • Untriaged
    • Linux x86_64
    • 0
    • Unknown

    Description

      What is the issue?

      When developing Couchbase Capella locally we run the one-node control plane database and services in Docker containers. I'm one of the small number of people in the control plane team who are running Linux on their machine and I had no issues with this local development setup until very recently when I've tried switching to a new Linux laptop with better specs. The issue seems to be that at some point after we configure the control plane database the indexer becomes unresponsive, which makes it impossible to use the development setup as queries stop working.

      For additional context, we use a CLI utility called cbclocal to start and manage the local development setup. To start the setup it does (terminal output from a sample successful run):

      ✅  Verifying Environment ...
      ✅  Starting ngrok (ngrok http 8081 --scheme http)
      ✅  Starting build couchbase:local (docker build -t couchbase:local  -f ./cmd/cbclocal/Dockerfile .) ...
      ✅  Starting db ...
      ✅  Finished db readiness check (8s)
      ✅  Starting couchbase_config.sh (docker exec cbclocal_db /bin/bash -c './couchbase_config.sh') ...
      ✅  Finished Generate env file (2.31s)
      ✅  Starting building index-manager (docker build -t index-manager -f ./docker/index-manager/Dockerfile ./docker/index-manager) ...
      ✅  Starting index-manager ...
      ✅  Finished waiting for indexes (38.01s)
      ✅  Starting building cbclocalrunner (docker build -t cbclocalrunner ./docker/cbclocalrunner) ...
      ✅  Finished pulling localstack image (1.43s)
      ✅  Finished performing localstack readiness check (5.22s)
      ✅  Finished performing SQS readiness check (12.18s)
      ✅  Starting cp-api ...
      ✅  Finished cp-api readiness check (6s)
      ✅  Starting cp-internal-api ...
      ✅  Finished cp-internal-api readiness check (4s)
      ✅  Starting cp-jobs ...
      ✅  Finished cp-jobs readiness check (8s)
      ✅  Finished pull node image (990ms)
      ✅  Starting cp-ui-v2 ...
      ✅  Finished ui readiness check (8.55s) 

      As one of the first steps it starts the database container, which is based on the official couchbase Docker image, we are not doing anything special with it apart from adding the server setup script to it. See the Dockerfile for it below.

      FROM couchbase/server:enterprise-7.2.2
       
      ARG VBUCKETS=1024
      ENV COUCHBASE_NUM_VBUCKETS=$VBUCKETS
       
      COPY cmd/cbclocal/scripts/couchbase_config.sh /
       
      ENTRYPOINT ["/entrypoint.sh"]
      CMD ["couchbase-server"]
       
      EXPOSE 8091 8092 8093 8094 8095 8096 11207 11210 11211 18091 18092 18093 18094 18095 18096
      VOLUME /opt/couchbase/var

      Then we configure the database using the couchbase_config.sh script, which is also very basic:

      #!/bin/bash -x
       
      couchbase-cli cluster-init -c localhost \
        --cluster-username Administrator \
        --cluster-password password \
        --services data,index,query \
        --cluster-ramsize 512 \
        --cluster-index-ramsize 256
       
      couchbase-cli user-manage -c localhost:8091 \
        -u Administrator \
        -p password \
        --set --rbac-username cpapi --rbac-password password --roles admin --auth-domain local
       
      bucket_create() {
        couchbase-cli bucket-create -c localhost:8091 \
          -u Administrator \
          -p password \
          --bucket $1 --bucket-type couchbase --bucket-ramsize 128 --bucket-replica 0 --wait
      }
       
      primary_index() {
        bucketName=$1
        url="http://localhost:8093/query/service?statement=create%20primary%20index%20on%20$bucketName"
       
        # Sometimes the bucket might not be ready yet and we'll fail to connect. So re-try if we see a failure.
        while true; do
          curl -X POST -v -u Administrator:password $url
          if [[ $? -eq 0 ]]; then
            break
          fi
        done
      }
       
      bucket_create auditevents
      bucket_create cpapi
      bucket_create jobs
      bucket_create notifications
       
      primary_index cpapi
      primary_index notifications
      

      Then, finally, we try to create and build the required control plane indexes using couchbase-index-manager from a different Docker container. This is the step at which the indexer sometimes starts becoming unresponsive, usually after the first 20+ indexes are created (this is not always the case, sometimes it manages to build all of the indexes and the local development setup starts successfully but then the indexer becomes unresponsive shortly after anyways).

      Observations

      In the server UI we can see that the indexer no longer seems to use memory on the node:

      And if we navigate to the "Indexes" page we pop-up saying that the ns_server cannot communicate with the indexer.

      From the indexer.log server logs, I've noticed that the indexer almost always hangs after logging the following lines (specifically, it is always [0 0 0 0 0 0 0 5]):

      2023-11-06T12:52:10.346+00:00 [Info] RebalanceServiceManager::GetCurrentTopology []
      2023-11-06T12:52:10.346+00:00 [Info] RebalanceServiceManager::GetCurrentTopology returns &{Rev:[0 0 0 0 0 0 0 5] Nodes:[c16767276915a76980cf6ae54d5cee84] IsBalanced:true Messages:[]}
      2023-11-06T12:52:10.346+00:00 [Info] RebalanceServiceManager::GetTaskList []
      2023-11-06T12:52:10.346+00:00 [Info] RebalanceServiceManager::GetTaskList returns &{Rev:[0 0 0 0 0 0 0 5] Tasks:[]}
      2023-11-06T12:52:10.348+00:00 [Info] RebalanceServiceManager::GetCurrentTopology [0 0 0 0 0 0 0 5]
      2023-11-06T12:52:10.348+00:00 [Info] RebalanceServiceManager::GetTaskList [0 0 0 0 0 0 0 5]
      2023-11-06T12:52:20.316+00:00 [Info] Indexer::ReadMemstats Time Taken 103.598µs
      

      I've tried to collect the stacktrace from the indexer process and it doesn't seem to be consistent on every "from nothing" run of cbclocal that I did, e.g. the trace always ended with

      --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=579, si_uid=1000} ---
      rt_sigreturn({mask=[]})                 = 211128991416320
      --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=579, si_uid=1000} ---
      rt_sigreturn({mask=[]})                 = 211128991416320
      --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=579, si_uid=1000} ---
      rt_sigreturn({mask=[]})                 = 211128991416320
      --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=579, si_uid=1000} ---
      rt_sigreturn({mask=[]})                 = 211128991416320
      --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=579, si_uid=1000} ---
      rt_sigreturn({mask=[]})                 = 211128991416320
      --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=579, si_uid=1000} ---
      rt_sigreturn({mask=[]})                 = 211128991416320
      --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=579, si_uid=1000} ---
      rt_sigreturn({mask=[]})                 = 211128991416320
      

      printed indefinitely or with

      futex(0x1fa6f08, FUTEX_WAIT_PRIVATE, 0, NULL
      

      being printed and then nothing happening after. I will attach sample outputs that end in both of these ways.

      Unfortunately I couldn't collect server logs from the UI so will just attach all logs files from the var/lib/couchbase/logs directory. Let me know if another set of logs is required, I should be able to collect them as the issue is consistently reproducible on my machine.

      Things I have tried

      • Restarting the indexer process from the inside of the container doesn't seem to fix the issue.
      • Reinstalling the Docker engine.
      • Upgrading the system.
      • Using a completely different Linux distribution (information below).

      Additional information

      I'm currently running Arch Linux on the laptop that is experiencing the issue but before I had tried Ubuntu 22.04 first. When running Ubuntu the indexer would always become unresponsive much earlier, just after the first 3-5 couchbase-index-manager "create index" requests which are done right after the database is configured. I though that this might an issue with the way I configured the distribution so I've decided to try out a more lightweight one like Arch to ensure that I only have the required packages and nothing else.

      Is There a Workaround?
      Yes, use the memory optimized index storage mode.

      Attachments

        1. cb_logs_0.zip
          2.82 MB
        2. dump.tar.gz
          37.78 MB
        3. image-2023-11-06-15-05-06-206.png
          image-2023-11-06-15-05-06-206.png
          60 kB
        4. image-2023-11-06-15-06-56-331.png
          image-2023-11-06-15-06-56-331.png
          138 kB
        5. indexer_strace_0.log
          366 kB
        6. indexer_strace_1.log
          485 kB
        7. indexer_strace_2.log
          225 kB
        8. indexer_strace_3.log
          544 kB

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              saptarshi.sen Saptarshi Sen
              maks.januska Maksimiljans Januska
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty