Loading...

Details

Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: Morpheus
Affects Version/s: 7.2.2
Component/s: storage-engine
Labels:
- deferred-from-trinity
- plasma
Environment:
OS: Arch Linux x86_64
Host: 21CB007CUK ThinkPad X1 Carbon Gen 10
Kernel: 6.5.9-arch2-1
CPU: 12th Gen Intel i7-1260P (16) @ 4.700GHz
Memory: 31798MiB

Triage:
Untriaged
Operating System:
Linux x86_64
Story Points:
0
Is this a Regression?:
Unknown

Description

What is the issue?

When developing Couchbase Capella locally we run the one-node control plane database and services in Docker containers. I'm one of the small number of people in the control plane team who are running Linux on their machine and I had no issues with this local development setup until very recently when I've tried switching to a new Linux laptop with better specs. The issue seems to be that at some point after we configure the control plane database the indexer becomes unresponsive, which makes it impossible to use the development setup as queries stop working.

For additional context, we use a CLI utility called cbclocal to start and manage the local development setup. To start the setup it does (terminal output from a sample successful run):

✅  Verifying Environment ...

✅  Starting ngrok (ngrok http 8081 --scheme http)

✅  Starting build couchbase:local (docker build -t couchbase:local  -f ./cmd/cbclocal/Dockerfile .) ...

✅  Starting db ...

✅  Finished db readiness check (8s)

✅  Starting couchbase_config.sh (docker exec cbclocal_db /bin/bash -c './couchbase_config.sh') ...

✅  Finished Generate env file (2.31s)

✅  Starting building index-manager (docker build -t index-manager -f ./docker/index-manager/Dockerfile ./docker/index-manager) ...

✅  Starting index-manager ...

✅  Finished waiting for indexes (38.01s)

✅  Starting building cbclocalrunner (docker build -t cbclocalrunner ./docker/cbclocalrunner) ...

✅  Finished pulling localstack image (1.43s)

✅  Finished performing localstack readiness check (5.22s)

✅  Finished performing SQS readiness check (12.18s)

✅  Starting cp-api ...

✅  Finished cp-api readiness check (6s)

✅  Starting cp-internal-api ...

✅  Finished cp-internal-api readiness check (4s)

✅  Starting cp-jobs ...

✅  Finished cp-jobs readiness check (8s)

✅  Finished pull node image (990ms)

✅  Starting cp-ui-v2 ...

✅  Finished ui readiness check (8.55s)

As one of the first steps it starts the database container, which is based on the official couchbase Docker image, we are not doing anything special with it apart from adding the server setup script to it. See the Dockerfile for it below.

FROM couchbase/server:enterprise-7.2.2

ARG VBUCKETS=1024

ENV COUCHBASE_NUM_VBUCKETS=$VBUCKETS

COPY cmd/cbclocal/scripts/couchbase_config.sh /

ENTRYPOINT ["/entrypoint.sh"]

CMD ["couchbase-server"]

EXPOSE 8091 8092 8093 8094 8095 8096 11207 11210 11211 18091 18092 18093 18094 18095 18096

VOLUME /opt/couchbase/var

Then we configure the database using the couchbase_config.sh script, which is also very basic:

#!/bin/bash -x

couchbase-cli cluster-init -c localhost \

  --cluster-username Administrator \

  --cluster-password password \

  --services data,index,query \

  --cluster-ramsize 512 \

  --cluster-index-ramsize 256

couchbase-cli user-manage -c localhost:8091 \

  -u Administrator \

  -p password \

  --set --rbac-username cpapi --rbac-password password --roles admin --auth-domain local

bucket_create() {

  couchbase-cli bucket-create -c localhost:8091 \

    -u Administrator \

    -p password \

    --bucket $1 --bucket-type couchbase --bucket-ramsize 128 --bucket-replica 0 --wait

primary_index() {

  bucketName=$1

  url="http://localhost:8093/query/service?statement=create%20primary%20index%20on%20$bucketName"

  # Sometimes the bucket might not be ready yet and we'll fail to connect. So re-try if we see a failure.

  while true; do

    curl -X POST -v -u Administrator:password $url

    if [[ $? -eq 0 ]]; then

      break

fi

  done

bucket_create auditevents

bucket_create cpapi

bucket_create jobs

bucket_create notifications

primary_index cpapi

primary_index notifications

Then, finally, we try to create and build the required control plane indexes using couchbase-index-manager from a different Docker container. This is the step at which the indexer sometimes starts becoming unresponsive, usually after the first 20+ indexes are created (this is not always the case, sometimes it manages to build all of the indexes and the local development setup starts successfully but then the indexer becomes unresponsive shortly after anyways).

Observations

In the server UI we can see that the indexer no longer seems to use memory on the node:

And if we navigate to the "Indexes" page we pop-up saying that the ns_server cannot communicate with the indexer.

From the indexer.log server logs, I've noticed that the indexer almost always hangs after logging the following lines (specifically, it is always [0 0 0 0 0 0 0 5]):

2023-11-06T12:52:10.346+00:00 [Info] RebalanceServiceManager::GetCurrentTopology []

2023-11-06T12:52:10.346+00:00 [Info] RebalanceServiceManager::GetCurrentTopology returns &{Rev:[0 0 0 0 0 0 0 5] Nodes:[c16767276915a76980cf6ae54d5cee84] IsBalanced:true Messages:[]}

2023-11-06T12:52:10.346+00:00 [Info] RebalanceServiceManager::GetTaskList []

2023-11-06T12:52:10.346+00:00 [Info] RebalanceServiceManager::GetTaskList returns &{Rev:[0 0 0 0 0 0 0 5] Tasks:[]}

2023-11-06T12:52:10.348+00:00 [Info] RebalanceServiceManager::GetCurrentTopology [0 0 0 0 0 0 0 5]

2023-11-06T12:52:10.348+00:00 [Info] RebalanceServiceManager::GetTaskList [0 0 0 0 0 0 0 5]

2023-11-06T12:52:20.316+00:00 [Info] Indexer::ReadMemstats Time Taken 103.598µs

I've tried to collect the stacktrace from the indexer process and it doesn't seem to be consistent on every "from nothing" run of cbclocal that I did, e.g. the trace always ended with

--- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=579, si_uid=1000} ---

rt_sigreturn({mask=[]})                 = 211128991416320

--- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=579, si_uid=1000} ---

rt_sigreturn({mask=[]})                 = 211128991416320

--- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=579, si_uid=1000} ---

rt_sigreturn({mask=[]})                 = 211128991416320

--- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=579, si_uid=1000} ---

rt_sigreturn({mask=[]})                 = 211128991416320

--- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=579, si_uid=1000} ---

rt_sigreturn({mask=[]})                 = 211128991416320

--- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=579, si_uid=1000} ---

rt_sigreturn({mask=[]})                 = 211128991416320

--- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=579, si_uid=1000} ---

rt_sigreturn({mask=[]})                 = 211128991416320

printed indefinitely or with

futex(0x1fa6f08, FUTEX_WAIT_PRIVATE, 0, NULL

being printed and then nothing happening after. I will attach sample outputs that end in both of these ways.

Unfortunately I couldn't collect server logs from the UI so will just attach all logs files from the var/lib/couchbase/logs directory. Let me know if another set of logs is required, I should be able to collect them as the issue is consistently reproducible on my machine.

Things I have tried

Restarting the indexer process from the inside of the container doesn't seem to fix the issue.
Reinstalling the Docker engine.
Upgrading the system.
Using a completely different Linux distribution (information below).

Additional information

I'm currently running Arch Linux on the laptop that is experiencing the issue but before I had tried Ubuntu 22.04 first. When running Ubuntu the indexer would always become unresponsive much earlier, just after the first 3-5 couchbase-index-manager "create index" requests which are done right after the database is configured. I though that this might an issue with the way I configured the distribution so I've decided to try out a more lightweight one like Arch to ensure that I only have the required packages and nothing else.

Is There a Workaround?
Yes, use the memory optimized index storage mode.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

cb_logs_0.zip
2.82 MB
06/Nov/23 8:00 AM
dump.tar.gz
37.78 MB
10/Nov/23 11:04 AM
image-2023-11-06-15-05-06-206.png
60 kB
06/Nov/23 7:05 AM
image-2023-11-06-15-06-56-331.png
138 kB
06/Nov/23 7:06 AM
indexer_strace_0.log
366 kB
06/Nov/23 7:59 AM
indexer_strace_1.log
485 kB
06/Nov/23 7:59 AM
indexer_strace_2.log
225 kB
06/Nov/23 7:59 AM
indexer_strace_3.log
544 kB
06/Nov/23 7:59 AM

Issue Links

relates to: AV-67324 Loading...

Indexer becomes unresponsive after some time with the node running in a Docker container

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

PagerDuty