Details
-
Bug
-
Status: Reopened
-
Critical
-
Resolution: Unresolved
-
6.0.0
-
Untriaged
-
Unknown
Description
Build 6.0.0-1693
Observed that in high data density test with 30 buckets, swap rebalance for index+query node failed with reason "cannot assign requested address"
The setup has 30 buckets and 2 indexes per bucket and ongoing queries on each bucket.
Note that swap rebalance after this failed rebalance is successful.
Logs-
Index+query node which is rebalanced in- https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-249/172.23.97.15.zip
Index+query node- https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-249/172.23.97.19.zip
Index+query node which is rebalanced out- https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-249/172.23.97.20.zip
KV node- https://s3.amazonaws.com/bugdb/jira/index_reb_multibucket/collectinfo-2019-01-08T151840-ns_1%40172.23.97.12.zip
KV node- https://s3.amazonaws.com/bugdb/jira/index_reb_multibucket/collectinfo-2019-01-08T151840-ns_1%40172.23.97.13.zip
KV node- https://s3.amazonaws.com/bugdb/jira/index_reb_multibucket/collectinfo-2019-01-08T151840-ns_1%40172.23.97.14.zip
FTS node- https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-249/172.23.96.20.zip
Analytics node- https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-249/172.23.96.23.zip
Eventing node- https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-249/172.23.97.177.zip
Attachments
Gerrit Reviews
Activity
It already assigned to indexer team take look.
Query service is transient what ever ns_server brings up we will start
Observed this issue again in same test.
Job- http://perf.jenkins.couchbase.com/job/arke-multi-bucket/266
Indexer logs-
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-266/172.23.97.15.zip
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-266/172.23.97.19.zip
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-266/172.23.97.20.zip
Varun Velamuri can we have an update on this?
We cannot certify 30 buckets until we know more about this issue and whether retrying is an acceptable workaround.
Keshav Murthy please help with prioritizing this.
RMS work (reliability-manageability-serviceability) work is the highest priority for GSI team.
Jeelan Poola is aware of this and will prioritize accordingly.
The nodes need to be configured with higher limits for open fds.
couchbase user limits
|
su couchbase -s /bin/sh -c "ulimit -a"
|
==============================================================================
|
open files (-n) 1024
|
When the limit gets exhausted, below TCP error would be seen:
dial tcp 127.0.0.1:9102: connect: cannot assign requested address
|
Build 6.5.0-2433
Increased open files limit to 4096 and ran the same test. This time it failed in rebalance out scenario before swap rebalance with same reason.
Rebalance exited with reason {service_rebalance_failed,index,
|
{rebalance_failed,
|
{service_error,
|
<<"Post http://127.0.0.1:9102/createIndexRebalance: dial tcp 127.0.0.1:9102: connect: cannot assign requested address">>}}}
|
|
Logs-
rebalance out node- https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-290/172.23.97.15.zip
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-290/172.23.97.19.zip
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-290/172.23.97.20.zip
Based on the lsof output in couchbase.log for 172.23.97.19, here are the major consumers of file handles(couchbase processes only):
1140 cbq-engine
440 indexer
325 memcached
225 beam.smp
I don't know what workload is being run on the system. All of these may be legitimate usage.
Based on our documentation, our installer tries to set the file handle limit to 40960. So I think you can go higher.
https://docs.couchbase.com/server/5.5/troubleshooting/troubleshooting-common-errors.html
Build 6.5.0-2640
Ran the test with open file limit as 81920 and index + n1ql swap rebalance completed successfully.
Job- http://perf.jenkins.couchbase.com/job/arke-multi-bucket/293
Deepkaran Salooja Shivani Gupta,
do we need to create doc issue to capture this behaviour ?
Mahesh Mandhare my question is why didn't the installer set it to 40960 in this case? Or did you bypass the installer?
I don't think we have to document it if it is built in to the installer.
Also, btw, the above documentation page seems to have disappeared from 6.0 documentation - that might have to be filed as a bug.
Build 6.5.0-2640
Ran the same test again with open file limit as 81920 and index + n1ql swap rebalance failed with same reason.
Job- http://perf.jenkins.couchbase.com/job/arke-multi-bucket/294
Logs(Nodes have same roles as mentioned in description)-
https://s3.amazonaws.com/bugdb/jira/hbd-logs-1/collectinfo-2019-03-26T141923-ns_1@172.23.97.12.zip
https://s3.amazonaws.com/bugdb/jira/hbd-logs-1/collectinfo-2019-03-26T141923-ns_1@172.23.97.13.zip
https://s3.amazonaws.com/bugdb/jira/hbd-logs-1/collectinfo-2019-03-26T141923-ns_1@172.23.97.14.zip
https://s3.amazonaws.com/bugdb/jira/hbd-logs-1/collectinfo-2019-03-26T141923-ns_1@172.23.97.15.zip
https://s3.amazonaws.com/bugdb/jira/hbd-logs-1/collectinfo-2019-03-26T141923-ns_1@172.23.96.20.zip
https://s3.amazonaws.com/bugdb/jira/hbd-logs-1/collectinfo-2019-03-26T141923-ns_1@172.23.96.23.zip
https://s3.amazonaws.com/bugdb/jira/hbd-logs-1/collectinfo-2019-03-26T141923-ns_1@172.23.97.177.zip
https://s3.amazonaws.com/bugdb/jira/hbd-logs-1/collectinfo-2019-03-26T141923-ns_1@172.23.97.19.zip
https://s3.amazonaws.com/bugdb/jira/hbd-logs-1/collectinfo-2019-03-26T141923-ns_1@172.23.97.20.zip
Build 6.5.0-2802
Here are logs from latest run which have same issue-
https://s3.amazonaws.com/bugdb/jira/hbd_2/collectinfo-2019-04-10T103209-ns_1%40172.23.96.20.zip
https://s3.amazonaws.com/bugdb/jira/hbd_2/collectinfo-2019-04-10T103209-ns_1%40172.23.96.23.zip
https://s3.amazonaws.com/bugdb/jira/hbd_2/collectinfo-2019-04-10T103209-ns_1%40172.23.97.12.zip
https://s3.amazonaws.com/bugdb/jira/hbd_2/collectinfo-2019-04-10T103209-ns_1%40172.23.97.13.zip
https://s3.amazonaws.com/bugdb/jira/hbd_2/collectinfo-2019-04-10T103209-ns_1%40172.23.97.14.zip
https://s3.amazonaws.com/bugdb/jira/hbd_2/collectinfo-2019-04-10T103209-ns_1%40172.23.97.15.zip
https://s3.amazonaws.com/bugdb/jira/hbd_2/collectinfo-2019-04-10T103209-ns_1%40172.23.97.177.zip
https://s3.amazonaws.com/bugdb/jira/hbd_2/collectinfo-2019-04-10T103209-ns_1%40172.23.97.19.zip
https://s3.amazonaws.com/bugdb/jira/hbd_2/collectinfo-2019-04-10T103209-ns_1%40172.23.97.20.zip
The failure is coming from node 172.23.97.20.
In couchbase.log, I see below open files for couchbase processes:
memcached - 330
cbq-engine - 1416
indexer - 425
I cannot determine the file handle limit as below error is reported while collecting this information:
cbcollect_info.log:[2019-04-10T18:28:14.991981+08:00] couchbase user limits (su couchbase -s /bin/sh -c "ulimit -a") - skipped (needs root privs)
|
Is this a non-root/non-sudo installation?
I think we need to do the following:
1. Run the command(su couchbase -s /bin/sh -c "ulimit -a") to verify that couchbase user limits are set correctly on each node.
2. Manually run lsof on the indexer nodes after the rebalance failure to see where file handles are being used. couchbase.log only collects for couchbase processes.
Here is output for first to do on node 172.23.97.20 :
[root@ark08-sa11 ~]# su couchbase -s /bin/sh -c "ulimit -a"
|
core file size (blocks, -c) 0
|
data seg size (kbytes, -d) unlimited
|
scheduling priority (-e) 0
|
file size (blocks, -f) unlimited
|
pending signals (-i) 1288640
|
max locked memory (kbytes, -l) 64
|
max memory size (kbytes, -m) unlimited
|
open files (-n) 81920
|
pipe size (512 bytes, -p) 8
|
POSIX message queues (bytes, -q) 819200
|
real-time priority (-r) 0
|
stack size (kbytes, -s) 8192
|
cpu time (seconds, -t) unlimited
|
max user processes (-u) 4096
|
virtual memory (kbytes, -v) unlimited
|
file locks (-x) unlimited
|
Will run lsof command after failure, next time I run the test.
Build 6.5.0-3667
We have seen swap rebalance failure again on high bucket density test.
Attaching lost o/p file.
Here is ulimit -a o/p:
[root@ark08-sa11 ~]# su couchbase -s /bin/sh -c "ulimit -a"
|
core file size (blocks, -c) 0
|
data seg size (kbytes, -d) unlimited
|
scheduling priority (-e) 0
|
file size (blocks, -f) unlimited
|
pending signals (-i) 1288640
|
max locked memory (kbytes, -l) 64
|
max memory size (kbytes, -m) unlimited
|
open files (-n) 81920
|
pipe size (512 bytes, -p) 8
|
POSIX message queues (bytes, -q) 819200
|
real-time priority (-r) 0
|
stack size (kbytes, -s) 8192
|
cpu time (seconds, -t) unlimited
|
max user processes (-u) 4096
|
virtual memory (kbytes, -v) unlimited
|
file locks (-x) unlimited
|
|
Logs:
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-312/172.23.97.15.zip
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-312/172.23.97.19.zip
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-312/172.23.97.20.zip
From the attached lsof.txt, cbq-engine has 67731 open connections to memcached.
mulb > grep cbq-engin lsof.txt > cbq.txt
|
mulb > grep 11210 cbq.txt | grep ESTABLISHED |wc -l
|
67731
|
mulb > grep 11210 cbq.txt | grep ESTABLISHED |head
|
cbq-engin 29713 couchbase 25u IPv4 16831690 0t0 TCP k8s-master-1:44072->k8s-master-1:11210 (ESTABLISHED)
|
cbq-engin 29713 couchbase 26u IPv4 16834667 0t0 TCP k8s-master-1:44074->k8s-master-1:11210 (ESTABLISHED)
|
cbq-engin 29713 couchbase 27u IPv4 16834668 0t0 TCP k8s-master-1:44076->k8s-master-1:11210 (ESTABLISHED)
|
cbq-engin 29713 couchbase 28u IPv4 16809758 0t0 TCP k8s-master-1:44078->k8s-master-1:11210 (ESTABLISHED)
|
cbq-engin 29713 couchbase 29u IPv4 16834669 0t0 TCP k8s-master-1:44080->k8s-master-1:11210 (ESTABLISHED)
|
cbq-engin 29713 couchbase 39u IPv4 22578368 0t0 TCP k8s-master-1:50386->k8s-worker-2:11210 (ESTABLISHED)
|
cbq-engin 29713 couchbase 40u IPv4 26647300 0t0 TCP k8s-master-1:35222->k8s-worker-3:11210 (ESTABLISHED)
|
cbq-engin 29713 couchbase 43u IPv4 22529880 0t0 TCP k8s-master-1:50384->k8s-worker-2:11210 (ESTABLISHED)
|
cbq-engin 29713 couchbase 45u IPv4 22565502 0t0 TCP k8s-master-1:50385->k8s-worker-2:11210 (ESTABLISHED)
|
cbq-engin 29713 couchbase 52u IPv4 22580104 0t0 TCP k8s-master-1:50388->k8s-worker-2:11210 (ESTABLISHED)
|
The node has 81900 open handle limit.
[root@ark08-sa11 ~]# su couchbase -s /bin/sh -c "ulimit -a"
|
open files (-n) 81920
|
The open file limit got exhausted due to cbq-engine using 67731 of those and this led to rebalance failure
{log_entry,{1562,234031,246086},
|
'ns_1@172.23.96.20',ns_orchestrator,0,
|
[<<"Rebalance exited with reason {service_rebalance_failed,index,\n {rebalance_failed,\n
|
{service_error,\n <<\"Post http://127.0.0.1:9102/dropIndex: dial tcp 127.0.0.1:9102: connect: cannot assign requested address\">>}}}.\nRebalance Operation Id = d7d3fa5796a8e533c227863e12681b5f">>],
|
Query team, please check why cbq-engine is having so many open connections with memcached.
MB-33170 should have taken care of it. This has been over 2 months can you rerun the test.
Collect how many cbq-engine connections opened before rebalance and during after rebalance.
If problem occurs collect go routine dump of cbq-engine and provide the cluster details will take look.
Build 6.5.0-4059
Observed that swap rebalance of index+n1ql node is failed with error-
Rebalance exited with reason {service_rebalance_failed,index,
|
{rebalance_failed,
|
{service_error,
|
<<"Post http://127.0.0.1:9102/buildIndex: dial tcp 127.0.0.1:9102: connect: cannot assign requested address">>}}}.
|
Rebalance Operation Id = c92552ca70db6277c2402bb98db8a912
|
Attaching lsof command output for one of index+n1ql node, and cbq-engine go routine dump - lsof_6.5.0-4059_172-23-97-20.txt, cbq-engine_go_dump_172-23-97-15.txt
, cbq-engine_go_dump_172-23-97-19.txt
, cbq-engine_go_dump_172-23-97-20.txt
Logs-
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-318/172.23.97.15.zip
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-318/172.23.97.19.zip
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-318/172.23.97.20.zip
Job- http://perf.jenkins.couchbase.com/job/arke-multi-bucket/318/
Could you please tell me what queries you run and what steps you perform.
how to setup 30 buckets and all queries
I need the cluster for long time would able to run the steps and repro. I may have to do instrumentation
Here are details-
Queries:
We have partitioned indexes, each bucket have 2 indexes. Index definition looks like-
CREATE INDEX idx11 ON `bucket-1`(alt_email, name) PARTITION BY HASH(META().id) WITH {"defer_build": true};
|
CREATE INDEX idx12 ON `bucket-1`(alt_email, city) PARTITION BY HASH(META().id) WITH {"defer_build": true};
|
Queries run on all buckets and they look like-
SELECT email, alt_email FROM `bucket-1` WHERE alt_email > 'a0' limit 1;
|
Test file to get definition of all indexes and queries - https://github.com/couchbase/perfrunner/blob/master/tests/multi_bucket/KS_P0_30bucket.test
Here is rest API call to increase max buckets on CB cluster - https://github.com/couchbase/perfrunner/blob/master/perfrunner/helpers/rest.py#L245
As of now we merge 2 cluster to create big cluster to run this tests, hence we do not have setup/repro available for long time.
Looked issue extensively. Nothing unusual found. No connections leaking. As far file descriptor concern N1QL will not use any files all the TCP connections uses file descriptors.
- Connections to KV
- Incoming REST HTTP requests
- HTTP connections to ns server for meta data
- Remote query service HTTP connections
- GSI Client to indexer
- GSI client to ns server
- FTS client to ns server
- FTS client to FTS indexer
- Even more
Under neath HTTP does connection pooling for performance and number of buckets increases more file descriptors are needed. Please increase or set unlimited files
Mahesh Mandhare Please let us know how the test went after increasing the max number of file descriptors.
Bulk closing all invalid, duplicate and won't fix bugs. Please feel free to reopen them
Build 6.5.0-4744
Observed similar rebalance failure even after increasing open files limit to 1000000.
Here is ulimit -a o/p on one of index+n1ql node:
[root@arke07-sa10 ~]# su couchbase -s /bin/sh -c "ulimit -a"
|
core file size (blocks, -c) 0
|
data seg size (kbytes, -d) unlimited
|
scheduling priority (-e) 0
|
file size (blocks, -f) unlimited
|
pending signals (-i) 1288637
|
max locked memory (kbytes, -l) 64
|
max memory size (kbytes, -m) unlimited
|
open files (-n) 1000000
|
pipe size (512 bytes, -p) 8
|
POSIX message queues (bytes, -q) 819200
|
real-time priority (-r) 0
|
stack size (kbytes, -s) 8192
|
cpu time (seconds, -t) unlimited
|
max user processes (-u) 4096
|
virtual memory (kbytes, -v) unlimited
|
file locks (-x) unlimited
|
Job- http://perf.jenkins.couchbase.com/job/arke-multi-bucket/330/
Logs-
Index+query nodes-
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-330/172.23.97.15.zip
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-330/172.23.97.19.zip
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-330/172.23.97.20.zip
Other nodes-
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-330/172.23.96.16.zip
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-330/172.23.96.17.zip
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-330/172.23.96.20.zip
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-330/172.23.96.23.zip
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-330/172.23.97.12.zip
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-330/172.23.97.13.zip
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-330/172.23.97.14.zip
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-330/172.23.97.177.zip
Note that in last run on 6.5.0-4558, this issue did not reproduced - http://perf.jenkins.couchbase.com/job/arke-multi-bucket/325/
There is nothing changed much from 6.5.0-4558. As earlier mentioned query service never use any files except connections. Nothing can be done at present with connections.
You should not open old defects repeatedly, if it pass and re-appear.
If you consistently reproduce narrow down the build that caused the issue. First check with 6.5.0-4578
Mahesh- Can you please help narrow down the build where you see it happen?
Kamini Jagtiani [5:58 AM]
Hi Mahesh - any update on this bug(MB-32545)?
Mahesh Mandhare [6:24 AM]
Tried running test, we are hitting some other issue in setup itself on hight bucket density with latest build.
When I tried to run this test on latest build 6.5.0-4841, I started hitting MB-36973
So tried rerun on 6.5.0-4744, first swap rebalance was successful and second swap rebalance failed - http://perf.jenkins.couchbase.com/job/arke-multi-bucket/337/
I collected lsof, goroutine info and nestat o/p during first rebalance info.zip if it is of any use, next time will try to collect this information during second rebalance.
Will take look later. Did you remove all unwanted steps and reduce documents in the bucket 1000 as described in the mail.
Did u tried with 6.5.0-4578
Sitaram Vemulapalli , I started this test yesterday so it has all steps.
Will try on 6.5.0-4578 when I get clusters again.
I have looked 172.23.97.20 where highest files open files. From the info no unusall this i can find.
wc -l lsof.txt
256690 lsof.txt
cbq-engine used 50%
grep cbq-engin lsof.txt | wc -l
133792
25% used establish connection to memcached (The files are not our control and underneath go lang uses them)
grep cbq-engin lsof.txt | grep "11210" | wc -l
74793
6% used by jamlink ( https://www.speedguide.net/port.php?port=8091), 8091 port
grep cbq-engin lsof.txt | grep "jamlink" | wc -l
15171
6% used by bacula-dir (I think firewalls)
grep cbq-engin lsof.txt | grep "bacula-dir" | wc -l
14659
According to netstat only 794 connections are established.
grep 11210 neststat.txt | wc -l
794
The gorutine dump seems not right not sure how it collected. Lot of mandatory go routines missing.
Use following command.
curl -u Administrator:password http://localhost:8093/debug/pprof/goroutine?debug=2
The collection of these must be when it failed not later because it gives correct picture to correlate. go routine dump will give what go routines are running.
As mentioned Whole couchbase architecture is on REST/HTTP which uses connection pooling to make it faster. So which may need more files. Also of Instead limiting files set unlimited and try.
Build 6.5.0-4959
Not able to find a way to set open files limit to unlimited, found that it can set maximum to 1048576(https://stackoverflow.com/a/1213069/2320823). Let me know if there is any way to set it unlimited.
Tried setting open files limit to 1048575 on index+query nodes and ran the test, swap rebalance failed with similar error.
Job- http://perf.jenkins.couchbase.com/job/arke-multi-bucket/343
Not able to collect go routine dump when swap rebalance happening, will try next time.
Previously collected go routine dump was with- http://localhost:8093/debug/pprof/goroutine?debug=1
Mahesh Mandhare this is not something that can be fixed easily - it requires rearchitecting the layer that connects n1ql and memcached to switch from using a collection pool per each node+bucket combination to having connection pools on nodes only.
To make it more complicated, this layer is used by other components as well.
Also, the GSI client would have to be rewritten in a similar fashion, and FTS.
Currently the story is - the more buckets you have, the more file descriptors you are going to use.
I'm sure we will at some stage revisit this - for collections and TCO, but the amount of work required is substantial, so it's not something that we would e able to do for mad-hatter.
Even cheshire-cat is not a given.
This is nothing to do with query. You should check which log has error and transfer to that component.