Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
6.0.0
-
Untriaged
-
Unknown
Description
Build 6.0.0-1693
Observed that swap rebalance of FTS node in high bucket density test failed with following error-
Rebalance exited with reason {service_rebalance_failed,fts, {rebalance_failed, {service_error, <<"nodes: sample, res: (*http.Response)(nil), urlUUID: monitor.UrlUUID{Url:\"http://172.23.96.20:8094\", UUID:\"050b9271854fba4a40721562e160e9be\"}, kind: /api/stats, err: Get http://%40fts-cbauth:51f01258286dc0e326539b393b07c7c1@172.23.96.20:8094/api/stats: EOF">>}}}
|
The setup has 30 buckets and 1 FTS index per bucket and no its queries.
Tried swap rebalance again on same setup got same error.
Logs-
FTS node which is rebalanced out- https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-250/172.23.96.20.zip
FTS node which is rebalanced in- https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-250/172.23.97.15.zip
Analytics node- https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-250/172.23.96.23.zip
Eventing node- https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-250/172.23.97.177.zip
Index+query node- https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-250/172.23.97.19.zip
Index+query node- https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-250/172.23.97.20.zip
KV node- https://s3.amazonaws.com/bugdb/jira/30_bucket_reb_logs/collectinfo-2019-01-09T041824-ns_1%40172.23.97.12.zip
KV node- https://s3.amazonaws.com/bugdb/jira/30_bucket_reb_logs/collectinfo-2019-01-09T041824-ns_1%40172.23.97.13.zip
KV node- https://s3.amazonaws.com/bugdb/jira/30_bucket_reb_logs/collectinfo-2019-01-09T041824-ns_1%40172.23.97.14.zip
Attachments
Activity
Keshav Murthy is this related to having 30 buckets? Do you know if swap rebalance of FTS node works with a single bucket?
This should be fixed for Mad Hatter.
That's a new one – have never seen this kind of stack trace before.
And, FTS swap rebalance ought to work (like, for fewer buckets), so this feels like a legitimate bug.
Quick summary - please try a retest with the latest build, as there was a bbolt fix that should be helpful.
----------------------------------
More behind-the-scenes info...
I see this ticket was from 2019/Jan/08. And, I see from the stack trace that it's still using boltdb/bolt.
From looking at the mad-hatter manifest, looks like boltdb/bolt is pretty old (revision 51f99c)...
63c7c1020 (Chris Hillery 2018-06-12 16:54:14 -0700 106) <project name="bolt" remote="couchbasedeps" revision="51f99c862475898df9773747d3accd05a7ca33c1" path="godeps/src/github.com/boltdb/bolt"/>
|
Looking into boltdb/bolt, there's some issues that were fixed awhile back in boltdb/bolt that have the same panic source-code line-numbers – see: https://github.com/boltdb/bolt/issues/596 and
https://github.com/golang/go/issues/16772
And, the boltdb/bolt version 51f99c that's listed in mad-hatter's manifest is from before those fixes...
Author: Ben Johnson <benbjohnson@yahoo.com>
|
Date: Mon Sep 21 2015 13:14:00 GMT-0700 (PDT)
|
SHA: 51f99c862475898df9773747d3accd05a7ca33c1
|
Luckily, recently, Abhinav Dangeti had upgraded mad-hatter to use the latest etcd-io/bbolt project instead, which is the fork of boltdb/bolt that will be actively maintained going forwards (as the boltdb/bolt project has now been deprecated). From the mad-hatter manifest...
6605d8efa (abhinavdangeti 2019-02-19 13:53:41 -0800 107) <project name="etcd-bbolt" remote="couchbasedeps" revision="7ee3ded59d4835e10f3e7d0f7603c42aa5e83820" path="godeps/src/github.com/etcd-io/bbolt"/>
|
Abhinav did that on 2019/02/19, so worth trying a retest with the latest build.
Whoops – I'm thinking now based on the idea that I think this ought to be fixed at this point, marking this resolved. cheers, steve
Build 6.5.0-3633
Observed that in high bucket density tests swap rebalance of FTS is failed again with same error-
Rebalance exited with reason {service_rebalance_failed,fts,
|
{rebalance_failed,
|
{service_error,
|
<<"nodes: sample, res: (*http.Response)(nil), urlUUID: monitor.UrlUUID{Url:\"http://172.23.96.20:8094\", UUID:\"57a702e0b3ebec5ca47f209844098537\"}, kind: /api/stats, err: Get http://%40fts-cbauth:***@172.23.96.20:8094/api/stats: EOF">>}}}.
|
Rebalance Operation Id = fb7f73c7f76bd2f87888891e0aa3a299
|
Logs-
FTS node coming in- https://s3.amazonaws.com/bugdb/jira/fts_swap_failure/collectinfo-2019-07-05T030032-ns_1%40172.23.97.14.zip
FTS node going out- https://s3.amazonaws.com/bugdb/jira/fts_swap_failure/collectinfo-2019-07-05T030032-ns_1%40172.23.96.20.zip
Other nodes-
https://s3.amazonaws.com/bugdb/jira/fts_swap_failure/collectinfo-2019-07-05T030032-ns_1%40172.23.96.23.zip
https://s3.amazonaws.com/bugdb/jira/fts_swap_failure/collectinfo-2019-07-05T030032-ns_1%40172.23.97.12.zip
https://s3.amazonaws.com/bugdb/jira/fts_swap_failure/collectinfo-2019-07-05T030032-ns_1%40172.23.97.13.zip
https://s3.amazonaws.com/bugdb/jira/fts_swap_failure/collectinfo-2019-07-05T030032-ns_1%40172.23.97.15.zip
https://s3.amazonaws.com/bugdb/jira/fts_swap_failure/collectinfo-2019-07-05T030032-ns_1%40172.23.97.177.zip
https://s3.amazonaws.com/bugdb/jira/fts_swap_failure/collectinfo-2019-07-05T030032-ns_1%40172.23.97.19.zip
https://s3.amazonaws.com/bugdb/jira/fts_swap_failure/collectinfo-2019-07-05T030032-ns_1%40172.23.97.20.zip
From the latest, recent *.log files, not seeing any more panic()'s, so the original issue from awhile back with boltdb looks fixed. So, this might deserve spawning off new MB ticket to keep things tracking these issues easier.
From the recent ns_server.fts.log from node *.14, here are some log message lines from around the 2019-07-04T10:19:33 timeframe around the error of "/api/stats, err: Get http://%40fts-cbauth:***@172.23.96.20:8094/api/stats: EOF". The log file for node *.20, btw, doesn't have any useful fts log.
2019-07-04T10:19:32.631-07:00 [INFO] ctl/manager: GetTaskList, haveTasksRev: 4718, changed, rv: &{Rev:[52 55 50 48] Tasks:[{Rev:[52 55 49 57] ID:rebalance:c7f7247e99cc3d24c3b37\
|
3256000b310 Type:task-rebalance Status:task-running IsCancelable:true Progress:1.4555555555555555 DetailedProgress:map[] Description:topology change ErrorMessage: Extra:map[top\
|
ologyChange:{ID:c7f7247e99cc3d24c3b373256000b310 CurrentTopologyRev:[] Type:topology-change-rebalance KeepNodes:[{NodeInfo:{NodeID:e982f22ef1a2276d471ce79e1491c68a Priority:0 O\
|
paque:<nil>} RecoveryType:recovery-full}] EjectNodes:[{NodeID:57a702e0b3ebec5ca47f209844098537 Priority:0 Opaque:<nil>}]}]}]}
|
2019-07-04T10:19:32.660-07:00 [INFO] cbdatasource: server: 172.23.97.13:11210, uprOpenName: fts:bucket-9-fts_less_words_23477b50fc00abbd_6ddbfb54-39d3da9c, worker, looping beg,\
|
vbucketState: "running" (has 169 vbuckets), 855-1023
|
2019-07-04T10:19:32.740-07:00 [INFO] janitor: pindexes to remove: 0
|
2019-07-04T10:19:32.740-07:00 [INFO] janitor: pindexes to add: 0
|
2019-07-04T10:19:32.740-07:00 [INFO] janitor: pindexes to restart: 0
|
2019-07-04T10:19:33.033-07:00 [INFO] ctl/manager: revNum: 4721, progress: 1.455556
|
2019-07-04T10:19:33.033-07:00 [INFO] ctl/manager: GetTaskList, haveTasksRev: 4720, changed, rv: &{Rev:[52 55 50 50] Tasks:[{Rev:[52 55 50 49] ID:rebalance:c7f7247e99cc3d24c3b37\
|
3256000b310 Type:task-rebalance Status:task-running IsCancelable:true Progress:1.4555555555555555 DetailedProgress:map[] Description:topology change ErrorMessage: Extra:map[top\
|
ologyChange:{ID:c7f7247e99cc3d24c3b373256000b310 CurrentTopologyRev:[] Type:topology-change-rebalance KeepNodes:[{NodeInfo:{NodeID:e982f22ef1a2276d471ce79e1491c68a Priority:0 O\
|
paque:<nil>} RecoveryType:recovery-full}] EjectNodes:[{NodeID:57a702e0b3ebec5ca47f209844098537 Priority:0 Opaque:<nil>}]}]}]}
|
2019-07-04T10:19:33.041-07:00 [WARN] ctl: ReportProgress, err: nodes: sample, res: (*http.Response)(nil), urlUUID: monitor.UrlUUID{Url:"http://172.23.96.20:8094", UUID:"57a702e\
|
0b3ebec5ca47f209844098537"}, kind: /api/stats, err: Get http://%40fts-cbauth:***@172.23.96.20:8094/api/stats: EOF -- ctl.(*Ctl).startCtlLOCKED.func2.2() at ctl.go:807
|
2019-07-04T10:19:33.181-07:00 [INFO] janitor: feeds to remove: 0
|
2019-07-04T10:19:33.181-07:00 [INFO] janitor: feeds to add: 0
|
2019-07-04T10:19:33.181-07:00 [INFO] janitor: awakes, op: kick, msg: cfg changed, key: planPIndexes
|
2019-07-04T10:19:33.305-07:00 [INFO] cbdatasource: server: 172.23.97.12:11210, uprOpenName: fts:bucket-13-fts_less_words_6cd3c04d7bcc992e_13aa53f3-62873057, worker, looping beg\
|
, vbucketState: "running" (has 87 vbuckets), 0-86
|
2019-07-04T10:19:33.314-07:00 [INFO] cbdatasource: server: 172.23.97.13:11210, uprOpenName: fts:bucket-13-fts_less_words_6cd3c04d7bcc992e_13aa53f3-62873057, worker, looping beg\
|
, vbucketState: "running" (has 84 vbuckets), 87-170
|
2019-07-04T10:19:33.508-07:00 [INFO] cbdatasource: server: 172.23.97.13:11210, uprOpenName: fts:bucket-13-fts_less_words_6cd3c04d7bcc992e_6ddbfb54-55928503, worker, looping beg\
|
, vbucketState: "running" (has 169 vbuckets), 855-1023
|
2019-07-04T10:19:33.606-07:00 [INFO] cbdatasource: server: 172.23.97.13:11210, uprOpenName: fts:bucket-5-fts_less_words_7d34a0ae348705c4_f4e0a48a-7ddfe2f3, worker, looping beg,\
|
vbucketState: "running" (has 86 vbuckets), 256-341
|
2019-07-04T10:19:33.606-07:00 [INFO] cbdatasource: server: 172.23.97.12:11210, uprOpenName: fts:bucket-5-fts_less_words_7d34a0ae348705c4_f4e0a48a-7ddfe2f3, worker, looping beg,\
|
vbucketState: "running" (has 85 vbuckets), 171-255
|
2019-07-04T10:19:33.896-07:00 [INFO] cbdatasource: server: 172.23.97.13:11210, uprOpenName: fts:bucket-14-fts_more_words_7c1051b3fbc3ea31_13aa53f3-373bc0bc, worker, looping beg\
|
, vbucketState: "running" (has 84 vbuckets), 87-170
|
2019-07-04T10:19:33.896-07:00 [INFO] cbdatasource: server: 172.23.97.12:11210, uprOpenName: fts:bucket-14-fts_more_words_7c1051b3fbc3ea31_13aa53f3-373bc0bc, worker, looping beg\
|
, vbucketState: "running" (has 87 vbuckets), 0-86
|
2019-07-04T10:19:34.047-07:00 [INFO] ctl/manager: revNum: 4723, progress: 0.000000
|
2019-07-04T10:19:34.047-07:00 [INFO] ctl/manager: GetTaskList, haveTasksRev: 4722, changed, rv: &{Rev:[52 55 50 52] Tasks:[{Rev:[52 55 50 51] ID:rebalance:c7f7247e99cc3d24c3b37\
|
3256000b310 Type:task-rebalance Status:task-failed IsCancelable:true Progress:0 DetailedProgress:map[] Description:topology change ErrorMessage:nodes: sample, res: (*http.Respo\
|
nse)(nil), urlUUID: monitor.UrlUUID{Url:"http://172.23.96.20:8094", UUID:"57a702e0b3ebec5ca47f209844098537"}, kind: /api/stats, err: Get http://%40fts-cbauth:***@172.23.96.20:8\
|
094/api/stats: EOF Extra:map[topologyChange:{ID:c7f7247e99cc3d24c3b373256000b310 CurrentTopologyRev:[] Type:topology-change-rebalance KeepNodes:[{NodeInfo:{NodeID:e982f22ef1a22\
|
76d471ce79e1491c68a Priority:0 Opaque:<nil>} RecoveryType:recovery-full}] EjectNodes:[{NodeID:57a702e0b3ebec5ca47f209844098537 Priority:0 Opaque:<nil>}]}]}]}
|
2019-07-04T10:19:34.047-07:00 [INFO] ctl/manager: GetCurrentTopology, haveTopologyRev: 6, changed, rv: &{Rev:[55] Nodes:[57a702e0b3ebec5ca47f209844098537 e982f22ef1a2276d471ce7\
|
9e1491c68a] IsBalanced:false Messages:[error: nodes: sample, res: (*http.Response)(nil), urlUUID: monitor.UrlUUID{Url:"http://172.23.96.20:8094", UUID:"57a702e0b3ebec5ca47f2098\
|
44098537"}, kind: /api/stats, err: Get http://%40fts-cbauth:***@172.23.96.20:8094/api/stats: EOF]}
|
2019-07-04T10:19:34.054-07:00 [INFO] ctl/manager: CancelTask, taskId: rebalance:c7f7247e99cc3d24c3b373256000b310, taskRev:
|
2019-07-04T10:19:34.054-07:00 [INFO] ctl/manager: stop taskHandle, ctlTopology.Rev: 6
|
2019-07-04T10:19:34.054-07:00 [INFO] ctl/manager: CancelTask, taskId: rebalance:c7f7247e99cc3d24c3b373256000b310, taskRev: [], done
|
2019-07-04T10:19:34.054-07:00 [INFO] ctl/manager: GetTaskList, haveTasksRev: 4724, changed, rv: &{Rev:[52 55 50 53] Tasks:[]}
|
2019-07-04T10:19:34.222-07:00 [INFO] cbdatasource: server: 172.23.97.13:11210, uprOpenName: fts:bucket-14-fts_more_words_7c1051b3fbc3ea31_18572d87-5cd49993, worker, looping beg\
|
I see this MB-32547 ticket is against 6.0.0? Which is Alice?
One related thing is I recall Sreekanth Sivasankaran had added (I think after Alice shipped) some more resilience codepaths when FTS receives errors when trying to poll for REST /api/stats across the cluster... Related: MB-31258 / http://review.couchbase.org/99672
If so, this might have been fixed already in Mad-Hatter?
Assigning to Sreekanth Sivasankaran to get his sanity check on that.
> I see this MB-32547 ticket is against 6.0.0? Which is Alice?
Whoops, that's wrong. The latest cbcollect-info's from Mahesh Mandhare are with using 6.5.0, so the REST API resilience checks should already be in 6.5.0 codepaths.
I've spawned off a separate ticket – MB-34931 – to reduce confusion (at least my confusion!).
Closing this issue as we did not see panic in FTS logs in last rebalance failure comment.
To track REST API issue ticket logged at MB-34931
Wil reopen this if we see bolted panic again.
In one more run we see similar swap rebalance failure, also in this run we observed panic in fts logs.
Here are details-
Job- http://perf.jenkins.couchbase.com/job/arke-multi-bucket/261
FTS node logs-
Node added to cluster, which have panic- https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-261/172.23.97.15.zip
Node ejected from cluster- https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-261/172.23.96.20.zip
Panic-
2019-01-10T02:54:33.898-08:00 [INFO] janitor: pindexes to restart: 0
unexpected fault address 0xc457470000
fatal error: fault
[signal SIGSEGV: segmentation violation code=0x1 addr=0xc457470000 pc=0x5d0412]
goroutine 58791 [running]:
runtime.throw(0xddad27, 0x5)
/home/couchbase/.cbdepscache/exploded/x86_64/go-1.9.6/go/src/runtime/panic.go:605 +0x95 fp=0xc43565d168 sp=0xc43565d148 pc=0x42d885
runtime.sigpanic()
/home/couchbase/.cbdepscache/exploded/x86_64/go-1.9.6/go/src/runtime/signal_unix.go:374 +0x227 fp=0xc43565d1b8 sp=0xc43565d168 pc=0x4443d7
github.com/boltdb/bolt.(*node).write(0xc42e2630a0, 0xc45746fff0)
godeps/src/github.com/boltdb/bolt/node.go:205 +0x72 fp=0xc43565d2f8 sp=0xc43565d1b8 pc=0x5d0412
github.com/boltdb/bolt.(*Bucket).write(0xc43565d410, 0xc42e2630a0, 0x8, 0x0)
godeps/src/github.com/boltdb/bolt/bucket.go:594 +0xb1 fp=0xc43565d358 sp=0xc43565d2f8 pc=0x5c8251
github.com/boltdb/bolt.(*Bucket).CreateBucket(0xc42f5d3100, 0xc438c90828, 0x3, 0x8, 0x0, 0x0, 0x3)
godeps/src/github.com/boltdb/bolt/bucket.go:179 +0x27e fp=0xc43565d458 sp=0xc43565d358 pc=0x5c5d5e
github.com/boltdb/bolt.(*Bucket).CreateBucketIfNotExists(0xc42f5d3100, 0xc438c90828, 0x3, 0x8, 0xc438c90828, 0x3, 0x8)
godeps/src/github.com/boltdb/bolt/bucket.go:196 +0x4d fp=0xc43565d4b8 sp=0xc43565d458 pc=0x5c602d
github.com/blevesearch/bleve/index/scorch.(*Scorch).persistSnapshotDirect(0xc42a4a8000, 0xc43565db50, 0x0, 0x0)
godeps/src/github.com/blevesearch/bleve/index/scorch/persister.go:426 +0x70d fp=0xc43565d9e0 sp=0xc43565d4b8 pc=0x5fe65d
github.com/blevesearch/bleve/index/scorch.(*Scorch).persistSnapshotMaybeMerge(0xc42a4a8000, 0xc422752580, 0x5c372400, 0x0, 0x0)
godeps/src/github.com/blevesearch/bleve/index/scorch/persister.go:360 +0x8e6 fp=0xc43565dbe0 sp=0xc43565d9e0 pc=0x5fdd76
github.com/blevesearch/bleve/index/scorch.(*Scorch).persistSnapshot(0xc42a4a8000, 0xc422752580, 0x1453f40, 0xc43db46fd0)
godeps/src/github.com/blevesearch/bleve/index/scorch/persister.go:274 +0x87 fp=0xc43565dc18 sp=0xc43565dbe0 pc=0x5fd437
github.com/blevesearch/bleve/index/scorch.(*Scorch).persisterLoop(0xc42a4a8000)
godeps/src/github.com/blevesearch/bleve/index/scorch/persister.go:106 +0x619 fp=0xc43565dfd8 sp=0xc43565dc18 pc=0x5fc089
runtime.goexit()
/home/couchbase/.cbdepscache/exploded/x86_64/go-1.9.6/go/src/runtime/asm_amd64.s:2337 +0x1 fp=0xc43565dfe0 sp=0xc43565dfd8 pc=0x45e8b1
created by github.com/blevesearch/bleve/index/scorch.(*Scorch).Open
godeps/src/github.com/blevesearch/bleve/index/scorch/scorch.go:169 +0xf6