Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-32547

Swap FTS node rebalance failed (boltdb panic)

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 6.0.0
    • 6.5.0
    • fts
    • Untriaged
    • Unknown

    Description

      Build 6.0.0-1693

      Observed that swap rebalance of FTS node in high bucket density test failed with following error-

      Rebalance exited with reason {service_rebalance_failed,fts, {rebalance_failed, {service_error, <<"nodes: sample, res: (*http.Response)(nil), urlUUID: monitor.UrlUUID{Url:\"http://172.23.96.20:8094\", UUID:\"050b9271854fba4a40721562e160e9be\"}, kind: /api/stats, err: Get http://%40fts-cbauth:51f01258286dc0e326539b393b07c7c1@172.23.96.20:8094/api/stats: EOF">>}}}
      

      The setup has 30 buckets and 1 FTS index per bucket and no its queries.

      Tried swap rebalance again on same setup got same error.

      Logs-
      FTS node which is rebalanced out- https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-250/172.23.96.20.zip
      FTS node which is rebalanced in- https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-250/172.23.97.15.zip
      Analytics node- https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-250/172.23.96.23.zip
      Eventing node- https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-250/172.23.97.177.zip
      Index+query node- https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-250/172.23.97.19.zip
      Index+query node- https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-250/172.23.97.20.zip
      KV node- https://s3.amazonaws.com/bugdb/jira/30_bucket_reb_logs/collectinfo-2019-01-09T041824-ns_1%40172.23.97.12.zip
      KV node- https://s3.amazonaws.com/bugdb/jira/30_bucket_reb_logs/collectinfo-2019-01-09T041824-ns_1%40172.23.97.13.zip
      KV node- https://s3.amazonaws.com/bugdb/jira/30_bucket_reb_logs/collectinfo-2019-01-09T041824-ns_1%40172.23.97.14.zip

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          In one more run we see similar swap rebalance failure, also in this run we observed panic in fts logs.

          Here are details-

          Job- http://perf.jenkins.couchbase.com/job/arke-multi-bucket/261

          FTS node logs-

          Node added to cluster, which have panic- https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-261/172.23.97.15.zip 

          Node ejected from cluster- https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-261/172.23.96.20.zip 

           

          Panic-

           

          2019-01-10T02:54:33.898-08:00 [INFO] janitor: pindexes to restart: 0
          unexpected fault address 0xc457470000
          fatal error: fault
          [signal SIGSEGV: segmentation violation code=0x1 addr=0xc457470000 pc=0x5d0412]
           
          goroutine 58791 [running]:
          runtime.throw(0xddad27, 0x5)
                  /home/couchbase/.cbdepscache/exploded/x86_64/go-1.9.6/go/src/runtime/panic.go:605 +0x95 fp=0xc43565d168 sp=0xc43565d148 pc=0x42d885
          runtime.sigpanic()
                  /home/couchbase/.cbdepscache/exploded/x86_64/go-1.9.6/go/src/runtime/signal_unix.go:374 +0x227 fp=0xc43565d1b8 sp=0xc43565d168 pc=0x4443d7
          github.com/boltdb/bolt.(*node).write(0xc42e2630a0, 0xc45746fff0)
                  godeps/src/github.com/boltdb/bolt/node.go:205 +0x72 fp=0xc43565d2f8 sp=0xc43565d1b8 pc=0x5d0412
          github.com/boltdb/bolt.(*Bucket).write(0xc43565d410, 0xc42e2630a0, 0x8, 0x0)
                  godeps/src/github.com/boltdb/bolt/bucket.go:594 +0xb1 fp=0xc43565d358 sp=0xc43565d2f8 pc=0x5c8251
          github.com/boltdb/bolt.(*Bucket).CreateBucket(0xc42f5d3100, 0xc438c90828, 0x3, 0x8, 0x0, 0x0, 0x3)
                  godeps/src/github.com/boltdb/bolt/bucket.go:179 +0x27e fp=0xc43565d458 sp=0xc43565d358 pc=0x5c5d5e
          github.com/boltdb/bolt.(*Bucket).CreateBucketIfNotExists(0xc42f5d3100, 0xc438c90828, 0x3, 0x8, 0xc438c90828, 0x3, 0x8)
                  godeps/src/github.com/boltdb/bolt/bucket.go:196 +0x4d fp=0xc43565d4b8 sp=0xc43565d458 pc=0x5c602d
          github.com/blevesearch/bleve/index/scorch.(*Scorch).persistSnapshotDirect(0xc42a4a8000, 0xc43565db50, 0x0, 0x0)
                  godeps/src/github.com/blevesearch/bleve/index/scorch/persister.go:426 +0x70d fp=0xc43565d9e0 sp=0xc43565d4b8 pc=0x5fe65d
          github.com/blevesearch/bleve/index/scorch.(*Scorch).persistSnapshotMaybeMerge(0xc42a4a8000, 0xc422752580, 0x5c372400, 0x0, 0x0)
                  godeps/src/github.com/blevesearch/bleve/index/scorch/persister.go:360 +0x8e6 fp=0xc43565dbe0 sp=0xc43565d9e0 pc=0x5fdd76
          github.com/blevesearch/bleve/index/scorch.(*Scorch).persistSnapshot(0xc42a4a8000, 0xc422752580, 0x1453f40, 0xc43db46fd0)
                  godeps/src/github.com/blevesearch/bleve/index/scorch/persister.go:274 +0x87 fp=0xc43565dc18 sp=0xc43565dbe0 pc=0x5fd437
          github.com/blevesearch/bleve/index/scorch.(*Scorch).persisterLoop(0xc42a4a8000)
                  godeps/src/github.com/blevesearch/bleve/index/scorch/persister.go:106 +0x619 fp=0xc43565dfd8 sp=0xc43565dc18 pc=0x5fc089
          runtime.goexit()
                  /home/couchbase/.cbdepscache/exploded/x86_64/go-1.9.6/go/src/runtime/asm_amd64.s:2337 +0x1 fp=0xc43565dfe0 sp=0xc43565dfd8 pc=0x45e8b1
          created by github.com/blevesearch/bleve/index/scorch.(*Scorch).Open
                  godeps/src/github.com/blevesearch/bleve/index/scorch/scorch.go:169 +0xf6
           
          

           

           

          mahesh.mandhare Mahesh Mandhare (Inactive) added a comment - In one more run we see similar swap rebalance failure, also in this run we observed panic in fts logs. Here are details- Job- http://perf.jenkins.couchbase.com/job/arke-multi-bucket/261 FTS node logs- Node added to cluster, which have panic-  https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-261/172.23.97.15.zip   Node ejected from cluster-  https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-261/172.23.96.20.zip     Panic-   2019-01-10T02:54:33.898-08:00 [INFO] janitor: pindexes to restart: 0 unexpected fault address 0xc457470000 fatal error: fault [signal SIGSEGV: segmentation violation code=0x1 addr=0xc457470000 pc=0x5d0412]   goroutine 58791 [running]: runtime.throw(0xddad27, 0x5)         /home/couchbase/.cbdepscache/exploded/x86_64/go-1.9.6/go/src/runtime/panic.go:605 +0x95 fp=0xc43565d168 sp=0xc43565d148 pc=0x42d885 runtime.sigpanic()         /home/couchbase/.cbdepscache/exploded/x86_64/go-1.9.6/go/src/runtime/signal_unix.go:374 +0x227 fp=0xc43565d1b8 sp=0xc43565d168 pc=0x4443d7 github.com/boltdb/bolt.(*node).write(0xc42e2630a0, 0xc45746fff0)         godeps/src/github.com/boltdb/bolt/node.go:205 +0x72 fp=0xc43565d2f8 sp=0xc43565d1b8 pc=0x5d0412 github.com/boltdb/bolt.(*Bucket).write(0xc43565d410, 0xc42e2630a0, 0x8, 0x0)         godeps/src/github.com/boltdb/bolt/bucket.go:594 +0xb1 fp=0xc43565d358 sp=0xc43565d2f8 pc=0x5c8251 github.com/boltdb/bolt.(*Bucket).CreateBucket(0xc42f5d3100, 0xc438c90828, 0x3, 0x8, 0x0, 0x0, 0x3)         godeps/src/github.com/boltdb/bolt/bucket.go:179 +0x27e fp=0xc43565d458 sp=0xc43565d358 pc=0x5c5d5e github.com/boltdb/bolt.(*Bucket).CreateBucketIfNotExists(0xc42f5d3100, 0xc438c90828, 0x3, 0x8, 0xc438c90828, 0x3, 0x8)         godeps/src/github.com/boltdb/bolt/bucket.go:196 +0x4d fp=0xc43565d4b8 sp=0xc43565d458 pc=0x5c602d github.com/blevesearch/bleve/index/scorch.(*Scorch).persistSnapshotDirect(0xc42a4a8000, 0xc43565db50, 0x0, 0x0)         godeps/src/github.com/blevesearch/bleve/index/scorch/persister.go:426 +0x70d fp=0xc43565d9e0 sp=0xc43565d4b8 pc=0x5fe65d github.com/blevesearch/bleve/index/scorch.(*Scorch).persistSnapshotMaybeMerge(0xc42a4a8000, 0xc422752580, 0x5c372400, 0x0, 0x0)         godeps/src/github.com/blevesearch/bleve/index/scorch/persister.go:360 +0x8e6 fp=0xc43565dbe0 sp=0xc43565d9e0 pc=0x5fdd76 github.com/blevesearch/bleve/index/scorch.(*Scorch).persistSnapshot(0xc42a4a8000, 0xc422752580, 0x1453f40, 0xc43db46fd0)         godeps/src/github.com/blevesearch/bleve/index/scorch/persister.go:274 +0x87 fp=0xc43565dc18 sp=0xc43565dbe0 pc=0x5fd437 github.com/blevesearch/bleve/index/scorch.(*Scorch).persisterLoop(0xc42a4a8000)         godeps/src/github.com/blevesearch/bleve/index/scorch/persister.go:106 +0x619 fp=0xc43565dfd8 sp=0xc43565dc18 pc=0x5fc089 runtime.goexit()         /home/couchbase/.cbdepscache/exploded/x86_64/go-1.9.6/go/src/runtime/asm_amd64.s:2337 +0x1 fp=0xc43565dfe0 sp=0xc43565dfd8 pc=0x45e8b1 created by github.com/blevesearch/bleve/index/scorch.(*Scorch).Open         godeps/src/github.com/blevesearch/bleve/index/scorch/scorch.go:169 +0xf6      

          Keshav Murthy is this related to having 30 buckets? Do you know if swap rebalance of FTS node works with a single bucket?

          This should be fixed for Mad Hatter.

          shivani.gupta Shivani Gupta added a comment - Keshav Murthy is this related to having 30 buckets? Do you know if swap rebalance of FTS node works with a single bucket? This should be fixed for Mad Hatter.
          steve Steve Yen added a comment -

          That's a new one – have never seen this kind of stack trace before.

          And, FTS swap rebalance ought to work (like, for fewer buckets), so this feels like a legitimate bug.

          steve Steve Yen added a comment - That's a new one – have never seen this kind of stack trace before. And, FTS swap rebalance ought to work (like, for fewer buckets), so this feels like a legitimate bug.
          steve Steve Yen added a comment - - edited

          Quick summary - please try a retest with the latest build, as there was a bbolt fix that should be helpful.

          ----------------------------------
          More behind-the-scenes info...

          I see this ticket was from 2019/Jan/08. And, I see from the stack trace that it's still using boltdb/bolt.

          From looking at the mad-hatter manifest, looks like boltdb/bolt is pretty old (revision 51f99c)...

          63c7c1020 (Chris Hillery 2018-06-12 16:54:14 -0700 106)   <project name="bolt"           remote="couchbasedeps" revision="51f99c862475898df9773747d3accd05a7ca33c1" path="godeps/src/github.com/boltdb/bolt"/>
          

          Looking into boltdb/bolt, there's some issues that were fixed awhile back in boltdb/bolt that have the same panic source-code line-numbers – see: https://github.com/boltdb/bolt/issues/596 and
          https://github.com/golang/go/issues/16772

          And, the boltdb/bolt version 51f99c that's listed in mad-hatter's manifest is from before those fixes...

          Author:		Ben Johnson <benbjohnson@yahoo.com>
          Date:	Mon Sep 21 2015 13:14:00 GMT-0700 (PDT)
          SHA:	51f99c862475898df9773747d3accd05a7ca33c1
          

          Luckily, recently, Abhinav Dangeti had upgraded mad-hatter to use the latest etcd-io/bbolt project instead, which is the fork of boltdb/bolt that will be actively maintained going forwards (as the boltdb/bolt project has now been deprecated). From the mad-hatter manifest...

          6605d8efa (abhinavdangeti 2019-02-19 13:53:41 -0800 107)   <project name="etcd-bbolt"     remote="couchbasedeps" revision="7ee3ded59d4835e10f3e7d0f7603c42aa5e83820" path="godeps/src/github.com/etcd-io/bbolt"/>
          

          Abhinav did that on 2019/02/19, so worth trying a retest with the latest build.

          steve Steve Yen added a comment - - edited Quick summary - please try a retest with the latest build, as there was a bbolt fix that should be helpful. ---------------------------------- More behind-the-scenes info... I see this ticket was from 2019/Jan/08. And, I see from the stack trace that it's still using boltdb/bolt. From looking at the mad-hatter manifest, looks like boltdb/bolt is pretty old (revision 51f99c)... 63c7c1020 (Chris Hillery 2018-06-12 16:54:14 -0700 106) <project name="bolt" remote="couchbasedeps" revision="51f99c862475898df9773747d3accd05a7ca33c1" path="godeps/src/github.com/boltdb/bolt"/> Looking into boltdb/bolt, there's some issues that were fixed awhile back in boltdb/bolt that have the same panic source-code line-numbers – see: https://github.com/boltdb/bolt/issues/596 and https://github.com/golang/go/issues/16772 And, the boltdb/bolt version 51f99c that's listed in mad-hatter's manifest is from before those fixes... Author: Ben Johnson <benbjohnson@yahoo.com> Date: Mon Sep 21 2015 13:14:00 GMT-0700 (PDT) SHA: 51f99c862475898df9773747d3accd05a7ca33c1 Luckily, recently, Abhinav Dangeti had upgraded mad-hatter to use the latest etcd-io/bbolt project instead, which is the fork of boltdb/bolt that will be actively maintained going forwards (as the boltdb/bolt project has now been deprecated). From the mad-hatter manifest... 6605d8efa (abhinavdangeti 2019-02-19 13:53:41 -0800 107) <project name="etcd-bbolt" remote="couchbasedeps" revision="7ee3ded59d4835e10f3e7d0f7603c42aa5e83820" path="godeps/src/github.com/etcd-io/bbolt"/> Abhinav did that on 2019/02/19, so worth trying a retest with the latest build.
          steve Steve Yen added a comment -

          Whoops – I'm thinking now based on the idea that I think this ought to be fixed at this point, marking this resolved. cheers, steve

          steve Steve Yen added a comment - Whoops – I'm thinking now based on the idea that I think this ought to be fixed at this point, marking this resolved. cheers, steve

          Build 6.5.0-3633

          Observed that in high bucket density tests swap rebalance of FTS is failed again with same error-

          Rebalance exited with reason {service_rebalance_failed,fts,
          {rebalance_failed,
          {service_error,
          <<"nodes: sample, res: (*http.Response)(nil), urlUUID: monitor.UrlUUID{Url:\"http://172.23.96.20:8094\", UUID:\"57a702e0b3ebec5ca47f209844098537\"}, kind: /api/stats, err: Get http://%40fts-cbauth:***@172.23.96.20:8094/api/stats: EOF">>}}}.
          Rebalance Operation Id = fb7f73c7f76bd2f87888891e0aa3a299

          Logs-

          FTS node coming in- https://s3.amazonaws.com/bugdb/jira/fts_swap_failure/collectinfo-2019-07-05T030032-ns_1%40172.23.97.14.zip

          FTS node going out- https://s3.amazonaws.com/bugdb/jira/fts_swap_failure/collectinfo-2019-07-05T030032-ns_1%40172.23.96.20.zip

          Other nodes-
          https://s3.amazonaws.com/bugdb/jira/fts_swap_failure/collectinfo-2019-07-05T030032-ns_1%40172.23.96.23.zip
          https://s3.amazonaws.com/bugdb/jira/fts_swap_failure/collectinfo-2019-07-05T030032-ns_1%40172.23.97.12.zip
          https://s3.amazonaws.com/bugdb/jira/fts_swap_failure/collectinfo-2019-07-05T030032-ns_1%40172.23.97.13.zip
          https://s3.amazonaws.com/bugdb/jira/fts_swap_failure/collectinfo-2019-07-05T030032-ns_1%40172.23.97.15.zip
          https://s3.amazonaws.com/bugdb/jira/fts_swap_failure/collectinfo-2019-07-05T030032-ns_1%40172.23.97.177.zip
          https://s3.amazonaws.com/bugdb/jira/fts_swap_failure/collectinfo-2019-07-05T030032-ns_1%40172.23.97.19.zip
          https://s3.amazonaws.com/bugdb/jira/fts_swap_failure/collectinfo-2019-07-05T030032-ns_1%40172.23.97.20.zip
           

          mahesh.mandhare Mahesh Mandhare (Inactive) added a comment - Build 6.5.0-3633 Observed that in high bucket density tests swap rebalance of FTS is failed again with same error- Rebalance exited with reason {service_rebalance_failed,fts, {rebalance_failed, {service_error, <<"nodes: sample, res: (*http.Response)(nil), urlUUID: monitor.UrlUUID{Url:\"http://172.23.96.20:8094\", UUID:\"57a702e0b3ebec5ca47f209844098537\"}, kind: /api/stats, err: Get http://%40fts-cbauth:***@172.23.96.20:8094/api/stats: EOF">>}}}. Rebalance Operation Id = fb7f73c7f76bd2f87888891e0aa3a299 Logs- FTS node coming in-  https://s3.amazonaws.com/bugdb/jira/fts_swap_failure/collectinfo-2019-07-05T030032-ns_1%40172.23.97.14.zip FTS node going out-  https://s3.amazonaws.com/bugdb/jira/fts_swap_failure/collectinfo-2019-07-05T030032-ns_1%40172.23.96.20.zip Other nodes- https://s3.amazonaws.com/bugdb/jira/fts_swap_failure/collectinfo-2019-07-05T030032-ns_1%40172.23.96.23.zip https://s3.amazonaws.com/bugdb/jira/fts_swap_failure/collectinfo-2019-07-05T030032-ns_1%40172.23.97.12.zip https://s3.amazonaws.com/bugdb/jira/fts_swap_failure/collectinfo-2019-07-05T030032-ns_1%40172.23.97.13.zip https://s3.amazonaws.com/bugdb/jira/fts_swap_failure/collectinfo-2019-07-05T030032-ns_1%40172.23.97.15.zip https://s3.amazonaws.com/bugdb/jira/fts_swap_failure/collectinfo-2019-07-05T030032-ns_1%40172.23.97.177.zip https://s3.amazonaws.com/bugdb/jira/fts_swap_failure/collectinfo-2019-07-05T030032-ns_1%40172.23.97.19.zip https://s3.amazonaws.com/bugdb/jira/fts_swap_failure/collectinfo-2019-07-05T030032-ns_1%40172.23.97.20.zip  
          steve Steve Yen added a comment - - edited

          From the latest, recent *.log files, not seeing any more panic()'s, so the original issue from awhile back with boltdb looks fixed. So, this might deserve spawning off new MB ticket to keep things tracking these issues easier.

          From the recent ns_server.fts.log from node *.14, here are some log message lines from around the 2019-07-04T10:19:33 timeframe around the error of "/api/stats, err: Get http://%40fts-cbauth:***@172.23.96.20:8094/api/stats: EOF". The log file for node *.20, btw, doesn't have any useful fts log.

          2019-07-04T10:19:32.631-07:00 [INFO] ctl/manager: GetTaskList, haveTasksRev: 4718, changed, rv: &{Rev:[52 55 50 48] Tasks:[{Rev:[52 55 49 57] ID:rebalance:c7f7247e99cc3d24c3b37\
          3256000b310 Type:task-rebalance Status:task-running IsCancelable:true Progress:1.4555555555555555 DetailedProgress:map[] Description:topology change ErrorMessage: Extra:map[top\
          ologyChange:{ID:c7f7247e99cc3d24c3b373256000b310 CurrentTopologyRev:[] Type:topology-change-rebalance KeepNodes:[{NodeInfo:{NodeID:e982f22ef1a2276d471ce79e1491c68a Priority:0 O\
          paque:<nil>} RecoveryType:recovery-full}] EjectNodes:[{NodeID:57a702e0b3ebec5ca47f209844098537 Priority:0 Opaque:<nil>}]}]}]}
          2019-07-04T10:19:32.660-07:00 [INFO] cbdatasource: server: 172.23.97.13:11210, uprOpenName: fts:bucket-9-fts_less_words_23477b50fc00abbd_6ddbfb54-39d3da9c, worker, looping beg,\
           vbucketState: "running" (has 169 vbuckets), 855-1023
          2019-07-04T10:19:32.740-07:00 [INFO] janitor: pindexes to remove: 0
          2019-07-04T10:19:32.740-07:00 [INFO] janitor: pindexes to add: 0
          2019-07-04T10:19:32.740-07:00 [INFO] janitor: pindexes to restart: 0
          2019-07-04T10:19:33.033-07:00 [INFO] ctl/manager: revNum: 4721, progress: 1.455556
          2019-07-04T10:19:33.033-07:00 [INFO] ctl/manager: GetTaskList, haveTasksRev: 4720, changed, rv: &{Rev:[52 55 50 50] Tasks:[{Rev:[52 55 50 49] ID:rebalance:c7f7247e99cc3d24c3b37\
          3256000b310 Type:task-rebalance Status:task-running IsCancelable:true Progress:1.4555555555555555 DetailedProgress:map[] Description:topology change ErrorMessage: Extra:map[top\
          ologyChange:{ID:c7f7247e99cc3d24c3b373256000b310 CurrentTopologyRev:[] Type:topology-change-rebalance KeepNodes:[{NodeInfo:{NodeID:e982f22ef1a2276d471ce79e1491c68a Priority:0 O\
          paque:<nil>} RecoveryType:recovery-full}] EjectNodes:[{NodeID:57a702e0b3ebec5ca47f209844098537 Priority:0 Opaque:<nil>}]}]}]}
          2019-07-04T10:19:33.041-07:00 [WARN] ctl: ReportProgress, err: nodes: sample, res: (*http.Response)(nil), urlUUID: monitor.UrlUUID{Url:"http://172.23.96.20:8094", UUID:"57a702e\
          0b3ebec5ca47f209844098537"}, kind: /api/stats, err: Get http://%40fts-cbauth:***@172.23.96.20:8094/api/stats: EOF -- ctl.(*Ctl).startCtlLOCKED.func2.2() at ctl.go:807
          2019-07-04T10:19:33.181-07:00 [INFO] janitor: feeds to remove: 0
          2019-07-04T10:19:33.181-07:00 [INFO] janitor: feeds to add: 0
          2019-07-04T10:19:33.181-07:00 [INFO] janitor: awakes, op: kick, msg: cfg changed, key: planPIndexes
          2019-07-04T10:19:33.305-07:00 [INFO] cbdatasource: server: 172.23.97.12:11210, uprOpenName: fts:bucket-13-fts_less_words_6cd3c04d7bcc992e_13aa53f3-62873057, worker, looping beg\
          , vbucketState: "running" (has 87 vbuckets), 0-86
          2019-07-04T10:19:33.314-07:00 [INFO] cbdatasource: server: 172.23.97.13:11210, uprOpenName: fts:bucket-13-fts_less_words_6cd3c04d7bcc992e_13aa53f3-62873057, worker, looping beg\
          , vbucketState: "running" (has 84 vbuckets), 87-170
          2019-07-04T10:19:33.508-07:00 [INFO] cbdatasource: server: 172.23.97.13:11210, uprOpenName: fts:bucket-13-fts_less_words_6cd3c04d7bcc992e_6ddbfb54-55928503, worker, looping beg\
          , vbucketState: "running" (has 169 vbuckets), 855-1023
          2019-07-04T10:19:33.606-07:00 [INFO] cbdatasource: server: 172.23.97.13:11210, uprOpenName: fts:bucket-5-fts_less_words_7d34a0ae348705c4_f4e0a48a-7ddfe2f3, worker, looping beg,\
           vbucketState: "running" (has 86 vbuckets), 256-341
          2019-07-04T10:19:33.606-07:00 [INFO] cbdatasource: server: 172.23.97.12:11210, uprOpenName: fts:bucket-5-fts_less_words_7d34a0ae348705c4_f4e0a48a-7ddfe2f3, worker, looping beg,\
           vbucketState: "running" (has 85 vbuckets), 171-255
          2019-07-04T10:19:33.896-07:00 [INFO] cbdatasource: server: 172.23.97.13:11210, uprOpenName: fts:bucket-14-fts_more_words_7c1051b3fbc3ea31_13aa53f3-373bc0bc, worker, looping beg\
          , vbucketState: "running" (has 84 vbuckets), 87-170
          2019-07-04T10:19:33.896-07:00 [INFO] cbdatasource: server: 172.23.97.12:11210, uprOpenName: fts:bucket-14-fts_more_words_7c1051b3fbc3ea31_13aa53f3-373bc0bc, worker, looping beg\
          , vbucketState: "running" (has 87 vbuckets), 0-86
          2019-07-04T10:19:34.047-07:00 [INFO] ctl/manager: revNum: 4723, progress: 0.000000
          2019-07-04T10:19:34.047-07:00 [INFO] ctl/manager: GetTaskList, haveTasksRev: 4722, changed, rv: &{Rev:[52 55 50 52] Tasks:[{Rev:[52 55 50 51] ID:rebalance:c7f7247e99cc3d24c3b37\
          3256000b310 Type:task-rebalance Status:task-failed IsCancelable:true Progress:0 DetailedProgress:map[] Description:topology change ErrorMessage:nodes: sample, res: (*http.Respo\
          nse)(nil), urlUUID: monitor.UrlUUID{Url:"http://172.23.96.20:8094", UUID:"57a702e0b3ebec5ca47f209844098537"}, kind: /api/stats, err: Get http://%40fts-cbauth:***@172.23.96.20:8\
          094/api/stats: EOF Extra:map[topologyChange:{ID:c7f7247e99cc3d24c3b373256000b310 CurrentTopologyRev:[] Type:topology-change-rebalance KeepNodes:[{NodeInfo:{NodeID:e982f22ef1a22\
          76d471ce79e1491c68a Priority:0 Opaque:<nil>} RecoveryType:recovery-full}] EjectNodes:[{NodeID:57a702e0b3ebec5ca47f209844098537 Priority:0 Opaque:<nil>}]}]}]}
          2019-07-04T10:19:34.047-07:00 [INFO] ctl/manager: GetCurrentTopology, haveTopologyRev: 6, changed, rv: &{Rev:[55] Nodes:[57a702e0b3ebec5ca47f209844098537 e982f22ef1a2276d471ce7\
          9e1491c68a] IsBalanced:false Messages:[error: nodes: sample, res: (*http.Response)(nil), urlUUID: monitor.UrlUUID{Url:"http://172.23.96.20:8094", UUID:"57a702e0b3ebec5ca47f2098\
          44098537"}, kind: /api/stats, err: Get http://%40fts-cbauth:***@172.23.96.20:8094/api/stats: EOF]}
          2019-07-04T10:19:34.054-07:00 [INFO] ctl/manager: CancelTask, taskId: rebalance:c7f7247e99cc3d24c3b373256000b310, taskRev: 
          2019-07-04T10:19:34.054-07:00 [INFO] ctl/manager: stop taskHandle, ctlTopology.Rev: 6
          2019-07-04T10:19:34.054-07:00 [INFO] ctl/manager: CancelTask, taskId: rebalance:c7f7247e99cc3d24c3b373256000b310, taskRev: [], done
          2019-07-04T10:19:34.054-07:00 [INFO] ctl/manager: GetTaskList, haveTasksRev: 4724, changed, rv: &{Rev:[52 55 50 53] Tasks:[]}
          2019-07-04T10:19:34.222-07:00 [INFO] cbdatasource: server: 172.23.97.13:11210, uprOpenName: fts:bucket-14-fts_more_words_7c1051b3fbc3ea31_18572d87-5cd49993, worker, looping beg\
          

          steve Steve Yen added a comment - - edited From the latest, recent *.log files, not seeing any more panic()'s, so the original issue from awhile back with boltdb looks fixed. So, this might deserve spawning off new MB ticket to keep things tracking these issues easier. From the recent ns_server.fts.log from node *.14, here are some log message lines from around the 2019-07-04T10:19:33 timeframe around the error of "/api/stats, err: Get http://%40fts-cbauth:***@172.23.96.20:8094/api/stats: EOF". The log file for node *.20, btw, doesn't have any useful fts log. 2019-07-04T10:19:32.631-07:00 [INFO] ctl/manager: GetTaskList, haveTasksRev: 4718, changed, rv: &{Rev:[52 55 50 48] Tasks:[{Rev:[52 55 49 57] ID:rebalance:c7f7247e99cc3d24c3b37\ 3256000b310 Type:task-rebalance Status:task-running IsCancelable:true Progress:1.4555555555555555 DetailedProgress:map[] Description:topology change ErrorMessage: Extra:map[top\ ologyChange:{ID:c7f7247e99cc3d24c3b373256000b310 CurrentTopologyRev:[] Type:topology-change-rebalance KeepNodes:[{NodeInfo:{NodeID:e982f22ef1a2276d471ce79e1491c68a Priority:0 O\ paque:<nil>} RecoveryType:recovery-full}] EjectNodes:[{NodeID:57a702e0b3ebec5ca47f209844098537 Priority:0 Opaque:<nil>}]}]}]} 2019-07-04T10:19:32.660-07:00 [INFO] cbdatasource: server: 172.23.97.13:11210, uprOpenName: fts:bucket-9-fts_less_words_23477b50fc00abbd_6ddbfb54-39d3da9c, worker, looping beg,\ vbucketState: "running" (has 169 vbuckets), 855-1023 2019-07-04T10:19:32.740-07:00 [INFO] janitor: pindexes to remove: 0 2019-07-04T10:19:32.740-07:00 [INFO] janitor: pindexes to add: 0 2019-07-04T10:19:32.740-07:00 [INFO] janitor: pindexes to restart: 0 2019-07-04T10:19:33.033-07:00 [INFO] ctl/manager: revNum: 4721, progress: 1.455556 2019-07-04T10:19:33.033-07:00 [INFO] ctl/manager: GetTaskList, haveTasksRev: 4720, changed, rv: &{Rev:[52 55 50 50] Tasks:[{Rev:[52 55 50 49] ID:rebalance:c7f7247e99cc3d24c3b37\ 3256000b310 Type:task-rebalance Status:task-running IsCancelable:true Progress:1.4555555555555555 DetailedProgress:map[] Description:topology change ErrorMessage: Extra:map[top\ ologyChange:{ID:c7f7247e99cc3d24c3b373256000b310 CurrentTopologyRev:[] Type:topology-change-rebalance KeepNodes:[{NodeInfo:{NodeID:e982f22ef1a2276d471ce79e1491c68a Priority:0 O\ paque:<nil>} RecoveryType:recovery-full}] EjectNodes:[{NodeID:57a702e0b3ebec5ca47f209844098537 Priority:0 Opaque:<nil>}]}]}]} 2019-07-04T10:19:33.041-07:00 [WARN] ctl: ReportProgress, err: nodes: sample, res: (*http.Response)(nil), urlUUID: monitor.UrlUUID{Url:"http://172.23.96.20:8094", UUID:"57a702e\ 0b3ebec5ca47f209844098537"}, kind: /api/stats, err: Get http://%40fts-cbauth:***@172.23.96.20:8094/api/stats: EOF -- ctl.(*Ctl).startCtlLOCKED.func2.2() at ctl.go:807 2019-07-04T10:19:33.181-07:00 [INFO] janitor: feeds to remove: 0 2019-07-04T10:19:33.181-07:00 [INFO] janitor: feeds to add: 0 2019-07-04T10:19:33.181-07:00 [INFO] janitor: awakes, op: kick, msg: cfg changed, key: planPIndexes 2019-07-04T10:19:33.305-07:00 [INFO] cbdatasource: server: 172.23.97.12:11210, uprOpenName: fts:bucket-13-fts_less_words_6cd3c04d7bcc992e_13aa53f3-62873057, worker, looping beg\ , vbucketState: "running" (has 87 vbuckets), 0-86 2019-07-04T10:19:33.314-07:00 [INFO] cbdatasource: server: 172.23.97.13:11210, uprOpenName: fts:bucket-13-fts_less_words_6cd3c04d7bcc992e_13aa53f3-62873057, worker, looping beg\ , vbucketState: "running" (has 84 vbuckets), 87-170 2019-07-04T10:19:33.508-07:00 [INFO] cbdatasource: server: 172.23.97.13:11210, uprOpenName: fts:bucket-13-fts_less_words_6cd3c04d7bcc992e_6ddbfb54-55928503, worker, looping beg\ , vbucketState: "running" (has 169 vbuckets), 855-1023 2019-07-04T10:19:33.606-07:00 [INFO] cbdatasource: server: 172.23.97.13:11210, uprOpenName: fts:bucket-5-fts_less_words_7d34a0ae348705c4_f4e0a48a-7ddfe2f3, worker, looping beg,\ vbucketState: "running" (has 86 vbuckets), 256-341 2019-07-04T10:19:33.606-07:00 [INFO] cbdatasource: server: 172.23.97.12:11210, uprOpenName: fts:bucket-5-fts_less_words_7d34a0ae348705c4_f4e0a48a-7ddfe2f3, worker, looping beg,\ vbucketState: "running" (has 85 vbuckets), 171-255 2019-07-04T10:19:33.896-07:00 [INFO] cbdatasource: server: 172.23.97.13:11210, uprOpenName: fts:bucket-14-fts_more_words_7c1051b3fbc3ea31_13aa53f3-373bc0bc, worker, looping beg\ , vbucketState: "running" (has 84 vbuckets), 87-170 2019-07-04T10:19:33.896-07:00 [INFO] cbdatasource: server: 172.23.97.12:11210, uprOpenName: fts:bucket-14-fts_more_words_7c1051b3fbc3ea31_13aa53f3-373bc0bc, worker, looping beg\ , vbucketState: "running" (has 87 vbuckets), 0-86 2019-07-04T10:19:34.047-07:00 [INFO] ctl/manager: revNum: 4723, progress: 0.000000 2019-07-04T10:19:34.047-07:00 [INFO] ctl/manager: GetTaskList, haveTasksRev: 4722, changed, rv: &{Rev:[52 55 50 52] Tasks:[{Rev:[52 55 50 51] ID:rebalance:c7f7247e99cc3d24c3b37\ 3256000b310 Type:task-rebalance Status:task-failed IsCancelable:true Progress:0 DetailedProgress:map[] Description:topology change ErrorMessage:nodes: sample, res: (*http.Respo\ nse)(nil), urlUUID: monitor.UrlUUID{Url:"http://172.23.96.20:8094", UUID:"57a702e0b3ebec5ca47f209844098537"}, kind: /api/stats, err: Get http://%40fts-cbauth:***@172.23.96.20:8\ 094/api/stats: EOF Extra:map[topologyChange:{ID:c7f7247e99cc3d24c3b373256000b310 CurrentTopologyRev:[] Type:topology-change-rebalance KeepNodes:[{NodeInfo:{NodeID:e982f22ef1a22\ 76d471ce79e1491c68a Priority:0 Opaque:<nil>} RecoveryType:recovery-full}] EjectNodes:[{NodeID:57a702e0b3ebec5ca47f209844098537 Priority:0 Opaque:<nil>}]}]}]} 2019-07-04T10:19:34.047-07:00 [INFO] ctl/manager: GetCurrentTopology, haveTopologyRev: 6, changed, rv: &{Rev:[55] Nodes:[57a702e0b3ebec5ca47f209844098537 e982f22ef1a2276d471ce7\ 9e1491c68a] IsBalanced:false Messages:[error: nodes: sample, res: (*http.Response)(nil), urlUUID: monitor.UrlUUID{Url:"http://172.23.96.20:8094", UUID:"57a702e0b3ebec5ca47f2098\ 44098537"}, kind: /api/stats, err: Get http://%40fts-cbauth:***@172.23.96.20:8094/api/stats: EOF]} 2019-07-04T10:19:34.054-07:00 [INFO] ctl/manager: CancelTask, taskId: rebalance:c7f7247e99cc3d24c3b373256000b310, taskRev: 2019-07-04T10:19:34.054-07:00 [INFO] ctl/manager: stop taskHandle, ctlTopology.Rev: 6 2019-07-04T10:19:34.054-07:00 [INFO] ctl/manager: CancelTask, taskId: rebalance:c7f7247e99cc3d24c3b373256000b310, taskRev: [], done 2019-07-04T10:19:34.054-07:00 [INFO] ctl/manager: GetTaskList, haveTasksRev: 4724, changed, rv: &{Rev:[52 55 50 53] Tasks:[]} 2019-07-04T10:19:34.222-07:00 [INFO] cbdatasource: server: 172.23.97.13:11210, uprOpenName: fts:bucket-14-fts_more_words_7c1051b3fbc3ea31_18572d87-5cd49993, worker, looping beg\
          steve Steve Yen added a comment -

          I see this MB-32547 ticket is against 6.0.0? Which is Alice?

          One related thing is I recall Sreekanth Sivasankaran had added (I think after Alice shipped) some more resilience codepaths when FTS receives errors when trying to poll for REST /api/stats across the cluster... Related: MB-31258 / http://review.couchbase.org/99672

          If so, this might have been fixed already in Mad-Hatter?

          Assigning to Sreekanth Sivasankaran to get his sanity check on that.

          steve Steve Yen added a comment - I see this MB-32547 ticket is against 6.0.0? Which is Alice? One related thing is I recall Sreekanth Sivasankaran had added (I think after Alice shipped) some more resilience codepaths when FTS receives errors when trying to poll for REST /api/stats across the cluster... Related: MB-31258 / http://review.couchbase.org/99672 If so, this might have been fixed already in Mad-Hatter? Assigning to Sreekanth Sivasankaran to get his sanity check on that.
          steve Steve Yen added a comment -

          > I see this MB-32547 ticket is against 6.0.0? Which is Alice?

          Whoops, that's wrong. The latest cbcollect-info's from Mahesh Mandhare are with using 6.5.0, so the REST API resilience checks should already be in 6.5.0 codepaths.

          I've spawned off a separate ticket – MB-34931 – to reduce confusion (at least my confusion!).

          steve Steve Yen added a comment - > I see this MB-32547 ticket is against 6.0.0? Which is Alice? Whoops, that's wrong. The latest cbcollect-info's from Mahesh Mandhare are with using 6.5.0, so the REST API resilience checks should already be in 6.5.0 codepaths. I've spawned off a separate ticket – MB-34931 – to reduce confusion (at least my confusion!).

          Closing this issue as we did not see panic in FTS logs in last rebalance failure comment.

          To track REST API issue ticket logged at MB-34931

          Wil reopen this if we see bolted panic again.

           

          mahesh.mandhare Mahesh Mandhare (Inactive) added a comment - - edited Closing this issue as we did not see panic in FTS logs in last rebalance failure comment. To track REST API issue ticket logged at  MB-34931 Wil reopen this if we see bolted panic again.  

          People

            Sreekanth Sivasankaran Sreekanth Sivasankaran
            mahesh.mandhare Mahesh Mandhare (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty