Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-53574

BP-6.6.x- MB-51491 Indexer/Plasma crash lead to kv rebalance failure.

    XMLWordPrintable

Details

    Description

      1. Load 1 billion 5k(body field contains the major chunk of data) items in a default magma bucket.
      2. Create 50% fragmentation in KV
      3. Create an secondary index idx0 and wait for it to build.
      4. Start a data load as below:

      Read Start: 0
      Read End: 500000000
      Update Start: 0
      Update End: 500000000
      Expiry Start: 0
      Expiry End: 0
      Delete Start: 500000000
      Delete End: 1000000000
      Create Start: 1000000000
      Create End: 1500000000
      

      5. Rebalance IN 1 KV node. Rebalance finished in 11019.869 sec
      6. Create another index idx1 on body field while the above rebalance was at 35%. Index was building while KV rebalances were running. Total plasma disk size was at ~2TB.
      7. Rebalance OUT 1 KV node. Rebalance finished in 10870.6289999 sec
      8. Rebalance IN 2 and out 1 KV node. Rebalance failed at 80%.

      {u'code': 0, u'module': u'ns_orchestrator', u'type': u'critical', u'node': u'ns_1@172.23.110.64', u'tstamp': 1647477583397L, u'shortText': u'message', u'serverTime': u'2022-03-16T17:39:43.397Z', u'text': u"Rebalance exited with reason {service_rebalance_failed,index,\n                              {agent_died,<26112.18949.96>,\n                               {lost_connection,\n                                {'ns_1@172.23.110.66',shutdown}}}}.\nRebalance Operation Id = 6e737e537f7cc57e8962bec1a189e45e"}
      2022-03-16 17:39:52,405 | test  | ERROR   | pool-3-thread-6 | [rest_client:print_UI_logs:2708] {u'code': 0, u'module': u'ns_log', u'type': u'info', u'node': u'ns_1@172.23.110.66', u'tstamp': 1647477583388L, u'shortText': u'message', u'serverTime': u'2022-03-16T17:39:43.388Z', u'text': u"Service 'index' exited with status 2. Restarting. Messages:\nruntime.gopark(0x14603b8, 0x0, 0x1809, 0x1)\n\t/home/couchbase/.cbdepscache/exploded/x86_64/go-1.16.5/go/src/runtime/proc.go:336 +0xe5 fp=0xc018ea99e0 sp=0xc018ea99c0 pc=0x441ae5\nruntime.selectgo(0xc018ea9c50, 0xc018ea9ba4, 0x0, 0x0, 0x3, 0x1459601, 0x0, 0x1)\n\t/home/couchbase/.cbdepscache/exploded/x86_64/go-1.16.5/go/src/runtime/select.go:327 +0xef7 fp=0xc018ea9b18 sp=0xc018ea99e0 pc=0x452a97\ngithub.com/couchbase/indexing/secondary/indexer.(*Rebalancer).processDropIndexQueue(0xc018e96000)\n\t/home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/indexing/secondary/indexer/rebalancer.go:1383 +0x1b2 fp=0xc018ea9fd8 sp=0xc018ea9b18 pc=0xfe9cb2\nruntime.goexit()\n\t/home/couchbase/.cbdepscache/exploded/x86_64/go-1.16.5/go/src/runtime/asm_amd64.s:1371 +0x1 fp=0xc018ea9fe0 sp=0xc018ea9fd8 pc=0x4784c1\ncreated by github.com/couchbase/indexing/secondary/indexer.NewRebalancer\n\t/home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/indexing/secondary/indexer/rebalancer.go:171 +0x545\n"}
      2022-03-16 17:39:52,407 | test  | ERROR   | pool-3-thread-6 | [rest_client:print_UI_logs:2708] {u'code': 0, u'module': u'ns_memcached', u'type': u'info', u'node': u'ns_1@172.23.110.65', u'tstamp': 1647477562188L, u'shortText': u'message', u'serverTime': u'2022-03-16T17:39:22.188Z', u'text': u'Shutting down bucket "GleamBookUsers0" on \'ns_1@172.23.110.65\' for deletion'}
      2022-03-16 17:39:52,407 | test  | ERROR   | pool-3-thread-6 | [rest_client:print_UI_logs:2708] {u'code': 0, u'module': u'ns_log', u'type': u'info', u'node': u'ns_1@172.23.110.66', u'tstamp': 1647476292642L, u'shortText': u'message', u'serverTime': u'2022-03-16T17:18:12.642Z', u'text': u"Service 'index' exited with status 2. Restarting. Messages:\n\t/home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/plasma/mvcc.go:824 +0x3bd fp=0xc05021feb8 sp=0xc05021fc70 pc=0xe2c33d\ngithub.com/couchbase/plasma.(*Plasma).VisitPartition(0xc005ae2000, 0x46, 0xc014640000, 0xc014641300, 0xc0136419c0, 0x0, 0x0)\n\t/home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/plasma/page_visitor.go:87 +0x24a fp=0xc05021ff38 sp=0xc05021feb8 pc=0xddbdca\ngithub.com/couchbase/plasma.(*Plasma).PageVisitor.func1(0xc048f98530, 0xc005ae2000, 0xc0136419c0, 0xc0007c4900, 0x48, 0x48, 0x46, 0xc014640000, 0xc014641300)\n\t/home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/plasma/page_visitor.go:41 +0x88 fp=0xc05021ff98 sp=0xc05021ff38 pc=0xe2d1c8\nruntime.goexit()\n\t/home/couchbase/.cbdepscache/exploded/x86_64/go-1.16.5/go/src/runtime/asm_amd64.s:1371 +0x1 fp=0xc05021ffa0 sp=0xc05021ff98 pc=0x4784c1\ncreated by github.com/couchbase/plasma.(*Plasma).PageVisitor\n\t/home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/plasma/page_visitor.go:39 +0x18b\n"}
      

      Note: This is found while trying out a recently reported customer use case(CBSE-11583). The env is much better in this case. These are physical machines with 256GB RAM and 72 cores.

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            Wayne Siu Lynn Straus Prachi Sikchi . Please approve for 6.6.6.

            srinath.duvuru Srinath Duvuru added a comment - Wayne Siu Lynn Straus Prachi Sikchi . Please approve for 6.6.6.

            Build couchbase-server-6.6.6-10565 contains plasma commit 353ffbe with commit message:
            MB-53574 [BP]: Avoid copying next for all new delta allocs

            build-team Couchbase Build Team added a comment - Build couchbase-server-6.6.6-10565 contains plasma commit 353ffbe with commit message: MB-53574 [BP] : Avoid copying next for all new delta allocs

            Build couchbase-server-6.6.6-10565 contains plasma commit 04eb7d1 with commit message:
            MB-53574 [BP]: Avoid copying pageDelta.next during rollback

            build-team Couchbase Build Team added a comment - Build couchbase-server-6.6.6-10565 contains plasma commit 04eb7d1 with commit message: MB-53574 [BP] : Avoid copying pageDelta.next during rollback

            Closing the ticket based on regression, system and perf testing results

            chanabasappa.ghali Chanabasappa Ghali added a comment - Closing the ticket based on regression, system and perf testing results

            People

              ritesh.agarwal Ritesh Agarwal
              srinath.duvuru Srinath Duvuru
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty