Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-51491

Indexer/Plasma crash lead to kv rebalance failure.

    XMLWordPrintable

Details

    Description

      1. Load 1 billion 5k(body field contains the major chunk of data) items in a default magma bucket.
      2. Create 50% fragmentation in KV
      3. Create an secondary index idx0 and wait for it to build.
      4. Start a data load as below:

      Read Start: 0
      Read End: 500000000
      Update Start: 0
      Update End: 500000000
      Expiry Start: 0
      Expiry End: 0
      Delete Start: 500000000
      Delete End: 1000000000
      Create Start: 1000000000
      Create End: 1500000000
      

      5. Rebalance IN 1 KV node. Rebalance finished in 11019.869 sec
      6. Create another index idx1 on body field while the above rebalance was at 35%. Index was building while KV rebalances were running. Total plasma disk size was at ~2TB.
      7. Rebalance OUT 1 KV node. Rebalance finished in 10870.6289999 sec
      8. Rebalance IN 2 and out 1 KV node. Rebalance failed at 80%.

      {u'code': 0, u'module': u'ns_orchestrator', u'type': u'critical', u'node': u'ns_1@172.23.110.64', u'tstamp': 1647477583397L, u'shortText': u'message', u'serverTime': u'2022-03-16T17:39:43.397Z', u'text': u"Rebalance exited with reason {service_rebalance_failed,index,\n                              {agent_died,<26112.18949.96>,\n                               {lost_connection,\n                                {'ns_1@172.23.110.66',shutdown}}}}.\nRebalance Operation Id = 6e737e537f7cc57e8962bec1a189e45e"}
      2022-03-16 17:39:52,405 | test  | ERROR   | pool-3-thread-6 | [rest_client:print_UI_logs:2708] {u'code': 0, u'module': u'ns_log', u'type': u'info', u'node': u'ns_1@172.23.110.66', u'tstamp': 1647477583388L, u'shortText': u'message', u'serverTime': u'2022-03-16T17:39:43.388Z', u'text': u"Service 'index' exited with status 2. Restarting. Messages:\nruntime.gopark(0x14603b8, 0x0, 0x1809, 0x1)\n\t/home/couchbase/.cbdepscache/exploded/x86_64/go-1.16.5/go/src/runtime/proc.go:336 +0xe5 fp=0xc018ea99e0 sp=0xc018ea99c0 pc=0x441ae5\nruntime.selectgo(0xc018ea9c50, 0xc018ea9ba4, 0x0, 0x0, 0x3, 0x1459601, 0x0, 0x1)\n\t/home/couchbase/.cbdepscache/exploded/x86_64/go-1.16.5/go/src/runtime/select.go:327 +0xef7 fp=0xc018ea9b18 sp=0xc018ea99e0 pc=0x452a97\ngithub.com/couchbase/indexing/secondary/indexer.(*Rebalancer).processDropIndexQueue(0xc018e96000)\n\t/home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/indexing/secondary/indexer/rebalancer.go:1383 +0x1b2 fp=0xc018ea9fd8 sp=0xc018ea9b18 pc=0xfe9cb2\nruntime.goexit()\n\t/home/couchbase/.cbdepscache/exploded/x86_64/go-1.16.5/go/src/runtime/asm_amd64.s:1371 +0x1 fp=0xc018ea9fe0 sp=0xc018ea9fd8 pc=0x4784c1\ncreated by github.com/couchbase/indexing/secondary/indexer.NewRebalancer\n\t/home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/indexing/secondary/indexer/rebalancer.go:171 +0x545\n"}
      2022-03-16 17:39:52,407 | test  | ERROR   | pool-3-thread-6 | [rest_client:print_UI_logs:2708] {u'code': 0, u'module': u'ns_memcached', u'type': u'info', u'node': u'ns_1@172.23.110.65', u'tstamp': 1647477562188L, u'shortText': u'message', u'serverTime': u'2022-03-16T17:39:22.188Z', u'text': u'Shutting down bucket "GleamBookUsers0" on \'ns_1@172.23.110.65\' for deletion'}
      2022-03-16 17:39:52,407 | test  | ERROR   | pool-3-thread-6 | [rest_client:print_UI_logs:2708] {u'code': 0, u'module': u'ns_log', u'type': u'info', u'node': u'ns_1@172.23.110.66', u'tstamp': 1647476292642L, u'shortText': u'message', u'serverTime': u'2022-03-16T17:18:12.642Z', u'text': u"Service 'index' exited with status 2. Restarting. Messages:\n\t/home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/plasma/mvcc.go:824 +0x3bd fp=0xc05021feb8 sp=0xc05021fc70 pc=0xe2c33d\ngithub.com/couchbase/plasma.(*Plasma).VisitPartition(0xc005ae2000, 0x46, 0xc014640000, 0xc014641300, 0xc0136419c0, 0x0, 0x0)\n\t/home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/plasma/page_visitor.go:87 +0x24a fp=0xc05021ff38 sp=0xc05021feb8 pc=0xddbdca\ngithub.com/couchbase/plasma.(*Plasma).PageVisitor.func1(0xc048f98530, 0xc005ae2000, 0xc0136419c0, 0xc0007c4900, 0x48, 0x48, 0x46, 0xc014640000, 0xc014641300)\n\t/home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/plasma/page_visitor.go:41 +0x88 fp=0xc05021ff98 sp=0xc05021ff38 pc=0xe2d1c8\nruntime.goexit()\n\t/home/couchbase/.cbdepscache/exploded/x86_64/go-1.16.5/go/src/runtime/asm_amd64.s:1371 +0x1 fp=0xc05021ffa0 sp=0xc05021ff98 pc=0x4784c1\ncreated by github.com/couchbase/plasma.(*Plasma).PageVisitor\n\t/home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/plasma/page_visitor.go:39 +0x18b\n"}
      

      Note: This is found while trying out a recently reported customer use case(CBSE-11583). The env is much better in this case. These are physical machines with 256GB RAM and 72 cores.

      Attachments

        1. Index.png
          Index.png
          292 kB
        2. screenshot-1.png
          screenshot-1.png
          505 kB

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            ritesh.agarwal Ritesh Agarwal added a comment - - edited

            All these rebalance passed(but the test got aborted due to jython process kill. Planning to re-run on RC3):

            • Rebalance in with Loading of docs. Rebalance completed with progress: 100% in 11127.8659999 sec
            • Rebalance Out with Loading of docs. Rebalance completed with progress: 100% in 20551.9489999 sec
            • Rebalance In_Out with Loading of docs. Rebalance completed with progress: 100% in 13125.1660001 sec
            • Swap with Loading of docs. Rebalance completed with progress: 100% in 17427.4619999 sec
            • Failover 1 node and RebalanceOut that node with loading in parallel. Rebalance completed with progress: 100% in 17610.3427634 sec

            In the last run where this defect was found the test failed in step 3 above.

            Here are the logs from the latest run on 7.1.0 build 2521:
            s3://cb-customers-secure/index_7.1.0-2521/2022-03-25/collectinfo-2022-03-25t165930-ns_1@172.23.110.64.zip
            s3://cb-customers-secure/index_7.1.0-2521/2022-03-25/collectinfo-2022-03-25t165930-ns_1@172.23.110.65.zip
            s3://cb-customers-secure/index_7.1.0-2521/2022-03-25/collectinfo-2022-03-25t165930-ns_1@172.23.110.67.zip

            ritesh.agarwal Ritesh Agarwal added a comment - - edited All these rebalance passed(but the test got aborted due to jython process kill. Planning to re-run on RC3): Rebalance in with Loading of docs. Rebalance completed with progress: 100% in 11127.8659999 sec Rebalance Out with Loading of docs. Rebalance completed with progress: 100% in 20551.9489999 sec Rebalance In_Out with Loading of docs. Rebalance completed with progress: 100% in 13125.1660001 sec Swap with Loading of docs. Rebalance completed with progress: 100% in 17427.4619999 sec Failover 1 node and RebalanceOut that node with loading in parallel. Rebalance completed with progress: 100% in 17610.3427634 sec In the last run where this defect was found the test failed in step 3 above. Here are the logs from the latest run on 7.1.0 build 2521: s3://cb-customers-secure/index_7.1.0-2521/2022-03-25/collectinfo-2022-03-25t165930-ns_1@172.23.110.64.zip s3://cb-customers-secure/index_7.1.0-2521/2022-03-25/collectinfo-2022-03-25t165930-ns_1@172.23.110.65.zip s3://cb-customers-secure/index_7.1.0-2521/2022-03-25/collectinfo-2022-03-25t165930-ns_1@172.23.110.67.zip

            The test passed on RC3(2534) build as well. Closing this defect.

            ritesh.agarwal Ritesh Agarwal added a comment - The test passed on RC3(2534) build as well. Closing this defect.

            Build couchbase-server-7.2.0-5000 contains plasma commit a5f4130 with commit message:
            MB-51491 [BP]: Avoid copying next for all new delta allocs

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.2.0-5000 contains plasma commit a5f4130 with commit message: MB-51491 [BP] : Avoid copying next for all new delta allocs

            Build couchbase-server-7.2.0-5000 contains plasma commit ac69d8f with commit message:
            MB-51491 [BP]: Avoid copying pageDelta.next during rollback

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.2.0-5000 contains plasma commit ac69d8f with commit message: MB-51491 [BP] : Avoid copying pageDelta.next during rollback

            Build couchbase-server-7.2.0-5000 contains plasma commit e9005ec with commit message:
            MB-51491 [BP]: Reclaim memory during recovery log replay

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.2.0-5000 contains plasma commit e9005ec with commit message: MB-51491 [BP] : Reclaim memory during recovery log replay

            People

              ritesh.agarwal Ritesh Agarwal
              ritesh.agarwal Ritesh Agarwal
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                PagerDuty