Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-30500

[System test] Rebalance out of an analytics node failed

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Test Blocker
    • 6.0.0
    • 6.0.0
    • analytics
    • component cluster
    • Untriaged
    • Unknown
    • CX Sprint 111, CX Sprint 112

    Description

      Build : 6.0.0-1318
      Test : -test tests/analytics/test_analytics_rebalance.yml -scope tests/analytics/scope_analytics_rebalance.yml
      Scale : 5
      Iteration : 3

      All rebalance operations in this test are failing. We could capture the logs for only the latest one. For others, the logs have rolled over. Following error is logged in the debug.log for 172.23.104.16 :

      ns_server:debug,2018-07-16T20:09:41.872-07:00,ns_1@172.23.104.16:service_rebalancer-cbas<0.27272.127>:service_rebalancer:run_rebalance:71]Worker terminated: {'EXIT',<0.27542.127>,
                             {rebalance_failed,
                                 {service_error,
                                     <<"Rebalance 402b4d3847e109a655e098fc19af0f32 failed, see analytics log for details">>}}}
      [error_logger:error,2018-07-16T20:09:41.873-07:00,ns_1@172.23.104.16:error_logger<0.6.0>:ale_error_logger_handler:do_log:203]
      =========================CRASH REPORT=========================
        crasher:
          initial call: service_rebalancer:-run_rebalance/7-fun-1-/0
          pid: <0.27542.127>
          registered_name: 'service_rebalancer-cbas-worker'
          exception exit: {rebalance_failed,
                              {service_error,
                                  <<"Rebalance 402b4d3847e109a655e098fc19af0f32 failed, see analytics log for details">>}}
            in function  service_rebalancer:wait_for_rebalance_completion/3 (src/service_rebalancer.erl, line 138)
          ancestors: ['service_rebalancer-cbas',<0.26736.127>]
          messages: []
          links: [<0.27272.127>]
          dictionary: []
          trap_exit: false
          status: running
          heap_size: 4185
          stack_size: 27
          reductions: 69853
        neighbours:
       
      [ns_server:error,2018-07-16T20:09:41.930-07:00,ns_1@172.23.104.16:service_rebalancer-cbas<0.27272.127>:service_agent:process_bad_results:810]Service call unset_rebalancer (service cbas) failed on some nodes:
      [{'ns_1@172.23.104.94',nack}]
      [ns_server:warn,2018-07-16T20:09:41.930-07:00,ns_1@172.23.104.16:service_rebalancer-cbas<0.27272.127>:service_rebalancer:run_rebalance:91]Failed to unset rebalancer on some nodes:
      {error,{bad_nodes,cbas,unset_rebalancer,[{'ns_1@172.23.104.94',nack}]}}
      [user:error,2018-07-16T20:09:41.931-07:00,ns_1@172.23.104.16:<0.15034.0>:ns_orchestrator:do_log_rebalance_completion:1117]Rebalance exited with reason {service_rebalance_failed,cbas,
                                    {rebalance_failed,
                                     {service_error,
                                      <<"Rebalance 402b4d3847e109a655e098fc19af0f32 failed, see analytics log for details">>}}}
      
      

      On 172.23.104.94, there is no error around the timestamp of rebalance failure, but can see the following errors a couple of mins before the rebalance failure. Not sure if they are related.

      2018-07-16T20:07:37.362-07:00 WARN CBAS.buffercache.BufferCache [org.apache.hyracks.api.rewriter.runtime.SuperActivity:TAID:TID:ANID:ODID:2:0:36:0:0] Failure while trying to read a page from disk
      org.apache.hyracks.api.exceptions.HyracksDataException: java.lang.InterruptedException
              at org.apache.hyracks.api.exceptions.HyracksDataException.create(HyracksDataException.java:51) ~[hyracks-api.jar:6.0.0-1318]
      

      2018-07-16T20:07:37.363-07:00 INFO CBAS.impls.MemoryComponentMetadata [Executor-23482:de40bcd69de638bbdb6151d8417e32c3] Copying Metadata into a different component
      2018-07-16T20:07:37.363-07:00 ERRO CBAS.impls.LSMHarness [org.apache.hyracks.api.rewriter.runtime.SuperActivity:TAID:TID:ANID:ODID:2:0:36:0:0] Failed to process frame
      org.apache.hyracks.api.exceptions.HyracksDataException: java.lang.InterruptedException
              at org.apache.hyracks.api.exceptions.HyracksDataException.create(HyracksDataException.java:51) ~[hyracks-api.jar:6.0.0-1318]
              at org.apache.hyracks.control.nc.io.IoRequest.queue(IoRequest.java:105) ~[hyracks-control-nc.jar:6.0.0-1318]
      

      2018-07-16T20:07:37.363-07:00 INFO CBAS.impls.MemoryComponentMetadata [Executor-23482:de40bcd69de638bbdb6151d8417e32c3] Copying Component_Id_Min : 8 bytes
      2018-07-16T20:07:37.363-07:00 INFO CBAS.impls.MemoryComponentMetadata [Executor-23482:de40bcd69de638bbdb6151d8417e32c3] Copying Component_Id_Max : 8 bytes
      2018-07-16T20:07:37.363-07:00 INFO CBAS.impls.MemoryComponentMetadata [Executor-23482:de40bcd69de638bbdb6151d8417e32c3] Copying ShadowState : 16384 bytes
      2018-07-16T20:07:37.363-07:00 ERRO CBAS.impls.LSMHarness [org.apache.hyracks.api.rewriter.runtime.SuperActivity:TAID:TID:ANID:ODID:2:0:37:0:0] Failed to process frame
      org.apache.hyracks.api.exceptions.HyracksDataException: java.lang.InterruptedException
              at org.apache.hyracks.api.exceptions.HyracksDataException.create(HyracksDataException.java:51) ~[hyracks-api.jar:6.0.0-1318]
              at org.apache.hyracks.control.nc.io.IoRequest.queue(IoRequest.java:105) ~[hyracks-control-nc.jar:6.0.0-1318]
      

      2018-07-16T20:07:37.389-07:00 WARN CBAS.runtime.BucketOperatorNodePushable [Executor-23716:de40bcd69de638bbdb6151d8417e32c3] (Default.Local.other-3(CouchbaseMetadataExtension))[39]:BucketOperatorDescriptor was aborted, adapter.stop succeeded but the thread is still running. re-interrupting the thread...
      java.lang.Throwable: Executor-22626:de40bcd69de638bbdb6151d8417e32c3:BucketOperatorNodePushable:(Default.Local.other-3(CouchbaseMetadataExtension))[39]:BucketOperatorDescriptor Stack trace
              at com.couchbase.analytics.runtime.BucketOperatorNodePushable.start(BucketOperatorNodePushable.java:66) ~[cbas-connector.jar:6.0.0-1318]
              at org.apache.asterix.active.ActiveSourceOperatorNodePushable.initialize(ActiveSourceOperatorNodePushable.java:98) ~[asterix-active.jar:6.0.0-1318]
      

      2018-07-16T20:07:37.391-07:00 INFO CBAS.adapter.CouchbaseConnector [HttpExecutor(port:9110)-3] Getting stats and the client is disconnected, skipping adding progress info
      2018-07-16T20:07:37.389-07:00 WARN CBAS.runtime.BucketOperatorNodePushable [Executor-23715:de40bcd69de638bbdb6151d8417e32c3] (Default.Local.other-3(CouchbaseMetadataExtension))[37]:BucketOperatorDescriptor was aborted, adapter.stop succeeded but the thread is still running. re-interrupting the thread...
      java.lang.Throwable: Executor-22645:de40bcd69de638bbdb6151d8417e32c3:BucketOperatorNodePushable:(Default.Local.other-3(CouchbaseMetadataExtension))[37]:BucketOperatorDescriptor Stack trace
              at java.lang.Object.wait(Native Method) ~[?:1.8.0_162]
              at java.lang.Object.wait(Object.java:502) ~[?:1.8.0_162]
      

      This bug doesn't look like MB-30194 and MB-30472.

      Marking this as a blocker since no rebalances are working on this test where we have increased the scale for ops rate and number of mutations.

      Supportal Link : http://supportal.couchbase.com/snapshot/dc6762750b73f8f07a9e9797b484d75f::0

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            Abdullah.Alamoudi Abdullah Alamoudi [X] (Inactive)
            mihir.kamdar Mihir Kamdar (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty