Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Fixed
Priority: Test Blocker
Fix Version/s: 6.0.0
Affects Version/s: 6.0.0
Component/s: analytics
Labels:
- system-test
- triaged
Environment:
component cluster

Triage:
Untriaged
Is this a Regression?:
Unknown
Sprint:
CX Sprint 111, CX Sprint 112

Description

Build : 6.0.0-1318
Test : -test tests/analytics/test_analytics_rebalance.yml -scope tests/analytics/scope_analytics_rebalance.yml
Scale : 5
Iteration : 3

All rebalance operations in this test are failing. We could capture the logs for only the latest one. For others, the logs have rolled over. Following error is logged in the debug.log for 172.23.104.16 :

ns_server:debug,2018-07-16T20:09:41.872-07:00,ns_1@172.23.104.16:service_rebalancer-cbas<0.27272.127>:service_rebalancer:run_rebalance:71]Worker terminated: {'EXIT',<0.27542.127>,

                       {rebalance_failed,

                           {service_error,

                               <<"Rebalance 402b4d3847e109a655e098fc19af0f32 failed, see analytics log for details">>}}}

[error_logger:error,2018-07-16T20:09:41.873-07:00,ns_1@172.23.104.16:error_logger<0.6.0>:ale_error_logger_handler:do_log:203]

=========================CRASH REPORT=========================

  crasher:

    initial call: service_rebalancer:-run_rebalance/7-fun-1-/0

    pid: <0.27542.127>

    registered_name: 'service_rebalancer-cbas-worker'

    exception exit: {rebalance_failed,

                        {service_error,

                            <<"Rebalance 402b4d3847e109a655e098fc19af0f32 failed, see analytics log for details">>}}

      in function  service_rebalancer:wait_for_rebalance_completion/3 (src/service_rebalancer.erl, line 138)

    ancestors: ['service_rebalancer-cbas',<0.26736.127>]

    messages: []

    links: [<0.27272.127>]

    dictionary: []

    trap_exit: false

    status: running

    heap_size: 4185

    stack_size: 27

    reductions: 69853

  neighbours:

[ns_server:error,2018-07-16T20:09:41.930-07:00,ns_1@172.23.104.16:service_rebalancer-cbas<0.27272.127>:service_agent:process_bad_results:810]Service call unset_rebalancer (service cbas) failed on some nodes:

[{'ns_1@172.23.104.94',nack}]

[ns_server:warn,2018-07-16T20:09:41.930-07:00,ns_1@172.23.104.16:service_rebalancer-cbas<0.27272.127>:service_rebalancer:run_rebalance:91]Failed to unset rebalancer on some nodes:

{error,{bad_nodes,cbas,unset_rebalancer,[{'ns_1@172.23.104.94',nack}]}}

[user:error,2018-07-16T20:09:41.931-07:00,ns_1@172.23.104.16:<0.15034.0>:ns_orchestrator:do_log_rebalance_completion:1117]Rebalance exited with reason {service_rebalance_failed,cbas,

                              {rebalance_failed,

                               {service_error,

                                <<"Rebalance 402b4d3847e109a655e098fc19af0f32 failed, see analytics log for details">>}}}

On 172.23.104.94, there is no error around the timestamp of rebalance failure, but can see the following errors a couple of mins before the rebalance failure. Not sure if they are related.

2018-07-16T20:07:37.362-07:00 WARN CBAS.buffercache.BufferCache [org.apache.hyracks.api.rewriter.runtime.SuperActivity:TAID:TID:ANID:ODID:2:0:36:0:0] Failure while trying to read a page from disk

org.apache.hyracks.api.exceptions.HyracksDataException: java.lang.InterruptedException

        at org.apache.hyracks.api.exceptions.HyracksDataException.create(HyracksDataException.java:51) ~[hyracks-api.jar:6.0.0-1318]

2018-07-16T20:07:37.363-07:00 INFO CBAS.impls.MemoryComponentMetadata [Executor-23482:de40bcd69de638bbdb6151d8417e32c3] Copying Metadata into a different component

2018-07-16T20:07:37.363-07:00 ERRO CBAS.impls.LSMHarness [org.apache.hyracks.api.rewriter.runtime.SuperActivity:TAID:TID:ANID:ODID:2:0:36:0:0] Failed to process frame

org.apache.hyracks.api.exceptions.HyracksDataException: java.lang.InterruptedException

        at org.apache.hyracks.api.exceptions.HyracksDataException.create(HyracksDataException.java:51) ~[hyracks-api.jar:6.0.0-1318]

        at org.apache.hyracks.control.nc.io.IoRequest.queue(IoRequest.java:105) ~[hyracks-control-nc.jar:6.0.0-1318]

2018-07-16T20:07:37.363-07:00 INFO CBAS.impls.MemoryComponentMetadata [Executor-23482:de40bcd69de638bbdb6151d8417e32c3] Copying Component_Id_Min : 8 bytes

2018-07-16T20:07:37.363-07:00 INFO CBAS.impls.MemoryComponentMetadata [Executor-23482:de40bcd69de638bbdb6151d8417e32c3] Copying Component_Id_Max : 8 bytes

2018-07-16T20:07:37.363-07:00 INFO CBAS.impls.MemoryComponentMetadata [Executor-23482:de40bcd69de638bbdb6151d8417e32c3] Copying ShadowState : 16384 bytes

2018-07-16T20:07:37.363-07:00 ERRO CBAS.impls.LSMHarness [org.apache.hyracks.api.rewriter.runtime.SuperActivity:TAID:TID:ANID:ODID:2:0:37:0:0] Failed to process frame

org.apache.hyracks.api.exceptions.HyracksDataException: java.lang.InterruptedException

        at org.apache.hyracks.api.exceptions.HyracksDataException.create(HyracksDataException.java:51) ~[hyracks-api.jar:6.0.0-1318]

        at org.apache.hyracks.control.nc.io.IoRequest.queue(IoRequest.java:105) ~[hyracks-control-nc.jar:6.0.0-1318]

2018-07-16T20:07:37.389-07:00 WARN CBAS.runtime.BucketOperatorNodePushable [Executor-23716:de40bcd69de638bbdb6151d8417e32c3] (Default.Local.other-3(CouchbaseMetadataExtension))[39]:BucketOperatorDescriptor was aborted, adapter.stop succeeded but the thread is still running. re-interrupting the thread...

java.lang.Throwable: Executor-22626:de40bcd69de638bbdb6151d8417e32c3:BucketOperatorNodePushable:(Default.Local.other-3(CouchbaseMetadataExtension))[39]:BucketOperatorDescriptor Stack trace

        at com.couchbase.analytics.runtime.BucketOperatorNodePushable.start(BucketOperatorNodePushable.java:66) ~[cbas-connector.jar:6.0.0-1318]

        at org.apache.asterix.active.ActiveSourceOperatorNodePushable.initialize(ActiveSourceOperatorNodePushable.java:98) ~[asterix-active.jar:6.0.0-1318]

2018-07-16T20:07:37.391-07:00 INFO CBAS.adapter.CouchbaseConnector [HttpExecutor(port:9110)-3] Getting stats and the client is disconnected, skipping adding progress info

2018-07-16T20:07:37.389-07:00 WARN CBAS.runtime.BucketOperatorNodePushable [Executor-23715:de40bcd69de638bbdb6151d8417e32c3] (Default.Local.other-3(CouchbaseMetadataExtension))[37]:BucketOperatorDescriptor was aborted, adapter.stop succeeded but the thread is still running. re-interrupting the thread...

java.lang.Throwable: Executor-22645:de40bcd69de638bbdb6151d8417e32c3:BucketOperatorNodePushable:(Default.Local.other-3(CouchbaseMetadataExtension))[37]:BucketOperatorDescriptor Stack trace

        at java.lang.Object.wait(Native Method) ~[?:1.8.0_162]

        at java.lang.Object.wait(Object.java:502) ~[?:1.8.0_162]

This bug doesn't look like ~~MB-30194~~ and ~~MB-30472~~.

Marking this as a blocker since no rebalances are working on this test where we have increased the scale for ops rate and number of mutations.

Supportal Link : http://supportal.couchbase.com/snapshot/dc6762750b73f8f07a9e9797b484d75f::0

Attachments

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews
- Show All Issues
- Show Open Issues

No reviews matched the request. Check your Options in the drop-down menu of this sections header.

Activity

People

Assignee:: Abdullah Alamoudi [X] (Inactive)

Reporter:: Mihir Kamdar (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 16/Jul/18 9:38 PM

Updated:: 06/Aug/18 12:23 AM

Resolved:: 24/Jul/18 6:21 PM

Gerrit Reviews

There are no open Gerrit changes

[System test] Rebalance out of an analytics node failed

Details

Description

Attachments

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty