Details
-
Bug
-
Resolution: Fixed
-
Test Blocker
-
6.0.0
-
component cluster
-
Untriaged
-
Unknown
-
CX Sprint 111, CX Sprint 112
Description
Build : 6.0.0-1318
Test : -test tests/analytics/test_analytics_rebalance.yml -scope tests/analytics/scope_analytics_rebalance.yml
Scale : 5
Iteration : 3
All rebalance operations in this test are failing. We could capture the logs for only the latest one. For others, the logs have rolled over. Following error is logged in the debug.log for 172.23.104.16 :
ns_server:debug,2018-07-16T20:09:41.872-07:00,ns_1@172.23.104.16:service_rebalancer-cbas<0.27272.127>:service_rebalancer:run_rebalance:71]Worker terminated: {'EXIT',<0.27542.127>,
|
{rebalance_failed,
|
{service_error,
|
<<"Rebalance 402b4d3847e109a655e098fc19af0f32 failed, see analytics log for details">>}}}
|
[error_logger:error,2018-07-16T20:09:41.873-07:00,ns_1@172.23.104.16:error_logger<0.6.0>:ale_error_logger_handler:do_log:203]
|
=========================CRASH REPORT=========================
|
crasher:
|
initial call: service_rebalancer:-run_rebalance/7-fun-1-/0
|
pid: <0.27542.127>
|
registered_name: 'service_rebalancer-cbas-worker'
|
exception exit: {rebalance_failed,
|
{service_error,
|
<<"Rebalance 402b4d3847e109a655e098fc19af0f32 failed, see analytics log for details">>}}
|
in function service_rebalancer:wait_for_rebalance_completion/3 (src/service_rebalancer.erl, line 138)
|
ancestors: ['service_rebalancer-cbas',<0.26736.127>]
|
messages: []
|
links: [<0.27272.127>]
|
dictionary: []
|
trap_exit: false
|
status: running
|
heap_size: 4185
|
stack_size: 27
|
reductions: 69853
|
neighbours:
|
|
[ns_server:error,2018-07-16T20:09:41.930-07:00,ns_1@172.23.104.16:service_rebalancer-cbas<0.27272.127>:service_agent:process_bad_results:810]Service call unset_rebalancer (service cbas) failed on some nodes:
|
[{'ns_1@172.23.104.94',nack}]
|
[ns_server:warn,2018-07-16T20:09:41.930-07:00,ns_1@172.23.104.16:service_rebalancer-cbas<0.27272.127>:service_rebalancer:run_rebalance:91]Failed to unset rebalancer on some nodes:
|
{error,{bad_nodes,cbas,unset_rebalancer,[{'ns_1@172.23.104.94',nack}]}}
|
[user:error,2018-07-16T20:09:41.931-07:00,ns_1@172.23.104.16:<0.15034.0>:ns_orchestrator:do_log_rebalance_completion:1117]Rebalance exited with reason {service_rebalance_failed,cbas,
|
{rebalance_failed,
|
{service_error,
|
<<"Rebalance 402b4d3847e109a655e098fc19af0f32 failed, see analytics log for details">>}}}
|
|
On 172.23.104.94, there is no error around the timestamp of rebalance failure, but can see the following errors a couple of mins before the rebalance failure. Not sure if they are related.
2018-07-16T20:07:37.362-07:00 WARN CBAS.buffercache.BufferCache [org.apache.hyracks.api.rewriter.runtime.SuperActivity:TAID:TID:ANID:ODID:2:0:36:0:0] Failure while trying to read a page from disk
|
org.apache.hyracks.api.exceptions.HyracksDataException: java.lang.InterruptedException
|
at org.apache.hyracks.api.exceptions.HyracksDataException.create(HyracksDataException.java:51) ~[hyracks-api.jar:6.0.0-1318]
|
2018-07-16T20:07:37.363-07:00 INFO CBAS.impls.MemoryComponentMetadata [Executor-23482:de40bcd69de638bbdb6151d8417e32c3] Copying Metadata into a different component
|
2018-07-16T20:07:37.363-07:00 ERRO CBAS.impls.LSMHarness [org.apache.hyracks.api.rewriter.runtime.SuperActivity:TAID:TID:ANID:ODID:2:0:36:0:0] Failed to process frame
|
org.apache.hyracks.api.exceptions.HyracksDataException: java.lang.InterruptedException
|
at org.apache.hyracks.api.exceptions.HyracksDataException.create(HyracksDataException.java:51) ~[hyracks-api.jar:6.0.0-1318]
|
at org.apache.hyracks.control.nc.io.IoRequest.queue(IoRequest.java:105) ~[hyracks-control-nc.jar:6.0.0-1318]
|
2018-07-16T20:07:37.363-07:00 INFO CBAS.impls.MemoryComponentMetadata [Executor-23482:de40bcd69de638bbdb6151d8417e32c3] Copying Component_Id_Min : 8 bytes
|
2018-07-16T20:07:37.363-07:00 INFO CBAS.impls.MemoryComponentMetadata [Executor-23482:de40bcd69de638bbdb6151d8417e32c3] Copying Component_Id_Max : 8 bytes
|
2018-07-16T20:07:37.363-07:00 INFO CBAS.impls.MemoryComponentMetadata [Executor-23482:de40bcd69de638bbdb6151d8417e32c3] Copying ShadowState : 16384 bytes
|
2018-07-16T20:07:37.363-07:00 ERRO CBAS.impls.LSMHarness [org.apache.hyracks.api.rewriter.runtime.SuperActivity:TAID:TID:ANID:ODID:2:0:37:0:0] Failed to process frame
|
org.apache.hyracks.api.exceptions.HyracksDataException: java.lang.InterruptedException
|
at org.apache.hyracks.api.exceptions.HyracksDataException.create(HyracksDataException.java:51) ~[hyracks-api.jar:6.0.0-1318]
|
at org.apache.hyracks.control.nc.io.IoRequest.queue(IoRequest.java:105) ~[hyracks-control-nc.jar:6.0.0-1318]
|
2018-07-16T20:07:37.389-07:00 WARN CBAS.runtime.BucketOperatorNodePushable [Executor-23716:de40bcd69de638bbdb6151d8417e32c3] (Default.Local.other-3(CouchbaseMetadataExtension))[39]:BucketOperatorDescriptor was aborted, adapter.stop succeeded but the thread is still running. re-interrupting the thread...
|
java.lang.Throwable: Executor-22626:de40bcd69de638bbdb6151d8417e32c3:BucketOperatorNodePushable:(Default.Local.other-3(CouchbaseMetadataExtension))[39]:BucketOperatorDescriptor Stack trace
|
at com.couchbase.analytics.runtime.BucketOperatorNodePushable.start(BucketOperatorNodePushable.java:66) ~[cbas-connector.jar:6.0.0-1318]
|
at org.apache.asterix.active.ActiveSourceOperatorNodePushable.initialize(ActiveSourceOperatorNodePushable.java:98) ~[asterix-active.jar:6.0.0-1318]
|
2018-07-16T20:07:37.391-07:00 INFO CBAS.adapter.CouchbaseConnector [HttpExecutor(port:9110)-3] Getting stats and the client is disconnected, skipping adding progress info
|
2018-07-16T20:07:37.389-07:00 WARN CBAS.runtime.BucketOperatorNodePushable [Executor-23715:de40bcd69de638bbdb6151d8417e32c3] (Default.Local.other-3(CouchbaseMetadataExtension))[37]:BucketOperatorDescriptor was aborted, adapter.stop succeeded but the thread is still running. re-interrupting the thread...
|
java.lang.Throwable: Executor-22645:de40bcd69de638bbdb6151d8417e32c3:BucketOperatorNodePushable:(Default.Local.other-3(CouchbaseMetadataExtension))[37]:BucketOperatorDescriptor Stack trace
|
at java.lang.Object.wait(Native Method) ~[?:1.8.0_162]
|
at java.lang.Object.wait(Object.java:502) ~[?:1.8.0_162]
|
This bug doesn't look like MB-30194 and MB-30472.
Marking this as a blocker since no rebalances are working on this test where we have increased the scale for ops rate and number of mutations.
Supportal Link : http://supportal.couchbase.com/snapshot/dc6762750b73f8f07a9e9797b484d75f::0