Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
6.6.0
-
1
Description
[Split out of MB-47873 per 2021-10-20 GSI scrum. Originally opened by Varun Velamuri from CBSE-10499.
MB-47874 was also created for the same problem from same CBSE, aiming to find and fix the root cause, while current MB is aimed at making Rebalance not fail even if this bug gets hit.]
If merging replica counter fails, indexer can avoid replica repair for the definition on which merge failed, rather than failing rebalance.
From CBSE-10499 description, which is the source of the original error:
Background and Analysis:
Analytics Nodes were added on `2021-07-27`, Rebalance is failing since then:
2021-07-27T10:08:18.443000+0100 10.114.141.4 added node 10.114.141.16 |
2021-07-27T10:14:59.570000+0100 10.114.141.4 added node 10.114.141.17 |
Rebalance is failing with below error:
[ns_server:error,2021-07-27T13:19:53.907+01:00,ns_1@10.114.141.10:service_status_keeper-index<0.725.0>:service_status_keeper:handle_cast:119]Service service_index returned incorrect status [ns_server:error,2021-07-27T13:22:33.398+01:00,ns_1@10.114.141.10:service_agent-index<0.669.0>:service_agent:handle_info:287]Rebalancer <13523.17251.1> died unexpectedly: {worker_died, {'EXIT',<13523.17280.1>, {rebalance_failed, {service_error, <<"Unable to read index layout from cluster 127.0.0.1:8091. err = Cannot merge counter with different base values">>}}}} .. |
Attachments
Issue Links
Activity
Link | This issue relates to CBSE-10499 [ CBSE-10499 ] |
Description |
There are multiple phases in index rebalance which can fail with a variety of errors. Current behaviour is to fail rebalance for any error encountered during any of the phases. As failing index service rebalance can block system wide progress, it is not a good idea to fail rebalance for every error.
This is a blanket ticket with the goal of: a. Investigating all possible errors that can be encountered during rebalance b. Identify the error cases where index service can continue the rebalance without failing it - may be by taking a work-around path. E.g., if merging replica counter fails, indexer can avoid replica repair for the definition on which merge failed, rather than failing rebalance |
[Split out of MB-47873 per 2021-10-20 GSI scrum. Originally opened by [~varun.velamuri]]
If merging replica counter fails, indexer can avoid replica repair for the definition on which merge failed, rather than failing rebalance. |
Summary | Do not fail Rebalance for merge replica counter failure | CBSE: Do not fail Rebalance for merge replica counter failure |
Description |
[Split out of MB-47873 per 2021-10-20 GSI scrum. Originally opened by [~varun.velamuri]]
If merging replica counter fails, indexer can avoid replica repair for the definition on which merge failed, rather than failing rebalance. |
[Split out of MB-47873 per 2021-10-20 GSI scrum. Originally opened by [~varun.velamuri] from CBSE-10499.]
If merging replica counter fails, indexer can avoid replica repair for the definition on which merge failed, rather than failing rebalance. |
Status | Open [ 1 ] | In Progress [ 3 ] |
Status | In Progress [ 3 ] | Open [ 1 ] |
Status | Open [ 1 ] | In Progress [ 3 ] |
Description |
[Split out of MB-47873 per 2021-10-20 GSI scrum. Originally opened by [~varun.velamuri] from CBSE-10499.]
If merging replica counter fails, indexer can avoid replica repair for the definition on which merge failed, rather than failing rebalance. |
[Split out of MB-47873 per 2021-10-20 GSI scrum. Originally opened by [~varun.velamuri] from CBSE-10499.]
If merging replica counter fails, indexer can avoid replica repair for the definition on which merge failed, rather than failing rebalance. From CBSE-10499 description, which is the source of the original error: +*Background and Analysis:*+ Analytics Nodes were added on `2021-07-27`, Rebalance is failing since then: |{color:#009900}2021{color}{color:#000000}-{color}{color:#009900}07{color}{color:#000000}-27T10:{color}{color:#009900}08{color}{color:#000000}:{color}{color:#009900}18.443000{color}{color:#000000}+{color}{color:#009900}0100{color}{color:#000000} {color}{color:#009900}10.114{color}{color:#000000}.{color}{color:#009900}141.4{color}{color:#000000} added node {color}{color:#009900}10.114{color}{color:#000000}.{color}{color:#009900}141.16{color}| | | |{color:#009900}2021{color}{color:#000000}-{color}{color:#009900}07{color}{color:#000000}-27T10:{color}{color:#009900}14{color}{color:#000000}:{color}{color:#009900}59.570000{color}{color:#000000}+{color}{color:#009900}0100{color}{color:#000000} {color}{color:#009900}10.114{color}{color:#000000}.{color}{color:#009900}141.4{color}{color:#000000} added node {color}{color:#009900}10.114{color}{color:#000000}.{color}{color:#009900}141.17{color}| | | Rebalance is failing with below error: |{color:#000000}[ns_server:error,{color}{color:#009900}2021{color}{color:#000000}-{color}{color:#009900}07{color}{color:#000000}-27T13:{color}{color:#009900}19{color}{color:#000000}:{color}{color:#009900}53.907{color}{color:#000000}+{color}{color:#009900}01{color}{color:#000000}:{color}{color:#009900}00{color}{color:#000000},ns_1{color}{color:#808080}@10{color}{color:#000000}.114.{color}{color:#009900}141.10{color}{color:#000000}:service_status_keeper-index<{color}{color:#009900}0.725{color}{color:#000000}.{color}{color:#009900}0{color}{color:#000000}>:service_status_keeper:handle_cast:{color}{color:#009900}119{color}{color:#000000}]Service service_index returned incorrect status [ns_server:error,{color}{color:#009900}2021{color}{color:#000000}-{color}{color:#009900}07{color}{color:#000000}-27T13:{color}{color:#009900}22{color}{color:#000000}:{color}{color:#009900}33.398{color}{color:#000000}+{color}{color:#009900}01{color}{color:#000000}:{color}{color:#009900}00{color}{color:#000000},ns_1{color}{color:#808080}@10{color}{color:#000000}.114.{color}{color:#009900}141.10{color}{color:#000000}:service_agent-index<{color}{color:#009900}0.669{color}{color:#000000}.{color}{color:#009900}0{color}{color:#000000}>:service_agent:handle_info:{color}{color:#009900}287{color}{color:#000000}]Rebalancer <{color}{color:#009900}13523.17251{color}{color:#000000}.{color}{color:#009900}1{color}{color:#000000}> died unexpectedly: {worker_died, {{color}{color:#0000FF}'EXIT'{color}{color:#000000},<{color}{color:#009900}13523.17280{color}{color:#000000}.{color}{color:#009900}1{color}{color:#000000}>, {rebalance_failed, {service_error, <<{color}{color:#0000FF}"Unable to read index layout from cluster 127.0.0.1:8091. err = Cannot merge counter with different base values"{color}{color:#000000}>>}}}} ..{color}| | | |
Description |
[Split out of MB-47873 per 2021-10-20 GSI scrum. Originally opened by [~varun.velamuri] from CBSE-10499.]
If merging replica counter fails, indexer can avoid replica repair for the definition on which merge failed, rather than failing rebalance. From CBSE-10499 description, which is the source of the original error: +*Background and Analysis:*+ Analytics Nodes were added on `2021-07-27`, Rebalance is failing since then: |{color:#009900}2021{color}{color:#000000}-{color}{color:#009900}07{color}{color:#000000}-27T10:{color}{color:#009900}08{color}{color:#000000}:{color}{color:#009900}18.443000{color}{color:#000000}+{color}{color:#009900}0100{color}{color:#000000} {color}{color:#009900}10.114{color}{color:#000000}.{color}{color:#009900}141.4{color}{color:#000000} added node {color}{color:#009900}10.114{color}{color:#000000}.{color}{color:#009900}141.16{color}| | | |{color:#009900}2021{color}{color:#000000}-{color}{color:#009900}07{color}{color:#000000}-27T10:{color}{color:#009900}14{color}{color:#000000}:{color}{color:#009900}59.570000{color}{color:#000000}+{color}{color:#009900}0100{color}{color:#000000} {color}{color:#009900}10.114{color}{color:#000000}.{color}{color:#009900}141.4{color}{color:#000000} added node {color}{color:#009900}10.114{color}{color:#000000}.{color}{color:#009900}141.17{color}| | | Rebalance is failing with below error: |{color:#000000}[ns_server:error,{color}{color:#009900}2021{color}{color:#000000}-{color}{color:#009900}07{color}{color:#000000}-27T13:{color}{color:#009900}19{color}{color:#000000}:{color}{color:#009900}53.907{color}{color:#000000}+{color}{color:#009900}01{color}{color:#000000}:{color}{color:#009900}00{color}{color:#000000},ns_1{color}{color:#808080}@10{color}{color:#000000}.114.{color}{color:#009900}141.10{color}{color:#000000}:service_status_keeper-index<{color}{color:#009900}0.725{color}{color:#000000}.{color}{color:#009900}0{color}{color:#000000}>:service_status_keeper:handle_cast:{color}{color:#009900}119{color}{color:#000000}]Service service_index returned incorrect status [ns_server:error,{color}{color:#009900}2021{color}{color:#000000}-{color}{color:#009900}07{color}{color:#000000}-27T13:{color}{color:#009900}22{color}{color:#000000}:{color}{color:#009900}33.398{color}{color:#000000}+{color}{color:#009900}01{color}{color:#000000}:{color}{color:#009900}00{color}{color:#000000},ns_1{color}{color:#808080}@10{color}{color:#000000}.114.{color}{color:#009900}141.10{color}{color:#000000}:service_agent-index<{color}{color:#009900}0.669{color}{color:#000000}.{color}{color:#009900}0{color}{color:#000000}>:service_agent:handle_info:{color}{color:#009900}287{color}{color:#000000}]Rebalancer <{color}{color:#009900}13523.17251{color}{color:#000000}.{color}{color:#009900}1{color}{color:#000000}> died unexpectedly: {worker_died, {{color}{color:#0000FF}'EXIT'{color}{color:#000000},<{color}{color:#009900}13523.17280{color}{color:#000000}.{color}{color:#009900}1{color}{color:#000000}>, {rebalance_failed, {service_error, <<{color}{color:#0000FF}"Unable to read index layout from cluster 127.0.0.1:8091. err = Cannot merge counter with different base values"{color}{color:#000000}>>}}}} ..{color}| | | |
[Split out of MB-47873 per 2021-10-20 GSI scrum. Originally opened by [~varun.velamuri] from CBSE-10499.
MB-47874 was also created for the same problem from same CBSE, aiming to find and fix the root cause, while current MB is aimed at making Rebalance not fail even if this bug gets hit.] If merging replica counter fails, indexer can avoid replica repair for the definition on which merge failed, rather than failing rebalance. From CBSE-10499 description, which is the source of the original error: +*Background and Analysis:*+ Analytics Nodes were added on `2021-07-27`, Rebalance is failing since then: |{color:#009900}2021{color}{color:#000000}-{color}{color:#009900}07{color}{color:#000000}-27T10:{color}{color:#009900}08{color}{color:#000000}:{color}{color:#009900}18.443000{color}{color:#000000}+{color}{color:#009900}0100{color}{color:#000000} {color}{color:#009900}10.114{color}{color:#000000}.{color}{color:#009900}141.4{color}{color:#000000} added node {color}{color:#009900}10.114{color}{color:#000000}.{color}{color:#009900}141.16{color}| | | |{color:#009900}2021{color}{color:#000000}-{color}{color:#009900}07{color}{color:#000000}-27T10:{color}{color:#009900}14{color}{color:#000000}:{color}{color:#009900}59.570000{color}{color:#000000}+{color}{color:#009900}0100{color}{color:#000000} {color}{color:#009900}10.114{color}{color:#000000}.{color}{color:#009900}141.4{color}{color:#000000} added node {color}{color:#009900}10.114{color}{color:#000000}.{color}{color:#009900}141.17{color}| | | Rebalance is failing with below error: |{color:#000000}[ns_server:error,{color}{color:#009900}2021{color}{color:#000000}-{color}{color:#009900}07{color}{color:#000000}-27T13:{color}{color:#009900}19{color}{color:#000000}:{color}{color:#009900}53.907{color}{color:#000000}+{color}{color:#009900}01{color}{color:#000000}:{color}{color:#009900}00{color}{color:#000000},ns_1{color}{color:#808080}@10{color}{color:#000000}.114.{color}{color:#009900}141.10{color}{color:#000000}:service_status_keeper-index<{color}{color:#009900}0.725{color}{color:#000000}.{color}{color:#009900}0{color}{color:#000000}>:service_status_keeper:handle_cast:{color}{color:#009900}119{color}{color:#000000}]Service service_index returned incorrect status [ns_server:error,{color}{color:#009900}2021{color}{color:#000000}-{color}{color:#009900}07{color}{color:#000000}-27T13:{color}{color:#009900}22{color}{color:#000000}:{color}{color:#009900}33.398{color}{color:#000000}+{color}{color:#009900}01{color}{color:#000000}:{color}{color:#009900}00{color}{color:#000000},ns_1{color}{color:#808080}@10{color}{color:#000000}.114.{color}{color:#009900}141.10{color}{color:#000000}:service_agent-index<{color}{color:#009900}0.669{color}{color:#000000}.{color}{color:#009900}0{color}{color:#000000}>:service_agent:handle_info:{color}{color:#009900}287{color}{color:#000000}]Rebalancer <{color}{color:#009900}13523.17251{color}{color:#000000}.{color}{color:#009900}1{color}{color:#000000}> died unexpectedly: {worker_died, {{color}{color:#0000FF}'EXIT'{color}{color:#000000},<{color}{color:#009900}13523.17280{color}{color:#000000}.{color}{color:#009900}1{color}{color:#000000}>, {rebalance_failed, {service_error, <<{color}{color:#0000FF}"Unable to read index layout from cluster 127.0.0.1:8091. err = Cannot merge counter with different base values"{color}{color:#000000}>>}}}} ..{color}| | | |
Affects Version/s | 7.0.0 [ 17233 ] | |
Affects Version/s | 6.6.0 [ 16787 ] |
Resolution | Fixed [ 1 ] | |
Status | In Progress [ 3 ] | Resolved [ 5 ] |
Labels | request-dev-verify |
Status | Resolved [ 5 ] | Closed [ 6 ] |
This is a Planner problem, not a Rebalance problem. The error "Unable to read index layout from cluster" is only logged by planner/executor.go, in 6 different places that are indistinguishable from the message itself:
1. ExecuteRebalanceInternal()
2. ExecutePlan()
3. FindIndexReplicaNodes()
4. ExecuteReplicaRepair()
5. ExecuteReplicaDrop()
6. ExecuteRetrieve()
(Also logged by cmd/cbindexplan/main.go main() but that is probably not relevant here.)
The relevant code path for Rebalance is
rebalancer.go – NewRebalancer()
rebalancer.go – initRebalAsync()
executor.go – ExecuteRebalance()
executor.go – ExecuteRebalanceInternal() – wraps original error in message starting "Unable to read index layout from cluster"
proxy.go – RetrievePlanFromCluster() – reports error "Cannot merge counter with different base values"
proxy.go – getIndexNumReplica()
counter.go – MergeWith() – generates "Cannot merge..." error message