Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-30766

[System Test] Rebalance operation for any service fails because of analytics nodes rebalance error - Datasets in different partitions have different DCP states

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 6.0.0
    • Fix Version/s: 6.0.0
    • Component/s: analytics
    • Environment:
      centos cluster (longevity)

      Description

      Build : 6.0.0-1432
      Test : -test tests/integration/test_allFeatures_alice.yml -scope tests/integration/scope_Xattrs_Alice.yml
      Scale : 3

      Seeing this issue several times with the system test run on the latest build. The latest being Rebalance ID d958ccd453cca075ef38f72b4c8915ea.

      Analytics rebalance causes failures for rebalance operations involving other services as well. Like GSI, Analytics service should also refrain from rebalancing analytics nodes when the rebalance operation is initiated for nodes of other services.

      Also, when disconnecting the link, it would be good to ensure DCP states on all partitions are balanced, even if it delays the disconnect operation, so that issues like these can be avoided.

      Seeing the following in the analytics_error.log file on 172.23.96.145

      2018-08-05T12:45:56.027-07:00 ERRO CBAS.metadata.BucketEventsListener [Executor-571:ClusterController] Failed to connect bucket Default.Local.CUSTOMER(CouchbaseMetadataExtension)
      java.lang.NullPointerException: null
      2018-08-05T12:46:24.561-07:00 ERRO CBAS.metadata.BucketEventsListener [Executor-657:ClusterController] Failed to connect bucket Default.Local.CUSTOMER(CouchbaseMetadataExtension)
      java.lang.NullPointerException: null
      2018-08-05T12:47:09.721-07:00 ERRO CBAS.rebalance.Rebalance [Executor-586:ClusterController] rebalance failed
      com.couchbase.analytics.common.exceptions.AnalyticsHyracksException: CBAS0001: Datasets in different partitions have different DCP states. Mutations needed to catch up = 234581. User action: Connect the bucket: { "class" : "Bucket", "dataverse" : "Default", "link" : "Local", "bucket" : "default", "uuid" : "0e91fbf6d20c5b4a6456222cc2c45ab4", "running" : false } or drop the dataset: Default.ds1
              at com.couchbase.analytics.control.rebalance.ShadowStateWriteCallback.beforeRebalance(ShadowStateWriteCallback.java:89) ~[cbas-server.jar:6.0.0-1435]
              at org.apache.asterix.utils.RebalanceUtil.rebalance(RebalanceUtil.java:220) ~[asterix-app.jar:6.0.0-1435]
              at org.apache.asterix.utils.RebalanceUtil.rebalance(RebalanceUtil.java:131) ~[asterix-app.jar:6.0.0-1435]
              at com.couchbase.analytics.control.rebalance.Rebalance.rebalanceDataset(Rebalance.java:403) ~[cbas-server.jar:6.0.0-1435]
              at com.couchbase.analytics.control.rebalance.Rebalance.rebalanceDatasets(Rebalance.java:237) ~[cbas-server.jar:6.0.0-1435]
              at com.couchbase.analytics.control.rebalance.Rebalance.lambda$doRebalance$3(Rebalance.java:170) ~[cbas-server.jar:6.0.0-1435]
              at org.apache.hyracks.api.util.InvokeUtil.tryWithCleanups(InvokeUtil.java:191) ~[hyracks-api.jar:6.0.0-1435]
              at com.couchbase.analytics.control.rebalance.Rebalance.doRebalance(Rebalance.java:166) ~[cbas-server.jar:6.0.0-1435]
              at com.couchbase.analytics.control.rebalance.Rebalance.doCall(Rebalance.java:130) [cbas-server.jar:6.0.0-1435]
              at com.couchbase.analytics.control.rebalance.Rebalance.doCall(Rebalance.java:70) [cbas-server.jar:6.0.0-1435]
              at com.couchbase.analytics.runtime.WriteLockCallable.call(WriteLockCallable.java:21) [cbas-connector.jar:6.0.0-1435]
              at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_181]
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_181]
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_181]
              at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181]
      2018-08-05T12:47:10.426-07:00 ERRO CBAS.servlet.RebalanceServlet [HttpExecutor(port:9111)-2] Rebalance d958ccd453cca075ef38f72b4c8915ea failed
      com.couchbase.analytics.common.exceptions.AnalyticsHyracksException: CBAS0001: Datasets in different partitions have different DCP states. Mutations needed to catch up = 234581. User action: Connect the bucket: { "class" : "Bucket", "dataverse" : "Default", "link" : "Local", "bucket" : "default", "uuid" : "0e91fbf6d20c5b4a6456222cc2c45ab4", "running" : false } or drop the dataset: Default.ds1
              at com.couchbase.analytics.control.rebalance.ShadowStateWriteCallback.beforeRebalance(ShadowStateWriteCallback.java:89) ~[cbas-server.jar:6.0.0-1435]
              at org.apache.asterix.utils.RebalanceUtil.rebalance(RebalanceUtil.java:220) ~[asterix-app.jar:6.0.0-1435]
              at org.apache.asterix.utils.RebalanceUtil.rebalance(RebalanceUtil.java:131) ~[asterix-app.jar:6.0.0-1435]
              at com.couchbase.analytics.control.rebalance.Rebalance.rebalanceDataset(Rebalance.java:403) ~[cbas-server.jar:6.0.0-1435]
              at com.couchbase.analytics.control.rebalance.Rebalance.rebalanceDatasets(Rebalance.java:237) ~[cbas-server.jar:6.0.0-1435]
              at com.couchbase.analytics.control.rebalance.Rebalance.lambda$doRebalance$3(Rebalance.java:170) ~[cbas-server.jar:6.0.0-1435]
              at org.apache.hyracks.api.util.InvokeUtil.tryWithCleanups(InvokeUtil.java:191) ~[hyracks-api.jar:6.0.0-1435]
              at com.couchbase.analytics.control.rebalance.Rebalance.doRebalance(Rebalance.java:166) ~[cbas-server.jar:6.0.0-1435]
              at com.couchbase.analytics.control.rebalance.Rebalance.doCall(Rebalance.java:130) ~[cbas-server.jar:6.0.0-1435]
              at com.couchbase.analytics.control.rebalance.Rebalance.doCall(Rebalance.java:70) ~[cbas-server.jar:6.0.0-1435]
              at com.couchbase.analytics.runtime.WriteLockCallable.call(WriteLockCallable.java:21) ~[cbas-connector.jar:6.0.0-1435]
              at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_181]
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_181]
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_181]
              at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181]
      2018-08-05T12:47:28.610-07:00 ERRO CBAS.metadata.BucketEventsListener [Executor-345:ClusterController] Failed to connect bucket Default.Local.CUSTOMER(CouchbaseMetadataExtension)
      java.lang.NullPointerException: null
      2018-08-05T12:48:18.334-07:00 ERRO CBAS.metadata.BucketEventsListener [Executor-659:ClusterController] Failed to connect bucket Default.Local.CUSTOMER(CouchbaseMetadataExtension)
      java.lang.NullPointerException: null
      
      

        Attachments

          Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

            Activity

            Hide
            build-team Couchbase Build Team added a comment -

            Build couchbase-server-6.5.0-1173 contains asterix-opt commit dc8fd91 with commit message:
            MB-30766: log rebalance failures to UI log

            Show
            build-team Couchbase Build Team added a comment - Build couchbase-server-6.5.0-1173 contains asterix-opt commit dc8fd91 with commit message: MB-30766 : log rebalance failures to UI log
            Hide
            build-team Couchbase Build Team added a comment -

            Build couchbase-server-6.5.0-1173 contains cbas commit d1c8951 with commit message:
            MB-30766: ignore rebalance failures when analytics topology is not changing

            Show
            build-team Couchbase Build Team added a comment - Build couchbase-server-6.5.0-1173 contains cbas commit d1c8951 with commit message: MB-30766 : ignore rebalance failures when analytics topology is not changing
            Hide
            vikas.chaudhary Vikas Chaudhary added a comment -

            Still seeing on 6.0.0-1458

            Rebalance exited with reason {service_rebalance_failed,cbas, {rebalance_failed, {service_error, <<"Rebalance b246946c9e1d3ec0f634a73b33574701 failed: CBAS0001: Datasets in different partitions have different DCP states. Mutations needed to catch up = 1266531. User action: Try again later">>}}}ns_orchestrator 000ns_1@172.23.108.1032:14:51 AM   Wed Aug 8, 2018
             
            Analytics unable to successfully rebalance b246946c9e1d3ec0f634a73b33574701 due to 'CBAS0001: Datasets in different partitions have different DCP states. Mutations needed to catch up = 1266531. User action: Try again later'; see analytics log for detailsanalytics 000ns_1@172.23.96.1452:14:50 AM   Wed Aug 8, 2018

            logs :
            https://s3.amazonaws.com/bugdb/jira/longevity/collectinfo-2018-08-08T092848-ns_1%40172.23.104.164.zip
            https://s3.amazonaws.com/bugdb/jira/longevity/collectinfo-2018-08-08T092848-ns_1%40172.23.104.61.zip
            https://s3.amazonaws.com/bugdb/jira/longevity/collectinfo-2018-08-08T092848-ns_1%40172.23.104.67.zip
            https://s3.amazonaws.com/bugdb/jira/longevity/collectinfo-2018-08-08T092848-ns_1%40172.23.104.69.zip
            https://s3.amazonaws.com/bugdb/jira/longevity/collectinfo-2018-08-08T092848-ns_1%40172.23.104.70.zip
            https://s3.amazonaws.com/bugdb/jira/longevity/collectinfo-2018-08-08T092848-ns_1%40172.23.104.87.zip
            https://s3.amazonaws.com/bugdb/jira/longevity/collectinfo-2018-08-08T092848-ns_1%40172.23.104.88.zip
            https://s3.amazonaws.com/bugdb/jira/longevity/collectinfo-2018-08-08T092848-ns_1%40172.23.106.188.zip
            https://s3.amazonaws.com/bugdb/jira/longevity/collectinfo-2018-08-08T092848-ns_1%40172.23.108.103.zip
            https://s3.amazonaws.com/bugdb/jira/longevity/collectinfo-2018-08-08T092848-ns_1%40172.23.108.104.zip
            https://s3.amazonaws.com/bugdb/jira/longevity/collectinfo-2018-08-08T092848-ns_1%40172.23.96.145.zip
            https://s3.amazonaws.com/bugdb/jira/longevity/collectinfo-2018-08-08T092848-ns_1%40172.23.96.148.zip
            https://s3.amazonaws.com/bugdb/jira/longevity/collectinfo-2018-08-08T092848-ns_1%40172.23.96.168.zip
            https://s3.amazonaws.com/bugdb/jira/longevity/collectinfo-2018-08-08T092848-ns_1%40172.23.96.56.zip
            https://s3.amazonaws.com/bugdb/jira/longevity/collectinfo-2018-08-08T092848-ns_1%40172.23.97.238.zip
            https://s3.amazonaws.com/bugdb/jira/longevity/collectinfo-2018-08-08T092848-ns_1%40172.23.97.239.zip
            https://s3.amazonaws.com/bugdb/jira/longevity/collectinfo-2018-08-08T092848-ns_1%40172.23.97.242.zip
            https://s3.amazonaws.com/bugdb/jira/longevity/collectinfo-2018-08-08T092848-ns_1%40172.23.98.135.zip
            https://s3.amazonaws.com/bugdb/jira/longevity/collectinfo-2018-08-08T092848-ns_1%40172.23.99.11.zip
            https://s3.amazonaws.com/bugdb/jira/longevity/collectinfo-2018-08-08T092848-ns_1%40172.23.99.20.zip
            https://s3.amazonaws.com/bugdb/jira/longevity/collectinfo-2018-08-08T092848-ns_1%40172.23.99.21.zip
            https://s3.amazonaws.com/bugdb/jira/longevity/collectinfo-2018-08-08T092848-ns_1%40172.23.99.25.zip

            Show
            vikas.chaudhary Vikas Chaudhary added a comment - Still seeing on 6.0.0-1458 Rebalance exited with reason {service_rebalance_failed,cbas, {rebalance_failed, {service_error, <<"Rebalance b246946c9e1d3ec0f634a73b33574701 failed: CBAS0001: Datasets in different partitions have different DCP states. Mutations needed to catch up = 1266531. User action: Try again later">>}}}ns_orchestrator 000ns_1@172.23.108.1032:14:51 AM   Wed Aug 8, 2018   Analytics unable to successfully rebalance b246946c9e1d3ec0f634a73b33574701 due to 'CBAS0001: Datasets in different partitions have different DCP states. Mutations needed to catch up = 1266531. User action: Try again later'; see analytics log for detailsanalytics 000ns_1@172.23.96.1452:14:50 AM   Wed Aug 8, 2018 logs : https://s3.amazonaws.com/bugdb/jira/longevity/collectinfo-2018-08-08T092848-ns_1%40172.23.104.164.zip https://s3.amazonaws.com/bugdb/jira/longevity/collectinfo-2018-08-08T092848-ns_1%40172.23.104.61.zip https://s3.amazonaws.com/bugdb/jira/longevity/collectinfo-2018-08-08T092848-ns_1%40172.23.104.67.zip https://s3.amazonaws.com/bugdb/jira/longevity/collectinfo-2018-08-08T092848-ns_1%40172.23.104.69.zip https://s3.amazonaws.com/bugdb/jira/longevity/collectinfo-2018-08-08T092848-ns_1%40172.23.104.70.zip https://s3.amazonaws.com/bugdb/jira/longevity/collectinfo-2018-08-08T092848-ns_1%40172.23.104.87.zip https://s3.amazonaws.com/bugdb/jira/longevity/collectinfo-2018-08-08T092848-ns_1%40172.23.104.88.zip https://s3.amazonaws.com/bugdb/jira/longevity/collectinfo-2018-08-08T092848-ns_1%40172.23.106.188.zip https://s3.amazonaws.com/bugdb/jira/longevity/collectinfo-2018-08-08T092848-ns_1%40172.23.108.103.zip https://s3.amazonaws.com/bugdb/jira/longevity/collectinfo-2018-08-08T092848-ns_1%40172.23.108.104.zip https://s3.amazonaws.com/bugdb/jira/longevity/collectinfo-2018-08-08T092848-ns_1%40172.23.96.145.zip https://s3.amazonaws.com/bugdb/jira/longevity/collectinfo-2018-08-08T092848-ns_1%40172.23.96.148.zip https://s3.amazonaws.com/bugdb/jira/longevity/collectinfo-2018-08-08T092848-ns_1%40172.23.96.168.zip https://s3.amazonaws.com/bugdb/jira/longevity/collectinfo-2018-08-08T092848-ns_1%40172.23.96.56.zip https://s3.amazonaws.com/bugdb/jira/longevity/collectinfo-2018-08-08T092848-ns_1%40172.23.97.238.zip https://s3.amazonaws.com/bugdb/jira/longevity/collectinfo-2018-08-08T092848-ns_1%40172.23.97.239.zip https://s3.amazonaws.com/bugdb/jira/longevity/collectinfo-2018-08-08T092848-ns_1%40172.23.97.242.zip https://s3.amazonaws.com/bugdb/jira/longevity/collectinfo-2018-08-08T092848-ns_1%40172.23.98.135.zip https://s3.amazonaws.com/bugdb/jira/longevity/collectinfo-2018-08-08T092848-ns_1%40172.23.99.11.zip https://s3.amazonaws.com/bugdb/jira/longevity/collectinfo-2018-08-08T092848-ns_1%40172.23.99.20.zip https://s3.amazonaws.com/bugdb/jira/longevity/collectinfo-2018-08-08T092848-ns_1%40172.23.99.21.zip https://s3.amazonaws.com/bugdb/jira/longevity/collectinfo-2018-08-08T092848-ns_1%40172.23.99.25.zip
            Hide
            michael.blow Michael Blow added a comment -

            The issue in the reopened logs is not the same- this is on a rebalance out of an Analytics node (172.23.99.25), which is still expected to fail if the DCP states do not match.  The DCP states must match (and can be brought into match w/ a connect link) in order to rebalance an Analytics node.

             

            2018-08-08T02:11:20.471-07:00 INFO CBAS.rebalance.Rebalance [Executor-66:ClusterController] %%%%%%%%%%%%% enter rebalance {"id":"b246946c9e1d3ec0f634a73b33574701","currentTopologyRev":null,"type":"topology-change-rebalance","keepNodes":[{"nodeInfo":{"nodeId":"118386a343c5366834b04db5895f5f61","priority":0,"opaque":{"cc-http-port":"9111","controller-id":"2","host":"172.23.106.188","num-iodevices":"8","starting-partition-id":"16"}},"recoveryType":"recovery-full"},{"nodeInfo":{"nodeId":"48bcb863227a27ba33e10f24c3d81416","priority":0,"opaque":{"cc-http-port":"9111","controller-id":"0","host":"172.23.96.145","num-iodevices":"8","starting-partition-id":"0"}},"recoveryType":"recovery-full"}],"ejectNodes":[{"nodeId":"66b03e8c74ece2ce9f787af73da841ec","priority":0,"opaque":{"cc-http-port":"9111","controller-id":"1","host":"172.23.99.25","num-iodevices":"8","starting-partition-id":"8"}}]}
            

             

            Show
            michael.blow Michael Blow added a comment - The issue in the reopened logs is not the same- this is on a rebalance out of an Analytics node (172.23.99.25), which is still expected to fail if the DCP states do not match.  The DCP states must match (and can be brought into match w/ a connect link) in order to rebalance an Analytics node.   2018-08-08T02:11:20.471-07:00 INFO CBAS.rebalance.Rebalance [Executor-66:ClusterController] %%%%%%%%%%%%% enter rebalance {"id":"b246946c9e1d3ec0f634a73b33574701","currentTopologyRev":null,"type":"topology-change-rebalance","keepNodes":[{"nodeInfo":{"nodeId":"118386a343c5366834b04db5895f5f61","priority":0,"opaque":{"cc-http-port":"9111","controller-id":"2","host":"172.23.106.188","num-iodevices":"8","starting-partition-id":"16"}},"recoveryType":"recovery-full"},{"nodeInfo":{"nodeId":"48bcb863227a27ba33e10f24c3d81416","priority":0,"opaque":{"cc-http-port":"9111","controller-id":"0","host":"172.23.96.145","num-iodevices":"8","starting-partition-id":"0"}},"recoveryType":"recovery-full"}],"ejectNodes":[{"nodeId":"66b03e8c74ece2ce9f787af73da841ec","priority":0,"opaque":{"cc-http-port":"9111","controller-id":"1","host":"172.23.99.25","num-iodevices":"8","starting-partition-id":"8"}}]}  
            Hide
            vikas.chaudhary Vikas Chaudhary added a comment -

            Not seen on 6.0.0-1606

            Show
            vikas.chaudhary Vikas Chaudhary added a comment - Not seen on 6.0.0-1606

              People

              Assignee:
              michael.blow Michael Blow
              Reporter:
              mihir.kamdar Mihir Kamdar
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Gerrit Reviews

                  There are no open Gerrit changes

                    PagerDuty