Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-37098

[System Test]: Rebalance failed with badmatch

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Incomplete
    • 6.5.0
    • 6.5.0
    • ns_server

    Description

      Build: 6.5.0-4890 , not seen on 6.5.0-4821

      Test : MH longevity

      Day: 2nd

      Cycle: 4th

      Test step: Failover -> full recovery 

      [2019-11-28T08:47:40-08:00, sequoiatools/couchbase-cli:6.5:23ed4b] failover -c 172.23.97.74:8091 --server-failover 172.23.104.78:8091 -u Administrator -p password --force
      [2019-11-28T08:47:54-08:00, sequoiatools/couchbase-cli:6.5:da704d] recovery -c 172.23.97.74:8091 --server-recovery 172.23.104.78:8091 --recovery-type full -u Administrator -p password
      [2019-11-28T08:48:04-08:00, sequoiatools/couchbase-cli:6.5:493b8a] rebalance -c 172.23.97.74:8091 -u Administrator -p password
      → 
       
       
      Error occurred on container - sequoiatools/couchbase-cli:6.5:[rebalance -c 172.23.97.74:8091 -u Administrator -p password]
       
       
      docker logs 493b8a
      docker start 493b8a
       
       
      *Unable to display progress bar on this os
      JERROR: Rebalance failed. See logs for detailed reason. You can try again.
      [2019-11-28T08:48:59-08:00, sequoiatools/cmd:78a8bc] 60
      
      

      Rebalance failed 

      [user:error,2019-11-28T08:48:49.237-08:00,ns_1@172.23.97.74:<0.15848.0>:ns_orchestrator:log_rebalance_completion:1445]Rebalance exited with reason {{badmatch,
                                        {error,
                                            {failed_nodes,['ns_1@172.23.96.212']}}},
                                    [{ns_janitor,cleanup_apply_config_body,4,
                                         [{file,"src/ns_janitor.erl"},{line,294}]},
                                     {ns_janitor,'-cleanup_apply_config/4-fun-0-',
                                         4,
                                         [{file,"src/ns_janitor.erl"},{line,217}]},
                                     {async,'-async_init/4-fun-1-',3,
                                         [{file,"src/async.erl"},{line,197}]}]}.
      Rebalance Operation Id = 57197aa228c4043bd66043c97e5f6cb8 

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          No logs from 172.23.96.212 from the time when the failure occurred.

           

          Separately, it appears that diagnostics was being collected when rebalance was started. We don't recommend it.

          Aliaksey Artamonau Aliaksey Artamonau added a comment - No logs from 172.23.96.212 from the time when the failure occurred.   Separately, it appears that diagnostics was being collected when rebalance was started. We don't recommend it.

          Another set of logs after 4 hr , see if you have enough info.
          cbcollect logs:
           
          url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1574979708/collectinfo-2019-11-28T222149-ns_1%40172.23.104.113.zip
          url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1574979708/collectinfo-2019-11-28T222149-ns_1%40172.23.104.114.zip
          url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1574979708/collectinfo-2019-11-28T222149-ns_1%40172.23.104.148.zip
          url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1574979708/collectinfo-2019-11-28T222149-ns_1%40172.23.104.151.zip
          url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1574979708/collectinfo-2019-11-28T222149-ns_1%40172.23.104.152.zip
          url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1574979708/collectinfo-2019-11-28T222149-ns_1%40172.23.104.154.zip
          url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1574979708/collectinfo-2019-11-28T222149-ns_1%40172.23.104.68.zip
          url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1574979708/collectinfo-2019-11-28T222149-ns_1%40172.23.104.78.zip
          url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1574979708/collectinfo-2019-11-28T222149-ns_1%40172.23.106.134.zip
          url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1574979708/collectinfo-2019-11-28T222149-ns_1%40172.23.96.122.zip
          url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1574979708/collectinfo-2019-11-28T222149-ns_1%40172.23.96.14.zip
          url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1574979708/collectinfo-2019-11-28T222149-ns_1%40172.23.96.18.zip
          url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1574979708/collectinfo-2019-11-28T222149-ns_1%40172.23.96.183.zip
          url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1574979708/collectinfo-2019-11-28T222149-ns_1%40172.23.96.191.zip
          url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1574979708/collectinfo-2019-11-28T222149-ns_1%40172.23.96.207.zip
          url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1574979708/collectinfo-2019-11-28T222149-ns_1%40172.23.96.209.zip
          url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1574979708/collectinfo-2019-11-28T222149-ns_1%40172.23.96.210.zip
          url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1574979708/collectinfo-2019-11-28T222149-ns_1%40172.23.96.212.zip
          url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1574979708/collectinfo-2019-11-28T222149-ns_1%40172.23.96.214.zip
          url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1574979708/collectinfo-2019-11-28T222149-ns_1%40172.23.96.216.zip
          url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1574979708/collectinfo-2019-11-28T222149-ns_1%40172.23.96.220.zip
          url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1574979708/collectinfo-2019-11-28T222149-ns_1%40172.23.96.221.zip
          url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1574979708/collectinfo-2019-11-28T222149-ns_1%40172.23.96.254.zip
          url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1574979708/collectinfo-2019-11-28T222149-ns_1%40172.23.96.48.zip
          url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1574979708/collectinfo-2019-11-28T222149-ns_1%40172.23.97.74.zip
           

          vikas.chaudhary Vikas Chaudhary added a comment - Another set of logs after 4 hr , see if you have enough info. cbcollect logs:   url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1574979708/collectinfo-2019-11-28T222149-ns_1%40172.23.104.113.zip url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1574979708/collectinfo-2019-11-28T222149-ns_1%40172.23.104.114.zip url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1574979708/collectinfo-2019-11-28T222149-ns_1%40172.23.104.148.zip url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1574979708/collectinfo-2019-11-28T222149-ns_1%40172.23.104.151.zip url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1574979708/collectinfo-2019-11-28T222149-ns_1%40172.23.104.152.zip url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1574979708/collectinfo-2019-11-28T222149-ns_1%40172.23.104.154.zip url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1574979708/collectinfo-2019-11-28T222149-ns_1%40172.23.104.68.zip url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1574979708/collectinfo-2019-11-28T222149-ns_1%40172.23.104.78.zip url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1574979708/collectinfo-2019-11-28T222149-ns_1%40172.23.106.134.zip url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1574979708/collectinfo-2019-11-28T222149-ns_1%40172.23.96.122.zip url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1574979708/collectinfo-2019-11-28T222149-ns_1%40172.23.96.14.zip url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1574979708/collectinfo-2019-11-28T222149-ns_1%40172.23.96.18.zip url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1574979708/collectinfo-2019-11-28T222149-ns_1%40172.23.96.183.zip url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1574979708/collectinfo-2019-11-28T222149-ns_1%40172.23.96.191.zip url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1574979708/collectinfo-2019-11-28T222149-ns_1%40172.23.96.207.zip url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1574979708/collectinfo-2019-11-28T222149-ns_1%40172.23.96.209.zip url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1574979708/collectinfo-2019-11-28T222149-ns_1%40172.23.96.210.zip url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1574979708/collectinfo-2019-11-28T222149-ns_1%40172.23.96.212.zip url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1574979708/collectinfo-2019-11-28T222149-ns_1%40172.23.96.214.zip url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1574979708/collectinfo-2019-11-28T222149-ns_1%40172.23.96.216.zip url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1574979708/collectinfo-2019-11-28T222149-ns_1%40172.23.96.220.zip url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1574979708/collectinfo-2019-11-28T222149-ns_1%40172.23.96.221.zip url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1574979708/collectinfo-2019-11-28T222149-ns_1%40172.23.96.254.zip url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1574979708/collectinfo-2019-11-28T222149-ns_1%40172.23.96.48.zip url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1574979708/collectinfo-2019-11-28T222149-ns_1%40172.23.97.74.zip  

          Around 2019-11-28T08:48:00, node 172.23.96.212 suddenly starts experiencing significant scheduling delays (as witnessed by both ns_server and memcached):

          2019-11-28T08:48:04.001741-08:00 WARNING (ITEM) Slow scheduling for NON_IO task 'DCP connection notifier' on thread nonIO_worker_0. Schedule overhead: 12 s
          2019-11-28T08:48:04.001794-08:00 WARNING (NEW_ORDER) Slow scheduling for NON_IO task 'DCP connection notifier' on thread nonIO_worker_0. Schedule overhead: 12 s
          2019-11-28T08:48:04.001832-08:00 WARNING (ORDERS) Slow scheduling for NON_IO task 'DCP connection notifier' on thread nonIO_worker_0. Schedule overhead: 12 s
          2019-11-28T08:48:04.001875-08:00 WARNING (ORDER_LINE) Slow scheduling for NON_IO task 'DCP connection notifier' on thread nonIO_worker_0. Schedule overhead: 12 s
          2019-11-28T08:48:04.001928-08:00 WARNING (STOCK) Slow scheduling for NON_IO task 'DCP connection notifier' on thread nonIO_worker_0. Schedule overhead: 12 s
          2019-11-28T08:48:04.001969-08:00 WARNING (WAREHOUSE) Slow scheduling for NON_IO task 'DCP connection notifier' on thread nonIO_worker_0. Schedule overhead: 12 s
          

          In ns_server this manifests in internal timeouts which lead to the rebalance failure.

          Stats archives on node .212 were reset since the node was ejected from the cluster after the failure of interested. So there's no way to see what could have been the bottleneck.

          Aliaksey Artamonau Aliaksey Artamonau added a comment - Around 2019-11-28T08:48:00, node 172.23.96.212 suddenly starts experiencing significant scheduling delays (as witnessed by both ns_server and memcached): 2019-11-28T08:48:04.001741-08:00 WARNING (ITEM) Slow scheduling for NON_IO task 'DCP connection notifier' on thread nonIO_worker_0. Schedule overhead: 12 s 2019-11-28T08:48:04.001794-08:00 WARNING (NEW_ORDER) Slow scheduling for NON_IO task 'DCP connection notifier' on thread nonIO_worker_0. Schedule overhead: 12 s 2019-11-28T08:48:04.001832-08:00 WARNING (ORDERS) Slow scheduling for NON_IO task 'DCP connection notifier' on thread nonIO_worker_0. Schedule overhead: 12 s 2019-11-28T08:48:04.001875-08:00 WARNING (ORDER_LINE) Slow scheduling for NON_IO task 'DCP connection notifier' on thread nonIO_worker_0. Schedule overhead: 12 s 2019-11-28T08:48:04.001928-08:00 WARNING (STOCK) Slow scheduling for NON_IO task 'DCP connection notifier' on thread nonIO_worker_0. Schedule overhead: 12 s 2019-11-28T08:48:04.001969-08:00 WARNING (WAREHOUSE) Slow scheduling for NON_IO task 'DCP connection notifier' on thread nonIO_worker_0. Schedule overhead: 12 s In ns_server this manifests in internal timeouts which lead to the rebalance failure. Stats archives on node .212 were reset since the node was ejected from the cluster after the failure of interested. So there's no way to see what could have been the bottleneck.

          Ritam Sharma, some time ago we requested that atop be run on all tests with a granularity of at least 10 seconds. This is one of the tickets were it would have helped a lot. Yet, I still see that atop is configured to collect one sample every 10 minutes.

          Aliaksey Artamonau Aliaksey Artamonau added a comment - Ritam Sharma , some time ago we requested that atop be run on all tests with a granularity of at least 10 seconds. This is one of the tickets were it would have helped a lot. Yet, I still see that atop is configured to collect one sample every 10 minutes.
          ritam.sharma Ritam Sharma added a comment -

          Aliaksey Artamonau - We enabled atop on cluster1 for longevity. Will enable it on this cluster as well.

          ritam.sharma Ritam Sharma added a comment - Aliaksey Artamonau - We enabled atop on cluster1 for longevity. Will enable it on this cluster as well.
          dfinlay Dave Finlay added a comment -

          Thanks Ritam. The other important issue is to make sure the interval is no more than 10s. (Not 10 mins.)

          dfinlay Dave Finlay added a comment - Thanks Ritam. The other important issue is to make sure the interval is no more than 10s. (Not 10 mins.)

          Aliaksey Artamonau we add few extra nodes which doesn't have atop or running with default values. Fixed the cluster. Test is still going on we will update the issue if we hit it again

          vikas.chaudhary Vikas Chaudhary added a comment - Aliaksey Artamonau  we add few extra nodes which doesn't have atop or running with default values. Fixed the cluster. Test is still going on we will update the issue if we hit it again

          People

            vikas.chaudhary Vikas Chaudhary
            vikas.chaudhary Vikas Chaudhary
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty