Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-18934

[FTS] MCP: Rebalance fails due to cbft getting killed by OOM killer

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • 4.5.0
    • 4.5.0
    • cbft
    • Untriaged
    • Unknown

    Description

      Build
      4.5.0-1960

      Testcase
      ./testrunner -i INI_FILE.ini get-cbcollect-info=True,get-logs=False,stop-on-failure=False,index_retry=10,GROUP=P0 -t fts.moving_topology_fts.MovingTopFTS.rebalance_out_during_index_building,items=30000,cluster=D,F,F,index_replicas=1,standard_buckets=2,sasl_buckets=2,GROUP=P0

      All nodes have 4cores, 4GB ram, SSDs.

      Steps
      1. D,F,F cluster, 5 buckets having 30K docs each.
      2. with index_replica=1, create default index.
      3. During index building, rebalance out an fts node.
      4. Rebalance fails with message indicating cbft was killed. cbcollect syslog contains "messages" file that indicates cbft was killed twice -

      2016-03-28 15:45:17 | INFO | MainProcess | test_thread | [moving_topology_fts.rebalance_out_during_index_building] Index building has begun...
      2016-03-28 15:45:20 | INFO | MainProcess | test_thread | [moving_topology_fts.rebalance_out_during_index_building] Index count for default_index_1: 6335
      2016-03-28 15:45:22 | INFO | MainProcess | test_thread | [moving_topology_fts.rebalance_out_during_index_building] Index count for sasl_bucket_1_index_1: 5529
      2016-03-28 15:45:25 | INFO | MainProcess | test_thread | [moving_topology_fts.rebalance_out_during_index_building] Index count for sasl_bucket_2_index_1: 3183
      2016-03-28 15:45:27 | INFO | MainProcess | test_thread | [moving_topology_fts.rebalance_out_during_index_building] Index count for standard_bucket_1_index_1: 1608
      2016-03-28 15:45:30 | INFO | MainProcess | test_thread | [moving_topology_fts.rebalance_out_during_index_building] Index count for standard_bucket_2_index_1: 787
      2016-03-28 15:45:30 | INFO | MainProcess | test_thread | [fts_base.__async_rebalance_out] Starting rebalance-out nodes:[ip:172.23.106.176 port:8091 ssh_username:root] at C1 cluster 172.23.106.139
      2016-03-28 15:45:30 | INFO | MainProcess | Cluster_Thread | [rest_client.rebalance] rebalance params : password=password&ejectedNodes=ns_1%40172.23.106.176&user=Administrator&knownNodes=ns_1%40172.23.106.176%2Cns_1%40172.23.106.175%2Cns_1%40172.23.106.139
      2016-03-28 15:45:30 | INFO | MainProcess | Cluster_Thread | [rest_client.rebalance] rebalance operation started
      2016-03-28 15:45:30 | INFO | MainProcess | Cluster_Thread | [rest_client._rebalance_progress] rebalance percentage : 0.00 %
      2016-03-28 15:45:41 | INFO | MainProcess | Cluster_Thread | [rest_client._rebalance_progress] rebalance percentage : 16.67 %
      2016-03-28 15:45:51 | INFO | MainProcess | Cluster_Thread | [rest_client._rebalance_progress] rebalance percentage : 16.67 %
      2016-03-28 15:46:01 | INFO | MainProcess | Cluster_Thread | [rest_client._rebalance_progress] rebalance percentage : 18.27 %
      2016-03-28 15:46:11 | INFO | MainProcess | Cluster_Thread | [rest_client._rebalance_progress] rebalance percentage : 23.08 %
      2016-03-28 15:46:21 | INFO | MainProcess | Cluster_Thread | [rest_client._rebalance_progress] rebalance percentage : 23.08 %
      2016-03-28 15:46:31 | INFO | MainProcess | Cluster_Thread | [rest_client._rebalance_progress] rebalance percentage : 26.28 %
      2016-03-28 15:46:41 | INFO | MainProcess | Cluster_Thread | [rest_client._rebalance_progress] rebalance percentage : 26.28 %
      2016-03-28 15:46:51 | INFO | MainProcess | Cluster_Thread | [rest_client._rebalance_progress] rebalance percentage : 29.49 %
      2016-03-28 15:47:01 | INFO | MainProcess | Cluster_Thread | [rest_client._rebalance_progress] rebalance percentage : 29.49 %
      2016-03-28 15:47:11 | INFO | MainProcess | Cluster_Thread | [rest_client._rebalance_progress] rebalance percentage : 31.09 %
      2016-03-28 15:47:22 | INFO | MainProcess | Cluster_Thread | [rest_client._rebalance_progress] rebalance percentage : 32.69 %
      2016-03-28 15:47:32 | INFO | MainProcess | Cluster_Thread | [rest_client._rebalance_progress] rebalance percentage : 32.69 %
      2016-03-28 15:47:42 | INFO | MainProcess | Cluster_Thread | [rest_client._rebalance_progress] rebalance percentage : 39.10 %
      2016-03-28 15:47:52 | INFO | MainProcess | Cluster_Thread | [rest_client._rebalance_progress] rebalance percentage : 45.51 %
      2016-03-28 15:48:02 | INFO | MainProcess | Cluster_Thread | [rest_client._rebalance_progress] rebalance percentage : 51.92 %
      2016-03-28 15:48:12 | INFO | MainProcess | Cluster_Thread | [rest_client._rebalance_progress] rebalance percentage : 61.54 %
      2016-03-28 15:48:22 | INFO | MainProcess | Cluster_Thread | [rest_client._rebalance_progress] rebalance percentage : 61.54 %
      2016-03-28 15:48:32 | INFO | MainProcess | Cluster_Thread | [rest_client._rebalance_progress] rebalance percentage : 61.54 %
      2016-03-28 15:48:42 | INFO | MainProcess | Cluster_Thread | [rest_client._rebalance_progress] rebalance percentage : 61.54 %
      2016-03-28 15:48:52 | INFO | MainProcess | Cluster_Thread | [rest_client._rebalance_progress] rebalance percentage : 63.14 %
      2016-03-28 15:49:02 | INFO | MainProcess | Cluster_Thread | [rest_client._rebalance_progress] rebalance percentage : 66.35 %
      2016-03-28 15:49:12 | INFO | MainProcess | Cluster_Thread | [rest_client._rebalance_progress] rebalance percentage : 66.35 %
      2016-03-28 15:49:22 | INFO | MainProcess | Cluster_Thread | [rest_client._rebalance_progress] rebalance percentage : 66.35 %
      2016-03-28 15:49:33 | INFO | MainProcess | Cluster_Thread | [rest_client._rebalance_progress] rebalance percentage : 71.15 %
      2016-03-28 15:49:43 | INFO | MainProcess | Cluster_Thread | [rest_client._rebalance_progress] rebalance percentage : 71.15 %
      2016-03-28 15:49:53 | ERROR | MainProcess | Cluster_Thread | [rest_client._rebalance_progress] {u'status': u'none', u'errorMessage': u'Rebalance failed. See logs for detailed reason. You can try rebalance again.'} - rebalance failed
      2016-03-28 15:49:56 | INFO | MainProcess | Cluster_Thread | [rest_client.print_UI_logs] Latest logs from UI on 172.23.106.139:
      2016-03-28 15:49:56 | ERROR | MainProcess | Cluster_Thread | [rest_client.print_UI_logs] {u'node': u'ns_1@172.23.106.139', u'code': 2, u'text': u'Rebalance exited with reason {service_rebalance_failed,fts,\n                                 {lost_connection,shutdown}}\n', u'shortText': u'message', u'serverTime': u'2016-03-28T15:49:52.944Z', u'module': u'ns_orchestrator', u'tstamp': 1459205392944, u'type': u'info'}
      2016-03-28 15:49:56 | ERROR | MainProcess | Cluster_Thread | [rest_client.print_UI_logs] {u'node': u'ns_1@172.23.106.175', u'code': 0, u'text': u"Service 'fts' exited with status 1. Restarting. Messages: 2016-03-28T15:49:34.116-07:00 [INFO] janitor: feeds to remove: 0\n2016-03-28T15:49:34.116-07:00 [INFO] janitor: feeds to add: 0\n2016-03-28T15:49:37.621-07:00 [INFO] moss_herder: persistence progess, waiting: 27\n2016-03-28T15:49:48.875-07:00 [INFO] moss_herder: persistence progess, waiting: 7\n[goport] 2016/03/28 15:49:52 /opt/couchbase/bin/cbft terminated: signal: killed", u'shortText': u'message', u'serverTime': u'2016-03-28T15:49:52.608Z', u'module': u'ns_log', u'tstamp': 1459205392608, u'type': u'info'}
      

      On .175 -

      Mar 28 15:49:51 localhost kernel: Out of memory: Kill process 17132 (cbft) score 426 or sacrifice child
      :
      :
      Mar 28 15:54:32 localhost kernel: Out of memory: Kill process 17894 (cbft) score 425 or sacrifice child
      

      Attaching logs from
      .139 --> kv
      .175 --> fts.
      .176 --> fts, rebalanced out node

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              apiravi Aruna Piravi (Inactive)
              apiravi Aruna Piravi (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty