Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-25354

[FTS] Rebalance fails from OOM killer, when many pindexes (> 256)

    XMLWordPrintable

Details

    • Triaged
    • No

    Description

      Build

      5.0.0-3317

      Testcase

      Pls see MB-25272, rebalance now failed in the same test.

      "This test has never failed in previous runs but I do not think it's a regression. Probably timing-related or a race in index deletion.

      1. Cluster: D, D+F, D+F
      2. Test indexes 200k docs, in the middle of indexing, fails over one fts node and updates the index definition to use 512 pindexes(oops! note to self: change to 8 or 16)
      3. Indexing completes.
      4. In the teardown phase, we start by deleting the index, then the buckets and then try to de-cluster the nodes when rebalance gets stuck at 75% "..and then fails with the following error.

       
      ./testrunner -i INI_FILE.ini get-cbcollect-info=True,get-logs=False,stop-on-failure=False,level_compaction=False,GROUP=P1 -t fts.moving_topology_fts.MovingTopFTS.update_index_during_failover,items=200000,cluster=D+F+F,GROUP=P1

      [2017-07-16 00:21:42,820] - [fts_base:1709] INFO - removing nodes from cluster ...
      [2017-07-16 00:21:42,835] - [fts_base:1711] INFO - cleanup [ip:172.23.105.96 port:8091 ssh_username:root, ip:172.23.105.190 port:8091 ssh_username:root, ip:172.23.106.66 port:8091 ssh_username:root]
      [2017-07-16 00:21:42,878] - [bucket_helper:142] INFO - deleting existing buckets [u'default'] on 172.23.105.96
      [2017-07-16 00:21:42,878] - [bucket_helper:144] INFO - remove bucket default ...
      [2017-07-16 00:21:45,725] - [bucket_helper:158] INFO - deleted bucket : default from 172.23.105.96
      [2017-07-16 00:21:45,726] - [bucket_helper:234] INFO - waiting for bucket deletion to complete....
      [2017-07-16 00:21:45,730] - [rest_client:133] INFO - node 172.23.105.96 existing buckets : []
      [2017-07-16 00:21:45,767] - [cluster_helper:254] INFO - rebalancing all nodes in order to remove nodes
      [2017-07-16 00:21:45,773] - [rest_client:1395] INFO - rebalance params : {'password': 'password', 'ejectedNodes': u'ns_1@172.23.106.66,ns_1@172.23.105.190', 'user': 'Administrator', 'knownNodes': u'ns_1@172.23.106.66,ns_1@172.23.105.96,ns_1@172.23.105.190'}
      [2017-07-16 00:21:45,786] - [rest_client:1400] INFO - rebalance operation started
      [2017-07-16 00:21:45,793] - [rest_client:1548] INFO - rebalance percentage : 0.00 %
      [2017-07-16 00:21:55,810] - [rest_client:1548] INFO - rebalance percentage : 75.00 %
      [2017-07-16 00:22:05,826] - [rest_client:1548] INFO - rebalance percentage : 75.00 %
      [2017-07-16 00:22:15,842] - [rest_client:1548] INFO - rebalance percentage : 75.00 %
      [2017-07-16 00:22:25,859] - [rest_client:1548] INFO - rebalance percentage : 75.00 %
      [2017-07-16 00:22:35,869] - [rest_client:1548] INFO - rebalance percentage : 75.00 %
      [2017-07-16 00:22:45,885] - [rest_client:1548] INFO - rebalance percentage : 75.00 %
      [2017-07-16 00:22:55,902] - [rest_client:1548] INFO - rebalance percentage : 75.00 %
      [2017-07-16 00:23:05,915] - [rest_client:1548] INFO - rebalance percentage : 75.00 %
      [2017-07-16 00:23:15,930] - [rest_client:1548] INFO - rebalance percentage : 75.00 %
      [2017-07-16 00:23:25,950] - [rest_client:1548] INFO - rebalance percentage : 75.00 %
      [2017-07-16 00:23:35,968] - [rest_client:1548] INFO - rebalance percentage : 75.00 %
      [2017-07-16 00:23:45,983] - [rest_client:1548] INFO - rebalance percentage : 75.00 %
      [2017-07-16 00:23:55,999] - [rest_client:1548] INFO - rebalance percentage : 75.00 %
      [2017-07-16 00:24:06,018] - [rest_client:1548] INFO - rebalance percentage : 75.00 %
      [2017-07-16 00:24:16,038] - [rest_client:1548] INFO - rebalance percentage : 75.00 %
      [2017-07-16 00:24:26,053] - [rest_client:1548] INFO - rebalance percentage : 75.00 %
      [2017-07-16 00:24:36,059] - [rest_client:1548] INFO - rebalance percentage : 75.00 %
      [2017-07-16 00:24:46,067] - [rest_client:1548] INFO - rebalance percentage : 75.00 %
      [2017-07-16 00:24:56,081] - [rest_client:1548] INFO - rebalance percentage : 75.00 %
      [2017-07-16 00:25:06,096] - [rest_client:1531] ERROR - {u'status': u'none', u'errorMessage': u'Rebalance failed. See logs for detailed reason. You can try again.'} - rebalance failed
      [2017-07-16 00:25:06,122] - [rest_client:2955] INFO - Latest logs from UI on 172.23.105.96:
      [2017-07-16 00:25:06,122] - [rest_client:2956] ERROR - {u'node': u'ns_1@172.23.105.190', u'code': 0, u'text': u"Service 'fts' exited with status 137. Restarting. Messages:\n2017-07-16T00:25:00.892-07:00 [INFO] feed_dcp: OnError, name: default_index_1_3e2f9755ca7458f5_fb9bdf16: bucketName: default, bucketUUID: , err: worker auth, AuthenticateMemcachedConn, server: 172.23.105.190:11210, err: MCResponse status=EACCESS, opcode=0x89, opaque=0, msg: \n2017-07-16T00:25:00.892-07:00 [INFO] main: meh.OnFeedError, srcType: couchbase, err: worker auth, AuthenticateMemcachedConn, server: 172.23.105.190:11210, err: MCResponse status=EACCESS, opcode=0x89, opaque=0, msg: \n2017-07-16T00:25:00.940-07:00 [INFO] Using plain authentication for user @fts \n2017-07-16T00:25:00.941-07:00 [INFO] Using plain authentication for user @fts \n2017-07-16T00:25:01.004-07:00 [INFO] feed_dcp: OnError, name: default_index_1_3e2f9755ca7458f5_d75fa3fd: bucketName: default, bucketUUID: , err: worker auth, AuthenticateMemcachedConn, server: 172.23.105.190:11210, err: MCResponse status=EACCESS, opcode=0x89, opaque=0, msg: \n2017-07-16T00:25:01.008-07:00 [INFO] main: meh.OnFeedError, srcType: couchbase, err: worker auth, AuthenticateMemcachedConn, server: 172.23.105.190:11210, err: MCResponse status=EACCESS, opcode=0x89, opaque=0, msg: \n[goport(/opt/couchbase/bin/cbft)] 2017/07/16 00:25:03 child process exited with status 137\n", u'shortText': u'message', u'serverTime': u'2017-07-16T00:25:03.445Z', u'module': u'ns_log', u'tstamp': 1500189903445, u'type': u'info'}
      [2017-07-16 00:25:06,123] - [rest_client:2956] ERROR - {u'node': u'ns_1@172.23.105.96', u'code': 0, u'text': u'Rebalance exited with reason {service_rebalance_failed,fts,\n {lost_connection,shutdown}}', u'shortText': u'message', u'serverTime': u'2017-07-16T00:25:03.293Z', u'module': u'ns_orchestrator', u'tstamp': 1500189903293, u'type': u'critical'}

      Pls note: subsequent attempts at rebalance pass, so marking it major.

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            evgeny.makarenko Evgeny Makarenko (Inactive)
            apiravi Aruna Piravi (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty