Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-20397

[FTS] cbft gets killed by OOM killer (kvstore=moss, moss_merge_threshold=20%)

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • 5.0.0
    • 5.0.0
    • cbft

    Description

      Build
      4.7.0-937

      Testcase
      ./testrunner -i INI_FILE.ini -p cluster=D:F:F -t fts.stable_topology_fts.StableTopFTS.create_simple_default_index,cluster=D,F,D+F,dgm_run=1,active_resident_ratio=10,eviction_policy=fullEviction,moss_compact_threshold=20,GROUP=DGM

      .120 had kv + fts
      .224 and .216 - only fts.

      The test uses mossStore. Moss compaction/merge threshold has been set to 20%. This is consistently reproducible. I'm seeing the same symptoms of MB-20209 except for compactor/memcached crash. I do see nodes going down one by one, however not sure if the network_tick_timeout is caused by fts putting pressure on the system.

      testrunner log - https://gist.github.com/anonymous/da47085c908d250237c0c08da5374794

      The test fails during index building when .224 becomes not reachable all of a sudden.

      [2016-08-01 13:24:50,260] - [fts_base:2503] INFO - Docs in bucket = 3391000, docs in FTS index 'default_index_1': 3084547
      [2016-08-01 13:25:01,377] - [fts_base:2503] INFO - Docs in bucket = 3391000, docs in FTS index 'default_index_1': 3084943
      [2016-08-01 13:25:12,468] - [fts_base:2503] INFO - Docs in bucket = 3391000, docs in FTS index 'default_index_1': 3084943
      [2016-08-01 13:27:18,591] - [rest_client:781] ERROR - socket error while connecting to http://172.23.105.224:8091/nodes/self error timed out 
      ERROR
      

      Later on .224, it looks like we've been trying to retry some operation leading to an OOM -

      Aug  1 23:44:26 localhost kernel: Out of memory: Kill process 7125 (cbft) score 564 or sacrifice child
      Aug  1 23:44:26 localhost kernel: Killed process 7125 (cbft) total-vm:3921079564kB, anon-rss:186228kB, file-rss:0kB
      

      .216 has gone down and is not reachable. CBIT filed. Will get logs once the node is up. Attaching logs from other two nodes.

      Live cluster - http://172.23.105.224:8091/ui/index.html

      Attachments

        Activity

          People

            apiravi Aruna Piravi (Inactive)
            apiravi Aruna Piravi (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              PagerDuty