Details
-
Bug
-
Resolution: Fixed
-
Critical
-
5.0.0
-
Untriaged
-
-
Unknown
Description
Build
4.7.0-937
Testcase
./testrunner -i INI_FILE.ini -p cluster=D:F:F -t fts.stable_topology_fts.StableTopFTS.create_simple_default_index,cluster=D,F,D+F,dgm_run=1,active_resident_ratio=10,eviction_policy=fullEviction,moss_compact_threshold=20,GROUP=DGM
.120 had kv + fts
.224 and .216 - only fts.
The test uses mossStore. Moss compaction/merge threshold has been set to 20%. This is consistently reproducible. I'm seeing the same symptoms of MB-20209 except for compactor/memcached crash. I do see nodes going down one by one, however not sure if the network_tick_timeout is caused by fts putting pressure on the system.
testrunner log - https://gist.github.com/anonymous/da47085c908d250237c0c08da5374794
The test fails during index building when .224 becomes not reachable all of a sudden.
[2016-08-01 13:24:50,260] - [fts_base:2503] INFO - Docs in bucket = 3391000, docs in FTS index 'default_index_1': 3084547
|
[2016-08-01 13:25:01,377] - [fts_base:2503] INFO - Docs in bucket = 3391000, docs in FTS index 'default_index_1': 3084943
|
[2016-08-01 13:25:12,468] - [fts_base:2503] INFO - Docs in bucket = 3391000, docs in FTS index 'default_index_1': 3084943
|
[2016-08-01 13:27:18,591] - [rest_client:781] ERROR - socket error while connecting to http://172.23.105.224:8091/nodes/self error timed out
|
ERROR
|
Later on .224, it looks like we've been trying to retry some operation leading to an OOM -
Aug 1 23:44:26 localhost kernel: Out of memory: Kill process 7125 (cbft) score 564 or sacrifice child
|
Aug 1 23:44:26 localhost kernel: Killed process 7125 (cbft) total-vm:3921079564kB, anon-rss:186228kB, file-rss:0kB
|
.216 has gone down and is not reachable. CBIT filed. Will get logs once the node is up. Attaching logs from other two nodes.
Live cluster - http://172.23.105.224:8091/ui/index.html