Details
-
Improvement
-
Resolution: Fixed
-
Major
-
None
-
Magma: Jan 20 - Feb 2
Description
Magma creates large number of files for minimizing the write amplification as well as avoiding large compactions. For a very large dataset like 20TB, we may end up creating 500k - 5M files (There is going to be a tunable option to reduce number of files at write amplification cost by doing file reduction compaction).
For a 1% resident dataset, 99% of the data is inactive and the number of files doesn't add significant overhead.
The godu process spawns by ns_server to estimate disk usage forces all the dentries and file core inodes to be present in memory. Since the godu runs every second (or similar frequent interval), OS is forced to keep file metadata in memory. This creates significant pressure of virtual memory and the system runs into swap if a large amount of free memory is not reserved for the OS outside of bucket quota. A lot of the files are already deleted and dentry cache is left with inactive file data as well.
For magma, we are moving to direct i/o and trying to avoid scalability issues related to filesystem / page cache.
The following slabtop output shows the severity of memory consumed by the filesystem metadata:
It consumes 5102368K(dentry) + 8501184K(inode) + 797752K(xfsmeta).
Active / Total Objects (% used) : 41794189 / 60179910 (69.4%)
|
Active / Total Slabs (% used) : 1378377 / 1378377 (100.0%)
|
Active / Total Caches (% used) : 68 / 100 (68.0%)
|
Active / Total Size (% used) : 10199466.90K / 13702587.58K (74.4%)
|
Minimum / Average / Maximum Object : 0.01K / 0.23K / 16.69K
|
|
OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME
|
26787432 9544323 35% 0.19K 637796 42 5102368K dentry
|
13761856 13759786 99% 0.06K 215029 64 860116K kmalloc-64
|
5630655 5630276 99% 0.08K 110405 51 441620K Acpi-State
|
5610222 5609944 99% 1.06K 265662 30 8501184K xfs_inode
|
5285107 5284903 99% 0.15K 99719 53 797752K xfs_ili
|
374912 289450 77% 0.03K 2929 128 11716K kmalloc-32
|
373248 211264 56% 0.02K 1458 256 5832K kmalloc-16
|
318045 106955 33% 0.10K 8155 39 32620K buffer_head
|
283458 145845 51% 0.38K 6749 42 107984K mnt_cache
|
266000 39069 14% 0.50K 4165 64 133280K kmalloc-512
|
250560 77845 31% 0.25K 3915 64 62640K kmalloc-256
|
205800 205577 99% 0.07K 3675 56 14700K Acpi-Operand
|
145887 142911 97% 0.57K 3656 56 116992K radix_tree_node
|
145350 145350 100% 0.02K 855 170 3420K scsi_data_buffer
|
114176 114176 100% 0.01K 223 512 892K kmalloc-8
|
110292 32043 29% 0.09K 2626 42 10504K kmalloc-96
|
|
Magma will report accurate disk usage of aggregate files consumed by magma per vbucket through kv-engine. Can we depend on these stats and avoid polling large number of file inodes from disk ?
Attachments
Issue Links
- relates to
-
MB-38012 1% DGM Test: Write ops/s dropped from 230k/s to 400/s due to swapping
- Closed