Magma creates large number of files for minimizing the write amplification as well as avoiding large compactions. For a very large dataset like 20TB, we may end up creating 500k - 5M files (There is going to be a tunable option to reduce number of files at write amplification cost by doing file reduction compaction).
For a 1% resident dataset, 99% of the data is inactive and the number of files doesn't add significant overhead.
The godu process spawns by ns_server to estimate disk usage forces all the dentries and file core inodes to be present in memory. Since the godu runs every second (or similar frequent interval), OS is forced to keep file metadata in memory. This creates significant pressure of virtual memory and the system runs into swap if a large amount of free memory is not reserved for the OS outside of bucket quota. A lot of the files are already deleted and dentry cache is left with inactive file data as well.
For magma, we are moving to direct i/o and trying to avoid scalability issues related to filesystem / page cache.
The following slabtop output shows the severity of memory consumed by the filesystem metadata:
It consumes 5102368K(dentry) + 8501184K(inode) + 797752K(xfsmeta).
Magma will report accurate disk usage of aggregate files consumed by magma per vbucket through kv-engine. Can we depend on these stats and avoid polling large number of file inodes from disk ?