2b13331d62
Unfortunately, this change is an cheap attempt to work around a pathological workload for the ARC. A "real" solution still needs to be fleshed out, so this patch is intended to alleviate the situation in the meantime. Let me try and describe the problem.. Data buffers residing in the dbuf hash table (dbuf cache) will keep a hold on their respective dnode, this dnode will in turn keep a hold on its backing dbuf (the physical block of the dnode object backing it). Since the dnode has a hold on its backing dbuf, the arc buffer for this dbuf is unevictable. What this essentially boils down to, "data" buffers have the potential to pin "metadata" in the arc (as a result of these dnode object buffers being unevictable). This scenario becomes a real problem when the workload consists of many small files (e.g. creating millions of 4K files). With this workload, the arc's "arc_meta_used" space get filled up with buffers for any resident directories as well as buffers for the objset's dnode object. Once the "arc_meta_limit" is reached, the directory buffers will be evicted and only the unevictable dnode object buffers will reside. If the workload is simply creating new small files, these dnode object buffers will never even be needed again, whereas the directory buffers will be used constantly until the creates move to a new directory. If "arc_c" and "arc_meta_limit" are sized appropriately, this situation wont occur. This is because as the data buffers accumulate, "arc_size" will eventually approach "arc_c" (before "arc_meta_used" reaches "arc_meta_limit"); at that point the data buffers will be evicted, which releases the hold on the dnode, which releases the hold on the dnode object's dbuf, which allows that buffer to be evicted from the arc in preference to more "useful" metadata. So, to side step the issue, we simply need to ensure "arc_size" reaches "arc_c" before "arc_meta_used" reaches "arc_meta_limit". In order to pick a proper limit, we have to do some math. To make things a little easier to follow, it is assumed that there will only be a single data buffer per file (which is probably always the case for "small" files anyways). Based on the current internals of the arc, if N files residing in the dbuf cache all pin a single dnode buffer (i.e. their dnodes all share the same physical dnode object block), then the following amount of "arc_meta_used" space will be consumed: - 16K for the dnode object's block - [ 16384 bytes] - N * sizeof(dnode_t) -------------- [ N * 928 bytes] - (N + 1) * sizeof(arc_buf_t) ------ [(N + 1) * 72 bytes] - (N + 1) * sizeof(arc_buf_hdr_t) -- [(N + 1) * 264 bytes] - (N + 1) * sizeof(dmu_buf_impl_t) - [(N + 1) * 280 bytes] To simplify, these N files will pin the following amount of "arc_meta_used" space as unevictable: Pinned "arc_meta_used" bytes = 16384 + N * 928 + (N + 1) * (72 + 264 + 280) Pinned "arc_meta_used" bytes = 17000 + N * 1544 This pinned space is regardless of the size of the files, and is only dependent on the number of pinned dnodes sharing a physical block (i.e. N). For example, 32 512b files sharing a single dnode object block would consume the same "arc_meta_used" space as 32 4K files sharing a single dnode object block. Now, given a files size of S, we can determine the total amount of space that will be consumed in the arc: Total = 17000 + N * 1544 + S * N ^^^^^^^^^^^^^^^^ ^^^^^ metadata data So, given these formulas, we can generate a table which states the ratio of pinned metadata to total arc (meta + data) using different values of N (number of pinned dnodes per pinned physical dnode block) and S (size of the file). File Sizes (S) | 512 | 1024 | 2048 | 4096 | 8192 | 16384 | ---+----------+----------+----------+----------+----------+----------+ 1 | 0.973132 | 0.947670 | 0.900544 | 0.819081 | 0.693597 | 0.530921 | 2 | 0.951497 | 0.907481 | 0.830632 | 0.710325 | 0.550779 | 0.380051 | N 4 | 0.918807 | 0.849809 | 0.738842 | 0.585844 | 0.414271 | 0.261250 | 8 | 0.877541 | 0.781803 | 0.641770 | 0.472505 | 0.309333 | 0.182965 | 16 | 0.835819 | 0.717945 | 0.559996 | 0.388885 | 0.241376 | 0.137253 | 32 | 0.802106 | 0.669597 | 0.503304 | 0.336277 | 0.202123 | 0.112423 | As you can see, if we wanted to support the absolute worst case of 1 dnode per physical dnode block and 512b files, we would have to set the "arc_meta_limit" to something greater than 97.3132% of "arc_c_max". At that point, it essentially defeats the purpose of having an "arc_meta_limit" at all. This patch changes the default value of "arc_meta_limit" to be 75% of "arc_c_max", which should be good enough for "most" workloads (I think). Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #2110 |
||
---|---|---|
cmd | ||
config | ||
dracut | ||
etc | ||
include | ||
lib | ||
man | ||
module | ||
patches | ||
rpm | ||
scripts | ||
udev | ||
.gitignore | ||
.gitmodules | ||
AUTHORS | ||
COPYRIGHT | ||
DISCLAIMER | ||
META | ||
Makefile.am | ||
OPENSOLARIS.LICENSE | ||
README.markdown | ||
autogen.sh | ||
configure.ac | ||
copy-builtin | ||
zfs-script-config.sh.in | ||
zfs.release.in |
README.markdown
Native ZFS for Linux!
ZFS is an advanced file system and volume manager which was originally developed for Solaris and is now maintained by the Illumos community.
ZFS on Linux, which is also known as ZoL, is currently feature complete. It includes fully functional and stable SPA, DMU, ZVOL, and ZPL layers.
Full documentation for installing ZoL on your favorite Linux distribution can be found at: http://zfsonlinux.org