Archive-Team/zfs - zfs - Gitea: Git with a cup of tea

Commit Graph

Author	SHA1	Message	Date
Prakash Surya	77765b540b	Remove "arc_meta_used" from arc_adjust calculation Using "arc_meta_used" to determine if the arc's mru list is over it's target value of "arc_p" doesn't seem correct. The size of the mru list and the value of "arc_meta_used", although related, are completely independent. Buffers contained in "arc_meta_used" may not even be contained in the arc's mru list. As such, this patch removes "arc_meta_used" from the calculation in arc_adjust. Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #2110	2014-02-21 16:10:49 -08:00
Prakash Surya	94520ca462	Prune metadata from ghost lists in arc_adjust_meta To maintain a strict limit on the metadata contained in the arc, while preventing the arc buffer headers from completely consuming the "arc_meta_used" space, we need to evict metadata buffers from the arc's ghost lists along with the regular lists. This change modifies arc_adjust_meta such that it more closely models the adjustments made in arc_adjust. "arc_meta_used" is used similarly to "arc_size", and "arc_meta_limit" is used similarly to "arc_c". Testing metadata intensive workloads (e.g. creating, copying, and removing millions of small files and/or directories) has shown this change to make a dramatic improvement to the hit rate maintained in the arc. While I think there is still room for improvement, this is a big step in the right direction. In addition, zpl_free_cached_objects was made into a no-op as I'm not yet sure how to properly implement that function. Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #2110	2014-02-21 16:10:49 -08:00
Prakash Surya	1e3cb67b53	Revert "Return -1 from arc_shrinker_func()" This reverts commit `c11a12bc3b`. Out of memory events were fixed by reverting this patch. Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #2110	2014-02-21 16:10:49 -08:00
Prakash Surya	624227854e	Disable arc_p adapt dampener by default It's unclear why adjustments to arc_p need to be dampened as they are in arc_adjust. With that said, it's removal significantly improves the arc's ability to "warm up" to a given workload. Thus, I'm disabling by default until its usefulness is better understood. Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #2110	2014-02-21 16:10:49 -08:00
Prakash Surya	f521ce1b9c	Allow "arc_p" to drop to zero or grow to "arc_c" Setting a limit on the minimum value of "arc_p" has been shown to have detrimental effects on the arc hit rate for certain "metadata" intensive workloads. Specifically, this has been exhibited with a workload that constantly dirties new "metadata" but also frequently touches a "small" amount of mfu data (e.g. mkdir's). What is seen is that the new anon data throttles the mfu list to a negligible size (because arc_p > anon + mru in arc_get_data_buf), even though the mfu ghost list receives a constant stream of hits. To remedy this, arc_p is now allowed to drop to zero if the algorithm deems it necessary. Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #2110	2014-02-21 16:10:27 -08:00
Prakash Surya	89c8cac493	Disable aggressive arc_p growth by default For specific workloads consisting mainly of mfu data and new anon data buffers, the aggressive growth of arc_p found in the arc_get_data_buf() function can have detrimental effects on the mfu list size and ghost list hit rate. Running a workload consisting of two processes: * Process 1 is creating many small files * Process 2 is tar'ing a directory consisting of many small files I've seen arc_p and the mru grow to their maximum size, while the mru ghost list receives 100K times fewer hits than the mfu ghost list. Ideally, as the mfu ghost list receives hits, arc_p should be driven down and the size of the mfu should increase. Given the specific workload I was testing with, the mfu list size should grow to a point where almost no mfu ghost list hits would occur. Unfortunately, this does not happen because the newly dirtied anon buffers constancy drive arc_p to its maximum value and keep it there (effectively prioritizing the mru list and starving the mfu list down to a negligible size). The logic to increment arc_p from within the arc_get_data_buf() function was introduced many years ago in this upstream commit: commit 641fbdae3a027d12b3c3dcd18927ccafae6d58bc Author: maybee <none@none> Date: Wed Dec 20 15:46:12 2006 -0800 6505658 target MRU size (arc.p) needs to be adjusted more aggressively and since I don't fully understand the motivation for the change, I am reluctant to completely remove it. As a way to test out how it's removal might affect performance, I've disabled that code by default, but left it tunable via a module option. Thus, if its removal is found to be grossly detrimental for certain workloads, it can be re-enabled on the fly, without a code change. Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #2110	2014-02-21 14:53:28 -08:00
Prakash Surya	39e055c44b	Adjust arc_p based on "bytes" in arc_shrink In an attempt to prevent arc_c from collapsing "too fast", the arc_shrink() function was updated to take a "bytes" parameter by this change: commit `302f753f16` Author: Brian Behlendorf <behlendorf1@llnl.gov> Date: Tue Mar 13 14:29:16 2012 -0700 Integrate ARC more tightly with Linux Unfortunately, that change failed to make a similar change to the way that arc_p was updated. So, there still exists the possibility for arc_p to collapse to near 0 when the kernel start calling the arc's shrinkers. This change attempts to fix this, by decrementing arc_p by the "bytes" parameter in the same way that arc_c is updated. In addition, a minimum value of arc_p is attempted to be maintained, similar to the way a minimum arc_p value is maintained in arc_adapt(). Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #2110	2014-02-21 14:53:08 -08:00
Brian Behlendorf	9141582592	Set zfs_arc_min to 4MB Decrease the mimimum ARC size from 1/32 of total system memory (or 64MB) to a much smaller 4MB. 1) Large systems with over a 1TB of memory are being deployed and reserving 1/32 of this memory (32GB) as the mimimum requirement is overkill. 2) Tiny systems like the raspberry pi may only have 256MB of memory in which case 64MB is far too large. The ARC should be reclaimable if the VFS determines it needs the memory for some other purpose. If you want to ensure the ARC is never completely reclaimed due to memory pressure you may still set a larger value with zfs_arc_min. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Prakash Surya <surya1@llnl.gov> Issue #2110	2014-02-21 14:52:02 -08:00
Richard Yao	4f2dcb3eee	Add erratum for issue #2094 ZoL commit `1421c89` unintentionally changed the disk format in a forward- compatible, but not backward compatible way. This was accomplished by adding an entry to zbookmark_t, which is included in a couple of on-disk structures. That lead to the creation of pools with incorrect dsl_scan_phys_t objects that could only be imported by versions of ZoL containing that commit. Such pools cannot be imported by other versions of ZFS or past versions of ZoL. The additional field has been removed by the previous commit. However, affected pools must be imported and scrubbed using a version of ZoL with this commit applied. This will return the pools to a state in which they may be imported by other implementations. The 'zpool import' or 'zpool status' command can be used to determine if a pool is impacted. A message similar to one of the following means your pool must be scrubbed to restore compatibility. $ zpool import pool: zol-0.6.2-173 id: 1165955789558693437 state: ONLINE status: Errata #1 detected. action: The pool can be imported using its name or numeric identifier, however there is a compatibility issue which should be corrected by running 'zpool scrub' see: http://zfsonlinux.org/msg/ZFS-8000-ER config: ... $ zpool status pool: zol-0.6.2-173 state: ONLINE scan: pool compatibility issue detected. see: https://github.com/zfsonlinux/zfs/issues/2094 action: To correct the issue run 'zpool scrub'. config: ... If there was an async destroy in progress 'zpool import' will prevent the pool from being imported. Further advice on how to proceed will be provided by the error message as follows. $ zpool import pool: zol-0.6.2-173 id: 1165955789558693437 state: ONLINE status: Errata #2 detected. action: The pool can not be imported with this version of ZFS due to an active asynchronous destroy. Revert to an earlier version and allow the destroy to complete before updating. see: http://zfsonlinux.org/msg/ZFS-8000-ER config: ... Pools affected by the damaged dsl_scan_phys_t can be detected prior to an upgrade by running the following command as root: zdb -dddd poolname 1 \| grep -P '^\t\tscan = ' \| sed -e 's;scan = ;;' \| wc -w Note that `poolname` must be replaced with the name of the pool you wish to check. A value of 25 indicates the dsl_scan_phys_t has been damaged. A value of 24 indicates that the dsl_scan_phys_t is normal. A value of 0 indicates that there has never been a scrub run on the pool. The regression caused by the change to zbookmark_t never made it into a tagged release, Gentoo backports, Ubuntu, Debian, Fedora, or EPEL stable respositorys. Only those using the HEAD version directly from Github after the 0.6.2 but before the 0.6.3 tag are affected. This patch does have one limitation that should be mentioned. It will not detect errata #2 on a pool unless errata #1 is also present. It expected this will not be a significant problem because pools impacted by errata #2 have a high probably of being impacted by errata #1. End users can ensure they do no hit this unlikely case by waiting for all asynchronous destroy operations to complete before updating ZoL. The presence of any background destroys on any imported pools can be checked by running `zpool get freeing` as root. This will display a non-zero value for any pool with an active asynchronous destroy. Lastly, it is expected that no user data has been lost as a result of this erratum. Original-patch-by: Tim Chase <tim@chase2k.com> Reworked-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #2094	2014-02-21 12:10:40 -08:00
Brian Behlendorf	ffe9d38275	Add generic errata infrastructure From time to time it may be necessary to inform the pool administrator about an errata which impacts their pool. These errata will by shown to the administrator through the 'zpool status' and 'zpool import' output as appropriate. The errata must clearly describe the issue detected, how the pool is impacted, and what action should be taken to resolve the situation. Additional information for each errata will be provided at http://zfsonlinux.org/msg/ZFS-8000-ER. To accomplish the above this patch adds the required infrastructure to allow the kernel modules to notify the utilities that an errata has been detected. This is done through the ZPOOL_CONFIG_ERRATA uint64_t which has been added to the pool configuration nvlist. To add a new errata the following changes must be made: * A new errata identifier must be assigned by adding a new enum value to the zpool_errata_t type. New enums must be added to the end to preserve the existing ordering. * Code must be added to detect the issue. This does not strictly need to be done at pool import time but doing so will make the errata visible in 'zpool import' as well as 'zpool status'. Once detected the spa->spa_errata member should be set to the new enum. * If possible code should be added to clear the spa->spa_errata member once the errata has been resolved. * The show_import() and status_callback() functions must be updated to include an informational message describing the errata. This should include an action message describing what an administrator should do to address the errata. * The documentation at http://zfsonlinux.org/msg/ZFS-8000-ER must be updated to describe the errata. This space can be used to provide as much additional information as needed to fully describe the errata. A link to this documentation will be automatically generated in the output of 'zpool import' and 'zpool status'. Original-idea-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Richard Yao <ryao@gentoo.or Issue #2094	2014-02-21 12:10:40 -08:00
Richard Yao	ed9e8368d3	Revert changes to zbookmark_t Commit `1421c89142` added a field to zbookmark_t that unintentinoally caused a disk format change. This negatively affected backward compatibility and platform portability. Therefore, this field is being removed. The function that field permitted is left unimplemented until a later patch that will reimplement the field in a way that does not affect the disk format. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #2094	2014-02-21 12:10:39 -08:00
Tim Chase	98fad86293	Propagate errors when registering "relatime" property callback. Various errors can occur when registering property callbacks. As the author's comments indicate, the code is very paranoid about preserving the first-seen error when registering callbacks. This patch causes an error seen while registering the "relatime" callback to not clobber a previously-seen error. Reported-by: Jorgen Lundman <lundman@lundman.net> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2117	2014-02-12 09:38:28 -08:00
Brian Behlendorf	c5cb66addc	Fix corrupted l2_asize in arcstats Commit `e0b0ca9` accidentally corrupted the l2_asize displayed in arcstats. This was caused by changing the l2arc_buf_hdr.b_asize member from an int to uint32_t type. There are places in the code where this field is cast to a uint64_t resulting in the b_hits member being treated as part of b_asize. To resolve the issue the type has been changed to a uint64_t, and the b_hits member is placed after the enum to prevent the size of the structure from increasing. This is a good example of exactly why it's a bad idea to use ambiguous types (int) in these structures. Signed-off-by: DHE <git@dehacked.net> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1990	2014-02-05 12:24:53 -08:00
Matthew Ahrens	2e7b7657cd	4188 assertion failed in dmu_tx_hold_free(): dn_datablkshift != 0 Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Approved by: Garrett D'Amore <garrett@damore.org> Refences: https://www.illumos.org/issues/4188 illumos/illumos-gate@bb411a08b0 Ported-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2091	2014-01-31 10:49:34 -08:00
Matthew Ahrens	8b4646494c	Illumos 4504 traverse_visitbp: visit group before user 4504 traverse_visitbp: visit DMU_GROUPUSED_OBJECT before DMU_USERUSED_OBJECT Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> References: https://illumos.org/issues/4504 http://code.delphix.com/illumos-4504 http://svnweb.freebsd.org/base?view=revision&revision=260812 Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Closes #2079	2014-01-29 15:50:49 -08:00
Tim Chase	6d111134c0	Implement relatime. Add the "relatime" property. When set to "on", a file's atime will only be updated if the existing atime at least a day old or if the existing ctime or mtime has been updated since the last access. This behavior is compatible with the Linux "relatime" mount option. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2064 Closes #1917	2014-01-29 15:50:44 -08:00
Cyril Plisko	01b738f457	Call gethrtime() only once per new txg creation When transitioning current open TXG into QUIESCE state and opening a new one txg_quiesce() calls gethrtime(): - to mark the birth time of the new TXG - to record the SPA txg history kstat - implicitely inside spa_txg_history_add() These timestamps are practically the same, so that the first one can be used instead of the other two. The only visible difference is that inside spa_txg_history_add() the time spent in kmem_zalloc() will be counted towards the opened TXG. Since at this point the new TXG already exists (tx->tx_open_txg has been already incremented) it is actually a correct accounting. In any case this extra work is only happening when spa_txg_history kstat is activated (i.e. zfs_txg_history > 0) and doesn't affect the normal processing in any way. Signed-off-by: Cyril Plisko <cyril.plisko@mountall.com> Issue #2075	2014-01-23 13:31:51 -08:00
Igor Lvovsky	478d64fdae	Add additional state TXG_STATE_WAIT_FOR_SYNC for txg. In several cases when digging into kstats we can found two txgs in SYNC state, e.g. txg birth state nreserved nread nwritten ... 985452 258127184872561 C 0 373948416 2376272384 ... 985453 258129016180616 C 0 378173440 28793344 ... 985454 258129016271523 S 0 0 0 ... 985455 258130864245986 S 0 0 0 ... 985456 258130867458851 O 0 0 0 ... However only first txg (985454) is really syncing at this moment. The other one (985455) marked as SYNCED is actually in a post-QUIESCED state and waiting to start sync. So, the new TXG_STATE_WAIT_FOR_SYNC state between TXG_STATE_QUIESCED and TXG_STATE_SYNCED was added to reveal this situation. txg birth state nreserved nread nwritten ... 1086896 235261068743969 C 0 163577856 8437248 ... 1086897 235262870830801 C 0 280625152 822594048 ... 1086898 235264172219064 S 0 0 0 ... 1086899 235264936134407 W 0 0 0 ... 1086900 235264936296156 O 0 0 0 ... Signed-off-by: Igor Lvovsky <ilvovsky@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #2075	2014-01-23 13:31:51 -08:00
Shen Yan	93292b3081	Use enum type(zfetch_dirn_t) instead Fix code with zfetch_dirn_t, which is more readable and clear. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2068	2014-01-23 12:56:33 -08:00
Tim Chase	4461aa6118	Allow chown/chgrp when no ACL SAs exist. From the comment in the commit: Some ZFS implementations (ZEVO) create neither a ZNODE_ACL nor a DACL_ACES SA in which case ENOENT is returned from zfs_acl_node_read() when the SA can't be located. Allow chown/chgrp to succeed in these cases rather than returning an error that makes no sense in the context of the caller. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue zfs-osx/zfs#86 Closes #1911 Closes #2029	2014-01-23 11:07:29 -08:00
Ned Bass	04aa2de8f7	vdev_file_io_start() to use taskq_dispatch(TQ_PUSHPAGE) The vdev_file_io_start() function may be processing a zio that the txg_sync thread is waiting on. In this case it is not safe to perform memory allocations that may generate new I/O since this could cause a deadlock. To avoid this, call taskq_dispatch() with TQ_PUSHPAGE instead of TQ_SLEEP. Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #1928	2014-01-23 09:58:07 -08:00
Brian Behlendorf	35d3e32274	Use long holds in zvol_set_volsize() Under Linux the zvol_set_volsize() function was originally written to use dmu_objset_hold()/dmu_objset_rele(). Subsequently, the dmu_objset_own()/dmu_objset_disown() interfaces were added but the ZVOL code wasn't updated to take advantage of them. This was never an issue but after the dsl_pool_config changes the code now takes the config lock twice. The cleanest solution is to shift to using dmu_objset_own() which takes a long hold on the dataset and does not hold the dsl pool lock. This patch also slightly restructures the existing code such that it more closely resembles the upstream Illumos code. Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2039	2014-01-14 14:46:12 -08:00
Brian Behlendorf	fd23720ae1	Drain iput taskq outside z_teardown_lock It's unsafe to drain the iput taskq while holding the z_teardown_lock as a writer. This is because when the last reference on an inode is dropped it may still have pages which need to be written to disk. This will be done through zpl_writepages which will acquire the z_teardown_lock as a reader in ZFS_ENTER. Therefore, if we're holding the lock as a writer in zfs_sb_teardown the unmount will deadlock. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chris Dunlop <chris@onthe.net.au> Closes #1988	2014-01-09 15:54:08 -08:00
Brian Behlendorf	4fcc43790c	Force LZ4_FORCE_SW_BITCOUNT for Sparc This change was proposed for Sparc but it's not clear to me why it's required. Proper support exists in the lz4 code to detect the endianness and the required builtins are available for gcc. Still I'm including the patch because it will only impact Sparc and it may resolve a case which hasn't occured to me. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: marku89 <mar42@kola.li> Issue #1700	2014-01-09 15:54:03 -08:00
Brian Behlendorf	b585bc4afa	Fix zfs_getattr_fast types On Sparc sp->blksize will be a 64-bit value which is then cast incorrectly to a 32-bit value. For big endian systems this results in an incorrect value for sp->blksize. To resolve the problem local variables of the correct size are used and then assigned to sp->blksize. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: marku89 <mar42@kola.li> Issue #1700	2014-01-09 15:50:23 -08:00
Brian Behlendorf	aa0218d6a1	Fix nvlist 'Bus Error' for Sparc The mis-aligned memory accesses in nvpair_native_embedded() and nvpair_native_embedded_array() will cause a 'Bus Error' for architectures such as Sparc which not fully byte addressible. To avoid this issue care is taken to avoid dereferencing the potentially mis-aligned packed nvlist_t. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: marku89 <mar42@kola.li> Issue #1700	2014-01-09 15:50:15 -08:00
Brian Behlendorf	7f89ae6ba0	Use local variable to read zp->z_mode When accessing the zp->z_mode through the SA bulk interface we expect that 64-bits are available to hold the result. However, on 32-bit platforms mode_t will only be 32-bits so we cannot pass it to SA_ADD_BULK_ATTR(). Instead a local uint64_t variable must be used and the result assigned to zp->z_mode. This went unnoticed on 32-bit little endian platforms because the bytes happen to end up in the correct 32-bits. But on big endian platforms like Sparc the zp->z_mode will always end up set to zero. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: marku89 <mar42@kola.li> Issue #1700	2014-01-09 15:50:11 -08:00
John Layman	ecf3d9b8e6	Add ddt, ddt_entry, and l2arc_hdr caches Back the allocations for ddt tables+entries and l2arc headers with kmem caches. This will reduce the cost of allocating these commonly used structures and allow for greater visibility of them through the /proc/spl/kmem/slab interface. Signed-off-by: John Layman <jlayman@sagecloud.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1893	2014-01-07 10:33:11 -08:00
Tim Chase	fb8e608d9d	Fix the creation of ZPOOL_HIST_CMD pool history entries. Move the libzfs_fini() after the zpool_log_history() call so the ZPOOL_HIST_CMD entry can get written. Fix the handling of saved_poolname in zfsdev_ioctl() which was broken as part of the stack-reduction work in `a168788053`. Since ZoL destroys the TSD data in which the previously successful ioctl()'s pool name is stored following every vop, the ZFS_IOC_LOG_HISTORY ioctl has a very important restriction: it can only successfully write a long entry following a successful ioctl() if no intervening vops have been performed. Some of zfs subcommands do perform intervening vops and to do the logging themselves. At the moment, the "create" and "clone" subcommands have been modified appropriately. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1998	2014-01-07 09:00:26 -08:00
Tim Chase	5d862cb0d9	Properly handle updates of variably-sized SA entries. During the update process in sa_modify_attrs(), the sizes of existing variably-sized SA entries are obtained from sa_lengths[]. The case where a variably-sized SA was being replaced neglected to increment the index into sa_lengths[], so subsequent variable-length SAs would be rewritten with the wrong length. This patch adds the missing increment operation so all variably-sized SA entries are stored with their correct lengths. Previously, a size-changing update of a variably-sized SA that occurred when there were other variably-sized SAs in the bonus buffer would cause the subsequent SAs to be corrupted. The most common case in which this would occur is when a mode change caused the ZPL_DACL_ACES entry to change size when a ZPL_DXATTR (SA xattr) entry already existed. The following sequence would have caused a failure when xattr=sa was in force and would corrupt the bonus buffer: open(filename, O_WRONLY \| O_CREAT, 0600); ... lsetxattr(filename, ...); /* create xattr SA / chmod(filename, 0650); / enlarges the ACL */ Signed-off-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1978	2013-12-20 13:52:33 -08:00
Brian Behlendorf	ac0340970c	Register correct handlers for nvlist_{dup,pack,unpack} This change is related to commit `81eaf15` which ensured the correct allocation handlers were installed for nvlist_alloc(). The nvlist functions nvlist_dup(), nvlist_pack(), and nvlist_unpack() suffer from the same issue and have been updated accordingly. Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #1937	2013-12-20 13:52:28 -08:00
Matthew Thode	11b9ec23b9	Add full SELinux support Four new dataset properties have been added to support SELinux. They are 'context', 'fscontext', 'defcontext' and 'rootcontext' which map directly to the context options described in mount(8). When one of these properties is set to something other than 'none'. That string will be passed verbatim as a mount option for the given context when the filesystem is mounted. For example, if you wanted the rootcontext for a filesystem to be set to 'system_u:object_r:fs_t' you would set the property as follows: $ zfs set rootcontext="system_u:object_r:fs_t" storage-pool/media This will ensure the filesystem is automatically mounted with that rootcontext. It is equivalent to manually specifying the rootcontext with the -o option like this: $ zfs mount -o rootcontext=system_u:object_r:fs_t storage-pool/media By default all four contexts are set to 'none'. Further information on SELinux contexts is detailed in mount(8) and selinux(8) man pages. Signed-off-by: Matthew Thode <prometheanfire@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <ryao@gentoo.org> Closes #1504	2013-12-19 10:37:31 -08:00
Michael Kjorling	d1d7e2689d	cstyle: Resolve C style issues The vast majority of these changes are in Linux specific code. They are the result of not having an automated style checker to validate the code when it was originally written. Others were caused when the common code was slightly adjusted for Linux. This patch contains no functional changes. It only refreshes the code to conform to style guide. Everyone submitting patches for inclusion upstream should now run 'make checkstyle' and resolve any warning prior to opening a pull request. The automated builders have been updated to fail a build if when 'make checkstyle' detects an issue. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1821	2013-12-18 16:46:35 -08:00
Turbo Fredriksson	fd8febbd1e	Add zfs_send_corrupt_data module option Tuning setting to ignore read/checksum errors when sending data. Signed-off-by: Turbo Fredriksson <turbo@bayour.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1982 Issue #1897	2013-12-18 16:46:35 -08:00
Chunwei Chen	7dc71949f2	Fix z_sync_cnt decrement in zfs_close The comment in zfs_close states that "Under Linux the zfs_close() hook is not symmetric with zfs_open()". This is not true. zfs_open/zfs_close is associated with every successful struct file creation/deletion, which should always be balanced. Here is an example of what's wrong: Process A B open(O_SYNC) z_sync_cnt = 1 open(O_SYNC) z_sync_cnt = 2 close() z_sync_cnt = 0 So z_sync_cnt is 0 even if B still has the file with O_SYNC. Also moves the generic_file_open call before zfs_open to ensure that in the case generic_file_open fails z_sync_cnt is not incremented. This is safe because generic_file_open has no side effects. Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #1962	2013-12-17 10:28:27 -08:00
Brian Behlendorf	ce37ebd2eb	cstyle: zvol.c Update zvol.c to conform to the style guidelines, verified by running cstyle.pl on the source file. This patch contains no functional changes. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Issue #1821	2013-12-16 09:41:45 -08:00
Brian Behlendorf	2e0358cbca	Sync /dev/zfs ioctl ordering In order to minimize any future disruption caused by the addition and removal /dev/zfs ioctls this patch makes the following changes. 1) Sync ZoL's ioctl ordering such that it matches Illumos. For historic reasons the ZFS_IOC_DESTROY_SNAPS and ZFS_IOC_POOL_REGUID ioctls were out of order. 2) Move Linux and FreeBSD specific ioctls in to their own reserved ranges. This allows us to preserve the existing ordering when new ioctls are added by either Illumos or FreeBSD. When an ioctl is no longer needed it should be retired in place. This change alters the ZFS user/kernel ABI so make sure you rebuild both your user and kernel modules. However, it should allow for a much stabler interface going forward. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov> Closes #1973	2013-12-16 09:41:39 -08:00
Brian Behlendorf	ba6a24026c	Remove ZFC_IOC__MINOR ioctl()s Early versions of ZFS coordinated the creation and destruction of device minors from userspace. This was inherently racy and in late 2009 these ioctl()s were removed leaving everything up to the kernel. This significantly simplified the code. However, we never picked up these changes in ZoL since we'd already significantly adjusted this code for Linux. This patch aims to rectify that by finally removing ZFC_IOC__MINOR ioctl()s and moving all the functionality down in to the kernel. Since this cleanup will change the kernel/user ABI it's being done in the same tag as the previous libzfs_core ABI changes. This will minimize, but not eliminate, the disruption to end users. Once merged ZoL, Illumos, and FreeBSD will basically be back in sync in regards to handling ZVOLs in the common code. While each platform must have its own custom zvol.c implemenation the interfaces provided are consistent. NOTES: 1) This patch introduces one subtle change in behavior which could not be easily avoided. Prior to this change callers of 'zfs create -V ...' were guaranteed that upon exit the /dev/zvol/ block device link would be created or an error returned. That's no longer the case. The utilities will no longer block waiting for the symlink to be created. Callers are now responsible for blocking, this is why a 'udev_wait' call was added to the 'label' function in scripts/common.sh. 2) The read-only behavior of a ZVOL now solely depends on if the ZVOL_RDONLY bit is set in zv->zv_flags. The redundant policy setting in the gendisk structure was removed. This both simplifies the code and allows us to safely leverage set_disk_ro() to issue a KOBJ_CHANGE uevent. See the comment in the code for futher details on this. 3) Because __zvol_create_minor() and zvol_alloc() may now be called in a sync task they must use KM_PUSHPAGE. References: illumos/illumos-gate@681d9761e8 Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Closes #1969	2013-12-16 09:15:57 -08:00
George Wilson	dda12da9f1	Illumos #4121 vdev_label_init read only 4121 vdev_label_init should treat request as succeeded when pool is read only Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com> Approved by: Richard Lowe <richlowe@richlowe.net> References: https://www.illumos.org/issues/4121 illumos/illumos-gate@973c78e94b Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1863	2013-12-12 10:24:01 -08:00
Tim Chase	84b0aac5fd	Fix atime handling. Previously, the atime-modifying vnops called ZFS_ACCESSTIME_STAMP() followed by zfs_inode_update() to update the atime. However, since atimes are cached in the znode for delayed writing, the zfs_inode_update() function would effectively ignore the cached atime by reading it from the SA. This commit moves the updating of the atime in the inode into zfs_tstamp_update_setup() which is called by the ZFS_ACCESSTIME_STAMP() macro and eliminates the call to zfs_inode_update() in the atime-modifying vnops. It's possible the same thing could have been done directly in zfs_inode_update() but I wasn't sure that it was safe in all cases where it is called. The effect is that atime handling is as if "strictatime" were selected; even if the filesystem is mounted with "relatime". Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #1949	2013-12-12 10:23:58 -08:00
david.chen	be5db977ea	Remove MAX when initializing arc_c_max The MAX when initializing arc_c_max doesn't make any sense because it hasn't been set anywhere before. Though, arc_c_max should be implicitly set to zero when initializing arc_stats, so the MAX doesn't make any difference. The MAX was mistakenly left if place when the Illumos default values were changed for Linux. Signed-off-by: david.chen <tuxoko@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1941	2013-12-10 10:05:40 -08:00
Ned Bass	b6e335bfc4	Revert "Use directory xattrs for symlinks" This reverts commit `6a7c0ccca4`. A proper fix for Issue #1648 was landed under Issue #1890, so this is no longer needed. Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1648	2013-12-10 09:48:30 -08:00
James Pan	472e7c6085	sa_find_sizes() may compute wrong SA header size Under the right conditions sa_find_sizes() will compute an incorrect size of the system attribute (SA) header. This causes a failed assertion when the SA_HDR_SIZE_MATCH_LAYOUT() test returns false, and may lead to corruption of SA data. The bug presents itself when there are more than two variable-length SAs of just the right size to fit in the bonus buffer of a dnode. The existing logic fails to account for the SA header space needed to store the sizes of all the variable-length SAs. A reproducer was possible on Linux by setting the xattr=sa dataset property and storing xattrs on symbolic links (Issue #1648). Note the corrupt link target name: $ zfs set xattr=sa tank/fish $ cd /tank/fish $ ln -fs 12345678901234567 link $ setfattr -n trusted.0000000000000000000 -v 0x000000000000000000000000 -h link $ setfattr -n trusted.1111111111111111111 -v 0x000000000000000000000000 -h link $ ls -l link lrwxrwxrwx 1 root root 17 Dec 6 15:40 link -> 90123456701234567 Commit `6a7c0ccca4` worked around this bug by forcing xattr's on symlinks to be stored in directory format. This change implements a proper fix, so the workaround can now be reverted. The reference link below contains a reproducer for FreeBSD. References: http://lists.open-zfs.org/pipermail/developer/2013-November/000306.html Ported-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1890	2013-12-10 09:48:15 -08:00
Brian Behlendorf	90ee9ed32f	Fix 'zfs diff' shares error When creating a dataset with ZoL a zsb->z_shares_dir ZAP object will not be created because shares are unimplemented. Instead ZoL just sets zsb->z_shares_dir to zero to indicate there are no shares. However, if you import a pool which was created with a different ZFS implementation then the shares ZAP object may exist. Code was added to handle this case but it clearly wasn't sufficiently tested with other ZFS pools. There was a bug in the zpl_shares_getattr() function which passed the wrong inode to zfs_getattr_fast() for the case where are shares ZAP object does exist. This causes an EIO to be returned to stat64() which in turn causes 'zfs diff' to fail. This fix is the pass the correct inode after a sucessful zfs_zget(). Additionally, only put away the references if we were able to get one. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Graham Booker <https://github.com/gbooker> Signed-off-by: timemaster67 <https://github.com/timemaster67> Closes #1426 Closes #481	2013-12-06 09:42:39 -08:00
Brian Behlendorf	99e349db92	Add module versioning Use the standard Linux MODULE_VERSION macro to expose the installed zavl, znvpair, zunicode, zcommon, zfs, and zpios module versions. This will also automatically add a checksum of the .c files and headers in "srcversion". See: /sys/module/zavl/version /sys/module/zavl/srcversion /sys/module/znvpair/version /sys/module/znvpair/srcversion /sys/module/zunicode/version /sys/module/zunicode/srcversion /sys/module/zcommon/version /sys/module/zcommon/srcversion /sys/module/zfs/version /sys/module/zfs/srcversion /sys/module/zpios/version /sys/module/zpios/srcversion Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1923	2013-12-06 09:34:41 -08:00
Matthew Ahrens	e8b96c6007	Illumos #4045 write throttle & i/o scheduler performance work 4045 zfs write throttle & i/o scheduler performance work 1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync read, sync write, async read, async write, and scrub/resilver. The scheduler issues a number of concurrent i/os from each class to the device. Once a class has been selected, an i/o is selected from this class using either an elevator algorithem (async, scrub classes) or FIFO (sync classes). The number of concurrent async write i/os is tuned dynamically based on i/o load, to achieve good sync i/o latency when there is not a high load of writes, and good write throughput when there is. See the block comment in vdev_queue.c (reproduced below) for more details. 2. The write throttle (dsl_pool_tempreserve_space() and txg_constrain_throughput()) is rewritten to produce much more consistent delays when under constant load. The new write throttle is based on the amount of dirty data, rather than guesses about future performance of the system. When there is a lot of dirty data, each transaction (e.g. write() syscall) will be delayed by the same small amount. This eliminates the "brick wall of wait" that the old write throttle could hit, causing all transactions to wait several seconds until the next txg opens. One of the keys to the new write throttle is decrementing the amount of dirty data as i/o completes, rather than at the end of spa_sync(). Note that the write throttle is only applied once the i/o scheduler is issuing the maximum number of outstanding async writes. See the block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for more details. This diff has several other effects, including: * the commonly-tuned global variable zfs_vdev_max_pending has been removed; use per-class zfs_vdev__max_active values or zfs_vdev_max_active instead. the size of each txg (meaning the amount of dirty data written, and thus the time it takes to write out) is now controlled differently. There is no longer an explicit time goal; the primary determinant is amount of dirty data. Systems that are under light or medium load will now often see that a txg is always syncing, but the impact to performance (e.g. read latency) is minimal. Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this. * zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression, checksum, etc. This improves latency by not allowing these CPU-intensive tasks to consume all CPU (on machines with at least 4 CPU's; the percentage is rounded up). --matt APPENDIX: problems with the current i/o scheduler The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem with this is that if there are always i/os pending, then certain classes of i/os can see very long delays. For example, if there are always synchronous reads outstanding, then no async writes will be serviced until they become "past due". One symptom of this situation is that each pass of the txg sync takes at least several seconds (typically 3 seconds). If many i/os become "past due" (their deadline is in the past), then we must service all of these overdue i/os before any new i/os. This happens when we enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in the future. If we can't complete all the i/os in 2.5 seconds (e.g. because there were always reads pending), then these i/os will become past due. Now we must service all the "async" writes (which could be hundreds of megabytes) before we service any reads, introducing considerable latency to synchronous i/os (reads or ZIL writes). Notes on porting to ZFS on Linux: - zio_t gained new members io_physdone and io_phys_children. Because object caches in the Linux port call the constructor only once at allocation time, objects may contain residual data when retrieved from the cache. Therefore zio_create() was updated to zero out the two new fields. - vdev_mirror_pending() relied on the depth of the per-vdev pending queue (vq->vq_pending_tree) to select the least-busy leaf vdev to read from. This tree has been replaced by vq->vq_active_tree which is now used for the same purpose. - vdev_queue_init() used the value of zfs_vdev_max_pending to determine the number of vdev I/O buffers to pre-allocate. That global no longer exists, so we instead use the sum of the *_max_active values for each of the five I/O classes described above. - The Illumos implementation of dmu_tx_delay() delays a transaction by sleeping in condition variable embedded in the thread (curthread->t_delay_cv). We do not have an equivalent CV to use in Linux, so this change replaced the delay logic with a wrapper called zfs_sleep_until(). This wrapper could be adopted upstream and in other downstream ports to abstract away operating system-specific delay logic. - These tunables are added as module parameters, and descriptions added to the zfs-module-parameters.5 man page. spa_asize_inflation zfs_deadman_synctime_ms zfs_vdev_max_active zfs_vdev_async_write_active_min_dirty_percent zfs_vdev_async_write_active_max_dirty_percent zfs_vdev_async_read_max_active zfs_vdev_async_read_min_active zfs_vdev_async_write_max_active zfs_vdev_async_write_min_active zfs_vdev_scrub_max_active zfs_vdev_scrub_min_active zfs_vdev_sync_read_max_active zfs_vdev_sync_read_min_active zfs_vdev_sync_write_max_active zfs_vdev_sync_write_min_active zfs_dirty_data_max_percent zfs_delay_min_dirty_percent zfs_dirty_data_max_max_percent zfs_dirty_data_max zfs_dirty_data_max_max zfs_dirty_data_sync zfs_delay_scale The latter four have type unsigned long, whereas they are uint64_t in Illumos. This accommodates Linux's module_param() supported types, but means they may overflow on 32-bit architectures. The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most likely to overflow on 32-bit systems, since they express physical RAM sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to 2^32 which does overflow. To resolve that, this port instead initializes it in arc_init() to 25% of physical RAM, and adds the tunable zfs_dirty_data_max_max_percent to override that percentage. While this solution doesn't completely avoid the overflow issue, it should be a reasonable default for most systems, and the minority of affected systems can work around the issue by overriding the defaults. - Fixed reversed logic in comment above zfs_delay_scale declaration. - Clarified comments in vdev_queue.c regarding when per-queue minimums take effect. - Replaced dmu_tx_write_limit in the dmu_tx kstat file with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts how many times a transaction has been delayed because the pool dirty data has exceeded zfs_delay_min_dirty_percent. The latter counts how many times the pool dirty data has exceeded zfs_dirty_data_max (which we expect to never happen). - The original patch would have regressed the bug fixed in zfsonlinux/zfs@c418410, which prevented users from setting the zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE. A similar fix is added to vdev_queue_aggregate(). - In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the heap instead of the stack. In Linux we can't afford such large structures on the stack. Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: Ned Bass <bass6@llnl.gov> Reviewed by: Brendan Gregg <brendan.gregg@joyent.com> Approved by: Robert Mustacchi <rm@joyent.com> References: http://www.illumos.org/issues/4045 illumos/illumos-gate@69962b5647 Ported-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1913	2013-12-06 09:32:43 -08:00
Matthew Ahrens	384f8a09f8	Illumos #4347 ZPL can use dmu_tx_assign(TXG_WAIT) Fix a lock contention issue by allowing threads not holding ZPL locks to block when waiting to assign a transaction. Porting Notes: zfs_putpage() still uses TXG_NOWAIT, unlike the upstream version. This case may be a contention point just like zfs_write(), however it is not safe to block here since it may be called during memory reclaim. Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Dan McDonald <danmcd@nexenta.com> Reviewed by: Boris Protopopov <boris.protopopov@nexenta.com> Approved by: Dan McDonald <danmcd@nexenta.com> References: https://www.illumos.org/issues/4347 illumos/illumos-gate@e722410c49 Ported-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2013-12-06 09:30:51 -08:00
Brian Behlendorf	2e40f09410	Remove incorrect ASSERT in zfs_sb_teardown() As part of zfs_sb_teardown() there is an assertion that all inodes which are part of the zsb->z_all_znodes list have at least one reference on them. This is always true for the standard unmount case but there are two other cases where it is not strictly true. * zfs_ioc_rollback() - This is the most common case and it results from the fact that we aren't unmounting the filesystem. During a normal unmount the MS_ACTIVE flag will be cleared on the super block causing iput_final() to evict the inode when its reference count drops to zero. However, during a rollback MS_ACTIVE remains set since we're rolling back a live filesystem and need to preserve the existing super block. This allows inodes with a zero reference count to stay in the cache thereby violating the assertion. * destroy_inode() / zfs_sb_teardown() - There exists a small race between dropping the last reference on an inode and removing it from the zsb->z_all_znodes list. This is unlikely to occur but could also trigger the assertion which is incorrect. The inode may safely have a zero reference count in this case. Since allowing a zero reference count on the inode is expected and safe for both of these cases the simplest thing to do is remove the ASSERT. This code is only enabled for default builds so removing this entirely is a very safe change. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Tim Chase <tim@chase2k.com> Closes #1417 Closes #1536	2013-12-02 15:58:58 -08:00
Tim Chase	f707635fa5	Some nvlist allocations in hold processing need to use KM_PUSHPAGE. This should hopefully catch the rest of the allocations in the user hold/release processing that were missed by commit `65c67ea86e`. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1852 Closes #1855	2013-12-02 14:02:46 -08:00
Etienne Dechamps	119a394ab0	Only commit the ZIL once in zpl_writepages() (msync() case). Currently, using msync() results in the following code path: sys_msync -> zpl_fsync -> filemap_write_and_wait_range -> zpl_writepages -> write_cache_pages -> zpl_putpage In such a code path, zil_commit() is called as part of zpl_putpage(). This means that for each page, the write is handed to the DMU, the ZIL is committed, and only then do we move on to the next page. As one might imagine, this results in atrocious performance where there is a large number of pages to write: instead of committing a batch of N writes, we do N commits containing one page each. In some extreme cases this can result in msync() being ~700 times slower than it should be, as well as very inefficient use of ZIL resources. This patch fixes this issue by making sure that the requested writes are batched and then committed only once. Unfortunately, the implementation is somewhat non-trivial because there is no way to run write_cache_pages in SYNC mode (so that we get all pages) without making it wait on the writeback tag for each page. The solution implemented here is composed of two parts: - I added a new callback system to the ZIL, which allows the caller to be notified when its ITX gets written to stable storage. One nice thing is that the callback is called not only in zil_commit() but in zil_sync() as well, which means that the caller doesn't have to care whether the write ended up in the ZIL or the DMU: it will get notified as soon as it's safe, period. This is an improvement over dmu_tx_callback_register() that was used previously, which only supports DMU writes. The rationale for this change is to allow zpl_putpage() to be notified when a ZIL commit is completed without having to block on zil_commit() itself. - zpl_writepages() now calls write_cache_pages in non-SYNC mode, which will prevent (1) write_cache_pages from blocking, and (2) zpl_putpage from issuing ZIL commits. zpl_writepages() will issue the commit itself instead of relying on zpl_putpage() to do it, thus nicely batching the writes. Note, however, that we still have to call write_cache_pages() again in SYNC mode because there is an edge case documented in the implementation of write_cache_pages() whereas it will not give us all dirty pages when running in non-SYNC mode. Thus we need to run it at least once in SYNC mode to make sure we honor persistency guarantees. This only happens when the pages are modified at the same time msync() is running, which should be rare. In most cases there won't be any additional pages and this second call will do nothing. Note that this change also fixes a bug related to #907 whereas calling msync() on pages that were already handed over to the DMU in a previous writepages() call would make msync() block until the next TXG sync instead of returning as soon as the ZIL commit is complete. The new callback system fixes that problem. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1849 Closes #907	2013-11-23 15:08:29 -08:00

1 2 3 4 5 ...

670 Commits