Archive-Team/zfs - zfs - Gitea: Git with a cup of tea

Commit Graph

Author	SHA1	Message	Date
Brian Behlendorf	1c2555ef92	Restructure mount option handling Restructure the handling of mount options to be consistent with upstream OpenZFS. This required making the following changes. - The zfs_mntopts_t was renamed vfs_t and adjusted to provide the minimal needed functionality. This includes a pointer back to the associated zfsvfs_t. Plus it made it possible to revert zfs_register_callbacks() and zfsvfs_create() back to their original prototypes. - A zfs_mnt_t structure was added for the sole purpose of providing a structure to pass the osname and raw mount pointer to zfs_domount() without having to copy them. - Mount option parsing was moved down from the zpl_* wrapper functions in to the zfs_* functions. This allowed for the code to be simplied and it's where similar functionality appears on other platforms. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2017-03-10 09:51:41 -08:00
Brian Behlendorf	f298b24ddf	Rename zfs_* functions Several functions were renamed when ZFS was originally ported to Linux. Revert the code to the original names to minimize the delta with upstream OpenZFS. zfs_sb_teardown -> zfsvfs_teardown zfs_sb_create -> zfsvfs_create zfs_sb_setup -> zfsvfs_setup zfs_sb_free -> zfsvfs_free get_zfs_sb -> getzfsvfs zfs_sb_hold -> zfsvfs_hold zfs_sb_rele -> zfsvfs_rele zfs_sb_prune_aliases -> zfs_prune_aliases (Linux-only) zfs_sb_prune -> zfs_prune (Linux only) Align the zfs_vnops.h and zfs_vfsops.h with upstream as much as possible. Several prototypes were removed and those that remain were reordered. Move the EXPORT_SYMBOL lines to the end of the source files for consistency with the other source files. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2017-03-10 09:51:35 -08:00
Brian Behlendorf	0037b49e83	Rename zfs_sb_t -> zfsvfs_t The use of zfs_sb_t instead of zfsvfs_t results in unnecessary conflicts with the upstream source. Change all instances of zfs_sb_t to zfsvfs_t including updating the variables names. Whenever possible the code was updated to be consistent with hope it appears in the upstream OpenZFS source. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2017-03-10 09:51:33 -08:00
Chunwei Chen	9b77d1c958	Fix nfs snapdir automount The current implementation for allowing nfs to access snapdir is very buggy. It uses a special fh for snapdirs, such that the next time nfsd does fh_to_dentry, it actually returns the root inode inside the snapshot. So nfsd never knows it cross a mountpoint. The problem is that nfsd will not hold a reference on the vfsmount of the snapshot. This cause auto unmounter to unmount the snapshot even though nfs is still holding dentries in it. To fix this, we return the inode for the snapdirs themselves. However, we also trigger automount upon fh_to_dentry, and return ESTALE so nfsd will revalidate and see the mountpoint and do crossmnt. Because nfsd will now be aware that these are different filesystems users must add crossmnt to their export options to access snapshot directories. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Closes #3794 Closes #4716 Closes #5810 Closes #5833	2017-03-08 09:26:33 -08:00
Tony Hutter	463009865f	Fix harmless "BARRIER is deprecated" kernel warning on Centos 6.8 A one time warning after module load that "BARRIER is deprecated" was seen on the heavily patched 2.6.32-642.13.1.el6.x86_64 Centos 6.8 kernel. It seems that kernel had both the old BARRIER and the newer FLUSH/FUA interfaces defined. This fixes the warning by prefering the newer FLUSH/FUA interface if it's available. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #5739 Closes #5828	2017-03-08 09:20:21 -08:00
bunder2015	5fc73c46f9	Fix multi-line error messages in blkdev_compat.h Fix multi-line error messages in blkdev_compat.h by changing error-generating multi-line error messages to single line errors. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: bunder2015 <omfgbunder@gmail.com> Closes #5860	2017-03-07 09:54:55 -08:00
Brian Behlendorf	3ec3bc2167	OpenZFS 7793 - ztest fails assertion in dmu_tx_willuse_space Reviewed by: Steve Gonczi <steve.gonczi@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> Background information: This assertion about tx_space_* verifies that we are not dirtying more stuff than we thought we would. We “need” to know how much we will dirty so that we can check if we should fail this transaction with ENOSPC/EDQUOT, in dmu_tx_assign(). While the transaction is open (i.e. between dmu_tx_assign() and dmu_tx_commit() — typically less than a millisecond), we call dbuf_dirty() on the exact blocks that will be modified. Once this happens, the temporary accounting in tx_space_* is unnecessary, because we know exactly what blocks are newly dirtied; we call dnode_willuse_space() to track this more exact accounting. The fundamental problem causing this bug is that dmu_tx_hold_() relies on the current state in the DMU (e.g. dn_nlevels) to predict how much will be dirtied by this transaction, but this state can change before we actually perform the transaction (i.e. call dbuf_dirty()). This bug will be fixed by removing the assertion that the tx_space_ accounting is perfectly accurate (i.e. we never dirty more than was predicted by dmu_tx_hold_()). By removing the requirement that this accounting be perfectly accurate, we can also vastly simplify it, e.g. removing most of the logic in dmu_tx_count_(). The new tx space accounting will be very approximate, and may be more or less than what is actually dirtied. It will still be used to determine if this transaction will put us over quota. Transactions that are marked by dmu_tx_mark_netfree() will be excepted from this check. We won’t make an attempt to determine how much space will be freed by the transaction — this was rarely accurate enough to determine if a transaction should be permitted when we are over quota, which is why dmu_tx_mark_netfree() was introduced in 2014. We also won’t attempt to give “credit” when overwriting existing blocks, if those blocks may be freed. This allows us to remove the do_free_accounting logic in dbuf_dirty(), and associated routines. This logic attempted to predict what will be on disk when this txg syncs, to know if the overwritten block will be freed (i.e. exists, and has no snapshots). OpenZFS-issue: https://www.illumos.org/issues/7793 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/3704e0a Upstream bugs: DLPX-32883a Closes #5804 Porting notes: - DNODE_SIZE replaced with DNODE_MIN_SIZE in dmu_tx_count_dnode(), Using the default dnode size would be slightly better. - DEBUG_DMU_TX wrappers and configure option removed. - Resolved _by_dnode() conflicts these changes have not yet been applied to OpenZFS.	2017-03-07 09:51:59 -08:00
Olaf Faaland	4859fe796c	Linux 4.11 compat: avoid refcount_t name conflict Linux 4.11 introduces a new type, refcount_t, which conflicts with the type of the same name defined within ZFS. Rename the ZFS type zfs_refcount_t. Within the ZFS code, use a macro to cause references to refcount_t to be changed to zfs_refcount_t at compile time. This reduces conflicts when later landing OpenZFS patches. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Olaf Faaland <faaland1@llnl.gov> Closes #5823 Closes #5842	2017-02-28 16:10:18 -08:00
Matthew Ahrens	66eead53c9	Clean up by-dnode code in dmu_tx.c `0eef1bde31` introduced some changes which we slightly improved the style of when porting to illumos. There is also one minor error-handling fix, in zap_add() the "zap" may become NULL in case of an error re-opening the ZAP. Originally suggested at: https://github.com/openzfs/openzfs/pull/276 Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #5805	2017-02-24 13:34:26 -08:00
Andriy Gapon	0efd97912a	OpenZFS 7199 - dsl_dataset_rollback_sync may try to free already free blocks 7200 no blocks must be born in a txg after a snaphot is created Authored by: Andriy Gapon <andriy.gapon@clusterhq.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Brad Lewis <brad.lewis@delphix.com> Approved by: Gordon Ross <gordon.w.ross@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: George Melikov <mail@gmelikov.ru> OpenZFS-issue: https://www.illumos.org/issues/7199 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/bfaed0b Closes #5817	2017-02-24 11:05:33 -08:00
Matthew Ahrens	c30e58c462	zfs_arc_num_sublists_per_state should be common to all multilists The global tunable zfs_arc_num_sublists_per_state is used by the ARC and the dbuf cache, and other users are planned. We should change this tunable to be common to all multilists. This tuning may be overridden on a per-multilist basis. Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #5764	2017-02-15 15:49:33 -08:00
Matthew Ahrens	d7958b4cda	OpenZFS 7104 - increase indirect block size Authored by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Approved by: Robert Mustacchi <rm@joyent.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: George Melikov <mail@gmelikov.ru> OpenZFS-issue: https://www.illumos.org/issues/7104 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/4b5c8e9 Closes #5679	2017-02-09 10:27:02 -08:00
Giuseppe Di Natale	d21d5b8248	OpenZFS 4521 - zfstest is trying to execute evil "zfs unmount -a" Authored by: Yuri Pankov <yuri.pankov@nexenta.com> Reviewed by: Andriy Gapon <avg@FreeBSD.org> Reviewed by: Dan McDonald <danmcd@omniti.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: John Kennedy <john.kennedy@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov> Porting Notes: - Correctly set __ZFS_POOL_RESTRICT in inherit_001_pos OpenZFS-issue: https://www.illumos.org/issues/4521 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/8808ac5 Closes #5674	2017-02-03 13:24:44 -08:00
George Melikov	9b7b9cd370	OpenZFS 1300 - filename normalization doesn't work for removes Authored by: Kevin Crowe <kevin.crowe@nexenta.com> Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: George Melikov <mail@gmelikov.ru> OpenZFS-issue: https://www.illumos.org/issues/1300 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/8f1750d Closes #5725 Porting notes: - zap_micro.c: all `MT_EXACT` are replaced by `0`	2017-02-02 14:13:41 -08:00
David Quigley	2fe36b0bfb	Use fletcher_4 routines natively with `abd_iterate_func()` This patch adds the necessary infrastructure for ABD to make use of the vectorized fletcher 4 routines. - export ABD compatible interface from fletcher_4 - add ABD fletcher_4 tests for data and metadata ABD types. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Original-patch-by: Gvozden Neskovic <neskovic@gmail.com> Signed-off-by: David Quigley <david.quigley@intel.com> Closes #5589	2017-02-01 09:34:22 -08:00
George Melikov	539d33c791	OpenZFS 6569 - large file delete can starve out write ops Authored by: Alek Pinchuk <alek@nexenta.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: George Melikov <mail@gmelikov.ru> Tested-by: kernelOfTruth <kerneloftruth@gmail.com> OpenZFS-issue: https://www.illumos.org/issues/6569 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/1bf4b6f2 Closes #5706	2017-01-31 14:44:03 -08:00
George Melikov	ed828c0c37	OpenZFS 7280 - Allow changing global libzpool variables in zdb and ztest through command line Authored by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: George Melikov <mail@gmelikov.ru> OpenZFS-issue: https://www.illumos.org/issues/7280 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/0e60744 Closes #5676	2017-01-31 10:13:10 -08:00
George Melikov	28b40c8a6e	OpenZFS 7541 - zpool import/tryimport ioctl returns ENOMEM Authored by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> The refresh_config() calls into the kernel with ZFS_IOC_POOL_TRYIMPORT. This ioctl returns the config of the pool in a buffer pre-allocated in userland. The original estimate for the size is too conservative since it doesn't account for the large size of vdev stats that are added to the config before returning. This fix simply increases the size of the buffer passed. This results in a speed up of the zpool import process, and less spam in zfs_dbgmsg. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: George Melikov <mail@gmelikov.ru> OpenZFS-issue: https://www.illumos.org/issues/7541 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/a3c7690 Closes #5704	2017-01-30 13:20:54 -08:00
George Melikov	fa603f8233	OpenZFS 7277 - zdb should be able to print zfs_dbgmsg's Porting notes: - 'zfs_dbgmsg_print()' reintroduced to userspace. Authored by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com> Approved by: Dan McDonald <danmcd@omniti.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Ported-by: George Melikov <mail@gmelikov.ru> OpenZFS-issue: https://www.illumos.org/issues/7277 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/29bdd2f Closes #5684	2017-01-28 12:16:43 -08:00
George Melikov	a08abc1bb3	OpenZFS 7301 - zpool export -f should be able to interrupt file freeing Authored by: Alek Pinchuk <alek@nexenta.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com> Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com> Reviewed by: John Kennedy <john.kennedy@delphix.com> Approved by: Gordon Ross <gordon.ross@nexenta.com> Reviewed-by: Tim Chase <tim@chase2k.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: George Melikov <mail@gmelikov.ru> OpenZFS-issue: https://www.illumos.org/issues/7301 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/eb72182 Closes #5680	2017-01-27 11:46:39 -08:00
George Melikov	cc9bb3e58e	OpenZFS 7254 - ztest failed assertion in ztest_dataset_dirobj_verify: dirobjs + 1 == usedobjs Authored by: Paul Dagnelie <pcd@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Steve Gonczi <steve.gonczi@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: George Melikov <mail@gmelikov.ru> OpenZFS-issue: https://www.illumos.org/issues/7254 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c166b69 Closes #5670	2017-01-27 11:43:42 -08:00
Tim Chase	258553d3d7	OpenZFS 7613 - ms_freetree[4] is only used in syncing context metaslab_t:ms_freetree[TXG_SIZE] is only used in syncing context. We should replace it with two trees: the freeing tree (ranges that we are freeing this syncing txg) and the freed tree (ranges which have been freed this txg). Authored by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Alex Reece <alex@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: Tim Chase <tim@chase2k.com> OpenZFS-issue: https://www.illumos.org/issues/7613 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/a8698da2 Closes #5598	2017-01-26 15:27:19 -08:00
George Melikov	9c9531cb6f	OpenZFS 7500 - Simplify dbuf_free_range by removing dn_unlisted_l0_blkid Authored by: Stephen Blinick <stephen.blinick@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Approved by: Gordon Ross <gordon.w.ross@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: George Melikov <mail@gmelikov.ru> OpenZFS-issue: https://www.illumos.org/issues/7500 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/653af1b Closes #5639	2017-01-26 15:15:48 -08:00
George Melikov	39efbde7c5	OpenZFS 6676 - Race between unique_insert() and unique_remove() causes ZFS fsid change Authored by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com> Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com> Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com> Reviewed by: Dan Vatca <dan.vatca@gmail.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Sebastien Roy <sebastien.roy@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: George Melikov <mail@gmelikov.ru> OpenZFS-issue: https://www.illumos.org/issues/6676 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/40510e8 Closes #5667	2017-01-26 14:43:28 -08:00
George Melikov	aeacdefedc	OpenZFS 7386 - zfs get does not work properly with bookmarks Authored by: Marcel Telka <marcel@telka.sk> Reviewed by: Simon Klinkert <simon.klinkert@gmail.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Approved by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: George Melikov <mail@gmelikov.ru> OpenZFS-issue: https://www.illumos.org/issues/7386 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/edb901a Closes #5666	2017-01-26 14:42:15 -08:00
George Melikov	774ee3c7ce	OpenZFS 7336 - vfork and O_CLOEXEC causes zfs_mount EBUSY Porting notes: - statvfs64 is replaced by statfs64. - ZFS_SUPER_MAGIC definition moved in include/sys/fs/zfs.h to share it between user and kernel space. Authored by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Robert Mustacchi <rm@joyent.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: George Melikov <mail@gmelikov.ru> OpenZFS-issue: https://www.illumos.org/issues/7336 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/dd862f6d Closes #5651	2017-01-26 12:28:29 -08:00
George Melikov	e2a65adbb8	OpenZFS 6871 - libzpool implementation of thread_create should enforce length is 0 Porting notes: - Several direct callers of zk_thread_create() are passing TS_RUN for the length. The `len` and `state` were inverted,this commit fixes them. Authored by: Eli Rosenthal <eli.rosenthal@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: George Melikov mail@gmelikov.ru OpenZFS-issue: https://www.illumos.org/issues/6871 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/8fc9228 Closes #5621	2017-01-24 09:13:49 -08:00
Brian Behlendorf	e82dbae1ee	Fix build-it compilation regression Accidentally introduced by `4ea3f86`. The BEGIN CSTYLE block cannot appear half way through a continued #define. Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #5643 Closes #5644	2017-01-24 08:50:15 -08:00
George Melikov	ec923db25c	OpenZFS 7180 - potential race between zfs_suspend_fs+zfs_resume_fs and zfs_ioc_rename Authored by: Andriy Gapon <andriy.gapon@clusterhq.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Approved by: Richard Lowe <richlowe@richlowe.net> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: George Melikov <mail@gmelikov.ru> OpenZFS-issue: https://www.illumos.org/issues/7180 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/690041b Closes #5627	2017-01-23 10:53:46 -08:00
George Melikov	e67a7ffb5d	OpenZFS 6052 - decouple lzc_create() from the implementation details Authored by: Andriy Gapon <andriy.gapon@clusterhq.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Richard Lowe <richlowe@richlowe.net> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: George Melikov mail@gmelikov.ru OpenZFS-issue: https://www.illumos.org/issues/6052 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/26455f9 Closes #5622	2017-01-23 09:49:57 -08:00
George Melikov	f85c06bedf	OpenZFS 7054 - dmu_tx_hold_t should use refcount_t to track space Authored by: Igor Kozhukhov ikozhukhov@gmail.com Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com> Approved by: Dan McDonald <danmcd@omniti.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: George Melikov mail@gmelikov.ru OpenZFS-issue: https://www.illumos.org/issues/7054 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/0c779ad Closes #5600	2017-01-23 09:36:24 -08:00
George Melikov	4ea3f86426	codebase style improvements for OpenZFS 6459 port	2017-01-22 13:25:40 -08:00
Chunwei Chen	040dab9939	Suspend/resume zvol for recv and rollback When doing recv and rollback, dsl_dataset_clone_swap_sync_impl will be called to swap out the ds_objset and do dmu_objset_evict on the old one. However, currently zv->zv_objset will not be swapped out accordingly, so if anyone currently holds a fd on the zvol, we risk hitting a use-after-free. We fix this by introducing the suspend and resume mechanism of zsb to zv. Before recv or rollback, we use zvol_suspend to block all access to zv_objset and shut it down. After the recv or rollback, we use zvol_resume to swap in zv_objset with the new ds_objset and unblock the access. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Closes #4866 Closes #5609	2017-01-19 13:56:36 -08:00
George Melikov	7330fc57b7	OpenZFS 7235 - remove unused func dsl_dataset_set_blkptr Authored by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Alex Reece <alex@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com> Approved by: Robert Mustacchi <rm@joyent.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: George Melikov <mail@gmelikov.ru> OpenZFS-issue: https://www.illumos.org/issues/7235 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/bd56f80 Closes #5604	2017-01-17 15:22:56 -08:00
Brian Behlendorf	648a09adc2	OpenZFS 6550 - cmd/zfs: cleanup gcc warnings Porting Notes: - Many of the fixes proposed by this patch were already applied. In the cases where a different but equivalent fix was made the code was updated with the OpenZFS version to minimize differences. Authored by: Igor Kozhukhov <ikozhukhov@gmail.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Andy Stormont <astormont@racktopsystems.com> Approved by: Dan McDonald <danmcd@omniti.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/6550 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c16bcc4 Closes #5591	2017-01-17 14:45:02 -08:00
bzzz77	0eef1bde31	Add _by-dnode routines Add _by_dnode() routines for accessing objects given their dnode_t , this is more efficient than accessing the object by (objset_t , uint64_t object). This change converts some but not all of the existing consumers. As performance-sensitive code paths are discovered they should be converted to use these routines. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alex Zhuravlev <bzzz@whamcloud.com> Closes #5534 Issue #4802	2017-01-13 14:58:41 -08:00
Don Brady	38640550f2	OpenZFS 7743 - per-vdev-zaps init path for upgrade Authored by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Don Brady <don.brady@intel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Joe Stein <jas14@cs.brown.edu> Ported-by: Don Brady <don.brady@intel.com> When loading a pool that had been created before the existance of per-vdev zaps, on a system that knows about per-vdev zaps, the per-vdev zaps will not be allocated and initialized. This appears to be because the logic that would have done so, in spa_sync_config_object(), is not reached under normal operation. It is only reached if spa_config_dirty_list is non-empty. The fix is to add another `AVZ_ACTION_` enum that will allow this code to be reached when we detect that we're loading an old pool, even when there are no dirty configs. OpenZFS-issue: https://www.illumos.org/issues/7743 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/e2d29d0 Closes #5582	2017-01-13 13:50:22 -08:00
Don Brady	4e21fd060a	OpenZFS 7303 - dynamic metaslab selection This change introduces a new weighting algorithm to improve metaslab selection. The new weighting algorithm relies on the SPACEMAP_HISTOGRAM feature. As a result, the metaslab weight now encodes the type of weighting algorithm used (size-based vs segment-based). Porting Notes: The metaslab allocation tracing code is conditionally removed on linux (dependent on mdb debugger). Authored by: George Wilson <george.wilson@delphix.com> Reviewed by: Alex Reece <alex@delphix.com> Reviewed by: Chris Siden <christopher.siden@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com> Reviewed by: Pavel Zakharov pavel.zakharov@delphix.com Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Don Brady <don.brady@intel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: Don Brady <don.brady@intel.com> OpenZFS-issue: https://www.illumos.org/issues/7303 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/d5190931bd Closes #5404	2017-01-12 11:52:56 -08:00
George Melikov	e9aa730c49	OpenZFS 6328 - Fix cstyle errors in zfs codebase Authored by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Alex Reece <alex@delphix.com> Reviewed by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed by: Jorgen Lundman <lundman@lundman.net> Approved by: Robert Mustacchi <rm@joyent.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: George Melikov <mail@gmelikov.ru> OpenZFS-issue: https://www.illumos.org/issues/6328 OpenZFS-commit: https://github.com/illumos/illumos-gate/commit/9a686fb Closes #5579	2017-01-12 09:42:11 -08:00
George Melikov	24d42e2221	OpenZFS 7259 - DS_FIELD_LARGE_BLOCKS is unused Authored by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: George Melikov <mail@gmelikov.ru> The DS_FIELD_LARGE_BLOCKS macro has been unused since the integration of this patch: `241b541` Illumos 5959 - clean up per-dataset feature count code. This patch simply removes this macro from dsl_dataset.h. OpenZFS-issue: https://www.illumos.org/issues/7259 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/faa8036 Closes #5544	2017-01-03 12:03:05 -06:00
ka7	4e33ba4c38	Fix spelling Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Haakan T Johansson <f96hajo@chalmers.se> Closes #5547 Closes #5543	2017-01-03 11:31:18 -06:00
Tim Chase	a5e046eaac	4.10 compat - BIO flag changes and others [bio] The req_op enum was changed to req_opf. Update the "Linux 4.8 API" autotools checks to use an int to determine whether the various REQ_OP values are defined. This should work properly on kernels >= 4.8. [bio] bio_set_op_attrs() is now an inline function and can't be detected with #ifdef. Add a configure check to determine whether bio_set_op_attrs() is defined. Move the local definition of it from vdev_disk.c to blkdev_compat.h for consistency with other related compability shims. [bio] The read/write flags and their modifiers, including WRITE_FLUSH, WRITE_FUA and WRITE_FLUSH_FUA have been removed from fs.h. Add the new bio_set_flush() compatibility wrapper to replace VDEV_WRITE_FLUSH_FUA and set the flags appropriately for each supported kernel version. [vfs] The generic_readlink() function has been made static. If .readlink in inode_operations is NULL, generic_readlink() is used. [zol typo] Completely unrelated to 4.10 compat, fix a typo in the check for REQ_OP_SECURE_ERASE so that the proper macro is defined: s/HAVE_REQ_OP_SECURE_DISCARD/HAVE_REQ_OP_SECURE_ERASE/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Tim Chase <tim@chase2k.com> Closes #5499	2016-12-30 16:03:59 -06:00
Chunwei Chen	da8f51e16a	Use a dedicated taskq for vdev_file The introduction of parallel zvol prefetch causes deadlock when using vdev_file. spa_async->(spa_namespace_lock)->txg_wait_synced->(wait for txg_sync) txg_sync->zio_wait->(wait for vdev_file_io_fsync on system_taskq) zvol_prefetch_minors_impl (on system_taskq)->spa_open_common->(wait for spa_namespace_lock) We fix this by using dedicated taskq for vdev_file. This same change was originally made in commit `bc25c93` but reverted in commit `aa9af22` when dynamic taskqs were added. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Closes #5506 Closes #5495	2016-12-21 10:47:15 -08:00
Chunwei Chen	05100ec8f0	Fix wrong operator in xvattr.h Signed-off-by: Chunwei Chen <david.chen@osnexus.com>	2016-12-14 14:53:56 -08:00
Brian Behlendorf	02730c333c	Use cstyle -cpP in `make cstyle` check Enable picky cstyle checks and resolve the new warnings. The vast majority of the changes needed were to handle minor issues with whitespace formatting. This patch contains no functional changes. Non-whitespace changes are as follows: * 8 times ; to { } in for/while loop * fix missing ; in cmd/zed/agents/zfs_diagnosis.c * comment (confim -> confirm) * change endline , to ; in cmd/zpool/zpool_main.c * a number of /* BEGIN CSTYLED / / END CSTYLED / blocks /* CSTYLED / markers change == 0 to ! * ulong to unsigned long in module/zfs/dsl_scan.c * rearrangement of module_param lines in module/zfs/metaslab.c * add { } block around statement after for_each_online_node Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Håkan Johansson <f96hajo@chalmers.se> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #5465	2016-12-12 10:46:26 -08:00
Brian Behlendorf	f95e647891	Speed up zvol import and export speed Speed up import and export speed by: * Add system delay taskq * Parallel prefetch zvol dnodes during zvol_create_minors * Parallel zvol_free during zvol_remove_minors * Reduce list linear search using ida and hash Reviewed-by: Boris Protopopov <boris.protopopov@actifio.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Closes #5433	2016-12-08 14:05:02 -07:00
Gvozden Neskovic	e8a2014436	Cache ddt_get_dedup_dspace() value if there was no ddt changes Save and reuse ddt dspace calculation when there have been no ddt changes. This avoids unnecessary traversal of 168KiB of ddt histograms. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Gvozden Neskovic <neskovic@gmail.com> Closes #5425	2016-12-02 16:59:35 -07:00
Brian Behlendorf	baf67d15a5	Refactor txg history kstat It was observed that even when the txg history is disabled by setting `zfs_txg_history=0` the txg_sync thread still fetches the vdev stats unnecessarily. This patch refactors the code such that vdev_get_stats() is no longer called when `zfs_txg_history=0`. And it further reduces the differences between upstream and the ZoL txg_sync_thread() function. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #5412	2016-12-02 16:57:49 -07:00
cao	e2c7d3785a	Remove unused sa_update_from_cb() It looks like this was functionality which was added in the original SA implementation and then never needed. It can be safely removed now and easily added back if we find a use for it. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: cao.xuewen <cao.xuewen@zte.com.cn> Closes #5440	2016-12-01 16:39:06 -07:00
cao	6a8fd57fa7	Compile zio.h and zio_impl.h mutual include zio.h includes zio_impl.h but zio_impl.h also includes zio.h, so the header files to contain each other. Get rid of the zio_impl.h include in zio.h and update zio_inject.c to include zio.h instead of zio_impl.h. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: cao.xuewen <cao.xuewen@zte.com.cn> Closes #5439	2016-12-01 16:36:25 -07:00
Chunwei Chen	57ddcda164	Use system_delay_taskq for long delay tasks Use it for spa_deadman, zpl_posix_acl_free, snapentry_expire. This free system_taskq from the above long delay tasks, and allow us to do taskq_wait_outstanding on system_taskq without being blocked forever, making system_taskq more generic and useful. Signed-off-by: Chunwei Chen <david.chen@osnexus.com>	2016-12-01 14:52:48 -08:00
Chunwei Chen	9829574834	ABD optimized page allocation code * Convert ABD to use the Linux Kernel scatterlist implementation instead of the hand rolled one from illumos. * Scatter ABDs are preferentially populated with higher order compound pages from a single zone. Allocation size is progressively decreased until it can be satisfied without performing reclaim or compaction. * An alternate page allocator is provided for kernels older than 3.6 and for CONFIG_HIGHMEM systems. This allocator is designed as a fallback for maximum compatibility. * Extended abdstats to provide visibility in the the allocator. * Add cached value for PAGESIZE in userspace. Contributions-by: Chunwei Chen <david.chen@osnexus.com> Gvozden Neskovic <neskovic@gmail.com> Jinshan Xiong <jinshan.xiong@intel.com> Isaac Huang <he.huang@intel.com> David Quigley <david.quigley@intel.com> Brian Behlendorf <behlendorf1@llnl.gov>	2016-11-29 14:34:33 -08:00
Gvozden Neskovic	a206522c4f	ABD changes for vectorized RAIDZ * userspace: aligned buffers. Minimum of 32B alignment is needed for AVX2. Kernel buffers are aligned 512B or more. * add abd_get_offset_size() interface * abd_iter_map(): fix calculation of iter_mapsize * add abd_raidz_gen_iterate() and abd_raidz_rec_iterate() Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>	2016-11-29 14:34:33 -08:00
Isaac Huang	b0be93e81a	ABD page support to vdev_disk.c Signed-off-by: Isaac Huang <he.huang@intel.com>	2016-11-29 14:34:32 -08:00
David Quigley	a6255b7fce	DLPX-44812 integrate EP-220 large memory scalability	2016-11-29 14:34:27 -08:00
Tony Hutter	8720e9e748	Add -c to zpool iostat & status to run command This patch adds a command (-c) option to zpool status and zpool iostat. The -c option allows you to run an arbitrary command on each vdev and display the first line of output in zpool status/iostat. The environment vars VDEV_PATH and VDEV_UPATH are set to the vdev's path and "underlying path" before running the command. For device mapper, multipath, or partitioned vdevs, VDEV_UPATH is the actual underlying /dev/sd* disk. This can be useful if the command you're running requires a /dev/sd* device. The patch also uses /sys/block/<dev>/slaves/ to lookup the underlying device instead of using libdevmapper. This not only removes the libdevmapper requirement at build time, but also allows you to resolve device mapper devices without being root. This means that UDEV_UPATH get set correctly when running zpool status/iostat as an unprivileged user. Example: $ zpool status -c 'echo I am $VDEV_PATH, $VDEV_UPATH' NAME STATE READ WRITE CKSUM mypool ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 mpatha ONLINE 0 0 0 I am /dev/mapper/mpatha, /dev/sdc sdb ONLINE 0 0 0 I am /dev/sdb1, /dev/sdb Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #5368	2016-11-29 14:45:38 -07:00
LOLi	2f71caf2d9	Allow zfs unshare <protocol> -a Allow `zfs unshare <protocol> -a` command to share or unshare all datasets of a given protocol, nfs or smb. Additionally, enable most of ZFS Test Suite zfs_share/zfs_unshare test cases. To work around some Illumos-specific functionalities ($SHARE/$UNSHARE) some function wrappers were added around them. Finally, fix and issue in smb_is_share_active() that would leave SMB shares exported when invoking 'zfs unshare -a' Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Turbo Fredriksson <turbo@bayour.com> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #3238 Closes #5367	2016-11-29 12:22:38 -07:00
cao	3bfd95d589	Fix coverity defects: CID 147540, 147542 CID 147540: unsigned_compare - Cast nsec to a int32_t to properly detect the expected overflow. CID 147542: unsigned_compare - intval can never be less than ZIO_FAILURE_MODE_WAIT which is defined to be zero. Remove this useless check. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: cao.xuewen <cao.xuewen@zte.com.cn> Closes #5379	2016-11-09 17:35:26 -08:00
jxiong	126ae9f4e9	Export symbol dmu_objset_userobjspace_upgradable It's used by Lustre to determine if the objset can be upgraded. The inline version doesn't work because dmu_objset_is_snapshot() is not exported. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Jinshan Xiong <jinshan.xiong@intel.com> Closes #5385	2016-11-09 13:51:12 -08:00
tuxoko	0420c126ce	Linux 3.14 compat: assign inode->set_acl Linux 3.14 introduces inode->set_acl(). Normally, acl modification will come from setxattr, which will handle by the acl xattr_handler, and we already handles that well. However, nfsd will directly calls inode->set_acl or return error if it doesn't exists. Reviewed-by: Tim Chase <tim@chase2k.com> Reviewed-by: Massimo Maggi <me@massimo-maggi.eu> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Closes #5371 Closes #5375	2016-11-09 10:37:17 -08:00
Chunwei Chen	3779913b35	Use set_cached_acl and forget_cached_acl when possible Originally, these two function are inline, so their usability is tied to posix_acl_release. However, since Linux 3.14, they became EXPORT_SYMBOL, so we can always use them. In this patch, we create an independent test for these two functions so we can use them when possible. Signed-off-by: Chunwei Chen <david.chen@osnexus.com>	2016-11-07 11:04:44 -08:00
Chunwei Chen	8e71ab99dc	Batch free zpl_posix_acl_release Currently every calls to zpl_posix_acl_release will schedule a delayed task, and each delayed task will add a timer. This used to be fine except for possibly bad performance impact. However, in Linux 4.8, a new timer wheel implementation[1] is introduced. In this new implementation, the larger the delay, the less accuracy the timer is. So when we have a flood of timer from zpl_posix_acl_release, they will expire at the same time. Couple with the fact that task_expire will do linear search with lock held. This causes an extreme amount of contention inside interrupt and would actually lockup the system. We fix this by doing batch free to prevent a flood of delayed task. Every call to zpl_posix_acl_release will put the posix_acl to be freed on a lockless list. Every batch window, 1 sec, the zpl_posix_acl_free will fire up and free every posix_acl that passed the grace period on the list. This way, we only have one delayed task every second. [1] https://lwn.net/Articles/646950/ Signed-off-by: Chunwei Chen <david.chen@osnexus.com>	2016-11-07 11:04:44 -08:00
Brian Behlendorf	83bf769d50	Fix 'zpool import' detection issues This patch addresses multiple 'zpool import' block device indentification problems which are most likely to occur on a system configured to use blkid, by_vdev paths, multipath and failover. The symptom most commonly observed is the import uses different path names to import the pool than would normally be expected. * When using blkid to identify vdevs the listed devices may be added to the cache in any order. In order to apply the preferred search order heuristic a zfs_path_order() function was added to calculate the order given full path names. * Since it's possible to have multiple block devices with different vdev guids which refer to the same ZPOOL_CONFIG_PATH the slice cache must be indexed by guid and name. By avoiding collisions the preferred ordering can be maintaining even when multiple block devices claim the same ZPOOL_CONFIG_PATH. The preferred sorting by partition was never benefitial for a Linux system and was removed as part of this change. * When adding entries to the blkid cache avl_find/avl_insert are used instead of avl_add because collisions are possible and must be handled gracefully. * For pools using multipath devices there are, at a minimum, three devices where a vdev label may be read. They are the dm-* device and each underlying /dev/sd* device. Due to the way the block cache is implemented each of these devices may have a different cached copy of the vdev label. This can result in "ghost pools" which appear to persist even after a 'zpool labelclear' has been done to the dm-* device. In order to prevent this the vdev label is read with O_DIRECT in order to bypass any caching to get the on-disk version. * When opening a block device verify that vdev guid read from the disk matches the expected vdev guid. This allows for bad labels to be filtered out. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #5359	2016-11-07 10:28:57 -08:00
Romain Dolbeau	7f3194932d	Add superscalar fletcher4 This is the Fletcher4 algorithm implemented in pure C, but using multiple counters using algorithms identical to those used for SSE/NEON and AVX2. This allows for faster execution on core with strong superscalar capabilities but weak SIMD capabilities. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Romain Dolbeau <romain.dolbeau@atos.net> Closes #5317	2016-11-04 10:53:03 -07:00
Chunwei Chen	ace1eae84c	Add support for O_TMPFILE Linux 3.11 add O_TMPFILE to open(2), which allow creating an unlinked file on supported filesystem. It's basically doing open(2) and unlink(2) atomically. The filesystem support is added through i_op->tmpfile. We basically copy the create operation except we get rid of the link and name related stuff and add the new node to unlinked set. We also add support for linkat(2) to link tmpfile. However, since all previous file operation will skip ZIL, we force a txg_wait_synced to make sure we are sync safe. Signed-off-by: Chunwei Chen <david.chen@osnexus.com>	2016-11-04 10:46:40 -07:00
Chunwei Chen	987014903f	Fix unlinked file cannot do xattr operations Currently, doing things like fsetxattr(2) on an unlinked file will result in ENODATA. There's two places that cause this: zfs_dirent_lock and zfs_zget. The fix in zfs_dirent_lock is pretty straightforward. In zfs_zget though, we need it to not return error when the zp is unlinked. This is a pretty big change in behavior, but skimming through all the callers, I don't think this change would cause any problem. Also there's nothing preventing z_unlinked from being set after the z_lock mutex is dropped before but before zfs_zget returns anyway. The rest of the stuff is to make sure we don't log xattr stuff when owner is unlinked. Signed-off-by: Chunwei Chen <david.chen@osnexus.com>	2016-11-04 10:46:40 -07:00
Romain Dolbeau	7f547f85fe	Add parity generation/rebuild using AVX-512 for x86-64 avx512f should work on all AVX512 hardware, since it only uses Foundation instructions. avx512bw should be faster on hardware supporting the AVW512BW extension. We can use full-width pshufb (instead of relying on the 256 bits AVX2 pshufb). As a side-effect, the code is also unrolled more. Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Gvozden Neskovic <neskovic@gmail.com> Reviewed-by: Jinshan Xiong <jinshan.xiong@intel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Romain Dolbeau <romain.github@dolbeau.name> Closes #5219	2016-11-02 12:40:23 -07:00
Brian Behlendorf	82ec9d41d8	Fix 32-bit maximum volume size A limit of 1TB exists for zvols on 32-bit systems. Update the code to correctly reflect this limitation in a similar manor as the OpenZFS implementation. Reviewed-by: Tom Caputi <tcaputi@datto.com> Reviewed-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #5347	2016-11-02 12:14:45 -07:00
Brian Behlendorf	48d3eb40c7	Add TASKQID_INVALID Add the TASKQID_INVALID macros and update callers to use the macro instead of testing against 0. There is no functional change even though the functions in zfs_ctldir.c incorrectly used -1 instead of 0. Reviewed-by: Tom Caputi <tcaputi@datto.com> Reviewed-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #5347	2016-11-02 12:14:45 -07:00
cao	2bac68145f	Fix coverity defects: CID 147548 CID 147548: Type:Dereference null return value Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: cao.xuewen <cao.xuewen@zte.com.cn> Closes #5321	2016-10-31 16:56:10 -07:00
Hajo Möller	e02aaf17f1	Fix lookup_bdev() on Ubuntu Ubuntu added support for checking inode permissions to lookup_bdev() in kernel commit 193fb6a2c94fab8eb8ce70a5da4d21c7d4023bee (merged in 4.4.0-6.21). Upstream bug: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1636517 This patch adds a test for Ubuntu's variant of lookup_bdev() to configure and calls the function in the correct way. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Hajo Möller <dasjoe@gmail.com> Closes #5336	2016-10-26 10:30:43 -07:00
jxiong	16fa68f07d	Do not upgrade userobj accounting for snapshot dataset 'zfs recv' could disown a living objset without calling dmu_objset_disown(). This will cause the problem that the objset would be released while the upgrading thread is still running. This patch avoids the problem by checking if a dataset is a snapshot before calling dmu_objset_userobjspace_upgrade(). Snapshots are immutable and therefore it doesn't make sense to update them. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Jinshan Xiong <jinshan.xiong@intel.com> Closes #5295 Closes #5328	2016-10-25 13:21:05 -07:00
Tony Hutter	1bbd877049	Turn on/off enclosure slot fault LED even when disk isn't present Previously when a drive faulted, the statechange-led.sh script would lookup the drive's LED sysfs entry in /sys/block/sd/device/enclosure_device, and turn it on. During testing we noticed that if you pulled out a drive, or if the drive was so badly broken that it no longer appeared to Linux, that the /sys/block/sd path would be removed, and the script could not lookup the LED entry. To fix this, this patch looks up the disks's more persistent "/sys/class/enclosure/X:X:X:X/Slot N" LED sysfs path at pool import. It then passes that path to the statechange-led script to use, rather than having the script look it up on the fly. This allows the script to turn on/off the slot LEDs even when the drive is missing. Closes #5309 Closes #2375	2016-10-24 10:45:59 -07:00
Romain Dolbeau	24cdeaf12e	Fletcher4 algorithm implemented in pure NEON for Aarch64 / ARMv8 64 bits This is not useful on micro-architecture with a weak NEON implementation (only 64 bits); the native version is slower & the byteswap barely faster than scalar. On A53 or A57, it's a small improvement on scalar but OK for byteswap. Results from an A53 system: 0 0 0x01 -1 0 1499068294333000 1499101101878000 implementation native byteswap scalar 1008227510 755880264 aarch64_neon 1198098720 1044818671 fastest aarch64_neon aarch64_neon Results from a A57 system: 0 0 0x01 -1 0 4407214734807033 4407233933777404 implementation native byteswap scalar 2302071241 1124873346 aarch64_neon 2542214946 2245570352 fastest aarch64_neon aarch64_neon Reviewed-by: Gvozden Neskovic <neskovic@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Romain Dolbeau <romain.dolbeau@atos.net> Closes #5248	2016-10-21 10:55:49 -07:00
cao	5a6765cf8c	Fix coverity defects: CID 147472 CID 147472: Type: 'Constant' variable guards dead code Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: cao.xuewen <cao.xuewen@zte.com.cn> Closes #5288	2016-10-20 11:24:01 -07:00
Brian Behlendorf	3b0ba3ba99	Linux 4.9 compat: inode_change_ok() renamed setattr_prepare() In torvalds/linux@31051c8 the inode_change_ok() function was renamed setattr_prepare() and updated to take a dentry ratheri than an inode. Update the code to call the setattr_prepare() and add a wrapper function which call inode_change_ok() for older kernels. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Requires-spl: refs/pull/581/head	2016-10-20 09:39:09 -07:00
Tony Hutter	6078881aa1	Multipath autoreplace, control enclosure LEDs, event rate limiting 1. Enable multipath autoreplace support for FMA. This extends FMA autoreplace to work with multipath disks. This requires libdevmapper to be installed at build time. 2. Turn on/off fault LEDs when VDEVs become degraded/faulted/online Set ZED_USE_ENCLOSURE_LEDS=1 in zed.rc to have ZED turn on/off the enclosure LED for a drive when a drive becomes FAULTED/DEGRADED. Your enclosure must be supported by the Linux SES driver for this to work. The enclosure LED scripts work for multipath devices as well. The scripts will clear the LED when the fault is cleared. 3. Rate limit ZIO delay and checksum events so as not to flood ZED ZIO delay and checksum events are rate limited to 5/sec in the zfs module. Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed by: Don Brady <don.brady@intel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #2449 Closes #3017 Closes #5159	2016-10-19 12:55:59 -07:00
Don Brady	3dfb57a35e	OpenZFS 7090 - zfs should throttle allocations OpenZFS 7090 - zfs should throttle allocations Authored by: George Wilson <george.wilson@delphix.com> Reviewed by: Alex Reece <alex@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Sebastien Roy <sebastien.roy@delphix.com> Approved by: Matthew Ahrens <mahrens@delphix.com> Ported-by: Don Brady <don.brady@intel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> When write I/Os are issued, they are issued in block order but the ZIO pipeline will drive them asynchronously through the allocation stage which can result in blocks being allocated out-of-order. It would be nice to preserve as much of the logical order as possible. In addition, the allocations are equally scattered across all top-level VDEVs but not all top-level VDEVs are created equally. The pipeline should be able to detect devices that are more capable of handling allocations and should allocate more blocks to those devices. This allows for dynamic allocation distribution when devices are imbalanced as fuller devices will tend to be slower than empty devices. The change includes a new pool-wide allocation queue which would throttle and order allocations in the ZIO pipeline. The queue would be ordered by issued time and offset and would provide an initial amount of allocation of work to each top-level vdev. The allocation logic utilizes a reservation system to reserve allocations that will be performed by the allocator. Once an allocation is successfully completed it's scheduled on a given top-level vdev. Each top-level vdev maintains a maximum number of allocations that it can handle (mg_alloc_queue_depth). The pool-wide reserved allocations (top-levels * mg_alloc_queue_depth) are distributed across the top-level vdevs metaslab groups and round robin across all eligible metaslab groups to distribute the work. As top-levels complete their work, they receive additional work from the pool-wide allocation queue until the allocation queue is emptied. OpenZFS-issue: https://www.illumos.org/issues/7090 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/4756c3d7 Closes #5258 Porting Notes: - Maintained minimal stack in zio_done - Preserve linux-specific io sizes in zio_write_compress - Added module params and documentation - Updated to use optimize AVL cmp macros	2016-10-13 17:59:18 -07:00
Brian Behlendorf	482cd9ee69	Fletcher4: Incremental updates and ctx calculation Fixes ABI issues with fletcher4 code, adds support for incremental updates, and adds ztest method for testing. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Gvozden Neskovic <neskovic@gmail.com> Closes #5164	2016-10-07 12:44:12 -07:00
Jinshan Xiong	1de321e626	Add support for user/group dnode accounting & quota This patch tracks dnode usage for each user/group in the DMU_USER/GROUPUSED_OBJECT ZAPs. ZAP entries dedicated to dnode accounting have the key prefixed with "obj-" followed by the UID/GID in string format (as done for the block accounting). A new SPA feature has been added for dnode accounting as well as a new ZPL version. The SPA feature must be enabled in the pool before upgrading the zfs filesystem. During the zfs version upgrade, a "quotacheck" will be executed by marking all dnode as dirty. ZoL-bug-id: https://github.com/zfsonlinux/zfs/issues/3500 Signed-off-by: Jinshan Xiong <jinshan.xiong@intel.com> Signed-off-by: Johann Lombardi <johann.lombardi@intel.com>	2016-10-07 09:45:13 -07:00
Gvozden Neskovic	5bf703b8f3	Fletcher4: save/reload implementation context Init, compute, and fini methods are changed to work on internal context object. This is necessary because ABI does not guarantee that SIMD registers will be preserved on function calls. This is technically the case in Linux kernel in between `kfpu_begin()/kfpu_end()`, but it breaks user-space tests and some kernels that don't require disabling preemption for using SIMD (osx). Use scalar compute methods in-place for small buffers, and when the buffer size does not meet SIMD size alignment. Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>	2016-10-05 16:41:46 +02:00
Tony Hutter	3c67d83a8a	OpenZFS 4185 - add new cryptographic checksums to ZFS: SHA-512, Skein, Edon-R Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com> Reviewed by: Richard Lowe <richlowe@richlowe.net> Approved by: Garrett D'Amore <garrett@damore.org> Ported by: Tony Hutter <hutter2@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/4185 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/45818ee Porting Notes: This code is ported on top of the Illumos Crypto Framework code: `b5e030c8db` The list of porting changes includes: - Copied module/icp/include/sha2/sha2.h directly from illumos - Removed from module/icp/algs/sha2/sha2.c: #pragma inline(SHA256Init, SHA384Init, SHA512Init) - Added 'ctx' to lib/libzfs/libzfs_sendrecv.c:zio_checksum_SHA256() since it now takes in an extra parameter. - Added CTASSERT() to assert.h from for module/zfs/edonr_zfs.c - Added skein & edonr to libicp/Makefile.am - Added sha512.S. It was generated from sha512-x86_64.pl in Illumos. - Updated ztest.c with new fletcher_4_() args; used NULL for new CTX argument. - In icp/algs/edonr/edonr_byteorder.h, Removed the #if defined(__linux) section to not #include the non-existant endian.h. - In skein_test.c, renane NULL to 0 in "no test vector" array entries to get around a compiler warning. - Fixup test files: - Rename <sys/varargs.h> -> <varargs.h>, <strings.h> -> <string.h>, - Remove <note.h> and define NOTE() as NOP. - Define u_longlong_t - Rename "#!/usr/bin/ksh" -> "#!/bin/ksh -p" - Rename NULL to 0 in "no test vector" array entries to get around a compiler warning. - Remove "for isa in $($ISAINFO); do" stuff - Add/update Makefiles - Add some userspace headers like stdio.h/stdlib.h in places of sys/types.h. - EXPORT_SYMBOL _Init/_Update/_Final... routines in ICP modules. - Update scripts/zfs2zol-patch.sed - include <sys/sha2.h> in sha2_impl.h - Add sha2.h to include/sys/Makefile.am - Add skein and edonr dirs to icp Makefile - Add new checksums to zpool_get.cfg - Move checksum switch block from zfs_secpolicy_setprop() to zfs_check_settable() - Fix -Wuninitialized error in edonr_byteorder.h on PPC - Fix stack frame size errors on ARM32 - Don't unroll loops in Skein on 32-bit to save stack space - Add memory barriers in sha2.c on 32-bit to save stack space - Add filetest_001_pos.ksh checksum sanity test - Add option to write psudorandom data in file_write utility	2016-10-03 14:51:15 -07:00
Romain Dolbeau	62a65a654e	Add parity generation/rebuild using 128-bits NEON for Aarch64 This re-use the framework established for SSE2, SSSE3 and AVX2. However, GCC is using FP registers on Aarch64, so unlike SSE/AVX2 we can't rely on the registers being left alone between ASM statements. So instead, the NEON code uses C variables and GCC extended ASM syntax. Note that since the kernel explicitly disable vector registers, they have to be locally re-enabled explicitly. As we use the variable's number to define the symbolic name, and GCC won't allow duplicate symbolic names, numbers have to be unique. Even when the code is not going to be used (e.g. the case for 4 registers when using the macro with only 2). Only the actually used variables should be declared, otherwise the build will fails in debug mode. This requires the replacement of the XOR(X,X) syntax by a new ZERO(X) macro, which does the same thing but without repeating the argument. And perhaps someday there will be a machine where there is a more efficient way to zero a register than XOR with itself. This affects scalar, SSE2, SSSE3 and AVX2 as they need the new macro. It's possible to write faster implementations (different scheduling, different unrolling, interleaving NEON and scalar, ...) for various cores, but this one has the advantage of fitting in the current state of the code, and thus is likely easier to review/check/merge. The only difference between aarch64-neon and aarch64-neonx2 is that aarch64-neonx2 unroll some functions some more. Reviewed-by: Gvozden Neskovic <neskovic@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Romain Dolbeau <romain.dolbeau@atos.net> Closes #4801	2016-10-03 09:44:00 -07:00
Gvozden Neskovic	031d7c2fe6	fix: Shift exponent too large Undefined operation is reported by running ztest (or zloop) compiled with GCC UndefinedBehaviorSanitizer. Error only happens on top level of dnode indirection with large enough offset values. Logically, left shift operation would work, but bit shift semantics in C, and limitation of uint64_t, do not produce desired result. Issue #5059, #4883 Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>	2016-09-29 15:55:41 -07:00
kernelOfTruth aka. kOT, Gentoo user	51907a31bc	OpenZFS 7230 - add assertions to dmu_send_impl() to verify that stream includes BEGIN and END records Authored by: Matt Krantz <matt.krantz@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Approved by: Robert Mustacchi <rm@joyent.com> Ported-by: kernelOfTruth <kerneloftruth@gmail.com> OpenZFS-issue: https://www.illumos.org/issues/7230 OpenZFS-commit: https://github.com/illumos/illumos-gate/commit/12b90ee2 Closes #5112	2016-09-22 16:01:19 -07:00
cao	884385a0b2	Fix coverity defects Fix coverity defects: coverity scan CID:147623, Type: Resource leak. coverity scan CID:147622, Type: Resource leak. reason: zpool_open zhp, but not zpool_close zhp. so resource leak. coverity scan CID:147621, Type: Resource fd leak. coverity scan CID:147620, Type: Resource fd leak. reason: do_write do_read open file fd,but exception not close fd. delete unuse definition DMU_OS_IS_L2COMPRESSIBLE. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: cao.xuewen <cao.xuewen@zte.com.cn> Closes #5137	2016-09-20 17:45:45 -07:00
Dan Kimmel	524b4217b8	DLPX-44733 combine arc_buf_alloc_impl() with arc_buf_clone() Authored by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Tom Caputi <tcaputi@datto.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Ported by: David Quigley <david.quigley@intel.com> Issue #5078	2016-09-13 09:59:13 -07:00
Dan Kimmel	c4434877ae	Remove lint suppression from dmu.h and unnecessary dmu.h include in spa.h Authored by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Tom Caputi <tcaputi@datto.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Ported by: David Quigley <david.quigley@intel.com> Issue #5078	2016-09-13 09:59:09 -07:00
Dan Kimmel	2aa34383b9	DLPX-40252 integrate EP-476 compressed zfs send/receive Authored by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Tom Caputi <tcaputi@datto.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Ported by: David Quigley <david.quigley@intel.com> Issue #5078	2016-09-13 09:58:58 -07:00
George Wilson	d3c2ae1c08	OpenZFS 6950 - ARC should cache compressed data Authored by: George Wilson <george.wilson@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Tom Caputi <tcaputi@datto.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Ported by: David Quigley <david.quigley@intel.com> This review covers the reading and writing of compressed arc headers, sharing data between the arc_hdr_t and the arc_buf_t, and the implementation of a new dbuf cache to keep frequently access data uncompressed. I've added a new member to l1 arc hdr called b_pdata. The b_pdata always hangs off the arc_buf_hdr_t (if an L1 hdr is in use) and points to the physical block for that DVA. The physical block may or may not be compressed. If compressed arc is enabled and the block on-disk is compressed, then the b_pdata will match the block on-disk and remain compressed in memory. If the block on disk is not compressed, then neither will the b_pdata. Lastly, if compressed arc is disabled, then b_pdata will always be an uncompressed version of the on-disk block. Typically the arc will cache only the arc_buf_hdr_t and will aggressively evict any arc_buf_t's that are no longer referenced. This means that the arc will primarily have compressed blocks as the arc_buf_t's are considered overhead and are always uncompressed. When a consumer reads a block we first look to see if the arc_buf_hdr_t is cached. If the hdr is cached then we allocate a new arc_buf_t and decompress the b_pdata contents into the arc_buf_t's b_data. If the hdr already has a arc_buf_t, then we will allocate an additional arc_buf_t and bcopy the uncompressed contents from the first arc_buf_t to the new one. Writing to the compressed arc requires that we first discard the b_pdata since the physical block is about to be rewritten. The new data contents will be passed in via an arc_buf_t (uncompressed) and during the I/O pipeline stages we will copy the physical block contents to a newly allocated b_pdata. When an l2arc is inuse it will also take advantage of the b_pdata. Now the l2arc will always write the contents of b_pdata to the l2arc. This means that when compressed arc is enabled that the l2arc blocks are identical to those stored in the main data pool. This provides a significant advantage since we can leverage the bp's checksum when reading from the l2arc to determine if the contents are valid. If the compressed arc is disabled, then we must first transform the read block to look like the physical block in the main data pool before comparing the checksum and determining it's valid. OpenZFS-issue: https://www.illumos.org/issues/6950 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/7fc10f0 Issue #5078	2016-09-13 09:58:33 -07:00
Don Brady	d02ca37979	Bring over illumos ZFS FMA logic -- phase 1 This first phase brings over the ZFS SLM module, zfs_mod.c, to handle auto operations in response to disk events. Disk event monitoring is provided from libudev and generates the expected payload schema for zfs_mod. This work leverages the recently added devid and phys_path strings in the vdev label. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Don Brady <don.brady@intel.com> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #4673	2016-09-01 11:39:45 -07:00
luozhengzheng	0b284702b7	Delete unreferenced function zfs_ereport_send_interim_checksum Signed-off-by: luozhengzheng <luo.zhengzheng@zte.com.cn> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #5055	2016-09-01 11:39:45 -07:00
Gvozden Neskovic	ee36c709c3	Performance optimization of AVL tree comparator functions perf: 2.75x faster ddt_entry_compare() First 256bits of ddt_key_t is a block checksum, which are expected to be close to random data. Hence, on average, comparison only needs to look at first few bytes of the keys. To reduce number of conditional jump instructions, the result is computed as: sign(memcmp(k1, k2)). Sign of an integer 'a' can be obtained as: `(0 < a) - (a < 0)` := {-1, 0, 1} , which is computed efficiently. Synthetic performance evaluation of original and new algorithm over 1G random keys on 2.6GHz Intel(R) Xeon(R) CPU E5-2660 v3: old 6.85789 s new 2.49089 s perf: 2.8x faster vdev_queue_offset_compare() and vdev_queue_timestamp_compare() Compute the result directly instead of using conditionals perf: zfs_range_compare() Speedup between 1.1x - 2.5x, depending on compiler version and optimization level. perf: spa_error_entry_compare() `bcmp()` is not suitable for comparator use. Use `memcmp()` instead. perf: 2.8x faster metaslab_compare() and metaslab_rangesize_compare() perf: 2.8x faster zil_bp_compare() perf: 2.8x faster mze_compare() perf: faster dbuf_compare() perf: faster compares in spa_misc perf: 2.8x faster layout_hash_compare() perf: 2.8x faster space_reftree_compare() perf: libzfs: faster avl tree comparators perf: guid_compare() perf: dsl_deadlist_compare() perf: perm_set_compare() perf: 2x faster range_tree_seg_compare() perf: faster unique_compare() perf: faster vdev_cache _compare() perf: faster vdev_uberblock_compare() perf: faster fuid _compare() perf: faster zfs_znode_hold_compare() Signed-off-by: Gvozden Neskovic <neskovic@gmail.com> Signed-off-by: Richard Elling <richard.elling@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #5033	2016-08-31 14:35:34 -07:00
cao	8f50bafb04	Delete unused zfsctl_snapdir_inactive declaration zfsctl_snapdir_inactive is defined in zfs-0.6.3. In zfs-0.6.5.7 this is declaration remains even though the implementation was removed in commit `278bee93`. Removed fastreboot_disable_highpil which is also unused. Signed-off-by: caoxuewen cao.xuewen@zte.com.cn Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #5042	2016-08-30 14:33:40 -07:00
Alexander Motin	755065f3dc	OpenZFS 6322 - ZFS indirect block predictive prefetch For quite some time I was thinking about possibility to prefetch ZFS indirection tables while doing sequential reads or writes. Recent changes in predictive prefetcher made that much easier to do. My tests on zvol with 16KB block size on 5x striped and 2x mirrored pool of 10 disks show almost double throughput on sequential read, and almost tripple on sequential rewrite. While for read alike effect can be received from increasing maximal prefetch distance (though at higher memory cost), for rewrite there is no other solution so far. Authored by: Alexander Motin <mav@freebsd.org> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Ported-by: kernelOfTruth kerneloftruth@gmail.com Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/6322 OpenZFS-commit: https://github.com/illumos/illumos-gate/commit/cb92f413 Closes #5040 Porting notes: - Change from upstream in module/zfs/dbuf.c in 'int dbuf_read' due to commit `5f6d0b6` 'Handle block pointers with a corrupt logical size' - Difference from upstream in module/zfs/dmu_zfetch.c, uint32_t zfetch_max_idistance -> unsigned int zfetch_max_idistance - Variables have been initialized at the beginning of the function (void dmu_zfetch) to resemble the order of occurrence and account for C99, C11 mode errors.	2016-08-30 14:26:55 -07:00
Gvozden Neskovic	9cc1844a1d	Linux compat: Grsecurity kernel API Change: Module parameter set/get methods take const parameter in Grsecurity kernel v4.7.1 Signed-off-by: Gvozden Neskovic <neskovic@gmail.com> Signed-off-by: Jason Zaman <jason@perfinion.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4997 Closes #5001	2016-08-22 10:05:45 -07:00
Matthew Ahrens	2bce8049c3	OpenZFS 7004 - dmu_tx_hold_zap() does dnode_hold() 7x on same object Using a benchmark which has 32 threads creating 2 million files in the same directory, on a machine with 16 CPU cores, I observed poor performance. I noticed that dmu_tx_hold_zap() was using about 30% of all CPU, and doing dnode_hold() 7 times on the same object (the ZAP object that is being held). dmu_tx_hold_zap() keeps a hold on the dnode_t the entire time it is running, in dmu_tx_hold_t:txh_dnode, so it would be nice to use the dnode_t that we already have in hand, rather than repeatedly calling dnode_hold(). To do this, we need to pass the dnode_t down through all the intermediate calls that dmu_tx_hold_zap() makes, making these routines take the dnode_t* rather than an objset_t* and a uint64_t object number. In particular, the following routines will need to have analogous *_by_dnode() variants created: dmu_buf_hold_noread() dmu_buf_hold() zap_lookup() zap_lookup_norm() zap_count_write() zap_lockdir() zap_count_write() This can improve performance on the benchmark described above by 100%, from 30,000 file creations per second to 60,000. (This improvement is on top of that provided by working around the object allocation issue. Peak performance of ~90,000 creations per second was observed with 8 CPUs; adding CPUs past that decreased performance due to lock contention.) The CPU used by dmu_tx_hold_zap() was reduced by 88%, from 340 CPU-seconds to 40 CPU-seconds. Sponsored by: Intel Corp. Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/7004 OpenZFS-commit: https://github.com/openzfs/openzfs/pull/109 Closes #4641 Closes #4972	2016-08-19 12:48:03 -07:00
Matthew Ahrens	8bea981504	OpenZFS 7003 - zap_lockdir() should tag hold zap_lockdir() / zap_unlockdir() should take a "void *tag" argument which tags the hold on the zap. This will help diagnose programming errors which misuse the hold on the ZAP. Sponsored by: Intel Corp. Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Pavel Zakharov <pavel.zakha@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/7003 OpenZFS-commit: https://github.com/openzfs/openzfs/pull/108 Closes #4972	2016-08-19 12:35:23 -07:00
Paul Dagnelie	32d41fb73a	OpenZFS 7176 - Yet another hole birth issue This is another bug in the long line of hole-birth related issues. In this particular case, it was discovered that a previous hole-birth fix (illumos bug 6513, commit `bc77ba73`) did not cover as many cases as we thought it did. While the issue worked in the case of hole-punching (writing zeroes to a large part of a file), it did not deal with truncation, and then writing beyond the new end of the file. The problem is that dbuf_findbp will return ENOENT if the block it's trying to find is beyond the end of the file. If that happens, we assume there is no birth time, and so we lose that information when we write out new blkptrs. We should teach dbuf_findbp to look for things that are beyond the current end, but not beyond the absolute end of the file. Authored by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Matthew Ahrens mahrens@delphix.com Reviewed by: George Wilson george.wilson@delphix.com Ported-by: kernelOfTruth <kerneloftruth@gmail.com> Signed-off-by: Boris Protopopov <boris.protopopov@actifio.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/7176 OpenZFS-commit: https://github.com/openzfs/openzfs/pull/173/commits/8b9f3ad Upstream-bugs: DLPX-46009 Porting notes: - Fix ISO C90 mixed declaration error in dbuf.c ( int nlevels, epbs; ) ; keep previous position of the initialization	2016-08-18 09:26:44 -07:00
Gvozden Neskovic	fc897b24b2	Rework of fletcher_4 module - Benchmark memory block is increased to 128kiB to reflect real block sizes more accurately. Measurements include all three stages needed for checksum generation, i.e. `init()/compute()/fini()`. The inner loop is repeated multiple times to offset overhead of time function. - Fastest implementation selects native and byteswap methods independently in benchmark. To support this new function pointers `init_byteswap()/fini_byteswap()` are introduced. - Implementation mutex lock is replaced by atomic variable. - To save time, benchmark is not executed in userspace. Instead, highest supported implementation is used for fastest. Default userspace selector is still 'cycle'. - `fletcher_4_native/byteswap()` methods use incremental methods to finish calculation if data size is not multiple of vector stride (currently 64B). - Added `fletcher_4_native_varsize()` special purpose method for use when buffer size is not known in advance. The method does not enforce 4B alignment on buffer size, and will ignore last (size % 4) bytes of the data buffer. - Benchmark `kstat` is changed to match the one of vdev_raidz. It now shows throughput for all supported implementations (in B/s), native and byteswap, as well as the code [fastest] is running. Example of `fletcher_4_bench` running on `Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz`: implementation native byteswap scalar 4768120823 3426105750 sse2 7947841777 4318964249 ssse3 7951922722 6112191941 avx2 13269714358 11043200912 fastest avx2 avx2 Example of `fletcher_4_bench` running on `Intel(R) Xeon Phi(TM) CPU 7210 @ 1.30GHz`: implementation native byteswap scalar 1291115967 1031555336 sse2 2539571138 1280970926 ssse3 2537778746 1080016762 avx2 4950749767 1078493449 avx512f 9581379998 4010029046 fastest avx512f avx512f Signed-off-by: Gvozden Neskovic <neskovic@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4952	2016-08-16 14:11:55 -07:00

1 2 3 4 5 ...

651 Commits