Archive-Team/zfs - zfs - Gitea: Git with a cup of tea

Commit Graph

Author	SHA1	Message	Date
Michael Niewöhner	10b3c7f5e4	Add zstd support to zfs This PR adds two new compression types, based on ZStandard: - zstd: A basic ZStandard compression algorithm Available compression. Levels for zstd are zstd-1 through zstd-19, where the compression increases with every level, but speed decreases. - zstd-fast: A faster version of the ZStandard compression algorithm zstd-fast is basically a "negative" level of zstd. The compression decreases with every level, but speed increases. Available compression levels for zstd-fast: - zstd-fast-1 through zstd-fast-10 - zstd-fast-20 through zstd-fast-100 (in increments of 10) - zstd-fast-500 and zstd-fast-1000 For more information check the man page. Implementation details: Rather than treat each level of zstd as a different algorithm (as was done historically with gzip), the block pointer `enum zio_compress` value is simply zstd for all levels, including zstd-fast, since they all use the same decompression function. The compress= property (a 64bit unsigned integer) uses the lower 7 bits to store the compression algorithm (matching the number of bits used in a block pointer, as the 8th bit was borrowed for embedded block pointers). The upper bits are used to store the compression level. It is necessary to be able to determine what compression level was used when later reading a block back, so the concept used in LZ4, where the first 32bits of the on-disk value are the size of the compressed data (since the allocation is rounded up to the nearest ashift), was extended, and we store the version of ZSTD and the level as well as the compressed size. This value is returned when decompressing a block, so that if the block needs to be recompressed (L2ARC, nop-write, etc), that the same parameters will be used to result in the matching checksum. All of the internal ZFS code ( `arc_buf_hdr_t`, `objset_t`, `zio_prop_t`, etc.) uses the separated _compress and _complevel variables. Only the properties ZAP contains the combined/bit-shifted value. The combined value is split when the compression_changed_cb() callback is called, and sets both objset members (os_compress and os_complevel). The userspace tools all use the combined/bit-shifted value. Additional notes: zdb can now also decode the ZSTD compression header (flag -Z) and inspect the size, version and compression level saved in that header. For each record, if it is ZSTD compressed, the parameters of the decoded compression header get printed. ZSTD is included with all current tests and new tests are added as-needed. Per-dataset feature flags now get activated when the property is set. If a compression algorithm requires a feature flag, zfs activates the feature when the property is set, rather than waiting for the first block to be born. This is currently only used by zstd but can be extended as needed. Portions-Sponsored-By: The FreeBSD Foundation Co-authored-by: Allan Jude <allanjude@freebsd.org> Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: Sebastian Gottschall <s.gottschall@dd-wrt.com> Co-authored-by: Kjeld Schouten-Lebbing <kjeld@schouten-lebbing.nl> Co-authored-by: Michael Niewöhner <foss@mniewoehner.de> Signed-off-by: Allan Jude <allan@klarasystems.com> Signed-off-by: Allan Jude <allanjude@freebsd.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Sebastian Gottschall <s.gottschall@dd-wrt.com> Signed-off-by: Kjeld Schouten-Lebbing <kjeld@schouten-lebbing.nl> Signed-off-by: Michael Niewöhner <foss@mniewoehner.de> Closes #6247 Closes #9024 Closes #10277 Closes #10278	2020-08-20 10:30:06 -07:00
Ryan Moeller	009cc8e884	Make zc_nvlist_src_size limit tunable We limit the size of nvlists passed to the kernel so a user cannot make the kernel do an unreasonably large allocation. On FreeBSD this limit was 128 kiB, which turns out to be a bit too small when doing some operations involving a large number of datasets or snapshots, for example replication. Make this limit tunable, with a platform-specific auto default. Linux keeps its limit at KMALLOC_MAX_SIZE. FreeBSD uses 1/4 of the system limit on user wired memory, which allows it to scale depending on system configuration. Reviewed-by: Matt Macy <mmacy@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <freqlabs@FreeBSD.org> Issue #6572 Closes #10706	2020-08-18 09:33:55 -07:00
Richard Laager	eaa25f1a8e	Remove GRUB restrictions The GRUB restrictions are based around the pool's bootfs property. Given the current situation where GRUB is not staying current with OpenZFS pool features, having either a non-ZFS /boot or a separate pool with limited features are pretty much the only long-term answers for GRUB support. Only the second case matters in this context. For the restrictions to be useful, the bootfs property would have to be set on the boot pool, because that is where we need the restrictions, as that is the pool that GRUB reads from. The documentation for bootfs describes it as pointing to the root pool. That's also how it's used in the initramfs. ZFS does not allow setting bootfs to point to a dataset in another pool. (If it did, it'd be difficult-to-impossible to enforce these restrictions cross-pool). Accordingly, bootfs is pretty much useless for GRUB scenarios moving forward. Even for users who have only one pool, the existing restrictions for GRUB are incomplete. They don't prevent you from enabling the unsupported checksums, for example. For that reason, I have ripped out all the GRUB restrictions. A little longer-term, I think extending the proposed features=portable system to define a features=grub is a much more useful approach. The user could set that on the boot pool at creation, and things would Just Work. Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Laager <rlaager@wiktel.com> Closes #8627	2020-08-17 23:12:39 -07:00
Matthew Ahrens	994de7e4b7	Remove KMC_KMEM and KMC_VMEM `KMC_KMEM` and `KMC_VMEM` are now unused since all SPL-implemented caches are `KMC_KVMEM`. KMC_KMEM: Given the default value of `spl_kmem_cache_kmem_limit`, we don't use kmalloc to back the SPL caches, instead we use kvmalloc (KMC_KVMEM). The flag, module parameter, /proc entries, and associated code are removed. KMC_VMEM: This flag is not used, and kvmalloc() is always preferable to vmalloc(). The flag, /proc entries, and associated code are removed. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #10673	2020-08-17 16:04:28 -07:00
Matthew Ahrens	3442c2a02d	Revise ARC shrinker algorithm The ARC shrinker callback `arc_shrinker_count/_scan()` is invoked by the kernel's shrinker mechanism when the system is running low on free pages. This happens via 2 code paths: 1. "direct reclaim": The system is attempting to allocate a page, but we are low on memory. The ARC shrinker callback is invoked from the page-allocation code path. 2. "indirect reclaim": kswapd notices that there aren't many free pages, so it invokes the ARC shrinker callback. In both cases, the kernel's shrinker code requests that the ARC shrinker callback release some of its cache, and then it measures how many pages were released. However, it's measurement of released pages does not include pages that are freed via `__free_pages()`, which is how the ARC releases memory (via `abd_free_chunks()`). Rather, the kernel shrinker code is looking for pages to be placed on the lists of reclaimable pages (which is separate from actually-free pages). Because the kernel shrinker code doesn't detect that the ARC has released pages, it may call the ARC shrinker callback many times, resulting in the ARC "collapsing" down to `arc_c_min`. This has several negative impacts: 1. ZFS doesn't use RAM to cache data effectively. 2. In the direct reclaim case, a single page allocation may wait a long time (e.g. more than a minute) while we evict the entire ARC. 3. Even with the improvements made in `67c0f0dedc` ("ARC shrinking blocks reads/writes"), occasionally `arc_size` may stay above `arc_c` for the entire time of the ARC collapse, thus blocking ZFS read/write operations in `arc_get_data_impl()`. To address these issues, this commit limits the ways that the ARC shrinker callback can be used by the kernel shrinker code, and mitigates the impact of arc_is_overflowing() on ZFS read/write operations. With this commit: 1. We limit the amount of data that can be reclaimed from the ARC via the "direct reclaim" shrinker. This limits the amount of time it takes to allocate a single page. 2. We do not allow the ARC to shrink via kswapd (indirect reclaim). Instead we rely on `arc_evict_zthr` to monitor free memory and reduce the ARC target size to keep sufficient free memory in the system. Note that we can't simply rely on limiting the amount that we reclaim at once (as for the direct reclaim case), because kswapd's "boosted" logic can invoke the callback an unlimited number of times (see `balance_pgdat()`). 3. When `arc_is_overflowing()` and we want to allocate memory, `arc_get_data_impl()` will wait only for a multiple of the requested amount of data to be evicted, rather than waiting for the ARC to no longer be overflowing. This allows ZFS reads/writes to make progress even while the ARC is overflowing, while also ensuring that the eviction thread makes progress towards reducing the total amount of memory used by the ARC. 4. The amount of memory that the ARC always tries to keep free for the rest of the system, `arc_sys_free` is increased. 5. Now that the shrinker callback is able to provide feedback to the kernel's shrinker code about our progress, we can safely enable the kswapd hook. This will allow the arc to receive notifications when memory pressure is first detected by the kernel. We also re-enable the appropriate kstats to track these callbacks. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: George Wilson <george.wilson@delphix.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #10600	2020-07-31 21:10:52 -07:00
Ryan Moeller	8348fac30c	Limit dbuf cache sizes based only on ARC target size by default Set the initial max sizes to ULONG_MAX to allow the caches to grow with the ARC. Recalculate the metadata cache size on demand so it can adapt, too. Update descriptions in zfs-module-parameters(5). Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matt Ahrens <matt@delphix.com> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #10563 Closes #10610	2020-07-24 20:38:48 -07:00
tony-zfs	02fced3067	Add support to decode a resume token Adding a new subcommand to zstream called token. This now allows users to decode a resume token to retrieve the toname field. This can be useful for tools that need this information. The syntax works as follows zstream token <resume_token>. Reviewed-by: Matt Ahrens <matt@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Paul Zuchowski <pzuchowski@datto.com> Signed-off-by: Tony Perkins <tperkins@datto.com> Closes #10558	2020-07-23 17:44:03 -07:00
Matthew Ahrens	6774931dfa	Extend zdb to print inconsistencies in livelists and metaslabs Livelists and spacemaps are data structures that are logs of allocations and frees. Livelists entries are block pointers (blkptr_t). Spacemaps entries are ranges of numbers, most often used as to track allocated/freed regions of metaslabs/vdevs. These data structures can become self-inconsistent, for example if a block or range can be "double allocated" (two allocation records without an intervening free) or "double freed" (two free records without an intervening allocation). ZDB (as well as zfs running in the kernel) can detect these inconsistencies when loading livelists and metaslab. However, it generally halts processing when the error is detected. When analyzing an on-disk problem, we often want to know the entire set of inconsistencies, which is not possible with the current behavior. This commit adds a new flag, `zdb -y`, which analyzes the livelist and metaslab data structures and displays all of their inconsistencies. Note that this is different from the leak detection performed by `zdb -b`, which checks for inconsistencies between the spacemaps and the tree of block pointers, but assumes the spacemaps are self-consistent. The specific checks added are: Verify livelists by iterating through each sublivelists and: - report leftover FREEs - report double ALLOCs and double FREEs - record leftover ALLOCs together with their TXG [see Cross Check] Verify spacemaps by iterating over each metaslab and: - iterate over spacemap and then the metaslab's entries in the spacemap log, then report any double FREEs and double ALLOCs Verify that livelists are consistenet with spacemaps. The space referenced by livelists (after using the FREE's to cancel out corresponding ALLOCs) should be allocated, according to the spacemaps. Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: Sara Hartse <sara.hartse@delphix.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> External-issue: DLPX-66031 Closes #10515	2020-07-14 17:51:05 -07:00
Arvind Sankar	38e2e9ce83	Centralize variable substitution A bunch of places need to edit files to incorporate the configured paths i.e. bindir, sbindir etc. Move this logic into a common file. Create arc_summary by copying arc_summary[23] as appropriate at build time instead of install time. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu> Closes #10559	2020-07-14 17:33:44 -07:00
Brian Behlendorf	9a49d3f3d3	Add device rebuild feature The device_rebuild feature enables sequential reconstruction when resilvering. Mirror vdevs can be rebuilt in LBA order which may more quickly restore redundancy depending on the pools average block size, overall fragmentation and the performance characteristics of the devices. However, block checksums cannot be verified as part of the rebuild thus a scrub is automatically started after the sequential resilver completes. The new '-s' option has been added to the `zpool attach` and `zpool replace` command to request sequential reconstruction instead of healing reconstruction when resilvering. zpool attach -s <pool> <existing vdev> <new vdev> zpool replace -s <pool> <old vdev> <new vdev> The `zpool status` output has been updated to report the progress of sequential resilvering in the same way as healing resilvering. The one notable difference is that multiple sequential resilvers may be in progress as long as they're operating on different top-level vdevs. The `zpool wait -t resilver` command was extended to wait on sequential resilvers. From this perspective they are no different than healing resilvers. Sequential resilvers cannot be supported for RAIDZ, but are compatible with the dRAID feature being developed. As part of this change the resilver_restart_* tests were moved in to the functional/replacement directory. Additionally, the replacement tests were renamed and extended to verify both resilvering and rebuilding. Original-patch-by: Isaac Huang <he.huang@intel.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: John Poduska <jpoduska@datto.com> Co-authored-by: Mark Maybee <mmaybee@cray.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #10349	2020-07-03 11:05:50 -07:00
Matthew Ahrens	7b232e9354	arcstat: add 'avail', fix 'free' The meaning of the `free` field is currently `zfs_arc_sys_free`, which is the target amount of memory to leave free for the system, and is constant after booting. This commit changes the meaning of `free` to arc_free_memory(), the amount of memory that the ARC considers to be free. It also adds a new arcstat field `avail`, which tracks `arc_available_memory()`. Since `avail` can be negative, it also updates the arcstat script to pretty-print negative values. example output: $ arcstat -f time,miss,arcsz,c,grow,need,free,avail 1 time miss arcsz c grow need free avail 15:03:02 39K 114G 114G 0 0 2.4G 407M 15:03:03 42K 114G 114G 0 0 2.1G 120M 15:03:04 40K 114G 114G 0 0 1.8G -177M 15:03:05 24K 113G 112G 0 0 1.7G -269M 15:03:06 29K 111G 110G 0 0 1.6G -385M 15:03:07 27K 110G 108G 0 0 1.4G -535M 15:03:08 13K 108G 108G 0 0 2.2G 239M 15:03:09 33K 107G 107G 0 0 1.3G -639M 15:03:10 16K 105G 102G 0 0 2.6G 704M 15:03:11 7.2K 102G 102G 0 0 5.1G 3.1G 15:03:12 42K 103G 102G 0 0 4.8G 2.8G Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #10494	2020-06-26 18:05:28 -07:00
Arvind Sankar	6b99fc0620	Fixes for make dist Reduce the usage of EXTRA_DIST. If files are conditionally included in _SOURCES, _HEADERS etc, automake is smart enough to dist all files that could possibly be included, but this does not apply to EXTRA_DIST, resulting in make dist depending on the configuration. Add some files that were missing altogether in various Makefile's. The changes to disted files in this commit (excluding deleted files): +./cmd/zed/agents/README.md +./etc/init.d/README.md +./lib/libspl/os/freebsd/getexecname.c +./lib/libspl/os/freebsd/gethostid.c +./lib/libspl/os/freebsd/getmntany.c +./lib/libspl/os/freebsd/mnttab.c -./lib/libzfs/libzfs_core.pc -./lib/libzfs/libzfs.pc +./lib/libzfs/os/freebsd/libzfs_compat.c +./lib/libzfs/os/freebsd/libzfs_fsshare.c +./lib/libzfs/os/freebsd/libzfs_ioctl_compat.c +./lib/libzfs/os/freebsd/libzfs_zmount.c +./lib/libzutil/os/freebsd/zutil_compat.c +./lib/libzutil/os/freebsd/zutil_device_path_os.c +./lib/libzutil/os/freebsd/zutil_import_os.c +./module/lua/README.zfs +./module/os/linux/spl/README.md +./tests/README.md +./tests/zfs-tests/tests/functional/cli_root/zfs_clone/zfs_clone_rm_nested.ksh +./tests/zfs-tests/tests/functional/cli_root/zfs_send/zfs_send_encrypted_unloaded.ksh +./tests/zfs-tests/tests/functional/inheritance/README.config +./tests/zfs-tests/tests/functional/inheritance/README.state +./tests/zfs-tests/tests/functional/rsend/rsend_016_neg.ksh +./tests/zfs-tests/tests/perf/fio/sequential_readwrite.fio Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu> Closes #10501	2020-06-26 14:20:02 -07:00
Jorgen Lundman	68301ba20e	zed additional features This commit adds two features to zed, that macOS desires. The first is that when you unload the kernel module, zed would enter into a cpubusy loop calling zfs_events_next() repeatedly. We now look for ENODEV, returned by kernel, so zed can exit gracefully. Second feature is -I (idle) (alas -P persist was taken) is for the deamon to; 1; if started without ZFS kernel module, stick around waiting for it. 2; if kernel module is unloaded, go back to 1. This is due to daemons in macOS is started by launchctl, and is expected to stick around. Currently, the busy loop only exists when errno is ENODEV. This is to ensure that functionality that upstream expects is not changed. It did not care about errors before, and it still does not. (with the exception of ENODEV). However, it is probably better that all errors (ERESTART notwithstanding) exits the loop, and the issues complaining about zed taking all CPU will go away. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Jorgen Lundman <lundman@lundman.net> Closes #10476	2020-06-22 09:53:34 -07:00
adilger	f734301d22	linux: add basic fallocate(mode=0/2) compatibility Implement semi-compatible functionality for mode=0 (preallocation) and mode=FALLOC_FL_KEEP_SIZE (preallocation beyond EOF) for ZPL. Since ZFS does COW and snapshots, preallocating blocks for a file cannot guarantee that writes to the file will not run out of space. Even if the first overwrite was guaranteed, it would not handle any later overwrite of blocks due to COW, so strict compliance is futile. Instead, make a best-effort check that at least enough free space is currently available in the pool (with a bit of margin), then create a sparse file of the requested size and continue on with life. This does not handle all cases (e.g. several fallocate() calls before writing into the files when the filesystem is nearly full), which would require a more complex mechanism to be implemented, probably based on a modified version of dmu_prealloc(), but is usable as-is. A new module option zfs_fallocate_reserve_percent is used to control the reserve margin for any single fallocate call. By default, this is 110% of the requested preallocation size, so an additional 10% of available space is reserved for overhead to allow the application a good chance of finishing the write when the fallocate() succeeds. If the heuristics of this basic fallocate implementation are not desirable, the old non-functional behavior of returning EOPNOTSUPP for calls can be restored by setting zfs_fallocate_reserve_percent=0. The parameter of zfs_statvfs() is changed to take an inode instead of a dentry, since no dentry is available in zfs_fallocate_common(). A few tests from @behlendorf cover basic fallocate functionality. Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Arshad Hussain <arshad.super@gmail.com> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Andreas Dilger <adilger@dilger.ca> Issue #326 Closes #10408	2020-06-18 11:22:11 -07:00
Grischa Zengel	059f7c20e3	man.8: Add bookmark to list of types While checking bash_completion I missed bookmark as type. ``` # zfs get type zpool2#b NAME PROPERTY VALUE SOURCE zpool2#b type bookmark - ``` Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Grischa Zengel <github.zfsonlinux@zengel.info> Closes #10419	2020-06-10 17:53:07 -07:00
Matthew Ahrens	f66434268c	Remove unnecessary references to slavery The horrible effects of human slavery continue to impact society. The casual use of the term "slave" in computer software is an unnecessary reference to a painful human experience. This commit removes all possible references to the term "slave". Implementation notes: The zpool.d/slaves script is renamed to dm-deps, which uses the same terminology as `dmsetup deps`. References to the `/sys/class/block/$dev/slaves` directory remain. This directory name is determined by the Linux kernel. Although `dmsetup deps` provides the same information, it unfortunately requires elevated privileges, whereas the `/sys/...` directory is world-readable. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #10435	2020-06-10 17:07:59 -07:00
Andrea Gelmini	dd4bc569b9	Fix typos Correct various typos in the comments and tests. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net> Closes #10423	2020-06-09 21:24:09 -07:00
George Amanakis	b7654bd794	Trim L2ARC The l2arc_evict() function is responsible for evicting buffers which reference the next bytes of the L2ARC device to be overwritten. Teach this function to additionally TRIM that vdev space before it is overwritten if the device has been filled with data. This is done by vdev_trim_simple() which trims by issuing a new type of TRIM, TRIM_TYPE_SIMPLE. We also implement a "Trim Ahead" feature. It is a zfs module parameter, expressed in % of the current write size. This trims ahead of the current write size. A minimum of 64MB will be trimmed. The default is 0 which disables TRIM on L2ARC as it can put significant stress to underlying storage devices. To enable TRIM on L2ARC we set l2arc_trim_ahead > 0. We also implement TRIM of the whole cache device upon addition to a pool, pool creation or when the header of the device is invalid upon importing a pool or onlining a cache device. This is dependent on l2arc_trim_ahead > 0. TRIM of the whole device is done with TRIM_TYPE_MANUAL so that its status can be monitored by zpool status -t. We save the TRIM state for the whole device and the time of completion on-disk in the header, and restore these upon L2ARC rebuild so that zpool status -t can correctly report them. Whole device TRIM is done asynchronously so that the user can export of the pool or remove the cache device while it is trimming (ie if it is too slow). We do not TRIM the whole device if persistent L2ARC has been disabled by l2arc_rebuild_enabled = 0 because we may not want to lose all cached buffers (eg we may want to import the pool with l2arc_rebuild_enabled = 0 only once because of memory pressure). If persistent L2ARC has been disabled by setting the module parameter l2arc_rebuild_blocks_min_l2size to a value greater than the size of the cache device then the whole device is trimmed upon creation or import of a pool if l2arc_trim_ahead > 0. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Adam D. Moss <c@yotes.com> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #9713 Closes #9789 Closes #10224	2020-06-09 10:15:08 -07:00
felixdoerre	501a1511ae	mount: use the mount syscall directly Allow zfs datasets to be mounted on Linux without relying on the invocation of an external processes. This is the same behavior which is implemented for FreeBSD. Use of the libmount library was originally considered because it provides functionality to properly lock and update the /etc/mtab file. However, these days /etc/mtab is typically a symlink to /proc/self/mounts so there's nothing to updated. Therefore, we call mount(2) directly and avoid any additional dependencies. If required the legacy behavior can be enabled by setting the ZFS_MOUNT_HELPER environment variable. This may be needed in environments where SELinux in enabled and the zfs binary does not have mount permission. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Felix Dörre <felix@dogcraft.de> #10294	2020-05-20 18:02:41 -07:00
Paul Dagnelie	de4f06c275	Small program that converts a dataset id and an object id to a path Small program that converts a dataset id and an object id to a path Reviewed-by: Prakash Surya <prakash.surya@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Paul Dagnelie <pcd@delphix.com> Closes #10204	2020-05-20 10:05:33 -07:00
AJ Jordan	5a04b17717	Fix up arcstat(1) to match our version Turns out the illumos manpage, which is what this originates from, was written for the original Perl version of the utility which is not the version in the OpenZFS tree. That version originates from a Python rewrite that was done for FreeNAS. So fix up the manpage to match what we actually ship (and fix a few typos in the process). Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: AJ Jordan <alex@strugee.net> Closes #10288	2020-05-11 16:22:48 -07:00
AJ Jordan	ac806a2559	Import the arcstat(1m) manpage from illumos And move it from section 1m to section 1 for consistency. Imported from illumos commit f34d737f. Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: AJ Jordan <alex@strugee.net> Closes #10288	2020-05-11 16:22:18 -07:00
George Amanakis	657fd33bcf	Improvements on persistent L2ARC Functional changes: We implement refcounts of log blocks and their aligned size on the cache device along with two corresponding arcstats. The refcounts are reflected in the header of the device and provide valuable information as to whether log blocks are accounted for correctly. These are dynamically adjusted as log blocks are committed/evicted. zdb also uses this information in the device header and compares it to the corresponding values as reported by dump_l2arc_log_blocks() which emulates l2arc_rebuild(). If the refcounts saved in the device header report higher values, zdb exits with an error. For this feature to work correctly there should be no active writes on the device. This is also employed in the tests of persistent L2ARC. We extend the structure of the cache device header by adding the two new variables mirroring the refcounts after the existing variables to preserve backward compatibility in terms of persistent L2ARC. 1) a new arcstat "l2_log_blk_asize" and refcount "l2ad_lb_asize" which reflect the total aligned size of log blocks on the device. This is also reflected in the header of the cache device as "dh_lb_asize". 2) a new arcstat "l2arc_log_blk_count" and refcount "l2ad_lb_count" which reflect the total number of L2ARC log blocks present on cache devices. It is also reflected in the header of the cache device as "dh_lb_count". In l2arc_rebuild_vdev() if the amount of committed log entries in a log block is 0 and the device header is valid we update the device header. This will facilitate trimming of the whole device in this case when TRIM for L2ARC is implemented. Improve loop protection in l2arc_rebuild() by using the starting offset of the payload of each log block instead of the starting offset of the log block. If the zio in l2arc_write_buffers() fails, restore the lbps array in the header of the device to its previous state in l2arc_write_done(). If l2arc_rebuild() ends the rebuild process without restoring any L2ARC log blocks in ARC and without any other error, this means that the lbps array in the header is pointing to non-existent or invalid log blocks. Reset the device header in this case. In l2arc_rebuild() change the zfs_dbgmsg messages to spa_history_log_internal() making them user visible with zpool history command. Non-functional changes: Make the first test in persistent L2ARC use `zdb -lll` to increase coverage in `zdb.c`. Rename psize with asize when referring to log blocks, since L2ARC_SET_PSIZE stores the vdev aligned size for log blocks. Also rename dh_log_blk_entries to dh_log_entries to make it clear that it is a mirror of l2ad_log_entries. Added comments for both changes. Fix inaccurate comments for example in l2arc_log_blk_restore(). Add asserts at the end in l2arc_evict() and l2arc_write_buffers(). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #10228	2020-05-07 16:34:03 -07:00
Paul B. Henson	7bf3e1fa0f	OpenZFS 3254 - add support in zfs for aclmode=restricted Authored-by: Paul B. Henson <henson@acm.org> Reviewed by: Albert Lee <trisk@nexenta.com> Reviewed by: Gordon Ross <gwr@nexenta.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Approved by: Richard Lowe <richlowe@richlowe.net> Ported-by: Paul B. Henson <henson@acm.org> OpenZFS-issue: https://www.illumos.org/issues/3254 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/71dbfc287c Closes #10266	2020-04-30 11:23:59 -07:00
Paul B. Henson	a1af567bb6	OpenZFS 742 - Resurrect the ZFS "aclmode" property OpenZFS 664 - Umask masking "deny" ACL entries OpenZFS 279 - Bug in the new ACL (post-PSARC/2010/029) semantics Porting notes: * Updated zfs_acl_chmod to take 'boolean_t isdir' as first parameter rather than 'zfsvfs_t zfsvfs' zfs man pages changes mixed between zfs and new zfsprops man pages Reviewed by: Aram Hvrneanu <aram@nexenta.com> Reviewed by: Gordon Ross <gwr@nexenta.com> Reviewed by: Robert Gordon <rbg@openrbg.com> Reviewed by: Mark.Maybee@oracle.com Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Approved by: Garrett D'Amore <garrett@nexenta.com> Ported-by: Paul B. Henson <henson@acm.org> OpenZFS-issue: https://www.illumos.org/issues/742 OpenZFS-issue: https://www.illumos.org/issues/664 OpenZFS-issue: https://www.illumos.org/issues/279 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/a3c49ce110 Closes #10266	2020-04-30 11:22:45 -07:00
alex	47c9299fcc	zfs_create: round up volume size to multiple of bs Round up the volume size requested in `zfs create -V size` to the next higher multiple of the volblocksize. Updates the man page and adds a test to verify the new behavior. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reported-by: puffi <puffi@users.noreply.github.com> Signed-off-by: Alex John <alex@stty.io> Closes #8541 Closes #10196	2020-04-24 19:04:34 -07:00
Matthew Ahrens	196bee4cfd	Remove deduplicated send/receive code Deduplicated send streams (i.e. `zfs send -D` and `zfs receive` of such streams) are deprecated. Deduplicated send streams can be received by first converting them to non-deduplicated with the `zstream redup` command. This commit removes the code for sending and receiving deduplicated send streams. `zfs send -D` will now print a warning, ignore the `-D` flag, and generate a regular (non-deduplicated) send stream. `zfs receive` of a deduplicated send stream will print an error message and fail. The resulting code simplification (especially in the kernel's support for receiving dedup streams) should help enable future performance enhancements. Several new tests are added which leverage `zstream redup`. Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Issue #7887 Issue #10117 Issue #10156 Closes #10212	2020-04-23 10:06:57 -07:00
Matthew Ahrens	c618f87cd2	Add `zstream redup` command to convert deduplicated send streams Deduplicated send and receive is deprecated. To ease migration to the new dedup-send-less world, the commit adds a `zstream redup` utility to convert deduplicated send streams to normal streams, so that they can continue to be received indefinitely. The new `zstream` command also replaces the functionality of `zstreamdump`, by way of the `zstream dump` subcommand. The `zstreamdump` command is replaced by a shell script which invokes `zstream dump`. The way that `zstream redup` works under the hood is that as we read the send stream, we build up a hash table which maps from `<GUID, object, offset> -> <file_offset>`. Whenever we see a WRITE record, we add a new entry to the hash table, which indicates where in the stream file to find the WRITE record for this block. (The key is `drr_toguid, drr_object, drr_offset`.) For entries other than WRITE_BYREF, we pass them through unchanged (except for the running checksum, which is recalculated). For WRITE_BYREF records, we change them to WRITE records. We find the referenced WRITE record by looking in the hash table (for the record with key `drr_refguid, drr_refobject, drr_refoffset`), and then reading the record header and payload from the specified offset in the stream file. This is why the stream can not be a pipe. The found WRITE record replaces the WRITE_BYREF record, with its `drr_toguid`, `drr_object`, and `drr_offset` fields changed to be the same as the WRITE_BYREF's (i.e. we are writing the same logical block, but with the data supplied by the previous WRITE record). This algorithm requires memory proportional to the number of WRITE records (same as `zfs send -D`), but the size per WRITE record is relatively low (40 bytes, vs. 72 for `zfs send -D`). A 1TB send stream with 8KB blocks (`recordsize=8k`) would use around 5GB of RAM to "redup". Reviewed-by: Jorgen Lundman <lundman@lundman.net> Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #10124 Closes #10156	2020-04-10 10:39:55 -07:00
George Amanakis	77f6826b83	Persistent L2ARC This commit makes the L2ARC persistent across reboots. We implement a light-weight persistent L2ARC metadata structure that allows L2ARC contents to be recovered after a reboot. This significantly eases the impact a reboot has on read performance on systems with large caches. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: Saso Kiselkov <skiselkov@gmail.com> Co-authored-by: Jorgen Lundman <lundman@lundman.net> Co-authored-by: George Amanakis <gamanakis@gmail.com> Ported-by: Yuxuan Shui <yshuiv7@gmail.com> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #925 Closes #1823 Closes #2672 Closes #3744 Closes #9582	2020-04-10 10:33:35 -07:00
Paul Dagnelie	5a42ef04fd	Add 'zfs wait' command Add a mechanism to wait for delete queue to drain. When doing redacted send/recv, many workflows involve deleting files that contain sensitive data. Because of the way zfs handles file deletions, snapshots taken quickly after a rm operation can sometimes still contain the file in question, especially if the file is very large. This can result in issues for redacted send/recv users who expect the deleted files to be redacted in the send streams, and not appear in their clones. This change duplicates much of the zpool wait related logic into a zfs wait command, which can be used to wait until the internal deleteq has been drained. Additional wait activities may be added in the future. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: John Gallagher <john.gallagher@delphix.com> Signed-off-by: Paul Dagnelie <pcd@delphix.com> Closes #9707	2020-04-01 10:02:06 -07:00
Ryan Moeller	9a51738b60	Let default arc_c_max be platform dependent Linux changed the default max ARC size to 1/2 of physical memory to deal with shortcomings of the Linux SLUB allocator. Other platforms do not require the same logic. Implement an arc_default_max() function to determine a default max ARC size in platform code. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #10155	2020-03-27 09:14:46 -07:00
Matthew Ahrens	652bdc9b0e	Deprecate deduplicated send streams Dedup send can only deduplicate over the set of blocks in the send command being invoked, and it does not take advantage of the dedup table to do so. This is a very common misconception among not only users, but developers, and makes the feature seem more useful than it is. As a result, many users are using the feature but not getting any benefit from it. Dedup send requires a nontrivial expenditure of memory and CPU to operate, especially if the dataset(s) being sent is (are) not already using a dedup-strength checksum. Dedup send adds developer burden. It expands the test matrix when developing new features, causing bugs in released code, and delaying development efforts by forcing more testing to be done. As a result, we are deprecating the use of `zfs send -D` and receiving of such streams. This change adds a warning to the man page, and also prints the warning whenever dedup send or receive are used. In a future release, we plan to: 1. remove the kernel code for generating deduplicated streams 2. make `zfs send -D` generate regular, non-deduplicated streams 3. remove the kernel code for receiving deduplicated streams 4. make `zfs receive` of deduplicated streams process them in userland to "re-duplicate" them, so that they can still be received. Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #7887 Closes #10117	2020-03-18 13:31:10 -07:00
Mariusz Zaborski	a57d3d45d6	Add option for forcible unmounting dataset while receiving snapshot. Currently when the dataset is in use we can't receive snapshots. zfs send test/1@asd \| zfs recv -FM test/2 cannot unmount '/test/2': Device busy This commits add option 'M' which attempts to forcibly unmount the dataset. Thanks to this we can enforce receiving snapshots in a single step. Note that this functionality is not supported on Linux because the VFS will prevent active mounted filesystems from being unmounted, even with the force option. This is the intended VFS behavior. Test cases were added to verify the expected behavior based on the platform. Discussed-with: Pawel Jakub Dawidek <pjd@FreeBSD.org> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Allan Jude <allanjude@freebsd.org> External-issue: https://reviews.freebsd.org/D22306 Closes #9904	2020-03-17 10:08:32 -07:00
Matthew Ahrens	7261fc2e81	Improve zfs receive performance by batching writes For each WRITE record in the stream, `zfs receive` creates a DMU transaction (`dmu_tx_create()`) and writes this block's data into the object. If per-block overheads (as opposed to per-byte overheads) dominate performance (as is often the case with small recordsize), the per-dmu-transaction overheads can be significant. For example, in some workloads the `receieve_writer` thread is 100% on CPU, and more than half of its CPU time is in these per-tx routines (e.g. dmu_tx_hold_write, dmu_tx_assign, dmu_tx_commit). To improve performance of `zfs receive`, this commit batches WRITE records which are to nearby offsets of the same object, and uses one DMU transaction to write them all. By default the batch size is 1MB, which for recordsize=8K reduces the number of DMU transactions by 128x for full send streams (incrementals will depend on how "clumpy" the changed blocks are). This commit improves the performance of `dd if=stream \| zfs recv` from 78,800 blocks/sec to 98,100 blocks/sec (25% improvement). Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #10099	2020-03-16 11:51:56 -07:00
Ryan Moeller	f5f6fb03b7	Change default to overlay=on Filesystems allow overlay mounts by default on FreeBSD and Linux. Respect the native convention by switching the default to overlay=on, while retaining the option to turn the property off for compatibility with other operating systems' conventions. Update documentation and tests accordingly. Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #10030	2020-03-06 09:28:19 -08:00
Brian Behlendorf	2288d41968	Add trim support to zpool wait Manual trims fall into the category of long-running pool activities which people might want to wait synchronously for. This change adds support to 'zpool wait' for waiting for manual trim operations to complete. It also adds a '-w' flag to 'zpool trim' which can be used to turn 'zpool trim' into a synchronous operation. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Signed-off-by: John Gallagher <john.gallagher@delphix.com> Closes #10071	2020-03-04 15:07:11 -08:00
Mariusz Zaborski	89adffb7bc	Add notice that forcefully unmount is not supported on Linux The Linux VFS will never allow a filesystem which is in use to be unmounted. This behavior differs from other platforms like FreeBSD which allow a filesystem to be force unmounted. This will result in errors being returned to applications actively using the filesystem. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Mariusz Zaborski <oshogbo@vexillium.org> Closes #10013	2020-02-18 13:36:23 -08:00
Richard Laager	f244846462	Prefer org.openzfs for features and properties Moving forward, we wish to use org.openzfs (no dash) rather than org.open-zfs or org.zfsonlinux for feature GUIDs and property names. The existing feature GUIDs cannot be changed. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Richard Laager <rlaager@wiktel.com> Closes #10003	2020-02-18 09:36:50 -08:00
InsanePrawn	ecbbdac799	Systemd mount generator: Generate noauto units; add control properties This commit refactors the systemd mount generators and makes the following major changes: - The generator now generates units for datasets marked canmount=noauto, too. These units are NOT WantedBy local-fs.target. If there are multiple noauto datasets for a path, no noauto unit will be created. Datasets with canmount=on are prioritized. - Introduces handling of new user properties which are now included in the zfs-list.cache files: - org.openzfs.systemd:requires: List of units to require for this mount unit - org.openzfs.systemd:requires-mounts-for: List of mounts to require by this mount unit - org.openzfs.systemd:before: List of units to order after this mount unit - org.openzfs.systemd:after: List of units to order before this mount unit - org.openzfs.systemd:wanted-by: List of units to add a Wants dependency on this mount unit to - org.openzfs.systemd:required-by: List of units to add a Requires dependency on this mount unit to - org.openzfs.systemd:nofail: Toggles between a wants and a requires dependency. - org.openzfs.systemd:ignore: Do not generate a mount unit for this dataset. Consult the updated man page for detailed documentation. - Restructures and extends the zfs-mount-generator(8) man page with the above properties, information on unit ordering and a license header. Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Antonio Russo <antonio.e.russo@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: InsanePrawn <insane.prawny@gmail.com> Closes #9649	2020-02-14 15:32:55 -08:00
Jason King	13b5a4d5c0	Support setting user properties in a channel program This adds support for setting user properties in a zfs channel program by adding 'zfs.sync.set_prop' and 'zfs.check.set_prop' to the ZFS LUA API. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matt Ahrens <matt@delphix.com> Co-authored-by: Sara Hartse <sara.hartse@delphix.com> Contributions-by: Jason King <jason.king@joyent.com> Signed-off-by: Sara Hartse <sara.hartse@delphix.com> Signed-off-by: Jason King <jason.king@joyent.com> Closes #9950	2020-02-14 13:41:42 -08:00
Matthew Ahrens	4fe3a842bb	Remove limit on number of async zio_frees of non-dedup blocks The module parameter zfs_async_block_max_blocks limits the number of blocks that can be freed by the background freeing of filesystems and snapshots (from "zfs destroy"), in one TXG. This is useful when freeing dedup blocks, becuase each zio_free() of a dedup block can require an i/o to read the relevant part of the dedup table (DDT), and will also dirty that block. zfs_async_block_max_blocks is set to 100,000 by default. For the more typical case where dedup is not used, this can have a negative performance impact on the rate of background freeing (from "zfs destroy"). For example, with recordsize=8k, and TXG's syncing once every 5 seconds, we can free only 160MB of data per second, which may be much less than the rate we can write data. This change increases zfs_async_block_max_blocks to be unlimited by default. To address the dedup freeing issue, a new tunable is introduced, zfs_max_async_dedup_frees, which limits the number of zio_free()'s of dedup blocks done by background destroys, per txg. The default is 100,000 free's (same as the old zfs_async_block_max_blocks default). Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #10000	2020-02-14 08:39:46 -08:00
Christian Schwarz	948f0c4419	zcp: add zfs.sync.bookmark Add support for bookmark creation and cloning. Reviewed-by: Matt Ahrens <matt@delphix.com> Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Christian Schwarz <me@cschwarz.com> Closes #9571	2020-02-11 13:19:17 -08:00
Christian Schwarz	a73f361fdb	Implement bookmark copying This feature allows copying existing bookmarks using zfs bookmark fs#target fs#newbookmark There are some niche use cases for such functionality, e.g. when using bookmarks as markers for replication progress. Copying redaction bookmarks produces a normal bookmark that cannot be used for redacted send (we are not duplicating the redaction object). ZCP support for bookmarking (both creation and copying) will be implemented in a separate patch based on this work. Overview: - Terminology: - source = existing snapshot or bookmark - new/bmark = new bookmark - Implement bookmark copying in `dsl_bookmark.c` - create new bookmark node - copy source's `zbn_phys` to new's `zbn_phys` - zero-out redaction object id in copy - Extend existing bookmark ioctl nvlist schema to accept bookmarks as sources - => `dsl_bookmark_create_nvl_validate` is authoritative - use `dsl_dataset_is_before` check for both snapshot and bookmark sources - Adjust CLI - refactor shortname expansion logic in `zfs_do_bookmark` - Update man pages - warn about redaction bookmark handling - Add test cases - CLI - pyyzfs libzfs_core bindings Reviewed-by: Matt Ahrens <matt@delphix.com> Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Christian Schwarz <me@cschwarz.com> Closes #9571	2020-02-11 13:19:12 -08:00
Paul Zuchowski	bc67cba7c0	Fix zdb -R with 'b' flag zdb -R :b fails due to the indirect block being compressed, and the 'b' and 'd' flag not working in tandem when specified. Fix the flag parsing code and create a zfs test for zdb -R block display. Also fix the zio flags where the dotted notation for the vdev portion of DVA (i.e. 0.0:offset:length) fails. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Zuchowski <pzuchowski@datto.com> Closes #9640 Closes #9729	2020-02-10 14:00:05 -08:00
Attila Fülöp	31b160f0a6	ICP: Improve AES-GCM performance Currently SIMD accelerated AES-GCM performance is limited by two factors: a. The need to disable preemption and interrupts and save the FPU state before using it and to do the reverse when done. Due to the way the code is organized (see (b) below) we have to pay this price twice for each 16 byte GCM block processed. b. Most processing is done in C, operating on single GCM blocks. The use of SIMD instructions is limited to the AES encryption of the counter block (AES-NI) and the Galois multiplication (PCLMULQDQ). This leads to the FPU not being fully utilized for crypto operations. To solve (a) we do crypto processing in larger chunks while owning the FPU. An `icp_gcm_avx_chunk_size` module parameter was introduced to make this chunk size tweakable. It defaults to 32 KiB. This step alone roughly doubles performance. (b) is tackled by porting and using the highly optimized openssl AES-GCM assembler routines, which do all the processing (CTR, AES, GMULT) in a single routine. Both steps together result in up to 32x reduction of the time spend in the en/decryption routines, leading up to approximately 12x throughput increase for large (128 KiB) blocks. Lastly, this commit changes the default encryption algorithm from AES-CCM to AES-GCM when setting the `encryption=on` property. Reviewed-By: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-By: Jason King <jason.king@joyent.com> Reviewed-By: Tom Caputi <tcaputi@datto.com> Reviewed-By: Richard Laager <rlaager@wiktel.com> Signed-off-by: Attila Fülöp <attila@fueloep.org> Closes #9749	2020-02-10 12:59:50 -08:00
Ryan Moeller	8c4987c489	Restore aclmode and remove acltype on FreeBSD This replaces the placeholder ZFS_PROP_PRIVATE with ZFS_PROP_ACLMODE, matching what is done in the NFSv4 ACLs PR (#9709). On FreeBSD we hide ZFS_PROP_ACLTYPE, while on Linux we hide ZFS_PROP_ACLMODE. The tests already assume this arrangement. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #9913	2020-02-04 08:40:07 -08:00
Ned Bass	a3403164d7	zdb: add support for object ranges for zdb -d Allow a range of object identifiers to dump with -d. This may be useful when dumping a large dataset and you want to break it up into multiple phases, or to resume where a previous scan left off. Object type selection flags are supported to reduce the performance overhead of verbosely dumping unwanted objects, and to reduce the amount of post-processing work needed to filter out unwanted objects from zdb output. This change extends existing syntax in a backward-compatible way. That is, the base case of a range is to specify a single object identifier to dump. Ranges and object identifiers can be intermixed as command line parameters. Usage synopsis: Object ranges take the form <start>:<end>[:<flags>] start Starting object number end Ending object number, or -1 for no upper bound flags Optional flags to select object types: A All objects (this is the default) d ZFS directories f ZFS files m SPA space maps z ZAPs - Negate effect of next flag Examples: # Dump all file objects zdb -dd tank/fish 0👎f # Dump all file and directory objects zdb -dd tank/fish 0👎fd # Dump all types except file and directory objects zdb -dd tank/fish 0👎A-f-d # Dump object IDs in a specific range zdb -dd tank/fish 1000:2000 Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: Paul Zuchowski <pzuchowski@datto.com> Signed-off-by: Ned Bass <bass6@llnl.gov> Closes #9832	2020-01-24 11:00:46 -08:00
Romain Dolbeau	35b07497c6	Add AltiVec RAID-Z Implements the RAID-Z function using AltiVec SIMD. This is basically the NEON code translated to AltiVec. Note that the 'fletcher' algorithm requires 64-bits operations, and the initial implementations of AltiVec (PPC74xx a.k.a. G4, PPC970 a.k.a. G5) only has up to 32-bits operations, so no 'fletcher'. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Romain Dolbeau <romain.dolbeau@european-processor-initiative.eu> Closes #9539	2020-01-23 11:01:24 -08:00
Jason King	e2ef1cbf04	Support inheriting properties in channel programs This adds support in channel programs to inherit properties analogous to `zfs inherit` by adding `zfs.sync.inherit` and `zfs.check.inherit` functions to the ZFS LUA API. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Jason King <jason.king@joyent.com> Closes #9738	2020-01-22 17:03:17 -08:00
Paul Zuchowski	f12e42cccf	zdb -d should accept the numeric objset id As an alternative to the dataset name, zdb now allows the decimal or hexadecimal objset ID to be specified. When permanent errors are reported as 2 hexadecimal numbers (objset ID : object ID) in zpool status; you can now use 'zdb <pool>[/objset ID] object' to determine the names of the objset and object which have the error. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Signed-off-by: Paul Zuchowski <pzuchowski@datto.com> Closes #9733	2020-01-16 09:22:49 -08:00
Richard Laager	f744f36ce5	Document zfs change-key caveats As discussed on the 2019-01-07 OpenZFS Leadership Meeting, we need to be clear about the limitations of `zfs change-key`. Changing the user key does not change the master key, nor does it currently overwrite the old wrapped master key on disk. Reviewed-by: Tom Caputi <tcaputi@datto.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matt Ahrens <matt@delphix.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Garrett Fields <ghfields@gmail.com> Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl> Signed-off-by: Richard Laager <rlaager@wiktel.com> Closes #9819	2020-01-14 10:11:07 -08:00
Brian Behlendorf	e458fcca75	Change http://zfsonlinux.org links to https://zfsonlinux.org Update the project website links contained in to repository to reference the secure https://zfsonlinux.org address. Reviewed-By: Richard Laager <rlaager@wiktel.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Garrett Fields <ghfields@gmail.com> Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #9837	2020-01-13 16:43:59 -08:00
Tom Caputi	ba0ba69e50	Add 'zfs send --saved' flag This commit adds the --saved (-S) to the 'zfs send' command. This flag allows a user to send a partially received dataset, which can be useful when migrating a backup server to new hardware. This flag is compatible with resumable receives, so even if the saved send is interrupted, it can be resumed. The flag does not require any user / kernel ABI changes or any new feature flags in the send stream format. Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Alek Pinchuk <apinchuk@datto.com> Reviewed-by: Paul Zuchowski <pzuchowski@datto.com> Reviewed-by: Christian Schwarz <me@cschwarz.com> Reviewed-by: Matt Ahrens <matt@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #9007	2020-01-10 10:16:58 -08:00
DeHackEd	67709516db	zfs-module-parameters(5): add all missing items I ran a report against the output of `modinfo zfs.ko`. This commit adds everything missing and corrects a few renamed module parameters. Specifically: * zfs_checksums_per second renamed in `ad796b8a3` * vdev_ms_count_limit renamed in `c853f382d` Also fixes some variable type inconsistencies (unsigned int => uint) Reviewed-by: George Amanakis <gamanakis@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: DHE <git@dehacked.net> Closes #9809	2020-01-07 12:41:53 -08:00
Brian Behlendorf	33dc49e0c0	Fix ZPOOL_VDEV_NAME_PATH option description The corresponding zpool status option is -P and not -p. Update this description to reference the correct option. Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #9803	2020-01-06 10:43:32 -08:00
Tony Hutter	9fb2771aa5	Colorize zpool status output If the ZFS_COLOR env variable is set, then use ANSI color output in zpool status: - Column headers are bold - Degraded or offline pools/vdevs are yellow - Non-zero error counters and faulted vdevs/pools are red - The 'status:' and 'action:' sections are yellow if they're displaying a warning. This also includes a new 'faketty' function in libtest.shlib that is compatible with FreeBSD (code provided by @freqlabs). Reviewed-by: Jorgen Lundman <lundman@lundman.net> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #9340	2019-12-19 16:26:07 -08:00
Matthew Macy	4bc721965f	Add FreeBSD jail support hooks Add the 'zfs jail/unjail' subcommands along with the relevant documentation from FreeBSD. This feature is not supported on Linux and still requires the match kernel ioctls which will be included when the FreeBSD platform code is integrated. Reviewed-by: Jorgen Lundman <lundman@lundman.net> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Signed-off-by: Ryan Moeller <ryan@ixsystems.com> Closes #9686	2019-12-11 11:58:37 -08:00
Paul Zuchowski	f0bf435176	zio_decompress_data always ASSERTs successful decompression This interferes with zdb_read_block trying all the decompression algorithms when the 'd' flag is specified, as some are expected to fail. Also control the output when guessing algorithms, try the more common compression types first, allow specifying lsize/psize, and fix an uninitialized variable. Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Zuchowski <pzuchowski@datto.com> Closes #9612 Closes #9630	2019-12-10 15:51:58 -08:00
Matthew Macy	f95704ca5e	Disable EDONR on FreeBSD FreeBSD uses its own crypto framework in-kernel which, at this time, has no EDONR implementation. Reviewed-by: Jorgen Lundman <lundman@lundman.net> Reviewed-by: Allan Jude <allanjude@freebsd.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Signed-off-by: Ryan Moeller <ryan@ixsystems.com> Closes #9664	2019-12-05 13:10:29 -08:00
Brian Behlendorf	624222ae31	Increase allowed 'special_small_blocks' maximum value There may be circumstances where it's desirable that all blocks in a specified dataset be stored on the special device. Relax the artificial 128K limit and allow the special_small_blocks property to be set up to 1M. When blocks >1MB have been enabled via the zfs_max_recordsize module option, this limit is increased accordingly. Reviewed-by: Don Brady <don.brady@delphix.com> Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #9131 Closes #9355	2019-12-03 09:58:03 -08:00
Brian Behlendorf	9e17e6f254	Remove zfs_vdev_elevator module option As described in commit `f81d5ef6` the zfs_vdev_elevator module option is being removed. Users who require this functionality should update their systems to set the disk scheduler using a udev rule. Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #8664 Closes #9417 Closes #9609	2019-11-27 10:35:49 -08:00
Paul Zuchowski	894f6696b4	Add display of checksums to zdb -R The function zdb_read_block (zdb -R) was always intended to have a :c flag which would read the DVA and length supplied by the user, and display the checksum. Since we don't know which checksum goes with the data, we should calculate and display them all. For each checksum in the table, read in the data at the supplied DVA:length, calculate the checksum, and display it. Update the man page and create a zfs test for the new feature. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl> Signed-off-by: Paul Zuchowski <pzuchowski@datto.com> Closes #9607	2019-11-27 10:08:18 -08:00
Kjeld Schouten	64c77c4bad	Change zed.service to zfs-zed.service in man page zed.service does not exist replaced with correct service name in man. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Kjeld Schouten-Lebbing <kjeld@schouten-lebbing.nl> Closes #9581	2019-11-13 10:23:23 -08:00
Ross Williams	c5ebfbbe19	Reorganize zpool(8) man page into sections Moved subcommand topics into individual manpages. Reordered and grouped the list of subcommands by topic. Moved concepts overview to `zpoolconcepts.8` and the long list of available pool properties to `zpoolprops.8`. Internal cross-references copied from `zpool.8` needed to be converted to `.Xr` external references to new subcommand manual pages. Move `autotrim` into lexical order, autotrim tacked onto the end of a list. Now it is in alphabetical order. Clarify attach/detach description. Description was too specific to command syntax. Overview clarifies reason for attaching or detaching a device. Clarify replace description, don't refer to subcommand arguments. Clarify split command description, say what split actually does and why you'd want to do it. Clarify description of upgrade, and simplify the zpool.8 wording of the zpool-upgrade(8) description. Clarify description of import, detail what zpool-import(8) actually does. Add appropriate SEE ALSO sections. Divided zpool subcommand manual pages need their own SEE ALSO sections. Also modified fsck.zfs.8 to point directly to zfs-scrub.8 and zed.8.in to include a direct reference to zfs-events.8 Reviewed-by: Matt Ahrens <matt@delphix.com> Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ross Williams <ross@ross-williams.net> Closes #9564	2019-11-13 09:21:07 -08:00
Ross Williams	870fc32fc9	Reorganize zfs(8) man page into sections Most subcommands got their own manpages (e.g. create). Some related commands grouped into a single manpage and symlinks created (e.g. set, get, and inherit). I did this when topics were either too short to warrant their own file or so interrelated that a user would want to refer between commands in the same file. Corrected .Sx internal references to .Xr cross refs; lots of .Sx references from when text was all in zfs.8 needed to be changed to .Xr zfs-$SUBCOMMAND 8 cross references. Divided subcommand list in zfs(8) into sections of related functionality. This required writing new descriptions for some commands. Preserved ".Os Linux", `.Os` macro parsing behavior differs between mandoc from the "BSD" mandoc package (available on Ubuntu) and man from Ubuntu's man-db package, which calls groff to format the manpages. Groff handles the `.Os` macro differently and wrongly, defaulting it to "BSD" in `/usr/share/groff//tmac/mdoc/doc-common`, instead of getting the default from `uname`. A future set of changes will introduce build-time preprocessing of manpages for platform-specific documentation and can insert the correct operating system name. Added SEE ALSO sections, the newly-divided zfs-.8 subcommand man pages needed their own SEE ALSO sections pointing to related subcommands and, in some cases, documentation from other packages (e.g. zfs-share.8). Reviewed-by: Matt Ahrens <matt@delphix.com> Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl> Reviewed-by: Sean Eric Fagan <sef@ixsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ross Williams <ross@ross-williams.net> Closes #9559	2019-11-12 11:17:40 -08:00
Romain Dolbeau	0b2a642351	Add AVX512BW variant of fletcher It is much faster than AVX512F when byteswapping on Skylake-SP and newer, as we can do the byteswap in a single vshufb instead of many instructions. Reviewed by: Gvozden Neskovic <neskovic@gmail.com> Reviewed-by: Chunwei Chen <tuxoko@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Romain Dolbeau <romain.dolbeau@atos.net> Closes #9517	2019-10-30 12:26:14 -07:00
Brian Behlendorf	d292307212	Modify sharenfs=on default behavior While it may sometimes be convenient to export an NFS filesystem with no_root_squash it should not be the default behavior. Align the default behavior with the Linux NFS server defaults. To restore the previous behavior use 'zfs set sharenfs="no_root_squash,..."'. Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Reviewed-by: Richard Laager <rlaager@wiktel.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #9397 Closes #9425	2019-10-13 19:13:26 -07:00
Richard Yao	803884217f	Implement ZPOOL_IMPORT_UDEV_TIMEOUT_MS Since 0.7.0, zpool import would unconditionally block on udev for 30 seconds. This introduced a regression in initramfs environments that lack udev (particularly mdev based environments), yet use a zfs userland tools intended for the system that had been built against udev. Gentoo's genkernel is the main example, although custom user initramfs environments would be similarly impacted unless special builds of the ZFS userland utilities were done for them. Such environments already have their own mechanisms for blocking until device nodes are ready (such as genkernel's scandelay parameter), so it is unnecessary for zpool import to block on a non-existent udev until a timeout is reached inside of them. Rather than trying to intelligently determine whether udev is available on the system to avoid unnecessarily blocking in such environments, it seems best to just allow the environment to override the timeout. I propose that we add an environment variable called ZPOOL_IMPORT_UDEV_TIMEOUT_MS. Setting it to 0 would restore the 0.6.x behavior that was more desirable in mdev based initramfs environments. This allows the system user land utilities to be reused when building mdev-based initramfs archives. Reviewed-by: Igor Kozhukhov <igor@dilos.org> Reviewed-by: Jorgen Lundman <lundman@lundman.net> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Georgy Yakovlev <gyakovlev@gentoo.org> Signed-off-by: Richard Yao <ryao@gentoo.org> Closes #9436	2019-10-11 09:52:48 -07:00
Brian Behlendorf	f3dc4a85e9	Update `zfs program` command usage Update the zfs(8) man page to clearly describe that arguments for channel programs are to be listed after the -- sentinel which terminates argument processing. This behavior is supported by getopt on Linux, FreeBSD, and Illumos according to each platforms respective man pages. zfs program [-jn] [-t instruction-limit] [-m memory-limit] pool script [--] arg1 ... Reviewed-by: Clint Armstrong <clint@clintarmstrong.net> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #9056 Closes #9428	2019-10-10 09:49:45 -07:00
Brian Behlendorf	f81d5ef686	Add warning for zfs_vdev_elevator option removal Originally the zfs_vdev_elevator module option was added as a convenience so the requested elevator would be automatically set on the underlying block devices. At the time this was simple because the kernel provided an API function which did exactly this. This API was then removed in the Linux 4.12 kernel which prompted us to add compatibly code to set the elevator via a usermodehelper. While well intentioned this introduced a bug which could cause a system hang, that issue was subsequently fixed by commit `2a0d4188`. In order to avoid future bugs in this area, and to simplify the code, this functionality is being deprecated. A console warning has been added to notify any existing consumers and the documentation updated accordingly. This option will remain for the lifetime of the 0.8.x series for compatibility but if planned to be phased out of master. Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #8664 Closes #9317	2019-09-25 09:23:29 -07:00
John Gallagher	e60e158eff	Add subcommand to wait for background zfs activity to complete Currently the best way to wait for the completion of a long-running operation in a pool, like a scrub or device removal, is to poll 'zpool status' and parse its output, which is neither efficient nor convenient. This change adds a 'wait' subcommand to the zpool command. When invoked, 'zpool wait' will block until a specified type of background activity completes. Currently, this subcommand can wait for any of the following: - Scrubs or resilvers to complete - Devices to initialized - Devices to be replaced - Devices to be removed - Checkpoints to be discarded - Background freeing to complete For example, a scrub that is in progress could be waited for by running zpool wait -t scrub <pool> This also adds a -w flag to the attach, checkpoint, initialize, replace, remove, and scrub subcommands. When used, this flag makes the operations kicked off by these subcommands synchronous instead of asynchronous. This functionality is implemented using a new ioctl. The type of activity to wait for is provided as input to the ioctl, and the ioctl blocks until all activity of that type has completed. An ioctl was used over other methods of kernel-userspace communiction primarily for the sake of portability. Porting Notes: This is ported from Delphix OS change DLPX-44432. The following changes were made while porting: - Added ZoL-style ioctl input declaration. - Reorganized error handling in zpool_initialize in libzfs to integrate better with changes made for TRIM support. - Fixed check for whether a checkpoint discard is in progress. Previously it also waited if the pool had a checkpoint, instead of just if a checkpoint was being discarded. - Exposed zfs_initialize_chunk_size as a ZoL-style tunable. - Updated more existing tests to make use of new 'zpool wait' functionality, tests that don't exist in Delphix OS. - Used existing ZoL tunable zfs_scan_suspend_progress, together with zinject, in place of a new tunable zfs_scan_max_blks_per_txg. - Added support for a non-integral interval argument to zpool wait. Future work: ZoL has support for trimming devices, which Delphix OS does not. In the future, 'zpool wait' could be extended to add the ability to wait for trim operations to complete. Reviewed-by: Matt Ahrens <matt@delphix.com> Reviewed-by: John Kennedy <john.kennedy@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: John Gallagher <john.gallagher@delphix.com> Closes #9162	2019-09-13 18:09:06 -07:00
Andrea Gelmini	ac3d4d0cf6	Fix typos in man/ Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net> Closes #9233	2019-08-30 09:41:35 -07:00
Paul Dagnelie	eef0f4d84e	Keep more metaslabs loaded With the other metaslab changes loaded onto a system, we can significantly reduce the memory usage of each loaded metaslab and unload them on demand if there is memory pressure. However, none of those changes actually result in us keeping more metaslabs loaded. If we don't keep more metaslabs loaded, we will still have to wait for demand-loading to finish when no loaded metaslab can satisfy our allocation, which can cause ZIL performance issues. In addition, performance is traditionally measured by IOs per unit time, while unloading is currently done on a txg-count basis. Txgs can take a widely varying range of times, from tenths of a second to several seconds. This can result in confusing, hard to predict behavior. This change simply adds a time-based component to metaslab unloading. A metaslab will remain loaded for one minute and 8 txgs (by default) after it was last used, unless it is evicted due to memory pressure. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matt Ahrens <mahrens@delphix.com> Signed-off-by: Paul Dagnelie <pcd@delphix.com> External-issue: DLPX-65016 External-issue: DLPX-65047 Closes #9197	2019-08-29 10:20:36 -07:00
Paul Dagnelie	f09fda5071	Cap metaslab memory usage On systems with large amounts of storage and high fragmentation, a huge amount of space can be used by storing metaslab range trees. Since metaslabs are only unloaded during a txg sync, and only if they have been inactive for 8 txgs, it is possible to get into a state where all of the system's memory is consumed by range trees and metaslabs, and txgs cannot sync. While ZFS knows how to evict ARC data when needed, it has no such mechanism for range tree data. This can result in boot hangs for some system configurations. First, we add the ability to unload metaslabs outside of syncing context. Second, we store a multilist of all loaded metaslabs, sorted by their selection txg, so we can quickly identify the oldest metaslabs. We use a multilist to reduce lock contention during heavy write workloads. Finally, we add logic that will unload a metaslab when we're loading a new metaslab, if we're using more than a certain fraction of the available memory on range trees. Reviewed-by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com> Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <pcd@delphix.com> Closes #9128	2019-08-16 09:08:21 -06:00
George Wilson	c8242a96ba	spa_load_verify() may consume too much memory When a pool is imported it will scan the pool to verify the integrity of the data and metadata. The amount it scans will depend on the import flags provided. On systems with small amounts of memory or when importing a pool from the crash kernel, it's possible for spa_load_verify to issue too many I/Os that it consumes all the memory of the system resulting in an OOM message or a hang. To prevent this, we limit the amount of memory that the initial pool scan can consume. This change will, by default, use 1/16th of the ARC for scan I/Os to prevent running the system out of memory during import. Reviewed-by: Matt Ahrens <matt@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Signed-off-by: George Wilson george.wilson@delphix.com External-issue: DLPX-65237 External-issue: DLPX-65238 Closes #9146	2019-08-13 08:11:57 -06:00
Serapheim Dimitropoulos	3b9edd7b17	Introduce getting holds and listing bookmarks through ZCP Consumers of ZFS Channel Programs can now list bookmarks, and get holds from datasets. A minor-refactoring was also applied to distinguish between user and system properties in ZCP. Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Ported-by: Serapheim Dimitropoulos <serapheim@delphix.com> Signed-off-by: Dan Kimmel <dan.kimmel@delphix.com> OpenZFS-issue: https://illumos.org/issues/8862 Closes #7902	2019-08-12 10:02:34 -07:00
Paul Dagnelie	c81f1790e2	Metaslab max_size should be persisted while unloaded When we unload metaslabs today in ZFS, the cached max_size value is discarded. We instead use the histogram to determine whether or not we think we can satisfy an allocation from the metaslab. This can result in situations where, if we're doing I/Os of a size not aligned to a histogram bucket, a metaslab is loaded even though it cannot satisfy the allocation we think it can. For example, a metaslab with 16 entries in the 16k-32k bucket may have entirely 16kB entries. If we try to allocate a 24kB buffer, we will load that metaslab because we think it should be able to handle the allocation. Doing so is expensive in CPU time, disk reads, and average IO latency. This is exacerbated if the write being attempted is a sync write. This change makes ZFS cache the max_size after the metaslab is unloaded. If we ever get a free (or a coalesced group of frees) larger than the max_size, we will update it. Otherwise, we leave it as is. When attempting to allocate, we use the max_size as a lower bound, and respect it unless we are in try_hard. However, we do age the max_size out at some point, since we expect the actual max_size to increase as we do more frees. A more sophisticated algorithm here might be helpful, but this works reasonably well. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matt Ahrens <mahrens@delphix.com> Signed-off-by: Paul Dagnelie <pcd@delphix.com> Closes #9055	2019-08-05 14:34:27 -07:00
Serapheim Dimitropoulos	cae97c88c4	List log_spacemap feature in zpool-features.5 manual Update zpool-features.5 manpage to describe the log_spacemap feature. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com> Closes #9096	2019-07-31 09:29:01 -07:00
Sara Hartse	37f03da8ba	Fast Clone Deletion Deleting a clone requires finding blocks are clone-only, not shared with the snapshot. This was done by traversing the entire block tree which results in a large performance penalty for sparsely written clones. This is new method keeps track of clone blocks when they are modified in a "Livelist" so that, when it’s time to delete, the clone-specific blocks are already at hand. We see performance improvements because now deletion work is proportional to the number of clone-modified blocks, not the size of the original dataset. Reviewed-by: Sean Eric Fagan <sef@ixsystems.com> Reviewed-by: Matt Ahrens <matt@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Signed-off-by: Sara Hartse <sara.hartse@delphix.com> Closes #8416	2019-07-26 10:54:14 -07:00
Pavel Zakharov	26b6047469	New service that waits on zvol links to be created The zfs-volume-wait.service scans existing zvols and waits for their links under /dev to be created. Any service that depends on zvol links to be there should add a dependency on zfs-volumes.target. By default, this target is not enabled. Reviewed-by: Fabian Grünbichler <f.gruenbichler@proxmox.com> Reviewed-by: Antonio Russo <antonio.e.russo@gmail.com> Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Reviewed-by: John Gallagher <john.gallagher@delphix.com> Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Pavel Zakharov <pzakharov@delphix.com> Closes #8975	2019-07-17 15:33:05 -07:00
Mike Gerdts	d45d7f08fa	Add zfs create dryrun Adds the ability to sanity check zfs create arguments and to see the value of any additional properties that will local to the dataset. For example, automation that may need to adjust quota on a parent filesystem before creating a volume may call `zfs create -nP -V <size> <volume>` to obtain the value of refreservation. This adds the following options to zfs create: - -n dry-run (no-op) - -v verbose - -P parseable (implies verbose) Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matt Ahrens <matt@delphix.com> Reviewed-by: Jerry Jelinek <jerry.jelinek@joyent.com> Signed-off-by: Mike Gerdts <mike.gerdts@joyent.com> Closes #8974	2019-07-16 11:19:24 -07:00
Serapheim Dimitropoulos	93e28d661e	Log Spacemap Project = Motivation At Delphix we've seen a lot of customer systems where fragmentation is over 75% and random writes take a performance hit because a lot of time is spend on I/Os that update on-disk space accounting metadata. Specifically, we seen cases where 20% to 40% of sync time is spend after sync pass 1 and ~30% of the I/Os on the system is spent updating spacemaps. The problem is that these pools have existed long enough that we've touched almost every metaslab at least once, and random writes scatter frees across all metaslabs every TXG, thus appending to their spacemaps and resulting in many I/Os. To give an example, assuming that every VDEV has 200 metaslabs and our writes fit within a single spacemap block (generally 4K) we have 200 I/Os. Then if we assume 2 levels of indirection, we need 400 additional I/Os and since we are talking about metadata for which we keep 2 extra copies for redundancy we need to triple that number, leading to a total of 1800 I/Os per VDEV every TXG. We could try and decrease the number of metaslabs so we have less I/Os per TXG but then each metaslab would cover a wider range on disk and thus would take more time to be loaded in memory from disk. In addition, after it's loaded, it's range tree would consume more memory. Another idea would be to just increase the spacemap block size which would allow us to fit more entries within an I/O block resulting in fewer I/Os per metaslab and a speedup in loading time. The problem is still that we don't deal with the number of I/Os going up as the number of metaslabs is increasing and the fact is that we generally write a lot to a few metaslabs and a little to the rest of them. Thus, just increasing the block size would actually waste bandwidth because we won't be utilizing our bigger block size. = About this patch This patch introduces the Log Spacemap project which provides the solution to the above problem while taking into account all the aforementioned tradeoffs. The details on how it achieves that can be found in the references sections below and in the code (see Big Theory Statement in spa_log_spacemap.c). Even though the change is fairly constraint within the metaslab and lower-level SPA codepaths, there is a side-change that is user-facing. The change is that VDEV IDs from VDEV holes will no longer be reused. To give some background and reasoning for this, when a log device is removed and its VDEV structure was replaced with a hole (or was compacted; if at the end of the vdev array), its vdev_id could be reused by devices added after that. Now with the pool-wide space maps recording the vdev ID, this behavior can cause problems (e.g. is this entry referring to a segment in the new vdev or the removed log?). Thus, to simplify things the ID reuse behavior is gone and now vdev IDs for top-level vdevs are truly unique within a pool. = Testing The illumos implementation of this feature has been used internally for a year and has been in production for ~6 months. For this patch specifically there don't seem to be any regressions introduced to ZTS and I have been running zloop for a week without any related problems. = Performance Analysis (Linux Specific) All performance results and analysis for illumos can be found in the links of the references. Redoing the same experiments in Linux gave similar results. Below are the specifics of the Linux run. After the pool reached stable state the percentage of the time spent in pass 1 per TXG was 64% on average for the stock bits while the log spacemap bits stayed at 95% during the experiment (graph: sdimitro.github.io/img/linux-lsm/PercOfSyncInPassOne.png). Sync times per TXG were 37.6 seconds on average for the stock bits and 22.7 seconds for the log spacemap bits (related graph: sdimitro.github.io/img/linux-lsm/SyncTimePerTXG.png). As a result the log spacemap bits were able to push more TXGs, which is also the reason why all graphs quantified per TXG have more entries for the log spacemap bits. Another interesting aspect in terms of txg syncs is that the stock bits had 22% of their TXGs reach sync pass 7, 55% reach sync pass 8, and 20% reach 9. The log space map bits reached sync pass 4 in 79% of their TXGs, sync pass 7 in 19%, and sync pass 8 at 1%. This emphasizes the fact that not only we spend less time on metadata but we also iterate less times to convergence in spa_sync() dirtying objects. [related graphs: stock- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGStock.png lsm- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGLSM.png] Finally, the improvement in IOPs that the userland gains from the change is approximately 40%. There is a consistent win in IOPS as you can see from the graphs below but the absolute amount of improvement that the log spacemap gives varies within each minute interval. sdimitro.github.io/img/linux-lsm/StockVsLog3Days.png sdimitro.github.io/img/linux-lsm/StockVsLog10Hours.png = Porting to Other Platforms For people that want to port this commit to other platforms below is a list of ZoL commits that this patch depends on: Make zdb results for checkpoint tests consistent `db587941c5` Update vdev_is_spacemap_addressable() for new spacemap encoding `419ba59145` Simplify spa_sync by breaking it up to smaller functions `8dc2197b7b` Factor metaslab_load_wait() in metaslab_load() `b194fab0fb` Rename range_tree_verify to range_tree_verify_not_present `df72b8bebe` Change target size of metaslabs from 256GB to 16GB `c853f382db` zdb -L should skip leak detection altogether `21e7cf5da8` vs_alloc can underflow in L2ARC vdevs `7558997d2f` Simplify log vdev removal code `6c926f426a` Get rid of space_map_update() for ms_synced_length `425d3237ee` Introduce auxiliary metaslab histograms `928e8ad47d` Error path in metaslab_load_impl() forgets to drop ms_sync_lock `8eef997679` = References Background, Motivation, and Internals of the Feature - OpenZFS 2017 Presentation: youtu.be/jj2IxRkl5bQ - Slides: slideshare.net/SerapheimNikolaosDim/zfs-log-spacemaps-project Flushing Algorithm Internals & Performance Results (Illumos Specific) - Blogpost: sdimitro.github.io/post/zfs-lsm-flushing/ - OpenZFS 2018 Presentation: youtu.be/x6D2dHRjkxw - Slides: slideshare.net/SerapheimNikolaosDim/zfs-log-spacemap-flushing-algorithm Upstream Delphix Issues: DLPX-51539, DLPX-59659, DLPX-57783, DLPX-61438, DLPX-41227, DLPX-59320 DLPX-63385 Reviewed-by: Sean Eric Fagan <sef@ixsystems.com> Reviewed-by: Matt Ahrens <matt@delphix.com> Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com> Closes #8442	2019-07-16 10:11:49 -07:00
Antonio Russo	f88d069cbb	systemd encryption key support Modify zfs-mount-generator to produce a dependency on new zfs-import-key-*.service units, dynamically created at boot to call zfs load-key for the encryption root, before attempting to mount any encrypted datasets. These units are created by zfs-mount-generator, and RequiresMountsFor on the keyfile, if present, or call systemd-ask-password if a passphrase is requested. This patch includes suggestions from @Fabian-Gruenbichler, @ryanjaeb and @rlaager, as well an adaptation of @rlaager's script to retry on incorrect password entry. Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Fabian Grünbichler <f.gruenbichler@proxmox.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Antonio Russo <antonio.e.russo@gmail.com> Closes #8750 Closes #8848	2019-07-15 16:31:47 -07:00
loli10K	3b5fe2c351	Fix zfs "redact" misc issues * zfs redact error messages do not end with newline character * `30af21b0` inadvertently removed some ZFS_PROP comments * man/zfs: zfs redact <redaction_snapshot> is not optional Reviewed-by: Giuseppe Di Natale <guss80@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #8988	2019-07-05 16:38:17 -07:00
Matthew Ahrens	aa79910726	Fix typo in zpool-features.5, section bookmark_written The full property name includes "delphix", not "delphxi". Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Igor Kozhukhov <igor@dilos.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #8985	2019-07-03 12:57:05 -07:00
loli10K	ac6d54160a	Fix zfs(8) mandoc errors This was accidentally introduced in 765d1f06: mandoc: ./man/man8/zfs.8: ERROR: skipping item outside list: It Ar filesystem Ns \| Ns Ar mountpoint mandoc: ./man/man8/zfs.8: ERROR: skipping item outside list: It Xo mandoc: ./man/man8/zfs.8: ERROR: skipping end of block that is not open: Xc mandoc: ./man/man8/zfs.8: ERROR: skipping item outside list: It Xo mandoc: ./man/man8/zfs.8: ERROR: skipping end of block that is not open: Xc Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #8980	2019-07-02 08:19:55 -07:00
Tom Caputi	765d1f0644	Add 'zfs umount -u' for encrypted datasets This patch adds the ability for the user to unload keys for datasets as they are being unmounted. This is analogous to 'zfs mount -l'. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alek Pinchuk <apinchuk@datto.com> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes: #8917 Closes: #8952	2019-06-28 12:38:37 -07:00
Matthew Ahrens	050d720c43	Remove dedupditto functionality If dedup is in use, the `dedupditto` property can be set, causing ZFS to keep an extra copy of data that is referenced many times (>100x). The idea was that this data is more important than other data and thus we want to be really sure that it is not lost if the disk experiences a small amount of random corruption. ZFS (and system administrators) rely on the pool-level redundancy to protect their data (e.g. mirroring or RAIDZ). Since the user/sysadmin doesn't have control over what data will be offered extra redundancy by dedupditto, this extra redundancy is not very useful. The bulk of the data is still vulnerable to loss based on the pool-level redundancy. For example, if particle strikes corrupt 0.1% of blocks, you will either be saved by mirror/raidz, or you will be sad. This is true even if dedupditto saved another 0.01% of blocks from being corrupted. Therefore, the dedupditto functionality is rarely enabled (i.e. the property is rarely set), and it fulfills its promise of increased redundancy even more rarely. Additionally, this feature does not work as advertised (on existing releases), because scrub/resilver did not repair the extra (dedupditto) copy (see https://github.com/zfsonlinux/zfs/pull/8270). In summary, this seldom-used feature doesn't work, and even if it did it wouldn't provide useful data protection. It has a non-trivial maintenance burden (again see https://github.com/zfsonlinux/zfs/pull/8270). We should remove the dedupditto functionality. For backwards compatibility with the existing CLI, "zpool set dedupditto" will still "succeed" (exit code zero), but won't have any effect. For backwards compatibility with existing pools that had dedupditto enabled at some point, the code will still be able to understand dedupditto blocks and free them when appropriate. However, ZFS won't write any new dedupditto blocks. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Igor Kozhukhov <igor@dilos.org> Reviewed-by: Alek Pinchuk <apinchuk@datto.com> Issue #8270 Closes #8310	2019-06-19 14:54:02 -07:00
Paul Dagnelie	30af21b025	Implement Redacted Send/Receive Redacted send/receive allows users to send subsets of their data to a target system. One possible use case for this feature is to not transmit sensitive information to a data warehousing, test/dev, or analytics environment. Another is to save space by not replicating unimportant data within a given dataset, for example in backup tools like zrepl. Redacted send/receive is a three-stage process. First, a clone (or clones) is made of the snapshot to be sent to the target. In this clone (or clones), all unnecessary or unwanted data is removed or modified. This clone is then snapshotted to create the "redaction snapshot" (or snapshots). Second, the new zfs redact command is used to create a redaction bookmark. The redaction bookmark stores the list of blocks in a snapshot that were modified by the redaction snapshot(s). Finally, the redaction bookmark is passed as a parameter to zfs send. When sending to the snapshot that was redacted, the redaction bookmark is used to filter out blocks that contain sensitive or unwanted information, and those blocks are not included in the send stream. When sending from the redaction bookmark, the blocks it contains are considered as candidate blocks in addition to those blocks in the destination snapshot that were modified since the creation_txg of the redaction bookmark. This step is necessary to allow the target to rehydrate data in the case where some blocks are accidentally or unnecessarily modified in the redaction snapshot. The changes to bookmarks to enable fast space estimation involve adding deadlists to bookmarks. There is also logic to manage the life cycles of these deadlists. The new size estimation process operates in cases where previously an accurate estimate could not be provided. In those cases, a send is performed where no data blocks are read, reducing the runtime significantly and providing a byte-accurate size estimate. Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed-by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Prashanth Sreenivasa <pks@delphix.com> Reviewed-by: John Kennedy <john.kennedy@delphix.com> Reviewed-by: George Wilson <george.wilson@delphix.com> Reviewed-by: Chris Williamson <chris.williamson@delphix.com> Reviewed-by: Pavel Zhakarov <pavel.zakharov@delphix.com> Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com> Reviewed-by: Prakash Surya <prakash.surya@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <pcd@delphix.com> Closes #7958	2019-06-19 09:48:12 -07:00
Matthew Ahrens	53dce5acc6	panic in removal_remap test on 4K devices If the zfs_remove_max_segment tunable is changed to be not a multiple of the sector size, then the device removal code will malfunction and try to create mappings that are smaller than one sector, leading to a panic. On debug bits this assertion will fail in spa_vdev_copy_segment(): ASSERT3U(DVA_GET_ASIZE(&dst), ==, size); On nondebug, the system panics with a stack like: metaslab_free_concrete() metaslab_free_impl() metaslab_free_impl_cb() vdev_indirect_remap() free_from_removing_vdev() metaslab_free_impl() metaslab_free_dva() metaslab_free() Fortunately, the default for zfs_remove_max_segment is 1MB, so this can't occur by default. We hit it during this test because removal_remap.ksh changes zfs_remove_max_segment to 1KB. When testing on 4KB-sector disks, we hit the bug. This change makes the zfs_remove_max_segment tunable more robust, automatically rounding it up to a multiple of the sector size. We also turn some key assertions into VERIFY's so that similar bugs would be caught before they are encoded on disk (and thus avoid a panic-reboot-loop). Reviewed-by: Sean Eric Fagan <sef@ixsystems.com> Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> External-issue: DLPX-61342 Closes #8893	2019-06-13 13:12:39 -07:00
Matthew Ahrens	be89734a29	compress metadata in later sync passes Starting in sync pass 5 (zfs_sync_pass_dont_compress), we disable compression (including of metadata). Ostensibly this helps the sync passes to converge (i.e. for a sync pass to not need to allocate anything because it is 100% overwrites). However, in practice it increases the average number of sync passes, because when we turn compression off, a lot of block's size will change and thus we have to re-allocate (not overwrite) them. It also increases the number of 128KB allocations (e.g. for indirect blocks and spacemaps) because these will not be compressed. The 128K allocations are especially detrimental to performance on highly fragmented systems, which may have very few free segments of this size, and may need to load new metaslabs to satisfy 128K allocations. We should increase zfs_sync_pass_dont_compress. In practice on a highly fragmented system we see a few 5-pass txg's, a tiny number of 6-pass txg's, and no txg's with more than 6 passes. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed-by: George Wilson <george.wilson@delphix.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> External-issue: DLPX-63431 Closes #8892	2019-06-13 13:10:18 -07:00
Matthew Ahrens	d3230d761a	looping in metaslab_block_picker impacts performance on fragmented pools On fragmented pools with high-performance storage, the looping in metaslab_block_picker() can become the performance-limiting bottleneck. When looking for a larger block (e.g. a 128K block for the ZIL), we may search through many free segments (up to hundreds of thousands) to find one that is large enough to satisfy the allocation. This can take a long time (up to dozens of ms), and is done while holding the ms_lock, which other threads may spin waiting for. When this performance problem is encountered, profiling will show high CPU time in metaslab_block_picker, as well as in mutex_enter from various callers. The problem is very evident on a test system with a sync write workload with 8K writes to a recordsize=8k filesystem, with 4TB of SSD storage, 84% full and 88% fragmented. It has also been observed on production systems with 90TB of storage, 76% full and 87% fragmented. The fix is to change metaslab_df_alloc() to search only up to 16MB from the previous allocation (of this alignment). After that, we will pick a segment that is of the exact size requested (or larger). This reduces the number of iterations to a few hundred on fragmented pools (a ~100x improvement). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Reviewed-by: George Wilson <george.wilson@delphix.com> Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> External-issue: DLPX-62324 Closes #8877	2019-06-13 13:06:15 -07:00
Matthew Ahrens	d9b4bf0665	fat zap should prefetch when iterating When iterating over a ZAP object, we're almost always certain to iterate over the entire object. If there are multiple leaf blocks, we can realize a performance win by issuing reads for all the leaf blocks in parallel when the iteration begins. For example, if we have 10,000 snapshots, "zfs destroy -nv pool/fs@1%9999" can take 30 minutes when the cache is cold. This change provides a >3x performance improvement, by issuing the reads for all ~64 blocks of each ZAP object in parallel. Reviewed-by: Andreas Dilger <andreas.dilger@whamcloud.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> External-issue: DLPX-58347 Closes #8862	2019-06-12 13:13:09 -07:00
Matthew Ahrens	b8738257c2	make zil max block size tunable We've observed that on some highly fragmented pools, most metaslab allocations are small (~2-8KB), but there are some large, 128K allocations. The large allocations are for ZIL blocks. If there is a lot of fragmentation, the large allocations can be hard to satisfy. The most common impact of this is that we need to check (and thus load) lots of metaslabs from the ZIL allocation code path, causing sync writes to wait for metaslabs to load, which can take a second or more. In the worst case, we may not be able to satisfy the allocation, in which case the ZIL will resort to txg_wait_synced() to ensure the change is on disk. To provide a workaround for this, this change adds a tunable that can reduce the size of ZIL blocks. External-issue: DLPX-61719 Reviewed-by: George Wilson <george.wilson@delphix.com> Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #8865	2019-06-10 11:48:42 -07:00
Serapheim Dimitropoulos	cb020f0d86	Reduced IOPS when all vdevs are in the zfs_mg_fragmentation_threshold Historically while doing performance testing we've noticed that IOPS can be significantly reduced when all vdevs in the pool are hitting the zfs_mg_fragmentation_threshold percentage. Specifically in a hypothetical pool with two vdevs, what can happen is the following: Vdev A would go above that threshold and only vdev B would be used. Then vdev B would pass that threshold but vdev A would go below it (we've been freeing from A to allocate to B). The allocations would go back and forth utilizing one vdev at a time with IOPS taking a hit. Empirically, we've seen that our vdev selection for allocations is good enough that fragmentation increases uniformly across all vdevs the majority of the time. Thus we set the threshold percentage high enough to avoid hitting the speed bump on pools that are being pushed to the edge. We effectively disable its effect in the majority of the cases but we don't remove (at least for now) just in case we hit any weird behavior in the future. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matt Ahrens <mahrens@delphix.com> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com> Closes #8859	2019-06-06 13:08:41 -07:00
Peter Wirdemo	2ecc2020ef	Fixed a small typo in man/man1/raidz_test.1 Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Peter Wirdemo <peter.wirdemo@gmail.com> Closes #8855	2019-06-05 09:09:17 -07:00
Richard Laager	e7ce9759ac	Correct man page dates Various changes (many by me) have been made to the man pages without bumping their dates. I have now corrected them based on the last commit to each file. I also added the script I used to make these changes. Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Laager <rlaager@wiktel.com> Closes #8710	2019-05-08 10:59:32 -07:00
DeHackEd	1f02ecc5a5	Make zfs_special_class_metadata_reserve_pct into a parameter Exported and documented a new module parameter. Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: DHE <git@dehacked.net> Closes #8706	2019-05-07 15:34:42 -07:00
Brian Behlendorf	caf9dd209f	Fix send/recv lost spill block When receiving a DRR_OBJECT record the receive_object() function needs to determine how to handle a spill block associated with the object. It may need to be removed or kept depending on how the object was modified at the source. This determination is currently accomplished using a heuristic which takes in to account the DRR_OBJECT record and the existing object properties. This is a problem because there isn't quite enough information available to do the right thing under all circumstances. For example, when only the block size changes the spill block is removed when it should be kept. What's needed to resolve this is an additional flag in the DRR_OBJECT which indicates if the object being received references a spill block. The DRR_OBJECT_SPILL flag was added for this purpose. When set then the object references a spill block and it must be kept. Either it is update to date, or it will be replaced by a subsequent DRR_SPILL record. Conversely, if the object being received doesn't reference a spill block then any existing spill block should always be removed. Since previous versions of ZFS do not understand this new flag additional DRR_SPILL records will be inserted in to the stream. This has the advantage of being fully backward compatible. Existing ZFS systems receiving this stream will recreate the spill block if it was incorrectly removed. Updated ZFS versions will correctly ignore the additional spill blocks which can be identified by checking for the DRR_SPILL_UNMODIFIED flag. The small downside to this approach is that is may increase the size of the stream and of the received snapshot on previous versions of ZFS. Additionally, when receiving streams generated by previous unpatched versions of ZFS spill blocks may still be lost. OpenZFS-issue: https://www.illumos.org/issues/9952 FreeBSD-issue: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=233277 Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #8668	2019-05-07 15:18:44 -07:00
Richard Laager	c6eaa14620	Cleanup special/dedup language This standardizes the language on "deduplication tables" rather than "dedup data" (which might be read as the data blocks rather than the DDT). Likewise, it standardizes on "small file blocks". It also standardizes on "normal" rather than using both "normal" and "general" in the same paragraph. I also replaced "non-specified" with the more explicit "non-dedup/special". Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Laager <rlaager@wiktel.com> Closes #8713	2019-05-07 09:51:09 -07:00
Richard Laager	bca06413f7	OpenZFS 10473 - zfs(1M) missing cross-reference to zfs-program(1M) Authored by: Jason King <jason.king@joyent.com> Reviewed by: Toomas Soome <tsoome@me.com> Reviewed by: Andy Fiddaman <andy@omniosce.org> Reviewed by: Peter Tribble <peter.tribble@gmail.com> Reviewed by: Gergő Mihály Doma <domag02@gmail.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Approved by: Dan McDonald <danmcd@joyent.com> Ported-by: Richard Laager <rlaager@wiktel.com> OpenZFS-issue: https://www.illumos.org/issues/10473 OpenZFS-commit: https://github.com/illumos/illumos-gate/commit/736e67003 Closes #8711	2019-05-06 10:37:18 -07:00
Tom Caputi	fa24166074	Add feature check for 'zpool resilver' command The 'zpool resilver' command requires that the resilver_defer feature is active on the pool. Unfortunately, the check for this was left out of the original patch. This commit simply corrects this so that the command properly returns an error in this case. Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Igor Kozhukhov <igor@dilos.org> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #8700	2019-05-02 16:42:31 -07:00
Tom Caputi	85bdc68401	Fix estimated scrub completion time Currently, it is possible for the 'zpool scrub' command to progress slightly beyond 100% due to concurrent changes happening on the live pool. This behavior is expected, but the userspace code for 'zpool status' would subtract the expected amount of data from the amount of data already scrubbed, resulting in a negative integer being casted to a large positive one. This number was then used to calculate the estimated completion time, resulting in wildly wrong results. This code changes the behavior so that 'zpool status' does not attempt to report an estimate during this period. Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Igor Kozhukhov <igor@dilos.org> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #8611 Closes #8687	2019-05-01 17:34:24 -07:00
Richard Laager	77449a1ab0	Clarify that deduped data is encrypted Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Richard Laager <rlaager@wiktel.com> Closes #8691	2019-04-30 13:53:54 -07:00
Richard Laager	2b127afb44	Clarify and improve encryption documentation - Remove the language that "all user data" is encrypted. This is to avoid misunderstandings or arguments about what is "user data", especially in light of "user properties". - Document that properties are unencrypted. - Document that snapshot names are unencrypted. - For consistency with the rest of the zfs.8 man page, use "ZFS" as the generic noun, not (bolded) "zfs". The latter refers to the command. Likewise, use "ZFS" instead of "the kernel module". - Give "a passphrase" as an example of a "user's key". Reviewed-by: Tom Caputi <tcaputi@datto.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Richard Laager <rlaager@wiktel.com> Closes #8652	2019-04-24 17:14:24 -07:00
Richard Laager	6e81f9b21b	Duplicate the encryption copies=3 limitation This adds the encryption copies=3 limitation language into the copies property section. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Laager <rlaager@wiktel.com> Closes #8651	2019-04-24 17:12:14 -07:00
Richard Laager	7b337fda40	Deprecate dedupditto This documents, in zpool.8, that dedupditto is deprecated and will be made to have no effect in a future release. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Richard Laager <rlaager@wiktel.com> Closes #8650	2019-04-24 17:10:47 -07:00
Richard Laager	0531c83c18	Eliminate useless double-bolding in man pages As far as I know and can tell from testing, \fB\fB...\fR\fR is exactly equivalent to \fB...\fR. Reviewed-by: Tom Caputi <tcaputi@datto.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Laager <rlaager@wiktel.com> Closes #8641	2019-04-24 17:04:35 -07:00
Richard Laager	a5a6d82dda	Alphabetize zpool-features.5 by short name The features are sorted in the en_US locale, not the C locale. Specifically, that means that bookmark_v2 comes _after_ bookmarks. Reviewed-by: Tom Caputi <tcaputi@datto.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Laager <rlaager@wiktel.com> Closes #8641	2019-04-24 17:04:28 -07:00
Richard Laager	6f07780147	Standardize .RE placement in zpool-features.5 This command is being used to unindent, so it should be at the end of each block. This is consistent with the other man pages. Reviewed-by: Tom Caputi <tcaputi@datto.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Laager <rlaager@wiktel.com> Closes #8641	2019-04-24 17:04:20 -07:00
Richard Laager	1644708fdc	Add missing formatting to sha512 in zpool-features.5 Reviewed-by: Tom Caputi <tcaputi@datto.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Laager <rlaager@wiktel.com> Closes #8641	2019-04-24 17:04:13 -07:00
Richard Laager	e1822912a9	Correct GUID of large_blocks in zpool-features.5 It is org.open-zfs:large_blocks (plural). Reviewed-by: Tom Caputi <tcaputi@datto.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Laager <rlaager@wiktel.com> Closes #8641	2019-04-24 17:04:06 -07:00
Tom Caputi	d0c3aa9cdd	Change wording in zfs-module-parameters.5 Reviewed-by: Tom Caputi <tcaputi@datto.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Laager <rlaager@wiktel.com> Closes #8641	2019-04-24 17:03:59 -07:00
Richard Laager	063edd7fd8	Document that hole_birth is effectively useless The first sentence of this commit comes from the wiki, and was originally written by: Rich Ercolani <rincebrain@gmail.com> with changes by: Tom Caputi <tcaputi@datto.com> Reviewed-by: Tom Caputi <tcaputi@datto.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Laager <rlaager@wiktel.com> Closes #8641 Closes #8642	2019-04-24 17:03:27 -07:00
Richard Laager	6ce7b2d9ad	Document send_holes_without_birth_time Reviewed-by: Tom Caputi <tcaputi@datto.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Laager <rlaager@wiktel.com> Closes #8641	2019-04-24 17:03:21 -07:00
Richard Laager	e1fbe77110	Correct bookmark_v2 dependencies encryption depends on bookmark_v2. bookmark_v2 depends on bookmarks. Reviewed-by: Tom Caputi <tcaputi@datto.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Laager <rlaager@wiktel.com> Closes #8641	2019-04-24 17:03:12 -07:00
Richard Laager	4d256e897a	Fix formatting for multi_vdev_crash_dump in zpool-features.5 This needs to use tabs instead of spaces to display correctly (i.e. with things lined up). Reviewed-by: Tom Caputi <tcaputi@datto.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Laager <rlaager@wiktel.com> Closes #8641	2019-04-24 17:02:45 -07:00
Tomohiro Kusumi	2de17298de	Fix incorrect use of .Nm directive for ZPOOL_VDEV_NAME_GUID in zpool(8) It should only affect "zpool". Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Laager <rlaager@wiktel.com> Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Closes #8644	2019-04-23 10:23:00 -07:00
cfzhu	5090f72743	Code improvement and bug fixes for QAT support 1. Support QAT when ZFS is root file-system: When ZFS module is loaded before QAT started, the QAT can be started again in post-process, e.g.: echo 0 > /sys/module/zfs/parameters/zfs_qat_compress_disable echo 0 > /sys/module/zfs/parameters/zfs_qat_encrypt_disable echo 0 > /sys/module/zfs/parameters/zfs_qat_checksum_disable 2. Verify alder checksum of the de-compress result 3. Allocate Digest, IV and AAD buffer in physical contiguous memory by QAT_PHYS_CONTIG_ALLOC. 4. Update the documentation for zfs_qat_compress_disable, zfs_qat_checksum_disable, zfs_qat_encrypt_disable. Reviewed-by: Tom Caputi <tcaputi@datto.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Weigang Li <weigang.li@intel.com> Signed-off-by: Chengfeix Zhu <chengfeix.zhu@intel.com> Closes #8323 Closes #8610	2019-04-16 12:38:36 -07:00
TerraTech	50478c6dad	Add option [-V\|--version] to emit version string Add the 'zfs version' and 'zpool version' subcommands to display the version of the user space utilities and loaded zfs kernel module. For example: $ zfs version zfs-0.8.0-rc3_169_g67e0366b88 zfs-kmod-0.8.0-rc3_169_g67e0366b88 The '-V' and '--version' aliases were added to support the common convention of using 'zfs --version` to obtain the version information. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Richard Laager <rlaager@wiktel.com> Signed-off-by: TerraTech <1118433+TerraTech@users.noreply.github.com> Closes #2501 Closes #8567	2019-04-16 12:24:06 -07:00
Richard Laager	8750edf1f7	zfs allow refreservation needed for zfs create -V When creating a non-sparse volume, zfs create sets a refreservation. Accordingly, one needs the "refreservation" ability in addition to the "create" ability in order to create a non-sparse volume. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reported-by: github.com/homerlinux Reported-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Richard Laager <rlaager@wiktel.com> Closes #8531 Closes #8624	2019-04-16 10:12:08 -07:00
Richard Laager	6c0f78f8a3	Clarify GRUB's lack of support for sha512, skein, edonr zfs.8 correctly said that GRUB did not support them, but zpool-features.5 said that "Booting off pools...is supported." Now, zpool-features.5 discusses GRUB specifically and indicates its lack of support for these features. Also, I have clarified the wording in both places to indicate that the pool feature cannot be used. It's not a filesystem dataset thing, but pool-wide. I described this as "cannot be used". I think technically the feature can be enabled, just not active. However, the effect is essentially the same: you cannot enable those checksum algorithms on any dataset in the pool, so you might as well not enable the feature (which is just pointing a loaded gun at your foot). In the past, an argument could be made that having all the features enabled was useful for simplicity, as long as you didn't activate the GRUB-incompatible features, but that's getting less and less realistic over time. A user can still do that, but we should not encourage that. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Laager <rlaager@wiktel.com> Closes #8626 Closes #8446	2019-04-16 10:02:46 -07:00
Richard Laager	7886aa8a79	Reword the dedup limitation for Edon-R The old wording was effectively "You can not use this (except you can)", which just seems confusing. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Laager <rlaager@wiktel.com> Closes #8626	2019-04-16 10:02:07 -07:00
Richard Laager	393363c5ec	Consistently captialize GUID for features Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Laager <rlaager@wiktel.com> Closes #8626	2019-04-16 10:01:51 -07:00
Richard Laager	612c4930dd	Fix the spelling of deferred Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Laager <rlaager@wiktel.com> Closes #8626	2019-04-16 10:01:45 -07:00
Richard Laager	c349137a7e	Update zdb.8's fsck reference On Linux, this is in man section 8, not 1M. Also, there is no fsdb on Linux, so I removed that. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Laager <rlaager@wiktel.com> Closes #8626	2019-04-16 10:01:38 -07:00
Richard Laager	9042ca0e94	Refer to commands consistently in zpool-features.5 This had a mix of command vs subcommand, quoted vs not quoted, and bolded vs. not bolded command names. Also, fix man page sections from 1M (Solaris) to 8 (Linux). Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Laager <rlaager@wiktel.com> Closes #8626	2019-04-16 10:01:31 -07:00
Richard Laager	9810410a53	Eliminate most mentions of "special" Previously, the "spare" vdev type was described as "A special pseudo-vdev which...". I wanted to eliminate the word "special" from that, now that the allocation_classes feature exists and there is such a thing as a "special vdev". I ended up eliminating almost all instances of the word "special" that are not referencing the allocation_classes feature. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Laager <rlaager@wiktel.com> Closes #8626	2019-04-16 09:59:57 -07:00
Josh Soref	af65079300	Hint about zpool free vs zfs available Also describe free/allocated/fragmentation Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matt Ahrens <mahrens@delphix.com> Signed-off-by: Josh Soref <jsoref@users.noreply.github.com> Closes #7565 Closes #8483	2019-04-04 10:00:16 -07:00
Josh Soref	f72ecb8d27	Fix man(1) warnings The macOS man app strenuously objects to blank lines in man files. mdoc warning: Empty input line #xyz Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: bunder2015 <omfgbunder@gmail.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Josh Soref <jsoref@users.noreply.github.com> Closes #8559	2019-04-02 11:05:09 -07:00
Tom Caputi	dd29864b01	Update raw send documentation This patch simply clarifies some of the limitations related to raw sends in the man page. No functional changes. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Jason Cohen <jwittlincohen@gmail.com> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #8503 Closes #8544	2019-03-29 15:05:55 -07:00
Brian Behlendorf	1b939560be	Add TRIM support UNMAP/TRIM support is a frequently-requested feature to help prevent performance from degrading on SSDs and on various other SAN-like storage back-ends. By issuing UNMAP/TRIM commands for sectors which are no longer allocated the underlying device can often more efficiently manage itself. This TRIM implementation is modeled on the `zpool initialize` feature which writes a pattern to all unallocated space in the pool. The new `zpool trim` command uses the same vdev_xlate() code to calculate what sectors are unallocated, the same per- vdev TRIM thread model and locking, and the same basic CLI for a consistent user experience. The core difference is that instead of writing a pattern it will issue UNMAP/TRIM commands for those extents. The zio pipeline was updated to accommodate this by adding a new ZIO_TYPE_TRIM type and associated spa taskq. This new type makes is straight forward to add the platform specific TRIM/UNMAP calls to vdev_disk.c and vdev_file.c. These new ZIO_TYPE_TRIM zios are handled largely the same way as ZIO_TYPE_READs or ZIO_TYPE_WRITEs. This makes it possible to largely avoid changing the pipieline, one exception is that TRIM zio's may exceed the 16M block size limit since they contain no data. In addition to the manual `zpool trim` command, a background automatic TRIM was added and is controlled by the 'autotrim' property. It relies on the exact same infrastructure as the manual TRIM. However, instead of relying on the extents in a metaslab's ms_allocatable range tree, a ms_trim tree is kept per metaslab. When 'autotrim=on', ranges added back to the ms_allocatable tree are also added to the ms_free tree. The ms_free tree is then periodically consumed by an autotrim thread which systematically walks a top level vdev's metaslabs. Since the automatic TRIM will skip ranges it considers too small there is value in occasionally running a full `zpool trim`. This may occur when the freed blocks are small and not enough time was allowed to aggregate them. An automatic TRIM and a manual `zpool trim` may be run concurrently, in which case the automatic TRIM will yield to the manual TRIM. Reviewed-by: Jorgen Lundman <lundman@lundman.net> Reviewed-by: Tim Chase <tim@chase2k.com> Reviewed-by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: George Wilson <george.wilson@delphix.com> Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Contributions-by: Saso Kiselkov <saso.kiselkov@nexenta.com> Contributions-by: Tim Chase <tim@chase2k.com> Contributions-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #8419 Closes #598	2019-03-29 09:13:20 -07:00
Evan Allrich	74580a9411	Correct a very minor grammar issue Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Evan Allrich <evan@unguku.com> Closes #8535	2019-03-26 12:27:29 -07:00
Olaf Faaland	060f0226e6	MMP interval and fail_intervals in uberblock When Multihost is enabled, and a pool is imported, uberblock writes include ub_mmp_delay to allow an importing node to calculate the duration of an activity test. This value, is not enough information. If zfs_multihost_fail_intervals > 0 on the node with the pool imported, the safe minimum duration of the activity test is well defined, but does not depend on ub_mmp_delay: zfs_multihost_fail_intervals * zfs_multihost_interval and if zfs_multihost_fail_intervals == 0 on that node, there is no such well defined safe duration, but the importing host cannot tell whether mmp_delay is high due to I/O delays, or due to a very large zfs_multihost_interval setting on the host which last imported the pool. As a result, it may use a far longer period for the activity test than is necessary. This patch renames ub_mmp_sequence to ub_mmp_config and uses it to record the zfs_multihost_interval and zfs_multihost_fail_intervals values, as well as the mmp sequence. This allows a shorter activity test duration to be calculated by the importing host in most situations. These values are also added to the multihost_history kstat records. It calculates the activity test duration differently depending on whether the new fields are present or not; for importing pools with only ub_mmp_delay, it uses (zfs_multihost_interval + ub_mmp_delay) * zfs_multihost_import_intervals Which results in an activity test duration less sensitive to the leaf count. In addition, it makes a few other improvements: * It updates the "sequence" part of ub_mmp_config when MMP writes in between syncs occur. This allows an importing host to detect MMP on the remote host sooner, when the pool is idle, as it is not limited to the granularity of ub_timestamp (1 second). * It issues writes immediately when zfs_multihost_interval is changed so remote hosts see the updated value as soon as possible. * It fixes a bug where setting zfs_multihost_fail_intervals = 1 results in immediate pool suspension. * Update tests to verify activity check duration is based on recorded tunable values, not tunable values on importing host. * Update tests to verify the expected number of uberblocks have valid MMP fields - fail_intervals, mmp_interval, mmp_seq (sequence number), that sequence number is incrementing, and that uberblock values match tunable settings. Reviewed-by: Andreas Dilger <andreas.dilger@whamcloud.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Olaf Faaland <faaland1@llnl.gov> Closes #7842	2019-03-21 12:47:57 -07:00
Tom Caputi	ab7615d92c	Multiple DVA Scrubbing Fix Currently, there is an issue in the sequential scrub code which prevents self healing from working in some cases. The scrub code will split up all DVA copies of a bp and issue each of them separately. The problem is that, since each of the DVAs is no longer associated with the others, the self healing code doesn't have the opportunity to repair problems that show up in one of the DVAs with the data from the others. This patch fixes this issue by ensuring that all IOs issued by the sequential scrub code include all DVAs. Initially, only the first DVA of each is attempted. If an issue arises, the IO is retried with all available copies, giving the self healing code a chance to correct the issue. To test this change, this patch also adds the ability for zinject to specify individual DVAs to inject read errors into. We then add a new test case that utilizes this functionality to ensure scrubs and self-healing reads can handle and transparently fix issues with individual copies of blocks. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matt Ahrens <mahrens@delphix.com> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #8453	2019-03-15 14:14:31 -07:00
Alexander Motin	1af240f3b5	Add separate aggregation limit for non-rotating media Before sequential scrub patches ZFS never aggregated I/Os above 128KB. Sequential scrub bumped that to 1MB, supposedly to reduce number of head seeks for spinning disks. But for SSDs it makes little to no sense, especially on FreeBSD, where due to MAXPHYS limitation device will likely still see bunch of 128KB I/Os instead of one large. Having more strict aggregation limit for SSDs allows to avoid allocation of large memory buffer and copy to/from it, that is a serious problem when throughput reaches gigabytes per second. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #8494	2019-03-13 12:00:10 -07:00
Tom Caputi	f00ab3f22c	Detect and prevent mixed raw and non-raw sends Currently, there is an issue in the raw receive code where raw receives are allowed to happen on top of previously non-raw received datasets. This is a problem because the source-side dataset doesn't know about how the blocks on the destination were encrypted. As a result, any MAC in the objset's checksum-of-MACs tree that is a parent of both blocks encrypted on the source and blocks encrypted by the destination will be incorrect. This will result in authentication errors when we decrypt the dataset. This patch fixes this issue by adding a new check to the raw receive code. The code now maintains an "IVset guid", which acts as an identifier for the set of IVs used to encrypt a given snapshot. When a snapshot is raw received, the destination snapshot will take this value from the DRR_BEGIN payload. Non-raw receives and normal "zfs snap" operations will cause ZFS to generate a new IVset guid. When a raw incremental stream is received, ZFS will check that the "from" IVset guid in the stream matches that of the "from" destination snapshot. If they do not match, the code will error out the receive, preventing the problem. This patch requires an on-disk format change to add the IVset guids to snapshots and bookmarks. As a result, this patch has errata handling and a tunable to help affected users resolve the issue with as little interruption as possible. Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matt Ahrens <mahrens@delphix.com> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #8308	2019-03-13 11:00:43 -07:00
Tom Caputi	579ce7c5ae	Add bookmark v2 on-disk feature This patch adds the bookmark v2 feature to the on-disk format. This feature will be needed for the upcoming redacted sends and for an upcoming fix that for raw receives. The feature is not currently used by any code and thus this change is a no-op, aside from the fact that the user can now enable the feature. Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matt Ahrens <mahrens@delphix.com> Signed-off-by: Tom Caputi <tcaputi@datto.com> Issue #8308	2019-03-13 10:58:39 -07:00
Olaf Faaland	db2af93d72	Increase default zfs_multihost_fail_intervals and import_intervals By default, when multihost is enabled for a pool, the pool is suspended if (zfs_multihost_fail_intervals*zfs_multihost_interval) ms pass without a successful MMP write. This is the recommended configuration. The default value for zfs_multihost_fail_intervals has been 5, and the default value for zfs_multihost_interval has been 1000, so pool suspension occurred at 5 seconds. There have been multiple cases where a single misbehaving device in a pool triggered a SCSI reset, and all I/O paused for 5-6 seconds. This in turn caused MMP to suspend the pool. In the cases observed, the rest of the devices were healthy and the pool was otherwise correctly performing I/O. The reset was handled correctly by ZFS, and by suspending the pool MMP made replacing the device more difficult as well as forcing the host to be rebooted. Increase the default value of zfs_multihost_fail_intervals to 10, so that MMP tolerates up to 10 seconds of failed MMP writes before suspending the pool. Increase the default value of zfs_multihost_import_intervals to 20, to maintain the 2:1 safety factor. This results in a force import taking approximately 20 seconds when MMP is enabled, with default values. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Andreas Dilger <andreas.dilger@whamcloud.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Olaf Faaland <faaland1@llnl.gov> Closes #7709 Closes #8495	2019-03-13 09:50:48 -07:00
Olaf Faaland	8133679ff0	Do not resume a pool if multihost is enabled When multihost is enabled, and a pool is suspended, return EINVAL in response to "zpool clear <pool>". The pool may have been imported on another host while I/O was suspended. Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Olaf Faaland <faaland1@llnl.gov> Closes #6933 Closes #8460	2019-02-28 17:56:19 -08:00
Olaf Faaland	4f3218aed8	Warn user about accidentally sharing devices Improve the man page text to warn the user about the risk of adding the same device to multiple pools via simultaneous "zpool create", "zpool add", "zpool replace", etc. State that MMP/multihost does not protect against these scenarios. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Olaf Faaland <faaland1@llnl.gov> Closes #6473 Closes #8457	2019-02-28 17:54:36 -08:00
Matthew Ahrens	87c25d567f	abd_alloc should use scatter for >1K allocations abd_alloc() normally does scatter allocations, thus solving the problem that ABD originally set out to: the bulk of ZFS's allocations are single pages, which are faster to allocate and free, and don't suffer from internal fragmentation (and the inability to reclaim memory because some buffers in the slab are still allocated). However, the current code does linear allocations for 4KB and smaller allocations, defeating the purpose of ABD. Scatter ABD's use at least one page each, so sub-page allocations waste some space when allocated as scatter (e.g. 2KB scatter allocation wastes half of each page). Using linear ABD's for small allocations means that they will be put on slabs which contain many allocations. This can improve memory efficiency, but it also makes it much harder for ARC evictions to actually free pages, because all the buffers on one slab need to be freed in order for the slab (and underlying pages) to be freed. Typically, 512B and 1KB kmem caches have 16 buffers per slab, so it's possible for them to actually waste more memory than scatter (one page per buf = wasting 3/4 or 7/8th; one buf per slab = wasting 15/16th). Spill blocks are typically 512B and are heavily used on systems running selinux with the default dnode size and the `xattr=sa` property set. By default we will use linear allocations for 512B and 1KB, and scatter allocations for larger (1.5KB and up). Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: DHE <git@dehacked.net> Reviewed-by: Chunwei Chen <tuxoko@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Don Brady <don.brady@delphix.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #8455	2019-02-28 17:52:55 -08:00
Matthew Ahrens	c568ab8d99	zfs.8 has wrong description of "zfs program -t" The "-t" argument to "zfs program" specifies a limit on the number of LUA instructions that can be executed. The zfs.8 manpage has the wrong description. It should be updated to match what's in zfs-program.8 Also fix the formatting of the zfs help message. Reviewed by: Allan Jude <allanjude@freebsd.org> Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #8410	2019-02-26 11:15:28 -08:00
DeHackEd	ba7b05cb25	zfs(8): improve document of compression behaviours Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed by: Allan Jude <allanjude@freebsd.org> Reviewed-by: bunder2015 <omfgbunder@gmail.com> Signed-off-by: DHE <git@dehacked.net> Closes #4660 Closes #8423	2019-02-25 11:10:16 -08:00
kpande	f8bb2a7e0c	Clarify zpool iostat statistics reporting Document expected behavior for zpool iostat statistics reporting. Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Alek Pinchuk <apinchuk@datto.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Allan Jude <allanjude@freebsd.org> Signed-off-by: Kash Pande <kash@tripleback.net> Closes #2888 Closes #8417	2019-02-21 14:00:48 -08:00
Anatoly Borodin	f23b0242b6	Fix '-T u\|d' descriptions in zpool(8) In -T u\|d Display a time stamp. Specify -u for a printed representation of the internal representation of time. See time(2). Specify -d for standard date format. See date(1). 'Specify u' and 'Specify d' should be used instead. `zpool list -T -u` does not work. Bring the descriptions in `zpool list` and `zpool status` in sync with `zpool iostat`. Reviewed by: Allan Jude <allanjude@freebsd.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Anatoly Borodin <anatoly.borodin@gmail.com> Closes #8438	2019-02-21 11:22:06 -08:00
kpande	11f6127aba	zfs mount man page should document legacy behaviour Document legacy mount behavior. Reviewed by: Allan Jude <allanjude@freebsd.org> Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: bunder2015 <omfgbunder@gmail.com> Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Signed-off-by: Kash Pande <kash@tripleback.net> Closes #2900 Closes #8414	2019-02-19 11:10:57 -08:00
Tim Chase	638dd5f44e	zio_deadman_impl() fix and enhancement Add the zio_deadman_log_all tunable to print all zios in zio_deadman_impl(). Also, in all cases, display the depth of the zio relative to the original parent zio. This is meant to be used by developers to gain diagnostic information for hangs which don't involve fully set-up zio trees or are otherwise stuck or hung in an early stage. Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Signed-off-by: Tim Chase <tim@chase2k.com> Closes #8362	2019-02-15 12:44:24 -08:00
Paul Zuchowski	9c5e88b1de	zfs should optionally send holds Add -h switch to zfs send command to send dataset holds. If holds are present in the stream, zfs receive will create them on the target dataset, unless the zfs receive -h option is used to skip receive of holds. Reviewed-by: Alek Pinchuk <apinchuk@datto.com> Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Paul Dagnelie <pcd@delphix.com> Signed-off-by: Paul Zuchowski <pzuchowski@datto.com> Closes #7513	2019-02-15 12:41:38 -08:00
Alek P	65282ee9e0	Freeing throttle should account for holes Deletion throttle currently does not account for holes in a file. This means that it can activate when it shouldn't. To fix it we switch the throttle to be based on the number of L1 blocks we will have to dirty when freeing Reviewed by: Tom Caputi <tcaputi@datto.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alek Pinchuk <apinchuk@datto.com> Closes #7725 Closes #7888	2019-02-12 12:01:08 -08:00
Alek P	dcec0a12c8	port async unlinked drain from illumos-nexenta This patch is an async implementation of the existing sync zfs_unlinked_drain() function. This function is called at mount time and is responsible for freeing znodes that we didn't get to freeing before. We don't have to hold mounting of the dataset until the unlinked list is fully drained as is done now. Since we can process the unlinked set asynchronously this results in a better user experience when mounting a dataset with entries in the unlinked set. Reviewed by: Jorgen Lundman <lundman@lundman.net> Reviewed by: Tom Caputi <tcaputi@datto.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Signed-off-by: Alek Pinchuk <apinchuk@datto.com> Closes #8142	2019-02-12 10:41:15 -08:00
loli10K	4417096956	Pool allocation classes misplacing small file blocks Due to an off-by-one condition in spa_preferred_class() we are picking the "normal" allocation class instead of the "special" one for file blocks with size equal to the special_small_blocks property value. This change fix the small code issue, update the ZFS Test Suite and the zfs(8) man page. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Don Brady <don.brady@delphix.com> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #8351 Closes #8361	2019-02-08 12:32:12 -08:00
loli10K	1a745ef62e	zstreamdump: -d option is not documented in manpage This change simply documents the missing -d (dump contents) option in zstreamdump(8). Reviewed-by: bunder2015 <omfgbunder@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Igor Kozhukhov <igor@dilos.org> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #8369	2019-02-04 09:13:00 -08:00
Serapheim Dimitropoulos	21e7cf5da8	zdb -L should skip leak detection altogether Currently the point of -L option in zdb is to disable leak tracing and the loading of space maps because they are expensive, yet still do leak detection in terms of space. Unfortunately, there is a scenario where this is a lie. If we are using zdb -L on a pool where a vdev is being removed, zdb_claim_removing() will open the metaslab space maps of that device. This patch makes it so zdb -L skips leak detection altogether and ensures that no space maps are loaded. Reviewed-by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com> Closes #8335	2019-01-30 09:54:27 -08:00
Serapheim Dimitropoulos	c853f382db	Change target size of metaslabs from 256GB to 16GB = Old behavior For vdev sizes 100GB to 50TB we keep ~200 metaslabs per vdev and the metaslab size grows from 512MB to 256GB. For vdev's bigger than that we start increasing the number of metaslabs until we hit the 128K limit. = New Behavior For vdev sizes 100GB to 3TB we keep ~200 metaslabs per vdev and the metaslab size grows from 512MB to 16GB. For vdev's bigger than that we start increasing the number of metaslabs until we hit the 128K limit. = Reasoning The old behavior makes metaslabs grow in size when the vdev range is between 3TB (ms_size 16GB) and 32PB (ms_size 256GB). Even though keeping the number of metaslabs is good in terms of potential number of I/Os per TXG, these bigger metaslabs take longer to be loaded and after they are loaded they can take up a lot of memory because of their range trees. This change tries to put a boundary in memory and loading time for the specific range of vdev sizes. Reviewed-by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Don Brady <don.brady@delphix.com> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com> Closes #8324	2019-01-25 16:38:27 -08:00
Damian Wojsław	8fccfa8e17	zpool iostat should print headers when terminal fills When `zpool iostat` fills the terminal the headers should be printed again. `zpool iostat -n` can be used to suppress this. If the command is not attached to a tty, headers will not be printed so as to not break existing scripts. Reviewed-by: Joshua M. Clulow <josh@sysmgr.org> Reviewed-by: Giuseppe Di Natale <guss80@gmail.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Damian Wojsław <damian@wojslaw.pl> Closes #8235 Closes #8262	2019-01-23 13:29:49 -08:00
Brian Behlendorf	64bdf63f5c	ztest: split block reconstruction Increase the default allowed number of reconstruction attempts. There's not an exact right number for this setting. It needs to be set large enough to cover any realistic failure scenarios and small enough to avoid stalling the IO pipeline and invoking the dead man detection. The current value of 256 was empirically determined to be too low based on multi-day runs of ztest. The fault injection code would inject more damage than could be reconstructed given the relatively small number of attempts. However, in all observed cases the block could be reconstructed using a slightly higher limit. Based on local testing increasing the default value to 4096 was determined to strike the best balance. Checking all combinations takes less than 10s in the worst case, and has so far eliminated the vast majority of false positives detected by ztest. This delay is roughly on par with how long retries may be performed to a misbehaving HDD and was deemed to be reasonable. Better to err on the side of a brief delay rather than fail to reconstruct the data. Lastly, the -Y flag has been added to zdb to make it easy to try all possible combinations when performing split block reconstruction. For badly damaged blocks with 18 splits, they can be fully enumerated within a few minutes. This has been done to ensure permanent errors are never incorrectly reported when ztest verifies the pool with zdb. Reviewed by: Tom Caputi <tcaputi@datto.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #8271	2019-01-16 14:10:02 -08:00
Brian Behlendorf	6e91a72fe3	Disable 'zfs remap' command The implementation of 'zfs remap' has proven to be problematic since it modifies the objset (but not its logical contents) by dirtying metadata without owning it. The consequence of which is that dmu_objset_remap_indirects() is vulnerable to certain races. For example, if we are in the middle of receiving into the filesystem while it is being remapped. Then it is possible we could evict the objset when the receive completes (see dsl_dataset_clone_swap_sync_impl, or dmu_recv_end_sync), but dmu_objset_remap_indirects() may be still using the objset. The result of which would be a panic. Extended runs of ztest(8) have exposed other possible races which can occur when using 'zfs remap'. Several of these have been fixed but there may be others which have not yet been encountered and diagnosed. Furthermore, the ability to manually remap a filesystem is no longer particularly useful now that the removal code can map large chunks. Coupled with the fact that explaining what this command does and why it may be useful requires a detailed understanding of the internals of device removal. These are details users should not be bothered with. Therefore, the 'zfs remap' command is being disabled but not entirely removed. It may be removed in the future or potentially reworked to address the issues described above. Since 'zfs remap' has never been part of a tagged release its removal is expected to have minimal impact. The ZTS tests have been updated to continue to exercise the command to prevent atrophy, but it has been removed entirely from ztest(8). Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #8238	2019-01-15 15:46:58 -08:00
Benjamin Gentil	22448f0894	zfs.8 uses wrong snapshot names in Example 15 Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: bunder2015 <omfgbunder@gmail.com> Signed-off-by: Benjamin Gentil <benjamin@gentil.io> Closes #8241	2019-01-07 11:08:54 -08:00
Brian Behlendorf	a769fb53a1	Add 'zpool status -i' option Only display the full details of the vdev initialization state in 'zpool status' output when requested with the -i option. By default display '(initializing)' after vdevs when they are being actively initialized. This is consistent with the established precident of appending '(resilvering), etc' and fits within the default 80 column terminal width making it easy to read. Additionally, updated the 'zpool initialize' documentation to make it clear the options are mutually exclusive, but allow duplicate options like all other zfs/zpool commands. Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Reviewed-by: Tim Chase <tim@chase2k.com> Reviewed-by: George Wilson <george.wilson@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #8230	2019-01-07 11:03:18 -08:00
George Wilson	619f097693	OpenZFS 9102 - zfs should be able to initialize storage devices PROBLEM ======== The first access to a block incurs a performance penalty on some platforms (e.g. AWS's EBS, VMware VMDKs). Therefore we recommend that volumes are "thick provisioned", where supported by the platform (VMware). This can create a large delay in getting a new virtual machines up and running (or adding storage to an existing Engine). If the thick provision step is omitted, write performance will be suboptimal until all blocks on the LUN have been written. SOLUTION ========= This feature introduces a way to 'initialize' the disks at install or in the background to make sure we don't incur this first read penalty. When an entire LUN is added to ZFS, we make all space available immediately, and allow ZFS to find unallocated space and zero it out. This works with concurrent writes to arbitrary offsets, ensuring that we don't zero out something that has been (or is in the middle of being) written. This scheme can also be applied to existing pools (affecting only free regions on the vdev). Detailed design: - new subcommand:zpool initialize [-cs] <pool> [<vdev> ...] - start, suspend, or cancel initialization - Creates new open-context thread for each vdev - Thread iterates through all metaslabs in this vdev - Each metaslab: - select a metaslab - load the metaslab - mark the metaslab as being zeroed - walk all free ranges within that metaslab and translate them to ranges on the leaf vdev - issue a "zeroing" I/O on the leaf vdev that corresponds to a free range on the metaslab we're working on - continue until all free ranges for this metaslab have been "zeroed" - reset/unmark the metaslab being zeroed - if more metaslabs exist, then repeat above tasks. - if no more metaslabs, then we're done. - progress for the initialization is stored on-disk in the vdev’s leaf zap object. The following information is stored: - the last offset that has been initialized - the state of the initialization process (i.e. active, suspended, or canceled) - the start time for the initialization - progress is reported via the zpool status command and shows information for each of the vdevs that are initializing Porting notes: - Added zfs_initialize_value module parameter to set the pattern written by "zpool initialize". - Added zfs_vdev_{initializing,removal}_{min,max}_active module options. Authored by: George Wilson <george.wilson@delphix.com> Reviewed by: John Wren Kennedy <john.kennedy@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: loli10K <ezomori.nozomu@gmail.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Approved by: Richard Lowe <richlowe@richlowe.net> Signed-off-by: Tim Chase <tim@chase2k.com> Ported-by: Tim Chase <tim@chase2k.com> OpenZFS-issue: https://www.illumos.org/issues/9102 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c3963210eb Closes #8230	2019-01-07 10:37:26 -08:00
Richard Laager	06f3fc2a4b	Minor tweaks to zfs.8 man page for POSIX ACLs * Capitalize POSIX in POSIX ACLs. This change makes the POSIX in POSIX ACLs all caps, which is both correct and consistent with the rest of the man page. * Slightly reword part of zfs.8. I tweaked a sentence to add a missing comma, and as long as I was editing, removed a couple unnecessary words. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: bunder2015 <omfgbunder@gmail.com> Signed-off-by: Richard Laager <rlaager@wiktel.com> Closes #8220	2018-12-26 13:50:14 -08:00
Tony Hutter	c66401fae0	Add enclosure_symlinks option to vdev_id Add an 'enclosure_symlinks' option to vdev_id.conf. This creates consistently named symlinks to the enclosure devices (/dev/sg*) based off the configuration in vdev_id.conf. The enclosure symlinks show up in /dev/by-enclosure/<prefix>-<channel><num>. The links make it make it easy to run sg_ses on a particular enclosure device. The enclosure links are created in addition to the normal /dev/disk/by-vdev links. 'enclosure_symlinks' is only valid in sas_direct configurations. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Simon Guest <simon.guest@tesujimath.org> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #8194	2018-12-14 17:27:49 -08:00
Prakash Surya	53b1f5eac6	OpenZFS 9963 - Separate tunable for disabling ZIL vdev flush Porting Notes: * Add options to zfs-module-parameters(5) man page. * zfs_nocacheflush move to vdev.c instead of vdev_disk.c, since the latter doesn't get built for user space. Authored by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed by: Brad Lewis <brad.lewis@delphix.com> Reviewed by: Patrick Mooney <patrick.mooney@joyent.com> Reviewed by: Tom Caputi <tcaputi@datto.com> Reviewed by: George Melikov <mail@gmelikov.ru> Approved by: Dan McDonald <danmcd@joyent.com> Ported-by: Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/9963 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f8fdf68125 Closes #8186	2018-12-07 11:06:29 -08:00
Brian Behlendorf	7c9a42921e	Detect IO errors during device removal * Detect IO errors during device removal While device removal cannot verify the checksums of individual blocks during device removal, it can reasonably detect hard IO errors from the leaf vdevs. Failure to perform this error checking can result in device removal completing successfully, but moving no data which will permanently corrupt the pool. Situation 1: faulted/degraded vdevs In the configuration shown below, the removal of mirror-0 will permanently corrupt the pool. Device removal will preferentially copy data from 'vdev1 -> vdev3' and from 'vdev2 -> vdev4'. Which in this case will result in nothing being copied since one vdev in each of those groups in unavailable. However, device removal will complete successfully since all IO errors are ignored. tank DEGRADED 0 0 0 mirror-0 DEGRADED 0 0 0 /var/tmp/vdev1 FAULTED 0 0 0 external fault /var/tmp/vdev2 ONLINE 0 0 0 mirror-1 DEGRADED 0 0 0 /var/tmp/vdev3 ONLINE 0 0 0 /var/tmp/vdev4 FAULTED 0 0 0 external fault This issue is resolved by updating the source child selection logic to exclude unreadable leaf vdevs. Additionally, unwritable destination child vdevs which can never succeed are skipped to prevent generating a large number of write IO errors. Situation 2: individual hard IO errors During removal if an unexpected hard IO error is encountered when either reading or writing the child vdev the entire removal operation is cancelled. While it may be possible to reconstruct the data after removal that cannot be guaranteed. The only strictly safe thing to do is to cancel the removal. As a future improvement we may want to instead suspend the removal process and allow the damaged region to be retried. But that work is left for another time, hard IO errors during the removal process are expected to be exceptionally rare. Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #6900 Closes #8161	2018-12-04 09:37:37 -08:00
Rich Ercolani	62ee31adce	Fix typo in update to zfs-module-parameters(5) Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Giuseppe Di Natale <guss80@gmail.com> Reviewed-by: bunder2015 <omfgbunder@gmail.com> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes #8153	2018-11-26 10:23:58 -08:00
Christian Schwarz	bd9c195805	man/zfs.8: document 'received' property source Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Giuseppe Di Natale <guss80@gmail.com> Reviewed-by: Richard Laager <rlaager@wiktel.com> Signed-off-by: Christian Schwarz <me@cschwarz.com> Closes #8134	2018-11-20 09:59:43 -08:00
Tony Hutter	ad796b8a3b	Add zpool status -s (slow I/Os) and -p (parseable) This patch adds a new slow I/Os (-s) column to zpool status to show the number of VDEV slow I/Os. This is the number of I/Os that didn't complete in zio_slow_io_ms milliseconds. It also adds a new parsable (-p) flag to display exact values. NAME STATE READ WRITE CKSUM SLOW testpool ONLINE 0 0 0 - mirror-0 ONLINE 0 0 0 - loop0 ONLINE 0 0 0 20 loop1 ONLINE 0 0 0 0 Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #7756 Closes #6885	2018-11-08 16:47:24 -08:00
Richard Elling	6644e5bb6e	Update zfs-events.5 with info from PSARC 2009/497 Update zfs-events.5 with info from PSARC 2009/497 regarding ereport fields. Also updates ZIO_STAGE_* and ZIO_FLAG_* descriptions to match current source. Reviewed by: loli10K <ezomori.nozomu@gmail.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Elling <Richard.Elling@RichardElling.com> Closes #8057	2018-11-01 15:54:55 -07:00
Tom Caputi	80a91e7469	Defer new resilvers until the current one ends Currently, if a resilver is triggered for any reason while an existing one is running, zfs will immediately restart the existing resilver from the beginning to include the new drive. This causes problems for system administrators when a drive fails while another is already resilvering. In this case, the optimal thing to do to reduce risk of data loss is to wait for the current resilver to end before immediately replacing the second failed drive, which allows the system to operate with two incomplete drives for the minimum amount of time. This patch introduces the resilver_defer feature that essentially does this for the admin without forcing them to wait and monitor the resilver manually. The change requires an on-disk feature since we must mark drives that are part of a deferred resilver in the vdev config to ensure that we do not assume they are done resilvering when an existing resilver completes. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: John Kennedy <john.kennedy@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: @mmaybee Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #7732	2018-10-18 21:06:18 -07:00
Matthew Ahrens	dfbe267503	OpenZFS 9617 - too-frequent TXG sync causes excessive write inflation Porting notes: * Renamed zfs_dirty_data_sync_pct to zfs_dirty_data_sync_percent and changed the type to be consistent with the other dirty module params. * Updated zfs-module-parameters.5 accordingly. Authored by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com> Reviewed by: Brad Lewis <brad.lewis@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Andrew Stormont <andyjstormont@gmail.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Approved by: Robert Mustacchi <rm@joyent.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://illumos.org/issues/9617 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/7928f4ba Closes #7976	2018-10-04 13:13:28 -07:00
Yuri Pankov	cb110f254e	OpenZFS 9763 - zfs(1M): broken formatting in allow/unallow description Porting notes: * Two of the three changes from the upstream patch were already applied for Linux. Only the last one is required. Authored by: Yuri Pankov <yuri.pankov@nexenta.com> Reviewed by: Robert Mustacchi <rm@joyent.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Approved by: Gordon Ross <gwr@nexenta.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://illumos.org/issues/9763 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/8a702e55 Closes #7972	2018-10-02 16:12:54 -07:00
Brian Behlendorf	1258bd778e	Refine split block reconstruction Due to a flaw in `4589f3ae` the number of unique combinations could be calculated incorrectly. This could result in the random combinations reconstruction being used when it would have been possible to check all combinations. This change fixes the unique combinations calculation and simplifies the reconstruction logic by maintaining a per- segment list of unique copies. The vdev_indirect_splits_damage() function was introduced to validate both the enumeration and random reconstruction logic with ztest. It is implemented such it will never make a known recoverable block unrecoverable. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #6900 Closes #7934	2018-10-01 10:36:34 -07:00
DeHackEd	9d489ab3a8	Fix reference to zpool-features(5) Reviewed-by: Giuseppe Di Natale <guss80@gmail.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Laager <rlaager@wiktel.com> Signed-off-by: DHE <git@dehacked.net> Closes #7938	2018-09-21 09:41:08 -07:00
DeHackEd	81155b296d	Fix allocation_classes GUID in zpool-features(5) Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Laager <rlaager@wiktel.com> Signed-off-by: DHE <git@dehacked.net> Closes #7920	2018-09-18 08:59:47 -07:00
Gregor Kopka	48b0b649fd	Man page fixes - zpool/zfs optional parameters The man pages for zpool and zfs (get command) listed the pool/dataset parameter as required, but these are optional. Fixed that. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Gregor Kopka <gregor@kopka.net> Closes #7916	2018-09-18 08:55:33 -07:00
Brian Behlendorf	2ced3cf0b2	Clarify 'zpool remove' restrictions Update zpool(8) to clarify what type of vdevs may be safely removed and that the existence of any top-level raidz device which is part of the primary pool will prevent device removal. Reviewed by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #7880 Closes #7893	2018-09-17 17:28:18 -07:00
LOLi	0238a9755b	Fix 'zfs allow' for create time permissions When no permission set is defined for a dataset the create time permissions are incorrectly shown as if they were a permission set. This change simply correct how allow permissions are displayed. This commit also fixes a small manpage formatting issue and adds the "zfs_allow_003_pos" test case to the ZFS Test Suite. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #7519 Closes #7860	2018-09-06 13:11:21 -07:00
Don Brady	cc99f275a2	Pool allocation classes Allocation Classes add the ability to have allocation classes in a pool that are dedicated to serving specific block categories, such as DDT data, metadata, and small file blocks. A pool can opt-in to this feature by adding a 'special' or 'dedup' top-level VDEV. Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Alek Pinchuk <apinchuk@datto.com> Reviewed-by: Håkan Johansson <f96hajo@chalmers.se> Reviewed-by: Andreas Dilger <andreas.dilger@chamcloud.com> Reviewed-by: DHE <git@dehacked.net> Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed-by: Gregor Kopka <gregor@kopka.net> Reviewed-by: Kash Pande <kash@tripleback.net> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Don Brady <don.brady@delphix.com> Closes #5182	2018-09-05 18:33:36 -07:00
Tom Caputi	c3bd3fb4ac	OpenZFS 9403 - assertion failed in arc_buf_destroy() Assertion failed in arc_buf_destroy() when concurrently reading block with checksum error. Porting notes: * The ability to zinject decompression errors has been added, but this only works at the zio_decompress() level, where we have all of the info we need to match against the user's zinject options. * The decompress_fault test has been added to test the new zinject functionality * We attempted to set zio_decompress_fail_fraction to (1 << 18) in ztest for further test coverage. Although this did uncover a few low priority issues, this unfortuantely also causes ztest to ASSERT in many locations where the code is working correctly since it is designed to fail on IO errors. Developers can manually set this variable with the '-o' option to find and debug issues. Authored by: Matt Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Approved by: Matt Ahrens <mahrens@delphix.com> Ported-by: Tom Caputi <tcaputi@datto.com> OpenZFS-issue: https://illumos.org/issues/9403 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/fa98e487a9 Closes #7822	2018-08-29 11:33:33 -07:00
Serapheim Dimitropoulos	a448a2557e	Introduce read/write kstats per dataset The following patch introduces a few statistics on reads and writes grouped by dataset. These statistics are implemented as kstats (backed by aggregate sums for performance) and can be retrieved by using the dataset objset ID number. The motivation for this change is to provide some preliminary analytics on dataset usage/performance. Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com> Closes #7705	2018-08-20 09:52:37 -07:00
LOLi	a9d6270acb	'zfs holds' scripted mode is not documented This change simply documents the existing "scripted mode" option in both command help and man page. Reviewed-by: Giuseppe Di Natale <guss80@gmail.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #7798	2018-08-18 15:47:41 -07:00
Tom Caputi	d9c460a0b6	Added encryption support for zfs recv -o / -x One small integration that was absent from b52563 was support for zfs recv -o / -x with regards to encryption parameters. The main use cases of this are as follows: * Receiving an unencrypted stream as encrypted without needing to create a "dummy" encrypted parent so that encryption can be inheritted. * Allowing users to change their keylocation on receive, so long as the receiving dataset is an encryption root. * Allowing users to explicitly exclude or override the encryption property from an unencrypted properties stream, allowing it to be received as encrypted. * Receiving a recursive heirarchy of unencrypted datasets, encrypting the top-level one and forcing all children to inherit the encryption. Reviewed-by: Jorgen Lundman <lundman@lundman.net> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #7650	2018-08-15 09:48:49 -07:00
Toomas Soome	5fadb7fb0c	OpenZFS 8906 - uts: illumos rootfs should support salted cksum Porting notes: * As of grub-2.02 these checksums are not supported. However, as pointed out in #6501 there are alternatives such as EFISTUB which work and have no such restriction. A warning was added to the checksum property section of the zfs.8 man page. Authored by: Toomas Soome <tsoome@me.com> Reviewed by: C Fraire <cfraire@me.com> Reviewed by: Robert Mustacchi <rm@joyent.com> Reviewed by: Yuri Pankov <yuripv@yuripv.net> Approved by: Dan McDonald <danmcd@joyent.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://illumos.org/issues/8906 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/7dec52f Closes #6501 Closes #7714	2018-07-27 08:35:28 -07:00
Matthew Ahrens	2e5dc449c1	OpenZFS 9337 - zfs get all is slow due to uncached metadata This project's goal is to make read-heavy channel programs and zfs(1m) administrative commands faster by caching all the metadata that they will need in the dbuf layer. This will prevent the data from being evicted, so that any future call to i.e. zfs get all won't have to go to disk (very much). There are two parts: The dbuf_metadata_cache. We identify what to put into the cache based on the object type of each dbuf. Caching objset properties os {version,normalization,utf8only,casesensitivity} in the objset_t. The reason these needed to be cached is that although they are queried frequently, they aren't stored in a dbuf type which we can easily recognize and cache in the dbuf layer; instead, we have to explicitly store them. There's already existing infrastructure for maintaining cached properties in the objset setup code, so I simply used that. Performance Testing: - Disabled kmem_flags - Tuned dbuf_cache_max_bytes very low (128K) - Tuned zfs_arc_max very low (64M) Created test pool with 400 filesystems, and 100 snapshots per filesystem. Later on in testing, added 600 more filesystems (with no snapshots) to make sure scaling didn't look different between snapshots and filesystems. Results: \| Test \| Time (trunk / diff) \| I/Os (trunk / diff) \| +------------------------+---------------------+---------------------+ \| zpool import \| 0:05 / 0:06 \| 12.9k / 12.9k \| \| zfs get all (uncached) \| 1:36 / 0:53 \| 16.7k / 5.7k \| \| zfs get all (cached) \| 1:36 / 0:51 \| 16.0k / 6.0k \| Authored by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Thomas Caputi <tcaputi@datto.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Approved by: Richard Lowe <richlowe@richlowe.net> Ported-by: Alek Pinchuk <apinchuk@datto.com> Signed-off-by: Alek Pinchuk <apinchuk@datto.com> OpenZFS-issue: https://illumos.org/issues/9337 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/7dec52f Closes #7668	2018-07-12 10:49:27 -07:00
Don Brady	e4e94ca315	OpenZFS 9426 - metaslab size can exceed offset addressable by spacemap Authored by: Don Brady <don.brady@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Matt Ahrens <matt@delphix.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> Approved by: Dan McDonald <danmcd@joyent.com> OpenZFS-issue: https://www.illumos.org/issues/9426 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f1c88afb1 Closes #7700	2018-07-11 15:55:48 -07:00
Serapheim Dimitropoulos	a7ed98d8b5	OpenZFS 9330 - stack overflow when creating a deeply nested dataset Datasets that are deeply nested (~100 levels) are impractical. We just put a limit of 50 levels to newly created datasets. Existing datasets should work without a problem. The problem can be seen by attempting to create a dataset using the -p option with many levels: panic[cpu0]/thread=ffffff01cd282c20: BAD TRAP: type=8 (#df Double fault) rp=ffffffff fffffffffbc3aa60 unix:die+100 () fffffffffbc3ab70 unix:trap+157d () ffffff00083d7020 unix:_patch_xrstorq_rbx+196 () ffffff00083d7050 zfs:dbuf_rele+2e () ... ffffff00083d7080 zfs:dsl_dir_close+32 () ffffff00083d70b0 zfs:dsl_dir_evict+30 () ffffff00083d70d0 zfs:dbuf_evict_user+4a () ffffff00083d7100 zfs:dbuf_rele_and_unlock+87 () ffffff00083d7130 zfs:dbuf_rele+2e () ... The block above repeats once per directory in the ... ... create -p command, working towards the root ... ffffff00083db9f0 zfs:dsl_dataset_drop_ref+19 () ffffff00083dba20 zfs:dsl_dataset_rele+42 () ffffff00083dba70 zfs:dmu_objset_prefetch+e4 () ffffff00083dbaa0 zfs:findfunc+23 () ffffff00083dbb80 zfs:dmu_objset_find_spa+38c () ffffff00083dbbc0 zfs:dmu_objset_find+40 () ffffff00083dbc20 zfs:zfs_ioc_snapshot_list_next+4b () ffffff00083dbcc0 zfs:zfsdev_ioctl+347 () ffffff00083dbd00 genunix:cdev_ioctl+45 () ffffff00083dbd40 specfs:spec_ioctl+5a () ffffff00083dbdc0 genunix:fop_ioctl+7b () ffffff00083dbec0 genunix:ioctl+18e () ffffff00083dbf10 unix:brand_sys_sysenter+1c9 () Porting notes: * Added zfs_max_dataset_nesting module option with documentation. * Updated zfs_rename_014_neg.ksh for Linux. * Increase the zfs.sh stack warning to 15K. Enough time has passed that 16K can be reasonably assumed to be the default value. It was increased in the 3.15 kernel released in June of 2014. Authored by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com> Reviewed by: John Kennedy <john.kennedy@delphix.com> Reviewed by: Matt Ahrens <matt@delphix.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> Approved by: Garrett D'Amore <garrett@damore.org> OpenZFS-issue: https://www.illumos.org/issues/9330 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/757a75a Closes #7681	2018-07-09 13:02:50 -07:00
Serapheim Dimitropoulos	4d044c4c1d	OpenZFS 9238 - ZFS Spacemap Encoding V2 Motivation ========== The current space map encoding has the following disadvantages: [1] Assuming 512 sector size each entry can represent at most 16MB for a segment. This makes the encoding very inefficient for large regions of space. [2] As vdev-wide space maps have started to be used by new features (i.e. device removal, zpool checkpoint) we've started imposing limits in the vdevs that can be used with them based on the maximum addressable offset (currently 64PB for a top-level vdev). New encoding ============ The layout can be found at space_map.h and it remains backwards compatible with the old one. The introduced two-word entry format, besides extending the limits imposed by the single-entry layout, also includes a vdev field and some extra padding after its prefix. The extra padding after the prefix should is reserved for future usage (e.g. new prefixes for future encodings or new fields for flags). The new vdev field not only makes the space maps more self-descriptive, but also opens the doors for pool-wide space maps (expected to be used in the log spacemap project). One final important note is that the number of bits used for vdevs is reduced to 24 bits for blkptrs. That was decided as we don't know of any setups that use more than 16M vdevs for the time being and we wanted to fit the vdev field in the space map. In addition that gives us some extra bits in dva_t. Other references: ================= The new encoding is also discussed towards the end of the Log Space Map presentation from 2017's OpenZFS summit. Link: https://www.youtube.com/watch?v=jj2IxRkl5bQ Authored by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <gwilson@zfsmail.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Approved by: Gordon Ross <gwr@nexenta.com> Ported-by: Tim Chase <tim@chase2k.com> Signed-off-by: Tim Chase <tim@chase2k.com> OpenZFS-commit: https://github.com/openzfs/openzfs/commit/90a56e6d OpenZFS-issue: https://www.illumos.org/issues/9238 Closes #7665	2018-07-05 12:02:34 -07:00
Tim Chase	3be1eb29da	Fix formatting in zpool-features(5) The formatting of the features beginning with large_blocks was broken when the zpool_checkpoint feature was added. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #7658	2018-06-27 09:34:25 -07:00
Eitan Adler	fb8a10d5be	OpenZFS 9521 - Add checkpoint field Add checkpoint field in the default list of the zpool-list man page Authored by: Eitan Adler <lists@eitanadler.com> Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: kpande <github@tripleback.net> Reviewed by: Tim Chase <tim@chase2k.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Giuseppe Di Natale <guss80@gmail.com> Approved by: Dan McDonald <danmcd@joyent.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://illumos.org/issues/9521 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c5a860f7b Closes #7658	2018-06-27 09:33:37 -07:00
Serapheim Dimitropoulos	d2734cce68	OpenZFS 9166 - zfs storage pool checkpoint Details about the motivation of this feature and its usage can be found in this blogpost: https://sdimitro.github.io/post/zpool-checkpoint/ A lightning talk of this feature can be found here: https://www.youtube.com/watch?v=fPQA8K40jAM Implementation details can be found in big block comment of spa_checkpoint.c Side-changes that are relevant to this commit but not explained elsewhere: * renames members of "struct metaslab trees to be shorter without losing meaning * space_map_{alloc,truncate}() accept a block size as a parameter. The reason is that in the current state all space maps that we allocate through the DMU use a global tunable (space_map_blksz) which defauls to 4KB. This is ok for metaslab space maps in terms of bandwirdth since they are scattered all over the disk. But for other space maps this default is probably not what we want. Examples are device removal's vdev_obsolete_sm or vdev_chedkpoint_sm from this review. Both of these have a 1:1 relationship with each vdev and could benefit from a bigger block size. Porting notes: * The part of dsl_scan_sync() which handles async destroys has been moved into the new dsl_process_async_destroys() function. * Remove "VERIFY(!(flags & FWRITE))" in "kernel.c" so zhack can write to block device backed pools. * ZTS: * Fix get_txg() in zpool_sync_001_pos due to "checkpoint_txg". * Don't use large dd block sizes on /dev/urandom under Linux in checkpoint_capacity. * Adopt Delphix-OS's setting of 4 (spa_asize_inflation = SPA_DVAS_PER_BP + 1) for the checkpoint_capacity test to speed its attempts to fill the pool * Create the base and nested pools with sync=disabled to speed up the "setup" phase. * Clear labels in test pool between checkpoint tests to avoid duplicate pool issues. * The import_rewind_device_replaced test has been marked as "known to fail" for the reasons listed in its DISCLAIMER. * New module parameters: zfs_spa_discard_memory_limit, zfs_remove_max_bytes_pause (not documented - debugging only) vdev_max_ms_count (formerly metaslabs_per_vdev) vdev_min_ms_count Authored by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: John Kennedy <john.kennedy@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Approved by: Richard Lowe <richlowe@richlowe.net> Ported-by: Tim Chase <tim@chase2k.com> Signed-off-by: Tim Chase <tim@chase2k.com> OpenZFS-issue: https://illumos.org/issues/9166 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/7159fdb8 Closes #7570	2018-06-26 10:07:42 -07:00
ajs124	a8577bdb32	Fix duplicate "fB" typo Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Giuseppe Di Natale <guss80@gmail.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: bunder2015 <omfgbunder@gmail.com> Signed-off-by: ajs124 <git@ajs124.de> Closes #7649	2018-06-25 09:50:01 -07:00
John Gallagher	917f475fba	Add tunables for channel programs This patch adds tunables for modifying the maximum memory limit and maximum instruction limit that can be specified when running a channel program. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov Reviewed-by: Sara Hartse <sara.hartse@delphix.com> Signed-off-by: John Gallagher <john.gallagher@delphix.com> External-issue: LX-1085 Closes #7618	2018-06-15 15:10:42 -07:00
Antonio Russo	39042f9736	Tunable directory for zfs runtime scripts zpool and zed place scripts in subdirectories of libexecdir. Some distributions locate architecture independent scripts in other locations (e.g. Debian). To avoid these paths getting out of sync, centralize the definitions. Build zfs-test's default.cfg by Makefile. Use the new directory logic building tests/zfs-tests/include/default.cfg.in. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Antonio Russo <antonio.e.russo@gmail.com> Closes #7597	2018-06-07 09:59:59 -07:00
Antonio Russo	7106b23640	Minor documentation, logging, and testing typos This patch collects some minor inconsistencies and typos in the documentation, logging and testing infrastructure. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Antonio Russo <antonio.e.russo@gmail.com> Closes #7608	2018-06-07 09:38:39 -07:00
bunder2015	85912983a4	Fix typoes in zpool man page Fixed some highlighting in the zpool man page Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: bunder2015 <omfgbunder@gmail.com> Closes #7596	2018-06-04 09:06:16 -07:00
Brian Behlendorf	93ce2b4ca5	Update build system and packaging Minimal changes required to integrate the SPL sources in to the ZFS repository build infrastructure and packaging. Build system and packaging: * Renamed SPL_* autoconf m4 macros to ZFS_. Removed redundant SPL_* autoconf m4 macros. * Updated the RPM spec files to remove SPL package dependency. * The zfs package obsoletes the spl package, and the zfs-kmod package obsoletes the spl-kmod package. * The zfs-kmod-devel* packages were updated to add compatibility symlinks under /usr/src/spl-x.y.z until all dependent packages can be updated. They will be removed in a future release. * Updated copy-builtin script for in-kernel builds. * Updated DKMS package to include the spl.ko. * Updated stale AUTHORS file to include all contributors. * Updated stale COPYRIGHT and included the SPL as an exception. * Renamed README.markdown to README.md * Renamed OPENSOLARIS.LICENSE to LICENSE. * Renamed DISCLAIMER to NOTICE. Required code changes: * Removed redundant HAVE_SPL macro. * Removed _BOOT from nvpairs since it doesn't apply for Linux. * Initial header cleanup (removal of empty headers, refactoring). * Remove SPL repository clone/build from zimport.sh. * Use of DEFINE_RATELIMIT_STATE and DEFINE_SPINLOCK removed due to build issues when forcing C99 compilation. * Replaced legacy ACCESS_ONCE with READ_ONCE. * Include needed headers for `current` and `EXPORT_SYMBOL`. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> TEST_ZIMPORT_SKIP="yes" Closes #7556	2018-05-29 16:00:33 -07:00
Brian Behlendorf	1272941f49	Merge branch 'zfsonlinux/merge-spl' Merge a minimal version of the zfsonlinux/spl repository in to the zfsonlinux/zfs repository. Care was taken to prevent file conflicts when merging and to preserve the spl repository history. The spl kernel module remains under the GPLv2 license as documented by the additional THIRDPARTYLICENSE.gplv2 file. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2018-05-29 14:57:55 -07:00
Brian Behlendorf	a91258913f	Prepare SPL repo to merge with ZFS repo This commit removes everything from the repository except the core SPL implementation for Linux. Those files which remain have been moved to non-conflicting locations to facilitate the merge. The README.md and associated files have been updated accordingly. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2018-05-29 14:51:39 -07:00
Matthew Ahrens	0dc2f70c5c	OpenZFS 9486 - reduce memory used by device removal on fragmented pools Device removal allocates a new location for each allocated segment on the disk that's being removed. Each allocation results in one entry in the mapping table, which maps from old location + length to new location. When a fragmented disk is removed, this can result in a large number of mapping entries, and thus a large amount of memory consumed by the mapping table. In the worst real-world cases, we've seen around 1GB of RAM per 1TB of storage removed. We can improve on this situation by allocating larger segments, which span across both allocated and free regions of the device being removed. By including free regions in the allocation (and thus mapping), we reduce the number of mapping entries. For example, if we have a 4K allocation followed by 1K free and then 4K allocated, we would allocate 4+1+4 = 9KB, and then move the entire region (including allocated and free parts). In this case we used one mapping where previously we would have used two, but often the ratio is much higher (up to 20:1 in real-world use). We then need to mark the regions that were free on the removing device as free in the new locations, and also obsolete in the mapping entry. This method preserves the fragmentation of the removing device, rather than consolidating its allocated space into a small number of chunks where possible. But it results in drastic reduction of memory used by the mapping table - around 20x in the most-fragmented cases. In the most fragmented real-world cases, this reduces memory used by the mapping from ~1GB to ~50MB of RAM per 1TB of storage removed. Less fragmented cases will typically also see around 50-100MB of RAM per 1TB of storage. Porting notes: * Add the following as module parameters: * zfs_condense_indirect_vdevs_enable * zfs_condense_max_obsolete_bytes * Document the following module parameters: * zfs_condense_indirect_vdevs_enable * zfs_condense_max_obsolete_bytes * zfs_condense_min_mapping_bytes Authored by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: Tim Chase <tim@chase2k.com> Signed-off-by: Tim Chase <tim@chase2k.com> OpenZFS-issue: https://illumos.org/issues/9486 OpenZFS-commit: https://github.com/ahrens/illumos/commit/07152e142e44c External-issue: DLPX-57962 Closes #7536	2018-05-24 10:18:07 -07:00
Brian Behlendorf	b669ab83bb	Ignore *.o.ur-safe build artifacts Generated when building on Ubuntu 18.04. Also ignore the new dynamically generated zfs-mount-generator.8 man page, and the module/.cache.mk file. Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #7534	2018-05-13 18:59:02 -07:00
Antonio Russo	68fded8146	Add canonical mount options zfs-mount-generator lib/libzfs/libzfs_mount.c:zfs_add_options provides the canonical mount options used by a `zfs mount` command. Because we cannot call `zfs mount` directly from a systemd.mount unit, we mirror that logic in zfs-mount-generator. The zed script is updated to cache these properties as well. Include a mini-tutorial in the manual page, properly substitute configuration paths in zfs-mount-generator.8.in, and standardize the Makefile. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Laager <rlaager@wiktel.com> Signed-off-by: Antonio Russo <antonio.e.russo@gmail.com> Closes #7453	2018-05-11 12:44:14 -07:00
Pavel Zakharov	6cb8e5306d	OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery Some work has been done lately to improve the debugability of the ZFS pool load (and import) process. This includes: 7638 Refactor spa_load_impl into several functions 8961 SPA load/import should tell us why it failed 7277 zdb should be able to print zfs_dbgmsg's To iterate on top of that, there's a few changes that were made to make the import process more resilient and crash free. One of the first tasks during the pool load process is to parse a config provided from userland that describes what devices the pool is composed of. A vdev tree is generated from that config, and then all the vdevs are opened. The Meta Object Set (MOS) of the pool is accessed, and several metadata objects that are necessary to load the pool are read. The exact configuration of the pool is also stored inside the MOS. Since the configuration provided from userland is external and might not accurately describe the vdev tree of the pool at the txg that is being loaded, it cannot be relied upon to safely operate the pool. For that reason, the configuration in the MOS is read early on. In the past, the two configurations were compared together and if there was a mismatch then the load process was aborted and an error was returned. The latter was a good way to ensure a pool does not get corrupted, however it made the pool load process needlessly fragile in cases where the vdev configuration changed or the userland configuration was outdated. Since the MOS is stored in 3 copies, the configuration provided by userland doesn't have to be perfect in order to read its contents. Hence, a new approach has been adopted: The pool is first opened with the untrusted userland configuration just so that the real configuration can be read from the MOS. The trusted MOS configuration is then used to generate a new vdev tree and the pool is re-opened. When the pool is opened with an untrusted configuration, writes are disabled to avoid accidentally damaging it. During reads, some sanity checks are performed on block pointers to see if each DVA points to a known vdev; when the configuration is untrusted, instead of panicking the system if those checks fail we simply avoid issuing reads to the invalid DVAs. This new two-step pool load process now allows rewinding pools accross vdev tree changes such as device replacement, addition, etc. Loading a pool from an external config file in a clustering environment also becomes much safer now since the pool will import even if the config is outdated and didn't, for instance, register a recent device addition. With this code in place, it became relatively easy to implement a long-sought-after feature: the ability to import a pool with missing top level (i.e. non-redundant) devices. Note that since this almost guarantees some loss of data, this feature is for now restricted to a read-only import. Porting notes (ZTS): * Fix 'make dist' target in zpool_import * The maximum path length allowed by tar is 99 characters. Several of the new test cases exceeded this limit resulting in them not being included in the tarball. Shorten the names slightly. * Set/get tunables using accessor functions. * Get last synced txg via the "zfs_txg_history" mechanism. * Clear zinject handlers in cleanup for import_cache_device_replaced and import_rewind_device_replaced in order that the zpool can be exported if there is an error. * Increase FILESIZE to 8G in zfs-test.sh to allow for a larger ext4 file system to be created on ZFS_DISK2. Also, there's no need to partition ZFS_DISK2 at all. The partitioning had already been disabled for multipath devices. Among other things, the partitioning steals some space from the ext4 file system, makes it difficult to accurately calculate the paramters to parted and can make some of the tests fail. * Increase FS_SIZE and FILE_SIZE in the zpool_import test configuration now that FILESIZE is larger. * Write more data in order that device evacuation take lonnger in a couple tests. * Use mkdir -p to avoid errors when the directory already exists. * Remove use of sudo in import_rewind_config_changed. Authored by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Andrew Stormont <andyjstormont@gmail.com> Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org> Ported-by: Tim Chase <tim@chase2k.com> Signed-off-by: Tim Chase <tim@chase2k.com> OpenZFS-issue: https://illumos.org/issues/9075 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123 Closes #7459	2018-05-08 21:35:27 -07:00
Paul Dagnelie	ca0845d59e	OpenZFS 9256 - zfs send space estimation off by > 10% on some datasets Authored by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed by: John Kennedy <john.kennedy@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Approved by: Richard Lowe <richlowe@richlowe.net> Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov> Porting Notes: * Added tuning to man page. * Test case changes dropped, default behavior unchanged. OpenZFS-issue: https://www.illumos.org/issues/9256 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/32356b3c56 Closes #7470	2018-05-08 08:59:24 -07:00
Tom Caputi	be9a5c355c	Add support for decryption faults in zinject This patch adds the ability for zinject to trigger decryption and authentication faults in the ZIO and ARC layers. This functionality is exposed via the new "decrypt" error type, which may be provided for "data" object types. This patch also refactors some of the core encryption / decryption functions so that they have consistent prototypes, handle errors consistently, and do not have unused arguments. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #7474	2018-05-02 15:36:20 -07:00
George Melikov	eb201f50ac	Add back iostat -y or -w descriptions The iostat -y and -w descriptions were left in `cda0317e`, get them back. Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Melikov <mail@gmelikov.ru> Closes #7479 Closes #7483	2018-04-30 13:42:58 -05:00
Matthew Ahrens	964c2d69a9	OpenZFS 9236 - nuke spa_dbgmsg We should use zfs_dbgmsg instead of spa_dbgmsg. Or at least, metaslab_condense() should call zfs_dbgmsg because it's important and rare enough to always log. It's possible that the message in zio_dva_allocate() would be too high-frequency for zfs_dbgmsg. Authored by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Approved by: Richard Lowe <richlowe@richlowe.net> Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov> Patch Notes: * Removed ZFS_DEBUG_SPA from zfs-module-parameters.5 OpenZFS-issue: https://www.illumos.org/issues/9236 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/cfaba7f668 Closes #7467	2018-04-30 10:19:48 -07:00
Matt Ahrens	d830d4795a	OpenZFS 9280 - Assertion failure while running removal_with_ganging test with 4K devices Authored by: Matt Ahrens <Matt.Ahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: John Kennedy <john.kennedy@delphix.com> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Approved by: Garrett D'Amore <garrett@damore.org> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/9280 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/243952c Closes #7445	2018-04-17 10:44:50 -07:00
Brian Behlendorf	4589f3ae4c	Optimize possible split block search space Remove duplicate segment copies to minimize the possible search space for reconstruction. Once reduced an accurate assessment can be made regarding the difficulty in reconstructing the block. Also, ztest will now run zdb with zfs_reconstruct_indirect_combinations_max set to 1000000 in an attempt to avoid checksum errors. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #6900	2018-04-14 12:22:43 -07:00
Matthew Ahrens	9e052db462	OpenZFS 9290 - device removal reduces redundancy of mirrors Mirrors are supposed to provide redundancy in the face of whole-disk failure and silent damage (e.g. some data on disk is not right, but ZFS hasn't detected the whole device as being broken). However, the current device removal implementation bypasses some of the mirror's redundancy. Note that in no case is incorrect data returned, but we might get a checksum error when we should have been able to find the right data. There are two underlying problems: 1. When we remove a mirror device, we only read one side of the mirror. Since we can't verify the checksum, this side may be silently bad, but the good data is on the other side of the mirror (which we didn't read). This can cause the removal to "bake in" the busted data – all copies of the data in the new location are the same, busted version, while we left the good version behind. The fix for this is to read and copy both sides of the mirror. If the old and new vdevs are mirrors, we will read both sides of the old mirror, and write each copy to the corresponding side of the new mirror. (If the old and new vdevs have a different number of children, we will do this as best as possible.) Even though we aren't verifying checksums, this ensures that as long as there's a good copy of the data, we'll have a good copy after the removal, even if there's silent damage to one side of the mirror. If we're removing a mirror that has some silent damage, we'll have exactly the same damage in the new location (assuming that the new location is also a mirror). 2. When we read from an indirect vdev that points to a mirror vdev, we only consider one copy of the data. This can lead to reduced effective redundancy, because we might read a bad copy of the data from one side of the mirror, and not retry the other, good side of the mirror. Note that the problem is not with the removal process, but rather after the removal has completed (having copied correct data to both sides of the mirror), if one side of the new mirror is silently damaged, we encounter the problem when reading the relocated data via the indirect vdev. Also note that the problem doesn't occur when ZFS knows that one side of the mirror is bad, e.g. when a disk entirely fails or is offlined. The impact is that reads (from indirect vdevs that point to mirrors) may return a checksum error even though the good data exists on one side of the mirror, and scrub doesn't repair all data on the mirror (if some of it is pointed to via an indirect vdev). The fix for this is complicated by "split blocks" - one logical block may be split into two (or more) pieces with each piece moved to a different new location. In this case we need to read all versions of each split (one from each side of the mirror), and figure out which combination of versions results in the correct checksum, and then repair the incorrect versions. This ensures that we supply the same redundancy whether you use device removal or not. For example, if a mirror has small silent errors on all of its children, we can still reconstruct the correct data, as long as those errors are at sufficiently-separated offsets (specifically, separated by the largest block size - default of 128KB, but up to 16MB). Porting notes: * A new indirect vdev check was moved from dsl_scan_needs_resilver_cb() to dsl_scan_needs_resilver(), which was added to ZoL as part of the sequential scrub work. * Passed NULL for zfs_ereport_post_checksum()'s zbookmark_phys_t parameter. The extra parameter is unique to ZoL. * When posting indirect checksum errors the ABD can be passed directly, zfs_ereport_post_checksum() is not yet ABD-aware in OpenZFS. Authored by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Tim Chase <tim@chase2k.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: Tim Chase <tim@chase2k.com> OpenZFS-issue: https://illumos.org/issues/9290 OpenZFS-commit: https://github.com/openzfs/openzfs/pull/591 Closes #6900	2018-04-14 12:21:39 -07:00
Matthew Ahrens	a1d477c24c	OpenZFS 7614, 9064 - zfs device evacuation/removal OpenZFS 7614 - zfs device evacuation/removal OpenZFS 9064 - remove_mirror should wait for device removal to complete This project allows top-level vdevs to be removed from the storage pool with "zpool remove", reducing the total amount of storage in the pool. This operation copies all allocated regions of the device to be removed onto other devices, recording the mapping from old to new location. After the removal is complete, read and free operations to the removed (now "indirect") vdev must be remapped and performed at the new location on disk. The indirect mapping table is kept in memory whenever the pool is loaded, so there is minimal performance overhead when doing operations on the indirect vdev. The size of the in-memory mapping table will be reduced when its entries become "obsolete" because they are no longer used by any block pointers in the pool. An entry becomes obsolete when all the blocks that use it are freed. An entry can also become obsolete when all the snapshots that reference it are deleted, and the block pointers that reference it have been "remapped" in all filesystems/zvols (and clones). Whenever an indirect block is written, all the block pointers in it will be "remapped" to their new (concrete) locations if possible. This process can be accelerated by using the "zfs remap" command to proactively rewrite all indirect blocks that reference indirect (removed) vdevs. Note that when a device is removed, we do not verify the checksum of the data that is copied. This makes the process much faster, but if it were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be possible to copy the wrong data, when we have the correct data on e.g. the other side of the mirror. At the moment, only mirrors and simple top-level vdevs can be removed and no removal is allowed if any of the top-level vdevs are raidz. Porting Notes: * Avoid zero-sized kmem_alloc() in vdev_compact_children(). The device evacuation code adds a dependency that vdev_compact_children() be able to properly empty the vdev_child array by setting it to NULL and zeroing vdev_children. Under Linux, kmem_alloc() and related functions return a sentinel pointer rather than NULL for zero-sized allocations. * Remove comment regarding "mpt" driver where zfs_remove_max_segment is initialized to SPA_MAXBLOCKSIZE. Change zfs_condense_indirect_commit_entry_delay_ticks to zfs_condense_indirect_commit_entry_delay_ms for consistency with most other tunables in which delays are specified in ms. * ZTS changes: Use set_tunable rather than mdb Use zpool sync as appropriate Use sync_pool instead of sync Kill jobs during test_removal_with_operation to allow unmount/export Don't add non-disk names such as "mirror" or "raidz" to $DISKS Use $TEST_BASE_DIR instead of /tmp Increase HZ from 100 to 1000 which is more common on Linux removal_multiple_indirection.ksh Reduce iterations in order to not time out on the code coverage builders. removal_resume_export: Functionally, the test case is correct but there exists a race where the kernel thread hasn't been fully started yet and is not visible. Wait for up to 1 second for the removal thread to be started before giving up on it. Also, increase the amount of data copied in order that the removal not finish before the export has a chance to fail. * MMP compatibility, the concept of concrete versus non-concrete devices has slightly changed the semantics of vdev_writeable(). Update mmp_random_leaf_impl() accordingly. * Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool feature which is not supported by OpenZFS. * Added support for new vdev removal tracepoints. * Test cases removal_with_zdb and removal_condense_export have been intentionally disabled. When run manually they pass as intended, but when running in the automated test environment they produce unreliable results on the latest Fedora release. They may work better once the upstream pool import refectoring is merged into ZoL at which point they will be re-enabled. Authored by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Alex Reece <alex@delphix.com> Reviewed-by: George Wilson <george.wilson@delphix.com> Reviewed-by: John Kennedy <john.kennedy@delphix.com> Reviewed-by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Richard Laager <rlaager@wiktel.com> Reviewed by: Tim Chase <tim@chase2k.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Approved by: Garrett D'Amore <garrett@damore.org> Ported-by: Tim Chase <tim@chase2k.com> Signed-off-by: Tim Chase <tim@chase2k.com> OpenZFS-issue: https://www.illumos.org/issues/7614 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb Closes #6900	2018-04-14 12:16:17 -07:00
Mike Gerdts	d22f3a8244	OpenZFS 9286 - want refreservation=auto Authored by: Mike Gerdts <mike.gerdts@joyent.com> Reviewed by: Allan Jude <allanjude@freebsd.org> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: John Kennedy <john.kennedy@delphix.com> Reviewed by: Andy Stormont <astormont@racktopsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Approved by: Richard Lowe <richlowe@richlowe.net> Ported-by: Don Brady <don.brady@delphix.com> Porting Notes: * Adopted destroy_dataset in ZTS test cleanup * Use ksh shebang instead of bash for new tests OpenZFS-issue: https://www.illumos.org/issues/9286 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/723d0c85 Closes #7387	2018-04-11 14:52:13 -07:00
DeHackEd	dfb1ad027f	zfs(8): fix dedup omission during mdoc overhaul The property description has been updated with new algorithms as well. Reviewed-by: Matt Ahrens <matt@delphix.com> Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: DHE <git@dehacked.net> Closes #7377	2018-04-10 17:19:01 -07:00
Brian Behlendorf	3b0d99289a	Fix 'zfs send/recv' hang with 16M blocks When using 16MB blocks the send/recv queue's aren't quite big enough. This change leaves the default 16M queue size which a good value for most pools. But it additionally ensures that the queue sizes are at least twice the allowed zfs_max_recordsize. Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #7365 Closes #7404	2018-04-08 19:41:15 -07:00
Antonio Russo	55d80e651a	systemd mount generator and tracking ZEDLET zfs-mount-generator implements the "systemd generator" protocol, producing systemd.mount units from the cached outputs of zfs list, during early boot, integrating with systemd. Each pool has an indpendent cache of the command zfs list -H -oname,mountpoint,canmount -tfilesystem -r $pool which is kept synchronized by the ZEDLET history_event-zfs-list-cacher.sh Datasets not in the cache will be loaded later in the boot process by zfs-mount.service, including pools without a cache. Among other things, this allows for complex mount hierarchies. Reviewed-by: Fabian Grünbichler <f.gruenbichler@proxmox.com> Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Antonio Russo <antonio.e.russo@gmail.com> Closes #7329	2018-04-06 14:11:09 -07:00
Peter Ashford	910f3ce739	Clarify zpool actions for an intent log device Updated the "Intent Log" section of the "zpool" manual page to properly reflect the actions that may be performed. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Peter Ashford <ashford@accs.com> Closes #6938 Closes #7318	2018-03-22 15:12:08 -07:00
Alek P	272b5d730f	Add JSON output support to channel programs The changes piggyback JSON output support on top of channel programs (#6558). This way the JSON output support is targeted to scripting use cases and is easily maintainable since it really only touches one function (zfs_do_channel_program()). This patch ports Joyent's JSON nvlist library from illumos to enable easy JSON printing of channel program output nvlist. To keep the delta small I also took advantage of the fact that printing in zfs_do_channel_program() was almost always done before exiting the program. Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alek Pinchuk <apinchuk@datto.com> Closes #7281	2018-03-19 12:40:58 -07:00
Brian Behlendorf	de4f8d5d26	OpenZFS 9188 - increase size of dbuf cache to reduce indirect block decompression With compressed ARC (bug 6950) we use up to 25% of our CPU to decompress indirect blocks, under a workload of random cached reads. To reduce this decompression cost, we would like to increase the size of the dbuf cache so that more indirect blocks can be stored uncompressed. If we are caching entire large files of recordsize=8K, the indirect blocks use 1/64th as much memory as the data blocks (assuming they have the same compression ratio). We suggest making the dbuf cache be 1/32nd of all memory, so that in this scenario we should be able to keep all the indirect blocks decompressed in the dbuf cache. (We want it to be more than the 1/64th that the indirect blocks would use because we need to cache other stuff in the dbuf cache as well.) In real world workloads, this won't help as dramatically as the example above, but we think it's still worth it because the risk of decreasing performance is low. The potential negative performance impact is that we will be slightly reducing the size of the ARC (by ~3%). Porting Notes: * Added modules options to zfs-module-parameters.5 man page. * Preserved scaling based on target ARC size rather than max ARC size. Authored by: George Wilson <george.wilson@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Prashanth Sreenivasa <pks@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/9188 OpenZFS-commit: https://github.com/openzfs/openzfs/pull/564 Upstream bug: DLPX-46942 Closes #7273	2018-03-13 10:52:48 -07:00
Tim Chase	02638a30ef	Add zfs_scan_ignore_errors tunable When it's set, a DTL range will be cleared even if its scan/scrub had errors. This allows to work around resilver/scrub upon import when the pool has errors. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Closes #7293	2018-03-13 10:43:14 -07:00
Paul Zuchowski	83362e8e67	Destroy makes full snap list before destroying Change zfs destroy logic so destroying begins before the entire list of snapshots is built. Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Reviewed-by: Kash Pande <kash@tripleback.net> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Zuchowski <pzuchowski@datto.com> Closes #7271	2018-03-12 15:24:08 -07:00
Tomohiro Kusumi	5ee220ba5c	Document allowed pool names PR #7208 was a patch to allow non-reserved pool names which begin with mirror, raidz, spare (but do not equal), however we'd rather document it in the man page for compatibility with other OpenZFS implementations, to avoid pool names that may not work on non-Linux platforms. Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@osnexus.com> Closes #7216	2018-03-09 14:04:15 -08:00
Tom Caputi	cf63739191	QAT support for AES-GCM This patch adds support for acceleration of AES-GCM encryption with Intel Quick Assist Technology. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chengfeix Zhu <chengfeix.zhu@intel.com> Signed-off-by: Weigang Li <weigang.li@intel.com> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #7282	2018-03-09 13:37:15 -08:00
Tony Hutter	80d52c3919	Change checksum & IO delay ratelimit values Change checksum & IO delay ratelimit thresholds from 5/sec to 20/sec. This allows zed to actually trigger if a bunch of these events arrive in a short period of time (zed has a threshold of 10 events in 10 sec). Previously, if you had, say, 100 checksum errors in 1 sec, it would get ratelimited to 5/sec which wouldn't trigger zed to fault the drive. Also, convert the checksum and IO delay thresholds to module params for easy testing. Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #7252	2018-03-04 17:34:51 -08:00
John Eismeier	d699aaef09	Fix some typos Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: George Melikov <mail@gmelikov.ru> Signed-off-by: John Eismeier <john.eismeier@gmail.com> Closes #7237	2018-02-28 08:57:10 -08:00
Tomohiro Kusumi	d72cd017dd	Fix zpool(8) list example to match actual format `a05dfd00` (Illumos 5147) has swapped FRAG and EXPANDSZ, so it's natural to modify these examples. # zpool list \| head -1 NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT ^^^^^^^^^^^^^^^ Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@osnexus.com> Closes #7244	2018-02-28 08:54:53 -08:00
Tony Hutter	bf95a000c4	Add scrub after resilver zed script * Add a zed script to kick off a scrub after a resilver. The script is disabled by default. * Add a optional $PATH (-P) option to zed to allow it to use a custom $PATH for its zedlets. This is needed when you're running zed under the ZTS in a local workspace. * Update test scripts to not copy in all-debug.sh and all-syslog.sh by default. They can be optionally copied in as part of zed_setup(). These scripts slow down zed considerably under heavy events loads and can cause events to be dropped or their delivery delayed. This was causing some sporadic failures in the 'fault' tests. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Laager <rlaager@wiktel.com> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #4662 Closes #7086	2018-02-23 11:38:05 -08:00
LOLi	faa97c1619	Want 'zfs send -b' This change implements 'zfs send -b' which can be used to send only received property values whether or not they are overridden by local settings. This can be very useful during "restore" operations from a backup pool because it allows to send only the property values originally sent from the backup source, even though they were later modified on the destination either by a 'zfs set' operation, explicit 'zfs inherit' or overridden during the receive process via 'zfs receive -o\|-x'. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #7156	2018-02-21 12:32:06 -08:00
Tom Caputi	b1d217338a	Raw receives must compress metadnode blocks Currently, the DMU relies on ZIO layer compression to free LO dnode blocks that no longer have objects in them. However, raw receives disable all compression, meaning that these blocks can never be freed. In addition to the obvious space concerns, this could also cause incremental raw receives to fail to mount since the MAC of a hole is different from that of a completely zeroed block. This patch corrects this issue by adding a special case in zio_write_compress() which will attempt to compress these blocks to a hole even if ZIO_FLAG_RAW_ENCRYPT is set. This patch also removes the zfs_mdcomp_disable tunable, since tuning it could cause these same issues. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #7198	2018-02-21 12:28:52 -08:00
Olaf Faaland	ec7c1b914c	Clarify zinject(8) explanation of -e Error injection of EIO or ENXIO simply sets the zio's io_error value, rather than preventing the read or write from occurring. This is important information as it affects how the probes must be used. Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Olaf Faaland <faaland1@llnl.gov> Closes #7172	2018-02-15 09:50:06 -08:00
Nasf-Fan	9c5167d19f	Project Quota on ZFS Project quota is a new ZFS system space/object usage accounting and enforcement mechanism. Similar as user/group quota, project quota is another dimension of system quota. It bases on the new object attribute - project ID. Project ID is a numerical value to indicate to which project an object belongs. An object only can belong to one project though you (the object owner or privileged user) can change the object project ID via 'chattr -p' or 'zfs project [-s] -p' explicitly. The object also can inherit the project ID from its parent when created if the parent has the project inherit flag (that can be set via 'chattr +P' or 'zfs project -s [-p]'). By accounting the spaces/objects belong to the same project, we can know how many spaces/objects used by the project. And if we set the upper limit then we can control the spaces/objects that are consumed by such project. It is useful when multiple groups and users cooperate for the same project, or a user/group needs to participate in multiple projects. Support the following commands and functionalities: zfs set projectquota@project zfs set projectobjquota@project zfs get projectquota@project zfs get projectobjquota@project zfs get projectused@project zfs get projectobjused@project zfs projectspace zfs allow projectquota zfs allow projectobjquota zfs allow projectused zfs allow projectobjused zfs unallow projectquota zfs unallow projectobjquota zfs unallow projectused zfs unallow projectobjused chattr +/-P chattr -p project_id lsattr -p This patch also supports tree quota based on the project quota via "zfs project" commands set as following: zfs project [-d\|-r] <file\|directory ...> zfs project -C [-k] [-r] <file\|directory ...> zfs project -c [-0] [-d\|-r] [-p id] <file\|directory ...> zfs project [-p id] [-r] [-s] <file\|directory ...> For "df [-i] $DIR" command, if we set INHERIT (project ID) flag on the $DIR, then the proejct [obj]quota and [obj]used values for the $DIR's project ID will be shown as the total/free (avail) resource. Keep the same behavior as EXT4/XFS does. Reviewed-by: Andreas Dilger <andreas.dilger@intel.com> Reviewed-by Ned Bass <bass6@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Fan Yong <fan.yong@intel.com> TEST_ZIMPORT_POOLS="zol-0.6.1 zol-0.6.2 master" Change-Id: Ib4f0544602e03fb61fd46a849d7ba51a6005693c Closes #6290	2018-02-13 14:54:54 -08:00
Chunwei Chen	950e17c215	Fix zdb -R decompression There are some issues in the zdb -R decompression implementation. The first is that ZLE can easily decompress non-ZLE streams. So we add ZDB_NO_ZLE env to make zdb skip ZLE. The second is the random bytes appended to pabd, pbuf2 stuff. This serve no purpose at all, those bytes shouldn't be read during decompression anyway. Instead, we randomize lbuf2, so that we can make sure decompression fill exactly to lsize by bcmp lbuf and lbuf2. The last one is the condition to detect fail is wrong. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Signed-off-by: Chunwei Chen <david.chen@nutanix.com> Closes #7099 Closes #4984	2018-02-09 10:11:02 -08:00
Serapheim Dimitropoulos	5b72a38d68	OpenZFS 8677 - Open-Context Channel Programs Authored by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Chris Williamson <chris.williamson@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Ported-by: Don Brady <don.brady@delphix.com> We want to be able to run channel programs outside of synching context. This would greatly improve performance for channel programs that just gather information, as they won't have to wait for synching context anymore. === What is implemented? This feature introduces the following: - A new command line flag in "zfs program" to specify our intention to run in open context. (The -n option) - A new flag/option within the channel program ioctl which selects the context. - Appropriate error handling whenever we try a channel program in open-context that contains zfs.sync* expressions. - Documentation for the new feature in the manual pages. === How do we handle zfs.sync functions in open context? When such a function is found by the interpreter and we are running in open context we abort the script and we spit out a descriptive runtime error. For example, given the script below ... arg = ... fs = arg["argv"][1] err = zfs.sync.destroy(fs) msg = "destroying " .. fs .. " err=" .. err return msg if we run it in open context, we will get back the following error: Channel program execution failed: [string "channel program"]:3: running functions from the zfs.sync submodule requires passing sync=TRUE to lzc_channel_program() (i.e. do not specify the "-n" command line argument) stack traceback: [C]: in function 'destroy' [string "channel program"]:3: in main chunk === What about testing? We've introduced new wrappers for all channel program tests that run each channel program as both (startard & open-context) and expect the appropriate behavior depending on the program using the zfs.sync module. OpenZFS-issue: https://www.illumos.org/issues/8677 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/17a49e15 Closes #6558	2018-02-08 16:05:57 -08:00
Chris Williamson	234c91c508	OpenZFS 8600 - ZFS channel programs - snapshot Authored by: Chris Williamson <chris.williamson@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: John Kennedy <john.kennedy@delphix.com> Reviewed by: Brad Lewis <brad.lewis@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Ported-by: Don Brady <don.brady@delphix.com> ZFS channel programs should be able to create snapshots. In addition to the base snapshot functionality, this entails extra logic to handle edge cases which were formerly not possible, such as creating then destroying a snapshot in the same transaction sync. OpenZFS-issue: https://www.illumos.org/issues/8600 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/68089b8b	2018-02-08 15:29:24 -08:00
Brad Lewis	af07368986	OpenZFS 8592 - ZFS channel programs - rollback Authored by: Brad Lewis <brad.lewis@delphix.com> Reviewed by: Chris Williamson <chris.williamson@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Ported-by: Don Brady <don.brady@delphix.com> ZFS channel programs should be able to perform a rollback. OpenZFS-issue: https://www.illumos.org/issues/8592 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/d46b5ed6	2018-02-08 15:29:14 -08:00
Chris Williamson	475eca4908	OpenZFS 8605 - zfs channel programs fix zfs.exists Authored by: Chris Williamson <chris.williamson@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Ported-by: Don Brady <don.brady@delphix.com> zfs.exists() in channel programs doesn't return any result, and should have a man page entry. This patch corrects zfs.exists so that it returns a value indicating if the dataset exists or not. It also adds documentation about it in the man page. OpenZFS-issue: https://www.illumos.org/issues/8605 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/1e85e111	2018-02-08 15:28:52 -08:00
Chris Williamson	d99a015343	OpenZFS 7431 - ZFS Channel Programs Authored by: Chris Williamson <chris.williamson@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: John Kennedy <john.kennedy@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Approved by: Garrett D'Amore <garrett@damore.org> Ported-by: Don Brady <don.brady@delphix.com> Ported-by: John Kennedy <john.kennedy@delphix.com> OpenZFS-issue: https://www.illumos.org/issues/7431 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/dfc11533 Porting Notes: * The CLI long option arguments for '-t' and '-m' don't parse on linux * Switched from kmem_alloc to vmem_alloc in zcp_lua_alloc * Lua implementation is built as its own module (zlua.ko) * Lua headers consumed directly by zfs code moved to 'include/sys/lua/' * There is no native setjmp/longjump available in stock Linux kernel. Brought over implementations from illumos and FreeBSD * The get_temporary_prop() was adapted due to VFS platform differences * Use of inline functions in lua parser to reduce stack usage per C call * Skip some ZFS Test Suite ZCP tests on sparc64 to avoid stack overflow	2018-02-08 15:28:18 -08:00
Richard Elling	6b810d04bd	Remove deprecated zfs_arc_p_aggressive_disable zfs_arc_p_aggressive_disable is no more. This PR removes docs and module parameters for zfs_arc_p_aggressive_disable. Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Richard Elling <Richard.Elling@RichardElling.com> Closes #7135	2018-02-07 11:54:20 -08:00
Tom Caputi	2b84817f66	Adjust ARC prefetch tunables to match docs Currently, the ARC exposes 2 tunables (zfs_arc_min_prefetch_ms and zfs_arc_min_prescient_prefetch_ms) which are documented to be specified in milliseconds. However, the code actually uses the values as though they were in seconds. This patch adjusts the code to match the names and documentation of the tunables. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed by: George Melikov <mail@gmelikov.ru> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #7126	2018-02-05 16:57:53 -08:00
bunder2015	405ec516ab	Fix zpool-features(5) large_block inconsistency Large_blocks feature activation was not consistent with man page, which erroneously stated that the feature was active when the recordsize was increased past the stock 128KB. It actually becomes active when data is written to the dataset. Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: bunder2015 <omfgbunder@gmail.com> Closes #6275 Closes #7093	2018-01-29 15:10:32 -08:00
LOLi	63f88c12b4	Fix style issues in man pages and commands help * Remove 'zfs snap' from zfs help message (OpenZFS sync) * Update zfs(8) to suggest 'snap' can be used as an alias for 'snapshot' * Enforce 80 columns limit in help messages * Remove zfs_disable_dup_eviction from zfs-module-parameters(5) * Expose zfs_scan_max_ext_gap as a kernel module parameter. Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #7087	2018-01-29 15:05:03 -08:00
Chunwei Chen	522db29275	zpool import -d to specify device path When we know which devices have the pool we are looking for, sometime it's better if we can directly pass those device paths to zpool import instead of letting it to search through all unrelated stuff, which might take a lot of time if you have hundreds of disks. This patch allows option -d <dev_path> to zpool import. You can have multiple pairs of -d <dev_path>, and zpool import will only search through those devices. For example: zpool import -d /dev/sda -d /dev/sdb Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@nutanix.com> Closes #7077	2018-01-26 10:49:46 -08:00
Brian Behlendorf	8fb1ede146	Extend deadman logic The intent of this patch is extend the existing deadman code such that it's flexible enough to be used by both ztest and on production systems. The proposed changes include: * Added a new `zfs_deadman_failmode` module option which is used to dynamically control the behavior of the deadman. It's loosely modeled after, but independant from, the pool failmode property. It can be set to wait, continue, or panic. * wait - Wait for the "hung" I/O (default) * continue - Attempt to recover from a "hung" I/O * panic - Panic the system * Added a new `zfs_deadman_ziotime_ms` module option which is analogous to `zfs_deadman_synctime_ms` except instead of applying to a pool TXG sync it applies to zio_wait(). A default value of 300s is used to define a "hung" zio. * The ztest deadman thread has been re-enabled by default, aligned with the upstream OpenZFS code, and then extended to terminate the process when it takes significantly longer to complete than expected. * The -G option was added to ztest to print the internal debug log when a fatal error is encountered. This same option was previously added to zdb in commit `fa603f82`. Update zloop.sh to unconditionally pass -G to obtain additional debugging. * The FM_EREPORT_ZFS_DELAY event which was previously posted when the deadman detect a "hung" pool has been replaced by a new dedicated FM_EREPORT_ZFS_DEADMAN event. * The proposed recovery logic attempts to restart a "hung" zio by calling zio_interrupt() on any outstanding leaf zios. We may want to further restrict this to zios in either the ZIO_STAGE_VDEV_IO_START or ZIO_STAGE_VDEV_IO_DONE stages. Calling zio_interrupt() is expected to only be useful for cases when an IO has been submitted to the physical device but for some reasonable the completion callback hasn't been called by the lower layers. This shouldn't be possible but has been observed and may be caused by kernel/driver bugs. * The 'zfs_deadman_synctime_ms' default value was reduced from 1000s to 600s. * Depending on how ztest fails there may be no cache file to move. This should not be considered fatal, collect the logs which are available and carry on. * Add deadman test cases for spa_deadman() and zio_wait(). * Increase default zfs_deadman_checktime_ms to 60s. Reviewed-by: Tim Chase <tim@chase2k.com> Reviewed by: Thomas Caputi <tcaputi@datto.com> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #6999	2018-01-25 13:40:38 -08:00
Sean Eric Fagan	43cb30b3ce	OpenZFS 8959 - Add notifications when a scrub is paused or resumed Authored by: Sean Eric Fagan <sef@ixsystems.com> Reviewed by: Alek Pinchuk <pinchuk.alek@gmail.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Approved by: Gordon Ross <gwr@nexenta.com> Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov> Porting Notes: - Brought #defines in eventdefs.h in line with ZFS on Linux format. - Updated zfs-events.5 with the new events. OpenZFS-issue: https://www.illumos.org/issues/8959 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c862b93eea Closes #7049	2018-01-17 10:31:00 -08:00
DeHackEd	d658b2caa9	Remove l2arc_nocompress from zfs-module-parameters(5) Parameter was removed in `d3c2ae1c08` (OpenZFS 6950 - ARC should cache compressed data) Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: DHE <git@dehacked.net> Closes #7043	2018-01-16 10:18:08 -08:00
Yuri Pankov	6df9f8ebd7	OpenZFS 8899 - zpool list property documentation doesn't match actual behaviour Authored by: Yuri Pankov <yuri.pankov@nexenta.com> Reviewed by: Alexander Pyhalov <alp@rsu.ru> Reviewed-by: George Melikov <mail@gmelikov.ru> Approved by: Dan McDonald <danmcd@joyent.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/8899 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/b0e142e57d Closes #7032	2018-01-11 13:54:34 -08:00
Yuri Pankov	bcb1a8a25e	OpenZFS 8898 - creating fs with checksum=skein on the boot pools fails ungracefully Authored by: Yuri Pankov <yuri.pankov@nexenta.com> Reviewed by: Toomas Soome <tsoome@me.com> Reviewed by: Andy Stormont <astormont@racktopsystems.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Approved by: Dan McDonald <danmcd@joyent.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/8898 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/9fa2266d9a Closes #7031	2018-01-11 13:53:04 -08:00
George Amanakis	be54a13c3e	Fix percentage styling in zfs-module-parameters.5 Replace "percent" with "%", add bold to default values. Reviewed-by: bunder2015 <omfgbunder@gmail.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #7018	2018-01-09 11:51:11 -08:00
Prakash Surya	2fe61a7ecc	OpenZFS 8909 - 8585 can cause a use-after-free kernel panic Authored by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: John Kennedy <jwk404@gmail.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Brad Lewis <brad.lewis@delphix.com> Reviewed by: Igor Kozhukhov <igor@dilos.org> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Approved by: Robert Mustacchi <rm@joyent.com> Ported-by: Prakash Surya <prakash.surya@delphix.com> PROBLEM ======= There's a race condition that exists if `zil_free_lwb` races with either `zil_commit_waiter_timeout` and/or `zil_lwb_flush_vdevs_done`. Here's an example panic due to this bug: > ::status debugging crash dump vmcore.0 (64-bit) from ip-10-110-205-40 operating system: 5.11 dlpx-5.2.2.0_2017-12-04-17-28-32b6ba51fb (i86pc) image uuid: 4af0edfb-e58e-6ed8-cafc-d3e9167c7513 panic message: BAD TRAP: type=e (#pf Page fault) rp=ffffff0010555970 addr=60 occurred in module "zfs" due to a NULL pointer dereference dump content: kernel pages only > $c zio_shrink+0x12() zil_lwb_write_issue+0x30d(ffffff03dcd15cc0, ffffff03e0730e20) zil_commit_waiter_timeout+0xa2(ffffff03dcd15cc0, ffffff03d97ffcf8) zil_commit_waiter+0xf3(ffffff03dcd15cc0, ffffff03d97ffcf8) zil_commit+0x80(ffffff03dcd15cc0, 9a9) zfs_write+0xc34(ffffff03dc38b140, ffffff0010555e60, 40, ffffff03e00fb758, 0) fop_write+0x5b(ffffff03dc38b140, ffffff0010555e60, 40, ffffff03e00fb758, 0) write+0x250(42, fffffd7ff4832000, 2000) sys_syscall+0x177() If there's an outstanding lwb that's in `zil_commit_waiter_timeout` waiting to timeout, waiting on it's waiter's CV, we must be sure not to call `zil_free_lwb`. If we end up calling `zil_free_lwb`, then that LWB may be freed and can result in a use-after-free situation where the stale lwb pointer stored in the `zil_commit_waiter_t` structure of the thread waiting on the waiter's CV is used. A similar situation can occur if an lwb is issued to disk, and thus in the `LWB_STATE_ISSUED` state, and `zil_free_lwb` is called while the disk is servicing that lwb. In this situation, the lwb will be freed by `zil_free_lwb`, which will result in a use-after-free situation when the lwb's zio completes, and `zil_lwb_flush_vdevs_done` is called. This race condition is prevented in `zil_close` by calling `zil_commit` before `zil_free_lwb` is called, which will ensure all outstanding (i.e. all lwb's in the `LWB_STATE_OPEN` and/or `LWB_STATE_ISSUED` states) reach the `LWB_STATE_DONE` state before the lwb's are freed (`zil_commit` will not return untill all the lwb's are `LWB_STATE_DONE`). Further, this race condition is prevented in `zil_sync` by only calling `zil_free_lwb` for lwb's that do not have their `lwb_buf` pointer set. All lwb's not in the `LWB_STATE_DONE` state will have a non-null value for this pointer; the pointer is only cleared in `zil_lwb_flush_vdevs_done`, at which point the lwb's state will be changed to `LWB_STATE_DONE`. This race is present in `zil_suspend`, leading to this bug. At first glance, it would appear as though this would not be true because `zil_suspend` will call `zil_commit`, just like `zil_close`, but the problem is that `zil_suspend` will set the zilog's `zl_suspend` field prior to calling `zil_commit`. Further, in `zil_commit`, if `zl_suspend` is set, `zil_commit` will take a special branch of logic and use `txg_wait_synced` instead of performing the normal `zil_commit` logic. This call to `txg_wait_synced` might be good enough for the data to reach disk safely before it returns, but it does not ensure that all outstanding lwb's reach the `LWB_STATE_DONE` state before it returns. This is because, if there's an lwb "stuck" in `zil_commit_waiter_timeout`, waiting for it's lwb to timeout, it will maintain a non-null value for it's `lwb_buf` field and thus `zil_sync` will not free that lwb. Thus, even though the lwb's data is already on disk, the lwb will be left lingering, waiting on the CV, and will eventually timeout and be issued to disk even though the write is unnecessary. So, after `zil_commit` is called from `zil_suspend`, we incorrectly assume that there are not outstanding lwb's, and proceed to free all lwb's found on the zilog's lwb list. As a result, we free the lwb that will later be used `zil_commit_waiter_timeout`. SOLUTION ======== The solution to this, is to ensure all outstanding lwb's complete before calling `zil_free_lwb` via `zil_destroy` in `zil_suspend`. This patch accomplishes this goal by forcing the normal `zil_commit` logic when called from `zil_sync`. Now, `zil_suspend` will call `zil_commit_impl` which will always use the normal logic of waiting/issuing lwb's to disk before it returns. As a result, any lwb's outstanding when `zil_commit_impl` is called will be guaranteed to reach the `LWB_STATE_DONE` state by the time it returns. Further, no new lwb's will be created via `zil_commit` since the zilog's `zl_suspend` flag will be set. This will force all new callers of `zil_commit` to use `txg_wait_synced` instead of creating and issuing new lwb's. Thus, all lwb's left on the zilog's lwb list when `zil_destroy` is called will be in the `LWB_STATE_DONE` state, and we'll avoid this race condition. OpenZFS-issue: https://www.illumos.org/issues/8909 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/ece62b6f8d Closes #6940	2017-12-28 10:18:04 -08:00
Simon Guest	993669a7bf	vdev_id: new slot type ses This extends vdev_id to support a new slot type, ses, for SCSI Enclosure Services. With slot type ses, the disk slot numbers are determined by using the device slot number reported by sg_ses for the device with matching SAS address, found by querying all available enclosures. This is primarily of use on systems with a deficient driver omitting support for bay_identifier in /sys/devices. In my testing, I found that the existing slot types of port and id were not stable across disk replacement, so an alternative was required. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Simon Guest <simon.guest@tesujimath.org> Closes #6956	2017-12-20 09:42:07 -08:00
DeHackEd	1c68856bca	zpool(8): Fix "zpool import -t" Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: DHE <git@dehacked.net> Closes #6894	2017-11-28 11:10:52 -06:00

... 3 4 5 6 7 ...

827 Commits