Archive-Team/zfs - zfs - Gitea: Git with a cup of tea

Commit Graph

Author	SHA1	Message	Date
Serapheim Dimitropoulos	93e28d661e	Log Spacemap Project = Motivation At Delphix we've seen a lot of customer systems where fragmentation is over 75% and random writes take a performance hit because a lot of time is spend on I/Os that update on-disk space accounting metadata. Specifically, we seen cases where 20% to 40% of sync time is spend after sync pass 1 and ~30% of the I/Os on the system is spent updating spacemaps. The problem is that these pools have existed long enough that we've touched almost every metaslab at least once, and random writes scatter frees across all metaslabs every TXG, thus appending to their spacemaps and resulting in many I/Os. To give an example, assuming that every VDEV has 200 metaslabs and our writes fit within a single spacemap block (generally 4K) we have 200 I/Os. Then if we assume 2 levels of indirection, we need 400 additional I/Os and since we are talking about metadata for which we keep 2 extra copies for redundancy we need to triple that number, leading to a total of 1800 I/Os per VDEV every TXG. We could try and decrease the number of metaslabs so we have less I/Os per TXG but then each metaslab would cover a wider range on disk and thus would take more time to be loaded in memory from disk. In addition, after it's loaded, it's range tree would consume more memory. Another idea would be to just increase the spacemap block size which would allow us to fit more entries within an I/O block resulting in fewer I/Os per metaslab and a speedup in loading time. The problem is still that we don't deal with the number of I/Os going up as the number of metaslabs is increasing and the fact is that we generally write a lot to a few metaslabs and a little to the rest of them. Thus, just increasing the block size would actually waste bandwidth because we won't be utilizing our bigger block size. = About this patch This patch introduces the Log Spacemap project which provides the solution to the above problem while taking into account all the aforementioned tradeoffs. The details on how it achieves that can be found in the references sections below and in the code (see Big Theory Statement in spa_log_spacemap.c). Even though the change is fairly constraint within the metaslab and lower-level SPA codepaths, there is a side-change that is user-facing. The change is that VDEV IDs from VDEV holes will no longer be reused. To give some background and reasoning for this, when a log device is removed and its VDEV structure was replaced with a hole (or was compacted; if at the end of the vdev array), its vdev_id could be reused by devices added after that. Now with the pool-wide space maps recording the vdev ID, this behavior can cause problems (e.g. is this entry referring to a segment in the new vdev or the removed log?). Thus, to simplify things the ID reuse behavior is gone and now vdev IDs for top-level vdevs are truly unique within a pool. = Testing The illumos implementation of this feature has been used internally for a year and has been in production for ~6 months. For this patch specifically there don't seem to be any regressions introduced to ZTS and I have been running zloop for a week without any related problems. = Performance Analysis (Linux Specific) All performance results and analysis for illumos can be found in the links of the references. Redoing the same experiments in Linux gave similar results. Below are the specifics of the Linux run. After the pool reached stable state the percentage of the time spent in pass 1 per TXG was 64% on average for the stock bits while the log spacemap bits stayed at 95% during the experiment (graph: sdimitro.github.io/img/linux-lsm/PercOfSyncInPassOne.png). Sync times per TXG were 37.6 seconds on average for the stock bits and 22.7 seconds for the log spacemap bits (related graph: sdimitro.github.io/img/linux-lsm/SyncTimePerTXG.png). As a result the log spacemap bits were able to push more TXGs, which is also the reason why all graphs quantified per TXG have more entries for the log spacemap bits. Another interesting aspect in terms of txg syncs is that the stock bits had 22% of their TXGs reach sync pass 7, 55% reach sync pass 8, and 20% reach 9. The log space map bits reached sync pass 4 in 79% of their TXGs, sync pass 7 in 19%, and sync pass 8 at 1%. This emphasizes the fact that not only we spend less time on metadata but we also iterate less times to convergence in spa_sync() dirtying objects. [related graphs: stock- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGStock.png lsm- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGLSM.png] Finally, the improvement in IOPs that the userland gains from the change is approximately 40%. There is a consistent win in IOPS as you can see from the graphs below but the absolute amount of improvement that the log spacemap gives varies within each minute interval. sdimitro.github.io/img/linux-lsm/StockVsLog3Days.png sdimitro.github.io/img/linux-lsm/StockVsLog10Hours.png = Porting to Other Platforms For people that want to port this commit to other platforms below is a list of ZoL commits that this patch depends on: Make zdb results for checkpoint tests consistent `db587941c5` Update vdev_is_spacemap_addressable() for new spacemap encoding `419ba59145` Simplify spa_sync by breaking it up to smaller functions `8dc2197b7b` Factor metaslab_load_wait() in metaslab_load() `b194fab0fb` Rename range_tree_verify to range_tree_verify_not_present `df72b8bebe` Change target size of metaslabs from 256GB to 16GB `c853f382db` zdb -L should skip leak detection altogether `21e7cf5da8` vs_alloc can underflow in L2ARC vdevs `7558997d2f` Simplify log vdev removal code `6c926f426a` Get rid of space_map_update() for ms_synced_length `425d3237ee` Introduce auxiliary metaslab histograms `928e8ad47d` Error path in metaslab_load_impl() forgets to drop ms_sync_lock `8eef997679` = References Background, Motivation, and Internals of the Feature - OpenZFS 2017 Presentation: youtu.be/jj2IxRkl5bQ - Slides: slideshare.net/SerapheimNikolaosDim/zfs-log-spacemaps-project Flushing Algorithm Internals & Performance Results (Illumos Specific) - Blogpost: sdimitro.github.io/post/zfs-lsm-flushing/ - OpenZFS 2018 Presentation: youtu.be/x6D2dHRjkxw - Slides: slideshare.net/SerapheimNikolaosDim/zfs-log-spacemap-flushing-algorithm Upstream Delphix Issues: DLPX-51539, DLPX-59659, DLPX-57783, DLPX-61438, DLPX-41227, DLPX-59320 DLPX-63385 Reviewed-by: Sean Eric Fagan <sef@ixsystems.com> Reviewed-by: Matt Ahrens <matt@delphix.com> Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com> Closes #8442	2019-07-16 10:11:49 -07:00
Antonio Russo	df834a7ccc	Enable zfs-mount-generator by default Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Fabian Grünbichler <f.gruenbichler@proxmox.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Antonio Russo <antonio.e.russo@gmail.com> Closes #8750 Closes #8848	2019-07-15 16:31:57 -07:00
Antonio Russo	f88d069cbb	systemd encryption key support Modify zfs-mount-generator to produce a dependency on new zfs-import-key-*.service units, dynamically created at boot to call zfs load-key for the encryption root, before attempting to mount any encrypted datasets. These units are created by zfs-mount-generator, and RequiresMountsFor on the keyfile, if present, or call systemd-ask-password if a passphrase is requested. This patch includes suggestions from @Fabian-Gruenbichler, @ryanjaeb and @rlaager, as well an adaptation of @rlaager's script to retry on incorrect password entry. Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Fabian Grünbichler <f.gruenbichler@proxmox.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Antonio Russo <antonio.e.russo@gmail.com> Closes #8750 Closes #8848	2019-07-15 16:31:47 -07:00
Brian Behlendorf	e5db313494	Linux 5.0 compat: SIMD compatibility Restore the SIMD optimization for 4.19.38 LTS, 4.14.120 LTS, and 5.0 and newer kernels. This is accomplished by leveraging the fact that by definition dedicated kernel threads never need to concern themselves with saving and restoring the user FPU state. Therefore, they may use the FPU as long as we can guarantee user tasks always restore their FPU state before context switching back to user space. For the 5.0 and 5.1 kernels disabling preemption and local interrupts is sufficient to allow the FPU to be used. All non-kernel threads will restore the preserved user FPU state. For 5.2 and latter kernels the user FPU state restoration will be skipped if the kernel determines the registers have not changed. Therefore, for these kernels we need to perform the additional step of saving and restoring the FPU registers. Invalidating the per-cpu global tracking the FPU state would force a restore but that functionality is private to the core x86 FPU implementation and unavailable. In practice, restricting SIMD to kernel threads is not a major restriction for ZFS. The vast majority of SIMD operations are already performed by the IO pipeline. The remaining cases are relatively infrequent and can be handled by the generic code without significant impact. The two most noteworthy cases are: 1) Decrypting the wrapping key for an encrypted dataset, i.e. `zfs load-key`. All other encryption and decryption operations will use the SIMD optimized implementations. 2) Generating the payload checksums for a `zfs send` stream. In order to avoid making any changes to the higher layers of ZFS all of the `*_get_ops()` functions were updated to take in to consideration the calling context. This allows for the fastest implementation to be used as appropriate (see kfpu_allowed()). The only other notable instance of SIMD operations being used outside a kernel thread was at module load time. This code was moved in to a taskq in order to accommodate the new kernel thread restriction. Finally, a few other modifications were made in order to further harden this code and facilitate testing. They include updating each implementations operations structure to be declared as a constant. And allowing "cycle" to be set when selecting the preferred ops in the kernel as well as user space. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #8754 Closes #8793 Closes #8965	2019-07-12 09:31:20 -07:00
loli10K	1d20b763bb	zfs send does not handle invalid input gracefully Due to some changes introduced in `30af21b` 'zfs send' can crash when provided with invalid inputs: this change attempts to add more checks to the affected code paths. Reviewed-by: Attila Fülöp <attila@fueloep.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #9001	2019-07-08 15:10:23 -07:00
loli10K	3b5fe2c351	Fix zfs "redact" misc issues * zfs redact error messages do not end with newline character * `30af21b0` inadvertently removed some ZFS_PROP comments * man/zfs: zfs redact <redaction_snapshot> is not optional Reviewed-by: Giuseppe Di Natale <guss80@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #8988	2019-07-05 16:38:17 -07:00
Mike Gerdts	341166c843	OpenZFS 9318 - vol_volsize_to_reservation does not account for raidz skip blocks When a volume is created in a pool with raidz vdevs and volblocksize != 128k, the volume can reference more space than is reserved with the automatically calculated refreservation. There are two deficiencies in vol_volsize_to_reservation that contribute to this: 1) Skip blocks may be added to keep each allocation a multiple of parity + 1. This is the dominating factor when volblocksize is close to 2^ashift. 2) raidz deflation for 128 KB blocks is different for most other block sizes. See "The theory of raidz space accounting" comment in libzfs_dataset.c for a full explanation. Authored by: Mike Gerdts <mike.gerdts@joyent.com> Reviewed by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com> Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com> Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed by: Kody Kantor <kody.kantor@joyent.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Approved by: Dan McDonald <danmcd@joyent.com> Ported-by: Mike Gerdts <mike.gerdts@joyent.com> Porting Notes: * ZTS: wait for zvols to exist before writing * ZTS: use log_must_busy with {zpool\|zfs} destroy OpenZFS-issue: https://www.illumos.org/issues/9318 OpenZFS-commit: https://github.com/illumos/illumos-gate/commit/b73ccab0 Closes #8973	2019-07-05 15:35:15 -07:00
Tom Caputi	765d1f0644	Add 'zfs umount -u' for encrypted datasets This patch adds the ability for the user to unload keys for datasets as they are being unmounted. This is analogous to 'zfs mount -l'. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alek Pinchuk <apinchuk@datto.com> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes: #8917 Closes: #8952	2019-06-28 12:38:37 -07:00
Paul Dagnelie	3fab4d9e08	zdb -vvvvv on ztest pool dies with "out of memory" ztest creates some extremely large files as part of its operation. When zdb tries to dump a large enough file, it can run out of memory or spend an extremely long time attempting to print millions or billions of uint64_ts. We cap the amount of data from a uint64 object that we are willing to read and print. Reviewed-by: Don Brady <don.brady@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <pcd@delphix.com> External-issue: DLPX-53814 Closes #8947	2019-06-25 12:50:37 -07:00
loli10K	5279ae918b	Redacted Send/Receive causes zdb to dump core When used with verbosity >= 4 zdb fails an assertion in dump_bookmarks() because it expects snprintf() to retun 0 on success. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Paul Dagnelie <pcd@delphix.com> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #8948	2019-06-24 18:06:26 -07:00
Matthew Ahrens	59ec30a329	Remove code for zfs remap The "zfs remap" command was disabled by `6e91a72fe3`, because it has little utility and introduced some tricky bugs. This commit removes the code for it, the associated ZFS_IOC_REMAP ioctl, and tests. Note that the ioctl and property will remain, but have no functionality. This allows older software to fail gracefully if it attempts to use these, and avoids a backwards incompatibility that would be introduced if we renumbered the later ioctls/props. Reviewed-by: Tom Caputi <tcaputi@datto.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #8944	2019-06-24 16:44:01 -07:00
Brian Behlendorf	8f12a4f8d2	Fix out-of-tree build failures Resolve the incorrect use of srcdir and builddir references for various files in the build system. These have crept in over time and went unnoticed because when building in the top level directory srcdir and builddir are identical. With this change it's again possible to build in a subdirectory. $ mkdir obj $ cd obj $ ../configure $ make Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Don Brady <don.brady@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #8921 Closes #8943	2019-06-24 09:32:47 -07:00
Don Brady	a9cd8bfde7	Let zfs mount all tolerate in-progress mounts The zfs-mount service can unexpectedly fail to start when zfs encounters a mount that is in progress. This service uses zfs mount -a, which has a window between the time it checks if the dataset was mounted and when the actual mount (via mount.zfs binary) occurs. The reason for the racing mounts is that both zfs-mount.target and zfs-share.target are allowed to execute concurrently after the import. This is more of an issue with the relatively recent addition of parallel mounting, and we should consider serializing the mount and share targets. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: John Kennedy <john.kennedy@delphix.com> Reviewed-by: Allan Jude <allanjude@freebsd.org> Signed-off-by: Don Brady <don.brady@delphix.com> Closes #8881	2019-06-22 16:41:21 -07:00
Allan Jude	fb6e6f1ffb	zstreamdump: add per-record-type counters and an overhead counter Count the bytes of payload for each replication record type Count the bytes of overhead (replication records themselves) Include these counters in the output summary at the end of the run. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matt Ahrens <mahrens@delphix.com> Signed-off-by: Allan Jude <allanjude@freebsd.org> Sponsored-By: Klara Systems and Catalogic Closes #8432	2019-06-22 16:33:44 -07:00
loli10K	3976fd65d3	Redacted Send/Receive broke zfs(8) help message Since `30af21b0` was merged 'zfs send' help message format is broken and lists "-r" as a valid option: this commit corrects these small issues. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Paul Dagnelie <pcd@delphix.com> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #8942	2019-06-21 09:38:15 -07:00
Matthew Ahrens	050d720c43	Remove dedupditto functionality If dedup is in use, the `dedupditto` property can be set, causing ZFS to keep an extra copy of data that is referenced many times (>100x). The idea was that this data is more important than other data and thus we want to be really sure that it is not lost if the disk experiences a small amount of random corruption. ZFS (and system administrators) rely on the pool-level redundancy to protect their data (e.g. mirroring or RAIDZ). Since the user/sysadmin doesn't have control over what data will be offered extra redundancy by dedupditto, this extra redundancy is not very useful. The bulk of the data is still vulnerable to loss based on the pool-level redundancy. For example, if particle strikes corrupt 0.1% of blocks, you will either be saved by mirror/raidz, or you will be sad. This is true even if dedupditto saved another 0.01% of blocks from being corrupted. Therefore, the dedupditto functionality is rarely enabled (i.e. the property is rarely set), and it fulfills its promise of increased redundancy even more rarely. Additionally, this feature does not work as advertised (on existing releases), because scrub/resilver did not repair the extra (dedupditto) copy (see https://github.com/zfsonlinux/zfs/pull/8270). In summary, this seldom-used feature doesn't work, and even if it did it wouldn't provide useful data protection. It has a non-trivial maintenance burden (again see https://github.com/zfsonlinux/zfs/pull/8270). We should remove the dedupditto functionality. For backwards compatibility with the existing CLI, "zpool set dedupditto" will still "succeed" (exit code zero), but won't have any effect. For backwards compatibility with existing pools that had dedupditto enabled at some point, the code will still be able to understand dedupditto blocks and free them when appropriate. However, ZFS won't write any new dedupditto blocks. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Igor Kozhukhov <igor@dilos.org> Reviewed-by: Alek Pinchuk <apinchuk@datto.com> Issue #8270 Closes #8310	2019-06-19 14:54:02 -07:00
Michael Niewöhner	0b755ec3d5	Fix memory leak in check_disk() Reviewed-by: Allan Jude <allanjude@freebsd.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com> Signed-off-by: Michael Niewöhner <foss@mniewoehner.de> Closes #8897 Closes #8911	2019-06-19 11:53:37 -07:00
Paul Dagnelie	30af21b025	Implement Redacted Send/Receive Redacted send/receive allows users to send subsets of their data to a target system. One possible use case for this feature is to not transmit sensitive information to a data warehousing, test/dev, or analytics environment. Another is to save space by not replicating unimportant data within a given dataset, for example in backup tools like zrepl. Redacted send/receive is a three-stage process. First, a clone (or clones) is made of the snapshot to be sent to the target. In this clone (or clones), all unnecessary or unwanted data is removed or modified. This clone is then snapshotted to create the "redaction snapshot" (or snapshots). Second, the new zfs redact command is used to create a redaction bookmark. The redaction bookmark stores the list of blocks in a snapshot that were modified by the redaction snapshot(s). Finally, the redaction bookmark is passed as a parameter to zfs send. When sending to the snapshot that was redacted, the redaction bookmark is used to filter out blocks that contain sensitive or unwanted information, and those blocks are not included in the send stream. When sending from the redaction bookmark, the blocks it contains are considered as candidate blocks in addition to those blocks in the destination snapshot that were modified since the creation_txg of the redaction bookmark. This step is necessary to allow the target to rehydrate data in the case where some blocks are accidentally or unnecessarily modified in the redaction snapshot. The changes to bookmarks to enable fast space estimation involve adding deadlists to bookmarks. There is also logic to manage the life cycles of these deadlists. The new size estimation process operates in cases where previously an accurate estimate could not be provided. In those cases, a send is performed where no data blocks are read, reducing the runtime significantly and providing a byte-accurate size estimate. Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed-by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Prashanth Sreenivasa <pks@delphix.com> Reviewed-by: John Kennedy <john.kennedy@delphix.com> Reviewed-by: George Wilson <george.wilson@delphix.com> Reviewed-by: Chris Williamson <chris.williamson@delphix.com> Reviewed-by: Pavel Zhakarov <pavel.zakharov@delphix.com> Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com> Reviewed-by: Prakash Surya <prakash.surya@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <pcd@delphix.com> Closes #7958	2019-06-19 09:48:12 -07:00
Matthew Ahrens	b8738257c2	make zil max block size tunable We've observed that on some highly fragmented pools, most metaslab allocations are small (~2-8KB), but there are some large, 128K allocations. The large allocations are for ZIL blocks. If there is a lot of fragmentation, the large allocations can be hard to satisfy. The most common impact of this is that we need to check (and thus load) lots of metaslabs from the ZIL allocation code path, causing sync writes to wait for metaslabs to load, which can take a second or more. In the worst case, we may not be able to satisfy the allocation, in which case the ZIL will resort to txg_wait_synced() to ensure the change is on disk. To provide a workaround for this, this change adds a tunable that can reduce the size of ZIL blocks. External-issue: DLPX-61719 Reviewed-by: George Wilson <george.wilson@delphix.com> Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #8865	2019-06-10 11:48:42 -07:00
Eli Schwartz	215e4fe4d2	arc_summary: prefer python3 version and install when there is no python This matches the behavior of other python scripts, such as arcstat and dbufstat, which are always installed but whose install-exec-hook actions will simply touch up the shebang if a python interpreter was configured and that interpreter is a python2 interpreter. Fixes installation in a minimal build chroot without python available. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@freqlabs.com> Signed-off-by: Eli Schwartz <eschwartz@archlinux.org> Closes #8851	2019-06-10 09:08:53 -07:00
Ryan Moeller	1a132f0638	Make Python detection optional and more portable Previously, --without-python would cause ./configure to fail. Now it is able to proceed, and the Python scripts will not be built. Use portable parameter expansion matching instead of nonstandard substring matching to detect the Python version. This test is duplicated in several places, so define a function for it. Don't assume the full path to binaries, since different platforms do install things in different places. Use AC_CHECK_PROGS instead. When building without Python, also build without pyzfs. Sponsored by: iXsystems, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Eli Schwartz <eschwartz93@gmail.com> Signed-off-by: Ryan Moeller <ryan@freqlabs.com> Closes #8809 Closes #8731	2019-06-04 18:05:46 -07:00
Josh Soref	46df7e6cc9	grammar: it is / plural agreement Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Josh Soref <jsoref@users.noreply.github.com> Closes #8818	2019-05-28 15:58:32 -07:00
Ryan Moeller	227d379385	Update comments to match code s/get_vdev_spec/make_root_vdev The former doesn't exist anymore. Sponsored by: iXsystems, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Ryan Moeller <ryan@freqlabs.com> Closes #8759	2019-05-28 15:18:31 -07:00
loli10K	0869b74a1e	Endless loop in zpool_do_remove() on platforms with unsigned char On systems where "char" is an unsigned type the value returned by getopt() will never be negative (-1), leading to an endless loop: this issue prevents both 'zpool remove' and 'zstreamdump' for working on some systems. Reviewed-by: Igor Kozhukhov <igor@dilos.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #8789	2019-05-28 11:14:58 -07:00
Tom Caputi	8e3c3ed1b3	Disable parallel processing for 'zfs mount -l' Currently, 'zfs mount -a' will always attempt to parallelize work related to mounting as best it can. Unfortunately, when the user passes the '-l' option to load keys, this causes all threads to prompt the user for their keys at once, resulting in a confusing and racy user experience. This patch simply disables parallel mounting when using the '-l' flag. Reviewed by: Sebastien Roy <sebastien.roy@delphix.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #8762 Closes #8811	2019-05-25 13:46:32 -07:00
loli10K	43977948aa	zpool: status -t is not documented in help message This commit adds the undocumented "-t" option to zpool(8) help message. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #8782	2019-05-24 14:16:00 -07:00
loli10K	62b2152eca	zfs: missing newline character in zfs_do_channel_program() error message This commit simply adds a missing newline ("\n") character to the error message printed by the zfs command when the provided pool parameter can't be found. Reviewed-by: Chris Dunlop <chris@onthe.net.au> Reviewed-by: Giuseppe Di Natale <guss80@gmail.com> Reviewed-by: Igor Kozhukhov <igor@dilos.org> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #8783	2019-05-24 13:54:36 -07:00
loli10K	e55db32ad0	zpool: trim -p is not a valid option This commit removes the documented but not handled "-p" option from zpool(8) help message. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Chris Dunlop <chris@onthe.net.au> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #8781	2019-05-24 09:40:46 -07:00
Alexander Motin	4bb40c1c82	Fix dataset name comparison in zfs_compare() The code never returned match comparing two datasets (not snapshots). As result, uu_avl_find(), called from zfs_callback(), never succeeded, allowing to add same dataset into the list multiple times, for example: # zfs get name pers pers pers@z pers@z NAME PROPERTY VALUE SOURCE pers name pers - pers name pers - pers@z name pers@z - With the patch: # zfs get name pers pers pers@z pers@z NAME PROPERTY VALUE SOURCE pers name pers - pers@z name pers@z - Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Igor Kozhukhov <igor@dilos.org> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #8723	2019-05-08 16:42:39 -07:00
JMoVS	e2dddb7e58	Fix typesetting of Errata #4 Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Justin Scholz <git@justinscholz.de> Closes #8712 Closes #8721	2019-05-08 16:04:45 -07:00
JMoVS	b3b60984ee	Clearer wording on Errata #4 Users of existing pools, especially pools with top-level encrypted datasets, could run into trouble trying to work around Errata #4. Clarify that removing encrypted snapshots and bookmarks is enough to clear the errata. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Justin Scholz <git@justinscholz.de> Closes #8682 Closes #8683	2019-05-02 16:52:57 -07:00
Tom Caputi	85bdc68401	Fix estimated scrub completion time Currently, it is possible for the 'zpool scrub' command to progress slightly beyond 100% due to concurrent changes happening on the live pool. This behavior is expected, but the userspace code for 'zpool status' would subtract the expected amount of data from the amount of data already scrubbed, resulting in a negative integer being casted to a large positive one. This number was then used to calculate the estimated completion time, resulting in wildly wrong results. This code changes the behavior so that 'zpool status' does not attempt to report an estimate during this period. Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Igor Kozhukhov <igor@dilos.org> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #8611 Closes #8687	2019-05-01 17:34:24 -07:00
Tomohiro Kusumi	2a15c00f89	Use sigaction(2) instead of sigignore(3) for portability sigignore(3) isn't portable. This code fails to compile on platforms without sigignore(3). Use sigaction(2). -- zfs_main.c: In function 'zfs_do_diff': zfs_main.c:7178:9: error: implicit declaration of function 'sigignore' [-Werror=implicit-function-declaration] (void) sigignore(SIGPIPE); ^~~~~~~~~ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Closes #8593	2019-04-30 20:46:15 -07:00
TerraTech	50478c6dad	Add option [-V\|--version] to emit version string Add the 'zfs version' and 'zpool version' subcommands to display the version of the user space utilities and loaded zfs kernel module. For example: $ zfs version zfs-0.8.0-rc3_169_g67e0366b88 zfs-kmod-0.8.0-rc3_169_g67e0366b88 The '-V' and '--version' aliases were added to support the common convention of using 'zfs --version` to obtain the version information. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Richard Laager <rlaager@wiktel.com> Signed-off-by: TerraTech <1118433+TerraTech@users.noreply.github.com> Closes #2501 Closes #8567	2019-04-16 12:24:06 -07:00
Tom Caputi	7dcd318832	Cleanup nits from `ab7615d92` This patch simply up cleans up a nit and corrects an error message issue that were introduced in the Multiple DVA scrub patch. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #8619	2019-04-14 11:03:06 -07:00
Brian Behlendorf	c375c69eca	Fix 'zfs list -t snapshot' depth Commit `df583073` introduced the ability to list the snapshots for a specified dataset. This change inadvertently resulted in only the top- level snapshots being listed when no dataset was specified. Fix this issue by adding an additional check to determine if a dataset was provided to avoid incorrectly restricting the depth. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Tom Caputi <tcaputi@datto.com> Reviewed-by: Alek Pinchuk <apinchuk@datto.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #8591 Closes #8594	2019-04-08 09:14:45 -07:00
Sara Hartse	a887d653b3	Restrict kstats and print real pointers There are several places where we use zfs_dbgmsg and %p to print pointers. In the Linux kernel, these values obfuscated to prevent information leaks which means the pointers aren't very useful for debugging crash dumps. We decided to restrict the permissions of dbgmsg (and some other kstats while we were at it) and print pointers with %px in zfs_dbgmsg as well as spl_dumpstack Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: John Gallagher <john.gallagher@delphix.com> Signed-off-by: sara hartse <sara.hartse@delphix.com> Closes #8467 Closes #8476	2019-04-04 18:57:06 -07:00
Tom Caputi	df583073eb	Do not iterate through filesystems unnecessarily Currently, when attempting to list snapshots ZFS may do a lot of extra work checking child datasets. This is because the code does not realize that it will not be able to reach any snapshots contained within snapshots that are at the depth limit since the snapshots of those datasets are counted as an additional layer deeper. This patch corrects this issue. In addition, this patch adds the ability to do perform the commands: $ zfs list -t snapshot <dataset> $ zfs get -t snapshot <prop> <dataset> as a convenient way to list out properties of all snapshots of a given dataset without having to use the depth limit. Reviewed-by: Alek Pinchuk <apinchuk@datto.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #8539	2019-04-01 11:58:59 -07:00
Brian Behlendorf	1b939560be	Add TRIM support UNMAP/TRIM support is a frequently-requested feature to help prevent performance from degrading on SSDs and on various other SAN-like storage back-ends. By issuing UNMAP/TRIM commands for sectors which are no longer allocated the underlying device can often more efficiently manage itself. This TRIM implementation is modeled on the `zpool initialize` feature which writes a pattern to all unallocated space in the pool. The new `zpool trim` command uses the same vdev_xlate() code to calculate what sectors are unallocated, the same per- vdev TRIM thread model and locking, and the same basic CLI for a consistent user experience. The core difference is that instead of writing a pattern it will issue UNMAP/TRIM commands for those extents. The zio pipeline was updated to accommodate this by adding a new ZIO_TYPE_TRIM type and associated spa taskq. This new type makes is straight forward to add the platform specific TRIM/UNMAP calls to vdev_disk.c and vdev_file.c. These new ZIO_TYPE_TRIM zios are handled largely the same way as ZIO_TYPE_READs or ZIO_TYPE_WRITEs. This makes it possible to largely avoid changing the pipieline, one exception is that TRIM zio's may exceed the 16M block size limit since they contain no data. In addition to the manual `zpool trim` command, a background automatic TRIM was added and is controlled by the 'autotrim' property. It relies on the exact same infrastructure as the manual TRIM. However, instead of relying on the extents in a metaslab's ms_allocatable range tree, a ms_trim tree is kept per metaslab. When 'autotrim=on', ranges added back to the ms_allocatable tree are also added to the ms_free tree. The ms_free tree is then periodically consumed by an autotrim thread which systematically walks a top level vdev's metaslabs. Since the automatic TRIM will skip ranges it considers too small there is value in occasionally running a full `zpool trim`. This may occur when the freed blocks are small and not enough time was allowed to aggregate them. An automatic TRIM and a manual `zpool trim` may be run concurrently, in which case the automatic TRIM will yield to the manual TRIM. Reviewed-by: Jorgen Lundman <lundman@lundman.net> Reviewed-by: Tim Chase <tim@chase2k.com> Reviewed-by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: George Wilson <george.wilson@delphix.com> Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Contributions-by: Saso Kiselkov <saso.kiselkov@nexenta.com> Contributions-by: Tim Chase <tim@chase2k.com> Contributions-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #8419 Closes #598	2019-03-29 09:13:20 -07:00
Olaf Faaland	060f0226e6	MMP interval and fail_intervals in uberblock When Multihost is enabled, and a pool is imported, uberblock writes include ub_mmp_delay to allow an importing node to calculate the duration of an activity test. This value, is not enough information. If zfs_multihost_fail_intervals > 0 on the node with the pool imported, the safe minimum duration of the activity test is well defined, but does not depend on ub_mmp_delay: zfs_multihost_fail_intervals * zfs_multihost_interval and if zfs_multihost_fail_intervals == 0 on that node, there is no such well defined safe duration, but the importing host cannot tell whether mmp_delay is high due to I/O delays, or due to a very large zfs_multihost_interval setting on the host which last imported the pool. As a result, it may use a far longer period for the activity test than is necessary. This patch renames ub_mmp_sequence to ub_mmp_config and uses it to record the zfs_multihost_interval and zfs_multihost_fail_intervals values, as well as the mmp sequence. This allows a shorter activity test duration to be calculated by the importing host in most situations. These values are also added to the multihost_history kstat records. It calculates the activity test duration differently depending on whether the new fields are present or not; for importing pools with only ub_mmp_delay, it uses (zfs_multihost_interval + ub_mmp_delay) * zfs_multihost_import_intervals Which results in an activity test duration less sensitive to the leaf count. In addition, it makes a few other improvements: * It updates the "sequence" part of ub_mmp_config when MMP writes in between syncs occur. This allows an importing host to detect MMP on the remote host sooner, when the pool is idle, as it is not limited to the granularity of ub_timestamp (1 second). * It issues writes immediately when zfs_multihost_interval is changed so remote hosts see the updated value as soon as possible. * It fixes a bug where setting zfs_multihost_fail_intervals = 1 results in immediate pool suspension. * Update tests to verify activity check duration is based on recorded tunable values, not tunable values on importing host. * Update tests to verify the expected number of uberblocks have valid MMP fields - fail_intervals, mmp_interval, mmp_seq (sequence number), that sequence number is incrementing, and that uberblock values match tunable settings. Reviewed-by: Andreas Dilger <andreas.dilger@whamcloud.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Olaf Faaland <faaland1@llnl.gov> Closes #7842	2019-03-21 12:47:57 -07:00
Brian Behlendorf	066da71e7f	Improve `zpool labelclear` 1) As implemented the `zpool labelclear` command overwrites the calculated offsets of all four vdev labels even when only a single valid label is found. If the device as been re-purposed but still contains a valid label this can result in space no longer owned by ZFS being zeroed. Prevent this by verifying every label removed is intact before it's overwritten. 2) Address a small bug in zpool_do_labelclear() which prevented labelclear from working on file vdevs. Only block devices support BLKFLSBUF, try the ioctl() but when it's reported as unsupported this should not be fatal. 3) Fix `zpool labelclear` so it can be run on vdevs which were removed from the pool with `zpool remove`. Additionally, allow intact but partial labels to be cleared as in the case of a failed `zpool attach` or `zpool replace`. 4) Remove LABELCLEAR and LABELREAD variables for test cases. Reviewed-by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Tim Chase <tim@chase2k.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #8500 Closes #8373 Closes #6261	2019-03-21 10:13:01 -07:00
Tom Caputi	ab7615d92c	Multiple DVA Scrubbing Fix Currently, there is an issue in the sequential scrub code which prevents self healing from working in some cases. The scrub code will split up all DVA copies of a bp and issue each of them separately. The problem is that, since each of the DVAs is no longer associated with the others, the self healing code doesn't have the opportunity to repair problems that show up in one of the DVAs with the data from the others. This patch fixes this issue by ensuring that all IOs issued by the sequential scrub code include all DVAs. Initially, only the first DVA of each is attempted. If an issue arises, the IO is retried with all available copies, giving the self healing code a chance to correct the issue. To test this change, this patch also adds the ability for zinject to specify individual DVAs to inject read errors into. We then add a new test case that utilizes this functionality to ensure scrubs and self-healing reads can handle and transparently fix issues with individual copies of blocks. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matt Ahrens <mahrens@delphix.com> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #8453	2019-03-15 14:14:31 -07:00
Tom Caputi	eaed840542	Better user experience for errata 4 This patch attempts to address some user concerns that have arisen since errata 4 was introduced. * The errata warning has been made less scary for users without any encrypted datasets. * The errata warning now clears itself without a pool reimport if the bookmark_v2 feature is enabled and no encrypted datasets exist. * It is no longer possible to create new encrypted datasets without enabling the bookmark_v2 feature, thus helping to ensure that the errata is resolved. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Issue ##8308 Closes #8504	2019-03-14 16:48:30 -07:00
kpande	98310e5d1a	Update commented zed.rc values to defaults Update zed.rc values reflect their default value. This helps avoid confusion if a user expects functionality to be enabled. Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Kash Pande <kash@tripleback.net> Closes #8498	2019-03-14 09:53:34 -07:00
Tom Caputi	5cc9ba5cf0	Make zstreamdump -v more greppable Currently, the verbose output of zstreamdump includes new line characters within some individual records. Presumably, this was originally done to keep the output from getting too wide to fit on a terminal. However, since new flags and struct members have been added, these rules have not been maintained consistently. In addition, these newlines can make it hard to grep the output in some scenarios. This patch simply removes these newlines, making the output easier to grep and removing the inconsistency. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Allan Jude <allanjude@freebsd.org> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #8493	2019-03-13 11:19:23 -07:00
Tom Caputi	f00ab3f22c	Detect and prevent mixed raw and non-raw sends Currently, there is an issue in the raw receive code where raw receives are allowed to happen on top of previously non-raw received datasets. This is a problem because the source-side dataset doesn't know about how the blocks on the destination were encrypted. As a result, any MAC in the objset's checksum-of-MACs tree that is a parent of both blocks encrypted on the source and blocks encrypted by the destination will be incorrect. This will result in authentication errors when we decrypt the dataset. This patch fixes this issue by adding a new check to the raw receive code. The code now maintains an "IVset guid", which acts as an identifier for the set of IVs used to encrypt a given snapshot. When a snapshot is raw received, the destination snapshot will take this value from the DRR_BEGIN payload. Non-raw receives and normal "zfs snap" operations will cause ZFS to generate a new IVset guid. When a raw incremental stream is received, ZFS will check that the "from" IVset guid in the stream matches that of the "from" destination snapshot. If they do not match, the code will error out the receive, preventing the problem. This patch requires an on-disk format change to add the IVset guids to snapshots and bookmarks. As a result, this patch has errata handling and a tunable to help affected users resolve the issue with as little interruption as possible. Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matt Ahrens <mahrens@delphix.com> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #8308	2019-03-13 11:00:43 -07:00
jwittlincohen	146bdc414c	Fix typo in arc_summary3 This is a simple fix for a typo ("perfetch" rather than "prefetch") in arc_summary3. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Jason Cohen <jwittlincohen@gmail.com> Closes #8499	2019-03-13 10:43:55 -07:00
Alek P	4c0883fb4a	Avoid retrieving unused snapshot props This patch modifies the zfs_ioc_snapshot_list_next() ioctl to enable it to take input parameters that alter the way looping through the list of snapshots is performed. The idea here is to restrict functions that throw away some of the snapshots returned by the ioctl to a range of snapshots that these functions actually use. This improves efficiency and execution speed for some rollback and send operations. Reviewed-by: Tom Caputi <tcaputi@datto.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Matt Ahrens <mahrens@delphix.com> Signed-off-by: Alek Pinchuk <apinchuk@datto.com> Closes #8077	2019-03-12 13:13:22 -07:00
Olaf Faaland	8133679ff0	Do not resume a pool if multihost is enabled When multihost is enabled, and a pool is suspended, return EINVAL in response to "zpool clear <pool>". The pool may have been imported on another host while I/O was suspended. Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Olaf Faaland <faaland1@llnl.gov> Closes #6933 Closes #8460	2019-02-28 17:56:19 -08:00
Allan Jude	d6838ae649	zstreamdump: include embedded writes when dumping raw data (-d) When feeding a replication stream to `zstreamdump -d` (raw dump mode), it does not print the raw data for DRR_WRITE_EMBEDDED records. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Matt Ahrens <mahrens@delphix.com> Signed-off-by: Allan Jude <allanjude@freebsd.org> Closes #8430	2019-02-27 17:55:25 -08:00
Matthew Ahrens	c568ab8d99	zfs.8 has wrong description of "zfs program -t" The "-t" argument to "zfs program" specifies a limit on the number of LUA instructions that can be executed. The zfs.8 manpage has the wrong description. It should be updated to match what's in zfs-program.8 Also fix the formatting of the zfs help message. Reviewed by: Allan Jude <allanjude@freebsd.org> Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #8410	2019-02-26 11:15:28 -08:00
Paul Zuchowski	9c5e88b1de	zfs should optionally send holds Add -h switch to zfs send command to send dataset holds. If holds are present in the stream, zfs receive will create them on the target dataset, unless the zfs receive -h option is used to skip receive of holds. Reviewed-by: Alek Pinchuk <apinchuk@datto.com> Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Paul Dagnelie <pcd@delphix.com> Signed-off-by: Paul Zuchowski <pzuchowski@datto.com> Closes #7513	2019-02-15 12:41:38 -08:00
Igor K	cf89a4ec9d	zdb: replace label_t to zdb_label_t for reduce collisions with builds on illumos based platform we can see build issue because label_t has been redefined. for reduce build issues on others platforms we should rename label_t to zdb_label_t. Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Igor Kozhukhov <igor@dilos.org> Closes #8397	2019-02-13 11:28:36 -08:00
Serapheim Dimitropoulos	425d3237ee	Get rid of space_map_update() for ms_synced_length Initially, metaslabs and space maps used to be the same thing in ZFS. Later, we started differentiating them by referring to the space map as the on-disk state of the metaslab, making the metaslab a higher-level concept that is metadata that deals with space accounting. Today we've managed to split that code furthermore, with the space map being its own on-disk data structure used in areas of ZFS besides metaslabs (e.g. the vdev-wide space maps used for zpool checkpoint or vdev removal features). This patch refactors the space map code to further split the space map code from the metaslab code. It does so by getting rid of the idea that the space map can have a different in-core and on-disk length (sm_length vs smp_length) which is something that is only used for the metaslab code, and other consumers of space maps just have to deal with. Instead, this patch introduces changes that move the old in-core length of the metaslab's space map to the metaslab structure itself (see ms_synced_length field) while making the space map code only care about the actual space map's length on-disk. The result of this is that space map consumers no longer have to deal with syncing two different lengths for the same structure (e.g. space_map_update() goes away) while metaslab specific behavior stays within the metaslab code. Specifically, the ms_synced_length field keeps track of the amount of data metaslab_load() can read from the metaslab's space map while working concurrently with metaslab_sync() that may be appending to that same space map. As a side note, the patch also adds a few comments around the metaslab code documenting some assumptions and expected behavior. Reviewed-by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com> Closes #8328	2019-02-12 10:38:11 -08:00
bunder2015	bf6ca0a631	shellcheck pass note: which is non-standard. Use builtin 'command -v' instead. [SC2230] note: Use -n instead of ! -z. [SC2236] Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Giuseppe Di Natale <guss80@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: bunder2015 <omfgbunder@gmail.com> Closes #8367	2019-02-04 09:07:19 -08:00
Tony Hutter	57dc41de96	Fix zpool iostat -w header names The zpool iostat latency histograms (-w) has column names 'sync_queue' and 'async_queue', which do not match the man page, nor the equivalent columns in average latency. Change the column names to be 'syncq_wait' and 'asyncq_wait' to be consistent. Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #8338	2019-01-31 10:51:18 -08:00
Serapheim Dimitropoulos	21e7cf5da8	zdb -L should skip leak detection altogether Currently the point of -L option in zdb is to disable leak tracing and the loading of space maps because they are expensive, yet still do leak detection in terms of space. Unfortunately, there is a scenario where this is a lie. If we are using zdb -L on a pool where a vdev is being removed, zdb_claim_removing() will open the metaslab space maps of that device. This patch makes it so zdb -L skips leak detection altogether and ensures that no space maps are loaded. Reviewed-by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com> Closes #8335	2019-01-30 09:54:27 -08:00
Tony Hutter	caacc6e4c4	GCC 9.0: Fix ztest "directive argument is not a nul-terminated string" GCC 9.0 is complaining because we're trying to print strings that are defined like this: .zo_pool = { 'z', 't', 'e', 's', 't', '\0' }, Fix them by making them actual strings. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #8330	2019-01-28 10:11:45 -08:00
Serapheim Dimitropoulos	df72b8bebe	Rename range_tree_verify to range_tree_verify_not_present The range_tree_verify function looks for a segment in a range tree and panics if the segment is present on the tree. This patch gives the function a more descriptive name. Reviewed-by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com> Closes #8327	2019-01-25 09:51:24 -08:00
loli10K	7646af20ad	zfs userspace dumps core when used on ZVOLs If you try to get the userspace, groupspace or projectspace on a ZVOL, the generated error results in passing EINVAL to zfs_standard_error_fmt() when we should return a specific error to inform the user that those properties aren't available on volumes. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Tom Caputi <tcaputi@datto.com> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #8279	2019-01-25 09:47:52 -08:00
Damian Wojsław	8fccfa8e17	zpool iostat should print headers when terminal fills When `zpool iostat` fills the terminal the headers should be printed again. `zpool iostat -n` can be used to suppress this. If the command is not attached to a tty, headers will not be printed so as to not break existing scripts. Reviewed-by: Joshua M. Clulow <josh@sysmgr.org> Reviewed-by: Giuseppe Di Natale <guss80@gmail.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Damian Wojsław <damian@wojslaw.pl> Closes #8235 Closes #8262	2019-01-23 13:29:49 -08:00
Serapheim Dimitropoulos	b194fab0fb	Factor metaslab_load_wait() in metaslab_load() Most callers that need to operate on a loaded metaslab, always call metaslab_load_wait() before loading the metaslab just in case someone else is already doing the work. Factoring metaslab_load_wait() within metaslab_load() makes the later more robust, as callers won't have to do the load-wait check explicitly every time they need to load a metaslab. Reviewed-by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com> Closes #8290	2019-01-18 11:10:32 -08:00
Brian Behlendorf	ce5fb2a7c6	ztest: scrub verification By design ztest will never inject non-repairable damage in to the pool. Update the ztest_scrub() test case such that it waits for the scrub to complete and verifies the pool is always repairable. After enabling scrub verification two scenarios were encountered which are the result of how ztest manages failure injection. The first case is straight forward and pertains to detaching a mirror vdev. In this case, the pool must always be scrubbed prior the detach. Failure to do so can potentially lock in previously repairable data corruption by removing all good copies of a block leaving only damaged ones. The second is a little more subtle. The child/offset selection logic in ztest_fault_inject() depends on the calculated number of leaves always remaining constant between injection passes. This is true within a single execution of ztest, but when using zloop.sh random values are selected for each restart. Therefore, when ztest imports an existing pool it must be scrubbed before failure injection can be safely enabled. Otherwise it is possible that it will inject non-repairable damage. Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #8269	2019-01-18 09:47:55 -08:00
Tom Caputi	305781da4b	Fix error handling incallers of dbuf_hold_level() Currently, the functions dbuf_prefetch_indirect_done() and dmu_assign_arcbuf_by_dnode() assume that dbuf_hold_level() cannot fail. In the event of an error the former will cause a NULL pointer dereference and the later will trigger a VERIFY. This patch adds error handling to these functions and their callers where necessary. Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #8291	2019-01-17 15:47:08 -08:00
Brian Behlendorf	52b684236d	ztest: scrub ddt repair The ztest_ddt_repair() test is designed inflict damage to the ddt which can be repairable by a scrub. Unfortunately, this repair logic was broken at some point and it went undetected. This issue is not specific to ztest, but thankfully this extra redundancy is rarely enabled and even more rarely needed. The root cause was identified to be the ddt_bp_create() function called by dsl_scan_ddt_entry() which did not set the dedup bit of the generated block pointer. The consequence of this was that the ZIO_DDT_READ_PIPELINE was never enabled for the block pointer during the scrub, and the dedup ditto repair logic was never run. Note that for demand reads which don't rely on ddt_bp_create() the required pipeline stages would be enabled and the repair performed. This was resolved by unconditionally setting the dedup bit in ddt_bp_create(). This way all codes paths which may need to perform a repair from a block pointer generated from the dtt entry will be able too. The only exception is that the dedup bit is cleared in ddt_phys_free() which is required to avoid leaking space. Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Tom Caputi <tcaputi@datto.com> Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #8270	2019-01-17 15:25:00 -08:00
Brian Behlendorf	64bdf63f5c	ztest: split block reconstruction Increase the default allowed number of reconstruction attempts. There's not an exact right number for this setting. It needs to be set large enough to cover any realistic failure scenarios and small enough to avoid stalling the IO pipeline and invoking the dead man detection. The current value of 256 was empirically determined to be too low based on multi-day runs of ztest. The fault injection code would inject more damage than could be reconstructed given the relatively small number of attempts. However, in all observed cases the block could be reconstructed using a slightly higher limit. Based on local testing increasing the default value to 4096 was determined to strike the best balance. Checking all combinations takes less than 10s in the worst case, and has so far eliminated the vast majority of false positives detected by ztest. This delay is roughly on par with how long retries may be performed to a misbehaving HDD and was deemed to be reasonable. Better to err on the side of a brief delay rather than fail to reconstruct the data. Lastly, the -Y flag has been added to zdb to make it easy to try all possible combinations when performing split block reconstruction. For badly damaged blocks with 18 splits, they can be fully enumerated within a few minutes. This has been done to ensure permanent errors are never incorrectly reported when ztest verifies the pool with zdb. Reviewed by: Tom Caputi <tcaputi@datto.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #8271	2019-01-16 14:10:02 -08:00
Brian Behlendorf	6e91a72fe3	Disable 'zfs remap' command The implementation of 'zfs remap' has proven to be problematic since it modifies the objset (but not its logical contents) by dirtying metadata without owning it. The consequence of which is that dmu_objset_remap_indirects() is vulnerable to certain races. For example, if we are in the middle of receiving into the filesystem while it is being remapped. Then it is possible we could evict the objset when the receive completes (see dsl_dataset_clone_swap_sync_impl, or dmu_recv_end_sync), but dmu_objset_remap_indirects() may be still using the objset. The result of which would be a panic. Extended runs of ztest(8) have exposed other possible races which can occur when using 'zfs remap'. Several of these have been fixed but there may be others which have not yet been encountered and diagnosed. Furthermore, the ability to manually remap a filesystem is no longer particularly useful now that the removal code can map large chunks. Coupled with the fact that explaining what this command does and why it may be useful requires a detailed understanding of the internals of device removal. These are details users should not be bothered with. Therefore, the 'zfs remap' command is being disabled but not entirely removed. It may be removed in the future or potentially reworked to address the issues described above. Since 'zfs remap' has never been part of a tagged release its removal is expected to have minimal impact. The ZTS tests have been updated to continue to exercise the command to prevent atrophy, but it has been removed entirely from ztest(8). Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #8238	2019-01-15 15:46:58 -08:00
Brian Behlendorf	d611989fdc	Minor spelling corrections Some minor spelling mistakes and typos. No functional changes. Reviewed-by: Neal Gompa <ngompa@datto.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Giuseppe Di Natale <guss80@gmail.com> Reviewed-by: bunder2015 <omfgbunder@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #8272	2019-01-13 10:11:52 -08:00
Brian Behlendorf	a769fb53a1	Add 'zpool status -i' option Only display the full details of the vdev initialization state in 'zpool status' output when requested with the -i option. By default display '(initializing)' after vdevs when they are being actively initialized. This is consistent with the established precident of appending '(resilvering), etc' and fits within the default 80 column terminal width making it easy to read. Additionally, updated the 'zpool initialize' documentation to make it clear the options are mutually exclusive, but allow duplicate options like all other zfs/zpool commands. Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Reviewed-by: Tim Chase <tim@chase2k.com> Reviewed-by: George Wilson <george.wilson@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #8230	2019-01-07 11:03:18 -08:00
George Wilson	c10d37dd9f	zfs initialize performance enhancements PROBLEM ======== When invoking "zpool initialize" on a pool the command will create a thread to initialize each disk. Unfortunately, it does this serially across many transaction groups which can result in commands taking a long time to return to the user and may appear hung. The same thing is true when trying to suspend/cancel the operation. SOLUTION ========= This change refactors the way we invoke the initialize interface to ensure we can start or stop the intialization in just a few transaction groups. When stopping or cancelling a vdev initialization perform it in two phases. First signal each vdev initialization thread that it should exit, then after all threads have been signaled wait for them to exit. On a pool with 40 leaf vdevs this reduces the vdev initialize stop/cancel time from ~10 minutes to under a second. The reason for this is spa_vdev_initialize() no longer needs to wait on multiple full TXGs per leaf vdev being stopped. This commit additionally adds some missing checks for the passed "initialize_vdevs" input nvlist. The contents of the user provided input "initialize_vdevs" nvlist must be validated to ensure all values are uint64s. This is done in zfs_ioc_pool_initialize() in order to keep all of these checks in a single location. Updated the innvl and outnvl comments to match the formatting used for all other new sytle ioctls. Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Reviewed-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Wilson <george.wilson@delphix.com> Closes #8230	2019-01-07 11:03:08 -08:00
George Wilson	619f097693	OpenZFS 9102 - zfs should be able to initialize storage devices PROBLEM ======== The first access to a block incurs a performance penalty on some platforms (e.g. AWS's EBS, VMware VMDKs). Therefore we recommend that volumes are "thick provisioned", where supported by the platform (VMware). This can create a large delay in getting a new virtual machines up and running (or adding storage to an existing Engine). If the thick provision step is omitted, write performance will be suboptimal until all blocks on the LUN have been written. SOLUTION ========= This feature introduces a way to 'initialize' the disks at install or in the background to make sure we don't incur this first read penalty. When an entire LUN is added to ZFS, we make all space available immediately, and allow ZFS to find unallocated space and zero it out. This works with concurrent writes to arbitrary offsets, ensuring that we don't zero out something that has been (or is in the middle of being) written. This scheme can also be applied to existing pools (affecting only free regions on the vdev). Detailed design: - new subcommand:zpool initialize [-cs] <pool> [<vdev> ...] - start, suspend, or cancel initialization - Creates new open-context thread for each vdev - Thread iterates through all metaslabs in this vdev - Each metaslab: - select a metaslab - load the metaslab - mark the metaslab as being zeroed - walk all free ranges within that metaslab and translate them to ranges on the leaf vdev - issue a "zeroing" I/O on the leaf vdev that corresponds to a free range on the metaslab we're working on - continue until all free ranges for this metaslab have been "zeroed" - reset/unmark the metaslab being zeroed - if more metaslabs exist, then repeat above tasks. - if no more metaslabs, then we're done. - progress for the initialization is stored on-disk in the vdev’s leaf zap object. The following information is stored: - the last offset that has been initialized - the state of the initialization process (i.e. active, suspended, or canceled) - the start time for the initialization - progress is reported via the zpool status command and shows information for each of the vdevs that are initializing Porting notes: - Added zfs_initialize_value module parameter to set the pattern written by "zpool initialize". - Added zfs_vdev_{initializing,removal}_{min,max}_active module options. Authored by: George Wilson <george.wilson@delphix.com> Reviewed by: John Wren Kennedy <john.kennedy@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: loli10K <ezomori.nozomu@gmail.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Approved by: Richard Lowe <richlowe@richlowe.net> Signed-off-by: Tim Chase <tim@chase2k.com> Ported-by: Tim Chase <tim@chase2k.com> OpenZFS-issue: https://www.illumos.org/issues/9102 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c3963210eb Closes #8230	2019-01-07 10:37:26 -08:00
Brian Behlendorf	6e72a5b9b6	pyzfs: python3 support (build system) Almost all of the Python code in the respository has been updated to be compatibile with Python 2.6, Python 3.4, or newer. The only exceptions are arc_summery3.py which requires Python 3, and pyzfs which requires at least Python 2.7. This allows us to maintain a single version of the code and support most default versions of python. This change does the following: * Sets the default shebang for all Python scripts to python3. If only Python 2 is available, then at install time scripts which are compatible with Python 2 will have their shebangs replaced with /usr/bin/python. This is done for compatibility until Python 2 goes end of life. Since only the installed versions are changed this means Python 3 must be installed on the system for test-runner when testing in-tree. * Added --with-python=<2\|3\|3.4,etc> configure option which sets the PYTHON environment variable to target a specific python version. By default the newest installed version of Python will be used or the preferred distribution version when creating pacakges. * Fixed --enable-pyzfs configure checks so they are run when --enable-pyzfs=check and --enable-pyzfs=yes. * Enabled pyzfs for Python 3.4 and newer, which is now supported. * Renamed pyzfs package to python<VERSION>-pyzfs and updated to install in the appropriate site location. For example, when building with --with-python=3.4 a python34-pyzfs will be created which installs in /usr/lib/python3.4/site-packages/. * Renamed the following python scripts according to the Fedora guidance for packaging utilities in /bin - dbufstat.py -> dbufstat - arcstat.py -> arcstat - arc_summary.py -> arc_summary - arc_summary3.py -> arc_summary3 * Updated python-cffi package name. On CentOS 6, CentOS 7, and Amazon Linux it's called python-cffi, not python2-cffi. For Python3 it's called python3-cffi or python3x-cffi. * Install one version of arc_summary. Depending on the version of Python available install either arc_summary2 or arc_summary3 as arc_summary. The user output is only slightly different. Reviewed-by: John Ramsden <johnramsden@riseup.net> Reviewed-by: Neal Gompa <ngompa@datto.com> Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #8096	2019-01-06 10:39:41 -08:00
Brian Behlendorf	65ca2c1eb9	Fix 'zpool remap' freeing race The dmu_objset_remap_indirects_impl() logic depends on dnode_hold() returning ENOENT for dnodes which will be freed and should be skipped. This behavior can only be relied upon when taking a new hold and while the caller has an open transaction. This ensures that the open txg cannot advance and that a concurrent free will end up in the same txg (which is critical). Relying on an existing hold will not prevent dnode_free() from succeeding. The solution is to take an additional dnode_hold() after assigning the transaction. This ensures the remap will never dirty the dnode if it was freed while we were waiting in dmu_tx_assign(, TXG_WAIT). Randomly set zfs_object_remap_one_indirect_delay_ms in ztest. This increases the likelihood of an operation racing with the remap. Converted from ticks to milliseconds. Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Tom Caputi <tcaputi@datto.com> Reviewed by: Igor Kozhukhov <igor@dilos.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #8215	2019-01-02 11:46:04 -08:00
Tony Hutter	c66401fae0	Add enclosure_symlinks option to vdev_id Add an 'enclosure_symlinks' option to vdev_id.conf. This creates consistently named symlinks to the enclosure devices (/dev/sg*) based off the configuration in vdev_id.conf. The enclosure symlinks show up in /dev/by-enclosure/<prefix>-<channel><num>. The links make it make it easy to run sg_ses on a particular enclosure device. The enclosure links are created in addition to the normal /dev/disk/by-vdev links. 'enclosure_symlinks' is only valid in sas_direct configurations. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Simon Guest <simon.guest@tesujimath.org> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #8194	2018-12-14 17:27:49 -08:00
Brian Behlendorf	0dd6b6bfcb	ztest: ENOSPC in ztest_objset_destroy_cb() While unlikely it is possible for dsl_destroy_head() to return ENOSPC in the ztest_objset_destroy_cb(). This can occur even when ZFS_SPACE_CHECK_DESTROY is used with the dsl_sync_task(). Both the existence of a checkpoint and pending deferred frees can cause this. Reviewed-by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com> Reviewed-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #8206	2018-12-14 10:03:05 -08:00
Prakash Surya	900d09b285	OpenZFS 9962 - zil_commit should omit cache thrash As a result of the changes made in 8585, it's possible for an excessive amount of vdev flush commands to be issued under some workloads. Specifically, when the workload consists of mostly async write activity, interspersed with some sync write and/or fsync activity, we can end up issuing more flush commands to the underlying storage than is actually necessary. As a result of these flush commands, the write latency and overall throughput of the pool can be poorly impacted (latency increases, throughput decreases). Currently, any time an lwb completes, the vdev(s) written to as a result of that lwb will be issued a flush command. The intenion is so the data written to that vdev is on stable storage, prior to communicating to any waiting threads that their data is safe on disk. The problem with this scheme, is that sometimes an lwb will not have any threads waiting for it to complete. This can occur when there's async activity that gets "converted" to sync requests, as a result of calling the zil_async_to_sync() function via zil_commit_impl(). When this occurs, the current code may issue many lwbs that don't have waiters associated with them, resulting in many flush commands, potentially to the same vdev(s). For example, given a pool with a single vdev, and a single fsync() call that results in 10 lwbs being written out (e.g. due to other async writes), that will result in 10 flush commands to that single vdev (a flush issued after each lwb write completes). Ideally, we'd only issue a single flush command to that vdev, after all 10 lwb writes completed. Further, and most important as it pertains to this change, since the flush commands are often very impactful to the performance of the pool's underlying storage, unnecessarily issuing these flush commands can poorly impact the performance of the lwb writes themselves. Thus, we need to avoid issuing flush commands when possible, in order to acheive the best possible performance out of the pool's underlying storage. This change attempts to address this problem by changing the ZIL's logic to only issue a vdev flush command when it detects an lwb that has a thread waiting for it to complete. When an lwb does not have threads waiting for it, the responsibility of issuing the flush command to the vdevs involved with that lwb's write is passed on to the "next" lwb. It's only once a write for an lwb with waiters completes, do we issue the vdev flush command(s). As a result, now when we issue the flush(s), we will issue them to the vdevs involved with that specific lwb's write, but potentially also to vdevs involved with "previous" lwb writes (i.e. if the previous lwbs did not have waiters associated with them). Thus, in our prior example with 10 lwbs, it's only once the last lwb completes (which will be the lwb containing the waiter for the thread that called fsync) will we issue the vdev flush command; all of the other lwbs will find they have no waiters, so they'll pass the responsibility of the flush to the "next" lwb (until reaching the last lwb that has the waiter). Porting Notes: * Reconciled conflicts with the fastwrite feature. Authored by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed by: Brad Lewis <brad.lewis@delphix.com> Reviewed by: Patrick Mooney <patrick.mooney@joyent.com> Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com> Approved by: Joshua M. Clulow <josh@sysmgr.org> Ported-by: Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/9962 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/545190c6 Closes #8188	2018-12-07 11:09:42 -08:00
Tom Caputi	e3c85c0938	Move assert in dump_dir() in zdb This one line patch moves an assert in the function dump_dir() below an error check that ensures it ran correctly. This ensures zdb dumps the error that actually caused the problem, as opposed to one of its symptoms. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #8171	2018-12-05 09:30:28 -08:00
Brian Behlendorf	c5eea0ab9c	Fix 'zpool list -v' alignment The verbose output of 'zpool list' was not correctly aligned due to differences in the vdev name lengths. Minimally update the code the correct the alignment using the same strategy employed by 'zpool status'. Missing dashes were added for the empty defaults columns, and the vdev state is now printed for all vdevs. Reviewed-by: Tom Caputi <tcaputi@datto.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #7308 Closes #8147	2018-12-04 10:17:54 -08:00
Tom Caputi	0b606cb33f	Fix ztest deadlock in ztest_zil_remount() This patch fixes a small race condition in ztest_zil_remount() that could result in a deadlock. ztest_device_removal() calls spa_vdev_remove() which may eventually call spa_reset_logs(). If ztest_zil_remount() attempts to call zil_close() while this is happening, it may fail when it asserts !zilog_is_dirty(zilog). This patch simply adds locking to correct the issue. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #8154	2018-12-04 09:43:31 -08:00
LOLi	bdbd5477bc	Fix ASSERT in zfs_receive_one() This commit fixes the following ASSERT in zfs_receive_one() when receiving a send stream from a root dataset with the "-e" option: $ sudo zfs snap source@snap $ sudo zfs send source@snap \| sudo zfs recv -e destination/recv chopprefix > drrb->drr_toname ASSERT at libzfs_sendrecv.c:3804:zfs_receive_one() Reviewed-by: Tom Caputi <tcaputi@datto.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Paul Dagnelie <pcd@delphix.com> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #8121	2018-12-04 09:38:55 -08:00
Tom Caputi	c40a1124e1	Fix consistency of ztest_device_removal_active ztest currently uses the boolean flag ztest_device_removal_active to protect some tests that may not run successfully if they occur at the same time as ztest_device_removal(). Unfortunately, in the event that ztest is in the middle of a device removal when it decides to issue a SIGKILL, the device removal will be automatically restarted (without setting the flag) when the pool is re-imported on the next run. This patch corrects this by ensuring that any in-progress removals are completed before running further tests after the re-import. This patch also makes a few small changes to prevent race conditions involving the creation and destruction of spa->spa_vdev_removal, since this field is not protected by any locks. Some checks that may run concurrently with setting / unsetting this field have been updated to check spa->spa_removing_phys.sr_state instead. The most significant change here is that spa_removal_get_stats() no longer accounts for in-flight work done, since that could result in a NULL pointer dereference. Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #8105	2018-11-28 20:47:09 -08:00
LOLi	0cd5c941d0	zpool: allow split with whole-disk devices This change allows 'zpool split' to work with whole-disk devices and updates the ZFS Test Suite with a new script to exercise this functionality. Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #6643 Closes #8133	2018-11-20 10:22:53 -08:00
Sebastien Roy	a10d50f999	OpenZFS 8115 - parallel zfs mount Porting Notes: * Use thread pools (tpool) API instead of introducing taskq interfaces to libzfs. * Use pthread_mutext for locks as mutex_t isn't available. * Ignore alternative libshare initialization since OpenZFS-7955 is not present on zfsonlinux. Authored by: Sebastien Roy <seb@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Brad Lewis <brad.lewis@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Prashanth Sreenivasa <pks@delphix.com> Authored by: Brian Behlendorf <behlendorf1@llnl.gov> Approved by: Matt Ahrens <mahrens@delphix.com> Ported-by: Don Brady <don.brady@delphix.com> OpenZFS-issue: https://www.illumos.org/issues/8115 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/a3f0e2b569 Closes #8092	2018-11-15 11:33:58 -08:00
loli10K	d48091de81	zed: detect and offline physically removed devices This commit adds a new test case to the ZFS Test Suite to verify ZED can detect when a device is physically removed from a running system: the device will be offlined if a spare is not available in the pool. We implement this by using the existing libudev functionality and without relying solely on the FM kernel module capabilities which have been observed to be unreliable with some kernels. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Don Brady <don.brady@delphix.com> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #1537 Closes #7926	2018-11-09 11:17:24 -08:00
Tony Hutter	ad796b8a3b	Add zpool status -s (slow I/Os) and -p (parseable) This patch adds a new slow I/Os (-s) column to zpool status to show the number of VDEV slow I/Os. This is the number of I/Os that didn't complete in zio_slow_io_ms milliseconds. It also adds a new parsable (-p) flag to display exact values. NAME STATE READ WRITE CKSUM SLOW testpool ONLINE 0 0 0 - mirror-0 ONLINE 0 0 0 - loop0 ONLINE 0 0 0 20 loop1 ONLINE 0 0 0 0 Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #7756 Closes #6885	2018-11-08 16:47:24 -08:00
Tom Caputi	a2d88f778a	Fix !zilog_is_dirty() assert during ztest ztest occasionally hits an assert that !zilog_is_dirty() during zil_close(). This is caused by an interaction between 2 threads. First, ztest_run() waits for each test thread to complete and closes the associated dataset as soon as the thread joins. At the same time, the ztest_vdev_add_remove() test may attempt to remove the slog, which will open, dirty, and reset the logs on every dataset in the pool (including those of other threads). This patch simply ensures that we always join all of the test threads before closing any datasets. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #8094	2018-11-07 15:46:50 -08:00
Tom Caputi	f44ad9297d	Replay logs before starting ztest workers This patch ensures that logs are replayed on all datasets prior to starting ztest workers. This ensures that the call to vdev_offline() a log device in ztest_fault_inject() will not fail due to the log device being required for replay. This patch also fixes a small issue found during testing where spa_keystore_load_wkey() does not check that the dataset specified is an encryption root. This check was present in libzfs, however. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #8084	2018-11-07 15:40:24 -08:00
Brian Behlendorf	09b85f2ded	ztest: reduce gangblock creation In order to validate the gang block code ztest is configured to artificially force a fraction of large blocks to be written as gang blocks. The default setting chosen for this was to write 25% of all blocks 32k or larger using gang blocks. The confluence of an unrealistically large number of gang blocks, the aggressive fault injection done by ztest, and the split segment reconstruction logic introduced by device removal has resulted in the following type of failure: zdb -bccsv -G -d ... exit code 3 Specifically, zdb was unable to open the pool because it was unable to reconstruct a damaged block. Manual investigation of multiple failures clearly showed that the block could be reconstructed. However, due to the large number of damaged segments (>35) it could not be done in the allotted time. Furthermore, the large number of gang blocks was determined to be the reason for the unrealistically large number of damaged segments. In order to make this situation less likely, this change both increases the forced gang block size to 64k and reduces the frequency to 3% of blocks. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #8080	2018-11-05 11:53:49 -08:00
Don Brady	e89f1295d4	Add libzutil for libzfs or libzpool consumers Adds a libzutil for utility functions that are common to libzfs and libzpool consumers (most of what was in libzfs_import.c). This removes the need for utilities to link against both libzpool and libzfs. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Don Brady <don.brady@delphix.com> Closes #8050	2018-11-05 11:22:33 -08:00
Serapheim Dimitropoulos	0a544c174d	zdb -k does not work on Linux when used with -e This minor bug was introduced with the port of the feature from OpenZFS to ZoL. This patch fixes the issue that was caused by a minor re-ordering from the original code. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Tim Chase <tim@chase2k.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com> Closes #8001	2018-10-30 11:46:18 -05:00
Gregor Kopka	63a77ae3cf	Added column definitions to arcstat.py grow: ARC Grow enabled (!arc_no_grow) free: ARC Free memory (arc_sys_free) need: ARC Reclaim need (arc_need_free) Fixed alignment issues (mread had wrong width). Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Giuseppe Di Natale <guss80@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Gregor Kopka <gregor@kopka.net> Closes #8058	2018-10-29 18:18:20 -05:00
Brian Behlendorf	bea7578356	ZTS: Fix auto_replace_001_pos test The root cause of these failures is that udev can notify the ZED of newly created partition before its links are created. Handle this by allowing an auto-replace to briefly wait until udev confirms the links exist. Distill this test case down to its essentials so it can be run reliably. What we need to check is that: 1) A new disk, in the same physical location, is automatically brought online when added to the system, 2) It completes the replacement process, and 3) The pool is now ONLINE and healthy. There is no need to remove the scsi_debug module. After exporting the pool the disk can be zeroed, removed, and then re-added to the system as a new disk. Reviewed by: loli10K <ezomori.nozomu@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #8051	2018-10-29 15:05:14 -05:00
Brian Behlendorf	b74f48fe1b	Fix flake8 "invalid escape sequence 'x'" warning From, https://lintlyci.github.io/Flake8Rules/rules/W605.html As of Python 3.6, a backslash-character pair that is not a valid escape sequence now generates a DeprecationWarning. Although this will eventually become a SyntaxError, that will not be for several Python releases. Note 'float_pobj' was simply removed from arcstat.py since it was entirely unused. Reviewed-by: John Kennedy <john.kennedy@delphix.com> Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #8056	2018-10-24 23:26:08 -07:00
Tom Caputi	7d658d29cf	Fix waiting in ztest_device_removal() spa->spa_vdev_removal is created in a sync task that is initiated via dsl_sync_task_nowait(). Since the task may not run before spa_vdev_remove() returns, we must wait at least 1 txg to ensure that the removal struct has been created. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #8010	2018-10-24 14:37:35 -07:00
Tom Caputi	4a7eb69a5a	Fix ztest deadman panic with indirect vdev damage This patch fixes an issue where ztest's deadman thread would trigger a panic because reconstructing artifically damaged blocks would take too long to reconstruct. This patch simply limits how often ztest inflicts split-block damage and how many segments it can damage when it does. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #8010	2018-10-24 14:37:31 -07:00
Tom Caputi	ab4c009e3d	Fix dbgmsg printing in ztest and zdb This patch resolves a problem where the -G option in both zdb and ztest would cause the code to call __dprintf() to print zfs_dbgmsg output. This function was not properly wired to add messages to the dbgmsg log as it is in userspace and so the messages were simply dropped. This patch also tries to add some degree of distinction to dprintf() (which now prints directly to stdout) and zfs_dbgmsg() (which adds messages to an internal list that can be dumped with zfs_dbgmsg_print()). In addition, this patch corrects an issue where ztest used a global variable to decide whether to dump the dbgmsg buffer on a crash. This did not work because ztest spins up more instances of itself using execv(), which did not copy the global variable to the new process. The option has been moved to the ztest_shared_opts_t which already exists for interprocess communication. This patch also changes zfs_dbgmsg_print() to use write() calls instead of printf() so that it will not fail when used in a signal handler. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #8010	2018-10-24 14:36:50 -07:00
Tom Caputi	9410257800	Fix random ztest_deadman_thread failures The zloop test has been failing in buildbot for the last few weeks with various failures in ztest_deadman_thread(). This is due to the fact that this thread is not stopped when performing pool import / export tests as it should be. This patch simply corrects this. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #8010	2018-10-24 14:36:21 -07:00
Serapheim Dimitropoulos	9b2266e3d8	OpenZFS 9682 - page fault in dsl_async_clone_destroy() while opening pool Authored by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed by: Brad Lewis <brad.lewis@delphix.com> Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed by: Sara Hartse <sara.hartse@delphix.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Approved by: Robert Mustacchi <rm@joyent.com> Ported-by: George Melikov <mail@gmelikov.ru> OpenZFS-issue: https://www.illumos.org/issues/9682 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/ade2c82828 Closes #8037	2018-10-19 12:06:21 -07:00
Matthew Ahrens	d637db98e1	OpenZFS 9681 - ztest failure in spa_history_log_internal due to spa_rename() Authored by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com> Reviewed by: George Melikov <mail@gmelikov.ru> Reviewed by: Tom Caputi <tcaputi@datto.com> Approved by: Robert Mustacchi <rm@joyent.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/9681 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/6aee0ad7 Closes #8041	2018-10-19 12:02:28 -07:00
Tom Caputi	80a91e7469	Defer new resilvers until the current one ends Currently, if a resilver is triggered for any reason while an existing one is running, zfs will immediately restart the existing resilver from the beginning to include the new drive. This causes problems for system administrators when a drive fails while another is already resilvering. In this case, the optimal thing to do to reduce risk of data loss is to wait for the current resilver to end before immediately replacing the second failed drive, which allows the system to operate with two incomplete drives for the minimum amount of time. This patch introduces the resilver_defer feature that essentially does this for the admin without forcing them to wait and monitor the resilver manually. The change requires an on-disk feature since we must mark drives that are part of a deferred resilver in the vdev config to ensure that we do not assume they are done resilvering when an existing resilver completes. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: John Kennedy <john.kennedy@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: @mmaybee Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #7732	2018-10-18 21:06:18 -07:00
LOLi	2e55034471	zpool: allow sharing of spare device among pools ZFS allows, by default, sharing of spare devices among different pools; this commit simply restores this functionality for disk devices and adds an additional tests case to the ZFS Test Suite to prevent future regression. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #7999	2018-10-17 11:21:07 -07:00
Paul Dagnelie	d52d80b700	Add types to featureflags in zfs The boolean featureflags in use thus far in ZFS are extremely useful, but because they take advantage of the zap layer, more interesting data than just a true/false value can be stored in a featureflag. In redacted send/receive, this is used to store the list of redaction snapshots for a redacted dataset. This change adds the ability for ZFS to store types other than a boolean in a featureflag. The only other implemented type is a uint64_t array. It also modifies the interfaces around dataset features to accomodate the new capabilities, and adds a few new functions to increase encapsulation. This functionality will be used by the Redacted Send/Receive feature. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <pcd@delphix.com> Closes #7981	2018-10-16 11:15:04 -07:00
Matthew Ahrens	0aa5916a30	OpenZFS 9847 - leaking dd_clones (DMU_OT_DSL_CLONES) objects (#7979 ) OpenZFS 9847 - leaking dd_clones (DMU_OT_DSL_CLONES) objects We're leaking the dd_clones objects in dsl_dir_destroy_sync. This bug appears to have been around forever. Thankfully the amount of space typically involved is tiny. In addition this adds a mechanism in ZDB to find objects in the MOS which are leaked (not referenced anywhere). Porting notes: * Added dd_crypto_obj to ZDB MOS object leak tracking Authored by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: George Wilson <george.wilson@delphix.com> Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: Matthew Ahrens <mahrens@delphix.com> OpenZFS-issue: https://illumos.org/issues/9847 Closes #7979	2018-10-12 11:28:26 -07:00
Brian Behlendorf	27f80e85c2	Improved error handling for extreme rewinds The vdev_checkpoint_sm_object(), vdev_obsolete_sm_object(), and vdev_obsolete_counts_are_precise() functions assume that the only way a zap_lookup() can fail is if the requested entry is missing. While this is the most common cause, it's not the only cause. Attemping to access a damaged ZAP will result in other errors. The most likely scenario for accessing a damaged ZAP is during an extreme rewind pool import. Under these conditions the pool is expected to contain damaged objects and the import code was updated to handle this gracefully. Getting an ECKSUM error from these ZAPs after the pool in import a far less likely, therefore the behavior for call paths was not modified. Reviewed-by: Tim Chase <tim@chase2k.com> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #7809 Closes #7921	2018-10-12 11:24:04 -07:00
Matt Ahrens	5d43cc9a59	OpenZFS 9689 - zfs range lock code should not be zpl-specific The ZFS range locking code in zfs_rlock.c/h depends on ZPL-specific data structures, specifically znode_t. However, it's also used by the ZVOL code, which uses a "dummy" znode_t to pass to the range locking code. We should clean this up so that the range locking code is generic and can be used equally by ZPL and ZVOL, and also can be used by future consumers that may need to run in userland (libzpool) as well as the kernel. Porting notes: * Added missing sys/avl.h include to sys/zfs_rlock.h. * Removed 'dbuf is within the locked range' ASSERTs from dmu_sync(). This was needed because ztest does not yet use a locked_range_t. * Removed "Approved by:" tag requirement from OpenZFS commit check to prevent needless warnings when integrating changes which has not been merged to illumos. * Reverted free_list range lock changes which were originally needed to defer the cv_destroy() which was called immediately after cv_broadcast(). With `d2733258` this should be safe but if not we may need to reintroduce this logic. * Reverts: The following two commits were reverted and squashed in to this change in order to make it easier to apply OpenZFS 9689. - `d88895a0`, which removed the dummy znode from zvol_state - `e3a07cd0`, which updated ztest to use range locks * Preserved optimized rangelock comparison function. Preserved the rangelock free list. The cv_destroy() function will block waiting for all processes in cv_wait() to be scheduled and drop their reference. This is done to ensure it's safe to free the condition variable. However, blocking while holding the rl->rl_lock mutex can result in a deadlock on Linux. A free list is introduced to defer the cv_destroy() and kmem_free() until after the mutex is released. Authored by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Brad Lewis <brad.lewis@delphix.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://illumos.org/issues/9689 OpenZFS-commit: https://github.com/openzfs/openzfs/pull/680 External-issue: DLPX-58662 Closes #7980	2018-10-11 10:19:33 -07:00
Tony Hutter	2ef0f8c329	Print "(repairing)" in zpool status again Historically, zpool status prints "(repairing)" for any drives that have errors during a scrub: NAME STATE READ WRITE CKSUM mypool ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 /tmp/file1 ONLINE 13 0 0 (repairing) /tmp/file2 ONLINE 0 0 0 /tmp/file3 ONLINE 0 0 0 This was accidentally broken in "OpenZFS 9166 - zfs storage pool checkpoint" (`d2734cc`). This patch adds it back in. Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #7779 Closes #7978	2018-10-09 20:30:32 -07:00
Tom Caputi	52ce99dd61	Refcounted DSL Crypto Key Mappings Since native ZFS encryption was merged, we have been fighting against a series of bugs that come down to the same problem: Key mappings (which must be present during all I/O operations) are created and destroyed based on dataset ownership, but I/Os can have traditionally been allowed to "leak" into the next txg after the dataset is disowned. In the past we have attempted to solve this problem by trying to ensure that datasets are disowned ater all I/O is finished by calling txg_wait_synced(), but we have repeatedly found edge cases that need to be squashed and code paths that might incur a high number of txg syncs. This patch attempts to resolve this issue differently, by adding a reference to the key mapping for each txg it is dirtied in. By doing so, we can remove many of the unnecessary calls to txg_wait_synced() we have added in the past and ensure we don't need to deal with this problem in the future. Reviewed-by: Jorgen Lundman <lundman@lundman.net> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #7949	2018-10-03 09:47:11 -07:00
Tim Schumacher	424fd7c3e0	Prefix all refcount functions with zfs_ Recent changes in the Linux kernel made it necessary to prefix the refcount_add() function with zfs_ due to a name collision. To bring the other functions in line with that and to avoid future collisions, prefix the other refcount functions as well. Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Schumacher <timschumi@gmx.de> Closes #7963	2018-10-01 10:42:05 -07:00
Brian Behlendorf	1258bd778e	Refine split block reconstruction Due to a flaw in `4589f3ae` the number of unique combinations could be calculated incorrectly. This could result in the random combinations reconstruction being used when it would have been possible to check all combinations. This change fixes the unique combinations calculation and simplifies the reconstruction logic by maintaining a per- segment list of unique copies. The vdev_indirect_splits_damage() function was introduced to validate both the enumeration and random reconstruction logic with ztest. It is implemented such it will never make a known recoverable block unrecoverable. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #6900 Closes #7934	2018-10-01 10:36:34 -07:00
Tim Schumacher	c13060e478	Linux 4.19-rc3+ compat: Remove refcount_t compat torvalds/linux@59b57717f ("blkcg: delay blkg destruction until after writeback has finished") added a refcount_t to the blkcg structure. Due to the refcount_t compatibility code, zfs_refcount_t was used by mistake. Resolve this by removing the compatibility code and replacing the occurrences of refcount_t with zfs_refcount_t. Reviewed-by: Franz Pletz <fpletz@fnordicwalking.de> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Schumacher <timschumi@gmx.de> Closes #7885 Closes #7932	2018-09-26 10:29:26 -07:00
Gregor Kopka	b954e36e51	Zpool iostat: remove latency/queue scaling Bandwidth and iops are average per second while _wait are averages per request for latency or, for queue depths, an instantaneous measurement at the end of an interval (according to man zpool). When calculating the first two it makes sense to do x/interval_duration (x being the increase in total bytes or number of requests over the duration of the interval, interval_duration in seconds) to 'scale' from amount/interval_duration to amount/second. But applying the same math for the latter (_wait latencies/queue) is wrong as there is no interval_duration component in the values (these are time/requests to get to average_time/request or already an absulute number). This bug leads to the only correct continuous *_wait figures for both latencies and queue depths from 'zpool iostat -l/q' being with duration=1 as then the wrong math cancels itself (x/1 is a nop). This removes temporal scaling from latency and queue depth figures. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Gregor Kopka <gregor@kopka.net> Closes #7945 Closes #7694	2018-09-25 16:29:16 -07:00
LOLi	e0b7ff46c9	zstreamdump dumps core printing truncated nvlist This change prevents zstreamdump from crashing when trying to print invalid nvlist data (DRR_BEGIN record) from a truncated send stream. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #7917	2018-09-18 09:43:09 -07:00
LOLi	5140a58f3b	zpool should detect invalid fs property on create This change improve the handling of invalid filesystem properties when specified at pool creation: this is useful when 'zpool create -n' (dry run) is executed to detect invalid fs-level options (-O) before the actual command is run. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #7620 Closes #7878	2018-09-13 13:37:42 -07:00
LOLi	0238a9755b	Fix 'zfs allow' for create time permissions When no permission set is defined for a dataset the create time permissions are incorrectly shown as if they were a permission set. This change simply correct how allow permissions are displayed. This commit also fixes a small manpage formatting issue and adds the "zfs_allow_003_pos" test case to the ZFS Test Suite. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #7519 Closes #7860	2018-09-06 13:11:21 -07:00
Don Brady	cc99f275a2	Pool allocation classes Allocation Classes add the ability to have allocation classes in a pool that are dedicated to serving specific block categories, such as DDT data, metadata, and small file blocks. A pool can opt-in to this feature by adding a 'special' or 'dedup' top-level VDEV. Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Alek Pinchuk <apinchuk@datto.com> Reviewed-by: Håkan Johansson <f96hajo@chalmers.se> Reviewed-by: Andreas Dilger <andreas.dilger@chamcloud.com> Reviewed-by: DHE <git@dehacked.net> Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed-by: Gregor Kopka <gregor@kopka.net> Reviewed-by: Kash Pande <kash@tripleback.net> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Don Brady <don.brady@delphix.com> Closes #5182	2018-09-05 18:33:36 -07:00
Don Brady	b83a0e2dc1	Add basic zfs ioc input nvpair validation We want newer versions of libzfs_core to run against an existing zfs kernel module (i.e. a deferred reboot or module reload after an update). Programmatically document, via a zfs_ioc_key_t, the valid arguments for the ioc commands that rely on nvpair input arguments (i.e. non legacy commands from libzfs_core). Automatically verify the expected pairs before dispatching a command. This initial phase focuses on the non-legacy ioctls. A follow-on change can address the legacy ioctl input from the zfs_cmd_t. The zfs_ioc_key_t for zfs_keys_channel_program looks like: static const zfs_ioc_key_t zfs_keys_channel_program[] = { {"program", DATA_TYPE_STRING, 0}, {"arg", DATA_TYPE_UNKNOWN, 0}, {"sync", DATA_TYPE_BOOLEAN_VALUE, ZK_OPTIONAL}, {"instrlimit", DATA_TYPE_UINT64, ZK_OPTIONAL}, {"memlimit", DATA_TYPE_UINT64, ZK_OPTIONAL}, }; Introduce four input errors to identify specific input failures (in addition to generic argument value errors like EINVAL, ERANGE, EBADF, and E2BIG). ZFS_ERR_IOC_CMD_UNAVAIL the ioctl number is not supported by kernel ZFS_ERR_IOC_ARG_UNAVAIL an input argument is not supported by kernel ZFS_ERR_IOC_ARG_REQUIRED a required input argument is missing ZFS_ERR_IOC_ARG_BADTYPE an input argument has an invalid type Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Don Brady <don.brady@delphix.com> Closes #7780	2018-09-02 12:14:01 -07:00
bernie1995	0fe7c953b3	ZTS: path cleanup Removing hardcoded paths in many scripts. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: bernie1995 <bernie.pikes@gmail.com> Issue #7507 Closes #7843	2018-08-30 13:46:55 -07:00
Tom Caputi	c3bd3fb4ac	OpenZFS 9403 - assertion failed in arc_buf_destroy() Assertion failed in arc_buf_destroy() when concurrently reading block with checksum error. Porting notes: * The ability to zinject decompression errors has been added, but this only works at the zio_decompress() level, where we have all of the info we need to match against the user's zinject options. * The decompress_fault test has been added to test the new zinject functionality * We attempted to set zio_decompress_fail_fraction to (1 << 18) in ztest for further test coverage. Although this did uncover a few low priority issues, this unfortuantely also causes ztest to ASSERT in many locations where the code is working correctly since it is designed to fail on IO errors. Developers can manually set this variable with the '-o' option to find and debug issues. Authored by: Matt Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Approved by: Matt Ahrens <mahrens@delphix.com> Ported-by: Tom Caputi <tcaputi@datto.com> OpenZFS-issue: https://illumos.org/issues/9403 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/fa98e487a9 Closes #7822	2018-08-29 11:33:33 -07:00
Rich Ercolani	e8a8208eef	Added metadata/dnode cache info to arc_summary Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes #7815	2018-08-22 09:35:20 -07:00
Olaf Faaland	34fe773e30	Skip import activity test in more zdb code paths Since zdb opens the pools read-only, it cannot damage the pool in the event the pool is already imported either on the same host or on another one. If the pool vdev structure is changing while zdb is importing the pool, it may cause zdb to crash. However this is unlikely, and in any case it's a user space process and can simply be run again. For this reason, zdb should disable the multihost activity test on import that is normally run. This commit fixes a few zdb code paths where that had been overlooked. It also adds tests to ensure that several common use cases handle this properly in the future. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Gu Zheng <guzheng2331314@163.com> Signed-off-by: Olaf Faaland <faaland1@llnl.gov> Closes #7797 Closes #7801	2018-08-20 10:05:23 -07:00
DeHackEd	edc05fdb34	Don't modify argv[] in user tools argv[] gets modified during string parsing for input arguments. This is reflected in the live process listing. Don't do that. Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Reviewed-by: Giuseppe Di Natale <guss80@gmail.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: DHE <git@dehacked.net> Closes #7760	2018-08-20 09:55:18 -07:00
LOLi	a9d6270acb	'zfs holds' scripted mode is not documented This change simply documents the existing "scripted mode" option in both command help and man page. Reviewed-by: Giuseppe Di Natale <guss80@gmail.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #7798	2018-08-18 15:47:41 -07:00
LOLi	1f87313ac8	Fix arcstat.py handling of unsupported options This change allows the arcstat.py script to handle unsupported options gracefully and print both error and usage messages when one such option is provided. Reviewed-by: Giuseppe Di Natale <guss80@gmail.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #7799	2018-08-18 13:10:36 -07:00
Olaf Faaland	5200237711	MMP should not suspend pool in ztest When running ztest, never suspend the pool due to failed or delayed MMP writes. There are many sources of long delays within ztest, such as device opens, closes, etc. which in combination, may delay MMP writes too long and cause MMP to suspend the pool. Some of these delays also affect real pools, and should be fixed. That is being worked separately. Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Olaf Faaland <faaland1@llnl.gov> Closes #7776	2018-08-13 08:12:25 -07:00
Nathan Lewis	010d12474c	Add support for selecting encryption backend - Add two new module parameters to icp (icp_aes_impl, icp_gcm_impl) that control the crypto implementation. At the moment there is a choice between generic and aesni (on platforms that support it). - This enables support for AES-NI and PCLMULQDQ-NI on AMD Family 15h (bulldozer) and newer CPUs (zen). - Modify aes_key_t to track what implementation it was generated with as key schedules generated with various implementations are not necessarily interchangable. Reviewed by: Gvozden Neskovic <neskovic@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tom Caputi <tcaputi@datto.com> Reviewed-by: Richard Laager <rlaager@wiktel.com> Signed-off-by: Nathaniel R. Lewis <linux.robotdude@gmail.com> Closes #7102 Closes #7103	2018-08-02 11:59:24 -07:00
Serapheim Dimitropoulos	6b64382b17	OpenZFS 9580 - Add a hash-table on top of nvlist to speed-up operations = Motivation While dealing with another performance issue (see 126118f) we noticed that we spend a lot of time in various places in the kernel when constructing long nvlists. The problem is that when an nvlist is created with the NV_UNIQUE_NAME set (which is the case most of the time), we do a linear search through the whole list to ensure uniqueness for every entry we add. An example of the above scenario can be seen in the following flamegraph, where more than have the time of the zfsdev_ioctl() is spent on constructing nvlists. Flamegraph: https://sdimitro.github.io/img/flame/sdimitro_snap_unmount3.svg Adding a table to speed up lookups will help situations where we just construct an nvlist (like the scenario above), in addition to regular lookups and removals. = What this patch does In this diff we've implemented a hash-table on top of the nvlist code that converts most nvlist operations from O(# number of entries) to O(1)* (the start is for amortized time as the hash-table grows and shrinks depending on the # of entries - plain lookup is strictly O(1)). = Performance Analysis To analyze the performance improvement I just used the setup from the snapshot deletion issue mentioned above in the Motivation section. Basically I created 10K filesystems with one snapshot each and then I just used the API of libZFS_Core to pass down an nvlist of all the snapshots to have them deleted. The reason I used my own driver program was to have clean performance results of what actually happens in the kernel. The flamegraphs and wall clock times mentioned below were gathered from the start to the end of the driver program's run. Between trials the testpool used was completely destroyed, the system was rebooted and the testpool was completely recreated. The reason for this dance was to get consistent results. == Results (before patch): === Sampling Flamegraphs [Trial 1] https://sdimitro.github.io/img/flame/DLPX-53417/trial-A.svg [Trial 2] https://sdimitro.github.io/img/flame/DLPX-53417/trial-A2.svg [Trial 3] https://sdimitro.github.io/img/flame/DLPX-53417/trial-A3.svg === Wall clock times (in seconds) ``` [Trial 4] real 5.3 user 0.4 sys 2.3 [Trial 5] real 8.2 user 0.4 sys 2.4 [Trial 6] real 6.0 user 0.5 sys 2.3 ``` == Results (after patch): === Sampling Flamegraphs [Trial 1] https://sdimitro.github.io/img/flame/DLPX-53417/trial-Ae.svg [Trial 2] https://sdimitro.github.io/img/flame/DLPX-53417/trial-A2e.svg [Trial 3] https://sdimitro.github.io/img/flame/DLPX-53417/trial-A3e.svg === Wall clock times (in seconds) ``` [Trial 4] real 4.9 user 0.0 sys 0.9 [Trial 5] real 3.8 user 0.0 sys 0.9 [Trial 6] real 3.6 user 0.0 sys 0.9 ``` == Analysis The results between the trials are consistent so in this sections I will only talk about the flamegraph results from trial-1 and the wall-clock results from trial-4. From trial-1 we can see that zfs_dev_ioctl() goes from 2,331 to 996 samples counts. Specifically, the samples from fnvlist_add_nvlist() and spa_history_log_nvl() are almost gone (~500 & ~800 to 5 & 5 samples), leaving zfs_ioc_destroy_snaps() to dominate most samples from zfs_dev_ioctl(). From trial-4 we see that the user time dropped to 0 secods. I believe the consistent 0.4 seconds before my patch was applied was due to my driver program constructing the long nvlist of snapshots so it can pass it to the kernel. As for the system time, the effect there is more clear (2.3 down to 0.9 seconds). Porting Notes: * DATA_TYPE_DONTCARE case added to switch in fm_nvprintr() and zpool_do_events_nvprint(). Authored by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com> Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed by: Sebastien Roy <sebastien.roy@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/9580 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/b5eca7b1 Closes #7748	2018-07-30 11:30:03 -07:00
Brian Behlendorf	d441e85dd7	Add support for autoexpand property While the autoexpand property may seem like a small feature it depends on a significant amount of system infrastructure. Enough of that infrastructure is now in place that with a few modifications for Linux it can be supported. Auto-expand works as follows; when a block device is modified (re-sized, closed after being open r/w, etc) a change uevent is generated for udev. The ZED, which is monitoring udev events, passes the change event along to zfs_deliver_dle() if the disk or partition contains a zfs_member as identified by blkid. From here the device is matched against all imported pool vdevs using the vdev_guid which was read from the label by blkid. If a match is found the ZED reopens the pool vdev. This re-opening is important because it allows the vdev to be briefly closed so the disk partition table can be re-read. Otherwise, it wouldn't be possible to report the maximum possible expansion size. Finally, if the property autoexpand=on a vdev expansion will be attempted. After performing some sanity checks on the disk to verify that it is safe to expand, the primary partition (-part1) will be expanded and the partition table updated. The partition is then re-opened (again) to detect the updated size which allows the new capacity to be used. In order to make all of the above possible the following changes were required: * Updated the zpool_expand_001_pos and zpool_expand_003_pos tests. These tests now create a pool which is layered on a loopback, scsi_debug, and file vdev. This allows for testing of non- partitioned block device (loopback), a partition block device (scsi_debug), and a file which does not receive udev change events. This provided for better test coverage, and by removing the layering on ZFS volumes there issues surrounding layering one pool on another are avoided. * zpool_find_vdev_by_physpath() updated to accept a vdev guid. This allows for matching by guid rather than path which is a more reliable way for the ZED to reference a vdev. * Fixed zfs_zevent_wait() signal handling which could result in the ZED spinning when a signal was not handled. * Removed vdev_disk_rrpart() functionality which can be abandoned in favor of kernel provided blkdev_reread_part() function. * Added a rwlock which is held as a writer while a disk is being reopened. This is important to prevent errors from occurring for any configuration related IOs which bypass the SCL_ZIO lock. The zpool_reopen_007_pos.ksh test case was added to verify IO error are never observed when reopening. This is not expected to impact IO performance. Additional fixes which aren't critical but were discovered and resolved in the course of developing this functionality. * Added PHYS_PATH="/dev/zvol/dataset" to the vdev configuration for ZFS volumes. This is as good as a unique physical path, while the volumes are not used in the test cases anymore for other reasons this improvement was included. Reviewed by: Richard Elling <Richard.Elling@RichardElling.com> Signed-off-by: Sara Hartse <sara.hartse@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #120 Closes #2437 Closes #5771 Closes #7366 Closes #7582 Closes #7629	2018-07-23 15:40:15 -07:00
Brian Behlendorf	e2cc448b60	Reduce zdb output when pool contains checkpoint When running zdb without additional arguments against a pool containing a checkpoint the entire checkpoint spacemap should not be dumped. Make this behavior conditional upon passing the -mmmm option as described in the zdb(8) man page. -mmmm Display every spacemap record. Reviewed-by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com> Reviewed-by: Giuseppe Di Natale <guss80@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #7702	2018-07-10 21:23:17 -07:00
Troels Nørgaard	94370f5955	Default ashift for Amazon EC2 NVMe devices Add a default 4 KiB ashift for Amazon EC2 NVMe devices on instances with NVMe ephemeral devices, such as the types c5d, f1, i3 and m5d. As per the official documentation [1] a 4096 byte blocksize should be used to match the underlying hardware. The string was identified via: $ sudo sginfo -M /dev/nvme0n1 INQUIRY response (cmd: 0x12) ---------------------------- Device Type 0 Vendor: NVMe Product: Amazon EC2 NVMe Revision level: $ lsblk -io KNAME,TYPE,SIZE,MODEL KNAME TYPE SIZE MODEL nvme0n1 disk 442.4G Amazon EC2 NVMe Instance Storage [1] https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ storage-optimized-instances.html Retrived 2018-07-03 Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Giuseppe Di Natale <guss80@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Troels Nørgaard <tnn@tradeshift.com> Closes #7676	2018-07-06 16:15:19 -07:00
Serapheim Dimitropoulos	4d044c4c1d	OpenZFS 9238 - ZFS Spacemap Encoding V2 Motivation ========== The current space map encoding has the following disadvantages: [1] Assuming 512 sector size each entry can represent at most 16MB for a segment. This makes the encoding very inefficient for large regions of space. [2] As vdev-wide space maps have started to be used by new features (i.e. device removal, zpool checkpoint) we've started imposing limits in the vdevs that can be used with them based on the maximum addressable offset (currently 64PB for a top-level vdev). New encoding ============ The layout can be found at space_map.h and it remains backwards compatible with the old one. The introduced two-word entry format, besides extending the limits imposed by the single-entry layout, also includes a vdev field and some extra padding after its prefix. The extra padding after the prefix should is reserved for future usage (e.g. new prefixes for future encodings or new fields for flags). The new vdev field not only makes the space maps more self-descriptive, but also opens the doors for pool-wide space maps (expected to be used in the log spacemap project). One final important note is that the number of bits used for vdevs is reduced to 24 bits for blkptrs. That was decided as we don't know of any setups that use more than 16M vdevs for the time being and we wanted to fit the vdev field in the space map. In addition that gives us some extra bits in dva_t. Other references: ================= The new encoding is also discussed towards the end of the Log Space Map presentation from 2017's OpenZFS summit. Link: https://www.youtube.com/watch?v=jj2IxRkl5bQ Authored by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <gwilson@zfsmail.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Approved by: Gordon Ross <gwr@nexenta.com> Ported-by: Tim Chase <tim@chase2k.com> Signed-off-by: Tim Chase <tim@chase2k.com> OpenZFS-commit: https://github.com/openzfs/openzfs/commit/90a56e6d OpenZFS-issue: https://www.illumos.org/issues/9238 Closes #7665	2018-07-05 12:02:34 -07:00
Low-power	aebfb84851	Fix missing option '-e' in zpool online usage Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: WHR <msl0000023508@gmail.com> Closes #7655	2018-06-26 10:17:55 -07:00
Serapheim Dimitropoulos	d2734cce68	OpenZFS 9166 - zfs storage pool checkpoint Details about the motivation of this feature and its usage can be found in this blogpost: https://sdimitro.github.io/post/zpool-checkpoint/ A lightning talk of this feature can be found here: https://www.youtube.com/watch?v=fPQA8K40jAM Implementation details can be found in big block comment of spa_checkpoint.c Side-changes that are relevant to this commit but not explained elsewhere: * renames members of "struct metaslab trees to be shorter without losing meaning * space_map_{alloc,truncate}() accept a block size as a parameter. The reason is that in the current state all space maps that we allocate through the DMU use a global tunable (space_map_blksz) which defauls to 4KB. This is ok for metaslab space maps in terms of bandwirdth since they are scattered all over the disk. But for other space maps this default is probably not what we want. Examples are device removal's vdev_obsolete_sm or vdev_chedkpoint_sm from this review. Both of these have a 1:1 relationship with each vdev and could benefit from a bigger block size. Porting notes: * The part of dsl_scan_sync() which handles async destroys has been moved into the new dsl_process_async_destroys() function. * Remove "VERIFY(!(flags & FWRITE))" in "kernel.c" so zhack can write to block device backed pools. * ZTS: * Fix get_txg() in zpool_sync_001_pos due to "checkpoint_txg". * Don't use large dd block sizes on /dev/urandom under Linux in checkpoint_capacity. * Adopt Delphix-OS's setting of 4 (spa_asize_inflation = SPA_DVAS_PER_BP + 1) for the checkpoint_capacity test to speed its attempts to fill the pool * Create the base and nested pools with sync=disabled to speed up the "setup" phase. * Clear labels in test pool between checkpoint tests to avoid duplicate pool issues. * The import_rewind_device_replaced test has been marked as "known to fail" for the reasons listed in its DISCLAIMER. * New module parameters: zfs_spa_discard_memory_limit, zfs_remove_max_bytes_pause (not documented - debugging only) vdev_max_ms_count (formerly metaslabs_per_vdev) vdev_min_ms_count Authored by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: John Kennedy <john.kennedy@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Approved by: Richard Lowe <richlowe@richlowe.net> Ported-by: Tim Chase <tim@chase2k.com> Signed-off-by: Tim Chase <tim@chase2k.com> OpenZFS-issue: https://illumos.org/issues/9166 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/7159fdb8 Closes #7570	2018-06-26 10:07:42 -07:00
kpande	aeb39df726	Fix typo in comment, handeling->handling Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Giuseppe Di Natale <guss80@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: bunder2015 <omfgbunder@gmail.com> Signed-off-by: Kash Pande <kash@tripleback.net> Closes #7641	2018-06-18 21:51:06 -07:00
John Gallagher	917f475fba	Add tunables for channel programs This patch adds tunables for modifying the maximum memory limit and maximum instruction limit that can be specified when running a channel program. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov Reviewed-by: Sara Hartse <sara.hartse@delphix.com> Signed-off-by: John Gallagher <john.gallagher@delphix.com> External-issue: LX-1085 Closes #7618	2018-06-15 15:10:42 -07:00
Brian Behlendorf	c91cf36fc2	Fix ztest_vdev_add_remove() test case Commit `2ffd89fc` allowed two new errors to be reported by zil_reset() in order to provide a descriptive error message regarding why a log device could not be removed. However, the new return values were not handled in the ztest_vdev_add_remove() test case resulting in ztest failures during automated testing. Reviewed-by: Tim Chase <tim@chase2k.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Paul Zuchowski <pzuchowski@datto.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #7630	2018-06-14 09:41:27 -07:00
Antonio Russo	39042f9736	Tunable directory for zfs runtime scripts zpool and zed place scripts in subdirectories of libexecdir. Some distributions locate architecture independent scripts in other locations (e.g. Debian). To avoid these paths getting out of sync, centralize the definitions. Build zfs-test's default.cfg by Makefile. Use the new directory logic building tests/zfs-tests/include/default.cfg.in. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Antonio Russo <antonio.e.russo@gmail.com> Closes #7597	2018-06-07 09:59:59 -07:00
Alek P	6969afcefd	Always continue recursive destroy after error Currently, during a recursive zfs destroy the first error that is encountered will stop the destruction of the datasets. Errors may happen for a variety of reasons including competing deletions and busy datasets. This patch switches recursive destroy to always do a best-effort recursive dataset destroy. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Alek Pinchuk <apinchuk@datto.com> Closes #7574	2018-06-06 10:14:52 -07:00
Tony Hutter	f0ed6c7448	Add pool state /proc entry, "SUSPENDED" pools 1. Add a proc entry to display the pool's state: $ cat /proc/spl/kstat/zfs/tank/state ONLINE This is done without using the spa config locks, so it will never hang. 2. Fix 'zpool status' and 'zpool list -o health' output to print "SUSPENDED" instead of "ONLINE" for suspended pools. Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Richard Elling <Richard.Elling@RichardElling.com> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #7331 Closes #7563	2018-06-06 09:33:54 -07:00
Brian Behlendorf	2d9142c9d4	Remove rwlock wrappers The only remaining consumer of the rwlock compatibility wrappers is ztest. Remove the wrappers and convert the few remaining calls to the underlying pthread functions. rwlock_init() -> pthread_rwlock_init() rwlock_destroy() -> pthread_rwlock_destroy() rw_rdlock() -> pthread_rwlock_rdlock() rw_wrlock() -> pthread_rwlock_wrlock() rw_unlock() -> pthread_rwlock_unlock() Note pthread_rwlock_init() defaults to PTHREAD_PROCESS_PRIVATE which is equivilant to the USYNC_THREAD behavior. There is no functional change. Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #7591	2018-06-04 16:52:10 -07:00
Pavel Zakharov	8a393be353	OpenZFS 9235 - rename zpool_rewind_policy_t to zpool_load_policy_t We want to be able to pass various settings during import/open of a pool, which are not only related to rewind. Instead of adding a new policy and duplicate a bunch of code, we should just rename rewind_policy to a more generic term like load_policy. For instance, we'd like to set spa->spa_import_flags from the nvlist, rather from a flags parameter passed to spa_import as in some cases we want those flags not only for the import case, but also for the open case. One such flag could be ZFS_IMPORT_MISSING_LOG (as used in zdb) which would allow zfs to open a pool when logs are missing. Authored by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://illumos.org/issues/9235 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/d2b1e44 Closes #7532	2018-06-04 14:54:20 -07:00
Brian Behlendorf	93ce2b4ca5	Update build system and packaging Minimal changes required to integrate the SPL sources in to the ZFS repository build infrastructure and packaging. Build system and packaging: * Renamed SPL_* autoconf m4 macros to ZFS_. Removed redundant SPL_* autoconf m4 macros. * Updated the RPM spec files to remove SPL package dependency. * The zfs package obsoletes the spl package, and the zfs-kmod package obsoletes the spl-kmod package. * The zfs-kmod-devel* packages were updated to add compatibility symlinks under /usr/src/spl-x.y.z until all dependent packages can be updated. They will be removed in a future release. * Updated copy-builtin script for in-kernel builds. * Updated DKMS package to include the spl.ko. * Updated stale AUTHORS file to include all contributors. * Updated stale COPYRIGHT and included the SPL as an exception. * Renamed README.markdown to README.md * Renamed OPENSOLARIS.LICENSE to LICENSE. * Renamed DISCLAIMER to NOTICE. Required code changes: * Removed redundant HAVE_SPL macro. * Removed _BOOT from nvpairs since it doesn't apply for Linux. * Initial header cleanup (removal of empty headers, refactoring). * Remove SPL repository clone/build from zimport.sh. * Use of DEFINE_RATELIMIT_STATE and DEFINE_SPINLOCK removed due to build issues when forcing C99 compilation. * Replaced legacy ACCESS_ONCE with READ_ONCE. * Include needed headers for `current` and `EXPORT_SYMBOL`. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> TEST_ZIMPORT_SKIP="yes" Closes #7556	2018-05-29 16:00:33 -07:00
Jorgen Lundman	561ba8d1b1	OpenZFS 9523 - Large alloc in zdb can cause trouble 16MB alloc in zdb_embedded_block() can cause cores in certain situations (clang, gcc55). Authored by: Jorgen Lundman <lundman@lundman.net> Reviewed by: Igor Kozhukhov <igor@dilos.org> Reviewed by: Andriy Gapon <avg@FreeBSD.org> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Approved by: Dan McDonald <danmcd@joyent.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> Porting Notes: * Replaces an equivalent fix previously made for Linux. OpenZFS-issue: https://illumos.org/issues/9523 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/2c1964a Closes #7561	2018-05-25 17:24:57 -07:00
Antonio Russo	68fded8146	Add canonical mount options zfs-mount-generator lib/libzfs/libzfs_mount.c:zfs_add_options provides the canonical mount options used by a `zfs mount` command. Because we cannot call `zfs mount` directly from a systemd.mount unit, we mirror that logic in zfs-mount-generator. The zed script is updated to cache these properties as well. Include a mini-tutorial in the manual page, properly substitute configuration paths in zfs-mount-generator.8.in, and standardize the Makefile. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Laager <rlaager@wiktel.com> Signed-off-by: Antonio Russo <antonio.e.russo@gmail.com> Closes #7453	2018-05-11 12:44:14 -07:00
Pavel Zakharov	6cb8e5306d	OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery Some work has been done lately to improve the debugability of the ZFS pool load (and import) process. This includes: 7638 Refactor spa_load_impl into several functions 8961 SPA load/import should tell us why it failed 7277 zdb should be able to print zfs_dbgmsg's To iterate on top of that, there's a few changes that were made to make the import process more resilient and crash free. One of the first tasks during the pool load process is to parse a config provided from userland that describes what devices the pool is composed of. A vdev tree is generated from that config, and then all the vdevs are opened. The Meta Object Set (MOS) of the pool is accessed, and several metadata objects that are necessary to load the pool are read. The exact configuration of the pool is also stored inside the MOS. Since the configuration provided from userland is external and might not accurately describe the vdev tree of the pool at the txg that is being loaded, it cannot be relied upon to safely operate the pool. For that reason, the configuration in the MOS is read early on. In the past, the two configurations were compared together and if there was a mismatch then the load process was aborted and an error was returned. The latter was a good way to ensure a pool does not get corrupted, however it made the pool load process needlessly fragile in cases where the vdev configuration changed or the userland configuration was outdated. Since the MOS is stored in 3 copies, the configuration provided by userland doesn't have to be perfect in order to read its contents. Hence, a new approach has been adopted: The pool is first opened with the untrusted userland configuration just so that the real configuration can be read from the MOS. The trusted MOS configuration is then used to generate a new vdev tree and the pool is re-opened. When the pool is opened with an untrusted configuration, writes are disabled to avoid accidentally damaging it. During reads, some sanity checks are performed on block pointers to see if each DVA points to a known vdev; when the configuration is untrusted, instead of panicking the system if those checks fail we simply avoid issuing reads to the invalid DVAs. This new two-step pool load process now allows rewinding pools accross vdev tree changes such as device replacement, addition, etc. Loading a pool from an external config file in a clustering environment also becomes much safer now since the pool will import even if the config is outdated and didn't, for instance, register a recent device addition. With this code in place, it became relatively easy to implement a long-sought-after feature: the ability to import a pool with missing top level (i.e. non-redundant) devices. Note that since this almost guarantees some loss of data, this feature is for now restricted to a read-only import. Porting notes (ZTS): * Fix 'make dist' target in zpool_import * The maximum path length allowed by tar is 99 characters. Several of the new test cases exceeded this limit resulting in them not being included in the tarball. Shorten the names slightly. * Set/get tunables using accessor functions. * Get last synced txg via the "zfs_txg_history" mechanism. * Clear zinject handlers in cleanup for import_cache_device_replaced and import_rewind_device_replaced in order that the zpool can be exported if there is an error. * Increase FILESIZE to 8G in zfs-test.sh to allow for a larger ext4 file system to be created on ZFS_DISK2. Also, there's no need to partition ZFS_DISK2 at all. The partitioning had already been disabled for multipath devices. Among other things, the partitioning steals some space from the ext4 file system, makes it difficult to accurately calculate the paramters to parted and can make some of the tests fail. * Increase FS_SIZE and FILE_SIZE in the zpool_import test configuration now that FILESIZE is larger. * Write more data in order that device evacuation take lonnger in a couple tests. * Use mkdir -p to avoid errors when the directory already exists. * Remove use of sudo in import_rewind_config_changed. Authored by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Andrew Stormont <andyjstormont@gmail.com> Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org> Ported-by: Tim Chase <tim@chase2k.com> Signed-off-by: Tim Chase <tim@chase2k.com> OpenZFS-issue: https://illumos.org/issues/9075 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123 Closes #7459	2018-05-08 21:35:27 -07:00
Pavel Zakharov	afd2f7b711	OpenZFS 8962 - zdb should work on non-idle pools Currently `zdb` consistently fails to examine non-idle pools as it fails during the `spa_load()` process. The main problem seems to be that `spa_load_verify()` fails as can be seen below: $ sudo zdb -d -G dcenter zdb: can't open 'dcenter': I/O error ZFS_DBGMSG(zdb): spa_open_common: opening dcenter spa_load(dcenter): LOADING disk vdev '/dev/dsk/c4t11d0s0': best uberblock found for spa dcenter. txg 40824950 spa_load(dcenter): using uberblock with txg=40824950 spa_load(dcenter): UNLOADING spa_load(dcenter): RELOADING spa_load(dcenter): LOADING disk vdev '/dev/dsk/c3t10d0s0': best uberblock found for spa dcenter. txg 40824952 spa_load(dcenter): using uberblock with txg=40824952 spa_load(dcenter): FAILED: spa_load_verify failed [error=5] spa_load(dcenter): UNLOADING This change makes `spa_load_verify()` a dryrun when ran from `zdb`. This is done by creating a global flag in zfs and then setting it in `zdb`. Authored by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Andy Stormont <astormont@racktopsystems.com> Approved by: Dan McDonald <danmcd@joyent.com> Ported-by: Tim Chase <tim@chase2k.com> Signed-off-by: Tim Chase <tim@chase2k.com> OpenZFS-issue: https://illumos.org/issues/8962 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/180ad792 Closes #7459	2018-05-08 21:32:57 -07:00
Paul Dagnelie	64c1dcefe3	OpenZFS 9421, 9422 - zdb show possibly leaked objects 9421 zdb should detect and print out the number of "leaked" objects 9422 zfs diff and zdb should explicitly mark objects that are on the deleted queue It is possible for zfs to "leak" objects in such a way that they are not freed, but are also not accessible via the POSIX interface. As the only way to know that this is happened is to see one of them directly in a zdb run, or by noting unaccounted space usage, zdb should be enhanced to count these objects and return failure if some are detected. We have access to the delete queue through the zfs_get_deleteq function; we should call it in dump_znode to determine if the object is on the delete queue. This is not the most efficient possible method, but it is the simplest to implement, and should suffice for the common case where there few objects on the delete queue. Also zfs diff and zdb currently traverse every single dnode in a dataset and tries to figure out the path of the object by following it's parent. When an object is placed on the delete queue, for all practical purposes it's already discarded, it's parent might not exist anymore, and another object might now have the object number that belonged to the parent. While all of the above makes sense, when trying to figure out the path of an object that is on the delete queue, we can run into issues where either it is impossible to determine the path because the parent is gone, or another dnode has taken it's place and thus we are returned a wrong path. We should therefore avoid trying to determine the path of an object on the delete queue and mark the object itself as being on the delete queue to avoid confusion. To achieve this, we currently have two ideas: 1. When putting an object on the delete queue, change it's parent object number to a known constant that means NULL. 2. When displaying objects, first check if it is present on the delete queue. Authored by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Approved by: Matt Ahrens <mahrens@delphix.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://illumos.org/issues/9421 OpenZFS-issue: https://illumos.org/issues/9422 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/45ae0dd9ca Closes #7500	2018-05-04 10:50:24 -07:00
Tom Caputi	be9a5c355c	Add support for decryption faults in zinject This patch adds the ability for zinject to trigger decryption and authentication faults in the ZIO and ARC layers. This functionality is exposed via the new "decrypt" error type, which may be provided for "data" object types. This patch also refactors some of the core encryption / decryption functions so that they have consistent prototypes, handle errors consistently, and do not have unused arguments. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #7474	2018-05-02 15:36:20 -07:00
Matthew Ahrens	964c2d69a9	OpenZFS 9236 - nuke spa_dbgmsg We should use zfs_dbgmsg instead of spa_dbgmsg. Or at least, metaslab_condense() should call zfs_dbgmsg because it's important and rare enough to always log. It's possible that the message in zio_dva_allocate() would be too high-frequency for zfs_dbgmsg. Authored by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Approved by: Richard Lowe <richlowe@richlowe.net> Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov> Patch Notes: * Removed ZFS_DEBUG_SPA from zfs-module-parameters.5 OpenZFS-issue: https://www.illumos.org/issues/9236 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/cfaba7f668 Closes #7467	2018-04-30 10:19:48 -07:00
LOLi	b4555c777a	Fix 'zfs remap <poolname@snapname>' Only filesystems and volumes are valid 'zfs remap' parameters: when passed a snapshot name zfs_remap_indirects() does not handle the EINVAL returned from libzfs_core, which results in failing an assertion and consequently crashing. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #7454	2018-04-19 09:45:17 -07:00
Tom Caputi	b0ee5946aa	Fix issues with raw sends of spill blocks This patch fixes 2 issues in how spill blocks are processed during raw sends. The first problem is that compressed spill blocks were using the logical length rather than the physical length to determine how much data to dump into the send stream. The second issue is a typo that caused the spill record's object number to be used where the objset's ID number was required. Both issues have been corrected, and the payload_size is now printed in zstreamdump for future debugging. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #7378 Closes #7432	2018-04-17 11:19:03 -07:00
Matt Ahrens	d830d4795a	OpenZFS 9280 - Assertion failure while running removal_with_ganging test with 4K devices Authored by: Matt Ahrens <Matt.Ahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: John Kennedy <john.kennedy@delphix.com> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Approved by: Garrett D'Amore <garrett@damore.org> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/9280 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/243952c Closes #7445	2018-04-17 10:44:50 -07:00
Brian Behlendorf	4589f3ae4c	Optimize possible split block search space Remove duplicate segment copies to minimize the possible search space for reconstruction. Once reduced an accurate assessment can be made regarding the difficulty in reconstructing the block. Also, ztest will now run zdb with zfs_reconstruct_indirect_combinations_max set to 1000000 in an attempt to avoid checksum errors. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #6900	2018-04-14 12:22:43 -07:00
Matthew Ahrens	9e052db462	OpenZFS 9290 - device removal reduces redundancy of mirrors Mirrors are supposed to provide redundancy in the face of whole-disk failure and silent damage (e.g. some data on disk is not right, but ZFS hasn't detected the whole device as being broken). However, the current device removal implementation bypasses some of the mirror's redundancy. Note that in no case is incorrect data returned, but we might get a checksum error when we should have been able to find the right data. There are two underlying problems: 1. When we remove a mirror device, we only read one side of the mirror. Since we can't verify the checksum, this side may be silently bad, but the good data is on the other side of the mirror (which we didn't read). This can cause the removal to "bake in" the busted data – all copies of the data in the new location are the same, busted version, while we left the good version behind. The fix for this is to read and copy both sides of the mirror. If the old and new vdevs are mirrors, we will read both sides of the old mirror, and write each copy to the corresponding side of the new mirror. (If the old and new vdevs have a different number of children, we will do this as best as possible.) Even though we aren't verifying checksums, this ensures that as long as there's a good copy of the data, we'll have a good copy after the removal, even if there's silent damage to one side of the mirror. If we're removing a mirror that has some silent damage, we'll have exactly the same damage in the new location (assuming that the new location is also a mirror). 2. When we read from an indirect vdev that points to a mirror vdev, we only consider one copy of the data. This can lead to reduced effective redundancy, because we might read a bad copy of the data from one side of the mirror, and not retry the other, good side of the mirror. Note that the problem is not with the removal process, but rather after the removal has completed (having copied correct data to both sides of the mirror), if one side of the new mirror is silently damaged, we encounter the problem when reading the relocated data via the indirect vdev. Also note that the problem doesn't occur when ZFS knows that one side of the mirror is bad, e.g. when a disk entirely fails or is offlined. The impact is that reads (from indirect vdevs that point to mirrors) may return a checksum error even though the good data exists on one side of the mirror, and scrub doesn't repair all data on the mirror (if some of it is pointed to via an indirect vdev). The fix for this is complicated by "split blocks" - one logical block may be split into two (or more) pieces with each piece moved to a different new location. In this case we need to read all versions of each split (one from each side of the mirror), and figure out which combination of versions results in the correct checksum, and then repair the incorrect versions. This ensures that we supply the same redundancy whether you use device removal or not. For example, if a mirror has small silent errors on all of its children, we can still reconstruct the correct data, as long as those errors are at sufficiently-separated offsets (specifically, separated by the largest block size - default of 128KB, but up to 16MB). Porting notes: * A new indirect vdev check was moved from dsl_scan_needs_resilver_cb() to dsl_scan_needs_resilver(), which was added to ZoL as part of the sequential scrub work. * Passed NULL for zfs_ereport_post_checksum()'s zbookmark_phys_t parameter. The extra parameter is unique to ZoL. * When posting indirect checksum errors the ABD can be passed directly, zfs_ereport_post_checksum() is not yet ABD-aware in OpenZFS. Authored by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Tim Chase <tim@chase2k.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: Tim Chase <tim@chase2k.com> OpenZFS-issue: https://illumos.org/issues/9290 OpenZFS-commit: https://github.com/openzfs/openzfs/pull/591 Closes #6900	2018-04-14 12:21:39 -07:00
Matthew Ahrens	a1d477c24c	OpenZFS 7614, 9064 - zfs device evacuation/removal OpenZFS 7614 - zfs device evacuation/removal OpenZFS 9064 - remove_mirror should wait for device removal to complete This project allows top-level vdevs to be removed from the storage pool with "zpool remove", reducing the total amount of storage in the pool. This operation copies all allocated regions of the device to be removed onto other devices, recording the mapping from old to new location. After the removal is complete, read and free operations to the removed (now "indirect") vdev must be remapped and performed at the new location on disk. The indirect mapping table is kept in memory whenever the pool is loaded, so there is minimal performance overhead when doing operations on the indirect vdev. The size of the in-memory mapping table will be reduced when its entries become "obsolete" because they are no longer used by any block pointers in the pool. An entry becomes obsolete when all the blocks that use it are freed. An entry can also become obsolete when all the snapshots that reference it are deleted, and the block pointers that reference it have been "remapped" in all filesystems/zvols (and clones). Whenever an indirect block is written, all the block pointers in it will be "remapped" to their new (concrete) locations if possible. This process can be accelerated by using the "zfs remap" command to proactively rewrite all indirect blocks that reference indirect (removed) vdevs. Note that when a device is removed, we do not verify the checksum of the data that is copied. This makes the process much faster, but if it were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be possible to copy the wrong data, when we have the correct data on e.g. the other side of the mirror. At the moment, only mirrors and simple top-level vdevs can be removed and no removal is allowed if any of the top-level vdevs are raidz. Porting Notes: * Avoid zero-sized kmem_alloc() in vdev_compact_children(). The device evacuation code adds a dependency that vdev_compact_children() be able to properly empty the vdev_child array by setting it to NULL and zeroing vdev_children. Under Linux, kmem_alloc() and related functions return a sentinel pointer rather than NULL for zero-sized allocations. * Remove comment regarding "mpt" driver where zfs_remove_max_segment is initialized to SPA_MAXBLOCKSIZE. Change zfs_condense_indirect_commit_entry_delay_ticks to zfs_condense_indirect_commit_entry_delay_ms for consistency with most other tunables in which delays are specified in ms. * ZTS changes: Use set_tunable rather than mdb Use zpool sync as appropriate Use sync_pool instead of sync Kill jobs during test_removal_with_operation to allow unmount/export Don't add non-disk names such as "mirror" or "raidz" to $DISKS Use $TEST_BASE_DIR instead of /tmp Increase HZ from 100 to 1000 which is more common on Linux removal_multiple_indirection.ksh Reduce iterations in order to not time out on the code coverage builders. removal_resume_export: Functionally, the test case is correct but there exists a race where the kernel thread hasn't been fully started yet and is not visible. Wait for up to 1 second for the removal thread to be started before giving up on it. Also, increase the amount of data copied in order that the removal not finish before the export has a chance to fail. * MMP compatibility, the concept of concrete versus non-concrete devices has slightly changed the semantics of vdev_writeable(). Update mmp_random_leaf_impl() accordingly. * Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool feature which is not supported by OpenZFS. * Added support for new vdev removal tracepoints. * Test cases removal_with_zdb and removal_condense_export have been intentionally disabled. When run manually they pass as intended, but when running in the automated test environment they produce unreliable results on the latest Fedora release. They may work better once the upstream pool import refectoring is merged into ZoL at which point they will be re-enabled. Authored by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Alex Reece <alex@delphix.com> Reviewed-by: George Wilson <george.wilson@delphix.com> Reviewed-by: John Kennedy <john.kennedy@delphix.com> Reviewed-by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Richard Laager <rlaager@wiktel.com> Reviewed by: Tim Chase <tim@chase2k.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Approved by: Garrett D'Amore <garrett@damore.org> Ported-by: Tim Chase <tim@chase2k.com> Signed-off-by: Tim Chase <tim@chase2k.com> OpenZFS-issue: https://www.illumos.org/issues/7614 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb Closes #6900	2018-04-14 12:16:17 -07:00
Antonio Russo	55d80e651a	systemd mount generator and tracking ZEDLET zfs-mount-generator implements the "systemd generator" protocol, producing systemd.mount units from the cached outputs of zfs list, during early boot, integrating with systemd. Each pool has an indpendent cache of the command zfs list -H -oname,mountpoint,canmount -tfilesystem -r $pool which is kept synchronized by the ZEDLET history_event-zfs-list-cacher.sh Datasets not in the cache will be loaded later in the boot process by zfs-mount.service, including pools without a cache. Among other things, this allows for complex mount hierarchies. Reviewed-by: Fabian Grünbichler <f.gruenbichler@proxmox.com> Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Antonio Russo <antonio.e.russo@gmail.com> Closes #7329	2018-04-06 14:11:09 -07:00
Tom Caputi	1bf9a552bb	Make encrypted "zfs mount -a" failures consistent Currently, "zfs mount -a" will print a warning and fail to mount any encrypted datasets that do not have a key loaded. This patch makes the behavior of this failure consistent with other failure modes ("zfs mount -a" will silently continue, explict "zfs mount" will print a message and return an error code. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #7382	2018-04-06 13:28:15 -07:00
Tony Hutter	21a4f5cc86	Fedora 28: Fix misc bounds check compiler warnings Fix a bunch of (mostly) sprintf/snprintf truncation compiler warnings that show up on Fedora 28 (GCC 8.0.1). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #7361 Closes #7368	2018-04-04 10:16:47 -07:00
Alek P	272b5d730f	Add JSON output support to channel programs The changes piggyback JSON output support on top of channel programs (#6558). This way the JSON output support is targeted to scripting use cases and is easily maintainable since it really only touches one function (zfs_do_channel_program()). This patch ports Joyent's JSON nvlist library from illumos to enable easy JSON printing of channel program output nvlist. To keep the delta small I also took advantage of the fact that printing in zfs_do_channel_program() was almost always done before exiting the program. Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alek Pinchuk <apinchuk@datto.com> Closes #7281	2018-03-19 12:40:58 -07:00
Olaf Faaland	cec3a0a1bb	Report pool suspended due to MMP When the pool is suspended, record whether it was due to an I/O error or due to MMP writes failing to succeed within the required time. Change spa_suspended from uint8_t to zio_suspend_reason_t to store the reason. When userspace queries pool status via spa_tryimport(), report the reason the pool was suspended in a new key, ZPOOL_CONFIG_SUSPENDED_REASON. In libzfs, when interpreting the returned config nvlist, report suspension due to MMP with a new pool status enum value, ZPOOL_STATUS_IO_FAILURE_MMP. In status_callback(), which generates and emits the message when 'zpool status' is executed, add a case to print an appropriate message for the new pool status enum value. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Olaf Faaland <faaland1@llnl.gov> Closes #7296	2018-03-15 10:56:55 -07:00
Paul Zuchowski	83362e8e67	Destroy makes full snap list before destroying Change zfs destroy logic so destroying begins before the entire list of snapshots is built. Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Reviewed-by: Kash Pande <kash@tripleback.net> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Zuchowski <pzuchowski@datto.com> Closes #7271	2018-03-12 15:24:08 -07:00
Tomohiro Kusumi	6b8655ad3f	Change functions which return literals to return `const char` get_format_prompt_string() and zpool_state_to_name() return a string literal which is read-only, thus they should return `const char`. zpool_get_prop_string() returns a non-const string after successful nv-lookup, and returns a string literal otherwise. Since this function is designed to be used for read-only purpose, the return type should also be `const char*`. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@osnexus.com> Closes #7285	2018-03-09 13:47:32 -08:00
Tony Hutter	639b18944a	Allow to limit zed's syslog chattiness Some usage patterns like send/recv of replication streams can produce a large number of events. In such a case, the current all-syslog.sh zedlet will hold up to its name, and flood the logs with mostly redundant information. Two mitigate this situation, this changeset introduces to new variables ZED_SYSLOG_SUBCLASS_INCLUDE and ZED_SYSLOG_SUBCLASS_EXCLUDE to zed.rc that give more control over which event classes end up in the syslog. Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Daniel Kobras <d.kobras@science-computing.de> Closes #6886 Closes #7260	2018-03-06 15:41:52 -08:00
Nasf-Fan	2705ebf0a7	Misc fixes and cleanup for project quota 1) The Coverity Scan reports some issues for the project quota patch, including: 1.1) zfs_prop_get_userquota() directly uses the const quota type value as the condition check by wrong. 1.2) dmu_objset_userquota_get_ids() may cause dnode::dn_newgid to be overwritten by dnode::dn->dn_oldprojid. 2) This patch fixes related issues. It also enhances the logic for zfs_project_item_alloc() to avoid buffer overflow. 3) Skip project quota ability check if does not change project quota related things (id or flag). Otherwise, it will cause chattr (for other non project quota flags) operation failed if project quota disabled. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Fan Yong <fan.yong@intel.com> Closes #7251 Closes #7265	2018-03-05 12:56:27 -08:00
John Eismeier	d699aaef09	Fix some typos Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: George Melikov <mail@gmelikov.ru> Signed-off-by: John Eismeier <john.eismeier@gmail.com> Closes #7237	2018-02-28 08:57:10 -08:00
Scot W. Stevenson	19528cf949	Add Python 3 rewrite of arc_summary.py Add new script arc_summary3.py as a complete rewrite of the arc_summary.py tool (see issue #6873) Add new options: -g/--graph - Display crude graphic representation of ARC status and quit -r/--raw - Print all available information as minimally formatted list (for grep) -s/--section - Print a single section. This replaces -p/--page, which is kept for backwards use but marked as depreciated Add new sections with information on ZIL and SPL. Notify user if sections L2ARC and VDEV are skipped instead of failing silently. Add warning that -p/--page option is depreciated. Developed for Python 3.5. Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Scot W. Stevenson <scot.stevenson@gmail.com> Closes #6873 Closes #6892	2018-02-28 08:52:34 -08:00
Tony Hutter	3e9c9d8a89	Add SMART self-test results to zpool status -c Add in SMART self-test results to zpool status\|iostat -c. This works for both SAS and SATA drives. Also, add plumbing to allow the 'smart' script to take smartctl output from a directory of output text files instead of running it against the vdevs. Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #7178	2018-02-27 09:31:27 -08:00
LOLi	4af6873af6	Fix segfault in zfs_do_bookmark() When invoked with wrong parameters 'zfs bookmark' fails to gracefully validate user input and crashes. This is a regression accidentally introduced in 587e228; this commit adds additional tests to the ZFS Test Suite to exercise this codepath. Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: KireinaHoro <i@jsteward.moe> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #7228 Closes #7229	2018-02-26 09:55:18 -08:00
Tony Hutter	bf95a000c4	Add scrub after resilver zed script * Add a zed script to kick off a scrub after a resilver. The script is disabled by default. * Add a optional $PATH (-P) option to zed to allow it to use a custom $PATH for its zedlets. This is needed when you're running zed under the ZTS in a local workspace. * Update test scripts to not copy in all-debug.sh and all-syslog.sh by default. They can be optionally copied in as part of zed_setup(). These scripts slow down zed considerably under heavy events loads and can cause events to be dropped or their delivery delayed. This was causing some sporadic failures in the 'fault' tests. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Laager <rlaager@wiktel.com> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #4662 Closes #7086	2018-02-23 11:38:05 -08:00
bunder2015	ca0b376604	Add SMART attributes for SSD and NVMe This adds the SMART attributes required to probe Samsung SSD and NVMe (and possibly others) disks when using the "zpool status -c" command. Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: bunder2015 <omfgbunder@gmail.com> Closes #7183 Closes #7193	2018-02-21 13:52:47 -08:00
LOLi	faa97c1619	Want 'zfs send -b' This change implements 'zfs send -b' which can be used to send only received property values whether or not they are overridden by local settings. This can be very useful during "restore" operations from a backup pool because it allows to send only the property values originally sent from the backup source, even though they were later modified on the destination either by a 'zfs set' operation, explicit 'zfs inherit' or overridden during the receive process via 'zfs receive -o\|-x'. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #7156	2018-02-21 12:32:06 -08:00
Don Brady	cbce581353	Fix coverity defects: zfs channel programs CID 173243, 173245: Memory - corruptions (OVERRUN) Added size argument to lcompat_sprintf() to avoid use of INT_MAX CID 173244: Integer handling issues (OVERFLOW_BEFORE_WIDEN) Added cast to uint64_t to avoid a 32 bit overflow warning CID 173242: Integer handling issues (CONSTANT_EXPRESSION_RESULT) Conditionally removed unused luai_numisnan() floating point check CID 173241: Resource leaks (RESOURCE_LEAK) Added missing close(fd) on error path CID 173240: (UNINIT) Fixed uninitialized variable in get_special_prop() CID 147560: Null pointer dereferences (NULL_RETURNS) Cleaned up bad code merge in dsl_dataset_promote_check() CID 28475: Memory - illegal accesses (OVERRUN) Fixed lcompat_sprintf() to use a size paramater CID 28418, 28422: Error handling issues (CHECKED_RETURN) Added function result cast to (void) to avoid warning CID 23935, 28411, 28412: Memory - corruptions (ARRAY_VS_SINGLETON) Added casts to avoid exposing result as an array Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Don Brady <don.brady@delphix.com> Closes #7181	2018-02-20 11:19:42 -08:00
Nasf-Fan	9c5167d19f	Project Quota on ZFS Project quota is a new ZFS system space/object usage accounting and enforcement mechanism. Similar as user/group quota, project quota is another dimension of system quota. It bases on the new object attribute - project ID. Project ID is a numerical value to indicate to which project an object belongs. An object only can belong to one project though you (the object owner or privileged user) can change the object project ID via 'chattr -p' or 'zfs project [-s] -p' explicitly. The object also can inherit the project ID from its parent when created if the parent has the project inherit flag (that can be set via 'chattr +P' or 'zfs project -s [-p]'). By accounting the spaces/objects belong to the same project, we can know how many spaces/objects used by the project. And if we set the upper limit then we can control the spaces/objects that are consumed by such project. It is useful when multiple groups and users cooperate for the same project, or a user/group needs to participate in multiple projects. Support the following commands and functionalities: zfs set projectquota@project zfs set projectobjquota@project zfs get projectquota@project zfs get projectobjquota@project zfs get projectused@project zfs get projectobjused@project zfs projectspace zfs allow projectquota zfs allow projectobjquota zfs allow projectused zfs allow projectobjused zfs unallow projectquota zfs unallow projectobjquota zfs unallow projectused zfs unallow projectobjused chattr +/-P chattr -p project_id lsattr -p This patch also supports tree quota based on the project quota via "zfs project" commands set as following: zfs project [-d\|-r] <file\|directory ...> zfs project -C [-k] [-r] <file\|directory ...> zfs project -c [-0] [-d\|-r] [-p id] <file\|directory ...> zfs project [-p id] [-r] [-s] <file\|directory ...> For "df [-i] $DIR" command, if we set INHERIT (project ID) flag on the $DIR, then the proejct [obj]quota and [obj]used values for the $DIR's project ID will be shown as the total/free (avail) resource. Keep the same behavior as EXT4/XFS does. Reviewed-by: Andreas Dilger <andreas.dilger@intel.com> Reviewed-by Ned Bass <bass6@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Fan Yong <fan.yong@intel.com> TEST_ZIMPORT_POOLS="zol-0.6.1 zol-0.6.2 master" Change-Id: Ib4f0544602e03fb61fd46a849d7ba51a6005693c Closes #6290	2018-02-13 14:54:54 -08:00
LOLi	c03f04708c	'zfs receive' fails with "dataset is busy" Receiving an incremental stream after an interrupted "zfs receive -s" fails with the message "dataset is busy": this is because we still have the hidden clone ../%recv from the resumable receive. Improve the error message suggesting the existence of a partially complete resumable stream from "zfs receive -s" which can be either aborted ("zfs receive -A") or resumed ("zfs send -t"). Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #7129 Closes #7154	2018-02-12 12:28:59 -08:00
Chunwei Chen	eb9c4532dd	Fix zdb -ed on objset for exported pool zdb -ed on objset for exported pool would failed with: failed to own dataset 'qq/fs0': No such file or directory The reason is that zdb pass objset name to spa_import, it uses that name to create a spa. Later, when dmu_objset_own tries to lookup the spa using real pool name, it can't find one. We fix this by make sure we pass pool name rather than objset name to spa_import. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Signed-off-by: Chunwei Chen <david.chen@nutanix.com> Closes #7099 Closes #6464	2018-02-09 10:11:34 -08:00
Chunwei Chen	5e3bd0e684	Fix zdb -E segfault SPA_MAXBLOCKSIZE is too large for stack. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Signed-off-by: Chunwei Chen <david.chen@nutanix.com> Closes #7099	2018-02-09 10:11:19 -08:00
Chunwei Chen	950e17c215	Fix zdb -R decompression There are some issues in the zdb -R decompression implementation. The first is that ZLE can easily decompress non-ZLE streams. So we add ZDB_NO_ZLE env to make zdb skip ZLE. The second is the random bytes appended to pabd, pbuf2 stuff. This serve no purpose at all, those bytes shouldn't be read during decompression anyway. Instead, we randomize lbuf2, so that we can make sure decompression fill exactly to lsize by bcmp lbuf and lbuf2. The last one is the condition to detect fail is wrong. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Signed-off-by: Chunwei Chen <david.chen@nutanix.com> Closes #7099 Closes #4984	2018-02-09 10:11:02 -08:00
Chunwei Chen	0c0b0ad48a	Fix racy assignment of zcb.zcb_haderrors zcb_haderrors will be modified in zdb_blkptr_done, which is asynchronous. So we must move this assignment after zio_wait. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Signed-off-by: Chunwei Chen <david.chen@nutanix.com> Closes #7099	2018-02-09 10:08:40 -08:00
Serapheim Dimitropoulos	5b72a38d68	OpenZFS 8677 - Open-Context Channel Programs Authored by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Chris Williamson <chris.williamson@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Ported-by: Don Brady <don.brady@delphix.com> We want to be able to run channel programs outside of synching context. This would greatly improve performance for channel programs that just gather information, as they won't have to wait for synching context anymore. === What is implemented? This feature introduces the following: - A new command line flag in "zfs program" to specify our intention to run in open context. (The -n option) - A new flag/option within the channel program ioctl which selects the context. - Appropriate error handling whenever we try a channel program in open-context that contains zfs.sync* expressions. - Documentation for the new feature in the manual pages. === How do we handle zfs.sync functions in open context? When such a function is found by the interpreter and we are running in open context we abort the script and we spit out a descriptive runtime error. For example, given the script below ... arg = ... fs = arg["argv"][1] err = zfs.sync.destroy(fs) msg = "destroying " .. fs .. " err=" .. err return msg if we run it in open context, we will get back the following error: Channel program execution failed: [string "channel program"]:3: running functions from the zfs.sync submodule requires passing sync=TRUE to lzc_channel_program() (i.e. do not specify the "-n" command line argument) stack traceback: [C]: in function 'destroy' [string "channel program"]:3: in main chunk === What about testing? We've introduced new wrappers for all channel program tests that run each channel program as both (startard & open-context) and expect the appropriate behavior depending on the program using the zfs.sync module. OpenZFS-issue: https://www.illumos.org/issues/8677 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/17a49e15 Closes #6558	2018-02-08 16:05:57 -08:00
Chris Williamson	d99a015343	OpenZFS 7431 - ZFS Channel Programs Authored by: Chris Williamson <chris.williamson@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: John Kennedy <john.kennedy@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Approved by: Garrett D'Amore <garrett@damore.org> Ported-by: Don Brady <don.brady@delphix.com> Ported-by: John Kennedy <john.kennedy@delphix.com> OpenZFS-issue: https://www.illumos.org/issues/7431 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/dfc11533 Porting Notes: * The CLI long option arguments for '-t' and '-m' don't parse on linux * Switched from kmem_alloc to vmem_alloc in zcp_lua_alloc * Lua implementation is built as its own module (zlua.ko) * Lua headers consumed directly by zfs code moved to 'include/sys/lua/' * There is no native setjmp/longjump available in stock Linux kernel. Brought over implementations from illumos and FreeBSD * The get_temporary_prop() was adapted due to VFS platform differences * Use of inline functions in lua parser to reduce stack usage per C call * Skip some ZFS Test Suite ZCP tests on sparc64 to avoid stack overflow	2018-02-08 15:28:18 -08:00
DeHackEd	0cf3430d2d	Fix `zpool status` overflow on fast scrubs Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: DHE <git@dehacked.net> Closes #7133	2018-02-06 16:35:16 -08:00
Brian Behlendorf	60b8207496	Set persistent ztest failure mode In order to reliably detect deadlocks in the create and import path ztest should set the failure mode property. This ensures that the pool is always using the correct failmode behavior. Removed insidious use of local variable in MAXFAULTS macro. Converted VERIFY() to VERIFY0() where appropriate. Reviewed-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #7111	2018-02-05 12:00:26 -08:00
Tom Caputi	047116ac76	Raw sends must be able to decrease nlevels Currently, when a raw zfs send file includes a DRR_OBJECT record that would decrease the number of levels of an existing object, the object is reallocated with dmu_object_reclaim() which creates the new dnode using the old object's nlevels. For non-raw sends this doesn't really matter, but raw sends require that nlevels on the receive side match that of the send side so that the checksum-of-MAC tree can be properly maintained. This patch corrects the issue by freeing the object completely before allocating it again in this case. This patch also corrects several issues with dnode_hold_impl() and related functions that prevented dnodes (particularly multi-slot dnodes) from being reallocated properly due to the fact that existing dnodes were not being fully cleaned up when they were freed. This patch adds a test to make sure that zfs recv functions properly with incremental streams containing dnodes of different sizes. Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Jorgen Lundman <lundman@lundman.net> Signed-off-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #6821 Closes #6864	2018-02-02 11:43:11 -08:00
Tom Caputi	ae76f45cda	Encryption Stability and On-Disk Format Fixes The on-disk format for encrypted datasets protects not only the encrypted and authenticated blocks themselves, but also the order and interpretation of these blocks. In order to make this work while maintaining the ability to do raw sends, the indirect bps maintain a secure checksum of all the MACs in the block below it along with a few other fields that determine how the data is interpreted. Unfortunately, the current on-disk format erroneously includes some fields which are not portable and thus cannot support raw sends. It is not possible to easily work around this issue due to a separate and much smaller bug which causes indirect blocks for encrypted dnodes to not be compressed, which conflicts with the previous bug. In addition, the current code generates incompatible on-disk formats on big endian and little endian systems due to an issue with how block pointers are authenticated. Finally, raw send streams do not currently include dn_maxblkid when sending both the metadnode and normal dnodes which are needed in order to ensure that we are correctly maintaining the portable objset MAC. This patch zero's out the offending fields when computing the bp MAC and ensures that these MACs are always calculated in little endian order (regardless of the host system's byte order). This patch also registers an errata for the old on-disk format, which we detect by adding a "version" field to newly created DSL Crypto Keys. We allow datasets without a version (version 0) to only be mounted for read so that they can easily be migrated. We also now include dn_maxblkid in raw send streams to ensure the MAC can be maintained correctly. This patch also contains minor bug fixes and cleanups. Reviewed-by: Jorgen Lundman <lundman@lundman.net> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #6845 Closes #6864 Closes #7052	2018-02-02 11:37:16 -08:00
LOLi	63f88c12b4	Fix style issues in man pages and commands help * Remove 'zfs snap' from zfs help message (OpenZFS sync) * Update zfs(8) to suggest 'snap' can be used as an alias for 'snapshot' * Enforce 80 columns limit in help messages * Remove zfs_disable_dup_eviction from zfs-module-parameters(5) * Expose zfs_scan_max_ext_gap as a kernel module parameter. Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #7087	2018-01-29 15:05:03 -08:00
Giuseppe Di Natale	5e021f56d3	Add dbuf hash and dbuf cache kstats Introduce kstats about the dbuf hash and dbuf cache to make it easier to inspect state. This should help with debugging and understanding of these portions of the codebase. Correct format of dbuf kstat file. Introduce a dbc column to dbufs kstat to indicate if a dbuf is in the dbuf cache. Introduce field filtering in the dbufstat python script. Introduce a no header option to the dbufstat python script. Introduce a test case to test basic mru->mfu list movement in the ARC. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov> Closes #6906	2018-01-29 10:24:52 -08:00
Brian Behlendorf	bb25362553	Prevent zdb(8) from occasionally hanging on I/O The zdb(8) command may not terminate in the case where the pool gets suspended and there is a caller in zio_wait() blocking on an outstanding read I/O that will never complete. This can in turn cause ztest(1) to block indefinitely despite the deadman. Resolve the issue by setting the default failure mode for zdb(8) to panic. In user space we always want the command to terminate when forward progress is no longer possible. Reviewed-by: Tim Chase <tim@chase2k.com> Reviewed by: Thomas Caputi <tcaputi@datto.com> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #6999	2018-01-25 13:41:51 -08:00
Brian Behlendorf	8fb1ede146	Extend deadman logic The intent of this patch is extend the existing deadman code such that it's flexible enough to be used by both ztest and on production systems. The proposed changes include: * Added a new `zfs_deadman_failmode` module option which is used to dynamically control the behavior of the deadman. It's loosely modeled after, but independant from, the pool failmode property. It can be set to wait, continue, or panic. * wait - Wait for the "hung" I/O (default) * continue - Attempt to recover from a "hung" I/O * panic - Panic the system * Added a new `zfs_deadman_ziotime_ms` module option which is analogous to `zfs_deadman_synctime_ms` except instead of applying to a pool TXG sync it applies to zio_wait(). A default value of 300s is used to define a "hung" zio. * The ztest deadman thread has been re-enabled by default, aligned with the upstream OpenZFS code, and then extended to terminate the process when it takes significantly longer to complete than expected. * The -G option was added to ztest to print the internal debug log when a fatal error is encountered. This same option was previously added to zdb in commit `fa603f82`. Update zloop.sh to unconditionally pass -G to obtain additional debugging. * The FM_EREPORT_ZFS_DELAY event which was previously posted when the deadman detect a "hung" pool has been replaced by a new dedicated FM_EREPORT_ZFS_DEADMAN event. * The proposed recovery logic attempts to restart a "hung" zio by calling zio_interrupt() on any outstanding leaf zios. We may want to further restrict this to zios in either the ZIO_STAGE_VDEV_IO_START or ZIO_STAGE_VDEV_IO_DONE stages. Calling zio_interrupt() is expected to only be useful for cases when an IO has been submitted to the physical device but for some reasonable the completion callback hasn't been called by the lower layers. This shouldn't be possible but has been observed and may be caused by kernel/driver bugs. * The 'zfs_deadman_synctime_ms' default value was reduced from 1000s to 600s. * Depending on how ztest fails there may be no cache file to move. This should not be considered fatal, collect the logs which are available and carry on. * Add deadman test cases for spa_deadman() and zio_wait(). * Increase default zfs_deadman_checktime_ms to 60s. Reviewed-by: Tim Chase <tim@chase2k.com> Reviewed by: Thomas Caputi <tcaputi@datto.com> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #6999	2018-01-25 13:40:38 -08:00
Allan Jude	6bc4a2376c	OpenZFS 8972 - zfs holds: In scripted mode, do not pad columns with spaces Authored by: Allan Jude <allanjude@freebsd.org> Approved by: Dan McDonald <danmcd@joyent.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Melikov <mail@gmelikov.ru> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/8972 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/3aace5c077 Closes #7063	2018-01-19 09:36:17 -08:00
Brian Behlendorf	31864e3d8c	OpenZFS 8652 - Tautological comparisons with ZPROP_INVAL usr/src/uts/common/sys/fs/zfs.h Change ZPROP_INVAL and ZPROP_CONT from macros to enum values. Clang and GCC both prefer to use unsigned ints to store enums. That was causing tautological comparison warnings (and likely eliminating error handling code at compile time) whenever a zfs_prop_t or zpool_prop_t was compared to ZPROP_INVAL or ZPROP_CONT. Making the error flags be explicity enum values forces the enum types to be signed. ZPROP_INVAL was also compared against two different enum types. I had to change its name to ZPOOL_PROP_INVAL whenever its compared to a zpool_prop_t. There are still some places where ZPROP_INVAL or ZPROP_CONT is compared to a plain int, in code that doesn't know whether the int is storing a zfs_prop_t or a zpool_prop_t. usr/src/uts/common/fs/zfs/spa.c s/ZPROP_INVAL/ZPOOL_PROP_INVAL/ Authored by: Alan Somers <asomers@gmail.com> Approved by: Gordon Ross <gwr@nexenta.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Igor Kozhukhov <igor@dilos.org> Reviewed by: George Melikov <mail@gmelikov.ru> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/8652 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c2de80dc74 Closes #7061	2018-01-19 09:22:37 -08:00
Brian Behlendorf	3da3488e63	Fix shellcheck v0.4.6 warnings Resolve new warnings reported after upgrading to shellcheck version 0.4.6. This patch contains no functional changes. * egrep is non-standard and deprecated. Use grep -E instead. [SC2196] * Check exit code directly with e.g. 'if mycmd;', not indirectly with $?. [SC2181] Suppressed. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #7040	2018-01-17 10:17:16 -08:00
Brian Behlendorf	e1a0850c35	Force ztest to always use /dev/urandom For ztest, which is solely for testing, using a pseudo random is entirely reasonable. Using /dev/urandom ensures the system entropy pool doesn't get depleted thus stalling the testing. This is a particular problem when testing in VMs. Reviewed-by: Tim Chase <tim@chase2k.com> Reviewed by: Thomas Caputi <tcaputi@datto.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #7017 Closes #7036	2018-01-12 09:36:26 -08:00
Brian Behlendorf	fed90353d7	Support -fsanitize=address with --enable-asan When --enable-asan is provided to configure then build all user space components with fsanitize=address. For kernel support use the Linux KASAN feature instead. https://github.com/google/sanitizers/wiki/AddressSanitizer When using gcc version 4.8 any test case which intentionally generates a core dump will fail when using --enable-asan. The default behavior is to disable core dumps and only newer versions allow this behavior to be controled at run time with the ASAN_OPTIONS environment variable. Additionally, this patch includes some build system cleanup. * Rules.am updated to set the minimum AM_CFLAGS, AM_CPPFLAGS, and AM_LDFLAGS. Any additional flags should be added on a per-Makefile basic. The --enable-debug and --enable-asan options apply to all user space binaries and libraries. * Compiler checks consolidated in always-compiler-options.m4 and renamed for consistency. * -fstack-check compiler flag was removed, this functionality is provided by asan when configured with --enable-asan. * Split DEBUG_CFLAGS in to DEBUG_CFLAGS, DEBUG_CPPFLAGS, and DEBUG_LDFLAGS. * Moved default kernel build flags in to module/Makefile.in and split in to ZFS_MODULE_CFLAGS and ZFS_MODULE_CPPFLAGS. These flags are set with the standard ccflags-y kbuild mechanism. * -Wframe-larger-than checks applied only to binaries or libraries which include source files which are built in both user space and kernel space. This restriction is relaxed for user space only utilities. * -Wno-unused-but-set-variable applied only to libzfs and libzpool. The remaining warnings are the result of an ASSERT using a variable when is always declared. * -D_POSIX_PTHREAD_SEMANTICS and -D__EXTENSIONS__ dropped because they are Solaris specific and thus not needed. * Ensure $GDB is defined as gdb by default in zloop.sh. Signed-off-by: DHE <git@dehacked.net> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #7027	2018-01-10 10:49:27 -08:00
Brian Behlendorf	bfe27ace0d	Fix unused variable warnings Resolved unused variable warnings observed after restricting -Wno-unused-but-set-variable to only libzfs and libzpool. Reviewed-by: DHE <git@dehacked.net> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #6941	2018-01-09 12:28:03 -08:00
Brian Behlendorf	06401e4222	Fix ztest_verify_dnode_bt() test case In ztest_verify_dnode_bt the ztest_object_lock must be held in order to safely verify the unused bonus space. Reviewed-by: DHE <git@dehacked.net> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #6941	2018-01-09 12:27:12 -08:00
Nathaniel Wesley Filardo	8b20a9f996	zhack: fix getopt return type This fixes zhack's command processing on ARM. On ARM char is unsigned, and so, in promotion to an int, it will never compare equal to -1. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Nathaniel Wesley Filardo <nwf@cs.jhu.edu> Closes #7016	2018-01-09 11:14:45 -08:00
LOLi	390d679acd	Fix 'zpool add' handling of nested interior VDEVs When replacing a faulted device which was previously handled by a spare multiple levels of nested interior VDEVs will be present in the pool configuration; the following example illustrates one of the possible situations: NAME STATE READ WRITE CKSUM testpool DEGRADED 0 0 0 raidz1-0 DEGRADED 0 0 0 spare-0 DEGRADED 0 0 0 replacing-0 DEGRADED 0 0 0 /var/tmp/fault-dev UNAVAIL 0 0 0 cannot open /var/tmp/replace-dev ONLINE 0 0 0 /var/tmp/spare-dev1 ONLINE 0 0 0 /var/tmp/safe-dev ONLINE 0 0 0 spares /var/tmp/spare-dev1 INUSE currently in use This is safe and allowed, but get_replication() needs to handle this situation gracefully to let zpool add new devices to the pool. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #6678 Closes #6996	2017-12-28 10:15:32 -08:00
Simon Guest	993669a7bf	vdev_id: new slot type ses This extends vdev_id to support a new slot type, ses, for SCSI Enclosure Services. With slot type ses, the disk slot numbers are determined by using the device slot number reported by sg_ses for the device with matching SAS address, found by querying all available enclosures. This is primarily of use on systems with a deficient driver omitting support for bay_identifier in /sys/devices. In my testing, I found that the existing slot types of port and id were not stable across disk replacement, so an alternative was required. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Simon Guest <simon.guest@tesujimath.org> Closes #6956	2017-12-20 09:42:07 -08:00
Giuseppe Di Natale	89a66a0457	Handle broken pipes in arc_summary Using a command similar to 'arc_summary.py \| head' causes a broken pipe exception. Gracefully exit in the case of a broken pipe in arc_summary.py. Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov> Closes #6965 Closes #6969	2017-12-19 13:19:24 -08:00
LOLi	c4ba46dead	Handle invalid options in arc_summary If an invalid option is provided to arc_summary.py we handle any error thrown from the getopt Python module and print the usage help message. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #6983	2017-12-19 13:02:40 -08:00
LOLi	c30e34faa1	ZTS: Fix create-o_ashift test case The function that fills the uberblock ring buffer on every device label has been reworked to avoid occasional failures caused by a race condition that prevents 'zpool sync' from writing some uberblock sequentially: this happens when the pool sync ioctl dispatch code calls txg_wait_synced() while we're already waiting for a TXG to sync. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #6924 Closes #6977	2017-12-19 10:49:33 -08:00

... 2 3 4 5 6 ...

1007 Commits