This PR adds two new compression types, based on ZStandard:
- zstd: A basic ZStandard compression algorithm Available compression.
Levels for zstd are zstd-1 through zstd-19, where the compression
increases with every level, but speed decreases.
- zstd-fast: A faster version of the ZStandard compression algorithm
zstd-fast is basically a "negative" level of zstd. The compression
decreases with every level, but speed increases.
Available compression levels for zstd-fast:
- zstd-fast-1 through zstd-fast-10
- zstd-fast-20 through zstd-fast-100 (in increments of 10)
- zstd-fast-500 and zstd-fast-1000
For more information check the man page.
Implementation details:
Rather than treat each level of zstd as a different algorithm (as was
done historically with gzip), the block pointer `enum zio_compress`
value is simply zstd for all levels, including zstd-fast, since they all
use the same decompression function.
The compress= property (a 64bit unsigned integer) uses the lower 7 bits
to store the compression algorithm (matching the number of bits used in
a block pointer, as the 8th bit was borrowed for embedded block
pointers). The upper bits are used to store the compression level.
It is necessary to be able to determine what compression level was used
when later reading a block back, so the concept used in LZ4, where the
first 32bits of the on-disk value are the size of the compressed data
(since the allocation is rounded up to the nearest ashift), was
extended, and we store the version of ZSTD and the level as well as the
compressed size. This value is returned when decompressing a block, so
that if the block needs to be recompressed (L2ARC, nop-write, etc), that
the same parameters will be used to result in the matching checksum.
All of the internal ZFS code ( `arc_buf_hdr_t`, `objset_t`,
`zio_prop_t`, etc.) uses the separated _compress and _complevel
variables. Only the properties ZAP contains the combined/bit-shifted
value. The combined value is split when the compression_changed_cb()
callback is called, and sets both objset members (os_compress and
os_complevel).
The userspace tools all use the combined/bit-shifted value.
Additional notes:
zdb can now also decode the ZSTD compression header (flag -Z) and
inspect the size, version and compression level saved in that header.
For each record, if it is ZSTD compressed, the parameters of the decoded
compression header get printed.
ZSTD is included with all current tests and new tests are added
as-needed.
Per-dataset feature flags now get activated when the property is set.
If a compression algorithm requires a feature flag, zfs activates the
feature when the property is set, rather than waiting for the first
block to be born. This is currently only used by zstd but can be
extended as needed.
Portions-Sponsored-By: The FreeBSD Foundation
Co-authored-by: Allan Jude <allanjude@freebsd.org>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Co-authored-by: Kjeld Schouten-Lebbing <kjeld@schouten-lebbing.nl>
Co-authored-by: Michael Niewöhner <foss@mniewoehner.de>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Allan Jude <allanjude@freebsd.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Signed-off-by: Kjeld Schouten-Lebbing <kjeld@schouten-lebbing.nl>
Signed-off-by: Michael Niewöhner <foss@mniewoehner.de>
Closes#6247Closes#9024Closes#10277Closes#10278
The 'zfs share -a' currently skips any filesystems which
have 'canmount=noauto' set. This behavior is unexpected since the
one would expect 'zfs share -a' to share any mounted filesystem
that has the 'sharenfs' property already set.
This changes the behavior of 'zfs share -a' to allow the sharing
of 'canmount=noauto' datasets if they are mounted.
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Don Brady <don.brady@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Signed-off-by: George Wilson <gwilson@delphix.com>
External-issue: DLPX-71313
Closes#10688
zfs_jail was not using zfs_ioctl so failed to map the IOC number
correctly. Use zfs_ioctl to perform the jail ioctl and add a test
case for FreeBSD.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#10658
FreeBSD recently integrated a change which causes \s in a regex to
throw an error instead of silently being misinterpreted as an s.
Change the regex in zpool_colors.ksh to use [[:space:]].
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <freqlabs@FreeBSD.org>
Closes#10651
When a clone is promoted, its livelist is no longer accurate, so it is
discarded. If the clone's origin is also a clone (i.e. we are promoting
a clone of a clone), then the origin's livelist is also no longer
accurate, so it should be discarded, but the code doesn't actually do
that.
Consider a pool with:
* Filesystem A
* Clone B, a clone of A
* Clone C, a clone of B
If we promote C, it discards C's livelist. It should discard B's
livelist, but that is not happening. The impact is that when B is
destroyed, we use the livelist to find the blocks to free, but the
livelist is no longer correct so we end up freeing blocks that are still
in use by C. The incorrectly-freed blocks can be reallocated causing
checksum errors. And when C is destroyed it can double-free the
incorrectly-freed blocks.
The problem is that we remove the livelist of `origin_ds->ds_dir`, but
the origin snapshot has already been moved to the promoted dsl_dir. So
this is actually trying to remove the livelist of the promoted dsl_dir,
which was already removed. As explained in a comment in the beginning
of `dsl_dataset_promote_sync()`, we need to use the saved `odd` for the
origin's dsl_dir.
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed by: Sara Hartse <sara.hartse@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#10652
Livelists and spacemaps are data structures that are logs of allocations
and frees. Livelists entries are block pointers (blkptr_t). Spacemaps
entries are ranges of numbers, most often used as to track
allocated/freed regions of metaslabs/vdevs.
These data structures can become self-inconsistent, for example if a
block or range can be "double allocated" (two allocation records without
an intervening free) or "double freed" (two free records without an
intervening allocation).
ZDB (as well as zfs running in the kernel) can detect these
inconsistencies when loading livelists and metaslab. However, it
generally halts processing when the error is detected.
When analyzing an on-disk problem, we often want to know the entire set
of inconsistencies, which is not possible with the current behavior.
This commit adds a new flag, `zdb -y`, which analyzes the livelist and
metaslab data structures and displays all of their inconsistencies.
Note that this is different from the leak detection performed by
`zdb -b`, which checks for inconsistencies between the spacemaps and the
tree of block pointers, but assumes the spacemaps are self-consistent.
The specific checks added are:
Verify livelists by iterating through each sublivelists and:
- report leftover FREEs
- report double ALLOCs and double FREEs
- record leftover ALLOCs together with their TXG [see Cross Check]
Verify spacemaps by iterating over each metaslab and:
- iterate over spacemap and then the metaslab's entries in the
spacemap log, then report any double FREEs and double ALLOCs
Verify that livelists are consistenet with spacemaps. The space
referenced by livelists (after using the FREE's to cancel out
corresponding ALLOCs) should be allocated, according to the spacemaps.
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Sara Hartse <sara.hartse@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-66031
Closes#10515
== Motivation and Context
The current implementation of 'sharenfs' and 'sharesmb' relies on
the use of the sharetab file. The use of this file is os-specific
and not required by linux or freebsd. Currently the code must
maintain updates to this file which adds complexity and presents
a significant performance impact when sharing many datasets. In
addition, concurrently running 'zfs sharenfs' command results in
missing entries in the sharetab file leading to unexpected failures.
== Description
This change removes the sharetab logic from the linux and freebsd
implementation of 'sharenfs' and 'sharesmb'. It still preserves an
os-specific library which contains the logic required for sharing
NFS or SMB. The following entry points exist in the vastly simplified
libshare library:
- sa_enable_share -- shares a dataset but may not commit the change
- sa_disable_share -- unshares a dataset but may not commit the change
- sa_is_shared -- determine if a dataset is shared
- sa_commit_share -- notify NFS/SMB subsystem to commit the shares
- sa_validate_shareopts -- determine if sharing options are valid
The sa_commit_share entry point is provided as a performance enhancement
and is not required. The sa_enable_share/sa_disable_share may commit
the share as part of the implementation. Libshare provides a framework
for both NFS and SMB but some operating systems may not fully support
these protocols or all features of the protocol.
NFS Operation:
For linux, libshare updates /etc/exports.d/zfs.exports to add
and remove shares and then commits the changes by invoking
'exportfs -r'. This file, is automatically read by the kernel NFS
implementation which makes for better integration with the NFS systemd
service. For FreeBSD, libshare updates /etc/zfs/exports to add and
remove shares and then commits the changes by sending a SIGHUP to
mountd.
SMB Operation:
For linux, libshare adds and removes files in /var/lib/samba/usershares
by calling the 'net' command directly. There is no need to commit the
changes. FreeBSD does not support SMB.
== Performance Results
To test sharing performance we created a pool with an increasing number
of datasets and invoked various zfs actions that would enable and
disable sharing. The performance testing was limited to NFS sharing.
The following tests were performed on an 8 vCPU system with 128GB and
a pool comprised of 4 50GB SSDs:
Scale testing:
- Share all filesystems in parallel -- zfs sharenfs=on <dataset> &
- Unshare all filesystems in parallel -- zfs sharenfs=off <dataset> &
Functional testing:
- share each filesystem serially -- zfs share -a
- unshare each filesystem serially -- zfs unshare -a
- reset sharenfs property and unshare -- zfs inherit -r sharenfs <pool>
For 'zfs sharenfs=on' scale testing we saw an average reduction in time
of 89.43% and for 'zfs sharenfs=off' we saw an average reduction in time
of 83.36%.
Functional testing also shows a huge improvement:
- zfs share -- 97.97% reduction in time
- zfs unshare -- 96.47% reduction in time
- zfs inhert -r sharenfs -- 99.01% reduction in time
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Bryant G. Ly <bryangly@gmail.com>
Signed-off-by: George Wilson <gwilson@delphix.com>
External-Issue: DLPX-68690
Closes#1603Closes#7692Closes#7943Closes#10300
The device_rebuild feature enables sequential reconstruction when
resilvering. Mirror vdevs can be rebuilt in LBA order which may
more quickly restore redundancy depending on the pools average block
size, overall fragmentation and the performance characteristics
of the devices. However, block checksums cannot be verified
as part of the rebuild thus a scrub is automatically started after
the sequential resilver completes.
The new '-s' option has been added to the `zpool attach` and
`zpool replace` command to request sequential reconstruction
instead of healing reconstruction when resilvering.
zpool attach -s <pool> <existing vdev> <new vdev>
zpool replace -s <pool> <old vdev> <new vdev>
The `zpool status` output has been updated to report the progress
of sequential resilvering in the same way as healing resilvering.
The one notable difference is that multiple sequential resilvers
may be in progress as long as they're operating on different
top-level vdevs.
The `zpool wait -t resilver` command was extended to wait on
sequential resilvers. From this perspective they are no different
than healing resilvers.
Sequential resilvers cannot be supported for RAIDZ, but are
compatible with the dRAID feature being developed.
As part of this change the resilver_restart_* tests were moved
in to the functional/replacement directory. Additionally, the
replacement tests were renamed and extended to verify both
resilvering and rebuilding.
Original-patch-by: Isaac Huang <he.huang@intel.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: John Poduska <jpoduska@datto.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#10349
The block histogram tracks the changes to psize, lsize and asize
both in the count of the number of blocks (by blocksize) and the
total length of all of the blocks for that blocksize. It also
keeps a running total of the cumulative size of all of the blocks
up to each size to help determine the size of caching SSDs to be
added to zfs hardware deployments.
The block history counts and lengths are summarized in bins
which are powers of two. Even rows with counts of zero are printed.
This change is accessed by specifying one of two options:
zdb -bbb pool
zdb -Pbbb pool
The first version prints the table in fixed size columns.
The second prints in "parseable" output that can be placed into
a CSV file.
Fixed Column, nicenum output sample:
block psize lsize asize
size Count Length Cum. Count Length Cum. Count Length Cum.
512: 3.50K 1.75M 1.75M 3.43K 1.71M 1.71M 3.41K 1.71M 1.71M
1K: 3.65K 3.67M 5.43M 3.43K 3.44M 5.15M 3.50K 3.51M 5.22M
2K: 3.45K 6.92M 12.3M 3.41K 6.83M 12.0M 3.59K 7.26M 12.5M
4K: 3.44K 13.8M 26.1M 3.43K 13.7M 25.7M 3.49K 14.1M 26.6M
8K: 3.42K 27.3M 53.5M 3.41K 27.3M 53.0M 3.44K 27.6M 54.2M
16K: 3.43K 54.9M 108M 3.50K 56.1M 109M 3.42K 54.7M 109M
32K: 3.44K 110M 219M 3.41K 109M 218M 3.43K 110M 219M
64K: 3.41K 218M 437M 3.41K 218M 437M 3.44K 221M 439M
128K: 3.41K 437M 874M 3.70K 474M 911M 3.41K 437M 876M
256K: 3.41K 874M 1.71G 3.41K 874M 1.74G 3.41K 874M 1.71G
512K: 3.41K 1.71G 3.41G 3.41K 1.71G 3.45G 3.41K 1.71G 3.42G
1M: 3.41K 3.41G 6.82G 3.41K 3.41G 6.86G 3.41K 3.41G 6.83G
2M: 0 0 6.82G 0 0 6.86G 0 0 6.83G
4M: 0 0 6.82G 0 0 6.86G 0 0 6.83G
8M: 0 0 6.82G 0 0 6.86G 0 0 6.83G
16M: 0 0 6.82G 0 0 6.86G 0 0 6.83G
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Robert E. Novak <novak5@llnl.gov>
Closes: #9158Closes#10315
Reduce the usage of EXTRA_DIST. If files are conditionally included in
_SOURCES, _HEADERS etc, automake is smart enough to dist all files that
could possibly be included, but this does not apply to EXTRA_DIST,
resulting in make dist depending on the configuration.
Add some files that were missing altogether in various Makefile's.
The changes to disted files in this commit (excluding deleted files):
+./cmd/zed/agents/README.md
+./etc/init.d/README.md
+./lib/libspl/os/freebsd/getexecname.c
+./lib/libspl/os/freebsd/gethostid.c
+./lib/libspl/os/freebsd/getmntany.c
+./lib/libspl/os/freebsd/mnttab.c
-./lib/libzfs/libzfs_core.pc
-./lib/libzfs/libzfs.pc
+./lib/libzfs/os/freebsd/libzfs_compat.c
+./lib/libzfs/os/freebsd/libzfs_fsshare.c
+./lib/libzfs/os/freebsd/libzfs_ioctl_compat.c
+./lib/libzfs/os/freebsd/libzfs_zmount.c
+./lib/libzutil/os/freebsd/zutil_compat.c
+./lib/libzutil/os/freebsd/zutil_device_path_os.c
+./lib/libzutil/os/freebsd/zutil_import_os.c
+./module/lua/README.zfs
+./module/os/linux/spl/README.md
+./tests/README.md
+./tests/zfs-tests/tests/functional/cli_root/zfs_clone/zfs_clone_rm_nested.ksh
+./tests/zfs-tests/tests/functional/cli_root/zfs_send/zfs_send_encrypted_unloaded.ksh
+./tests/zfs-tests/tests/functional/inheritance/README.config
+./tests/zfs-tests/tests/functional/inheritance/README.state
+./tests/zfs-tests/tests/functional/rsend/rsend_016_neg.ksh
+./tests/zfs-tests/tests/perf/fio/sequential_readwrite.fio
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Closes#10501
Correct various typos in the comments and tests.
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Closes#10423
Use option '-o' after action for compatibility
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Igor Kozhukhov <igor@dilos.org>
Closes#10426
It can take a moment for the NFS server to give up the mountpoint
after unsharing a filesystem.
Use log_must_busy to retry export/destroy a few times after switching
off sharenfs.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#10380
Update cleanup_filesystem to use destroy_dataset when performing
cleanup. This ensures the destroy is retried if the pool is busy
preventing occasional failures.
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#10358
Small program that converts a dataset id and an object id to a path
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes#10204
The cleanup routine for this test attempts to remove some temporary
files with `rm -f $VDEV_*`, but VDEV_ is undefined. As a result, all
files in the current working directory (/var/tmp/test_results/current)
get removed instead. This includes the complete log file of all tests.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: George Amanakis <gamanakis@gmail.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: John Kennedy <john.kennedy@delphix.com>
Closes#10324
When a top-level vdev is removed from a pool it is converted to an
indirect vdev. Until now splitting such mirrored pools was not possible
with zpool split. This patch enables handling of indirect vdevs and
splitting of those pools with zpool split.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes#10283
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Signed-off-by: sara hartse <sara.hartse@delphix.com>
Closes#10243
Round up the volume size requested in `zfs create -V size` to the next
higher multiple of the volblocksize. Updates the man page and adds a
test to verify the new behavior.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reported-by: puffi <puffi@users.noreply.github.com>
Signed-off-by: Alex John <alex@stty.io>
Closes#8541Closes#10196
Add the FreeBSD platform code to the OpenZFS repository. As of this
commit the source can be compiled and tested on FreeBSD 11 and 12.
Subsequent commits are now required to compile on FreeBSD and Linux.
Additionally, they must pass the ZFS Test Suite on FreeBSD which is
being run by the CI. As of this commit 1230 tests pass on FreeBSD
and there are no unexpected failures.
Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#898Closes#8987
Deduplicated send and receive is deprecated. To ease migration to the
new dedup-send-less world, the commit adds a `zstream redup` utility to
convert deduplicated send streams to normal streams, so that they can
continue to be received indefinitely.
The new `zstream` command also replaces the functionality of
`zstreamdump`, by way of the `zstream dump` subcommand. The
`zstreamdump` command is replaced by a shell script which invokes
`zstream dump`.
The way that `zstream redup` works under the hood is that as we read the
send stream, we build up a hash table which maps from `<GUID, object,
offset> -> <file_offset>`.
Whenever we see a WRITE record, we add a new entry to the hash table,
which indicates where in the stream file to find the WRITE record for
this block. (The key is `drr_toguid, drr_object, drr_offset`.)
For entries other than WRITE_BYREF, we pass them through unchanged
(except for the running checksum, which is recalculated).
For WRITE_BYREF records, we change them to WRITE records. We find the
referenced WRITE record by looking in the hash table (for the record
with key `drr_refguid, drr_refobject, drr_refoffset`), and then reading
the record header and payload from the specified offset in the stream
file. This is why the stream can not be a pipe. The found WRITE record
replaces the WRITE_BYREF record, with its `drr_toguid`, `drr_object`,
and `drr_offset` fields changed to be the same as the WRITE_BYREF's
(i.e. we are writing the same logical block, but with the data supplied
by the previous WRITE record).
This algorithm requires memory proportional to the number of WRITE
records (same as `zfs send -D`), but the size per WRITE record is
relatively low (40 bytes, vs. 72 for `zfs send -D`). A 1TB send stream
with 8KB blocks (`recordsize=8k`) would use around 5GB of RAM to
"redup".
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#10124Closes#10156
Add a mechanism to wait for delete queue to drain.
When doing redacted send/recv, many workflows involve deleting files
that contain sensitive data. Because of the way zfs handles file
deletions, snapshots taken quickly after a rm operation can sometimes
still contain the file in question, especially if the file is very
large. This can result in issues for redacted send/recv users who
expect the deleted files to be redacted in the send streams, and not
appear in their clones.
This change duplicates much of the zpool wait related logic into a
zfs wait command, which can be used to wait until the internal
deleteq has been drained. Additional wait activities may be added
in the future.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: John Gallagher <john.gallagher@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes#9707
libzfs aborts and dumps core on EINVAL from the kernel when trying to
do a redacted send with a bookmark that is not a redaction bookmark.
Move redacted bookmark validation into libzfs.
Check if the bookmark given for redactions is actually a redaction
bookmark. Print an error message and exit gracefully if it is not.
Don't abort on EINVAL in zfs_send_one.
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#10138
Currently when the dataset is in use we can't receive snapshots.
zfs send test/1@asd | zfs recv -FM test/2
cannot unmount '/test/2': Device busy
This commits add option 'M' which attempts to forcibly unmount the
dataset. Thanks to this we can enforce receiving snapshots in a
single step.
Note that this functionality is not supported on Linux because the
VFS will prevent active mounted filesystems from being unmounted,
even with the force option. This is the intended VFS behavior.
Test cases were added to verify the expected behavior based on
the platform.
Discussed-with: Pawel Jakub Dawidek <pjd@FreeBSD.org>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allanjude@freebsd.org>
External-issue: https://reviews.freebsd.org/D22306Closes#9904
Filesystems allow overlay mounts by default on FreeBSD and Linux.
Respect the native convention by switching the default to overlay=on,
while retaining the option to turn the property off for compatibility
with other operating systems' conventions.
Update documentation and tests accordingly.
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#10030
`default_setup` takes a disk list as the first argument and has
optional additional arguments that control secondary functionality.
A couple of test setups mistakenly call `default_setup $DISKS`.
Add quotes so the second and subsequent disks are correctly included
in the pool as vdevs rather than triggering unwanted behavior from
`default_setup`.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#10097
Manual trims fall into the category of long-running pool activities
which people might want to wait synchronously for. This change adds
support to 'zpool wait' for waiting for manual trim operations to
complete. It also adds a '-w' flag to 'zpool trim' which can be used to
turn 'zpool trim' into a synchronous operation.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Signed-off-by: John Gallagher <john.gallagher@delphix.com>
Closes#10071
Use file vdevs if we are short on $DISKS.
Also fixed vol recursion for FreeBSD in 004.
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#10060
* Set geom debug flags in corrupt_blocks_at_level
* Use the right time zone for history tests
* Add missing commands.cfg entry for diskinfo
* Rewrite get_last_txg_synced to use zdb
* Don't check ulimits for sparse files
* Suspend removal before removing a vdev, not after
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#10054
`zfs recv` of an incremental stream that already exists is ignored, with
a message like:
receiving incremental stream of pool/fs@incsnap into pool/fs@incsnap
snap testpool/testfs@incsnap already exists; ignoring
And the command exits successfully (exit code 0).
The zfs_receive_004_neg test is expecting that a this case will fail,
with nonzero exit code.
The fix is to remove this specific command from the test case. This
lets us check that the remaining commands do in fact fail.
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#10055
The function `get_used_prop` does not exist.
Use `get_prop used` instead.
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#10059
* Check for mountd in is_shared to avoid timeout when not running
* Enhance robustness of some cleanup functions
* Simplify atime lookup
* Skip sharenfs validation for now
* Don't add mountpoint property to inheritance validation on FreeBSD
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#10047
* Force UFS sync before snap in vol rollback tests
* rw is not a valid share option on FreeBSD, use ro instead
* zfs_unmount_nested: mountpoint is in the pool, rmdir *before* export
* Fix some more platform checks
* Fix disappearing group in delegate tests
* Don't try delegating for jailed, only root can set it
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#10038
The zpool destroy tests partition a single disk to create two pools.
This can be done using two disks and no partitioning instead.
And temporarily allow vol recursion for FreeBSD while in here.
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#10036
FreeBSD doesn't have a `share` command. It does have showmount.
Split the separate platform impls out of is_shared_impl.
Dispatch to the correct platform impl function from is_shared.
Eliminate the use of is_shared_impl from tests. is_shared works.
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#10037
FreeBSD does not support the "devices" and "nodevices" mount options.
Do not check these options on FreeBSD.
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#10028
These tests can be made to work without a bunch of complex
partitioning of physical disks.
Use the 3 disks directly, creating a few file disks if needed for a
compelling reason.
Reduce the use of shared variables that don't have a clear utility.
Catch the fallout in tests that include cfg/shlib from zpool_create.
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#10002
For some unknown reason, egrep was misbehaving with this pattern on
FreeBSD. The command works fine run interactively from a shell, but
in the test the output of egrep is empty.
Work around the issue by using a filter in the awk script instead.
While here, add a bit of diagnostic output and other simplifications
to the awk script as well.
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#10023
FreeBSD doesn't have the -n flag for cmp.
Read the area for the first four labels from the disk to a separate
file to compare instead of using the special flag to limit the size.
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#10024
We have have made the necessary changes in our module code to expose
zevents through both devd and the zpool events ioctl. Now the tunables
can be exposed and zpool events tests can be enabled on both platforms.
A few minor tweaks to the tests were needed to accommodate the way wc
formats output on FreeBSD.
zed remains to be ported.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#10008
Moving forward, we wish to use org.openzfs (no dash) rather than
org.open-zfs or org.zfsonlinux for feature GUIDs and property names.
The existing feature GUIDs cannot be changed.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Richard Laager <rlaager@wiktel.com>
Closes#10003
Add missing logic for FreeBSD to a few test scripts.
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#9994
The zpool_add tests include zpool_create.shlib for a few silly
variables.
Don't use those variables for the file names. Include zpool_add.kshlib
for whatever variables we still need.
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#9997
These tests do not need to use partitions.
Get rid of the partitioning and just use the disks directly.
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#9996
Cleanup functions should make a best effort to clean up as much as
possible.
Do a consistency pass in a bunch of tests to make the cleanup
functions less prone to failure and fix a few typos here and there.
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#9993
Missed this in the review, but wc output on FreeBSD is indented,
so string comparisons mismatch when comparing to an unindented number.
Compare counts as integers instead of strings.
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Paul Zuchowski <pzuchowski@datto.com>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#9980
This feature allows copying existing bookmarks using
zfs bookmark fs#target fs#newbookmark
There are some niche use cases for such functionality,
e.g. when using bookmarks as markers for replication progress.
Copying redaction bookmarks produces a normal bookmark that
cannot be used for redacted send (we are not duplicating
the redaction object).
ZCP support for bookmarking (both creation and copying) will be
implemented in a separate patch based on this work.
Overview:
- Terminology:
- source = existing snapshot or bookmark
- new/bmark = new bookmark
- Implement bookmark copying in `dsl_bookmark.c`
- create new bookmark node
- copy source's `zbn_phys` to new's `zbn_phys`
- zero-out redaction object id in copy
- Extend existing bookmark ioctl nvlist schema to accept
bookmarks as sources
- => `dsl_bookmark_create_nvl_validate` is authoritative
- use `dsl_dataset_is_before` check for both snapshot
and bookmark sources
- Adjust CLI
- refactor shortname expansion logic in `zfs_do_bookmark`
- Update man pages
- warn about redaction bookmark handling
- Add test cases
- CLI
- pyyzfs libzfs_core bindings
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Christian Schwarz <me@cschwarz.com>
Closes#9571
zdb -R :b fails due to the indirect block being compressed,
and the 'b' and 'd' flag not working in tandem when specified.
Fix the flag parsing code and create a zfs test for zdb -R
block display. Also fix the zio flags where the dotted notation
for the vdev portion of DVA (i.e. 0.0:offset:length) fails.
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Zuchowski <pzuchowski@datto.com>
Closes#9640Closes#9729
Currently SIMD accelerated AES-GCM performance is limited by two
factors:
a. The need to disable preemption and interrupts and save the FPU
state before using it and to do the reverse when done. Due to the
way the code is organized (see (b) below) we have to pay this price
twice for each 16 byte GCM block processed.
b. Most processing is done in C, operating on single GCM blocks.
The use of SIMD instructions is limited to the AES encryption of the
counter block (AES-NI) and the Galois multiplication (PCLMULQDQ).
This leads to the FPU not being fully utilized for crypto
operations.
To solve (a) we do crypto processing in larger chunks while owning
the FPU. An `icp_gcm_avx_chunk_size` module parameter was introduced
to make this chunk size tweakable. It defaults to 32 KiB. This step
alone roughly doubles performance. (b) is tackled by porting and
using the highly optimized openssl AES-GCM assembler routines, which
do all the processing (CTR, AES, GMULT) in a single routine. Both
steps together result in up to 32x reduction of the time spend in
the en/decryption routines, leading up to approximately 12x
throughput increase for large (128 KiB) blocks.
Lastly, this commit changes the default encryption algorithm from
AES-CCM to AES-GCM when setting the `encryption=on` property.
Reviewed-By: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-By: Jason King <jason.king@joyent.com>
Reviewed-By: Tom Caputi <tcaputi@datto.com>
Reviewed-By: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Attila Fülöp <attila@fueloep.org>
Closes#9749
FreeBSD has jails, not zones.
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#9899
Linux was missing a default value for DEV_DSKDIR. Set it to /dev.
Fix resulting fallout.
SLICE_PREFIX seems like a good candidate for including in the defaults.
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#9898
FreeBSD doesn't support feature@edonr.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#9901
Allow a range of object identifiers to dump with -d. This may
be useful when dumping a large dataset and you want to break
it up into multiple phases, or to resume where a previous scan
left off. Object type selection flags are supported to reduce
the performance overhead of verbosely dumping unwanted objects,
and to reduce the amount of post-processing work needed to
filter out unwanted objects from zdb output.
This change extends existing syntax in a backward-compatible
way. That is, the base case of a range is to specify a single
object identifier to dump. Ranges and object identifiers can
be intermixed as command line parameters.
Usage synopsis:
Object ranges take the form <start>:<end>[:<flags>]
start Starting object number
end Ending object number, or -1 for no upper bound
flags Optional flags to select object types:
A All objects (this is the default)
d ZFS directories
f ZFS files
m SPA space maps
z ZAPs
- Negate effect of next flag
Examples:
# Dump all file objects
zdb -dd tank/fish 0👎f
# Dump all file and directory objects
zdb -dd tank/fish 0👎fd
# Dump all types except file and directory objects
zdb -dd tank/fish 0👎A-f-d
# Dump object IDs in a specific range
zdb -dd tank/fish 1000:2000
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Paul Zuchowski <pzuchowski@datto.com>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes#9832
When used with non-loop devices, zdb_004_pos fails because the disk
argument provided is the partition rather than the expected whole disk
name.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Signed-off-by: John Kennedy <john.kennedy@delphix.com>
Closes#9876
There doesn't seem to be a need for this complexity.
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@ixsystems.com>
Closes#9854
Ensure the capture ends at the first DVA in case there are multiple
DVAs on the same line by only capturing up to the first '>' character.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Ryan Moeller <ryan@ixsystems.com>
Closes#9851
Remove the blkid version check from zpool_create_008_pos.ksh
so the test case will not be skipped.
All versions of blkid tested by the CI are either new enough
to not suffer from this issue, or have been patched as is
the case with CentOS 7 (libblkid-2.23.2-61).
Additionally, add a block_device_wait after device partitioning
to ensure the expected partitions will exist.
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#9853
This check isn't meant to be used for command substitution.
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@ixsystems.com>
Closes#9844
As an alternative to the dataset name, zdb now allows the decimal
or hexadecimal objset ID to be specified. When permanent errors
are reported as 2 hexadecimal numbers (objset ID : object ID) in
zpool status; you can now use 'zdb <pool>[/objset ID] object' to
determine the names of the objset and object which have the error.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Signed-off-by: Paul Zuchowski <pzuchowski@datto.com>
Closes#9733
On FreeBSD grep does not support Perl extensions
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@ixsystems.com>
Closes#9841
These are no longer needed after fixing device name matching for whole
disks in libzutil on FreeBSD.
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl>
Signed-off-by: Ryan Moeller <ryan@ixsystems.com>
Closes#9839
Update tests to use the variables for tunable names.
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@ixsystems.com>
Closes#9831
The name overlaps with a command needed by FreeBSD.
There is also no sense having two 'random' functions that do nearly
the same thing, so consolidate to just the more general one and name
it 'random_int_between'.
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Ryan Moeller <ryan@ixsystems.com>
Closes#9820
For dedup, special and log devices "zpool add -n" does not print
correctly their vdev type:
~# zpool add -n pool dedup /tmp/dedup special /tmp/special log /tmp/log
would update 'pool' to the following configuration:
pool
/tmp/normal
/tmp/dedup
/tmp/special
/tmp/log
This could lead storage administrators to modify their ZFS pools to
unexpected and unintended vdev configurations.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes#9783Closes#9390
Replace `test -b` and equivalents with `is_disk_device`, so that `-c`
is used instead on FreeBSD which has no block cache layer for devices.
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@ixsystems.com>
Closes#9795
Neither FreeBSD nor Linux support dumping to zvols.
DilOS still uses these tests, so the files are kept and the tests have
been relocated to sunos.run.
An `is_illumos` function was added to libtest.shlib to eliminate some
awkward platform checks.
A few functions that are not expected to be used outside of illumos
have been sanitized of extraneous FreeBSD adaptations.
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@ixsystems.com>
Closes#9794
Remove a few hardcoded instances of /var/tmp. This should use
the $TEST_BASE_DIR in order to allow the ZTS to be optionally
run in an alternate directory using `zfs-tests.sh -d <path>`.
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#9775
As of Python 3.5 the default behavior of json.tool was changed to
preserve the input order rather than lexical order. The test case
expects the output to be sorted so apply the --sort-keys option
to the json.tool command when using Python 3.5 and the option is
supported.
https://docs.python.org/3/library/json.html#module-json.tool
Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#9774
If the ZFS_COLOR env variable is set, then use ANSI color
output in zpool status:
- Column headers are bold
- Degraded or offline pools/vdevs are yellow
- Non-zero error counters and faulted vdevs/pools are red
- The 'status:' and 'action:' sections are yellow if they're
displaying a warning.
This also includes a new 'faketty' function in libtest.shlib that is
compatible with FreeBSD (code provided by @freqlabs).
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#9340
Update the common ZTS scripts and individual test cases as needed
in order to allow them to be run on FreeBSD. The high level goal
is to provide compatibility wrappers whenever possible to minimize
changes to individual test cases.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Signed-off-by: Ryan Moeller <ryan@ixsystems.com>
Closes#9692
This interferes with zdb_read_block trying all the decompression
algorithms when the 'd' flag is specified, as some are
expected to fail. Also control the output when guessing
algorithms, try the more common compression types first, allow
specifying lsize/psize, and fix an uninitialized variable.
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Zuchowski <pzuchowski@datto.com>
Closes#9612Closes#9630
- Moves compression algorithms for tests to properties.shlib
- Removes all compression algorithms levels from general tests
- Replaces on with lz4 for compression tests
- Removes random algorithm selection, if not needed
- Cleans copyright header formatting
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Michael Niewöhner <foss@mniewoehner.de>
Signed-off-by: Kjeld Schouten-Lebbing <kjeld@schouten-lebbing.nl>
Closes#9645
The `-o ashift` option must appear after attach to be properly
interpreted by getopt(3) on all platforms.
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Signed-off-by: Igor Kozhukhov <igor@dilos.org>
Closes#9636
The function zdb_read_block (zdb -R) was always intended to have a :c
flag which would read the DVA and length supplied by the user, and
display the checksum. Since we don't know which checksum goes with
the data, we should calculate and display them all.
For each checksum in the table, read in the data at the supplied
DVA:length, calculate the checksum, and display it. Update the man
page and create a zfs test for the new feature.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl>
Signed-off-by: Paul Zuchowski <pzuchowski@datto.com>
Closes#9607
df58307 removed the need to specify -d 1 when zfs list and zfs get are
called with -t snapshot on a datset. This commit extends the same
behaviour to -t bookmark.
This commit also introduces the 'snap' shorthand for snapshots from
zfs list to zfs get.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl>
Signed-off-by: InsanePrawn <insane.prawny@gmail.com>
Closes#9589
Currently, when you call 'zfs change-key' on an encrypted dataset
that has an unencrypted child, the code will trigger a VERIFY.
This VERIFY is leftover from before we allowed unencrypted
datasets to exist underneath encrypted ones. This patch fixes the
issue by simply replacing the VERIFY with an early return when
recursing through datasets.
Reviewed by: Jason King <jason.brian.king@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes#9524
* Use .ksh extension for ksh scripts, not .sh
* Remove .ksh extension from tests in common.run
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#9502
With 4k disks, this test will fail in the last section because the
expected human readable value of 20.0M is reported as 20.1M. Rather than
use the human readable property, switch to the parsable property and
verify that the values are reasonably close.
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: John Kennedy <john.kennedy@delphix.com>
Closes#9477
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#9445
Mostly whitespace changes, no functional changes intended.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#9447
Currently, the recv_fix_encryption_hierarchy() function accepts
'destsnap' as one of its parameters. Originally, this was intended
to be the top-level dataset of a receive (whether or not the
receive was recursive). Unfortunately, this parameter actually is
simply the input that is passed in from the command line. When
the user specifies 'zfs recv -d', this string is actually only the
name of the receiving pool since the rest of the name is derived
from the send stream. This causes the function to fail, leaving
some datasets with an invalid encryption hierarchy.
This patch resolves this problem by passing in the top_zfs variable
instead. In order to make this work, this patch also includes some
changes that ensure the value is always present when we need it.
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes#9273Closes#9309
lot_must -> log_must
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed by: Sara Hartse <sara.hartse@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Signed-off-by: Ryan Moeller <ryan@ixsystems.com>
Closes#9362
Trying to 'zfs diff' a snapshot with large dnodes will incorrectly try
to access its interior slots when dnodesize > sizeof(dnode_phys_t).
This is normally not an issue because the interior slots are
zero-filled, which report_dnode() handles calling
report_free_dnode_range(). However this is not the case for encrypted
large dnodes or filesystem using many SA based xattrs where the extra
data past the legacy dnode size boundary is interpreted as a
dnode_phys_t.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes#7678Closes#8931Closes#9343
Currently, spa_keystore_change_key_sync_impl() does not recurse
into clones when updating encryption roots for either a call to
'zfs promote' or 'zfs change-key'. This can cause children of
these clones to end up in a state where they point to the wrong
dataset as the encryption root. It can also trigger ASSERTs in
some cases where the code checks reference counts on wrapping
keys. This patch fixes this issue by ensuring that this function
properly recurses into clones during processing.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alek Pinchuk <apinchuk@datto.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes#9267Closes#9294
Currently the best way to wait for the completion of a long-running
operation in a pool, like a scrub or device removal, is to poll 'zpool
status' and parse its output, which is neither efficient nor convenient.
This change adds a 'wait' subcommand to the zpool command. When invoked,
'zpool wait' will block until a specified type of background activity
completes. Currently, this subcommand can wait for any of the following:
- Scrubs or resilvers to complete
- Devices to initialized
- Devices to be replaced
- Devices to be removed
- Checkpoints to be discarded
- Background freeing to complete
For example, a scrub that is in progress could be waited for by running
zpool wait -t scrub <pool>
This also adds a -w flag to the attach, checkpoint, initialize, replace,
remove, and scrub subcommands. When used, this flag makes the operations
kicked off by these subcommands synchronous instead of asynchronous.
This functionality is implemented using a new ioctl. The type of
activity to wait for is provided as input to the ioctl, and the ioctl
blocks until all activity of that type has completed. An ioctl was used
over other methods of kernel-userspace communiction primarily for the
sake of portability.
Porting Notes:
This is ported from Delphix OS change DLPX-44432. The following changes
were made while porting:
- Added ZoL-style ioctl input declaration.
- Reorganized error handling in zpool_initialize in libzfs to integrate
better with changes made for TRIM support.
- Fixed check for whether a checkpoint discard is in progress.
Previously it also waited if the pool had a checkpoint, instead of
just if a checkpoint was being discarded.
- Exposed zfs_initialize_chunk_size as a ZoL-style tunable.
- Updated more existing tests to make use of new 'zpool wait'
functionality, tests that don't exist in Delphix OS.
- Used existing ZoL tunable zfs_scan_suspend_progress, together with
zinject, in place of a new tunable zfs_scan_max_blks_per_txg.
- Added support for a non-integral interval argument to zpool wait.
Future work:
ZoL has support for trimming devices, which Delphix OS does not. In the
future, 'zpool wait' could be extended to add the ability to wait for
trim operations to complete.
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: John Gallagher <john.gallagher@delphix.com>
Closes#9162
Get rid of the `get_used_prop` function. `get_prop used` works fine.
Fix the comment describing the function parameters. The type does not
have a default, and mntp is also used for ext2.
Rename the variable for the number of copies from `copy` to `copies`.
Use a `case` statement to match the type parameter, order the cases
alphabetically, and add a little sanity checking for good measure.
Use eval to make sure the output of commands is silenced rather than
the log messages when redirecting output to /dev/null.
Simplify cases where zfs requires special behavior.
Don't allow the test to loop forever in the event space usage does not
change. Bail out of the loop and fail after an arbitrary number of
iterations.
Add more information to the log message when the test fails, to help
debugging.
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@ixsystems.com>
Closes#9286
Currently, the noop receive code fails to work with raw send streams
and resuming send streams. This happens because zfs_receive_impl()
reads the DRR_BEGIN payload without reading the payload itself.
Normally, the kernel expects to read this itself, but in this case
the recv_skip() code runs instead and it is not prepared to handle
the stream being left at any place other than the beginning of a
record.
This patch resolves this issue by manually reading the DRR_BEGIN
payload in the dry-run case. This patch also includes a number of
small fixups in this code path.
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes#9221Closes#9173
Remove a lot of unnecessary setting and incrementing of `i`.
Remove unused variable `j`.
Instead of calling out to Python in a loop to generate the same string
repeatedly, generate the string once using shell constructs before
entering the loop.
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Signed-off-by: Ryan Moeller <ryan@ixsystems.com>
Closes#9284
md5sum in particular but also sha256sum to a lesser extent is used
in several areas of the test suite for computing checksums. The vast
majority of invocations are followed by `| awk '{ print $1 }'`.
Introduce functions to wrap up `md5sum $file | awk '{ print $1 }'` and
likewise for sha256sum. These also serve as a convenient interface for
alternative implementations on other platforms.
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@ixsystems.com>
Closes#9280
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Closes#9250
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Closes#9249
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Closes#9247
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Closes#9248
BSD getopt() and getopt_long() want options before arguments.
Reorder arguments to zfs/zpool in tests to put all the options first.
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@ixsystems.com>
Closes#9228
Other than this test, zpool list -p is not well tested by any of the
automated tests. Add a test for zpool list -p.
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes#9134
The mdb_set_uint32 function requires that the values passed in be
decimal. This was overlooked initially because the matching Linux
function accepts both decimal and hexadecimal values.
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Sara Hartse <sara.hartse@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Igor Kozhukhov <igor@dilos.org>
Closes#9125Closes#9195
Deleting a clone requires finding blocks are clone-only, not shared
with the snapshot. This was done by traversing the entire block tree
which results in a large performance penalty for sparsely
written clones.
This is new method keeps track of clone blocks when they are
modified in a "Livelist" so that, when it’s time to delete,
the clone-specific blocks are already at hand.
We see performance improvements because now deletion work is
proportional to the number of clone-modified blocks, not the size
of the original dataset.
Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Signed-off-by: Sara Hartse <sara.hartse@delphix.com>
Closes#8416
The tests in tests/functional/cli_root/zpool_status should all require
root. However, linux.run has "user =" specified for those tests, which
means they run as a normal user. When I removed that line to run them
as root, the following tests did not pass:
zpool_status_003_pos
zpool_status_-c_disable
zpool_status_-c_homedir
zpool_status_-c_searchpath
These tests need to be run as a normal user. To fix this, move these
tests to a new tests/functional/cli_user/zpool_status directory.
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#9057
Adds the ability to sanity check zfs create arguments and to see the
value of any additional properties that will local to the dataset. For
example, automation that may need to adjust quota on a parent filesystem
before creating a volume may call `zfs create -nP -V <size> <volume>` to
obtain the value of refreservation. This adds the following options to
zfs create:
- -n dry-run (no-op)
- -v verbose
- -P parseable (implies verbose)
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Jerry Jelinek <jerry.jelinek@joyent.com>
Signed-off-by: Mike Gerdts <mike.gerdts@joyent.com>
Closes#8974
= Motivation
At Delphix we've seen a lot of customer systems where fragmentation
is over 75% and random writes take a performance hit because a lot
of time is spend on I/Os that update on-disk space accounting metadata.
Specifically, we seen cases where 20% to 40% of sync time is spend
after sync pass 1 and ~30% of the I/Os on the system is spent updating
spacemaps.
The problem is that these pools have existed long enough that we've
touched almost every metaslab at least once, and random writes
scatter frees across all metaslabs every TXG, thus appending to
their spacemaps and resulting in many I/Os. To give an example,
assuming that every VDEV has 200 metaslabs and our writes fit within
a single spacemap block (generally 4K) we have 200 I/Os. Then if we
assume 2 levels of indirection, we need 400 additional I/Os and
since we are talking about metadata for which we keep 2 extra copies
for redundancy we need to triple that number, leading to a total of
1800 I/Os per VDEV every TXG.
We could try and decrease the number of metaslabs so we have less
I/Os per TXG but then each metaslab would cover a wider range on
disk and thus would take more time to be loaded in memory from disk.
In addition, after it's loaded, it's range tree would consume more
memory.
Another idea would be to just increase the spacemap block size
which would allow us to fit more entries within an I/O block
resulting in fewer I/Os per metaslab and a speedup in loading time.
The problem is still that we don't deal with the number of I/Os
going up as the number of metaslabs is increasing and the fact
is that we generally write a lot to a few metaslabs and a little
to the rest of them. Thus, just increasing the block size would
actually waste bandwidth because we won't be utilizing our bigger
block size.
= About this patch
This patch introduces the Log Spacemap project which provides the
solution to the above problem while taking into account all the
aforementioned tradeoffs. The details on how it achieves that can
be found in the references sections below and in the code (see
Big Theory Statement in spa_log_spacemap.c).
Even though the change is fairly constraint within the metaslab
and lower-level SPA codepaths, there is a side-change that is
user-facing. The change is that VDEV IDs from VDEV holes will no
longer be reused. To give some background and reasoning for this,
when a log device is removed and its VDEV structure was replaced
with a hole (or was compacted; if at the end of the vdev array),
its vdev_id could be reused by devices added after that. Now
with the pool-wide space maps recording the vdev ID, this behavior
can cause problems (e.g. is this entry referring to a segment in
the new vdev or the removed log?). Thus, to simplify things the
ID reuse behavior is gone and now vdev IDs for top-level vdevs
are truly unique within a pool.
= Testing
The illumos implementation of this feature has been used internally
for a year and has been in production for ~6 months. For this patch
specifically there don't seem to be any regressions introduced to
ZTS and I have been running zloop for a week without any related
problems.
= Performance Analysis (Linux Specific)
All performance results and analysis for illumos can be found in
the links of the references. Redoing the same experiments in Linux
gave similar results. Below are the specifics of the Linux run.
After the pool reached stable state the percentage of the time
spent in pass 1 per TXG was 64% on average for the stock bits
while the log spacemap bits stayed at 95% during the experiment
(graph: sdimitro.github.io/img/linux-lsm/PercOfSyncInPassOne.png).
Sync times per TXG were 37.6 seconds on average for the stock
bits and 22.7 seconds for the log spacemap bits (related graph:
sdimitro.github.io/img/linux-lsm/SyncTimePerTXG.png). As a result
the log spacemap bits were able to push more TXGs, which is also
the reason why all graphs quantified per TXG have more entries for
the log spacemap bits.
Another interesting aspect in terms of txg syncs is that the stock
bits had 22% of their TXGs reach sync pass 7, 55% reach sync pass 8,
and 20% reach 9. The log space map bits reached sync pass 4 in 79%
of their TXGs, sync pass 7 in 19%, and sync pass 8 at 1%. This
emphasizes the fact that not only we spend less time on metadata
but we also iterate less times to convergence in spa_sync() dirtying
objects.
[related graphs:
stock- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGStock.png
lsm- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGLSM.png]
Finally, the improvement in IOPs that the userland gains from the
change is approximately 40%. There is a consistent win in IOPS as
you can see from the graphs below but the absolute amount of
improvement that the log spacemap gives varies within each minute
interval.
sdimitro.github.io/img/linux-lsm/StockVsLog3Days.png
sdimitro.github.io/img/linux-lsm/StockVsLog10Hours.png
= Porting to Other Platforms
For people that want to port this commit to other platforms below
is a list of ZoL commits that this patch depends on:
Make zdb results for checkpoint tests consistent
db587941c5
Update vdev_is_spacemap_addressable() for new spacemap encoding
419ba59145
Simplify spa_sync by breaking it up to smaller functions
8dc2197b7b
Factor metaslab_load_wait() in metaslab_load()
b194fab0fb
Rename range_tree_verify to range_tree_verify_not_present
df72b8bebe
Change target size of metaslabs from 256GB to 16GB
c853f382db
zdb -L should skip leak detection altogether
21e7cf5da8
vs_alloc can underflow in L2ARC vdevs
7558997d2f
Simplify log vdev removal code
6c926f426a
Get rid of space_map_update() for ms_synced_length
425d3237ee
Introduce auxiliary metaslab histograms
928e8ad47d
Error path in metaslab_load_impl() forgets to drop ms_sync_lock
8eef997679
= References
Background, Motivation, and Internals of the Feature
- OpenZFS 2017 Presentation:
youtu.be/jj2IxRkl5bQ
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemaps-project
Flushing Algorithm Internals & Performance Results
(Illumos Specific)
- Blogpost:
sdimitro.github.io/post/zfs-lsm-flushing/
- OpenZFS 2018 Presentation:
youtu.be/x6D2dHRjkxw
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemap-flushing-algorithm
Upstream Delphix Issues:
DLPX-51539, DLPX-59659, DLPX-57783, DLPX-61438, DLPX-41227, DLPX-59320
DLPX-63385
Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes#8442
Strategy of parallel mount is as follows.
1) Initial thread dispatching is to select sets of mount points that
don't have dependencies on other sets, hence threads can/should run
lock-less and shouldn't race with other threads for other sets. Each
thread dispatched corresponds to top level directory which may or may
not have datasets to be mounted on sub directories.
2) Subsequent recursive thread dispatching for each thread from 1)
is to mount datasets for each set of mount points. The mount points
within each set have dependencies (i.e. child directories), so child
directories are processed only after parent directory completes.
The problem is that the initial thread dispatching in
zfs_foreach_mountpoint() can be multi-threaded when it needs to be
single-threaded, and this puts threads under race condition. This race
appeared as mount/unmount issues on ZoL for ZoL having different
timing regarding mount(2) execution due to fork(2)/exec(2) of mount(8).
`zfs unmount -a` which expects proper mount order can't unmount if the
mounts were reordered by the race condition.
There are currently two known patterns of input list `handles` in
`zfs_foreach_mountpoint(..,handles,..)` which cause the race condition.
1) #8833 case where input is `/a /a /a/b` after sorting.
The problem is that libzfs_path_contains() can't correctly handle an
input list with two same top level directories.
There is a race between two POSIX threads A and B,
* ThreadA for "/a" for test1 and "/a/b"
* ThreadB for "/a" for test0/a
and in case of #8833, ThreadA won the race. Two threads were created
because "/a" wasn't considered as `"/a" contains "/a"`.
2) #8450 case where input is `/ /var/data /var/data/test` after sorting.
The problem is that libzfs_path_contains() can't correctly handle an
input list containing "/".
There is a race between two POSIX threads A and B,
* ThreadA for "/" and "/var/data/test"
* ThreadB for "/var/data"
and in case of #8450, ThreadA won the race. Two threads were created
because "/var/data" wasn't considered as `"/" contains "/var/data"`.
In other words, if there is (at least one) "/" in the input list,
the initial thread dispatching must be single-threaded since every
directory is a child of "/", meaning they all directly or indirectly
depend on "/".
In both cases, the first non_descendant_idx() call fails to correctly
determine "path1-contains-path2", and as a result the initial thread
dispatching creates another thread when it needs to be single-threaded.
Fix a conditional in libzfs_path_contains() to consider above two.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes#8450Closes#8833Closes#8878
This patch adds the ability for the user to unload keys for
datasets as they are being unmounted. This is analogous to
'zfs mount -l'.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alek Pinchuk <apinchuk@datto.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes: #8917Closes: #8952
The -Y option was added for ztest to test split block reconstruction.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Signed-off-by: Igor Kozhukhov <igor@dilos.org>
Closes#8926
The "zfs remap" command was disabled by
6e91a72fe3, because it has little utility
and introduced some tricky bugs. This commit removes the code for it,
the associated ZFS_IOC_REMAP ioctl, and tests.
Note that the ioctl and property will remain, but have no functionality.
This allows older software to fail gracefully if it attempts to use
these, and avoids a backwards incompatibility that would be introduced if
we renumbered the later ioctls/props.
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#8944
When encryption was first added to ZFS, we made a decision to
prevent users from creating unencrypted children of encrypted
datasets. The idea was to prevent users from inadvertently
leaving some of their data unencrypted. However, since the
release of 0.8.0, some legitimate reasons have been brought up
for this behavior to be allowed. This patch simply removes this
limitation from all code paths that had checks for it and updates
the tests accordingly.
Reviewed-by: Jason King <jason.king@joyent.com>
Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes#8737Closes#8870
If dedup is in use, the `dedupditto` property can be set, causing ZFS to
keep an extra copy of data that is referenced many times (>100x). The
idea was that this data is more important than other data and thus we
want to be really sure that it is not lost if the disk experiences a
small amount of random corruption.
ZFS (and system administrators) rely on the pool-level redundancy to
protect their data (e.g. mirroring or RAIDZ). Since the user/sysadmin
doesn't have control over what data will be offered extra redundancy by
dedupditto, this extra redundancy is not very useful. The bulk of the
data is still vulnerable to loss based on the pool-level redundancy.
For example, if particle strikes corrupt 0.1% of blocks, you will either
be saved by mirror/raidz, or you will be sad. This is true even if
dedupditto saved another 0.01% of blocks from being corrupted.
Therefore, the dedupditto functionality is rarely enabled (i.e. the
property is rarely set), and it fulfills its promise of increased
redundancy even more rarely.
Additionally, this feature does not work as advertised (on existing
releases), because scrub/resilver did not repair the extra (dedupditto)
copy (see https://github.com/zfsonlinux/zfs/pull/8270).
In summary, this seldom-used feature doesn't work, and even if it did it
wouldn't provide useful data protection. It has a non-trivial
maintenance burden (again see https://github.com/zfsonlinux/zfs/pull/8270).
We should remove the dedupditto functionality. For backwards
compatibility with the existing CLI, "zpool set dedupditto" will still
"succeed" (exit code zero), but won't have any effect. For backwards
compatibility with existing pools that had dedupditto enabled at some
point, the code will still be able to understand dedupditto blocks and
free them when appropriate. However, ZFS won't write any new dedupditto
blocks.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Alek Pinchuk <apinchuk@datto.com>
Issue #8270Closes#8310
Redacted send/receive allows users to send subsets of their data to
a target system. One possible use case for this feature is to not
transmit sensitive information to a data warehousing, test/dev, or
analytics environment. Another is to save space by not replicating
unimportant data within a given dataset, for example in backup tools
like zrepl.
Redacted send/receive is a three-stage process. First, a clone (or
clones) is made of the snapshot to be sent to the target. In this
clone (or clones), all unnecessary or unwanted data is removed or
modified. This clone is then snapshotted to create the "redaction
snapshot" (or snapshots). Second, the new zfs redact command is used
to create a redaction bookmark. The redaction bookmark stores the
list of blocks in a snapshot that were modified by the redaction
snapshot(s). Finally, the redaction bookmark is passed as a parameter
to zfs send. When sending to the snapshot that was redacted, the
redaction bookmark is used to filter out blocks that contain sensitive
or unwanted information, and those blocks are not included in the send
stream. When sending from the redaction bookmark, the blocks it
contains are considered as candidate blocks in addition to those
blocks in the destination snapshot that were modified since the
creation_txg of the redaction bookmark. This step is necessary to
allow the target to rehydrate data in the case where some blocks are
accidentally or unnecessarily modified in the redaction snapshot.
The changes to bookmarks to enable fast space estimation involve
adding deadlists to bookmarks. There is also logic to manage the
life cycles of these deadlists.
The new size estimation process operates in cases where previously
an accurate estimate could not be provided. In those cases, a send
is performed where no data blocks are read, reducing the runtime
significantly and providing a byte-accurate size estimate.
Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Chris Williamson <chris.williamson@delphix.com>
Reviewed-by: Pavel Zhakarov <pavel.zakharov@delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes#7958
This change restricts filesystem creation if the given name
contains either '.' or '..'
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Signed-off-by: TulsiJain <tulsi.jain@delphix.com>
Closes#8842Closes#8564
This commits fixes a double-free in zfs_ioc_pool_create() triggered by
specifying an unsupported combination of properties when creating a pool
with encryption enabled.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes#8791
files in dist_*_SCRIPTS get installed with 0755, those in dist_*_DATA
with 0644. This commit moves all .kshlib, .shlib and .cfg files in the
testsuite to dist_pkgdata_DATA, and removes the shebang from
zpool_import.kshlib.
This ensures that the files are installed with appropriate permissions
and silences some warnings from lintian
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Signed-off-by: Stoiko Ivanov <s.ivanov@proxmox.com>
Closes#8803
This change prevents the following warning when packaging some zfs-tests
files:
*** WARNING: ./usr/src/zfs-0.8.0/tests/zfs-tests/include/zpool_script.shlib
is executable but has empty or no shebang, removing executable bit
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes#8787
The 'zpool resilver' command requires that the resilver_defer
feature is active on the pool. Unfortunately, the check for
this was left out of the original patch. This commit simply
corrects this so that the command properly returns an error
in this case.
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes#8700
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reported-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Richard Laager <rlaager@wiktel.com>
Closes#8563Closes#8622
The features.kernel layout should match features.pool.
Reviewed-by: Sara Hartse <sara.hartse@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@delphix.com>
Closes#8566
Currently, when attempting to list snapshots ZFS may do a lot of
extra work checking child datasets. This is because the code does
not realize that it will not be able to reach any snapshots
contained within snapshots that are at the depth limit since the
snapshots of those datasets are counted as an additional layer
deeper. This patch corrects this issue.
In addition, this patch adds the ability to do perform the commands:
$ zfs list -t snapshot <dataset>
$ zfs get -t snapshot <prop> <dataset>
as a convenient way to list out properties of all snapshots of a
given dataset without having to use the depth limit.
Reviewed-by: Alek Pinchuk <apinchuk@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes#8539
UNMAP/TRIM support is a frequently-requested feature to help
prevent performance from degrading on SSDs and on various other
SAN-like storage back-ends. By issuing UNMAP/TRIM commands for
sectors which are no longer allocated the underlying device can
often more efficiently manage itself.
This TRIM implementation is modeled on the `zpool initialize`
feature which writes a pattern to all unallocated space in the
pool. The new `zpool trim` command uses the same vdev_xlate()
code to calculate what sectors are unallocated, the same per-
vdev TRIM thread model and locking, and the same basic CLI for
a consistent user experience. The core difference is that
instead of writing a pattern it will issue UNMAP/TRIM commands
for those extents.
The zio pipeline was updated to accommodate this by adding a new
ZIO_TYPE_TRIM type and associated spa taskq. This new type makes
is straight forward to add the platform specific TRIM/UNMAP calls
to vdev_disk.c and vdev_file.c. These new ZIO_TYPE_TRIM zios are
handled largely the same way as ZIO_TYPE_READs or ZIO_TYPE_WRITEs.
This makes it possible to largely avoid changing the pipieline,
one exception is that TRIM zio's may exceed the 16M block size
limit since they contain no data.
In addition to the manual `zpool trim` command, a background
automatic TRIM was added and is controlled by the 'autotrim'
property. It relies on the exact same infrastructure as the
manual TRIM. However, instead of relying on the extents in a
metaslab's ms_allocatable range tree, a ms_trim tree is kept
per metaslab. When 'autotrim=on', ranges added back to the
ms_allocatable tree are also added to the ms_free tree. The
ms_free tree is then periodically consumed by an autotrim
thread which systematically walks a top level vdev's metaslabs.
Since the automatic TRIM will skip ranges it considers too small
there is value in occasionally running a full `zpool trim`. This
may occur when the freed blocks are small and not enough time
was allowed to aggregate them. An automatic TRIM and a manual
`zpool trim` may be run concurrently, in which case the automatic
TRIM will yield to the manual TRIM.
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Contributions-by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Contributions-by: Tim Chase <tim@chase2k.com>
Contributions-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#8419Closes#598
1) As implemented the `zpool labelclear` command overwrites
the calculated offsets of all four vdev labels even when only a
single valid label is found. If the device as been re-purposed
but still contains a valid label this can result in space no
longer owned by ZFS being zeroed. Prevent this by verifying
every label removed is intact before it's overwritten.
2) Address a small bug in zpool_do_labelclear() which prevented
labelclear from working on file vdevs. Only block devices support
BLKFLSBUF, try the ioctl() but when it's reported as unsupported
this should not be fatal.
3) Fix `zpool labelclear` so it can be run on vdevs which were
removed from the pool with `zpool remove`. Additionally, allow
intact but partial labels to be cleared as in the case of a failed
`zpool attach` or `zpool replace`.
4) Remove LABELCLEAR and LABELREAD variables for test cases.
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#8500Closes#8373Closes#6261
Currently, there is an issue in the sequential scrub code which
prevents self healing from working in some cases. The scrub code
will split up all DVA copies of a bp and issue each of them
separately. The problem is that, since each of the DVAs is no
longer associated with the others, the self healing code doesn't
have the opportunity to repair problems that show up in one of the
DVAs with the data from the others.
This patch fixes this issue by ensuring that all IOs issued by the
sequential scrub code include all DVAs. Initially, only the first
DVA of each is attempted. If an issue arises, the IO is retried
with all available copies, giving the self healing code a chance
to correct the issue.
To test this change, this patch also adds the ability for zinject
to specify individual DVAs to inject read errors into. We then
add a new test case that utilizes this functionality to ensure
scrubs and self-healing reads can handle and transparently fix
issues with individual copies of blocks.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes#8453
The number of IO and checksum events should match the number of errors
seen in zpool status. Previously there was a mismatch between the
two counts because zpool status would only count unrecovered errors,
while zpool events would get an event for *all* errors (recovered or
not). This lead to situations where disks could be faulted for
"too many errors", while at the same time showing zero errors in zpool
status.
This fixes the zpool status error counters to increment at the same
times we post the error events.
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#4851Closes#7817
This patch attempts to address some user concerns that have arisen
since errata 4 was introduced.
* The errata warning has been made less scary for users without
any encrypted datasets.
* The errata warning now clears itself without a pool reimport if
the bookmark_v2 feature is enabled and no encrypted datasets
exist.
* It is no longer possible to create new encrypted datasets without
enabling the bookmark_v2 feature, thus helping to ensure that the
errata is resolved.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Issue ##8308
Closes#8504
The ksh 'echo -n' behavior on Illumos and Linux differs. For
compatibility with others platforms switch to "printf '%s' ".
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allanjude@freebsd.org>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Signed-off-by: Igor Kozhukhov <igor@dilos.org>
Closes#8501
Currently, there is an issue in the raw receive code where
raw receives are allowed to happen on top of previously
non-raw received datasets. This is a problem because the
source-side dataset doesn't know about how the blocks on
the destination were encrypted. As a result, any MAC in
the objset's checksum-of-MACs tree that is a parent of both
blocks encrypted on the source and blocks encrypted by the
destination will be incorrect. This will result in
authentication errors when we decrypt the dataset.
This patch fixes this issue by adding a new check to the
raw receive code. The code now maintains an "IVset guid",
which acts as an identifier for the set of IVs used to
encrypt a given snapshot. When a snapshot is raw received,
the destination snapshot will take this value from the
DRR_BEGIN payload. Non-raw receives and normal "zfs snap"
operations will cause ZFS to generate a new IVset guid.
When a raw incremental stream is received, ZFS will check
that the "from" IVset guid in the stream matches that of
the "from" destination snapshot. If they do not match, the
code will error out the receive, preventing the problem.
This patch requires an on-disk format change to add the
IVset guids to snapshots and bookmarks. As a result, this
patch has errata handling and a tunable to help affected
users resolve the issue with as little interruption as
possible.
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes#8308
The "-t" argument to "zfs program" specifies a limit on the number of
LUA instructions that can be executed. The zfs.8 manpage has the wrong
description. It should be updated to match what's in zfs-program.8
Also fix the formatting of the zfs help message.
Reviewed by: Allan Jude <allanjude@freebsd.org>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#8410
During the cleanup function of this test, an attempt to destroy a volume
can fail because the volume is busy. This leaves the system with
unexpected datasets which in turn causes subsequent failures.
Reviewed-by: bunder2015 <omfgbunder@gmail.com>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: John Kennedy <john.kennedy@delphix.com>
Closes#8422
= Old behavior
For vdev sizes 100GB to 50TB we keep ~200 metaslabs per
vdev and the metaslab size grows from 512MB to 256GB.
For vdev's bigger than that we start increasing the
number of metaslabs until we hit the 128K limit.
= New Behavior
For vdev sizes 100GB to 3TB we keep ~200 metaslabs per
vdev and the metaslab size grows from 512MB to 16GB.
For vdev's bigger than that we start increasing the
number of metaslabs until we hit the 128K limit.
= Reasoning
The old behavior makes metaslabs grow in size when
the vdev range is between 3TB (ms_size 16GB) and
32PB (ms_size 256GB). Even though keeping the number
of metaslabs is good in terms of potential number of
I/Os per TXG, these bigger metaslabs take longer
to be loaded and after they are loaded they can
take up a lot of memory because of their range trees.
This change tries to put a boundary in memory and
loading time for the specific range of vdev sizes.
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Don Brady <don.brady@delphix.com>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes#8324
The implementation of 'zfs remap' has proven to be problematic since
it modifies the objset (but not its logical contents) by dirtying
metadata without owning it. The consequence of which is that
dmu_objset_remap_indirects() is vulnerable to certain races.
For example, if we are in the middle of receiving into the filesystem
while it is being remapped. Then it is possible we could evict the
objset when the receive completes (see dsl_dataset_clone_swap_sync_impl,
or dmu_recv_end_sync), but dmu_objset_remap_indirects() may be still
using the objset. The result of which would be a panic.
Extended runs of ztest(8) have exposed other possible races which
can occur when using 'zfs remap'. Several of these have been fixed
but there may be others which have not yet been encountered and
diagnosed.
Furthermore, the ability to manually remap a filesystem is no longer
particularly useful now that the removal code can map large chunks.
Coupled with the fact that explaining what this command does and why
it may be useful requires a detailed understanding of the internals
of device removal. These are details users should not be bothered
with.
Therefore, the 'zfs remap' command is being disabled but not entirely
removed. It may be removed in the future or potentially reworked
to address the issues described above. Since 'zfs remap' has never
been part of a tagged release its removal is expected to have
minimal impact.
The ZTS tests have been updated to continue to exercise the command
to prevent atrophy, but it has been removed entirely from ztest(8).
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#8238
Since the vdev initialize feature was integrated the ZTS
zpool_resilver_restart test has been hitting its internal
timeout more frequently. This happens most often on
the coverage builder but not exclusively. Increasing the
timeout for this test case prevents any false positives.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#8273
Only display the full details of the vdev initialization state
in 'zpool status' output when requested with the -i option.
By default display '(initializing)' after vdevs when they are
being actively initialized. This is consistent with the
established precident of appending '(resilvering), etc' and
fits within the default 80 column terminal width making it
easy to read.
Additionally, updated the 'zpool initialize' documentation to
make it clear the options are mutually exclusive, but allow
duplicate options like all other zfs/zpool commands.
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#8230
PROBLEM
========
The first access to a block incurs a performance penalty on some platforms
(e.g. AWS's EBS, VMware VMDKs). Therefore we recommend that volumes are
"thick provisioned", where supported by the platform (VMware). This can
create a large delay in getting a new virtual machines up and running (or
adding storage to an existing Engine). If the thick provision step is
omitted, write performance will be suboptimal until all blocks on the LUN
have been written.
SOLUTION
=========
This feature introduces a way to 'initialize' the disks at install or in the
background to make sure we don't incur this first read penalty.
When an entire LUN is added to ZFS, we make all space available immediately,
and allow ZFS to find unallocated space and zero it out. This works with
concurrent writes to arbitrary offsets, ensuring that we don't zero out
something that has been (or is in the middle of being) written. This scheme
can also be applied to existing pools (affecting only free regions on the
vdev). Detailed design:
- new subcommand:zpool initialize [-cs] <pool> [<vdev> ...]
- start, suspend, or cancel initialization
- Creates new open-context thread for each vdev
- Thread iterates through all metaslabs in this vdev
- Each metaslab:
- select a metaslab
- load the metaslab
- mark the metaslab as being zeroed
- walk all free ranges within that metaslab and translate
them to ranges on the leaf vdev
- issue a "zeroing" I/O on the leaf vdev that corresponds to
a free range on the metaslab we're working on
- continue until all free ranges for this metaslab have been
"zeroed"
- reset/unmark the metaslab being zeroed
- if more metaslabs exist, then repeat above tasks.
- if no more metaslabs, then we're done.
- progress for the initialization is stored on-disk in the vdev’s
leaf zap object. The following information is stored:
- the last offset that has been initialized
- the state of the initialization process (i.e. active,
suspended, or canceled)
- the start time for the initialization
- progress is reported via the zpool status command and shows
information for each of the vdevs that are initializing
Porting notes:
- Added zfs_initialize_value module parameter to set the pattern
written by "zpool initialize".
- Added zfs_vdev_{initializing,removal}_{min,max}_active module options.
Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Wren Kennedy <john.kennedy@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: loli10K <ezomori.nozomu@gmail.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Signed-off-by: Tim Chase <tim@chase2k.com>
Ported-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/9102
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c3963210ebCloses#8230
When MMP was merged the status codes in libzfs_status were not
updated to add the status code for ZPOOL_STATUS_IO_FAILURE_MMP. This
commit corrects this and adds comments to help keep track of which
code is used for which status.
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: bunder2015 <omfgbunder@gmail.com>
Closes#8148Closes#8222
This commit fixes the following ASSERT in zfs_receive_one() when
receiving a send stream from a root dataset with the "-e" option:
$ sudo zfs snap source@snap
$ sudo zfs send source@snap | sudo zfs recv -e destination/recv
chopprefix > drrb->drr_toname
ASSERT at libzfs_sendrecv.c:3804:zfs_receive_one()
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes#8121
Currently, several tests in the ZFS Test Suite that attempt to
test scrub and resilver behavior occasionally fail. A big reason
for this is that these tests use a combination of zinject and
zfs_scan_vdev_limit to attempt to slow these operations enough
to verify their test commands. This method works most of the time,
but provides no guarantees and leads to flaky behavior. This patch
adds a new tunable, zfs_scan_suspend_progress, that ensures that
scans make no progress, guaranteeing that tests can be run without
racing.
This patch also changes zfs_remove_max_bytes_pause to match this
new tunable. This provides some consistency between these two
similar tunables and ensures that the tunable will not misbehave
on 32-bit systems.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes#8111
This commit fixes several "not found" errors caused by calling undefined
or incorrect shell functions in the following ZFS Test Suite groups:
* alloc_class
* channel_program/lua_core
* channel_program/synctask_core
* cli_root/zpool_import
* cli_user/misc
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: bunder2015 <omfgbunder@gmail.com>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes#8152
This change allows 'zpool split' to work with whole-disk devices and
updates the ZFS Test Suite with a new script to exercise this
functionality.
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes#6643Closes#8133
Porting Notes:
* Use thread pools (tpool) API instead of introducing taskq interfaces
to libzfs.
* Use pthread_mutext for locks as mutex_t isn't available.
* Ignore alternative libshare initialization since OpenZFS-7955 is
not present on zfsonlinux.
Authored by: Sebastien Roy <seb@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Authored by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Matt Ahrens <mahrens@delphix.com>
Ported-by: Don Brady <don.brady@delphix.com>
OpenZFS-issue: https://www.illumos.org/issues/8115
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/a3f0e2b569Closes#8092
This commit adds a new test case to the ZFS Test Suite to verify ZED
can detect when a device is physically removed from a running system:
the device will be offlined if a spare is not available in the pool.
We implement this by using the existing libudev functionality and
without relying solely on the FM kernel module capabilities which have
been observed to be unreliable with some kernels.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Don Brady <don.brady@delphix.com>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes#1537Closes#7926
zfs_rename_006_pos has been flaky in the past because it was
missing a call to block_device_wait to ensure the zvols it creates
are present before running dd. Whenever this this happened,
zfs_rename_009_neg would also fail because the first test would
leak a zvol clone that it did not know how to clean up. This patch
fixes the root cause and reenables the test. It also fixes some
minor grammar errors.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes#5647Closes#5648Closes#8088
For Linux, place a file in the mount point folder so it will be
considered "busy". Fix the while loop so it doesn't rm in
directories above the testdir. Add Linux-specific code to test
overlay on|off.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Zuchowski <pzuchowski@datto.com>
Closes#4990Closes#8081
Currently, if a resilver is triggered for any reason while an
existing one is running, zfs will immediately restart the existing
resilver from the beginning to include the new drive. This causes
problems for system administrators when a drive fails while another
is already resilvering. In this case, the optimal thing to do to
reduce risk of data loss is to wait for the current resilver to end
before immediately replacing the second failed drive, which allows
the system to operate with two incomplete drives for the minimum
amount of time.
This patch introduces the resilver_defer feature that essentially
does this for the admin without forcing them to wait and monitor
the resilver manually. The change requires an on-disk feature
since we must mark drives that are part of a deferred resilver in
the vdev config to ensure that we do not assume they are done
resilvering when an existing resilver completes.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: @mmaybee
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes#7732
Commit 0c6d093 caused a regression in the inherit codepath.
The fix is to restrict the changelist iteration on mountpoints and
add proper handling for 'legacy' mountpoints
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alek Pinchuk <apinchuk@datto.com>
Closes#7988Closes#7991
Historically, zpool status prints "(repairing)" for any drives that
have errors during a scrub:
NAME STATE READ WRITE CKSUM
mypool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
/tmp/file1 ONLINE 13 0 0 (repairing)
/tmp/file2 ONLINE 0 0 0
/tmp/file3 ONLINE 0 0 0
This was accidentally broken in "OpenZFS 9166 - zfs storage pool
checkpoint" (d2734cc). This patch adds it back in.
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#7779Closes#7978
This change adds a new test case to the zfs-test suite to verify that
when 'zfs destroy' is used on a shared dataset, the dataset will be
unshared after the destroy operation completes.
Reviewed by: loli10K <ezomori.nozomu@gmail.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <prakash.surya@delphix.com>
Closes#7941
Modified changelist_gather()ing for the mountpoint property.
Now instead of iterating on all dataset descendants, we read
/proc/self/mounts and iterate on the mounted descendant datasets only.
Switched changelist implementation from a uu_list_* to uu_avl_* in
order to reduce changlist code-path's worst case time complexity.
Reviewed by: Don Brady <don.brady@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alek Pinchuk <apinchuk@datto.com>
Closes#7967
This change improve the handling of invalid filesystem properties when
specified at pool creation: this is useful when 'zpool create -n'
(dry run) is executed to detect invalid fs-level options (-O) before
the actual command is run.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes#7620Closes#7878
Added vdev_resilver_needed() check to verify VDEVs are fully
synced, so that after split the new pool will not be corrupted.
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Roman Strashkin <roman.strashkin@nexenta.com>
Closes#7865Closes#7881
Allocation Classes add the ability to have allocation classes in a
pool that are dedicated to serving specific block categories, such
as DDT data, metadata, and small file blocks. A pool can opt-in to
this feature by adding a 'special' or 'dedup' top-level VDEV.
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Alek Pinchuk <apinchuk@datto.com>
Reviewed-by: Håkan Johansson <f96hajo@chalmers.se>
Reviewed-by: Andreas Dilger <andreas.dilger@chamcloud.com>
Reviewed-by: DHE <git@dehacked.net>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Gregor Kopka <gregor@kopka.net>
Reviewed-by: Kash Pande <kash@tripleback.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Don Brady <don.brady@delphix.com>
Closes#5182
This extends our sysfs '/sys/module/zfs' entry to include feature
and property attributes. The primary consumer of this information
is user processes, like the zfs CLI, that need to know what the
current loaded ZFS module supports. The libzfs binary will consult
this information when instantiating the zfs and zpool property
tables and the pool features table.
This introduces 4 kernel objects (dirs) into '/sys/module/zfs'
with corresponding attributes (files):
features.runtime
features.pool
properties.dataset
properties.pool
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@delphix.com>
Closes#7706
It's possible for an unrelated process, like blkid, to have the
volume open when 'zfs destroy' is run. Switch the cleanup functions
to the destroy_dataset() helper which handles this case by retrying
the destroy when the dataset is busy. This was done not only for
volumes but also for file systems for consistency.
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#7854
Removing hardcoded paths in many scripts.
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: bernie1995 <bernie.pikes@gmail.com>
Issue #7507Closes#7843
It's possible for an unrelated process, like blkid, to have the
volume open when 'zfs destroy' is run. Switch the cleanup function
to the destroy_dataset() helper which handles this case by retrying
the destroy when the dataset is busy.
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#7847
Currently, when unmounting a filesystem, ZFS will only wait for
a txg sync if the dataset is dirty and not readonly. However, this
can be problematic in cases where a dataset is remounted readonly
immediately before being unmounted, which often happens when the
system is being shut down. Since encrypted datasets require that
all I/O is completed before the dataset is disowned, this issue
causes problems when write I/Os leak into the txgs after the
dataset is disowned, which can happen when sync=disabled.
While looking into fixes for this issue, it was discovered that
dsl_dataset_is_dirty() does not return B_TRUE when the dataset has
been removed from the txg dirty datasets list, but has not actually
been processed yet. Furthermore, the implementation is comletely
different from dmu_objset_is_dirty(), adding to the confusion.
Rather than relying on this function, this patch forces the umount
code path (and the remount readonly code path) to always perform a
txg sync on read-write datasets and removes the function altogether.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes#7753Closes#7795
The following patch introduces a few statistics on reads and writes
grouped by dataset. These statistics are implemented as kstats
(backed by aggregate sums for performance) and can be retrieved by
using the dataset objset ID number. The motivation for this change is
to provide some preliminary analytics on dataset usage/performance.
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes#7705
Allow the 'zpool replace' to run slowly without overwhelming the vdev
queues by setting zfs_scan_vdev_limit=128k. This limits the number of
concurrent slow IOs which need to be handled. The net effect is the
test case runs approximately 3x faster putting it well under the 10
minute per-test time limit.
Rename import_cache* test cases to imprt_cachefile*. Originally
these were renamed due to a maximum tar name limit, this limit was
removed by commit 1dfde3d9b.
Replaced instances of /var/tmp in zpool_import.cfg with $TEST_BASE_DIR.
Reviewed-by: bunder2015 <omfgbunder@gmail.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#7765Closes#7802
Traditionally Automake has defaulted to the V7 tar format when
creating tarballs for distributions. One of the many limitions
of this format is a 99 character maximum path + file name limit.
This can cause problems when adding new test cases to the ZTS
due to the depth of the sub-tree and descriptive test names.
This change switches the build system to the posix (aliased as
pax) tar format which conforms to the POSIX.1-2001 specification.
This format does not suffer from the V7 limitations, was designed
to be compatible, and will become the default format in future
versions of GNU tar.
https://www.gnu.org/software/tar/manual/html_chapter/tar_8.html
As part of this change the blockfiles directories which were
originally removed due to this limit have been readded.
Reviewed by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#7767
This change modifies how 'checksum' and 'dedup' properties are verified
in zfs_check_settable() handling the case where they are explicitly
inherited in the dataset hierarchy when receiving a recursive send
stream.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes#7755Closes#7576Closes#7757
It's possible for an unrelated process, like blkid, to have the
volume open when 'zfs destroy' is run. Switch the cleanup function
to the destroy_dataset() helper which handles this case by retrying
the destroy when the dataset is busy.
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#7763
While the autoexpand property may seem like a small feature it
depends on a significant amount of system infrastructure. Enough
of that infrastructure is now in place that with a few modifications
for Linux it can be supported.
Auto-expand works as follows; when a block device is modified
(re-sized, closed after being open r/w, etc) a change uevent is
generated for udev. The ZED, which is monitoring udev events,
passes the change event along to zfs_deliver_dle() if the disk
or partition contains a zfs_member as identified by blkid.
From here the device is matched against all imported pool vdevs
using the vdev_guid which was read from the label by blkid. If
a match is found the ZED reopens the pool vdev. This re-opening
is important because it allows the vdev to be briefly closed so
the disk partition table can be re-read. Otherwise, it wouldn't
be possible to report the maximum possible expansion size.
Finally, if the property autoexpand=on a vdev expansion will be
attempted. After performing some sanity checks on the disk to
verify that it is safe to expand, the primary partition (-part1)
will be expanded and the partition table updated. The partition
is then re-opened (again) to detect the updated size which allows
the new capacity to be used.
In order to make all of the above possible the following changes
were required:
* Updated the zpool_expand_001_pos and zpool_expand_003_pos tests.
These tests now create a pool which is layered on a loopback,
scsi_debug, and file vdev. This allows for testing of non-
partitioned block device (loopback), a partition block device
(scsi_debug), and a file which does not receive udev change
events. This provided for better test coverage, and by removing
the layering on ZFS volumes there issues surrounding layering
one pool on another are avoided.
* zpool_find_vdev_by_physpath() updated to accept a vdev guid.
This allows for matching by guid rather than path which is a
more reliable way for the ZED to reference a vdev.
* Fixed zfs_zevent_wait() signal handling which could result
in the ZED spinning when a signal was not handled.
* Removed vdev_disk_rrpart() functionality which can be abandoned
in favor of kernel provided blkdev_reread_part() function.
* Added a rwlock which is held as a writer while a disk is being
reopened. This is important to prevent errors from occurring
for any configuration related IOs which bypass the SCL_ZIO lock.
The zpool_reopen_007_pos.ksh test case was added to verify IO
error are never observed when reopening. This is not expected
to impact IO performance.
Additional fixes which aren't critical but were discovered and
resolved in the course of developing this functionality.
* Added PHYS_PATH="/dev/zvol/dataset" to the vdev configuration for
ZFS volumes. This is as good as a unique physical path, while the
volumes are not used in the test cases anymore for other reasons
this improvement was included.
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Signed-off-by: Sara Hartse <sara.hartse@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#120Closes#2437Closes#5771Closes#7366Closes#7582Closes#7629
Datasets that are deeply nested (~100 levels) are impractical. We just
put a limit of 50 levels to newly created datasets. Existing datasets
should work without a problem.
The problem can be seen by attempting to create a dataset using the -p
option with many levels:
panic[cpu0]/thread=ffffff01cd282c20: BAD TRAP: type=8 (#df Double fault) rp=ffffffff
fffffffffbc3aa60 unix:die+100 ()
fffffffffbc3ab70 unix:trap+157d ()
ffffff00083d7020 unix:_patch_xrstorq_rbx+196 ()
ffffff00083d7050 zfs:dbuf_rele+2e ()
...
ffffff00083d7080 zfs:dsl_dir_close+32 ()
ffffff00083d70b0 zfs:dsl_dir_evict+30 ()
ffffff00083d70d0 zfs:dbuf_evict_user+4a ()
ffffff00083d7100 zfs:dbuf_rele_and_unlock+87 ()
ffffff00083d7130 zfs:dbuf_rele+2e ()
... The block above repeats once per directory in the ...
... create -p command, working towards the root ...
ffffff00083db9f0 zfs:dsl_dataset_drop_ref+19 ()
ffffff00083dba20 zfs:dsl_dataset_rele+42 ()
ffffff00083dba70 zfs:dmu_objset_prefetch+e4 ()
ffffff00083dbaa0 zfs:findfunc+23 ()
ffffff00083dbb80 zfs:dmu_objset_find_spa+38c ()
ffffff00083dbbc0 zfs:dmu_objset_find+40 ()
ffffff00083dbc20 zfs:zfs_ioc_snapshot_list_next+4b ()
ffffff00083dbcc0 zfs:zfsdev_ioctl+347 ()
ffffff00083dbd00 genunix:cdev_ioctl+45 ()
ffffff00083dbd40 specfs:spec_ioctl+5a ()
ffffff00083dbdc0 genunix:fop_ioctl+7b ()
ffffff00083dbec0 genunix:ioctl+18e ()
ffffff00083dbf10 unix:brand_sys_sysenter+1c9 ()
Porting notes:
* Added zfs_max_dataset_nesting module option with documentation.
* Updated zfs_rename_014_neg.ksh for Linux.
* Increase the zfs.sh stack warning to 15K. Enough time has passed
that 16K can be reasonably assumed to be the default value. It
was increased in the 3.15 kernel released in June of 2014.
Authored by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
OpenZFS-issue: https://www.illumos.org/issues/9330
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/757a75aCloses#7681
Motivation
==========
The current space map encoding has the following disadvantages:
[1] Assuming 512 sector size each entry can represent at most 16MB for a segment.
This makes the encoding very inefficient for large regions of space.
[2] As vdev-wide space maps have started to be used by new features (i.e.
device removal, zpool checkpoint) we've started imposing limits in the
vdevs that can be used with them based on the maximum addressable offset
(currently 64PB for a top-level vdev).
New encoding
============
The layout can be found at space_map.h and it remains backwards compatible with
the old one. The introduced two-word entry format, besides extending the limits
imposed by the single-entry layout, also includes a vdev field and some extra
padding after its prefix.
The extra padding after the prefix should is reserved for future usage (e.g.
new prefixes for future encodings or new fields for flags). The new vdev field
not only makes the space maps more self-descriptive, but also opens the doors
for pool-wide space maps (expected to be used in the log spacemap project).
One final important note is that the number of bits used for vdevs is reduced
to 24 bits for blkptrs. That was decided as we don't know of any setups that
use more than 16M vdevs for the time being and we wanted to fit the vdev field
in the space map. In addition that gives us some extra bits in dva_t.
Other references:
=================
The new encoding is also discussed towards the end of the Log Space Map
presentation from 2017's OpenZFS summit.
Link: https://www.youtube.com/watch?v=jj2IxRkl5bQ
Authored by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <gwilson@zfsmail.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Gordon Ross <gwr@nexenta.com>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/90a56e6d
OpenZFS-issue: https://www.illumos.org/issues/9238Closes#7665
Details about the motivation of this feature and its usage can
be found in this blogpost:
https://sdimitro.github.io/post/zpool-checkpoint/
A lightning talk of this feature can be found here:
https://www.youtube.com/watch?v=fPQA8K40jAM
Implementation details can be found in big block comment of
spa_checkpoint.c
Side-changes that are relevant to this commit but not explained
elsewhere:
* renames members of "struct metaslab trees to be shorter without
losing meaning
* space_map_{alloc,truncate}() accept a block size as a
parameter. The reason is that in the current state all space
maps that we allocate through the DMU use a global tunable
(space_map_blksz) which defauls to 4KB. This is ok for metaslab
space maps in terms of bandwirdth since they are scattered all
over the disk. But for other space maps this default is probably
not what we want. Examples are device removal's vdev_obsolete_sm
or vdev_chedkpoint_sm from this review. Both of these have a
1:1 relationship with each vdev and could benefit from a bigger
block size.
Porting notes:
* The part of dsl_scan_sync() which handles async destroys has
been moved into the new dsl_process_async_destroys() function.
* Remove "VERIFY(!(flags & FWRITE))" in "kernel.c" so zhack can write
to block device backed pools.
* ZTS:
* Fix get_txg() in zpool_sync_001_pos due to "checkpoint_txg".
* Don't use large dd block sizes on /dev/urandom under Linux in
checkpoint_capacity.
* Adopt Delphix-OS's setting of 4 (spa_asize_inflation =
SPA_DVAS_PER_BP + 1) for the checkpoint_capacity test to speed
its attempts to fill the pool
* Create the base and nested pools with sync=disabled to speed up
the "setup" phase.
* Clear labels in test pool between checkpoint tests to avoid
duplicate pool issues.
* The import_rewind_device_replaced test has been marked as "known
to fail" for the reasons listed in its DISCLAIMER.
* New module parameters:
zfs_spa_discard_memory_limit,
zfs_remove_max_bytes_pause (not documented - debugging only)
vdev_max_ms_count (formerly metaslabs_per_vdev)
vdev_min_ms_count
Authored by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9166
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/7159fdb8Closes#7570
Adopt and extend the OpenZFS ZTS results analysis script for use
with ZFS on Linux. This allows for automatic analysis of tests
which may be skipped for a variety or reasons or which are not
entirely reliable.
In addition to the list of 'known' failures, which have been updated
for ZFS on Linux, there in a new 'maybe' section. This mapping
include tests which might be correctly skipped depending on the
test environment. This may be because of a missing dependency or
lack of required kernel support. This list also includes tests
which normally pass but might on occasion fail for a harmless
reason.
The script was also extended include a reason for why a given test
might be skipped or may fail. The reason will be included after
the test in the "results other than PASS that are expected" section.
For failures it is preferable to set the reason to the GitHub issue
number and for skipped tests several generic reasons are available.
You may also specify a custom reason if needed.
All tests were added back in to the linux.run file even if they are
expected to failed. There is value in running tests which may not
pass, the expected results for these tests has been encoded in
the new analysis script.
All tests which were disabled because they ran more slowly on a
32-bit system have been re-enabled. Developers working on 32-bit
systems should assess what it reasonable for their environment.
The unnecessary dependency on physical block devices was removed for
the checksum, grow_pool, and grow_replicas test groups so they are
no longer skipped. Updated the filetest_001_pos test case to run
properly now that it is enabled and moved the grow tests in to a
single directory.
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#7638
Currently, during a recursive zfs destroy the first error that is
encountered will stop the destruction of the datasets. Errors may
happen for a variety of reasons including competing deletions
and busy datasets.
This patch switches recursive destroy to always do a best-effort
recursive dataset destroy.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Alek Pinchuk <apinchuk@datto.com>
Closes#7574
Update bdev_capacity to have wholedisk vdevs query the
size of the underlying block device (correcting for the size
of the efi parition and partition alignment) and therefore detect
expanded space.
Correct vdev_get_stats_ex so that the expandsize is aligned
to metaslab size and new space is only reported if it is large
enough for a new metaslab.
Reviewed by: Don Brady <don.brady@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: John Wren Kennedy <jwk404@gmail.com>
Signed-off-by: sara hartse <sara.hartse@delphix.com>
External-issue: LX-165
Closes#7546
Issue #7582
It's possible for the `zpool attach` portion of this test case
to complete before the `zpool scrub` can be issued. Update the
test case to force the resilvering phase to take longer.
Reviewed-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#5444Closes#7541
Must clear slow-disk zinject injections in test cleanup routine.
Otherwise, when this test fails, it causes most subsequent tests
to fail.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://illumos.org/issues/9190
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/762c6b4Closes#7530
Incorrect shebangs were used when porting.
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: bunder2015 <omfgbunder@gmail.com>
Closes#7523Closes#7524
Don't create an ext4 file system atop $DEV_DISKDIR/$DISK2.
There's likely to not be sufficient space for it to succeed.
Instead, simply create the vdev files in the directory where it
would have been mounted.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#7459
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123Closes#7459
Authored by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Porting Notes:
* Added tuning to man page.
* Test case changes dropped, default behavior unchanged.
OpenZFS-issue: https://www.illumos.org/issues/9256
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/32356b3c56Closes#7470
Creating a pool with a temporary name fails when we also specify custom
dataset properties: this is because we mistakenly call
zfs_set_prop_nvlist() on the "real" pool name which, as expected,
cannot be found because the SPA is present in the namespace with the
temporary name.
Fix this by specifying the correct pool name when setting the dataset
properties.
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes#7502Closes#7509
Commit 7fab6361 inadvertently disabled the MMP test cases by creating
and not removing an /etc/hostid file in the new zpool_split_props test
case. When the file exists the ZTS skips the entire MMP test group
rather than modify what may be a system which is already configured.
Update the test case to remove the file.
Additionally, because the MMP tests were disabled a regression slipped
in as part of commit 9eb7b46ed0. Fix it.
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#7514
When using mawk instead of gawk zfs_diff_timestamp fails consistently:
this is due to a subtle difference in how mawk handles substr().
From awk(1):
---
Finally, here is how mawk handles exceptional cases not discussed in
the AWK book or the Posix draft. It is unsafe to assume consistency
across awks and safe to skip to the next section.
substr(s, i, n) returns the characters of s in the intersection of
the closed interval [1, length(s)] and the half-open interval [i, i+n).
When this intersection is empty, the empty string is returned; so
substr("ABC", 1, 0) = "" and substr("ABC", -4, 6) = "A".
---
To support running zfs_diff_timestamp with both gawk and mawk change
the second parameter passed to substr().
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes#7503Closes#7510
Two deadlocks / ASSERT failures were introduced in a2c2ed1b which
would occur whenever arc_buf_fill() failed to decrypt a block of
data. This occurred because the call to arc_buf_destroy() which
was responsible for cleaning up the newly created buffer would
attempt to take out the hdr lock that it was already holding. This
was resolved by calling the underlying functions directly without
retaking the lock.
In addition, the dmu_diff() code did not properly ensure that keys
were loaded and mapped before begining dataset traversal. It turns
out that this code does not need to look at any encrypted values,
so the code was altered to perform raw IO only.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes#7354Closes#7456
When receiving an incremental send stream with intermediary snapshots
zfs_receive_one() does not correctly identify the top-level dataset:
consequently we restore said snapshots as if they were children
datasets in the hierarchy, forcing inheritance of any property received
with 'zfs send -o' and effectively removing any locally set value.
The test case did not correctly verify this situation because it uses
adjacent snapshots, basically testing 'zfs send -i' instead of
'zfs send -I': this commit adds an additional intermediary snapshot to
the test script.
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes#7478
Only filesystems and volumes are valid 'zfs remap' parameters: when
passed a snapshot name zfs_remap_indirects() does not handle the
EINVAL returned from libzfs_core, which results in failing an assertion
and consequently crashing.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes#7454
zpool_create_002_pos did not clean up filedisk files left over from
running the test.
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: bunder2015 <omfgbunder@gmail.com>
Closes#7435Closes#7439
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1ebCloses#6900
This test performs a rapid offline/online cycle of each of several
mirror vdevs. It can run so quickly that there isn't sufficient pool
redundancy to perform an offline. The solution is to wait until the
pool is resilvered following the online operation.
Also, add a pool sync before the offline operation to help reduce
spurious errors.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Issue #6900
Currently mounting an already mounted zfs dataset results in an
error, whereas it is typically allowed with other filesystems.
This causes some bad interactions with mount namespaces. Take
this sequence for example:
- Create a dataset
- Create a snapshot of the dataset
- Create a clone of the snapshot
- Create a new mount namespace
- Rename the original dataset
The rename results in unmounting and remounting the clone in the
original mount namespace, however the remount fails because the
dataset is still mounted in the new mount namespace. (Note that
this means the mount in the new mount namespace is never being
unmounted, so perhaps the unmount/remount of the clone isn't
actually necessary.)
The problem here is a result of the way mounting is implemented
in the kernel module. Since it is not mounting block devices it
uses mount_nodev() instead of the usual mount_bdev(). However,
mount_nodev() is written for filesystems for which each mount is
a new instance (i.e. a new super block), and zfs should be able
to detect when a mount request can be satisfied using an existing
super block.
Change zpl_mount() to call sget() directly with it's own test
callback. Passing the objset_t object as the fs data allows
checking if a superblock already exists for the dataset, and in
that case we just need to return a new reference for the sb's
root dentry.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Alek Pinchuk <apinchuk@datto.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Closes#5796Closes#7207
zfs_receive_raw_incremental did not clean up ibackup_trunc.* files
left over from running the test.
Also changing the path of the ibackup files so they can be placed
in the correct directories when /var/tmp is not the temporary
directory.
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: bunder2015 <omfgbunder@gmail.com>
Closes#7430
This change adds five new tests to the ZTS:
* zpool_split_cliargs: verify command line options and arguments
* zpool_split_devices: verify zpool split accepts a device list
* zpool_split_encryption: verify zpool can split encrypted pools
* zpool_split_props: verify zpool split can set property values
* zpool_split_vdevs: verify vdev layout when splitting the pool
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes#7409
Commit e4010f2 accidentally allows zpool to set pool features to
"disabled"; this should only be allowed at pool creation. This commit
adds additional checks and test coverage to 'zpool set'.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes#7402
Most kshlib files are imported by other scripts
and do not have a shebang at the top of their files.
Make all kshlib follow this convention.
Remove shebangs from cfg files as well.
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Close#7406
Fedora 28's RPM build checks warn when executable files don't have a
shebang line. These warnings are caused when we (incorrectly)
include data & config files in the_SCRIPTS automake lines. Files in
_SCRIPTS are marked executable by automake. This patch fixes the
issue by including non-executable scripts in a _DATA line instead.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#7359Closes#7395
Currently, "zfs mount -a" will print a warning and fail to mount
any encrypted datasets that do not have a key loaded. This patch
makes the behavior of this failure consistent with other failure
modes ("zfs mount -a" will silently continue, explict "zfs mount"
will print a message and return an error code.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes#7382
Fix a bunch of (mostly) sprintf/snprintf truncation compiler
warnings that show up on Fedora 28 (GCC 8.0.1).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#7361Closes#7368
Use 'zpool reopen' instead of 'zpool scrub' to kick in the spare device:
this is required to avoid spurious failures caused by a race condition
in events processing by the ZFS Event Daemon:
P1 (zpool scrub) P2 (zed)
---
zfs_ioc_pool_scan()
-> dsl_scan()
-> vdev_reopen()
-> vdev_set_state(VDEV_STATE_CANT_OPEN)
zfs_ioc_vdev_attach()
-> spa_vdev_attach()
-> dsl_resilver_restart()
-> dsl_sync_task()
-> dsl_scan_setup_check()
<- dsl_scan_setup_check(): EBUSY
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes#7247Closes#7342
Authored by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Don Brady <don.brady@delphix.com>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Porting Notes:
* Re-enabled and tweaked the zpool_upgrade_007_pos test case
to successfully run in under 5 minutes.
OpenZFS-issue: https://www.illumos.org/issues/9164
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/0e776dc06aCloses#6112Closes#7336
The changes piggyback JSON output support on top of channel programs
(#6558). This way the JSON output support is targeted to scripting
use cases and is easily maintainable since it really only touches
one function (zfs_do_channel_program()).
This patch ports Joyent's JSON nvlist library from illumos to enable
easy JSON printing of channel program output nvlist. To keep the
delta small I also took advantage of the fact that printing in
zfs_do_channel_program() was almost always done before exiting
the program.
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alek Pinchuk <apinchuk@datto.com>
Closes#7281
Due to zpool create auto-partioning in Linux (i.e. sdb1),
certain utilities need to use the parition (sdb1) while
others use the whole disk name (sdb).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Zuchowski <pzuchowski@datto.com>
Closes#6939Closes#7261
Datasets can be busy when calling zfs destroy. Introduce
a helper function to destroy datasets and use it to destroy
datasets in zfs_allow_004_pos, zfs_promote_008_pos, and
zfs_destroy_002_pos.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Closes#7224Closes#7246Closes#7249Closes#7267
When invoked with wrong parameters 'zfs bookmark' fails to gracefully
validate user input and crashes.
This is a regression accidentally introduced in 587e228; this commit
adds additional tests to the ZFS Test Suite to exercise this codepath.
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: KireinaHoro <i@jsteward.moe>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes#7228Closes#7229
This change implements 'zfs send -b' which can be used to send only
received property values whether or not they are overridden by local
settings.
This can be very useful during "restore" operations from a backup pool
because it allows to send only the property values originally sent
from the backup source, even though they were later modified on the
destination either by a 'zfs set' operation, explicit 'zfs inherit' or
overridden during the receive process via 'zfs receive -o|-x'.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes#7156
Currently, raw zfs sends transfer the encrypted master keys and
objset_phys_t encryption parameters in the DRR_BEGIN payload of
each send file. Both of these are processed as soon as they are
read in dmu_recv_stream(), meaning that the new keys are set
before the new snapshot is received. In addition to the fact that
this changes the user's keys for the dataset earlier than they
might expect, the keys were never reset to what they originally
were in the event that the receive failed. This patch splits the
processing into objset handling and key handling, the later of
which is moved to dmu_recv_end() so that they key change can be
done atomically.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes#7200
Project quota is a new ZFS system space/object usage accounting
and enforcement mechanism. Similar as user/group quota, project
quota is another dimension of system quota. It bases on the new
object attribute - project ID.
Project ID is a numerical value to indicate to which project an
object belongs. An object only can belong to one project though
you (the object owner or privileged user) can change the object
project ID via 'chattr -p' or 'zfs project [-s] -p' explicitly.
The object also can inherit the project ID from its parent when
created if the parent has the project inherit flag (that can be
set via 'chattr +P' or 'zfs project -s [-p]').
By accounting the spaces/objects belong to the same project, we
can know how many spaces/objects used by the project. And if we
set the upper limit then we can control the spaces/objects that
are consumed by such project. It is useful when multiple groups
and users cooperate for the same project, or a user/group needs
to participate in multiple projects.
Support the following commands and functionalities:
zfs set projectquota@project
zfs set projectobjquota@project
zfs get projectquota@project
zfs get projectobjquota@project
zfs get projectused@project
zfs get projectobjused@project
zfs projectspace
zfs allow projectquota
zfs allow projectobjquota
zfs allow projectused
zfs allow projectobjused
zfs unallow projectquota
zfs unallow projectobjquota
zfs unallow projectused
zfs unallow projectobjused
chattr +/-P
chattr -p project_id
lsattr -p
This patch also supports tree quota based on the project quota via
"zfs project" commands set as following:
zfs project [-d|-r] <file|directory ...>
zfs project -C [-k] [-r] <file|directory ...>
zfs project -c [-0] [-d|-r] [-p id] <file|directory ...>
zfs project [-p id] [-r] [-s] <file|directory ...>
For "df [-i] $DIR" command, if we set INHERIT (project ID) flag on
the $DIR, then the proejct [obj]quota and [obj]used values for the
$DIR's project ID will be shown as the total/free (avail) resource.
Keep the same behavior as EXT4/XFS does.
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by Ned Bass <bass6@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Fan Yong <fan.yong@intel.com>
TEST_ZIMPORT_POOLS="zol-0.6.1 zol-0.6.2 master"
Change-Id: Ib4f0544602e03fb61fd46a849d7ba51a6005693c
Closes#6290
zdb -ed on objset for exported pool would failed with:
failed to own dataset 'qq/fs0': No such file or directory
The reason is that zdb pass objset name to spa_import, it uses that
name to create a spa. Later, when dmu_objset_own tries to lookup the spa
using real pool name, it can't find one.
We fix this by make sure we pass pool name rather than objset name to
spa_import.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes#7099Closes#6464
This change adds a make target 'testscheck' which verifies all ksh test
scripts, part of the ZFS Test Suite, have their executable bit set:
this should avoid adding new test scripts that cannot be executed by
the test-runner.py which would result in the following warning message:
Warning: Test removed from TestGroup because it failed verification.
This also verifies both *.cfg and *.kshlib files used in the ZFS Test
Suite are not executable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes#7131