The arc_adapt() function tunes LRU/MLU balance according to 4 types of
cache hits (which is passed as state agrument): ghost LRU, LRU, MRU,
ghost MRU. If this function is called with wrong cache hit (state),
adaptation will be sub-optimal and performance will suffer.
Some time ago upstream received this commit:
6950 ARC should cache compressed data) in arc_read() do next
sequence (access to ghost buffer)
Before this commit, hit to any ghost list was passed arc_adapt() before
call to arc_access() which revive element in cache and change state from
ghost to real hit.
After this commit, the order of calls was reverted and arc_adapt() is
now called only with «real» hits even if hit was in one of two ghost
lists, which renders ghost lists useless and breaks the ARC algorithm.
FreeBSD fixed this problem locally in Change D19094 / Commit r348772.
This change is an adaptation of the above commit to the current arc
code.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#10548Closes#10618
Kernel 5.10 removed check_disk_change() in favor of callers using
the faster bdev_check_media_change() instead, and explicitly forcing
bdev revalidation when they desire that behavior. To preserve prior
behavior, I have wrapped this into a zfs_check_media_change() macro
that calls an inline function for the new API that mimics the old
behavior when check_disk_change() doesn't exist, and just calls
check_disk_change() if it exists.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Coleman Kane <ckane@colemankane.org>
Closes#11085
In Linux 5.10 the linux/frame.h header was renamed linux/objtool.h.
Add a configure check to detect and use the correctly named header.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#11085
Missing struct initialization in a config test results in the
interface being incorrectly detected.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Adam Moss <c@yotes.com>
Signed-off-by: Mathieu Velten <matmaul@gmail.com>
Closes#10713Closes#11049
The icp Makefile was referencing absolute paths to the zfs source tree for
include files. The result was that even though the headers get added to
the Linux kernel tree, building fails if you move/remove the zfs checkout.
Signed-off-by: Michael D Labriola <michael.d.labriola@gmail.com>
This creates a bit of a mess in the kernel tree, and is largely unnecessary
unless you're debugging copy-builtin.
Signed-off-by: Michael D Labriola <michael.d.labriola@gmail.com>
While it makes sense to make sure we don't control the generated
zfs_gitrev.h in the zfs source tree, once we copy the module sources into a
kernel source tree that's a problem. If you're keeping your zfs sources in
a kernel tree on a branch, you want to commit all the required files.
Signed-off-by: Michael D Labriola <michael.d.labriola@gmail.com>
This increases the Linux kernel version to 5.9 from 5.8
as most compatibility fixes should already be included.
Signed-off-by: Kjeld Schouten-Lebbing <kjeld@schouten-lebbing.nl>
Bump max kernel version to 5.8 to match the supported releases
indicated in the release notes of 0.8.5
Signed-off-by: Nick Danyluk <ndanyluk7@gmail.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
The kernel seq_read() helper function expects ->next() to update
the passed position even there are no more entries. Failure to
do so results in the following warning being logged.
seq_file: buggy .next function procfs_list_seq_next [spl]
did not update position index
Functionally there is no issue with the way procfs_list_seq_next()
is implemented and the warning is harmless. However, we want to
silence this some what scary incorrect warning. This commit
updates the Linux procfs code to advance the position even for
the last entry.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#10984Closes#10996
This check was accidentally broken when the kABI checks were updated
to run in parallel, commit 608f874. The check must be for the
config_debug_lock_alloc_license name to determine if the symbol
is license compatible.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#10991
The m4 objtool configure check can incorrectly fail because of a
missing header in the test. This appears to be the result of a
recent kernel change and was observed on the Fedora 5.8.11-200
kernel.
In file included from /home/fedora/zfs/build/objtool/objtool.c:75:
./arch/x86/include/asm/frame.h💯57: error: 'struct pt_regs'
declared inside parameter list will not be visible outside
of this definition or declaration [-Werror]
The consequence of this is that the "stack_frame_non_standard"
check is never run and HAVE_STACK_FRAME_NON_STANDARD is set
incorrectly which results in a build failure. This change adds
the appropriate header to the "objtool" check so it now behaves
as intended.
Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#10990
With PREEMPTION=y and BLK_CGROUP=y preempt_schedule_notrace() is being
used on arm64 which is a GPL-only function and hence the build of the
DKMS kernel module fails.
Fix that by redefining preempt_schedule_notrace() to preempt_schedule()
which should be safe as long as tracing is not used.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Juerg Haefliger <juergh@canonical.com>
Closes#8545Closes#9948Closes#10416Closes#10973
The following test cases may still occasionally fail and are being
added to the "maybe" list for Linux until they can be updated to be
entirely reliable.
cli_root/zfs_rename/zfs_rename_002_pos.ksh
cli_root/zpool_reopen/zpool_reopen_003_pos.ksh
refreserv/refreserv_raidz
These 6 tests consistently fail only on Fedora 31+, the failures
are related to the kernel rescanning the partition table on loopback
devices which is no longer reliable unless partprobe is used. In
order to enable the Fedora bot by default they are also being added
to the list until the tests can be updated. Any significant regression
in functionality covered by these tests will still be detected by the
FreeBSD builders.
alloc_class/alloc_class_009_pos
alloc_class/alloc_class_010_pos
cli_root/zpool_expand/zpool_expand_001_pos
cli_root/zpool_expand/zpool_expand_005_pos
rsend/rsend_007_pos
rsend/rsend_010_pos
rsend/rsend_011_pos
snapshot/rollback_003_pos
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
zfs-load-key-DATASET.service was gaining an
After=systemd-journald.socket due to its stdout/stderr going to the
journal (which is the default). systemd-journald.socket has an After
(via RequiresMountsFor=/run/systemd/journal) on -.mount. If the root
filesystem is encrypted, -.mount gets an After
zfs-load-key-DATASET.service.
By setting stdout and stderr to null on the key load services, we avoid
this loop.
Reviewed-by: Antonio Russo <antonio.e.russo@gmail.com>
Reviewed-by: InsanePrawn <insane.prawny@gmail.com>
Signed-off-by: Richard Laager <rlaager@wiktel.com>
Closes#10356Closes#10388
When generating units with zfs-mount-generator, if the pool is already
imported, zfs-import.target is not needed. This avoids a dependency
loop on root-on-ZFS systems:
systemd-random-seed.service After (via RequiresMountsFor)
var-lib.mount After
zfs-import.target After
zfs-import-{cache,scan}.service After
cryptsetup.service After
systemd-random-seed.service
Reviewed-by: Antonio Russo <antonio.e.russo@gmail.com>
Reviewed-by: InsanePrawn <insane.prawny@gmail.com>
Signed-off-by: Richard Laager <rlaager@wiktel.com>
Closes#10388
This is a minor change to the systemd service templates that verifies
the zfs kernel module is loaded by the kernel prior to attempting to
import any zpool.
The services check for the presence of /sys/module/zfs which indicates
the zfs is module is loaded. This uses the systemd built-in check
ConditionPathIsDirectory.
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Thode <prometheanfire@gentoo.org>
Signed-off-by: Jonathon Fernyhough <jonathon.fernyhough@york.ac.uk>
Closes#10663
This is a minor change to the systemd service templates that verifies the zfs
kernel module is loaded by the kernel prior to attempting to import any zpool.
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jonathon Fernyhough <jonathon.fernyhough@york.ac.uk>
Closes#10627
Linux already defines setjmp/longjmp for powerpc, which leads to
duplicate symbols in a statically linked build.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Sterlng Jensen <sterlingjensen@users.noreply.github.com>
Closes#10795
The unit was failing instead of stopping if someone manually unloaded
the key before stopping the unit (zfs unload-key is failing on an
unavailable key).
Follow a similar logic than for loading the key, checking for the key
status before unloading it.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Co-authored-by: Didier Roche <didrocks@ubuntu.com>
Signed-off-by: Didier Roche <didrocks@ubuntu.com>
Closes#10477
We need a stronger dependency between the mount unit and its keyload unit
when we know that the dataset is encrypted.
If the keyload unit fails, Wants= will still try to mount the dataset,
which will then fail.
It’s better to show that the failure is due to a dependency failing, the
keyload unit, by tighting up the dependency. We can do this as we know
that we generate both units in the generator and so, it’s not an
optional dependency.
BindsTo enable as well that if the keyload unit fails at any point, the
associated mountpoint will be then unmounted.
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Didier Roche <didrocks@ubuntu.com>
Signed-off-by: Didier Roche <didrocks@ubuntu.com>
Closes#10477
Drop Before=zfs.mount dependency explicity on generated key-load .service
unit.
Indeed, the associated mount unit is After=<dataset-key-load>.service.
This is thus the mount point which controls at what point it wants to be
mounted (Before=zfs-mount.service in stock generator), but this can be
an automount point, or triggered by another service.
This additional dependency from the key load service is not needed thus.
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Didier Roche <didrocks@ubuntu.com>
Signed-off-by: Didier Roche <didrocks@ubuntu.com>
Closes#10477
When a resilver finishes, vdev_dtl_reassess is called to hopefully
excise DTL_MISSING (amongst other things). If there are errors during
the resilver, they are tracked in DTL_SCRUB, as spelled out in the
block comment in vdev.c. DTL_SCRUB is in-core only, so it can only
be used if the pool was online for the whole resilver. This state is
tracked with the spa_scrub_started flag, which only gets set when
the scan is initialized. Unfortunately, this flag gets cleared right
before vdev_dtl_reassess gets called, so if there are any errors
during the scan, DTL_MISSING will never get excised and the resilver
will just continually restart. This fix simply moves clearing that
flag until after the call to vdev_dtl_reasses.
In addition, if a pool is imported and already has scn_errors > 0,
this change will restart the resilver immediately instead of doing
the rest of the scan and then restarting it from the beginning. On
the other hand, if scn_errors == 0 at import, then no errors have
been encountered so far, so the spa_scrub_started flag can be safely
set.
A test has been added to verify that resilver does not restart when
relevant DTL's are available.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Zuchowski <pzuchowski@datto.com>
Signed-off-by: John Poduska <jpoduska@datto.com>
Closes#10291
Due to commit d48091d a removed device is now explicitly offlined by
the ZED if no spare is available, rather than the letting ZFS detect
it as UNAVAIL. This broke auto-replacing of whole-disk devices, as
described in issue #10577. In short, when a new device is reinserted
in the same slot, the ZED will try to ONLINE it without letting ZFS
recreate the necessary partition table.
This change simply avoids setting the device OFFLINE when removed if
no spare is available (or if spare_on_remove is false). This change
has been left minimal to allow it to be backported to 0.8.x release.
The auto_offline_001_pos ZTS test has been updated accordingly.
Some follow up work is planned to update the ZED so it transitions
the vdev to a REMOVED state. This is a state which has always
existed but there is no current interface the ZED can use to
accomplish this. Therefore it's being left to a follow up PR.
Reviewed-by: Gionatan Danti <g.danti@assyoma.it>
Co-authored-by: Gionatan Danti <g.danti@assyoma.it>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#10577Closes#10730
The zts-auto_offline_001_pos test could exceed the 10 minute test
limit and be KILLED by the test infrastructure. To prevent this
speed up the test case by:
* Removing redundant pool configurations. Each of the following
vdev types is tested once: mirror, raidz, cache, and special.
* The block_device_wait function need only wait on the block
device which has been removed as part of the test.
Reviewed-by: Paul Zuchowski <pzuchowski@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#9827
The cache of struct svc_export and struct svc_expkey by nfsd and
rpc.mountd for the snapshot holds references to the mount point.
We need to flush them out before unmounting, otherwise umount
would fail with EBUSY.
Reviewed-by: Don Brady <don.brady@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Youzhong Yang <yyang@mathworks.com>
Closes#6000Closes#10783
Export the dmu_offset_next() symbol for use by Lustre.
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#10796
The `zfs program` subcommand invokes a LUA interpreter to run ZFS
"channel programs". This interpreter runs in a constrained environment,
with defined memory limits. The LUA stack (used for LUA functions that
call each other) is allocated in the kernel's heap, and is limited by
the `-m MEMORY-LIMIT` flag and the `zfs_lua_max_memlimit` module
parameter. The C stack is used by certain LUA features that are
implemented in C. The C stack is limited by `LUAI_MAXCCALLS=20`, which
limits call depth.
Some LUA C calls use more stack space than others, and `gsub()` uses an
unusually large amount. With a programming trick, it can be invoked
recursively using the C stack (rather than the LUA stack). This
overflows the 16KB Linux kernel stack after about 11 iterations, less
than the limit of 20.
One solution would be to decrease `LUAI_MAXCCALLS`. This could be made
to work, but it has a few drawbacks:
1. The existing test suite does not pass with `LUAI_MAXCCALLS=10`.
2. There may be other LUA functions that use a lot of stack space, and
the stack space may change depending on compiler version and options.
This commit addresses the problem by adding a new limit on the amount of
free space (in bytes) remaining on the C stack while running the LUA
interpreter: `LUAI_MINCSTACK=4096`. If there is less than this amount
of stack space remaining, a LUA runtime error is generated.
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Allan Jude <allanjude@freebsd.org>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#10611Closes#10613
Due to hotplug support or BIOS bugs sometimes max_ncpus can be
an absurdly high value. I have a system with 32 cores/threads
but reports max_ncpus == 440. This many threads potentially
cripples the system during arc_prune floods for example.
boot_ncpus is the number of working CPUs when called so use
that instead.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: DHE <git@dehacked.net>
Closes#10282
A great deal of time may go by between when mmp_init() is called and
the MMP thread starts, particularly if there are bad devices, because
there is I/O checking configs etc. If this time is too long,
(gethrtime() - mmp_last_write) > mmp_fail_ns
at the time the MMP thread starts. If MMP is configured to suspend
the pool, the pool will be suspended immediately.
This can be seen in issue #10838
The value of mmp_last_write doesn't matter before the mmp thread
starts. To give the MMP thread time to issue and land MMP writes,
initialize mmp_last_write when the MMP thread starts.
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Closes#10873
Increase the size of DDT_NAMELEN and MNT_LINE_MAX to appease GCC
snprintf truncation warnings.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris McDonough <chrism@plope.com>
Closes#10712Closes#10766
spl-generic.c defines some of the libgcc integer library functions on
32-bit. Don't bother checking -Wmissing-prototypes since nothing should
directly call these functions from C code.
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Closes#10470
gcc10.1 complains with:
../../include/sys/dmu.h:373:24: error: ‘%s’ directive output may be
truncated writing up to 95 bytes into a region of size 75
[-Werror=format-truncation=]
373 | #define DMU_POOL_DDT "DDT-%s-%s-%s"
| ^~~~~~~~~~~~~~
../../module/zfs/ddt.c:256:37: note: in expansion of macro
‘DMU_POOL_DDT’
256 | (void) snprintf(name, DDT_NAMELEN, DMU_POOL_DDT,
| ^~~~~~~~~~~~
../../include/sys/dmu.h:373:32: note: format string is defined here
373 | #define DMU_POOL_DDT "DDT-%s-%s-%s"
| ^~
../../module/zfs/ddt.c:256:9: note: ‘snprintf’ output 7 or more bytes
(assuming 102) into a destination of size 80
256 | (void) snprintf(name, DDT_NAMELEN, DMU_POOL_DDT,
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
257 | zio_checksum_table[ddt->ddt_checksum].ci_name,
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
258 | ddt_ops[type]->ddt_op_name, ddt_class_name[class]);
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Increasing DTT_NAMELEN fixes it.
Reviewed-By: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes#10433
Commit dcdc12e added compatibility code to treat NR_SLAB_RECLAIMABLE_B
as if it were the same as NR_SLAB_RECLAIMABLE. However, the new value
is in bytes while the old value was in pages which means they are not
interchangeable.
The only place the reclaimable slab size is used is as a component of
the calculation done by arc_free_memory(). This function returns the
amount of memory the ARC considers to be free or reclaimable at little
cost. Rather than switch to a new interface to get this value it has
been removed it from the calculation. It is normally a minor component
compared to the number of inactive or free pages, and removing it
aligns the behavior with the FreeBSD version of arc_free_memory().
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Coleman Kane <ckane@colemankane.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Backported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Closes#10834
struct task_struct is needed for lockdep_off() in mutex.h
This has popped up after e616cb8daadf (in linux-5.7-rc7).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pavel Snajdr <snajpa@snajpa.net>
Closes#10741
(cherry picked from commit 772c69d230)
Signed-off-by: Eli Schwartz <eschwartz@archlinux.org>
The make_request_fn and associated API was replaced recently in a
Linux 5.9 merge, to replace its functionality with a new submit_bio
member in struct block_device_operations.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Coleman Kane <ckane@colemankane.org>
Closes#10696
(cherry picked from commit d817c17100)
Signed-off-by: Eli Schwartz <eschwartz@archlinux.org>
This change appears to primarily be a name change for the enum. Had
to update the test logic so that it works so long as either one of
these is present (favoring the newer one). Additionally, as this is
newer, it only shows up in node_page_item, so this commit doesn't
test zone_page_item for the same enum.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Coleman Kane <ckane@colemankane.org>
Closes#10696
(cherry picked from commit dcdc12e8ba)
Signed-off-by: Eli Schwartz <eschwartz@archlinux.org>
Many of the block device operations (often functions with bdev in
the name) were moved into linux/blkdev.h from linux/fs.h. Seems
that this header is already included where needed in the code, but
in the autoconf tests it was missing causing false negatives. This
commit has those tests include linux/fs.h (old location) and now
also linux/blkdev.h (new locations).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Coleman Kane <ckane@colemankane.org>
Closes#10696
(cherry picked from commit 1823c8fe6a)
Signed-off-by: Eli Schwartz <eschwartz@archlinux.org>
libzfs is an architecture-specific library, thus its pkgconfig files
should also be installed into the architecture-specific path for
pkgconfig files so that systems that support multilib or multiarch
installation will be able to work properly.
Signed-off-by: Neal Gompa <ngompa@datto.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
The `pgprot` argument has been removed from `__vmalloc` in Linux 5.8,
being `PAGE_KERNEL` always now [1].
Detect this during configure and define a wrapper for older kernels.
[1] https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/mm/vmalloc.c?h=next-20200605&id=88dca4ca5a93d2c09e5bbc6a62fbfc3af83c4fca
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Co-authored-by: Michael Niewöhner <foss@mniewoehner.de>
Signed-off-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Signed-off-by: Michael Niewöhner <foss@mniewoehner.de>
Closes#10422
(cherry picked from commit 080102a1b6)
- apply to 0.8.4 before certain files were moved around
- config/kernel-kmem.m4 exists in git but not release tarballs because
it is unused; introduce it in a new file to prevent conflicts
- linux/mm.h is included in git master via sys/kmem.h; do not remove it
here or the build will error due to undefined is_vmalloc_addr()
Original-patch-by: Michael Niewöhner <c0d3z3r0@users.noreply.github.com>
Signed-off-by: Eli Schwartz <eschwartz@archlinux.org>
The strcpy() and sprintf() functions are deprecated on some platforms.
Care is needed to ensure correct size is used. If some platforms
miss snprintf, we can add a #define to sprintf, likewise strlcpy().
The biggest change is adding a size parameter to zfs_id_to_fuidstr().
The various *_impl_get() functions are only used on linux and have
not yet been updated.
As we do not expect the destination of these strncpy calls to be NULL
terminated, substitute them with memcpy.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes#10346
Due to a mismatch between the text and a regex looking for that text,
the `%preuninstall` script would never run the `dkms remove` command
necessary to avoid corrupting the DKMS data configuration. Increase
regex specificity to avoid this issue.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Lindee <chris.lindee+github@gmail.com>
Closes: #9891Closes#10327
When a Thumb-2 kernel is being used, then longjmp must be implemented
using the Thumb-2 instruction set in module/lua/setjmp/setjmp_arm.S.
Original-patch-by: @jsrlabs
Reviewed-by: @awehrfritz
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#7408Closes#9957Closes#9967
Commit https://github.com/torvalds/linux/commit/3d745ea5 simplified
the blk_alloc_queue() interface by updating it to take the request
queue as an argument. Add a wrapper function which accepts the new
arguments and internally uses the available interfaces.
Other minor changes include increasing the Linux-Maximum to 5.6 now
that 5.6 has been released. It was not bumped to 5.7 because this
release has not yet been finalized and is still subject to change.
Added local 'struct zvol_state_os *zso' variable to zvol_alloc.
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#10181Closes#10187
A struct rangelock already exists on FreeBSD. Add a zfs_ prefix as
per our convention to prevent any conflict with existing symbols.
This change is a follow up to 2cc479d0.
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#9534
When zfs is built in-tree using --enable-linux-builtin, the compile
commands are executed from the kernel build directory. If the build
directory is different from the kernel source directory, passing
-Ifs/zfs/icp will not find the headers as they are not present in the
build directory.
Fix this by adding @abs_top_srcdir@ to pull the headers from the zfs
source tree instead.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Closes#10021
There are a couple of x86_64 architectures which support all needed
features to make the accelerated GCM implementation work but the
MOVBE instruction. Those are mainly Intel Sandy- and Ivy-Bridge
and AMD Bulldozer, Piledriver, and Steamroller.
By using MOVBE only if available and replacing it with a MOV
followed by a BSWAP if not, those architectures now benefit from
the new GCM routines and performance is considerably better
compared to the original implementation.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Adam D. Moss <c@yotes.com>
Signed-off-by: Attila Fülöp <attila@fueloep.org>
Followup #9749Closes#10029
Currently SIMD accelerated AES-GCM performance is limited by two
factors:
a. The need to disable preemption and interrupts and save the FPU
state before using it and to do the reverse when done. Due to the
way the code is organized (see (b) below) we have to pay this price
twice for each 16 byte GCM block processed.
b. Most processing is done in C, operating on single GCM blocks.
The use of SIMD instructions is limited to the AES encryption of the
counter block (AES-NI) and the Galois multiplication (PCLMULQDQ).
This leads to the FPU not being fully utilized for crypto
operations.
To solve (a) we do crypto processing in larger chunks while owning
the FPU. An `icp_gcm_avx_chunk_size` module parameter was introduced
to make this chunk size tweakable. It defaults to 32 KiB. This step
alone roughly doubles performance. (b) is tackled by porting and
using the highly optimized openssl AES-GCM assembler routines, which
do all the processing (CTR, AES, GMULT) in a single routine. Both
steps together result in up to 32x reduction of the time spend in
the en/decryption routines, leading up to approximately 12x
throughput increase for large (128 KiB) blocks.
Lastly, this commit changes the default encryption algorithm from
AES-CCM to AES-GCM when setting the `encryption=on` property.
Reviewed-By: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-By: Jason King <jason.king@joyent.com>
Reviewed-By: Tom Caputi <tcaputi@datto.com>
Reviewed-By: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Attila Fülöp <attila@fueloep.org>
Closes#9749
In zfs_write(), the loop continues to the next iteration without
accounting for partial copies occurring in uiomove_iov when
copy_from_user/__copy_from_user_inatomic return a non-zero status.
This results in "zfs: accessing past end of object..." in the
kernel log, and the write failing.
Account for partial copies and update uio struct before returning
EFAULT, leave a comment explaining the reason why this is done.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: ilbsmart <wgqimut@gmail.com>
Signed-off-by: Fabio Scaccabarozzi <fsvm88@gmail.com>
Closes#8673Closes#10148
Using zfs with Lustre, an arc_read can trigger kernel memory allocation
that in turn leads to a memory reclaim callback and a deadlock within a
single zfs process. This change uses spl_fstrans_mark and
spl_trans_unmark to prevent the reclaim attempt and the deadlock
(https://zfsonlinux.topicbox.com/groups/zfs-devel/T4db2c705ec1804ba).
The stack trace observed is:
__schedule at ffffffff81610f2e
schedule at ffffffff81611558
schedule_preempt_disabled at ffffffff8161184a
__mutex_lock at ffffffff816131e8
arc_buf_destroy at ffffffffa0bf37d7 [zfs]
dbuf_destroy at ffffffffa0bfa6fe [zfs]
dbuf_evict_one at ffffffffa0bfaa96 [zfs]
dbuf_rele_and_unlock at ffffffffa0bfa561 [zfs]
dbuf_rele_and_unlock at ffffffffa0bfa32b [zfs]
osd_object_delete at ffffffffa0b64ecc [osd_zfs]
lu_object_free at ffffffffa06d6a74 [obdclass]
lu_site_purge_objects at ffffffffa06d7fc1 [obdclass]
lu_cache_shrink_scan at ffffffffa06d81b8 [obdclass]
shrink_slab at ffffffff811ca9d8
shrink_node at ffffffff811cfd94
do_try_to_free_pages at ffffffff811cfe63
try_to_free_pages at ffffffff811d01c4
__alloc_pages_slowpath at ffffffff811be7f2
__alloc_pages_nodemask at ffffffff811bf3ed
new_slab at ffffffff81226304
___slab_alloc at ffffffff812272ab
__slab_alloc at ffffffff8122740c
kmem_cache_alloc at ffffffff81227578
spl_kmem_cache_alloc at ffffffffa048a1fd [spl]
arc_buf_alloc_impl at ffffffffa0befba2 [zfs]
arc_read at ffffffffa0bf0924 [zfs]
dbuf_read at ffffffffa0bf9083 [zfs]
dmu_buf_hold_by_dnode at ffffffffa0c04869 [zfs]
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Mark Roper <markroper@gmail.com>
Closes#9987
Attempt to run scrub or resilver on a new pool containing only special
allocations (special vdev added on creation) caused infinite loop
because of dsl_scan_should_clear() limiting memory usage to 5% of pool
size, which it calculated accounting only normal allocation class.
Addition of special and just in case dedup classes fixes the issue.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes#10106Closes#8694
The crypto_cipher_init_prov and crypto_cipher_init are declared static
and should not be exported by the ICP. This resolves the following
warnings observed when building with the 5.4 kernel.
WARNING: "crypto_cipher_init" [.../icp] is a static EXPORT_SYMBOL
WARNING: "crypto_cipher_init_prov" [.../icp] is a static EXPORT_SYMBOL
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#9791
The proc_ops structure was introduced to replace the use of of the
file_operations structure when registering proc handlers. This
change creates a new kstat_proc_op_t typedef for compatibility
which can be used to pass around the correct structure.
This change additionally adds the 'const' keyword to all of the
existing proc operations structures.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#9961
The timestamp_truncate() function was added, it replaces the existing
timespec64_trunc() function. This change renames our wrapper function
to be consistent with the upstream name and updates the compatibility
code for older kernels accordingly.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#9956Closes#9961
The getrawmonotonic() and getrawmonotonic64() interfaces have been
fully retired. Update gethrtime() to use the replacement interface
ktime_get_raw_ts64() which was introduced in the 4.18 kernel.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#10052Closes#10064
As part of the Linux kernel's y2038 changes the time_t type has been
fully retired. Callers are now required to use the time64_t type.
Rather than move to the new type, I've removed the few remaining
places where a time_t is used in the kernel code. They've been
replaced with a uint64_t which is already how ZFS internally
handled these values.
Going forward we should work towards updating the remaining user
space time_t consumers to the 64-bit interfaces.
Reviewed-by: Matthew Macy <mmacy@freebsd.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#10052Closes#10064
-fno-common is the new default in GCC 10, replacing -fcommon in
GCC <= 9, so static data must only be allocated once.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Romain Dolbeau <romain.dolbeau@european-processor-initiative.eu>
Closes#9943
Issue #10090 reported that snapshots created between midnight and 1 AM
are missing a padded zero in the creation property
This change fixes the bug reported in issue #10090 where snapshots
created between midnight and 1 AM were missing a padded zero in the
creation timestamp output.
The leading zero was missing because the time format string used `%k`
which formats the hour as a decimal number from 0 to 23 where single
digits are preceded by blanks[0] and is fixed by changing it to `%H`
which formats the hour as 00-23.
The difference in output is as below
```
-Thu Mar 26 0:39 2020
+Thu Mar 26 00:39 2020
```
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Alex John <alex@stty.io>
Closes#10090Closes#10153
Dedup send can only deduplicate over the set of blocks in the send
command being invoked, and it does not take advantage of the dedup table
to do so. This is a very common misconception among not only users, but
developers, and makes the feature seem more useful than it is. As a
result, many users are using the feature but not getting any benefit
from it.
Dedup send requires a nontrivial expenditure of memory and CPU to
operate, especially if the dataset(s) being sent is (are) not already
using a dedup-strength checksum.
Dedup send adds developer burden. It expands the test matrix when
developing new features, causing bugs in released code, and delaying
development efforts by forcing more testing to be done.
As a result, we are deprecating the use of `zfs send -D` and receiving
of such streams. This change adds a warning to the man page, and also
prints the warning whenever dedup send or receive are used.
In a future release, we plan to:
1. remove the kernel code for generating deduplicated streams
2. make `zfs send -D` generate regular, non-deduplicated streams
3. remove the kernel code for receiving deduplicated streams
4. make `zfs receive` of deduplicated streams process them in userland
to "re-duplicate" them, so that they can still be received.
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#7887Closes#10117
This fixes a bug where the generated zfs-functions was being included
along with original zfs-functions.in in the make dist tarball. This
caused an unfortunate series of events during build/packaging that
resulted in the RPM-installed /etc/zfs/zfs-functions listing the
paths as:
ZFS="/usr/local/sbin/zfs"
ZED="/usr/local/sbin/zed"
ZPOOL="/usr/local/sbin/zpool"
When they should have been:
ZFS="/sbin/zfs"
ZED="/sbin/zed"
ZPOOL="/sbin/zpool"
This affects init.d (non-systemd) distros like CentOS 6.
/etc/default/zfs and /etc/zfs/zfs-functions are also used by the
initramfs, so they need to be built even when init.d support is not.
They have been moved to the (new) etc/default and (existing) etc/zfs
source directories, respectively.
Fixes: #9443
Co-authored-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Richard Laager <rlaager@wiktel.com>
The double-colon looked like a typo, but it's actually an obscure
feature. Rules with :: may appear multiple times and are run
independently of one another in the order they appear. The use of ::
for distclean-local was conventional, not accidental.
Add comments to indicate the intentional use of double-colon rules.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@ixsystems.com>
Closes#9210
Previously the generated keyload units for encryption roots with
keylocation=file://* didn't contain the code to detect if the key
was already loaded and would be marked failed in such situations.
Move the code to check whether the key is already loaded
from keylocation=prompt handling to general key loading code.
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: InsanePrawn <insane.prawny@gmail.com>
Closes#10103
This commit refactors the systemd mount generators and makes the
following major changes:
- The generator now generates units for datasets marked canmount=noauto,
too. These units are NOT WantedBy local-fs.target.
If there are multiple noauto datasets for a path, no noauto unit will
be created. Datasets with canmount=on are prioritized.
- Introduces handling of new user properties which are now included in
the zfs-list.cache files:
- org.openzfs.systemd:requires:
List of units to require for this mount unit
- org.openzfs.systemd:requires-mounts-for:
List of mounts to require by this mount unit
- org.openzfs.systemd:before:
List of units to order after this mount unit
- org.openzfs.systemd:after:
List of units to order before this mount unit
- org.openzfs.systemd:wanted-by:
List of units to add a Wants dependency on this mount unit to
- org.openzfs.systemd:required-by:
List of units to add a Requires dependency on this mount unit to
- org.openzfs.systemd:nofail:
Toggles between a wants and a requires dependency.
- org.openzfs.systemd:ignore:
Do not generate a mount unit for this dataset.
Consult the updated man page for detailed documentation.
- Restructures and extends the zfs-mount-generator(8) man page with the
above properties, information on unit ordering and a license header.
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Antonio Russo <antonio.e.russo@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: InsanePrawn <insane.prawny@gmail.com>
Closes#9649
Silences a warning about an intentionally unquoted variable.
Fixes a warning caused by strings split across lines by slightly
refactoring keyloadcmd.
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Antonio Russo <antonio.e.russo@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: InsanePrawn <insane.prawny@gmail.com>
Closes#9649
When configuring as builtin (--enable-linux-builtin) for kernels
without loadable module support (CONFIG_MODULES=n) only the object
file is created. Never a loadable kmod.
Update ZFS_LINUX_TRY_COMPILE to handle this in a manor similar to
the ZFS_LINUX_TEST_COMPILE_ALL macro.
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#9887Closes#10063
Commit https://github.com/torvalds/linux/commit/9e8d42a0f accidentally
converted the static inline function blkg_tryget() to GPL-only for
kernels built with CONFIG_PREEMPT_RCU=y and CONFIG_BLK_CGROUP=y.
Resolve the build issue by providing our own equivalent functionality
when needed which uses rcu_read_lock_sched() internally as before.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#9745Closes#10072
The correct name for the mount unit for / is "-.mount", not ".mount".
Reviewed-by: InsanePrawn <insane.prawny@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Antonio Russo <antonio.e.russo@gmail.com>
Signed-off-by: Richard Laager <rlaager@wiktel.com>
Closes#9970
When growing the size of a (VMEM or KVMEM) kmem cache, spl_cache_grow()
always does taskq_dispatch(spl_cache_grow_work), and then waits for the
KMC_BIT_GROWING to be cleared by the taskq thread.
The taskq thread (spl_cache_grow_work()) does:
1. allocate new slab and add to list
2. wake_up_all(skc_waitq)
3. clear_bit(KMC_BIT_GROWING)
Therefore, the waiting thread can wake up before GROWING has been
cleared. It will see that the growing has not yet completed, and go
back to sleep until it hits the 100ms timeout.
This can have an extreme performance impact on workloads that alloc/free
more than fits in the (statically-sized) magazines. These workloads
allocate and free slabs with high frequency.
The problem can be observed with `funclatency spl_cache_grow`, which on
some workloads shows that 99.5% of the time it takes <64us to allocate
slabs, but we spend ~70% of our time in outliers, waiting for the 100ms
timeout.
The fix is to do `clear_bit(KMC_BIT_GROWING)` before
`wake_up_all(skc_waitq)`.
A future investigation should evaluate if we still actually need to
taskq_dispatch() at all, and if so on which kernel versions.
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <gwilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#9989
If someone is using both multipathd and ZFS, they are probably using
them together. Ordering the zpool imports after multipathd is ready
fixes import issues for multipath configurations.
Tested-by: Mike Pastore <mike@oobak.org>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Laager <rlaager@wiktel.com>
Closes#9863
On some systems - openSUSE, for example - there is not yet a writeable
temporary file system available, so bash bails out with an error,
'cannot create temp file for here-document: Read-only file system',
on the here documents in zfs-mount-generator. The simple fix is to
change these into a multi-line echo statement.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-By: Richard Laager <rlaager@wiktel.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Lorenz Hüdepohl <dev@stellardeath.org>
Closes#9802
The resilver restart test was reported as failing about 2% of the
time. Two issues were found:
- The event log wasn't large enough, so resilver events were missing
- One 'zpool sync' wasn't enough for resilver to start after zinject
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl>
Signed-off-by: John Poduska <jpoduska@datto.com>
Issue #9588Closes#9677Closes#9703
If a device is participating in an active resilver, then it will have a
non-empty DTL. Operations like vdev_{open,reopen,probe}() can cause the
resilver to be restarted (or deferred to be restarted later), which is
unnecessary if the DTL is still covered by the current scan range. This
is similar to the logic in vdev_dtl_should_excise() where the DTL can
only be excised if it's max txg is in the resilvered range.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: John Gallagher <john.gallagher@delphix.com>
Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl>
Signed-off-by: John Poduska <jpoduska@datto.com>
Issue #840Closes#9155Closes#9378Closes#9551Closes#9588
When qat_compress() fails to allocate the required contiguous memory
it mistakenly returns success. This prevents the fallback software
compression from taking over and (un)compressing the block.
Resolve the issue by correctly setting the local 'status' variable
on all exit paths. Furthermore, initialize it to CPA_STATUS_FAIL
to ensure qat_compress() always fails safe to guard against any
similar bugs in the future.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#9784Closes#9788
The cleanup_devices function should remove any partitions created
on the device and force the partition table to be reread. This
is needed to ensure that blkid has an up to date version of what
devices and partitions are used by zfs.
The cleanup_devices call was removed from inuse_008_pos.ksh since
it operated on partitions instead of devices and was not needed.
Lastly ddidecode may be called by parted and was therefore added
to the constrained path.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#9806
A change[1] was merged yesterday that should refer
to the zfs binary in the initramfs, but is actually
an unset shell variable.
This commit changes this line to call `zfs` directly
like the surrounding code.
[1]: cb5b875b27
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Garrett Fields <ghfields@gmail.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Ben Cordero <bencord0@condi.me>
Closes#9780
The externally faulted vdev should be brought back online and have
its errors cleared before the pool is destroyed. Failure to do so
will leave a vdev with a valid active label. This vdev may then
not be used to create a new pool without the -f flag potentially
leading to subsequent test failures.
Additionally remove an unreachable log_pass from setup.ksh.
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#9777
Remove a few hardcoded instances of /var/tmp. This should use
the $TEST_BASE_DIR in order to allow the ZTS to be optionally
run in an alternate directory using `zfs-tests.sh -d <path>`.
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#9775
As of Python 3.5 the default behavior of json.tool was changed to
preserve the input order rather than lexical order. The test case
expects the output to be sorted so apply the --sort-keys option
to the json.tool command when using Python 3.5 and the option is
supported.
https://docs.python.org/3/library/json.html#module-json.tool
Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#9774
Update the devices_001_pos and devices_002_neg test cases such that the
special block device file created is backed by a ZFS volume. Specifying
a specific device allows the major and minor numbers to be easily
determined. Furthermore, this avoids the potentially dangerous behavior
of opening the first block device we happen to find under /dev/.
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#9773
- Skip invalid DVAs when importing pools in readonly mode
(in addition to when the config is untrusted).
- Upon encountering a DVA with a null VDEV, fail gracefully
instead of panicking with a NULL pointer dereference.
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Steve Mokris <smokris@softpixel.com>
Closes#9022
If the encryption key is stored in a file, the initramfs should not
prompt for the password. For example, this could be the case if the boot
partition is stored on removable media that is only present at boot time
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Garrett Fields <ghfields@gmail.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl>
Signed-off-by: Sam Lunt <samuel.j.lunt@gmail.com>
Closes#9764
Rather than defining a new instance of 'aok' in every compilation
unit which includes this header, there is a single instance
defined in zone.c, and the header now only declares an extern.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Zuchowski <pzuchowski@datto.com>
Signed-off-by: Nick Black <dankamongmen@gmail.com>
Closes#9752
Any running 'zpool initialize' or TRIM must be cancelled prior
to the vdev_metaslab_fini() call in spa_vdev_remove_log() which
will unload the metaslabs and set ms->ms_group == NULL.
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#8602Closes#9751
* large_dnode_008_pos - Force a pool sync before invoking zdb to
ensure the updated dnode blocks have been persisted to disk.
* refreserv_raidz - Wait for the /dev/zvol links to be both created
and removed, this is important because the same device volume
names are being used repeatedly.
* btree_test - Add missing .gitignore file for btree_test binary.
Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#9769
Increase the maximum supported kernel version to 5.4. This was
verified using the Fedora 5.4.2-300.fc31.x86_64 kernel.
Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#9754Closes#9759
* devices_001_pos and devices_002_neg - Failing after FreeBSD ZTS
merged due to missing 'function' keyword for create_dev_file_linux.
* pool_state - Occasionally fails due to an insufficient delay
before checking 'zpool status'. Increasing the delay from 1 to 3
seconds resolved the issue in local testing.
* procfs_list_basic - Fails when run in-tree because the logged
command is actually 'lt-zfs'. Updated the regex accordingly.
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#9748
This change leverage module_param_call() to run arc_tuning_update()
immediately after the ARC tunable has been updated as suggested in
cffa837 code review.
A simple test case is added to the ZFS Test Suite to prevent future
regressions in functionality.
This is a backport of #9489 provided from:
https://github.com/zfsonlinux/zfs/pull/9776#issuecomment-569418370
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Move the 'nvh = (void *)buf' assignment after the 'buf == NULL'
check to resolve the warning. Interestingly, cppcheck 1.88
correctly determines that the existing code is safe, while
cppcheck 1.86 reports the warning.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#9732
Suppress autoVariables warnings in the lua interpreter. The usage
here while unconventional in intentional and the same as upstream.
[module/lua/ldebug.c:327]: (error) Address of local auto-variable
assigned to a function parameter.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#9732
As indicated by the VERIFY the local who_perm variable can never
be NULL in parse_fs_perm(). Due to the existence of the is_set
conditional, which is always true, cppcheck 1.88 was reporting
a possible NULL reference. Resolve the issue by removing the
extraneous is_set variable.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#9732
The dnp argument can only be set to NULL when the DNODE_DRY_RUN flag
is set. In which case, an early return path will be executed and a
NULL pointer dereference at the given location is impossible. Add
an additional ASSERT to silence the cppcheck warning and document
that dbp must never be NULL at the point in the function.
[module/zfs/dnode.c:1566]: (warning) Possible null pointer deref: dnp
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#9732
Resolve the reported memory leak by using a dedicated local vptr
variable to store the pointer reported by calloc(). Only assign the
passed **vtoc function argument on success, in all other cases vptr
is freed.
[lib/libefi/rdwr_efi.c:403]: (error) Memory leak: vtoc
[lib/libefi/rdwr_efi.c:422]: (error) Memory leak: vtoc
[lib/libefi/rdwr_efi.c:440]: (error) Memory leak: vtoc
[lib/libefi/rdwr_efi.c:454]: (error) Memory leak: vtoc
[lib/libefi/rdwr_efi.c:470]: (error) Memory leak: vtoc
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#9732
As of cppcheck 1.82 surpress the warning regarding shifting too many
bits for __divdi3() implemention. The algorithm used here is correct.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#9732
As of cppcheck 1.82 warnings are issued when using the list_for_each_*
functions with an uninitialized variable. Functionally, this is fine
but to resolve the warning initialize these variables.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#9732
Resolve the following uninitialized variable warnings. In practice
these were unreachable due to the goto. Replacing the goto with a
return resolves the warning and yields more readable code.
[module/icp/algs/modes/ccm.c:892]: (error) Uninitialized variable: ccm_param
[module/icp/algs/modes/ccm.c:893]: (error) Uninitialized variable: ccm_param
[module/icp/algs/modes/gcm.c:564]: (error) Uninitialized variable: gcm_param
[module/icp/algs/modes/gcm.c:565]: (error) Uninitialized variable: gcm_param
[module/icp/algs/modes/gcm.c:599]: (error) Uninitialized variable: gmac_param
[module/icp/algs/modes/gcm.c:600]: (error) Uninitialized variable: gmac_param
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#9732
The existing rules miss nvme disk devices because of the trailing
digits in the KERNEL device name, e.g. nvme0n1. Partitions of nvme
disk devices are already properly handled by the existing rule for
ENV{DEVTYPE}=="partition".
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl>
Signed-off-by: Thomas Geppert <geppi@digitx.de>
Closes#9730
On systems that utilize TTY for password entry, if the kernel
option "quiet" is set, the system would appear to freeze on a
blank screen, when in fact it is waiting for password entry
from the user.
Since TTY is the fallback method, this has no effect on systemd
or plymouth password prompting.
By temporarily setting "printk" to "7", running the command,
then resuming with the original "printk" state, the user can
see the password prompt.
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Garrett Fields <ghfields@gmail.com>
Closes#9731
From Steve Langasek <steve.langasek@canonical.com>:
> The poorly-named 'FRAMEBUFFER' option in initramfs-tools controls
> whether the console_setup and plymouth scripts are included and used
> in the initramfs. These are required for any initramfs which will be
> prompting for user input: console_setup because without it the user's
> configured keymap will not be set up, and plymouth because you are
> not guaranteed to have working video output in the initramfs without
> it (e.g. some nvidia+UEFI configurations with the default GRUB
> behavior).
> The zfs initramfs script may need to prompt the user for passphrases
> for encrypted zfs datasets, and we don't know definitively whether
> this is the case or not at the time the initramfs is constructed (and
> it's difficult to dynamically populate initramfs config variables
> anyway), therefore the zfs-initramfs package should just set
> FRAMEBUFFER=yes in a conf snippet the same way that the
> cryptsetup-initramfs package does
> (/usr/share/initramfs-tools/conf-hooks.d/cryptsetup).
https://bugs.launchpad.net/ubuntu/+source/zfs-linux/+bug/1856408
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl>
Signed-off-by: Steve Langasek <steve.langasek@canonical.com>
Signed-off-by: Richard Laager <rlaager@wiktel.com>
Closes#9723
Apply umask to `mode` which will eventually be applied to inode.
This is needed since VFS doesn't apply umask for O_TMPFILE files.
(Note that zpl_init_acl() applies `ip->i_mode &= ~current_umask();`
only when POSIX ACL is used.)
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes#8997Closes#8998
Currently, 'zfs list' and 'zfs get' commands can be slow when
working with snapshots that have a ds_props_obj. This is
because the code that discovers all of the properties for these
snapshots needs to read this object for each snapshot, which
almost always ends up causing an extra random synchronous read
for each snapshot. This performance penalty exists even if the
properties on that snapshot have been unset because the object
is normally only freed when the snapshot is freed, even though
it is only created when it is needed.
This patch allows the user to regain 'zfs list' performance on
these snapshots by destroying the ds_props_obj when it no longer
has any entries left. In practice on a production machine, this
optimization seems to make 'zfs list' about 55% faster.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Zuchowski <pzuchowski@datto.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes#9704
After spa_vdev_remove_aux() is called, the config nvlist is no longer
valid, as it's been replaced by the new one (with the specified device
removed). Therefore any pointers into the nvlist are no longer valid.
So we can't save the result of
`fnvlist_lookup_string(nv, ZPOOL_CONFIG_PATH)` (in vd_path) across the
call to spa_vdev_remove_aux().
Instead, use spa_strdup() to save a copy of the string before calling
spa_vdev_remove_aux.
Found by AddressSanitizer:
ERROR: AddressSanitizer: heap-use-after-free on address ...
READ of size 34 at 0x608000a1fcd0 thread T686
#0 0x7fe88b0c166d (/usr/lib/x86_64-linux-gnu/libasan.so.4+0x5166d)
#1 0x7fe88a5acd6e in spa_strdup spa_misc.c:1447
#2 0x7fe88a688034 in spa_vdev_remove vdev_removal.c:2259
#3 0x55ffbc7748f8 in ztest_vdev_aux_add_remove ztest.c:3229
#4 0x55ffbc769fba in ztest_execute ztest.c:6714
#5 0x55ffbc779a90 in ztest_thread ztest.c:6761
#6 0x7fe889cbc6da in start_thread
#7 0x7fe8899e588e in __clone
0x608000a1fcd0 is located 48 bytes inside of 88-byte region
freed by thread T686 here:
#0 0x7fe88b14e7b8 in __interceptor_free
#1 0x7fe88ae541c5 in nvlist_free nvpair.c:874
#2 0x7fe88ae543ba in nvpair_free nvpair.c:844
#3 0x7fe88ae57400 in nvlist_remove_nvpair nvpair.c:978
#4 0x7fe88a683c81 in spa_vdev_remove_aux vdev_removal.c:185
#5 0x7fe88a68857c in spa_vdev_remove vdev_removal.c:2221
#6 0x55ffbc7748f8 in ztest_vdev_aux_add_remove ztest.c:3229
#7 0x55ffbc769fba in ztest_execute ztest.c:6714
#8 0x55ffbc779a90 in ztest_thread ztest.c:6761
#9 0x7fe889cbc6da in start_thread
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#9706
This interferes with zdb_read_block trying all the decompression
algorithms when the 'd' flag is specified, as some are
expected to fail. Also control the output when guessing
algorithms, try the more common compression types first, allow
specifying lsize/psize, and fix an uninitialized variable.
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Zuchowski <pzuchowski@datto.com>
Closes#9612Closes#9630
This change allows us to align the code dump logic across platforms.
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Don Brady <don.brady@delphix.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#9691
Update the vdev_disk_open() retry logic to use a specified number
of milliseconds to be more robust. Additionally, on failure log
both the time waited and requested timeout to the internal log.
The default maximum allowed open retry time has been increased
from 500ms to 1000ms.
Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#9680
Conflicts:
This sets send_realloc_files.ksh to use properties.shlib
(like the other compression related tests)
It was missing from #9645
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Kjeld Schouten-Lebbing <kjeld@schouten-lebbing.nl>
Issue #9645Closes#9679
arc_summary3 reports L2ARC hits and misses as Bytes, whereas they
should be reported as events. arc_summary2 reports these correctly.
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes#9669
The checksum display code of zdb_read_block uses a zio
to read in the block and then calls zio_checksum_compute.
Use a new zio in the call to zio_checksum_compute not the zio
from the read which has been destroyed by zio_wait.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Signed-off-by: Paul Zuchowski <pzuchowski@datto.com>
Closes#9644Closes#9657
In case L2ARC read failed, l2arc_read_done() creates _different_ ZIO
to read data from the original storage device. Unfortunately pointer
to the failed ZIO remains in hdr->b_l1hdr.b_acb->acb_zio_head, and if
some other read try to bump the ZIO priority, it will crash.
The problem is reproducible by corrupting L2ARC content and reading
some data with prefetch if l2arc_noprefetch tunable is changed to 0.
With the default setting the issue is probably not reproducible now.
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes#9648
There may be circumstances where it's desirable that all blocks
in a specified dataset be stored on the special device. Relax
the artificial 128K limit and allow the special_small_blocks
property to be set up to 1M. When blocks >1MB have been enabled
via the zfs_max_recordsize module option, this limit is increased
accordingly.
Reviewed-by: Don Brady <don.brady@delphix.com>
Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#9131Closes#9355
Remove the specific gitignore rules for module left-overs and add a
generic one in modules/.
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl>
Signed-off-by: Michael Niewöhner <foss@mniewoehner.de>
Closes#9656
Previously the generator would skip a dataset if it wasn't mountable by
'zfs mount -a' (legacy/none mountpoint, canmount off/noauto). This also
skipped the generation of key-load units for such datasets, breaking
the dependency handling for mountable child datasets.
Reviewed-by: Antonio Russo <antonio.e.russo@gmail.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: InsanePrawn <insane.prawny@gmail.com>
Closes#9611
Systemd will ignore units that try to execute programs from non-absolute
paths. Use hardcoded /bin/sh instead.
Reviewed-by: Antonio Russo <antonio.e.russo@gmail.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: InsanePrawn <insane.prawny@gmail.com>
Closes#9611
The command line switch -A (ignore ASSERTs) has always been available
in zdb but was never connected up to the correct global variable.
There are times when you need zdb to ignore asserts and keep dumping
out whatever information it can get despite the ASSERT(s) failing.
It was always intended to be part of zdb but was incomplete.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Zuchowski <pzuchowski@datto.com>
Closes#9610
As described in commit f81d5ef6 the zfs_vdev_elevator module
option is being removed. Users who require this functionality
should update their systems to set the disk scheduler using a
udev rule.
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #8664Closes#9417Closes#9609
The function zdb_read_block (zdb -R) was always intended to have a :c
flag which would read the DVA and length supplied by the user, and
display the checksum. Since we don't know which checksum goes with
the data, we should calculate and display them all.
For each checksum in the table, read in the data at the supplied
DVA:length, calculate the checksum, and display it. Update the man
page and create a zfs test for the new feature.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl>
Signed-off-by: Paul Zuchowski <pzuchowski@datto.com>
Closes#9607
The changes in commit 41e1aa2a / PR #9583 introduced a regression on
tmpfile_001_pos: fsetxattr() on a O_TMPFILE file descriptor started
to fail with errno ENODATA:
openat(AT_FDCWD, "/test", O_RDWR|O_TMPFILE, 0666) = 3
<...>
fsetxattr(3, "user.test", <...>, 64, 0) = -1 ENODATA
The originally proposed change on PR #9583 is not susceptible to it,
so just move the code/if-checks around back in that way, to fix it.
Reviewed-by: Pavel Snajdr <snajpa@snajpa.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Original-patch-by: Heitor Alves de Siqueira <halves@canonical.com>
Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
Closes#9602
The tst.terminate_by_signal test case may occasionally fail when
running in a less consistent virtual environment. For all observed
failures the process was terminated correctly but it took longer than
expected resulting in too many snapshot being created.
To minimize the likelyhood of this occuring increase the threshold
from 50 to 90 snapshots. The larger limit will still verifiy that
the channel program was correctly terminated early.
Reviewed-by: Don Brady <don.brady@delphix.com>
Reviewed-by: Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#9601
Use `printf` to properly interpret unicode characters.
Illumos uses a utility called `zlook` to allow additional flags to be
provided to readdir and lookup for testing. This functionality could
be ported to Linux, but even without it several of the tests can be
enabled by instead using the standard `test` command.
Additional, work is required to enable the remaining test cases.
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Melikov <mail@gmelikov.ru>
Issue #7633Closes#8812
df58307 removed the need to specify -d 1 when zfs list and zfs get are
called with -t snapshot on a datset. This commit extends the same
behaviour to -t bookmark.
This commit also introduces the 'snap' shorthand for snapshots from
zfs list to zfs get.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl>
Signed-off-by: InsanePrawn <insane.prawny@gmail.com>
Closes#9589
If zp->z_unlinked is set, we're working with a znode that has been
marked for deletion. If that's the case, we can skip the "goto again"
loop and return ENOENT, as the znode should not be discovered.
Reviewed-by: Richard Yao <ryao@gentoo.org>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Heitor Alves de Siqueira <halves@canonical.com>
Closes#9583
Removes an incorrect error message from libzfs that suggests applying
'-r' when a zfs subcommand is called with a filesystem path while
expecting either a snapshot or bookmark path.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl>
Signed-off-by: InsanePrawn <insane.prawny@gmail.com>
Closes#9574
zed.service does not exist
replaced with correct service name in man.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Kjeld Schouten-Lebbing <kjeld@schouten-lebbing.nl>
Closes#9581
blkg_tryget() as shipped in EL8 kernels does not seem to handle NULL
@blkg as input; this is different from its mainline counterpart where
NULL is accepted. To prevent dereferencing a NULL pointer when dealing
with block devices which do not set a root_blkg on the request queue
perform the NULL check in vdev_bio_associate_blkg().
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes#9546Closes#9577
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Signed-off-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Signed-off-by: Michael Niewöhner <foss@mniewoehner.de>
Closes#9034
When `zpool create -o <property>` is run without root permissions
and the pool property requested is not specifically enumerated in
zpool_valid_proplist(). Then an incorrect error message referring
to an invalid property is printed rather than the expected permission
denied error.
Specifying a pool property at create time should be handled the same
way as filesystem properties in zfs_valid_proplist(). There should
not be default zfs_error_aux() set for properties which are not
listed.
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#9550Closes#9568
Before my ZIL space optimization few years ago 128KB writes were logged
as two 64KB+ records in two 128KB log blocks. After that change it
became ~127KB+/1KB+ in two 128KB log blocks to free space in the second
block for another record. Unfortunately in case of 128KB only writes,
when space in the second block remained unused, that change increased
write latency by unbalancing checksum computation and write times
between parallel threads. It also didn't help with SLOG space
efficiency in that case.
This change introduces new 68KB log block size, used for both writes
below 67KB and 128KB-sharp writes. Writes of 68-127KB are still using
one 128KB block to not increase processing overhead. Writes above
131KB are still using full 128KB blocks, since possible saving there
is small. Mixed loads will likely also fall back to previous 128KB,
since code uses maximum of the last 16 requested block sizes.
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Closes#9409
Don't ask for the password / try to load the key if the key for the
encryptionroot is already loaded. The user might have loaded the key
manually or by other means before the scripts get called.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Witaut Bajaryn <vitaut.bayaryn@gmail.com>
Closes#9495Closes#9529
Some systemd users may want to change configurations in
/etc/defaults/zfs, but these settings won't affect systemd
services.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Mo Zhou <cdluminate@gmail.com>
Closes#9544
Address two prototype related warnings emitted by clang.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#9535
Removes the 'ZFS=' prefix from $BOOTFS instead of $root. This makes sure
that the 'zfs:' prefix remains stripped so that users with
'root=zfs:dataset' cmdline can have key loaded on boot again.
Reviewed-by: Garrett Fields <ghfields@gmail.com>
Reviewed-by: Dacian Reece-Stremtan <dacianstremtan@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Hiếu Lê <leorize+oss@disroot.org>
Closes#9520
Remove the stray leading + from the Makefile. This was
preventing the autosnap.lua channel program from being
properly included by `make dist`.
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#9527
Currently, when you call 'zfs change-key' on an encrypted dataset
that has an unencrypted child, the code will trigger a VERIFY.
This VERIFY is leftover from before we allowed unencrypted
datasets to exist underneath encrypted ones. This patch fixes the
issue by simply replacing the VERIFY with an early return when
recursing through datasets.
Reviewed by: Jason King <jason.brian.king@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes#9524
In original implementation, zpool history will read the whole history
before printing anything, causing memory usage goes unbounded. We fix
this by breaking it into read-print iterations.
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes#9516
Currently, incremental recursive encrypted receives fail to work
for any snapshot after the first. The reason for this is because
the check in zfs_setup_cmdline_props() did not properly realize
that when the user attempts to use '-x encryption' in this
situation, they are not really overriding the existing encryption
property and instead are attempting to prevent it from changing.
This resulted in an error message stating: "encryption property
'encryption' cannot be set or excluded for raw or incremental
streams".
This problem is fixed by updating the logic to expect this use
case.
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes#9494
* Use .ksh extension for ksh scripts, not .sh
* Remove .ksh extension from tests in common.run
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#9502
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#9486
With 4k disks, this test will fail in the last section because the
expected human readable value of 20.0M is reported as 20.1M. Rather than
use the human readable property, switch to the parsable property and
verify that the values are reasonably close.
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: John Kennedy <john.kennedy@delphix.com>
Closes#9477
Giving a name to this enum makes it discoverable from
debugging tools like DRGN and SDB. For example, with
the name proposed on this patch we can iterate over
these values in DRGN:
```
>>> prog.type('enum kmc_bit').enumerators
(('KMC_BIT_NOTOUCH', 0), ('KMC_BIT_NODEBUG', 1),
('KMC_BIT_NOMAGAZINE', 2), ('KMC_BIT_NOHASH', 3),
('KMC_BIT_QCACHE', 4), ('KMC_BIT_KMEM', 5),
('KMC_BIT_VMEM', 6), ('KMC_BIT_SLAB', 7),
...
```
This enables SDB to easily pretty-print the flags of
the spl_kmem_caches in the system like this:
```
> spl_kmem_caches -o "name,flags,total_memory"
name flags total_memory
------------------------ ----------------------- ------------
abd_t KMC_NOMAGAZINE|KMC_SLAB 4.5MB
arc_buf_hdr_t_full KMC_NOMAGAZINE|KMC_SLAB 12.3MB
... <cropped> ...
ddt_cache KMC_VMEM 583.7KB
ddt_entry_cache KMC_NOMAGAZINE|KMC_SLAB 0.0B
... <cropped> ...
zio_buf_1048576 KMC_NODEBUG|KMC_VMEM 0.0B
... <cropped> ...
```
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes#9478
Currently, for certain sizes and classes of allocations we use
SPL caches that are backed by caches in the Linux Slab allocator
to reduce fragmentation and increase utilization of memory. The
way things are implemented for these caches as of now though is
that we don't keep any statistics of the allocations that we
make from these caches.
This patch enables the tracking of allocated objects in those
SPL caches by making the trade-off of grabbing the cache lock
at every object allocation and free to update the respective
counter.
Additionally, this patch makes those caches visible in the
/proc/spl/kmem/slab special file.
As a side note, enabling the specific counter for those caches
enables SDB to create a more user-friendly interface than
/proc/spl/kmem/slab that can also cross-reference data from
slabinfo. Here is for example the output of one of those
caches in SDB that outputs the name of the underlying Linux
cache, the memory of SPL objects allocated in that cache,
and the percentage of those objects compared to all the
objects in it:
```
> spl_kmem_caches | filter obj.skc_name == "zio_buf_512" | pp
name ... source total_memory util
----------- ... ----------------- ------------ ----
zio_buf_512 ... kmalloc-512[SLUB] 16.9MB 8
```
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes#9474
While it may sometimes be convenient to export an NFS filesystem with
no_root_squash it should not be the default behavior. Align the
default behavior with the Linux NFS server defaults. To restore
the previous behavior use 'zfs set sharenfs="no_root_squash,..."'.
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#9397Closes#9425
After commit 5e74ac51 which split and reordered the run files the
`zpool_status_-s` test began failing. The new ordering placed
the test after a previous test which used `zpool replace` to replace
a disk but did not clear its label. This resulted in the next test,
`zpool_status_-s`, failing because of the potentially active
pool being detected on the replaced vdev.
/dev/loop0 is part of potentially active pool 'testpool'
Use the default_mirror_setup_noexit() and default_cleanup_noexit()
functions to create the pool in `zpool_status_-s`. They use the -f
flag by default.
In the `scrub_after_resilver` test wipe the label during cleanup
to prevent future failures if the tests are again reordered.
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#9451
Since 0.7.0, zpool import would unconditionally block on udev for 30
seconds. This introduced a regression in initramfs environments that
lack udev (particularly mdev based environments), yet use a zfs userland
tools intended for the system that had been built against udev. Gentoo's
genkernel is the main example, although custom user initramfs
environments would be similarly impacted unless special builds of the
ZFS userland utilities were done for them. Such environments already
have their own mechanisms for blocking until device nodes are ready
(such as genkernel's scandelay parameter), so it is unnecessary for
zpool import to block on a non-existent udev until a timeout is reached
inside of them.
Rather than trying to intelligently determine whether udev is available
on the system to avoid unnecessarily blocking in such environments, it
seems best to just allow the environment to override the timeout. I
propose that we add an environment variable called
ZPOOL_IMPORT_UDEV_TIMEOUT_MS. Setting it to 0 would restore the 0.6.x
behavior that was more desirable in mdev based initramfs environments.
This allows the system user land utilities to be reused when building
mdev-based initramfs archives.
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Georgy Yakovlev <gyakovlev@gentoo.org>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes#9436
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#9445
Mostly whitespace changes, no functional changes intended.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#9447
When "feature@allocation_classes" is not enabled on the pool no vdev
with "special" or "dedup" allocation type should be allowed to exist in
the vdev tree.
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes#9427Closes#9429
Update the zfs(8) man page to clearly describe that arguments for
channel programs are to be listed after the -- sentinel which
terminates argument processing. This behavior is supported by
getopt on Linux, FreeBSD, and Illumos according to each platforms
respective man pages.
zfs program [-jn] [-t instruction-limit] [-m memory-limit]
pool script [--] arg1 ...
Reviewed-by: Clint Armstrong <clint@clintarmstrong.net>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#9056Closes#9428
Correctly use the `mntpnt_fs` variable, and include additional
logic to ensure the /etc/hostid is correct set up and cleaned up.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Igor Kozhukhov <igor@dilos.org>
Closes#9349
If stdin if empty - don't run xargs command,
otherwise we can get `cp: missing file operand`
error.
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Melikov <mail@gmelikov.ru>
Closes#9418
There have been occasional CI failures which occur when the trimmed
vdev size exactly matches the target size. Resolve this by slightly
relaxing the conditional and checking for -ge rather than -gt. In
all of the cases observer, the values match exactly. For example:
Failure /mnt/trim-vdev1 is 768 MB which is not -gt than 768 MB
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#9399
Commit 093bb64 resolved an automount failures for chroot'd processes
but inadvertently broke automounting for root filesystems where the
vfs_mntpoint is NULL. Resolve the issue by checking for NULL in order
to generate the correct path.
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#9381Closes#9384
A rangelock KPI already exists on FreeBSD. Add a zfs_ prefix as
per our convention to prevent any conflict with existing symbols.
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#9402
Update cleanup_upgrade to use destroy_dataset and destroy_pool
when performing cleanup. These wrappers retry if the pool is busy
preventing occasional failures like those observed when running
tests upgrade_readonly_pool. For example:
SUCCESS: test enabled == enabled
User accounting upgrade is not executed on readonly pool
NOTE: Performing local cleanup via log_onexit (cleanup_upgrade)
cannot destroy 'testpool': pool is busy
ERROR: zpool destroy testpool exited 1
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#9400
If /var/lib is a dataset not under <pool>/ROOT/<root_dataset>, as
proposed in the ubuntu root on zfs upstream guide
(https://github.com/zfsonlinux/zfs/wiki/Ubuntu-18.04-Root-on-ZFS),
we end up with a race where some services, like systemd-random-seed
are writing under /var/lib, while zfs-mount is called. zfs mount will
then potentially fail because of /var/lib isn't empty and so, can't be
mounted.
Order those 2 units for now (more may be needed) as we can't declare
virtually a provide mount point to match
"RequiresMountsFor=/var/lib/systemd/random-seed" from
systemd-random-seed.service.
The optional generator for zfs 0.8 fixes it, but it's not enabled
by default nor necessarily required.
Example:
- rpool/ROOT/ubuntu (mountpoint = /)
- rpool/var/ (mountpoint = /var)
- rpool/var/lib (mountpoint = /var/lib)
Both zfs-mount.service and systemd-random-seed.service are starting
After=systemd-remount-fs.service. zfs-mount.service should be done
before local-fs.target while systemd-random-seed.service should finish
before sysinit.target (which is a later target).
Ideally, we would have a way for zfs mount -a unit to declare all paths
or move systemd-random-seed after local-fs.target.
Reviewed-by: Antonio Russo <antonio.e.russo@gmail.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Didier Roche <didrocks@ubuntu.com>
Closes#9360
Line 31 and 32 overwrote the ${root} variable which broke mount-zfs.sh
We have create a new variable for the dataset instead of overwriting the
${root} variable in zfs-load-key.sh${root} variable in zfs-load-key.sh
Reviewed-by: Kash Pande <kash@tripleback.net>
Reviewed-by: Garrett Fields <ghfields@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Dacian Reece-Stremtan <dacianstremtan@gmail.com>
Closes#8913Closes#9379
Reduce the time required for ./configure to perform the needed
KABI checks by allowing kbuild to compile multiple test cases in
parallel. This was accomplished by splitting each test's source
code from the logic handling whether that code could be compiled
or not.
By introducing this split it's possible to minimize the number of
times kbuild needs to be invoked. As importantly, it means all of
the tests can be built in parallel. This does require a little extra
care since we expect some tests to fail, so the --keep-going (-k)
option must be provided otherwise some tests may not get compiled.
Furthermore, since a failure during the kbuild modpost phase will
result in an early exit; the final linking phase is limited to tests
which passed the initial compilation and produced an object file.
Once everything has been built the configure script proceeds as
previously. The only significant difference is that it now merely
needs to test for the existence of a .ko file to determine the
result of a given test. This vastly speeds up the entire process.
New test cases should use ZFS_LINUX_TEST_SRC to declare their test
source code and ZFS_LINUX_TEST_RESULT to check the result. All of
the existing kernel-*.m4 files have been updated accordingly, see
config/kernel-current-time.m4 for a basic example. The legacy
ZFS_LINUX_TRY_COMPILE macro has been kept to handle special cases
but it's use is not encouraged.
master (secs) patched (secs)
------------- ----------------
autogen.sh 61 68
configure 137 24 (~17% of current run time)
make -j $(nproc) 44 44
make rpms 287 150
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#8547Closes#9132Closes#9341
Conflicts:
Makefile.am
config/kernel-fpu.m4
fxsave and xsave require the target address to be 16-/64-byte aligned.
kmalloc(_node) does not (yet) offer such fine-grained control over
alignment[0,1], even though it does "the right thing" most of the time
for power-of-2 sizes. unfortunately, alignment is completely off when
using certain debugging or hardening features/configs, such as KASAN,
slub_debug=Z or the not-yet-upstream SLAB_CANARY.
Use alloc_pages_node() instead which allows us to allocate page-aligned
memory. Since fpregs_state is padded to a full page anyway, and this
code is only relevant for x86 which has 4k pages, this approach should
not allocate any unnecessary memory but still guarantee the needed
alignment.
0: https://lwn.net/Articles/787740/
1: https://lore.kernel.org/linux-block/20190826111627.7505-1-vbabka@suse.cz/
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#9608Closes#9674
Restore the SIMD optimization for 4.19.38 LTS, 4.14.120 LTS,
and 5.0 and newer kernels.
This commit squashes the following commits from master in to
a single commit which can be applied to 0.8.2.
10fa2545 - Linux 4.14, 4.19, 5.0+ compat: SIMD save/restore
b88ca2ac - Enable SIMD for encryption
095b5412 - Fix CONFIG_X86_DEBUG_FPU build failure
e5db3134 - Linux 5.0 compat: SIMD compatibility
Reviewed-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
TEST_ZIMPORT_SKIP="yes"
When the xattr/cleanup.ksh script is unable to remove the test group
due to an active process then it will not call default_cleanup. This
will result in a zvol_ENOSPC/setup failure when attempting to create
the /mnt/testdir directory which will already exist.
Resolve the issue by performing the default_cleanup before removing
the test user and group to ensure this step always happens. Also
allow one more retry to further minimize the likelihood of the
cleanup failing.
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#9358
Originally the zfs_vdev_elevator module option was added as a
convenience so the requested elevator would be automatically set
on the underlying block devices. At the time this was simple
because the kernel provided an API function which did exactly this.
This API was then removed in the Linux 4.12 kernel which prompted
us to add compatibly code to set the elevator via a usermodehelper.
While well intentioned this introduced a bug which could cause a
system hang, that issue was subsequently fixed by commit 2a0d4188.
In order to avoid future bugs in this area, and to simplify the code,
this functionality is being deprecated. A console warning has been
added to notify any existing consumers and the documentation updated
accordingly. This option will remain for the lifetime of the 0.8.x
series for compatibility but if planned to be phased out of master.
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #8664Closes#9317
Trying to 'zfs diff' a snapshot with large dnodes will incorrectly try
to access its interior slots when dnodesize > sizeof(dnode_phys_t).
This is normally not an issue because the interior slots are
zero-filled, which report_dnode() handles calling
report_free_dnode_range(). However this is not the case for encrypted
large dnodes or filesystem using many SA based xattrs where the extra
data past the legacy dnode size boundary is interpreted as a
dnode_phys_t.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes#7678Closes#8931Closes#9343
The difference between the sizes could be positive or negative. Leaving
the types as unsigned means the result overflows when the difference is
negative and removing the labs() means we'll have introduced a bug. The
subtraction results in the correct value when the unsigned integer is
interpreted as a signed integer by labs().
Clang doesn't see that we're doing a subtraction and abusing the types.
It sees the result of the subtraction, an unsigned value, being passed
to an absolute value function and emits a warning which we treat as an
error.
Reviewed by: Youzhong Yang <youzhong@gmail.com>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@ixsystems.com>
Closes#9355
Move the trailing newlines from the error message strings to the format
strings to more closely match the other error messages.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Ryan Moeller <ryan@ixsystems.com>
Closes#9330
This commit fixes a NULL pointer dereference triggered in
spa_vdev_remove_top_check() by trying to "zpool remove" an indirect
vdev.
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes#9327
Since 4f342e45 env(1) must be able to find a "python2" executable in
the "constrained path" on systems configured with --with-python=2.x
otherwise the ZFS Test Suite won't be able to use Python scripts.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes#9325
Currently, spa_keystore_change_key_sync_impl() does not recurse
into clones when updating encryption roots for either a call to
'zfs promote' or 'zfs change-key'. This can cause children of
these clones to end up in a state where they point to the wrong
dataset as the encryption root. It can also trigger ASSERTs in
some cases where the code checks reference counts on wrapping
keys. This patch fixes this issue by ensuring that this function
properly recurses into clones during processing.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alek Pinchuk <apinchuk@datto.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes#9267Closes#9294
/usr/bin/env python3 is the suggested[1] shebang for Python in general
(likewise for python2) and is conventional across platforms. This eases
development on systems where python is not installed in /usr/bin
(FreeBSD for example) and makes it possible to develop in virtual
environments (venv) for isolating dependencies.
Many packaging guidelines discourage the use of /usr/bin/env, but since
this is the canonical way of writing shebangs in the Python community,
many packaging scripts are already equipped to handle substituting the
appropriate absolute path to python automatically.
Some RPM package builders lacking brp-mangle-shebangs need a small
fallback mechanism in the package spec to stamp the appropriate shebang
on installed Python scripts.
[1]: https://docs.python.org/3/using/unix.html?#miscellaneous
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Ryan Moeller <ryan@ixsystems.com>
Closes#9314
Currently, the DSL scan code figures out when it should suspend
processing and allow a txg to continue by calling the function
dsl_scan_check_suspend(). Unfortunately, this function only
allows the scan to suspend at a level 0 block. In the event that
the system is scanning a bunch of empty snapshots or a resilver
is running with a high enough scn_cur_min_txg, the scan will
stop processing each dataset at the root level, deciding it
has nothing left to do. This means that the check_suspend
function is never called and the txg remains stuck until a
dataset is found that has data to scan.
This patch fixes the problem by allowing scans to suspend at
the root level of the objset. For backwards compatibility, we
use the bookmark <objsetid, 0, 0, 0> when we suspend here so
that older versions of the code will work as intended.
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes#9300
filetest_001_pos verifies that various checksum algorithms detect
corruption by overwriting the underlying vdev on which a file resides.
It is possible for the overwrite to miss the blocks of a file, causing a
spurious failure. This change introduces a function to corrupt the
individual blocks of a file as determined by zdb.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Signed-off-by: John Kennedy <john.kennedy@delphix.com>
Closes#9288
Get rid of the `get_used_prop` function. `get_prop used` works fine.
Fix the comment describing the function parameters. The type does not
have a default, and mntp is also used for ext2.
Rename the variable for the number of copies from `copy` to `copies`.
Use a `case` statement to match the type parameter, order the cases
alphabetically, and add a little sanity checking for good measure.
Use eval to make sure the output of commands is silenced rather than
the log messages when redirecting output to /dev/null.
Simplify cases where zfs requires special behavior.
Don't allow the test to loop forever in the event space usage does not
change. Bail out of the loop and fail after an arbitrary number of
iterations.
Add more information to the log message when the test fails, to help
debugging.
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@ixsystems.com>
Closes#9286
Currently, the noop receive code fails to work with raw send streams
and resuming send streams. This happens because zfs_receive_impl()
reads the DRR_BEGIN payload without reading the payload itself.
Normally, the kernel expects to read this itself, but in this case
the recv_skip() code runs instead and it is not prepared to handle
the stream being left at any place other than the beginning of a
record.
This patch resolves this issue by manually reading the DRR_BEGIN
payload in the dry-run case. This patch also includes a number of
small fixups in this code path.
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes#9221Closes#9173
Remove a lot of unnecessary setting and incrementing of `i`.
Remove unused variable `j`.
Instead of calling out to Python in a loop to generate the same string
repeatedly, generate the string once using shell constructs before
entering the loop.
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Signed-off-by: Ryan Moeller <ryan@ixsystems.com>
Closes#9284
md5sum in particular but also sha256sum to a lesser extent is used
in several areas of the test suite for computing checksums. The vast
majority of invocations are followed by `| awk '{ print $1 }'`.
Introduce functions to wrap up `md5sum $file | awk '{ print $1 }'` and
likewise for sha256sum. These also serve as a convenient interface for
alternative implementations on other platforms.
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@ixsystems.com>
Closes#9280
TRUE and FALSE happen to be defined, but we should use B_TRUE and
B_FALSE for the sake of consistency.
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Ryan Moeller <ryan@ixsystems.com>
Closes#9264
Account for ZFS_MAX_DATASET_NAME_LEN in kstat data size. This value
is ignored in the Linux kstat code but resolves the issue for other
platforms.
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Igor Kozhukhov <igor@dilos.org>
Closes#9254Closes#9151
Create a larger file to extend the time required to perform the
removal. Occasional failures were observed due to the removal
completing before the cancel could be requested.
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Igor Kozhukhov <igor@dilos.org>
Closes#9259
When running on larger memory systems, we can overflow the value of
maxinflight. This can result in maxinflight having a value of 0 causing
the system to hang.
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Wilson <george.wilson@delphix.com>
Closes#9272
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Closes#9251
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Closes#9250
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Closes#9249
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Closes#9247
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Closes#9246
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Closes#9244
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Closes#9243
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Closes#9242
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Closes#9240
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Closes#9237
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Closes#9248
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Closes#9241
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Closes#9239
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Closes#9238
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Closes#9236
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Closes#9235
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Closes#9234
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Closes#9233
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Closes#9232
Must use 'zfs' instead of '$ZFS' which is undefined.
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Igor Kozhukhov <igor@dilos.org>
Closes#9257
If a pool enables the SPACEMAP_HISTOGRAM feature shortly before being
exported, we can enter a situation that causes a kernel panic. Any metaslabs
that are loaded during the final dirty txg and haven't already been condensed
will cause metaslab_sync to proceed after the final dirty txg so that the
condense can be performed, which there are assertions to prevent. Because of
the nature of this issue, there are a number of ways we can enter this
state. Rather than try to prevent each of them one by one, potentially missing
some edge cases, we instead cut it off at the point of intersection; by
preventing metaslab_sync from proceeding if it would only do so to perform a
condense and we're past the final dirty txg, we preserve the utility of the
existing asserts while preventing this particular issue.
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes#9185Closes#9186Closes#9231Closes#9253
Eliminate unnecessary code duplication. We can use a for-loop instead
of a while-loop. There is no need to echo $DISKSARRAY in a subshell or
return 0. Declare all variables with typeset.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Signed-off-by: Ryan Moeller <ryan@ixsystems.com>
Closes#9224
BSD getopt() and getopt_long() want options before arguments.
Reorder arguments to zfs/zpool in tests to put all the options first.
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@ixsystems.com>
Closes#9228
For interrupt coalescing, cv_timedwait_hires() uses a 100us slack/delta
for calls to schedule_hrtimeout_range(). This 100us slack can be costly
for small writes.
This change improves small write performance by passing resolution `res`
parameter to schedule_hrtimeout_range() to be used as delta/slack. A new
tunable `spl_schedule_hrtimeout_slack_us` is added to preserve old
behavior when desired.
Performance observations on 8K recordsize filesystem:
- 8K random writes at 1-64 threads, up to 60% improvement for one thread
and smaller gains as thread count increases. At >64 threads, 2-5%
decrease in performance was observed.
- 8K sequential writes, similar 60% improvement for one thread and
leveling out around 64 threads. At >64 threads, 5-10% decrease in
performance was observed.
- 128K sequential write sees 1-5 for the 128K. No observed regression at
high thread count.
Testing done on Ubuntu 18.04 with 4.15 kernel, 8vCPUs and SSD storage on
VMware ESX.
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Signed-off-by: Tony Nguyen <tony.nguyen@delphix.com>
Closes#9217
Defining a special constant to make an infinite loop is excessive,
especially when the name clashes with symbols commonly defined on
some platforms (ie FreeBSD).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: John Kennedy <john.kennedy@delphix.com
Signed-off-by: Ryan Moeller <ryan@ixsystems.com>
Closes#9219
Other than this test, zpool list -p is not well tested by any of the
automated tests. Add a test for zpool list -p.
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes#9134
Split the arguments for ${TEST_RUNNER} across multiple lines for
clarity. Also added quotes in the message to match the invoked command.
Unquoted variables in argument lists are subject to splitting. In this
particular case we can't quote the variable because it is an optional
argument. Use the method suggested in the description linked below,
instead.
The technique is to use an unquoted variable with an alternate value.
https://github.com/koalaman/shellcheck/wiki/SC2086
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Signed-off-by: Ryan Moeller <ryan@ixsystems.com>
Closes#9212
Commit a887d653 updated the dbufstats such that escalated privileges
are required. Since all tests under cli_user are run with normal
privileges move this test case to a location where it will be run
required privileges.
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Michael Niewöhner <foss@mniewoehner.de>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#9118Closes#9196
Document the ZFS_DKMS_ENABLE_DEBUGINFO option in the userland
configuration file, as done with the other ZFS_DKMS_* options.
It has been introduced with commit e45c1734a6 ("dkms: Enable
debuginfo option to be set with zfs sysconfig file") but isn't
mentioned anywhere other than the 'dkms.conf' file (generated).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
Closes#9191
Reuse enum value ZFS_IOC_BASE for `('Z' << 8)`.
This is helpful on FreeBSD where ZFS_IOC_BASE has a different value and
`('Z' << 8)` is wrong.
Reviewed-by: Chris Dunlop <chris@onthe.net.au>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@ixsystems.com>
Closes#9188
When checking ZFS_IOC_* numbers, print which numbers are wrong rather
than silently failing.
Reviewed-by: Chris Dunlop <chris@onthe.net.au>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@ixsystems.com>
Closes#9187
The ancient version of blkid (v2.17.2) used in CentOS 6 will not
detect the newly created pool unless it has been written to.
Force a pool sync so `zpool import` will detect the newly created
pool.
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#9199
Split long lines where adding license info to dist archive.
Remove extra colon from target line.
Reviewed-by: Chris Dunlop <chris@onthe.net.au>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@ixsystems.com>
Closes#9189
$fs used with the wrong sed command where should be $mntpnt instead
to match a variable exported by read_mtab()
The fix is mostly to reuse the sed command found in read_mtab()
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Michael Niewöhner <foss@mniewoehner.de>
Signed-off-by: Alexey Smirnoff <fling@member.fsf.org>
Closes#9168
There are two different deadlock scenarios, but they share a common
link, which is
thread 1 holding sa_lock and trying to get zap->zap_rwlock:
zap_lockdir_impl+0x858/0x16c0 [zfs]
zap_lockdir+0xd2/0x100 [zfs]
zap_lookup_norm+0x7f/0x100 [zfs]
zap_lookup+0x12/0x20 [zfs]
sa_setup+0x902/0x1380 [zfs]
zfsvfs_init+0x3d6/0xb20 [zfs]
zfsvfs_create+0x5dd/0x900 [zfs]
zfs_domount+0xa3/0xe20 [zfs]
and thread 2 trying to get sa_lock, either in sa_setup:
sa_setup+0x742/0x1380 [zfs]
zfsvfs_init+0x3d6/0xb20 [zfs]
zfsvfs_create+0x5dd/0x900 [zfs]
zfs_domount+0xa3/0xe20 [zfs]
or in sa_build_index:
sa_build_index+0x13d/0x790 [zfs]
sa_handle_get_from_db+0x368/0x500 [zfs]
zfs_znode_sa_init.isra.0+0x24b/0x330 [zfs]
zfs_znode_alloc+0x3da/0x1a40 [zfs]
zfs_zget+0x39a/0x6e0 [zfs]
zfs_root+0x101/0x160 [zfs]
zfs_domount+0x91f/0xea0 [zfs]
From there, there are different locking paths back to something
holding zap->zap_rwlock.
The deadlock scenarios involve multiple different ZFS filesystems
being mounted. sa_lock is common to these scenarios, and the sa
struct involved is private to a mount. Therefore, these must be
referring to different sa_lock instances and these deadlocks can't
occur in practice.
The fix, from Brian Behlendorf, is to remove sa_lock from lockdep
coverage by initializing it with MUTEX_NOLOCKDEP.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jeff Dike <jdike@akamai.com>
Closes#9110
Existing zfs initramfs script logic will attempt to set the 'noop'
scheduler if it's available on the vdev block devices. Newer kernels
have the similar 'none' scheduler on multiqueue devices; this change
alters the initramfs script logic to also attempt to set this scheduler
if it's available.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Garrett Fields <ghfields@gmail.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Colm Buckley <colm@tuatha.org>
Closes#9042
It used to be possible for zfs receive (and other operations related
to clone swap) to bypass refquotas. This can cause a number of issues,
and there should be an automated test for it.
Added tests for rollback and receive not overriding refquota.
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes#9139
* contrib/initramfs: include /etc/default/zfs and /etc/zfs/zfs-functions
At least debian needs /etc/default/zfs and /etc/zfs/zfs-functions for
its initramfs. Include both in build when initramfs is configured.
* contrib/initramfs: include 60-zvol.rules and zvol_id
Include 60-zvol.rules and zvol_id and set udev as predependency instead
of debians zdev. This makes debians additional zdev hook unneeded.
* Correct initconfdir substitution for some distros
Not every Linux distro is using @sysconfdir@/default but @initconfdir@
which is already determined by configure. Let's use it.
* systemd: prevent possible conflict between systemd and sysvinit
Systemd will not load a sysvinit service if a unit exists with the same
name. This prevents conflicts between sysvinit and systemd.
In ZFS there is one sysvinit service that does not have a systemd
service but a target counterpart, zfs-import.target.
Usually it does not make any sense to install both but it is possisble.
Let's prevent any conflict by masking zfs-import.service by default.
This does not harm even if init.d/zfs-import does not exist.
Reviewed-by: Chris Wedgwood <cw@f00f.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Tested-by: Alex Ingram <reimu@reimuhakurei.net>
Tested-by: Dreamcat4 <dreamcat4@gmail.com>
Signed-off-by: Michael Niewöhner <foss@mniewoehner.de>
Closes#7904Closes#9089
Even though the bug's writeup (Github issue #9136) is very detailed,
we still don't know exactly how we got to that state, thus I wasn't
able to reproduce the bug. That said, we can make an educated guess
combining the information on filled issue with the code.
From the fact that `dp_dirty_total` was 0 (which is less than
`zfs_dirty_data_max`) we know that there was one thread that set it to
0 and then signaled one of the waiters of `dp_spaceavail_cv` [see
`dsl_pool_dirty_delta()` which is also the only place that
`dp_dirty_total` is changed]. Thus, the only logical explaination
then for the bug being hit is that the waiter that just got awaken
didn't go through `dsl_pool_dirty_data()`. Given that this function
is only called by `dsl_pool_dirty_space()` or `dsl_pool_undirty_space()`
I can only think of two possible ways of the above scenario happening:
[1] The waiter didn't call into any of the two functions - which I
find highly unlikely (i.e. why wait on `dp_spaceavail_cv` to begin
with?).
[2] The waiter did call in one of the above function but it passed 0 as
the space/delta to be dirtied (or undirtied) and then the callee
returned immediately (e.g both `dsl_pool_dirty_space()` and
`dsl_pool_undirty_space()` return immediately when space is 0).
In any case and no matter how we got there, the easy fix would be to
just broadcast to all waiters whenever `dp_dirty_total` hits 0. That
said and given that we've never hit this before, it would make sense
to think more on why the above situation occured.
Attempting to mimic what Prakash was doing in the issue filed, I
created a dataset with `sync=always` and started doing contiguous
writes in a file within that dataset. I observed with DTrace that even
though we update the pool's dirty data accounting when we would dirty
stuff, the accounting wouldn't be decremented incrementally as we were
done with the ZIOs of those writes (the reason being that
`dbuf_write_physdone()` isn't be called as we go through the override
code paths, and thus `dsl_pool_undirty_space()` is never called). As a
result we'd have to wait until we get to `dsl_pool_sync()` where we
zero out all dirty data accounting for the pool and the current TXG's
metadata.
In addition, as Matt noted and I later verified, the same issue would
arise when using dedup.
In both cases (sync & dedup) we shouldn't have to wait until
`dsl_pool_sync()` zeros out the accounting data. According to the
comment in that part of the code, the reasons why we do the zeroing,
have nothing to do with what we observe:
````
/*
* We have written all of the accounted dirty data, so our
* dp_space_towrite should now be zero. However, some seldom-used
* code paths do not adhere to this (e.g. dbuf_undirty(), also
* rounding error in dbuf_write_physdone).
* Shore up the accounting of any dirtied space now.
*/
dsl_pool_undirty_space(dp, dp->dp_dirty_pertxg[txg & TXG_MASK], txg);
````
Ideally what we want to do is to undirty in the accounting exactly what
we dirty (I use the word ideally as we can still have rounding errors).
This would make the behavior of the system more clear and predictable.
Another interesting issue that I observed with DTrace was that we
wouldn't update any of the pool's dirty data accounting whenever we
would dirty and/or undirty MOS data. In addition, every time we would
change the size of a dbuf through `dbuf_new_size()` we wouldn't update
the accounted space dirtied in the appropriate dirty record, so when
ZIOs are done we would undirty less that we dirtied from the pool's
accounting point of view.
For the first two issues observed (sync & dedup) this patch ensures
that we still update the pool's accounting when we undirty data,
regardless of the write being physical or not.
For changes in the MOS, we first ensure to zero out the pool's dirty
data accounting in `dsl_pool_sync()` after we synced the MOS. Then we
can go ahead and enable the update of the pool's dirty data accounting
wheneve we change MOS data.
Another fix is that we now update the accounting explicitly for
counting errors in `dbuf_write_done()`.
Finally, `dbuf_new_size()` updates the accounted space of the
appropriate dirty record correctly now.
The problem is that we still don't know how the bug came up in the
issue filled. That said the issues fixed seem to be very relevant, so
instead of going with the broadcasting solution right away,
I decided to leave this patch as is.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
External-issue: DLPX-47285
Closes#9137
In zfs_log_write(), we can use dmu_read_by_dnode() rather than
dmu_read() thus avoiding unnecessary dnode_hold() calls.
We get a 2-5% performance gain for large sequential_writes tests, >=128K
writes to files with recordsize=8K.
Testing done on Ubuntu 18.04 with 4.15 kernel, 8vCPUs and SSD storage on
VMware ESX.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Nguyen <tony.nguyen@delphix.com>
Closes#9156
This patch introduces an assertion that can catch pitfalls in
development where there is a mismatch between the size of
reads and writes between a *_phys structure and its respective
in-core structure when bonus buffers are used.
This debugging-aid should be complementary to the verification
done by ztest in ztest_verify_dnode_bt().
A side to this patch is that we now clear out any extra bytes
past a bonus buffer's new size when the buffer is shrinking.
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes#8348
The call to txg_wait_synced in zfsvfs_teardown should
be made conditional on the objset having dirty data.
This can prevent unnecessary txg_wait_synced during
some unmount operations.
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Zuchowski <pzuchowski@datto.com>
Closes#9115
When we check the vdev of the blkptr in zfs_blkptr_verify, we can run
into a race condition where that vdev is temporarily unavailable. This
happens when a device removal operation and the old vdev_t has been
removed from the array, but the new indirect vdev has not yet been
inserted.
We hold the spa_config_lock while doing our sensitive verification.
To ensure that we don't deadlock, we only grab the lock if we don't
have config_writer held. In addition, I had to const the tags of the
refcounts and the spa_config_lock arguments.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes#9112
When running on an ESXi based VM, I've found that "zpool online -e" will
not expand the zpool, if the disk was expanded in ESXi while the VM was
powered off.
For example, take the following scenario:
1. VM running on top of VMware ESXi
2. ZFS pool created with a given device "sda" of size 8GB
3. VM powered off
4. Device "sda" size expanded to 16GB
5. VM powered on
6. "zpool online -e" used on device "sda"
In this situation, after (2) the zpool will be roughly 8GB in size.
After (6), the expectation is the zpool's size will expand to roughly
16GB in size; i.e. expand to the new size of the "sda" device.
Unfortunately, I've seen that after (6), the zpool size does not change.
What's happening is after (5), the EFI label of the "sda" device will be
such that fields "efi_last_u_lba", "efi_last_lba", and "efi_altern_lba"
all reflect the new size of the disk; i.e. "33554398", "33554431", and
"33554431" respectively.
Thus, the check that we perform in "efi_use_whole_disk":
if ((efi_label->efi_altern_lba == 1) || (efi_label->efi_altern_lba
>= efi_label->efi_last_lba)) {
This will return true, and then we return from the function without
having expanded the size of the zpool/device.
In contrast, if we remove steps (3) and (5) in the sequence above, i.e.
the device is expanded while the VM is powered on, things change. In
that case, the fields "efi_last_u_lba" and "efi_altern_lba" do not
change (i.e. they still reflect the old 8GB device size), but the
"efi_last_lba" field does change (i.e. it now reflects the new 16GB
device size). Thus, when we evaluate the same conditional in
"efi_use_whole_disk", it'll return false, so the zpool is expanded.
Taking all of this into account, this PR updates "efi_use_whole_disk" to
properly expand the zpool when the underlying disk is expanded while the
VM is powered off.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Don Brady <don.brady@delphix.com>
Signed-off-by: Prakash Surya <prakash.surya@delphix.com>
Closes#9111
When a pool is imported it will scan the pool to verify the integrity
of the data and metadata. The amount it scans will depend on the
import flags provided. On systems with small amounts of memory or
when importing a pool from the crash kernel, it's possible for
spa_load_verify to issue too many I/Os that it consumes all the memory
of the system resulting in an OOM message or a hang.
To prevent this, we limit the amount of memory that the initial pool
scan can consume. This change will, by default, use 1/16th of the ARC
for scan I/Os to prevent running the system out of memory during import.
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Signed-off-by: George Wilson george.wilson@delphix.com
External-issue: DLPX-65237
External-issue: DLPX-65238
Closes#9146
Given znode_t is an in-core structure, it's more readable to have
them as boolean. Also co-locate existing boolean fields with them
for space efficiency (expecting 8 booleans to be packed/aligned).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes#9092
Conflicts:
include/sys/zfs_znode.h
module/zfs/zfs_znode.c
This is not implemented. If it were implemented, using it would risk
deadlocks on pre-3.18 kernels. Lets just drop it.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Michael Niewöhner <foss@mniewoehner.de>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes#9119
ZED can prevent CPU's from properly sleeping.
Rather than periodically waking up in the zevents code, just go to sleep and wait for a wakeup.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: DHE <git@dehacked.net>
Closes#9091
This patch adds a new test that sanity checks cancelling a removal.
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes#9101
Conflicts:
tests/zfs-tests/tests/functional/removal/Makefile.am
This fixes a lockdep warning by breaking a link between ->tx_sync_lock
and ->dp_lock.
The deadlock envisioned by lockdep is this:
thread 1 holds db->db_mtx and tries to get dp->dp_lock:
dsl_pool_dirty_space+0x70/0x2d0 [zfs]
dbuf_dirty+0x778/0x31d0 [zfs]
thread 2 holds bpo->bpo_lock and tries to get db->db_mtx:
dmu_buf_will_dirty_impl
dmu_buf_will_dirty+0x6b/0x6c0 [zfs]
bpobj_iterate_impl+0xbe6/0x1410 [zfs]
thread 3 holds tx->tx_sync_lock and tries to get bpo->bpo_lock:
bpobj_space+0x63/0x470 [zfs]
dsl_scan_active+0x340/0x3d0 [zfs]
txg_sync_thread+0x3f2/0x1370 [zfs]
thread 4 holds dp->dp_lock and tries to get tx->tx_sync_lock
txg_kick+0x61/0x420 [zfs]
dsl_pool_need_dirty_delay+0x1c7/0x3f0 [zfs]
This patch is orginally from Brian Behlendorf and slightly simplified
by me.
It breaks this cycle in thread 4 by moving the call from
dsl_pool_need_dirty_delay to txg_kick outside the section controlled
by dp->dp_lock.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Signed-off-by: Jeff Dike <jdike@akamai.com>
Closes#9094
Channel programs that many users find useful should be included with zfs
in the /contrib directory. This is the first of these contributions. A
channel program to recursively take snapshots of datasets with the
property com.sun:auto-snapshot=true.
Reviewed-by: Kash Pande <kash@tripleback.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Clint Armstrong <clint@clintarmstrong.net>
Closes#8443Closes#9050
* rpm: correct pkgconfig path
pkconfig files get installed to $datarootdir/pkgconfig but rpm expects
them to be at $datadir. This works when $datarootdir==$datadir which is
the case most of the time but will fail when they differ.
* install: make initramfs-tools path static
Since initramfs-tools' path is nothing we can control as it is an
external package it does not make any sense to install zfs additions
anywhere else. Simply use /usr/share/initramfs-tools as path.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Michael Niewöhner <foss@mniewoehner.de>
Closes#9087
We return ENOSPC in metaslab_activate if the metaslab has weight 0,
to avoid activating a metaslab with no space available. For sanity
checking, we also assert that there is no free space in the range
tree in that case.
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes#8968
When a volume is created in a pool with raidz vdevs and
volblocksize != 128k, the volume can reference more space than is
reserved with the automatically calculated refreservation. There
are two deficiencies in vol_volsize_to_reservation that contribute
to this:
1) Skip blocks may be added to keep each allocation a multiple
of parity + 1. This is the dominating factor when volblocksize
is close to 2^ashift.
2) raidz deflation for 128 KB blocks is different for most other
block sizes.
See "The theory of raidz space accounting" comment in
libzfs_dataset.c for a full explanation.
Authored by: Mike Gerdts <mike.gerdts@joyent.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Kody Kantor <kody.kantor@joyent.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Mike Gerdts <mike.gerdts@joyent.com>
Porting Notes:
* ZTS: wait for zvols to exist before writing
* ZTS: use log_must_busy with {zpool|zfs} destroy
OpenZFS-issue: https://www.illumos.org/issues/9318
OpenZFS-commit: https://github.com/illumos/illumos-gate/commit/b73ccab0Closes#8973
With the new parallel allocators scheme, there is a possibility for
a problem where two threads, allocating from the same allocator at
the same time, conflict with each other. There are two primary cases
to worry about. First, another thread working on another allocator
activates the same metaslab that the first thread was trying to
activate. This results in the first thread needing to go back and
reselect a new metaslab, even though it may have waited a long time
for this metaslab to load. Second, another thread working on the same
allocator may have activated a different metaslab while the first
thread was waiting for its metaslab to load. Both of these cases
can cause the first thread to be significantly delayed in issuing
its IOs. The second case can also cause metaslab load/unload churn;
because the metaslab is loaded but not fully activated, we never set
the selected_txg, which results in the metaslab being immediately
unloaded again. This process can repeat many times, wasting disk and
cpu resources. This is more likely to happen when the IO of the first
thread is a larger one (like a ZIL write) and the other thread is
doing a smaller write, because it is more likely to find an
acceptable metaslab quickly.
There are two primary changes. The first is to always proceed with
the allocation when returning from metaslab_activate if we were
preempted in either of the ways described in the previous section.
The second change is to set the selected_txg before we do the call
to activate so that even if the metaslab is not used for an
allocation, we won't immediately attempt to unload it.
Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
External-issue: DLPX-61314
Closes#8843
With the addition of BP_EMBEDDED_TYPE_REDACTED in 30af21b0 a couple of
codepaths make wrong assumptions and could potentially result in errors.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Chris Dunlop <chris@onthe.net.au>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes#8951
Conflicts:
include/sys/spa.h
Problem Statement
=================
ZFS Channel program scripts currently require a timeout, so that hung or
long-running scripts return a timeout error instead of causing ZFS to get
wedged. This limit can currently be set up to 100 million Lua instructions.
Even with a limit in place, it would be desirable to have a sys admin
(support engineer) be able to cancel a script that is taking a long time.
Proposed Solution
=================
Make it possible to abort a channel program by sending an interrupt signal.In
the underlying txg_wait_sync function, switch the cv_wait to a cv_wait_sig to
catch the signal. Once a signal is encountered, the dsl_sync_task function can
install a Lua hook that will get called before the Lua interpreter executes a
new line of code. The dsl_sync_task can resume with a standard txg_wait_sync
call and wait for the txg to complete. Meanwhile, the hook will abort the
script and indicate that the channel program was canceled. The kernel returns
a EINTR to indicate that the channel program run was canceled.
Porting notes: Added missing return value from cv_wait_sig()
Authored by: Don Brady <don.brady@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Sara Hartse <sara.hartse@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Don Brady <don.brady@delphix.com>
Signed-off-by: Don Brady <don.brady@delphix.com>
OpenZFS-issue: https://www.illumos.org/issues/9425
OpenZFS-commit: https://github.com/illumos/illumos-gate/commit/d0cb1fb926Closes#8904
On fragmented pools with high-performance storage, the looping in
metaslab_block_picker() can become the performance-limiting bottleneck.
When looking for a larger block (e.g. a 128K block for the ZIL), we may
search through many free segments (up to hundreds of thousands) to find
one that is large enough to satisfy the allocation. This can take a long
time (up to dozens of ms), and is done while holding the ms_lock, which
other threads may spin waiting for.
When this performance problem is encountered, profiling will show
high CPU time in metaslab_block_picker, as well as in mutex_enter from
various callers.
The problem is very evident on a test system with a sync write workload
with 8K writes to a recordsize=8k filesystem, with 4TB of SSD storage,
84% full and 88% fragmented. It has also been observed on production
systems with 90TB of storage, 76% full and 87% fragmented.
The fix is to change metaslab_df_alloc() to search only up to 16MB from
the previous allocation (of this alignment). After that, we will pick a
segment that is of the exact size requested (or larger). This reduces
the number of iterations to a few hundred on fragmented pools (a ~100x
improvement).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-62324
Closes#8877
Scatter ABD's are allocated from a number of pages. In contrast to
linear ABD's, these pages are disjoint in the kernel's virtual address
space, so they can't be accessed as a contiguous buffer. Therefore
routines that need a linear buffer (e.g. abd_borrow_buf() and friends)
must allocate a separate linear buffer (with zio_buf_alloc()), and copy
the contents of the pages to/from the linear buffer. This can have a
measurable performance overhead on some workloads.
https://github.com/zfsonlinux/zfs/commit/87c25d567fb7969b44c7d8af63990e
("abd_alloc should use scatter for >1K allocations") increased the use
of scatter ABD's, specifically switching 1.5K through 4K (inclusive)
buffers from linear to scatter. For workloads that access blocks whose
compressed sizes are in this range, that commit introduced an additional
copy into the read code path. For example, the
sequential_reads_arc_cached tests in the test suite were reduced by
around 5% (this is doing reads of 8K-logical blocks, compressed to 3K,
which are cached in the ARC).
This commit treats single-chunk scattered buffers as linear buffers,
because they are contiguous in the kernel's virtual address space.
All single-page (4K) ABD's can be represented this way. Some multi-page
ABD's can also be represented this way, if we were able to allocate a
single "chunk" (higher-order "page" which represents a power-of-2 series
of physically-contiguous pages). This is often the case for 2-page (8K)
ABD's.
Representing a single-entry scatter ABD as a linear ABD has the
performance advantage of avoiding the copy (and allocation) in
abd_borrow_buf_copy / abd_return_buf_copy. A performance increase of
around 5% has been observed for ARC-cached reads (of small blocks which
can take advantage of this), fixing the regression introduced by
87c25d567.
Note that this optimization is only possible because all physical memory
is always mapped into the kernel's address space. This is not the case
for HIGHMEM pages, so the optimization can not be made on 32-bit
systems.
Reviewed-by: Chunwei Chen <tuxoko@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#8580
We've observed that on some highly fragmented pools, most metaslab
allocations are small (~2-8KB), but there are some large, 128K
allocations. The large allocations are for ZIL blocks. If there is a
lot of fragmentation, the large allocations can be hard to satisfy.
The most common impact of this is that we need to check (and thus load)
lots of metaslabs from the ZIL allocation code path, causing sync writes
to wait for metaslabs to load, which can take a second or more. In the
worst case, we may not be able to satisfy the allocation, in which case
the ZIL will resort to txg_wait_synced() to ensure the change is on
disk.
To provide a workaround for this, this change adds a tunable that can
reduce the size of ZIL blocks.
External-issue: DLPX-61719
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#8865
When a disk is replaced with another on a pool with the resilver_defer
feature present, but not enabled the resilver activity restarts during
each spa_sync. This patch checks to make sure that the resilver_defer
feature is first enabled before requesting a deferred resilver.
This was originally fixed in illumos-joyent as OS-7982.
Reviewed-by: Chris Dunlop <chris@onthe.net.au>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Signed-off-by: Kody A Kantor <kody@kkantor.com>
External-issue: illumos-joyent OS-7982
Closes#9299Closes#9338
The was incorrect with respect to swapping dataset IDs both in the
on-disk ZAP object and the in-memory queue.
In both cases, if ds1 was already present, then it would be first
replaced with ds2 and then ds would be replaced back with ds1.
Also, both cases did not properly handle a situation where both ds1 and
ds2 are already queued. A duplicate insertion would be attempted and
its failure would result in a panic.
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Andriy Gapon <avg@FreeBSD.org>
Closes#9140Closes#9163
Originally the zfs_vdev_elevator module option was added as a
convenience so the requested elevator would be automatically set
on the underlying block devices. At the time this was simple
because the kernel provided an API function which did exactly this.
This API was then removed in the Linux 4.12 kernel which prompted
us to add compatibly code to set the elevator via a usermodehelper.
Unfortunately changing the evelator via usermodehelper requires reading
some userland binaries, most notably modprobe(8) or sh(1), from a zfs
dataset on systems with root-on-zfs. This can deadlock the system if
used during the following call path because it may need, if the data
is not already cached in the ARC, reading directly from disk while
holding the spa config lock as a writer:
zfs_ioc_pool_scan()
-> spa_scan()
-> spa_scan()
-> vdev_reopen()
-> vdev_elevator_switch()
-> call_usermodehelper()
While the usermodehelper waits sh(1), modprobe(8) is blocked in the
ZIO pipeline trying to read from disk:
INFO: task modprobe:2650 blocked for more than 10 seconds.
Tainted: P OE 5.2.14
modprobe D 0 2650 206 0x00000000
Call Trace:
? __schedule+0x244/0x5f0
schedule+0x2f/0xa0
cv_wait_common+0x156/0x290 [spl]
? do_wait_intr_irq+0xb0/0xb0
spa_config_enter+0x13b/0x1e0 [zfs]
zio_vdev_io_start+0x51d/0x590 [zfs]
? tsd_get_by_thread+0x3b/0x80 [spl]
zio_nowait+0x142/0x2f0 [zfs]
arc_read+0xb2d/0x19d0 [zfs]
...
zpl_iter_read+0xfa/0x170 [zfs]
new_sync_read+0x124/0x1b0
vfs_read+0x91/0x140
ksys_read+0x59/0xd0
do_syscall_64+0x4f/0x130
entry_SYSCALL_64_after_hwframe+0x44/0xa9
This commit changes how we use the usermodehelper functionality from
synchronous (UMH_WAIT_PROC) to asynchronous (UMH_NO_WAIT) which prevents
scrubs, and other vdev_elevator_switch() consumers, from triggering the
aforementioned issue.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Issue #8664Closes#9321
1. Fix issue: Kernel BUG with QAT during decompression #9276.
Now it is uninterruptible for a specific given QAT request,
but Ctrl-C interrupt still works in user-space process.
2. Copy the digest result to the buffer only when doing encryption,
and vise-versa for decryption.
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chengfei Zhu <chengfeix.zhu@intel.com>
Closes#9276Closes#9303
Determine the location of depmod on the system, either /sbin/depmod or
/usr/sbin/depmod. Then use that path when generating the specfile.
Additionally, update the Requires lines to reference the package which
provides depmod rather than the binary itself. For CentOS/RHEL 7+8
and all supported Fedora releases this is the kmod package, and for
CentOS/RHEL 6 it is the module-init-tools package.
Reviewed-by: Minh Diep <mdiep@whamcloud.com>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#8724Closes#9310
Accidentally introduced by dc04a8c which now takes the SCL_VDEV lock
as a reader in zfs_blkptr_verify(). A deadlock can occur if the
/etc/hostid file resides on a dataset in the same pool. This is
because reading the /etc/hostid file may occur while the caller is
holding the SCL_VDEV lock as a writer. For example, to perform a
`zpool attach` as shown in the abbreviated stack below.
To resolve the issue we cache the system's hostid when initializing
the spa_t, or when modifying the multihost property. The cached
value is then relied upon for subsequent accesses.
Call Trace:
spa_config_enter+0x1e8/0x350 [zfs]
zfs_blkptr_verify+0x33c/0x4f0 [zfs] <--- trying read lock
zio_read+0x6c/0x140 [zfs]
...
vfs_read+0xfc/0x1e0
kernel_read+0x50/0x90
...
spa_get_hostid+0x1c/0x38 [zfs]
spa_config_generate+0x1a0/0x610 [zfs]
vdev_label_init+0xa0/0xc80 [zfs]
vdev_create+0x98/0xe0 [zfs]
spa_vdev_attach+0x14c/0xb40 [zfs] <--- grabbed write lock
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#9256Closes#9285
Building against RHEL 8 requires libtirpc-devel, as with fedora 28.
Add rhel8 and centos8 options to the test, to account for that.
BuildRequires Originally added for fedora 28 via commit
1a62a305be
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes#9289
Both 'detach' and 'online' zpool subcommands, when provided with an
unsupported option, forget to print it in the error message:
# zpool online -t rpool vda3
invalid option ''
usage:
online [-e] <pool> <device> ...
This changes fixes the error message in order to include the actual
option that is not supported.
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes#9270
Debian zfs-dkms package generated by alien doesn't call the prerm script
(rpm's %preun) with an integer as first parameter, which results in the
following warning when the package is uninstalled:
"zfs-dkms.prerm: line 3: [: remove: integer expression expected"
Modify the if-condition to avoid the warning.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes#9271
Partially received zvols won't have links in /dev/zvol.
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Closes#9260
The zfs-volume-wait.service scans existing zvols and waits for their
links under /dev to be created. Any service that depends on zvol
links to be there should add a dependency on zfs-volumes.target.
By default, this target is not enabled.
Reviewed-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
Reviewed-by: Antonio Russo <antonio.e.russo@gmail.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: John Gallagher <john.gallagher@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pavel Zakharov <pzakharov@delphix.com>
Closes#8975
This fixes a hole in the situation where the resume state is left from
receiving a new dataset and, so, the state is set on the dataset itself
(as opposed to %recv child).
Additionally, distinguish incremental and resume streams in error
messages.
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Andriy Gapon <avg@FreeBSD.org>
Closes#9252
This change use the compat code introduced in 9cc1844a.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes#9268Closes#9269
Remove the x86_64 warning, it's no longer the case that this is the
only supported architecture.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Georgy Yakovlev <gyakovlev@gentoo.org>
Closes: #9177
This is a typical case of use after free. We would call zfs_close(zhp)
which would free the handle, and then call zfs_iter_children() on that
handle later. This change ensures that the zfs_handle is only closed
when we are ready to return.
Running `zfs inherit -r sharenfs pool` was failing with an error
code without any error messages. After some debugging I've pinpointed
the issue to be memory corruption, which would cause zfs to try to
issue an ioctl to the wrong device and receive ENOTTY.
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alek Pinchuk <apinchuk@datto.com>
Signed-off-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Issue #7967Closes#9165
If TX_REMOVE is followed by TX_CREATE on the same object id, we need to
make sure the object removal is completely finished before creation. The
current implementation relies on dnode_hold_impl with
DNODE_MUST_BE_ALLOCATED returning ENOENT. While this check seems to work
fine before, in current version it does not guarantee the object removal
is completed.
We fix this by checking if DNODE_MUST_BE_FREE returns successful
instead. Also add test and remove dead code in dnode_hold_impl.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes#7151Closes#8910Closes#9123Closes#9145
Previously, the permissions were checked on the pool which was obviously
incorrect.
After this change, zfs_check_userprops() only validates the properties
without any permission checks. The permissions are checked individually
for each snapshotted dataset.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Signed-off-by: Andriy Gapon <avg@FreeBSD.org>
Closes#9179Closes#9180
Entering the ZFS encryption passphrase under Plymouth wasn't working
because in the ZFS initrd script, Plymouth was calling zfs via
"--command", which wasn't passing through the filesystem argument to
zfs load-key properly (it was passing through the single quotes around
the filesystem name intended to handle spaces literally,
which zfs load-key couldn't understand).
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Garrett Fields <ghfields@gmail.com>
Signed-off-by: Richard Allen <belperite@gmail.com>
Issue #9193Closes#9202
Currently, the 'zfs rollback' code can end up deadlocked due to
the way the kernel handles unreferenced inodes on a suspended fs.
Essentially, the zfs_resume_fs() code path may cause zfs to spawn
new threads as it reinstantiates the suspended fs's zil. When a
new thread is spawned, the kernel may attempt to free memory for
that thread by freeing some unreferenced inodes. If it happens to
select inodes that are a a part of the suspended fs a deadlock
will occur because freeing inodes requires holding the fs's
z_teardown_inactive_lock which is still held from the suspend.
This patch corrects this issue by adding an additional reference
to all inodes that are still present when a suspend is initiated.
This prevents them from being freed by the kernel for any reason.
Reviewed-by: Alek Pinchuk <apinchuk@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes#9203
The slog tests fail when attempting to create pools using file vdevs
that already exist from previous test runs. Remove these files in the
setup for the test.
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Signed-off-by: Ryan Moeller <ryan@ixsystems.com>
Closes#9194
Fix some switch() fall-though compiler errors:
abd.c:1504:9: error: this statement may fall through
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#9170
Uses obj-m instead, due to kernel changes.
See LKML: Masahiro Yamada, Tue, 6 Aug 2019 19:03:23 +0900
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Dominic Pearson <dsp@technoanimal.net>
Closes#9169
We should only call zil_remove_async when an object is removed. However,
in current implementation, it is called whenever TX_REMOVE is called. In
the case of hardlinked file, every unlink will generate TX_REMOVE and
causing operations to be dropped even when the object is not removed.
We fix this by only calling zil_remove_async when the file is fully
unlinked.
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes#8769Closes#9061
When creating hundreds of clones (for example using containers with
LXD) cloning slows down as the number of clones increases over time.
The reason for this is that the fetching of the clone information
using a small zcmd buffer requires two ioctl calls, one to determine
the size and a second to return the data. However, this requires
gathering the data twice, once to determine the size and again to
populate the zcmd buffer to return it to userspace.
These are expensive ioctl() calls, so instead, make the default buffer
size much larger: 256K.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Michael Niewöhner <foss@mniewoehner.de>
Closes#9084
In zfs_write() and dmu_tx_hold_sa(), we can use dmu_tx_hold_*_by_dnode()
instead of dmu_tx_hold_*(), since we already have a dbuf from the target
dnode in hand. This eliminates some calls to dnode_hold(), which can be
expensive. This is especially impactful if several threads are
accessing objects that are in the same block of dnodes, because they
will contend for that dbuf's lock.
We are seeing 10-20% performance wins for the sequential_writes tests in
the performance test suite, when doing >=128K writes to files with
recordsize=8K.
This also removes some unnecessary casts that are in the area.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#9081
When adapting the original sources for s390x the JMP_BUF_CNT was
mistakenly halved due to an incorrect assumption of the size of
a unsigned long. They are 8 bytes for the s390x architecture.
Increase JMP_BUF_CNT accordingly.
Authored-by: Don Brady <don.brady@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reported-by: Colin Ian King <canonical.com>
Tested-by: Colin Ian King <canonical.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#8992Closes#9080
When a system boots the zfs-mount.service and the
zfs-share.service can start simultaneously. What may be
unclear is that sharing a filesystem will first mount
the filesystem if it's not already mounted. This means
that both service can race to mount the same fileystem.
This race can result in a SEGFAULT or EBUSY conditions.
This change explicitly defines the start ordering between the
two services such that the zfs-mount.service is solely
responsible for mounting filesystems eliminating the race
between "zfs mount -a" and "zfs share -a" commands.
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Wilson <george.wilson@delphix.com>
Closes#9083
Don't unconditionally return 0 (i.e. retain SUID/SGID).
Test CAP_FSETID capability.
https://github.com/pjd/pjdfstest/blob/master/tests/chmod/12.t
which expects SUID/SGID to be dropped on write(2) by non-owner fails
without this. Most filesystems make this decision within VFS by using
a generic file write for fops.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes#9035Closes#9043
zed core dumps due to a NULL pointer in zfs_agent_iter_vdev(). The
gs_devid is NULL, but the nvl has a "devid" entry.
zfs_agent_post_event() checks that ZFS_EV_VDEV_GUID or DEV_IDENTIFIER is
present in nvl, but then later it and zfs_agent_iter_vdev() assume that
DEV_IDENTIFIER is present and thus gs_devid is set.
Typically this is not a problem because usually either all vdevs have
devid's, or none of them do. Since zfs_agent_iter_vdev() first checks if
the vdev has devid before dereferencing gs_devid, the problem isn't
typically encountered. However, if some vdevs have devid's and some do
not, then the problem is easily reproduced. This can happen if the pool
has been moved from a system that has devid's to one that does not.
The fix is for zfs_agent_iter_vdev() to only try to match the devid's if
both nvl and gsp have devid's present.
Reviewed-by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed-by: Don Brady <don.brady@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-65090
Closes#9054Closes#9060
Cast to uintptr_t first for portability on integer to/from pointer
conversion.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes#9065
The tests in tests/functional/cli_root/zpool_status should all require
root. However, linux.run has "user =" specified for those tests, which
means they run as a normal user. When I removed that line to run them
as root, the following tests did not pass:
zpool_status_003_pos
zpool_status_-c_disable
zpool_status_-c_homedir
zpool_status_-c_searchpath
These tests need to be run as a normal user. To fix this, move these
tests to a new tests/functional/cli_user/zpool_status directory.
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#9057
In the past we've seen multiple race conditions that have
to do with open-context threads async threads and concurrent
calls to spa_export()/spa_destroy() (including the one
referenced in issue #9015).
This patch ensures that only one thread can execute the
main body of spa_export_common() at a time, with subsequent
threads returning with a new error code created just for
this situation, eliminating this way any race condition
bugs introduced by concurrent calls to this function.
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes#9015Closes#9044
There exists a race condition were hdr_recl() calls
zthr_wakeup() on a destroyed zthr. The timeline is the
following:
[1] hdr_recl() runs first and goes intro zthr_wakeup()
because arc_initialized is set.
[2] arc_fini() is called by another thread, zeroes
that flag, destroying the zthr, and goes into
buf_init().
[3] hdr_recl() tries to enter the destroyed mutex
and we blow up.
This patch ensures that the ARC's zthrs are not offloaded
any new work once arc_initialized is set and then destroys
them after all of the ARC state has been deleted.
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes#9047
These aren't tunable; illumos has this comment fixed in
"3742 zfs comments need cleaner, more consistent style",
so sync with that.
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes#9052
These functions are unused and can be removed along
with the spl-mutex.c and spl-rwlock.c source files.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#9029
The Linux kernel's rwsem's have never provided an interface to
allow a reader to be upgraded to a writer. Historically, this
functionality has been implemented by a SPL wrapper function.
However, this approach depends on internal knowledge of the
rw_semaphore and is therefore rather brittle.
Since the ZFS code must always be able to fallback to rw_exit()
and rw_enter() when an rw_tryupgrade() fails; this functionality
isn't critical. Furthermore, the only potentially performance
sensitive consumer is dmu_zfetch() and no decrease in performance
was observed with this change applied. See the PR comments for
additional testing details.
Therefore, it is being retired to make the build more robust and
to simplify the rwlock implementation.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#9029
Commit https://github.com/torvalds/linux/commit/94a9717b updated the
rwsem's owner field to contain additional flags describing the rwsem's
state. Rather then update the wrappers to mask out these bits, the
code no longer relies on the owner stored by the kernel. This does
increase the size of a krwlock_t but it makes the implementation
less sensitive to future kernel changes.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#9029
lockdep reports a possible recursive lock in dbuf_destroy.
It is true that dbuf_destroy is acquiring the dn_dbufs_mtx
on one dnode while holding it on another dnode. However,
it is impossible for these to be the same dnode because,
among other things,dbuf_destroy checks MUTEX_HELD before
acquiring the mutex.
This fix defines a class NESTED_SINGLE == 1 and changes
that lock to call mutex_enter_nested with a subclass of
NESTED_SINGLE.
In order to make the userspace code compile,
include/sys/zfs_context.h now defines mutex_enter_nested and
NESTED_SINGLE.
This is the lockdep report:
[ 122.950921] ============================================
[ 122.950921] WARNING: possible recursive locking detected
[ 122.950921] 4.19.29-4.19.0-debug-d69edad5368c1166 #1 Tainted: G O
[ 122.950921] --------------------------------------------
[ 122.950921] dbu_evict/1457 is trying to acquire lock:
[ 122.950921] 0000000083e9cbcf (&dn->dn_dbufs_mtx){+.+.}, at: dbuf_destroy+0x3c0/0xdb0 [zfs]
[ 122.950921]
but task is already holding lock:
[ 122.950921] 0000000055523987 (&dn->dn_dbufs_mtx){+.+.}, at: dnode_evict_dbufs+0x90/0x740 [zfs]
[ 122.950921]
other info that might help us debug this:
[ 122.950921] Possible unsafe locking scenario:
[ 122.950921] CPU0
[ 122.950921] ----
[ 122.950921] lock(&dn->dn_dbufs_mtx);
[ 122.950921] lock(&dn->dn_dbufs_mtx);
[ 122.950921]
*** DEADLOCK ***
[ 122.950921] May be due to missing lock nesting notation
[ 122.950921] 1 lock held by dbu_evict/1457:
[ 122.950921] #0: 0000000055523987 (&dn->dn_dbufs_mtx){+.+.}, at: dnode_evict_dbufs+0x90/0x740 [zfs]
[ 122.950921]
stack backtrace:
[ 122.950921] CPU: 0 PID: 1457 Comm: dbu_evict Tainted: G O 4.19.29-4.19.0-debug-d69edad5368c1166 #1
[ 122.950921] Hardware name: Supermicro H8SSL-I2/H8SSL-I2, BIOS 080011 03/13/2009
[ 122.950921] Call Trace:
[ 122.950921] dump_stack+0x91/0xeb
[ 122.950921] __lock_acquire+0x2ca7/0x4f10
[ 122.950921] lock_acquire+0x153/0x330
[ 122.950921] dbuf_destroy+0x3c0/0xdb0 [zfs]
[ 122.950921] dbuf_evict_one+0x1cc/0x3d0 [zfs]
[ 122.950921] dbuf_rele_and_unlock+0xb84/0xd60 [zfs]
[ 122.950921] dnode_evict_dbufs+0x3a6/0x740 [zfs]
[ 122.950921] dmu_objset_evict+0x7a/0x500 [zfs]
[ 122.950921] dsl_dataset_evict_async+0x70/0x480 [zfs]
[ 122.950921] taskq_thread+0x979/0x1480 [spl]
[ 122.950921] kthread+0x2e7/0x3e0
[ 122.950921] ret_from_fork+0x27/0x50
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jeff Dike <jdike@akamai.com>
Closes#8984
Make use of __GFP_HIGHMEM flag in vmem_alloc, which is required for
some 32-bit systems to make use of full available memory.
While kernel versions >=4.12-rc1 add this flag implicitly, older
kernels do not.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Signed-off-by: Michael Niewöhner <foss@mniewoehner.de>
Closes#9031
zfs_refcount_*() are to be wrapped by zfsctl_snapshot_*() in this file.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes#9039
Resolve an assortment of style inconsistencies including
use of white space, typos, capitalization, and line wrapping.
There is no functional change.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#9030
The cast of the size_t returned by strlcpy() to a uint64_t by the
VERIFY3U can result in a build failure when CONFIG_FORTIFY_SOURCE
is set. This is due to the additional hardening. Since the token
is expected to always fit in strval the VERIFY3U has been removed.
If somehow it doesn't, it will still be safely truncated.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Don Brady <don.brady@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #8999Closes#9020
Modify zfs-mount-generator to produce a dependency on new
zfs-import-key-*.service units, dynamically created at boot to call
zfs load-key for the encryption root, before attempting to mount any
encrypted datasets.
These units are created by zfs-mount-generator, and RequiresMountsFor on
the keyfile, if present, or call systemd-ask-password if a passphrase is
requested.
This patch includes suggestions from @Fabian-Gruenbichler, @ryanjaeb and
@rlaager, as well an adaptation of @rlaager's script to retry on
incorrect password entry.
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Antonio Russo <antonio.e.russo@gmail.com>
Closes#8750Closes#8848
ZFS_ACLTYPE_POSIXACL has already been tested in zpl_init_acl(),
so no need to test again on POSIX ACL access.
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes#9009
External consumers such as Lustre require access to the dnode
interfaces in order to correctly manipulate dnodes.
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #8994Closes#9027
This patch corrects a small issue where the dsl_destroy_head()
code that runs when the async_destroy feature is disabled would
not properly decrypt the dataset before beginning processing.
If the dataset is not able to be decrypted, the optimization
code now simply does not run and the dataset is completely
destroyed in the DSL sync task.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes#9021
struct pathname is originally from Solaris VFS, and it has been used
in ZoL to merely call VOP from Linux VFS interface without API change,
therefore pathname::pn_path* are unused and unneeded. Technically,
struct pathname is a wrapper for C string in ZoL.
Saves stack a bit on lookup and unlink.
(#if0'd members instead of removing since comments refer to them.)
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes#9025
Large allocation over the spl_kmem_alloc_warn value was being performed.
Switched to vmem_alloc interface as specified for large allocations.
Changed the subsequent frees to match.
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: nmattis <nickm970@gmail.com>
Closes#8934Closes#9011
log_neg_expect was using the wrong exit status to detect if a process
got killed by SIGSEGV or SIGBUS, resulting in false positives.
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Attila Fülöp <attila@fueloep.org>
Closes#9003
Use python -Esc to set __python_sitelib.
Reviewed-by: Neal Gompa <ngompa@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Shaun Tancheff <stancheff@cray.com>
Closes#8969
Strategy of parallel mount is as follows.
1) Initial thread dispatching is to select sets of mount points that
don't have dependencies on other sets, hence threads can/should run
lock-less and shouldn't race with other threads for other sets. Each
thread dispatched corresponds to top level directory which may or may
not have datasets to be mounted on sub directories.
2) Subsequent recursive thread dispatching for each thread from 1)
is to mount datasets for each set of mount points. The mount points
within each set have dependencies (i.e. child directories), so child
directories are processed only after parent directory completes.
The problem is that the initial thread dispatching in
zfs_foreach_mountpoint() can be multi-threaded when it needs to be
single-threaded, and this puts threads under race condition. This race
appeared as mount/unmount issues on ZoL for ZoL having different
timing regarding mount(2) execution due to fork(2)/exec(2) of mount(8).
`zfs unmount -a` which expects proper mount order can't unmount if the
mounts were reordered by the race condition.
There are currently two known patterns of input list `handles` in
`zfs_foreach_mountpoint(..,handles,..)` which cause the race condition.
1) #8833 case where input is `/a /a /a/b` after sorting.
The problem is that libzfs_path_contains() can't correctly handle an
input list with two same top level directories.
There is a race between two POSIX threads A and B,
* ThreadA for "/a" for test1 and "/a/b"
* ThreadB for "/a" for test0/a
and in case of #8833, ThreadA won the race. Two threads were created
because "/a" wasn't considered as `"/a" contains "/a"`.
2) #8450 case where input is `/ /var/data /var/data/test` after sorting.
The problem is that libzfs_path_contains() can't correctly handle an
input list containing "/".
There is a race between two POSIX threads A and B,
* ThreadA for "/" and "/var/data/test"
* ThreadB for "/var/data"
and in case of #8450, ThreadA won the race. Two threads were created
because "/var/data" wasn't considered as `"/" contains "/var/data"`.
In other words, if there is (at least one) "/" in the input list,
the initial thread dispatching must be single-threaded since every
directory is a child of "/", meaning they all directly or indirectly
depend on "/".
In both cases, the first non_descendant_idx() call fails to correctly
determine "path1-contains-path2", and as a result the initial thread
dispatching creates another thread when it needs to be single-threaded.
Fix a conditional in libzfs_path_contains() to consider above two.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes#8450Closes#8833Closes#8878
This commit ensures make(1) targets that build .deb packages fail if
alien(1) can't convert all .rpm files; additionally it also updates
the zfs-dracut package name which was changed to "noarch" in ca4e5a7.
Reviewed-by: Neal Gompa <ngompa@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes#8990Closes#8991
This patch fixes an issue where dsl_dataset_crypt_stats() would
VERIFY that it was able to hold the encryption root. This function
should instead silently continue without populating the related
field in the nvlist, as is the convention for this code.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes#8976
Having the mountpoint and dataset name both in the message made it
confusing to read. Additionally, convert this to a zfs_dbgmsg rather than
sending it to the console.
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Zuchowski <pzuchowski@datto.com>
Closes#8959
The b_freeze_cksum field can only have data when ZFS_DEBUG_MODIFY
is set. Therefore, the EQUIV check must be wrapped accordingly.
For the same reason the ASSERT in arc_buf_fill() in unsafe.
However, since it's largely redundant it has simply been removed.
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Allan Jude <allanjude@freebsd.org>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#8979
This small patch fixes the EINVAL case for zfs_receive_one(). A
missing 'else' has been added to the two possible cases, which
will ensure the intended error message is printed.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes#8977
Chroot'd process fails to automount snapshots due to realpath(3)
failure in mount.zfs(8).
Construct a mount point path from sb of the ctldir inode and dirent
name, instead of from d_path(), so that chroot'd process doesn't get
affected by its view of fs.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes#8903Closes#8966
After device removal, performing nopwrites on a dmu_sync-ed block
will result in a panic. This panic can show up in two ways:
1. an attempt to issue an IOCTL in vdev_indirect_io_start()
2. a failed comparison of zio->io_bp and zio->io_bp_orig in
zio_done()
To resolve both of these panics, nopwrites of blocks on indirect
vdevs should be ignored and new allocations should be performed on
concrete vdevs.
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Don Brady <don.brady@delphix.com>
Signed-off-by: George Wilson <gwilson@delphix.com>
Closes#8957
DMU sync code calls taskq_dispatch() for each sublist of os_dirty_dnodes
and os_synced_dnodes. Since the number of sublists by default is equal
to number of CPUs, it will dispatch equal, potentially large, number of
tasks, waking up many CPUs to handle them, even if only one or few of
sublists actually have any work to do.
This change adds check for empty sublists to avoid this.
Reviewed by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Closes#8909
The -Y option was added for ztest to test split block reconstruction.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Signed-off-by: Igor Kozhukhov <igor@dilos.org>
Closes#8926
This patch corrects the error message reported when attempting
to promote a dataset outside of its encryption root.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes#8905Closes#8935
Resolve the incorrect use of srcdir and builddir references for
various files in the build system. These have crept in over time
and went unnoticed because when building in the top level directory
srcdir and builddir are identical.
With this change it's again possible to build in a subdirectory.
$ mkdir obj
$ cd obj
$ ../configure
$ make
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Don Brady <don.brady@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#8921Closes#8943
The thread calling dmu_tx_try_assign() can't hold the dn_struct_rwlock
while assigning the tx, because this can lead to deadlock. Specifically,
if this dnode is already assigned to an earlier txg, this thread may
need to wait for that txg to sync (the ERESTART case below). The other
thread that has assigned this dnode to an earlier txg prevents this txg
from syncing until its tx can complete (calling dmu_tx_commit()), but it
may need to acquire the dn_struct_rwlock to do so (e.g. via
dmu_buf_hold*()).
This commit adds an assertion to dmu_tx_try_assign() to ensure that this
deadlock is not inadvertently introduced.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#8929
Remove arch and relax version dependency for zfs-dracut
package.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Gordan Bobic <gordan@redsleeve.org>
Issue #8913Closes#8914
Functions such as `fnvlist_lookup_nvlist` need libnvpair to be linked.
Default pkg-config file did not contain it.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Harry Mallon <hjmallon@gmail.com>
Closes#8919
The zfs-mount service can unexpectedly fail to start when zfs
encounters a mount that is in progress. This service uses
zfs mount -a, which has a window between the time it checks if
the dataset was mounted and when the actual mount (via mount.zfs
binary) occurs.
The reason for the racing mounts is that both zfs-mount.target
and zfs-share.target are allowed to execute concurrently after
the import. This is more of an issue with the relatively recent
addition of parallel mounting, and we should consider serializing
the mount and share targets.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Allan Jude <allanjude@freebsd.org>
Signed-off-by: Don Brady <don.brady@delphix.com>
Closes#8881
Count the bytes of payload for each replication record type
Count the bytes of overhead (replication records themselves)
Include these counters in the output summary at the end of the run.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Signed-off-by: Allan Jude <allanjude@freebsd.org>
Sponsored-By: Klara Systems and Catalogic
Closes#8432
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes#8945
When exporting ZVOLs as SCSI LUNs, by default Windows will not
issue them UNMAP commands. This reduces storage efficiency in
many cases.
We add the SCSI_PASSTHROUGH flag to the zvol's device queue,
which lets the SCSI target logic know that it can handle SCSI
commands.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: John Gallagher <john.gallagher@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes#8933
`show_str` could be a pointer to a local variable in stack
which is out-of-scope by the time
`return (snprintf(buf, buflen, "%s\n", show_str));`
is called.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes#8924Closes#8940
The logic to handle strong checksum collisions where the data doesn't
match is incorrect. It is not clearing the dedup bit of the blkptr,
which can cause a panic later in zio_ddt_free() due to the dedup table
not matching what is in the blkptr.
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-48097
Closes#8936
Align vdev_ops_t from illumos for better compatibility.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Igor Kozhukhov <igor@dilos.org>
Closes#8925
When encryption was first added to ZFS, we made a decision to
prevent users from creating unencrypted children of encrypted
datasets. The idea was to prevent users from inadvertently
leaving some of their data unencrypted. However, since the
release of 0.8.0, some legitimate reasons have been brought up
for this behavior to be allowed. This patch simply removes this
limitation from all code paths that had checks for it and updates
the tests accordingly.
Reviewed-by: Jason King <jason.king@joyent.com>
Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes#8737Closes#8870
The whereis command should not be used since it may not exist
in the initramfs. The dracut plymouth module also uses the type
command instead of whereis.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Garrett Fields <ghfields@gmail.com>
Signed-off-by: Dacian Reece-Stremtan <dacianstremtan@gmail.com>
Closes#8920Closes#8938
The rest of the code/comments use ZFS_DEV, so sync with that.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes#8912
Reviewed-by: Allan Jude <allanjude@freebsd.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Signed-off-by: Michael Niewöhner <foss@mniewoehner.de>
Closes#8897Closes#8911
When configure is run with --with-spec=redhat, and rpms are built, the
kmod-zfs-devel package is missing
Provides: kmod-spl-devel = %{version}
which is required by software such as Lustre which builds against zfs
kmods. Adding it makes it easier for such software to build against
both zfs-0.7 (where SPL is separate and may be missing) and zfs-0.8.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes#8930
The mmp_interval test case was failing on Fedora 30 due to the built-in
'echo' command terminating the script when it was unable to write to
the sysfs module parameter. This change in behavior was observed with
ksh-2020.0.0-alpha1. Resolve the issue by using the external cat
command which fails gracefully as expected.
Additionally, remove some incorrect quotes around the $? return values.
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#8906
For busy ARC situation when arc_size close to arc_c is desired. But
then it is quite likely that aggsum_compare(&arc_size, arc_c) will need
to flush per-CPU buckets to find exact comparison result. Doing that
often in a hot path penalizes whole idea of aggsum usage there, since it
replaces few simple atomic additions with dozens of lock acquisitions.
Replacing aggsum_compare() with aggsum_upper_bound() in code increasing
arc_p when ARC is growing (arc_size < arc_c) according to PMC profiles
allows to save ~5% of CPU time in aggsum code during sequential write
to 12 ZVOLs with 16KB block size on large dual-socket system.
I suppose there some minor arc_p behavior change due to lower precision
of the new code, but I don't think it is a big deal, since it should
affect only very small window in time (aggsum buckets are flushed every
second) and in ARC size (buckets are limited to 10 average ARC blocks
per CPU).
Reviewed-by: Chris Dunlop <chris@onthe.net.au>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Allan Jude <allanjude@freebsd.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Closes#8901
Don't require Python at configure/build unless building pyzfs.
Move ZFS_AC_PYTHON_MODULE to always-pyzfs.m4 where it is used.
Make test syntax more consistent.
Sponsored by: iXsystems, Inc.
Reviewed-by: Neal Gompa <ngompa@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@ixsystems.com>
Closes#8895
`lz4_decompress_abd` is declared in zio_compress.h but it is not defined
anywhere. The declaration should be removed.
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed-by: Allan Jude <allanjude@freebsd.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-47477
Closes#8894
If the zfs_remove_max_segment tunable is changed to be not a multiple of
the sector size, then the device removal code will malfunction and try
to create mappings that are smaller than one sector, leading to a panic.
On debug bits this assertion will fail in spa_vdev_copy_segment():
ASSERT3U(DVA_GET_ASIZE(&dst), ==, size);
On nondebug, the system panics with a stack like:
metaslab_free_concrete()
metaslab_free_impl()
metaslab_free_impl_cb()
vdev_indirect_remap()
free_from_removing_vdev()
metaslab_free_impl()
metaslab_free_dva()
metaslab_free()
Fortunately, the default for zfs_remove_max_segment is 1MB, so this
can't occur by default. We hit it during this test because
removal_remap.ksh changes zfs_remove_max_segment to 1KB. When testing on
4KB-sector disks, we hit the bug.
This change makes the zfs_remove_max_segment tunable more robust,
automatically rounding it up to a multiple of the sector size. We also
turn some key assertions into VERIFY's so that similar bugs would be
caught before they are encoded on disk (and thus avoid a
panic-reboot-loop).
Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-61342
Closes#8893
Starting in sync pass 5 (zfs_sync_pass_dont_compress), we disable
compression (including of metadata). Ostensibly this helps the sync
passes to converge (i.e. for a sync pass to not need to allocate
anything because it is 100% overwrites).
However, in practice it increases the average number of sync passes,
because when we turn compression off, a lot of block's size will change
and thus we have to re-allocate (not overwrite) them. It also increases
the number of 128KB allocations (e.g. for indirect blocks and spacemaps)
because these will not be compressed. The 128K allocations are
especially detrimental to performance on highly fragmented systems,
which may have very few free segments of this size, and may need to load
new metaslabs to satisfy 128K allocations.
We should increase zfs_sync_pass_dont_compress. In practice on a highly
fragmented system we see a few 5-pass txg's, a tiny number of 6-pass
txg's, and no txg's with more than 6 passes.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-63431
Closes#8892
Memory copy is too heavy operation to do under the congested lock.
Moving it out reduces congestion by many times to almost invisible.
Since the original zio removed from the queue, and the child zio is
not executed yet, I don't see why would the copy need protection.
My guess it just remained like this from the time when lock was not
dropped here, which was added later to fix lock ordering issue.
Multi-threaded sequential write tests with both HDD and SSD pools
with ZVOL block sizes of 4KB, 16KB, 64KB and 128KB all show major
reduction of lock congestion, saving from 15% to 35% of CPU time
and increasing throughput from 10% to 40%.
Reviewed-by: Richard Yao <ryao@gentoo.org>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Closes#8890
This change restricts filesystem creation if the given name
contains either '.' or '..'
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Signed-off-by: TulsiJain <tulsi.jain@delphix.com>
Closes#8842Closes#8564
When running zloop, we occasionally see the following crash:
dmu_tx_assign(tx, TXG_WAIT) == 0 (0x1c == 0)
ASSERT at ../../module/zfs/vdev_removal.c:1507:spa_vdev_remove_thread()/sbin/ztest(+0x89c3)[0x55faf567b9c3]
The error value 0x1c is ENOSPC.
The transaction used by spa_vdev_remove_thread() should not be able to
fail due to being out of space. i.e. we should not call
dmu_tx_hold_space(). This will allow the removal thread to schedule its
work even when the pool is low on space. The "slop space" will provide
enough free space to sync out the txg.
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-37853
Closes#8889
sysfs_attr_init() is required to make lockdep happy for dynamically
allocated sysfs attributes. This fixed#8868 on Fedora 29 running
kernel-debug.
This requirement was introduced in 2.6.34.
See include/linux/sysfs.h for what it actually does.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes#8868Closes#8884
When iterating over a ZAP object, we're almost always certain to iterate
over the entire object. If there are multiple leaf blocks, we can
realize a performance win by issuing reads for all the leaf blocks in
parallel when the iteration begins.
For example, if we have 10,000 snapshots, "zfs destroy -nv
pool/fs@1%9999" can take 30 minutes when the cache is cold. This change
provides a >3x performance improvement, by issuing the reads for all ~64
blocks of each ZAP object in parallel.
Reviewed-by: Andreas Dilger <andreas.dilger@whamcloud.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-58347
Closes#8862
Sometimes the target ARC size is reduced to arc_c_min, which impacts
performance. We've seen this happen as part of the random_reads
performance regression test, where the ARC size is reduced before the
reads test starts which impacts how long it takes for system to reach
good IOPS performance.
We call arc_reduce_target_size when arc_reap_cb_check() returns TRUE,
and arc_available_memory() is less than arc_c>>arc_shrink_shift.
However, arc_available_memory() could easily be low, even when arc_c is
low, because we can have tons of unused bufs in the abd kmem cache. This
would be especially true just after the DMU requests a bunch of stuff be
evicted from the ARC (e.g. due to "zpool export").
To fix this, the ARC should reduce arc_c by the requested amount, not
all the way down to arc_size (or arc_c_min), which can be very small.
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-59431
Closes#8864
The udevadm settle timeout can be 120 or 180 seconds by default
for some distributions. If a long delay is experienced, it could
be due to some strangeness in a malfunctioning device that isn't
related to the devices under test. To help debug this condition,
a notice is given if settle takes too long.
Arguments can now be passed to block_device_wait. The expected
arguments are block device pathnames.
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Elling <Richard.Elling@RichardElling.com>
Closes#8839
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Elling <Richard.Elling@RichardElling.com>
Closes#8839
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Elling <Richard.Elling@RichardElling.com>
Closes#8839
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Elling <Richard.Elling@RichardElling.com>
Closes#8839
On large systems, the memory used by loaded metaslabs can become
a concern. While range trees are a fairly efficient data structure,
on heavily fragmented pools they can still consume a significant
amount of memory. This problem is amplified when we fail to unload
metaslabs that we aren't using. Currently, we only unload a metaslab
during metaslab_sync_done; in order for that function to be called
on a given metaslab in a given txg, we have to have dirtied that
metaslab in that txg. If the dirtying was the result of an allocation,
we wouldn't be unloading it (since it wouldn't be 8 txgs since it
was selected), so in effect we only unload a metaslab during txgs
where it's being freed from.
We move the unload logic from sync_done to a new function, and
call that function on all metaslabs in a given vdev during
vdev_sync_done().
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes#8837
Build process would always re-compile spa_history.c due to touching
zfs_gitrev.h - avoid if no change in gitrev.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Chris Dunlop <chris@onthe.net.au>
Reviewed-by: Allan Jude <allanjude@freebsd.org>
Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Closes#8860
Reviewed-by: Chris Dunlop <chris@onthe.net.au>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Allan Jude <allanjude@freebsd.org>
Closes#8822
Historically while doing performance testing we've noticed that IOPS
can be significantly reduced when all vdevs in the pool are hitting
the zfs_mg_fragmentation_threshold percentage. Specifically in a
hypothetical pool with two vdevs, what can happen is the following:
Vdev A would go above that threshold and only vdev B would be used.
Then vdev B would pass that threshold but vdev A would go below it
(we've been freeing from A to allocate to B). The allocations would
go back and forth utilizing one vdev at a time with IOPS taking a hit.
Empirically, we've seen that our vdev selection for allocations is
good enough that fragmentation increases uniformly across all vdevs
the majority of the time. Thus we set the threshold percentage high
enough to avoid hitting the speed bump on pools that are being pushed
to the edge. We effectively disable its effect in the majority of the
cases but we don't remove (at least for now) just in case we hit any
weird behavior in the future.
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes#8859
Since zfs_znode_alloc() already takes dmu_buf_t*, taking another
uint64_t argument for objid is redundant. inode's ->i_ino does and
needs to match znode's ->z_id.
zfs_znode_alloc() in FreeBSD and illumos doesn't have this argument
since vnode doesn't have vnode# in VFS (hence ->z_id exists).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@osnexus.com>
Closes#8841
Per suggestion from @behlendorf in #8777, remove vn_set_fs_pwd() and
vn_set_pwd() which are only used in zfs_ioctl.c:_init() while loading
zfs.ko.
The rest of initialization functions being called here after cwd set
to / don't depend on cwd of the process except for spa_config_load().
spa_config_load() uses a relative path ".//etc/zfs/zpool.cache" when
`rootdir` is non-NULL, which is "/etc/zfs/zpool.cache" given cwd is /,
so just unconditionally use the absolute path without "./", so that
`vn_set_pwd("/")` as well as the entire functions can be removed.
This is also what FreeBSD does.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@osnexus.com>
Closes#8826
For recursive renaming, simplify the code by moving `zhrp` and
`parentname` to inner scope. `zhrp` is only used to test existence
of a parent dataset for recursive dataset dir scan since ba6a24026c.
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@osnexus.com>
Closes#8815
s/get_vdev_spec/make_root_vdev
The former doesn't exist anymore.
Sponsored by: iXsystems, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Ryan Moeller <ryan@freqlabs.com>
Closes#8759
These descriptions are not uptodate with the code.
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes#8767
In `config/kernel-timer.m4` refactor slightly to check more generally
for the new `timer_setup()` APIs, but also check the callback signature
because some kernels (notably 4.14) have the new `timer_setup()` API but
use the old callback signature. Also add a check for a `flags` member in
`struct timer_list`, which was added in 4.1-rc8.
Add compatibility shims to `include/spl/sys/timer.h` to allow using the
new timer APIs with the only two caveats being that the callback
argument type must be declared as `spl_timer_list_t` and an explicit
assignment is required to get the timer variable for the `timer_of()`
macro. So the callback would look like this:
```c
__cv_wakeup(spl_timer_list_t t)
{
struct timer_list *tmr = (struct timer_list *)t;
struct thing *parent = from_timer(parent, tmr,
parent_timer_field);
... /* do stuff with parent */
```
Make some minor changes to `spl-condvar.c` and `spl-taskq.c` to use the
new timer APIs instead of conditional code.
Reviewed-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rafael Kitover <rkitover@gmail.com>
Closes#8647
2019-09-25 11:27:46 -07:00
753 changed files with 16538 additions and 6282 deletions