ZFS prefetch is currently governed by the zfs_prefetch_disable
tunable. However, this is a module-wide settings - if a specific
dataset benefits from prefetch, while others have issue with it,
an optimal solution does not exists.
This commit introduce the "prefetch" tri-state property, which enable
granular control (at dataset/volume level) for prefetching.
This patch does not remove the zfs_prefetch_disable, which remains
a system-wide switch for enable/disable prefetch. However, to avoid
duplication, it would be preferable to deprecate and then remove
the module tunable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Signed-off-by: Gionatan Danti <g.danti@assyoma.it>
Co-authored-by: Gionatan Danti <g.danti@assyoma.it>
Closes#15237Closes#15436
This reverts commit 797f55ef12 which
was causing a VERIFY failure when running the project quota tests.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#15438
spa_thread() and the "System Duty Cycle" scheduling class are from
Illumos and have not yet been adapted to Linux or FreeBSD.
HAVE_SPA_THREAD has long been explicitly undefined and used to mark
spa_thread(), but there's some related taskq code that can never be
invoked without it, which makes some already-tricky code harder to read.
HAVE_SYSDC is introduced in this commit to mark the SDC parts. SDC
requires spa_thread(), but the inverse is not true, so they are
separate.
I don't want to make the call to just remove it because I still harbour
hopes that OpenZFS could become a first-class citizen on Illumos
someday. But hopefully this will at least make the reason it exists a
bit clearer for people without long memories and/or an interest in
history.
For those that are interested in the history, the original FreeBSD port
of ZFS (before ZFS-on-Linux was adopted there) did have a spa_thread(),
but not SDC. The last version of that before it was removed can be read
here:
22df1ffd81/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c
Meanwhile, more information on the SDC scheduling class is here:
https://github.com/illumos/illumos-gate/blob/master/usr/src/uts/common/disp/sysdc.c
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#15406
In the zfs_id_over*quota functions, there is a short-circuit to skip
the zap_lookup when the quota zap does not exist. If quotas are never
used in a zpool, then the quota zap will never exist. But if
user/group/project quotas are ever used, the zap objects will be
created and will persist even if the quotas are deleted.
The quota zap_lookup in the write path can become a bottleneck for
write-heavy small I/O workloads. Before this commit, it was not
possible to remove this lookup without creating a new zpool.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Sam Atkinson <samatk@amazon.com>
Closes#14721
In my understanding ARC_BUF_SHARED() and arc_buf_is_shared() should
return identical results, except the second also asserts it deeper.
The first is much cheaper though, saving few pointer dereferences.
Replace production arc_buf_is_shared() calls with ARC_BUF_SHARED(),
and call arc_buf_is_shared() in random assertions, while making it
even more strict.
On my tests this in half reduces arc_buf_destroy_impl() time, that
noticeably reduces hash_lock congestion under heavy dbuf eviction.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15397
Torn reads/writes of dp_dirty_total are unlikely: on 64-bit systems
due to register size, while on 32-bit due to memory constraints.
And even if we hit some race, the code implementing the delay takes
the lock any way.
Removal of the poll-wide lock acquisition saves ~1% of CPU time on
8-thread 8KB write workload.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15390
Variable 'uma_align_cache' has not been used since commit "FreeBSD: Use
a hash table for taskqid lookups" (3933305ea). Moreover, it is soon
going to become private to FreeBSD's UMA in 15.0-CURRENT (main),
14.0-STABLE (stable/14) and 13.2-STABLE (stable/13). Should accessing
this information become necessary again, one will have to use the new
accessors for recent versions.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olivier Certner <olce.freebsd@certner.fr>
Closes#15416
When a vdev is to be expanded -- either via `zpool online -e` or via
the autoexpand option -- a SPA_ASYNC_CONFIG_UPDATE request is queued
to be handled via an asynchronous worker thread (spa_async_thread).
This normally happens almost immediately; but will be delayed up to
zfs_ccw_retry_interval seconds (default 5 minutes) if an attempt to
write the zpool configuration cache failed.
When FreeBSD boots ZFS-root VM images generated using `makefs -t zfs`,
the zpoolupgrade rc.d script runs `zpool upgrade`, which modifies the
pool configuration and triggers an attempt to write to the cache file.
This attempted write fails because the filesystem is still mounted
read-only at this point in the boot process, triggering a 5-minute
cooldown before SPA_ASYNC_CONFIG_UPDATE requests will be handled by
the asynchronous worker thread.
When expanding a vdev, reset the "when did a configuration cache
write last fail" value so that the SPA_ASYNC_CONFIG_UPDATE request
will be handled promptly. A cleaner but more intrusive option would
be to use separate SPA_ASYNC_ flags for "configuration changed" and
"try writing the configuration cache again", but with FreeBSD 14.0
coming very soon I'd prefer to leave such refactoring for a later
date.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Colin Percival <cperciva@FreeBSD.org>
Closes#15405
For synchronous write workloads with large IO sizes, a pool configured
with a slog performs worse than one with an embedded zil:
sequential_writes 1m sync ios, 16 threads
Write IOPS: 1292 438 -66.10%
Write Bandwidth: 1323570 448910 -66.08%
Write Latency: 12128400 36330970 3.0x
sequential_writes 1m sync ios, 32 threads
Write IOPS: 1293 430 -66.74%
Write Bandwidth: 1324184 441188 -66.68%
Write Latency: 24486278 74028536 3.0x
The reason is the `zil_slog_bulk` variable. In `zil_lwb_write_open`,
if a zil block is greater than 768K, the priority of the write is
downgraded from sync to async. Increasing the value allows greater
throughput. To select a value for this PR, I ran an fio workload with
the following values for `zil_slog_bulk`:
zil_slog_bulk KiB/s
1048576 422132
2097152 478935
4194304 533645
8388608 623031
12582912 827158
16777216 1038359
25165824 1142210
33554432 1211472
50331648 1292847
67108864 1308506
100663296 1306821
134217728 1304998
At 64M, the results with a slog are now improved to parity with an
embedded zil:
sequential_writes 1m sync ios, 16 threads
Write IOPS: 438 1288 2.9x
Write Bandwidth: 448910 1319062 2.9x
Write Latency: 36330970 12163408 -66.52%
sequential_writes 1m sync ios, 32 threads
Write IOPS: 430 1290 3.0x
Write Bandwidth: 441188 1321693 3.0x
Write Latency: 74028536 24519698 -66.88%
None of the other tests in the performance suite (run with a zil or
slog) had a significant change, including the random_write_zil tests,
which use multiple datasets.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Signed-off-by: John Wren Kennedy <john.kennedy@delphix.com>
Closes#14378
- Group tqent_task and tqent_timeout_task into a union. They are
never used same time. This shrinks taskq_ent_t from 192 to 160 bytes.
- Remove tqent_registered. Use tqent_id != 0 instead.
- Remove tqent_cancelled. Use taskqueue pending counter instead.
- Change tqent_type into uint_t. We don't need to pack it any more.
- Change tqent_rc into uint_t, matching refcount(9).
- Take shared locks in taskq_lookup().
- Call proper taskqueue_drain_timeout() for TIMEOUT_TASK in
taskq_cancel_id() and taskq_wait_id().
- Switch from CK_LIST to regular LIST.
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mateusz Guzik <mjguzik@gmail.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15356
When doing a manual TRIM on a zpool, the metaslab being TRIMmed is
potentially re-enabled before all queued TRIM zios for that metaslab
have completed. Since TRIM zios have the lowest priority, it is
possible to get into a situation where allocations occur from the
just re-enabled metaslab and cut ahead of queued TRIMs to the same
metaslab. If the ranges overlap, this will cause corruption.
We were able to trigger this pretty consistently with a small single
top-level vdev zpool (i.e. small number of metaslabs) with heavy
parallel write activity while performing a manual TRIM against a
somewhat 'slow' device (so TRIMs took a bit of time to complete).
With the patch, we've not been able to recreate it since. It was on
illumos, but inspection of the OpenZFS trim code looks like the
relevant pieces are largely unchanged and so it appears it would be
vulnerable to the same issue.
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jason King <jking@racktopsystems.com>
Illumos-issue: https://www.illumos.org/issues/15939Closes#15395
dmu_tx_check_ioerr() pre-reads blocks that are going to be dirtied
as part of transaction to both prefetch them and check for errors.
But it makes no sense to do it for holes, since there are no disk
reads to prefetch and there can be no errors. On the other side
those blocks are anonymous, and they are freed immediately by the
dbuf_rele() without even being put into dbuf cache, so we just
burn CPU time on decompression and overheads and get absolutely
no result at the end.
Use of dbuf_hold_impl() with fail_sparse parameter allows to skip
the extra work, and on my tests with sequential 8KB writes to empty
ZVOL with 32KB blocks shows throughput increase from 1.7 to 2GB/s.
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15371
Right now, zpl_ioctl_ficlone and zpl_ioctl_ficlonerange do not call
put on the src fd if the source and destination are on two different
devices. This leaves the source file held open in this case.
Reviewed-by: Kay Pedersen <mail@mkwg.de>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Daniel Berlin <dberlin@dberlin.org>
Closes#15386
There was a report of zvol data loss (#15351) after enabling blk-mq on a
zvol backed with 16k physical block sized disks. Out of an abundance of
caution, do not allow the user to enable blk-mq until we can look into
the issue.
Note that blk-mq was not enabled by default on zvols. It was always
opt-in via the zvol_use_blk_mq module parameter.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Addresses: #15351Closes#15378
This includes random small tweaks, primarily a build fixes, required
when ZFS is built as part of FreeBSD base.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15368
Benchmarks show that at certain write sizes range lock/unlock take
not so much time as extra memory copy. The exact threshold is not
obvious due to other overheads, but it is definitely lower than
~63KB used before. Make it configurable, defaulting at 7.5KB,
that is 8KB of nearest malloc() size minus itx and lr structs.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15353
Previously, if a cachefile is passed to zpool import, the cached config
is mostly offered as-is to ZFS_IOC_POOL_TRYIMPORT->spa_tryimport(), and
the results are taken as the canonical pool config and handed back to
ZFS_IOC_POOL_IMPORT.
In the course of its operation, spa_load() will inspect the pool and
build a new config from what it finds on disk. However, it then
regenerates a new config ready to import, and so rightly sets the hostid
and hostname for the local host in the config it returns.
Because of this, the "require force" checks always decide the pool is
exported and last touched by the local host, even if this is not true,
which is possible in a HA environment when MMP is not enabled. The pool
may be imported on another head, but the import checks still pass here,
so the pool ends up imported on both.
(This doesn't happen when a cachefile isn't used, because the pool
config is discovered in userspace in zpool_find_import(), and that does
find the on-disk hostid and hostname correctly).
Since the systemd zfs-import-cache.service unit uses cachefile imports,
this can lead to a system returning after a crash with a "valid"
cachefile on disk and automatically, quietly, importing a pool that has
already been taken up by a secondary head.
This commit causes the on-disk hostid and hostname to be included in the
ZPOOL_CONFIG_LOAD_INFO item in the returned config, and then changes the
"force" checks for zpool import to use them if present.
This method should give no change in behaviour for old userspace on new
kernels (they won't know to look for the new config items) and for new
userspace on old kernels (the won't find the new config items).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Closes#15290
Before this change ZFS created threads for 50% of CPUs for each top-
level vdev. Plus it created the same number of threads for embedded
log groups (that have only one metaslab and don't need any preload).
As result, on system with 80 CPUs and pool of 60 vdevs this resulted
in 4800 metaslab preload threads, that is absolutely insane.
This patch changes the preload threads to 50% of CPUs in one taskq
per pool, so on the mentioned system it will be only 40 threads.
Among other things this fixes zdb on the mentioned system and pool
on FreeBSD, that failed to create so many threads in one process.
Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15319
To reduce memory usage ZFS crypto allocated bigger by 56 bytes ARC
headers only when specific block was encrypted on disk. It was a
nice optimization, except in some cases the code reallocated them
on fly, that invalidated header pointers from the buffers. Since
the buffers use different locking, it created number of races, that
were originally covered (at least partially) by b_evict_lock, used
also to protection evictions. But it has gone as part of #14340.
As result, as was found in #15293, arc_hdr_realloc_crypt() ended
up unprotected and causing use-after-free.
Instead of introducing some even more elaborate locking, this patch
just drops the difference between normal and protected headers. It
cost us additional 56 bytes per header, but with couple patches
saving 24 bytes, the net growth is only 32 bytes with total header
size of 232 bytes on FreeBSD, that IMHO is acceptable price for
simplicity. Additional locking would also end up consuming space,
time or both.
Reviewe-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15293Closes#15347
In most cases we do not care about exact number of buffers linked
to the header, we just need to know if it is zero, non-zero or one.
That can easily be checked just looking on b_buf pointer or in some
cases derefencing it.
b_ebufcnt is read only once, and in that case we already traverse
the list as part of arc_buf_remove(), so second traverse should not
be expensive.
This reduces L1 ARC header size by 8 bytes and full crypto header by
16 bytes, down to 176 and 232 bytes on FreeBSD respectively.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15350
Earlier as part of #14123 I've removed one use of b_cv. This patch
reuses the same approach to remove the other one from much more
rare code path.
This saves 16 bytes of L1 ARC header on FreeBSD (reducing it from
200 to 184 bytes) and seems even more on Linux.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15340
The Arm Branch Target Identification (BTI) extension guards against
branching to an unintended instruction.
To support BTI add the landing pad instructions to the SHA2 functions.
These are from the hint space so are a nop on hardware that lacks BTI
support or if BTI isn't enabled.
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Andrew Turner <andrew.turner4@arm.com>
Closes#14862Closes#15339
Commit 8af1104f does not actually store the ashift of cache devices in
their label. However, in order to facilitate reporting the ashift
through zdb, we enable this in the present commit. We also document
how the retrieval of the ashift is done.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes#15331
If we are copying only one block and it is smaller than recordsize
property, do not allow destination to grow beyond one block if it
is not there yet. Otherwise the destination will get stuck with
that block size forever, that can be as small as 512 bytes, no
matter how big the destination grow later.
Reviewed-by: Kay Pedersen <mail@mkwg.de>
Reviewed-by: Rob Norris <rob.norris@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15321
Vendor testing shows we should be able to get a little more
performance if we further relax the hard limit which we're hitting.
Authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#15324
When unlinking multiple files from a pool at 100% capacity, it
was possible for ENOSPC to be returned after the first few unlinks.
This issue was fixed previously by PR #13172 but then this was
again introduced by PR #13839.
This is resolved using the existing mechanism of returning ERESTART
when over quota as long as we know enough space will shortly be
available after processing the pending deferred frees.
Also, updated the existing testcase which reliably reproduced the
issue without this patch.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Dipak Ghosh <dipak.ghosh@hpe.com>
Signed-off-by: Akash B <akash-b@hpe.com>
Closes#15312
In Linux commit 560e20e4bf6484a0c12f9f3c7a1aa55056948e1e, the
fsync_bdev() function was removed in favor of sync_blockdev() to do
(roughly) the same thing, given the same input. This change
conditionally attempts to call sync_blockdev() if fsync_bdev() isn't
discovered during configure.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Coleman Kane <ckane@colemankane.org>
Closes#15263
In commit 0d72b92883c651a11059d93335f33d65c6eb653b, a new u32 argument
for the request_mask was added to generic_fillattr. This is the same
request_mask for statx that's present in the most recent API implemented
by zpl_getattr_impl. This commit conditionally adds it to the
zpl_generic_fillattr(...) macro, as well as the zfs_getattr_fast(...)
implementation, when configure determines it's present in the kernel's
generic_fillattr(...).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Coleman Kane <ckane@colemankane.org>
Closes#15263
In Linux commit 13bc24457850583a2e7203ded05b7209ab4bc5ef, direct access
to the i_ctime member of struct inode was removed. The new approach is
to use accessor methods that exclusively handle passing the timestamp
around by value. This change adds new tests for each of these functions
and introduces zpl_* equivalents in include/os/linux/zfs/sys/zpl.h. In
where the inode_get/set_ctime*() functions exist, these zpl_* calls will
be mapped to the new functions. On older kernels, these macros just wrap
direct-access calls. The code that operated on an address of ip->i_ctime
to call ZFS_TIME_DECODE() now will take a local copy using
zpl_inode_get_ctime(), and then pass the address of the local copy when
performing the ZFS_TIME_DECODE() call, in all cases, rather than
directly accessing the member.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Coleman Kane <ckane@colemankane.org>
Closes#15263Closes#15257
When failmode=continue is set and the pool suspends, both 'zpool status'
and the 'zfs/pool/state' kstat ignore it and report the normal vdev tree
state. There's no clear indicator that the pool is suspended. This is
unlike suspend in failmode=wait, or suspend due to MMP check failure,
which both report "SUSPENDED" explicitly.
This commit changes it so SUSPENDED is reported for failmode=continue
the same as for other modes.
Rationale:
The historical behaviour of failmode=continue is roughly, "press on as
though all is well". To this end, the fact that the pool had suspended
was not shown, to maintain the façade that all is well.
Its unclear why hiding this information was considered appropriate. One
possibility is that it was expected that a true pool fault would always
be reported as DEGRADED or FAULTED, and that the pool could not suspend
without these happening.
That is not necessarily true, as vdev health and suspend state are only
loosely connected, such that a pool in (apparent) good health can be
suspended for good reasons, and of course a degraded pool does not lead
to suspension. Even if that expectation were true, there's still a
difference in urgency - a degraded pool may not need to be attended to
for hours, while a suspended pool is most often unusable until an
operator intervenes.
An operator that has set failmode=continue has presumably done so
because their workload is one that can continue to operate in a useful
way when the pool suspends. In this case the operator still needs a
clear indicator that there is a problem that needs attending to.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#15297
zil_lwb_set_zio_dependency() can not set write ZIO dependency on
previous LWB's write ZIO if one is already in done handler and set
state to LWB_STATE_WRITE_DONE. So theoretically done handler of
next LWB's write ZIO may run before done handler of previous LWB
write ZIO completes. In such case we can not defer flushes, since
the flush issue process is not locked.
This may fix some reported assertions of lwb_vdev_tree not being
empty inside zil_free_lwb().
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15278
In #13375 we modified the allocation size of the buffer that we use
to apply l2arc transforms to be the size of the arc hdr we're using,
rather than the allocation size that will be in place on the disk,
because sometimes the hdr size is larger. Unfortunately, sometimes
the allocation size is larger, which means that we overflow the buffer
in that case. This change modifies the allocation to be the max of
the two values
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes#15177Closes#15248
spa_upgrade_errlog() does not update the MOS directory when the
head_errlog feature is enabled. In this case if spa_errlog_sync() is not
called, the MOS dir references the old errlog_last and errlog_sync
objects. Thus when doing a scrub a panic will occur:
Call Trace:
dump_stack+0x6d/0x8b
panic+0x101/0x2e3
spl_panic+0xcf/0x102 [spl]
delete_errlog+0x124/0x130 [zfs]
spa_errlog_sync+0x256/0x260 [zfs]
spa_sync_iterate_to_convergence+0xe5/0x250 [zfs]
spa_sync+0x2f7/0x670 [zfs]
txg_sync_thread+0x22d/0x2d0 [zfs]
thread_generic_wrapper+0x83/0xa0 [spl]
kthread+0x104/0x140
ret_from_fork+0x1f/0x40
Fix this by updating the related MOS directory objects in
spa_upgrade_errlog().
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes#15279Closes#15277
Added in ab26409db7 ("Linux 3.1 compat, super_block->s_shrink"), with
the only consumer which needed the count getting retired in 066e825221
("Linux compat: Minimum kernel version 3.10").
The counter gets in the way of not maintaining the list to begin with.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Closes#15274
- We cannot clone into files with smaller block size if there is
more than one block, since we can not grow the block size.
- Block size must be power-of-2 if destination offset != 0, since
there can be no multiple blocks of non-power-of-2 size.
The first should handle the case when destination file has several
blocks but still is not bigger than one block of the source file.
The second fixes panic in dmu_buf_hold_array_by_dnode() on attempt
to concatenate files with equal but non-power-of-2 block sizes.
While there, assert that error is reported if we made no progress.
Reviewed-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <rob.norris@klarasystems.com>
Reviewed-by: Kay Pedersen <mail@mkwg.de>
Reviewed-by: Umer Saleem <usaleem@ixsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15251
When register_sysctl_table() is unavailable we fail to properly
unregister sysctl entries under "kernel/spl".
This leads to errors like the following when spl is unloaded/reloaded,
making impossible to properly reload the spl module:
[ 746.995704] sysctl duplicate entry: /kernel/spl/kmem/slab_kvmem_total
Fix by cleaning up all the sub-entries inside "kernel/spl" when the
spl module is unloaded.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Closes#15239
ZFS historically has had several space allocators that were
dynamically selectable. While these have been retained in
OpenZFS, only a single allocator has been statically compiled
in. This patch compiles all allocators for OpenZFS and provides
a module parameter to allow for manual selection between them.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Edmund Nadolski <edmund.nadolski@ixsystems.com>
Closes#15218
If we fail to create a proc entry in spl_proc_init() we may end up
calling unregister_sysctl_table() twice: one in the failure path of
spl_proc_init() and another time during spl_proc_fini().
Avoid the double call to unregister_sysctl_table() and while at it
refactor the code a bit to reduce code duplication.
This was accidentally introduced when the spl code was
updated for Linux 6.5 compatibility.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Closes#15234Closes#15235
In zil_lwb_write_issue(), after issuing lwb_root_zio/lwb_write_zio,
we have no right to access lwb->lwb_child_zio. If it was not there,
the first two ZIOs may have already completed and freed the lwb.
ZIOs issue in opposite order from children to parent should keep
the lwb valid till the end, since the lwb can be freed only after
lwb_root_zio completion callback.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15233
While I have no reports of it, I suspect possible use-after-free
scenario when zil_commit_waiter() tries to dereference zcw_lwb
for lwb already freed by zil_sync(), while zcw_done is not set.
Extension of zl_lock scope as it was originally should block
zil_sync() from freeing the lwb, closing this race.
This reverts #14959 and couple chunks of #14841.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15228
In zil_free_lwb() we should first assert lwb_state or the rest of
assertions can be misleading if it is false.
Add lwb_state assertions in zil_lwb_add_block() to make sure we are
not trying to add elements to lwb_vdev_tree after it was processed.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15227
Building module/zfs/dbuf.c for 32-bit targets can result in a warning:
In file included from
/usr/src/sys/contrib/openzfs/include/sys/zfs_context.h:97,
from /usr/src/sys/contrib/openzfs/module/zfs/dbuf.c:32:
/usr/src/sys/contrib/openzfs/module/zfs/dbuf.c: In function
'dmu_buf_will_clone':
/usr/src/sys/contrib/openzfs/lib/libspl/include/assert.h:116:33: error:
cast from pointer to integer of different size
[-Werror=pointer-to-int-cast]
116 | const uint64_t __left = (uint64_t)(LEFT);
\
| ^
/usr/src/sys/contrib/openzfs/lib/libspl/include/assert.h:148:25: note:
in expansion of macro 'VERIFY0'
148 | #define ASSERT0 VERIFY0
| ^~~~~~~
/usr/src/sys/contrib/openzfs/module/zfs/dbuf.c:2704:9: note: in
expansion of macro 'ASSERT0'
2704 | ASSERT0(dbuf_find_dirty_eq(db, tx->tx_txg));
| ^~~~~~~
This is because dbuf_find_dirty_eq() returns a pointer, which if
pointers are 32-bit results in a warning about the cast to uint64_t.
Instead, use the ASSERT3P() macro, with == and NULL as second and third
arguments, which should work regardless of the target's bitness.
Reviewed-by: Kay Pedersen <mail@mkwg.de>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Dimitry Andric <dimitry@andric.com>
Closes#15224
Currently redaction bookmarks and their associated redaction lists
have a relatively low limit of 36 redaction snapshots. This is imposed
by the number of snapshot GUIDs that fit in the bonus buffer of the
redaction list object. While this is more than enough for most use
cases, there are some limited cases where larger numbers would be
useful to support.
We tweak the redaction list creation code to use a spill block if
the number of redaction snapshots is above the amount that would fit
in the bonus buffer. We also make a small change to allow spill blocks
to be use for types of data besides SA. In order to fully leverage
this logic, we also change the redaction code to use vmem_alloc, to
handle extremely large allocations if needed. Finally, small tweaks
were made to the zfs commands and the test suite.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes#15018
As part of some internal gang block testing within Delphix
we hit the assertion removed by this patch. The assertion
was triggered by a ZIO that had two copies and was a gang
block making the following expression equal to 3:
```
MIN(zp->zp_copies + BP_IS_GANG(bp), spa_max_replication(spa))
```
and failing when we expected the above to be equal to
`BP_GET_NDVAS(bp)`.
The assertion is no longer valid since the following commit:
```
commit 14872aaa4f
Author: Matthew Ahrens <matthew.ahrens@delphix.com>
Date: Mon Feb 6 09:37:06 2023 -0800
EIO caused by encryption + recursive gang
```
The above commit changed gang block headers so they can't
have more than 2 copies but the assertion in question from
this PR was never updated.
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes#15180
The previous patch #14841 appeared to have significant flaw, causing
deadlocks if zl_get_data callback got blocked waiting for TXG sync. I
already handled some of such cases in the original patch, but issue
#14982 shown cases that were impossible to solve in that design.
This patch fixes the problem by postponing log blocks allocation till
the very end, just before the zios issue, leaving nothing blocking after
that point to cause deadlocks. Before that point though any sleeps are
now allowed, not causing sync thread blockage. This require slightly
more complicated lwb state machine to allocate blocks and issue zios
in proper order. But with removal of special early issue workarounds
the new code is much cleaner now, and should even be more efficient.
Since this patch uses null zios between write, I've found that null
zios do not wait for logical children ready status in zio_ready(),
that makes parent write to proceed prematurely, producing incorrect
log blocks. Added ZIO_CHILD_LOGICAL_BIT to zio_wait_for_children()
fixes it.
Reviewed-by: Rob Norris <rob.norris@klarasystems.com>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15122
In 019dea0a5 we removed the conversion from EAGAIN->EXDEV inside
zfs_clone_range(), but forgot to add a test for EAGAIN to the
copy_file_range() entry points to trigger fallback to a content copy.
This commit fixes that.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Kay Pedersen <mail@mkwg.de>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes#15170Closes#15172