Return the more descriptive error codes instead of `EXDEV` when
the parameters don't match the requirements of the clone function.
Updated the comments in `brt.c` accordingly.
The first three errors are just invalid parameters, which zfs can
not handle.
The fourth error indicates that the block which should be cloned
is created and cloned or modified in the same transaction
group (`txg`).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <rob.norris@klarasystems.com>
Signed-off-by: Kay Pedersen <mail@mkwg.de>
Closes#15148
The generic_file_splice_read function was removed in Linux 6.5 in favor
of filemap_splice_read. Add an autoconf test for filemap_splice_read and
use it if it is found as the handler for .splice_read in the
file_operations struct. Additionally, ITER_PIPE was removed in 6.5. This
change removes the ITER_* macros that OpenZFS doesn't use from being
tested in config/kernel-vfs-iov_iter.m4. The removal of ITER_PIPE was
causing the test to fail, which also affected the code responsible for
setting the .splice_read handler, above. That behavior caused run-time
panics on Linux 6.5.
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Coleman Kane <ckane@colemankane.org>
Closes#15155
- Split dmu_prefetch_dnode() from dmu_prefetch() into a separate
function. It is quite inconvenient to read the code where len = 0
means dnode prefetch instead indirect/data prefetch. One function
doing both has no benefits, since the code paths are independent.
- Improve dmu_prefetch() handling of long block ranges. Instead
of limiting L0 data length to prefetch for to dmu_prefetch_max,
make dmu_prefetch_max limit the actual amount of prefetch at the
specified level, and, if there is more, prefetch all the rest at
higher indirection level. It should improve random access times
within the prefetched range of any length, reducing importance of
specific dmu_prefetch_max value.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15076
Additionally, the .child element of ctl_table has been removed in 6.5.
This change adds a new test for the pre-6.5 register_sysctl_table()
function, and uses the old code in that case. If it isn't found, then
the parentage entries in the tables are removed, and the register_sysctl
call is provided the paths of "kernel/spl", "kernel/spl/kmem", and
"kernel/spl/kstat" directly, to populate each subdirectory over three
calls, as is the new API.
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Coleman Kane <ckane@colemankane.org>
Closes#15138
This reverts commit b35374fd64 as there
are error messages when loading the SPL module. Errors seemed to be tied
to duplicate a duplicate entry.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Closes#15134
Before Linux 5.3, the filesystem's copy_file_range handler had to signal
back to the kernel that we can't fulfill the request and it should
fallback to a content copy. This is done by returning -EOPNOTSUPP.
This commit converts the EXDEV return from zfs_clone_range to
EOPNOTSUPP, to force the kernel to fallback for all the valid reasons it
might be unable to clone. Without it the copy_file_range() syscall will
return EXDEV to userspace, breaking its semantics.
Add test for copy_file_range fallbacks. copy_file_range should always
fallback to a content copy whenever ZFS can't service the request with
cloning.
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Kay Pedersen <mail@mkwg.de>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes#15131
This gives `zdb -b` support for clone blocks.
Previously, it didn't know what clones were, so would count their space
allocation multiple times and then report leaked space (or, in debug,
would assert trying to claim blocks a second time).
This commit fixes those bugs, and reports the number of clones and the
space "used" (saved) by them.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Kay Pedersen <mail@mkwg.de>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-By: OpenDrives Inc.
Sponsored-By: Klara Inc.
Closes#15123
If looking up a snapdir inode failed, hold pool config – hold the
snapshot – get its creation property – release it – release it,
then use that as the [amc]time in the allocated inode. If that
fails then fall back to current time. No performance impact since
this is only done when allocating a new snapdir inode.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes#15110Closes#15117
An iov_iter_type() function to access the "type" member of the struct
iov_iter was added at one point. Move the conditional logic to decide
which method to use for accessing it into a macro and simplify the
zpl_uio_init code.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Coleman Kane <ckane@colemankane.org>
Closes#15100
The iov_iter->iov member is now iov_iter->__iov and must be accessed via
the accessor function iter_iov(). Create a wrapper that is conditionally
compiled to use the access method appropriate for the target kernel
version.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Coleman Kane <ckane@colemankane.org>
Closes#15100
Multiple changes to the blkdev API were introduced in Linux 6.5. This
includes passing (void* holder) to blkdev_put, adding a new
blk_holder_ops* arg to blkdev_get_by_path, adding a new blk_mode_t type
that replaces uses of fmode_t, and removing an argument from the release
handler on block_device_operations that we weren't using. The open
function definition has also changed to take gendisk* and blk_mode_t, so
update it accordingly, too.
Implement local wrappers for blkdev_get_by_path() and
vdev_blkdev_put() so that the in-line calls are cleaner, and place the
conditionally-compiled implementation details inside of both of these
local wrappers. Both calls are exclusively used within vdev_disk.c, at
this time.
Add blk_mode_is_open_write() to test FMODE_WRITE / BLK_OPEN_WRITE
The wrapper function is now used for testing using the appropriate
method for the kernel, whether the open mode is writable or not.
Emphasize fmode_t arg in zvol_release is not used
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Coleman Kane <ckane@colemankane.org>
Closes#15099
Additionally, the .child element of ctl_table has been removed in 6.5.
This change adds a new test for the pre-6.5 register_sysctl_table()
function, and uses the old code in that case. If it isn't found, then
the parentage entries in the tables are removed, and the register_sysctl
call is provided the paths of "kernel/spl", "kernel/spl/kmem", and
"kernel/spl/kstat" directly, to populate each subdirectory over three
calls, as is the new API.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Coleman Kane <ckane@colemankane.org>
Closes#15098
Fastwrite was introduced many years ago to improve ZIL writes spread
between multiple top-level vdevs by tracking number of allocated but
not written blocks and choosing vdev with smaller count. It suposed
to reduce ZIL knowledge about allocation, but actually made ZIL to
even more actively report allocation code about the allocations,
complicating both ZIL and metaslabs code.
On top of that, it seems ZIO_FLAG_FASTWRITE setting in dmu_sync()
was lost many years ago, that was one of the declared benefits. Plus
introduction of embedded log metaslab class solved another problem
with allocation rotor accounting both normal and log allocations,
since in most cases those are now in different metaslab classes.
After all that, I'd prefer to simplify already too complicated ZIL,
ZIO and metaslab code if the benefit of complexity is not obvious.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15107
Return the more descriptive EOPNOTSUPP instead of EXDEV when the
storage pool doesn't support block cloning.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <rob.norris@klarasystems.com>
Signed-off-by: Kay Pedersen <mail@mkwg.de>
Closes#15097
The transaction there does not produce any dirty data or log blocks,
so it should not be throttled. All other cases wait for TXG sync, by
which time the log block we are writing will be obsolete, so we can
skip waiting and just return error here instead.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15096
This locking was recently added as part of #14979. But appears it
is illegal to take zl_issuer_lock while holding dp_config_rwlock,
taken by dsl_pool_hold(). It causes deadlock with sync thread in
spa_sync_upgrades(). On a second thought, we should not
need this locking, since zil_commit_impl() we call below takes
zl_issuer_lock, that should sufficiently protect zl_suspend reads,
combined with other logic from #14979.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15103
Redhat have backported copy_file_range and clone_file_range to the EL7
kernel using an "extended file operations" wrapper structure. This
connects all that up to let cloning work there too.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Kay Pedersen <mail@mkwg.de>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-By: OpenDrives Inc.
Sponsored-By: Klara Inc.
Closes#15050
Prior to Linux 4.5, the FICLONE etc ioctls were specific to BTRFS, and
were implemented as regular filesystem-specific ioctls. This implements
those ioctls directly in OpenZFS, allowing cloning to work on older
kernels.
There's no need to gate these behind version checks; on later kernels
Linux will simply never deliver these ioctls, instead calling the
approprate VFS op.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Kay Pedersen <mail@mkwg.de>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-By: OpenDrives Inc.
Sponsored-By: Klara Inc.
Closes#15050
This implements the Linux VFS ops required to service the file
copy/clone APIs:
.copy_file_range (4.5+)
.clone_file_range (4.5-4.19)
.dedupe_file_range (4.5-4.19)
.remap_file_range (4.20+)
Note that dedupe_file_range() and remap_file_range(REMAP_FILE_DEDUP) are
hooked up here, but are not implemented yet.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Kay Pedersen <mail@mkwg.de>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-By: OpenDrives Inc.
Sponsored-By: Klara Inc.
Closes#15050
Block cloning introduced a new state transition from DB_NOFILL to
DB_READ. This occurs when a block is cloned and then read on the
current txg.
In this case, the clone will move the dbuf to DB_NOFILL, and then the
read will be issued for the overidden block pointer. If that read is
still outstanding when it comes time to write, the dbuf will be in
DB_READ, which is not handled by the checks in dbuf_sync_leaf, thus
tripping the assertions.
This updates those checks to allow DB_READ as a valid state iff the
dirty record is for a BRT write and there is a override block pointer.
This is a safe situation because the block already exists, so there's
nothing that could change from underneath the read.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Kay Pedersen <mail@mkwg.de>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Original-patch-by: Kay Pedersen <mail@mkwg.de>
Sponsored-By: OpenDrives Inc.
Sponsored-By: Klara Inc.
Closes#15050
dbuf_undirty() will (correctly) only removed dirty records for the given
(open) txg. If there is a dirty record for an earlier closed txg that
has not been synced out yet, then db_dirty_records will still have
entries on it, tripping the assertion.
Instead, change the assertion to only consider the current txg. To some
extent this is redundant, as its really just saying "did dbuf_undirty()
work?", but it it doesn't hurt and accurately expresses our
expectations.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Kay Pedersen <mail@mkwg.de>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Original-patch-by: Kay Pedersen <mail@mkwg.de>
Sponsored-By: OpenDrives Inc.
Sponsored-By: Klara Inc.
Closes#15050
bv_entcount can be a relatively large allocation (see comment for
BRT_RANGESIZE), so get it from the big allocator.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Kay Pedersen <mail@mkwg.de>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-By: OpenDrives Inc.
Sponsored-By: Klara Inc.
Closes#15050
Just silencing the warning about large allocations.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Kay Pedersen <mail@mkwg.de>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-By: OpenDrives Inc.
Sponsored-By: Klara Inc.
Closes#15050
When we have some LWBs closed and their ZIOs ready to be issued, we
can not afford sleeping on config lock if somebody else try to lock
it as writer, or it will cause a deadlock.
To solve it, move spa_config_enter() from zil_lwb_write_issue() to
zil_lwb_write_close() under zl_issuer_lock to enforce lock ordering
with other threads. Now if we can't immediately lock config, issue
all previously closed LWBs so that they could drop their config
locks after completion, and only then allow sleeping on our lock.
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15078Closes#15080
metaslab_force_ganging isn't enough to actually force ganging, because
it still only forces 3% of the time. This adds
metaslab_force_ganging_pct so we can configure how often to force
ganging.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Closes#15088
- Reduce maximum prefetch distance for 32bit platforms to 8MB as it
was previously. Those systems didn't grow much probably, so better
stay conservative there.
- Retire array_rd_sz tunable, blocking prefetch for large requests.
We should not penalize applications trying to be more efficient. The
speculative prefetcher by itself has reasonable distance limits, and
1MB is not much at all these days.
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15072
To simplify error handling bpobj_iterate_blkptrs() iterates through
the list of block pointers backwards. Unfortunately speculative
prefetcher is currently unable to detect such patterns, that makes
each block read there synchronous and very slow on HDD pools.
According to my tests, added explicit prefetch reduces time needed
to asynchronously delete 8 snapshots of 4 million blocks each from
20 seconds to less than one, that should free sync thread for other
useful work, such as async writes, scrub, etc.
While there, plug one memory leak in case of bpobj_open() error and
harmonize some variable names.
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15071
With anything but fletcher-4, even a tiny change in the input will cause
the checksum value to change completely. So knowing the actual and
expected checksums doesn't provide much more information than "they
don't match". The harm in sending them is simply that they bloat the
event. In particular, on FreeBSD the event must fit into a 1016 byte
buffer.
Fixes#14717 for mirrored pools.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rich Ercolani <rincebrain@gmail.com>
Signed-off-by: Alan Somers <asomers@gmail.com>
Sponsored-by: Axcient
Closes#14717Closes#15052
The checksum histograms were intended to be used with ATA and parallel
SCSI, which are obsolete. With modern storage hardware, they will
almost always look like white noise; all bits will be wrong. They only
serve to bloat the event. That's a particular problem on FreeBSD, where
events must fit into a 1016 byte buffer.
This fixes issue #14717 for RAIDZ pools, but not for mirror pools.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rich Ercolani <rincebrain@gmail.com>
Signed-off-by: Alan Somers <asomers@gmail.com>
Sponsored-by: Axcient
Closes#15052
Since spa_min_alloc may not be a power of 2, unlike ashifts, in the
case of DRAID, we should not select the minimal value among several
vdevs. Rounding to a multiple of it is unlikely to work for other
vdevs. Instead, using the greatest common divisor produces smaller
yet more reasonable results.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes#15067
Check that vdev has valid zap and bail out early.
While here, move objid selection out of the loop, it's not going to
change.
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Yuri Pankov <yuripv@FreeBSD.org>
Closes#15063
Ashift can be set for a vdev only during its creation, and the
top-level vdev does not change when a vdev is attached or replaced.
The ashift property should not be used during attachment, as it
does not allow attaching/replacing a vdev if the pool's ashift
property is increased after the existing vdev was created. Instead,
we should be able to attach the vdev if the attached vdev can
satisfy the ashift requirement with its parent.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes#15061
Set ARC_FLAG_NO_BUF when prefetching data L1 buffers for scan. We
do not prefetch data L0 buffers, so we do not need the L1 buffers,
only want them to be ready in ARC. This saves some CPU time on the
buffers decompression.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15029
As it turns out having autotrim default to 'on' on FreeBSD never really
worked due to mess with defines where userland and kernel module were
getting different default values (userland was defaulting to 'off',
module was thinking it's 'on').
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Yuri Pankov <yuripv@FreeBSD.org>
Closes#15079
Unlike regular receive, raw receive require destination to have the
same block structure as the source. In case of dnode reclaim this
triggers two special cases, requiring special handling:
- If dn_nlevels == 1, we can change the ibs, but dnode_set_blksz()
should not dirty the data buffer if block size does not change, or
durign receive dbuf_dirty_lightweight() will trigger assertion.
- If dn_nlevels > 1, we just can't change the ibs, dnode_set_blksz()
would fail and receive_object would trigger assertion, so we should
destroy and recreate the dnode from scratch.
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15039
My analysis in PR #14716 was incorrect. Each histogram bucket contains
the number of incorrect bits, by position in a 64-bit word, over the
entire record. 8-bit buckets can overflow for record sizes above 2k.
To forestall that, saturate each bucket at 255. That should still get
the point across: either all bits are equally wrong, or just a couple
are.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alan Somers <asomers@gmail.com>
Sponsored-by: Axcient
Closes#15049
Since we are already iterating the ZAP, we have exact string key to
remove, we do not need to call zap_remove_int() with the int key we
just converted, we can call zap_remove() for the original string.
This should make no functional change, only a micro-optimization.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15056
It seems 9c5167d19f "Project Quota on ZFS" missed to add prefetch
for DMU_PROJECTUSED_OBJECT during scan (scrub/resilver). It should
not cause visible problems, but may affect scub/resilver performance.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15024
Starting approximately from version 1302506 vn_lock_pair() grown two
additional arguments following head. There is a one week hole, but
that is closet reference point we have.
Reviewed-by: Mateusz Guzik <mjguzik@gmail.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15047
The default timeout for ZVOL opens may not be sufficient for all cases,
so we should enable the value to be more easily tuned to account for
systems where the default value is insufficient.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Prakash Surya <prakash.surya@delphix.com>
Closes#15023
The DDT is really inefficient on 4k and up vdevs, because it always
allocates 4k blocks, and while compression could save us somewhat
at ashift 9, that stops being true.
So let's change the default to 32 KiB, which seems like a reasonable
compromise between improved space savings and inflated write sizes
for DDT updates.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes#14654
The previous comment wondered if this case could happen; it turns out
that it really can't.
This block can only be entered if dde_type and dde_class are "real";
that only happens when a ddt entry has been previously synced to a ddt
store, that is, it was created on a previous txg. Since its gone through
that sync, its dde_refcount must be >0.
ddt_addref() is called from brt_pending_apply(), which is called at the
beginning of spa_sync(), before pending DMU writes/frees are issued.
Freeing a dedup block is the only thing that can decrement dde_refcount,
so there's no way for it to drop to zero before applying the clone bumps
it.
Further, even if it _could_ go to zero, it wouldn't be necessary to fill
the entry from the block. The phys content is not cleared until the free
is issued, which happens when the refcount goes to zero, when the last
real free comes through. The cloned block should be identical to what's
in the phys already, so the fill should be a no-op anyway.
I've replaced this with an assertion because this is all very dependent
on the ordering in which BRT and DDT changes are applied, and that might
change in the future.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-By: Klara, Inc.
Closes#15004
With zl_suspend read in zil_commit() not protected by any locks it
is possible for new ZIL writes to be in progress while zil_destroy()
called by zil_suspend() freeing them. This patch closes the race
by taking zl_issuer_lock in zil_suspend() and adding the second
zl_suspend check to zil_get_commit_list(), protected by the lock.
It allows all already queued transactions to be logged normally,
while blocks any new ones, calling txg_wait_synced() for the TXGs.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#14979
- Pack struct zio_prop by 4 bytes from 84 to 80.
- Skip new child ZIO locking while linking to parent. The newly
allocated ZIO is not externally visible yet, so nobody should care.
- Skip io_bp_copy writes when not used (write && non-debug).
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#14985
Scan process may skip blocks based on their birth time, DVA, etc.
Traditionally those blocks were accounted as issued, that caused
reporting of hugely over-inflated numbers, having nothing to do
with actual disk I/O. This change utilizes never used field in
struct dsl_scan_phys to account such skipped bytes, allowing to
report how much data were actually scrubbed/resilvered and what
is the actual I/O speed. While formally it is an on-disk format
change, it should be compatible both ways, so should not need a
feature flag.
This should partially address the same issue as c85ac731a0, but
from a different perspective, complementing it.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Akash B <akash-b@hpe.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15007
lwb->lwb_issued_txg can not be accessed after lwb_state is set to
LWB_STATE_FLUSH_DONE and zl_lock is dropped, since the lwb may be
freed by zil_sync(). We must save the txg number before that.
This is similar to the 55b1842f92, but as I see the bug is not new.
It existed for quite a while, just was not triggered due to smaller
race window.
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#14988Closes#14999
When ZFS appends files in chunks bigger than recordsize, it borrows
buffer from ARC and fills it before opening transaction. This
supposed to help in case of page faults to not hold transaction open
indefinitely. The problem appears when recordsize is set lower than
default 128KB. Since each block is committed in separate transaction,
per-transaction overhead becomes significant, and what is even worse,
active use of of per-dataset and per-pool locks to protect space use
accounting for each transaction badly hurts the code SMP scalability.
The same transaction size limitation applies in case of file rewrite,
but without even excuse of buffer borrowing.
To address the issue, disable the borrowing mechanism if recordsize
is smaller than default and the write request is 4x bigger than it.
In such case writes up to 32MB are executed in single transaction,
that dramatically reduces overhead and lock contention. Since the
borrowing mechanism is not used for file rewrites, and it was never
used by zvols, which seem to work fine, I don't think this change
should create significant problems, partially because in addition to
the borrowing mechanism there are also used pre-faults.
My tests with 4/8 threads writing several files same time on datasets
with 32KB recordsize in 1MB requests show reduction of CPU usage by
the user threads by 25-35%. I would measure it in GB/s, but at that
block size we are now limited by the lock contention of single write
issue taskqueue, which is a separate problem we are going to work on.
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#14964
Switch FIFO queues (SYNC/TRIM) and active queue of vdev queue from
time-sorted AVL-trees to simple lists. AVL-trees are too expensive
for such a simple task. To change I/O priority without searching
through the trees, add io_queue_state field to struct zio.
To not check number of queued I/Os for each priority add vq_cqueued
bitmap to struct vdev_queue. Update it when adding/removing I/Os.
Make vq_cactive a separate array instead of struct vdev_queue_class
member. Together those allow to avoid lots of cache misses when
looking for work in vdev_queue_class_to_issue().
Introduce deadline of ~0.5s for LBA-sorted queues. Before this I
saw some I/Os waiting in a queue for up to 8 seconds and possibly
more due to starvation. With this change I no longer see it. I
had to slightly more complicate the comparison function, but since
it uses all the same cache lines the difference is minimal. For a
sequential I/Os the new code in vdev_queue_io_to_issue() actually
often uses more simple avl_first(), falling back to avl_find() and
avl_nearest() only when needed.
Arrange members in struct zio to access only one cache line when
searching through vdev queues. While there, remove io_alloc_node,
reusing the io_queue_node instead. Those two are never used same
time.
Remove zfs_vdev_aggregate_trim parameter. It was disabled for 4
years since implemented, while still wasted time maintaining the
offset-sorted tree of TRIM requests. Just remove the tree.
Remove locking from txg_all_lists_empty(). It is racy by design,
while 2 pair of locks/unlocks take noticeable time under the vdev
queue lock.
With these changes in my tests with volblocksize=4KB I measure vdev
queue lock spin time reduction by 50% on read and 75% on write.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#14925