Track history in context of bursts, not individual log blocks. It
allows to not blow away all the history by single large burst of
many block, and same time allows optimizations covering multiple
blocks in a burst and even predicted following burst. For each
burst account its optimal block size and minimal first block size.
Use that statistics from the last 8 bursts to predict first block
size of the next burst.
Remove predefined set of block sizes. Allocate any size we see fit,
multiple of 4KB, as required by ZIL now. With compression enabled
by default, ZFS already writes pretty random block sizes, so this
should not surprise space allocator any more.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15635
Add `zpool` flags to control the slot power to drives. This assumes
your SAS or NVMe enclosure supports slot power control via sysfs.
The new `--power` flag is added to `zpool offline|online|clear`:
zpool offline --power <pool> <device> Turn off device slot power
zpool online --power <pool> <device> Turn on device slot power
zpool clear --power <pool> [device] Turn on device slot power
If the ZPOOL_AUTO_POWER_ON_SLOT env var is set, then the '--power'
option is automatically implied for `zpool online` and `zpool clear`
and does not need to be passed.
zpool status also gets a --power option to print the slot power status.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mart Frauenlob <AllKind@fastest.cc>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#15662
We are finding that as customers get larger and faster machines
(hundreds of cores, large NVMe-backed pools) they keep hitting
relatively low performance ceilings. Our profiling work almost always
finds that they're running into bottlenecks on the SPA IO taskqs.
Unfortunately there's often little we can advise at that point, because
there's very few ways to change behaviour without patching.
This commit adds two load-time parameters `zio_taskq_read` and
`zio_taskq_write` that can configure the READ and WRITE IO taskqs
directly.
This achieves two goals: it gives operators (and those that support
them) a way to tune things without requiring a custom build of OpenZFS,
which is often not possible, and it lets us easily try different config
variations in a variety of environments to inform the development of
better defaults for these kind of systems.
Because tuning the IO taskqs really requires a fairly deep understanding
of how IO in ZFS works, and generally isn't needed without a pretty
serious workload and an ability to identify bottlenecks, only minimal
documentation is provided. Its expected that anyone using this is going
to have the source code there as well.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#15675
6.7 changes the shrinker API such that shrinkers must be allocated
dynamically by the kernel. To accomodate this, this commit reworks
spl_register_shrinker() to do something similar against earlier kernels.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Sponsored-by: https://github.com/sponsors/robnCloses#15681
In 6.7 the superblock shrinker member s_shrink has changed from being an
embedded struct to a pointer. Detect this, and don't take a reference if
it already is one.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Sponsored-by: https://github.com/sponsors/robnCloses#15681
6.6 made i_ctime inaccessible; 6.7 has done the same for i_atime and
i_mtime. This extends the method used for ctime in b37f29341 to atime
and mtime as well.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Sponsored-by: https://github.com/sponsors/robnCloses#15681
6.7 changed the names of the time members in struct inode, so we can't
assign back to it because we don't know its name. In practice this
doesn't matter though - if we're missing current_time(), then we must be
on <4.9, and we know our fallback will need to return timespec.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Sponsored-by: https://github.com/sponsors/robnCloses#15681
PR#15634 removes 128K into 2x68K LWB split optimization, since it
was found to cause LWB buffer overflow while trying to write 128KB
TX_CLONE_RANGE record with 1022 block pointers into 68KB buffer,
with multiple VDEVs ZIL.
This commit adds a test for this particular scenario by writing
maximum sizes TX_CLONE_RANE record with 1022 block pointers into
68KB buffer, with two SLOG devices.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Signed-off-by: Umer Saleem <usaleem@ixsystems.com>
Closes#15672
When ZFS overwrites a whole block, it does not bother to read the
old content from disk. It is a good optimization, but if the buffer
fill fails due to page fault or something else, the buffer ends up
corrupted, neither keeping old content, nor getting the new one.
On FreeBSD this is additionally complicated by page faults being
blocked by VFS layer, always returning EFAULT on attempt to write
from mmap()'ed but not yet cached address range. Normally it is
not a big problem, since after original failure VFS will retry the
write after reading the required data. The problem becomes worse
in specific case when somebody tries to write into a file its own
mmap()'ed content from the same location. In that situation the
only copy of the data is getting corrupted on the page fault and
the following retries only fixate the status quo. Block cloning
makes this issue easier to reproduce, since it does not read the
old data, unlike traditional file copy, that may work by chance.
This patch provides the fill status to dmu_buf_fill_done(), that
in case of error can destroy the corrupted buffer as if no write
happened. One more complication in case of block cloning is that
if error is possible during fill, dmu_buf_will_fill() must read
the data via fall-back to dmu_buf_will_dirty(). It is required
to allow in case of error restoring the buffer to a state after
the cloning, not not before it, that would happen if we just call
dbuf_undirty().
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15665
Block cloning normally creates dirty record without dr_data. But if
the block is read after cloning, it is moved into DB_CACHED state and
receives the data buffer. If after that we call dbuf_unoverride()
to convert the dirty record into normal write, we should give it the
data buffer from dbuf and release one.
Reviewed-by: Kay Pedersen <mail@mkwg.de>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15654Closes#15656
In some cases dbuf_assign_arcbuf() may be called on a block that
was recently cloned. If it happened in current TXG we must undo
the block cloning first, since the only one dirty record per TXG
can't and shouldn't mean both cloning and overwrite same time.
Reviewed-by: Kay Pedersen <mail@mkwg.de>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15653
Align the raidz_expand_005_pos test with the raidz_expand_004_pos test
and only verify no errors were reported. Allow scrub repair IO.
Reviewed-by: Don Brady <don.brady@klarasystems.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#15663
While evicting dbufs of a dnode, a marker node is added to the AVL.
The marker node should be inserted in AVL tree ahead of the dbuf its
trying to delete. The blkid and level is used to ensure this. However,
this could go wrong there's another dbufs with the same blkid and level
in DB_EVICTING state but not yet removed from AVL tree. dbuf_compare()
could fail to give the right location or could cause confusion and
trigger ASSERTs.
To ensure that the marker is inserted before the deleting dbuf, use
the pointer value of the original dbuf for comparision.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Sanjeev Bagewadi <sanjeev.bagewadi@nutanix.com>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes#12482Closes#15643
Add a test for the dirty dnode SEEK_HOLE/SEEK_DATA bug described in
https://github.com/openzfs/zfs/issues/15526
The bug was fixed in https://github.com/openzfs/zfs/pull/15571 and
was backported to 2.2.2 and 2.1.14. This test case is just to
make sure it does not come back.
seekflood.c originally written by Rob Norris.
Reviewed-by: Graham Perrin <grahamperrin@freebsd.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#15608
The zpool_import_status.ksh test case was not being run because
it was not included in the Makefile.am.
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#15655
The io_uring test fails on CentOS 9 with the following fio error.
Disable the test for the benefit of the CI until this can be fully
investigated. This basic test passes as expected on newer kernels.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#15636
dmu_assign_arcbuf_by_dnode() should drop dn_struct_rwlock lock in
case dbuf_hold() failed. I don't have reproduction for this, but
it looks inconsistent with dmu_buf_hold_noread_by_dnode() and co.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15644
Reviewed-by: Kay Pedersen <mail@mkwg.de>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes#15614
Reviewed-by: Kay Pedersen <mail@mkwg.de>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes#15614
Replace ENCLO_US_RE with ENCLO_SU_RE in the name of the variable.
Note this changes the user-visible string in zed.rc, thus might
break current users with the wrong string, but it's ~2 months
since zfs-2.2.0 tag is out, thus should not be widespread yet.
Mechanical change:
$ grep -rl ZED_POWER_OFF_ENCLOUSRE_SLOT_ON_FAULT
cmd/zed/zed.d/zed.rc
cmd/zed/zed.d/statechange-slot_off.sh
$ sed -i 's/ZED_POWER_OFF_ENCLOUSRE_SLOT_ON_FAULT/<linebreak>
ZED_POWER_OFF_ENCLOSURE_SLOT_ON_FAULT/g' \
cmd/zed/zed.d/zed.rc \
cmd/zed/zed.d/statechange-slot_off.sh
$ grep -rl ZED_POWER_OFF_ENCLOUSRE_SLOT_ON_FAULT
$
Fixes 11fbcacf37
("zed: Add zedlet to power off slot when drive is faulted")
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
Closes#15651
Just silencing a warning. Its totally fine for a hostid to not be there.
Reported-by: Coverity (CID-1573336)
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes#15650
Reported-by: Coverity (CID-1573333)
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes#15649
Coverity noticed that sometimes we ignore the return, and sometimes we
don't. Its not wrong, and I like consistent style, so here we are.
Reported-by: Coverity (CID-1564584)
Reported-by: Coverity (CID-1564585)
Reported-by: Coverity (CID-1564586)
Reported-by: Coverity (CID-1564587)
Reported-by: Coverity (CID-1564588)
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes#15647
Otherwise the field is left uninitialized, leading to a possible kernel
memory disclosure to userspace or to the network. Use the same
initialization value we use in zfsctl_common_getattr().
Reported-by: KMSAN
Sponsored-by: The FreeBSD Foundation
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ed Maste <emaste@FreeBSD.org>
Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Closes#15639
Without this patch on pool of 60 vdevs with ZFS_DEBUG enabled clone
takes much more time than copy, while heavily trashing dbgmsg for
no good reason, repeatedly dumping all vdevs BRTs again and again,
even unmodified ones.
I am generally not sure this dumping is not excessive, but decided
to keep it for now, just restricting its scope to more reasonable.
Reviewed-by: Kay Pedersen <mail@mkwg.de>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15625
To improve 128KB block write performance in case of multiple VDEVs
ZIL used to spit those writes into two 64KB ones. Unfortunately it
was found to cause LWB buffer overflow, trying to write maximum-
sizes 128KB TX_CLONE_RANGE record with 1022 block pointers into
68KB buffer, since unlike TX_WRITE ZIL code can't split it.
This is a minimally-invasive temporary block cloning fix until the
following more invasive prediction code refactoring.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15634
Block pointers are not encrypted in TX_WRITE and TX_CLONE_RANGE
records, so we can dump them, that may be useful for debugging.
Related to #15543.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15629
Since Linux 6.2, the implementation of flush_dcache_page on riscv
references GPL-only symbol `PageHuge`, breaking the build of zfs.
This patch uses existing mechanism to override flush_dcache_page,
removing the call to `PageHuge`. According to comments in kernel,
it is only used to do some check against HugeTLB pages, which only
exist in userspace. ZFS uses flush_dcache_page only on kernel pages,
thus this patch will not introduce any behaviour change.
See also: torvalds/linux@d33deda, openzfs/zfs@589f59b
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Shengqi Chen <harry-chen@outlook.com>
Closes#14974Closes#15627
Detail the import progress of log spacemaps as they can take a very
long time. Also grab the spa_note() messages to, as they provide
insight into what is happening
Sponsored-By: OpenDrives Inc.
Sponsored-By: Klara Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@klarasystems.com>
Co-authored-by: Allan Jude <allan@klarasystems.com>
Closes#15539
My merged pull request #15557 fixes compilation of sha2 kernels on arm
v5/6. However, the compiler guards only allows sha256/512_armv7_impl to
be used when __ARM_ARCH > 6. This patch enables these ASM kernels on all
arm architectures. Some compiler guards are adjusted accordingly to
avoid the unnecessary compilation of SIMD (e.g., neon, armv8ce) kernels
on old architectures.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Shengqi Chen <harry-chen@outlook.com>
Closes#15623
Several zpool commands (status, list, iostat) have modes that present
some information, sleep a while, present the current state, sleep, etc.
Some of those had ways to invoke them that when piped would appear to do
nothing for a while, because non-terminals are block-buffered, not
line-buffered, by default. Fix this by forcing a flush before sleeping.
In particular, all of these buffered:
- zpool status <pool> <interval>
- zpool iostat -y<m> <pool> <interval>
- zpool list <pool> <interval>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes#15593
When two datasets share the same master encryption key, it is safe
to clone encrypted blocks. Currently only snapshots and clones
of a dataset share with it the same encryption key.
Added a test for:
- Clone from encrypted sibling to encrypted sibling with
non encrypted parent
- Clone from encrypted parent to inherited encrypted child
- Clone from child to sibling with encrypted parent
- Clone from snapshot to the original datasets
- Clone from foreign snapshot to a foreign dataset
- Cloning from non-encrypted to encrypted datasets
- Cloning from encrypted to non-encrypted datasets
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Original-patch-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Signed-off-by: Kay Pedersen <mail@mkwg.de>
Closes#15544
ZIL claim can not handle block pointers cloned from the future,
since they are not yet allocated at that point. It may happen
either if the block was just written when it was cloned, or if
the pool was frozen or somehow else rewound on import.
Handle it from two sides: prevent cloning of blocks with physical
birth time from not yet synced or frozen TXG, and abort ZIL claim
if we still detect such blocks due to rewind or something else.
While there, assert that any cloned blocks we claim are really
allocated by calling metaslab_check_free().
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15617
This commit adds the zed_notify_ntfy() function and hooks it
into zed_notify(). This will allow ZED to send notifications
to ntfy.sh or a self-hosted Ntfy service, which can be received
on a desktop or mobile device. It is configured with ZED_NTFY_TOPIC,
ZED_NTFY_URL, and ZED_NTFY_ACCESS_TOKEN variables in zed.rc.
Reviewed-by: @classabbyamp
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Dex Wood <slash2314@gmail.com>
Closes#15584
zil_claim_clone_range() takes references on cloned blocks before ZIL
replay. Later zil_free_clone_range() drops them after replay or on
dataset destroy. The total balance is neutral. It means we do not
need to do anything (drop the references) for not implemented yet
TX_CLONE_RANGE replay for ZVOLs.
This is a logical follow up to #15603.
Reviewed-by: Kay Pedersen <mail@mkwg.de>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15612
Since we use a limited set of kmem caches, quite often we have unused
memory after the end of the buffer. Put there up to a 512-byte canary
when built with debug to detect buffer overflows at the free time.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15553
Check for the existence of execvpe(3) and only provide the FreeBSD
compat version if required.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Brooks Davis <brooks.davis@sri.com>
Closes#15609
zil_claim_clone_range() takes references on cloned blocks before ZIL
replay. Later zil_free_clone_range() drops them after replay or on
dataset destroy. The total balance is neutral. It means on actual
replay we must take additional references, which would stay in BRT.
Without this blocks could be freed prematurely when either original
file or its clone are destroyed. I've observed BRT being emptied
and the feature being deactivated after ZIL replay completion, which
should not have happened. With the patch I see expected stats.
Reviewed-by: Kay Pedersen <mail@mkwg.de>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15603
getzoneid() should return GLOBAL_ZONEID instead of 0 when USER_NS is disabled.
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ilkka Sovanto <github@ilkka.kapsi.fi>
Closes#15560
Instead of using only the 3rd element return the entire string after
the split to handle device names with dashes.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Vaibhav Bhanawat <vaibhav.bhanawat@delphix.com>
Closes#15567
Bug introduced in 213d682967.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Warner Losh <imp@FreeBSD.org>
Signed-off-by: Martin Matuska <mm@FreeBSD.org>
Closes#15606
This should make sure we have log written without overflows.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15517
The `adr` insn in neon kernel generates an compiling
error on armv5/6 target. Fix that by using `ldr`.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Shengqi Chen <harry-chen@outlook.com>
Closes#15557
This patch uses __ARM_ARCH set by compiler (both
GCC and Clang have this) whenever possible instead
of hardcoding it to 7. This change allows code to
compile on earlier ARM architectures such as armv5te.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Shengqi Chen <harry-chen@outlook.com>
Closes#15557
With Linux v6.6.x and clang 16, a configure step fails on a warning that
later results in an error while building, due to 'ts' being
uninitialized. Add a trivial initialization to silence the warning.
Signed-off-by: Jaron Kent-Dobias <jaron@kent-dobias.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Previously, dmu_buf_will_clone() would roll back any dirty record, but
would not clean out the modified data nor reset the state before
releasing the lock. That leaves the last-written data in db_data, but
the dbuf in the wrong state.
This is eventually corrected when the dbuf state is made NOFILL, and
dbuf_noread() called (which clears out the old data), but at this point
its too late, because the lock was already dropped with that invalid
state.
Any caller acquiring the lock before the call into
dmu_buf_will_not_fill() can find what appears to be a clean, readable
buffer, and would take the wrong state from it: it should be getting the
data from the cloned block, not from earlier (unwritten) dirty data.
Even after the state was switched to NOFILL, the old data was still not
cleaned out until dbuf_noread(), which is another gap for a caller to
take the lock and read the wrong data.
This commit fixes all this by properly cleaning up the previous state
and then setting the new state before dropping the lock. The
DBUF_VERIFY() calls confirm that the dbuf is in a valid state when the
lock is down.
Sponsored-by: Klara, Inc.
Sponsored-By: OpenDrives Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#15566Closes#15526
Clean up code in dsl_scan_visitbp() by removing an unnecessary
alloc/free and `goto`. This has the side benefit of reducing CPU usage,
which is only really noticeable if we are not doing i/o for the leaf
blocks, like when `zfs_no_scrub_io` is set.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#15549
Over its history this the dirty dnode test has been changed between
checking for a dnodes being on `os_dirty_dnodes` (`dn_dirty_link`) and
`dn_dirty_record`.
de198f2d9 Fix lseek(SEEK_DATA/SEEK_HOLE) mmap consistency
2531ce372 Revert "Report holes when there are only metadata changes"
ec4f9b8f3 Report holes when there are only metadata changes
454365bba Fix dirty check in dmu_offset_next()
66aca2473 SEEK_HOLE should not block on txg_wait_synced()
Also illumos/illumos-gate@c543ec060dillumos/illumos-gate@2bcf0248e9
It turns out both are actually required.
In the case of appending data to a newly created file, the dnode proper
is dirtied (at least to change the blocksize) and dirty records are
added. Thus, a single logical operation is represented by separate
dirty indicators, and must not be separated.
The incorrect dirty check becomes a problem when the first block of a
file is being appended to while another process is calling lseek to skip
holes. There is a small window where the dnode part is undirtied while
there are still dirty records. In this case, `lseek(fd, 0, SEEK_DATA)`
would not know that the file is dirty, and would go to
`dnode_next_offset()`. Since the object has no data blocks yet, it
returns `ESRCH`, indicating no data found, which results in `ENXIO`
being returned to `lseek()`'s caller.
Since coreutils 9.2, `cp` performs sparse copies by default, that is, it
uses `SEEK_DATA` and `SEEK_HOLE` against the source file and attempts to
replicate the holes in the target. When it hits the bug, its initial
search for data fails, and it goes on to call `fallocate()` to create a
hole over the entire destination file.
This has come up more recently as users upgrade their systems, getting
OpenZFS 2.2 as well as a newer coreutils. However, this problem has been
reproduced against 2.1, as well as on FreeBSD 13 and 14.
This change simply updates the dirty check to check both types of dirty.
If there's anything dirty at all, we immediately go to the "wait for
sync" stage, It doesn't really matter after that; both changes are on
disk, so the dirty fields should be correct.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Rich Ercolani <rincebrain@gmail.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#15571Closes#15526
Call vfs_exjail_clone() for mounts created under .zfs/snapshot
to fill in the mnt_exjail field for the mount. If this is not
done, the snapshots under .zfs/snapshot with not be accessible
over NFS.
This version has the argument name in vfs.h fixed to match that
of the name in spl_vfs.c, although it really does not matter.
External-issue: https://reviews.freebsd.org/D42672
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rick Macklem <rmacklem@uoguelph.ca>
Closes#15563