When ZFS overwrites a whole block, it does not bother to read the
old content from disk. It is a good optimization, but if the buffer
fill fails due to page fault or something else, the buffer ends up
corrupted, neither keeping old content, nor getting the new one.
On FreeBSD this is additionally complicated by page faults being
blocked by VFS layer, always returning EFAULT on attempt to write
from mmap()'ed but not yet cached address range. Normally it is
not a big problem, since after original failure VFS will retry the
write after reading the required data. The problem becomes worse
in specific case when somebody tries to write into a file its own
mmap()'ed content from the same location. In that situation the
only copy of the data is getting corrupted on the page fault and
the following retries only fixate the status quo. Block cloning
makes this issue easier to reproduce, since it does not read the
old data, unlike traditional file copy, that may work by chance.
This patch provides the fill status to dmu_buf_fill_done(), that
in case of error can destroy the corrupted buffer as if no write
happened. One more complication in case of block cloning is that
if error is possible during fill, dmu_buf_will_fill() must read
the data via fall-back to dmu_buf_will_dirty(). It is required
to allow in case of error restoring the buffer to a state after
the cloning, not not before it, that would happen if we just call
dbuf_undirty().
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15665
Block cloning normally creates dirty record without dr_data. But if
the block is read after cloning, it is moved into DB_CACHED state and
receives the data buffer. If after that we call dbuf_unoverride()
to convert the dirty record into normal write, we should give it the
data buffer from dbuf and release one.
Reviewed-by: Kay Pedersen <mail@mkwg.de>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15654Closes#15656
In some cases dbuf_assign_arcbuf() may be called on a block that
was recently cloned. If it happened in current TXG we must undo
the block cloning first, since the only one dirty record per TXG
can't and shouldn't mean both cloning and overwrite same time.
Reviewed-by: Kay Pedersen <mail@mkwg.de>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15653
While evicting dbufs of a dnode, a marker node is added to the AVL.
The marker node should be inserted in AVL tree ahead of the dbuf its
trying to delete. The blkid and level is used to ensure this. However,
this could go wrong there's another dbufs with the same blkid and level
in DB_EVICTING state but not yet removed from AVL tree. dbuf_compare()
could fail to give the right location or could cause confusion and
trigger ASSERTs.
To ensure that the marker is inserted before the deleting dbuf, use
the pointer value of the original dbuf for comparision.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Sanjeev Bagewadi <sanjeev.bagewadi@nutanix.com>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes#12482Closes#15643
dmu_assign_arcbuf_by_dnode() should drop dn_struct_rwlock lock in
case dbuf_hold() failed. I don't have reproduction for this, but
it looks inconsistent with dmu_buf_hold_noread_by_dnode() and co.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15644
Coverity noticed that sometimes we ignore the return, and sometimes we
don't. Its not wrong, and I like consistent style, so here we are.
Reported-by: Coverity (CID-1564584)
Reported-by: Coverity (CID-1564585)
Reported-by: Coverity (CID-1564586)
Reported-by: Coverity (CID-1564587)
Reported-by: Coverity (CID-1564588)
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes#15647
Otherwise the field is left uninitialized, leading to a possible kernel
memory disclosure to userspace or to the network. Use the same
initialization value we use in zfsctl_common_getattr().
Reported-by: KMSAN
Sponsored-by: The FreeBSD Foundation
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ed Maste <emaste@FreeBSD.org>
Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Closes#15639
Without this patch on pool of 60 vdevs with ZFS_DEBUG enabled clone
takes much more time than copy, while heavily trashing dbgmsg for
no good reason, repeatedly dumping all vdevs BRTs again and again,
even unmodified ones.
I am generally not sure this dumping is not excessive, but decided
to keep it for now, just restricting its scope to more reasonable.
Reviewed-by: Kay Pedersen <mail@mkwg.de>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15625
To improve 128KB block write performance in case of multiple VDEVs
ZIL used to spit those writes into two 64KB ones. Unfortunately it
was found to cause LWB buffer overflow, trying to write maximum-
sizes 128KB TX_CLONE_RANGE record with 1022 block pointers into
68KB buffer, since unlike TX_WRITE ZIL code can't split it.
This is a minimally-invasive temporary block cloning fix until the
following more invasive prediction code refactoring.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15634
Detail the import progress of log spacemaps as they can take a very
long time. Also grab the spa_note() messages to, as they provide
insight into what is happening
Sponsored-By: OpenDrives Inc.
Sponsored-By: Klara Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@klarasystems.com>
Co-authored-by: Allan Jude <allan@klarasystems.com>
Closes#15539
My merged pull request #15557 fixes compilation of sha2 kernels on arm
v5/6. However, the compiler guards only allows sha256/512_armv7_impl to
be used when __ARM_ARCH > 6. This patch enables these ASM kernels on all
arm architectures. Some compiler guards are adjusted accordingly to
avoid the unnecessary compilation of SIMD (e.g., neon, armv8ce) kernels
on old architectures.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Shengqi Chen <harry-chen@outlook.com>
Closes#15623
When two datasets share the same master encryption key, it is safe
to clone encrypted blocks. Currently only snapshots and clones
of a dataset share with it the same encryption key.
Added a test for:
- Clone from encrypted sibling to encrypted sibling with
non encrypted parent
- Clone from encrypted parent to inherited encrypted child
- Clone from child to sibling with encrypted parent
- Clone from snapshot to the original datasets
- Clone from foreign snapshot to a foreign dataset
- Cloning from non-encrypted to encrypted datasets
- Cloning from encrypted to non-encrypted datasets
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Original-patch-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Signed-off-by: Kay Pedersen <mail@mkwg.de>
Closes#15544
ZIL claim can not handle block pointers cloned from the future,
since they are not yet allocated at that point. It may happen
either if the block was just written when it was cloned, or if
the pool was frozen or somehow else rewound on import.
Handle it from two sides: prevent cloning of blocks with physical
birth time from not yet synced or frozen TXG, and abort ZIL claim
if we still detect such blocks due to rewind or something else.
While there, assert that any cloned blocks we claim are really
allocated by calling metaslab_check_free().
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15617
zil_claim_clone_range() takes references on cloned blocks before ZIL
replay. Later zil_free_clone_range() drops them after replay or on
dataset destroy. The total balance is neutral. It means we do not
need to do anything (drop the references) for not implemented yet
TX_CLONE_RANGE replay for ZVOLs.
This is a logical follow up to #15603.
Reviewed-by: Kay Pedersen <mail@mkwg.de>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15612
Since we use a limited set of kmem caches, quite often we have unused
memory after the end of the buffer. Put there up to a 512-byte canary
when built with debug to detect buffer overflows at the free time.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15553
zil_claim_clone_range() takes references on cloned blocks before ZIL
replay. Later zil_free_clone_range() drops them after replay or on
dataset destroy. The total balance is neutral. It means on actual
replay we must take additional references, which would stay in BRT.
Without this blocks could be freed prematurely when either original
file or its clone are destroyed. I've observed BRT being emptied
and the feature being deactivated after ZIL replay completion, which
should not have happened. With the patch I see expected stats.
Reviewed-by: Kay Pedersen <mail@mkwg.de>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15603
This should make sure we have log written without overflows.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15517
The `adr` insn in neon kernel generates an compiling
error on armv5/6 target. Fix that by using `ldr`.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Shengqi Chen <harry-chen@outlook.com>
Closes#15557
This patch uses __ARM_ARCH set by compiler (both
GCC and Clang have this) whenever possible instead
of hardcoding it to 7. This change allows code to
compile on earlier ARM architectures such as armv5te.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Shengqi Chen <harry-chen@outlook.com>
Closes#15557
Previously, dmu_buf_will_clone() would roll back any dirty record, but
would not clean out the modified data nor reset the state before
releasing the lock. That leaves the last-written data in db_data, but
the dbuf in the wrong state.
This is eventually corrected when the dbuf state is made NOFILL, and
dbuf_noread() called (which clears out the old data), but at this point
its too late, because the lock was already dropped with that invalid
state.
Any caller acquiring the lock before the call into
dmu_buf_will_not_fill() can find what appears to be a clean, readable
buffer, and would take the wrong state from it: it should be getting the
data from the cloned block, not from earlier (unwritten) dirty data.
Even after the state was switched to NOFILL, the old data was still not
cleaned out until dbuf_noread(), which is another gap for a caller to
take the lock and read the wrong data.
This commit fixes all this by properly cleaning up the previous state
and then setting the new state before dropping the lock. The
DBUF_VERIFY() calls confirm that the dbuf is in a valid state when the
lock is down.
Sponsored-by: Klara, Inc.
Sponsored-By: OpenDrives Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#15566Closes#15526
Clean up code in dsl_scan_visitbp() by removing an unnecessary
alloc/free and `goto`. This has the side benefit of reducing CPU usage,
which is only really noticeable if we are not doing i/o for the leaf
blocks, like when `zfs_no_scrub_io` is set.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#15549
Over its history this the dirty dnode test has been changed between
checking for a dnodes being on `os_dirty_dnodes` (`dn_dirty_link`) and
`dn_dirty_record`.
de198f2d9 Fix lseek(SEEK_DATA/SEEK_HOLE) mmap consistency
2531ce372 Revert "Report holes when there are only metadata changes"
ec4f9b8f3 Report holes when there are only metadata changes
454365bba Fix dirty check in dmu_offset_next()
66aca2473 SEEK_HOLE should not block on txg_wait_synced()
Also illumos/illumos-gate@c543ec060dillumos/illumos-gate@2bcf0248e9
It turns out both are actually required.
In the case of appending data to a newly created file, the dnode proper
is dirtied (at least to change the blocksize) and dirty records are
added. Thus, a single logical operation is represented by separate
dirty indicators, and must not be separated.
The incorrect dirty check becomes a problem when the first block of a
file is being appended to while another process is calling lseek to skip
holes. There is a small window where the dnode part is undirtied while
there are still dirty records. In this case, `lseek(fd, 0, SEEK_DATA)`
would not know that the file is dirty, and would go to
`dnode_next_offset()`. Since the object has no data blocks yet, it
returns `ESRCH`, indicating no data found, which results in `ENXIO`
being returned to `lseek()`'s caller.
Since coreutils 9.2, `cp` performs sparse copies by default, that is, it
uses `SEEK_DATA` and `SEEK_HOLE` against the source file and attempts to
replicate the holes in the target. When it hits the bug, its initial
search for data fails, and it goes on to call `fallocate()` to create a
hole over the entire destination file.
This has come up more recently as users upgrade their systems, getting
OpenZFS 2.2 as well as a newer coreutils. However, this problem has been
reproduced against 2.1, as well as on FreeBSD 13 and 14.
This change simply updates the dirty check to check both types of dirty.
If there's anything dirty at all, we immediately go to the "wait for
sync" stage, It doesn't really matter after that; both changes are on
disk, so the dirty fields should be correct.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Rich Ercolani <rincebrain@gmail.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#15571Closes#15526
Call vfs_exjail_clone() for mounts created under .zfs/snapshot
to fill in the mnt_exjail field for the mount. If this is not
done, the snapshots under .zfs/snapshot with not be accessible
over NFS.
This version has the argument name in vfs.h fixed to match that
of the name in spl_vfs.c, although it really does not matter.
External-issue: https://reviews.freebsd.org/D42672
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rick Macklem <rmacklem@uoguelph.ca>
Closes#15563
So that zdb (and others!) can get at the BRT on-disk structures.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Kay Pedersen <mail@mkwg.de>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes#15541
- Remove zsda_tx field, it is used only once.
- Remove unneeded string lengths checks, all names are terminated.
- Replace few explicit MAXNAMELEN usages with sizeof().
- Change dsname from MAXNAMELEN to ZFS_MAX_DATASET_NAME_LEN, as
expected by dsl_dataset_name(). Both are 256 bytes now, but it is
better to be safe.
This should have no functional difference.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15535
It was broken for several reasons:
* VOP_UNLOCK lost an argument in 13.0. So OpenZFS should be using
VOP_UNLOCK1, but a few direct calls to VOP_UNLOCK snuck in.
* The location of the zlib header moved in 13.0 and 12.1. We can drop
support for building on 12.0, which is EoL.
* knlist_init lost an argument in 13.0. OpenZFS change 9d0887402b
assumed 13.0 or later.
* FreeBSD 13.0 added copy_file_range, and OpenZFS change 67a1b03791
assumed 13.0 or later.
Sponsored-by: Axcient
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Alan Somers <asomers@gmail.com>
Closes#15551
It should be purely textual change to make the code more readable.
Should cause no functional difference.
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Tom Caputi <caputit1@tcnj.edu>
Reviewed-by: Sean Eric Fagan <sef@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Edmund Nadolski <edmund.nadolski@ixsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15543Closes#15513
In case of crash cloned blocks need to be claimed on pool import.
It is only possible if they (lr_bps) and their count (lr_nbps) are
not encrypted but only authenticated, similar to block pointer in
lr_write_t. Few other fields can be and are still encrypted.
This should fix panic on ZIL claim after crash when block cloning
is actively used.
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Tom Caputi <caputit1@tcnj.edu>
Reviewed-by: Sean Eric Fagan <sef@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Edmund Nadolski <edmund.nadolski@ixsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15543Closes#15513
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@klarasystems.com>
Closes#15536Closes#15564
With FreeBSD's switch to git the $FreeBSD$ string is no longer expanded
and they have mostly been removed upstream. Stop using __FBSDID and
remove the no-longer needed sys/cdefs.h includes.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Brooks Davis <brooks.davis@sri.com>
Closes#15527
- Generalize vdev_nowritecache handling by traversing through the
VDEV tree and skipping children ZIOs where not supported.
- Remove intermediate zio_null() in case of several VDEV children.
- Remove children handling from zio_ioctl(). There are no other
use cases for this code beside DKIOCFLUSHWRITECACHED, and would there
be, I doubt they would so straightforward apply to all VDEV children.
Comparing to removed previous optimization this should improve cases
of redundant ZILs/SLOGs.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15515
In several places abd_zero() cleaned ABD filled at the next line.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15514
Entries in the dbuf cache contribute only the size of the dbuf data to
the cache size. Attached "user" data is not counted. This can lead to
the data currently "owned" by the cache consuming more memory accounting
appears to show. In some cases (eg a metadnode data block with all child
dnode_t slots allocated), the actual size can be as much as 3x as what
the cache believes it to be.
This is arguably correct behaviour, as the cache is only tracking the
size of the dbuf data, not even the overhead of the dbuf_t. On the other
hand, in the above case of dnodes, evicting cached metadnode dbufs is
the only current way to reclaim the dnode objects, and can lead to the
situation where the dbuf cache appears to be comfortably within its
target memory window and yet is holding enormous amounts of slab memory
that cannot be reclaimed.
This commit adds a facility for a dbuf user to artificially inflate the
apparent size of the dbuf for caching purposes. This at least allows for
cache tuning to be adjusted to match something closer to the real memory
overhead.
metadnode dbufs carry a >1KiB allocation per dnode in their user data.
This informs the dbuf cache machinery of that fact, allowing it to make
better decisions when evicting dbufs.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#15511
Copy the disable parameter that FreeBSD implemented, and extend it to
work on Linux as well, until we're sure this is stable.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes#15529
PR #15457 exposed weird logic in L2ARC write sizing. If it appeared
bigger than device size, instead of liming write it reset all the
system-wide tunables to their default. Aside of being excessive,
it did not actually help with the problem, still allowing infinite
loop to happen.
This patch removes the tunables reverting logic, but instead limits
L2ARC writes (or at least eviction/trim) to 1/4 of the capacity.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Amanakis <gamanakis@gmail.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15519
Use goto out instead of return for early exit to make sure
snap_obj_array is freed.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes#15516
This gets around UBSAN errors when using arrays at the end of
structs. It converts some zero-length arrays to variable length
arrays and disables UBSAN checking on certain modules.
It is based off of the patch from #15460.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Tested-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Co-authored-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Issue #15145Closes#15510
It is unused for 3 years since #10576.
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15507
Current L2ARC write rate and headroom parameters are very conservative:
l2arc_write_max=8M and l2arc_headroom=2 (ie: a full L2ARC writes at
8 MB/s, scanning 16/32 MB of ARC tail each time; a warming L2ARC runs
at 2x these rates).
These values were selected 15+ years ago based on then-current SSDs
size, performance and endurance. Today we have multi-TB, fast and
cheap SSDs which can sustain much higher read/write rates.
For this reason, this patch increases l2arc_write_max to 32M and
l2arc_headroom to 8 (4x increase for both).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Gionatan Danti <g.danti@assyoma.it>
Closes#15457
Private read/write mapping can't be used to modify the mapped files, so
they will remain be immutable. Private read/write mappings are usually
used to load the data segment of executable files, rejecting them will
rendering immutable executable files to stop working.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: WHR <msl0000023508@gmail.com>
Closes#15344
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally. This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).
== Initiating expansion ==
A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`. The new device will become part of the RAIDZ group. A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.
The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.
== During expansion ==
The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).
The expansion progress can be monitored with `zpool status`.
Data redundancy is maintained during (and after) the expansion. If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).
The pool remains accessible during expansion. Following a reboot or
export/import, the expansion resumes where it left off.
== After expansion ==
When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).
Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).
A RAIDZ vdev can be expanded multiple times.
After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks. New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.
Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes#15022
With Linux v6.6.0 and GCC 12, when debug build is configured,
implicit conversion error is raised while converting
'enum <anonymous>' to 'boolean_t'. Use 'B_TRUE' instead of
'true' to fix the issue.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Pavel Snajdr <snajpa@snajpa.net>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Umer Saleem <usaleem@ixsystems.com>
Closes#15489
Add a ZFS feature flag to indicate OpenZFS availability.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Gordon Tetlow <gordon@freebsd.org>
Closes#15484
When retrieving a system attribute, the size of the supplied
buffer is ignored. If the buffer is too small to hold the attribute,
sa_attr_op() will write past the end of the buffer.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jason King <jking@racktopsystems.com>
Closes#15476
Previously taskq_init_ent() was an empty macro, while actual init
was done by taskq_dispatch_ent(). It could be slightly faster in
case taskq never enqueued. But without it taskq_empty_ent() relied
on the structure being zeroed by somebody else, that is not good.
As a side effect this allows the same task to be queued several
times, that is normal on FreeBSD, that may or may not get useful
here also one day.
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15455
- Use sbuf_new_for_sysctl() to reduce double-buffering on sysctl
output.
- Use much faster sbuf_cat() instead of sbuf_printf("%s").
Together it reduces `sysctl kstat.zfs.misc.dbufs` time from minutes
to seconds, making dbufstat almost usable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15495
Add a dataset_kstats_rename function, and call it when renaming
a zvol on FreeBSD and Linux.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alan Somers <asomers@gmail.com>
Sponsored-by: Axcient
Closes#15482Closes#15486
Otherwise callbacks may trigger KMSAN violations in the dlen == 0 case.
For example, raidz_syn_pq_abd() will compare an uninitialized pointer
with itself before returning. This seems harmless, but let's maintain
good hygiene and avoid passing uninitialized variables, if only to
placate KMSAN.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Closes#15491
Currently vdev_queue_class_length is responsible for checking how long
the queue length is, however, it doesn't check the length when a list
is used, rather it just returns whether it is empty or not. To fix this
I added a counter variable to vdev_queue_class to keep track of the sync
IO ops, and changed vdev_queue_class_length to reference this variable
instead.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: MigeljanImeri <ImeriMigel@gmail.com>
Closes#15478
As part of transaction group commit, dsl_pool_sync() sequentially calls
dsl_dataset_sync() for each dirty dataset, which subsequently calls
dmu_objset_sync(). dmu_objset_sync() in turn uses up to 75% of CPU
cores to run sync_dnodes_task() in taskq threads to sync the dirty
dnodes (files).
There are two problems:
1. Each ZVOL in a pool is a separate dataset/objset having a single
dnode. This means the objsets are synchronized serially, which
leads to a bottleneck of ~330K blocks written per second per pool.
2. In the case of multiple dirty dnodes/files on a dataset/objset on a
big system they will be sync'd in parallel taskq threads. However,
it is inefficient to to use 75% of CPU cores of a big system to do
that, because of (a) bottlenecks on a single write issue taskq, and
(b) allocation throttling. In addition, if not for the allocation
throttling sorting write requests by bookmarks (logical address),
writes for different files may reach space allocators interleaved,
leading to unwanted fragmentation.
The solution to both problems is to always sync no more and (if
possible) no fewer dnodes at the same time than there are allocators
the pool.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Edmund Nadolski <edmund.nadolski@ixsystems.com>
Closes#15197