module/zfs/ddt.c:2612:6: error: variable 'total' set but not used
Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
zfs_acl_node_alloc allocates an uninitialized data buffer, but upstack
zfs_acl_chmod only partially initializes it. KMSAN reported that this
memory remained uninitialized at the point when it was read by
lzjb_compress, which suggests a possible kernel memory disclosure bug.
The full KMSAN warning may be found in the PR.
https://github.com/openzfs/zfs/pull/16511
Signed-off-by: Alan Somers <asomers@gmail.com>
Sponsored by: Axcient
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
When reviewing logs after a failure, its useful to see where
unsuspend/resume was requested.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
All callers to spa_prop_get() and spa_prop_get_nvlist() supplied their
own preallocated nvlist (except ztest), so we can remove the option to
have them allocate one if none is supplied.
This sidesteps a bug in spa_prop_get(), where the error var wasn't
initialised, which could lead to the provided nvlist being freed at the
end.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#16505
This adds zfs_valstr, a collection of pretty printers for bitfields and
enums. These are useful in debugging, logging and other display contexts
where raw values are difficult for the untrained (or even trained!) eye
to decipher.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Requires the new 'flat' physical data which has the start
time for a class entry.
The amount to prune can be based on a target percentage of
the unique entries or based on the age (i.e., every entry
older than N days).
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@klarasystems.com>
Closes#16277
The simplest thing first: add the FDT and log objects to the list of
objects to be considered when checking for leaks.
The rest is based on a conceptual change in all of this patch stack: a
block on disk with a 'D' bit is not necessarily in the DDT at all
(pruned), or in the DDT ZAPs (still on the log).
As such, walking the DDT up front is difficult (for all the reasons that
walking an unflushed log is difficult) and not really useful, since it's
not a reflection of what's on disk anyway.
Instead, we rework things here to be more like the BRT checks. When we
see a dedup'd block, we look it up in the DDT, consume a refcount, and
for the second-or-later instances, count them as duplicates.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Allan Jude <allan@klarasystems.com>
Co-authored-by: Don Brady <don.brady@klarasystems.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes#16277
In 4938d01db (#14086) zio_flag_t was converted from an enum (generally
signed 32-bit) to a uint64_t. The corresponding change wasn't made to
the error reporting subsystem, limiting the error flags being delivered
to zed to 32 bits. This bumps the whole pipeline to use uint64s.
A tiny bit of compatibility is added for newer zed working agsinst an
older kernel module, because its easy to do and misdetecting
scrub/resilver errors and taking action is potentially dangerous. Making
it work for new kernel modules against older zed seems to be far more
invasive for far less benefit, so I have not.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#16469
When process got SIGSTOP/SIGTSTP, issig() dequeue them and return 0.
But process could still have another signal pending after dequeue. So,
after dequeue, check and return 1, if signal_pending.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jitendra Patidar <jitendra.patidar@nutanix.com>
Closes#16464
This commit extends the zpool-reguid(8) command with a -g flag, which
allows the user to specify the GUID to set.
This change also adds some general tests for zpool-reguid(8).
Sponsored-by: Wasabi Technology, Inc.
Sponsored-by: Klara, Inc.
Signed-off-by: Mateusz Piotrowski <0mp@FreeBSD.org>
Reviewed-by: Rob Norris <rob.norris@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Dispatched delayed tasks were not added to tasks_total, and cancelled
tasks were not removed. This notably could make tasks_total go to
UNIT64_MAX, but just generally meant the count could be wrong. So lets
not!
Sponsored-by: Klara, Inc.
Sponsored-by: Syneto
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#16473
Commit 329e2ffa4b has made mount.zfs(8) to
call libzfs function 'zfs_mount_at', in order to propagate dataset
properties into mount options. This fix however, is limited to a special
use case where mount.zfs(8) is used in initrd with option '-o zfsutil'.
If either initrd or the user need to use mount.zfs(8) to mount a file
system with 'mountpoint' set to 'legacy', '-o zfsutil' can't be used and
the original issue #7947 will still happen.
Since the existing code already excluded the possibility of calling
'zfs_mount_at' when it was invoked as a helper program from zfs(8), by
checking 'ZFS_MOUNT_HELPER' environment variable, it makes no sense to
avoid calling 'zfs_mount_at' without '-o zfsutil'.
An exception however, is when mount.zfs(8) was invoked with '-o remount'
to update the mount options for an existing mount point. In this case
call mount(2) directly without modifying the mount options passed from
command line.
Furthermore, don't run mount.zfs(8) helper for automounting snapshot.
The above change to make mount.zfs(8) to call 'zfs_mount_at'
apparently caused it to trigger an automount for the snapshot
directory. When the helper was invoked as a result of a snapshot
automount, an infinite recursion will occur.
Since the need of invoking user mode mount(8) for automounting was to
overcome that the 'vfs_kern_mount' being GPL-only, just run mount(8)
without the mount.zfs(8) helper by adding option '-i'.
Reviewed-by: Umer Saleem <usaleem@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: WHR <whr@rivoreo.one>
Closes#16393
Some callers (eg `do_corrective_recv()`) pass in a dest buffer much
smaller than the wanted 87.5% of the source buffer, because the
incoming abd is larger than the source data and they "know" what the
decompressed size with be.
However, `abd_borrow_buf()` rightly asserts if we try to borrow more
than is available, so these callers fail.
Previously when all we had was a dest buffer, we didn't know how big it
was, so we couldn't do anything. Now we have a dest abd, with a size, so
we can clamp dest size to the abd size.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
This commit changes the frontend zio_compress_data and
zio_decompress_data APIs to take ABD points instead of buffer pointers.
All callers are updated to match. Any that already have an appropriate
ABD nearby now use it directly, while at the rest we create an one.
Internally, the ABDs are passed through to the provider directly.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
This commit changes the provider compress and decompress API to take ABD
pointers instead of buffer pointers for both data source and
destination. It then updates all providers to match.
This doesn't actually change the providers to do chunked compression,
just changes the API to allow such an update in the future. Helper
macros are added to easily adapt the ABD functions to their buffer-based
implementations.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
This is mostly to make searching easier.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
This will make future refactoring easier.
There are two we can't change for the moment, because zio_compress_data
does hole detection & collapsing which zio_decompress_data does not
actually know how to handle.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
This allows a simple "wrapping" ABD for an existing linear buffer to be
allocated on the stack, avoiding an allocation.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Removing the platform #ifdefs from shared headers in favour of
per-platform headers. Makes abd_t much leaner, among other things.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#16253
The Linux abd_os.c serves double-duty as the userspace scatter abd
implementation, by carrying an emulation of kernel scatterlists. This
commit lifts common and userspace-specific parts out into a separate
abd_os.c for libzpool.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#16253
Makes it harder to use memory debuggers like valgrind directly, because
they can't see canary overruns.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#16253
Nothing ever checks it.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#16253
zfs_arc_shrinker_limit (default: 10000) avoids ARC collapse
due to excessive memory reclaim. However, when the kernel is
in direct reclaim mode (ie: low on memory), limiting ARC reclaim
increases OOM risk. This is especially true on system without
(or with inadequate) swap.
This patch ignores zfs_arc_shrinker_limit when the kernel is in
direct reclaim mode, avoiding most OOM. It also restores
"echo 3 > /proc/sys/vm/drop_caches" ability to correctly drop
(almost) all ARC.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Adam Moss <c@yotes.com>
Signed-off-by: Gionatan Danti <g.danti@assyoma.it>
Closes#16313
Rob Noris suggested that we could clean up redundant limits for the case
of non-blk mq scenario.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes#16462
In kernels 6.8 and later, the zvol block device is allocated with
qlimits passed during initialization. However, the zvol driver does not
set `max_hw_discard_sectors`, which is necessary to properly
initialize `max_discard_sectors`. This causes the `zvol_misc_trim` test
to fail on 6.8+ kernels when invoking the `blkdiscard` command. Setting
`max_hw_discard_sectors` in the `HAVE_BLK_ALLOC_DISK_2ARG` case resolve
the issue.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes#16462
These had minimal useful information for the admin, didn't work properly
in some places, and knew far too much about taskq internals.
With the new stats available, these should never be needed anymore.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Syneto
Closes#16171
This adds /proc/spl/kstats/taskq/summary, which attempts to show a
useful subset of stats for all taskqs in the system.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Syneto
Closes#16171
This exposes a variety of per-taskq stats under /proc/spl/kstat/taskq,
one file per taskq, named for the taskq name.instance.
These include a small amount of info about the taskq config, the current
state of the threads and queues, and various counters for thread and
queue activity since the taskq was created.
To assist with decrementing queue size counters, the list an entry is on
is encoded in spare bits in the entry flags.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Syneto
Closes#16171
For spl-taskq to use the kstats infrastructure, it has to be available
first.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Syneto
Closes#16171
Skip ro check for snapshots since they are always ro regardless if ro
flag is passed by mount or not. This allows multi-mounting snapshots
without requiring to specify ro flag.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes#16299
A problem with zio_flush() is that it issues a flush ZIO to a top-level
vdev, which then recursively issues child flush ZIOs until the real leaf
devices are flushed. As usual, an error in a child ZIO results in the
parent ZIO erroring too, so if a leaf device has failed, it's flush ZIO
will fail, and so will the entire flush operation.
This didn't matter when we used to ignore flush errors, but now that we
propagate them, the flush error propagates into the ZIL write ZIO. This
causes the ZIL to believe its write failed, and fall back to a full txg
wait. This still provides correct behaviour for zil_commit() callers (eg
fsync()) but it ruins performance.
We cannot simply skip flushing failed vdevs, because the associated
write may have succeeded before the vdev failed, which would give the
appearance the write is fully flushed when it is not. Neither can we
issue a "syncing write" to the device (eg SCSI FUA), as this also
degrades performance.
The answer is that we must bind writes and flushes together in a way
such that we only flush the physical devices that we wrote to.
This adds a "vdev tracing" facility to ZIOs, zio_vdev_trace. When
enabled on a ZIO with ZIO_FLAG_VDEV_TRACE, then upon successful
completion (in the _done handler), zio->io_vdev_trace_tree will have a
list of zio_vdev_trace_t objects that each describe a vdev that was
involved in the successful completion of the ZIO.
A companion function, zio_vdev_trace_flush(), is included, that issues a
flush ZIO to the child vdevs on the given trace tree.
zil_lwb_write_done() is updated to use this to bind ZIL writes and
flushes together.
The tracing facility is similar in many ways to the "deferred flushing"
facility inside the ZIL, to the point where it can replace it. Now, if
the flush should be deferred, the trace records from the writing ZIO are
captured and combined with any captured from previous writes. When its
finally time to issue the flush, we issue it to the entire accumulated
set of traced vdevs.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
The first time a device returns ENOTSUP in repsonse to a flush request,
we set vdev_nowritecache so we don't issue flushes in the future and
instead just pretend the succeeded. However, we still return an error
for the initial flush, even though we just decided such errors are
meaningless!
So, when setting vdev_nowritecache in response to a flush error, also
reset the error code to assume success.
Along the way, it seems there's no good reason for vdev_disk & vdev_geom
to explicitly detect no support for flush and set vdev_nowritecache;
just letting the error through to zio_vdev_io_assess() will cause it all
to fall out nicely. So remove those checks.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Since the beginning, ZFS' "flush" operation has always ignored
errors[1]. Write errors are captured and dealt with, but if a write
succeeds but the subsequent flush fails, the operation as a whole will
appear to succeed[2].
In the end-of-transaction uberblock+label write+flush ceremony, it's
very difficult for this situation to occur. Since all devices are
written to, typically the first write will succeed, the first flush will
fail unobserved, but then the second write will fail, and the entire
transaction is aborted. It's difficult to imagine a real-world scenario
where all the writes in that sequence could succeed even as the flushes
are failing (understanding that the OS is still seeing hardware problems
and taking devices offline).
In the ZIL however, it's another story. Since only the write response is
checked, if that write succeeds but the flush then fails, the ZIL will
believe that it succeeds, and zil_commit() (and thus fsync()) will
return success rather than the "correct" behaviour of falling back into
txg_wait_synced()[3].
This commit fixes this by adding a simple flag to zio_flush() to
indicate whether or not the caller wants to receive flush errors. This
flag is enabled for ZIL calls. The existing zio chaining inside the ZIL
and the flush handler zil_lwb_flush_vdevs_done() already has all the
necessary support to properly handle a flush failure and fail the entire
zio chain. This causes zil_commit() to correct fall back to
txg_wait_synced() rather than returning success prematurely.
1. The ZFS birth commit (illumos/illumos-gate@fa9e4066f0) had support
for flushing devices with write caches with the DKIOCFLUSHWRITECACHE
ioctl. No errors are checked. The comment in `zil_flush_vdevs()` from
from the time shows the thinking:
/*
* Wait for all the flushes to complete. Not all devices actually
* support the DKIOCFLUSHWRITECACHE ioctl, so it's OK if it fails.
*/
2. It's not entirely clear from the code history why this was acceptable
for devices that _do_ have write caches. Our best guess is that this
was an oversight: between the combination of hardware, pool topology
and application behaviour required to hit this, it basically didn't
come up.
3. Somewhat frustratingly, zil.c contains comments describing this exact
behaviour, and further discussion in #12443 (September 2021). It
appears that those involved saw the potential, but were looking at a
different problem and so didn't have the context to recognise it for
what it was.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
`l2arc_mfuonly` was added to avoid wasting L2 ARC on read-once MRU
data and metadata. However it can be useful to cache as much
metadata as possible while, at the same time, restricting data
cache to MFU buffers only.
This patch allow for such behavior by setting `l2arc_mfuonly` to 2
(or higher). The list of possible values is the following:
0: cache both MRU and MFU for both data and metadata;
1: cache only MFU for both data and metadata;
2: cache both MRU and MFU for metadata, but only MFU for data.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Gionatan Danti <g.danti@assyoma.it>
Closes#16343Closes#16402
Adds per-DDT stats counting lookups and where they were serviced from
(either log or backing zap), number of log entries in memory, and flow
rates.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes#15895
The dedup log does not have a stable cursor, so its not possible to
persist our current scan location within it across pool reloads.
Beccause of this, when walking (scanning), we can't treat it like just
another source of dedup entries.
Instead, when a scan is wanted, we switch to an aggressive flushing
mode, pushing out entries older than the scan start txg as fast as we
can, before starting the scan proper.
Entries after the scan start txg will be handled via other methods; the
DDT ZAPs and logs will be written as normal, and blocks not seen yet
will be offered to the scan machinery as normal.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes#15895
Adds a log/journal to dedup. At the end of txg, instead of writing the
entry directly to the ZAP, instead its adding to an in-memory tree and
appended to an on-disk object. The on-disk object is only read at
import, to reload the in-memory tree.
Lookups first go the the log tree before going to the ZAP, so
recently-used entries will remain close by in memory. This vastly
reduces overhead from dedup IO, as it will not have to do so many
read/update/write cycles on ZAP leaf nodes.
A flushing facility is added at end of txg, to push logged entries out
to the ZAP. There's actually two separate "logs" (in-memory tree and
on-disk object), one active (recieving updated entries) and one flushing
(writing out to disk). These are swapped (ie flushing begins) based on
memory used by the in-memory log trees and time since we last flushed
something.
The flushing facility monitors the amount of entries coming in and being
flushed out, and calibrates itself to try to flush enough each txg to
keep up with the ingest rate without competing too much with other IO.
Multiple tuneables are provided to control the flushing facility.
All the histograms and stats are update to accomodate the log as a
separate entry store. zdb gains knowledge of how to count them and dump
them. Documentation included!
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes#15895
All objects stored in the MOS get copies=3. For a large dedup table,
this requires significant extra IO and disk space, when its not really
necessary - the dedup table itself isn't needed to read or write data,
only to keep data usage down. Losing the dedup table does not render the
pool unusable, it just messes up the accounting somewhat.
This adds a dmu_ddt_copies tuneable. When set to 0, the existing
behaviour is used. When set higher, dedup table blocks (ZAP and log)
will have this many copies rather than the usual 3, while indirect
blocks will have one more again.
This is a tuneable for now mostly for testing. Losing a dedup table can
cause blocks to be leaked, and we currently have no facilities to repair
that.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes#15895
This yields substantial performance improvements when we only write out
some small % of entries at a time, as it will cause entries that will go
into "nearby" ZAP leaf nodes to be grouped closer together in the AVL, and
so touch fewer blocks. Without this, the distribution is an even spread,
so we touch a lot more ZAP leaf nodes for any given number of entries.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes#15895
Both the API and the code were kinda mangled and I was really struggling
to follow it. The worst offender was the old ddt_stat_add(); after
fixing it up the rest of the changes are mostly knock-on effects and
targets of opportunity.
Note that the old ddt_stat_add() was safe against overflows - it could
produce crazy numbers, but the compiler wouldn't do anything stupid. The
assertions in ddt_stat_sub() go a lot of the way to protecting against
this; getting in a position where overflows are a problem is definitely
a programming error.
Also expanding ddt_stat_add() and ddt_histogram_empty() produces less
efficient assembly. I'm not bothered about this right now though; these
should not be hot functions, and if they are we'll optimise them later.
If we have to go back to the old form, we'll comment it like crazy.
Finally, I've removed the assertion that the bucket will never be
negative, as it will soon be possible to have entries with zero
refcounts: an entry for a block that is no longer on the pool, but is on
the log waiting to be synced out. It might be better to have a separate
bucket for these, since they're still using real space on disk, but
ultimately these stats are driving UI, and for now I've chosen to keep
them matching how they've looked in the past, as well as match the
operators mental model - pool usage is managed elsewhere.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes#15895
Traditional dedup keeps a separate ddt_phys_t "type" for each possible
count of DVAs (that is, copies=) parameter. Each of these are tracked
independently of each other, and have their own set of DVAs. This leads
to an (admittedly rare) situation where you can create as many as six
copies of the data, by changing the copies= parameter between copying.
This is both a waste of storage on disk, but also a waste of space in
the stored DDT entries, since there never needs to be more than three
DVAs to handle all possible values of copies=.
This commit adds a new FDT feature, DDT_FLAG_FLAT. When active, only the
first ddt_phys_t is used. Each time a block is written with the dedup
bit set, this single phys is checked to see if it has enough DVAs to
fulfill the request. If it does, the block is filled with the saved DVAs
as normal. If not, an adjusted write is issued to create as many extra
copies as are needed to fulfill the request, which are then saved into
the entry too.
Because a single phys is no longer an all-or-nothing, but can be
transitioning from fewer to more DVAs, the write path now has to keep a
copy of the previous "known good" DVA set so we can revert to it in case
an error occurs. zio_ddt_write() has been restructured and heavily
commented to make it much easier to see what's happening.
Backwards compatibility is maintained simply by allocating four
ddt_phys_t when the DDT_FLAG_FLAT flag is not set, and updating the phys
selection macros to check the flag. In the old arrangement, each number
of copies gets a whole phys, so it will always have either zero or all
necessary DVAs filled, with no in-between, so the old behaviour
naturally falls out of the new code.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Co-authored-by: Don Brady <don.brady@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes#15893
This slims down the in-memory entry to as small as it can be. The
IO-related parts are made into a separate entry, since they're
relatively rarely needed.
The variable allocation for dde_phys is to support the upcoming flat
format.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes#15893
The idea here is that sometimes you need the contents of an entry with
no intent to modify it, and/or from a place where its difficult to get
hold of its originating ddt_t to know how to interpret it.
A lightweight entry contains everything you might need to "read" an
entry - its key, type and phys contents - but none of the extras for
modifying it or using it in a larger context. It also has the full
complement of phys slots, so it can represent any kind of dedup entry
without having to know the specific configuration of the table it came
from.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes#15893
The "flat phys" feature will use only a single phys slot for all
entries, which means the old "single", "double" etc naming now makes no
sense, and more importantly, means that choosing the right slot for a
given block pointer will depend on how many slots are in use for a given
DDT.
This removes the old names, and adds accessor macros to decouple
specific phys array indexes from any particular meaning.
(These macros look strange in isolation, mainly in the way they take the
ddt_t* as an arg but don't use it. This is mostly a separate commit to
introduce the concept to the reader before the "flat phys" commit
extends it).
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes#15893
The upcoming dedup features break the long held assumption that all
blocks on disk with a 'D' dedup bit will always be present in the DDT,
or will have the same set of DVA allocations on disk as in the DDT.
If the DDT is no longer a complete picture of all the dedup blocks that
will be and should be on disk, then it does us no good to walk and prime
it up front, since it won't necessarily match up with every block we'll
see anyway.
Instead, we rework things here to be more like the BRT checks. When we
see a dedup'd block, we look it up in the DDT, consume a refcount, and
for the second-or-later instances, count them as duplicates.
The DDT and BRT are moved ahead of the space accounting. This will
become important for the "flat" feature, which may need to count a
modified version of the block.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Allan Jude <allan@klarasystems.com>
Co-authored-by: Don Brady <don.brady@klarasystems.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes#15892
This is the supporting infrastructure for the upcoming dedup features.
Traditionally, dedup objects live directly in the MOS root. While their
details vary (checksum, type and class), they are all the same "kind" of
thing - a store of dedup entries.
The new features are more varied than that, and are better thought of as
a set of related stores for the overall state of a dedup table.
This adds a new feature flag, SPA_FEATURE_FAST_DEDUP. Enabling this will
cause new DDTs to be created as a ZAP in the MOS root, named
DDT-<checksum>. The is used as the root object for the normal type/class
store objects, but will also be a place for any storage required by new
features.
This commit adds two new fields to ddt_t, for version and flags. These
are intended to describe the structure and features of the overall dedup
table, and are stored as-is in the DDT root. In this commit, flags are
always zero, but the intent is that they can be used to hang optional
logic or state onto for new dedup features. Version is always 1.
For a "legacy" dedup table, where no DDT root directory exists, the
version will be 0.
ddt_configure() is expected to determine the version and flags features
currently in operation based on whether or not the fast_dedup feature is
enabled, and from what's available on disk. In this way, its possible to
support both old and new tables.
This also provides a migration path. A legacy setup can be upgraded to
FDT by creating the DDT root ZAP, moving the existing objects into it,
and setting version and flags appropriately. There's no support for that
here, but it would be straightforward to add later and allows the
possibility that newer features could be applied to existing dedup
tables.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes#15892
zvol queue limits initialization depends on `zv_volblocksize`, but it is
initialized later, leading to several limits being initialized with
incorrect values, including `max_discard_*` limits. This also causes
`blkdiscard` command to consistently fail, as `blk_ioctl_discard` reads
`bdev_max_discard_sectors()` limits as 0, leading to failure. The fix is
straightforward: initialize `zv->zv_volblocksize` early, before setting
the queue limits. This PR should fix `zvol/zvol_misc/zvol_misc_trim`
failure on recent PRs, as the test case issues `blkdiscard` for a zvol.
Additionally, `zvol_misc_trim` was recently enabled in `6c7d41a`,
which is why the issue wasn't identified earlier.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes#16454