Entries in the dbuf cache contribute only the size of the dbuf data to
the cache size. Attached "user" data is not counted. This can lead to
the data currently "owned" by the cache consuming more memory accounting
appears to show. In some cases (eg a metadnode data block with all child
dnode_t slots allocated), the actual size can be as much as 3x as what
the cache believes it to be.
This is arguably correct behaviour, as the cache is only tracking the
size of the dbuf data, not even the overhead of the dbuf_t. On the other
hand, in the above case of dnodes, evicting cached metadnode dbufs is
the only current way to reclaim the dnode objects, and can lead to the
situation where the dbuf cache appears to be comfortably within its
target memory window and yet is holding enormous amounts of slab memory
that cannot be reclaimed.
This commit adds a facility for a dbuf user to artificially inflate the
apparent size of the dbuf for caching purposes. This at least allows for
cache tuning to be adjusted to match something closer to the real memory
overhead.
metadnode dbufs carry a >1KiB allocation per dnode in their user data.
This informs the dbuf cache machinery of that fact, allowing it to make
better decisions when evicting dbufs.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#15511
(cherry picked from commit 92dc4ad83d)
Over its history this the dirty dnode test has been changed between
checking for a dnodes being on `os_dirty_dnodes` (`dn_dirty_link`) and
`dn_dirty_record`.
de198f2d9 Fix lseek(SEEK_DATA/SEEK_HOLE) mmap consistency
2531ce372 Revert "Report holes when there are only metadata changes"
ec4f9b8f3 Report holes when there are only metadata changes
454365bba Fix dirty check in dmu_offset_next()
66aca2473 SEEK_HOLE should not block on txg_wait_synced()
Also illumos/illumos-gate@c543ec060dillumos/illumos-gate@2bcf0248e9
It turns out both are actually required.
In the case of appending data to a newly created file, the dnode proper
is dirtied (at least to change the blocksize) and dirty records are
added. Thus, a single logical operation is represented by separate
dirty indicators, and must not be separated.
The incorrect dirty check becomes a problem when the first block of a
file is being appended to while another process is calling lseek to skip
holes. There is a small window where the dnode part is undirtied while
there are still dirty records. In this case, `lseek(fd, 0, SEEK_DATA)`
would not know that the file is dirty, and would go to
`dnode_next_offset()`. Since the object has no data blocks yet, it
returns `ESRCH`, indicating no data found, which results in `ENXIO`
being returned to `lseek()`'s caller.
Since coreutils 9.2, `cp` performs sparse copies by default, that is, it
uses `SEEK_DATA` and `SEEK_HOLE` against the source file and attempts to
replicate the holes in the target. When it hits the bug, its initial
search for data fails, and it goes on to call `fallocate()` to create a
hole over the entire destination file.
This has come up more recently as users upgrade their systems, getting
OpenZFS 2.2 as well as a newer coreutils. However, this problem has been
reproduced against 2.1, as well as on FreeBSD 13 and 14.
This change simply updates the dirty check to check both types of dirty.
If there's anything dirty at all, we immediately go to the "wait for
sync" stage, It doesn't really matter after that; both changes are on
disk, so the dirty fields should be correct.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Rich Ercolani <rincebrain@gmail.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#15571Closes#15526
(cherry picked from commit 30d581121b)
Previous flushing algorithm limited only total number of log blocks to
the minimum of 256K and 4x number of metaslabs in the pool. As result,
system with 1500 disks with 1000 metaslabs each, touching several new
metaslabs each TXG could grow spacemap log to huge size without much
benefits. We've observed one of such systems importing pool for about
45 minutes.
This patch improves the situation from five sides:
- By limiting maximum period for each metaslab to be flushed to 1000
TXGs, that effectively limits maximum number of per-TXG spacemap logs
to load to the same number.
- By making flushing more smooth via accounting number of metaslabs
that were touched after the last flush and actually need another flush,
not just ms_unflushed_txg bump.
- By applying zfs_unflushed_log_block_pct to the number of metaslabs
that were touched after the last flush, not all metaslabs in the pool.
- By aggressively prefetching per-TXG spacemap logs up to 16 TXGs in
advance, making log spacemap load process for wide HDD pool CPU-bound,
accelerating it by many times.
- By reducing zfs_unflushed_log_block_max from 256K to 128K, reducing
single-threaded by nature log processing time from ~10 to ~5 minutes.
As further optimization we could skip bumping ms_unflushed_txg for
metaslabs not touched since the last flush, but that would be an
incompatible change, requiring new pool feature.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes#12789
(cherry picked from commit cbfe5cb849518dd8fb65bf94a72fd88a15093a67)
Use error thresholds from policy to control whether to scrub data
and/or metadata. If threshold is set to UINT64_MAX, then caller
probably does not care about result and we may skip that part.
By default import neither set the data error threshold nor read
the error counter, so skip the data scrub for faster import.
Metadata are still scrubbed and fail if even single error found.
While there just for symmetry return number of metadata errors in
case threshold is not set to zero and we haven't reached it.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Closes#13022
(cherry picked from commit f2c5bc150e)
Enables vdev traces for ZIL writes, and then only issues flushes to
things that were written to.
This simplifies a few things. We no longer have to extract the toplevel
vdevs to flush from the block pointer; instead we just look at what was
written. The vdev tree remains as a means to defer flushes to the next
lwb, which means a bit more copying trees, but also means we no longer
have to lock the tree.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
If you have a trace tree from, say, a write, hand it directly to
zio_flush_traced() to flush only the leaf vdevs that were involved in
the write.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Meant for external callers to be able to build trace trees that can
later be submitted back to zio for work. Its hardly necessary, but saves
needing to double up on kmem cache and comparison function.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
The idea here is that you can add a flag to a zio, and every vdev that
contributed to the successful completion of that zio will be referenced
on the "trace tree". You can poke around in here from your _done handler
to do any per-vdev followup work.
The actual use case is to track the vdevs that were actually written to,
in order to have a list of vdevs that we should flush. Thats why it
looks like the ZIL vdev flush tracker - the only difference is that it
will also list interior and leaf vdevs, not just toplevel vdevs.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
We ran out of space in enum zio_flag for additional flags. Rather than
introduce enum zio_flag2 and then modify a bunch of functions to take a
second flags variable, we expand the type to 64 bits via `typedef
uint64_t zio_flag_t`.
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@klarasystems.com>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Co-authored-by: Richard Yao <richard.yao@klarasystems.com>
Closes#14086
zio_ioctl() is the only user of zio_flush(), and its structure and flag
use is fairly specific to flushing. So here we bring the guts of
zio_ioctl() into zio_flush(), allowing some light reorganising (mostly
around how zio_nowait() is called) and a better signature.
This will help in the future when changing the way flush works, as its
clear where the change should be made and no wondering if zio_ioctl() is
being used somewhere else.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
After the ZIL is reopened, itxs created before the failure are held on
the failure itxg until the cleaner thread comes through and cleans them
up by calling zil_clean(). That's an asynchronous job, so may not run
immediately.
Previously, if the ZIL fails again while there are still itxs on the
fail list, it would trip assertions in debug mode, and in production,
the itx list would be leaked and the previous outstanding fsync() calls
would be lost.
This commit makes it so that if the ZIL fails before the previous
failure has been cleaned up, it will be immediately cleaned up before
being filled with currently outstanding itxs.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
This is possible if spa_sync() completes before the ZIL write/flush
does, which then errors. At this point all itxs are in the past, leaving
us with nothing to wait for.
In a perfect world we would not fail the ZIL, but at this point we've
already locked out the itxgs early, so we have to see it through.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
The previous change added a check to fall back to waiting forever if the
ZIL failed. This check was inverted; it actually caused it to always
enter a timed wait when it failed. When combined with the fact that the
last lwb issued likely would have failed quickly and so had a very small
latency, this caused effectively an infinite loop.
I initially fixed the check, but on further study I decided that this
loop doesn't need to exist. The way the whole logic falls out of the
original code in 2.1.5 is that if the lwb is OPENED, wait then issue it,
and if not (or post issue), wait forever. The loop will never see more
than two iterations, one for each half of the OPENED check, and it will
stop as soon as the waiter is signaled (zcw_done true), so it can be far
more simply expressed as a linear sequence:
if (!issued) {
wait a few
if (done)
return
issue IO
}
if (!done)
wait forever
This still holds when the ZIL fails, because zil_commit_waiter_timeout()
will check for failure under zl_issuer_lock, which zil_fail() will wait
for, and in turn, zil_fail() will wait on zcw_lock and then signal the
waiter before it releases zl_issuer_lock. Taken together, that means
that zil_commit_waiter_timeout() will do all it can under the
circumstances, and waiting forever the waiter to complete is all we can
past that point.
(cherry picked from commit c57c2ddd6f803f429da1e2b53abab277d781a5a3)
Various bits of output for catching broken bios.
(cherry picked from commit b1a5bc49acce3cbec56f3bf0638539f836aa2208)
Signed-off-by: Allan Jude <allan@klarasystems.com>
This is the same change as the previous commit, but for scatter abds.
Its less clear if this change is needed. Since scatter abds are only
ever added a page at time, both sides of the split should always be
added in consecutive segments.
Intuitively though, it may be possible for a partially-filled bio to be
used, or a bio with an odd number of iovecs, and that then could lead to
a misaligned bio. While I've not been able to reproduce this at all, it
seems to make sense to protect against it.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
(cherry picked from commit cbdf21fd1a32a5e696a22cad497d9211221fa309)
If we encounter a split page, we add two iovecs to the bio, one for the
fragment of the buffer on each side of the split. In order to do this
safely, we must be sure that we always have room for both fragments.
Its possible for a linear abd to have multiple pages, in which case we
want to add the "left" fragment, then a run of proper 4K pages. then
then "right" fragment. In this way we can keep whole pages together as
much possible.
This change handles both cases by noticing a split page. If we don't
have at least two iovecs remaining in the bio, then we abort outright
(allowing the caller to allocate a new bio and try again). We add the
"left" fragment, and note how big we expect the right fragment to be.
Then we load in as many full pages as are available.
When we reach the last iovec, we close out the bio by taking as uch as
is necessary to restore alignment.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
(cherry picked from commit 173cafcc3d8b6c94c61844c705d7a410f412a18e)
A single "page" in an ABD does not necessarily correspond to one segment
in a bio, because of how ZFS does ABD allocations and how it breaks them
up with adding them to a bio. Because of this, simply dividing the ABD
size by the page size can only ever give a minimum number of segments
required, rather than the correct number.
Until we can fix that, we'll just make each bio as large as they can be
for as many segments as the device queue will permit without needing to
split the the bio. This is a little wasteful if we don't intend to put
that many segments in the bio, but its not a lot of memory and its only
lost until the bio is completed.
This also adds a tuneable, vdev_disk_max_segs, to allow setting this
value to be set by the operator. This is very useful for debugging.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
(cherry picked from commit a3a438d1bedb0626417cd73ba10b1479a06bef7f)
This code should be kept inline with the upstream lua version as much
as possible. Therefore, we simply want to silence the warning. This
check was enabled by default as part of -Wall in gcc 12.1.
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#13528Closes#13575
(cherry picked from commit 4d0c1f14e77cf83d06de7c730de7f93f8a85c2eb)
Extend the buffer slightly resolve the warning.
cmd/zfs/zfs_main.c: In function ‘upgrade_set_callback’:
cmd/zfs/zfs_main.c:2446:22: error: ‘%llu’ directive output
may be truncated writing between 1 and 20 bytes into a
region of size 16 [-Werror=format-truncation=]
cmd/zfs/zfs_main.c:2445:24: note: ‘snprintf’ output between
2 and 21 bytes into a destination of size 16
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#13528Closes#13575
(cherry picked from commit 4a8ce916f9a1836db34b8c0c7d878adaae5bcf5a)
Switch to using asprintf() to satisfy the compiler and resolve the
potential format-overflow warning. Not the conditional before the
sprintf() would have prevented this regardless.
cmd/zfs/zfs_project.c: In function ‘zfs_project_handle_dir’:
cmd/zfs/zfs_project.c:241:38: error: ‘/’ directive writing
1 byte into a region of size between 0 and 4352
[-Werror=format-overflow=]
cmd/zfs/zfs_project.c:241:17: note: ‘sprintf’ output between
2 and 4609 bytes into a destination of size 4352
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#13528Closes#13575
(cherry picked from commit 8e15c80f90f3c80a4026c1f9ed248b4ea8ae41d0)
The ZIL may be failed any time we don't have the issuer lock, which
means we can end up working through zil_commit_impl() even when the ZIL
already failed.
So, each time we gain the issuer lock, recheck the fail state, and do
not issue IO if its failed. The waiter will eventually go to sleep
waiting to be signalled with the ZIL reopens in zil_sync()->zil_clean().
(cherry picked from commit 49fa92c8db389f257a15029d643fb026fa5b6dc2)
zil_failed() is called unlocked, and itxg_failed is only checked with
itxg_lock held. This was making the ZIL appear to be not failed even as
zil_fail() was in progress, scanning the itx lists. With zil_failed()
returning false, zil_commit() would continue to zil_commit_impl() and
also start processing itx lists, racing each other for locks.
So instead, set the fail state as early as possible, before we start
processing the itx lists. This won't stop new itxs arriving on the itxgs
proper, but it will avoid additional commit itxs being created and will
stop any attempts to collect and commit them.
(cherry picked from commit 17579a79a2b481e746879d5a033626754931441e)
Early on I thought it would be necessary to stop the world making
progress under us, and taking the namespace lock was my initial idea of
how to do that. The right way is a bit more nuanced than that, but as it
turns out, we don't even need it.
To fail the ZIL is effectively to stop it in its tracks and hold onto
all itxs stored within until they operations they represent are
committed to the pool by some other means (ie the regular txg sync).
It doesn't matter if the pool makes progress while we're doing this. If
the pool does progress, then zil_clean() will be called to process any
itxs now completed. That will be to take the itxg_lock, process and
destroy the itxs, and release the lock, leaving the itxg empty.
If zil_fail() is running at the same time, then either zil_clean() will
have cleaned up the itxg and zil_fail() will find it empty, or
zil_fail() will get there first and empty it onto the fail itxg.
(cherry picked from commit 83ce694898f5a89bd382dda0ba09bb8a04ac5666)
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Seagate Technology LLC
Closes#14719
(cherry picked from commit ff73574cd8)
The type of sysctls had to be changed from uint_t to int to match other
sysctls in to OpenZFS 2.1.5.
Spares can be in multiple pools, even if in use. This means that
status check AVAIL/INUSE is a bit tricky. spa_add_spares does not
need to be called, but we do need to do the equivalent. Which is
now done directly.
Some hardware has issues when issues a write of 0 bytes
Add a new module paramter, zio_suppress_zero_writes
That when enabled (default) will just complete these I/Os
without sending them to the hardware.
Signed-off-by: Allan Jude <allan@klarasystems.com>
This commit extends the zpool-reguid(8) command with a -g flag, which
allows the user to specify the GUID to set.
Sponsored-by: Wasabi Technology, Inc.
Sponsored-by: Klara Inc.