Commit Graph

3658 Commits

Author SHA1 Message Date
Don Brady f2d25f56f5 vdev probe to slow disk can stall mmp write checker
Simplify vdev probes in the zio_vdev_io_done context to
avoid holding the spa config lock for a long duration.

Also allow zpool clear if no evidence of another host
is using the pool.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@klarasystems.com>
Closes #15839
2024-04-30 16:48:01 +10:00
George Wilson dbf9aed3cf Parallel pool import
This commit allow spa_load() to drop the spa_namespace_lock so
that imports can happen concurrently. Prior to dropping the
spa_namespace_lock, the import logic will set the spa_load_thread
value to track the thread which is doing the import.

Consumers of spa_lookup() retain the same behavior by blocking
when either a thread is holding the spa_namespace_lock or the
spa_load_thread value is set. This will ensure that critical
concurrent operations cannot take place while a pool is being
imported.

The zpool command is also enhanced to provide multi-threaded support
when invoking zpool import -a.

Lastly, zinject provides a mechanism to insert artificial delays
when importing a pool and new zfs tests are added to verify parallel
import functionality.

Contributions-by: Don Brady <don.brady@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Wilson <gwilson@delphix.com>
Closes #16093
2024-04-30 16:47:59 +10:00
Don Brady 4296626ab0 Extend import_progress kstat with a notes field
Detail the import progress of log spacemaps as they can take a very
long time.  Also grab the spa_note() messages to, as they provide
insight into what is happening

Sponsored-By: OpenDrives Inc.
Sponsored-By: Klara Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@klarasystems.com>
Co-authored-by: Allan Jude <allan@klarasystems.com>
Closes #15539
2024-04-30 16:33:14 +10:00
Rob Norris 0ab4172e4c import: require force when cachefile hostid doesn't match on-disk
Previously, if a cachefile is passed to zpool import, the cached config
is mostly offered as-is to ZFS_IOC_POOL_TRYIMPORT->spa_tryimport(), and
the results are taken as the canonical pool config and handed back to
ZFS_IOC_POOL_IMPORT.

In the course of its operation, spa_load() will inspect the pool and
build a new config from what it finds on disk. However, it then
regenerates a new config ready to import, and so rightly sets the hostid
and hostname for the local host in the config it returns.

Because of this, the "require force" checks always decide the pool is
exported and last touched by the local host, even if this is not true,
which is possible in a HA environment when MMP is not enabled. The pool
may be imported on another head, but the import checks still pass here,
so the pool ends up imported on both.

(This doesn't happen when a cachefile isn't used, because the pool
config is discovered in userspace in zpool_find_import(), and that does
find the on-disk hostid and hostname correctly).

Since the systemd zfs-import-cache.service unit uses cachefile imports,
this can lead to a system returning after a crash with a "valid"
cachefile on disk and automatically, quietly, importing a pool that has
already been taken up by a secondary head.

This commit causes the on-disk hostid and hostname to be included in the
ZPOOL_CONFIG_LOAD_INFO item in the returned config, and then changes the
"force" checks for zpool import to use them if present.

This method should give no change in behaviour for old userspace on new
kernels (they won't know to look for the new config items) and for new
userspace on old kernels (the won't find the new config items).

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Closes #15290
(cherry picked from commit 54b1b1d893)
2024-04-03 10:23:13 +11:00
Rob Norris 831fdad1b0 slack: require module param before enabling slack compression
Just an extra explicit opt-in, since it changes the compression method
isn't upstreamed yet.
2024-04-03 10:23:13 +11:00
Rob Norris 014ff864d8 slack: fix decompression
Turns out decompression never worked at all; likely an oversight
converting the original "transparent" versions to a true compression
option. This makes it work.

Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
2024-04-03 10:23:13 +11:00
Rob Norris 4e9cbd04a4 spa: add zio_taskq_trylock tuneable
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
2024-04-03 10:23:13 +11:00
Rob Norris 7d74fc7a68 spa_taskq_dispatch_ent: look for an unlocked taskq before waiting
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
2024-04-03 10:23:13 +11:00
Rob Norris c85d31a88b taskq: add taskq_try_dispatch_ent()
Non-blocking form of taskq_dispatch_ent(), returns false if it can't
acquire the taskq lock.

Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
2024-04-03 10:23:13 +11:00
Rob Norris 2fec57bf03 trace: spa_taskqs_ent trace class, dispatch and dispatched tracepoints
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
2024-04-03 10:23:13 +11:00
Mark Johnston 79d092987b spa: Let spa_taskq_param_get()'s addition of a newline be optional
For FreeBSD sysctls, we don't want the extra newline, since the
sysctl(8) utility will format strings appropriately.

Signed-off-by: Mark Johnston <markj@FreeBSD.org>
(cherry picked from commit 4cf943769d3decd73b379eb069cebe0c148e3f39)
2024-04-03 10:23:13 +11:00
Mark Johnston 92063ad7d4 spa: Fix FreeBSD sysctl handlers
sbuf_cpy() resets the sbuf state, which is wrong for sbufs allocated by
sbuf_new_for_sysctl().  In particular, this code triggers an assertion
failure in sbuf_clear().

Simplify by just using sysctl_handle_string() for both reading and
setting the tunable.

Fixes: 6930ecbb7 ("spa: make read/write queues configurable")
Reported-by: Peter Holm <pho@FreeBSD.org>
Signed-off-by: Mark Johnston <markj@FreeBSD.org>
(cherry picked from commit e5230fb9d2c3670feda07b8a0759ac446749c247)
2024-04-03 10:23:13 +11:00
Rob Norris 1339b7e0ac spa: make read/write queues configurable
We are finding that as customers get larger and faster machines
(hundreds of cores, large NVMe-backed pools) they keep hitting
relatively low performance ceilings. Our profiling work almost always
finds that they're running into bottlenecks on the SPA IO taskqs.
Unfortunately there's often little we can advise at that point, because
there's very few ways to change behaviour without patching.

This commit adds two load-time parameters `zio_taskq_read` and
`zio_taskq_write` that can configure the READ and WRITE IO taskqs
directly.

This achieves two goals: it gives operators (and those that support
them) a way to tune things without requiring a custom build of OpenZFS,
which is often not possible, and it lets us easily try different config
variations in a variety of environments to inform the development of
better defaults for these kind of systems.

Because tuning the IO taskqs really requires a fairly deep understanding
of how IO in ZFS works, and generally isn't needed without a pretty
serious workload and an ability to identify bottlenecks, only minimal
documentation is provided. Its expected that anyone using this is going
to have the source code there as well.

Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
(cherry picked from commit 12a031a3f5)
2024-04-03 10:23:13 +11:00
Rob Norris 657c1f75ad vdev_disk: don't touch vbio after its handed off to the kernel
After IO is unplugged, it may complete immediately and vbio_completion
be called on interrupt context. That may interrupt or deschedule our
task. If its the last bio, the vbio will be freed. Then, we get
rescheduled, and try to write to freed memory through vbio->.

This patch just removes the the cleanup, and the corresponding assert.
These were leftovers from a previous iteration of vbio_submit() and were
always "belt and suspenders" ops anyway, never strictly required.

Reported-by: Rich Ercolani <rincebrain@gmail.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
(cherry picked from commit 34f662ad22206af6852020fd923ceccd836a855f)
2024-04-03 10:22:54 +11:00
Rob Norris 9e1afd0d91 abd_iter_page: don't use compound heads on Linux <4.5
Before 4.5 (specifically, torvalds/linux@ddc58f2), head and tail pages
in a compound page were refcounted separately. This means that using the
head page without taking a reference to it could see it cleaned up later
before we're finished with it. Specifically, bio_add_page() would take a
reference, and drop its reference after the bio completion callback
returns.

If the zio is executed immediately from the completion callback, this is
usually ok, as any data is referenced through the tail page referenced
by the ABD, and so becomes "live" that way. If there's a delay in zio
execution (high load, error injection), then the head page can be freed,
along with any dirty flags or other indicators that the underlying
memory is used. Later, when the zio completes and that memory is
accessed, its either unmapped and an unhandled fault takes down the
entire system, or it is mapped and we end up messing around in someone
else's memory. Both of these are very bad.

The solution on these older kernels is to take a reference to the head
page when we use it, and release it when we're done. There's not really
a sensible way under our current structure to do this; the "best" would
be to keep a list of head page references in the ABD, and release them
when the ABD is freed.

Since this additional overhead is totally unnecessary on 4.5+, where
head and tail pages share refcounts, I've opted to simply not use the
compound head in ABD page iteration there. This is theoretically less
efficient (though cleaning up head page references would add overhead),
but its safe, and we still get the other benefits of not mapping pages
before adding them to a bio and not mis-splitting pages.

There doesn't appear to be an obvious symbol name or config option we
can match on to discover this behaviour in configure (and the mm/page
APIs have changed a lot since then anyway), so I've gone with a simple
version check.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Closes #15533
Closes #15588
(cherry picked from commit c6be6ce175)
2024-04-03 10:15:02 +11:00
Rob Norris ce5719554f vdev_disk: use bio_chain() to submit multiple BIOs
Simplifies our code a lot, so we don't have to wait for each and
reassemble them.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Closes #15533
Closes #15588
(cherry picked from commit 72fd834c47)
2024-04-03 10:12:51 +11:00
Rob Norris ba6d0fb060 vdev_disk: add module parameter to select BIO submission method
This makes the submission method selectable at module load time via the
`zfs_vdev_disk_classic` parameter, allowing this change to be backported
to 2.2 safely, and disabled in favour of the "classic" submission method
if new problems come up.

Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
(cherry picked from commit 2382fdb0a83a5a3c6cf3860695d3f29281773170)
2024-04-03 10:10:20 +11:00
Rob Norris 1b16b90ae1 vdev_disk: rewrite BIO filling machinery to avoid split pages
This commit tackles a number of issues in the way BIOs (`struct bio`)
are constructed for submission to the Linux block layer.

### BIO segment limits are set incorrectly

The kernel has a hard upper limit on the number of pages/segments that
can be added to a BIO, as well as a separate limit for each device
(related to its queue depth and other scheduling characteristics).

ZFS counts the number of memory pages in the request ABD
(`abd_nr_pages_off()`, and then uses that as the number of segments to
put into the BIO, up to the hard upper limit. If it requires more than
the limit, it will create multiple BIOs.

Leaving aside the fact that page count method is wrong (see below), not
limiting to the device segment max means that the device driver will
need to split the BIO in half. This is alone is not necessarily a
problem, but it interacts with another issue to cause a much larger
problem.

### BIOs are filled inefficiently

The kernel function to add a segment to a BIO (`bio_add_page()`) takes a
`struct page` pointer, and offset+len within it. `struct page` can
represent a run of contiguous memory pages (known as a "compound page").
In can be of arbitrary length.

The ZFS functions that count ABD pages and load them into the BIO
(`abd_nr_pages_off()`, `bio_map()` and `abd_bio_map_off()`) will never
consider a page to be more than `PAGE_SIZE` (4K), even if the `struct
page` is for multiple pages. In this case, it will load the same `struct
page` into the BIO multiple times, with the offset adjusted each time.

With a sufficiently large ABD, this can easily lead to the BIO being
entirely filled much earlier than it could have been. This is also
further contributes to the problem caused by the incorrect segment limit
calculation, as its much easier to go past the device limit, and so
require a split.

Again, this is not a problem on its own.

### Incomplete pages are submitted to BIOs

The logic for "never submit more than `PAGE_SIZE`" is actually a little
more subtle. It will actually never submit a buffer that crosses a 4K
page boundary.

In practice, this is fine, as most ABDs are scattered, that is a list of
complete 4K pages, and so are loaded in as such.

Linear ABDs are typically allocated from slabs, and for small sizes they
are frequently not aligned to page boundaries. For example, a 12K
allocation can span four pages, eg:

     -- 4K -- -- 4K -- -- 4K -- -- 4K --
    |        |        |        |        |
          :## ######## ######## ######:    [1K, 4K, 4K, 3K]

Such an allocation would be loaded into a BIO as you see:

    [1K, 4K, 4K, 3K]

This tends not to be a problem in practice, because even if the BIO were
filled and needed to be split, each half would still have either a start
or end aligned to the logical block size of the device (assuming 4K at
least).

---

In ideal circumstances, these shortcomings don't cause any particular
problems. Its when they start to interact with other ZFS features that
things get interesting.

### Aggregation

Aggregation will create a "gang" ABD, which is simply a list of other
ABDs. Iterating over a gang ABD is just iterating over each ABD within
it in turn.

Because the segments are simply loaded in order, we can end up with
uneven segments either side of the "gap" between the two ABDs. For
example, two 12K ABDs might be aggregated and then loaded as:

    [1K, 4K, 4K, 3K, 2K, 4K, 4K, 2K]

Should a split occur, each individual BIO can end up either having an
start or end offset that is not aligned to the logical block size, which
some drivers (eg SCSI) will reject. However, this tends not to happen
because the default aggregation limit usually keeps the BIO small enough
to not require more than one split, and most pages are actually full 4K
pages, so hitting an uneven gap is very rare anyway.

### Gang blocks

If the pool is under particular memory pressure, then an IO can be
broken down into a "gang block", a 512-byte block composed of a header
and up to three block pointers. Each points to a fragment of the
original write, or in turn, another gang block, breaking the original
data up over and over until space can be found in the pool for each of
them.

Each gang header is a separate 512-byte memory allocation from a slab,
that needs to be written down to disk. When the gang header is added to
the BIO, its a single 512-byte segment.

### Aggregation with gang blocks

Pulling all this together, consider a large aggregated write of gang
blocks. This results a BIO containing lots of 512-byte segments. Given
our tendency to overfill the BIO, a split is likely, and most possible
split points will yield a pair of BIOs that are misaligned. Drivers that
care, like the SCSI driver, will reject them.

---

This commit is a substantial refactor and rewrite of much of `vdev_disk`
to sort all this out.

### Configure maximum segment size for device

`vdev_bio_max_segs()` now returns the ideal maximum size for the device,
if available. There's also a tuneable `zfs_vdev_disk_max_segs` to
override this, to assist with testing.

### ABDs checked up front for page count and alignment

We scan the ABD up front to count the number of pages within it, and to
confirm that if we submitted all those pages to one or more BIOs, it
could be split at any point with creating a misaligned BIO. Along the
way, we determine how many BIO segments we'll need to handle the entire
ABD, accounting for BIO fill limits (including segment and byte limits).

If the pages in the BIO are not usable (as in any of the above
situations), the ABD is linearised, and then checked again. This is the
same technique used in `vdev_geom` on FreeBSD, adjusted for Linux's
variable page size and allocator quirks.

In the end, a count of segments is produced, which is then used to
determine how many BIOs will be allocated.

### Virtual block IO object

`vbio_t` is a cleanup and enhancement of the old `dio_request_t`. The
idea is simply that it can hold all the state needed to create, submit
and return multiple BIOs, including all the refcounts, the ABD copy if
it was needed, and so on. Apart from what I hope is a clearer interface,
the major difference is that because we know how many BIOs we'll need up
front, we don't need the old overflow logic that would grow the BIO
array, throw away all the old work and restart. We can get it right from
the start.

Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
(cherry picked from commit 588a6a2d38f20cd6e0d458042feda1831b302207)
2024-04-03 10:10:12 +11:00
Rob Norris 21f6808780 vdev_disk: make read/write IO function configurable
This is just setting up for the next couple of commits, which will add a
new IO function and a parameter to select it.

Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
(cherry picked from commit 7ee83696cffac172eea89844ccc5e6b6899781ac)
2024-04-03 10:10:08 +11:00
Rob Norris 1feddf1ed6 vdev_disk: reorganise vdev_disk_io_start
Light reshuffle to make it a bit more linear to read and get rid of a
bunch of args that aren't needed in all cases.

Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
(cherry picked from commit ad847ff6acb77fbba0f3ab2e864784225fd41007)
2024-04-03 10:10:04 +11:00
Rob Norris d00ab549d4 vdev_disk: rename existing functions to vdev_classic_*
This is just renaming the existing functions we're about to replace and
grouping them together to make the next commits easier to follow.

Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
(cherry picked from commit 9bf6a7c8c3bdcc4e5975fa5baf6e9ff6f279a553)
2024-04-03 10:09:57 +11:00
Rob Norris edf4cf0ce7 abd: add page iterator
The regular ABD iterators yield data buffers, so they have to map and
unmap pages into kernel memory. If the caller only wants to count
chunks, or can use page pointers directly, then the map/unmap is just
unnecessary overhead.

This adds adb_iterate_page_func, which yields unmapped struct page
instead.

Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
(cherry picked from commit 930b785c61e9724f0a3a0e09571032ed397f368c)
2024-04-03 10:09:53 +11:00
Rob N 40b2b48fa5 Consider `dnode_t` allocations in dbuf cache size accounting
Entries in the dbuf cache contribute only the size of the dbuf data to
the cache size. Attached "user" data is not counted. This can lead to
the data currently "owned" by the cache consuming more memory accounting
appears to show. In some cases (eg a metadnode data block with all child
dnode_t slots allocated), the actual size can be as much as 3x as what
the cache believes it to be.

This is arguably correct behaviour, as the cache is only tracking the
size of the dbuf data, not even the overhead of the dbuf_t. On the other
hand, in the above case of dnodes, evicting cached metadnode dbufs is
the only current way to reclaim the dnode objects, and can lead to the
situation where the dbuf cache appears to be comfortably within its
target memory window and yet is holding enormous amounts of slab memory
that cannot be reclaimed.

This commit adds a facility for a dbuf user to artificially inflate the
apparent size of the dbuf for caching purposes. This at least allows for
cache tuning to be adjusted to match something closer to the real memory
overhead.

metadnode dbufs carry a >1KiB allocation per dnode in their user data.
This informs the dbuf cache machinery of that fact, allowing it to make
better decisions when evicting dbufs.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #15511
(cherry picked from commit 92dc4ad83d)
2024-04-03 09:58:25 +11:00
Rob N 66586d604a dnode_is_dirty: check dnode and its data for dirtiness
Over its history this the dirty dnode test has been changed between
checking for a dnodes being on `os_dirty_dnodes` (`dn_dirty_link`) and
`dn_dirty_record`.

  de198f2d9 Fix lseek(SEEK_DATA/SEEK_HOLE) mmap consistency
  2531ce372 Revert "Report holes when there are only metadata changes"
  ec4f9b8f3 Report holes when there are only metadata changes
  454365bba Fix dirty check in dmu_offset_next()
  66aca2473 SEEK_HOLE should not block on txg_wait_synced()

Also illumos/illumos-gate@c543ec060d illumos/illumos-gate@2bcf0248e9

It turns out both are actually required.

In the case of appending data to a newly created file, the dnode proper
is dirtied (at least to change the blocksize) and dirty records are
added.  Thus, a single logical operation is represented by separate
dirty indicators, and must not be separated.

The incorrect dirty check becomes a problem when the first block of a
file is being appended to while another process is calling lseek to skip
holes. There is a small window where the dnode part is undirtied while
there are still dirty records. In this case, `lseek(fd, 0, SEEK_DATA)`
would not know that the file is dirty, and would go to
`dnode_next_offset()`. Since the object has no data blocks yet, it
returns `ESRCH`, indicating no data found, which results in `ENXIO`
being returned to `lseek()`'s caller.

Since coreutils 9.2, `cp` performs sparse copies by default, that is, it
uses `SEEK_DATA` and `SEEK_HOLE` against the source file and attempts to
replicate the holes in the target. When it hits the bug, its initial
search for data fails, and it goes on to call `fallocate()` to create a
hole over the entire destination file.

This has come up more recently as users upgrade their systems, getting
OpenZFS 2.2 as well as a newer coreutils. However, this problem has been
reproduced against 2.1, as well as on FreeBSD 13 and 14.

This change simply updates the dirty check to check both types of dirty.
If there's anything dirty at all, we immediately go to the "wait for
sync" stage, It doesn't really matter after that; both changes are on
disk, so the dirty fields should be correct.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Rich Ercolani <rincebrain@gmail.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #15571
Closes #15526
(cherry picked from commit 30d581121b)
2023-11-29 22:19:29 +00:00
Alexander Motin 976c01223a Improve log spacemap load time
Previous flushing algorithm limited only total number of log blocks to
the minimum of 256K and 4x number of metaslabs in the pool.  As result,
system with 1500 disks with 1000 metaslabs each, touching several new
metaslabs each TXG could grow spacemap log to huge size without much
benefits.  We've observed one of such systems importing pool for about
45 minutes.

This patch improves the situation from five sides:
 - By limiting maximum period for each metaslab to be flushed to 1000
TXGs, that effectively limits maximum number of per-TXG spacemap logs
to load to the same number.
 - By making flushing more smooth via accounting number of metaslabs
that were touched after the last flush and actually need another flush,
not just ms_unflushed_txg bump.
 - By applying zfs_unflushed_log_block_pct to the number of metaslabs
that were touched after the last flush, not all metaslabs in the pool.
 - By aggressively prefetching per-TXG spacemap logs up to 16 TXGs in
advance, making log spacemap load process for wide HDD pool CPU-bound,
accelerating it by many times.
 - By reducing zfs_unflushed_log_block_max from 256K to 128K, reducing
single-threaded by nature log processing time from ~10 to ~5 minutes.

As further optimization we could skip bumping ms_unflushed_txg for
metaslabs not touched since the last flush, but that would be an
incompatible change, requiring new pool feature.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes #12789
(cherry picked from commit cbfe5cb849518dd8fb65bf94a72fd88a15093a67)
2023-10-23 21:41:18 +00:00
Alexander Motin fc1bd532b7 Add more control/visibility to spa_load_verify().
Use error thresholds from policy to control whether to scrub data
and/or metadata.  If threshold is set to UINT64_MAX, then caller
probably does not care about result and we may skip that part.

By default import neither set the data error threshold nor read
the error counter, so skip the data scrub for faster import.
Metadata are still scrubbed and fail if even single error found.

While there just for symmetry return number of metadata errors in
case threshold is not set to zero and we haven't reached it.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Closes #13022
(cherry picked from commit f2c5bc150e)
2023-09-27 20:30:22 +00:00
Rob Norris dd5e7171dc zil: only flush leaf vdevs that were actually written to
Enables vdev traces for ZIL writes, and then only issues flushes to
things that were written to.

This simplifies a few things. We no longer have to extract the toplevel
vdevs to flush from the block pointer; instead we just look at what was
written. The vdev tree remains as a means to defer flushes to the next
lwb, which means a bit more copying trees, but also means we no longer
have to lock the tree.

Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
2023-09-07 18:54:12 +00:00
Rob Norris ca18c876e5 zio: function to issue flushes by trace tree
If you have a trace tree from, say, a write, hand it directly to
zio_flush_traced() to flush only the leaf vdevs that were involved in
the write.

Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
2023-09-07 18:54:12 +00:00
Rob Norris 4b07fb9a98 zio: expose trace node alloc/free/compare
Meant for external callers to be able to build trace trees that can
later be submitted back to zio for work. Its hardly necessary, but saves
needing to double up on kmem cache and comparison function.

Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
2023-09-07 18:54:12 +00:00
Rob Norris efeeeec2a4 zio: add vdev tracing machinery
The idea here is that you can add a flag to a zio, and every vdev that
contributed to the successful completion of that zio will be referenced
on the "trace tree". You can poke around in here from your _done handler
to do any per-vdev followup work.

The actual use case is to track the vdevs that were actually written to,
in order to have a list of vdevs that we should flush. Thats why it
looks like the ZIL vdev flush tracker - the only difference is that it
will also list interior and leaf vdevs, not just toplevel vdevs.

Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
2023-09-07 18:54:12 +00:00
Richard Yao 609550bc5f Convert enum zio_flag to uint64_t
We ran out of space in enum zio_flag for additional flags. Rather than
introduce enum zio_flag2 and then modify a bunch of functions to take a
second flags variable, we expand the type to 64 bits via `typedef
uint64_t zio_flag_t`.

Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@klarasystems.com>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Co-authored-by: Richard Yao <richard.yao@klarasystems.com>
Closes #14086
2023-09-07 18:54:12 +00:00
Rob Norris 1a160e3b08 zio: refactor zio_flush()
zio_ioctl() is the only user of zio_flush(), and its structure and flag
use is fairly specific to flushing. So here we bring the guts of
zio_ioctl() into zio_flush(), allowing some light reorganising (mostly
around how zio_nowait() is called) and a better signature.

This will help in the future when changing the way flush works, as its
clear where the change should be made and no wondering if zio_ioctl() is
being used somewhere else.

Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
2023-09-07 18:54:11 +00:00
Rob Norris 97b5e0bbbd zil_fail: handle failure when previous failure not cleaned up yet
After the ZIL is reopened, itxs created before the failure are held on
the failure itxg until the cleaner thread comes through and cleans them
up by calling zil_clean(). That's an asynchronous job, so may not run
immediately.

Previously, if the ZIL fails again while there are still itxs on the
fail list, it would trip assertions in debug mode, and in production,
the itx list would be leaked and the previous outstanding fsync() calls
would be lost.

This commit makes it so that if the ZIL fails before the previous
failure has been cleaned up, it will be immediately cleaned up before
being filled with currently outstanding itxs.

Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
2023-09-07 18:54:11 +00:00
Rob Norris 27e39dd59a zil_fail: handle the case where no itxs are found
This is possible if spa_sync() completes before the ZIL write/flush
does, which then errors. At this point all itxs are in the past, leaving
us with nothing to wait for.

In a perfect world we would not fail the ZIL, but at this point we've
already locked out the itxgs early, so we have to see it through.

Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
2023-09-07 18:54:11 +00:00
Rob Norris 5b98ced63c zil: refactor zil_commit_waiter()
The previous change added a check to fall back to waiting forever if the
ZIL failed. This check was inverted; it actually caused it to always
enter a timed wait when it failed. When combined with the fact that the
last lwb issued likely would have failed quickly and so had a very small
latency, this caused effectively an infinite loop.

I initially fixed the check, but on further study I decided that this
loop doesn't need to exist. The way the whole logic falls out of the
original code in 2.1.5 is that if the lwb is OPENED, wait then issue it,
and if not (or post issue), wait forever. The loop will never see more
than two iterations, one for each half of the OPENED check, and it will
stop as soon as the waiter is signaled (zcw_done true), so it can be far
more simply expressed as a linear sequence:

    if (!issued) {
        wait a few
        if (done)
            return
        issue IO
    }
    if (!done)
          wait forever

This still holds when the ZIL fails, because zil_commit_waiter_timeout()
will check for failure under zl_issuer_lock, which zil_fail() will wait
for, and in turn, zil_fail() will wait on zcw_lock and then signal the
waiter before it releases zl_issuer_lock. Taken together, that means
that zil_commit_waiter_timeout() will do all it can under the
circumstances, and waiting forever the waiter to complete is all we can
past that point.

(cherry picked from commit c57c2ddd6f803f429da1e2b53abab277d781a5a3)
2023-08-10 19:10:59 +00:00
Rob Norris d6d343bb87 Revert debug code from bio-seg-limit
This patch reverts parts of bf08bc108dce6f1ecd0820c8b5a67b6fb7962c7e.
2023-08-10 19:10:59 +00:00
Rob Norris bf18873541 bio-seg-limit debug
Various bits of output for catching broken bios.

(cherry picked from commit b1a5bc49acce3cbec56f3bf0638539f836aa2208)
Signed-off-by: Allan Jude <allan@klarasystems.com>
2023-07-31 15:05:56 +00:00
Rob Norris 74e8091130 abd_bio_map_off: avoid splitting scatter pages across bios
This is the same change as the previous commit, but for scatter abds.

Its less clear if this change is needed. Since scatter abds are only
ever added a page at time, both sides of the split should always be
added in consecutive segments.

Intuitively though, it may be possible for a partially-filled bio to be
used, or a bio with an odd number of iovecs, and that then could lead to
a misaligned bio. While I've not been able to reproduce this at all, it
seems to make sense to protect against it.

Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
(cherry picked from commit cbdf21fd1a32a5e696a22cad497d9211221fa309)
2023-07-31 15:05:56 +00:00
Rob Norris 1bd475d6c6 bio_map: avoid splitting abd pages across bios
If we encounter a split page, we add two iovecs to the bio, one for the
fragment of the buffer on each side of the split. In order to do this
safely, we must be sure that we always have room for both fragments.

Its possible for a linear abd to have multiple pages, in which case we
want to add the "left" fragment, then a run of proper 4K pages. then
then "right" fragment. In this way we can keep whole pages together as
much possible.

This change handles both cases by noticing a split page. If we don't
have at least two iovecs remaining in the bio, then we abort outright
(allowing the caller to allocate a new bio and try again). We add the
"left" fragment, and note how big we expect the right fragment to be.
Then we load in as many full pages as are available.

When we reach the last iovec, we close out the bio by taking as uch as
is necessary to restore alignment.

Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
(cherry picked from commit 173cafcc3d8b6c94c61844c705d7a410f412a18e)
2023-07-31 15:05:56 +00:00
Rob Norris 98bcc390e8 vdev_disk: rework bio max segment calculation
A single "page" in an ABD does not necessarily correspond to one segment
in a bio, because of how ZFS does ABD allocations and how it breaks them
up with adding them to a bio. Because of this, simply dividing the ABD
size by the page size can only ever give a minimum number of segments
required, rather than the correct number.

Until we can fix that, we'll just make each bio as large as they can be
for as many segments as the device queue will permit without needing to
split the the bio. This is a little wasteful if we don't intend to put
that many segments in the bio, but its not a lot of memory and its only
lost until the bio is completed.

This also adds a tuneable, vdev_disk_max_segs, to allow setting this
value to be set by the operator. This is very useful for debugging.

Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
(cherry picked from commit a3a438d1bedb0626417cd73ba10b1479a06bef7f)
2023-07-31 15:05:56 +00:00
Rob Norris f615207ee6 metaslab: add tuneables to better control when to force ganging
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
(cherry picked from commit cf152e9d4da8fc82c940355ba444de915a00d754)
2023-07-31 15:05:56 +00:00
Brian Behlendorf 564e07e5c8 Silence -Winfinite-recursion warning in luaD_throw()
This code should be kept inline with the upstream lua version as much
as possible.  Therefore, we simply want to silence the warning.  This
check was enabled by default as part of -Wall in gcc 12.1.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #13528
Closes #13575
(cherry picked from commit 4d0c1f14e77cf83d06de7c730de7f93f8a85c2eb)
2023-07-31 15:05:56 +00:00
Rob Norris 7121e678a2 zil_fail: reheck ZIL fail state before issuing IO
The ZIL may be failed any time we don't have the issuer lock, which
means we can end up working through zil_commit_impl() even when the ZIL
already failed.

So, each time we gain the issuer lock, recheck the fail state, and do
not issue IO if its failed. The waiter will eventually go to sleep
waiting to be signalled with the ZIL reopens in zil_sync()->zil_clean().

(cherry picked from commit 49fa92c8db389f257a15029d643fb026fa5b6dc2)
2023-07-31 15:05:56 +00:00
Rob Norris 88bb9a3add zil_fail: set fail state as early as possible
zil_failed() is called unlocked, and itxg_failed is only checked with
itxg_lock held. This was making the ZIL appear to be not failed even as
zil_fail() was in progress, scanning the itx lists. With zil_failed()
returning false, zil_commit() would continue to zil_commit_impl() and
also start processing itx lists, racing each other for locks.

So instead, set the fail state as early as possible, before we start
processing the itx lists. This won't stop new itxs arriving on the itxgs
proper, but it will avoid additional commit itxs being created and will
stop any attempts to collect and commit them.

(cherry picked from commit 17579a79a2b481e746879d5a033626754931441e)
2023-07-31 15:05:56 +00:00
Rob Norris 43f45f8df0 zil_fail: don't take the namespace lock
Early on I thought it would be necessary to stop the world making
progress under us, and taking the namespace lock was my initial idea of
how to do that. The right way is a bit more nuanced than that, but as it
turns out, we don't even need it.

To fail the ZIL is effectively to stop it in its tracks and hold onto
all itxs stored within until they operations they represent are
committed to the pool by some other means (ie the regular txg sync).

It doesn't matter if the pool makes progress while we're doing this. If
the pool does progress, then zil_clean() will be called to process any
itxs now completed. That will be to take the itxg_lock, process and
destroy the itxs, and release the lock, leaving the itxg empty.

If zil_fail() is running at the same time, then either zil_clean() will
have cleaned up the itxg and zil_fail() will find it empty, or
zil_fail() will get there first and empty it onto the fail itxg.

(cherry picked from commit 83ce694898f5a89bd382dda0ba09bb8a04ac5666)
2023-07-31 15:05:56 +00:00
Rob Norris 3cb140863f zil_fail: fix infinite loop in commit itx search
(cherry picked from commit ba1f888858f8599e10211aa9957d942e7bcc36ce)
2023-07-31 15:05:56 +00:00
Rob N 6c36d72e71 vdev: expose zfs_vdev_max_ms_shift as a module parameter
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Seagate Technology LLC
Closes #14719

(cherry picked from commit ff73574cd8)

The type of sysctls had to be changed from uint_t to int to match other
sysctls in to OpenZFS 2.1.5.
2023-07-31 15:05:56 +00:00
Fred Weigel b037880efb JSON spares state readout
Spares can be in multiple pools, even if in use. This means that
status check AVAIL/INUSE is a bit tricky. spa_add_spares does not
need to be called, but we do need to do the equivalent. Which is
now done directly.
2023-07-05 13:27:31 +00:00
Fred Weigel f7574ccff3 Update to json output.
Missing proper state output for spare devices. Everything else
should be ok.
2023-07-05 13:27:31 +00:00
Rob Norris 802c258fc1 compress: add "slack" compression options
Signed-off-by: Allan Jude <allan@klarasystems.com>
2023-07-05 13:27:31 +00:00