Commit Graph

7324 Commits

Author SHA1 Message Date
Rob Norris 98bcc390e8 vdev_disk: rework bio max segment calculation
A single "page" in an ABD does not necessarily correspond to one segment
in a bio, because of how ZFS does ABD allocations and how it breaks them
up with adding them to a bio. Because of this, simply dividing the ABD
size by the page size can only ever give a minimum number of segments
required, rather than the correct number.

Until we can fix that, we'll just make each bio as large as they can be
for as many segments as the device queue will permit without needing to
split the the bio. This is a little wasteful if we don't intend to put
that many segments in the bio, but its not a lot of memory and its only
lost until the bio is completed.

This also adds a tuneable, vdev_disk_max_segs, to allow setting this
value to be set by the operator. This is very useful for debugging.

Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
(cherry picked from commit a3a438d1bedb0626417cd73ba10b1479a06bef7f)
2023-07-31 15:05:56 +00:00
Rob Norris f615207ee6 metaslab: add tuneables to better control when to force ganging
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
(cherry picked from commit cf152e9d4da8fc82c940355ba444de915a00d754)
2023-07-31 15:05:56 +00:00
Brian Behlendorf 564e07e5c8 Silence -Winfinite-recursion warning in luaD_throw()
This code should be kept inline with the upstream lua version as much
as possible.  Therefore, we simply want to silence the warning.  This
check was enabled by default as part of -Wall in gcc 12.1.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #13528
Closes #13575
(cherry picked from commit 4d0c1f14e77cf83d06de7c730de7f93f8a85c2eb)
2023-07-31 15:05:56 +00:00
Brian Behlendorf 83bb712a3a Fix -Wformat-truncation warning in upgrade_set_callback()
Extend the buffer slightly resolve the warning.

    cmd/zfs/zfs_main.c: In function ‘upgrade_set_callback’:
    cmd/zfs/zfs_main.c:2446:22: error: ‘%llu’ directive output
    may be truncated writing between 1 and 20 bytes into a
    region of size 16 [-Werror=format-truncation=]
    cmd/zfs/zfs_main.c:2445:24: note: ‘snprintf’ output between
    2 and 21 bytes into a destination of size 16

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #13528
Closes #13575
(cherry picked from commit 4a8ce916f9a1836db34b8c0c7d878adaae5bcf5a)
2023-07-31 15:05:56 +00:00
Brian Behlendorf 7933440875 Fix -Wformat-overflow warning in zfs_project_handle_dir()
Switch to using asprintf() to satisfy the compiler and resolve the
potential format-overflow warning.  Not the conditional before the
sprintf() would have prevented this regardless.

    cmd/zfs/zfs_project.c: In function ‘zfs_project_handle_dir’:
    cmd/zfs/zfs_project.c:241:38: error: ‘/’ directive writing
    1 byte into a region of size between 0 and 4352
    [-Werror=format-overflow=]
    cmd/zfs/zfs_project.c:241:17: note: ‘sprintf’ output between
    2 and 4609 bytes into a destination of size 4352

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #13528
Closes #13575
(cherry picked from commit 8e15c80f90f3c80a4026c1f9ed248b4ea8ae41d0)
2023-07-31 15:05:56 +00:00
Rob Norris 7121e678a2 zil_fail: reheck ZIL fail state before issuing IO
The ZIL may be failed any time we don't have the issuer lock, which
means we can end up working through zil_commit_impl() even when the ZIL
already failed.

So, each time we gain the issuer lock, recheck the fail state, and do
not issue IO if its failed. The waiter will eventually go to sleep
waiting to be signalled with the ZIL reopens in zil_sync()->zil_clean().

(cherry picked from commit 49fa92c8db389f257a15029d643fb026fa5b6dc2)
2023-07-31 15:05:56 +00:00
Rob Norris 88bb9a3add zil_fail: set fail state as early as possible
zil_failed() is called unlocked, and itxg_failed is only checked with
itxg_lock held. This was making the ZIL appear to be not failed even as
zil_fail() was in progress, scanning the itx lists. With zil_failed()
returning false, zil_commit() would continue to zil_commit_impl() and
also start processing itx lists, racing each other for locks.

So instead, set the fail state as early as possible, before we start
processing the itx lists. This won't stop new itxs arriving on the itxgs
proper, but it will avoid additional commit itxs being created and will
stop any attempts to collect and commit them.

(cherry picked from commit 17579a79a2b481e746879d5a033626754931441e)
2023-07-31 15:05:56 +00:00
Rob Norris 43f45f8df0 zil_fail: don't take the namespace lock
Early on I thought it would be necessary to stop the world making
progress under us, and taking the namespace lock was my initial idea of
how to do that. The right way is a bit more nuanced than that, but as it
turns out, we don't even need it.

To fail the ZIL is effectively to stop it in its tracks and hold onto
all itxs stored within until they operations they represent are
committed to the pool by some other means (ie the regular txg sync).

It doesn't matter if the pool makes progress while we're doing this. If
the pool does progress, then zil_clean() will be called to process any
itxs now completed. That will be to take the itxg_lock, process and
destroy the itxs, and release the lock, leaving the itxg empty.

If zil_fail() is running at the same time, then either zil_clean() will
have cleaned up the itxg and zil_fail() will find it empty, or
zil_fail() will get there first and empty it onto the fail itxg.

(cherry picked from commit 83ce694898f5a89bd382dda0ba09bb8a04ac5666)
2023-07-31 15:05:56 +00:00
Rob Norris 3cb140863f zil_fail: fix infinite loop in commit itx search
(cherry picked from commit ba1f888858f8599e10211aa9957d942e7bcc36ce)
2023-07-31 15:05:56 +00:00
Rob N 6c36d72e71 vdev: expose zfs_vdev_max_ms_shift as a module parameter
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Seagate Technology LLC
Closes #14719

(cherry picked from commit ff73574cd8)

The type of sysctls had to be changed from uint_t to int to match other
sysctls in to OpenZFS 2.1.5.
2023-07-31 15:05:56 +00:00
Geoff Amey 8d34aa5b66 Update META for zfs-2.1.5.4-1wasabi0 tag 2023-07-05 13:47:08 +00:00
Fred Weigel b037880efb JSON spares state readout
Spares can be in multiple pools, even if in use. This means that
status check AVAIL/INUSE is a bit tricky. spa_add_spares does not
need to be called, but we do need to do the equivalent. Which is
now done directly.
2023-07-05 13:27:31 +00:00
Fred Weigel f7574ccff3 Update to json output.
Missing proper state output for spare devices. Everything else
should be ok.
2023-07-05 13:27:31 +00:00
Rob Norris 802c258fc1 compress: add "slack" compression options
Signed-off-by: Allan Jude <allan@klarasystems.com>
2023-07-05 13:27:31 +00:00
Allan Jude 066532da51 Add module parameter to block 0 byte writes
Some hardware has issues when issues a write of 0 bytes
Add a new module paramter, zio_suppress_zero_writes
That when enabled (default) will just complete these I/Os
without sending them to the hardware.

Signed-off-by: Allan Jude <allan@klarasystems.com>
2023-07-05 13:27:31 +00:00
Mateusz Piotrowski 91d6b61268 json: Define PRId64 and PRIu64 on FreeBSD
On FreeBSD, these types are long instead of long long.
2023-07-05 13:27:31 +00:00
Mateusz Piotrowski 95d6d8d32f json: Drop problematic casts in nvlist_to_json()
The NVP_NAME() macro requires its argument to be castable to char *.
The compiler complains if const char * is provided instead.
2023-07-05 13:27:31 +00:00
Mateusz Piotrowski a7d67aed05 freebsd: Fix ZFS_ENTER_UNMOUNTOK and ZFS_ENTER on FreeBSD
There was a typo in zfs_znode_impl. The two macros were lowercase
instead of all caps, which caused compilation problems on FreeBSD.
2023-07-05 13:27:31 +00:00
Mateusz Piotrowski 6ee35af1a4 zil: Drop an unnecessary if statement
We already check for error != 0 earlier and return if true. The compiler
error here is a false positive.
2023-07-05 13:27:31 +00:00
Mateusz Piotrowski d744cdb77c json: null_filter(): Use __maybe_unused
The function fails to compile with -Wself-assign.
2023-07-05 13:27:31 +00:00
Mateusz Piotrowski 9c2c6124be zpool: Provide GUID to zpool-reguid(8) with -g
This commit extends the zpool-reguid(8) command with a -g flag, which
allows the user to specify the GUID to set.

Sponsored-by: Wasabi Technology, Inc.
Sponsored-by: Klara Inc.
2023-07-05 13:27:31 +00:00
Allan Jude 9c9eed9737 Make zpool clear reset the removed flag on vdevs
Signed-off-by: Allan Jude <allan@klarasystems.com>
2023-07-05 13:27:31 +00:00
Allan Jude 41b06f70c6 Make zpool clear reset the removed flag on vdevs
Signed-off-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Richard Yao <richard.yao@klarasystems.com>
2023-07-05 13:27:31 +00:00
Fred Weigel b6a9054a0e Fix checkstyle for zil.c
Returns are to be parenthesized
2023-07-05 13:27:31 +00:00
Fred Weigel eb3607bcec Fixes for Wasabi json endpoint
Corrects status output.
2023-07-05 13:27:31 +00:00
Fred Weigel cf5a6fbc82 Change 5 char tag limit to 255
Changes 5 character maximum tag to 255 characters.
2023-07-05 13:27:31 +00:00
Fred Weigel 6ccb1a75af Klara update for json
Fix checkstyle indicated errors, source format fixes

Signed-off-by: Fred Weigel <fred.weigel@klarasystems.com>
2023-07-05 13:27:31 +00:00
Allan Jude 2284c4d200 Add module parameter to block 0 byte writes
Some hardware has issues when issues a write of 0 bytes
Add a new module paramter, zio_suppress_zero_writes
That when enabled (default) will just complete these I/Os
without sending them to the hardware.

Signed-off-by: Allan Jude <allan@klarasystems.com>
2023-07-05 13:27:31 +00:00
Rob Norris f882884358 btree: fix double-free in zfs_btree_remove_idx
We applied 03c0ee94b to fix two use-after-free cases, backporting 13f2b8fb9
from upstream. Unfortunately that patch seems to have been misapplied,
introducing a double-free in one of them. This commit fixes that.

Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
2023-07-05 13:27:31 +00:00
Fred Weigel 506f987972 Added jprint.h and json_status.h to allow dist build
Signed-off-by: Allan Jude <allan@klarasystems.com>
2023-07-05 13:27:31 +00:00
Rob Norris 88149e0873 zil_create: don't try to deallocate a block we never allocated
(cherry picked from commit 8a35cfdcdd62ffc47e7628616f0dcb2ef172cf4b)
2023-07-05 13:27:31 +00:00
Rob Norris 5a256eaed1 zil_close: don't try to deallocate on-disk blocks
If we're force-exporting or failed then there's no guarantee the IO will
get anywhere. If its a clean shutdown then that's actually the lead
block and it'll be sorted out during replay or next txg.

(cherry picked from commit 01e04a4eef7811a31a6258c99d0cc51217732758)
2023-07-05 13:27:31 +00:00
Allan Jude 11d3cff47b Normalize the endpoint name
Signed-off-by: Allan Jude <allan@klarasystems.com>
2023-07-05 13:27:31 +00:00
fredw 43b705c787 stats_version: 2, scan_stats added even if never done. pass_scrub_scrub_spent_paused is now pass_scrub_spent_paused. stats is stats.json
Signed-off-by: Allan Jude <allan@klarasystems.com>
2023-07-05 13:27:31 +00:00
Mateusz Piotrowski 3828f754f1 json_stats.c: Rename the stats file to "status.json" 2023-07-05 13:27:31 +00:00
Rob Norris 2724bcb3d6 zil: allow the ZIL to fail and restart independently of the pool
zil_commit() has always returned void, and thus, cannot fail. Everything
inside it assumed that if anything ever went wrong, it could fall back
on txg_wait_synced() until the txg covering the operations being flushed
from the ZIL has fully committed. This meant that if the pool failed and
failmode=continue was set, syncing operations like fsync() would still
block.

Unblocking zil_commit() means largely the same approach. The difficulty
is that the ZIL carries the record of uncommitted VFS operations (vs the
changed data), and attached to those, callbacks and cvs that will
release userspace callers once the data is on disk. So if we can't write
the ZIL, we also can't release those records until the data is on disk.

This wasn't a problem before, because the zil_commit() would block. If
we change zil_commit() to return error, we still need to track those
entries until the data they represent hits the disk. We also need to
accept new records; just because the ZIL fails may not necessarily mean
the pool itself is unavailable.

This commit reorganises the ZIL to allow zil_commit() to return failure.
If ZIL writes or flushes fail, the ZIL is moved into a "failed" state,
and no further writes are done; all zil_commit() calls are serviced by
the regular txg mechanism. Outstanding records (itx_ts) are held until
the main pool writes their associated txg out. The records are then
released. Once all records are cleared, the ZIL is reset and reopened.

Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
(cherry picked from commit af821006f6602261e690fe6635689cabdeefcadf)
2023-07-05 13:27:31 +00:00
Rob Norris cdaf041d39 zil: ensure flush errors are recieved
Its possible for a hardware failure to occur in a way that the ZIL block
writes appear to succeed, but the flush fails.

Because flush errors were being ignored, the lwb chain would finish with
a zero error code, which would result in zil_commit() returning and thus
fsync() returning success to the caller, even though the data was not
recorded in the ZIL.

If the ZIL is on the main pool (no SLOG device) it would typically
suspend around the same time. If that happened before the txg committed,
then those writes are now totally lost - not on the pool, not in the
ZIL.

zil_lwb_flush_vdevs_done() has the necessary code to deal with this
situation, but zio_flush() would never return failure, so it never saw
it. This just allows flushes to report failure, and now we never miss a
failed ZIL write.

Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
(cherry picked from commit d9db5dccc56b551d0bf66bc9022b6c19a659b7e1)
2023-07-05 13:27:31 +00:00
Rob Norris 8ec175d7e1 zio_flush: require caller to decide if errors should propagate
Ignoring flush errors makes it possible for callers to never know that
their writes didn't succeed, and allows writes to be lost if the pool
fails.

This commit gives zio_flush() a flag argument, and updates the call
sites to pass ZIO_FLAG_DONT_PROPAGATE to it. Thus, this commit does not
change any behaviour, but opens the floor for further changes to allow
those callers to handle flush failures sensibly.

Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
(cherry picked from commit 6d0deb8a5a0c3d6bbc69d9625d55fc776bb98ea3)
2023-07-05 13:27:31 +00:00
Rob Norris 589cea17a9 dmu_tx_wait: handle pool suspension when failmode=continue
Let txg_wait_synced_tx fail, so the caller can retry.

Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
(cherry picked from commit d560d64dbdf853d8fb9e18fc7570bd309091b2e4)
2023-07-05 13:27:30 +00:00
Rob Norris 7b7af8ba02 vnops: thread DMU_TX_ASSIGN_CONTINUE to a bunch of vnops
These are ones that I'm reasonably sure connect to a real syscall and
have a reasonable error response.

I've left stuff like `dirty_inode`, `zfs_inactive`, etc, which are
internal kernel housekeeping things, as well as anything that looks like
it belongs to zvols, ioctls, admin commands, etc.

Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
(cherry picked from commit 39c2801c611e27b521d716fea8f771307820362e)
2023-07-05 13:27:30 +00:00
Rob Norris aea007e336 dmu: add DMU_TX_ASSIGN_CONTINUE flag
This is like DMU_TX_ASSIGN_NOSUSPEND, but only when failmode=continue,
and returning EIO if the pool is suspended. Its designed to be easy to
use from syscalls and similar without the ceremony of checking the for
EAGAIN and failmode every time.

Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
(cherry picked from commit 6bed8644dd2afa0e39727e9e90642479c2416521)
2023-07-05 13:27:30 +00:00
Rob Norris 48a48059c7 dmu: rename dmu_tx_assign flags
Their names clash with those for txg_wait_synced_tx, and they aren't
directly compatible, leading to confusion.

Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
(cherry picked from commit 1f0fb1dae7c1e84de3b39e669e09b8b3d5b80b87)
2023-07-05 13:27:30 +00:00
Rob Norris b0d75996ba zio: don't report suspend IOs if the pool is already suspended
This can happen if the pool suspended and then new IO is issued which
then fails too. This doesn't change behaviour, just silences the noise.

Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
(cherry picked from commit 3fa696404fb40205ed631538c62ec1a54d8ee6cd)
2023-07-05 13:27:30 +00:00
Rob Norris 3aea149bf8 linux: reject syncing ops if the filesystem is unmounting
The kernel can call these during unmount, so we have to handle them
directly to prevent any further IO being issued.

zfs_fsync reorganised slightly to not set up zfs_fsyncer_key until after
the teardown lock is acquired, just in case we don't get it.

Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
(cherry picked from commit 900c26570ddcdd1d3ca135e6aee5df6456f6bfd6)
2023-07-05 13:27:30 +00:00
Rob Norris 7e4a9cbaee zpool_disable_datasets: on Linux, detach mounts when forcing export
On Linux, MNT_FORCE makes the kernel inform that fileystem that its
about to call its unmount method so it can begin to eject active IO,
making it more likely that the unmount will succeed. This however does
not arrange for the unmount method to always succeed; new IO between the
two filesystem calls can dirty the filesystem. This is very difficult to
lock out properly within ZFS, as not all operations that cause the
kernel to dirty the filesystem can easily locked out (eg zfs_lookup).

So, we add MNT_DETACH as well. This causes the kernel to first remove
the mount from the user namespace, giving the appearance that it has
been unmounted (ie no longer appears in /proc/mounts), so that userspace
can't reference the filesystem anymore. The unmount then proceeds in the
background.

Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
(cherry picked from commit d2e1634fc935288aa851b5915feaa670c791265c)
2023-07-05 13:27:30 +00:00
Mariusz Zaborski 40a9efd0e8 zfs: support force exporting pools
This is primarily of use when a pool has lost its disk, while the user
doesn't care about any pending (or otherwise) transactions.

Implement various control methods to make this feasible:
- txg_wait can now take a NOSUSPEND flag, in which case the caller will
  be alerted if their txg can't be committed.  This is primarily of
  interest for callers that would normally pass TXG_WAIT, but don't want
  to wait if the pool becomes suspended, which allows unwinding in some
  cases, specifically when one is attempting a non-forced export.
  Without this, the non-forced export would preclude a forced export
  by virtue of holding the namespace lock indefinitely.
- txg_wait also returns failure for TXG_WAIT users if a pool is actually
  being force exported.  Adjust most callers to tolerate this.
- spa_config_enter_flags now takes a NOSUSPEND flag to the same effect.
- DMU objset initiator which may be set on an objset being forcibly
  exported / unmounted.
- SPA export initiator may be set on a pool being forcibly exported.
- DMU send/recv now use an interruption mechanism which relies on the
  SPA export initiator being able to enumerate datasets and closing any
  send/recv streams, causing their EINTR paths to be invoked.
- ZIO now has a cancel entry point, which tells all suspended zios to
  fail, and which suppresses the failures for non-CANFAIL users.
- metaslab, etc. cleanup, which consists of simply throwing away any
  changes that were not able to be synced out.
- Linux specific: introduce a new tunable,
  zfs_forced_export_unmount_enabled, which allows the filesystem to
  remain in a modified 'unmounted' state upon exiting zpl_umount_begin,
  to achieve parity with FreeBSD and illumos,
  which have VFS-level support for yanking filesystems out from under
  users.  However, this only helps when the user is actively performing
  I/O, while not sitting on the filesystem.  In particular, this allows
  test #3 below to pass on Linux.
- Add basic logic to zpool to indicate a force-exporting pool, instead
  of crashing due to lack of config, etc.

Add tests which cover the basic use cases:
- Force export while a send is in progress
- Force export while a recv is in progress
- Force export while POSIX I/O is in progress

This change modifies the libzfs ABI:
- New ZPOOL_STATUS_FORCE_EXPORTING zpool_status_t enum value.
- New field libzfs_force_export for libzfs_handle.

Signed-off-by: Will Andrews <will@firepipe.net>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Mariusz Zaborski <mariusz.zaborski@klarasystems.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by:  Klara, Inc.
Sponsored-by:  Catalogics, Inc.
Sponsored-by:  Wasabi Technology, Inc.
Closes #3461
(cherry picked from commit 852e633772217d779a63e8c46fe3c5f81dd8960e)
2023-07-05 13:27:30 +00:00
Mateusz Piotrowski f65b59c5e5 module/zfs/Makefile.in: Add jprint.o and json_stats.o 2023-07-05 13:27:30 +00:00
Mateusz Piotrowski dcf745c378 Remove remaining bits of zpool addlog and ZFS_IOC_ADD_LOG 2023-07-05 13:27:30 +00:00
Mateusz Piotrowski 676d1dcc8c json_stats.c: Do not print value of vs_noalloc
The vs_noalloc member of the vdev_stat structure was implemented in
2a673e76a9. It is not available in ZFS
2.1.5, so code using it needs to be disabled.
2023-07-05 13:27:30 +00:00
Mateusz Piotrowski bcde0da8e4 json_stats.c: Move variable declarations out of a switch statement
This patch fixes the following compilation error:

```
../../module/zfs/json_stats.c: In function ‘nvlist_to_json’:
../../module/zfs/json_stats.c:92:4: error: a label can only be part of a statement and a declaration is not a statement
    uint64_t *u = (uint64_t *)p;
    ^~~~~~~~
../../module/zfs/json_stats.c:102:4: error: a label can only be part of a statement and a declaration is not a statement
    nvlist_t **a = (nvlist_t **)p;
    ^~~~~~~~
```
2023-07-05 13:27:30 +00:00