Let txg_wait_synced_tx fail, so the caller can retry.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
(cherry picked from commit d560d64dbdf853d8fb9e18fc7570bd309091b2e4)
These are ones that I'm reasonably sure connect to a real syscall and
have a reasonable error response.
I've left stuff like `dirty_inode`, `zfs_inactive`, etc, which are
internal kernel housekeeping things, as well as anything that looks like
it belongs to zvols, ioctls, admin commands, etc.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
(cherry picked from commit 39c2801c611e27b521d716fea8f771307820362e)
This is like DMU_TX_ASSIGN_NOSUSPEND, but only when failmode=continue,
and returning EIO if the pool is suspended. Its designed to be easy to
use from syscalls and similar without the ceremony of checking the for
EAGAIN and failmode every time.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
(cherry picked from commit 6bed8644dd2afa0e39727e9e90642479c2416521)
Their names clash with those for txg_wait_synced_tx, and they aren't
directly compatible, leading to confusion.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
(cherry picked from commit 1f0fb1dae7c1e84de3b39e669e09b8b3d5b80b87)
This can happen if the pool suspended and then new IO is issued which
then fails too. This doesn't change behaviour, just silences the noise.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
(cherry picked from commit 3fa696404fb40205ed631538c62ec1a54d8ee6cd)
The kernel can call these during unmount, so we have to handle them
directly to prevent any further IO being issued.
zfs_fsync reorganised slightly to not set up zfs_fsyncer_key until after
the teardown lock is acquired, just in case we don't get it.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
(cherry picked from commit 900c26570ddcdd1d3ca135e6aee5df6456f6bfd6)
This is primarily of use when a pool has lost its disk, while the user
doesn't care about any pending (or otherwise) transactions.
Implement various control methods to make this feasible:
- txg_wait can now take a NOSUSPEND flag, in which case the caller will
be alerted if their txg can't be committed. This is primarily of
interest for callers that would normally pass TXG_WAIT, but don't want
to wait if the pool becomes suspended, which allows unwinding in some
cases, specifically when one is attempting a non-forced export.
Without this, the non-forced export would preclude a forced export
by virtue of holding the namespace lock indefinitely.
- txg_wait also returns failure for TXG_WAIT users if a pool is actually
being force exported. Adjust most callers to tolerate this.
- spa_config_enter_flags now takes a NOSUSPEND flag to the same effect.
- DMU objset initiator which may be set on an objset being forcibly
exported / unmounted.
- SPA export initiator may be set on a pool being forcibly exported.
- DMU send/recv now use an interruption mechanism which relies on the
SPA export initiator being able to enumerate datasets and closing any
send/recv streams, causing their EINTR paths to be invoked.
- ZIO now has a cancel entry point, which tells all suspended zios to
fail, and which suppresses the failures for non-CANFAIL users.
- metaslab, etc. cleanup, which consists of simply throwing away any
changes that were not able to be synced out.
- Linux specific: introduce a new tunable,
zfs_forced_export_unmount_enabled, which allows the filesystem to
remain in a modified 'unmounted' state upon exiting zpl_umount_begin,
to achieve parity with FreeBSD and illumos,
which have VFS-level support for yanking filesystems out from under
users. However, this only helps when the user is actively performing
I/O, while not sitting on the filesystem. In particular, this allows
test #3 below to pass on Linux.
- Add basic logic to zpool to indicate a force-exporting pool, instead
of crashing due to lack of config, etc.
Add tests which cover the basic use cases:
- Force export while a send is in progress
- Force export while a recv is in progress
- Force export while POSIX I/O is in progress
This change modifies the libzfs ABI:
- New ZPOOL_STATUS_FORCE_EXPORTING zpool_status_t enum value.
- New field libzfs_force_export for libzfs_handle.
Signed-off-by: Will Andrews <will@firepipe.net>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Mariusz Zaborski <mariusz.zaborski@klarasystems.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Catalogics, Inc.
Sponsored-by: Wasabi Technology, Inc.
Closes#3461
(cherry picked from commit 852e633772217d779a63e8c46fe3c5f81dd8960e)
The vs_noalloc member of the vdev_stat structure was implemented in
2a673e76a9. It is not available in ZFS
2.1.5, so code using it needs to be disabled.
This patch fixes the following compilation error:
```
../../module/zfs/json_stats.c: In function ‘nvlist_to_json’:
../../module/zfs/json_stats.c:92:4: error: a label can only be part of a statement and a declaration is not a statement
uint64_t *u = (uint64_t *)p;
^~~~~~~~
../../module/zfs/json_stats.c:102:4: error: a label can only be part of a statement and a declaration is not a statement
nvlist_t **a = (nvlist_t **)p;
^~~~~~~~
```
This is a squashed commit of the commits from
03a64568f318c696b9e4be19429e72b446c97462 to
1c64f0c8832b34bfa82645125351d6c62815ae21 developed by Fred Weigel.
Usage:
cat /proc/spl/kstat/zfs/POOLNAME/stats
The following changes has been applied during the rebase of the patches
on top of the 2.1.5 branch:
- Drop ZFS_IOC_ADD_LOG. This ioctl was introduced to support introducing
messages into the ZFS kernel log. It was used for debugging during
development. The implementation of this debugging feature made `zpool
addlog` output messages to /proc/spl/kstat/zfs/dbgmsg. The messages
could later be retrieved with `zdbgmsg show`.
- Change the fmgw.c entry in lib/libzpool/Makefile.am to json_stats.c.
The fmgw.c file has already been renamed to json_stats.c in other
places.
Co-authored-by: Mateusz Piotrowski <mateusz.piotrowski@klarasystems.com>
(cherry picked from commit 75f3395d7fc0c93c02c8a8e792515f3e821aa05a)
Coverty static analysis found these.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Neal Gompa <ngompa@datto.com>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes#10989Closes#13861
(cherry picked from commit 13f2b8fb92)
The default_bs and default_ibs tunables control the default block size
and indirect block size.
So far, default_bs and default_ibs were tunable only on FreeBSD, e.g.,
sysctl vfs.zfs.default_ibs
Remove the FreeBSD-specific sysctl code and expose default_bs and
default_ibs as tunables on both Linux and FreeBSD using
ZFS_MODULE_PARAM.
One of the use cases for changing the values of those tunables is to
lower the indirect block size, which may improve performance of large
directories (as discussed during the OpenZFS Leadership Meeting
on 2022-08-16).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Mateusz Piotrowski <mateusz.piotrowski@klarasystems.com>
Sponsored-by: Wasabi Technology, Inc.
Closes#14293
(cherry picked from commit 926715b9fc)
This change turns `MZAP_MAX_BLKSZ` into a `ZFS_MODULE_PARAM()` called
`zap_micro_max_size`. As a result, we can experiment with different
micro ZAP sizes to improve directory size scaling.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Mateusz Piotrowski <mateuszpiotrowski@klarasystems.com>
Co-authored-by: Toomas Soome <toomas.soome@klarasystems.com>
Signed-off-by: Mateusz Piotrowski <mateuszpiotrowski@klarasystems.com>
Sponsored-by: Wasabi Technology, Inc.
Closes#14292
(cherry picked from commit a4b21eadec)
Since we use two B-trees q_exts_by_size and q_exts_by_addr, we should
count 2x sizeof (range_seg_gap_t) per node. And since average B-tree
memory efficiency is about 75%, we should increase it to 3x.
Previous code under-counted up to 30% of the memory usage.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes#13537
I genuinely don't know why this didn't come up before,
but adding the LZ4 early abort pointed out this flaw,
in which we're allocating a buffer of one size, and
then telling the compressor that we're handing it buffers
of a different size, which may be Very Different - say,
allocating 512b and then telling it the inputs are 128k.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Amanakis <gamanakis@gmail.com>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes#13375
It is typical, but not generally true that if log summary has more
blocks it must also have unflushed metaslabs. Normally with metaslabs
flushed in order it works, but there are known exceptions, such as
device removal or metaslab being loaded during its flush attempt.
Before 600a02b884 if spa_flush_metaslabs() hit loading metaslab it
usually stopped (unless memlimit is also exceeded), but now it may
flush more metaslabs, just skipping that particular one. This
increased chances of assertion to fire when the skipped metaslab is
flushed on next iteration if all other metaslabs in that summary
entry are already flushed out of order.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes#13486Closes#13513
As of the Linux 5.19 kernel the readpage() address space operation
has been replaced by read_folio().
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#13515
Linux 5.19 commit torvalds/linux@44abff2c0 splits the secure
erase functionality from the blkdev_issue_discard() function.
The blkdev_issue_secure_erase() must now be issued to issue
a secure erase.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#13515
Linux 5.19 commit torvalds/linux@44abff2c0 removed the
blk_queue_secure_erase() helper function. The preferred
interface is to now use the bdev_max_secure_erase_sectors()
function to check for discard support.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#13515
Linux 5.19 commit torvalds/linux@70200574cc removed the
blk_queue_discard() helper function. The preferred interface
is to now use the bdev_max_discard_sectors() function to check
for discard support.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#13515
As for the Linux 5.18 kernel bio_alloc() expects a block_device struct
as an argument. This removes the need for the bio_set_dev() compatibility
code for 5.18 and newer kernels.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#13515
Refcount creation for abd_zero_scatter->abd_children is redundant in
abd_alloc_zero_scatter, as it has been done in abd_init_struct.
In addition, abd_children is undefined when ZFS_DEBUG is disabled, the
reference of abd_children in abd_alloc_zero_scatter breaks build of
libzpool when ZFS_DEBUG is disabled.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Ping Huang <huangping@smartx.com>
Closes#13429
clang-15 emits the following error message for functions without
a prototype:
fs/zfs/os/linux/spl/spl-kmem-cache.c:1423:27: error:
a function declaration without a prototype is deprecated
in all versions of C [-Werror,-Wstrict-prototypes]
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Aidan Harris <me@aidanharr.is>
Closes#13421
Linux 5.12 PPC 5.12 get_user() and __copy_from_user_inatomic()
inline helpers very indirectly include a reference to the GPL'd
array mmu_feature_keys[] and fails to build. Workaround this by
using copy_from_user() and throwing EFAULT for any calls to
__copy_from_user_inatomic(). This is a workaround until a fix
for Linux commit 7613f5a66becfd0e43a0f34de8518695888f5458
"powerpc/64s/kuap: Use mmu_has_feature()" is fully addressed.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Authored-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: szubersk <szuberskidamian@gmail.com>
Closes#11958Closes#12590Closes#13367
On some architectures ZERO_PAGE is unavailable because it references
a GPL exported symbol of empty_zero_page. Originally e08b993 removed
the call to PAGE_ZERO(0) for assignment to the abd_zero_page. However,
a simple check can be done to avoid a kernel allocation and free for
the abd_zero_page if ZERO_PAGE is available.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Closes#13199
This adds supports for hole-punching facilities in the FreeBSD kernel
starting from __FreeBSD_version 1400032.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Ka Ho Ng <khng@FreeBSD.org>
Sponsored-by: The FreeBSD Foundation
Closes#12458
Holding a dbuf is a common operation which can become highly contended
in dbuf_find() when acquiring the dbuf hash mutex. This is particularly
true on Linux when reading/writing volumes since by default up to 32
threads from the zvol_taskq may be taking a hold of the same dbuf.
This should also be observable on FreeBSD as long as there are enough
processes accessing the volume concurrently.
This is further aggregrated by the fact that only the block id will
be unique when calculating the dbuf hash for a single volume. The
objset id, object id, and level will be the same for data blocks.
This has been observed to result in a somehwat less than uniform hash
distribution and a longer than expected max hash chain depth (~20)
on a large memory system (256 GB) using volumes.
This commit improves the siutation by switching the hash mutex to
an rwlock to allow concurrent lookups, and increasing DBUF_RWLOCKS
from 2048 to 8192 to further reduce the odds of a hash collision.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#13405
Clang 13.0.0 added support for `Wunused-but-set-parameter` and
`-Wunused-but-set-variable` which correctly detects two unused
variables in zstd resulting in a build failure. This commit
annotates these instances accordingly.
https://releases.llvm.org/13.0.1/tools/clang/docs/ReleaseNotes.html#id6
In FSE_createCTable(), malloc() is intentionally defined as NULL when
compiled in the kernel so the variable is unused.
zstd/lib/compress/fse_compress.c:307:12: error: variable 'size'
set but not used [-Werror,-Wunused-but-set-variable]
Additionally, in ZSTD_seqDecompressedSize() the assert is compiled
out similarly resulting in an unused variable.
zstd/lib/compress/zstd_compress_superblock.c:412:12: error: variable
'litLengthSum' set but not used [-Werror,-Wunused-but-set-variable]
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The only zdb utility require to read metaslab-related data during
read-only pool import because of spacemaps validation. Add global
variable which will allow zdb read spacemaps in case of readonly
import mode.
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Fedor Uporov <fuporov.vstack@gmail.com>
Closes#9095Closes#12687
When using a Linux kernel which predates the iov_iter interface the
O_APPEND flag should be applied in zpl_aio_write() via the call to
generic_write_checks(). The updated pos variable was incorrectly
ignored resulting in the current offset being used.
This issue should only realistically impact the RHEL/CentOS 7.x
kernels which are based on Linux 3.10.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#13370Closes#13377
In hypothetical case of non-linear ABD with single segment, multiple
to page size but not aligned to it, vdev_geom_fill_unmap_cb() could
fill one page less into bio_ma array.
I am not sure it is expoitable, but better to be safe than sorry.
Reported-by: Mark Johnston <markj@FreeBSD.org>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
(cherry picked from commit 5352f85cdd)
It turns out, no, in fact, ZERO_RANGE and PUNCH_HOLE do
have differing semantics in some ways - in particular,
one requires KEEP_SIZE, and the other does not.
Also added a zero-range test to catch this, corrected a flaw
that made the punch-hole test succeed vacuously, and a typo
in file_write.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes#13329Closes#13338
As of the 5.17 kernel the GENHD_FL_EXT_DEVT flag has been removed
and the GENHD_FL_NO_PART_SCAN flag renamed GENHD_FL_NO_PART. Update
zvol_alloc() to set GENHD_FL_NO_PART for the newer kernels which
is sufficient. The behavior for prior kernels remains unchanged.
1ebe2e5f ("block: remove GENHD_FL_EXT_DEVT")
46e7eac6 ("block: rename GENHD_FL_NO_PART_SCAN to GENHD_FL_NO_PART")
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#13294Closes#13297