This commit fixes a bug whereby owner@ ACL that limits WRITE_DATA
access for the owner of a file was not being properly enforced. The
owner of a file should be prevented from write access in this case,
but being owner of file should still allow the file owner to chmod,
chown, and setacl.
Signed-off-by: Andrew Walker <awalker@ixsystems.com>
This function can be frequently called with MAY_EXEC|MAY_NOT_BLOCK
during RCU path walk. Where possible we should try not to break
out of it. In this case we check whether flag ZFS_NO_EXECS_DENIED is
set and check mode (similar to fastexecute check in zfs_acl.c).
Signed-off-by: Andrew Walker <awalker@ixsystems.com>
This removes an extra memory allocation / free from the
NFS4 ACL xattr handler. Initially this was written rather
quickly in the alpha cycle of SCALE and implemented in a
way to ensure that xattr was exactly matching format
used internally in samba's vfs_acl_xattr module. Since
this time a more efficient conversion between the Samba
format and various other ones was added for the purpose
of inclusion in the Kernel NFS server.
This change simplifies conversion between internal NFS ACL and
external xattr representation, but has no impact on userspace
and kernel consumers of this xattr (format does not change).
Signed-off-by: Andrew Walker <awalker@ixsystems.com>
MS-FSCC 2.6 is the governing document for
DOS attribute behavior. It specifies the following:
For a file, applications can read the file but
cannot write to it or delete it. For a directory,
applications cannot delete it, but applications can
create and delete files from the directory.
Signed-off-by: Andrew Walker <awalker@ixsystems.com>
When a new file is created on FreeBSD it is given the group
of the directory which contains it. On Linux it is given
to either the effective GID of the process (System V semantices)
or the GID of the parent directory (BSD semantics).
Since there is no hard-and-fast rule about creation semantics
for NFSv4 ACLs on Linux, we should opt for what is least likely
to break users permissions on change from FreeBSD to Linux.
Avoid setting actually setting the SGID bit on dirs unless
it was explicitly set.
Signed-off-by: Andrew Walker <awalker@ixsystems.com>
On Linux POSIX ACLs can be removed via rmxattr() for the
relevant system xattrs. On FreeBSD a non-trivial ACL
can be converted to one that is described by the mode with
no loss of info via combination of acl_get_file(), acl_strip_np(),
and acl_set_file(). Since there's no libc equivalent of these
ops in Linux for NFSv4 ACLs, this commit makes this less error
prone by handling entirely in ZFS. When user performs
rmxattr() vfs_setxattr() is called with value of NULL and length
of 0. Add special handling for this situation in the xattr
handler for the NFSv4 ACL so that we generate a new ACL and
zfs_acl_chmod() with the existing mode of file, then set the ACL.
Signed-off-by: Andrew Walker <awalker@ixsystems.com>
Add ACL_IS_TRIVIAL and ACL_IS_DIR flags as ACL-wide flags
in the system.nfs4_acl_xdr generated on getxattr requests.
This are non-RFC flags that are useful for userspace applications
(especially the ACL_IS_TRIVIAL flag as it can be used to avoid
relatively expensive ACL-related operations).
Also add system.nfs4_acl_xdr to xattr results if ACL is not trivial.
This duplicates POSIX ACL behavior where whether an ACL is
set on a path can be determined via listxattr(). Since the ACL
is not actually removed, we check whether the ZFS_ACL_TRIVIAL
is set. If the flag is not set, then we omit the xattr name from
the list. This allows users to determine whether ACL is trivial from
listxattr().
Signed-off-by: Andrew Walker <awalker@ixsystems.com>
The "permission" inode operation takes a new `struct user_namespace *`
parameter starting in Linux 5.12.
Add a configure check and adapt accordingly.
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Properly evaluate edge cases where user credential may grant capability
to override DAC in various situations. Switch to using ns-aware checks
rather than capable().
Expand optimization allow bypass of zfs_zaccess() in case of trivial
ACL if MAY_OPEN is included in requested mask. This will be evaluated
in generic_permission() check, which is RCU walk safe. This means that
in most cases evaluating permissions on boot volume with NFSv4 ACLs
will follow the fast path on checking inode permissions.
Additionally, CAP_SYS_ADMIN is granted to nfsd process, and so override
for this capability in access2 policy check is removed in favor of a
simple check for fsid == 0. Checks for CAP_DAC_OVERRIDE and other
override capabilities are kept as-is.
Signed-off-by: Andrew Walker <awalker@ixsystems.com>
This implements NFSv41 (RFC 5661) ACLs in a manner
compatible with vfs_nfs4acl_xattr in Samba and
nfs4xdr-acl-tools.
There are three key areas of change in this commit:
1) NFSv4 ACL management through system.nfs4_acl_xdr xattr.
Install an xattr handler for "system.nfs4_acl_xdr" that
presents an xattr containing full NFSv41 ACL structures
generated through rpcgen using specification from the Samba
project. This xattr is used by userspace programs to read and
set permissions.
2) add an i_op->permissions endpoint: zpl_permissions(). This
is used by the VFS in Linux to determine whether to allow /
deny an operation. Wherever possible, we try to avoid having
to call zfs_access(). If kernel has NFSv4 patch for VFS, then
perform more complete check of avaiable access mask.
3) add capability-based overrides to secpolicy_vnode_access2()
there are various situations in which ACL may need to be
overridden based on capabilities. This logic is almost directly
copied from Linux VFS. For instance, root needs to be able to
always read / write ACLs (otherwise admin can get locked out
from files).
This is commit was initially inspired by work from Paul B. Henson
to implement NFSv4.0 (RFC3530) ACLs in ZFS on Linux. Key areas of
divergence are as follows:
- ACL specification, xattr format, xattr name
- Addition of handling for NFSv4 masks from Linux VFS
- Addition of ACL overrides based on capabilities
Signed-off-by: Andrew Walker <awalker@ixsystems.com>
SB_LARGEXATTR is used in TrueNAS SCALE to indicate to the kernel
that the filesystem supports large-size xattrs (greater than 64KiB).
This flag is used to evaluate whether to allow large xattr read
or write requests (up to 2 MiB).
Signed-off-by: Andrew Walker <awalker@ixsystems.com>
Follow up to 99495ba6ab which
accidentally introduce this regression.
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Quartz <yyhran@163.com>
Closes#15907
This changes taskq_thread_should_stop() to limit maximum exit rate
for idle threads to one per 5 seconds. I believe the previous one
was broken, not allowing any thread exits for tasks arriving more
than one at a time and so completing while others are running.
Also while there:
- Remove taskq_thread_spawn() calls on task allocation errors.
- Remove extra taskq_thread_should_stop() call.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rich Ercolani <rincebrain@gmail.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15873
On Linux the ioctl_ficlonerange() and ioctl_ficlone() system calls
are expected to either fully clone the specified range or return an
error. The range may be for an entire file. While internally ZFS
supports cloning partial ranges there's no way to return the length
cloned to the caller so we need to make this all or nothing.
As part of this change support for the REMAP_FILE_CAN_SHORTEN flag
has been added. When REMAP_FILE_CAN_SHORTEN is set zfs_clone_range()
will return a shortened range when encountering pending dirty records.
When it's clear zfs_clone_range() will block and wait for the records
to be written out allowing the blocks to be cloned.
Furthermore, the file range lock is held over the region being cloned
to prevent it from being modified while cloning. This doesn't quite
provide an atomic semantics since if an error is encountered only a
portion of the range may be cloned. This will be converted to an
error if REMAP_FILE_CAN_SHORTEN was not provided and returned to the
caller. However, the destination file range is left in an undefined
state.
A test case has been added which exercises this functionality by
verifying that `cp --reflink=never|auto|always` works correctly.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#15728Closes#15842
On Linux, ZFS uses blkdev_issue_discard in vdev_disk_io_trim to issue
trim command which is synchronous.
This commit updates vdev_disk_io_trim to use __blkdev_issue_discard,
which is asynchronous. Unfortunately there isn't any asynchronous
version for blkdev_issue_secure_erase, so performance of secure trim
will still suffer.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Umer Saleem <usaleem@ixsystems.com>
Closes#15843
MAX_ORDER has been renamed to MAX_PAGE_ORDER. Rather than just
redefining it, instead define our own name and set it consistently from
the start.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Sponsored-by: https://despairlabs.com/sponsor/Closes#15805
Linux has removed strlcpy in favour of strscpy. This implements a
fallback implementation of strlcpy for this case.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Sponsored-by: https://despairlabs.com/sponsor/Closes#15805
blkdev_get_by_path() and blkdev_put() have been replaced by
bdev_open_by_path() and bdev_release(), which return a "handle" object
with the bdev object itself inside.
This adds detection for the new functions, and macros to handle the old
and new forms consistently.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Sponsored-by: https://despairlabs.com/sponsor/Closes#15805
In db4fc559c I messed up and changed this bit of code to set the inode
atime to an uninitialised value, when actually it was just supposed to
loading the atime from the inode to be stored in the SA. This changes it
to what it should have been.
Ensure times change by the right amount Previously, we only checked
if the times changed at all, which missed a bug where the atime was
being set to an undefined value.
Now ensure the times change by two seconds (or thereabouts), ensuring
we catch cases where we set the time to something bonkers
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Sponsored-by: https://despairlabs.com/sponsor/Closes#15762Closes#15773
On Linux x86_64, kmem cache can have size up to 4M,
however increasing spl_kmem_cache_slab_limit can lead
to crash due to the size check inconsistency.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Youzhong Yang <yyang@mathworks.com>
Closes#15757
We need to wait until after having done a zfs_enter() to load some
fields from the zfsvfs structure. Otherwise a use-after-free is
possible in the face of a concurrent rollback.
Other functions in this file are careful to avoid this bug, I believe
this is the only instance.
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Closes#15752
On some systems we already have blkdev_get_by_path() with 4 args
but still the old FMODE_EXCL and not BLK_OPEN_EXCL defined.
The vdev_bdev_mode() function was added to handle this case
but there was no generic way to specify exclusive access.
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#15692
6.7 changes the shrinker API such that shrinkers must be allocated
dynamically by the kernel. To accomodate this, this commit reworks
spl_register_shrinker() to do something similar against earlier kernels.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Sponsored-by: https://github.com/sponsors/robnCloses#15681
In 6.7 the superblock shrinker member s_shrink has changed from being an
embedded struct to a pointer. Detect this, and don't take a reference if
it already is one.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Sponsored-by: https://github.com/sponsors/robnCloses#15681
6.6 made i_ctime inaccessible; 6.7 has done the same for i_atime and
i_mtime. This extends the method used for ctime in b37f29341 to atime
and mtime as well.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Sponsored-by: https://github.com/sponsors/robnCloses#15681
This should make sure we have log written without overflows.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15517
- Remove zsda_tx field, it is used only once.
- Remove unneeded string lengths checks, all names are terminated.
- Replace few explicit MAXNAMELEN usages with sizeof().
- Change dsname from MAXNAMELEN to ZFS_MAX_DATASET_NAME_LEN, as
expected by dsl_dataset_name(). Both are 256 bytes now, but it is
better to be safe.
This should have no functional difference.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15535
It should be purely textual change to make the code more readable.
Should cause no functional difference.
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Tom Caputi <caputit1@tcnj.edu>
Reviewed-by: Sean Eric Fagan <sef@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Edmund Nadolski <edmund.nadolski@ixsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15543Closes#15513
In case of crash cloned blocks need to be claimed on pool import.
It is only possible if they (lr_bps) and their count (lr_nbps) are
not encrypted but only authenticated, similar to block pointer in
lr_write_t. Few other fields can be and are still encrypted.
This should fix panic on ZIL claim after crash when block cloning
is actively used.
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Tom Caputi <caputit1@tcnj.edu>
Reviewed-by: Sean Eric Fagan <sef@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Edmund Nadolski <edmund.nadolski@ixsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15543Closes#15513
Copy the disable parameter that FreeBSD implemented, and extend it to
work on Linux as well, until we're sure this is stable.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes#15529
It is unused for 3 years since #10576.
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15507
Private read/write mapping can't be used to modify the mapped files, so
they will remain be immutable. Private read/write mappings are usually
used to load the data segment of executable files, rejecting them will
rendering immutable executable files to stop working.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: WHR <msl0000023508@gmail.com>
Closes#15344
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally. This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).
== Initiating expansion ==
A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`. The new device will become part of the RAIDZ group. A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.
The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.
== During expansion ==
The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).
The expansion progress can be monitored with `zpool status`.
Data redundancy is maintained during (and after) the expansion. If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).
The pool remains accessible during expansion. Following a reboot or
export/import, the expansion resumes where it left off.
== After expansion ==
When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).
Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).
A RAIDZ vdev can be expanded multiple times.
After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks. New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.
Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes#15022
With Linux v6.6.0 and GCC 12, when debug build is configured,
implicit conversion error is raised while converting
'enum <anonymous>' to 'boolean_t'. Use 'B_TRUE' instead of
'true' to fix the issue.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Pavel Snajdr <snajpa@snajpa.net>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Umer Saleem <usaleem@ixsystems.com>
Closes#15489
Add a dataset_kstats_rename function, and call it when renaming
a zvol on FreeBSD and Linux.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alan Somers <asomers@gmail.com>
Sponsored-by: Axcient
Closes#15482Closes#15486
As part of transaction group commit, dsl_pool_sync() sequentially calls
dsl_dataset_sync() for each dirty dataset, which subsequently calls
dmu_objset_sync(). dmu_objset_sync() in turn uses up to 75% of CPU
cores to run sync_dnodes_task() in taskq threads to sync the dirty
dnodes (files).
There are two problems:
1. Each ZVOL in a pool is a separate dataset/objset having a single
dnode. This means the objsets are synchronized serially, which
leads to a bottleneck of ~330K blocks written per second per pool.
2. In the case of multiple dirty dnodes/files on a dataset/objset on a
big system they will be sync'd in parallel taskq threads. However,
it is inefficient to to use 75% of CPU cores of a big system to do
that, because of (a) bottlenecks on a single write issue taskq, and
(b) allocation throttling. In addition, if not for the allocation
throttling sorting write requests by bookmarks (logical address),
writes for different files may reach space allocators interleaved,
leading to unwanted fragmentation.
The solution to both problems is to always sync no more and (if
possible) no fewer dnodes at the same time than there are allocators
the pool.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Edmund Nadolski <edmund.nadolski@ixsystems.com>
Closes#15197
Currently, zvol threading can be switched through the zvol_request_sync
module parameter system-wide. By making it a zvol property, zvol
threading can be switched per zvol.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes#15409
There is no sense to have separate implementations for FreeBSD and
Linux. Make Linux code shared as more functional and just register
FreeBSD-specific prune callback with arc_add_prune_callback() API.
Aside of code cleanup this should fix excessive pruning on FreeBSD:
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=274698
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Johnston <markj@FreeBSD.org>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15456
ZVOL:
- Mark all ZVOL ZIL transactions as sync. Since ZVOLs have only
one object, it makes no sense to maintain async queue and on each
commit merge it into sync. Single sync queue is just cheaper, while
it changes nothing until actual commit request arrives.
- Remove zsd_sync_cnt and the zil_async_to_sync() calls since we
are no longer switching between sync and async queues.
ZFS:
- Mark write transactions as sync based only on number of sync
opens (z_sync_cnt). We can not randomly jump between sync and
async unless we want data corruptions due to writes reordering.
- When file first opened with O_SYNC (z_sync_cnt incremented to 1)
call zil_async_to_sync() for it to preserve correct ordering between
past and future writes.
- Drop zfs_fsyncer_key logic. Looks like it was an optimization
for workloads heavily intermixing async writes with tons of fsyncs.
But first it was broken 8 years ago due to Linux tsd implementation
not allowing data storage between syscalls, and second, I doubt it
is safe to switch from async to sync so often and without calling
zil_async_to_sync().
- Rename sync argument of *_log_write() into commit, now only
signalling caller's intent to call zil_commit() soon after. It
allows WR_COPIED optimizations without extra other meanings.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15366
Commits 518b487 and 23bdb07 changed the default ARC size limit on
Linux systems to 1/2 of physical memory, which has become too
strict for modern systems with large amounts of RAM. This patch
changes the default limit to match that of FreeBSD, so ZFS may
have a unified value on both platforms.
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Edmund Nadolski <edmund.nadolski@ixsystems.com>
Closes#15437
This reverts commit aefb6a2bd6.
aefb6a2bd temporally disabled blk-mq until we could fix a fix for
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#15439
This fix removes a dubious optimization in zfs_uiomove_bvec_rq()
that saved the iterator contents of a rq_for_each_segment(). This
optimization allowed restoring the "saved state" from a previous
rq_for_each_segment() call on the same uio so that you wouldn't
need to iterate though each bvec on every zfs_uiomove_bvec_rq() call.
However, if the kernel is manipulating the requests/bios/bvecs under
the covers between zfs_uiomove_bvec_rq() calls, then it could result
in corruption from using the "saved state". This optimization
results in an unbootable system after installing an OS on a zvol
with blk-mq enabled.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#15351
Right now, zpl_ioctl_ficlone and zpl_ioctl_ficlonerange do not call
put on the src fd if the source and destination are on two different
devices. This leaves the source file held open in this case.
Reviewed-by: Kay Pedersen <mail@mkwg.de>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Daniel Berlin <dberlin@dberlin.org>
Closes#15386
There was a report of zvol data loss (#15351) after enabling blk-mq on a
zvol backed with 16k physical block sized disks. Out of an abundance of
caution, do not allow the user to enable blk-mq until we can look into
the issue.
Note that blk-mq was not enabled by default on zvols. It was always
opt-in via the zvol_use_blk_mq module parameter.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Addresses: #15351Closes#15378
In Linux commit 560e20e4bf6484a0c12f9f3c7a1aa55056948e1e, the
fsync_bdev() function was removed in favor of sync_blockdev() to do
(roughly) the same thing, given the same input. This change
conditionally attempts to call sync_blockdev() if fsync_bdev() isn't
discovered during configure.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Coleman Kane <ckane@colemankane.org>
Closes#15263
In commit 0d72b92883c651a11059d93335f33d65c6eb653b, a new u32 argument
for the request_mask was added to generic_fillattr. This is the same
request_mask for statx that's present in the most recent API implemented
by zpl_getattr_impl. This commit conditionally adds it to the
zpl_generic_fillattr(...) macro, as well as the zfs_getattr_fast(...)
implementation, when configure determines it's present in the kernel's
generic_fillattr(...).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Coleman Kane <ckane@colemankane.org>
Closes#15263