It's been observed that in certain workloads (zvol-related being a
big one), ZFS will end up spending a large amount of time spinning
up taskqs only to tear them down again almost immediately, then
spin them up again...
I noticed this when I looked at what my mostly-idle system was doing
and wondered how on earth taskq creation/destroy was a bunch of time...
So I added a configurable delay to avoid it tearing down tasks the
first time it notices them idle, and the total number of threads at
steady state went up, but the amount of time being burned just
tearing down/turning up new ones almost vanished.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes#14938
This function can be frequently called with MAY_EXEC|MAY_NOT_BLOCK
during RCU path walk. Where possible we should try not to break
out of it. In this case we check whether flag ZFS_NO_EXECS_DENIED is
set and check mode (similar to fastexecute check in zfs_acl.c).
Signed-off-by: Andrew Walker <awalker@ixsystems.com>
dkms package layout is changed in bookworm and splits into dh-dkms
package. Debhelper in Bookworm is updated to use dh-sequence-dkms
instead of dkms.
GitHub Actions are updated to use Ubuntu 22.04 instead of Ubuntu
20.04, since dh-sequence-dkms is not aavailable on Ubuntu 20.04.
Signed-off-by: Umer Saleem <usaleem@ixsystems.com>
This removes an extra memory allocation / free from the
NFS4 ACL xattr handler. Initially this was written rather
quickly in the alpha cycle of SCALE and implemented in a
way to ensure that xattr was exactly matching format
used internally in samba's vfs_acl_xattr module. Since
this time a more efficient conversion between the Samba
format and various other ones was added for the purpose
of inclusion in the Kernel NFS server.
This change simplifies conversion between internal NFS ACL and
external xattr representation, but has no impact on userspace
and kernel consumers of this xattr (format does not change).
Signed-off-by: Andrew Walker <awalker@ixsystems.com>
MS-FSCC 2.6 is the governing document for
DOS attribute behavior. It specifies the following:
For a file, applications can read the file but
cannot write to it or delete it. For a directory,
applications cannot delete it, but applications can
create and delete files from the directory.
Signed-off-by: Andrew Walker <awalker@ixsystems.com>
We never want to partition vdevs automatically from ZFS in SCALE.
Ignore the wholedisk flag in SCALE and skip the tests that expect
auto partitioning to work.
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
With --enable-debuginfo configured, ZFS packages are built with
debug symbols embedded into the binaries.
Signed-off-by: Umer Saleem <usaleem@ixsystems.com>
TrueNAS SCALE doesn't boot from pools on top of LVM, and the scan can
take a significant amount of time on systems with a large number of
disks.
Skip the lvm commands in our local-top/zfs script.
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
When a new file is created on FreeBSD it is given the group
of the directory which contains it. On Linux it is given
to either the effective GID of the process (System V semantices)
or the GID of the parent directory (BSD semantics).
Since there is no hard-and-fast rule about creation semantics
for NFSv4 ACLs on Linux, we should opt for what is least likely
to break users permissions on change from FreeBSD to Linux.
Avoid setting actually setting the SGID bit on dirs unless
it was explicitly set.
Signed-off-by: Andrew Walker <awalker@ixsystems.com>
On Linux POSIX ACLs can be removed via rmxattr() for the
relevant system xattrs. On FreeBSD a non-trivial ACL
can be converted to one that is described by the mode with
no loss of info via combination of acl_get_file(), acl_strip_np(),
and acl_set_file(). Since there's no libc equivalent of these
ops in Linux for NFSv4 ACLs, this commit makes this less error
prone by handling entirely in ZFS. When user performs
rmxattr() vfs_setxattr() is called with value of NULL and length
of 0. Add special handling for this situation in the xattr
handler for the NFSv4 ACL so that we generate a new ACL and
zfs_acl_chmod() with the existing mode of file, then set the ACL.
Signed-off-by: Andrew Walker <awalker@ixsystems.com>
Add ACL_IS_TRIVIAL and ACL_IS_DIR flags as ACL-wide flags
in the system.nfs4_acl_xdr generated on getxattr requests.
This are non-RFC flags that are useful for userspace applications
(especially the ACL_IS_TRIVIAL flag as it can be used to avoid
relatively expensive ACL-related operations).
Also add system.nfs4_acl_xdr to xattr results if ACL is not trivial.
This duplicates POSIX ACL behavior where whether an ACL is
set on a path can be determined via listxattr(). Since the ACL
is not actually removed, we check whether the ZFS_ACL_TRIVIAL
is set. If the flag is not set, then we omit the xattr name from
the list. This allows users to determine whether ACL is trivial from
listxattr().
Signed-off-by: Andrew Walker <awalker@ixsystems.com>
The "permission" inode operation takes a new `struct user_namespace *`
parameter starting in Linux 5.12.
Add a configure check and adapt accordingly.
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Properly evaluate edge cases where user credential may grant capability
to override DAC in various situations. Switch to using ns-aware checks
rather than capable().
Expand optimization allow bypass of zfs_zaccess() in case of trivial
ACL if MAY_OPEN is included in requested mask. This will be evaluated
in generic_permission() check, which is RCU walk safe. This means that
in most cases evaluating permissions on boot volume with NFSv4 ACLs
will follow the fast path on checking inode permissions.
Additionally, CAP_SYS_ADMIN is granted to nfsd process, and so override
for this capability in access2 policy check is removed in favor of a
simple check for fsid == 0. Checks for CAP_DAC_OVERRIDE and other
override capabilities are kept as-is.
Signed-off-by: Andrew Walker <awalker@ixsystems.com>
The new sysfs attribute makes kernel to wait for all device probe to
complete before return. Without it wait_for_udev call does not give
any guaranties.
Ticket: NAS-108200
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Now that we support NFSv4 ACLs on Linux, this can now be made the
default across all platforms.
Update the documentation and tests accordingly.
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
This implements NFSv41 (RFC 5661) ACLs in a manner
compatible with vfs_nfs4acl_xattr in Samba and
nfs4xdr-acl-tools.
There are three key areas of change in this commit:
1) NFSv4 ACL management through system.nfs4_acl_xdr xattr.
Install an xattr handler for "system.nfs4_acl_xdr" that
presents an xattr containing full NFSv41 ACL structures
generated through rpcgen using specification from the Samba
project. This xattr is used by userspace programs to read and
set permissions.
2) add an i_op->permissions endpoint: zpl_permissions(). This
is used by the VFS in Linux to determine whether to allow /
deny an operation. Wherever possible, we try to avoid having
to call zfs_access(). If kernel has NFSv4 patch for VFS, then
perform more complete check of avaiable access mask.
3) add capability-based overrides to secpolicy_vnode_access2()
there are various situations in which ACL may need to be
overridden based on capabilities. This logic is almost directly
copied from Linux VFS. For instance, root needs to be able to
always read / write ACLs (otherwise admin can get locked out
from files).
This is commit was initially inspired by work from Paul B. Henson
to implement NFSv4.0 (RFC3530) ACLs in ZFS on Linux. Key areas of
divergence are as follows:
- ACL specification, xattr format, xattr name
- Addition of handling for NFSv4 masks from Linux VFS
- Addition of ACL overrides based on capabilities
Signed-off-by: Andrew Walker <awalker@ixsystems.com>
SB_LARGEXATTR is used in TrueNAS SCALE to indicate to the kernel
that the filesystem supports large-size xattrs (greater than 64KiB).
This flag is used to evaluate whether to allow large xattr read
or write requests (up to 2 MiB).
Signed-off-by: Andrew Walker <awalker@ixsystems.com>
The assert does not account for the case where there is a single
buffer in the chain that is decompressed and has a valid
checksum.
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
482da24e2 missed arc_buf_destroy() calls on log parse errors, possibly
leaking up to 128KB of memory per dataset during ZIL replay.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#14987
With the latest L2ARC fixes, 2 seconds is too long to wait for
quiescence of arcstats like l2_size. Shorten this interval to avoid
having the persistent L2ARC tests in ZTS prematurely terminated.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes#14981
Those callbacks were introduced many years ago as part of a bigger
patch to smoothen the write throttling within a txg. They allow to
account completion of individual physical writes within a logical
one, improving cases when some of physical writes complete much
sooner than others, gradually opening the write throttle.
Few years after that ZFS got allocation throttling, working on a
level of logical writes and limiting number of writes queued to
vdevs at any point, and so limiting latency distribution between
the physical writes and especially writes of multiple copies.
The addition of scheduling deadline I proposed in #14925 should
further reduce the latency distribution. Grown memory sizes over
the past 10 years should also reduce importance of the smoothing.
While the use of physdone callback may still in theory provide
some smoother throttling, there are cases where we simply can not
afford it. Since dirty data accounting is protected by pool-wide
lock, in case of 6-wide RAIDZ, for example, it requires us to take
it 8 times per logical block write, creating huge lock contention.
My tests of this patch show radical reduction of the lock spinning
time on workloads when smaller blocks are written to RAIDZ pools,
when each of the disks receives 8-16KB chunks, but the total rate
reaching 100K+ blocks per second. Same time attempts to measure
any write time fluctuations didn't show anything noticeable.
While there, remove also io_child_count/io_parent_count counters.
They are used only for couple assertions that can be avoided.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#14948
On FreeBSD 14 this test runs slowly in the CI environment
and is killed by the 10 minute timeout. Skip the test on
FreeBSD until the slow down is resolved.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #14961
With large number of tracked references list searches under the lock
become too expensive, creating enormous lock contention.
On my tests with ZFS_DEBUG enabled this increases write throughput
with 32KB blocks from ~1.2GB/s to ~7.5GB/s.
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#14970
If this is not done, and the pool has an ashift other than the default
(at the moment 9) then the following happens:
1) vdev_alloc() assigns the ashift of the pool to L2ARC device, but
upon export it is not stored anywhere
2) at the first import, vdev_open() sees an vdev_ashift() of 0 and
assigns the logical_ashift, which is 9
3) reading the contents of L2ARC, including the header fails
4) L2ARC buffers are not restored in ARC.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes#14313Closes#14963
While commit bcd5321 adjusts the write size based on the size of the log
block, this happens after comparing the unadjusted write size to the
evicted (target) size.
In this case l2ad_hand will exceed l2ad_evict and violate an assertion
at the end of l2arc_write_buffers().
Fix this by adding the max log block size to the allocated size of the
buffer to be committed before comparing the result to the target
size.
Also reset the l2arc_trim_ahead ZFS module variable when the adjusted
write size exceeds the size of the L2ARC device.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes#14936Closes#14954
It was a vdev level read cache, designed to aggregate many small
reads by speculatively issuing bigger reads instead and caching
the result. But since it has almost no idea about what is going
on with exception of ZIO_FLAG_DONT_CACHE flag set by higher layers,
it was found to make more harm than good, for which reason it was
disabled for the past 12 years. These days we have much better
instruments to enlarge the I/Os, such as speculative and prescient
prefetches, I/O scheduler, I/O aggregation etc.
Besides just the dead code removal this removes one extra mutex
lock/unlock per write inside vdev_cache_write(), not otherwise
disabled and trying to do some work.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#14953
Until the ASSERT which is occasionally hit while running
checkpoint_discard_busy is resolved skip this test case.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #12053Closes#14952
- Do not report L2ARC as FAULTED in presence of in-flight writes.
- Report read and write I/Os, bytes and errors.
- Remove few numbers not important to average user.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#12304Closes#14946
... instead of list_head() + list_remove(). On FreeBSD the list
functions are not inlined, so in addition to more compact code
this also saves another function call.
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#14955
We are not allowed to access lwb after setting LWB_STATE_FLUSH_DONE
state and dropping zl_lock, since it may be freed by zil_sync().
To free itxs and waiters after dropping the lock we need to move
lwb_itxs and lwb_waiters lists elements to local storage.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#14957Closes#14959
This reverts commit 79b20949b2 since it
doesn't work with the systemd version shipped with RHEL7-based systems.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes#14943Closes#14945
When a kmem cache is exhausted and needs to be expanded a new
slab is allocated. KM_SLEEP callers can block and wait for the
allocation, but KM_NOSLEEP callers were incorrectly allowed to
block as well.
Resolve this by attempting an emergency allocation as a best
effort. This may fail but that's fine since any KM_NOSLEEP
consumer is required to handle an allocation failure.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Adam Moss <c@yotes.com>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
l2arc_write_size() should return the write size after adjusting for trim
and overhead of the L2ARC log blocks. Also take into account the
allocated size of log blocks when deciding when to stop writing buffers
to L2ARC.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes#14939