The DDT is really inefficient on 4k and up vdevs, because it always
allocates 4k blocks, and while compression could save us somewhat
at ashift 9, that stops being true.
So let's change the default to 32 KiB, which seems like a reasonable
compromise between improved space savings and inflated write sizes
for DDT updates.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes#14654
The previous comment wondered if this case could happen; it turns out
that it really can't.
This block can only be entered if dde_type and dde_class are "real";
that only happens when a ddt entry has been previously synced to a ddt
store, that is, it was created on a previous txg. Since its gone through
that sync, its dde_refcount must be >0.
ddt_addref() is called from brt_pending_apply(), which is called at the
beginning of spa_sync(), before pending DMU writes/frees are issued.
Freeing a dedup block is the only thing that can decrement dde_refcount,
so there's no way for it to drop to zero before applying the clone bumps
it.
Further, even if it _could_ go to zero, it wouldn't be necessary to fill
the entry from the block. The phys content is not cleared until the free
is issued, which happens when the refcount goes to zero, when the last
real free comes through. The cloned block should be identical to what's
in the phys already, so the fill should be a no-op anyway.
I've replaced this with an assertion because this is all very dependent
on the ordering in which BRT and DDT changes are applied, and that might
change in the future.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-By: Klara, Inc.
Closes#15004
With zl_suspend read in zil_commit() not protected by any locks it
is possible for new ZIL writes to be in progress while zil_destroy()
called by zil_suspend() freeing them. This patch closes the race
by taking zl_issuer_lock in zil_suspend() and adding the second
zl_suspend check to zil_get_commit_list(), protected by the lock.
It allows all already queued transactions to be logged normally,
while blocks any new ones, calling txg_wait_synced() for the TXGs.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#14979
- Pack struct zio_prop by 4 bytes from 84 to 80.
- Skip new child ZIO locking while linking to parent. The newly
allocated ZIO is not externally visible yet, so nobody should care.
- Skip io_bp_copy writes when not used (write && non-debug).
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#14985
Scan process may skip blocks based on their birth time, DVA, etc.
Traditionally those blocks were accounted as issued, that caused
reporting of hugely over-inflated numbers, having nothing to do
with actual disk I/O. This change utilizes never used field in
struct dsl_scan_phys to account such skipped bytes, allowing to
report how much data were actually scrubbed/resilvered and what
is the actual I/O speed. While formally it is an on-disk format
change, it should be compatible both ways, so should not need a
feature flag.
This should partially address the same issue as c85ac731a0, but
from a different perspective, complementing it.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Akash B <akash-b@hpe.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15007
This patch changes the passing of "size" to snprintf
from hard-coded (openended) to sizeof(errbuf). This
is bringing to standard with rest of the code where-
ever 'errbuf' is used.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Arshad Hussain <arshad.hussain@aeoncomputing.com>
Closes#15003
The previous code was checking zfs_is_namespace_prop() only for the
last property on the list. If one was not "namespace", then remount
wasn't called. To fix that move zfs_is_namespace_prop() inside the
loop and remount if at least one of properties was "namespace".
Reviewed-by: Umer Saleem <usaleem@ixsystems.com>
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15000
The issue that this is designed to work around is only applicable to
glibc, since it's caused by glibc's pthread_cancel() implementation
using dlopen on libgcc_s.so.1 (and therefor not triggering dracut to
include it in the initramfs). This commit adds an extra condition to the
workaround that tests for glibc via "ldconfig -p | grep -qF 'libc.so.6'"
(which should only be present on glibc systems).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Violet Purcell <vimproved@inventati.org>
Closes#14992
Consistently get the proper default value for autotrim.
Currently, only the kernel module is built with IN_FREEBSD_BASE,
and libzfs get the wrong default value, leading to confusion and
incorrect output when autotrim value was not set explicitly.
Reviewed-by: Warner Losh <imp@bsdimp.com>
Signed-off-by: Yuri Pankov <yuripv@FreeBSD.org>
Closes#15016
lwb->lwb_issued_txg can not be accessed after lwb_state is set to
LWB_STATE_FLUSH_DONE and zl_lock is dropped, since the lwb may be
freed by zil_sync(). We must save the txg number before that.
This is similar to the 55b1842f92, but as I see the bug is not new.
It existed for quite a while, just was not triggered due to smaller
race window.
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#14988Closes#14999
When ZFS appends files in chunks bigger than recordsize, it borrows
buffer from ARC and fills it before opening transaction. This
supposed to help in case of page faults to not hold transaction open
indefinitely. The problem appears when recordsize is set lower than
default 128KB. Since each block is committed in separate transaction,
per-transaction overhead becomes significant, and what is even worse,
active use of of per-dataset and per-pool locks to protect space use
accounting for each transaction badly hurts the code SMP scalability.
The same transaction size limitation applies in case of file rewrite,
but without even excuse of buffer borrowing.
To address the issue, disable the borrowing mechanism if recordsize
is smaller than default and the write request is 4x bigger than it.
In such case writes up to 32MB are executed in single transaction,
that dramatically reduces overhead and lock contention. Since the
borrowing mechanism is not used for file rewrites, and it was never
used by zvols, which seem to work fine, I don't think this change
should create significant problems, partially because in addition to
the borrowing mechanism there are also used pre-faults.
My tests with 4/8 threads writing several files same time on datasets
with 32KB recordsize in 1MB requests show reduction of CPU usage by
the user threads by 25-35%. I would measure it in GB/s, but at that
block size we are now limited by the lock contention of single write
issue taskqueue, which is a separate problem we are going to work on.
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#14964
Switch FIFO queues (SYNC/TRIM) and active queue of vdev queue from
time-sorted AVL-trees to simple lists. AVL-trees are too expensive
for such a simple task. To change I/O priority without searching
through the trees, add io_queue_state field to struct zio.
To not check number of queued I/Os for each priority add vq_cqueued
bitmap to struct vdev_queue. Update it when adding/removing I/Os.
Make vq_cactive a separate array instead of struct vdev_queue_class
member. Together those allow to avoid lots of cache misses when
looking for work in vdev_queue_class_to_issue().
Introduce deadline of ~0.5s for LBA-sorted queues. Before this I
saw some I/Os waiting in a queue for up to 8 seconds and possibly
more due to starvation. With this change I no longer see it. I
had to slightly more complicate the comparison function, but since
it uses all the same cache lines the difference is minimal. For a
sequential I/Os the new code in vdev_queue_io_to_issue() actually
often uses more simple avl_first(), falling back to avl_find() and
avl_nearest() only when needed.
Arrange members in struct zio to access only one cache line when
searching through vdev queues. While there, remove io_alloc_node,
reusing the io_queue_node instead. Those two are never used same
time.
Remove zfs_vdev_aggregate_trim parameter. It was disabled for 4
years since implemented, while still wasted time maintaining the
offset-sorted tree of TRIM requests. Just remove the tree.
Remove locking from txg_all_lists_empty(). It is racy by design,
while 2 pair of locks/unlocks take noticeable time under the vdev
queue lock.
With these changes in my tests with volblocksize=4KB I measure vdev
queue lock spin time reduction by 50% on read and 75% on write.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#14925
It's been observed that in certain workloads (zvol-related being a
big one), ZFS will end up spending a large amount of time spinning
up taskqs only to tear them down again almost immediately, then
spin them up again...
I noticed this when I looked at what my mostly-idle system was doing
and wondered how on earth taskq creation/destroy was a bunch of time...
So I added a configurable delay to avoid it tearing down tasks the
first time it notices them idle, and the total number of threads at
steady state went up, but the amount of time being burned just
tearing down/turning up new ones almost vanished.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes#14938
This function can be frequently called with MAY_EXEC|MAY_NOT_BLOCK
during RCU path walk. Where possible we should try not to break
out of it. In this case we check whether flag ZFS_NO_EXECS_DENIED is
set and check mode (similar to fastexecute check in zfs_acl.c).
Signed-off-by: Andrew Walker <awalker@ixsystems.com>
dkms package layout is changed in bookworm and splits into dh-dkms
package. Debhelper in Bookworm is updated to use dh-sequence-dkms
instead of dkms.
GitHub Actions are updated to use Ubuntu 22.04 instead of Ubuntu
20.04, since dh-sequence-dkms is not aavailable on Ubuntu 20.04.
Signed-off-by: Umer Saleem <usaleem@ixsystems.com>
This removes an extra memory allocation / free from the
NFS4 ACL xattr handler. Initially this was written rather
quickly in the alpha cycle of SCALE and implemented in a
way to ensure that xattr was exactly matching format
used internally in samba's vfs_acl_xattr module. Since
this time a more efficient conversion between the Samba
format and various other ones was added for the purpose
of inclusion in the Kernel NFS server.
This change simplifies conversion between internal NFS ACL and
external xattr representation, but has no impact on userspace
and kernel consumers of this xattr (format does not change).
Signed-off-by: Andrew Walker <awalker@ixsystems.com>
MS-FSCC 2.6 is the governing document for
DOS attribute behavior. It specifies the following:
For a file, applications can read the file but
cannot write to it or delete it. For a directory,
applications cannot delete it, but applications can
create and delete files from the directory.
Signed-off-by: Andrew Walker <awalker@ixsystems.com>
We never want to partition vdevs automatically from ZFS in SCALE.
Ignore the wholedisk flag in SCALE and skip the tests that expect
auto partitioning to work.
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
With --enable-debuginfo configured, ZFS packages are built with
debug symbols embedded into the binaries.
Signed-off-by: Umer Saleem <usaleem@ixsystems.com>
TrueNAS SCALE doesn't boot from pools on top of LVM, and the scan can
take a significant amount of time on systems with a large number of
disks.
Skip the lvm commands in our local-top/zfs script.
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
When a new file is created on FreeBSD it is given the group
of the directory which contains it. On Linux it is given
to either the effective GID of the process (System V semantices)
or the GID of the parent directory (BSD semantics).
Since there is no hard-and-fast rule about creation semantics
for NFSv4 ACLs on Linux, we should opt for what is least likely
to break users permissions on change from FreeBSD to Linux.
Avoid setting actually setting the SGID bit on dirs unless
it was explicitly set.
Signed-off-by: Andrew Walker <awalker@ixsystems.com>
On Linux POSIX ACLs can be removed via rmxattr() for the
relevant system xattrs. On FreeBSD a non-trivial ACL
can be converted to one that is described by the mode with
no loss of info via combination of acl_get_file(), acl_strip_np(),
and acl_set_file(). Since there's no libc equivalent of these
ops in Linux for NFSv4 ACLs, this commit makes this less error
prone by handling entirely in ZFS. When user performs
rmxattr() vfs_setxattr() is called with value of NULL and length
of 0. Add special handling for this situation in the xattr
handler for the NFSv4 ACL so that we generate a new ACL and
zfs_acl_chmod() with the existing mode of file, then set the ACL.
Signed-off-by: Andrew Walker <awalker@ixsystems.com>
Add ACL_IS_TRIVIAL and ACL_IS_DIR flags as ACL-wide flags
in the system.nfs4_acl_xdr generated on getxattr requests.
This are non-RFC flags that are useful for userspace applications
(especially the ACL_IS_TRIVIAL flag as it can be used to avoid
relatively expensive ACL-related operations).
Also add system.nfs4_acl_xdr to xattr results if ACL is not trivial.
This duplicates POSIX ACL behavior where whether an ACL is
set on a path can be determined via listxattr(). Since the ACL
is not actually removed, we check whether the ZFS_ACL_TRIVIAL
is set. If the flag is not set, then we omit the xattr name from
the list. This allows users to determine whether ACL is trivial from
listxattr().
Signed-off-by: Andrew Walker <awalker@ixsystems.com>
The "permission" inode operation takes a new `struct user_namespace *`
parameter starting in Linux 5.12.
Add a configure check and adapt accordingly.
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Properly evaluate edge cases where user credential may grant capability
to override DAC in various situations. Switch to using ns-aware checks
rather than capable().
Expand optimization allow bypass of zfs_zaccess() in case of trivial
ACL if MAY_OPEN is included in requested mask. This will be evaluated
in generic_permission() check, which is RCU walk safe. This means that
in most cases evaluating permissions on boot volume with NFSv4 ACLs
will follow the fast path on checking inode permissions.
Additionally, CAP_SYS_ADMIN is granted to nfsd process, and so override
for this capability in access2 policy check is removed in favor of a
simple check for fsid == 0. Checks for CAP_DAC_OVERRIDE and other
override capabilities are kept as-is.
Signed-off-by: Andrew Walker <awalker@ixsystems.com>
The new sysfs attribute makes kernel to wait for all device probe to
complete before return. Without it wait_for_udev call does not give
any guaranties.
Ticket: NAS-108200
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Now that we support NFSv4 ACLs on Linux, this can now be made the
default across all platforms.
Update the documentation and tests accordingly.
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
This implements NFSv41 (RFC 5661) ACLs in a manner
compatible with vfs_nfs4acl_xattr in Samba and
nfs4xdr-acl-tools.
There are three key areas of change in this commit:
1) NFSv4 ACL management through system.nfs4_acl_xdr xattr.
Install an xattr handler for "system.nfs4_acl_xdr" that
presents an xattr containing full NFSv41 ACL structures
generated through rpcgen using specification from the Samba
project. This xattr is used by userspace programs to read and
set permissions.
2) add an i_op->permissions endpoint: zpl_permissions(). This
is used by the VFS in Linux to determine whether to allow /
deny an operation. Wherever possible, we try to avoid having
to call zfs_access(). If kernel has NFSv4 patch for VFS, then
perform more complete check of avaiable access mask.
3) add capability-based overrides to secpolicy_vnode_access2()
there are various situations in which ACL may need to be
overridden based on capabilities. This logic is almost directly
copied from Linux VFS. For instance, root needs to be able to
always read / write ACLs (otherwise admin can get locked out
from files).
This is commit was initially inspired by work from Paul B. Henson
to implement NFSv4.0 (RFC3530) ACLs in ZFS on Linux. Key areas of
divergence are as follows:
- ACL specification, xattr format, xattr name
- Addition of handling for NFSv4 masks from Linux VFS
- Addition of ACL overrides based on capabilities
Signed-off-by: Andrew Walker <awalker@ixsystems.com>
SB_LARGEXATTR is used in TrueNAS SCALE to indicate to the kernel
that the filesystem supports large-size xattrs (greater than 64KiB).
This flag is used to evaluate whether to allow large xattr read
or write requests (up to 2 MiB).
Signed-off-by: Andrew Walker <awalker@ixsystems.com>