Commit Graph

8847 Commits

Author SHA1 Message Date
Alexander Motin eda32dca92
Fix remount when setting multiple properties.
The previous code was checking zfs_is_namespace_prop() only for the
last property on the list.  If one was not "namespace", then remount
wasn't called.  To fix that move zfs_is_namespace_prop() inside the
loop and remount if at least one of properties was "namespace".

Reviewed-by: Umer Saleem <usaleem@ixsystems.com>
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #15000
2023-06-30 08:36:43 -07:00
vimproved 24554082bd
contrib: dracut: Conditionalize copying of libgcc_s.so.1 to glibc only
The issue that this is designed to work around is only applicable to
glibc, since it's caused by glibc's pthread_cancel() implementation
using dlopen on libgcc_s.so.1 (and therefor not triggering dracut to
include it in the initramfs). This commit adds an extra condition to the
workaround that tests for glibc via "ldconfig -p | grep -qF 'libc.so.6'"
(which should only be present on glibc systems).

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Violet Purcell <vimproved@inventati.org>
Closes #14992
2023-06-29 12:54:37 -07:00
Yuri Pankov 77a3bb1f47
spa.h: use IN_BASE instead of IN_FREEBSD_BASE
Consistently get the proper default value for autotrim.

Currently, only the kernel module is built with IN_FREEBSD_BASE,
and libzfs get the wrong default value, leading to confusion and
incorrect output when autotrim value was not set explicitly.

Reviewed-by: Warner Losh <imp@bsdimp.com>
Signed-off-by: Yuri Pankov <yuripv@FreeBSD.org>
Closes #15016
2023-06-29 11:50:52 -07:00
Mateusz Piotrowski 62ace21a14
zdb: Add missing poolname to -C synopsis
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Mateusz Piotrowski <0mp@FreeBSD.org>
Sponsored-by: Klara Inc.
Closes #15014
2023-06-29 10:54:43 -07:00
Alexander Motin a9d6b0690b
ZIL: Fix another use-after-free.
lwb->lwb_issued_txg can not be accessed after lwb_state is set to
LWB_STATE_FLUSH_DONE and zl_lock is dropped, since the lwb may be
freed by zil_sync().  We must save the txg number before that.

This is similar to the 55b1842f92, but as I see the bug is not new.
It existed for quite a while, just was not triggered due to smaller
race window.

Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #14988
Closes #14999
2023-06-27 17:03:37 -07:00
Alexander Motin b0cbc1aa9a
Use big transactions for small recordsize writes.
When ZFS appends files in chunks bigger than recordsize, it borrows
buffer from ARC and fills it before opening transaction.  This
supposed to help in case of page faults to not hold transaction open
indefinitely.  The problem appears when recordsize is set lower than
default 128KB. Since each block is committed in separate transaction,
per-transaction overhead becomes significant, and what is even worse,
active use of of per-dataset and per-pool locks to protect space use
accounting for each transaction badly hurts the code SMP scalability.
The same transaction size limitation applies in case of file rewrite,
but without even excuse of buffer borrowing.

To address the issue, disable the borrowing mechanism if recordsize
is smaller than default and the write request is 4x bigger than it.
In such case writes up to 32MB are executed in single transaction,
that dramatically reduces overhead and lock contention.  Since the
borrowing mechanism is not used for file rewrites, and it was never
used by zvols, which seem to work fine, I don't think this change
should create significant problems, partially because in addition to
the borrowing mechanism there are also used pre-faults.

My tests with 4/8 threads writing several files same time on datasets
with 32KB recordsize in 1MB requests show reduction of CPU usage by
the user threads by 25-35%.  I would measure it in GB/s, but at that
block size we are now limited by the lock contention of single write
issue taskqueue, which is a separate problem we are going to work on.

Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #14964
2023-06-27 17:00:30 -07:00
Laevos bc9d0084ea
Remove unnecessary commas in zpool-create.8
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Laevos <5572812+Laevos@users.noreply.github.com>
Closes #15011
2023-06-27 16:58:32 -07:00
Alexander Motin 638a717d2a
Merge pull request #142 from truenas/NAS-122578-zfsd-exec-perm
NAS-122578 / None / Make zfsd executable in order to run it from the rc.d script
2023-06-27 18:26:25 -04:00
Ameer Hamza 66066524df Make zfsd executable in order to run it from the rc.d script
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
2023-06-28 02:52:14 +05:00
Ameer Hamza 1e1523c2e5
Merge pull request #140 from truenas/zfs-2.2-build-fix
linux 6.3 compat changes for truenas/zfs-2.2-release
2023-06-28 02:38:51 +05:00
Ameer Hamza d00a585773 linux 6.3 compat changes for truenas/zfs-2.2-release
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
2023-06-27 23:24:39 +05:00
Alexander Motin 8469b5aac0
Another set of vdev queue optimizations.
Switch FIFO queues (SYNC/TRIM) and active queue of vdev queue from
time-sorted AVL-trees to simple lists.  AVL-trees are too expensive
for such a simple task.  To change I/O priority without searching
through the trees, add io_queue_state field to struct zio.

To not check number of queued I/Os for each priority add vq_cqueued
bitmap to struct vdev_queue.  Update it when adding/removing I/Os.
Make vq_cactive a separate array instead of struct vdev_queue_class
member.  Together those allow to avoid lots of cache misses when
looking for work in vdev_queue_class_to_issue().

Introduce deadline of ~0.5s for LBA-sorted queues.  Before this I
saw some I/Os waiting in a queue for up to 8 seconds and possibly
more due to starvation.  With this change I no longer see it.  I
had to slightly more complicate the comparison function, but since
it uses all the same cache lines the difference is minimal.  For a
sequential I/Os the new code in vdev_queue_io_to_issue() actually
often uses more simple avl_first(), falling back to avl_find() and
avl_nearest() only when needed.

Arrange members in struct zio to access only one cache line when
searching through vdev queues.  While there, remove io_alloc_node,
reusing the io_queue_node instead.  Those two are never used same
time.

Remove zfs_vdev_aggregate_trim parameter.  It was disabled for 4
years since implemented, while still wasted time maintaining the
offset-sorted tree of TRIM requests.  Just remove the tree.

Remove locking from txg_all_lists_empty().  It is racy by design,
while 2 pair of locks/unlocks take noticeable time under the vdev
queue lock.

With these changes in my tests with volblocksize=4KB I measure vdev
queue lock spin time reduction by 50% on read and 75% on write.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #14925
2023-06-27 09:09:48 -07:00
Rich Ercolani 35a6247c5f
Add a delay to tearing down threads.
It's been observed that in certain workloads (zvol-related being a
big one), ZFS will end up spending a large amount of time spinning
up taskqs only to tear them down again almost immediately, then
spin them up again...

I noticed this when I looked at what my mostly-idle system was doing
and wondered how on earth taskq creation/destroy was a bunch of time...

So I added a configurable delay to avoid it tearing down tasks the
first time it notices them idle, and the total number of threads at
steady state went up, but the amount of time being burned just
tearing down/turning up new ones almost vanished.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes #14938
2023-06-26 13:57:12 -07:00
Ameer Hamza a52c8d49c6
Merge pull request #139 from truenas/truenas/zfs-2.2-testing
Forward port truenas/zfs patches to upstream openzfs master
2023-06-22 16:30:04 +05:00
Ameer Hamza 5bf0d5db13 Bump changelog for 2.1.99
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
2023-06-21 21:29:23 +05:00
Ameer Hamza fc2b4d3458 Skip id-mapped tests for now due to nfsv4 acls incompatibility 2023-06-21 21:29:23 +05:00
Ameer Hamza e3b5817448 Port latest zfsd changes from upstream FreeBSD
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
2023-06-21 21:29:23 +05:00
Ameer Hamza 06029e211c Port TrueNAS contrib changes and adjust github workflows
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
2023-06-21 21:29:23 +05:00
Andrew Walker c16f99d389 Improve zpl_permission performance
This function can be frequently called with MAY_EXEC|MAY_NOT_BLOCK
during RCU path walk. Where possible we should try not to break
out of it. In this case we check whether flag ZFS_NO_EXECS_DENIED is
set and check mode (similar to fastexecute check in zfs_acl.c).

Signed-off-by: Andrew Walker <awalker@ixsystems.com>
2023-06-21 21:29:23 +05:00
Ameer Hamza f34365ed28 zfsd: add support for hotplugging spares
If you remove an unused spare and then reinsert it, zfsd will now online
it in all pools.

Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
2023-06-21 21:29:23 +05:00
Umer Saleem 0b58a60509 Fix OpenZFS build issue for Debian Bookworm
dkms package layout is changed in bookworm and splits into dh-dkms
package. Debhelper in Bookworm is updated to use dh-sequence-dkms
instead of dkms.

GitHub Actions are updated to use Ubuntu 22.04 instead of Ubuntu
20.04, since dh-sequence-dkms is not aavailable on Ubuntu 20.04.

Signed-off-by: Umer Saleem <usaleem@ixsystems.com>
2023-06-21 21:29:23 +05:00
Andrew Walker 33ec2c3e96 Simplify get/set NFS4 ACL (#113)
This removes an extra memory allocation / free from the
NFS4 ACL xattr handler. Initially this was written rather
quickly in the alpha cycle of SCALE and implemented in a
way to ensure that xattr was exactly matching format
used internally in samba's vfs_acl_xattr module. Since
this time a more efficient conversion between the Samba
format and various other ones was added for the purpose
of inclusion in the Kernel NFS server.

This change simplifies conversion between internal NFS ACL and
external xattr representation, but has no impact on userspace
and kernel consumers of this xattr (format does not change).

Signed-off-by: Andrew Walker <awalker@ixsystems.com>
2023-06-21 21:29:23 +05:00
Andrew Walker 09a0c8a0ee Fix ZFS_READONLY implementation on Linux (#121)
MS-FSCC 2.6 is the governing document for
DOS attribute behavior. It specifies the following:

For a file, applications can read the file but
cannot write to it or delete it. For a directory,
applications cannot delete it, but applications can
create and delete files from the directory.

Signed-off-by: Andrew Walker <awalker@ixsystems.com>
2023-06-21 21:29:23 +05:00
Umer Saleem 02af6c4175 Update CI workflow for native packages
CI workflow now builds RPM converted Debian packages along with
native debian packages.

Signed-off-by: Umer Saleem <usaleem@ixsystems.com>
2023-06-21 21:29:23 +05:00
Ryan Moeller 6115cf6a76 SCALE: ignore wholedisk
We never want to partition vdevs automatically from ZFS in SCALE.

Ignore the wholedisk flag in SCALE and skip the tests that expect
auto partitioning to work.

Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
2023-06-21 21:29:23 +05:00
Umer Saleem f4efe4ea92 Build packages with debug symbols
With --enable-debuginfo configured, ZFS packages are built with
debug symbols embedded into the binaries.

Signed-off-by: Umer Saleem <usaleem@ixsystems.com>
2023-06-21 21:29:23 +05:00
Ameer Hamza f41d5dc6f1 Add kfpu entry to kbuild and suppress Cppcheck checks
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
2023-06-21 21:29:23 +05:00
Ryan Moeller 26b74065b9 Provide kfpu_begin/end from spl
Jira: NAS-115648
2023-06-21 21:29:23 +05:00
Ryan Moeller 631adac5f6 initramfs: Skip lvm scan before boot pool import
TrueNAS SCALE doesn't boot from pools on top of LVM, and the scan can
take a significant amount of time on systems with a large number of
disks.

Skip the lvm commands in our local-top/zfs script.

Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
2023-06-21 21:29:23 +05:00
Andrew ac2420afb0 NAS-116836 / Force BSD semantics for group ownership if NFSV4ACL (#78)
When a new file is created on FreeBSD it is given the group
of the directory which contains it. On Linux it is given
to either the effective GID of the process (System V semantices)
or the GID of the parent directory (BSD semantics).

Since there is no hard-and-fast rule about creation semantics
for NFSv4 ACLs on Linux, we should opt for what is least likely
to break users permissions on change from FreeBSD to Linux.

Avoid setting actually setting the SGID bit on dirs unless
it was explicitly set.

Signed-off-by: Andrew Walker <awalker@ixsystems.com>
2023-06-21 21:29:23 +05:00
Ameer Hamza c0d493822b Fix ACL build errors on sync with openzfs/master
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
2023-06-21 21:29:23 +05:00
Andrew 6bf8daf376 Add ability for xattr handler to "strip" NFSv4 ACL (#54)
On Linux POSIX ACLs can be removed via rmxattr() for the
relevant system xattrs. On FreeBSD a non-trivial ACL
can be converted to one that is described by the mode with
no loss of info via combination of acl_get_file(), acl_strip_np(),
and acl_set_file(). Since there's no libc equivalent of these
ops in Linux for NFSv4 ACLs, this commit makes this less error
prone by handling entirely in ZFS. When user performs
rmxattr() vfs_setxattr() is called with value of NULL and length
of 0. Add special handling for this situation in the xattr
handler for the NFSv4 ACL so that we generate a new ACL and
zfs_acl_chmod() with the existing mode of file, then set the ACL.

Signed-off-by: Andrew Walker <awalker@ixsystems.com>
2023-06-21 21:29:23 +05:00
Andrew 6dc46c7d54 NAS-115465 / 22.12 / expose ZFS_ACL_TRIVIAL to users (#52)
Add ACL_IS_TRIVIAL and ACL_IS_DIR flags as ACL-wide flags
in the system.nfs4_acl_xdr generated on getxattr requests.

This are non-RFC flags that are useful for userspace applications
(especially the ACL_IS_TRIVIAL flag as it can be used to avoid
relatively expensive ACL-related operations).

Also add system.nfs4_acl_xdr to xattr results if ACL is not trivial.
This duplicates POSIX ACL behavior where whether an ACL is
set on a path can be determined via listxattr(). Since the ACL
is not actually removed, we check whether the ZFS_ACL_TRIVIAL
is set. If the flag is not set, then we omit the xattr name from
the list. This allows users to determine whether ACL is trivial from
listxattr().

Signed-off-by: Andrew Walker <awalker@ixsystems.com>
2023-06-21 21:29:23 +05:00
Ryan Moeller e5f1583a08 Make zpl_permission work with 5.12+ kernels
The "permission" inode operation takes a new `struct user_namespace *`
parameter starting in Linux 5.12.

Add a configure check and adapt accordingly.

Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
2023-06-21 02:51:24 +05:00
Ryan Moeller e7904b8280 Switch to production builds for SCALE
Jira: NAS-113186

Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
2023-06-21 02:51:24 +05:00
Andrew Walker 8503a85e06 Fix access check when cred allows override of ACL
Properly evaluate edge cases where user credential may grant capability
to override DAC in various situations. Switch to using ns-aware checks
rather than capable().

Expand optimization allow bypass of zfs_zaccess() in case of trivial
ACL if MAY_OPEN is included in requested mask. This will be evaluated
in generic_permission() check, which is RCU walk safe. This means that
in most cases evaluating permissions on boot volume with NFSv4 ACLs
will follow the fast path on checking inode permissions.

Additionally, CAP_SYS_ADMIN is granted to nfsd process, and so override
for this capability in access2 policy check is removed in favor of a
simple check for fsid == 0. Checks for CAP_DAC_OVERRIDE and other
override capabilities are kept as-is.

Signed-off-by: Andrew Walker <awalker@ixsystems.com>
2023-06-21 02:51:24 +05:00
Alexander Motin 4d8b67b164 Write /sys/kernel/wait_for_device_probe before import.
The new sysfs attribute makes kernel to wait for all device probe to
complete before return.  Without it wait_for_udev call does not give
any guaranties.

Ticket:	NAS-108200

Signed-off-by: Alexander Motin <mav@FreeBSD.org>
2023-06-21 02:51:24 +05:00
Ryan Moeller c078b8660e Make acltype=nfsv4 the default on Linux, too
Now that we support NFSv4 ACLs on Linux, this can now be made the
default across all platforms.

Update the documentation and tests accordingly.

Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
2023-06-21 02:51:24 +05:00
Ameer Hamza 3c72bef6bd Adjust zfsd Makefiles for openzfs compatibility
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
2023-06-21 02:51:15 +05:00
Ryan Moeller 35ca19b591 Add zfsd for FreeBSD
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
2023-06-21 00:33:40 +05:00
Andrew c6ba4a01f0 Implement NFSv41 ACLs through xattr
This implements NFSv41 (RFC 5661) ACLs in a manner
compatible with vfs_nfs4acl_xattr in Samba and
nfs4xdr-acl-tools.

There are three key areas of change in this commit:
1) NFSv4 ACL management through system.nfs4_acl_xdr xattr.
  Install an xattr handler for "system.nfs4_acl_xdr" that
  presents an xattr containing full NFSv41 ACL structures
  generated through rpcgen using specification from the Samba
  project. This xattr is used by userspace programs to read and
  set permissions.

2) add an i_op->permissions endpoint: zpl_permissions(). This
  is used by the VFS in Linux to determine whether to allow /
  deny an operation. Wherever possible, we try to avoid having
  to call zfs_access(). If kernel has NFSv4 patch for VFS, then
  perform more complete check of avaiable access mask.

3) add capability-based overrides to secpolicy_vnode_access2()
  there are various situations in which ACL may need to be
  overridden based on capabilities. This logic is almost directly
  copied from Linux VFS. For instance, root needs to be able to
  always read / write ACLs (otherwise admin can get locked out
  from files).

This is commit was initially inspired by work from Paul B. Henson
to implement NFSv4.0 (RFC3530) ACLs in ZFS on Linux. Key areas of
divergence are as follows:
- ACL specification, xattr format, xattr name
- Addition of handling for NFSv4 masks from Linux VFS
- Addition of ACL overrides based on capabilities

Signed-off-by: Andrew Walker <awalker@ixsystems.com>
2023-06-21 00:33:32 +05:00
Andrew Walker 5e1eba8718 Advertise support for large xattrs on TrueNAS
SB_LARGEXATTR is used in TrueNAS SCALE to indicate to the kernel
that the filesystem supports large-size xattrs (greater than 64KiB).

This flag is used to evaluate whether to allow large xattr read
or write requests (up to 2 MiB).

Signed-off-by: Andrew Walker <awalker@ixsystems.com>
2023-06-21 00:33:25 +05:00
Waqar Ahmed cfd08bedb2 Add action to build and push docker image on master update
Signed-off-by: Waqar Ahmed <waqarahmedjoyia@live.com>
2023-06-21 00:33:20 +05:00
Andrew Walker 17d7f9de97 Add check for custom TrueNAS kernel
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
2023-06-21 00:33:13 +05:00
Waqar Ahmed fd31804abc Add CI for building zfs package
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
2023-06-21 00:33:06 +05:00
Matt Macy ae78a23f75 Fix ZFS_DEBUG_MODIFY assert in arc_buf_try_copy_decompressed_data
The assert does not account for the case where there is a single
buffer in the chain that is decompressed and has a valid
checksum.

Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
2023-06-21 00:32:59 +05:00
Ryan Moeller 23f878a89d Add packaging bits for TrueNAS SCALE 2023-06-21 00:32:51 +05:00
Alexander Motin 8e8acabdca
Fix memory leak in zil_parse().
482da24e2 missed arc_buf_destroy() calls on log parse errors, possibly
leaking up to 128KB of memory per dataset during ZIL replay.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #14987
2023-06-17 19:51:37 -07:00
George Amanakis 10e36e1761
Shorten arcstat_quiescence sleep time
With the latest L2ARC fixes, 2 seconds is too long to wait for
quiescence of arcstats like l2_size. Shorten this interval to avoid
having the persistent L2ARC tests in ZTS prematurely terminated.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #14981
2023-06-15 12:45:36 -07:00
Alexander Motin ccec7fbe1c
Remove ARC/ZIO physdone callbacks.
Those callbacks were introduced many years ago as part of a bigger
patch to smoothen the write throttling within a txg. They allow to
account completion of individual physical writes within a logical
one, improving cases when some of physical writes complete much
sooner than others, gradually opening the write throttle.

Few years after that ZFS got allocation throttling, working on a
level of logical writes and limiting number of writes queued to
vdevs at any point, and so limiting latency distribution between
the physical writes and especially writes of multiple copies.
The addition of scheduling deadline I proposed in #14925 should
further reduce the latency distribution.  Grown memory sizes over
the past 10 years should also reduce importance of the smoothing.

While the use of physdone callback may still in theory provide
some smoother throttling, there are cases where we simply can not
afford it.  Since dirty data accounting is protected by pool-wide
lock, in case of 6-wide RAIDZ, for example, it requires us to take
it 8 times per logical block write, creating huge lock contention.

My tests of this patch show radical reduction of the lock spinning
time on workloads when smaller blocks are written to RAIDZ pools,
when each of the disks receives 8-16KB chunks, but the total rate
reaching 100K+ blocks per second.  Same time attempts to measure
any write time fluctuations didn't show anything noticeable.

While there, remove also io_child_count/io_parent_count counters.
They are used only for couple assertions that can be avoided.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #14948
2023-06-15 10:49:03 -07:00