Commit Graph

8935 Commits

Author SHA1 Message Date
Rich Ercolani 35a6247c5f
Add a delay to tearing down threads.
It's been observed that in certain workloads (zvol-related being a
big one), ZFS will end up spending a large amount of time spinning
up taskqs only to tear them down again almost immediately, then
spin them up again...

I noticed this when I looked at what my mostly-idle system was doing
and wondered how on earth taskq creation/destroy was a bunch of time...

So I added a configurable delay to avoid it tearing down tasks the
first time it notices them idle, and the total number of threads at
steady state went up, but the amount of time being burned just
tearing down/turning up new ones almost vanished.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes #14938
2023-06-26 13:57:12 -07:00
Ameer Hamza a52c8d49c6
Merge pull request #139 from truenas/truenas/zfs-2.2-testing
Forward port truenas/zfs patches to upstream openzfs master
2023-06-22 16:30:04 +05:00
Ameer Hamza 5bf0d5db13 Bump changelog for 2.1.99
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
2023-06-21 21:29:23 +05:00
Ameer Hamza fc2b4d3458 Skip id-mapped tests for now due to nfsv4 acls incompatibility 2023-06-21 21:29:23 +05:00
Ameer Hamza e3b5817448 Port latest zfsd changes from upstream FreeBSD
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
2023-06-21 21:29:23 +05:00
Ameer Hamza 06029e211c Port TrueNAS contrib changes and adjust github workflows
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
2023-06-21 21:29:23 +05:00
Andrew Walker c16f99d389 Improve zpl_permission performance
This function can be frequently called with MAY_EXEC|MAY_NOT_BLOCK
during RCU path walk. Where possible we should try not to break
out of it. In this case we check whether flag ZFS_NO_EXECS_DENIED is
set and check mode (similar to fastexecute check in zfs_acl.c).

Signed-off-by: Andrew Walker <awalker@ixsystems.com>
2023-06-21 21:29:23 +05:00
Ameer Hamza f34365ed28 zfsd: add support for hotplugging spares
If you remove an unused spare and then reinsert it, zfsd will now online
it in all pools.

Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
2023-06-21 21:29:23 +05:00
Umer Saleem 0b58a60509 Fix OpenZFS build issue for Debian Bookworm
dkms package layout is changed in bookworm and splits into dh-dkms
package. Debhelper in Bookworm is updated to use dh-sequence-dkms
instead of dkms.

GitHub Actions are updated to use Ubuntu 22.04 instead of Ubuntu
20.04, since dh-sequence-dkms is not aavailable on Ubuntu 20.04.

Signed-off-by: Umer Saleem <usaleem@ixsystems.com>
2023-06-21 21:29:23 +05:00
Andrew Walker 33ec2c3e96 Simplify get/set NFS4 ACL (#113)
This removes an extra memory allocation / free from the
NFS4 ACL xattr handler. Initially this was written rather
quickly in the alpha cycle of SCALE and implemented in a
way to ensure that xattr was exactly matching format
used internally in samba's vfs_acl_xattr module. Since
this time a more efficient conversion between the Samba
format and various other ones was added for the purpose
of inclusion in the Kernel NFS server.

This change simplifies conversion between internal NFS ACL and
external xattr representation, but has no impact on userspace
and kernel consumers of this xattr (format does not change).

Signed-off-by: Andrew Walker <awalker@ixsystems.com>
2023-06-21 21:29:23 +05:00
Andrew Walker 09a0c8a0ee Fix ZFS_READONLY implementation on Linux (#121)
MS-FSCC 2.6 is the governing document for
DOS attribute behavior. It specifies the following:

For a file, applications can read the file but
cannot write to it or delete it. For a directory,
applications cannot delete it, but applications can
create and delete files from the directory.

Signed-off-by: Andrew Walker <awalker@ixsystems.com>
2023-06-21 21:29:23 +05:00
Umer Saleem 02af6c4175 Update CI workflow for native packages
CI workflow now builds RPM converted Debian packages along with
native debian packages.

Signed-off-by: Umer Saleem <usaleem@ixsystems.com>
2023-06-21 21:29:23 +05:00
Ryan Moeller 6115cf6a76 SCALE: ignore wholedisk
We never want to partition vdevs automatically from ZFS in SCALE.

Ignore the wholedisk flag in SCALE and skip the tests that expect
auto partitioning to work.

Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
2023-06-21 21:29:23 +05:00
Umer Saleem f4efe4ea92 Build packages with debug symbols
With --enable-debuginfo configured, ZFS packages are built with
debug symbols embedded into the binaries.

Signed-off-by: Umer Saleem <usaleem@ixsystems.com>
2023-06-21 21:29:23 +05:00
Ameer Hamza f41d5dc6f1 Add kfpu entry to kbuild and suppress Cppcheck checks
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
2023-06-21 21:29:23 +05:00
Ryan Moeller 26b74065b9 Provide kfpu_begin/end from spl
Jira: NAS-115648
2023-06-21 21:29:23 +05:00
Ryan Moeller 631adac5f6 initramfs: Skip lvm scan before boot pool import
TrueNAS SCALE doesn't boot from pools on top of LVM, and the scan can
take a significant amount of time on systems with a large number of
disks.

Skip the lvm commands in our local-top/zfs script.

Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
2023-06-21 21:29:23 +05:00
Andrew ac2420afb0 NAS-116836 / Force BSD semantics for group ownership if NFSV4ACL (#78)
When a new file is created on FreeBSD it is given the group
of the directory which contains it. On Linux it is given
to either the effective GID of the process (System V semantices)
or the GID of the parent directory (BSD semantics).

Since there is no hard-and-fast rule about creation semantics
for NFSv4 ACLs on Linux, we should opt for what is least likely
to break users permissions on change from FreeBSD to Linux.

Avoid setting actually setting the SGID bit on dirs unless
it was explicitly set.

Signed-off-by: Andrew Walker <awalker@ixsystems.com>
2023-06-21 21:29:23 +05:00
Ameer Hamza c0d493822b Fix ACL build errors on sync with openzfs/master
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
2023-06-21 21:29:23 +05:00
Andrew 6bf8daf376 Add ability for xattr handler to "strip" NFSv4 ACL (#54)
On Linux POSIX ACLs can be removed via rmxattr() for the
relevant system xattrs. On FreeBSD a non-trivial ACL
can be converted to one that is described by the mode with
no loss of info via combination of acl_get_file(), acl_strip_np(),
and acl_set_file(). Since there's no libc equivalent of these
ops in Linux for NFSv4 ACLs, this commit makes this less error
prone by handling entirely in ZFS. When user performs
rmxattr() vfs_setxattr() is called with value of NULL and length
of 0. Add special handling for this situation in the xattr
handler for the NFSv4 ACL so that we generate a new ACL and
zfs_acl_chmod() with the existing mode of file, then set the ACL.

Signed-off-by: Andrew Walker <awalker@ixsystems.com>
2023-06-21 21:29:23 +05:00
Andrew 6dc46c7d54 NAS-115465 / 22.12 / expose ZFS_ACL_TRIVIAL to users (#52)
Add ACL_IS_TRIVIAL and ACL_IS_DIR flags as ACL-wide flags
in the system.nfs4_acl_xdr generated on getxattr requests.

This are non-RFC flags that are useful for userspace applications
(especially the ACL_IS_TRIVIAL flag as it can be used to avoid
relatively expensive ACL-related operations).

Also add system.nfs4_acl_xdr to xattr results if ACL is not trivial.
This duplicates POSIX ACL behavior where whether an ACL is
set on a path can be determined via listxattr(). Since the ACL
is not actually removed, we check whether the ZFS_ACL_TRIVIAL
is set. If the flag is not set, then we omit the xattr name from
the list. This allows users to determine whether ACL is trivial from
listxattr().

Signed-off-by: Andrew Walker <awalker@ixsystems.com>
2023-06-21 21:29:23 +05:00
Ryan Moeller e5f1583a08 Make zpl_permission work with 5.12+ kernels
The "permission" inode operation takes a new `struct user_namespace *`
parameter starting in Linux 5.12.

Add a configure check and adapt accordingly.

Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
2023-06-21 02:51:24 +05:00
Ryan Moeller e7904b8280 Switch to production builds for SCALE
Jira: NAS-113186

Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
2023-06-21 02:51:24 +05:00
Andrew Walker 8503a85e06 Fix access check when cred allows override of ACL
Properly evaluate edge cases where user credential may grant capability
to override DAC in various situations. Switch to using ns-aware checks
rather than capable().

Expand optimization allow bypass of zfs_zaccess() in case of trivial
ACL if MAY_OPEN is included in requested mask. This will be evaluated
in generic_permission() check, which is RCU walk safe. This means that
in most cases evaluating permissions on boot volume with NFSv4 ACLs
will follow the fast path on checking inode permissions.

Additionally, CAP_SYS_ADMIN is granted to nfsd process, and so override
for this capability in access2 policy check is removed in favor of a
simple check for fsid == 0. Checks for CAP_DAC_OVERRIDE and other
override capabilities are kept as-is.

Signed-off-by: Andrew Walker <awalker@ixsystems.com>
2023-06-21 02:51:24 +05:00
Alexander Motin 4d8b67b164 Write /sys/kernel/wait_for_device_probe before import.
The new sysfs attribute makes kernel to wait for all device probe to
complete before return.  Without it wait_for_udev call does not give
any guaranties.

Ticket:	NAS-108200

Signed-off-by: Alexander Motin <mav@FreeBSD.org>
2023-06-21 02:51:24 +05:00
Ryan Moeller c078b8660e Make acltype=nfsv4 the default on Linux, too
Now that we support NFSv4 ACLs on Linux, this can now be made the
default across all platforms.

Update the documentation and tests accordingly.

Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
2023-06-21 02:51:24 +05:00
Ameer Hamza 3c72bef6bd Adjust zfsd Makefiles for openzfs compatibility
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
2023-06-21 02:51:15 +05:00
Ryan Moeller 35ca19b591 Add zfsd for FreeBSD
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
2023-06-21 00:33:40 +05:00
Andrew c6ba4a01f0 Implement NFSv41 ACLs through xattr
This implements NFSv41 (RFC 5661) ACLs in a manner
compatible with vfs_nfs4acl_xattr in Samba and
nfs4xdr-acl-tools.

There are three key areas of change in this commit:
1) NFSv4 ACL management through system.nfs4_acl_xdr xattr.
  Install an xattr handler for "system.nfs4_acl_xdr" that
  presents an xattr containing full NFSv41 ACL structures
  generated through rpcgen using specification from the Samba
  project. This xattr is used by userspace programs to read and
  set permissions.

2) add an i_op->permissions endpoint: zpl_permissions(). This
  is used by the VFS in Linux to determine whether to allow /
  deny an operation. Wherever possible, we try to avoid having
  to call zfs_access(). If kernel has NFSv4 patch for VFS, then
  perform more complete check of avaiable access mask.

3) add capability-based overrides to secpolicy_vnode_access2()
  there are various situations in which ACL may need to be
  overridden based on capabilities. This logic is almost directly
  copied from Linux VFS. For instance, root needs to be able to
  always read / write ACLs (otherwise admin can get locked out
  from files).

This is commit was initially inspired by work from Paul B. Henson
to implement NFSv4.0 (RFC3530) ACLs in ZFS on Linux. Key areas of
divergence are as follows:
- ACL specification, xattr format, xattr name
- Addition of handling for NFSv4 masks from Linux VFS
- Addition of ACL overrides based on capabilities

Signed-off-by: Andrew Walker <awalker@ixsystems.com>
2023-06-21 00:33:32 +05:00
Andrew Walker 5e1eba8718 Advertise support for large xattrs on TrueNAS
SB_LARGEXATTR is used in TrueNAS SCALE to indicate to the kernel
that the filesystem supports large-size xattrs (greater than 64KiB).

This flag is used to evaluate whether to allow large xattr read
or write requests (up to 2 MiB).

Signed-off-by: Andrew Walker <awalker@ixsystems.com>
2023-06-21 00:33:25 +05:00
Waqar Ahmed cfd08bedb2 Add action to build and push docker image on master update
Signed-off-by: Waqar Ahmed <waqarahmedjoyia@live.com>
2023-06-21 00:33:20 +05:00
Andrew Walker 17d7f9de97 Add check for custom TrueNAS kernel
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
2023-06-21 00:33:13 +05:00
Waqar Ahmed fd31804abc Add CI for building zfs package
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
2023-06-21 00:33:06 +05:00
Matt Macy ae78a23f75 Fix ZFS_DEBUG_MODIFY assert in arc_buf_try_copy_decompressed_data
The assert does not account for the case where there is a single
buffer in the chain that is decompressed and has a valid
checksum.

Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
2023-06-21 00:32:59 +05:00
Ryan Moeller 23f878a89d Add packaging bits for TrueNAS SCALE 2023-06-21 00:32:51 +05:00
Alexander Motin 8e8acabdca
Fix memory leak in zil_parse().
482da24e2 missed arc_buf_destroy() calls on log parse errors, possibly
leaking up to 128KB of memory per dataset during ZIL replay.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #14987
2023-06-17 19:51:37 -07:00
George Amanakis 10e36e1761
Shorten arcstat_quiescence sleep time
With the latest L2ARC fixes, 2 seconds is too long to wait for
quiescence of arcstats like l2_size. Shorten this interval to avoid
having the persistent L2ARC tests in ZTS prematurely terminated.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #14981
2023-06-15 12:45:36 -07:00
Alexander Motin ccec7fbe1c
Remove ARC/ZIO physdone callbacks.
Those callbacks were introduced many years ago as part of a bigger
patch to smoothen the write throttling within a txg. They allow to
account completion of individual physical writes within a logical
one, improving cases when some of physical writes complete much
sooner than others, gradually opening the write throttle.

Few years after that ZFS got allocation throttling, working on a
level of logical writes and limiting number of writes queued to
vdevs at any point, and so limiting latency distribution between
the physical writes and especially writes of multiple copies.
The addition of scheduling deadline I proposed in #14925 should
further reduce the latency distribution.  Grown memory sizes over
the past 10 years should also reduce importance of the smoothing.

While the use of physdone callback may still in theory provide
some smoother throttling, there are cases where we simply can not
afford it.  Since dirty data accounting is protected by pool-wide
lock, in case of 6-wide RAIDZ, for example, it requires us to take
it 8 times per logical block write, creating huge lock contention.

My tests of this patch show radical reduction of the lock spinning
time on workloads when smaller blocks are written to RAIDZ pools,
when each of the disks receives 8-16KB chunks, but the total rate
reaching 100K+ blocks per second.  Same time attempts to measure
any write time fluctuations didn't show anything noticeable.

While there, remove also io_child_count/io_parent_count counters.
They are used only for couple assertions that can be avoided.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #14948
2023-06-15 10:49:03 -07:00
Brian Behlendorf e32e326c5b
ZTS: Skip send_raw_ashift on FreeBSD
On FreeBSD 14 this test runs slowly in the CI environment
and is killed by the 10 minute timeout.  Skip the test on
FreeBSD until the slow down is resolved.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #14961
2023-06-14 08:04:05 -07:00
Alexander Motin d057807ede
Switch refcount tracking from lists to AVL-trees.
With large number of tracked references list searches under the lock
become too expensive, creating enormous lock contention.

On my tests with ZFS_DEBUG enabled this increases write throughput
with 32KB blocks from ~1.2GB/s to ~7.5GB/s.

Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #14970
2023-06-14 08:02:27 -07:00
George Amanakis 8af1104f83
Store the L2ARC device ashift in the vdev label
If this is not done, and the pool has an ashift other than the default
(at the moment 9) then the following happens:

1) vdev_alloc() assigns the ashift of the pool to L2ARC device, but
   upon export it is not stored anywhere
2) at the first import, vdev_open() sees an vdev_ashift() of 0 and
   assigns the logical_ashift, which is 9
3) reading the contents of L2ARC, including the header fails
4) L2ARC buffers are not restored in ARC.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #14313 
Closes #14963
2023-06-14 08:01:17 -07:00
George Amanakis feff9dfed3
Fix the L2ARC write size calculating logic (2)
While commit bcd5321 adjusts the write size based on the size of the log
block, this happens after comparing the unadjusted write size to the
evicted (target) size.

In this case l2ad_hand will exceed l2ad_evict and violate an assertion
at the end of l2arc_write_buffers().

Fix this by adding the max log block size to the allocated size of the
buffer to be committed before comparing the result to the target
size.

Also reset the l2arc_trim_ahead ZFS module variable when the adjusted
write size exceeds the size of the L2ARC device.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #14936
Closes #14954
2023-06-09 17:05:47 -07:00
Alexander Motin 70ea484e3e
Finally drop long disabled vdev cache.
It was a vdev level read cache, designed to aggregate many small
reads by speculatively issuing bigger reads instead and caching
the result.  But since it has almost no idea about what is going
on with exception of ZIO_FLAG_DONT_CACHE flag set by higher layers,
it was found to make more harm than good, for which reason it was
disabled for the past 12 years.  These days we have much better
instruments to enlarge the I/Os, such as speculative and prescient
prefetches, I/O scheduler, I/O aggregation etc.

Besides just the dead code removal this removes one extra mutex
lock/unlock per write inside vdev_cache_write(), not otherwise
disabled and trying to do some work.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #14953
2023-06-09 12:40:55 -07:00
Brian Behlendorf 6db4ed51d6
ZTS: Skip checkpoint_discard_busy
Until the ASSERT which is occasionally hit while running
checkpoint_discard_busy is resolved skip this test case.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #12053
Closes #14952
2023-06-09 11:10:01 -07:00
Alexander Motin 90ccfd426d
Improve l2arc reporting in arc_summary.
- Do not report L2ARC as FAULTED in presence of in-flight writes.
- Report read and write I/Os, bytes and errors.
- Remove few numbers not important to average user.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #12304 
Closes #14946
2023-06-09 10:14:05 -07:00
Alexander Motin b3ad3f48d9
Use list_remove_head() where possible.
... instead of list_head() + list_remove().  On FreeBSD the list
functions are not inlined, so in addition to more compact code
this also saves another function call.

Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #14955
2023-06-09 10:12:52 -07:00
Alexander Motin 55b1842f92
ZIL: Fix race introduced by f63811f072.
We are not allowed to access lwb after setting LWB_STATE_FLUSH_DONE
state and dropping zl_lock, since it may be freed by zil_sync().
To free itxs and waiters after dropping the lock we need to move
lwb_itxs and lwb_waiters lists elements to local storage.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #14957
Closes #14959
2023-06-09 10:08:05 -07:00
Rich Ercolani 6c96269024
Revert "systemd: Use non-absolute paths in Exec* lines"
This reverts commit 79b20949b2 since it
doesn't work with the systemd version shipped with RHEL7-based systems.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes #14943
Closes #14945
2023-06-07 11:14:05 -07:00
Brian Behlendorf 93f8abeff0
Linux: Never sleep in kmem_cache_alloc(..., KM_NOSLEEP) (#14926)
When a kmem cache is exhausted and needs to be expanded a new
slab is allocated.  KM_SLEEP callers can block and wait for the
allocation, but KM_NOSLEEP callers were incorrectly allowed to
block as well.

Resolve this by attempting an emergency allocation as a best
effort.  This may fail but that's fine since any KM_NOSLEEP
consumer is required to handle an allocation failure.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Adam Moss <c@yotes.com>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
2023-06-07 10:43:43 -07:00
George Amanakis bcd5321039
Fix the L2ARC write size calculating logic
l2arc_write_size() should return the write size after adjusting for trim
and overhead of the L2ARC log blocks. Also take into account the
allocated size of log blocks when deciding when to stop writing buffers
to L2ARC.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #14939
2023-06-06 12:32:37 -07:00