Commit Graph

219 Commits

Author SHA1 Message Date
Mateusz Guzik 2c5c8bb0a6 FreeBSD: use zero_region instead of allocating a dedicated page
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Closes #13406
2022-05-20 10:33:24 -07:00
Ka Ho Ng 1f31889046 FreeBSD: Implement hole-punching support
This adds supports for hole-punching facilities in the FreeBSD kernel
starting from __FreeBSD_version 1400032.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Ka Ho Ng <khng@FreeBSD.org>
Sponsored-by: The FreeBSD Foundation
Closes #12458
2022-05-17 11:15:29 -07:00
наб ecec151c14 module: zfs: freebsd: fix unused, remove argsused
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes #12844
2022-05-02 15:42:58 -07:00
наб a4f582f0b6 FreeBSD: remove unused variable
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Issue #12899
2022-05-02 15:42:58 -07:00
наб b8e1366ee6 zvol: remove unused variable
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes #12917
2022-05-02 15:42:58 -07:00
Alexander Motin 972637dc06 FreeBSD: Fix translation from ABD to physical pages.
In hypothetical case of non-linear ABD with single segment, multiple
to page size but not aligned to it, vdev_geom_fill_unmap_cb() could
fill one page less into bio_ma array.

I am not sure it is expoitable, but better to be safe than sorry.

Reported-by: Mark Johnston <markj@FreeBSD.org>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
(cherry picked from commit 5352f85cdd)
2022-04-21 16:59:09 -07:00
Mark Johnston b7546f92ea FreeBSD: Return Mach error codes from VOP_(GET|PUT)PAGES
FreeBSD's memory management system uses its own error numbers and gets
confused when these VOPs return EIO.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reported-by: Peter Holm <pho@FreeBSD.org>
Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Closes #13311
2022-04-19 10:42:54 -07:00
Mark Johnston e9cd90f6e5 FreeBSD: Parameterize ZFS_ENTER/ZFS_VERIFY_VP with an error code
For legacy reasons, a couple of VOPs have to return error numbers that
don't come from the usual errno namespace.  To handle the cases where
ZFS_ENTER or ZFS_VERIFY_ZP fail, we need to be able to override the
default error return value of EIO.  Extend the macros to permit this.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Closes #13311
2022-04-19 10:42:54 -07:00
Ryan Moeller a5a28723bd FreeBSD: Use NDFREE_PNBUF if available
NDF_ONLY_PNBUF has been removed from FreeBSD in favor of NDFREE_PNBUF.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <freqlabs@FreeBSD.org>
Closes #13277
2022-04-06 10:29:53 -07:00
Brian Behlendorf 847d03060f
Fix ACL checks for NFS kernel server
This PR changes ZFS ACL checks to evaluate
fsuid / fsgid rather than euid / egid to avoid
accidentally granting elevated permissions to
NFS clients.

Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Andrew Walker <awalker@ixsystems.com>
Co-authored-by: Ryan Moeller <freqlabs@FreeBSD.org>
Signed-off-by: Ryan Moeller <freqlabs@FreeBSD.org>
Closes #13221
2022-03-20 21:21:18 -07:00
Kyle Evans 421750672b module: freebsd: avoid a taking a destroyed lock in zfs_zevent bits
At shutdown time, we drain all of the zevents and set the
ZEVENT_SHUTDOWN flag.  On FreeBSD, we may end up calling
zfs_zevent_destroy() after the zevent_lock has been destroyed while
the sysevent thread is winding down; we observe ESHUTDOWN, then back
out.

Events have already been drained, so just inline the kmem_free call in
sysevent_worker() to avoid the race, and document the assumption that
zfs_zevent_destroy doesn't do anything else useful at that point.

This fixes a panic that can occur at module unload time.

Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Kyle Evans <kevans@FreeBSD.org>
Closes #13220
2022-03-18 17:11:43 -07:00
Mateusz Guzik 275c756730 FreeBSD: add missing replay check to an assert in zfs_xvattr_set
Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Closes #13219
2022-03-18 17:11:43 -07:00
Mark Johnston b3427b18b1 zfs: Fix a deadlock between page busy and the teardown lock
When rolling back a dataset, ZFS has to purge file data resident in the
system page cache.  To do this, it loops over all vnodes for the
mountpoint and calls vn_pages_remove() to purge pages associated with
the vnode's VM object.  Each page is thus exclusively busied while the
dataset's teardown write lock is held.

When handling a page fault on a mapped ZFS file, FreeBSD's page fault
handler busies newly allocated pages and then uses VOP_GETPAGES to fill
them.  The ZFS getpages VOP acquires the teardown read lock with vnode
pages already busied.  This represents a lock order reversal which can
lead to deadlock.

To break the deadlock, observe that zfs_rezget() need only purge those
pages marked valid, and that pages busied by the page fault handler are,
by definition, invalid.  Furthermore, ZFS pages always transition from
invalid to valid with the teardown lock held, and ZFS never creates
partially valid pages.  Thus, zfs_rezget() can use the new
vn_pages_remove_valid() to skip over pages busied by the fault handler.

PR:		258208
Tested by:	pho
Reviewed by:	avg, sef, kib
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D32931

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ryan Moeller <freqlabs@FreeBSD.org>
Closes #12828
2022-03-04 15:37:41 -08:00
Alexander Motin 0e2bb1a3ee Really zero the zero page
While switching abd_zero_buf allocation KPI I've missed the fact
that kmem_zalloc() zeroed the allocation, while kmem_cache_alloc()
does not.  Add explicit bzero() after it.

I don't think it should have caused real problems, but leaking one
memory page content all over the pool is not good.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Closes #12569
2022-03-04 15:37:33 -08:00
наб 94a4b7ec3d module: zfs: fix unused, remove argsused
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes #12844
2022-02-16 17:58:56 -08:00
drowfx bc99c809d5 Add dataset_kstats_update.. to mmap read/write paths
This allows reads/writes caused by accesses to mmap files to be
accounted correctly in the per-dataset kstats for both Linux and
FreeBSD.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org>
Signed-off-by: Matthias Blankertz <matthias@blankertz.org>
Closes #12994 
Closes #13044
2022-02-16 17:58:56 -08:00
George Amanakis e257bd481b Introduce a flag to skip comparing the local mac when raw sending
Raw receiving a snapshot back to the originating dataset is currently
impossible because of user accounting being present in the originating
dataset.

One solution would be resetting user accounting when raw receiving on
the receiving dataset. However, to recalculate it we would have to dirty
all dnodes, which may not be preferable on big datasets.

Instead, we rely on the os_phys flag
OBJSET_FLAG_USERACCOUNTING_COMPLETE to indicate that user accounting is
incomplete when raw receiving. Thus, on the next mount of the receiving
dataset the local mac protecting user accounting is zeroed out.
The flag is then cleared when user accounting of the raw received
snapshot is calculated.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #12981 
Closes #10523
Closes #11221
Closes #11294
Closes #12594
Issue #11300
2022-02-04 16:14:56 -08:00
Ryan Moeller af1630c883 FreeBSD: Fix zvol_cdev_open locking
First open locking changes were correctly applied to zvol_geom_open but
incorrectly applied to zvol_cdev_open, causing spa_namespace_lock to be
held indefinitely.

Make the first open locking in zvol_cdev_open match zvol_geom_open.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #13016
2022-02-03 15:28:01 -08:00
Ryan Moeller 1828b68a0b FreeBSD: Fix zvol_*_open() locking
These are the changes for FreeBSD corresponding to the changes made for
Linux in #12863, see that PR for details.

Changes from #12863 are applied for zvol_geom_open and zvol_cdev_open
on FreeBSD.  This also adds a check for the zvol dying which we had
in zvol_geom_open but was missing in zvol_cdev_open.  The check causes
the open to fail early with ENXIO when we are in the middle of changing
volmode.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #12934
2022-02-03 15:28:01 -08:00
наб 36a91d6cef FreeBSD: vfsops: use setgen for error case
Fix from https://github.com/openzfs/zfs/pull/12844#discussion_r774179413

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes #12905
2022-02-03 15:28:01 -08:00
наб 5d8c081193 FreeBSD: fix unpropagated error
When performing I/O on FreeBSD using a file based vdev ensure all
errors encountered when reading/writing are propagated through the
zio pipeline.  

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes #12904
2022-02-03 15:28:01 -08:00
Alan Somers 4b2bac5fe9 FreeBSD: Update argument types for VOP_READDIR
A recent commit to FreeBSD changed the type of
vop_readdir_args.a_cookies to a uint64_t**.  There is no functional
impact to ZFS because ZFS only uses 32-bit cookies, which will be
zero-extended to 64-bits by the existing code.

b214fcceac

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Alan Somers <asomers@gmail.com>
Closes #12874
2022-02-03 15:28:01 -08:00
Ryan Moeller 913ae45218 FreeBSD: Provide correct file generation number
va_seq was actually a thin veil over va_gen, so z_gen is a more
appropriate value than z_seq to populate the field with.

Drop the unnecessary compat obfuscation and provide the correct
file generation number.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ryan Moeller <freqlabs@freebsd.org>
Closes #12851
2022-02-03 15:28:01 -08:00
Ryan Moeller def73c0735 FreeBSD: Add vop_standard_writecount_nomsync
https://cgit.freebsd.org/src/commit?id=3ffcfa599e29686cf2b3c1a6087408c37acaed78

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ryan Moeller <freqlabs@FreeBSD.org>
Closes #12828
2021-12-13 13:23:07 -08:00
Ryan Moeller effe984148 FreeBSD: Catch up with more VFS changes
Unused thread argument was removed from NDINIT*

https://cgit.freebsd.org/src/commit?id=7e1d3eefd410ca0fbae5a217422821244c3eeee4

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ryan Moeller <freqlabs@FreeBSD.org>
Closes #12828
2021-12-13 13:23:01 -08:00
Mark Johnston 19337332cc Fix several bugs in the FreeBSD rename VOP implementation
- To avoid a use-after-free, zfsvfs->z_log needs to be loaded after the
  teardown lock is acquired with ZFS_ENTER().
- Avoid leaking vnode locks in zfs_rename_relock() and zfs_rename_()
  when the ZFS_ENTER() macros forces an early return.

Refactor the rename implementation so that ZFS_ENTER() can be used
safely.  As a bonus, this lets us use the ZFS_VERIFY_ZP() macro instead
of open-coding its implementation.

Reported-by: Peter Holm <pho@FreeBSD.org>
Tested-by: Peter Holm <pho@FreeBSD.org>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Sponsored-by: The FreeBSD Foundation
Closes #12717
2021-12-13 13:22:54 -08:00
Pawel Jakub Dawidek b96737b83e Remove (now unused) td argument from zfs_lookup()
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Closes #12748
2021-12-13 13:22:47 -08:00
Mark Johnston 4b7bfcf8a0 Exit the teardown section later in rename on FreeBSD
We have to hold the teardown lock while dereferencing zfsvfs->z_os and,
I believe, when committing to the ZIL.

Note that jumping to the "out" label, "error" is always non-zero.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Closes #12704
2021-12-13 13:22:41 -08:00
Mark Johnston 07165ce540 Fix potential use-after-frees in FreeBSD getpages and setattr VOPs
The objset object is reallocated during certain dataset operations, such
as rollbacks, so the objset pointer must be loaded after acquiring the
teardown lock.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Closes #12704
2021-12-13 13:22:34 -08:00
Damian Szuberski 64e88992b6 Update `checkstyle` workflow env to ubuntu-20.04
- `checkstyle` workflow uses ubuntu-20.04 environment
- improved `mancheck.sh` readability

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Signed-off-by: szubersk <szuberskidamian@gmail.com>
Closes #12713
2021-12-08 13:27:56 -08:00
Ryan Moeller 27d9c6ae2b FreeBSD: Catch up with recent VFS changes
cn_thread is always curthread.

https://cgit.freebsd.org/src/commit?id=b4a58fbf640409a1e507d9f7b411c83a3f83a2f3
https://cgit.freebsd.org/src/commit?id=2b68eb8e1dbbdaf6a0df1c83b26f5403ca52d4c3

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Alan Somers <asomers@gmail.com>
Signed-off-by: Ryan Moeller <freqlabs@FreeBSD.org>
Closes #12668
2021-11-02 13:48:54 -07:00
Brian Behlendorf 143476ce8d Use fallthrough macro
As of the Linux 5.9 kernel a fallthrough macro has been added which
should be used to anotate all intentional fallthrough paths.  Once
all of the kernel code paths have been updated to use fallthrough
the -Wimplicit-fallthrough option will because the default.  To
avoid warnings in the OpenZFS code base when this happens apply
the fallthrough macro.

Additional reading: https://lwn.net/Articles/794944/

Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #12441
2021-11-02 09:50:30 -07:00
Ryan Moeller 81611683c8 FreeBSD: Don't remove SA xattr if not SA znode
We attempt to remove an existing SA xattr when setting a dir xattr, but
this only makes sense if the znode has been upgraded to the SA format.
Otherwise, we will hit an assert in zfs_sa_get_xattr.

Make sure this is an SA znode before attempting to remove the SA xattr.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #12514
2021-09-14 15:11:56 -07:00
Alexander Motin a4862125b8 Remove b_pabd/b_rabd allocation from arc_hdr_alloc()
When a header is allocated for full overwrite it is a waste of time
to allocate b_pabd/b_rabd for it, since arc_write() will free them
without ever being touched.  If it is a read or a partial overwrite
then arc_read() and arc_hdr_decrypt() allocate them explicitly.

Reduced memory allocation in user threads also reduces ARC eviction
throttling there, proportionally increasing it in ZIO threads, that
is not good.  To minimize or even avoid it introduce ARC allocation
reserve, allowing certain arc_get_data_abd() callers to allocate a
bit longer in situations where user threads will already throttle.

Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Closes #12398
2021-09-14 14:31:50 -07:00
Allan Jude 24e51e3749 Restore FreeBSD sysctl processing for arc.min and arc.max
Before OpenZFS 2.0, trying to set the FreeBSD sysctl vfs.zfs.arc_max
to a disallowed value would return an error.
Since the switch, it instead only generates WARN_IF_TUNING_IGNORED

Keep the ability to set the sysctl's specifically to 0, even though
that is less than the minimum, because some tests depend on this.

Also lost, was the ability to set vfs.zfs.arc_max to a value less
than the default vfs.zfs.arc_min at boot time. Restore this as well.

Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Closes #12161
2021-09-14 14:31:01 -07:00
Ryan Moeller 744f3009fc zfs: add missed dependency of zfs module on zlib
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Martin Matuska <mm@FreeBSD.org>
Co-authored-by: Konstantin Belousov <kib@FreeBSD.org>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
External-issue: https://reviews.freebsd.org/D31207
Closes #12442
2021-09-14 14:30:39 -07:00
Mark Johnston 451d6da988 Allow disabling of unmapped I/O on FreeBSD
We have a tunable which permits one to disable the use of unmapped I/O
for the buffer cache.  Respect it in ZFS as well.  This is useful for
KMSAN, which cannot easily maintain shadow state for unmapped pages.

No functional change intended, as unmapped I/O is permitted by default
and there's no real reason to disable it in practice except for
debugging.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Closes #12446
2021-09-14 14:29:46 -07:00
Alexander Motin 93e11e257b FreeBSD: Ignore make_dev_s() errors
Since errors returned by zvol_create_minor_impl() are ignored by the
common code, it is more convenient to ignore make_dev_s() errors there.
It allows, for example, to get device created for the zvol after later
rename instead of having it further stuck in half-created state.
zvol_rename_minor() already ignores those errors.

While there, switch from MAXPHYS to maxphys in FreeBSD 13+.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Closes #12375
2021-09-14 12:40:45 -07:00
Alexander Motin c2c4d05700 FreeBSD: Switch from MAXPHYS to maxphys on FreeBSD 13+
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Closes #12378
2021-09-14 12:39:17 -07:00
Alexander Motin 45305a067f Fix ARC ghost states eviction accounting
arc_evict_hdr() returns number of evicted bytes in scope of specific
state.  For ghost states it does not mean the amount of really freed
memory, but the logical buffer size.  It is correct for the eviction
process, but not for waking up threads waiting for ARC size reduction,
as added in "Revise ARC shrinker algorithm" commit, causing premature
wakeups while ARC is still overflowed, allowing even bigger overflow,
plus processing overhead when next allocation will also get blocked,
probably also for too short time.

To fix that make arc_evict_hdr() also return the amount of really
freed memory, which for the ghost states is only the header, and use
it to update arc_evict_count instead.  Originally I was thinking to
not return it at all, since arc_get_data_impl() does not account for
the headers, but decided that some slow allocation progress is better
than long waits, reaching on my tests up to 100ms.

To reduce negative latency effects of long time periods when reclaim
thread can free little real memory, start reclamation process earlier,
before we actually reached the overflow threshold, when we have to
throttle new allocations.  We can also do it without taking global
arc_evict_lock, reducing the contention.

Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Closes #12279
2021-09-14 12:38:05 -07:00
George Wilson 8415c3c170 file reference counts can get corrupted
Callers of zfs_file_get and zfs_file_put can corrupt the reference
counts for the file structure resulting in a panic or a soft lockup.
When zfs send/recv runs, it will add a reference count to the
open file, and begin to send or recv the stream. If the file descriptor
is closed, then when dmu_recv_stream() or dmu_send() return we will
call zfs_file_put to remove the reference we placed on the file
structure. Unfortunately, because zfs_file_put() uses the file
descriptor to lookup the file structure, it may end up finding that
the file descriptor table no longer contains the file struct, thus
leaking the file structure. Or it might end up finding a file
descriptor for a different file and blindly updating its reference
counts. Other failure modes probably exists.

This change reworks the zfs_file_[get|put] interface to not rely
on the file descriptor but instead pass the zfs_file_t pointer around.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Co-authored-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: George Wilson <gwilson@delphix.com>
External-issue: DLPX-76119
Closes #12299
2021-09-14 12:37:38 -07:00
Alexander Motin c84670950a FreeBSD: Use unmapped I/O for scattered/gang ABD buffers
Many FreeBSD disk drivers support "unmapped" I/O mode, when data
buffer represented not with a virtually contiguous KVA-mapped address
range, but with a list of physical memory pages.  Originally it was
designed to do I/O from buffers without KVA mapping (unmapped).  But
moving virtual addresses out of equation allows us to operate even
non-contiguous data buffers with one condition: all buffer discon-
tinuities must be aligned to memory page borders.

Doing I/O to capable GEOM device this patch traverses through non-
linear ABD buffers, validating the chunks borders.  If the condition
is met, it supplies GEOM with the list of original physical memory
pages instead of copying the data into temporary contiguous buffer.
On capable hardware on pools with ashift=12 and default ABD chunk of
4KB it should handle all the I/O without additional memory copying.

Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Closes #12320
2021-09-14 12:37:02 -07:00
Alexander Motin 49bb454120 FreeBSD: Hardcode abd_chunk_size to PAGE_SIZE
It makes no sense to set it below PAGE_SIZE, since it increases all
overheads and makes returning memory to OS problematic.  It makes no
sense to set it above PAGE_SIZE, since such allocations and especially
frees are too expensive and cause KVA fragmentation to benefit from
fewer chunks.  After that it makes no sense to keep more complicated
math here.

What may have sense though is just a tunable border between linear and
scatter ABDs, previously also controlled by this tunable.  Retain that
functionality by taking abd_scatter_min_size tunable from Linux, just
with different default value.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Closes #12328
2021-09-14 12:36:44 -07:00
Jorgen Lundman 035219ee10 Fix abd leak, kmem_free correct size of abd_t
Fix a leak of abd_t that manifested mostly when using
raidzN with at least as many columns as N (e.g. a
four-disk raidz2 but not a three-disk raidz2).
Sufficiently heavy raidz use would eventually run a system
out of memory.

Additionally:

* Switch abd_cache arena to FIRSTFIT, which empirically
improves perofrmance.

* Make abd_chunk_cache more performant and debuggable.

* Allocate the abd_zero_buf from abd_chunk_cache rather
than the heap.

* Don't try to reap non-existent qcaches in abd_cache arena.

* KM_PUSHPAGE->KM_SLEEP when allocating chunks from their
own arena

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Co-authored-by: Sean Doran <smd@use.net>
Closes #12295
2021-09-14 12:22:28 -07:00
Ryan Moeller 6fe6192796 FreeBSD: Implement xattr=sa
FreeBSD historically has not cared about the xattr property; it was
always treated as xattr=on.  With xattr=on, xattrs are stored as files
in a hidden xattr directory.  With xattr=sa, xattrs are stored as
system attributes and get cached in nvlists during xattr operations.
This makes SA xattrs simpler and more efficient to manipulate.  FreeBSD
needs to implement the SA xattr operations for feature parity with
Linux and to ensure that SA xattrs are accessible when migrated or
replicated from Linux.

Following the example set by Linux, refactor our existing extattr vnops
to split off the parts handling dir style xattrs, and add the
corresponding SA handling parts.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #11997
2021-09-14 12:09:35 -07:00
Ryan Moeller 1826068523 FreeBSD: Clean up ASSERT/VERIFY use in module
Convert use of ASSERT() to ASSERT0(), ASSERT3U(), ASSERT3S(),
ASSERT3P(), and likewise for VERIFY().  In some cases it ended up
making more sense to change the code, such as VERIFY on nvlist
operations that I have converted to use fnvlist instead.  In one
place I changed an internal struct member from int to boolean_t to
match its use.  Some asserts that combined multiple checks with &&
in a single assert have been split to separate asserts, to make it
apparent which check fails.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #11971
2021-09-14 12:02:23 -07:00
Rich Ercolani 5e89181544 Annotated dprintf as printf-like
ZFS loves using %llu for uint64_t, but that requires a cast to not 
be noisy - which is even done in many, though not all, places.
Also a couple places used %u for uint64_t, which were promoted
to %llu. 

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes #12233
2021-06-24 13:12:36 -07:00
Alexander Motin 6b239d1757 Use wmsum for arc, abd, dbuf and zfetch statistics. (#12172)
wmsum was designed exactly for cases like these with many updates
and rare reads.  It allows to completely avoid atomic operations on
congested global variables.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes #12172
2021-06-24 13:10:59 -07:00
Alexander Motin efdfb14fc8 Remove pool io kstats
This mostly reverts "3537 want pool io kstats" commit of 8 years ago.

From one side this code using pool-wide locks became pretty bad for
performance, creating significant lock contention in I/O pipeline.
From another, there are more efficient ways now to obtain detailed
statistics, while this statistics is illumos-specific and much less
usable on Linux and FreeBSD, reported only via procfs/sysctls.

This commit does not remove KSTAT_TYPE_IO implementation, that may
be removed later together with already unused KSTAT_TYPE_INTR and
KSTAT_TYPE_TIMER.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes #12212
2021-06-10 10:50:16 -07:00
Alan Somers 2c3d7283b4 libzfs: On FreeBSD, use MNT_NOWAIT with getfsstat
`getfsstat(2)` is used to retrieve the list of mounted file systems,
which libzfs uses when fetching properties like mountpoint, atime,
setuid, etc.  The `mode` parameter may be `MNT_NOWAIT`, which uses
information in the VFS's cache, or `MNT_WAIT`, which effectively does a
`statfs` on every single mounted file system in order to fetch the most
up-to-date information.  As far as I can tell, the only fields that
libzfs cares about are the filesystem's name, mountpoint, fstypename,
and mount flags.  Those things are always updated on mount and unmount,
so they will always be accurate in the VFS's mount cache except in two
circumstances:

1) When a file system is busy unmounting
2) When a ZFS file system changes the value of a mount-overridable
   property like atime or setuid, but doesn't remount the file system.
   Right now that only happens when the property is changed by an
   unprivileged user who has delegated authority to change the property
   but not to mount the dataset.  But perhaps libzfs could choose to do
   it for other reasons in the future.

Switching to `MNT_NOWAIT` will greatly improve speed with no downside,
as long as we explicitly update the mount cache whenever we change a
mount-overridable property.

For comparison, Illumos gets this information using the native
`getmntany` and `getmntent` functions, which also use cached
information.  The illumos function that would refresh the cache,
`resetmnttab`, is never called by libzfs.

And on GNU/Linux, `getmntany` and `getmntent` don't even communicate
with the kernel directly.  They simply parse the file they are given,
which is usually /etc/mtab or /proc/mounts.  Perhaps the implementation
of /proc/mounts is synchronous, ala MNT_WAIT; I don't know.

Sponsored-by:	Axcient
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by:	Alan Somers <asomers@gmail.com>
Closes: #12091
2021-06-09 13:05:34 -07:00