Commit Graph

3518 Commits

Author SHA1 Message Date
Brian Behlendorf c23803be84
Restore dirty dnode detection logic
In addition to flushing memory mapped regions when checking holes,
commit de198f2d95 modified the dirty dnode detection logic to check
the dn->dn_dirty_records instead of the dn->dn_dirty_link.  Relying
on the dirty record has not be reliable, switch back to the previous
method.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #11900 
Closes #12745
2021-11-10 16:14:32 -08:00
Fedor Uporov e39fe05b69
Skip spacemaps reading in case of pool readonly import
The only zdb utility require to read metaslab-related data during
read-only pool import because of spacemaps validation. Add global
variable which will allow zdb read spacemaps in case of readonly
import mode.

Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Fedor Uporov <fuporov.vstack@gmail.com>
Closes #9095
Closes #12687
2021-11-09 12:50:39 -08:00
Brian Atkinson 345196be18
Single IO issue for raidz writes with skip sector
In order to reduce contention on the vq_lock, optional skip sectors
for Raidz writes can be placed into a single IO request. This is done by
padding out the linear ABD for a parity column to contain the skip
sector and by creating gang ABD to contain the data and skip sector for
data columns.

The vdev_raidz_map_alloc() function now contains specific functions for
both reads and write to allocate the ABD's that will be issued down to
the VDEV chldren.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-By: Mark Maybee <mark.maybee@delphix.com>
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Closes #12333
2021-11-09 12:51:33 -07:00
Brian Behlendorf 453c63e9b7 Linux 5.16 compat: submit_bio()
The submit_bio() prototype has changed again.  The version is 5.16
still only expects a single argument but the return type has changed
to void.  Since we never used the returned value before update the
configure check to detect both single arg versions.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Alexander Lobakin <alobakin@pm.me>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #12725
2021-11-09 11:27:46 -08:00
Brian Behlendorf de198f2d95
Fix lseek(SEEK_DATA/SEEK_HOLE) mmap consistency
When using lseek(2) to report data/holes memory mapped regions of
the file were ignored.  This could result in incorrect results.
To handle this zfs_holey_common() was updated to asynchronously
writeback any dirty mmap(2) regions prior to reporting holes.

Additionally, while not strictly required, the dn_struct_rwlock is
now held over the dirty check to prevent the dnode structure from
changing.  This ensures that a clean dnode can't be dirtied before
the data/hole is located.  The range lock is now also taken to
ensure the call cannot race with zfs_write().

Furthermore, the code was refactored to provide a dnode_is_dirty()
helper function which checks the dnode for any dirty records to
determine its dirtiness.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Rich Ercolani <rincebrain@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #11900
Closes #12724
2021-11-07 14:27:44 -07:00
Rich Ercolani 05679465ac
Revert behavior of 59eab109 on not-Linux
It turns out that short-circuiting the EFAULT behavior on a short read
breaks things on FreeBSD. So until there's a nicer solution, let's
just revert the behavior for not-Linux.

Reference:
https://reviews.freebsd.org/R10:70f51f0e474ffe1fb74cb427423a2fba3637544d

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes #12698
2021-11-04 07:49:40 -06:00
Rich Ercolani e79b6807b8
Workaround issue cleaning up automounted snapshots on Linux
On Linux, sometimes, when ZFS goes to unmount an automounted snap,
it fails a VERIFY check on debug builds, because taskq_cancel_id
returned ENOENT after not finding the taskq it was trying to cancel.

This presumably happens when it already died for some reason; in this
case, we don't really mind it already being dead, since we're just
going to dispatch a new task to unmount it right after.

So we just ignore it if we get back ENOENT trying to cancel here,
retry a couple times if we get back the only other possible condition
(EBUSY), and log to dbgmsg if we got anything but ENOENT or success.

(We also add some locking around taskqid, to avoid one or two cases
of two instances of trying to cancel something at once.)

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes #11632
Closes #12670
2021-11-03 09:00:08 -06:00
Damian Szuberski 6d680e61ef
Update `checkstyle` workflow env to ubuntu-20.04
- `checkstyle` workflow uses ubuntu-20.04 environment
- improved `mancheck.sh` readability

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Signed-off-by: szubersk <szuberskidamian@gmail.com>
Closes #12713
2021-11-02 14:02:57 -06:00
Paul Dagnelie 58bf6afd3f
Fix cpu hotplug atomic sleep issue
We move the spinlock unlock before the thread creation. This should be
safe because the thread creation code doesn't actually manipulate any
taskq data structures; that's done by the thread once it's created.

We also remove the assertion that the maxthreads is the current threads
plus one; that assertion could fail if multiple hotplug events come in
quick succession, and the first new taskq thread hasn't had a chance to
start processing yet.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
eviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #12714
2021-11-02 10:23:48 -06:00
Mark Johnston 4d1a3ba6ed Exit the teardown section later in rename on FreeBSD
We have to hold the teardown lock while dereferencing zfsvfs->z_os and,
I believe, when committing to the ZIL.

Note that jumping to the "out" label, "error" is always non-zero.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Closes #12704
2021-10-29 16:40:25 -07:00
Mark Johnston 68a7a9edc5 Fix potential use-after-frees in FreeBSD getpages and setattr VOPs
The objset object is reallocated during certain dataset operations, such
as rollbacks, so the objset pointer must be loaded after acquiring the
teardown lock.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Closes #12704
2021-10-29 16:39:47 -07:00
Fedor Uporov d5a5ec4693
Remove unused function zvol_set_volblocksize()
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Fedor Uporov <fuporov.vstack@gmail.com>
Closes #12688
2021-10-26 17:07:53 -07:00
Rich Ercolani 6f57f1e382
Make dsl_scan print the pool name in dbgmsg
If you've got multiple scrubs/resilvers going, it's rather helpful
to know which pool each scan line refers to.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes: #12674
2021-10-26 17:24:14 -06:00
Allan Jude 65ad5d1165
spa.c: Replace VERIFY(nvlist_*(...) == 0) with fnvlist_* (#12678)
The fnvlist versions of the functions are fatal if they fail,
saving each call from having to include checking the result.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Signed-off-by: Allan Jude <allan@klarasystems.com>
2021-10-26 17:15:38 -06:00
Ryan Moeller 14b69c0929
FreeBSD: Catch up with recent VFS changes
cn_thread is always curthread.

https://cgit.freebsd.org/src/commit?id=b4a58fbf640409a1e507d9f7b411c83a3f83a2f3
https://cgit.freebsd.org/src/commit?id=2b68eb8e1dbbdaf6a0df1c83b26f5403ca52d4c3

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Alan Somers <asomers@gmail.com>
Signed-off-by: Ryan Moeller <freqlabs@FreeBSD.org>
Closes #12668
2021-10-25 09:46:28 -07:00
Pawel Jakub Dawidek a95c82bed8
Remove code duplication
Remove code duplication by moving code responsible for partial block
zeroing to a separate function: dnode_partial_zero().

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Closes #12627
2021-10-18 16:50:33 -07:00
Pawel Jakub Dawidek afbc617921
Remove FreeBSD's local copy of the dmu_buf_hold_array() function
Make the main dmu_buf_hold_array() function non-static.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Closes #12628
2021-10-13 11:01:01 -07:00
Brian Behlendorf 280d0f0ce4
Export minimal zfs_refcount interfaces
Lustre makes light use of the zfs_refcount interfaces which
isn't a problem when using a non-debug build of OpenZFS. However,
when debugging is enabled the required symbols are not exported.

Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #12613
2021-10-11 10:54:39 -07:00
Rich Ercolani 9d1407e8f2
Correct refcount_add in dmu_zfetch
refcount_add_many(foo,N) is not the same as
for (i=0; i < N; i++) { refcount_add(foo); }

Unfortunately, this is only actually true with debug kernels and
reference_tracking_enable=1.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes #12589 
Closes #12602
2021-10-08 11:10:34 -07:00
Tony Hutter 2a8430a260
Rescan enclosure sysfs path on import
When you create a pool, zfs writes vd->vdev_enc_sysfs_path with the
enclosure sysfs path to the fault LEDs, like:

    vdev_enc_sysfs_path = /sys/class/enclosure/0:0:1:0/SLOT8

However, this enclosure path doesn't get updated on successive imports
even if enclosure path to the disk changes.  This patch fixes the issue.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #11950 
Closes #12095
2021-10-04 12:32:16 -07:00
Attila Fülöp 60b618a967
ZFS: Remove a redundant if condition (#12598)
Commit 0c03d21ac9 left in a redundant if condition while
removing some code. Just remove it.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Signed-off-by: Attila Fülöp <attila@fueloep.org>
Closes #12598
2021-10-02 12:50:57 -06:00
José Luis Salvador Rufo ed3a3bdb0d
Proper support for DESTDIR and INSTALL_MOD_PATH
The environment variables DESTDIR and INSTALL_MOD_PATH must
be mutually exclusive.

https://www.gnu.org/prep/standards/html_node/DESTDIR.html
https://www.kernel.org/doc/Documentation/kbuild/modules.txt

This issue was discussed in this Buildroot thread:
https://lists.buildroot.org/pipermail/buildroot/2021-August/621350.html

I saw this behavior in other different projects, as:

- Yocto Project:
  https://www.yoctoproject.org/pipermail/meta-freescale/2013-August/004307.html

- Google IA Coral:
  https://coral.googlesource.com/linux-imx-debian/+/refs/heads/master/debian/rules

For the above reasons, INSTALL_MOD_PATH will be set as DESTDIR
by default.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: José Luis Salvador Rufo <salvador.joseluis@gmail.com>
Signed-off-by: Romain Naour <romain.naour@gmail.com>
Closes #12577
2021-10-01 10:44:34 -07:00
Rich Ercolani 59eab1093a
Handle partial reads in zfs_read
Currently, dmu_read_uio_dnode can read 64K of a requested 1M in one
loop, get EFAULT back from zfs_uiomove() (because the iovec only holds
64k), and return EFAULT, which turns into EAGAIN on the way out. EAGAIN
gets interpreted as "I didn't read anything", the caller tries again
without consuming the 64k we already read, and we're stuck.

This apparently works on newer kernels because the caller which breaks
on older Linux kernels by happily passing along a 1M read request and a
64k iovec just requests 64k at a time.

With this, we now won't return EFAULT if we got a partial read.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes #12370 
Closes #12509
Closes #12516
2021-09-20 10:30:50 -07:00
Alexander Motin 139690d6c3
Really zero the zero page
While switching abd_zero_buf allocation KPI I've missed the fact
that kmem_zalloc() zeroed the allocation, while kmem_cache_alloc()
does not.  Add explicit bzero() after it.

I don't think it should have caused real problems, but leaking one
memory page content all over the pool is not good.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Closes #12569
2021-09-17 10:17:18 -07:00
George Amanakis 2a49ebbb4d
Avoid panic in case of pool errors and missing L2ARC
In case an ARC buffer is allocated only on L2ARC, and there are
underlying errors in a pool with the cache device in faulty state, a
panic can occur in arc_read_done()->arc_hdr_destroy()->
arc_hdr_l2arc_destroy()->arc_hdr_clear_flags() when trying to free
the ARC buffer.

Fix this by discarding the buffer's identity in arc_hdr_destroy(), in
case the buffer is not empty, before calling arc_hdr_l2hdr_destroy().

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #12392
2021-09-16 09:40:15 -07:00
Allan Jude 4a1195ca50
Temporarily use root credentials to mount snapshots in .zfs
When mounting a snapshot in the .zfs/snapshots control directory,
temporarily assume roots credentials to perform the VFS_MOUNT().

This allows regular users and users inside jails to access these
snapshots.

The regular usermount code is not helpful here, since it requires
that the user performing the mount own the mountpoint, which won't
be the case for .zfs/snapshot/<snapname>

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Sponsored-By: Modirum MDPay
Sponsored-By: Klara Inc.
Closes #11312
2021-09-14 17:10:00 -06:00
Brian Behlendorf 6954c22f35
Use fallthrough macro
As of the Linux 5.9 kernel a fallthrough macro has been added which
should be used to anotate all intentional fallthrough paths.  Once
all of the kernel code paths have been updated to use fallthrough
the -Wimplicit-fallthrough option will because the default.  To
avoid warnings in the OpenZFS code base when this happens apply
the fallthrough macro.

Additional reading: https://lwn.net/Articles/794944/

Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #12441
2021-09-14 10:17:54 -06:00
Jorgen Lundman 7443299fe0
Iterate encrypted clones at zvol_create_minor
Userland figures out which encryption-root keys are required to load,
and issues ZFS_IOC_LOAD_KEY.
The tail section of spa_keystore_load_wkey() will call
zvol_create_minors() on the encryption-root object.

Any clones of the encrypted zvol will not be plumbed. This commits
adds additional logic to detect if zvol has clones, and is encrypted,
then adds these to the list of zvols to call zvol_create_minors() on.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Closes #12471
2021-09-13 13:27:07 -07:00
Arun KV f82f0279ed
Fixed data integrity issue when underlying disk returns error
Errors in zil_lwb_write_done() are not propagated to
zil_lwb_flush_vdevs_done() which can result in zil_commit_impl()
not returning an error to applications even when zfs was not able
to write data to the disk.

Remove the ZIO_FLAG_DONT_PROPAGATE flag from zio_rewrite() to
allow errors to propagate and consolidate the error handling for
flush and write errors to a single location (rather than having
error handling split between the "write done" and "flush done"
handlers).

Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Signed-off-by: Arun KV <arun.kv@datacore.com>
Closes #12391
Closes #12443
2021-09-13 13:02:39 -07:00
Brian Behlendorf b9ec4a15e5
Verify embedded blkptr's in arc_read()
The block pointer verification check in arc_read() should also
cover embedded block pointers.  While highly unlikely, accessing
a damaged block pointer can result in panic.  To further harden
the code extend the existing check to include embedded block
pointers and add a comment explaining the rational for this
sanity check.  Lastly, correct a flaw in zfs_blkptr_verify()
so the error count is checked even when checking a untrusted
config to verify the non-pool-specific portions of a block
pointer.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #12535
2021-09-09 19:02:07 -06:00
Jorgen Lundman 5a54a4e051
Upstream: Add snapshot and zvol events
For kernel to send snapshot mount/unmount events to zed.

For kernel to send symlink creates/removes on zvol plumbing.
(/dev/run/dsk/zvol/$pool/$zvol -> /dev/diskX)

If zed misses the ENODEV, all errors after are EINVAL. Treat any error
as kernel module failure.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Closes #12416
2021-09-09 10:44:21 -07:00
Brian Behlendorf 2079111f42
Linux 5.15 compat: get_acl()
Kernel commits

332f606b32b6 ovl: enable RCU'd ->get_acl()
0cad6246621b vfs: add rcu argument to ->get_acl() callback

Added compatibility code to detect the new ->get_acl() interface
and correctly handle the case where the new rcu argument is set.

Reviewed-by: Coleman Kane <ckane@colemankane.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #12548
2021-09-09 09:38:35 -07:00
Allan Jude a68e4b5919
Allow sending corrupt snapshots even if metadata is corrupted
When zfs_send_corrupt_data is set, use the TRAVERSE_HARD flag,
so traverse_visitbp() will not fail with ECKSUM if a blockpointer
cannot be read, but rather will continue and send the objects it can.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Sponsored-By: Klara Inc.
Sponsored-By: WHC Online Solutions Inc.
Closes #12541
2021-09-09 08:17:31 -06:00
Rich Ercolani 7676ffc51f
arc: Drop an incorrect assert
Unfortunately, there was an overzealous assertion that was (in pretty
specific circumstances) false, causing failure.  This assertion was
added in error, so we're removing it.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes #9897
Closes #12020
Closes #12246
2021-09-08 14:00:03 -07:00
Paul Dagnelie c634320e51
Compressed receive with different ashift can result in incorrect PSIZE on disk
We round up the psize to the nearest multiple of the asize or to the
lsize, whichever is smaller. Once that's done, we allocate a new
buffer of the appropriate size, zero the tail, and copy the data
into it. This adds a small performance cost to these kinds of writes,
but fixes the bookkeeping problems.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Matthew Ahrens <matthew.ahrens@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #12522
Closes #8462
2021-09-08 13:52:28 -07:00
Ryan Moeller 3b89d9518d
FreeBSD: Don't remove SA xattr if not SA znode
We attempt to remove an existing SA xattr when setting a dir xattr, but
this only makes sense if the znode has been upgraded to the SA format.
Otherwise, we will hit an assert in zfs_sa_get_xattr.

Make sure this is an SA znode before attempting to remove the SA xattr.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #12514
2021-08-30 16:01:09 -07:00
Rich Ercolani b1a1c64313
Fix cross-endian interoperability of zstd
It turns out that layouts of union bitfields are a pain, and the
current code results in an inconsistent layout between BE and LE
systems, leading to zstd-active datasets on one erroring out on
the other.

Switch everyone over to the LE layout, and add compatibility code
to read both.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes #12008
Closes #12022
2021-08-30 14:13:46 -07:00
Ka Ho Ng f3bbeb970e FreeBSD: Implement hole-punching support
This adds supports for hole-punching facilities in the FreeBSD kernel
starting from __FreeBSD_version 1400032.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Ka Ho Ng <khng@FreeBSD.org>
Sponsored-by: The FreeBSD Foundation
Closes #12458
2021-08-30 13:33:30 -07:00
Trevor Bautista 00888c0898
Extend zpool-iostat to account for ZIO_PRIORITY_REBUILD (#12319)
Previously, zpool-iostat did not display any data regarding rebuild I/Os
in either the latency/size histograms (-w/-l/-r) or the queue data (-q).
This fix essentially utilizes the existing infrastructure for tracking
rebuild queue data and displays this data in the proper places within
zpool-iostat's output.

Signed-off-by: Trevor Bautista <tbautista@newmexicoconsortium.org>
Signed-off-by: Trevor Bautista <tbautista@lanl.gov>
Co-authored-by: Trevor Bautista <tbautista@newmexicoconsortium.org>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
2021-08-26 11:26:49 -07:00
Mark Johnston 3ee9a997a3
Initialize parity blocks before RAID-Z reconstruction benchmarking
benchmark_raidz() allocates a row to benchmark parity calculation and
reconstruction.  In the latter case, the parity blocks are left
uninitialized, leading to reports from KMSAN.

Initialize parity blocks to 0xAA as we do for the data earlier in the
function.  This does not affect the selected RAID-Z implementation on
any of several systems tested.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Closes #12473
2021-08-23 11:10:17 -07:00
Richard Yao abbf0bd4eb Linux 4.11 compat: statx support
Linux 4.11 added a new statx system call that allows us to expose crtime
as btime. We do this by caching crtime in the znode to match how atime,
ctime and mtime are cached in the inode.

statx also introduced a new way of reporting whether the immutable,
append and nodump bits have been set. It adds support for reporting
compression and encryption, but the semantics on other filesystems is
not just to report compression/encryption, but to allow it to be turned
on/off at the file level. We do not support that.

We could implement semantics where we refuse to allow user modification
of the bit, but we would need to do a dnode_hold() in zfs_znode_alloc()
to find out encryption/compression information. That would introduce
locking that will have a minor (although unmeasured) performance cost.
It also would be inferior to zdb, which reports far more detailed
information. We therefore omit reporting of encryption/compression
through statx in favor of recommending that users interested in such
information use zdb.

Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes #8507
2021-08-17 10:25:58 -07:00
Alexander Motin 6b88b4b501
Remove b_pabd/b_rabd allocation from arc_hdr_alloc()
When a header is allocated for full overwrite it is a waste of time
to allocate b_pabd/b_rabd for it, since arc_write() will free them
without ever being touched.  If it is a read or a partial overwrite
then arc_read() and arc_hdr_decrypt() allocate them explicitly.

Reduced memory allocation in user threads also reduces ARC eviction
throttling there, proportionally increasing it in ZIO threads, that
is not good.  To minimize or even avoid it introduce ARC allocation
reserve, allowing certain arc_get_data_abd() callers to allocate a
bit longer in situations where user threads will already throttle.

Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes #12398
2021-08-17 10:15:54 -06:00
Alexander Motin bb7ad5d326
Optimize arc_l2c_only lists assertions
It is very expensive and not informative to call multilist_is_empty()
for each arc_change_state() on debug builds to check for impossible.
Instead implement special index function for arc_l2c_only->arcs_list,
multilists, panicking on any attempt to use it.

Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes #12421
2021-08-17 09:55:34 -06:00
Alexander Motin cfe8e960f1
Fix/improve dbuf hits accounting
Instead of clearing stats inside arc_buf_alloc_impl() do it inside
arc_hdr_alloc() and arc_release().  It fixes statistics being wiped
every time a new dbuf is filled from the ARC.

Remove b_l1hdr.b_l2_hits. L2ARC hits are accounted at b_l2hdr.b_hits.
Since the hits are accounted under hash lock, replace atomics with
simple increments.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes #12422
2021-08-17 09:50:31 -06:00
Alexander Motin 7f9d9e6f39
Avoid vq_lock drop in vdev_queue_aggregate()
vq_lock is already too congested for two more operations per I/O.
Instead of dropping and reacquiring it inside vdev_queue_aggregate()
delegate the zio_vdev_io_bypass() and zio_execute() calls for parent
I/Os to callers, that drop the lock any way to execute the new I/O.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes #12297
2021-08-17 09:47:00 -06:00
Alexander Motin e829a865bf
Use more atomics in refcounts
Use atomic_load_64() for zfs_refcount_count() to prevent torn reads
on 32-bit platforms.  On 64-bit ones it should not change anything.

When built with ZFS_DEBUG but running without tracking enabled use
atomics instead of mutexes same as for builds without ZFS_DEBUG.
Since rc_tracked can't change live we can check it without lock.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes #12420
2021-08-17 09:44:34 -06:00
Allan Jude e945e8d7f4
Restore FreeBSD sysctl processing for arc.min and arc.max
Before OpenZFS 2.0, trying to set the FreeBSD sysctl vfs.zfs.arc_max
to a disallowed value would return an error.
Since the switch, it instead only generates WARN_IF_TUNING_IGNORED

Keep the ability to set the sysctl's specifically to 0, even though
that is less than the minimum, because some tests depend on this.

Also lost, was the ability to set vfs.zfs.arc_max to a value less
than the default vfs.zfs.arc_min at boot time. Restore this as well.

Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Closes #12161
2021-08-16 09:35:19 -06:00
Ryan Moeller 41bee4039c
zfs: add missed dependency of zfs module on zlib
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Martin Matuska <mm@FreeBSD.org>
Co-authored-by: Konstantin Belousov <kib@FreeBSD.org>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
External-issue: https://reviews.freebsd.org/D31207
Closes #12442
2021-08-13 13:42:45 -07:00
Tony Nguyen 6bc61d22c4
Run arc_evict thread at higher priority
Run arc_evict thread at higher priority, nice=0, to give it more CPU
time which can improve performance for workload with high ARC evict
activities.

On mixed read/write and sequential read workloads, I've seen between
10-40% better performance.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Tony Nguyen <tony.nguyen@delphix.com>
Closes #12397
2021-08-10 11:36:26 -06:00
Brian Behlendorf ae1e40b329
Enable /proc/diskstats for zvols
The /proc/diskstats accounting needs to be explicitly enabled
for block devices which do not use multi-queue.

Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #12440
Closes #12066
2021-08-05 15:35:34 -06:00