This commit adds '-u' flag for zfs set operation. With this flag,
mountpoint, sharenfs and sharesmb properties can be updated
without actually mounting or sharing the dataset.
Previously, if dataset was unmounted, and mountpoint property was
updated, dataset was not mounted after the update. This behavior
is changed in #15240. We mount the dataset whenever mountpoint
property is updated, regardless if it's mounted or not.
To provide the user with option to keep the dataset unmounted and
still update the mountpoint without mounting the dataset, '-u'
flag can be used.
If any of mountpoint, sharenfs or sharesmb properties are updated
with '-u' flag, the property is set to desired value but the
operation to (re/un)mount and/or (re/un)share the dataset is not
performed and dataset remains as it was before.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Umer Saleem <usaleem@ixsystems.com>
Closes#15322
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes#15315
There are some inconsistencies in the handling of mountpoint
property. This commit updates the behavior and makes it
consistent.
If mountpoint property is set when dataset is unmounted, this
would update the mountpoint property. The mountpoint could be
valid or invalid in this case. Setting the mountpoint property
would result in success in this case. Dataset would still be
unmounted here.
On the other hand, if dataset is mounted and mountpoint
property is updated to something invalid where mount cannot be
successful, for example, setting the mountpoint inside a readonly
directory. This would unmount the dataset, set the mountpoint
property to requested value and tries to mount the dataset. The
mount operation returns error and this error is treated as
overall failure of setting the property while the property is
actually set.
To make the behavior consistent in case dataset is mounted or
unmounted, we should try to mount the dataset whenever mountpoint
property is updated. This would result in mounting the datasets
if canmount property is set to on, regardless if the dataset was
previously unmounted.
The failure in mount operation while setting the mountpoint
property should not be treated as failure, since the property is
actually set now to user requested value.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Signed-off-by: Umer Saleem <usaleem@ixsystems.com>
Closes#15240
'program help subcommand' is a reasonably common pattern for
multifunction command-line programs. This commit adds support for that
style to the zpool and zfs commands.
When run as 'zpool help [<topic>]' or 'zfs help [<topic>]', executes the
'man' program on the PATH with the most likely manpage name for the
requested topic: "zpool-<topic>" or "zfs-<topic>" for subcommands, or
"zpool<topic>" or "zfs<topic>" for the "concepts" and "props" topics.
If no topic is supplied, uses the top "zpool" or "zfs" pages.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Kay Pedersen <mail@mkwg.de>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes#15288
Commit 2d7843401a had previously
updated this hardcoded limit to allow for CI testing. As there
is no deterministic pass/fail value, the need has arisen for
one more small increase.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Edmund Nadolski <edmund.nadolski@ixsystems.com>
Closes#15252
Currently redaction bookmarks and their associated redaction lists
have a relatively low limit of 36 redaction snapshots. This is imposed
by the number of snapshot GUIDs that fit in the bonus buffer of the
redaction list object. While this is more than enough for most use
cases, there are some limited cases where larger numbers would be
useful to support.
We tweak the redaction list creation code to use a spill block if
the number of redaction snapshots is above the amount that would fit
in the bonus buffer. We also make a small change to allow spill blocks
to be use for types of data besides SA. In order to fully leverage
this logic, we also change the redaction code to use vmem_alloc, to
handle extremely large allocations if needed. Finally, small tweaks
were made to the zfs commands and the test suite.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes#15018
Ashift can be set for a vdev only during its creation, and the
top-level vdev does not change when a vdev is attached or replaced.
The ashift property should not be used during attachment, as it
does not allow attaching/replacing a vdev if the pool's ashift
property is increased after the existing vdev was created. Instead,
we should be able to attach the vdev if the attached vdev can
satisfy the ashift requirement with its parent.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes#15061
This is more-or-less like `zfs send`, but specifying the snapshot by its
objset id for situations where it can't be referenced any other way.
Sponsored-By: Klara, Inc.
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: WHR <msl0000023508@gmail.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#14642
For draid vdevs it was possible to initiate both the
sequential and healing resilver at same time.
This fixes the following two scenarios.
1) There's a window where a sequential rebuild can
be started via ZED even if a healing resilver has been
scheduled.
- This is fixed by adding additional check in
spa_vdev_attach() for any scheduled resilver and return
appropriate error code when a resilver is already in
progress.
2) It was possible for zpool clear to start a healing
resilver when it wasn't needed at all. This occurs because
during a vdev_open() the device is presumed to be healthy not
until the device is validated by vdev_validate() and it's set
unavailable. However, by this point an async resilver will
have already been requested if the DTL isn't empty.
- This is fixed by cancelling the SPA_ASYNC_RESILVER
request immediately at the end of vdev_reopen() when a resilver
is unneeded.
Finally, added a testcase in ZTS for verification.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Dipak Ghosh <dipak.ghosh@hpe.com>
Signed-off-by: Akash B <akash-b@hpe.com>
Closes#14881Closes#14892
Added a flag '-e' in zpool scrub to scrub only blocks in error log. A
user can pause, resume and cancel the error scrub by passing additional
command line arguments -p -s just like a regular scrub. This involves
adding a new flag, creating new libzfs interfaces, a new ioctl, and the
actual iteration and read-issuing logic. Error scrubbing is executed in
multiple txg to make sure pool performance is not affected.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Co-authored-by: TulsiJain tulsi.jain@delphix.com
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes#8995Closes#12355
zpool initialize functions well for touching every free byte...once.
But if we want to do it again, we're currently out of luck.
So let's add zpool initialize -u to clear it.
Co-authored-by: Rich Ercolani <rincebrain@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes#12451Closes#14873
In case check_filesystem() does not error out and does not report
an error, remove that error block from error lists and logs
without requiring a scrub. This can happen when the original file and
all snapshots/clones referencing it have been removed.
Otherwise zpool status will still report that "Permanent errors have
been detected..." without actually reporting any of them.
To implement this change the functions introduced in corrective
receive were modified to take into account the head_errlog feature.
Before this change:
=============================
pool: test
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
config:
NAME STATE READ WRITE CKSUM
test ONLINE 0 0 0
/home/user/vdev_a ONLINE 0 0 2
errors: Permanent errors have been detected in the following files:
=============================
After this change:
=============================
pool: test
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are
unaffected.
action: Determine if the device needs to be replaced, and clear the
errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
config:
NAME STATE READ WRITE CKSUM
test ONLINE 0 0 0
/home/user/vdev_a ONLINE 0 0 2
errors: No known data errors
=============================
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes#14813
For the head_errlog feature use dsl_dataset_hold_obj_flags() instead of
dsl_dataset_hold_obj() in order to enable access to the encryption keys
(if loaded). This enables reporting of errors in encrypted filesystems
which are not mounted but have their keys loaded.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes#14837
spa_import() relies on a pool config fetched by spa_try_import() for
spare/cache devices. Import flags are not passed to spa_tryimport(),
which makes it return early due to a missing log device and missing
retrieving the cache device and spare eventually. Passing
ZFS_IMPORT_MISSING_LOG to spa_tryimport() makes it fetch the correct
configuration regardless of the missing log device.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes#14794
This commit expands on the zhack label repair command in d04b5c9 by
adding the -u option to undetach a device by regenerating uberblocks,
in addition to the existing functionality of fixing checksums, now
represented by -c. Previous behavior is retained in the case of no
options.
The changes are heavily inspired by Jeff Bonwick's labelfix
utility, as archived at:
https://gist.github.com/jjwhitney/baaa63144da89726e482
Additionally, it is now capable of properly determining the size of
block devices and other media, as well as handling sizes which are
not divisible by 2^18. This should make it viable for use on physical
devices and partitions, in addition to files.
These changes should make it possible to import zpools that have had
their uberblocks erased, such as in the case of pools rendered
inaccessible by erroneous detach commands.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: buzzingwires <buzzingwires@outlook.com>
Closes#14773
On FreeBSD, `wc` prints some leading spaces, while on Linux it does not.
So we tell ksh to expect an integer, and it does the rest.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes#14791Closes#14797
Usage:
zpool set org.freebsd:comment="this is my pool" poolname
Tests are based on zfs_set's user property tests.
Also stop truncating property values at MAXNAMELEN, use ZFS_MAXPROPLEN.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Mateusz Piotrowski <mateusz.piotrowski@klarasystems.com>
Sponsored-by: Beckhoff Automation GmbH & Co. KG.
Sponsored-by: Klara Inc.
Closes#11680
And add it to the AVZ, this is not backwards compatible with older pools
due to an assertion in spa_sync() that verifies the number of ZAPs of
all vdevs matches the number of ZAPs in the AVZ.
Granted, the assertion only applies to #DEBUG builds - still, a feature
flag is introduced to avoid the assertion, com.klarasystems:vdev_zaps_v2
Notably, this allows to get/set properties on the root vdev:
% zpool set user:prop=value <pool> root-0
Before this commit, it was already possible to get/set properties on
top-level vdevs with the syntax <type>-<vdev_id> (e.g. mirror-0):
% zpool set user:prop=value <pool> mirror-0
This syntax also applies to the root vdev as it is is of type 'root'
with a vdev_id of 0, root-0. The keyword 'root' as an alias for
'root-0'.
The following tests have been added:
- zpool get all properties from root vdev
- zpool set a property on root vdev
- verify root vdev ZAP is created
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Wing <rob.wing@klarasystems.com>
Sponsored-by: Seagate Technology
Submitted-by: Klara, Inc.
Closes#14405
Address the following bugs in persistent error log:
1) Check nested clones, eg "fs->snap->clone->snap2->clone2".
2) When deleting files containing error blocks in those clones (from
"clone" the example above), do not break the check chain.
3) When deleting files in the originating fs before syncing the errlog
to disk, do not break the check chain. This happens because at the
time of introducing the error block in the error list, we do not have
its birth txg and the head filesystem. If the original file is
deleted before the error list is synced to the error log (which is
when we actually lookup the birth txg and the head filesystem), then
we do not have access to this info anymore and break the check chain.
The most prominent change is related to achieving (3). We expand the
spa_error_entry_t structure to accommodate the newly introduced
zbookmark_err_phys_t structure (containing the birth txg of the error
block).Due to compatibility reasons we cannot remove the
zbookmark_phys_t structure and we also need to place the new structure
after se_avl, so it is not accounted for in avl_find(). Then we modify
spa_log_error() to also provide the birth txg of the error block. With
these changes in place we simplify the previously introduced function
get_head_and_birth_txg() (now named get_head_ds()).
We chose not to follow the same approach for the head filesystem (thus
completely removing get_head_ds()) to avoid introducing new lock
contentions.
The stack sizes of nested functions (as measured by checkstack.pl in the
linux kernel) are:
check_filesystem [zfs]: 272 (was 912)
check_clones [zfs]: 64
We also introduced two new tests covering the above changes.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes#14633
Fix the manpage. The "SYNOPSIS" section is incorrectly formatted for
receive -c. I also took this opportunity to reword some parts and
fix a run-on sentence in the manpage.
Add large block testing for corrective recv. This adds a new test
that makes sure blocks generated using zfs send -L/--large-block
large-block send flag are able to be used for healing.
Since with unloaded key and errlog feature enabled corruption is not
shown in zpool status #13675 is fixed the zfs_receive_corrective.ksh
test no longer sets -o feature@head_errlog=disabled on pool creation
so that it can also test for regressions related to head_errlog feature.
Note that the zfs_receive_compressed_corrective.ksh and
zfs_receive_large_block_corrective.ksh tests are still creating pools
with -o feature@head_errlog=disabled.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alek Pinchuk <apinchuk@axcient.com>
Closes#14615
Block Cloning allows to manually clone a file (or a subset of its
blocks) into another (or the same) file by just creating additional
references to the data blocks without copying the data itself.
Those references are kept in the Block Reference Tables (BRTs).
The whole design of block cloning is documented in module/zfs/brt.c.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Christian Schwarz <christian.schwarz@nutanix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rich Ercolani <rincebrain@gmail.com>
Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Closes#13392
The approach is straightforward: for dataset ops, if a key was offered,
find the encryption root and the various encryption parameters, derive a
wrapping key if necessary, and then unlock the encryption root. After
that all the regular dataset ops will return unencrypted data, and
that's kinda the whole thing.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes#11551Closes#12707Closes#14503
This allows parsing of zfs send progress by checking the process
title.
Doing so requires some changes to the send code in libzfs_sendrecv.c;
primarily these changes move some of the accounting around, to allow
for the code to be verbose as normal, or set the process title. Unlike
BSD, setproctitle() isn't standard in Linux; thus, borrowed it from
libbsd with slight modifications.
Authored-by: Sean Eric Fagan <sef@FreeBSD.org>
Co-authored-by: Ryan Moeller <ryan@iXsystems.com>
Co-authored-by: Ameer Hamza <ahamza@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes#14376
ARC code was many times significantly modified over the years, that
created significant amount of tangled and potentially broken code.
This should make arc_access()/arc_read() code some more readable.
- Decouple prefetch status tracking from b_refcnt. It made sense
originally, but became highly cryptic over the years. Move all the
logic into arc_access(). While there, clean up and comment state
transitions in arc_access(). Some transitions were weird IMO.
- Unify arc_access() calls to arc_read() instead of sometimes calling
it from arc_read_done(). To avoid extra state changes and checks add
one more b_refcnt for ARC_FLAG_IO_IN_PROGRESS.
- Reimplement ARC_FLAG_WAIT in case of ARC_FLAG_IO_IN_PROGRESS with
the same callback mechanism to not falsely account them as hits. Count
those as "iohits", an intermediate between "hits" and "misses". While
there, call read callbacks in original request order, that should be
good for fairness and random speculations/allocations/aggregations.
- Introduce additional statistic counters for prefetch, accounting
predictive vs prescient and hits vs iohits vs misses.
- Remove hash_lock argument from functions not needing it.
- Remove ARC_FLAG_PREDICTIVE_PREFETCH, since it should be opposite
to ARC_FLAG_PRESCIENT_PREFETCH if ARC_FLAG_PREFETCH is set. We may
wish to add ARC_FLAG_PRESCIENT_PREFETCH to few more places.
- Fix few false positive tests found in the process.
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#14123
Currently, the receiver fails to override the encryption
property for the plain replicated dataset with the error:
"cannot receive incremental stream: encryption property
'encryption' cannot be set for incremental streams.". The
problem is resolved by allowing the receiver to override
the encryption property for plain replicated send.
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes#14253Closes#13533
Commit 68ddc06b61 introduced support
for receiving unencrypted datasets as children of encrypted ones but
unfortunately got the logic upside down. This resulted in failing to
deny receives of incremental sends into encrypted datasets without
their keys loaded. If receiving a filesystem, the receive was done
into a newly created unencrypted child dataset of the target. In
case of volumes the receive made the target volume undeletable since
a dataset was created below it, which we obviously can't handle.
Incremental streams with embedded blocks are affected as well.
We fix the broken logic to properly deny receives in such cases.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Attila Fülöp <attila@fueloep.org>
Closes#13598Closes#14055Closes#14119
As investigated by #14026, the zpool_add_004_pos can reliably hang if
the timing is not right. This is caused by a race condition between
zed doing zpool reopen (due to the zvol being added to the zpool),
and the command zpool destroy.
This change adds a delay between zpool add zvol and zpool destroy to
avoid these issue, but does not address the underlying problem.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Youzhong Yang <yyang@mathworks.com>
Issue #14026Closes#14052
Various module parameters such as `zfs_arc_max` were originally
`uint64_t` on OpenSolaris/Illumos, but were changed to `unsigned long`
for Linux compatibility because Linux's kernel default module parameter
implementation did not support 64-bit types on 32-bit platforms. This
caused problems when porting OpenZFS to Windows because its LLP64 memory
model made `unsigned long` a 32-bit type on 64-bit, which created the
undesireable situation that parameters that should accept 64-bit values
could not on 64-bit Windows.
Upon inspection, it turns out that the Linux kernel module parameter
interface is extensible, such that we are allowed to define our own
types. Rather than maintaining the original type change via hacks to to
continue shrinking module parameters on 32-bit Linux, we implement
support for 64-bit module parameters on Linux.
After doing a review of all 64-bit kernel parameters (found via the man
page and also proposed changes by Andrew Innes), the kernel module
parameters fell into a few groups:
Parameters that were originally 64-bit on Illumos:
* dbuf_cache_max_bytes
* dbuf_metadata_cache_max_bytes
* l2arc_feed_min_ms
* l2arc_feed_secs
* l2arc_headroom
* l2arc_headroom_boost
* l2arc_write_boost
* l2arc_write_max
* metaslab_aliquot
* metaslab_force_ganging
* zfetch_array_rd_sz
* zfs_arc_max
* zfs_arc_meta_limit
* zfs_arc_meta_min
* zfs_arc_min
* zfs_async_block_max_blocks
* zfs_condense_max_obsolete_bytes
* zfs_condense_min_mapping_bytes
* zfs_deadman_checktime_ms
* zfs_deadman_synctime_ms
* zfs_initialize_chunk_size
* zfs_initialize_value
* zfs_lua_max_instrlimit
* zfs_lua_max_memlimit
* zil_slog_bulk
Parameters that were originally 32-bit on Illumos:
* zfs_per_txg_dirty_frees_percent
Parameters that were originally `ssize_t` on Illumos:
* zfs_immediate_write_sz
Note that `ssize_t` is `int32_t` on 32-bit and `int64_t` on 64-bit. It
has been upgraded to 64-bit.
Parameters that were `long`/`unsigned long` because of Linux/FreeBSD
influence:
* l2arc_rebuild_blocks_min_l2size
* zfs_key_max_salt_uses
* zfs_max_log_walking
* zfs_max_logsm_summary_length
* zfs_metaslab_max_size_cache_sec
* zfs_min_metaslabs_to_flush
* zfs_multihost_interval
* zfs_unflushed_log_block_max
* zfs_unflushed_log_block_min
* zfs_unflushed_log_block_pct
* zfs_unflushed_max_mem_amt
* zfs_unflushed_max_mem_ppm
New parameters that do not exist in Illumos:
* l2arc_trim_ahead
* vdev_file_logical_ashift
* vdev_file_physical_ashift
* zfs_arc_dnode_limit
* zfs_arc_dnode_limit_percent
* zfs_arc_dnode_reduce_percent
* zfs_arc_meta_limit_percent
* zfs_arc_sys_free
* zfs_deadman_ziotime_ms
* zfs_delete_blocks
* zfs_history_output_max
* zfs_livelist_max_entries
* zfs_max_async_dedup_frees
* zfs_max_nvlist_src_size
* zfs_rebuild_max_segment
* zfs_rebuild_vdev_limit
* zfs_unflushed_log_txg_max
* zfs_vdev_max_auto_ashift
* zfs_vdev_min_auto_ashift
* zfs_vnops_read_chunk_size
* zvol_max_discard_blocks
Rather than clutter the lists with commentary, the module parameters
that need comments are repeated below.
A few parameters were defined in Linux/FreeBSD specific code, where the
use of ulong/long is not an issue for portability, so we leave them
alone:
* zfs_delete_blocks
* zfs_key_max_salt_uses
* zvol_max_discard_blocks
The documentation for a few parameters was found to be incorrect:
* zfs_deadman_checktime_ms - incorrectly documented as int
* zfs_delete_blocks - not documented as Linux only
* zfs_history_output_max - incorrectly documented as int
* zfs_vnops_read_chunk_size - incorrectly documented as long
* zvol_max_discard_blocks - incorrectly documented as ulong
The documentation for these has been fixed, alongside the changes to
document the switch to fixed width types.
In addition, several kernel module parameters were percentages or held
ashift values, so being 64-bit never made sense for them. They have been
downgraded to 32-bit:
* vdev_file_logical_ashift
* vdev_file_physical_ashift
* zfs_arc_dnode_limit_percent
* zfs_arc_dnode_reduce_percent
* zfs_arc_meta_limit_percent
* zfs_per_txg_dirty_frees_percent
* zfs_unflushed_log_block_pct
* zfs_vdev_max_auto_ashift
* zfs_vdev_min_auto_ashift
Of special note are `zfs_vdev_max_auto_ashift` and
`zfs_vdev_min_auto_ashift`, which were already defined as `uint64_t`,
and passed to the kernel as `ulong`. This is inherently buggy on big
endian 32-bit Linux, since the values would not be written to the
correct locations. 32-bit FreeBSD was unaffected because its sysctl code
correctly treated this as a `uint64_t`.
Lastly, a code comment suggests that `zfs_arc_sys_free` is
Linux-specific, but there is nothing to indicate to me that it is
Linux-specific. Nothing was done about that.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Original-patch-by: Andrew Innes <andrew.c12@gmail.com>
Original-patch-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes#13984Closes#14004
When iterating through children physical ashifts for vdev, prefer
ones above the maximum logical ashift, that we can actually use,
but within the administrator defined maximum.
When selecting top-level vdev ashift, do not set it to the defined
maximum in case physical ashift is even higher, but just ignore one.
Using the maximum does not prevent misaligned writes, but reduces
space efficiency. Since ZFS tries to write data sequentially and
aggregates the writes, in many cases large misanigned writes may be
not as bad as the space penalty otherwise.
Allow internal physical ashifts for vdevs higher than SHIFT_MAX.
May be one day allocator or aggregation could benefit from that.
Reduce zfs_vdev_max_auto_ashift default from 16 (64KB) to 14 (16KB),
so that ZFS may still use bigger ashifts up to SHIFT_MAX (64KB),
but only if it really has to or explicitly told to, but not as an
"optimization".
There are some read-intensive NVMe SSDs that report Preferred Write
Alignment of 64KB, and attempt to build RAIDZ2 of those leads to a
space inefficiency that can't be justified. Instead these changes
make ZFS fall back to logical ashift of 12 (4KB) by default and
only warn user that it may be suboptimal for performance.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#13798
When scrubbing an encrypted filesystem with unloaded key still report an
error in zpool status.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alek Pinchuk <apinchuk@axcient.com>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes#13675Closes#13717
`zpool_expand_001_pos` was often failing due to not seeing autoexpand
commands in the `zpool history`. During testing, I found this to be
unreliable (sometimes the "online" wouldn't appear in `zpool history`)
and unnecessary, as we could simply check that the pool increased in
size.
This commit revamps the test to check for the expanded pool size
and corresponding new free space.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#13743
This type of recv is used to heal corrupted data when a replica
of the data already exists (in the form of a send file for example).
With the provided send stream, corrective receive will read from
disk blocks described by the WRITE records. When any of the reads
come back with ECKSUM we use the data from the corresponding WRITE
record to rewrite the corrupted block.
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Zuchowski <pzuchowski@datto.com>
Signed-off-by: Alek Pinchuk <apinchuk@axcient.com>
Closes#9372
This commit adds BLAKE3 checksums to OpenZFS, it has similar
performance to Edon-R, but without the caveats around the latter.
Homepage of BLAKE3: https://github.com/BLAKE3-team/BLAKE3
Wikipedia: https://en.wikipedia.org/wiki/BLAKE_(hash_function)#BLAKE3
Short description of Wikipedia:
BLAKE3 is a cryptographic hash function based on Bao and BLAKE2,
created by Jack O'Connor, Jean-Philippe Aumasson, Samuel Neves, and
Zooko Wilcox-O'Hearn. It was announced on January 9, 2020, at Real
World Crypto. BLAKE3 is a single algorithm with many desirable
features (parallelism, XOF, KDF, PRF and MAC), in contrast to BLAKE
and BLAKE2, which are algorithm families with multiple variants.
BLAKE3 has a binary tree structure, so it supports a practically
unlimited degree of parallelism (both SIMD and multithreading) given
enough input. The official Rust and C implementations are
dual-licensed as public domain (CC0) and the Apache License.
Along with adding the BLAKE3 hash into the OpenZFS infrastructure a
new benchmarking file called chksum_bench was introduced. When read
it reports the speed of the available checksum functions.
On Linux: cat /proc/spl/kstat/zfs/chksum_bench
On FreeBSD: sysctl kstat.zfs.misc.chksum_bench
This is an example output of an i3-1005G1 test system with Debian 11:
implementation 1k 4k 16k 64k 256k 1m 4m
edonr-generic 1196 1602 1761 1749 1762 1759 1751
skein-generic 546 591 608 615 619 612 616
sha256-generic 240 300 316 314 304 285 276
sha512-generic 353 441 467 476 472 467 426
blake3-generic 308 313 313 313 312 313 312
blake3-sse2 402 1289 1423 1446 1432 1458 1413
blake3-sse41 427 1470 1625 1704 1679 1607 1629
blake3-avx2 428 1920 3095 3343 3356 3318 3204
blake3-avx512 473 2687 4905 5836 5844 5643 5374
Output on Debian 5.10.0-10-amd64 system: (Ryzen 7 5800X)
implementation 1k 4k 16k 64k 256k 1m 4m
edonr-generic 1840 2458 2665 2719 2711 2723 2693
skein-generic 870 966 996 992 1003 1005 1009
sha256-generic 415 442 453 455 457 457 457
sha512-generic 608 690 711 718 719 720 721
blake3-generic 301 313 311 309 309 310 310
blake3-sse2 343 1865 2124 2188 2180 2181 2186
blake3-sse41 364 2091 2396 2509 2463 2482 2488
blake3-avx2 365 2590 4399 4971 4915 4802 4764
Output on Debian 5.10.0-9-powerpc64le system: (POWER 9)
implementation 1k 4k 16k 64k 256k 1m 4m
edonr-generic 1213 1703 1889 1918 1957 1902 1907
skein-generic 434 492 520 522 511 525 525
sha256-generic 167 183 187 188 188 187 188
sha512-generic 186 216 222 221 225 224 224
blake3-generic 153 152 154 153 151 153 153
blake3-sse2 391 1170 1366 1406 1428 1426 1414
blake3-sse41 352 1049 1212 1174 1262 1258 1259
Output on Debian 5.10.0-11-arm64 system: (Pi400)
implementation 1k 4k 16k 64k 256k 1m 4m
edonr-generic 487 603 629 639 643 641 641
skein-generic 271 299 303 308 309 309 307
sha256-generic 117 127 128 130 130 129 130
sha512-generic 145 165 170 172 173 174 175
blake3-generic 81 29 71 89 89 89 89
blake3-sse2 112 323 368 379 380 371 374
blake3-sse41 101 315 357 368 369 364 360
Structurally, the new code is mainly split into these parts:
- 1x cross platform generic c variant: blake3_generic.c
- 4x assembly for X86-64 (SSE2, SSE4.1, AVX2, AVX512)
- 2x assembly for ARMv8 (NEON converted from SSE2)
- 2x assembly for PPC64-LE (POWER8 converted from SSE2)
- one file for switching between the implementations
Note the PPC64 assembly requires the VSX instruction set and the
kfpu_begin() / kfpu_end() calls on PowerPC were updated accordingly.
Reviewed-by: Felix Dörre <felix@dogcraft.de>
Reviewed-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de>
Co-authored-by: Rich Ercolani <rincebrain@gmail.com>
Closes#10058Closes#12918
We drop /multiple/ seconds off the generation, a dozen off a clean
rebuild, 185 files, and trivialise the distribution,
which can now be trivially generated via the provided snippets
Dist diff:
-zfs-2.1.99/tests/zfs-tests/tests/functional/pam/utilities.kshlib
+zfs-2.1.99/tests/zfs-tests/tests/functional/pam/utilities.kshlib.in
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes#13316
Increase the default allowed maximum recordsize from 1M to 16M.
As described in the zfs(4) man page, there are significant costs
which need to be considered before using very large blocks.
However, there are scenarios where they make good sense and
it should no longer be necessary to artificially restrict their
use behind a module option.
Note that for 32-bit platforms we continue to leave this
restriction in place due to the limited virtual address space
available (256-512MB). On these systems only a handful
of blocks could be cached at any one time severely impacting
performance and potentially stability.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes#12830Closes#13302
Currently, determining which datasets are affected by corruption is
a manual process.
The primary difficulty in reporting the list of affected snapshots is
that since the error was initially found, the snapshot where the error
originally occurred in, may have been deleted. To solve this issue, we
add the ID of the head dataset of the original snapshot which the error
was detected in, to the stored error report. Then any time a filesystem
is deleted, the errors associated with it are deleted as well. Any time
a clone promote occurs, we modify reports associated with the original
head to refer to the new head. The stored error reports are identified
by this head ID, the birth time of the block which the error occurred
in, as well as some information about the error itself are also stored.
Once this information is stored, we can find the set of datasets
affected by an error by walking back the list of snapshots in the given
head until we find one with the appropriate birth txg, and then traverse
through the snapshots of the clone family, terminating a branch if the
block was replaced in a given snapshot. Then we report this information
back to libzfs, and to the zpool status command, where it is displayed
as follows:
pool: test
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
scan: scrub repaired 0B in 00:00:00 with 800 errors on Fri Dec 3
08:27:57 2021
config:
NAME STATE READ WRITE CKSUM
test ONLINE 0 0 0
sdb ONLINE 0 0 1.58K
errors: Permanent errors have been detected in the following files:
test@1:/test.0.0
/test/test.0.0
/test/1clone/test.0.0
A new feature flag is introduced to mark the presence of this change, as
well as promotion and backwards compatibility logic. This is an updated
version of #9175. Rebase required fixing the tests, updating the ABI of
libzfs, updating the man pages, fixing bugs, fixing the error returns,
and updating the old on-disk error logs to the new format when
activating the feature.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Co-authored-by: TulsiJain <tulsi.jain@delphix.com>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes#9175Closes#12812
Originally it was thought it would be useful to split up the kmods
by functionality. This would allow external consumers to only load
what was needed. However, in practice we've never had a case where
this functionality would be needed, and conversely managing multiple
kmods can be awkward. Therefore, this change merges all but the
spl.ko kmod in to a single zfs.ko kmod.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes#13274
What remains is a bunch of anonymous untraceable /tmp/tmp.XXXXXXXXXX
files and bak.root.receive.staff1.3835 from an error branch, testdir.1,
testdir.3, and testroot454470 (with children) in testroot
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes#13259
Otherwise, they leak past the tests and contaminate the running system,
breaking coredumps entirely
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Co-authored-by: Yannick Le Pennec <yannick.lepennec@live.fr>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes#13259