Archive-Team/zfs - zfs - Gitea: Git with a cup of tea

Commit Graph

Author	SHA1	Message	Date
Richard Yao	b79e7114bb	Add missing increment to dsl_deadlist_move_bpobj() `dc5c8006f6` was recently merged to prefetch up to 128 deadlists. Unfortunately, a loop was missing an increment, such that it will prefetch all deadlists. The performance properties of that patch probably should be re-evaluated. This was caught by CodeQL's cpp/constant-comparison check in an experimental branch where I am testing the security-and-extended queries. It complained about the `i < 128` part of the loop condition always evaluating to the same thing. The standard CodeQL configuration we use missed this because it does not include that check. Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14573	2023-03-06 15:28:26 -08:00
Richard Yao	8846139b45	SHA2Init() should use signed assertions when checking an enum The recent `4c5fec01a4` commit caused Coverity to report that ASSERT3U(algotype, >=, SHA256_MECH_INFO_TYPE); is always true. That is because the signed algotype and signed SHA256_MECH_INFO_TYPE values were cast to unsigned types. To fix this, we switch the assertions to use ASSERT3S(), which retains the signedness of the original values for the comparison. Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reported-by: Coverity (CID-1535300) Closes #14573	2023-03-06 15:26:43 -08:00
Jorgen Lundman	47119d60ef	Restore ASMABI and other Unify work Make sure all SHA2 transform function has wrappers For ASMABI to work, it is required the calling convention is consistent. Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Joergen Lundman <lundman@lundman.net> Closes #14569	2023-03-06 15:24:05 -08:00
Tino Reichardt	620a977f22	Use SECTION_STATIC macro for sha2 x86_64 assembly - instead of ".section .rodata" we should use SECTION_STATIC Tested-by: Rich Ercolani <rincebrain@gmail.com> Tested-by: Sebastian Gottschall <s.gottschall@dd-wrt.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de> Closes #13741	2023-03-02 13:52:33 -08:00
Tino Reichardt	f9f9bef22f	Update BLAKE3 for using the new impl handling This commit changes the BLAKE3 implementation handling and also the calls to it from the ztest command. Tested-by: Rich Ercolani <rincebrain@gmail.com> Tested-by: Sebastian Gottschall <s.gottschall@dd-wrt.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de> Closes #13741	2023-03-02 13:52:27 -08:00
Tino Reichardt	4c5fec01a4	Add generic implementation handling and SHA2 impl The skeleton file module/icp/include/generic_impl.c can be used for iterating over different implementations of algorithms. It is used by SHA256, SHA512 and BLAKE3 currently. The Solaris SHA2 implementation got replaced with a version which is based on public domain code of cppcrypto v0.10. These assembly files are taken from current openssl master: - sha256-x86_64.S: x64, SSSE3, AVX, AVX2, SHA-NI (x86_64) - sha512-x86_64.S: x64, AVX, AVX2 (x86_64) - sha256-armv7.S: ARMv7, NEON, ARMv8-CE (arm) - sha512-armv7.S: ARMv7, NEON (arm) - sha256-armv8.S: ARMv7, NEON, ARMv8-CE (aarch64) - sha512-armv8.S: ARMv7, ARMv8-CE (aarch64) - sha256-ppc.S: Generic PPC64 LE/BE (ppc64) - sha512-ppc.S: Generic PPC64 LE/BE (ppc64) - sha256-p8.S: Power8 ISA Version 2.07 LE/BE (ppc64) - sha512-p8.S: Power8 ISA Version 2.07 LE/BE (ppc64) Tested-by: Rich Ercolani <rincebrain@gmail.com> Tested-by: Sebastian Gottschall <s.gottschall@dd-wrt.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de> Closes #13741	2023-03-02 13:52:21 -08:00
Tino Reichardt	3e254aaad0	Remove old or redundant SHA2 files We had three sha2.h headers in different places. The FreeBSD version, the Linux version and the generic solaris version. The only assembly used for acceleration was some old x86-64 openssl implementation for sha256 within the icp module. For FreeBSD the whole SHA2 files of FreeBSD were copied into OpenZFS, these files got removed also. Tested-by: Rich Ercolani <rincebrain@gmail.com> Tested-by: Sebastian Gottschall <s.gottschall@dd-wrt.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de> Closes #13741	2023-03-02 13:50:21 -08:00
Alexander Motin	5f42d1dbf2	System-wide speculative prefetch limit. With some pathological access patterns it is possible to make ZFS accumulate almost unlimited amount of speculative prefetch ZIOs. Combined with linear ABD allocations in RAIDZ code, it appears to be possible to exhaust system KVA, triggering kernel panic. Address this by introducing a system-wide counter of active prefetch requests and blocking prefetch distance doubling per stream hits if the number of active requests is higher that ~6% of ARC size. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #14516	2023-03-01 15:27:40 -08:00
Richard Yao	4c856fb333	Fix data race between zil_commit() and zil_suspend() openzfsonwindows/openzfs#206 found that it is possible to trip `VERIFY(list_is_empty(&lwb->lwb_itxs))` when a `zil_commit()` is delayed by the scheduler long enough for a parallel `zil_suspend()` operation to exit `zil_commit_impl()`. This is a data race. To prevent this, we introduce a `zilog->zl_suspend_lock` rwlock to ensure that all outstanding `zil_commit()` operations finish before `zil_suspend()` begins and that subsequent operations fallback to `txg_wait_synced()` after `zil_suspend()` has begun. On `PREEMPT_RT` Linux kernels, the `rw_enter()` implementation suffers from writer starvation. This means that a ZIL intensive system can delay `zil_suspend()` indefinitely. This is a pre-existing problem that affects everything that uses rw locks, so it needs to be addressed in the SPL. However, builds against `PREEMPT_RT` Linux kernels are currently broken due to a GPL symbol issue (#11097), so we can safely disregard that issue for now. Reported-by: Arun KV <arun.kv@datacore.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14514	2023-03-01 13:23:09 -08:00
Richard Yao	9fd5fedc67	Remove bad kmem_free() oversight from previous zfsdev_state_list patch I forgot to remove the corresponding kmem_free() from zfs_kmod_fini() in `9a14ce43c3`. Clang's static analyzer did not complain, but the Coverity scan that was run after the patch was merged did. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reported-by: Coverity (CID-1535275) Closes #14556	2023-03-01 13:20:53 -08:00
Richard Yao	dd108f5d73	Linux: zfs_fillpage() should handle partial pages from end of file After `89cd2197b9` was merged, Clang's static analyzer began complaining about a dead assignment in `zfs_fillpage()`. Upon inspection, I noticed that the dead assignment was because we are not using the calculated io_len that we should use to avoid asking the DMU to read past the end of a file. This should result in `dmu_buf_hold_array_by_dnode()` calling `zfs_panic_recover()`. This issue predates `89cd2197b9`, but its simplification of zfs_fillpage() eliminated the only use of the assignment to io_len, which made Clang's static analyzer complain about the issue. Also, as a precaution, we add an assertion that io_offset < i_size. If this ever fails, bad things will happen. Otherwise, we are blindly trusting the kernel not to give us invalid offsets. We continue to blindly trust it on non-debug kernels. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14534	2023-03-01 13:19:47 -08:00
Richard Yao	3a7c35119e	Handle unexpected errors in zil_lwb_commit() without ASSERT() We tripped `ASSERT(error == ENOENT \|\| error == EEXIST \|\| error == EALREADY)` in `zil_lwb_commit()` at Klara when doing robustness testing of ZIL against drive power cycles. That assertion presumably exists because when this code was written, the only errors expected from here were EIO, ENOENT, EEXIST and EALREADY, with EIO having its own handling before the assertion. However, upon doing a manual depth first search traversal of the source tree, it turns out that a large number of unexpected errors are possible here. In theory, EINVAL and ENOSPC can come from dnode_hold_impl(). However, most unexpected errors originate in the block layer and come to us from zio_wait() in various ways. One way is ->zl_get_data() -> dmu_buf_hold() -> dbuf_read() -> zio_wait(). From vdev_disk.c on Linux alone, zio_wait() can return the unexpected errors ENXIO, ENOTSUP, EOPNOTSUPP, ETIMEDOUT, ENOSPC, ENOLINK, EREMOTEIO, EBADE, ENODATA, EILSEQ and ENOMEM This was only observed after what have been likely over 1000 test iterations, so we do not expect to reproduce this again to find out what the error code was. However, circumstantial evidence suggests that the error was ENXIO. When ENXIO or any other unexpected error occurs, the `fsync()` or equivalent operation that called zil_commit() will return success, when in fact, dirty data has not been committed to stable storage. This is a violation of the Single UNIX Specification. The code should be able to handle this and any other unknown error by calling `txg_wait_synced()`. In addition to changing the code to call txg_wait_synced() on unexpected errors instead of returning, we modify it to print information about unexpected errors to dmesg. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@klarasystems.com> Sponsored-By: Wasabi Technology, Inc. Closes #14532	2023-03-01 09:39:41 -08:00
Allan Jude	cbd76b4032	Makefile.bsd: cleanup and sync with FreeBSD xxhash.c was not being compiled, so when FreeBSD's kernel switched to a newer version of ZSTD a few weeks ago, out-of-tree ZFS failed to build Sync module/Makefile.bsd with FreeBSD's sys/modules/zfs/Makefile And restore the alphabetical sort in a number of places Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Yao <richard.yao@klarasystems.com> Signed-off-by: Allan Jude <allan@klarasystems.com> Sponsored-by: Klara, Inc. Closes #14508	2023-02-28 17:33:59 -08:00
Richard Yao	ae7e700650	Suppress static analyzer warning in dmu_objset_create_impl_dnstats() Clang's static analyzer claims that dereferencing ds in dmu_objset_create_impl_dnstats() could cause a NULL pointer dereference when a previous NULL check confirms that it is NULL. It is only NULL on the MOS, for which dmu_objset_userused_enabled(os) should always return false, so ds will never be dereferenced when it is NULL. We add an assertion to suppress this warning. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14470	2023-02-28 17:31:58 -08:00
Richard Yao	5cc4950901	Suppress static analyzer warning in dbuf_hold_copy() Clang's static analyzer claims that dbuf_hold_copy() will have a NULL pointer dereference in data->b_data when called by dbuf_hold_impl(). This is impossible because data is dr->dt.dl.dr_data, which is non-NULL whenever db->db_level == 0, which is always the case whenever dbuf_hold_impl() calls dbuf_hold_copy(). We add an assertion to suppress the complaint. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14470	2023-02-28 17:31:50 -08:00
Richard Yao	9a14ce43c3	Statically allocate first node of zfsdev_state_list This avoids a call to kmem_alloc() during module load. It also suppresses a defect report from Clang's static analyzer that claims that we will have a NULL pointer dereference in zfsdev_state_init() because it does not understand that this has already been allocated in zfs_kmod_init(). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14470	2023-02-28 17:31:41 -08:00
Richard Yao	7fc48f8378	Suppress static analyzer warning in sa_attr_iter() Clang's static analyzer points out that when IS_SA_BONUSTYPE(type) is true and .sa_length is 0 for an attribute, we have a NULL pointer dereference. We suppress this with an IMPLY() statement. This was also identified by Coverity. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reported-by: Coverity (CID-1017954) Closes #14470	2023-02-28 17:31:30 -08:00
Richard Yao	4d9bb5514c	Suppress static analyzer warnings in zio_checksum_error_impl() Clang's static analyzer informs us of multiple NULL pointer dereferences involving zio_checksum_error_impl(). The first is a NULL pointer dereference if bp is NULL and ci->ci_flags & ZCHECKSUM_FLAG_EMBEDDED is false, but bp is NULL implies that ci->ci_flags & ZCHECKSUM_FLAG_EMBEDDED is true, so we add an IMPLY() statement to suppress the report. The second and third are identical, and are duplicated because while the NULL pointer dereference occurs in zio_checksum_gang_verifier(), it is called by zio_checksum_error_impl() and there is a report for each of the two functions. The reports state that when bp is NULL, ci->ci_flags & ZCHECKSUM_FLAG_EMBEDDED is true and checksum is not ZIO_CHECKSUM_LABEL, we also have a NULL pointer dereference. bp is NULL should imply that checksum == ZIO_CHECKSUM_LABEL, so we add an IMPLY() statement to suppress the second report. The two reports are functionally identical. A fourth variation of this was also reported by Coverity. It occurs when checksum == ZIO_CHECKSUM_ZILOG2. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reported-by: Coverity (CID-1524672) Closes #14470	2023-02-28 17:31:08 -08:00
Richard Yao	d634d20d1b	icp: Prevent compilers from optimizing away memset() in gcm_clear_ctx() The recently merged `f58e513f74` was intended to zero sensitive data before exit from encryption functions to harden the code against theoretical information leaks. Unfortunately, the method by which it did that is optimized away by the compiler, so some information still leaks. This was confirmed by counting function calls in disassembly. After studying how the OpenBSD, FreeBSD and Linux kernels handle this, and looking at our disassembly, I decided on a two-factor approach to protect us from compiler dead store elimination passes. The first factor is to stop trying to inline gcm_clear_ctx(). GCC does not actually inline it in the first place, and testing suggests that dead store elimination passes appear to become more powerful in a bad way when inlining is forced, so we recognize that and move gcm_clear_ctx() to a C file. The second factor is to implement an explicit_memset() function based on the technique used by `secure_zero_memory()` in FreeBSD's blake2 implementation, which coincidentally is functionally identical to the one used by Linux. The source for this appears to be a LLVM bug: https://llvm.org/bugs/show_bug.cgi?id=15495 Unlike both FreeBSD and Linux, we explicitly avoid the inline keyword, based on my observations that GCC's dead store elimination pass becomes more powerful when inlining is forced, under the assumption that it will be equally powerful when the compiler does decide to inline function calls. Disassembly of GCC's output confirms that all 6 memset() calls are executed with this patch applied. Reviewed-by: Attila Fülöp <attila@fueloep.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14544	2023-02-28 17:28:50 -08:00
George Amanakis	13ff72ba0a	Revert zfeature_active() to static Commit `34ce4c4` made zfeature_active() non-static. This is not required. Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #14546	2023-02-28 14:03:52 -08:00
John Poduska	73c383f541	Prevent incorrect datasets being mounted During a mount, zpl_mount_impl(), uses sget() with the callback zpl_test_super() to find a super_block with a matching objset, stored in z_os. It does so without taking the teardown lock on the zfsvfs. The problem is that operations like rollback will replace the z_os. And, there is a window where the objset in the rollback is freed, but z_os still points to it. Then, a mount like operation, for instance a clone, can reallocate that exact same pointer and zpl_test_super() will then match the super_block associated with the rollback as opposed to the clone. This fix tests for a match and if so, takes the teardown lock before doing the final match test. Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: John Poduska <jpoduska@datto.com> Closes #14518	2023-02-27 16:49:34 -08:00
Richard Yao	bff26b0220	Skip memory allocation when compressing holes Hole detection in the zio compression code allows us to opportunistically skip compression on holes. We can go a step further by not doing memory allocations on holes either. Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Richard Yao <richard.yao@klarasystems.com> Sponsored-by: Wasabi Technology, Inc. Closes #14500	2023-02-27 14:41:02 -08:00
Attila Fülöp	f58e513f74	ICP: AES-GCM: Refactor gcm_clear_ctx() Currently the temporary buffer in which decryption takes place isn't cleared on context destruction. Further in some routines we fail to call gcm_clear_ctx() on error exit. Both flaws may result in leaking sensitive data. We follow best practices and zero out the plaintext buffer before freeing the memory holding it. Also move all cleanup into gcm_clear_ctx() and call it on any context destruction. The performance impact should be negligible. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: Attila Fülöp <attila@fueloep.org> Closes #14528	2023-02-27 14:38:12 -08:00
Mariusz Zaborski	3b9309aabe	Move zap_attribute_t to the heap in dsl_deadlist_merge In the case of a regular compilation, the compiler raises a warning for a dsl_deadlist_merge function, that the stack size is to large. In debug build this can generate an error. Move large structures to heap. Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Mariusz Zaborski <mariusz.zaborski@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes #14524	2023-02-27 14:27:58 -08:00
George Amanakis	d816bc5ec7	Move dmu_buf_rele() after dsl_dataset_sync_done() Otherwise the dataset may be freed after the last dmu_buf_rele() leading to a panic. Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #14522 Closes #14523	2023-02-23 18:14:52 -07:00
Brian Behlendorf	89cd2197b9	Fix buffered/direct/mmap I/O race When a page is faulted in for memory mapped I/O the page lock may be dropped before it has been read and marked up to date. If a buffered read encounters such a page in mappedread() it must wait until the page has been updated. Failure to do so will result in a panic on debug builds and incorrect data on production builds. The critical part of this change is in mappedread() where pages which are not up to date are now handled. Additionally, it includes the following simplifications. - zfs_getpage() and zfs_fillpage() could be passed an array of pages. This could be more efficient if it was used but in practice only a single page was ever provided. These interfaces were simplified to acknowledge that. - update_pages() was modified to correctly set the PG_error bit on a page when it cannot be read by dmu_read(). - Setting PG_error and PG_uptodate was moved to zfs_fillpage() from zpl_readpage_common(). This is consistent with the handling in update_pages() and mappedread(). - Minor additional refactoring to comments and variable declarations to improve readability. - Add a test case to exercise concurrent buffered, direct, and mmap IO to the same file. - Reduce the mmap_sync test case default run time. Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #13608 Closes #14498	2023-02-23 10:57:24 -08:00
Richard Yao	7cb67d627c	Fix NULL pointer dereference in zio_ready() Clang's static analyzer correctly identified a NULL pointer dereference in zio_ready() when ZIO_FLAG_NODATA has been set on a zio that is missing a block pointer. The NULL pointer dereference occurs because we have logic intended to disable ZIO_FLAG_NODATA when it has been set on a gang block. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14469	2023-02-23 10:19:08 -08:00
Richard Yao	c9e39da9a4	Use rw_tryupgrade() in dmu_bonus_hold_by_dnode() When dn->dn_bonus == NULL, dmu_bonus_hold_by_dnode() will unlock its read lock on dn->dn_struct_rwlock and grab a write lock. This can be micro-optimized by calling rw_tryupgrade(). Linux will not benefit from this since it does not support rwlock upgrades, but FreeBSD will. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14517	2023-02-22 16:33:23 -08:00
rob-wing	28251d81d7	FreeBSD: don't verify recycled vnode for zfs control directory Under certain loads, the following panic is hit: panic: page fault KDB: stack backtrace: #0 0xffffffff805db025 at kdb_backtrace+0x65 #1 0xffffffff8058e86f at vpanic+0x17f #2 0xffffffff8058e6e3 at panic+0x43 #3 0xffffffff808adc15 at trap_fatal+0x385 #4 0xffffffff808adc6f at trap_pfault+0x4f #5 0xffffffff80886da8 at calltrap+0x8 #6 0xffffffff80669186 at vgonel+0x186 #7 0xffffffff80669841 at vgone+0x31 #8 0xffffffff8065806d at vfs_hash_insert+0x26d #9 0xffffffff81a39069 at sfs_vgetx+0x149 #10 0xffffffff81a39c54 at zfsctl_snapdir_lookup+0x1e4 #11 0xffffffff8065a28c at lookup+0x45c #12 0xffffffff806594b9 at namei+0x259 #13 0xffffffff80676a33 at kern_statat+0xf3 #14 0xffffffff8067712f at sys_fstatat+0x2f #15 0xffffffff808ae50c at amd64_syscall+0x10c #16 0xffffffff808876bb at fast_syscall_common+0xf8 The page fault occurs because vgonel() will call VOP_CLOSE() for active vnodes. For this reason, define vop_close for zfsctl_ops_snapshot. While here, define vop_open for consistency. After adding the necessary vop, the bug progresses to the following panic: panic: VERIFY3(vrecycle(vp) == 1) failed (0 == 1) cpuid = 17 KDB: stack backtrace: #0 0xffffffff805e29c5 at kdb_backtrace+0x65 #1 0xffffffff8059620f at vpanic+0x17f #2 0xffffffff81a27f4a at spl_panic+0x3a #3 0xffffffff81a3a4d0 at zfsctl_snapshot_inactive+0x40 #4 0xffffffff8066fdee at vinactivef+0xde #5 0xffffffff80670b8a at vgonel+0x1ea #6 0xffffffff806711e1 at vgone+0x31 #7 0xffffffff8065fa0d at vfs_hash_insert+0x26d #8 0xffffffff81a39069 at sfs_vgetx+0x149 #9 0xffffffff81a39c54 at zfsctl_snapdir_lookup+0x1e4 #10 0xffffffff80661c2c at lookup+0x45c #11 0xffffffff80660e59 at namei+0x259 #12 0xffffffff8067e3d3 at kern_statat+0xf3 #13 0xffffffff8067eacf at sys_fstatat+0x2f #14 0xffffffff808b5ecc at amd64_syscall+0x10c #15 0xffffffff8088f07b at fast_syscall_common+0xf8 This is caused by a race condition that can occur when allocating a new vnode and adding that vnode to the vfs hash. If the newly created vnode loses the race when being inserted into the vfs hash, it will not be recycled as its usecount is greater than zero, hitting the above assertion. Fix this by dropping the assertion. FreeBSD-issue: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=252700 Reviewed-by: Andriy Gapon <avg@FreeBSD.org> Reviewed-by: Mateusz Guzik <mjguzik@gmail.com> Reviewed-by: Alek Pinchuk <apinchuk@axcient.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Rob Wing <rob.wing@klarasystems.com> Co-authored-by: Rob Wing <rob.wing@klarasystems.com> Submitted-by: Klara, Inc. Sponsored-by: rsync.net Closes #14501	2023-02-21 17:26:33 -08:00
Allan Jude	1d56c6d017	Fix per-jail zfs.mount_snapshot setting When jail.conf set the nopersist flag during startup, it was incorrectly destroying the per-jail ZFS settings. Reported-by: Martin Matuska <mm@FreeBSD.org> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Allan Jude <allan@klarasystems.com> Sponsored-by: Modirum MDPay Sponsored-by: Klara, Inc. Closes #14509	2023-02-21 17:23:01 -08:00
George Amanakis	0f32b1f728	Partially revert `eee9362a7` With commit `34ce4c42f` applied, there is no need for `eee9362a7`. Revert that aside from the test. All tests introduced in those commits pass. Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #14502	2023-02-21 09:36:22 -08:00
Richard Yao	9f08b6e31f	Sync thread should avoid holding the spa config write lock when possible spa_sync() currently grabs the write lock due to an old hack that is documented by a comment: We need the write lock here because, for aux vdevs, calling vdev_config_dirty() modifies sav_config. This is ugly and will become unnecessary when we eliminate the aux vdev wart by integrating all vdevs into the root vdev tree. This has lead to deadlocks in rare edge cases from holding the write lock. We can reduce incidence of these deadlocks by not grabbing the write lock on pools without auxillary vdevs. Sponsored-By: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@klarasystems.com> Closes #14282	2023-02-16 14:10:52 -08:00
Paul Dagnelie	dc72c60ec1	zfs redact fails when dnodesize=auto Add handling to dmu_object_next for the case where *objectp == 0. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <pcd@delphix.com> Closes #14479	2023-02-16 09:23:39 -08:00
Richard Yao	f04cb31e7c	Suppress Clang static analyzer complaint in zfs_replay_create() Clang's static analyzer incorrectly complains about an undefined value here when lr->lr_common.lrc_txtype == TX_SYMLINK and txtype == TX_CREATE. This is impossible, because of this line: txtype = (lr->lr_common.lrc_txtype & ~TX_CI((uint64_t)0x1 << 63)); Changing the code to compare against txtype suppresses the report. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14472	2023-02-14 11:05:41 -08:00
Brian Behlendorf	3fc92adc40	Linux: use filemap_range_has_page() As of the 4.13 kernel filemap_range_has_page() can be used to check if there is a page mapped in a given file range. When available this interface should be used which eliminates the need for the zp->z_is_mapped boolean. Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #14493	2023-02-14 11:04:34 -08:00
Richard Yao	ab672133a9	Give strlcat() full buffer lengths rather than smaller buffer lengths strlcat() is supposed to be given the length of the destination buffer, including the existing contents. Unfortunately, I had been overzealous when I wrote `a51288aabb`, since I gave it the length of the destination buffer, minus the existing contents. This likely caused a regression on large strings. On the topic of being overzealous, the use of strlcat() in dmu_send_estimate_fast() was unnecessary because recv_clone_name is a fixed length string. We continue using strlcat() mostly as defensive programming, in case the string length is ever changed, even though it is unnecessary. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14476	2023-02-14 11:03:42 -08:00
Rich Ercolani	cfd57573ff	quick fix for lingering snapdir unmount problems Unfortunately, even after `e79b6807`, I still, much more rarely, tripped asserts when playing with many ctldir mounts at once. Since this appears to happen if we dispatched twice too fast, just ignore it. We don't actually need to do anything if someone already started doing it for us. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes #14462	2023-02-13 16:40:13 -08:00
George Amanakis	34ce4c42ff	Fix a race condition in dsl_dataset_sync() when activating features The zio returned from arc_write() in dmu_objset_sync() uses zio_nowait(). However we may reach the end of dsl_dataset_sync() which checks if we need to activate features in the filesystem without knowing if that zio has even run through the ZIO pipeline yet. In that case we will flag features to be activated in dsl_dataset_block_born() but dsl_dataset_sync() has already completed its run and those features will not actually be activated. Mitigate this by moving the feature activation code in dsl_dataset_sync_done(). Also add new ASSERTs in dsl_scan_visitbp() checking if a block contradicts any filesystem flags. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #13816	2023-02-13 16:37:46 -08:00
Prakash Surya	13312e2fa1	Reduce need for contiguous memory for ioctls We've had cases where we trigger an OOM despite having memory freely available on the system. For example, here, we had about 21GB free: kernel: Node 0 Normal: 24187584kB (UME) 15495338kB (UE) 016kB 032kB 064kB 0128kB 0256kB 0512kB 01024kB 02048kB 0*4096kB = 22071296kB The problem being, all the memory is in 4K and 8K contiguous regions, but the allocation request was for a 16K contiguous region: kernel: SafeExecutors-4 invoked oom-killer: gfp_mask=0x42dc0(GFP_KERNEL\|__GFP_NOWARN\|__GFP_COMP\|__GFP_ZERO), order=2, oom_score_adj=0 The offending allocation came from this call trace: kernel: Call Trace: kernel: dump_stack+0x57/0x7a kernel: dump_header+0x4f/0x1e1 kernel: oom_kill_process.cold.33+0xb/0x10 kernel: out_of_memory+0x1ad/0x490 kernel: __alloc_pages_slowpath+0xd55/0xe40 kernel: __alloc_pages_nodemask+0x2df/0x330 kernel: kmalloc_large_node+0x42/0x90 kernel: __kmalloc_node+0x25a/0x320 kernel: ? spl_kmem_free_impl+0x21/0x30 [spl] kernel: spl_kmem_alloc_impl+0xa5/0x100 [spl] kernel: spl_kmem_zalloc+0x19/0x20 [spl] kernel: zfsdev_ioctl+0x2b/0xe0 [zfs] kernel: do_vfs_ioctl+0xa9/0x640 kernel: ? __audit_syscall_entry+0xdd/0x130 kernel: ksys_ioctl+0x67/0x90 kernel: __x64_sys_ioctl+0x1a/0x20 kernel: do_syscall_64+0x5e/0x200 kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9 kernel: RIP: 0033:0x7fdca3674317 The problem is, for each ioctl that ZFS makes, it has to allocate a zfs_cmd_t structure, which is 13744 bytes in size (on my system): sdb> sizeof zfs_cmd (size_t)13744 This size, coupled with the fact that we currently allocate it with kmem_zalloc, means we need a 16K contiguous region of memory to satisfy the request. The solution taken by this change, is to use "vmem" instead of "kmem" to do the allocation, such that we don't necessarily need a contiguous 16K memory region to satisfy the allocation. Arguably, a better solution would be not to require such a large allocation to begin with (e.g. reduce the size of the zfs_cmd_t structure), but that'd be a much larger change than this "one liner". Thus, I've opted for this approach for now; we can always circle back and attempt to reduce the size of the structure in the future. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Reviewed-by: Don Brady <don.brady@delphix.com> Signed-off-by: Prakash Surya <prakash.surya@delphix.com> Closes #14474	2023-02-13 16:35:59 -08:00
Alexander Motin	87a4dfa561	Improve arc_read() error reporting Debugging reported NULL de-reference panic in dnode_hold_impl() I found that for certain types of errors arc_read() may only return error code, but not properly report it via done and pio arguments. Lack of done calls may result in reference and/or memory leaks in higher level code. Lack of error reporting via pio may result in unnoticed errors there. For example, dbuf_read(), where dbuf_read_impl() ignores arc_read() return, relies completely on the pio mechanism and missed the errors. This patch makes arc_read() to always call done callback and always propagate errors to parent zio, if either is provided. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #14454	2023-02-13 13:21:53 -08:00
Richard Yao	66953686c0	Add assertion and make variables unsigned in abd_alloc_chunks() Clang's static analyzer pointed out that if alloc_pages >= nr_pages before the loop, the value of page will be undefined and will be used anyway. This should not be possible, but as cleanup, we add an assertion. We also recognize that the local variables should be unsigned in the first place, so we make them unsigned. This is not enough to avoid the need for the assertion, since there is still the case that alloc_pages == nr_pages and nr_pages == 0, which the assertion implicitly checks. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14456	2023-02-06 11:10:50 -08:00
Richard Yao	cfb49616cd	Cleanup: spa vdev processing should check NULL pointers The PVS Studio 2016 FreeBSD kernel report stated: \contrib\opensolaris\uts\common\fs\zfs\spa.c (1341): error V595: The 'spa->spa_spares.sav_vdevs' pointer was utilized before it was verified against nullptr. Check lines: 1341, 1342. \sys\cddl\contrib\opensolaris\uts\common\fs\zfs\spa.c (1355): error V595: The 'spa->spa_l2cache.sav_vdevs' pointer was utilized before it was verified against nullptr. Check lines: 1355, 1357. \sys\cddl\contrib\opensolaris\uts\common\fs\zfs\spa.c (1398): error V595: The 'spa->spa_spares.sav_vdevs' pointer was utilized before it was verified against nullptr. Check lines: 1398, 1408. \sys\cddl\contrib\opensolaris\uts\common\fs\zfs\spa.c (1583): error V595: The 'oldvdevs' pointer was utilized before it was verified against nullptr. Check lines: 1583, 1595. In practice, all of these uses were safe because a NULL pointer implied a 0 vdev count, which kept us from iterating over vdevs. However, rearranging the code to check the pointer first is not a terrible micro-optimization and makes it more readable, so let us do that. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14456	2023-02-06 11:09:28 -08:00
Richard Yao	3a7d2a0ce0	zfs_get_temporary_prop() should not pass NULL to strcpy() `dsl_dir_activity_in_progress()` can call `zfs_get_temporary_prop()` with the forth value set to NULL, which will pass NULL to `strcpy()` when there is a match Clang's static analyzer caught this with the help of CodeChecker for Cross Translation Unit analysis. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14456	2023-02-06 11:08:57 -08:00
Matthew Ahrens	14872aaa4f	EIO caused by encryption + recursive gang Encrypted blocks can not have 3 DVAs, because they use the space of the 3rd DVA for the IV+salt. zio_write_gang_block() takes this into account, setting `gbh_copies` to no more than 2 in this case. Gang members BP's do not have the X (encrypted) bit set (nor do they have the DMU level and type fields set), because encryption is not handled at this level. The gang block is reassembled, and then encryption (and compression) are handled. To check if this gang block is encrypted, the code in zio_write_gang_block() checks `pio->io_bp`. This is normally fine, because the block that's being ganged is typically the encrypted BP. The problem is that if there is "recursive ganging", where a gang member is itself a gang block, then when zio_write_gang_block() is called to create a gang block for a gang member, `pio->io_bp` is the gang member's BP, which doesn't have the X bit set, so the number of DVA's is not restricted to 2. It should instead be looking at the the "gang leader", i.e. the top-level gang block, to determine how many DVA's can be used, to avoid a "NDVA's inversion" (where a child has more DVA's than its parent). gang leader BP: X (encrypted) bit set, 2 DVA's, IV+salt in 3rd DVA's space: ``` DVA[0]=<1:...:100400> DVA[1]=<0:...:100400> salt=... iv=... [L0 ZFS plain file] fletcher4 uncompressed encrypted LE gang unique double size=100000L/100000P birth=... fill=1 cksum=... ``` leader's GBH contains a BP with gang bit set and 3 DVA's: ``` DVA[0]=<1:...:55600> DVA[1]=<0:...:55600> [L0 unallocated] fletcher4 uncompressed unencrypted LE contiguous unique double size=55600L/55600P birth=... fill=0 cksum=... DVA[0]=<1:...:55600> DVA[1]=<0:...:55600> [L0 unallocated] fletcher4 uncompressed unencrypted LE contiguous unique double size=55600L/55600P birth=... fill=0 cksum=... DVA[0]=<1:...:55600> DVA[1]=<0:...:55600> DVA[2]=<1:...:200> [L0 unallocated] fletcher4 uncompressed unencrypted LE gang unique double size=55400L/55400P birth=... fill=0 cksum=... ``` On nondebug bits, having the 3rd DVA in the gang block works for the most part, because it's true that all 3 DVA's are available in the gang member BP (in the GBH). However, for accounting purposes, gang block DVA's ASIZE include all the space allocated below them, i.e. the 512-byte gang block header (GBH) as well as the gang members below that. We see that above where the gang leader BP is 1MB logical (and after compression: 0x`100000P`), but the ASIZE of each DVA is 2 sectors (1KB) more than 1MB (0x`100400`). Since thre are 3 copies of a block below it, we increment the ATIME of the 3rd DVA of the gang leader by the space used by the 3rd DVA of the child (1 sector, in this case). But there isn't really a 3rd DVA of the parent; the salt is stored in place of the 3rd DVA's ASIZE. So when zio_write_gang_member_ready() increments the parent's BP's `DVA[2]`'s ASIZE, it's actually incrementing the parent's salt. When we later try to read the encrypted recursively-ganged block, the salt doesn't match what we used to write it, so MAC verification fails and we get an EIO. ``` zio_encrypt(): encrypted 515/2/0/403 salt: 25 25 bb 9d ad d6 cd 89 zio_decrypt(): decrypting 515/2/0/403 salt: 26 25 bb 9d ad d6 cd 89 ``` This commit addresses the problem by not increasing the number of copies of the GBH beyond 2 (even for non-encrypted blocks). This simplifies the logic while maintaining the ability to traverse all metadata (including gang blocks) even if one copy is lost. (Note that 3 copies of the GBH will still be created if requested, e.g. for `copies=3` or MOS blocks.) Additionally, the code that increments the parent's DVA's ASIZE is made to check the parent DVA's NDVAS even on nondebug bits. So if there's a similar bug in the future, it will cause a panic when trying to write, rather than corrupting the parent BP and causing an error when reading. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Caused-by: #14356 Closes #14440 Closes #14413	2023-02-06 09:37:06 -08:00
Jorgen Lundman	dffd40b3b6	Unify assembly files with macOS The remaining changes needed to make the assembly files work with macOS. Reviewed-by: Attila Fülöp <attila@fueloep.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Jorgen Lundman <lundman@lundman.net> Closes #14451	2023-02-06 09:27:55 -08:00
Allan Jude	c799866b97	Resolve WS-2021-0184 vulnerability in zstd Pull in d40f55cd950919d7eac951b122668e55e33e5202 from upstream Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Allan Jude <allan@klarasystems.com> Closes #14439	2023-02-02 15:12:51 -08:00
Brian Behlendorf	973934b965	Increase default zfs_rebuild_vdev_limit to 64MB When testing distributed rebuild performance with more capable hardware it was observed than increasing the zfs_rebuild_vdev_limit to 64M reduced the rebuild time by 17%. Beyond 64MB there was some improvement (~2%) but it was not significant when weighed against the increased memory usage. Memory usage is capped at 1/4 of arc_c_max. Additionally, vr_bytes_inflight_max has been moved so it's updated per-metaslab to allow the size to be adjust while a rebuild is running. Reviewed-by: Akash B <akash-b@hpe.com> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #14428	2023-01-27 10:02:24 -08:00
Brian Behlendorf	c0aea7cf4e	Increase default zfs_scan_vdev_limit to 16MB For HDD based pools the default zfs_scan_vdev_limit of 4M per-vdev can significantly limit the maximum scrub performance. Increasing the default to 16M can double the scrub speed from 80 MB/s per disk to 160 MB/s per disk. This does increase the memory footprint during scrub/resilver but given the performance win this is a reasonable trade off. Memory usage is capped at 1/4 of arc_c_max. Note that number of outstanding I/Os has not changed and is still limited by zfs_vdev_scrub_max_active. Reviewed-by: Akash B <akash-b@hpe.com> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #14428	2023-01-27 10:01:13 -08:00
Alexander Motin	dc5c8006f6	Prefetch on deadlists merge During snapshot deletion ZFS may issue several reads for each deadlist to merge them into next snapshot's or pool's bpobj. Number of the dead lists increases with number of snapshots. On HDD pools it may take significant time during which sync thread is blocked. This patch introduces prescient prefetch of required blocks for up to 128 deadlists ahead. Tests show reduction of time required to delete dataset with 720 snapshots with randomly overwritten file on wide HDD pool from 75-85 to 22-28 seconds. Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Issue #14276 Closes #14402	2023-01-25 11:30:24 -08:00
Brian Behlendorf	c85ac731a0	Improve resilver ETAs When resilvering the estimated time remaining is calculated using the average issue rate over the current pass. Where the current pass starts when a scan was started, or restarted, if the pool was exported/imported. For dRAID pools in particular this can result in wildly optimistic estimates since the issue rate will be very high while scanning when non-degraded regions of the pool are scanned. Once repair I/O starts being issued performance drops to a realistic number but the estimated performance is still significantly skewed. To address this we redefine a pass such that it starts after a scanning phase completes so the issue rate is more reflective of recent performance. Additionally, the zfs_scan_report_txgs module option can be set to reset the pass statistics more often. Reviewed-by: Akash B <akash-b@hpe.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #14410	2023-01-25 11:28:54 -08:00
Coleman Kane	9cd71c8604	linux 6.2 compat: zpl_set_acl arg2 is now struct dentry Linux 6.2 changes the second argument of the set_acl operation to be a "struct dentry " rather than a "struct inode ". The inode* parameter is still available as dentry->d_inode, so adjust the call to the _impl function call to dereference and pass that pointer to it. Also document that the get_acl -> get_inode_acl member name change from commit `884a693` was an API change also introduced in Linux 6.2. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Coleman Kane <ckane@colemankane.org> Closes #14415	2023-01-24 11:20:50 -08:00
Alexander Motin	0f740a4f1d	Introduce minimal ZIL block commit delay Despite all optimizations, tests on actual hardware show that FreeBSD kernel can't sleep for less then ~2us. Similar tests on Linux show ~50us delay at least from nanosleep() (haven't tested inside kernel). It means that on very fast log device ZIL may not be able to satisfy zfs_commit_timeout_pct block commit timeout, increasing log latency more than desired. Handle that by introduction of zil_min_commit_timeout parameter, specifying minimal timeout value where additional delays to aggregate writes may be skipped. Also skip delays if the LWB is more than 7/8 full, that often happens if I/O sizes are constant and match one of LWB sizes. Both things are applied only if there were no already outstanding log blocks, that may indicate single-threaded workload, that by definition can not benefit from the commit delays. While there, add short time moving average to zl_last_lwb_latency to make it more stable. Tests of single-threaded 4KB writes to NVDIMM SLOG on FreeBSD show IOPS increase by 9% instead of expected 5%. For zfs_commit_timeout_pct of 1 there IOPS increase by 5.5% instead of expected 1%. Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Prakash Surya <prakash.surya@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #14418	2023-01-24 09:20:32 -08:00
Attila Fülöp	037e4f2536	x86 asm: Replace .align with .balign The .align directive used to align storage locations is ambiguous. On some platforms and assemblers it takes a byte count, on others the argument is interpreted as a shift value. The current usage expects the first interpretation. Replace it with the unambiguous .balign directive which always expects a byte count, regardless of platform and assembler. Reviewed-by: Jorgen Lundman <lundman@lundman.net> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Attila Fülöp <attila@fueloep.org> Closes #14422	2023-01-24 09:04:39 -08:00
Attila Fülöp	58ca7b1011	IPC: blake3 x86 asm: fix placement of .size directives The .size directive used by the SET_SIZE C macro uses the special dot symbol to calculate the size of a function. The dot symbol refers to the current address, so for the calculation to be meaningful the SET_SIZE macro must be placed immediately after the end of the function the size is calculated for. Reviewed-by: Jorgen Lundman <lundman@lundman.net> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Attila Fülöp <attila@fueloep.org> Closes #14422	2023-01-24 09:03:31 -08:00
David Hedberg	37a27b4306	Wait for txg sync if the last DRR_FREEOBJECTS might result in a hole If we receive a DRR_FREEOBJECTS as the first entry in an object range, this might end up producing a hole if the freed objects were the only existing objects in the block. If the txg starts syncing before we've processed any following DRR_OBJECT records, this leads to a possible race where the backing arc_buf_t gets its psize set to 0 in the arc_write_ready() callback while still being referenced from a dirty record in the open txg. To prevent this, we insert a txg_wait_synced call if the first record in the range was a DRR_FREEOBJECTS that actually resulted in one or more freed objects. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: David Hedberg <david.hedberg@findity.com> Sponsored by: Findity AB Closes #11893 Closes #14358	2023-01-23 13:19:43 -08:00
Richard Yao	73968defdd	Reject streams that set ->drr_payloadlen to unreasonably large values In the zstream code, Coverity reported: "The argument could be controlled by an attacker, who could invoke the function with arbitrary values (for example, a very high or negative buffer size)." It did not report this in the kernel. This is likely because the userspace code stored this in an int before passing it into the allocator, while the kernel code stored it in a uint32_t. However, this did reveal a potentially real problem. On 32-bit systems and systems with only 4GB of physical memory or less in general, it is possible to pass a large enough value that the system will hang. Even worse, on Linux systems, the kernel memory allocator is not able to support allocations up to the maximum 4GB allocation size that this allows. This had already been limited in userspace to 64MB by `ZFS_SENDRECV_MAX_NVLIST`, but we need a hard limit in the kernel to protect systems. After some discussion, we settle on 256MB as a hard upper limit. Attempting to receive a stream that requires more memory than that will result in E2BIG being returned to user space. Reported-by: Coverity (CID-1529836) Reported-by: Coverity (CID-1529837) Reported-by: Coverity (CID-1529838) Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14285	2023-01-23 13:16:22 -08:00
rob-wing	69f024a56e	Configure zed's diagnosis engine with vdev properties Introduce four new vdev properties: checksum_n checksum_t io_n io_t These properties can be used for configuring the thresholds of zed's diagnosis engine and are interpeted as <N> events in T <seconds>. When this property is set to a non-default value on a top-level vdev, those thresholds will also apply to its leaf vdevs. This behavior can be overridden by explicitly setting the property on the leaf vdev. Note that, these properties do not persist across vdev replacement. For this reason, it is advisable to set the property on the top-level vdev instead of the leaf vdev. The default values for zed's diagnosis engine (10 events, 600 seconds) remains unchanged. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Allan Jude <allan@klarasystems.com> Signed-off-by: Rob Wing <rob.wing@klarasystems.com> Sponsored-by: Seagate Technology LLC Closes #13805	2023-01-23 13:14:25 -08:00
Richard Yao	f091db9248	free_blocks(): Fix reports from 2016 PVS Studio FreeBSD report In 2016, the authors of PVS Studio ran it on the FreeBSD kernel, which identified a number of bugs / cleanup opportunities in the FreeBSD ZFS kernel code. A few of them persist to the present day: https://reviews.freebsd.org/D5245 Note that the scan was done against freebsd/freebsd-src@46763fd4ca. In particular, we have the following in free_blocks(): \sys\cddl\contrib\opensolaris\uts\common\fs\zfs\dnode_sync.c (174): error V547: Expression '__left >= __right' is always true. Unsigned type value is always >= 0. \sys\cddl\contrib\opensolaris\uts\common\fs\zfs\dnode_sync.c (171): error V634: The priority of the '' operation is higher than that of the '<<' operation. It's possible that parentheses should be used in the expression. \sys\cddl\contrib\opensolaris\uts\common\fs\zfs\dnode_sync.c (175): error V547: Expression '__left >= __right' is always true. Unsigned type value is always >= 0. A couple of assertions accidentally typecast the arguments they check to unsigned in such a way that the result is always true. Also, parentheses are missing around `1<<epbs` in `(db->db_blkid 1<<epbs)`. This works out to be okay due to multiplication not caring what order of operations we use, but it is better to fix it to be `(db->db_blkid << epbs)`. A few of the function local variables probably never should have been 32-bit in the first place, so we make them 64-bit. We also replace the existing assertions with additional assertions to ensure that 64-bit unsigned arithmetic is safe. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14407	2023-01-23 13:12:37 -08:00
Chunwei Chen	71974946be	Fix reading uninitialized variable in receive_read When zfs_file_read returns error, resid may be uninitialized. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Chunwei Chen <david.chen@nutanix.com> Closes #14404	2023-01-20 11:49:56 -08:00
Richard Yao	856cefcd1c	Cleanup ->dd_space_towrite should be unsigned This is only ever used with unsigned data, so the type itself should be unsigned. Also, PVS Studio's 2016 FreeBSD kernel report correctly identified the following assertion as always being true, so we can drop it: ASSERT3U(dd->dd_space_towrite[i & TXG_MASK], >=, 0); The reason it was always true is because it would do casts to give us unsigned comparisons. This could have been fixed by switching to `ASSERT3S()`, but upon inspection, it turned out that this variable never should have been allowed to be signed in the first place. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14408	2023-01-20 11:10:15 -08:00
Mark Johnston	ebabb93e6c	Micro-optimize dsl_prop_get_dd() Use the saved property index instead of looking it up once per DSL directory when traversing up towards the root. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Igor Kozhukhov <igor@dilos.org> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Akash B <akash-b@hpe.com> Signed-off-by: Mark Johnston <markj@FreeBSD.org> Sponsored-by: The FreeBSD Foundation Closes #14397	2023-01-20 11:01:41 -08:00
Mark Johnston	7c30100c00	Avoid passing an uninitialized index to dsl_prop_known_index Reported-by: KMSAN Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Igor Kozhukhov <igor@dilos.org> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Akash B <akash-b@hpe.com> Signed-off-by: Mark Johnston <markj@FreeBSD.org> Sponsored-by: The FreeBSD Foundation Closes #14397	2023-01-20 11:00:38 -08:00
Chunwei Chen	c6dab6dd39	Fix unprotected zfs_znode_dmu_fini In original code, zfs_znode_dmu_fini is called in zfs_rmnode without zfs_znode_hold_enter. It seems to assume it's ok to do so when the znode is unlinked. However this assumption is not correct, as zfs_zget can be called by NFS through zpl_fh_to_dentry as pointed out by Christian in https://github.com/openzfs/zfs/pull/12767, which could result in a use-after-free bug. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Chunwei Chen <david.chen@nutanix.com> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #12767 Closes #14364	2023-01-19 16:59:05 -08:00
Jorgen Lundman	68c0771cc9	Unify Assembler files between Linux and Windows Add new macro ASMABI used by Windows to change calling API to "sysv_abi". Reviewed-by: Attila Fülöp <attila@fueloep.org> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Jorgen Lundman <lundman@lundman.net> Closes #14228	2023-01-17 11:09:19 -08:00
Richard Yao	2e7f664f04	Cleanup of dead code suggested by Clang Static Analyzer (#14380 ) I recently gained the ability to run Clang's static analyzer on the linux kernel modules via a few hacks. This extended coverage to code that was previously missed since Clang's static analyzer only looked at code that we built in userspace. Running it against the Linux kernel modules built from my local branch produced a total of 72 reports against my local branch. Of those, 50 were reports of logic errors and 22 were reports of dead code. Since we already had cleaned up all of the previous dead code reports, I felt it would be a good next step to clean up these dead code reports. Clang did a further breakdown of the dead code reports into: Dead assignment 15 Dead increment 2 Dead nested assignment 5 The benefit of cleaning these up, especially in the case of dead nested assignment, is that they can expose places where our error handling is incorrect. A number of them were fairly straight forward. However several were not: In vdev_disk_physio_completion(), not only were we not using the return value from the static function vdev_disk_dio_put(), but nothing used it, so I changed it to return void and removed the existing (void) cast in the other area where we call it in addition to no longer storing it to a stack value. In FSE_createDTable(), the function is dead code. Its helper function FSE_freeDTable() is also dead code, as are the CPP definitions in `module/zstd/include/zstd_compat_wrapper.h`. We just delete it all. In zfs_zevent_wait(), we have an optimization opportunity. cv_wait_sig() returns 0 if there are waiting signals and 1 if there are none. The Linux SPL version literally returns `signal_pending(current) ? 0 : 1)` and FreeBSD implements the same semantics, we can just do `!cv_wait_sig()` in place of `signal_pending(current)` to avoid unnecessarily calling it again. zfs_setattr() on FreeBSD version did not have error handling issue because the code was removed entirely from FreeBSD version. The error is from updating the attribute directory's files. After some thought, I decided to propapage errors on it to userspace. In zfs_secpolicy_tmp_snapshot(), we ignore a lack of permission from the first check in favor of checking three other permissions. I assume this is intentional. In zfs_create_fs(), the return value of zap_update() was not checked despite setting an important version number. I see no backward compatibility reason to permit failures, so we add an assertion to catch failures. Interestingly, Linux is still using ASSERT(error == 0) from OpenSolaris while FreeBSD has switched to the improved ASSERT0(error) from illumos, although illumos has yet to adopt it here. ASSERT(error == 0) was used on Linux while ASSERT0(error) was used on FreeBSD since the entire file needs conversion and that should be the subject of another patch. dnode_move()'s issue was caused by us not having implemented POINTER_IS_VALID() on Linux. We have a stub in `include/os/linux/spl/sys/kmem_cache.h` for it, when it really should be in `include/os/linux/spl/sys/kmem.h` to be consistent with Illumos/OpenSolaris. FreeBSD put both `POINTER_IS_VALID()` and `POINTER_INVALIDATE()` in `include/os/freebsd/spl/sys/kmem.h`, so we copy what it did. Whenever a report was in platform-specific code, I checked the FreeBSD version to see if it also applied to FreeBSD, but it was only relevant a few times. Lastly, the patch that enabled Clang's static analyzer to be run on the Linux kernel modules needs more work before it can be put into a PR. I plan to do that in the future as part of the on-going static analysis work that I am doing. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14380	2023-01-17 09:57:12 -08:00
Richard Yao	d27c7ba62f	Linux ppc64le ieee128 compat: Do not redefine __asm on external headers There is an external assembly declaration extension in GNU C that glibc uses when building with ieee128 floating point support on ppc64le. Marking that as volatile makes no sense, so the build breaks. It does not make sense to only mark this as volatile on Linux, since if do not want the compiler reordering things on Linux, we do not want the compiler reordering things on any other platform, so we stop treating Linux specially and just manually inline the CPP macro so that we can eliminate it. This should fix the build on ppc64le. Tested-by: @gyakovlev Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14308 Closes #14384	2023-01-13 10:58:58 -08:00
Richard Yao	4ef69de384	Cleanup: Use NULL when doing NULL pointer comparisons The Linux 5.16.14 kernel's coccicheck caught this. The semantic patch that caught it was: ./scripts/coccinelle/null/badzero.cocci Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14372	2023-01-12 16:00:37 -08:00
Richard Yao	64195fc89f	Cleanup: Remove unneeded semicolons The Linux 5.16.14 kernel's coccicheck caught this. The semantic patch that caught it was: ./scripts/coccinelle/misc/semicolon.cocci Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14372	2023-01-12 16:00:30 -08:00
Richard Yao	3b2f9c1ec8	Cleanup: Use MIN() macro The Linux 5.16.14 kernel's coccicheck caught this. The semantic patch that caught it was: ./scripts/coccinelle/misc/minmax.cocci There was a third opportunity to use `MIN()`, but that was in `FSE_minTableLog()` in `module/zstd/lib/compress/fse_compress.c`. Upstream zstd has yet to make this change and I did not want to change header includes just for MIN, or do a one off, so I left it alone. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14372	2023-01-12 16:00:23 -08:00
Richard Yao	e6328fda2e	Cleanup: !A \|\| A && B is equivalent to !A \|\| B In zfs_zaccess_dataset_check(), we have the following subexpression: (!IS_DEVVP(ZTOV(zp)) \|\| (IS_DEVVP(ZTOV(zp)) && (v4_mode & WRITE_MASK_ATTRS))) When !IS_DEVVP(ZTOV(zp)) is false, IS_DEVVP(ZTOV(zp)) is true under the law of the excluded middle since we are not doing pseudoboolean alegbra. Therefore doing: (IS_DEVVP(ZTOV(zp)) && (v4_mode & WRITE_MASK_ATTRS)) Is unnecessary and we can just do: (v4_mode & WRITE_MASK_ATTRS) The Linux 5.16.14 kernel's coccicheck caught this. The semantic patch that caught it was: ./scripts/coccinelle/misc/excluded_middle.cocci Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14372	2023-01-12 16:00:15 -08:00
Richard Yao	9c8fabffa2	Cleanup: Replace oldstyle struct hack with C99 flexible array members The Linux 5.16.14 kernel's coccicheck caught this. The semantic patch that caught it was: ./scripts/coccinelle/misc/flexible_array.cocci However, unlike the cases where the GNU zero length array extension had been used, coccicheck would not suggest patches for the older style single member arrays. That was good because blindly changing them would break size calculations in most cases. Therefore, this required care to make sure that we did not break size calculations. In the case of `indirect_split_t`, we use `offsetof(indirect_split_t, is_child[is->is_children])` to calculate size. This might be subtly wrong according to an old mailing list thread: https://inbox.sourceware.org/gcc-prs/20021226123454.27019.qmail@sources.redhat.com/T/ That is because the C99 specification should consider the flexible array members to start at the end of a structure, but compilers prefer to put padding at the end. A suggestion was made to allow compilers to allocate padding after the VLA like compilers already did: http://std.dkuug.dk/JTC1/SC22/WG14/www/docs/n983.htm However, upon thinking about it, whether or not we allocate end of structure padding does not matter, so using offsetof() to calculate the size of the structure is fine, so long as we do not mix it with sizeof() on structures with no array members. In the case that we mix them and padding causes offsetof(struct_t, vla_member[0]) to differ from sizeof(struct_t), we would be doing unsafe operations if we underallocate via `offsetof()` and then overcopy via sizeof(). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14372	2023-01-12 16:00:03 -08:00
Richard Yao	d35ccc1f59	Cleanup: Fix indentation in zfs_dbgmsg_t `fdc2d30371` accidentally broke the indentation. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14372	2023-01-12 15:59:56 -08:00
Richard Yao	8e7ebf4e2d	Cleanup: Use C99 flexible array members instead of zero length arrays The Linux 5.16.14 kernel's coccicheck caught this. The semantic patch that caught it was: ./scripts/coccinelle/misc/flexible_array.cocci The Linux kernel's documentation makes a good case for why we should not use these: https://www.kernel.org/doc/html/latest/process/deprecated.html#zero-length-and-one-element-arrays Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14372	2023-01-12 15:59:41 -08:00
Richard Yao	c9c3ce7976	Cleanup: Use kmem_zalloc() instead of memset() to zero memory The Linux 5.16.14 kernel's coccicheck caught this. The semantic patch that caught it was: ./scripts/coccinelle/api/alloc/zalloc-simple.cocci Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14372	2023-01-12 15:59:28 -08:00
Richard Yao	7384ec65cd	Cleanup: Remove unnecessary explicit casts of pointers from allocators The Linux 5.16.14 kernel's coccicheck caught these. The semantic patch that caught them was: ./scripts/coccinelle/api/alloc/alloc_cast.cocci Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14372	2023-01-12 15:59:12 -08:00
George Amanakis	eee9362a72	Activate filesystem features only in syncing context When activating filesystem features after receiving a snapshot, do so only in syncing context. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #14304 Closes #14252	2023-01-11 18:00:39 -08:00
Mateusz Piotrowski	926715b9fc	Turn default_bs and default_ibs into ZFS_MODULE_PARAMs The default_bs and default_ibs tunables control the default block size and indirect block size. So far, default_bs and default_ibs were tunable only on FreeBSD, e.g., sysctl vfs.zfs.default_ibs Remove the FreeBSD-specific sysctl code and expose default_bs and default_ibs as tunables on both Linux and FreeBSD using ZFS_MODULE_PARAM. One of the use cases for changing the values of those tunables is to lower the indirect block size, which may improve performance of large directories (as discussed during the OpenZFS Leadership Meeting on 2022-08-16). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Mateusz Piotrowski <mateusz.piotrowski@klarasystems.com> Sponsored-by: Wasabi Technology, Inc. Closes #14293	2023-01-11 09:38:20 -08:00
Mateusz Piotrowski	a4b21eadec	Add tunable to allow changing micro ZAP's max size This change turns `MZAP_MAX_BLKSZ` into a `ZFS_MODULE_PARAM()` called `zap_micro_max_size`. As a result, we can experiment with different micro ZAP sizes to improve directory size scaling. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: Mateusz Piotrowski <mateuszpiotrowski@klarasystems.com> Co-authored-by: Toomas Soome <toomas.soome@klarasystems.com> Signed-off-by: Mateusz Piotrowski <mateuszpiotrowski@klarasystems.com> Sponsored-by: Wasabi Technology, Inc. Closes #14292	2023-01-10 13:41:54 -08:00
Matthew Ahrens	fc45975ec8	Batch enqueue/dequeue for bqueue The Blocking Queue (bqueue) code is used by zfs send/receive to send messages between the various threads. It uses a shared linked list, which is locked whenever we enqueue or dequeue. For workloads which process many blocks per second, the locking on the shared list can be quite expensive. This commit changes the bqueue logic to have 3 linked lists: 1. An enquing list, which is used only by the (single) enquing thread, and thus needs no locks. 2. A shared list, with an associated lock. 3. A dequing list, which is used only by the (single) dequing thread, and thus needs no locks. The entire enquing list can be moved to the shared list in constant time, and the entire shared list can be moved to the dequing list in constant time. These operations only happen when the `fill_fraction` is reached, or on an explicit flush request. Therefore, the lock only needs to be acquired infrequently. The API already allows for dequing to block until an explicit flush, so callers don't need to be changed. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Wilson <george.wilson@delphix.com> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #14121	2023-01-10 13:39:22 -08:00
Brian Behlendorf	0c8fbe5b6a	ztest: update ztest_dmu_snapshot_create_destroy() ECHRNG is returned when the channel program encounters a runtime error. For example, this can happen when a snapshot doesn't exist. We handle this error the same way as the existing EEXIST and ENOENT error checks. Additionally, improve the internal debug message to include the error describing why a pool couldn't be opened. Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #14351	2023-01-10 13:27:48 -08:00
Matthew Ahrens	40d7e971ff	ztest fails assertion in zio_write_gang_member_ready() Encrypted blocks can have up to 2 DVA's, as the third DVA is reserved for the salt+IV. However, dmu_write_policy() allows non-encrypted blocks (e.g. DMU_OT_OBJSET) inside encrypted datasets to request and allocate 3 DVA's, since they don't need a salt+IV (they are merely authenicated). However, if such a block becomes a gang block, the gang code incorrectly limits the gang block header to 2 DVA's. This leads to a "NDVAs inversion", where a parent block (the gang block header) has less DVA's than its children (the gang members), causing an assertion failure in zio_write_gang_member_ready(). This commit addresses the problem by only restricting the gang block header to 2 DVA's if the block is actually encrypted (and thus its gang block members can have at most 2 DVA's). Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #14250 Closes #14356	2023-01-09 16:43:45 -08:00
Ameer Hamza	5091867ee6	zed: add hotplug support for spare vdevs This commit supports for spare vdev hotplug. The spare vdev associated with all the pools will be marked as "Removed" when the drive is physically detached and will become "Available" when the drive is reattached. Currently, the spare vdev status does not change on the drive removal and the same is the case with reattachment. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #14295	2023-01-09 12:43:03 -08:00
Alexander Motin	289f7e6adb	Remove some dead ARC code. (#14340 ) Every ARC buffer holds a reference on the header. It means headers with buffers are never evictable. When we are evicting a header, there can be no more buffers to free. Just assert that. b_evict_lock seems not protecting anything now. Remove it. Buffers checksum should also be freed with the last uncompressed buffer, so it should not be there also when we are evicting the header. Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc.	2023-01-09 10:45:17 -08:00
Coleman Kane	884a69357f	linux 6.2 compat: get_acl() got moved to get_inode_acl() in 6.2 Linux 6.2 renamed the get_acl() operation to get_inode_acl() in the inode_operations struct. This should fix Issue #14323. Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Coleman Kane <ckane@colemankane.org> Closes #14323 Closes #14331	2023-01-06 14:40:54 -08:00
Antonio Russo	d27c81847b	Linux 6.1 compat: open inside tmpfile() Linux 863f144 modified the .tmpfile interface to pass a struct file, rather than a struct dentry, and expect the tmpfile implementation to open inside of tmpfile(). This patch implements a configuration test that checks for this new API and appropriately sets a HAVE_TMPFILE_DENTRY flag that tracks this old API. Contingent on this flag, the appropriate API is implemented. Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Antonio Russo <aerusso@aerusso.net> Closes #14301 Closes #14343	2023-01-06 14:33:00 -08:00
Richard Yao	1f3bc5ea80	Illumos #15286 : do_composition() needs sign awareness Authored by: Dan McDonald <danmcd@mnx.io> Reviewed by: Patrick Mooney <pmooney@pfmooney.com> Reviewed by: Richard Lowe <richlowe@richlowe.net> Approved by: Joshua M. Clulow <josh@sysmgr.org> Ported-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Illumos-issue: https://www.illumos.org/issues/15286 Illumos-commit: `f137b22e73` Porting Notes: The patch in illumos did not have much of a commit message, and did not provide attribution to the reporter, while original patch proposed to OpenZFS did, so I am listing the reporter (myself) and original patch author (also myself) below while including the original commit message with some minor corrections as part of the porting notes: In do_composition(), we have: size = u8_number_of_bytes[p]; if (size <= 1 \|\| (p + size) > oslast) break; There, we have type promotion from int8_t to size_t, which is unsigned. C will sign extend the value as part of the widening before treating the value as unsigned and the negative values we can counter are error values from U8_ILLEGAL_CHAR and U8_OUT_OF_RANGE_CHAR, which are -1 and -2 respectively. The unsigned versions of these under two's complement are SIZE_MAX and SIZE_MAX-1 respectively. The bounds check is written under the assumption that `size <= 1` does a signed comparison. This is followed by a pointer comparison to see if the string has the correct length, which is fine. A little further down we have: for (i = 0; i < size; i++) tc[i] = p++; When an error condition is encountered, this will attempt to iterate at least SIZE_MAX-1 times, which will massively overflow the buffer, which is not fine. The kernel will kill the loop as soon as it hits the kernel stack guard on Linux systems built with CONFIG_VMAP_STACK=y, which should be just about all of them. That prevents arbitrary code execution and just about any other bad thing that a black hat attacker might attempt with knowledge of this buffer overflow. Other systems' kernels have mitigations for unbounded in-kernel buffer overflows that will catch this too. Also, the patch in illumos-gate made an effort to fix C style issues that had been fixed in the OpenZFS/ZFSOnLinux repository. Those issues had been mentioned in the email that I originally sent them about this issue. One of the fixes had not been already done, so it is included. Another to collect_a_seq()'s arguments was handled differently in OpenZFS. For the sake of avoiding unnecessary differences, it has been adopted. This has the interesting effect that if you correct the paths in the illumos-gate patch to match the current OpenZFS repository, you can reverse apply it cleanly. Original-patch-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reported-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Co-authored-by: Dan McDonald <danmcd@mnx.io> Closes #14318 Closes #14342	2023-01-05 11:16:21 -08:00
Mateusz Guzik	f25f1f9091	FreeBSD: catch up to 1400077 Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Closes #14328	2023-01-05 10:56:40 -08:00
Alexander Motin	bacf366fe2	Hide b_freeze_* under ZFS_DEBUG This saves 40 bytes per full ARC header, reducing it on FreeBSD from 240 to 200 bytes on production bits. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #14315	2023-01-05 10:15:31 -07:00
Alexander Motin	ed2f7ba08d	Implement uncached prefetch Previously the primarycache property was handled only in the dbuf layer. Since the speculative prefetcher is implemented in the ARC, it had to be disabled for uncacheable buffers. This change gives the ARC knowledge about uncacheable buffers via arc_read() and arc_write(). So when remove_reference() drops the last reference on the ARC header, it can either immediately destroy it, or if it is marked as prefetch, put it into a new arc_uncached state. That state is scanned every second, evicting stale buffers that were not demand read. This change also tracks dbufs that were read from the beginning, but not to the end. It is assumed that such buffers may receive further reads, and so they are stored in dbuf cache. If a following reads reaches the end of the buffer, it is immediately evicted. Otherwise it will follow regular dbuf cache eviction. Since the dbuf layer does not know actual file sizes, this logic is not applied to the final buffer of a dnode. Since uncacheable buffers should no longer stay in the ARC for long, this patch also tries to optimize I/O by allocating ARC physical buffers as linear to allow buffer sharing. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Wilson <george.wilson@delphix.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #14243	2023-01-04 17:29:54 -07:00
Alexander Motin	c935fe2e92	arc_read()/arc_access() refactoring and cleanup ARC code was many times significantly modified over the years, that created significant amount of tangled and potentially broken code. This should make arc_access()/arc_read() code some more readable. - Decouple prefetch status tracking from b_refcnt. It made sense originally, but became highly cryptic over the years. Move all the logic into arc_access(). While there, clean up and comment state transitions in arc_access(). Some transitions were weird IMO. - Unify arc_access() calls to arc_read() instead of sometimes calling it from arc_read_done(). To avoid extra state changes and checks add one more b_refcnt for ARC_FLAG_IO_IN_PROGRESS. - Reimplement ARC_FLAG_WAIT in case of ARC_FLAG_IO_IN_PROGRESS with the same callback mechanism to not falsely account them as hits. Count those as "iohits", an intermediate between "hits" and "misses". While there, call read callbacks in original request order, that should be good for fairness and random speculations/allocations/aggregations. - Introduce additional statistic counters for prefetch, accounting predictive vs prescient and hits vs iohits vs misses. - Remove hash_lock argument from functions not needing it. - Remove ARC_FLAG_PREDICTIVE_PREFETCH, since it should be opposite to ARC_FLAG_PRESCIENT_PREFETCH if ARC_FLAG_PREFETCH is set. We may wish to add ARC_FLAG_PRESCIENT_PREFETCH to few more places. - Fix few false positive tests found in the process. Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #14123	2022-12-22 12:10:24 -08:00
Ryan Moeller	dc8c2f6158	FreeBSD: Fix potential boot panic with bad label vdev_geom_read_pool_label() can leave NULL in configs. Check for it and skip consistently when generating rootconf. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #14291	2022-12-22 11:50:09 -08:00
Matthew Ahrens	018f26041d	deadlock between spa_errlog_lock and dp_config_rwlock There is a lock order inversion deadlock between `spa_errlog_lock` and `dp_config_rwlock`: A thread in `spa_delete_dataset_errlog()` is running from a sync task. It is holding the `dp_config_rwlock` for writer (see `dsl_sync_task_sync()`), and waiting for the `spa_errlog_lock`. A thread in `dsl_pool_config_enter()` is holding the `spa_errlog_lock` (see `spa_get_errlog_size()`) and waiting for the `dp_config_rwlock` (as reader). Note that this was introduced by #12812. This commit address this by defining the lock ordering to be dp_config_rwlock first, then spa_errlog_lock / spa_errlist_lock. spa_get_errlog() and spa_get_errlog_size() can acquire the locks in this order, and then process_error_block() and get_head_and_birth_txg() can verify that the dp_config_rwlock is already held. Additionally, a buffer overrun in `spa_get_errlog()` is corrected. Many code paths didn't check if `*count` got to zero, instead continuing to overwrite past the beginning of the userspace buffer at `uaddr`. Tested by having some errors in the pool (via `zinject -t data /path/to/file`), one thread running `zpool iostat 0.001`, and another thread runs `zfs destroy` (in a loop, although it hits the first time). This reproduces the problem easily without the fix, and works with the fix. Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Reviewed-by: George Amanakis <gamanakis@gmail.com> Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #14239 Closes #14289	2022-12-22 11:48:49 -08:00
Doug Rabson	24502bd3a7	FreeBSD: Remove stray debug printf Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Doug Rabson <dfr@rabson.org> Closes #14286 Closes #14287	2022-12-13 17:35:07 -08:00
Richard Yao	f3f5263f8a	Zero end of embedded block buffer in dump_write_embedded() This fixes a kernel stack leak. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Tested-by: Nicholas Sherlock <n.sherlock@gmail.com> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #13778 Closes #14255	2022-12-13 17:31:47 -08:00
Richard Yao	3236c0b891	Cache dbuf_hash() calculation We currently compute a 64-bit hash three times, which consumes 0.8% CPU time on ARC eviction heavy workloads. Caching the 64-bit value in the dbuf allows us to avoid that overhead. Sponsored-By: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Richard Yao <richard.yao@klarasystems.com> Closes #14251	2022-12-13 17:29:21 -08:00
Allan Jude	dc95911d21	zfs list: Allow more fields in ZFS_ITER_SIMPLE mode If the fields to be listed and sorted by are constrained to those populated by dsl_dataset_fast_stat(), then zfs list is much faster, as it does not need to open each objset and reads its properties. A previous optimization by Pawel Dawidek (`0cee24064a`) took advantage of this to make listing snapshot names sorted only by name much faster. However, it was limited to `-o name -s name`, this work extends this optimization to work with: - name - guid - createtxg - numclones - inconsistent - redacted - origin and could be further extended to any other properties supported by dsl_dataset_fast_stat() or similar, that do not require extra locking or reading from disk. This was committed before (`9a9e2e343d`), but was reverted due to a regression when used with an older kernel. If the kernel does not populate zc->zc_objset_stats, we now fallback to getting the properties via the slower interface, to avoid problems with newer userland and older kernels. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Allan Jude <allan@klarasystems.com> Closes #14110	2022-12-13 17:27:54 -08:00
Ameer Hamza	e3785718ba	Skip permission checks for extended attributes zfs_zaccess_trivial() calls the generic_permission() to read xattr attributes. This causes deadlock if called from zpl_xattr_set_dir() context as xattr and the dent locks are already held in this scenario. This commit skips the permissions checks for extended attributes since the Linux VFS stack already checks it before passing us the control. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Youzhong Yang <yyang@mathworks.com> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #14220	2022-12-12 10:21:37 -08:00
Allan Jude	f900279e6d	Restrict visibility of per-dataset kstats inside FreeBSD jails When inside a jail, visibility on datasets not "jailed" to the jail is restricted. However, it was possible to enumerate all datasets in the pool by looking at the kstats sysctl MIB. Only the kstats corresponding to datasets that the user has visibility on are accessible now. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Allan Jude <allan@klarasystems.com> Closes #14254	2022-12-09 11:04:29 -08:00
Serapheim Dimitropoulos	7bf4c97a36	Bypass metaslab throttle for removal allocations Context: We recently had a scenario where a customer with 2x10TB disks at 95+% fragmentation and capacity, wanted to migrate their disks to a 2x20TB setup. So they added the 2 new disks and submitted the removal of the first 10TB disk. The removal took a lot more than expected (order of more than a week to 2 weeks vs a couple of days) and once it was done it generated a huge indirect mappign table in RAM (~16GB vs expected ~1GB). Root-Cause: The removal code calls `metaslab_alloc_dva()` to allocate a new block for each evacuating block in the removing device and it tries to batch them into 16MB segments. If it can't find such a segment it tries for 8MBs, 4MBs, all the way down to 512 bytes. In our scenario what would happen is that `metaslab_alloc_dva()` from the removal thread pick the new devices initially but wouldn't allocate from them because of throttling in their metaslab allocation queue's depth (see `metaslab_group_allocatable()`) as these devices are new and favored for most types of allocations because of their free space. So then the removal thread would look at the old fragmented disk for allocations and wouldn't find any contiguous space and finally retry with a smaller allocation size until it would to the low KB range. This caused a lot of small mappings to be generated blowing up the size of the indirect table. It also wasted a lot of CPU while the removal was active making everything slow. This patch: Make all allocations coming from the device removal thread bypass the throttle checks. These allocations are not even counted in the metaslab allocation queues anyway so why check them? Side-Fix: Allocations with METASLAB_DONT_THROTTLE in their flags would not be accounted at the throttle queues but they'd still abide by the throttling rules which seems wrong. This patch fixes this by checking for that flag in `metaslab_group_allocatable()`. I did a quick check to see where else this flag is used and it doesn't seem like this change would cause issues. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com> Closes #14159	2022-12-09 10:48:33 -08:00
Richard Yao	242a5b748c	Fix dereference after null check in enqueue_range If the bp is NULL, we have a hole. However, when we build with assertions, we will dereference bp when `blkid == DMU_SPILL_BLKID`. When this happens on a hole, we will have a NULL pointer dereference. Reported-by: Coverity (CID-1524670) Reviewed-by: Damian Szuberski <szuberskidamian@gmail.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14264	2022-12-08 14:15:21 -08:00
Richard Yao	f1100863f7	Linux: Cleanup unnecessary NULL check in __vdev_disk_physio() zio is never NULL when given to the vdev. Coverity complained saying: "Either the check against null is unnecessary, or there may be a null pointer dereference." Reported-by: Coverity (CID-1466174) Reviewed-by: Damian Szuberski <szuberskidamian@gmail.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14263	2022-12-08 13:52:47 -08:00
Richard Yao	56c6f293c0	Remove duplicate statically allocated variable dsl_dataset_snapshot_sync_impl() declares `static zil_header_t zero_zil __maybe_unused;`, but this is also declared globally. This wastes memory. CodeQL's cpp/local-variable-hides-global-variable check caught this. Reviewed-by: Damian Szuberski <szuberskidamian@gmail.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14263	2022-12-08 13:52:42 -08:00
Richard Yao	59493b63c1	Micro-optimize fletcher4 calculations When processing abds, we execute 1 `kfpu_begin()`/`kfpu_end()` pair on every page in the abd. This is wasteful and slows down checksum performance versus what the benchmark claimed. We correct this by moving those calls to the init and fini functions. Also, we always check the buffer length against 0 before calling the non-scalar checksum functions. This means that we do not need to execute the loop condition for the first loop iteration. That allows us to micro-optimize the checksum calculations by switching to do-while loops. Note that we do not apply that micro-optimization to the scalar implementation because there is no check in `fletcher_4_incremental_native()`/`fletcher_4_incremental_byteswap()` against 0 sized buffers being passed. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14247	2022-12-05 11:00:34 -08:00
Richard Yao	7b9a423076	FreeBSD: zfs_register_callbacks() must implement error check correctly I read the following article and noticed a couple of ZFS bugs mentioned: https://pvs-studio.com/en/blog/posts/cpp/0377/ I decided to search for them in the modern OpenZFS codebase and then found one that matched the description of the first one: V593 Consider reviewing the expression of the 'A = B != C' kind. The expression is calculated as following: 'A = (B != C)'. zfs_vfsops.c 498 The consequence of this is that the error value is replaced with `1` when there is an error. When there is no error, 0 is correctly passed. This is a very minor issue that is unlikely to cause any real problems. The incorrect error code would either be returned to the mount command on a failure or any of `zfs receive`, `zfs recv`, `zfs rollback` or `zfs upgrade`. The second one has already been fixed. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Damian Szuberski <szuberskidamian@gmail.com> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14261	2022-12-05 10:16:50 -08:00
George Wilson	ffd2e15d65	zio can deadlock during device removal When doing a device removal on a pool with gang blocks, the zio pipeline can deadlock when trying to free blocks from a device which is being removed with a stack similar to this: 0xffff8ab9a13a1740 UNINTERRUPTIBLE 4 __schedule+0x2e5 __schedule+0x2e5 schedule+0x33 schedule_preempt_disabled+0xe __mutex_lock.isra.12+0x2a7 __mutex_lock.isra.12+0x2a7 __mutex_lock_slowpath+0x13 mutex_lock+0x2c free_from_removing_vdev+0x61 metaslab_free_impl+0xd6 metaslab_free_dva+0x5e metaslab_free+0x196 zio_free_sync+0xe4 zio_free_gang+0x38 zio_gang_tree_issue+0x42 zio_gang_tree_issue+0xa2 zio_gang_issue+0x6d zio_execute+0x94 zio_execute+0x94 taskq_thread+0x23b kthread+0x120 ret_from_fork+0x1f Since there are gang blocks we have to read the gang members as part of the free. This can be seen with a zio dependency tree that looks like this: sdb> echo 0xffff900c24f8a700 \| zio -rc \| zio ADDRESS TYPE STAGE WAITER 0xffff900c24f8a700 NULL CHECKSUM_VERIFY 0xffff900ddfd31740 0xffff900c24f8c920 FREE GANG_ASSEMBLE - 0xffff900d93d435a0 READ DONE In the illustration above we are processing frees but because of gang block we have to read the constituents blocks. Once we finish the READ in the zio pipeline we will execute the parent. In this case the parent is a FREE but the zio taskq is a READ and we continue to process the pipeline leading to the stack above. In the stack above, we are blocked waiting for the svr_lock so as a result a READ interrupt taskq thread is now consumed. Eventually, all of the READ taskq threads end up blocked and we're unable to complete any read requests. In zio_notify_parent there is an optimization to continue to use the taskq thread to exectue the parent's pipeline. To resolve the deadlock above, we only allow this optimization if the parent's zio type matches the child which just completed. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: George Wilson <gwilson@delphix.com> External-issue: DLPX-80130 Closes #14236	2022-12-02 17:46:29 -08:00
George Wilson	d7cf06a25d	nopwrites on dmu_sync-ed blocks can result in a panic After a device has been removed, any nopwrites for blocks on that indirect vdev should be ignored and a new block should be allocated. The original code attempted to handle this but used the wrong block pointer when checking for indirect vdevs and failed to check all DVAs. This change corrects both of these issues and modifies the test case to ensure that it properly tests nopwrites with device removal. Reviewed-by: Prakash Surya <prakash.surya@delphix.com> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Wilson <gwilson@delphix.com> Closes #14235	2022-12-02 17:45:33 -08:00
Rob Wing	7a75f74cec	Bump checksum error counter before reporting to ZED The checksum error counter is incremented after reporting to ZED. This leads ZED to receiving a checksum error report with 0 checksum errors. To avoid this, bump the checksum error counter before reporting to ZED. Sponsored-by: Seagate Technology LLC Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Wing <rob.wing@klarasystems.com> Closes #14190	2022-12-02 17:42:22 -08:00
szubersk	fe975048da	Fix Clang 15 compilation errors - Clang 15 doesn't support `-fno-ipa-sra` anymore. Do a separate check for `-fno-ipa-sra` support by $KERNEL_CC. - Don't enable `-mgeneral-regs-only` for certain module files. Fix #13260 - Scope `GCC diagnostic ignored` statements to GCC only. Clang doesn't need them to compile the code. Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: szubersk <szuberskidamian@gmail.com> Closes #13260 Closes #14150	2022-11-30 13:46:26 -08:00
szubersk	3c1e1933b6	Fix GCC 12 compilation errors Squelch false positives reported by GCC 12 with UBSan. Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: szubersk <szuberskidamian@gmail.com> Closes #14150	2022-11-30 13:45:53 -08:00
Yann Collet	776441152e	zstd: Refactor prefetching for the decoding loop Following facebook/zstd#2545, I noticed that one field in `seq_t` is optional, and only used in combination with prefetching. (This may have contributed to static analyzer failure to detect correct initialization). I then wondered if it would be possible to rewrite the code so that this optional part is handled directly by the prefetching code rather than delegated as an option into `ZSTD_decodeSequence()`. This resulted into this refactoring exercise where the prefetching responsibility is better isolated into its own function and `ZSTD_decodeSequence()` is streamlined to contain strictly Sequence decoding operations. Incidently, due to better code locality, it reduces the need to send information around, leading to simplified interface, and smaller state structures. Port of facebook/zstd@f5434663ea Reported-by: Coverity (CID 1462271) Reviewed-by: Damian Szuberski <szuberskidamian@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Ported-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14212	2022-11-29 10:05:30 -08:00
Nick Terrell	466cf54ecf	zstd: [superblock] Add defensive assert and bounds check The bound check condition should always be met because we selected `set_basic` as our encoding type. But that code is very far away, so assert it is true so if it is ever false we can catch it, and add a bounds check. Port of facebook/zstd@1047097dad Reported-by: Coverity (CID 1524446) Reviewed-by: Damian Szuberski <szuberskidamian@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Ported-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14212	2022-11-29 10:04:43 -08:00
Richard Yao	97fac0fb70	Fix NULL pointer dereference in dbuf_prefetch_indirect_done() When ZFS is built with assertions, a prefetch is done on a redacted blkptr and `dpa->dpa_dnode` is NULL, we will have a NULL pointer dereference in `dbuf_prefetch_indirect_done()`. Both Coverity and Clang's Static Analyzer caught this. Reported-by: Coverity (CID 1524671) Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14210	2022-11-29 10:00:50 -08:00
Richard Yao	8532da5e20	Cleanup: Delete dead code from send_merge_thread() range is always deferenced before it reaches this check, such that the kmem_zalloc() call is never executed. A previously version of this had erronously also pruned the `range->eos_marker = B_TRUE` line, but it must be set whenever we encounter an error or are cancelled early. Coverity incorrectly complained about a potential NULL pointer dereference because of this. Reported-by: Coverity (CID 1524550) Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14210	2022-11-29 09:59:53 -08:00
Alexander	b5459dd354	Fix the last two CFI callback prototype mismatches There was the series from me a year ago which fixed most of the callback vs implementation prototype mismatches. It was based on running the CFI-enabled kernel (in permissive mode -- warning instead of panic) and performing a full ZTS cycle, and then fixing all of the problems caught by CFI. Now, Clang 16-dev has new warning flag, -Wcast-function-type-strict, which detect such mismatches at compile-time. It allows to find the remaining issues missed by the first series. There are only two of them left: one for the secpolicy_vnode_setattr() callback and one for taskq_dispatch(). The fix is easy, since they are not used anywhere else. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Lobakin <alobakin@pm.me> Closes #14207	2022-11-29 09:56:16 -08:00
Richard Yao	587a39b729	Lua: Fix bad bitshift in lua_strx2number() The port of lua to OpenZFS modified lua to use int64_t for numbers instead of double. As part of this, a function for calculating exponentiation was replaced with a bit shift. Unfortunately, it did not handle negative values. Also, it only supported exponents numbers with 7 digits before before overflow. This supports exponents up to 15 digits before overflow. Clang's static analyzer reported this as "Result of operation is garbage or undefined" because the exponent was negative. Reviewed-by: Damian Szuberski <szuberskidamian@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14204	2022-11-29 09:53:33 -08:00
Alexander Motin	fd61b2eaba	Remove few pointer dereferences in dbuf_read() Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #14199	2022-11-29 09:49:02 -08:00
Mateusz Guzik	9aea88ba44	FreeBSD: stop using buffer cache-only routines on sync Both vop_fsync and vfs_stdsync are effectively just costly no-ops as they only act on ->v_bufobj.bo_dirty et al, which are unused by zfs. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Closes #14157	2022-11-29 09:35:25 -08:00
Alexander Motin	4df415aa86	Switch dnode stats to wmsums I've noticed that some of those counters are used in hot paths like dnode_hold_impl(), and results of this change is visible in profiler. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #14198	2022-11-29 09:33:45 -08:00
Alexander Motin	f0a76fbec1	Micro-optimize zrl_remove() atomic_dec_32() should be a bit lighter than atomic_dec_32_nv(). Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #14200	2022-11-29 09:26:03 -08:00
Ameer Hamza	e996c502e4	zed: unclean disk attachment faults the vdev If the attached disk already contains a vdev GUID, it means the disk is not clean. In such a scenario, the physical path would be a match that makes the disk faulted when trying to online it. So, we would only want to proceed if either GUID matches with the last attached disk or the disk is in a clean state. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #14181	2022-11-29 09:24:10 -08:00
Richard Yao	303678350a	Convert some sprintf() calls to kmem_scnprintf() These `sprintf()` calls are used repeatedly to write to a buffer. There is no protection against overflow other than reviewers explicitly checking to see if the buffers are big enough. However, such issues are easily missed during review and when they are missed, we would rather stop printing rather than have a buffer overflow, so we convert these functions to use `kmem_scnprintf()`. The Linux kernel provides an entire page for module parameters, so we are safe to write up to PAGE_SIZE. Removing `sprintf()` from these functions removes the last instances of `sprintf()` usage in our platform-independent kernel code. This improves XNU kernel compatibility because the XNU kernel does not support (removed support for?) `sprintf()`. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Jorgen Lundman <lundman@lundman.net> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14209	2022-11-28 13:49:58 -08:00
Allan Jude	d27a00283f	Avoid a null pointer dereference in zfs_mount() on FreeBSD When mounting the root filesystem, vfs_t->mnt_vnodecovered is null This will cause zfsctl_is_node() to dereference a null pointer when mounting, or updating the mount flags, on the root filesystem, both of which happen during the boot process. Reported-by: Martin Matuska <mm@FreeBSD.org> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Allan Jude <allan@klarasystems.com> Closes #14218	2022-11-28 13:40:49 -08:00
Alexander Motin	5f45e3f699	Remove atomics from zh_refcount It is protected by z_hold_locks, so we do not need more serialization, simple integer math should be fine. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #14196	2022-11-28 11:36:53 -08:00
Ameer Hamza	3a74f488fc	zed: post a udev change event from spa_vdev_attach() In order for zed to process the removal event correctly, udev change event needs to be posted to sync the blkid information. spa_create() and spa_config_update() posts the event already through spa_write_cachefile(). Doing the same for spa_vdev_attach() that handles the case for vdev attachment and replacement. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #14172	2022-11-18 11:39:59 -08:00
George Amanakis	3226e0dc8e	Fix setting the large_block feature after receiving a snapshot We are not allowed to dirty a filesystem when done receiving a snapshot. In this case the flag SPA_FEATURE_LARGE_BLOCKS will not be set on that filesystem since the filesystem is not on dp_dirty_datasets, and a subsequent encrypted raw send will fail. Fix this by checking in dsl_dataset_snapshot_sync_impl() if the feature needs to be activated and do so if appropriate. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #13699 Closes #13782	2022-11-18 11:38:37 -08:00
Rich Ercolani	2163cde450	Handle and detect #13709 's unlock regression (#14161 ) In #13709, as in #11294 before it, it turns out that `63a26454` still had the same failure mode as when it was first landed as `d1d47691`, and fails to unlock certain datasets that formerly worked. Rather than reverting it again, let's add handling to just throw out the accounting metadata that failed to unlock when that happens, as well as a test with a pre-broken pool image to ensure that we never get bitten by this again. Fixes: #13709 Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov>	2022-11-15 14:44:12 -08:00
shodanshok	b445b25b27	Fix arc_p aggressive increase The original ARC paper called for an initial 50/50 MRU/MFU split and this is accounted in various places where arc_p = arc_c >> 1, with further adjustment based on ghost lists size/hit. However, in current code both arc_adapt() and arc_get_data_impl() aggressively grow arc_p until arc_c is reached, causing unneeded pressure on MFU and greatly reducing its scan-resistance until ghost list adjustments kick in. This patch restores the original behavior of initially having arc_p as 1/2 of total ARC, without preventing MRU to use up to 100% total ARC when MFU is empty. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Gionatan Danti <g.danti@assyoma.it> Closes #14137 Closes #14120	2022-11-11 10:41:36 -08:00
Richard Yao	b1eec00904	Cleanup: Suppress Coverity dereference before/after NULL check reports `f224eddf92` began dereferencing a NULL checked pointer in zpl_vap_init(), which made Coverity complain because either the dereference is unsafe or the NULL check is unnecessary. Upon inspection, this pointer is guaranteed to never be NULL because it is from the Linux kernel VFS. The calls into ZFS simply would not make sense if this pointer were NULL, so the NULL check is unnecessary. Reported-by: Coverity (CID 1527260) Reported-by: Coverity (CID 1527262) Reviewed-by: Mariusz Zaborski <mariusz.zaborski@klarasystems.com> Reviewed-by: Youzhong Yang <yyang@mathworks.com> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14170	2022-11-10 13:58:05 -08:00
Richard Yao	9e2be2dfbd	Fix potential NULL pointer dereference regression `945b407486` neglected to `NULL` check `tx->tx_objset`, which is already done in the function. This upset Coverity, which complained about a "dereference after null check". Upon inspection, it was found that whenever `dmu_tx_create_dd()` is called followed by `dmu_tx_assign()`, such as in `dsl_sync_task_common()`, `tx->tx_objset` will be `NULL`. Reported-by: Coverity (CID 1527261) Reviewed-by: Mariusz Zaborski <mariusz.zaborski@klarasystems.com> Reviewed-by: Youzhong Yang <yyang@mathworks.com> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14170	2022-11-10 13:56:28 -08:00
Mariusz Zaborski	16f0fdaddd	Allow to control failfast Linux defaults to setting "failfast" on BIOs, so that the OS will not retry IOs that fail, and instead report the error to ZFS. In some cases, such as errors reported by the HBA driver, not the device itself, we would wish to retry rather than generating vdev errors in ZFS. This new property allows that. This introduces a per vdev option to disable the failfast option. This also introduces a global module parameter to define the failfast mask value. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: Allan Jude <allan@klarasystems.com> Signed-off-by: Allan Jude <allan@klarasystems.com> Signed-off-by: Mariusz Zaborski <mariusz.zaborski@klarasystems.com> Sponsored-by: Seagate Technology LLC Submitted-by: Klara, Inc. Closes #14056	2022-11-10 13:37:12 -08:00
Mariusz Zaborski	945b407486	quota: disable quota check for ZVOL The quota for ZVOLs is set to the size of the volume. When the quota reaches the maximum, there isn't an excellent way to check if the new writers are overwriting the data or if they are inserting a new one. Because of that, when we reach the maximum quota, we wait till txg is flushed. This is causing a significant fluctuation in bandwidth. In the case of ZVOL, the quota is enforced by the volsize, so we can omit it. This commit adds a sysctl thats allow to control if the quota mechanism should be enforced or not. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Mariusz Zaborski <mariusz.zaborski@klarasystems.com> Sponsored-by: Zededa Inc. Sponsored-by: Klara Inc. Closes #13838	2022-11-08 12:40:22 -08:00
Alan Somers	e197bb24f1	Optionally skip zil_close during zvol_create_minor_impl If there were no zil entries to replay, skip zil_close. zil_close waits for a transaction to sync. That can take several seconds, for example during pool import of a resilvering pool. Skipping zil_close can cut the time for "zpool import" from 2 hours to 45 seconds on a resilvering pool with a thousand zvols. Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Sponsored-by: Axcient Closes #13999 Closes #14015	2022-11-08 12:38:08 -08:00
youzhongyang	f224eddf92	Support idmapped mount in user namespace Linux 5.17 commit torvalds/linux@5dfbfe71e enables "the idmapping infrastructure to support idmapped mounts of filesystems mounted with an idmapping". Update the OpenZFS accordingly to improve the idmapped mount support. This pull request contains the following changes: - xattr setter functions are fixed to take mnt_ns argument. Without this, cp -p would fail for an idmapped mount in a user namespace. - idmap_util is enhanced/fixed for its use in a user ns context. - One test case added to test idmapped mount in a user ns. Reviewed-by: Christian Brauner <christian@brauner.io> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Youzhong Yang <yyang@mathworks.com> Closes #14097	2022-11-08 10:28:56 -08:00
Damian Szuberski	109731cd73	dsl_prop_known_index(): check for invalid prop Resolve UBSAN array-index-out-of-bounds error in zprop_desc_t. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: szubersk <szuberskidamian@gmail.com> Closes #14142 Closes #14147	2022-11-08 10:16:01 -08:00
Brooks Davis	20b867f5f7	freebsd: add ifdefs around legacy ioctl support Require that ZFS_LEGACY_SUPPORT be defined for legacy ioctl support to be built. For now, define it in zfs_ioctl_compat.h so support is always built. This will allow systems that need never support pre-openzfs tools a mechanism to remove support at build time. This code should be removed once the need for tool compatability is gone. No functional change at this time. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Brooks Davis <brooks.davis@sri.com> Closes #14127	2022-11-07 15:55:26 -08:00
Brooks Davis	6c89cffc2c	freebsd: remove no-op vn_renamepath() vn_renamepath() is a Solaris-ism that was defined away in the FreeBSD port. Now that the only use is in the FreeBSD zfs_vnops_os.c, drop it entierly. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Brooks Davis <brooks.davis@sri.com> Closes #14127	2022-11-07 15:55:20 -08:00
Ameer Hamza	c23738c70e	zed: Prevent special vdev to be replaced by hot spare Special vdevs should not be replaced by a hot spare. Log vdevs already support this, extending the functionality for special vdevs. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #14129	2022-11-04 11:33:47 -07:00
Alexander Lobakin	73b8f700b6	icp: fix all !ENDBR objtool warnings in x86 Asm code Currently, only Blake3 x86 Asm code has signs of being ENDBR-aware. At least, under certain conditions it includes some header file and uses some custom macro from there. Linux has its own NOENDBR since several releases ago. It's defined in the same <asm/linkage.h>, so currently <sys/asm_linkage.h> already is provided with it. Let's unify those two into one %ENDBR macro. At first, check if it's present already. If so -- use Linux kernel version. Otherwise, try to go that second way and use %_CET_ENDBR from <cet.h> if available. If no, fall back to just empty definition. This fixes a couple more 'relocations to !ENDBR' across the module. And now that we always have the latest/actual ENDBR definition, use it at the entrance of the few corresponding functions that objtool still complains about. This matches the way how it's used in the upstream x86 core Asm code. Reviewed-by: Attila Fülöp <attila@fueloep.org> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Lobakin <alobakin@pm.me> Closes #14035	2022-11-04 11:25:56 -07:00
Alexander Lobakin	61cca6fa05	icp: fix rodata being marked as text in x86 Asm code objtool properly complains that it can't decode some of the instructions from ICP x86 Asm code. As mentioned in the Makefile, where those object files were excluded from objtool check (but they can still be visible under IBT and LTO), those are just constants, not code. In that case, they must be placed in .rodata, so they won't be marked as "allocatable, executable" (ax) in EFL headers and this effectively prevents objtool from trying to decode this data. That reveals a whole bunch of other issues in ICP Asm code, as previously objtool was bailing out after that warning message. Reviewed-by: Attila Fülöp <attila@fueloep.org> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Lobakin <alobakin@pm.me> Closes #14035	2022-11-04 11:25:51 -07:00
Alexander Lobakin	b844489ec0	icp: properly fix all RETs in x86_64 Asm code Commit `43569ee374` ("Fix objtool: missing int3 after ret warning") addressed replacing all `ret`s in x86 asm code to a macro in the Linux kernel in order to enable SLS. That was done by copying the upstream macro definitions and fixed objtool complaints. Since then, several more mitigations were introduced, including Rethunk. It requires to have a jump to one of the thunks in order to work, so the RET macro was changed again. And, as ZFS code didn't use the mainline defition, but copied it, this is currently missing. Objtool reminds about it time to time (Clang 16, CONFIG_RETHUNK=y): fs/zfs/lua/zlua.o: warning: objtool: setjmp+0x25: 'naked' return found in RETHUNK build fs/zfs/lua/zlua.o: warning: objtool: longjmp+0x27: 'naked' return found in RETHUNK build Do it the following way: * if we're building under Linux, unconditionally include <linux/linkage.h> in the related files. It is available in x86 sources since even pre-2.6 times, so doesn't need any conftests; * then, if RET macro is available, it will be used directly, so that we will always have the version actual to the kernel we build; * if there's no such macro, we define it as a simple `ret`, as it was on pre-SLS times. This ensures we always have the up-to-date definition with no need to update it manually, and at the same time is safe for the whole variety of kernels ZFS module supports. Then, there's a couple more "naked" rets left in the code, they're just defined as: .byte 0xf3,0xc3 In fact, this is just: rep ret `rep ret` instead of just `ret` seems to mitigate performance issues on some old AMD processors and most likely makes no sense as of today. Anyways, address those rets, so that they will be protected with Rethunk and SLS. Include <sys/asm_linkage.h> here which now always has RET definition and replace those constructs with just RET. This wipes the last couple of places with unpatched rets objtool's been complaining about. Reviewed-by: Attila Fülöp <attila@fueloep.org> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Lobakin <alobakin@pm.me> Closes #14035	2022-11-04 11:24:09 -07:00
Richard Yao	993ee7a006	FreeBSD: Fix out of bounds read in zfs_ioctl_ozfs_to_legacy() There is an off by 1 error in the check. Fortunately, this function does not appear to be used in kernel space, despite being compiled as part of the kernel module. However, it is used in userspace. Callers of lzc_ioctl_fd() likely will crash if they attempt to use the unimplemented request number. This was reported by FreeBSD's coverity scan. Reported-by: Coverity (CID 1432059) Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Damian Szuberski <szuberskidamian@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14135	2022-11-04 11:06:14 -07:00
Serapheim Dimitropoulos	f66ffe6878	Expose zfs_vdev_open_timeout_ms as a tunable Some of our customers have been occasionally hitting zfs import failures in Linux because udevd doesn't create the by-id symbolic links in time for zpool import to use them. The main issue is that the systemd-udev-settle.service that zfs-import-cache.service and other services depend on is racy. There is also an openzfs issue filed (see https://github.com/openzfs/zfs/issues/10891) outlining the problem and potential solutions. With the proper solutions being significant in terms of complexity and the priority of the issue being low for the time being, this patch exposes `zfs_vdev_open_timeout_ms` as a tunable so people that are experiencing this issue often can increase it as a workaround. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Don Brady <don.brady@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com> Closes #14133	2022-11-03 15:02:46 -07:00
Allan Jude	595d3ac2ed	Allow mounting snapshots in .zfs/snapshot as a regular user Rather than doing a terrible credential swapping hack, we just check that the thing being mounted is a snapshot, and the mountpoint is the zfsctl directory, then we allow it. If the mount attempt is from inside a jail, on an unjailed dataset (mounted from the host, not by the jail), the ability to mount the snapshot is controlled by a new per-jail parameter: zfs.mount_snapshot Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Allan Jude <allan@klarasystems.com> Sponsored-by: Modirum MDPay Sponsored-by: Klara Inc. Closes #13758	2022-11-03 11:53:24 -07:00
Richard Yao	11e3416ae7	Cleanup: Remove branches that always evaluate the same way Coverity reported that the ASSERT in taskq_create() is always true and the `*offp > MAXOFFSET_T` check in zfs_file_seek() is always false. We delete them as cleanup. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14130	2022-11-03 10:47:48 -07:00
Brooks Davis	1e1ce10e55	Remove an unused variable Clang-16 detects this set-but-unused variable which is assigned and incremented, but never referenced otherwise. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Brooks Davis <brooks.davis@sri.com> Closes #14125	2022-11-03 10:17:17 -07:00
Richard Yao	f47f6a055d	Address warnings about possible division by zero from clangsa * The complaint in ztest_replay_write() is only possible if something went horribly wrong. An assertion will silence this and if it goes off, we will know that something is wrong. * The complaint in spa_estimate_metaslabs_to_flush() is not impossible, but seems very unlikely. We resolve this by passing the value from the `MIN()` that does not go to infinity when the variable is zero. There was a third report from Clang's scan-build, but that was a definite false positive and disappeared when checked again through Clang's static analyzer with Z3 refution via CodeChecker. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14124	2022-11-03 09:58:14 -07:00
Attila Fülöp	211ec1b9fd	Deny receiving into encrypted datasets if the keys are not loaded Commit `68ddc06b61` introduced support for receiving unencrypted datasets as children of encrypted ones but unfortunately got the logic upside down. This resulted in failing to deny receives of incremental sends into encrypted datasets without their keys loaded. If receiving a filesystem, the receive was done into a newly created unencrypted child dataset of the target. In case of volumes the receive made the target volume undeletable since a dataset was created below it, which we obviously can't handle. Incremental streams with embedded blocks are affected as well. We fix the broken logic to properly deny receives in such cases. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Attila Fülöp <attila@fueloep.org> Closes #13598 Closes #14055 Closes #14119	2022-11-03 09:55:13 -07:00
Brooks Davis	84477e148d	lua: cast through uintptr_t when return a pointer Don't assume size_t can carry pointer provenance and use uintptr_t (identialy on all current platforms) instead. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Brooks Davis <brooks.davis@sri.com> Closes #14131	2022-11-03 09:52:28 -07:00
Brooks Davis	b9041e1f27	Use intptr_t when storing an integer in a pointer Cast the integer type to (u)intptr_t before casting to "void *". In CHERI C/C++ we warn on bare casts from integers to pointers to catch attempts to create pointers our of thin air. We allow the warning to be supressed with a suitable cast through (u)intptr_t. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Brooks Davis <brooks.davis@sri.com> Closes #14131	2022-11-03 09:52:23 -07:00
Brooks Davis	250b2bac78	zfs_onexit_add_cb: make action_handle point to a uintptr_t Avoid assuming than a uint64_t can hold a pointer and reduce the number of casts in the process. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Brooks Davis <brooks.davis@sri.com> Closes #14131	2022-11-03 09:52:12 -07:00
Brooks Davis	d96303cb07	acl: use uintptr_t for ace walker cookies Avoid assuming that a pointer can fit in a uint64_t and use uintptr_t instead. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Brooks Davis <brooks.davis@sri.com> Closes #14131	2022-11-03 09:51:34 -07:00
Ryan Moeller	748b9d5bda	zil: Relax assertion in zil_parse Rather than panic debug builds when we fail to parse a whole ZIL, let's instead improve the logging of errors and continue like in a release build. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #14116	2022-11-01 12:19:32 -07:00
Allan Jude	b37d495e04	Avoid null pointer dereference in dsl_fs_ss_limit_check() Check for cr == NULL before dereferencing it in dsl_enforce_ds_ss_limits() to lookup the zone/jail ID. Reported-by: Coverity (CID 1210459) Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Allan Jude <allan@klarasystems.com> Closes #14103	2022-10-29 13:08:54 -07:00
Richard Yao	97143b9d31	Introduce kmem_scnprintf() `snprintf()` is meant to protect against buffer overflows, but operating on the buffer using its return value, possibly by calling it again, can cause a buffer overflow, because it will return how many characters it would have written if it had enough space even when it did not. In a number of places, we repeatedly call snprintf() by successively incrementing a buffer offset and decrementing a buffer length, by its return value. This is a potentially unsafe usage of `snprintf()` whenever the buffer length is reached. CodeQL complained about this. To fix this, we introduce `kmem_scnprintf()`, which will return 0 when the buffer is zero or the number of written characters, minus 1 to exclude the NULL character, when the buffer was too small. In all other cases, it behaves like snprintf(). The name is inspired by the Linux and XNU kernels' `scnprintf()`. The implementation was written before I thought to look at `scnprintf()` and had a good name for it, but it turned out to have identical semantics to the Linux kernel version. That lead to the name, `kmem_scnprintf()`. CodeQL only catches this issue in loops, so repeated use of snprintf() outside of a loop was not caught. As a result, a thorough audit of the codebase was done to examine all instances of `snprintf()` usage for potential problems and a few were caught. Fixes for them are included in this patch. Unfortunately, ZED is one of the places where `snprintf()` is potentially used incorrectly. Since using `kmem_scnprintf()` in it would require changing how it is linked, we modify its usage to make it safe, no matter what buffer length is used. In addition, there was a bug in the use of the return value where the NULL format character was not being written by pwrite(). That has been fixed. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14098	2022-10-29 13:05:11 -07:00
Richard Yao	d71d693261	Fix too few arguments to formatting function CodeQL reported that when the VERIFY3U condition is false, we do not pass enough arguments to `spl_panic()`. This is because the format string from `snprintf()` was concatenated into the format string for `spl_panic()`, which causes us to have an unexpected format specifier. A CodeQL developer suggested fixing the macro to have a `%s` format string that takes a stringified RIGHT argument, which would fix this. However, upon inspection, the VERIFY3U check was never necessary in the first place, so we remove it in favor of just calling `snprintf()`. Lastly, it is interesting that every other static analyzer run on the codebase did not catch this, including some that made an effort to catch such things. Presumably, all of them relied on header annotations, which we have not yet done on `spl_panic()`. CodeQL apparently is able to track the flow of arguments on their way to annotated functions, which llowed it to catch this when others did not. A future patch that I have in development should annotate `spl_panic()`, so the others will catch this too. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14098	2022-10-29 13:04:52 -07:00
Brian Behlendorf	82ad2a06ac	Revert "Cleanup: Delete dead code from send_merge_thread()" This reverts commit `fb823de9f` due to a regression. It is in fact possible for the range->eos_marker to be false on error. Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #14042 Closes #14104	2022-10-28 13:25:37 -07:00
Mariusz Zaborski	8af08a69cd	quota: extend quota for dataset This patch relax the quota limitation for dataset by around 3%. What this means is that user can write more data then the quota is set to. However thanks to that we can get more stable bandwidth, in case when we are overwriting data in-place, and not consuming any additional space. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Allan Jude <allan@klarasystems.com> Signed-off-by: Mariusz Zaborski <oshogbo@vexillium.org> Sponsored-by: Zededa Inc. Sponsored-by: Klara Inc. Closes #13839	2022-10-28 11:44:18 -07:00
shodanshok	dc56c673e3	Fix ARC target collapse when zfs_arc_meta_limit_percent=100 Reclaim metadata when arc_available_memory < 0 even if meta_used is not bigger than arc_meta_limit. As described in https://github.com/openzfs/zfs/issues/14054 if zfs_arc_meta_limit_percent=100 then ARC target can collapse to arc_min due to arc_purge not freeing any metadata. This patch lets arc_prune to do its work when arc_available_memory is negative even if meta_used is not bigger than arc_meta_limit, avoiding ARC target collapse. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Gionatan Danti <g.danti@assyoma.it> Closes #14054 Closes #14093	2022-10-28 10:21:54 -07:00
vaclavskala	7822b50f54	Propagate extent_bytes change to autotrim thread The autotrim thread only reads zfs_trim_extent_bytes_min and zfs_trim_extent_bytes_max variable only on thread start. We should check for parameter changes during thread execution to allow parameter changes take effect without needing to disable then restart the autotrim. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Václav Skála <skala@vshosting.cz> Closes #14077	2022-10-28 10:16:31 -07:00
Aleksa Sarai	dbf6108b4d	zfs_rename: support RENAME_* flags Implement support for Linux's RENAME_* flags (for renameat2). Aside from being quite useful for userspace (providing race-free ways to exchange paths and implement mv --no-clobber), they are used by overlayfs and are thus required in order to use overlayfs-on-ZFS. In order for us to represent the new renameat2(2) flags in the ZIL, we create two new transaction types for the two flags which need transactional-level support (RENAME_EXCHANGE and RENAME_WHITEOUT). RENAME_NOREPLACE does not need any ZIL support because we know that if the operation succeeded before creating the ZIL entry, there was no file to be clobbered and thus it can be treated as a regular TX_RENAME. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Pavel Snajdr <snajpa@snajpa.net> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> Closes #12209 Closes #14070	2022-10-28 09:49:20 -07:00
Aleksa Sarai	e015d6cc0b	zfs_rename: restructure to have cleaner fallbacks This is in preparation for RENAME_EXCHANGE and RENAME_WHITEOUT support for ZoL, but the changes here allow for far nicer fallbacks than the previous implementation (the source and target are re-linked in case of the final link failing). In addition, a small cleanup was done for the "target exists but is a different type" codepath so that it's more understandable. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> Closes #12209 Closes #14070	2022-10-28 09:48:58 -07:00
Pavel Snajdr	86db35c447	Remove zpl_revalidate: fix snapshot rollback Open files, which aren't present in the snapshot, which is being roll-backed to, need to disappear from the visible VFS image of the dataset. Kernel provides d_drop function to drop invalid entry from the dcache, but inode can be referenced by dentry multiple dentries. The introduced zpl_d_drop_aliases function walks and invalidates all aliases of an inode. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Pavel Snajdr <snajpa@snajpa.net> Closes #9600 Closes #14070	2022-10-28 09:47:19 -07:00
Andrew Innes	e09fdda977	Fix multiplication converted to larger type This fixes the instances of the "Multiplication result converted to larger type" alert that codeQL scanning found. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Andrew Innes <andrew.c12@gmail.com> Closes #14094	2022-10-28 09:30:37 -07:00
Richard Yao	4938d01db7	Convert enum zio_flag to uint64_t We ran out of space in enum zio_flag for additional flags. Rather than introduce enum zio_flag2 and then modify a bunch of functions to take a second flags variable, we expand the type to 64 bits via `typedef uint64_t zio_flag_t`. Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@klarasystems.com> Signed-off-by: Allan Jude <allan@klarasystems.com> Co-authored-by: Richard Yao <richard.yao@klarasystems.com> Closes #14086	2022-10-27 09:54:54 -07:00
Andrew Innes	07de86923b	Aligned free for aligned alloc Windows port frees memory that was alloc'd aligned in a different way then alloc'd memory. So changing frees to be specific. Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Andrew Innes <andrew.c12@gmail.com> Co-Authored-By: Jorgen Lundman <lundman@lundman.net> Closes #14059	2022-10-26 15:08:31 -07:00
Richard Yao	a06df8d7c1	Linux: Upgrade random_get_pseudo_bytes() to xoshiro256++ 1.0 The motivation for upgrading our PRNG is the recent buildbot failures in the ZTS' tests/functional/fault/decompress_fault test. The probability of a failure in that test is 0.8^256, which is ~1.6e-25 out of 1, yet we have observed multiple test failures in it. This suggests a problem with our random number generation. The xorshift128+ generator that we were using has been replaced by newer generators that have "better statistical properties". After doing some reading, it turns out that these generators have "low linear complexity of the lowest bits", which could explain the ZTS test failures. We do two things to try to fix this: 1. We upgrade from xorshift128+ to xoshiro256++ 1.0. 2. We tweak random_get_pseudo_bytes() to copy the higher order bytes first. It is hoped that this will fix the test failures in tests/functional/fault/decompress_fault, although I have not done simulations. I am skeptical that any simulations I do on a PRNG with a period of 2^256 - 1 would be meaningful. Since we have raised the minimum kernel version to 3.10 since this was first implemented, we have the option of using the Linux kernel's get_random_int(). However, I am not currently prepared to do performance tests to ensure that this would not be a regression (for the time being), so we opt for upgrading our PRNG to a newer one from Sebastiano Vigna. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #13983	2022-10-20 14:14:42 -07:00
Alexander Motin	9dcdee7889	Optimize microzaps Microzap on-disk format does not include a hash tree, expecting one to be built in RAM during mzap_open(). The built tree is linked to DMU user buffer, freed when original DMU buffer is dropped from cache. I've found that workloads accessing many large directories and having active eviction from DMU cache spend significant amount of time building and then destroying the trees. I've also found that for each 64 byte mzap element additional 64 byte tree element is allocated, that is a waste of memory and CPU caches. Improve memory efficiency of the hash tree by switching from AVL-tree to B-tree. It allows to save 24 bytes per element just on pointers. Save 32 bits on mze_hash by storing only upper 32 bits since lower 32 bits are always zero for microzaps. Save 16 bits on mze_chunkid, since microzap can never have so many elements. Respectively with the 16 bits there can be no more than 16 bits of collision differentiators. As result, struct mzap_ent now drops from 48 (rounded to 64) to 8 bytes. Tune B-trees for small data. Reduce BTREE_CORE_ELEMS from 128 to 126 to allow struct zfs_btree_core in case of 8 byte elements to pack into 2KB instead of 4KB. Aside of the microzaps it should also help 32bit range trees. Allow custom B-tree leaf size to reduce memmove() time. Split zap_name_alloc() into zap_name_alloc() and zap_name_init_str(). It allows to not waste time allocating/freeing memory when processing multiple names in a loop during mzap_open(). Together on a pool with 10K directories of 1800 files each and DMU cache limited to 128MB this reduces time of `find . -name zzz` by 41% from 7.63s to 4.47s, and saves additional ~30% of CPU time on the DMU cache reclamation. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #14039	2022-10-20 11:57:15 -07:00
Richard Yao	411d327c67	Add defensive assertion to vdev_queue_aggregate() `a6ccb36b94` had been intended to include this to silence Coverity reports, but this one was missed by mistake. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14043	2022-10-19 17:11:06 -07:00
Richard Yao	d692e6c36e	abd_return_buf() should call zfs_refcount_remove_many() early Calling zfs_refcount_remove_many() after freeing memory means we pass a reference to freed memory as the holder. This is not believed to be able to cause a problem, but there is a bit of a tradition of fixing these issues when they appear so that they do not obscure more serious issues in static analyzer output, so we fix this one too. Clang's static analyzer found this with the help of CodeChecker's CTU analysis. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14043	2022-10-19 17:11:01 -07:00
Richard Yao	c77d2d7415	crypto_get_ptrs() should always write to out_data_2 Callers will check if it has been set to NULL before trying to access it, but never initialize it themselves. Whenever "one block spans two iovecs", `crypto_get_ptrs()` will return, without ever setting `out_data_2 = NULL`. The caller will then do a NULL check against the uninitailized pointer and if it is not zero, pass it to `memcpy()`. The only reason this has not caused horrible runtime issues is because `memcpy()` should be told to copy zero bytes when this happens. That said, this is technically undefined behavior, so we should correct it so that future changes to the code cannot trigger it. Clang's static analyzer found this with the help of CodeChecker's CTU analysis. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14043	2022-10-19 17:10:56 -07:00
Richard Yao	44f71818f8	Silence static analyzer warnings about spa_sync_props() Both Coverity and Clang's static analyzer complain about reading an uninitialized intval if the property is not passed as DATA_TYPE_UINT64 in the nvlist. This is impossible becuase spa_prop_validate() already checked this, but they are unlikely to be the last static analyzers to complain about this, so lets just refactor the code to suppress the warnings. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14043	2022-10-19 17:10:52 -07:00
Akash B	5405be0365	Add options to zfs redundant_metadata property Currently, additional/extra copies are created for metadata in addition to the redundancy provided by the pool(mirror/raidz/draid), due to this 2 times more space is utilized per inode and this decreases the total number of inodes that can be created in the filesystem. By setting redundant_metadata to none, no additional copies of metadata are created, hence can reduce the space consumed by the additional metadata copies and increase the total number of inodes that can be created in the filesystem. Additionally, this can improve file create performance due to the reduced amount of metadata which needs to be written. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Dipak Ghosh <dipak.ghosh@hpe.com> Signed-off-by: Akash B <akash-b@hpe.com> Closes #13680	2022-10-19 17:07:51 -07:00
samwyc	2be0a124af	Fix sequential resilver drive failure race condition This patch handles the race condition on simultaneous failure of 2 drives, which misses the vdev_rebuild_reset_wanted signal in vdev_rebuild_thread. We retry to catch this inside the vdev_rebuild_complete_sync function. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Dipak Ghosh <dipak.ghosh@hpe.com> Reviewed-by: Akash B <akash-b@hpe.com> Signed-off-by: Samuel Wycliffe J <samwyc@hpe.com> Closes #14041 Closes #14050	2022-10-19 15:48:13 -07:00
youzhongyang	2a068a1394	Support idmapped mount Adds support for idmapped mounts. Supported as of Linux 5.12 this functionality allows user and group IDs to be remapped without changing their state on disk. This can be useful for portable home directories and a variety of container related use cases. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Youzhong Yang <yyang@mathworks.com> Closes #12923 Closes #13671	2022-10-19 11:17:09 -07:00
Richard Yao	eaaed26ffb	Fix memory leaks in dmu_send()/dmu_send_obj() If we encounter an EXDEV error when using the redacted snapshots feature, the memory used by dspp.fromredactsnaps is leaked. Clang's static analyzer caught this during an experiment in which I had annotated various headers in an attempt to improve the results of static analysis. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #13973	2022-10-18 16:03:33 -07:00
Richard Yao	84243acb91	Cleanup: Remove NULL pointer check from dmu_send_impl() The pointer is to a structure member, so it is never NULL. Coverity complained about this. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14042	2022-10-18 15:40:04 -07:00
Richard Yao	fb823de9fb	Cleanup: Delete dead code from send_merge_thread() range is always deferenced before it reaches this check, such that the kmem_zalloc() call is never executed. There is also no need to set `range->eos_marker = B_TRUE` because it is already set. Coverity incorrectly complained about a potential NULL pointer dereference because of this. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14042	2022-10-18 15:39:56 -07:00
Richard Yao	717641ac09	Cleanup: zvol_add_clones() should not NULL check dp It is never NULL because we return early if dsl_pool_hold() fails. This caused Coverity to complain. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14042	2022-10-18 15:39:48 -07:00
Richard Yao	ef55679a75	Cleanup: metaslab_alloc_dva() should not NULL check mg->mg_next This is a circularly linked list. mg->mg_next can never be NULL. This caused 3 defect reports in Coverity. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14042	2022-10-18 15:39:40 -07:00
Richard Yao	9a8039439a	Cleanup: Simplify userspace abd_free_chunks() Clang's static analyzer complained that we could use after free here if the inner loop ever iterated. That is a false positive, but upon inspection, the userland abd_alloc_chunks() function never will put multiple consecutive pages into a `struct scatterlist`, so there is no need to loop. We delete the inner loop. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14042	2022-10-18 15:39:21 -07:00
Richard Yao	6ae2f90888	Fix possible NULL pointer dereference in sha2_mac_init() If mechanism->cm_param is NULL, passing mechanism to PROV_SHA2_GET_DIGEST_LEN() will dereference a NULL pointer. Coverity reported this. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14044	2022-10-18 15:35:23 -07:00
Richard Yao	1bd02680c0	Fix NULL pointer dereference in spa_open_common() Calling spa_open() will pass a NULL pointer to spa_open_common()'s config parameter. Under the right circumstances, we will dereference the config parameter without doing a NULL check. Clang's static analyzer found this. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14044	2022-10-18 15:34:54 -07:00
Richard Yao	3146fc7edf	Fix NULL pointer passed to strlcpy from zap_lookup_impl() Clang's static analyzer pointed out that whenever zap_lookup_by_dnode() is called, we have the following stack where strlcpy() is passed a NULL pointer for realname from zap_lookup_by_dnode(): strlcpy() zap_lookup_impl() zap_lookup_norm_by_dnode() zap_lookup_by_dnode() Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14044	2022-10-18 15:34:44 -07:00
Richard Yao	711b35dc24	fm_fmri_hc_create() must call va_end() before returning clang-tidy caught this. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14044	2022-10-18 15:34:36 -07:00
Coleman Kane	ecb6a50819	Linux 6.1 compat: change order of sys/mutex.h includes After Linux 6.1-rc1 came out, the build started failing to build a couple of the files in the linux spl code due to the mutex_init redefinition. Moving the sys/mutex.h include to a lower position within these two files appears to fix the problem. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Coleman Kane <ckane@colemankane.org> Closes #14040	2022-10-18 12:29:44 -07:00
Tino Reichardt	27218a32fc	Fix declarations of non-global variables This patch inserts the `static` keyword to non-global variables, which where found by the analysis tool smatch. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de> Closes #13970	2022-10-18 11:05:32 -07:00
Richard Yao	0aae8a449b	Fix theoretical array overflow in lua_typename() Out of the 12 defects in lua that coverity reports, 5 of them involve `lua_typename()` and out of the dozens of defects in ZFS that lua reports, 3 of them involve `lua_typename()` due to the ZCP code. Given all of the uses of `lua_typename()` in the ZCP code, I was surprised that there were not more. It appears that only 2 were reported because only 3 called `lua_type()`, which does a defective sanity check that allows invalid types to be passed. lua/lua@d4fb848be7 addressed this in upstream lua 5.3. Unfortunately, we did not get that fix since we use lua 5.2 and we do not have assertions enabled in lua, so the upstream solution would not do anything. While we could adopt the upstream solution and enable assertions, a simpler solution is to fix the issue by making `lua_typename()` return `internal_type_error` whenever it is called with an invalid type. This avoids the array overflow and if we ever see it appear somewhere, we will know there is a problem with the lua interpreter. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #13947	2022-10-14 13:41:56 -07:00
Richard Yao	6a42939fcd	Cleanup: Address Clang's static analyzer's unused code complaints These were categorized as the following: * Dead assignment 23 * Dead increment 4 * Dead initialization 6 * Dead nested assignment 18 Most of these are harmless, but since actual issues can hide among them, we correct them. That said, there were a few return values that were being ignored that appeared to merit some correction: * `destroy_callback()` in `cmd/zfs/zfs_main.c` ignored the error from `destroy_batched()`. We handle it by returning -1 if there is an error. * `zfs_do_upgrade()` in `cmd/zfs/zfs_main.c` ignored the error from `zfs_for_each()`. We handle it by doing a binary OR of the error value from the subsequent `zfs_for_each()` call to the existing value. This is how errors are mostly handled inside `zfs_for_each()`. The error value here is passed to exit from the zfs command, so doing a binary or on it is better than what we did previously. * `get_zap_prop()` in `module/zfs/zcp_get.c` ignored the error from `dsl_prop_get_ds()` when the property is not of type string. We return an error when it does. There is a small concern that the `zfs_get_temporary_prop()` call would handle things, but in the case that it does not, we would be pushing an uninitialized numval onto the lua stack. It is expected that `dsl_prop_get_ds()` will succeed anytime that `zfs_get_temporary_prop()` does, so that not giving it a chance to fix things is not a problem. * `draid_merge_impl()` in `tests/zfs-tests/cmd/draid.c` used `nvlist_add_nvlist()` twice in ways in which errors are expected to be impossible, so we switch to `fnvlist_add_nvlist()`. A few notable ones did not merit use of the return value, so we suppressed it with `(void)`: * `write_free_diffs()` in `lib/libzfs/libzfs_diff.c` ignored the error value from `describe_free()`. A look through the commit history revealed that this was intentional. * `arc_evict_hdr()` in `module/zfs/arc.c` did not need to use the returned handle from `arc_hdr_realloc()` because it is already referenced in lists. * `spa_vdev_detach()` in `module/zfs/spa.c` has a comment explicitly saying not to use the error from `vdev_label_init()` because whatever causes the error could be the reason why a detach is being done. Unfortunately, I am not presently able to analyze the kernel modules with Clang's static analyzer, so I could have missed some cases of this. In cases where reports were present in code that is duplicated between Linux and FreeBSD, I made a conscious effort to fix the FreeBSD version too. After this commit is merged, regressions like `dee8934` should become extremely obvious with Clang's static analyzer since a regression would appear in the results as the only instance of unused code. That assumes that Coverity does not catch the issue first. My local branch with fixes from all of my outstanding non-draft pull requests shows 118 reports from Clang's static anlayzer after this patch. That is down by 51 from 169. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Cedric Berger <cedric@precidata.com> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #13986	2022-10-14 13:37:54 -07:00
Christian Schwarz	4d5aef3ba9	zfs_domount: fix double-disown of dataset / double-free of zfsvfs_t Before this patch, in zfs_domount, if zfs_root or d_make_root fails, we leave zfsvfs != NULL. This will lead to execution of the error handling `if` statement at the `out` label, and hence to a call to dmu_objset_disown and zfsvfs_free. However, zfs_umount, which we call upon failure of zfs_root and d_make_root already does dmu_objset_disown and zfsvfs_free. I suppose this patch rather adds to the brittleness of this part of the code base, but I don't want to invest more time in this right now. To add a regression test, we'd need some kind of fault injection facility for zfs_root or d_make_root, which doesn't exist right now. And even then, I think that regression test would be too closely tied to the implementation. To repro the double-disown / double-free, do the following: 1. patch zfs_root to always return an error 2. mount a ZFS filesystem Here's the stack trace you would see then: VERIFY3(ds->ds_owner == tag) failed (0000000000000000 == ffff9142361e8000) PANIC at dsl_dataset.c:1003:dsl_dataset_disown() Showing stack for process 28332 CPU: 2 PID: 28332 Comm: zpool Tainted: G O 5.10.103-1.nutanix.el7.x86_64 #1 Call Trace: dump_stack+0x74/0x92 spl_dumpstack+0x29/0x2b [spl] spl_panic+0xd4/0xfc [spl] dsl_dataset_disown+0xe9/0x150 [zfs] dmu_objset_disown+0xd6/0x150 [zfs] zfs_domount+0x17b/0x4b0 [zfs] zpl_mount+0x174/0x220 [zfs] legacy_get_tree+0x2b/0x50 vfs_get_tree+0x2a/0xc0 path_mount+0x2fa/0xa70 do_mount+0x7c/0xa0 __x64_sys_mount+0x8b/0xe0 do_syscall_64+0x38/0x50 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Co-authored-by: Christian Schwarz <christian.schwarz@nutanix.com> Signed-off-by: Christian Schwarz <christian.schwarz@nutanix.com> Closes #14025	2022-10-14 11:46:47 -07:00
Richard Yao	ab8d9c1783	Cleanup: 64-bit kernel module parameters should use fixed width types Various module parameters such as `zfs_arc_max` were originally `uint64_t` on OpenSolaris/Illumos, but were changed to `unsigned long` for Linux compatibility because Linux's kernel default module parameter implementation did not support 64-bit types on 32-bit platforms. This caused problems when porting OpenZFS to Windows because its LLP64 memory model made `unsigned long` a 32-bit type on 64-bit, which created the undesireable situation that parameters that should accept 64-bit values could not on 64-bit Windows. Upon inspection, it turns out that the Linux kernel module parameter interface is extensible, such that we are allowed to define our own types. Rather than maintaining the original type change via hacks to to continue shrinking module parameters on 32-bit Linux, we implement support for 64-bit module parameters on Linux. After doing a review of all 64-bit kernel parameters (found via the man page and also proposed changes by Andrew Innes), the kernel module parameters fell into a few groups: Parameters that were originally 64-bit on Illumos: * dbuf_cache_max_bytes * dbuf_metadata_cache_max_bytes * l2arc_feed_min_ms * l2arc_feed_secs * l2arc_headroom * l2arc_headroom_boost * l2arc_write_boost * l2arc_write_max * metaslab_aliquot * metaslab_force_ganging * zfetch_array_rd_sz * zfs_arc_max * zfs_arc_meta_limit * zfs_arc_meta_min * zfs_arc_min * zfs_async_block_max_blocks * zfs_condense_max_obsolete_bytes * zfs_condense_min_mapping_bytes * zfs_deadman_checktime_ms * zfs_deadman_synctime_ms * zfs_initialize_chunk_size * zfs_initialize_value * zfs_lua_max_instrlimit * zfs_lua_max_memlimit * zil_slog_bulk Parameters that were originally 32-bit on Illumos: * zfs_per_txg_dirty_frees_percent Parameters that were originally `ssize_t` on Illumos: * zfs_immediate_write_sz Note that `ssize_t` is `int32_t` on 32-bit and `int64_t` on 64-bit. It has been upgraded to 64-bit. Parameters that were `long`/`unsigned long` because of Linux/FreeBSD influence: * l2arc_rebuild_blocks_min_l2size * zfs_key_max_salt_uses * zfs_max_log_walking * zfs_max_logsm_summary_length * zfs_metaslab_max_size_cache_sec * zfs_min_metaslabs_to_flush * zfs_multihost_interval * zfs_unflushed_log_block_max * zfs_unflushed_log_block_min * zfs_unflushed_log_block_pct * zfs_unflushed_max_mem_amt * zfs_unflushed_max_mem_ppm New parameters that do not exist in Illumos: * l2arc_trim_ahead * vdev_file_logical_ashift * vdev_file_physical_ashift * zfs_arc_dnode_limit * zfs_arc_dnode_limit_percent * zfs_arc_dnode_reduce_percent * zfs_arc_meta_limit_percent * zfs_arc_sys_free * zfs_deadman_ziotime_ms * zfs_delete_blocks * zfs_history_output_max * zfs_livelist_max_entries * zfs_max_async_dedup_frees * zfs_max_nvlist_src_size * zfs_rebuild_max_segment * zfs_rebuild_vdev_limit * zfs_unflushed_log_txg_max * zfs_vdev_max_auto_ashift * zfs_vdev_min_auto_ashift * zfs_vnops_read_chunk_size * zvol_max_discard_blocks Rather than clutter the lists with commentary, the module parameters that need comments are repeated below. A few parameters were defined in Linux/FreeBSD specific code, where the use of ulong/long is not an issue for portability, so we leave them alone: * zfs_delete_blocks * zfs_key_max_salt_uses * zvol_max_discard_blocks The documentation for a few parameters was found to be incorrect: * zfs_deadman_checktime_ms - incorrectly documented as int * zfs_delete_blocks - not documented as Linux only * zfs_history_output_max - incorrectly documented as int * zfs_vnops_read_chunk_size - incorrectly documented as long * zvol_max_discard_blocks - incorrectly documented as ulong The documentation for these has been fixed, alongside the changes to document the switch to fixed width types. In addition, several kernel module parameters were percentages or held ashift values, so being 64-bit never made sense for them. They have been downgraded to 32-bit: * vdev_file_logical_ashift * vdev_file_physical_ashift * zfs_arc_dnode_limit_percent * zfs_arc_dnode_reduce_percent * zfs_arc_meta_limit_percent * zfs_per_txg_dirty_frees_percent * zfs_unflushed_log_block_pct * zfs_vdev_max_auto_ashift * zfs_vdev_min_auto_ashift Of special note are `zfs_vdev_max_auto_ashift` and `zfs_vdev_min_auto_ashift`, which were already defined as `uint64_t`, and passed to the kernel as `ulong`. This is inherently buggy on big endian 32-bit Linux, since the values would not be written to the correct locations. 32-bit FreeBSD was unaffected because its sysctl code correctly treated this as a `uint64_t`. Lastly, a code comment suggests that `zfs_arc_sys_free` is Linux-specific, but there is nothing to indicate to me that it is Linux-specific. Nothing was done about that. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Jorgen Lundman <lundman@lundman.net> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Original-patch-by: Andrew Innes <andrew.c12@gmail.com> Original-patch-by: Jorgen Lundman <lundman@lundman.net> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #13984 Closes #14004	2022-10-13 10:03:29 -07:00
Richard Yao	a6ccb36b94	Add defensive assertions Coverity complains about possible bugs involving referencing NULL return values and division by zero. The division by zero bugs require that a block pointer be corrupt, either from in-memory corruption, or on-disk corruption. The NULL return value complaints are only bugs if assumptions that we make about the state of data structures are wrong. Some seem impossible to be wrong and thus are false positives, while others are hard to analyze. Rather than dismiss these as false positives by assuming we know better, we add defensive assertions to let us know when our assumptions are wrong. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #13972	2022-10-12 11:25:18 -07:00
Mark Johnston	ed566bf1cd	FreeBSD: Fix a pair of bugs in zfs_fhtovp() - Add a zfs_exit() call in an error path, otherwise a lock is leaked. - Remove the fid_gen > 1 check. That appears to be Linux-specific: zfsctl_snapdir_fid() sets fid_gen to 0 or 1 depending on whether the snapshot directory is mounted. On FreeBSD it fails, making snapshot dirs inaccessible via NFS. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Andriy Gapon <avg@FreeBSD.org> Signed-off-by: Mark Johnston <markj@FreeBSD.org> Fixes: `43dbf88178` ("FreeBSD: vfsops: use setgen for error case") Closes #14001 Closes #13974	2022-10-11 12:29:55 -07:00
Serapheim Dimitropoulos	4dcc2bde9c	Stop ganging due to past vdev write errors = Problem While examining a customer's system we noticed unreasonable space usage from a few snapshots due to gang blocks. Under some further analysis we discovered that the pool would create gang blocks because all its disks had non-zero write error counts and they'd be skipped for normal metaslab allocations due to the following if-clause in `metaslab_alloc_dva()`: ``` /* * Avoid writing single-copy data to a failing, * non-redundant vdev, unless we've already tried all * other vdevs. */ if ((vd->vdev_stat.vs_write_errors > 0 \|\| vd->vdev_state < VDEV_STATE_HEALTHY) && d == 0 && !try_hard && vd->vdev_children == 0) { metaslab_trace_add(zal, mg, NULL, psize, d, TRACE_VDEV_ERROR, allocator); goto next; } ``` = Proposed Solution Get rid of the predicate in the if-clause that checks the past write errors of the selected vdev. We still try to allocate from HEALTHY vdevs anyway by checking vdev_state so the past write errors doesn't seem to help us (quite the opposite - it can cause issues in long-lived pools like the one from our customer). = Testing I first created a pool with 3 vdevs: ``` $ zpool list -v volpool NAME SIZE ALLOC FREE volpool 22.5G 117M 22.4G xvdb 7.99G 40.2M 7.46G xvdc 7.99G 39.1M 7.46G xvdd 7.99G 37.8M 7.46G ``` And used `zinject` like so with each one of them: ``` $ sudo zinject -d xvdb -e io -T write -f 0.1 volpool ``` And got the vdevs to the following state: ``` $ zpool status volpool pool: volpool state: ONLINE status: One or more devices has experienced an unrecoverable error. ...<cropped>.. action: Determine if the device needs to be replaced, and clear the ...<cropped>.. config: NAME STATE READ WRITE CKSUM volpool ONLINE 0 0 0 xvdb ONLINE 0 1 0 xvdc ONLINE 0 1 0 xvdd ONLINE 0 4 0 ``` I also double-checked their write error counters with sdb: ``` sdb> spa volpool \| vdev \| member vdev_stat.vs_write_errors (uint64_t)0 # <---- this is the root vdev (uint64_t)2 (uint64_t)1 (uint64_t)1 ``` Then I checked that I the problem was reproduced in my VM as I the gang count was growing in zdb as I was writting more data: ``` $ sudo zdb volpool \| grep gang ganged count: 1384 $ sudo zdb volpool \| grep gang ganged count: 1393 $ sudo zdb volpool \| grep gang ganged count: 1402 $ sudo zdb volpool \| grep gang ganged count: 1414 ``` Then I updated my bits with this patch and the gang count stayed the same. Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com> Closes #14003	2022-10-11 12:27:41 -07:00
Richard Yao	70248b82e8	Fix uninitialized value read in vdev_prop_set() If no errors are encountered, we read an uninitialized error value. Clang's static analyzer complained about this. Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14007	2022-10-11 12:24:36 -07:00
Jorgen Lundman	4b629d04a5	Avoid calling rw_destroy() on uninitialized rwlock First the function `memset(&key, 0, ...)` but any call to "goto error;" would call zio_crypt_key_destroy(key) which calls `rw_destroy()`. The `rw_init()` is moved up to be right after the memset. This way the rwlock can be released. The ctx does allocate memory, but that is handled by the memset to 0 and icp skips NULL ptrs. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Jorgen Lundman <lundman@lundman.net> Closes #13976	2022-10-05 17:07:50 -07:00
Finix1979	6694ca5539	Avoid unnecessary metaslab_check_free calling The metaslab_check_free() function only needs to be called in the GANG\|DEDUP\|etc case because zio_free_sync() will internally call metaslab_check_free(). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Finix1979 <yancw@info2soft.com> Closes #13977	2022-10-04 10:55:35 -07:00
Richard Yao	a36b37d4de	Fix potential NULL pointer dereference in dsl_dataset_promote_check() If the `list_head()` returns NULL, we dereference it, right before we check to see if it returned NULL. We have defined two different pointers that both point to the same thing, which are `origin_head` and `origin_ds`. Almost everything uses `origin_ds`, so we switch them to use `origin_ds`. We also promote `origin_ds` to a const pointer so that the compiler verifies that nothing modifies it. Coverity complained about this. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Neal Gompa <ngompa@datto.com> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #13967	2022-09-30 16:59:51 -07:00
Tino Reichardt	a2d5643f88	Fix double const qualifier declarations Some header files define structures like this one: typedef const struct zio_checksum_info { /* ... / const char ci_name; } zio_abd_checksum_func_t; So we can use `zio_abd_checksum_func_t` for const declarations now. It's not needed that we use the `const` qualifier again like this: `const zio_abd_checksum_func_t *varname;` This patch solves the double const qualifiers, which were found by smatch. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de> Closes #13961	2022-09-30 15:34:39 -07:00
Richard Yao	55d7afa4ad	Reduce false positives from Static Analyzers Both Clang's Static Analyzer and Synopsys' Coverity would ignore assertions. Following Clang's advice, we annotate our assertions: https://clang-analyzer.llvm.org/annotations.html#custom_assertions This makes both Clang's Static Analyzer and Coverity properly identify assertions. This change reduced Clang's reported defects from 246 to 180. It also reduced the false positives reported by Coverityi by 10, while enabling Coverity to find 9 more defects that previously were false negatives. A couple examples of this would be CID-1524417 and CID-1524423. After submitting a build to coverity with the modified assertions, CID-1524417 disappeared while the report for CID-1524423 no longer claimed that the assertion tripped. Coincidentally, it turns out that it is possible to more accurately annotate our headers than the Coverity modelling file permits in the case of format strings. Since we can do that and this patch annotates headers whenever `__coverity_panic__()` would have been used in the model file, we drop all models that use `__coverity_panic__()` from the model file. Upon seeing the success in eliminating false positives involving assertions, it occurred to me that we could also modify our headers to eliminate coverity's false positives involving byte swaps. We now have coverity specific byteswap macros, that do nothing, to disable Coverity's false positives when we do byte swaps. This allowed us to also drop the byteswap definitions from the model file. Lastly, a model file update has been done beyond the mentioned deletions: * The definitions of `umem_alloc_aligned()`, `umem_alloc()` andi `umem_zalloc()` were originally implemented in a way that was intended to inform coverity that when KM_SLEEP has been passed these functions, they do not return NULL. A small error in how this was done was found, so we correct it. * Definitions for umem_cache_alloc() and umem_cache_free() have been added. In practice, no false positives were avoided by making these changes, but in the interest of correctness from future coverity builds, we make them anyway. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #13902	2022-09-30 15:30:12 -07:00
Serapheim Dimitropoulos	4acc36ed7c	Fix panic in dsl_process_sub_livelist for EINTR = Issue Recently we hit an assertion panic in `dsl_process_sub_livelist` while exporting the spa and interrupting `bpobj_iterate_nofree`. In that case `bpobj_iterate_nofree` stops mid-way returning an EINTR without clearing the intermediate AVL tree that keeps track of the livelist entries it has encountered so far. At that point the code has a VERIFY for the number of elements of the AVL expecting it to be zero (which is not the case for EINTR). = Fix Cleanup any intermediate state before destroying the AVL when encountering EINTR. Also added a comment documenting the scenario where the EINTR comes up. There is no need to do anything else for the calles of `dsl_process_sub_livelist` as they already handle the EINTR case. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com> Closes #13939	2022-09-29 09:39:48 -07:00
Richard Yao	1b87195c3c	Fix unchecked return values `2a493a4c71` was intended to fix all instances of coverity reported unchecked return values, but unfortunately, two were missed by mistake. This commit fixes the unchecked return values that had been missed. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Neal Gompa <ngompa@datto.com> Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #13945	2022-09-29 09:02:57 -07:00
Ameer Hamza	55c12724d3	zed: mark disks as REMOVED when they are removed ZED does not take any action for disk removal events if there is no spare VDEV available. Added zpool_vdev_remove_wanted() in libzfs and vdev_remove_wanted() in vdev.c to remove the VDEV through ZED on removal event. This means that if you are running zed and remove a disk, it will be properly marked as REMOVED. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #13797	2022-09-28 09:48:46 -07:00
Mateusz Guzik	eb9bec0a5d	Bring per_txg_dirty_frees_percent back to 30 The current value causes significant artificial slowdown during mass parallel file removal, which can be observed both on FreeBSD and Linux when running real workloads. Sample results from Linux doing make -j 96 clean after an allyesconfig modules build: before: 4.14s user 6.79s system 48% cpu 22.631 total after: 4.17s user 6.44s system 153% cpu 6.927 total FreeBSD results in the ticket. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Closes #13932 Closes #13938	2022-09-27 17:38:03 -07:00
Richard Yao	a51288aabb	Fix unsafe string operations Coverity caught unsafe use of `strcpy()` in `ztest_dmu_objset_own()`, `nfs_init_tmpfile()` and `dump_snapshot()`. It also caught an unsafe use of `strlcat()` in `nfs_init_tmpfile()`. Inspired by this, I did an audit of every single usage of `strcpy()` and `strcat()` in the code. If I could not prove that the usage was safe, I changed the code to use either `strlcpy()` or `strlcat()`, depending on which function was originally used. In some cases, `snprintf()` was used to replace multiple uses of `strcat` because it was cleaner. Whenever I changed a function, I preferred to use `sizeof(dst)` when the compiler is able to provide the string size via that. When it could not because the string was passed by a caller, I checked the entire call tree of the function to find out how big the buffer was and hard coded it. Hardcoding is less than ideal, but it is safe unless someone shrinks the buffer sizes being passed. Additionally, Coverity reported three more string related issues: * It caught a case where we do an overlapping memory copy in a call to `snprintf()`. We fix that via `kmem_strdup()` and `kmem_strfree()`. * It caught `sizeof (buf)` being used instead of `buflen` in `zdb_nicenum()`'s call to `zfs_nicenum()`, which is passed to `snprintf()`. We change that to pass `buflen`. * It caught a theoretical unterminated string passed to `strcmp()`. This one is likely a false positive, but we have the information needed to do this more safely, so we change this to silence the false positive not just in coverity, but potentially other static analysis tools too. We switch to `strncmp()`. * There was a false positive in tests/zfs-tests/cmd/dir_rd_update.c. We suppress it by switching to `snprintf()` since other static analysis tools might complain about it too. Interestingly, there is a possible real bug there too, since it assumes that the passed directory path ends with '/'. We add a '/' to fix that potential bug. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #13913	2022-09-27 16:47:24 -07:00
Richard Yao	88b199c24e	Cleanup spa_export_common() Coverity complains about a possible NULL pointer dereference. This is impossible, but it suspects it because we do a NULL check against `spa->spa_root_vdev`. This NULL check was never necessary and makes the code harder to understand, so we drop it. In particular, we dereference `spa->spa_root_vdev` when `new_state != POOL_STATE_UNINITIALIZED && !hardforce`. The first is only true when spa_reset is called, which only occurs under fault injection. The second is true unless `zpool export -F $POOLNAME` is used. Therefore, we effectively always dereference the pointer. In the cases where we do not, there is no reason to think it is unsafe. Therefore this change is safe to make. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #13905	2022-09-27 16:45:51 -07:00
Richard Yao	31b4e008f1	LUA: Fix CVE-2014-5461 Apply the fix from upstream. http://www.lua.org/bugs.html#5.2.2-1 https://www.opencve.io/cve/CVE-2014-5461 It should be noted that exploiting this requires the `SYS_CONFIG` privilege, and anyone with that privilege likely has other opportunities to do exploits, so it is unlikely that bad actors could exploit this unless system administrators are executing untrusted ZFS Channel Programs. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #13949	2022-09-27 16:44:13 -07:00
Richard Yao	fdc2d30371	Cleanup: Specify unsignedness on things that should not be signed In #13871, zfs_vdev_aggregation_limit_non_rotating and zfs_vdev_aggregation_limit being signed was pointed out as a possible reason not to eliminate an unnecessary MAX(unsigned, 0) since the unsigned value was assigned from them. There is no reason for these module parameters to be signed and upon inspection, it was found that there are a number of other module parameters that are signed, but should not be, so we make them unsigned. Making them unsigned made it clear that some other variables in the code should also be unsigned, so we also make those unsigned. This prevents users from setting negative values that could potentially cause bad behaviors. It also makes the code slightly easier to understand. Mostly module parameters that deal with timeouts, limits, bitshifts and percentages are made unsigned by this. Any that are boolean are left signed, since whether booleans should be considered signed or unsigned does not matter. Making zfs_arc_lotsfree_percent unsigned caused a `zfs_arc_lotsfree_percent >= 0` check to become redundant, so it was removed. Removing the check was also necessary to prevent a compiler error from -Werror=type-limits. Several end of line comments had to be moved to their own lines because replacing int with uint_t caused us to exceed the 80 character limit enforced by cstyle.pl. The following were kept signed because they are passed to taskq_create(), which expects signed values and modifying the OpenSolaris/Illumos DDI is out of scope of this patch: * metaslab_load_pct * zfs_sync_taskq_batch_pct * zfs_zil_clean_taskq_nthr_pct * zfs_zil_clean_taskq_minalloc * zfs_zil_clean_taskq_maxalloc * zfs_arc_prune_task_threads Also, negative values in those parameters was found to be harmless. The following were left signed because either negative values make sense, or more analysis was needed to determine whether negative values should be disallowed: * zfs_metaslab_switch_threshold * zfs_pd_bytes_max * zfs_livelist_min_percent_shared zfs_multihost_history was made static to be consistent with other parameters. A number of module parameters were marked as signed, but in reality referenced unsigned variables. upgrade_errlog_limit is one of the numerous examples. In the case of zfs_vdev_async_read_max_active, it was already uint32_t, but zdb had an extern int declaration for it. Interestingly, the documentation in zfs.4 was right for upgrade_errlog_limit despite the module parameter being wrongly marked, while the documentation for zfs_vdev_async_read_max_active (and friends) was wrong. It was also wrong for zstd_abort_size, which was unsigned, but was documented as signed. Also, the documentation in zfs.4 incorrectly described the following parameters as ulong when they were int: * zfs_arc_meta_adjust_restarts * zfs_override_estimate_recordsize They are now uint_t as of this patch and thus the man page has been updated to describe them as uint. dbuf_state_index was left alone since it does nothing and perhaps should be removed in another patch. If any module parameters were missed, they were not found by `grep -r 'ZFS_MODULE_PARAM' \| grep ', INT'`. I did find a few that grep missed, but only because they were in files that had hits. This patch intentionally did not attempt to address whether some of these module parameters should be elevated to 64-bit parameters, because the length of a long on 32-bit is 32-bit. Lastly, it was pointed out during review that uint_t is a better match for these variables than uint32_t because FreeBSD kernel parameter definitions are designed for uint_t, whose bit width can change in future memory models. As a result, we change the existing parameters that are uint32_t to use uint_t. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Neal Gompa <ngompa@datto.com> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #13875	2022-09-27 16:42:41 -07:00
Richard Yao	7584fbe846	Cleanup: Switch to strlcpy from strncpy Coverity found a bug in `zfs_secpolicy_create_clone()` where it is possible for us to pass an unterminated string when `zfs_get_parent()` returns an error. Upon inspection, it is clear that using `strlcpy()` would have avoided this issue. Looking at the codebase, there are a number of other uses of `strncpy()` that are unsafe and even when it is used safely, switching to `strlcpy()` would make the code more readable. Therefore, we switch all instances where we use `strncpy()` to use `strlcpy()`. Unfortunately, we do not portably have access to `strlcpy()` in tests/zfs-tests/cmd/zfs_diff-socket.c because it does not link to libspl. Modifying the appropriate Makefile.am to try to link to it resulted in an error from the naming choice used in the file. Trying to disable the check on the file did not work on FreeBSD because Clang ignores `#undef` when a definition is provided by `-Dstrncpy(...)=...`. We workaround that by explictly including the C file from libspl into the test. This makes things build correctly everywhere. We add a deprecation warning to `config/Rules.am` and suppress it on the remaining `strncpy()` usage. `strlcpy()` is not portably avaliable in tests/zfs-tests/cmd/zfs_diff-socket.c, so we use `snprintf()` there as a substitute. This patch does not tackle the related problem of `strcpy()`, which is even less safe. Thankfully, a quick inspection found that it is used far more correctly than strncpy() was used. A quick inspection did not find any problems with `strcpy()` usage outside of zhack, but it should be said that I only checked around 90% of them. Lastly, some of the fields in kstat_t varied in size by 1 depending on whether they were in userspace or in the kernel. The origin of this discrepancy appears to be `04a479f706` where it was made for no apparent reason. It conflicts with the comment on KSTAT_STRLEN, so we shrink the kernel field sizes to match the userspace field sizes. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #13876	2022-09-27 16:35:29 -07:00
Jitendra Patidar	3ed9d6883b	Enforce "-F" flag on resuming recv of full/newfs on existing dataset When receiving full/newfs on existing dataset, then it should be done with "-F" flag. Its enforced for initial receive in checks done in zfs_receive_one function of libzfs. Similarly, on resuming full/newfs recv on existing dataset, it should be done with "-F" flag. When dataset doesn't exist, then full/new recv is done on newly created dataset and it's marked INCONSISTENT. But when receiving on existing dataset, recv is first done on %recv and its marked INCONSISTENT. Existing dataset is not marked INCONSISTENT. Resume of full/newfs receive with dataset not INCONSISTENT indicates that its resuming newfs on existing dataset. So, enforce "-F" flag in this case. Also return an error from dmu_recv_resume_begin_check() in zfs kernel, when its resuming full/newfs recv without force. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Chunwei Chen <david.chen@nutanix.com> Signed-off-by: Jitendra Patidar <jitendra.patidar@nutanix.com> Closes #13856 Closes #13857	2022-09-27 16:34:27 -07:00
Richard Yao	a2163a96ae	Fix bad free in skein code Clang's static analyzer found a bad free caused by skein_mac_atomic(). It will allocate a context on the stack and then pass it to skein_final(), which attempts to free it. Upon inspection, skein_digest_atomic() also has the same problem. These functions were created to match the OpenSolaris ICP API, so I was curious how we avoided this in other providers and looked at the SHA2 code. It appears that SHA2 has a SHA2Final() helper function that is called by the exported sha2_mac_final()/sha2_digest_final() as well as the sha2_mac_atomic() and sha2_digest_atomic() functions. The real work is done in SHA2Final() while some checks and the free are done in sha2_mac_final()/sha2_digest_final(). We fix the use after free in the skein code by taking inspiration from the SHA2 code. We introduce a skein_final_nofree() that does most of the work, and make skein_final() into a function that calls it and then frees the memory. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #13954	2022-09-27 12:36:58 -07:00
Richard Yao	f7bda2de97	Fix userspace memory leaks found by Clang Static Analzyer Recently, I have been making a push to fix things that coverity found. However, I was curious what Clang's static analyzer reported, so I ran it and found things that coverity had missed. * contrib/pam_zfs_key/pam_zfs_key.c: If prop_mountpoint is passed more than once, we leak memory. * module/zfs/zcp_get.c: We leak memory on temporary properties in userspace. * tests/zfs-tests/cmd/draid.c: On error from vdev_draid_rand(), we leak memory if best_map had been allocated by a prior iteration. * tests/zfs-tests/cmd/mkfile.c: Memory used by the loop is not freed before program termination. Arguably, these are all minor issues, but if we ignore them, then they could obscure serious bugs, so we fix them. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #13955	2022-09-26 17:18:05 -07:00
Richard Yao	8ef15f9322	Cleanup: Remove ineffective unsigned comparisons against 0 Coverity found a number of places where we either do MAX(unsigned, 0) or do assertions that a unsigned variable is >= 0. These do nothing, so let us drop them all. It also found a spot where we do `if (unsigned >= 0 && ...)`. Let us also drop the unsigned >= 0 check. Reviewed-by: Neal Gompa <ngompa@datto.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #13871	2022-09-26 17:02:38 -07:00
Richard Yao	52afc3443d	Linux: Fix uninitialized variable usage in zio_do_crypt_data() Coverity complained about this. An error from `hkdf_sha512()` before uio initialization will cause pointers to uninitialized memory to be passed to `zio_crypt_destroy_uio()`. This is a regression that was introduced by `cf63739191`. Interestingly, this never affected FreeBSD, since the FreeBSD version never had that patch ported. Since moving uio initialization to the top of this function would slow down the qat_crypt() path, we only move the `memset()` calls to the top of the function. This is sufficient to fix this problem. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Neal Gompa <ngompa@datto.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #13944	2022-09-26 16:44:22 -07:00
Richard Yao	2a493a4c71	Fix unchecked return values and unused return values Coverity complained about unchecked return values and unused values that turned out to be unused return values. Different approaches were used to handle the different cases of unchecked return values: * cmd/zdb/zdb.c: VERIFY0 was used in one place since the existing code had no error handling. An error message was printed in another to match the rest of the code. * cmd/zed/agents/zfs_retire.c: We dismiss the return value with `(void)` because the value is expected to be potentially unset. * cmd/zpool_influxdb/zpool_influxdb.c: We dismiss the return value with `(void)` because the values are expected to be potentially unset. * cmd/ztest.c: VERIFY0 was used since we want failures if something goes wrong in ztest. * module/zfs/dsl_dir.c: We dismiss the return value with `(void)` because there is no guarantee that the zap entry will always be there. For example, old pools imported readonly would not have it and we do not want to fail here because of that. * module/zfs/zfs_fm.c: `fnvlist_add_()` was used since the allocations sleep and thus can never fail. module/zfs/zvol.c: We dismiss the return value with `(void)` because we do not need it. This matches what is already done in the analogous `zfs_replay_write2()`. * tests/zfs-tests/cmd/draid.c: We suppress one return value with `(void)` since the code handles errors already. The other return value is handled by switching to `fnvlist_lookup_uint8_array()`. * tests/zfs-tests/cmd/file/file_fadvise.c: We add error handling. * tests/zfs-tests/cmd/mmap_sync.c: We add error handling for munmap, but ignore failures on remove() with (void) since it is expected to be able to fail. * tests/zfs-tests/cmd/mmapwrite.c: We add error handling. As for unused return values, they were all in places where there was error handling, so logic was added to handle the return values. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #13920	2022-09-23 16:52:03 -07:00
Brian Behlendorf	505df8d133	Dynamically size dbuf hash mutex array Incorrectly sizing the array of hash locks used to protect the dbuf hash table can lead to contention and reduce performance. We could unconditionally allocate a larger array for the locks but it's wasteful, particularly for a low-memory system. Instead, dynamically allocate the array of locks and scale it based on total system memory. Additionally, add a new `dbuf_mutex_cache_shift` module option which can be used to override the hash lock array size. This is disabled by default (dbuf_mutex_hash_shift=0) and can only be set at module load time. The minimum target array size is set to 8192, this matches the current constant value. Note that the count of the dbuf hash table and count of the mutex array were added to the /proc/spl/kstat/zfs/dbufstats kstat. Finally, this change removes the _KERNEL conditional checks. These were not required since for the user space build there is no difference between the kmem and vmem interfaces. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #13928	2022-09-22 12:59:56 -07:00
Brian Behlendorf	223b04d23d	Revert "Reduce dbuf_find() lock contention" This reverts commit `34dbc618f5`. While this change resolved the lock contention observed for certain workloads, it inadventantly reduced the maximum hash inserts/removes per second. This appears to be due to the slightly higher acquisition cost of a rwlock vs a mutex. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>	2022-09-22 12:59:41 -07:00
Richard Yao	e506a0ce40	Cleanup: Change 1 used in bitshifts to 1ULL Coverity complains about this. It is not a bug as long as we never shift by more than 31, but it is not terrible to change the constants from 1 to 1ULL as clean up. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #13914	2022-09-22 11:28:33 -07:00
youzhongyang	62e2a2881f	Fix minor issues in namespace delegation support get_user_ns() is only done once for each namespace, so put_user_ns() should be done once too. Fix two typos in user_namespace/user_namespace_002.ksh and user_namespace/user_namespace_003.ksh. Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Youzhong Yang <yyang@mathworks.com> Closes #13918	2022-09-20 15:25:21 -07:00
Mateusz Guzik	fbf874a4ac	FreeBSD: handle V_PCATCH See https://cgit.FreeBSD.org/src/commit/?id=a75d1ddd74312f5dd79bc1e965f7077679659f2e Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Closes #13910	2022-09-20 15:22:32 -07:00
Mateusz Guzik	3e5caef4c5	FreeBSD: catch up to 1400068 Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Closes #13909	2022-09-20 15:21:30 -07:00
Ameer Hamza	c50b3f14d3	Delay ZFS_PROP_SHARESMB property to handle it for encrypted raw receive For encrypted raw receive, objset creation is delayed until a call to dmu_recv_stream(). ZFS_PROP_SHARESMB property requires objset to be populated when calling zpl_earlier_version(). To correctly handle the ZFS_PROP_SHARESMB property for encrypted raw receive, this change delays setting the property. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #13878	2022-09-20 15:19:05 -07:00
Richard Yao	3f400b0f58	FreeBSD: Cleanup zfs_readdir() The FreeBSD project's coverity scans found dead code in `zfs_readdir()`. Also, the comment above `zfs_readdir()` is out of date. I fixed the comment and deleted all of the dead code, plus additional dead code that was found upon review. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #13924	2022-09-20 14:50:16 -07:00
Richard Yao	9276e202eb	FreeBSD: Fix uninitialized pointer read in spa_import_rootpool() The FreeBSD project's coverity scans found this. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #13923	2022-09-20 14:43:03 -07:00
Richard Yao	f272960d52	Fix usage of zed_log_msg() and zfs_panic_recover() Coverity complained about the format specifiers not matching variables. In one case, the variable is a constant, so we fix it. In another, we were missing an argument (about which coverity also complained). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #13888	2022-09-19 17:32:18 -07:00
Richard Yao	891ac937be	Linux: Fix use-after-free in zfsvfs_create() Coverity reported that we pass a pointer to zfsvfs to `dmu_objset_disown()` after freeing zfsvfs in zfsvfs_create_impl() after a failure in zfsvfs_init(). We have nearly identical duplicate versions of this code for FreeBSD and Linux, but interestingly, the FreeBSD version of this code differs in such a way that it does not suffer from this bug. We remove the difference from the FreeBSD version to fix this bug. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #13883	2022-09-19 17:30:58 -07:00
Martin Matuška	042d43a1dd	FreeBSD: fix static module build broken in `7bb707ffa` param_set_arc_free_target(SYSCTL_HANDLER_ARGS) and param_set_arc_no_grow_shift(SYSCTL_HANDLER_ARGS) defined in sysctl_os.c must be made available to arc_os.c. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Martin Matuska <mm@FreeBSD.org> Closes #13915	2022-09-19 17:21:45 -07:00
Mateusz Guzik	9a671fe7ec	FreeBSD: stop passing LK_INTERLOCK to VOP_LOCK There is an ongoing effort to eliminate this feature. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Closes #13908	2022-09-19 17:17:27 -07:00
Tino Reichardt	75e8b5ad84	Fix BLAKE3 tuneable and module loading on Linux and FreeBSD Apply similar options to BLAKE3 as it is done for zfs_fletcher_4_impl. The zfs module parameter on Linux changes from icp_blake3_impl to zfs_blake3_impl. You can check and set it on Linux via sysfs like this: ``` [bash]# cat /sys/module/zfs/parameters/zfs_blake3_impl cycle [fastest] generic sse2 sse41 avx2 [bash]# echo sse2 > /sys/module/zfs/parameters/zfs_blake3_impl [bash]# cat /sys/module/zfs/parameters/zfs_blake3_impl cycle fastest generic [sse2] sse41 avx2 ``` The modprobe module parameters may also be used now: ``` [bash]# modprobe zfs zfs_blake3_impl=sse41 [bash]# cat /sys/module/zfs/parameters/zfs_blake3_impl cycle fastest generic sse2 [sse41] avx2 ``` On FreeBSD the BLAKE3 implementation can be set via sysctl like this: ``` [bsd]# sysctl vfs.zfs.blake3_impl vfs.zfs.blake3_impl: cycle [fastest] generic sse2 sse41 avx2 [bsd]# sysctl vfs.zfs.blake3_impl=sse2 vfs.zfs.blake3_impl: cycle [fastest] generic sse2 sse41 avx2 \ -> cycle fastest generic [sse2] sse41 avx2 ``` This commit changes also some Blake3 internals like these: - blake3_impl_ops_t was renamed to blake3_ops_t - all functions are named blake3_impl_NAME() now Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Co-authored-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de> Closes #13725	2022-09-16 14:25:53 -07:00
Brian Behlendorf	7dee043af5	zfs_enter rework followup The zpl_fadvise() function was recently added and was not included in the initial patch. Update it accordingly. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #13831	2022-09-16 14:25:53 -07:00
Ameer Hamza	577d41d3b2	zfs recv hangs if max recordsize is less than received recordsize - Some optimizations for bqueue enqueue/dequeue. - Added a fix to prevent deadlock when both bqueue_enqueue_impl() and bqueue_dequeue() waits for signal to be triggered. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #13855	2022-09-16 13:52:25 -07:00
Chunwei Chen	768eacedef	zfs_enter rework Replace ZFS_ENTER and ZFS_VERIFY_ZP, which have hidden returns, with functions that return error code. The reason we want to do this is because hidden returns are not obvious and had caused some missing fail path unwinding. This patch changes the common, linux, and freebsd parts. Also fixes fail path unwinding in zfs_fsync, zpl_fsync, zpl_xattr_{list,get,set}, and zfs_lookup(). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Chunwei Chen <david.chen@nutanix.com> Closes #13831	2022-09-16 13:36:47 -07:00
Richard Yao	b24d1c77f7	Add zfs_btree_verify_intensity kernel module parameter I see a few issues in the issue tracker that might be aided by being able to turn this on. We have no module parameter for it, so I would like to add one. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #13874	2022-09-15 16:22:33 -07:00
Richard Yao	ddb1fd91c0	Fix incorrect size given to bqueue_enqueue() call in dmu_redact.c We pass sizeof (struct redact_record *) rather than sizeof (struct redact_record). Passing the pointer size is wrong. Coverity caught this in two places. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #13885	2022-09-15 16:21:21 -07:00
Richard Yao	e949d36040	Fix assertions in crypto reference helpers The assertions are racy and the use of `membar_exit()` did nothing to fix that. The helpers use atomic functions, so we cleverly get values from the atomics that we can use to ensure that the assertions operate on the correct values. We also use `membar_producer()` prior to decrementing reference counts so that operations that happened prior to a decrement to 0 will be guaranteed to happen before the decrement on architectures that reorder atomics. This also slightly improves performance by eliminating unnecessary reads, although I doubt it would be measurable in any benchmark. Reviewed-by: Mateusz Guzik <mjguzik@gmail.com> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #13880	2022-09-15 13:24:00 -07:00
Richard Yao	fd8c3012b3	Fix use-after-free bugs in icp code These were reported by Coverity as "Read from pointer after free" bugs. Presumably, it did not report it as a use-after-free bug because it does not understand the inline assembly that implements the atomic instruction. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #13881	2022-09-15 11:46:42 -07:00
Richard Yao	ccec88f11a	FreeBSD: Fix integer conversion for vnlru_free{,_vfsops}() When reviewing #13875, I noticed that our FreeBSD code has an issue where it converts from `int64_t` to `int` when calling `vnlru_free{,_vfsops}()`. The result is that if the int64_t is `1 << 36`, the int will be 0, since the low bits are 0. Even when some low bits are set, a value such as `((1 << 36) + 1)` would truncate to 1, which is wrong. There is protection against this on 32-bit platforms, but on 64-bit platforms, there is no check to protect us, so we add a check. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #13882	2022-09-14 12:51:55 -07:00
Richard Yao	4a6e8b99f5	Add assertion to dsl_dataset_set_compression_sync Coverity pointed out that if we somehow receive SPA_FEATURE_NONE, we will use a negative number as an array index. A defensive assertion seems appropriate. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Neal Gompa <ngompa@datto.com> Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #13872	2022-09-14 12:50:03 -07:00
Richard Yao	d954ca19ba	Fix theoretical "use-after-free" in dbuf_prefetch_indirect_done() Coverity complains about a "use-after-free" bug in `dbuf_prefetch_indirect_done()` because we use a pointer value after freeing its buffer. The pointer is used for refcounting in ARC (as the reference holder). There is a theoretical situation where the pointer would be reused in a way that causes the refcounting to collide, so we change the order in which we call arc_buf_destroy() and dbuf_prefetch_fini() to match the rest of the function. This prevents the theoretical situation from being a possibility. Also, we have a few return statements with a value, despite this being a void function. We clean those up while we are making changes here. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Neal Gompa <ngompa@datto.com> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #13869	2022-09-13 17:58:29 -07:00
Richard Yao	cf66e7e594	Cleanup: Make memory barrier definitions consistent across kernels We inherited membar_consumer() and membar_producer() from OpenSolaris, but we had replaced membar_consumer() with Linux's smp_rmb() in zfs_ioctl.c. The FreeBSD SPL consequently implemented a shim for the Linux-only smp_rmb(). We reinstate membar_consumer() in platform independent code and fix the FreeBSD SPL to implement membar_consumer() in a way analogous to Linux. Reviewed-by: Konstantin Belousov <kib@FreeBSD.org> Reviewed-by: Mateusz Guzik <mjguzik@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Neal Gompa <ngompa@datto.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #13843	2022-09-13 16:59:33 -07:00
Richard Yao	d5d10f2aef	Cleanup dead spa_boot code Unused code detected by coverity. Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Neal Gompa <ngompa@datto.com> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #13868	2022-09-13 16:40:10 -07:00
Richard Yao	e5327e7f97	vdev_draid_lookup_map() should not iterate outside draid_maps Coverity reported this as an out-of-bounds read. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Neal Gompa <ngompa@datto.com> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #13865	2022-09-12 12:51:17 -07:00
Richard Yao	13f2b8fb92	Fix use-after-free in btree code Coverty static analysis found these. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Neal Gompa <ngompa@datto.com> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #10989 Closes #13861	2022-09-12 11:22:15 -07:00
Richard Yao	0e4c830bc1	Cleanup: Use OpenSolaris functions to call scheduler In our codebase, `cond_resched() and `schedule()` are Linux kernel functions that have replaced the OpenSolaris `kpreempt()` functions in the codebase to such an extent that `kpreempt()` in zfs_context.h was broken. Nobody noticed because we did not actually use it. The header had defined `kpreempt()` as `yield()`, which works on OpenSolaris and Illumos where `sched_yield()` is a wrapper for `yield()`, but that does not work on any other platform. The FreeBSD platform specific code implemented shims for these, but the shim for `schedule()` forced us to wait, which is different than merely rescheduling to another thread as the original Linux code does, while the shim for `cond_resched()` had the same definition as its kernel kpreempt() shim. After studying this, I have concluded that we should reintroduce the kpreempt() function in platform independent code with the following definitions: - In the Linux kernel: kpreempt(unused) -> cond_resched() - In the FreeBSD kernel: kpreempt(unused) -> kern_yield(PRI_USER) - In userspace: kpreempt(unused) -> sched_yield() In userspace, nothing changes from this cleanup. In the kernels, the function `fm_fini()` will now call `kern_yield(PRI_USER)` on FreeBSD and `cond_resched()` on Linux. This is instead of `pause("schedule", 1)` on FreeBSD and `schedule()` on Linux. This makes our behavior consistent across platforms. Note that Linux's SPL continues to use `cond_resched()` and `schedule()`. However, those functions have been removed from both the FreeBSD code and userspace code. This should have the benefit of making it slightly easier to port the code to new platforms by making how things should be mapped less confusing. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Neal Gompa <ngompa@datto.com> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #13845	2022-09-12 09:55:37 -07:00
Ryan Moeller	60d995727a	FreeBSD: Replace legacy make_dev() interface usage The function make_dev_s() was introduced to replace make_dev() in FreeBSD 11.0. It allows further specification of properties and flags and returns an error code on failure. Using this we can fail loading the module more gracefully than a panic in situations such as when a device named zfs already exists. We already use it for zvols. Use make_dev_s() for /dev/zfs. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #13854	2022-09-08 10:40:18 -07:00
Alexander Motin	37f6845c6f	Improve too large physical ashift handling When iterating through children physical ashifts for vdev, prefer ones above the maximum logical ashift, that we can actually use, but within the administrator defined maximum. When selecting top-level vdev ashift, do not set it to the defined maximum in case physical ashift is even higher, but just ignore one. Using the maximum does not prevent misaligned writes, but reduces space efficiency. Since ZFS tries to write data sequentially and aggregates the writes, in many cases large misanigned writes may be not as bad as the space penalty otherwise. Allow internal physical ashifts for vdevs higher than SHIFT_MAX. May be one day allocator or aggregation could benefit from that. Reduce zfs_vdev_max_auto_ashift default from 16 (64KB) to 14 (16KB), so that ZFS may still use bigger ashifts up to SHIFT_MAX (64KB), but only if it really has to or explicitly told to, but not as an "optimization". There are some read-intensive NVMe SSDs that report Preferred Write Alignment of 64KB, and attempt to build RAIDZ2 of those leads to a space inefficiency that can't be justified. Instead these changes make ZFS fall back to logical ashift of 12 (4KB) by default and only warn user that it may be suboptimal for performance. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #13798	2022-09-08 10:30:53 -07:00
Finix1979	320f0c6022	Add Linux posix_fadvise support The purpose of this PR is to accepts fadvise ioctl from userland to do read-ahead by demand. It could dramatically improve sequential read performance especially when primarycache is set to metadata or zfs_prefetch_disable is 1. If the file is mmaped, generic_fadvise is also called for page cache read-ahead besides dmu_prefetch. Only POSIX_FADV_WILLNEED and POSIX_FADV_SEQUENTIAL are supported in this PR currently. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Finix Yan <yancw@info2soft.com> Closes #13694	2022-09-08 10:29:41 -07:00
Richard Yao	380b08098e	Linux SPL module init: Handle memory allocation failures correctly Upon inspection of our code, I noticed that we assume that __alloc_percpu() cannot fail, and while it probably never has failed in practice, technically, it can fail, so we should handle that. Additionally, we incorrectly assume that `taskq_create()` in spl_kmem_cache_init() cannot fail. The same remark applies to it. Lastly, `spl-init()` failures should always return negative error values, but in some places, we are returning positive 1, which is incorrect. We change those values to their correct error codes. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #13847	2022-09-08 10:28:20 -07:00
pkubaj	dff541f698	Fix build on FreeBSD/powerpc64* There's no VSX handler on FreeBSD for now. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Piotr Kubaj <pkubaj@FreeBSD.org> Closes #13848	2022-09-08 10:27:25 -07:00
Rob Wing	983096a1b4	FreeBSD: add kqfilter support for zvol cdev The only event hooked up is NOTE_ATTRIB, which is triggered when the device is resized. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Wing <rew@FreeBSD.org> Closes #13773	2022-09-06 09:49:33 -07:00
Rob Wing	9d0887402b	FreeBSD: add knlist_init_sx() for exclusive locks This will be used to implement kqfilter support for zvol cdevs. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Wing <rew@FreeBSD.org> Closes #13773	2022-09-06 09:48:57 -07:00
Richard Yao	11df48ab8b	Cleanup Raid-Z Typo fixes Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #13834	2022-09-06 09:43:21 -07:00
Umer Saleem	59767479ac	Add DD_FIELD string for snapshots_changed property This commit adds DD_FIELD string used in extensified dsl_dir zap object for snapshots_changed property. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Umer Saleem <usaleem@ixsystems.com> Closes #13819	2022-09-02 13:33:50 -07:00
Andriy Gapon	ee9f3bca55	Add zfs.sync.snapshot_rename Only the single snapshot rename is provided. The recursive or more complex rename can be scripted. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Andriy Gapon <avg@FreeBSD.org> Closes #13802	2022-09-02 13:31:19 -07:00
Ryan Moeller	7bb707ffaf	FreeBSD: Organize sysctls FreeBSD had a few platform-specific ARC tunables in the wrong place: - Move FreeBSD-specifc ARC tunables into the same vfs.zfs.arc node as the rest of the ARC tunables. - Move the handlers from arc_os.c to sysctl_os.c and add compat sysctls for the legacy names. While here, some additional clean up: - Most handlers are specific to a particular variable and don't need a pointer passed through the args. - Group blocks of related variables, handlers, and sysctl declarations into logical sections. - Match variable types for temporaries in handlers with the type of the global variable. - Remove leftover comments. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #13756	2022-09-02 13:26:24 -07:00
Alexander Motin	f933b3fd4d	Apply arc_shrink_shift to ARC above arc_c_min It makes sense to free memory in smaller chunks when approaching arc_c_min to let other kernel subsystems to free more, since after that point we can't free anything. This also matches behavior on Linux, where to shrinker reported only the size above arc_c_min. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #13794	2022-09-02 13:21:18 -07:00
Richard Yao	0b30dc484f	FreeBSD: Cleanup dead code from VFS The vfs_*_feature() macros turn anything that uses them into dead code, so we can delete all of it. As a side effect, zfs_set_fuid_feature() is now identical in module/os/freebsd/zfs/zfs_vnops_os.c and module/os/linux/zfs/zfs_vnops_os.c. A few other functions are identical too. Future cleanup could move these into a common file. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #13832	2022-09-02 13:20:10 -07:00
Brian Behlendorf	9f346abbe8	Revert "Avoid panic with recordsize > 128k, raw sending and no large_blocks" This reverts commit `80a650b7bb`. This change inadvertently introduced a regression in ztest where one of the new ASSERTs is triggered in dsl_scan_visitbp(). Reviewed-by: George Amanakis <gamanakis@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #12275 Closes #13799	2022-08-25 13:33:32 -07:00
Umer Saleem	a582d52993	Updates for snapshots_changed property Currently, snapshots_changed property is stored in dd_props_zapobj, due to which the property is assumed to be local. This causes a difference in behavior with respect to other readonly properties. This commit stores the snapshots_changed property in dd_object. Source is not set to local in this case, which makes it consistent with other readonly properties. This commit also updates the date string format to include seconds. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Umer Saleem <usaleem@ixsystems.com> Closes #13785	2022-08-24 14:20:43 -07:00
George Amanakis	0c4064d9a0	Fix zpool status in case of unloaded keys When scrubbing an encrypted filesystem with unloaded key still report an error in zpool status. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alek Pinchuk <apinchuk@axcient.com> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #13675 Closes #13717	2022-08-22 17:42:01 -07:00
Paul Dagnelie	17e212652d	Prevent zevent list from consuming all of kernel memory There are a couple changes included here. The first is to introduce a cap on the size the ZED will grow the zevent list to. One million entries is more than enough for most use cases, and if you are overflowing that value, the problem needs to be addressed another way. The value is also tunable, for those who want the limit to be higher or lower. The other change is to add a kernel module parameter that allows snapshot creation/deletion to be exempted from the history logging; for most workloads, having these things logged is valuable, but for some workloads it produces large quantities of log spam and isn't especially helpful. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <pcd@delphix.com> Issue #13374 Closes #13753	2022-08-22 12:36:22 -07:00
George Melikov	fbc210fab2	Enable relatime by default Linux sets relatime on mount by default for any file system, but relatime=off in ZFS disables it explicitly. Let's be consistent with other file systems on Linux. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Melikov <mail@gmelikov.ru> Closes #13614	2022-08-12 14:20:25 -07:00
Christian Schwarz	91983265b6	Add comment on acb_zio_dummy Thanks to George Wilson for clarifying this on Slack. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Wilson <gwilson@delphix.com> Signed-off-by: Christian Schwarz <christian.schwarz@nutanix.com> Closes #13698	2022-08-08 16:55:13 -07:00
Brian Behlendorf	c26045b435	Linux 5.20 compat: blk_cleanup_disk() As of the Linux 5.20 kernel blk_cleanup_disk() has been removed, all callers should use put_disk(). Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #13728	2022-08-04 16:57:49 -07:00
Brian Behlendorf	bebdf52a16	Linux 5.20 compat: bdevname() As of the Linux 5.20 kernel bdevname() has been removed, all callers should use snprintf() and the "%pg" format specifier. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #13728	2022-08-04 16:57:33 -07:00
Paul Dagnelie	673aa7e6cf	Don't double-zero buffers in fault management nvlists This is a small cleanup for a trivial problem which happened to be noticed while another issue was being investigated. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <pcd@delphix.com> Closes #13730	2022-08-04 16:53:47 -07:00
Umer Saleem	9681de4657	Add snapshots_changed as property Make dd_snap_cmtime property persistent across mount and unmount operations by storing in ZAP and restore the value from ZAP on hold into dd_snap_cmtime instead of updating it. Expose dd_snap_cmtime as 'snapshots_changed' property that provides a mechanism to quickly determine whether snapshot list for dataset has changed without having to mount a dataset or iterate the snapshot list. It specifies the time at which a snapshot for a dataset was last created or deleted. This allows us to be more efficient how often we query snapshots. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Umer Saleem <usaleem@ixsystems.com> Closes #13635	2022-08-02 16:45:30 -07:00
Ryan Moeller	5ad44a0ce9	FreeBSD: Ignore symlink to i386 includes A symlink to i386 includes is created in the build dir on amd64 since freebsd/freebsd-src@d07600c563 Tell git to ignore it like the other include links. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #13719	2022-08-02 16:34:23 -07:00
Tino Reichardt	68aa3379ec	Skip checksum benchmarks on systems with slow cpu The checksum benchmarking on module load may take a really long time on embedded systems with a slow cpu. Avoid all benchmarks >= 1MiB on systems, where EdonR is slower then 300 MiB/s. This limit is currently hardcoded via the define LIMIT_PERF_MBS. This is the new benchmark output of a slow Intel Atom: ``` implementation 1k 4k 16k 64k 256k 1m 4m 16m edonr-generic 209 257 268 259 262 0 0 0 skein-generic 129 150 151 150 150 0 0 0 sha256-generic 50 55 56 56 56 0 0 0 sha512-generic 76 86 88 89 88 0 0 0 blake3-generic 63 62 62 62 61 0 0 0 blake3-sse2 114 292 301 307 309 0 0 0 ``` Reviewed-by: Sebastian Gottschall <s.gottschall@dd-wrt.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de> Closes #13695	2022-08-01 09:51:45 -07:00
Alek P	e8cf3a4f76	Implement a new type of zfs receive: corrective receive (-c) This type of recv is used to heal corrupted data when a replica of the data already exists (in the form of a send file for example). With the provided send stream, corrective receive will read from disk blocks described by the WRITE records. When any of the reads come back with ECKSUM we use the data from the corresponding WRITE record to rewrite the corrupted block. Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Paul Zuchowski <pzuchowski@datto.com> Signed-off-by: Alek Pinchuk <apinchuk@axcient.com> Closes #9372	2022-07-28 15:52:46 -07:00
Tino Reichardt	5fae33e047	FreeBSD compile fix The file module/os/freebsd/zfs/zfs_ioctl_compat.c fails compiling because of this error: 'static' is not at beginning of declaration This commit fixes the three places within that file. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de> Closes #13702	2022-07-28 14:19:41 -07:00
Ameer Hamza	3a1ce49141	Add createtxg sort support for simple snapshot iterator - When iterating snapshots with name only, e.g., "-o name -s name", libzfs uses simple snapshot iterator and results are displayed in alphabetic order. This PR adds support for faster version of createtxg sort by avoiding nvlist parsing for properties. Flags "-o name -s createtxg" will enable createtxg sort while using simple snapshot iterator. - Added support to read createtxg property directly from zfs handle for filesystem, volume and snapshot types instead of parsing nvlist. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #13577	2022-07-25 14:04:46 -07:00
ixhamza	fb087146de	Add support for per dataset zil stats and use wmsum counters ZIL kstats are reported in an inclusive way, i.e., same counters are shared to capture all the activities happening in zil. Added support to report zil stats for every datset individually by combining them with already exposed dataset kstats. Wmsum uses per cpu counters and provide less overhead as compared to atomic operations. Updated zil kstats to replace wmsum counters to avoid atomic operations. Reviewed-by: Christian Schwarz <christian.schwarz@nutanix.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #13636	2022-07-20 17:14:06 -07:00
Alexander Motin	33dba8c792	Fix scrub resume from newly created hole It may happen that scan bookmark points to a block that was turned into a part of a big hole. In such case dsl_scan_visitbp() may skip it and dsl_scan_check_resume() will not be called for it. As result new scan suspend won't be possible until the end of the object, that may take hours if the object is a multi-terabyte ZVOL on a slow HDD pool, stretching TXG to all that time, creating all sorts of problems. This patch changes the resume condition to any greater or equal block, so even if we miss the bookmarked block, the next one we find will delete the bookmark, allowing new suspend. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #13643	2022-07-20 17:02:36 -07:00
Tino Reichardt	97fd1ea42a	Fix memory allocation for the checksum benchmark Allocation via kmem_cache_alloc() is limited to less then 4m for some architectures. This commit limits the benchmarks with the linear abd cache to 1m on all architectures and adds 4m + 16m benchmarks via non-linear abd_alloc(). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Co-authored-by: Sebastian Gottschall <s.gottschall@dd-wrt.com> Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de> Closes #13669 Closes #13670	2022-07-20 17:01:32 -07:00
ixhamza	f371cc18f8	Expose ZFS dataset case sensitivity setting via sb_opts Makes the case sensitivity setting visible on Linux in /proc/mounts. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #13607	2022-07-14 10:38:16 -07:00
Tino Reichardt	1d3ba0bf01	Replace dead opensolaris.org license link The commit replaces all findings of the link: http://www.opensolaris.org/os/licensing with this one: https://opensource.org/licenses/CDDL-1.0 Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de> Closes #13619	2022-07-11 14:16:13 -07:00
Brian Behlendorf	e6489be347	Linux: Align MODULE_LICENSE macro text Specify the lua and zstd license text in the manor in which the kernel MODULE_LICENSE macro requires it. The now duplicate entries were merged and a comment added to make it clear what they apply to. Reviewed-by: Christian Schwarz <christian.schwarz@nutanix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #13641	2022-07-11 11:29:12 -07:00
Finix1979	cb01da6805	Call nvlist_free before return Fixes a small kernel memory leak which would occur if a pool failed to import because the `DMU_POOL_VDEV_ZAP_MAP` key can't be read from a presumably damaged MOS config. In the case of a missing key there was no leak. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Finix1979 <yancw@info2soft.com> Closes #13629	2022-07-07 11:43:58 -07:00
Alexander Motin	74230a5bc1	Avoid memory copy when verifying raidz/draid parity Before this change for every valid parity column raidz_parity_verify() allocated new buffer and copied there existing data, then recalculated the parity and compared the result with the copy. This patch removes the memory copy, simply swapping original buffer pointers with newly allocated empty ones for parity recalculation and comparison. Original buffers with potentially incorrect parity data are then just freed, while new recalculated ones are used for repair. On a pool of 12 4-wide raidz vdevs, storing 1.5TB of 16MB blocks, this change reduces memory traffic during scrub by 17% and total unhalted CPU time by 25%. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #13613	2022-07-05 16:27:29 -07:00
Alexander Motin	1ac7d194e5	Avoid memory copies during mirror scrub Issuing several scrub reads for a block we may use the parent ZIO buffer for one of child ZIOs. If that read complete successfully, then we won't need to copy the data explicitly. If block has only one copy (typical for root vdev, which is also a mirror inside), then we never need to copy -- succeed or fail as-is. Previous code also copied data from buffer of every successfully completed child ZIO, but that just does not make any sense. On healthy N-wide mirror this saves all N+1 (or even more in case of ditto blocks) memory copies for each scrubbed block, allowing CPU to focus mostly on check-summing. For other vdev types it should save one memory copy per block copy at root vdev. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #13606	2022-07-05 16:26:20 -07:00
наб	6fca6195cd	Re-fix -Wwrite-strings on FreeBSD Follow up fix for `a926aab902`. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #13348 Closes #13610	2022-06-30 11:31:09 -07:00
George Amanakis	34e5423f83	Fix dnode byteswapping If a dnode has a spill pointer, and we use DN_SLOTS_TO_BONUSLEN() then we will possibly include the spill pointer in the len calculation and it will be byteswapped. Then dnode_byteswap() will carry on and swap the spill pointer again. Fix this by using DN_MAX_BONUS_LEN() instead. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #13002 Closes #13015	2022-06-29 17:06:16 -07:00
наб	60e389ca10	module: lua: ldo: fix pragma name /home/nabijaczleweli/store/code/zfs/module/lua/ldo.c:175:32: warning: unknown option after ‘#pragma GCC diagnostic’ kind [-Wpragmas] 175 \| #pragma GCC diagnostic ignored "-Winfinite-recursion"a \| ^~~~~~~~~~~~~~~~~~~~~~ Fixes: `a6e8113fed` ("Silence -Winfinite-recursion warning in luaD_throw()") Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #13348	2022-06-29 14:08:59 -07:00
наб	dd66857d92	Remaining {=> const} char\|void *tag Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #13348	2022-06-29 14:08:59 -07:00
наб	a926aab902	Enable -Wwrite-strings Also, fix leak from ztest_global_vars_to_zdb_args() Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #13348	2022-06-29 14:08:54 -07:00
gaoyanping	e7d90362e5	Fix znode group permission different from acl mask Zp->z_mode is set at the same time inode->i_mode is being changed. This has the effect of keeping both in sync without relying on zfs_znode_update_vfs. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: yanping.gao <yanping.gao@xtaotech.com> Closes #13581	2022-06-29 13:38:46 -07:00
Alexander Motin	827322991f	Fix and disable blocks statistics during scrub Block statistics calculation during scrub I/O issue in case of sorted scrub accounted ditto blocks several times. Embedded blocks on other side were not accounted at all. This change moves the accounting from issue to scan stage, that fixes both problems and also allows to avoid pool-wide locking and the lock contention it created. Since this statistics is quite specific and is not even exposed now anywhere, disable its calculation by default to not waste CPU time. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #13579	2022-06-28 11:23:31 -07:00
Brian Behlendorf	43569ee374	Fix objtool: missing int3 after ret warning Resolve straight-line speculation warnings reported by objtool for x86_64 assembly on Linux when CONFIG_SLS is set. See the following LWN article for the complete details. https://lwn.net/Articles/877845/ Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #13528 Closes #13575	2022-06-27 14:19:43 -07:00
Brian Behlendorf	c175f5ebb2	Fix -Wuse-after-free warning in dbuf_destroy() Move the use of the db pointer after it is freed. It's only used as a tag so a dereference would never occur, but there's no reason we can't invert the order to resolve the warning. module/zfs/dbuf.c: In function 'dbuf_destroy': module/zfs/dbuf.c:2953:17: error: pointer 'db' may be used after 'free' [-Werror=use-after-free] Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #13528 Closes #13575	2022-06-27 14:19:24 -07:00
Brian Behlendorf	9619bcdefb	Fix -Wuse-after-free warning in dbuf_issue_final_prefetch_done() Move the use of the private pointer after it is freed. It's only used as a tag so a dereference would never occur, but there's no harm in inverting the order to resolve the warning. module/zfs/dbuf.c: In function 'dbuf_issue_final_prefetch_done': module/zfs/dbuf.c:3204:17: error: pointer 'private' may be used after 'free' [-Werror=use-after-free] Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #13528 Closes #13575	2022-06-27 14:19:19 -07:00
Brian Behlendorf	ff7e405f83	Fix -Wattribute-warning in dsl layer The memcpy(), memmove(), and memset() functions have been annotated to perform bounds checking when using FORTIFY_SOURCE. A warning is now generted when writing beyond the end of the specified field. Alternately, the new struct_group() macro could be used to create an anonymous union member for use by memcpy(). However, since this is the only place the macro would be helpful it's preferable to restructure the code slights to avoid the need for additional compatibility code when the macro does not exist. https://lore.kernel.org/lkml/20211118183807.1283332-1-keescook@chromium.org/T/ Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #13528 Closes #13575	2022-06-27 14:19:12 -07:00
Brian Behlendorf	18df6afdfc	Fix -Wattribute-warning in edonr The wrong union memory was being accessed in EdonRInit resulting in a write beyond size of field compiler warning. Reference the correct member to resolve the warning. The warning was correct and this in case the mistake was harmless. In function ‘fortify_memcpy_chk’, inlined from ‘EdonRInit’ at zfs/module/icp/algs/edonr/edonr.c:494:3: ./include/linux/fortify-string.h:344:25: error: call to ‘__write_overflow_field’ declared with attribute warning: detected write beyond size of field (1st parameter); maybe use struct_group()? [-Werror=attribute-warning] Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #13528 Closes #13575	2022-06-27 14:19:04 -07:00
Brian Behlendorf	b0f7dd276c	Fix -Wattribute-warning in zfs_log_xvattr() Restructure the code in zfs_log_xvattr() to use a lr_attr_end structure when accessing lr_attr_t elements located after the variable sized array. This makes the code more understandable and resolves the accessing beyond the end of the field warnings. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #13528 Closes #13575	2022-06-27 14:18:57 -07:00
Brian Behlendorf	a6e8113fed	Silence -Winfinite-recursion warning in luaD_throw() This code should be kept inline with the upstream lua version as much as possible. Therefore, we simply want to silence the warning. This check was enabled by default as part of -Wall in gcc 12.1. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #13528 Closes #13575	2022-06-27 14:18:50 -07:00
George Amanakis	80a650b7bb	Avoid panic with recordsize > 128k, raw sending and no large_blocks The current codebase does not support raw sending buffers with block size > 128kB when large_blocks is not active. This can happen in the codepath dsl_dataset_sync()->dmu_objset_sync()->zio_nowait() which calls back dmu_objset_write_done()->dsl_dataset_block_born(). If dsl_dataset_sync() completes its run before dsl_dataset_block_born() is called, we will end up not activating some of the necessary flags, while having blocks based on those flags written in the filesystem. A subsequent send will then panic. Fix this by directly deciding in dmu_objset_sync() whether these flags need to be activated later by dsl_dataset_sync(). Instead of panicking due to a NULL pointer dereference in dmu_dump_write() in case of a send, print out an error message. Also during scrub verify there are no contradicting filesystem flags. Reviewed-by: Paul Dagnelie <pcd@delphix.com> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #12275 Closes #12438	2022-06-27 14:17:25 -07:00
Alexander Motin	1cd72b9c13	Avoid two 64-bit divisions per scanned block Change math to make it like the ARC, using multiplications instead. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #13591	2022-06-27 11:08:21 -07:00
Alexander Motin	c0bf952c84	Several B-tree optimizations - Introduce first element offset within a leaf. It allows to reduce by ~50% average memmove() size when adding/removing elements. If the added/removed element is in the first half of the leaf, we may shift elements before it and adjust the bth_first instead of moving more elements after it. - Use memcpy() instead of memmove() when we know there is no overlap. - Switch from uint64_t to uint32_t. It does not limit anything, but 32-bit arches should appreciate it greatly in hot paths. - Store leaf capacity in struct btree to avoid 64-bit divisions. - Adjust zfs_btree_insert_into_leaf() to always result in balanced leaves after splitting, no matter where the new element was inserted. Not that we care about it much, but it should also allow B-trees with as little as two elements per leaf instead of 4 previously. When scrubbing pool of 12 SSDs, storing 1.5TB of 4KB zvol blocks this reduces amount of time spent in memmove() inside the scan thread from 13.7% to 5.7% and total scrub time by ~15 seconds out of 9 minutes. It should also reduce spacemaps load time, but I haven't measured it. Reviewed-by: Paul Dagnelie <pcd@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #13582	2022-06-24 13:55:58 -07:00
Alexander Motin	1c0c729ab4	Several sorted scrub optimizations - Reduce size and comparison complexity of q_exts_by_size B-tree. Previous code used two 64-bit divisions and many other operations to compare two B-tree elements. It created enormous overhead. This implementation moves the math to the upper level and stores the score in the B-tree elements themselves. Since all that we need to store in that B-tree is the extent score and offset, those can fit into single 8 byte value instead of 24 bytes of q_exts_by_addr element and can be compared with single operation. - Better decouple secondary tree logic from main range_tree by moving rt_btree_ops and related functions into dsl_scan.c as ext_size_ops. Those functions are very small to worry about the code duplication and range_tree does not need to know details such as rt_btree_compare. - Instead of accounting number of pending bytes per pool, that needs atomic on global variable per block, account the number of non-empty per-vdev queues, that change much more rarely. - When extent scan is interrupted by TXG end, continue it in the next TXG instead of selecting next best extent. It allows to avoid leaving one truncated (and so likely not the best any more) extent each TXG. On top of some other optimizations this saves about 1.5 minutes out of 10 to scrub pool of 12 SSDs, storing 1.5TB of 4KB zvol blocks. Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tom Caputi <caputit1@tcnj.edu> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #13576	2022-06-24 09:50:37 -07:00
Brian Behlendorf	ad8b9f940c	Scrub mirror children without BPs When scrubbing a raidz/draid pool, which contains a replacing or sparing mirror with multiple online children, only one child will be read. This is not normally a serious concern because the DTL records are used to determine where a good copy of the data is. As long as the data can be read from one child the mirror vdev will use it to repair gaps in any of its children. Furthermore, even if the data which was read is corrupt the raidz code will detect this and issue its own repair I/O to correct the damage in the mirror vdev. However, in the scenario where the DTL is wrong due to silent data corruption (say due to overwriting one child) and the scrub happens to read from a child with good data, then the other damaged mirror child will not be detected nor repaired. While this is possible for both raidz and draid vdevs, it's most pronounced when using draid. This is because by default the zed will sequentially rebuild a draid pool to a distributed spare, and the distributed spare half of the mirror is always preferred since it delivers better performance. This means the damaged half of the mirror will go undetected even after scrubbing. For system administrations this behavior is non-intuitive and in a worst case scenario could result in the only good copy of the data being unknowingly detached from the mirror. This change resolves the issue by reading all replacing/sparing mirror children when scrubbing. When the BP isn't available for verification, then compare the data buffers from each child. They must all be identical, if not there's silent damage and an error is returned to prompt the top-level vdev to issue a repair I/O to rewrite the data on all of the mirror children. Since we can't tell which child was wrong a checksum error is logged against the replacing or sparing mirror vdev. Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #13555	2022-06-23 10:36:28 -07:00
Tino Reichardt	deb1213098	Fix memory allocation issue for BLAKE3 context The kmem_alloc(sizeof (*ctx), KM_NOSLEEP) call on FreeBSD can't be used in this code segment. Work around this by pre-allocating a percpu context array for later use. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de> Closes #13568	2022-06-21 14:32:09 -07:00
Alexander Motin	d51f4ea5f9	FreeBSD: Improve crypto_dispatch() handling Handle crypto_dispatch() return values same as crp->crp_etype errors. On FreeBSD 12 many drivers returned same errors both ways, and lack of proper handling for the first ended up in assertion panic later. It was changed in FreeBSD 13, but there is no reason to not be safe. While there, skip waiting for completion, including locking and wakeup() call, for sessions on synchronous crypto drivers, such as typical aesni and software. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #13563	2022-06-17 15:38:51 -07:00
Andrew	f609739985	expose snapshot count via stat(2) of .zfs/snapshot (#13559 ) Increase nlinks in stat results of ./zfs/snapshot based on snapshot count. This provides quick and efficient method for administrators to get snapshot counts without having to use libzfs or list the snapdir contents. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Andrew Walker <awalker@ixsystems.com> Closes #13559	2022-06-17 11:44:49 -07:00
Alexander Motin	dd8671459f	Reduce ZIO io_lock contention on sorted scrub During sorted scrub multiple threads (one per vdev) are issuing many ZIOs same time, all using the same scn->scn_zio_root ZIO as parent. It causes huge lock contention on the single global lock on that ZIO. Improve it by introducing per-queue null ZIOs, children to that one, and using them instead as proxy. For 12 SSD pool storing 1.5TB of 4KB blocks on 80-core system this dramatically reduces lock contention and reduces scrub time from 21 minutes down to 12.5, while actual read stages (not scan) are about 3x faster, reaching 100K blocks per second per vdev. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #13553	2022-06-15 14:25:08 -07:00
crass	bc00d2c711	Add support for ARCH=um for x86 sub-architectures When building modules (as well as the kernel) with ARCH=um, the options -Dsetjmp=kernel_setjmp and -Dlongjmp=kernel_longjmp are passed to the C preprocessor for C files. This causes the setjmp and longjmp used in module/lua/ldo.c to be kernel_setjmp and kernel_longjmp respectively in the object file. However, the setjmp and longjmp that is intended to be called is defined in an architecture dependent assembly file under the directory module/lua/setjmp. Since it is an assembly and not a C file, the preprocessor define is not given and the names do not change. This becomes an issue when modpost is trying to create the Module.symvers and sees no defined symbol for kernel_setjmp and kernel_longjmp. To fix this, if the macro CONFIG_UML is defined, then setjmp and longjmp macros are undefined. When building with ARCH=um for x86 sub-architectures, CONFIG_X86 is not defined. Instead, CONFIG_UML_X86 is defined. Despite this, the UML x86 sub-architecture can use the same object files as the x86 architectures because the x86 sub-architecture UML kernel is running with the same instruction set as CONFIG_X86. So the modules/Kbuild build file is updated to add the same object files that CONFIG_X86 would add when CONFIG_UML_X86 is defined. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Glenn Washburn <development@efficientek.com> Closes #13547	2022-06-15 14:22:52 -07:00
Damian Szuberski	9884319666	Fix clang 13 compilation errors ``` os/linux/zfs/zvol_os.c:1111:3: error: ignoring return value of function declared with 'warn_unused_result' attribute [-Werror,-Wunused-result] add_disk(zv->zv_zso->zvo_disk); ^~~~~~~~ ~~~~~~~~~~~~~~~~~~~~ zpl_xattr.c:1579:1: warning: no previous prototype for function 'zpl_posix_acl_release_impl' [-Wmissing-prototypes] ``` Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: szubersk <szuberskidamian@gmail.com> Closes #13551	2022-06-15 14:20:28 -07:00
Allan Jude	4ff7a8fa2f	Replace ZPROP_INVAL with ZPROP_USERPROP where it means a user property Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Allan Jude <allan@klarasystems.com> Sponsored-by: Klara Inc. Closes #12676	2022-06-14 11:27:53 -07:00
Ryan Moeller	9e605cf155	spl: Use a clearer name for the user namespace fd This fd has nothing to do with cleanup, that's just the name of the field in zfs_cmd_t that was used to pass it to the kernel. Call it what it is, an fd for a user namespace. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Allan Jude <allan@klarasystems.com> Signed-off-by: Ryan Moeller <freqlabs@FreeBSD.org> Closes #13554	2022-06-14 08:14:19 -07:00
Alexander Motin	87b46d63b2	Improve sorted scan memory accounting Since we use two B-trees q_exts_by_size and q_exts_by_addr, we should count 2x sizeof (range_seg_gap_t) per node. And since average B-tree memory efficiency is about 75%, we should increase it to 3x. Previous code under-counted up to 30% of the memory usage. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #13537	2022-06-10 10:01:46 -07:00
Will Andrews	4ed5e25074	Add Linux namespace delegation support This allows ZFS datasets to be delegated to a user/mount namespace Within that namespace, only the delegated datasets are visible Works very similarly to Zones/Jailes on other ZFS OSes As a user: ``` $ unshare -Um $ zfs list no datasets available $ echo $$ 1234 ``` As root: ``` # zfs list NAME ZONED MOUNTPOINT containers off /containers containers/host off /containers/host containers/host/child off /containers/host/child containers/host/child/gchild off /containers/host/child/gchild containers/unpriv on /unpriv containers/unpriv/child on /unpriv/child containers/unpriv/child/gchild on /unpriv/child/gchild # zfs zone /proc/1234/ns/user containers/unpriv ``` Back to the user namespace: ``` $ zfs list NAME USED AVAIL REFER MOUNTPOINT containers 129M 47.8G 24K /containers containers/unpriv 128M 47.8G 24K /unpriv containers/unpriv/child 128M 47.8G 128M /unpriv/child ``` Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Will Andrews <will.andrews@klarasystems.com> Signed-off-by: Allan Jude <allan@klarasystems.com> Signed-off-by: Mateusz Piotrowski <mateusz.piotrowski@klarasystems.com> Co-authored-by: Allan Jude <allan@klarasystems.com> Co-authored-by: Mateusz Piotrowski <mateusz.piotrowski@klarasystems.com> Sponsored-by: Buddy <https://buddy.works> Closes #12263	2022-06-10 09:51:46 -07:00
Alexander Motin	fc5200aa9b	AVL: Remove obsolete branching optimizations Modern Clang and GCC can successfully implement simple conditions without branching with math and flag operations. Use of arrays for translation no longer helps as much as it was 14+ years ago. Disassemble of the code generated by Clang 13.0.0 on FreeBSD 13.1, Clang 14.0.4 on FreeBSD 14 and GCC 10.2.1 on Debian 11 with this change still shows no branching instructions. Profiling of CPU-bound scan stage of sorted scrub shows reproducible reduction of time spent inside avl_find() from 6.52% to 4.58%. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #13540	2022-06-09 15:27:36 -07:00
Tony Hutter	6f73d02168	zvol: Support blk-mq for better performance Add support for the kernel's block multiqueue (blk-mq) interface in the zvol block driver. blk-mq creates multiple request queues on different CPUs rather than having a single request queue. This can improve zvol performance with multithreaded reads/writes. This implementation uses the blk-mq interfaces on 4.13 or newer kernels. Building against older kernels will fall back to the older BIO interfaces. Note that you must set the `zvol_use_blk_mq` module param to enable the blk-mq API. It is disabled by default. In addition, this commit lets the zvol blk-mq layer process whole `struct request` IOs at a time, rather than breaking them down into their individual BIOs. This reduces dbuf lock contention and overhead versus the legacy zvol submit_bio() codepath. sequential dd to one zvol, 8k volblocksize, no O_DIRECT: legacy submit_bio() 292MB/s write 453MB/s read this commit 453MB/s write 885MB/s read It also introduces a new `zvol_blk_mq_chunks_per_thread` module parameter. This parameter represents how many volblocksize'd chunks to process per each zvol thread. It can be used to tune your zvols for better read vs write performance (higher values favor write, lower favor read). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #13148 Issue #12483	2022-06-09 08:10:38 -06:00
Tino Reichardt	985c33b132	Introduce BLAKE3 checksums as an OpenZFS feature This commit adds BLAKE3 checksums to OpenZFS, it has similar performance to Edon-R, but without the caveats around the latter. Homepage of BLAKE3: https://github.com/BLAKE3-team/BLAKE3 Wikipedia: https://en.wikipedia.org/wiki/BLAKE_(hash_function)#BLAKE3 Short description of Wikipedia: BLAKE3 is a cryptographic hash function based on Bao and BLAKE2, created by Jack O'Connor, Jean-Philippe Aumasson, Samuel Neves, and Zooko Wilcox-O'Hearn. It was announced on January 9, 2020, at Real World Crypto. BLAKE3 is a single algorithm with many desirable features (parallelism, XOF, KDF, PRF and MAC), in contrast to BLAKE and BLAKE2, which are algorithm families with multiple variants. BLAKE3 has a binary tree structure, so it supports a practically unlimited degree of parallelism (both SIMD and multithreading) given enough input. The official Rust and C implementations are dual-licensed as public domain (CC0) and the Apache License. Along with adding the BLAKE3 hash into the OpenZFS infrastructure a new benchmarking file called chksum_bench was introduced. When read it reports the speed of the available checksum functions. On Linux: cat /proc/spl/kstat/zfs/chksum_bench On FreeBSD: sysctl kstat.zfs.misc.chksum_bench This is an example output of an i3-1005G1 test system with Debian 11: implementation 1k 4k 16k 64k 256k 1m 4m edonr-generic 1196 1602 1761 1749 1762 1759 1751 skein-generic 546 591 608 615 619 612 616 sha256-generic 240 300 316 314 304 285 276 sha512-generic 353 441 467 476 472 467 426 blake3-generic 308 313 313 313 312 313 312 blake3-sse2 402 1289 1423 1446 1432 1458 1413 blake3-sse41 427 1470 1625 1704 1679 1607 1629 blake3-avx2 428 1920 3095 3343 3356 3318 3204 blake3-avx512 473 2687 4905 5836 5844 5643 5374 Output on Debian 5.10.0-10-amd64 system: (Ryzen 7 5800X) implementation 1k 4k 16k 64k 256k 1m 4m edonr-generic 1840 2458 2665 2719 2711 2723 2693 skein-generic 870 966 996 992 1003 1005 1009 sha256-generic 415 442 453 455 457 457 457 sha512-generic 608 690 711 718 719 720 721 blake3-generic 301 313 311 309 309 310 310 blake3-sse2 343 1865 2124 2188 2180 2181 2186 blake3-sse41 364 2091 2396 2509 2463 2482 2488 blake3-avx2 365 2590 4399 4971 4915 4802 4764 Output on Debian 5.10.0-9-powerpc64le system: (POWER 9) implementation 1k 4k 16k 64k 256k 1m 4m edonr-generic 1213 1703 1889 1918 1957 1902 1907 skein-generic 434 492 520 522 511 525 525 sha256-generic 167 183 187 188 188 187 188 sha512-generic 186 216 222 221 225 224 224 blake3-generic 153 152 154 153 151 153 153 blake3-sse2 391 1170 1366 1406 1428 1426 1414 blake3-sse41 352 1049 1212 1174 1262 1258 1259 Output on Debian 5.10.0-11-arm64 system: (Pi400) implementation 1k 4k 16k 64k 256k 1m 4m edonr-generic 487 603 629 639 643 641 641 skein-generic 271 299 303 308 309 309 307 sha256-generic 117 127 128 130 130 129 130 sha512-generic 145 165 170 172 173 174 175 blake3-generic 81 29 71 89 89 89 89 blake3-sse2 112 323 368 379 380 371 374 blake3-sse41 101 315 357 368 369 364 360 Structurally, the new code is mainly split into these parts: - 1x cross platform generic c variant: blake3_generic.c - 4x assembly for X86-64 (SSE2, SSE4.1, AVX2, AVX512) - 2x assembly for ARMv8 (NEON converted from SSE2) - 2x assembly for PPC64-LE (POWER8 converted from SSE2) - one file for switching between the implementations Note the PPC64 assembly requires the VSX instruction set and the kfpu_begin() / kfpu_end() calls on PowerPC were updated accordingly. Reviewed-by: Felix Dörre <felix@dogcraft.de> Reviewed-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de> Co-authored-by: Rich Ercolani <rincebrain@gmail.com> Closes #10058 Closes #12918	2022-06-08 15:55:57 -07:00
Alexander Motin	42cf2ad0e4	Remove wrong assertion in log spacemap It is typical, but not generally true that if log summary has more blocks it must also have unflushed metaslabs. Normally with metaslabs flushed in order it works, but there are known exceptions, such as device removal or metaslab being loaded during its flush attempt. Before `600a02b884` if spa_flush_metaslabs() hit loading metaslab it usually stopped (unless memlimit is also exceeded), but now it may flush more metaslabs, just skipping that particular one. This increased chances of assertion to fire when the skipped metaslab is flushed on next iteration if all other metaslabs in that summary entry are already flushed out of order. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #13486 Closes #13513	2022-06-01 09:54:35 -07:00
Allan Jude	2310dba9eb	Fix typo in zil_commit() comment block Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Allan Jude <allan@klarasystems.com> Closes #13518	2022-05-31 15:37:46 -07:00
Brian Behlendorf	c2c2e7bb8b	Linux 5.19 compat: aops->read_folio() As of the Linux 5.19 kernel the readpage() address space operation has been replaced by read_folio(). Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #13515	2022-05-31 12:04:31 -07:00
Brian Behlendorf	a12a5cb5b8	Linux 5.19 compat: blkdev_issue_secure_erase() Linux 5.19 commit torvalds/linux@44abff2c0 splits the secure erase functionality from the blkdev_issue_discard() function. The blkdev_issue_secure_erase() must now be issued to issue a secure erase. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #13515	2022-05-31 12:04:26 -07:00
Brian Behlendorf	e2c31f2bc7	Linux 5.19 compat: bdev_max_secure_erase_sectors() Linux 5.19 commit torvalds/linux@44abff2c0 removed the blk_queue_secure_erase() helper function. The preferred interface is to now use the bdev_max_secure_erase_sectors() function to check for discard support. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #13515	2022-05-31 12:04:22 -07:00
Brian Behlendorf	5e4aedaca7	Linux 5.19 compat: bdev_max_discard_sectors() Linux 5.19 commit torvalds/linux@70200574cc removed the blk_queue_discard() helper function. The preferred interface is to now use the bdev_max_discard_sectors() function to check for discard support. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #13515	2022-05-31 12:04:17 -07:00
Brian Behlendorf	5f264996f4	Linux 5.18 compat: bio_alloc() As for the Linux 5.18 kernel bio_alloc() expects a block_device struct as an argument. This removes the need for the bio_set_dev() compatibility code for 5.18 and newer kernels. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #13515	2022-05-31 12:04:03 -07:00
Kevin Jin	152d6fda54	Fix inflated quiesce time caused by lwb_tx during zil_commit() In current zil_commit() process, transaction lwb_tx is assigned in zil_lwb_write_issue(), and is committed in zil_lwb_flush_vdevs_done(). Thus, during lwb write out process, the txg is held in open or quiesing state, until zil_lwb_flush_vdevs_done() is called. If the zil's zio latency is high, it will cause txg_sync_thread() to starve. The goal here is to defer waiting for zil_lwb_flush_vdevs_done to the 'syncing' txg state. That is, in zil_sync(). In this patch, it achieves the goal without holding transaction. A new function zil_lwb_flush_wait_all() is introduced. It waits for the completion of all the zil_lwb_flush_vdevs_done() by given txg. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Prakash Surya <prakash.surya@delphix.com> Signed-off-by: jxdking <lostking2008@hotmail.com> Closes #12321	2022-05-26 09:36:14 -07:00
Alexander Motin	6aa8c21a2a	More speculative prefetcher improvements - Make prefetch distance adaptive: up to 4MB prefetch doubles for every, hit same as before, but after that it grows by 1/8 every time the prefetch read does not complete in time to satisfy the demand. My tests show that 4MB is sufficient for wide NVMe pool to saturate single reader thread at 2.5GB/s, while new 64MB maximum allows the same thread to reach 1.5GB/s on wide HDD pool. Further distance increase may increase speed even more, but less dramatic and with higher latency. - Allow early reuse of inactive prefetch streams: streams that never saw hits can be reused immediately if there is a demand, while others can be reused after 1s of inactivity, starting with the oldest. After 2s of inactivity streams are deleted to free resources same as before. This allows by several times increase strided read performance on HDD pool in presence of simultaneous random reads, previously filling the zfetch_max_streams limit for seconds and so blocking most of prefetch. - Always issue intermediate indirect block reads with SYNC priority. Each of those reads if delayed for longer may delay up to 1024 other block prefetches, that may be not good for wide pools. Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #13452	2022-05-25 10:12:52 -07:00
Paul Dagnelie	7829b465a7	Cancel in-progress rebuilds when we finish removal This issue was discovered by zloop runs. When a mirror or other redundant top-level vdev has a disk failure, and the disk is replaced, the rebuild process occurs. A removal can happen while this is in progress. If the removal completes before the rebuild does, the removal process will try to free the vdev that is still in use. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <pcd@delphix.com> Closes #13498	2022-05-25 09:25:13 -07:00
Alexander Motin	84d0a03f3e	Refactor Log Size Limit Original Log Size Limit implementation blocked all writes in case of limit reached until the TXG is committed and the log is freed. It caused huge delays and following speed spikes in application writes. This implementation instead smoothly throttles writes, using exactly the same mechanism as used for dirty data. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: jxdking <lostking2008@hotmail.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Issue #12284 Closes #13476	2022-05-24 09:46:35 -07:00
Rich Ercolani	f375b23c02	Tiered early abort, zstd edition It turns out that "do LZ4 and zstd-1 both fail" is a great heuristic for "don't even bother trying higher zstd tiers". By way of illustration: $ cat /incompress \| mbuffer \| zfs recv -o compression=zstd-12 evenfaster/lowcomp_1M_zstd12_normal summary: 39.8 GiByte in 3min 40.2sec - average of 185 MiB/s $ echo 3 \| sudo tee /sys/module/zzstd/parameters/zstd_lz4_pass 3 $ cat /incompress \| mbuffer -m 4G \| zfs recv -o compression=zstd-12 evenfaster/lowcomp_1M_zstd12_patched summary: 39.8 GiByte in 48.6sec - average of 839 MiB/s $ sudo zfs list -p -o name,used,lused,ratio evenfaster/lowcomp_1M_zstd12_normal evenfaster/lowcomp_1M_zstd12_patched NAME USED LUSED RATIO evenfaster/lowcomp_1M_zstd12_normal 39549931520 42721221632 1.08 evenfaster/lowcomp_1M_zstd12_patched 39626399744 42721217536 1.07 $ python3 -c "print(39626399744 - 39549931520)" 76468224 $ I'll take 76 MB out of 42 GB for > 4x speedup. Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl> Reviewed-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes #13244	2022-05-24 09:43:22 -07:00
Brian Behlendorf	2cd0f98f4a	Verify BPs in spa_load_verify_cb() and dsl_scan_visitbp() We want `zpool import` to be highly robust and never panic, even when encountering corrupt metadata. This is already handled in the arc_read() code path, which covers most cases, but spa_load_verify_cb() relies on zio_read() and is responsible for verifying the block pointer. During import it is also possible to encounter blocks pointers which contain ZIO_COMPRESS_INHERIT and ZIO_CHECKSUM_INHERIT values. Relax the verification function slightly to allow this. Futhermore, extend dsl_scan_recurse() to verify the block pointer contents of level zero blocks which are not of type DMU_OT_DNODE or DMU_OT_OBJSET. This is handled by arc_read() in the other cases. Reviewed-by: Paul Dagnelie <pcd@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #13124 Closes #13360	2022-05-20 10:36:14 -07:00
Andrew	00ac77464e	Expose zpool guids through kstats There are times when end-users may wish to have a fast and convenient method to get zpool guid without having to use libzfs. This commit exposes the zpool guid via kstats in similar manner to the zpool state. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Andrew Walker <awalker@ixsystems.com> Closes #13466	2022-05-18 10:25:33 -07:00
наб	de82164518	linux: spl: generic: ddi_strto: match solaris ddi_strto(9) Recognise initial whitespace, + in both cases, and - also in unsigneds Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #13434	2022-05-13 10:15:47 -07:00
наб	354a1bfb8e	linux: spl: generic: ddi_strtou##type: elide unused flag Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #13434	2022-05-13 10:15:44 -07:00
наб	c25b281378	Remove hw_serial, ddi_strtoul() Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #13434	2022-05-13 10:15:31 -07:00
Rich Ercolani	bd88c036e6	Added a workaround for Linux KASAN builds Linux passes -Wframe-larger-than=1024, which breaks our build in a number of places with -Werror. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes #13450	2022-05-11 13:26:55 -07:00
szubersk	e0911f7b7f	autoconf: Fail when __copy_from_user_inatomic is a non-GPL symbol A followup to `849c14e048` Fix https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1009242 Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: szubersk <szuberskidamian@gmail.com> Closes #13389	2022-05-11 10:32:51 -07:00
наб	c8970f52ed	autoconf: use include directives instead of recursing down lib As a bonus, this also adds zfs-mount-generator (previously undescended down) and libzstd (not included) to CppCheck As a bonus bonus, abigail rules work out-of-tree, too Against current trunk: $ diff -U0 ./destdir.listing ~/store/code/zfs/destdir.listing -destdir/usr/local/include/libspl/sscanf.h $ diff --color -U0 ./zfs-2.1.99.tar.gz.listing ../oot/zfs-2.1.99.tar.gz.listing \| grep -v @@ \| grep -v /Makefile -zfs-2.1.99/config/Abigail.am -zfs-2.1.99/lib/libspl/include/util/ -zfs-2.1.99/lib/libspl/include/util/sscanf.h $ diff --color -U0 ./zfs-2.1.99.tar.gz.listing ../oot/zfs-2.1.99.tar.gz.listing \| grep -v @@ \| grep /Makefile -zfs-2.1.99/lib/libavl/Makefile.in -zfs-2.1.99/lib/libefi/Makefile.in -zfs-2.1.99/lib/libicp/Makefile.in -zfs-2.1.99/lib/libnvpair/Makefile.in -zfs-2.1.99/lib/libshare/Makefile.in -zfs-2.1.99/lib/libspl/include/Makefile.in -zfs-2.1.99/lib/libspl/include/os/freebsd/Makefile.am -zfs-2.1.99/lib/libspl/include/os/freebsd/Makefile.in -zfs-2.1.99/lib/libspl/include/os/freebsd/sys/Makefile.am -zfs-2.1.99/lib/libspl/include/os/freebsd/sys/Makefile.in -zfs-2.1.99/lib/libspl/include/os/linux/Makefile.am -zfs-2.1.99/lib/libspl/include/os/linux/Makefile.in -zfs-2.1.99/lib/libspl/include/os/linux/sys/Makefile.am -zfs-2.1.99/lib/libspl/include/os/linux/sys/Makefile.in -zfs-2.1.99/lib/libspl/include/os/Makefile.am -zfs-2.1.99/lib/libspl/include/os/Makefile.in -zfs-2.1.99/lib/libspl/include/rpc/Makefile.am -zfs-2.1.99/lib/libspl/include/rpc/Makefile.in -zfs-2.1.99/lib/libspl/include/sys/dktp/Makefile.am -zfs-2.1.99/lib/libspl/include/sys/dktp/Makefile.in -zfs-2.1.99/lib/libspl/include/sys/Makefile.am -zfs-2.1.99/lib/libspl/include/sys/Makefile.in -zfs-2.1.99/lib/libspl/include/util/Makefile.am -zfs-2.1.99/lib/libspl/include/util/Makefile.in -zfs-2.1.99/lib/libspl/Makefile.in -zfs-2.1.99/lib/libtpool/Makefile.in -zfs-2.1.99/lib/libunicode/Makefile.in -zfs-2.1.99/lib/libuutil/Makefile.in -zfs-2.1.99/lib/libzfsbootenv/Makefile.in -zfs-2.1.99/lib/libzfs_core/Makefile.in -zfs-2.1.99/lib/libzfs/Makefile.in -zfs-2.1.99/lib/libzpool/Makefile.in -zfs-2.1.99/lib/libzstd/Makefile.in -zfs-2.1.99/lib/libzutil/Makefile.in -zfs-2.1.99/lib/Makefile.in Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #13316	2022-05-10 10:18:11 -07:00
наб	6fc34371e1	libzfs: pool: fix false-positives -Wmaybe-uninitialised As noted by gcc (Debian 10.2.1-6) 10.2.1 20210110 Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #13316	2022-05-10 10:18:06 -07:00
наб	be91239efa	module: Makefile: cppcheck: zfs_config.h lives in builddir Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #13316	2022-05-10 10:18:02 -07:00
hping	a18d13c200	abd_os: remove redundant refcount creation for abd_children Refcount creation for abd_zero_scatter->abd_children is redundant in abd_alloc_zero_scatter, as it has been done in abd_init_struct. In addition, abd_children is undefined when ZFS_DEBUG is disabled, the reference of abd_children in abd_alloc_zero_scatter breaks build of libzpool when ZFS_DEBUG is disabled. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Ping Huang <huangping@smartx.com> Closes #13429	2022-05-09 16:30:16 -07:00
Aidan Harris	493b6e5607	Fix functions without a prototype clang-15 emits the following error message for functions without a prototype: fs/zfs/os/linux/spl/spl-kmem-cache.c:1423:27: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes] Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Aidan Harris <me@aidanharr.is> Closes #13421	2022-05-06 11:57:37 -07:00
Rich Ercolani	7bf06f7262	Corrected edge case in uncompressed ARC->L2ARC handling I genuinely don't know why this didn't come up before, but adding the LZ4 early abort pointed out this flaw, in which we're allocating a buffer of one size, and then telling the compressor that we're handing it buffers of a different size, which may be Very Different - say, allocating 512b and then telling it the inputs are 128k. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Amanakis <gamanakis@gmail.com> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes #13375	2022-05-04 11:59:30 -07:00
Mateusz Guzik	81b8b2d004	FreeBSD: use zero_region instead of allocating a dedicated page Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Closes #13406	2022-05-04 11:46:37 -07:00
Alexander Motin	c55b293287	Improve mg_aliquot math When calculating mg_aliquot alike to #12046 use number of unique data disks in the vdev, not the total number of children vdev. Increase default value of the tunable from 512KB to 1MB to compensate. Before this change each disk in striped pool was getting 512KB of sequential data, in 2-wide mirror -- 1MB, in 3-wide RAIDZ1 -- 768KB. After this change in all the cases each disk should get 1MB. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #13388	2022-05-04 11:33:42 -07:00
Brian Behlendorf	34dbc618f5	Reduce dbuf_find() lock contention Holding a dbuf is a common operation which can become highly contended in dbuf_find() when acquiring the dbuf hash mutex. This is particularly true on Linux when reading/writing volumes since by default up to 32 threads from the zvol_taskq may be taking a hold of the same dbuf. This should also be observable on FreeBSD as long as there are enough processes accessing the volume concurrently. This is further aggregrated by the fact that only the block id will be unique when calculating the dbuf hash for a single volume. The objset id, object id, and level will be the same for data blocks. This has been observed to result in a somehwat less than uniform hash distribution and a longer than expected max hash chain depth (~20) on a large memory system (256 GB) using volumes. This commit improves the siutation by switching the hash mutex to an rwlock to allow concurrent lookups, and increasing DBUF_RWLOCKS from 2048 to 8192 to further reduce the odds of a hash collision. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #13405	2022-05-04 11:17:29 -07:00
Shaan Nobee	411f4a018d	Speed up WB_SYNC_NONE when a WB_SYNC_ALL occurs simultaneously Page writebacks with WB_SYNC_NONE can take several seconds to complete since they wait for the transaction group to close before being committed. This is usually not a problem since the caller does not need to wait. However, if we're simultaneously doing a writeback with WB_SYNC_ALL (e.g via msync), the latter can block for several seconds (up to zfs_txg_timeout) due to the active WB_SYNC_NONE writeback since it needs to wait for the transaction to complete and the PG_writeback bit to be cleared. This commit deals with 2 cases: - No page writeback is active. A WB_SYNC_ALL page writeback starts and even completes. But when it's about to check if the PG_writeback bit has been cleared, another writeback with WB_SYNC_NONE starts. The sync page writeback ends up waiting for the non-sync page writeback to complete. - A page writeback with WB_SYNC_NONE is already active when a WB_SYNC_ALL writeback starts. The WB_SYNC_ALL writeback ends up waiting for the WB_SYNC_NONE writeback. The fix works by carefully keeping track of active sync/non-sync writebacks and committing when beneficial. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Shaan Nobee <sniper111@gmail.com> Closes #12662 Closes #12790	2022-05-03 13:23:26 -07:00
Pawel Jakub Dawidek	a64d757aa4	FreeBSD: Clean up the use of ioflags - Prefer O_* flags over F* flags that mostly mirror O_* flags anyway, but O_* flags seem to be preferred. - Simplify the code as all the F*SYNC flags were defined as FFSYNC flag. - Don't define FRSYNC flag, so we don't generate unnecessary ZIL commits. - Remove EXCL define, FreeBSD ignores the excl argument for zfs_create() anyway. Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net> Closes #13400	2022-05-02 16:26:28 -07:00
Jitendra Patidar	159c6fd154	Add missing replay entry in zvol_replay_vector for TX_SETSAXATTR Commit `361a7e8` (log xattr=sa create/remove/update to ZIL) introduced a TX_SETSAXATTR, but missed to add a corresponding entry in zvol_replay_vector. Adding a missing replay entry in zvol_replay_vector. Reviewed-by: Christian Schwarz <christian.schwarz@nutanix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Jitendra Patidar <jitendra.patidar@nutanix.com> Closes #13396 Closes #13395	2022-05-02 11:01:26 -07:00
Brian Behlendorf	1bf3abc634	Silence unused-but-set-variable warnings Clang 13.0.0 added support for `Wunused-but-set-parameter` and `-Wunused-but-set-variable` which correctly detects two unused variables in zstd resulting in a build failure. This commit annotates these instances accordingly. https://releases.llvm.org/13.0.1/tools/clang/docs/ReleaseNotes.html#id6 In FSE_createCTable(), malloc() is intentionally defined as NULL when compiled in the kernel so the variable is unused. zstd/lib/compress/fse_compress.c:307:12: error: variable 'size' set but not used [-Werror,-Wunused-but-set-variable] Additionally, in ZSTD_seqDecompressedSize() the assert is compiled out similarly resulting in an unused variable. zstd/lib/compress/zstd_compress_superblock.c:412:12: error: variable 'litLengthSum' set but not used [-Werror,-Wunused-but-set-variable] Reviewed-by: Rich Ercolani <rincebrain@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #13382	2022-04-29 14:21:11 -07:00
Rich Ercolani	f2330bd156	Default zfs_max_recordsize to 16M Increase the default allowed maximum recordsize from 1M to 16M. As described in the zfs(4) man page, there are significant costs which need to be considered before using very large blocks. However, there are scenarios where they make good sense and it should no longer be necessary to artificially restrict their use behind a module option. Note that for 32-bit platforms we continue to leave this restriction in place due to the limited virtual address space available (256-512MB). On these systems only a handful of blocks could be cached at any one time severely impacting performance and potentially stability. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes #12830 Closes #13302	2022-04-28 15:12:24 -07:00
Brian Behlendorf	63b18e4097	Fix O_APPEND for Linux 3.15 and older kernels When using a Linux kernel which predates the iov_iter interface the O_APPEND flag should be applied in zpl_aio_write() via the call to generic_write_checks(). The updated pos variable was incorrectly ignored resulting in the current offset being used. This issue should only realistically impact the RHEL/CentOS 7.x kernels which are based on Linux 3.10. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #13370 Closes #13377	2022-04-27 12:56:17 -07:00
Satadru Pramanik	7dde17e860	Linux 5.18 compat: replace __set_page_dirty_nobuffers Replace __set_page_dirty_nobuffers with filemap_dirty_folio. Upstream-commit: 6b1f86f8e9c7f9de7ca1cb987b2cf25e99b1ae3a ("Merge tag 'folio-5.18b' of git://git.infradead.org/users/willy/pagecache ") Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Authored-by: Satadru Pramanik <satadru@gmail.com> Signed-off-by: Satadru Pramanik <satadru@gmail.com> Closes #13325 Closes #13380	2022-04-27 12:54:17 -07:00
Damian Szuberski	849c14e048	PPC get_user workaround Linux 5.12 PPC 5.12 get_user() and __copy_from_user_inatomic() inline helpers very indirectly include a reference to the GPL'd array mmu_feature_keys[] and fails to build. Workaround this by using copy_from_user() and throwing EFAULT for any calls to __copy_from_user_inatomic(). This is a workaround until a fix for Linux commit 7613f5a66becfd0e43a0f34de8518695888f5458 "powerpc/64s/kuap: Use mmu_has_feature()" is fully addressed. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Authored-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: szubersk <szuberskidamian@gmail.com> Closes #11958 Closes #12590 Closes #13367	2022-04-26 10:52:40 -07:00
Alexander Motin	600a02b884	Improve log spacemap load time Previous flushing algorithm limited only total number of log blocks to the minimum of 256K and 4x number of metaslabs in the pool. As result, system with 1500 disks with 1000 metaslabs each, touching several new metaslabs each TXG could grow spacemap log to huge size without much benefits. We've observed one of such systems importing pool for about 45 minutes. This patch improves the situation from five sides: - By limiting maximum period for each metaslab to be flushed to 1000 TXGs, that effectively limits maximum number of per-TXG spacemap logs to load to the same number. - By making flushing more smooth via accounting number of metaslabs that were touched after the last flush and actually need another flush, not just ms_unflushed_txg bump. - By applying zfs_unflushed_log_block_pct to the number of metaslabs that were touched after the last flush, not all metaslabs in the pool. - By aggressively prefetching per-TXG spacemap logs up to 16 TXGs in advance, making log spacemap load process for wide HDD pool CPU-bound, accelerating it by many times. - By reducing zfs_unflushed_log_block_max from 256K to 128K, reducing single-threaded by nature log processing time from ~10 to ~5 minutes. As further optimization we could skip bumping ms_unflushed_txg for metaslabs not touched since the last flush, but that would be an incompatible change, requiring new pool feature. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #12789	2022-04-26 10:44:21 -07:00
George Amanakis	0409d33273	Improve zpool status output, list all affected datasets Currently, determining which datasets are affected by corruption is a manual process. The primary difficulty in reporting the list of affected snapshots is that since the error was initially found, the snapshot where the error originally occurred in, may have been deleted. To solve this issue, we add the ID of the head dataset of the original snapshot which the error was detected in, to the stored error report. Then any time a filesystem is deleted, the errors associated with it are deleted as well. Any time a clone promote occurs, we modify reports associated with the original head to refer to the new head. The stored error reports are identified by this head ID, the birth time of the block which the error occurred in, as well as some information about the error itself are also stored. Once this information is stored, we can find the set of datasets affected by an error by walking back the list of snapshots in the given head until we find one with the appropriate birth txg, and then traverse through the snapshots of the clone family, terminating a branch if the block was replaced in a given snapshot. Then we report this information back to libzfs, and to the zpool status command, where it is displayed as follows: pool: test state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A scan: scrub repaired 0B in 00:00:00 with 800 errors on Fri Dec 3 08:27:57 2021 config: NAME STATE READ WRITE CKSUM test ONLINE 0 0 0 sdb ONLINE 0 0 1.58K errors: Permanent errors have been detected in the following files: test@1:/test.0.0 /test/test.0.0 /test/1clone/test.0.0 A new feature flag is introduced to mark the presence of this change, as well as promotion and backwards compatibility logic. This is an updated version of #9175. Rebase required fixing the tests, updating the ABI of libzfs, updating the man pages, fixing bugs, fixing the error returns, and updating the old on-disk error logs to the new format when activating the feature. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Co-authored-by: TulsiJain <tulsi.jain@delphix.com> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #9175 Closes #12812	2022-04-25 17:25:42 -07:00

... 5 6 7 8 9 ...

4356 Commits