Archive-Team/zfs - zfs - Gitea: Git with a cup of tea

Commit Graph

Author	SHA1	Message	Date
Alexander Motin	b01a8cc2c0	zil: Don't expect zio_shrink() to succeed. At least for RAIDZ zio_shrink() does not reduce zio size, but reduced wsz in that case likely results in writing uninitialized memory. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #14853	2023-06-02 11:17:11 -07:00
Alexander Motin	a727848e05	Remove single parent assertion from zio_nowait(). We only need to know if ZIO has any parent there. We do not care if it has more than one, but use of zio_unique_parent() == NULL asserts that. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #14823	2023-06-02 11:17:11 -07:00
Alexander Motin	b2ede77bf9	Fix two abd_gang_add_gang() issues. - There is no reason to assert that added gang is not empty. It may be weird to add an empty gang, but it is legal. - When moving chain list from the added gang clear its size, or it will trigger assertion in abd_verify() when that gang is freed. Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #14816	2023-06-02 11:17:11 -07:00
Alexander Motin	c1b9dc735f	Mark TX_COMMIT transaction with TXG_NOTHROTTLE. TX_COMMIT has no on-disk representation and does not produce any more dirty data. It should not wait for anything, and even just skipping the checks if not waiting gives improvement noticeable in profiler. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Prakash Surya <prakash.surya@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #14798	2023-06-02 11:17:11 -07:00
Alexander Motin	e271cd7a65	Fix positive ABD size assertion in abd_verify(). Gang ABDs without childred are legal, and they do have zero size. For other ABD types zero size doesn't have much sense and likely not working correctly now. Reviewed-by: Igor Kozhukhov <igor@dilos.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #14795	2023-06-02 11:17:11 -07:00
Mariusz Zaborski	7d26967d4e	Move zap_attribute_t to the heap in dsl_deadlist_merge In the case of a regular compilation, the compiler raises a warning for a dsl_deadlist_merge function, that the stack size is to large. In debug build this can generate an error. Move large structures to heap. Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Mariusz Zaborski <mariusz.zaborski@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes #14524	2023-06-01 08:34:52 -07:00
Luís Henriques	671b1af1bc	Fix NULL pointer dereference when doing concurrent 'send' operations A NULL pointer will occur when doing a 'zfs send -S' on a dataset that is still being received. The problem is that the new 'send' will rightfully fail to own the datasets (i.e. dsl_dataset_own_force() will fail), but then dmu_send() will still do the dsl_dataset_disown(). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Luís Henriques <henrix@camandro.org> Closes #14903 Closes #14890	2023-05-31 17:02:38 -07:00
Mateusz Guzik	07a2ba541d	FreeBSD: add missing vop_fplookup assignments It became illegal to not have them as of 5f6df177758b9dff88e4b6069aeb2359e8b0c493 ("vfs: validate that vop vectors provide all or none fplookup vops") upstream. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Closes #14788	2023-05-30 08:53:21 -07:00
rob-wing	f786232b2a	FreeBSD: don't verify recycled vnode for zfs control directory Under certain loads, the following panic is hit: panic: page fault KDB: stack backtrace: #0 0xffffffff805db025 at kdb_backtrace+0x65 #1 0xffffffff8058e86f at vpanic+0x17f #2 0xffffffff8058e6e3 at panic+0x43 #3 0xffffffff808adc15 at trap_fatal+0x385 #4 0xffffffff808adc6f at trap_pfault+0x4f #5 0xffffffff80886da8 at calltrap+0x8 #6 0xffffffff80669186 at vgonel+0x186 #7 0xffffffff80669841 at vgone+0x31 #8 0xffffffff8065806d at vfs_hash_insert+0x26d #9 0xffffffff81a39069 at sfs_vgetx+0x149 #10 0xffffffff81a39c54 at zfsctl_snapdir_lookup+0x1e4 #11 0xffffffff8065a28c at lookup+0x45c #12 0xffffffff806594b9 at namei+0x259 #13 0xffffffff80676a33 at kern_statat+0xf3 #14 0xffffffff8067712f at sys_fstatat+0x2f #15 0xffffffff808ae50c at amd64_syscall+0x10c #16 0xffffffff808876bb at fast_syscall_common+0xf8 The page fault occurs because vgonel() will call VOP_CLOSE() for active vnodes. For this reason, define vop_close for zfsctl_ops_snapshot. While here, define vop_open for consistency. After adding the necessary vop, the bug progresses to the following panic: panic: VERIFY3(vrecycle(vp) == 1) failed (0 == 1) cpuid = 17 KDB: stack backtrace: #0 0xffffffff805e29c5 at kdb_backtrace+0x65 #1 0xffffffff8059620f at vpanic+0x17f #2 0xffffffff81a27f4a at spl_panic+0x3a #3 0xffffffff81a3a4d0 at zfsctl_snapshot_inactive+0x40 #4 0xffffffff8066fdee at vinactivef+0xde #5 0xffffffff80670b8a at vgonel+0x1ea #6 0xffffffff806711e1 at vgone+0x31 #7 0xffffffff8065fa0d at vfs_hash_insert+0x26d #8 0xffffffff81a39069 at sfs_vgetx+0x149 #9 0xffffffff81a39c54 at zfsctl_snapdir_lookup+0x1e4 #10 0xffffffff80661c2c at lookup+0x45c #11 0xffffffff80660e59 at namei+0x259 #12 0xffffffff8067e3d3 at kern_statat+0xf3 #13 0xffffffff8067eacf at sys_fstatat+0x2f #14 0xffffffff808b5ecc at amd64_syscall+0x10c #15 0xffffffff8088f07b at fast_syscall_common+0xf8 This is caused by a race condition that can occur when allocating a new vnode and adding that vnode to the vfs hash. If the newly created vnode loses the race when being inserted into the vfs hash, it will not be recycled as its usecount is greater than zero, hitting the above assertion. Fix this by dropping the assertion. FreeBSD-issue: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=252700 Reviewed-by: Andriy Gapon <avg@FreeBSD.org> Reviewed-by: Mateusz Guzik <mjguzik@gmail.com> Reviewed-by: Alek Pinchuk <apinchuk@axcient.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Rob Wing <rob.wing@klarasystems.com> Co-authored-by: Rob Wing <rob.wing@klarasystems.com> Submitted-by: Klara, Inc. Sponsored-by: rsync.net Closes #14501	2023-05-30 08:53:21 -07:00
Brian Behlendorf	45c4b3e680	Fix checkstyle warning Resolve a missed checkstyle warning. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Mateusz Guzik <mjguzik@gmail.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #14799	2023-05-30 08:53:21 -07:00
Mateusz Guzik	092021ba39	FreeBSD: add missing vn state transition for .zfs Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Closes #14774	2023-05-30 08:53:21 -07:00
Mateusz Guzik	aef1324d59	FreeBSD: fix up EINVAL from getdirentries on .zfs Without the change: /.zfs /.zfs/snapshot find: /.zfs: Invalid argument Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Closes #14774	2023-05-30 08:53:21 -07:00
Dimitry Andric	d1e05c6856	FreeBSD: make zfs_vfs_held() definition consistent with declaration Noticed while attempting to change FreeBSD's boolean_t into an actual bool: in include/sys/zfs_ioctl_impl.h, zfs_vfs_held() is declared to return a boolean_t, but in module/os/freebsd/zfs/zfs_ioctl_os.c it is defined to return an int. Make the definition match the declaration. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Dimitry Andric <dimitry@andric.com> Closes #14776	2023-05-30 08:53:21 -07:00
Brian Behlendorf	6ec3abcb59	Use vmem_zalloc to silence allocation warning The kmem allocation in zfs_prune_aliases() will trigger a large allocation warning on systems with 64K pages. Resolve this by switching to vmem_alloc() which internally uses kvmalloc() so the right allocator will be used based on the allocation size. Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #8491 Closes #14694	2023-05-26 10:10:09 -07:00
Brian Behlendorf	e97637d484	Add the ability to uninitialize zpool initialize functions well for touching every free byte...once. But if we want to do it again, we're currently out of luck. So let's add zpool initialize -u to clear it. Co-authored-by: Rich Ercolani <rincebrain@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes #12451 Closes #14873	2023-05-26 10:09:04 -07:00
Brian Behlendorf	e2176f12a9	Probe vdevs before marking removed Before allowing the ZED to mark a vdev as REMOVED due to a hotplug event confirm that it is non-responsive with probe. Any device which can be successfully probed should be left ONLINE to prevent a healthy pool from being incorrectly SUSPENDED. This may occur for at least the following two scenarios. 1) Drive expansion (zpool online -e) in VMware environments. If, during the partition resize operation, a partition is removed and re-created then udev will send a removed event. 2) Re-scanning the namespaces of an NVMe device (nvme ns-rescan) may result in a udev remove and add event being delivered. Finally, update the ZED to only kick in a spare when the removal was successful. Reviewed-by: Ameer Hamza <ahamza@ixsystems.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #14859 Closes #14861	2023-05-26 10:08:04 -07:00
Akash B	c2f0aaeb3c	Fix concurrent resilvers initiated at same time For draid vdevs it was possible to initiate both the sequential and healing resilver at same time. This fixes the following two scenarios. 1) There's a window where a sequential rebuild can be started via ZED even if a healing resilver has been scheduled. - This is fixed by adding additional check in spa_vdev_attach() for any scheduled resilver and return appropriate error code when a resilver is already in progress. 2) It was possible for zpool clear to start a healing resilver when it wasn't needed at all. This occurs because during a vdev_open() the device is presumed to be healthy not until the device is validated by vdev_validate() and it's set unavailable. However, by this point an async resilver will have already been requested if the DTL isn't empty. - This is fixed by cancelling the SPA_ASYNC_RESILVER request immediately at the end of vdev_reopen() when a resilver is unneeded. Finally, added a testcase in ZTS for verification. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Dipak Ghosh <dipak.ghosh@hpe.com> Signed-off-by: Akash B <akash-b@hpe.com> Closes #14881 Closes #14892	2023-05-26 10:07:19 -07:00
Brian Behlendorf	133faca275	Add dmu_tx_hold_append() interface Provides an interface which callers can use to declare a write when the exact starting offset in not yet known. Since the full range being updated is not available only the first L0 block at the provided offset will be prefetched. Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #14819	2023-05-11 09:08:08 -07:00
David Hedberg	9b17d5a37d	Wait for txg sync if the last DRR_FREEOBJECTS might result in a hole If we receive a DRR_FREEOBJECTS as the first entry in an object range, this might end up producing a hole if the freed objects were the only existing objects in the block. If the txg starts syncing before we've processed any following DRR_OBJECT records, this leads to a possible race where the backing arc_buf_t gets its psize set to 0 in the arc_write_ready() callback while still being referenced from a dirty record in the open txg. To prevent this, we insert a txg_wait_synced call if the first record in the range was a DRR_FREEOBJECTS that actually resulted in one or more freed objects. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: David Hedberg <david.hedberg@findity.com> Sponsored by: Findity AB Closes #11893 Closes #14358	2023-05-09 12:57:56 -07:00
Ameer Hamza	75ec145710	zpool import -m also removing spare and cache when log device is missing spa_import() relies on a pool config fetched by spa_try_import() for spare/cache devices. Import flags are not passed to spa_tryimport(), which makes it return early due to a missing log device and missing retrieving the cache device and spare eventually. Passing ZFS_IMPORT_MISSING_LOG to spa_tryimport() makes it fetch the correct configuration regardless of the missing log device. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #14794	2023-05-05 09:07:07 -07:00
Herb Wartens	33075e465f	Allow MMP to bypass waiting for other threads At our site we have seen cases when multi-modifier protection is enabled (multihost=on) on our pool and the pool gets suspended due to a single disk that is failing and responding very slowly. Our pools have 90 disks in them and we expect disks to fail. The current version of MMP requires that we wait for other writers before moving on. When a disk is responding very slowly, we observed that waiting here was bad enough to cause the pool to suspend. This change allows the MMP thread to bypass waiting for other threads and reduces the chances the pool gets suspended. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Herb Wartens <hawartens@gmail.com> Closes #14659	2023-04-24 12:55:07 -07:00
Brian Behlendorf	cdbe1d65c4	Increase default zfs_rebuild_vdev_limit to 64MB When testing distributed rebuild performance with more capable hardware it was observed than increasing the zfs_rebuild_vdev_limit to 64M reduced the rebuild time by 17%. Beyond 64MB there was some improvement (~2%) but it was not significant when weighed against the increased memory usage. Memory usage is capped at 1/4 of arc_c_max. Additionally, vr_bytes_inflight_max has been moved so it's updated per-metaslab to allow the size to be adjust while a rebuild is running. Reviewed-by: Akash B <akash-b@hpe.com> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #14428	2023-04-24 12:55:07 -07:00
Brian Behlendorf	fa28e26e42	Increase default zfs_scan_vdev_limit to 16MB For HDD based pools the default zfs_scan_vdev_limit of 4M per-vdev can significantly limit the maximum scrub performance. Increasing the default to 16M can double the scrub speed from 80 MB/s per disk to 160 MB/s per disk. This does increase the memory footprint during scrub/resilver but given the performance win this is a reasonable trade off. Memory usage is capped at 1/4 of arc_c_max. Note that number of outstanding I/Os has not changed and is still limited by zfs_vdev_scrub_max_active. Reviewed-by: Akash B <akash-b@hpe.com> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #14428	2023-04-24 12:55:07 -07:00
Brian Behlendorf	9fe3da9364	Improve resilver ETAs When resilvering the estimated time remaining is calculated using the average issue rate over the current pass. Where the current pass starts when a scan was started, or restarted, if the pool was exported/imported. For dRAID pools in particular this can result in wildly optimistic estimates since the issue rate will be very high while scanning when non-degraded regions of the pool are scanned. Once repair I/O starts being issued performance drops to a realistic number but the estimated performance is still significantly skewed. To address this we redefine a pass such that it starts after a scanning phase completes so the issue rate is more reflective of recent performance. Additionally, the zfs_scan_report_txgs module option can be set to reset the pass statistics more often. Reviewed-by: Akash B <akash-b@hpe.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #14410	2023-04-24 12:55:07 -07:00
Ameer Hamza	a68dfdb88c	Fix "Detach spare vdev in case if resilvering does not happen" Spare vdev should detach from the pool when a disk is reinserted. However, spare detachment depends on the completion of resilvering, and if resilver does not schedule, the spare vdev keeps attached to the pool until the next resilvering. When a zfs pool contains several disks (25+ mirror), resilvering does not always happen when a disk is reinserted. In this patch, spare vdev is manually detached from the pool when resilvering does not occur and it has been tested on both Linux and FreeBSD. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #14722	2023-04-24 09:56:13 -07:00
Richard Yao	4a5950a129	Linux: zfs_fillpage() should handle partial pages from end of file After `89cd2197b9` was merged, Clang's static analyzer began complaining about a dead assignment in `zfs_fillpage()`. Upon inspection, I noticed that the dead assignment was because we are not using the calculated io_len that we should use to avoid asking the DMU to read past the end of a file. This should result in `dmu_buf_hold_array_by_dnode()` calling `zfs_panic_recover()`. This issue predates `89cd2197b9`, but its simplification of zfs_fillpage() eliminated the only use of the assignment to io_len, which made Clang's static analyzer complain about the issue. Also, as a precaution, we add an assertion that io_offset < i_size. If this ever fails, bad things will happen. Otherwise, we are blindly trusting the kernel not to give us invalid offsets. We continue to blindly trust it on non-debug kernels. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14534	2023-04-21 13:12:35 -07:00
Brian Behlendorf	c7db374ac6	Fix buffered/direct/mmap I/O race When a page is faulted in for memory mapped I/O the page lock may be dropped before it has been read and marked up to date. If a buffered read encounters such a page in mappedread() it must wait until the page has been updated. Failure to do so will result in a panic on debug builds and incorrect data on production builds. The critical part of this change is in mappedread() where pages which are not up to date are now handled. Additionally, it includes the following simplifications. - zfs_getpage() and zfs_fillpage() could be passed an array of pages. This could be more efficient if it was used but in practice only a single page was ever provided. These interfaces were simplified to acknowledge that. - update_pages() was modified to correctly set the PG_error bit on a page when it cannot be read by dmu_read(). - Setting PG_error and PG_uptodate was moved to zfs_fillpage() from zpl_readpage_common(). This is consistent with the handling in update_pages() and mappedread(). - Minor additional refactoring to comments and variable declarations to improve readability. - Add a test case to exercise concurrent buffered, direct, and mmap IO to the same file. - Reduce the mmap_sync test case default run time. Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #13608 Closes #14498	2023-04-21 13:12:35 -07:00
Tony Hutter	a969b1b22c	Revert "ZFS_IOC_COUNT_FILLED does unnecessary txg_wait_synced()" This reverts commit `4b3133e671`. Users identified this commit as a possible source of data corruption: https://github.com/openzfs/zfs/issues/14753 Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Issue #14753 Closes #14761	2023-04-18 11:44:07 -07:00
Brian Behlendorf	164d184ed9	Additional limits on hole reporting Holding the zp->z_rangelock as a RL_READER over the range 0-UINT64_MAX is sufficient to prevent the dnode from being re-dirtied by concurrent writers. To avoid potentially looping multiple times for external caller which do not take the rangelock holes are not reported after the first sync. While not optimal this is always functionally correct. This change adds the missing rangelock calls on FreeBSD to zvol_cdev_ioctl(). Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #14512 Closes #14641	2023-03-29 10:40:49 -07:00
Ameer Hamza	aa7258ced0	Update vdev state for spare vdev zfsd fetches new pool configuration through ZFS_IOC_POOL_STATS but it does not get updated nvlist configuration for spare vdev since the configuration is read by spa_spares->sav_config. In this commit, updating the vdev state for spare vdev that is consumed by zfsd on spare disk hotplug. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #14653	2023-03-27 11:46:24 -07:00
Ameer Hamza	dedd8243fc	zed: add hotplug support for spare vdevs This commit supports for spare vdev hotplug. The spare vdev associated with all the pools will be marked as "Removed" when the drive is physically detached and will become "Available" when the drive is reattached. Currently, the spare vdev status does not change on the drive removal and the same is the case with reattachment. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #14295	2023-03-27 11:32:09 -07:00
Ameer Hamza	43d63ab2d4	zed: post a udev change event from spa_vdev_attach() In order for zed to process the removal event correctly, udev change event needs to be posted to sync the blkid information. spa_create() and spa_config_update() posts the event already through spa_write_cachefile(). Doing the same for spa_vdev_attach() that handles the case for vdev attachment and replacement. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #14172	2023-03-27 11:32:09 -07:00
Ameer Hamza	bd9a9a4e1a	zed: mark disks as REMOVED when they are removed ZED does not take any action for disk removal events if there is no spare VDEV available. Added zpool_vdev_remove_wanted() in libzfs and vdev_remove_wanted() in vdev.c to remove the VDEV through ZED on removal event. This means that if you are running zed and remove a disk, it will be propertly marked as REMOVED. Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>	2023-03-27 11:32:09 -07:00
Alexander Motin	5219a2691e	FreeBSD: Remove extra arc_reduce_target_size() call Remove arc_reduce_target_size() call from arc_prune_task(). The idea of arc_prune_task() is to remove external references on ARC metadata, such as vnodes. Since arc_prune_async() is called only from ARC itself, it makes no sense to create a parasitic loop between ARC eviction and the pruning, treatening to drop ARC to its minimum. I can't guess why it was added as part of FreeBSD to OpenZFS integration. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #14639	2023-03-24 12:27:52 -05:00
Alexander Motin	48f376b0c5	Improve arc_read() error reporting Debugging reported NULL de-reference panic in dnode_hold_impl() I found that for certain types of errors arc_read() may only return error code, but not properly report it via done and pio arguments. Lack of done calls may result in reference and/or memory leaks in higher level code. Lack of error reporting via pio may result in unnoticed errors there. For example, dbuf_read(), where dbuf_read_impl() ignores arc_read() return, relies completely on the pio mechanism and missed the errors. This patch makes arc_read() to always call done callback and always propagate errors to parent zio, if either is provided. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc.	2023-03-24 11:55:36 -05:00
naivekun	345f8beb58	QAT: Fix uninitialized seed in QAT compression CpaDcRqResults have to be initialized with checksum=1 for adler32. Otherwise when error CPA_DC_OVERFLOW occurred, the next compress operation will continue on previously part-compressed data, and write invalid checksum data. When zfs decompress the compressed data, a invalid checksum will occurred and lead to #14463 Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Weigang Li <weigang.li@intel.com> Reviewed-by: Chengfei Zhu <chengfeix.zhu@intel.com> Signed-off-by: naivekun <naivekun0817@gmail.com> Closes #14632 Closes #14463	2023-03-17 11:09:07 -07:00
Matthew Ahrens	4b3133e671	ZFS_IOC_COUNT_FILLED does unnecessary txg_wait_synced() `lseek(SEEK_DATA \| SEEK_HOLE)` are only accurate when the on-disk blocks reflect all writes, i.e. when there are no dirty data blocks. To ensure this, if the target dnode is dirty, they wait for the open txg to be synced, so we can call them "stabilizing operations". If they cause txg_wait_synced often, it can be detrimental to performance. Typically, a group of files are all modified, and then SEEK_DATA/HOLE are performed on them. In this case, the first SEEK does a txg_wait_synced(), and subsequent SEEKs don't need to wait, so performance is good. However, if a workload involves an interleaved metadata modification, the subsequent SEEK may do a txg_wait_synced() unnecessarily. For example, if we do a `read()` syscall to each file before we do its SEEK. This applies even with `relatime=on`, when the `read()` is the first read after the last write. The txg_wait_synced() is unnecessary because the SEEK operations only care that the structure of the tree of indirect and data blocks is up to date on disk. They don't care about metadata like the contents of the bonus or spill blocks. (They also don't care if an existing data block is modified, but this would be more involved to filter out.) This commit changes the behavior of SEEK_DATA/HOLE operations such that they do not call txg_wait_synced() if there is only a pending change to the bonus or spill block. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #13368 Issue #14594 Issue #14512 Issue #14009	2023-03-15 10:44:54 -07:00
Mateusz Piotrowski	576d34cb11	Turn default_bs and default_ibs into ZFS_MODULE_PARAMs The default_bs and default_ibs tunables control the default block size and indirect block size. So far, default_bs and default_ibs were tunable only on FreeBSD, e.g., sysctl vfs.zfs.default_ibs Remove the FreeBSD-specific sysctl code and expose default_bs and default_ibs as tunables on both Linux and FreeBSD using ZFS_MODULE_PARAM. One of the use cases for changing the values of those tunables is to lower the indirect block size, which may improve performance of large directories (as discussed during the OpenZFS Leadership Meeting on 2022-08-16). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Mateusz Piotrowski <mateusz.piotrowski@klarasystems.com> Sponsored-by: Wasabi Technology, Inc. Closes #14293	2023-03-07 13:58:36 -08:00
Richard Yao	6281b5c488	Add missing increment to dsl_deadlist_move_bpobj() `dc5c8006f6` was recently merged to prefetch up to 128 deadlists. Unfortunately, a loop was missing an increment, such that it will prefetch all deadlists. The performance properties of that patch probably should be re-evaluated. This was caught by CodeQL's cpp/constant-comparison check in an experimental branch where I am testing the security-and-extended queries. It complained about the `i < 128` part of the loop condition always evaluating to the same thing. The standard CodeQL configuration we use missed this because it does not include that check. Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14573	2023-03-07 09:08:18 -08:00
George Amanakis	231a37c4c0	Optimize the is_l2cacheable functions by placing the most common use case (no special vdevs) first and avoid allocating new variables. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #14494 Closes #14563	2023-03-07 09:07:58 -08:00
Alexander Motin	9d2e5c14b2	System-wide speculative prefetch limit. With some pathological access patterns it is possible to make ZFS accumulate almost unlimited amount of speculative prefetch ZIOs. Combined with linear ABD allocations in RAIDZ code, it appears to be possible to exhaust system KVA, triggering kernel panic. Address this by introducing a system-wide counter of active prefetch requests and blocking prefetch distance doubling per stream hits if the number of active requests is higher that ~6% of ARC size. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #14516	2023-03-02 14:37:07 -08:00
Alexander Motin	b644a45bd4	Prefetch on deadlists merge During snapshot deletion ZFS may issue several reads for each deadlist to merge them into next snapshot's or pool's bpobj. Number of the dead lists increases with number of snapshots. On HDD pools it may take significant time during which sync thread is blocked. This patch introduces prescient prefetch of required blocks for up to 128 deadlists ahead. Tests show reduction of time required to delete dataset with 720 snapshots with randomly overwritten file on wide HDD pool from 75-85 to 22-28 seconds. Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Issue #14276 Closes #14402	2023-03-02 14:37:07 -08:00
Alexander Motin	fd0893cf1f	Introduce minimal ZIL block commit delay Despite all optimizations, tests on actual hardware show that FreeBSD kernel can't sleep for less then ~2us. Similar tests on Linux show ~50us delay at least from nanosleep() (haven't tested inside kernel). It means that on very fast log device ZIL may not be able to satisfy zfs_commit_timeout_pct block commit timeout, increasing log latency more than desired. Handle that by introduction of zil_min_commit_timeout parameter, specifying minimal timeout value where additional delays to aggregate writes may be skipped. Also skip delays if the LWB is more than 7/8 full, that often happens if I/O sizes are constant and match one of LWB sizes. Both things are applied only if there were no already outstanding log blocks, that may indicate single-threaded workload, that by definition can not benefit from the commit delays. While there, add short time moving average to zl_last_lwb_latency to make it more stable. Tests of single-threaded 4KB writes to NVDIMM SLOG on FreeBSD show IOPS increase by 9% instead of expected 5%. For zfs_commit_timeout_pct of 1 there IOPS increase by 5.5% instead of expected 1%. Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Prakash Surya <prakash.surya@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #14418	2023-03-02 14:37:07 -08:00
Alexander Motin	edaa250bb3	Remove few pointer dereferences in dbuf_read() Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #14199	2023-03-02 14:37:07 -08:00
Alexander Motin	04a6ae0585	Switch dnode stats to wmsums I've noticed that some of those counters are used in hot paths like dnode_hold_impl(), and results of this change is visible in profiler. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #14198	2023-03-02 14:37:07 -08:00
Alexander Motin	eb68e3cd56	Micro-optimize zrl_remove() atomic_dec_32() should be a bit lighter than atomic_dec_32_nv(). Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #14200	2023-03-02 14:37:07 -08:00
Alexander Motin	2098a00318	Remove atomics from zh_refcount It is protected by z_hold_locks, so we do not need more serialization, simple integer math should be fine. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #14196	2023-03-02 14:37:07 -08:00
Alexander Motin	82e3117095	Optimize microzaps Microzap on-disk format does not include a hash tree, expecting one to be built in RAM during mzap_open(). The built tree is linked to DMU user buffer, freed when original DMU buffer is dropped from cache. I've found that workloads accessing many large directories and having active eviction from DMU cache spend significant amount of time building and then destroying the trees. I've also found that for each 64 byte mzap element additional 64 byte tree element is allocated, that is a waste of memory and CPU caches. Improve memory efficiency of the hash tree by switching from AVL-tree to B-tree. It allows to save 24 bytes per element just on pointers. Save 32 bits on mze_hash by storing only upper 32 bits since lower 32 bits are always zero for microzaps. Save 16 bits on mze_chunkid, since microzap can never have so many elements. Respectively with the 16 bits there can be no more than 16 bits of collision differentiators. As result, struct mzap_ent now drops from 48 (rounded to 64) to 8 bytes. Tune B-trees for small data. Reduce BTREE_CORE_ELEMS from 128 to 126 to allow struct zfs_btree_core in case of 8 byte elements to pack into 2KB instead of 4KB. Aside of the microzaps it should also help 32bit range trees. Allow custom B-tree leaf size to reduce memmove() time. Split zap_name_alloc() into zap_name_alloc() and zap_name_init_str(). It allows to not waste time allocating/freeing memory when processing multiple names in a loop during mzap_open(). Together on a pool with 10K directories of 1800 files each and DMU cache limited to 128MB this reduces time of `find . -name zzz` by 41% from 7.63s to 4.47s, and saves additional ~30% of CPU time on the DMU cache reclamation. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #14039 (cherry picked from commit `9dcdee7889`)	2023-03-02 14:37:07 -08:00
George Amanakis	7043742828	Revert zfeature_active() to static Signed-off-by: George Amanakis <gamanakis@gmail.com>	2023-03-01 09:36:19 -08:00
George Amanakis	aa256549d1	Move dmu_buf_rele() after dsl_dataset_sync_done() Otherwise the dataset may be freed after the last dmu_buf_rele() leading to a panic. Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #14522 Closes #14523	2023-03-01 09:36:19 -08:00
George Amanakis	083239575a	Partially revert `eee9362a7` With commit `34ce4c42f` applied, there is no need for `eee9362a7`. Revert that aside from the test. All tests introduced in those commits pass. Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #14502	2023-03-01 09:36:19 -08:00
George Amanakis	57159a519b	Fix a race condition in dsl_dataset_sync() when activating features The zio returned from arc_write() in dmu_objset_sync() uses zio_nowait(). However we may reach the end of dsl_dataset_sync() which checks if we need to activate features in the filesystem without knowing if that zio has even run through the ZIO pipeline yet. In that case we will flag features to be activated in dsl_dataset_block_born() but dsl_dataset_sync() has already completed its run and those features will not actually be activated. Mitigate this by moving the feature activation code in dsl_dataset_sync_done(). Also add new ASSERTs in dsl_scan_visitbp() checking if a block contradicts any filesystem flags. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #13816	2023-03-01 09:36:19 -08:00
Allan Jude	e45a981f6d	Avoid a null pointer dereference in zfs_mount() on FreeBSD When mounting the root filesystem, vfs_t->mnt_vnodecovered is null This will cause zfsctl_is_node() to dereference a null pointer when mounting, or updating the mount flags, on the root filesystem, both of which happen during the boot process. Reported-by: Martin Matuska <mm@FreeBSD.org> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Allan Jude <allan@klarasystems.com> Closes #14218	2023-02-06 10:40:16 -08:00
Allan Jude	5161e5d8a4	Allow mounting snapshots in .zfs/snapshot as a regular user Rather than doing a terrible credential swapping hack, we just check that the thing being mounted is a snapshot, and the mountpoint is the zfsctl directory, then we allow it. If the mount attempt is from inside a jail, on an unjailed dataset (mounted from the host, not by the jail), the ability to mount the snapshot is controlled by a new per-jail parameter: zfs.mount_snapshot Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Allan Jude <allan@klarasystems.com> Sponsored-by: Modirum MDPay Sponsored-by: Klara Inc. Closes #13758	2023-02-06 10:40:16 -08:00
Coleman Kane	232fc23c6e	linux 6.2 compat: zpl_set_acl arg2 is now struct dentry Linux 6.2 changes the second argument of the set_acl operation to be a "struct dentry " rather than a "struct inode ". The inode* parameter is still available as dentry->d_inode, so adjust the call to the _impl function call to dereference and pass that pointer to it. Also document that the get_acl -> get_inode_acl member name change from commit `884a693` was an API change also introduced in Linux 6.2. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Coleman Kane <ckane@colemankane.org> Closes #14415	2023-01-24 15:36:08 -08:00
Tony Hutter	11bdc5c8e8	Revert "ztest fails assertion in zio_write_gang_member_ready()" This reverts commit `0156253d29`. That commit was identified as causing IO errors on a user's encrypted dataset: https://github.com/openzfs/zfs/issues/14413 Signed-off-by: Tony Hutter <hutter2@llnl.gov>	2023-01-24 15:35:24 -08:00
Richard Yao	7319a73921	Linux ppc64le ieee128 compat: Do not redefine __asm on external headers There is an external assembly declaration extension in GNU C that glibc uses when building with ieee128 floating point support on ppc64le. Marking that as volatile makes no sense, so the build breaks. It does not make sense to only mark this as volatile on Linux, since if do not want the compiler reordering things on Linux, we do not want the compiler reordering things on any other platform, so we stop treating Linux specially and just manually inline the CPP macro so that we can eliminate it. This should fix the build on ppc64le. Tested-by: @gyakovlev Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14308 Closes #14384	2023-01-19 12:50:42 -08:00
George Amanakis	f806306ce0	Activate filesystem features only in syncing context When activating filesystem features after receiving a snapshot, do so only in syncing context. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #14304 Closes #14252	2023-01-19 12:50:42 -08:00
Richard Yao	f33b298346	Illumos #15286 : do_composition() needs sign awareness Authored by: Dan McDonald <danmcd@mnx.io> Reviewed by: Patrick Mooney <pmooney@pfmooney.com> Reviewed by: Richard Lowe <richlowe@richlowe.net> Approved by: Joshua M. Clulow <josh@sysmgr.org> Ported-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Illumos-issue: https://www.illumos.org/issues/15286 Illumos-commit: `f137b22e73` Porting Notes: The patch in illumos did not have much of a commit message, and did not provide attribution to the reporter, while original patch proposed to OpenZFS did, so I am listing the reporter (myself) and original patch author (also myself) below while including the original commit message with some minor corrections as part of the porting notes: In do_composition(), we have: size = u8_number_of_bytes[p]; if (size <= 1 \|\| (p + size) > oslast) break; There, we have type promotion from int8_t to size_t, which is unsigned. C will sign extend the value as part of the widening before treating the value as unsigned and the negative values we can counter are error values from U8_ILLEGAL_CHAR and U8_OUT_OF_RANGE_CHAR, which are -1 and -2 respectively. The unsigned versions of these under two's complement are SIZE_MAX and SIZE_MAX-1 respectively. The bounds check is written under the assumption that `size <= 1` does a signed comparison. This is followed by a pointer comparison to see if the string has the correct length, which is fine. A little further down we have: for (i = 0; i < size; i++) tc[i] = p++; When an error condition is encountered, this will attempt to iterate at least SIZE_MAX-1 times, which will massively overflow the buffer, which is not fine. The kernel will kill the loop as soon as it hits the kernel stack guard on Linux systems built with CONFIG_VMAP_STACK=y, which should be just about all of them. That prevents arbitrary code execution and just about any other bad thing that a black hat attacker might attempt with knowledge of this buffer overflow. Other systems' kernels have mitigations for unbounded in-kernel buffer overflows that will catch this too. Also, the patch in illumos-gate made an effort to fix C style issues that had been fixed in the OpenZFS/ZFSOnLinux repository. Those issues had been mentioned in the email that I originally sent them about this issue. One of the fixes had not been already done, so it is included. Another to collect_a_seq()'s arguments was handled differently in OpenZFS. For the sake of avoiding unnecessary differences, it has been adopted. This has the interesting effect that if you correct the paths in the illumos-gate patch to match the current OpenZFS repository, you can reverse apply it cleanly. Original-patch-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reported-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Co-authored-by: Dan McDonald <danmcd@mnx.io> Closes #14318 Closes #14342	2023-01-19 12:50:42 -08:00
Mateusz Guzik	be697f4339	FreeBSD: catch up to 1400077 Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Closes #14328	2023-01-19 12:50:42 -08:00
Doug Rabson	70b1b5bb98	FreeBSD: Remove stray debug printf Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Doug Rabson <dfr@rabson.org> Closes #14286 Closes #14287	2023-01-19 12:50:36 -08:00
Richard Yao	a2aabac123	Zero end of embedded block buffer in dump_write_embedded() This fixes a kernel stack leak. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Tested-by: Nicholas Sherlock <n.sherlock@gmail.com> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #13778 Closes #14255	2023-01-19 12:50:36 -08:00
Allan Jude	6219190d7f	Restrict visibility of per-dataset kstats inside FreeBSD jails When inside a jail, visibility on datasets not "jailed" to the jail is restricted. However, it was possible to enumerate all datasets in the pool by looking at the kstats sysctl MIB. Only the kstats corresponding to datasets that the user has visibility on are accessible now. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Allan Jude <allan@klarasystems.com> Closes #14254	2023-01-19 12:50:36 -08:00
Richard Yao	24a6d8316a	Fix dereference after null check in enqueue_range If the bp is NULL, we have a hole. However, when we build with assertions, we will dereference bp when `blkid == DMU_SPILL_BLKID`. When this happens on a hole, we will have a NULL pointer dereference. Reported-by: Coverity (CID-1524670) Reviewed-by: Damian Szuberski <szuberskidamian@gmail.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14264	2023-01-19 12:50:36 -08:00
Richard Yao	572114d846	FreeBSD: zfs_register_callbacks() must implement error check correctly I read the following article and noticed a couple of ZFS bugs mentioned: https://pvs-studio.com/en/blog/posts/cpp/0377/ I decided to search for them in the modern OpenZFS codebase and then found one that matched the description of the first one: V593 Consider reviewing the expression of the 'A = B != C' kind. The expression is calculated as following: 'A = (B != C)'. zfs_vfsops.c 498 The consequence of this is that the error value is replaced with `1` when there is an error. When there is no error, 0 is correctly passed. This is a very minor issue that is unlikely to cause any real problems. The incorrect error code would either be returned to the mount command on a failure or any of `zfs receive`, `zfs recv`, `zfs rollback` or `zfs upgrade`. The second one has already been fixed. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Damian Szuberski <szuberskidamian@gmail.com> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14261	2023-01-19 12:50:36 -08:00
Matthew Ahrens	0156253d29	ztest fails assertion in zio_write_gang_member_ready() Encrypted blocks can have up to 2 DVA's, as the third DVA is reserved for the salt+IV. However, dmu_write_policy() allows non-encrypted blocks (e.g. DMU_OT_OBJSET) inside encrypted datasets to request and allocate 3 DVA's, since they don't need a salt+IV (they are merely authenicated). However, if such a block becomes a gang block, the gang code incorrectly limits the gang block header to 2 DVA's. This leads to a "NDVAs inversion", where a parent block (the gang block header) has less DVA's than its children (the gang members), causing an assertion failure in zio_write_gang_member_ready(). This commit addresses the problem by only restricting the gang block header to 2 DVA's if the block is actually encrypted (and thus its gang block members can have at most 2 DVA's). Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #14250 Closes #14356	2023-01-10 08:44:55 -08:00
Coleman Kane	b586ea5d93	linux 6.2 compat: get_acl() got moved to get_inode_acl() in 6.2 Linux 6.2 renamed the get_acl() operation to get_inode_acl() in the inode_operations struct. This should fix Issue #14323. Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Coleman Kane <ckane@colemankane.org> Closes #14323 Closes #14331	2023-01-10 08:43:49 -08:00
Antonio Russo	138d2b29dd	Linux 6.1 compat: open inside tmpfile() commit `d27c81847b` upstream Linux 863f144 modified the .tmpfile interface to pass a struct file, rather than a struct dentry, and expect the tmpfile implementation to open inside of tmpfile(). This patch implements a configuration test that checks for this new API and appropriately sets a HAVE_TMPFILE_DENTRY flag that tracks this old API. Contingent on this flag, the appropriate API is implemented. Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Antonio Russo <aerusso@aerusso.net> Closes #14301 Closes #14343	2023-01-09 17:15:22 -08:00
Ameer Hamza	75fbe7eb99	skip permission checks for extended attributes zfs_zaccess_trivial() calls the generic_permission() to read xattr attributes. This causes deadlock if called from zpl_xattr_set_dir() context as xattr and the dent locks are already held in this scenario. This commit skips the permissions checks for extended attributes since the Linux VFS stack already checks it before passing us the control. Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>	2023-01-05 11:10:28 -08:00
Ameer Hamza	2f2d6bece8	zed: unclean disk attachment faults the vdev If the attached disk already contains a vdev GUID, it means the disk is not clean. In such a scenario, the physical path would be a match that makes the disk faulted when trying to online it. So, we would only want to proceed if either GUID matches with the last attached disk or the disk is in a clean state. Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>	2023-01-05 11:09:36 -08:00
Ryan Moeller	fbbc375d43	FreeBSD: Fix potential boot panic with bad label vdev_geom_read_pool_label() can leave NULL in configs. Check for it and skip consistently when generating rootconf. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #14291 (cherry picked from commit `dc8c2f6158`)	2023-01-05 11:00:09 -08:00
szubersk	4767037bcf	vdev_raidz_math_aarch64_neonx2.c: suppress diagnostic only for GCC Signed-off-by: szubersk <szuberskidamian@gmail.com>	2022-12-09 12:07:38 -08:00
George Amanakis	c8d2ab05e1	Fix setting the large_block feature after receiving a snapshot We are not allowed to dirty a filesystem when done receiving a snapshot. In this case the flag SPA_FEATURE_LARGE_BLOCKS will not be set on that filesystem since the filesystem is not on dp_dirty_datasets, and a subsequent encrypted raw send will fail. Fix this by checking in dsl_dataset_snapshot_sync_impl() if the feature needs to be activated and do so if appropriate. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #13699 Closes #13782	2022-12-01 12:39:45 -08:00
Richard Yao	e48aaef89f	Fix NULL pointer dereference in dbuf_prefetch_indirect_done() When ZFS is built with assertions, a prefetch is done on a redacted blkptr and `dpa->dpa_dnode` is NULL, we will have a NULL pointer dereference in `dbuf_prefetch_indirect_done()`. Both Coverity and Clang's Static Analyzer caught this. Reported-by: Coverity (CID 1524671) Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14210	2022-12-01 12:39:44 -08:00
Richard Yao	0e3abd2994	Lua: Fix bad bitshift in lua_strx2number() The port of lua to OpenZFS modified lua to use int64_t for numbers instead of double. As part of this, a function for calculating exponentiation was replaced with a bit shift. Unfortunately, it did not handle negative values. Also, it only supported exponents numbers with 7 digits before before overflow. This supports exponents up to 15 digits before overflow. Clang's static analyzer reported this as "Result of operation is garbage or undefined" because the exponent was negative. Reviewed-by: Damian Szuberski <szuberskidamian@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14204	2022-12-01 12:39:44 -08:00
Damian Szuberski	3d1e808096	Fix clang 13 compilation errors ``` os/linux/zfs/zvol_os.c:1111:3: error: ignoring return value of function declared with 'warn_unused_result' attribute [-Werror,-Wunused-result] add_disk(zv->zv_zso->zvo_disk); ^~~~~~~~ ~~~~~~~~~~~~~~~~~~~~ zpl_xattr.c:1579:1: warning: no previous prototype for function 'zpl_posix_acl_release_impl' [-Wmissing-prototypes] ``` Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: szubersk <szuberskidamian@gmail.com> Closes #13551 (cherry picked from commit `9884319666`)	2022-12-01 12:39:44 -08:00
наб	32f7499acf	module: zfs: vdev_removal: remove unused num_indirect Found with -Wunused-but-set-variable on Clang trunk Fixes: `a1d477c24c` ("OpenZFS 7614, 9064 - zfs device evacuation/removal") Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #13304	2022-12-01 12:39:44 -08:00
Rich Ercolani	fa7d572a8a	Handle and detect #13709 's unlock regression (#14161 ) In #13709, as in #11294 before it, it turns out that `63a26454` still had the same failure mode as when it was first landed as `d1d47691`, and fails to unlock certain datasets that formerly worked. Rather than reverting it again, let's add handling to just throw out the accounting metadata that failed to unlock when that happens, as well as a test with a pre-broken pool image to ensure that we never get bitten by this again. Fixes: #13709 Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov>	2022-12-01 12:39:43 -08:00
shodanshok	d9de079a4b	Fix arc_p aggressive increase The original ARC paper called for an initial 50/50 MRU/MFU split and this is accounted in various places where arc_p = arc_c >> 1, with further adjustment based on ghost lists size/hit. However, in current code both arc_adapt() and arc_get_data_impl() aggressively grow arc_p until arc_c is reached, causing unneeded pressure on MFU and greatly reducing its scan-resistance until ghost list adjustments kick in. This patch restores the original behavior of initially having arc_p as 1/2 of total ARC, without preventing MRU to use up to 100% total ARC when MFU is empty. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Gionatan Danti <g.danti@assyoma.it> Closes #14137 Closes #14120	2022-12-01 12:39:43 -08:00
Richard Yao	957c3776f2	FreeBSD: Fix out of bounds read in zfs_ioctl_ozfs_to_legacy() There is an off by 1 error in the check. Fortunately, this function does not appear to be used in kernel space, despite being compiled as part of the kernel module. However, it is used in userspace. Callers of lzc_ioctl_fd() likely will crash if they attempt to use the unimplemented request number. This was reported by FreeBSD's coverity scan. Reported-by: Coverity (CID 1432059) Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Damian Szuberski <szuberskidamian@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14135	2022-12-01 12:39:43 -08:00
Serapheim Dimitropoulos	85537f77a3	Expose zfs_vdev_open_timeout_ms as a tunable Some of our customers have been occasionally hitting zfs import failures in Linux because udevd doesn't create the by-id symbolic links in time for zpool import to use them. The main issue is that the systemd-udev-settle.service that zfs-import-cache.service and other services depend on is racy. There is also an openzfs issue filed (see https://github.com/openzfs/zfs/issues/10891) outlining the problem and potential solutions. With the proper solutions being significant in terms of complexity and the priority of the issue being low for the time being, this patch exposes `zfs_vdev_open_timeout_ms` as a tunable so people that are experiencing this issue often can increase it as a workaround. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Don Brady <don.brady@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com> Closes #14133	2022-12-01 12:39:43 -08:00
Brooks Davis	5f53a444b3	Remove an unused variable Clang-16 detects this set-but-unused variable which is assigned and incremented, but never referenced otherwise. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Brooks Davis <brooks.davis@sri.com> Closes #14125	2022-12-01 12:39:43 -08:00
Richard Yao	256b74d0b0	Address warnings about possible division by zero from clangsa * The complaint in ztest_replay_write() is only possible if something went horribly wrong. An assertion will silence this and if it goes off, we will know that something is wrong. * The complaint in spa_estimate_metaslabs_to_flush() is not impossible, but seems very unlikely. We resolve this by passing the value from the `MIN()` that does not go to infinity when the variable is zero. There was a third report from Clang's scan-build, but that was a definite false positive and disappeared when checked again through Clang's static analyzer with Z3 refution via CodeChecker. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14124	2022-12-01 12:39:43 -08:00
Allan Jude	ac01b876c9	Avoid null pointer dereference in dsl_fs_ss_limit_check() Check for cr == NULL before dereferencing it in dsl_enforce_ds_ss_limits() to lookup the zone/jail ID. Reported-by: Coverity (CID 1210459) Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Allan Jude <allan@klarasystems.com> Closes #14103	2022-12-01 12:39:43 -08:00
Richard Yao	e9a8fb17b5	Fix too few arguments to formatting function CodeQL reported that when the VERIFY3U condition is false, we do not pass enough arguments to `spl_panic()`. This is because the format string from `snprintf()` was concatenated into the format string for `spl_panic()`, which causes us to have an unexpected format specifier. A CodeQL developer suggested fixing the macro to have a `%s` format string that takes a stringified RIGHT argument, which would fix this. However, upon inspection, the VERIFY3U check was never necessary in the first place, so we remove it in favor of just calling `snprintf()`. Lastly, it is interesting that every other static analyzer run on the codebase did not catch this, including some that made an effort to catch such things. Presumably, all of them relied on header annotations, which we have not yet done on `spl_panic()`. CodeQL apparently is able to track the flow of arguments on their way to annotated functions, which llowed it to catch this when others did not. A future patch that I have in development should annotate `spl_panic()`, so the others will catch this too. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14098	2022-12-01 12:39:43 -08:00
Pavel Snajdr	52e658edd7	Remove zpl_revalidate: fix snapshot rollback Open files, which aren't present in the snapshot, which is being roll-backed to, need to disappear from the visible VFS image of the dataset. Kernel provides d_drop function to drop invalid entry from the dcache, but inode can be referenced by dentry multiple dentries. The introduced zpl_d_drop_aliases function walks and invalidates all aliases of an inode. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Pavel Snajdr <snajpa@snajpa.net> Closes #9600 Closes #14070	2022-12-01 12:39:42 -08:00
Richard Yao	3830858c5c	Fix memory leaks in dmu_send()/dmu_send_obj() If we encounter an EXDEV error when using the redacted snapshots feature, the memory used by dspp.fromredactsnaps is leaked. Clang's static analyzer caught this during an experiment in which I had annotated various headers in an attempt to improve the results of static analysis. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #13973	2022-12-01 12:39:42 -08:00
Richard Yao	af2e53f62c	Fix possible NULL pointer dereference in sha2_mac_init() If mechanism->cm_param is NULL, passing mechanism to PROV_SHA2_GET_DIGEST_LEN() will dereference a NULL pointer. Coverity reported this. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14044	2022-12-01 12:39:42 -08:00
Richard Yao	409c99a1d3	Fix NULL pointer dereference in spa_open_common() Calling spa_open() will pass a NULL pointer to spa_open_common()'s config parameter. Under the right circumstances, we will dereference the config parameter without doing a NULL check. Clang's static analyzer found this. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14044	2022-12-01 12:39:42 -08:00
Richard Yao	bbec0e60a8	Fix NULL pointer passed to strlcpy from zap_lookup_impl() Clang's static analyzer pointed out that whenever zap_lookup_by_dnode() is called, we have the following stack where strlcpy() is passed a NULL pointer for realname from zap_lookup_by_dnode(): strlcpy() zap_lookup_impl() zap_lookup_norm_by_dnode() zap_lookup_by_dnode() Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14044	2022-12-01 12:39:42 -08:00
Richard Yao	a5f17a94d3	fm_fmri_hc_create() must call va_end() before returning clang-tidy caught this. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14044	2022-12-01 12:39:42 -08:00
Richard Yao	2453f90350	Fix theoretical array overflow in lua_typename() Out of the 12 defects in lua that coverity reports, 5 of them involve `lua_typename()` and out of the dozens of defects in ZFS that lua reports, 3 of them involve `lua_typename()` due to the ZCP code. Given all of the uses of `lua_typename()` in the ZCP code, I was surprised that there were not more. It appears that only 2 were reported because only 3 called `lua_type()`, which does a defective sanity check that allows invalid types to be passed. lua/lua@d4fb848be7 addressed this in upstream lua 5.3. Unfortunately, we did not get that fix since we use lua 5.2 and we do not have assertions enabled in lua, so the upstream solution would not do anything. While we could adopt the upstream solution and enable assertions, a simpler solution is to fix the issue by making `lua_typename()` return `internal_type_error` whenever it is called with an invalid type. This avoids the array overflow and if we ever see it appear somewhere, we will know there is a problem with the lua interpreter. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #13947	2022-12-01 12:39:41 -08:00
Richard Yao	c562bbefc0	Fix potential NULL pointer dereference in dsl_dataset_promote_check() If the `list_head()` returns NULL, we dereference it, right before we check to see if it returned NULL. We have defined two different pointers that both point to the same thing, which are `origin_head` and `origin_ds`. Almost everything uses `origin_ds`, so we switch them to use `origin_ds`. We also promote `origin_ds` to a const pointer so that the compiler verifies that nothing modifies it. Coverity complained about this. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Neal Gompa <ngompa@datto.com> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #13967	2022-12-01 12:39:41 -08:00
Richard Yao	c6d93d0a80	FreeBSD: Fix uninitialized pointer read in spa_import_rootpool() The FreeBSD project's coverity scans found this. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #13923	2022-12-01 12:39:40 -08:00
Richard Yao	9f1691a964	Linux: Fix use-after-free in zfsvfs_create() Coverity reported that we pass a pointer to zfsvfs to `dmu_objset_disown()` after freeing zfsvfs in zfsvfs_create_impl() after a failure in zfsvfs_init(). We have nearly identical duplicate versions of this code for FreeBSD and Linux, but interestingly, the FreeBSD version of this code differs in such a way that it does not suffer from this bug. We remove the difference from the FreeBSD version to fix this bug. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #13883	2022-12-01 12:39:40 -08:00
Richard Yao	1d5e569a69	Fix use-after-free bugs in icp code These were reported by Coverity as "Read from pointer after free" bugs. Presumably, it did not report it as a use-after-free bug because it does not understand the inline assembly that implements the atomic instruction. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #13881	2022-12-01 12:39:40 -08:00
Richard Yao	b247d47be1	Cleanup: Make memory barrier definitions consistent across kernels We inherited membar_consumer() and membar_producer() from OpenSolaris, but we had replaced membar_consumer() with Linux's smp_rmb() in zfs_ioctl.c. The FreeBSD SPL consequently implemented a shim for the Linux-only smp_rmb(). We reinstate membar_consumer() in platform independent code and fix the FreeBSD SPL to implement membar_consumer() in a way analogous to Linux. Reviewed-by: Konstantin Belousov <kib@FreeBSD.org> Reviewed-by: Mateusz Guzik <mjguzik@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Neal Gompa <ngompa@datto.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #13843	2022-12-01 12:39:40 -08:00
Alexander Lobakin	ab22031d79	icp: fix all !ENDBR objtool warnings in x86 Asm code Currently, only Blake3 x86 Asm code has signs of being ENDBR-aware. At least, under certain conditions it includes some header file and uses some custom macro from there. Linux has its own NOENDBR since several releases ago. It's defined in the same <asm/linkage.h>, so currently <sys/asm_linkage.h> already is provided with it. Let's unify those two into one %ENDBR macro. At first, check if it's present already. If so -- use Linux kernel version. Otherwise, try to go that second way and use %_CET_ENDBR from <cet.h> if available. If no, fall back to just empty definition. This fixes a couple more 'relocations to !ENDBR' across the module. And now that we always have the latest/actual ENDBR definition, use it at the entrance of the few corresponding functions that objtool still complains about. This matches the way how it's used in the upstream x86 core Asm code. Reviewed-by: Attila Fülöp <attila@fueloep.org> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Lobakin <alobakin@pm.me> Closes #14035	2022-12-01 12:39:39 -08:00
Alexander Lobakin	33bc03dea7	icp: fix rodata being marked as text in x86 Asm code objtool properly complains that it can't decode some of the instructions from ICP x86 Asm code. As mentioned in the Makefile, where those object files were excluded from objtool check (but they can still be visible under IBT and LTO), those are just constants, not code. In that case, they must be placed in .rodata, so they won't be marked as "allocatable, executable" (ax) in EFL headers and this effectively prevents objtool from trying to decode this data. That reveals a whole bunch of other issues in ICP Asm code, as previously objtool was bailing out after that warning message. Reviewed-by: Attila Fülöp <attila@fueloep.org> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Lobakin <alobakin@pm.me> Closes #14035 Conflicts: module/Kbuild.in	2022-11-30 10:15:58 -08:00
Alexander Lobakin	ee93cbc9d4	icp: properly fix all RETs in x86_64 Asm code Commit `43569ee374` ("Fix objtool: missing int3 after ret warning") addressed replacing all `ret`s in x86 asm code to a macro in the Linux kernel in order to enable SLS. That was done by copying the upstream macro definitions and fixed objtool complaints. Since then, several more mitigations were introduced, including Rethunk. It requires to have a jump to one of the thunks in order to work, so the RET macro was changed again. And, as ZFS code didn't use the mainline defition, but copied it, this is currently missing. Objtool reminds about it time to time (Clang 16, CONFIG_RETHUNK=y): fs/zfs/lua/zlua.o: warning: objtool: setjmp+0x25: 'naked' return found in RETHUNK build fs/zfs/lua/zlua.o: warning: objtool: longjmp+0x27: 'naked' return found in RETHUNK build Do it the following way: * if we're building under Linux, unconditionally include <linux/linkage.h> in the related files. It is available in x86 sources since even pre-2.6 times, so doesn't need any conftests; * then, if RET macro is available, it will be used directly, so that we will always have the version actual to the kernel we build; * if there's no such macro, we define it as a simple `ret`, as it was on pre-SLS times. This ensures we always have the up-to-date definition with no need to update it manually, and at the same time is safe for the whole variety of kernels ZFS module supports. Then, there's a couple more "naked" rets left in the code, they're just defined as: .byte 0xf3,0xc3 In fact, this is just: rep ret `rep ret` instead of just `ret` seems to mitigate performance issues on some old AMD processors and most likely makes no sense as of today. Anyways, address those rets, so that they will be protected with Rethunk and SLS. Include <sys/asm_linkage.h> here which now always has RET definition and replace those constructs with just RET. This wipes the last couple of places with unpatched rets objtool's been complaining about. Reviewed-by: Attila Fülöp <attila@fueloep.org> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Lobakin <alobakin@pm.me> Closes #14035	2022-11-30 10:15:58 -08:00

1 2 3 4 5 ...

3792 Commits