Archive-Team/zfs - zfs - Gitea: Git with a cup of tea

Commit Graph

Author	SHA1	Message	Date
youzhongyang	04305bbd18	Linux 6.3 compat: writepage_t first arg struct folio* The type def of writepage_t in kernel 6.3 is changed to take struct folio* as the first argument. We need to detect this change and pass correct function to write_cache_pages(). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Youzhong Yang <yyang@mathworks.com> Closes #14699	2023-06-05 10:59:02 -07:00
Low-power	35d43ba8ea	Workaround for Linux PowerPC GPL-only cpu_has_feature() Linux since 4.7 makes interface 'cpu_has_feature' to use jump labels on powerpc if CONFIG_JUMP_LABEL_FEATURE_CHECKS is enabled, in this case however the inline function references GPL-only symbol 'cpu_feature_keys'. ZFS currently uses 'cpu_has_feature' either directly or indirectly from several places; while it is unknown how this issue didn't break ZFS on 64-bit little-endian powerpc, it is known to break ZFS with many Linux versions on both 32-bit and 64-bit big-endian powerpc. Until this issue is fixed in Linux, we have to workaround it by overriding affected inline functions without depending on 'cpu_feature_keys'. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: WHR <msl0000023508@gmail.com> Closes #14590	2023-06-05 10:59:02 -07:00
Brian Behlendorf	3ad6c1692f	Linux: use filemap_range_has_page() As of the 4.13 kernel filemap_range_has_page() can be used to check if there is a page mapped in a given file range. When available this interface should be used which eliminates the need for the zp->z_is_mapped boolean. Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #14493	2023-06-05 10:59:02 -07:00
Shaan Nobee	9e5a297de6	Speed up WB_SYNC_NONE when a WB_SYNC_ALL occurs simultaneously Page writebacks with WB_SYNC_NONE can take several seconds to complete since they wait for the transaction group to close before being committed. This is usually not a problem since the caller does not need to wait. However, if we're simultaneously doing a writeback with WB_SYNC_ALL (e.g via msync), the latter can block for several seconds (up to zfs_txg_timeout) due to the active WB_SYNC_NONE writeback since it needs to wait for the transaction to complete and the PG_writeback bit to be cleared. This commit deals with 2 cases: - No page writeback is active. A WB_SYNC_ALL page writeback starts and even completes. But when it's about to check if the PG_writeback bit has been cleared, another writeback with WB_SYNC_NONE starts. The sync page writeback ends up waiting for the non-sync page writeback to complete. - A page writeback with WB_SYNC_NONE is already active when a WB_SYNC_ALL writeback starts. The WB_SYNC_ALL writeback ends up waiting for the WB_SYNC_NONE writeback. The fix works by carefully keeping track of active sync/non-sync writebacks and committing when beneficial. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Shaan Nobee <sniper111@gmail.com> Closes #12662 Closes #12790	2023-06-05 10:59:02 -07:00
Alexander Motin	8a315a30ab	ZIL: Allow to replay blocks of any size. There seems to be no reason for ZIL blocks to be limited by 128KB other than replay code is written in such a way. This change does not increase the limit yet, just removes the artificial limitation. Avoided extra memcpy() may save us a second during replay. Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc.	2023-06-02 17:26:41 -07:00
Alexander Motin	b01a8cc2c0	zil: Don't expect zio_shrink() to succeed. At least for RAIDZ zio_shrink() does not reduce zio size, but reduced wsz in that case likely results in writing uninitialized memory. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #14853	2023-06-02 11:17:11 -07:00
Alexander Motin	a727848e05	Remove single parent assertion from zio_nowait(). We only need to know if ZIO has any parent there. We do not care if it has more than one, but use of zio_unique_parent() == NULL asserts that. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #14823	2023-06-02 11:17:11 -07:00
Alexander Motin	b2ede77bf9	Fix two abd_gang_add_gang() issues. - There is no reason to assert that added gang is not empty. It may be weird to add an empty gang, but it is legal. - When moving chain list from the added gang clear its size, or it will trigger assertion in abd_verify() when that gang is freed. Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #14816	2023-06-02 11:17:11 -07:00
Alexander Motin	c1b9dc735f	Mark TX_COMMIT transaction with TXG_NOTHROTTLE. TX_COMMIT has no on-disk representation and does not produce any more dirty data. It should not wait for anything, and even just skipping the checks if not waiting gives improvement noticeable in profiler. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Prakash Surya <prakash.surya@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #14798	2023-06-02 11:17:11 -07:00
Alexander Motin	e271cd7a65	Fix positive ABD size assertion in abd_verify(). Gang ABDs without childred are legal, and they do have zero size. For other ABD types zero size doesn't have much sense and likely not working correctly now. Reviewed-by: Igor Kozhukhov <igor@dilos.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #14795	2023-06-02 11:17:11 -07:00
Mariusz Zaborski	7d26967d4e	Move zap_attribute_t to the heap in dsl_deadlist_merge In the case of a regular compilation, the compiler raises a warning for a dsl_deadlist_merge function, that the stack size is to large. In debug build this can generate an error. Move large structures to heap. Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Mariusz Zaborski <mariusz.zaborski@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes #14524	2023-06-01 08:34:52 -07:00
Rich Ercolani	93a99c6daa	Revert "initramfs: use `mount.zfs` instead of `mount`" This broke mounting of snapshots on / for users. See https://github.com/openzfs/zfs/issues/9461#issuecomment-1376162949 for more context. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes #14908	2023-06-01 08:30:54 -07:00
Luís Henriques	671b1af1bc	Fix NULL pointer dereference when doing concurrent 'send' operations A NULL pointer will occur when doing a 'zfs send -S' on a dataset that is still being received. The problem is that the new 'send' will rightfully fail to own the datasets (i.e. dsl_dataset_own_force() will fail), but then dmu_send() will still do the dsl_dataset_disown(). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Luís Henriques <henrix@camandro.org> Closes #14903 Closes #14890	2023-05-31 17:02:38 -07:00
Brian Behlendorf	a836cc6ddd	ZTS: threadsappend_001_pos Correct exception path used in zts-report.py.in. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2023-05-30 15:12:44 -07:00
наб	435407eb5c	Fix test-runner on FreeBSD CLOCK_MONOTONIC_RAW is only a thing on Linux and macOS. I'm not actually sure why the previous hardcoding of a constant didn't error out, but when we removed it, it sure does now. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: Rich Ercolani <rincebrain@gmail.com> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes #12995	2023-05-30 08:53:21 -07:00
Mateusz Guzik	07a2ba541d	FreeBSD: add missing vop_fplookup assignments It became illegal to not have them as of 5f6df177758b9dff88e4b6069aeb2359e8b0c493 ("vfs: validate that vop vectors provide all or none fplookup vops") upstream. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Closes #14788	2023-05-30 08:53:21 -07:00
rob-wing	f786232b2a	FreeBSD: don't verify recycled vnode for zfs control directory Under certain loads, the following panic is hit: panic: page fault KDB: stack backtrace: #0 0xffffffff805db025 at kdb_backtrace+0x65 #1 0xffffffff8058e86f at vpanic+0x17f #2 0xffffffff8058e6e3 at panic+0x43 #3 0xffffffff808adc15 at trap_fatal+0x385 #4 0xffffffff808adc6f at trap_pfault+0x4f #5 0xffffffff80886da8 at calltrap+0x8 #6 0xffffffff80669186 at vgonel+0x186 #7 0xffffffff80669841 at vgone+0x31 #8 0xffffffff8065806d at vfs_hash_insert+0x26d #9 0xffffffff81a39069 at sfs_vgetx+0x149 #10 0xffffffff81a39c54 at zfsctl_snapdir_lookup+0x1e4 #11 0xffffffff8065a28c at lookup+0x45c #12 0xffffffff806594b9 at namei+0x259 #13 0xffffffff80676a33 at kern_statat+0xf3 #14 0xffffffff8067712f at sys_fstatat+0x2f #15 0xffffffff808ae50c at amd64_syscall+0x10c #16 0xffffffff808876bb at fast_syscall_common+0xf8 The page fault occurs because vgonel() will call VOP_CLOSE() for active vnodes. For this reason, define vop_close for zfsctl_ops_snapshot. While here, define vop_open for consistency. After adding the necessary vop, the bug progresses to the following panic: panic: VERIFY3(vrecycle(vp) == 1) failed (0 == 1) cpuid = 17 KDB: stack backtrace: #0 0xffffffff805e29c5 at kdb_backtrace+0x65 #1 0xffffffff8059620f at vpanic+0x17f #2 0xffffffff81a27f4a at spl_panic+0x3a #3 0xffffffff81a3a4d0 at zfsctl_snapshot_inactive+0x40 #4 0xffffffff8066fdee at vinactivef+0xde #5 0xffffffff80670b8a at vgonel+0x1ea #6 0xffffffff806711e1 at vgone+0x31 #7 0xffffffff8065fa0d at vfs_hash_insert+0x26d #8 0xffffffff81a39069 at sfs_vgetx+0x149 #9 0xffffffff81a39c54 at zfsctl_snapdir_lookup+0x1e4 #10 0xffffffff80661c2c at lookup+0x45c #11 0xffffffff80660e59 at namei+0x259 #12 0xffffffff8067e3d3 at kern_statat+0xf3 #13 0xffffffff8067eacf at sys_fstatat+0x2f #14 0xffffffff808b5ecc at amd64_syscall+0x10c #15 0xffffffff8088f07b at fast_syscall_common+0xf8 This is caused by a race condition that can occur when allocating a new vnode and adding that vnode to the vfs hash. If the newly created vnode loses the race when being inserted into the vfs hash, it will not be recycled as its usecount is greater than zero, hitting the above assertion. Fix this by dropping the assertion. FreeBSD-issue: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=252700 Reviewed-by: Andriy Gapon <avg@FreeBSD.org> Reviewed-by: Mateusz Guzik <mjguzik@gmail.com> Reviewed-by: Alek Pinchuk <apinchuk@axcient.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Rob Wing <rob.wing@klarasystems.com> Co-authored-by: Rob Wing <rob.wing@klarasystems.com> Submitted-by: Klara, Inc. Sponsored-by: rsync.net Closes #14501	2023-05-30 08:53:21 -07:00
Brian Behlendorf	45c4b3e680	Fix checkstyle warning Resolve a missed checkstyle warning. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Mateusz Guzik <mjguzik@gmail.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #14799	2023-05-30 08:53:21 -07:00
Mateusz Guzik	092021ba39	FreeBSD: add missing vn state transition for .zfs Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Closes #14774	2023-05-30 08:53:21 -07:00
Mateusz Guzik	aef1324d59	FreeBSD: fix up EINVAL from getdirentries on .zfs Without the change: /.zfs /.zfs/snapshot find: /.zfs: Invalid argument Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Closes #14774	2023-05-30 08:53:21 -07:00
Dimitry Andric	d1e05c6856	FreeBSD: make zfs_vfs_held() definition consistent with declaration Noticed while attempting to change FreeBSD's boolean_t into an actual bool: in include/sys/zfs_ioctl_impl.h, zfs_vfs_held() is declared to return a boolean_t, but in module/os/freebsd/zfs/zfs_ioctl_os.c it is defined to return an int. Make the definition match the declaration. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Dimitry Andric <dimitry@andric.com> Closes #14776	2023-05-30 08:53:21 -07:00
Don Brady	30dcddaec7	Refine special_small_blocks property validation When the special_small_blocks property is being set during a pool create it enforces a limit of 128KiB even if the pool's record size is larger. If the recordsize property is being set during a pool create, then use that value instead of the default SPA_OLD_MAXBLOCKSIZE value. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Don Brady <dev.fs.zfs@gmail.com> Closes #13815 Closes #14811	2023-05-27 18:23:33 -07:00
Brian Behlendorf	c094b9a928	ZTS: Add zpool_resilver_concurrent exception The zpool_resilver_concurrent test case requires the ZED which is not used on FreeBSD. Add this test to the known list of skipped tested for FreeBSD. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #14904	2023-05-27 18:23:02 -07:00
Brian Behlendorf	4e24df0e81	ZTS: Add auto_replace_001_pos to exceptions The auto_replace_001_pos test case does not reliably pass on Fedora 37 and newer. Until the test case can be updated to make it reliable add it to the list of "maybe" exceptions on Linux. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #14851 Closes #14852	2023-05-27 18:21:49 -07:00
Brian Behlendorf	848c4b2300	ZTS: Annotate additonal flaky test cases Update several flaky test cases in zts-report.py.in until they can be made entirely reliable. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #14392	2023-05-27 18:21:49 -07:00
Brian Behlendorf	c6f6958fc0	ZTS: add snapshot/snapshot_002_pos exception Add snapshot_002_pos to the known list of occasional failures for FreeBSD until it can be made entirely reliable. Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #14831 Closes #14832	2023-05-27 18:21:49 -07:00
Paul Dagnelie	e1b3ab5f51	ZTS: send-c_volume is flaky We use block_device_wait to wait for the zvol block device to actually appear, and we log the result of the dd calls by using an intermediate file. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: John Wren Kennedy <john.kennedy@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <pcd@delphix.com> Closes #14767	2023-05-27 18:21:49 -07:00
Paul Dagnelie	e2a96aa4eb	Storage device expansion "silently" fails on degraded vdev When a vdev is degraded or faulted, we refuse to expand it when doing online -e. However, we also don't actually cause the online command to fail, even though the disk didn't expand. This is confusing and misleading, and can result in violated expectations. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Signed-off-by: Paul Dagnelie <pcd@delphix.com> Closes 14145	2023-05-26 10:12:57 -07:00
Brian Behlendorf	6ec3abcb59	Use vmem_zalloc to silence allocation warning The kmem allocation in zfs_prune_aliases() will trigger a large allocation warning on systems with 64K pages. Resolve this by switching to vmem_alloc() which internally uses kvmalloc() so the right allocator will be used based on the allocation size. Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #8491 Closes #14694	2023-05-26 10:10:09 -07:00
Brian Behlendorf	e97637d484	Add the ability to uninitialize zpool initialize functions well for touching every free byte...once. But if we want to do it again, we're currently out of luck. So let's add zpool initialize -u to clear it. Co-authored-by: Rich Ercolani <rincebrain@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes #12451 Closes #14873	2023-05-26 10:09:04 -07:00
Brian Behlendorf	e2176f12a9	Probe vdevs before marking removed Before allowing the ZED to mark a vdev as REMOVED due to a hotplug event confirm that it is non-responsive with probe. Any device which can be successfully probed should be left ONLINE to prevent a healthy pool from being incorrectly SUSPENDED. This may occur for at least the following two scenarios. 1) Drive expansion (zpool online -e) in VMware environments. If, during the partition resize operation, a partition is removed and re-created then udev will send a removed event. 2) Re-scanning the namespaces of an NVMe device (nvme ns-rescan) may result in a udev remove and add event being delivered. Finally, update the ZED to only kick in a spare when the removal was successful. Reviewed-by: Ameer Hamza <ahamza@ixsystems.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #14859 Closes #14861	2023-05-26 10:08:04 -07:00
Akash B	c2f0aaeb3c	Fix concurrent resilvers initiated at same time For draid vdevs it was possible to initiate both the sequential and healing resilver at same time. This fixes the following two scenarios. 1) There's a window where a sequential rebuild can be started via ZED even if a healing resilver has been scheduled. - This is fixed by adding additional check in spa_vdev_attach() for any scheduled resilver and return appropriate error code when a resilver is already in progress. 2) It was possible for zpool clear to start a healing resilver when it wasn't needed at all. This occurs because during a vdev_open() the device is presumed to be healthy not until the device is validated by vdev_validate() and it's set unavailable. However, by this point an async resilver will have already been requested if the DTL isn't empty. - This is fixed by cancelling the SPA_ASYNC_RESILVER request immediately at the end of vdev_reopen() when a resilver is unneeded. Finally, added a testcase in ZTS for verification. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Dipak Ghosh <dipak.ghosh@hpe.com> Signed-off-by: Akash B <akash-b@hpe.com> Closes #14881 Closes #14892	2023-05-26 10:07:19 -07:00
Brian Behlendorf	b17e472983	zdb: consistent xattr output When using zdb to output the value of an xattr only interpret it as printable characters if the entire byte array is printable. Additionally, if the --parseable option is set always output the buffer contents as octal for easy parsing. Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #14830	2023-05-11 09:09:14 -07:00
Brian Behlendorf	133faca275	Add dmu_tx_hold_append() interface Provides an interface which callers can use to declare a write when the exact starting offset in not yet known. Since the full range being updated is not available only the first L0 block at the provided offset will be prefetched. Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #14819	2023-05-11 09:08:08 -07:00
Tony Hutter	7c555fee9d	pam: Fix "buffer overflow" in pam ZTS tests on F38 The pam ZTS tests were reporting a buffer overflow on F38, possibly due to F38 now setting _FORTIFY_SOURCE=3 by default. gdb and valgrind narrowed this down to a snprintf() buffer overflow in zfs_key_config_modify_session_counter(). I'm not clear why this particular snprintf() was being flagged as an overflow, but when I replaced it with an asprintf(), the test passed reliably. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #14802 Closes #14842	2023-05-11 09:07:34 -07:00
Brian Behlendorf	ecaf3ea3f2	ZTS: Minor fixes Backport two minor ZTS test case fixes from `63652e15` to resolve a few spurious failures. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2023-05-10 10:40:24 -07:00
David Hedberg	9b17d5a37d	Wait for txg sync if the last DRR_FREEOBJECTS might result in a hole If we receive a DRR_FREEOBJECTS as the first entry in an object range, this might end up producing a hole if the freed objects were the only existing objects in the block. If the txg starts syncing before we've processed any following DRR_OBJECT records, this leads to a possible race where the backing arc_buf_t gets its psize set to 0 in the arc_write_ready() callback while still being referenced from a dirty record in the open txg. To prevent this, we insert a txg_wait_synced call if the first record in the range was a DRR_FREEOBJECTS that actually resulted in one or more freed objects. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: David Hedberg <david.hedberg@findity.com> Sponsored by: Findity AB Closes #11893 Closes #14358	2023-05-09 12:57:56 -07:00
Ameer Hamza	75ec145710	zpool import -m also removing spare and cache when log device is missing spa_import() relies on a pool config fetched by spa_try_import() for spare/cache devices. Import flags are not passed to spa_tryimport(), which makes it return early due to a missing log device and missing retrieving the cache device and spare eventually. Passing ZFS_IMPORT_MISSING_LOG to spa_tryimport() makes it fetch the correct configuration regardless of the missing log device. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #14794	2023-05-05 09:07:07 -07:00
Herb Wartens	33075e465f	Allow MMP to bypass waiting for other threads At our site we have seen cases when multi-modifier protection is enabled (multihost=on) on our pool and the pool gets suspended due to a single disk that is failing and responding very slowly. Our pools have 90 disks in them and we expect disks to fail. The current version of MMP requires that we wait for other writers before moving on. When a disk is responding very slowly, we observed that waiting here was bad enough to cause the pool to suspend. This change allows the MMP thread to bypass waiting for other threads and reduces the chances the pool gets suspended. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Herb Wartens <hawartens@gmail.com> Closes #14659	2023-04-24 12:55:07 -07:00
Brian Behlendorf	cdbe1d65c4	Increase default zfs_rebuild_vdev_limit to 64MB When testing distributed rebuild performance with more capable hardware it was observed than increasing the zfs_rebuild_vdev_limit to 64M reduced the rebuild time by 17%. Beyond 64MB there was some improvement (~2%) but it was not significant when weighed against the increased memory usage. Memory usage is capped at 1/4 of arc_c_max. Additionally, vr_bytes_inflight_max has been moved so it's updated per-metaslab to allow the size to be adjust while a rebuild is running. Reviewed-by: Akash B <akash-b@hpe.com> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #14428	2023-04-24 12:55:07 -07:00
Brian Behlendorf	fa28e26e42	Increase default zfs_scan_vdev_limit to 16MB For HDD based pools the default zfs_scan_vdev_limit of 4M per-vdev can significantly limit the maximum scrub performance. Increasing the default to 16M can double the scrub speed from 80 MB/s per disk to 160 MB/s per disk. This does increase the memory footprint during scrub/resilver but given the performance win this is a reasonable trade off. Memory usage is capped at 1/4 of arc_c_max. Note that number of outstanding I/Os has not changed and is still limited by zfs_vdev_scrub_max_active. Reviewed-by: Akash B <akash-b@hpe.com> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #14428	2023-04-24 12:55:07 -07:00
Brian Behlendorf	9fe3da9364	Improve resilver ETAs When resilvering the estimated time remaining is calculated using the average issue rate over the current pass. Where the current pass starts when a scan was started, or restarted, if the pool was exported/imported. For dRAID pools in particular this can result in wildly optimistic estimates since the issue rate will be very high while scanning when non-degraded regions of the pool are scanned. Once repair I/O starts being issued performance drops to a realistic number but the estimated performance is still significantly skewed. To address this we redefine a pass such that it starts after a scanning phase completes so the issue rate is more reflective of recent performance. Additionally, the zfs_scan_report_txgs module option can be set to reset the pass statistics more often. Reviewed-by: Akash B <akash-b@hpe.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #14410	2023-04-24 12:55:07 -07:00
Ameer Hamza	a68dfdb88c	Fix "Detach spare vdev in case if resilvering does not happen" Spare vdev should detach from the pool when a disk is reinserted. However, spare detachment depends on the completion of resilvering, and if resilver does not schedule, the spare vdev keeps attached to the pool until the next resilvering. When a zfs pool contains several disks (25+ mirror), resilvering does not always happen when a disk is reinserted. In this patch, spare vdev is manually detached from the pool when resilvering does not occur and it has been tested on both Linux and FreeBSD. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #14722	2023-04-24 09:56:13 -07:00
Richard Yao	4a5950a129	Linux: zfs_fillpage() should handle partial pages from end of file After `89cd2197b9` was merged, Clang's static analyzer began complaining about a dead assignment in `zfs_fillpage()`. Upon inspection, I noticed that the dead assignment was because we are not using the calculated io_len that we should use to avoid asking the DMU to read past the end of a file. This should result in `dmu_buf_hold_array_by_dnode()` calling `zfs_panic_recover()`. This issue predates `89cd2197b9`, but its simplification of zfs_fillpage() eliminated the only use of the assignment to io_len, which made Clang's static analyzer complain about the issue. Also, as a precaution, we add an assertion that io_offset < i_size. If this ever fails, bad things will happen. Otherwise, we are blindly trusting the kernel not to give us invalid offsets. We continue to blindly trust it on non-debug kernels. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14534	2023-04-21 13:12:35 -07:00
Brian Behlendorf	c7db374ac6	Fix buffered/direct/mmap I/O race When a page is faulted in for memory mapped I/O the page lock may be dropped before it has been read and marked up to date. If a buffered read encounters such a page in mappedread() it must wait until the page has been updated. Failure to do so will result in a panic on debug builds and incorrect data on production builds. The critical part of this change is in mappedread() where pages which are not up to date are now handled. Additionally, it includes the following simplifications. - zfs_getpage() and zfs_fillpage() could be passed an array of pages. This could be more efficient if it was used but in practice only a single page was ever provided. These interfaces were simplified to acknowledge that. - update_pages() was modified to correctly set the PG_error bit on a page when it cannot be read by dmu_read(). - Setting PG_error and PG_uptodate was moved to zfs_fillpage() from zpl_readpage_common(). This is consistent with the handling in update_pages() and mappedread(). - Minor additional refactoring to comments and variable declarations to improve readability. - Add a test case to exercise concurrent buffered, direct, and mmap IO to the same file. - Reduce the mmap_sync test case default run time. Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #13608 Closes #14498	2023-04-21 13:12:35 -07:00
Tony Hutter	e25f9131d6	Tag zfs-2.1.11 META file and changelog updated. Signed-off-by: Tony Hutter <hutter2@llnl.gov>	2023-04-18 11:44:34 -07:00
Low-power	4e49d8ed27	Values printed by zpool-iostat(8) should be right-aligned This inappropriate left-alignment was introduced in `7bb7b1f`. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Signed-off-by: WHR <msl0000023508@gmail.com> Closes #14751	2023-04-18 11:44:07 -07:00
Tony Hutter	a969b1b22c	Revert "ZFS_IOC_COUNT_FILLED does unnecessary txg_wait_synced()" This reverts commit `4b3133e671`. Users identified this commit as a possible source of data corruption: https://github.com/openzfs/zfs/issues/14753 Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Issue #14753 Closes #14761	2023-04-18 11:44:07 -07:00
наб	18edf7a3ba	contrib: dracut: fix race with root=zfs:dset when necessities required This had always worked in my testing, but a user on hardware reported this to happen 100%, and I reproduced it once with cold VM host caches. dracut-zfs-generator runs as a systemd generator, i.e. at Some Relatively Early Time; if root= is a fixed dataset, it tries to "solve [necessities] statically at generation time". If by that point zfs-import.target hasn't popped (because the import is taking a non-negligible amount of time for whatever reason), it'll see no children for the root datase, and as such generate no mounts. This has never had any right to work. No-one caught this earlier because it's just that much more convenient to have root=zfs:AUTO, which orders itself properly. To fix this, always run zfs-nonroot-necessities.service; this additionally simplifies the implementation by: * making BOOTFS from zfs-env-bootfs.service be the real, canonical, root dataset name, not just "whatever the first bootfs is", and only set it if we're ZFS-booting * zfs-{rollback,snapshot}-bootfs.service can use this instead of re-implementing it * having zfs-env-bootfs.service also set BOOTFSFLAGS * this means the sysroot.mount drop-in can be fixed text * zfs-nonroot-necessities.service can also be constant and always enabled, because it's conditioned on BOOTFS being set There is no longer any code generated at run-time (the sysroot.mount drop-in is an unavoidable gratuitous cp). The flow of BOOTFS{,FLAGS} from zfs-env-bootfs.service to sysroot.mount is not noted explicitly in dracut.zfs(7), because (a) at some point it's just visual noise and (b) it's already ordered via d-p-m.s from z-i.t. Backport-of: `3399a30ee0` Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>	2023-04-18 10:10:45 -07:00
Tony Hutter	33003ecc93	Tag zfs-2.1.10 META file and changelog updated. Signed-off-by: Tony Hutter <hutter2@llnl.gov>	2023-04-13 15:59:45 -07:00

1 2 3 4 5 ...

7581 Commits All Branches Search

7581 Commits

All Branches