Archive-Team/zfs - zfs - Gitea: Git with a cup of tea

Commit Graph

Author	SHA1	Message	Date
Paul Dagnelie	7eabb0af37	Try to clarify wording to reduce zpool add incidents Try to clarify wording to reduce zpool add incidents. Add an attach example. Reviewed-by: Rich Ercolani <Rincebrain@gmail.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Paul Dagnelie <pcd@delphix.com> Closes #15179	2023-08-27 08:25:42 -07:00
Rich Ercolani	c65aaa8387	Avoid save/restoring AMX registers to avoid a SPR erratum Intel SPR erratum SPR4 says that if you trip into a vmexit while doing FPU save/restore, your AMX register state might misbehave... and by misbehave, I mean save all zeroes incorrectly, leading to explosions if you restore it. Since we're not using AMX for anything, the simple way to avoid this is to just not save/restore those when we do anything, since we're killing preemption of any sort across our save/restores. If we ever decide to use AMX, it's not clear that we have any way to mitigate this, on Linux...but I am not an expert. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes #14989 Closes #15168	2023-08-27 08:25:42 -07:00
Brian Behlendorf	e99e684b33	zed: update zed.d/statechange-slot_off.sh The statechange-slot_off.sh zedlet which was added in #15200 needed to be installed so it's included by the packages. Additional testing has also shown that multiple retries are often needed for the script to operate reliably. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #15210	2023-08-27 08:25:42 -07:00
наб	1b696429c1	Make zoned/jailed zfsprops(7) make more sense. - Distribute zfs-[un]jail.8 on FreeBSD and zfs-[un]zone.8 on Linux - zfsprops.7: mirror zoned/jailed, only available on respective platforms Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #15161	2023-08-27 08:25:42 -07:00
Rob N	084ff4abd2	tests/block_cloning: rename and document get_same_blocks helper `get_same_blocks` is a helper to compare two files and return a list of the blocks that are clones of each other. Its very necessary for block cloning tests. Previously it was incorrectly called `unique_blocks`, which is the _inverse_ of what it does (an early version did list unique blocks; it was changed but the name was not). So if nothing else, it should be called `duplicate_blocks`. But, keeping the details of a clone operation in your head is actually quite difficult, without the additional overhead of wondering how the tools work. So I've renamed it to better describe what it does, added a usage note, and changed it to return block indexes from 0 instead of 1, to match how L0 blocks are normally counted. Reviewed-by: Umer Saleem <usaleem@ixsystems.com> Reviewed-by: Kay Pedersen <mail@mkwg.de> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #15181	2023-08-26 11:18:11 -07:00
Serapheim Dimitropoulos	ab999406fe	Update outdated assertion from zio_write_compress As part of some internal gang block testing within Delphix we hit the assertion removed by this patch. The assertion was triggered by a ZIO that had two copies and was a gang block making the following expression equal to 3: ``` MIN(zp->zp_copies + BP_IS_GANG(bp), spa_max_replication(spa)) ``` and failing when we expected the above to be equal to `BP_GET_NDVAS(bp)`. The assertion is no longer valid since the following commit: ``` commit `14872aaa4f` Author: Matthew Ahrens <matthew.ahrens@delphix.com> Date: Mon Feb 6 09:37:06 2023 -0800 EIO caused by encryption + recursive gang ``` The above commit changed gang block headers so they can't have more than 2 copies but the assertion in question from this PR was never updated. Reviewed-by: George Wilson <george.wilson@delphix.com> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com> Closes #15180	2023-08-26 11:18:11 -07:00
Tony Hutter	d19304ffee	zed: Add zedlet to power off slot when drive is faulted If ZED_POWER_OFF_ENCLOUSRE_SLOT_ON_FAULT is enabled in zed.rc, then power off the drive's slot in the enclosure if it becomes FAULTED. This can help silence misbehaving drives. This assumes your drive enclosure fully supports slot power control via sysfs. Reviewed-by: @AllKind Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #15200	2023-08-25 13:33:40 -07:00
Rob N	92f095a903	copy_file_range: fix fallback when source create on same txg In `019dea0a5` we removed the conversion from EAGAIN->EXDEV inside zfs_clone_range(), but forgot to add a test for EAGAIN to the copy_file_range() entry points to trigger fallback to a content copy. This commit fixes that. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Kay Pedersen <mail@mkwg.de> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #15170 Closes #15172	2023-08-25 13:33:40 -07:00
Umer Saleem	645a7e4d95	Move zinject from openzfs-zfs-test to openzfs-zfsutils For Native Debian packaging, zinject binary and man page is packaged in ZFS test package. zinject is not not directly related to ZTS and should be packaged with other utilities, like it is present in zfs_<ver>.rpm/deb packages. This commit moves zinject binary and man page from openzfs-zfs-test to openzfs-zfsutils package. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ameer Hamza <ahamza@ixsystems.com> Signed-off-by: Umer Saleem <usaleem@ixsystems.com> Closes #15160	2023-08-25 13:33:40 -07:00
Rafael Kitover	95649854ba	dracut: support mountpoint=legacy for root dataset Support mountpoint=legacy for the root dataset in the dracut zfs support scripts. mountpoint=/ or mountpoint=/sysroot also works. Change zfs-env-bootfs.service to add zfsutil to BOOTFSFLAGS only for root datasets with mountpoint != legacy. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Signed-off-by: Rafael Kitover <rkitover@gmail.com> Closes #15149	2023-08-25 13:33:40 -07:00
oromenahar	895cb689d3	zfs_clone_range should return a descriptive error codes Return the more descriptive error codes instead of `EXDEV` when the parameters don't match the requirements of the clone function. Updated the comments in `brt.c` accordingly. The first three errors are just invalid parameters, which zfs can not handle. The fourth error indicates that the block which should be cloned is created and cloned or modified in the same transaction group (`txg`). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <rob.norris@klarasystems.com> Signed-off-by: Kay Pedersen <mail@mkwg.de> Closes #15148	2023-08-25 13:33:40 -07:00
наб	6bdc7259d1	libzfs: sendrecv: send_progress_thread: handle SIGINFO/SIGUSR1 POSIX timers target the process, not the thread (as does SIGINFO), so we need to block it in the main thread which will die if interrupted. Ref: https://101010.pl/@ed1conf@bsd.network/110731819189629373 Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Jorgen Lundman <lundman@lundman.net> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #15113	2023-08-25 13:33:40 -07:00
Ryan Lahfa	1e488eec60	linux/spl/kmem_cache: undefine `kmem_cache_alloc` before defining it When compiling a kernel with bcachefs and zfs, the two macros will collide, making it impossible to have both filesystems. It is sufficient to just undefine the macro before calling it. On why this should be in ZFS rather than bcachefs, currently, bcachefs is not a in-tree filesystem, but, it has a reasonably high chance of getting included soon. This avoids the breakage in ZFS early, this patch may be distributed downstream in NixOS and is already used there. Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Lahfa <ryan@lahfa.xyz> Closes #15144	2023-08-25 13:33:40 -07:00
Mateusz Piotrowski	c418edf1d3	Fix some typos Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Mateusz Piotrowski <0mp@FreeBSD.org> Closes #15141	2023-08-25 13:33:40 -07:00
Alexander Motin	df8c9f351d	ZIL: Second attempt to reduce scope of zl_issuer_lock. The previous patch #14841 appeared to have significant flaw, causing deadlocks if zl_get_data callback got blocked waiting for TXG sync. I already handled some of such cases in the original patch, but issue #14982 shown cases that were impossible to solve in that design. This patch fixes the problem by postponing log blocks allocation till the very end, just before the zios issue, leaving nothing blocking after that point to cause deadlocks. Before that point though any sleeps are now allowed, not causing sync thread blockage. This require slightly more complicated lwb state machine to allocate blocks and issue zios in proper order. But with removal of special early issue workarounds the new code is much cleaner now, and should even be more efficient. Since this patch uses null zios between write, I've found that null zios do not wait for logical children ready status in zio_ready(), that makes parent write to proceed prematurely, producing incorrect log blocks. Added ZIO_CHILD_LOGICAL_BIT to zio_wait_for_children() fixes it. Reviewed-by: Rob Norris <rob.norris@klarasystems.com> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Reviewed-by: George Wilson <george.wilson@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15122	2023-08-25 11:58:44 -07:00
Alexander Motin	bb31ded68b	ZIL: Replay blocks without next block pointer. If we get next block allocation error during log write, we trigger transaction commit. But the block we have just completed is still written and transactions it covers will be acknowledged normally. If after that we ignore the block during replay just because it is the last in the chain, we may not replay some transactions that we have acknowledged as synced, that is not right. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15132	2023-08-25 11:58:44 -07:00
Alexander Motin	c1801cbe59	ZIL: Avoid dbuf_read() before dmu_sync(). In most cases dmu_sync() works with dirty records directly and does not need actual data. The only exception is dmu_sync_late_arrival(). To save some CPU time use dmu_buf_hold_noread() in z_get_data() and explicitly call dbuf_read() in dmu_sync_late_arrival(). There is also a chance that by that time TXG will already be synced and we won't have to do it at all. Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15153	2023-08-25 11:58:44 -07:00
Alexander Motin	ffaedf0a44	Remove fastwrite mechanism. Fastwrite was introduced many years ago to improve ZIL writes spread between multiple top-level vdevs by tracking number of allocated but not written blocks and choosing vdev with smaller count. It suposed to reduce ZIL knowledge about allocation, but actually made ZIL to even more actively report allocation code about the allocations, complicating both ZIL and metaslabs code. On top of that, it seems ZIO_FLAG_FASTWRITE setting in dmu_sync() was lost many years ago, that was one of the declared benefits. Plus introduction of embedded log metaslab class solved another problem with allocation rotor accounting both normal and log allocations, since in most cases those are now in different metaslab classes. After all that, I'd prefer to simplify already too complicated ZIL, ZIO and metaslab code if the benefit of complexity is not obvious. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Wilson <george.wilson@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15107	2023-08-25 11:58:44 -07:00
Alexander Motin	02ce9030e6	Avoid waiting in dmu_sync_late_arrival(). The transaction there does not produce any dirty data or log blocks, so it should not be throttled. All other cases wait for TXG sync, by which time the log block we are writing will be obsolete, so we can skip waiting and just return error here instead. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15096	2023-08-25 11:58:44 -07:00
Serapheim Dimitropoulos	0ae7bfc0a4	zpool_vdev_remove() should handle EALREADY error return When the vdev properties features was merged an extra check was added in `spa_vdev_remove_top_check()` which checked whether the vdev that we want to remove is already being removed and if so return an EALREADY error. ``` static int spa_vdev_remove_top_check(vdev_t vd) { ... <snip> ... / * This device is already being removed / if (vd->vdev_removing) return (SET_ERROR(EALREADY)); ``` Before that change we'd still fail with an error but it was a more generic one - here is the check that failed later in the same function: ``` / * There can not be a removal in progress. */ if (spa->spa_removing_phys.sr_state == DSS_SCANNING) return (SET_ERROR(EBUSY)); ``` Changing the error code returned from that function changed the behavior of the removal's library interface exposed to the userland - `spa_vdev_remove()` now returns `EZFS_UNKNOWN` instead of `EZFS_EBUSY` that was returning before. This patch adds logic to make `spa_vdev_remove()` mindful of the new EALREADY code and propagating `EZFS_EBUSY` reverting to the previously established semantics of that function. Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com> Closes #15013 Closes #15129	2023-08-02 08:54:09 -07:00
наб	bd1eab16eb	linux: zfs: ctldir: set [amc]time to snapshot's creation property If looking up a snapdir inode failed, hold pool config – hold the snapshot – get its creation property – release it – release it, then use that as the [amc]time in the allocated inode. If that fails then fall back to current time. No performance impact since this is only done when allocating a new snapdir inode. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #15110 Closes #15117	2023-08-02 08:53:45 -07:00
Zach Dykstra	b3c1807d77	readmmap.c: fix building with MUSL libc glibc includes sys/types.h from stdlib.h. This is not the case for MUSL, so explicitly include it. Fixes usage of uint_t. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Zach Dykstra <dykstra.zachary@gmail.com> Closes #15130	2023-08-02 08:53:06 -07:00
oromenahar	b5e2456333	Check the return value in clonefile test Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <rob.norris@klarasystems.com> Signed-off-by: Kay Pedersen <mail@mkwg.de> Closes #15128	2023-08-02 08:52:40 -07:00
Rob N	c47f0f4417	linux/copy_file_range: properly request a fallback copy on Linux <5.3 Before Linux 5.3, the filesystem's copy_file_range handler had to signal back to the kernel that we can't fulfill the request and it should fallback to a content copy. This is done by returning -EOPNOTSUPP. This commit converts the EXDEV return from zfs_clone_range to EOPNOTSUPP, to force the kernel to fallback for all the valid reasons it might be unable to clone. Without it the copy_file_range() syscall will return EXDEV to userspace, breaking its semantics. Add test for copy_file_range fallbacks. copy_file_range should always fallback to a content copy whenever ZFS can't service the request with cloning. Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Kay Pedersen <mail@mkwg.de> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #15131	2023-08-02 08:52:40 -07:00
Rob N	12f2b1f65e	zdb: include cloned blocks in block statistics This gives `zdb -b` support for clone blocks. Previously, it didn't know what clones were, so would count their space allocation multiple times and then report leaked space (or, in debug, would assert trying to claim blocks a second time). This commit fixes those bugs, and reports the number of clones and the space "used" (saved) by them. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Kay Pedersen <mail@mkwg.de> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Closes #15123	2023-08-02 08:52:40 -07:00
Brian Behlendorf	4a104ac047	Tag 2.2.0-rc3 Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2023-07-27 16:15:44 -07:00
oromenahar	c24a480631	BRT should return EOPNOTSUPP Return the more descriptive EOPNOTSUPP instead of EXDEV when the storage pool doesn't support block cloning. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <rob.norris@klarasystems.com> Signed-off-by: Kay Pedersen <mail@mkwg.de> Closes #15097	2023-07-27 16:11:54 -07:00
Rob Norris	36d1a3ef4e	zts: block cloning tests Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Kay Pedersen <mail@mkwg.de> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Closes #15050 Closes #405 Closes #13349	2023-07-26 08:46:58 -07:00
Rob Norris	2768dc04cc	linux: implement filesystem-side copy/clone functions for EL7 Redhat have backported copy_file_range and clone_file_range to the EL7 kernel using an "extended file operations" wrapper structure. This connects all that up to let cloning work there too. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Kay Pedersen <mail@mkwg.de> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Closes #15050	2023-07-26 08:46:58 -07:00
Rob Norris	3366ceaf3a	linux: implement filesystem-side clone ioctls Prior to Linux 4.5, the FICLONE etc ioctls were specific to BTRFS, and were implemented as regular filesystem-specific ioctls. This implements those ioctls directly in OpenZFS, allowing cloning to work on older kernels. There's no need to gate these behind version checks; on later kernels Linux will simply never deliver these ioctls, instead calling the approprate VFS op. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Kay Pedersen <mail@mkwg.de> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Closes #15050	2023-07-26 08:46:58 -07:00
Rob Norris	5d12545da8	linux: implement filesystem-side copy/clone functions This implements the Linux VFS ops required to service the file copy/clone APIs: .copy_file_range (4.5+) .clone_file_range (4.5-4.19) .dedupe_file_range (4.5-4.19) .remap_file_range (4.20+) Note that dedupe_file_range() and remap_file_range(REMAP_FILE_DEDUP) are hooked up here, but are not implemented yet. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Kay Pedersen <mail@mkwg.de> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Closes #15050	2023-07-26 08:46:58 -07:00
Rob Norris	a3ea8c8ee6	dbuf_sync_leaf: check DB_READ in state assertions Block cloning introduced a new state transition from DB_NOFILL to DB_READ. This occurs when a block is cloned and then read on the current txg. In this case, the clone will move the dbuf to DB_NOFILL, and then the read will be issued for the overidden block pointer. If that read is still outstanding when it comes time to write, the dbuf will be in DB_READ, which is not handled by the checks in dbuf_sync_leaf, thus tripping the assertions. This updates those checks to allow DB_READ as a valid state iff the dirty record is for a BRT write and there is a override block pointer. This is a safe situation because the block already exists, so there's nothing that could change from underneath the read. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Kay Pedersen <mail@mkwg.de> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Original-patch-by: Kay Pedersen <mail@mkwg.de> Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Closes #15050	2023-07-26 08:46:58 -07:00
Rob Norris	0426e13271	dmu_buf_will_clone: only check that current txg is clean dbuf_undirty() will (correctly) only removed dirty records for the given (open) txg. If there is a dirty record for an earlier closed txg that has not been synced out yet, then db_dirty_records will still have entries on it, tripping the assertion. Instead, change the assertion to only consider the current txg. To some extent this is redundant, as its really just saying "did dbuf_undirty() work?", but it it doesn't hurt and accurately expresses our expectations. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Kay Pedersen <mail@mkwg.de> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Original-patch-by: Kay Pedersen <mail@mkwg.de> Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Closes #15050	2023-07-26 08:46:58 -07:00
Rob Norris	8aa4f0f0fc	brt_vdev_realloc: use vmem_alloc for large allocation bv_entcount can be a relatively large allocation (see comment for BRT_RANGESIZE), so get it from the big allocator. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Kay Pedersen <mail@mkwg.de> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Closes #15050	2023-07-26 08:46:58 -07:00
Rob Norris	7698503dca	zfs_clone_range: use vmem_malloc for large allocation Just silencing the warning about large allocations. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Kay Pedersen <mail@mkwg.de> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Closes #15050	2023-07-26 08:46:58 -07:00
Brian Behlendorf	b9aa32ff39	zed: Reduce log noise for large JBODs For large JBODs the log message "zfs_iter_vdev: no match" can account for the bulk of the log messages (over 70%). Since this message is purely informational and not that useful we remove it. Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #15086 Closes #15094	2023-07-26 08:46:58 -07:00
Brian Behlendorf	571762b290	Linux 6.4 compat: META Update the META file to reflect compatibility with the 6.4 kernel. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Rob Norris <rob.norris@klarasystems.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #15095	2023-07-26 08:46:58 -07:00
Alexander Motin	991834f5dc	Remove zl_issuer_lock from zil_suspend(). This locking was recently added as part of #14979. But appears it is illegal to take zl_issuer_lock while holding dp_config_rwlock, taken by dsl_pool_hold(). It causes deadlock with sync thread in spa_sync_upgrades(). On a second thought, we should not need this locking, since zil_commit_impl() we call below takes zl_issuer_lock, that should sufficiently protect zl_suspend reads, combined with other logic from #14979. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15103	2023-07-25 13:54:02 -07:00
Alexander Motin	41a0f66279	ZIL: Fix config lock deadlock. When we have some LWBs closed and their ZIOs ready to be issued, we can not afford sleeping on config lock if somebody else try to lock it as writer, or it will cause a deadlock. To solve it, move spa_config_enter() from zil_lwb_write_issue() to zil_lwb_write_close() under zl_issuer_lock to enforce lock ordering with other threads. Now if we can't immediately lock config, issue all previously closed LWBs so that they could drop their config locks after completion, and only then allow sleeping on our lock. Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Reviewed-by: Prakash Surya <prakash.surya@delphix.com> Reviewed-by: George Wilson <george.wilson@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15078 Closes #15080	2023-07-25 13:54:02 -07:00
Umer Saleem	c79d1bae75	Update changelog for OpenZFS 2.2.0 release This commit updates changelog for native Debian packages for OpenZFS 2.2.0 release. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Umer Saleem <usaleem@ixsystems.com> Closes #15104	2023-07-25 09:01:27 -07:00
Brian Behlendorf	70232483b4	Tag 2.2.0-rc2 Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2023-07-21 16:36:34 -07:00
Rob N	c5273e0c31	shellcheck: disable "unreachable command" check [SC2317] This new check in 0.9.0 appears to have some issues with various forms of "early return", like trap, exit and return. This is tripping up (at least): cmd/zed/zed.d/history_event-zfs-list-cacher.sh /etc/zfs/zfs-functions Its not obvious what its complaining about or what the remedy is, so it seems sensible to disable this check for now. See also: https://www.shellcheck.net/wiki/SC2317 https://github.com/koalaman/shellcheck/issues/2542 https://github.com/koalaman/shellcheck/issues/2613 Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #15089	2023-07-21 16:35:12 -07:00
Rob N	685ae4429f	metaslab: tuneable to better control force ganging metaslab_force_ganging isn't enough to actually force ganging, because it still only forces 3% of the time. This adds metaslab_force_ganging_pct so we can configure how often to force ganging. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes #15088	2023-07-21 16:35:12 -07:00
Alexander Motin	81be809a25	Adjust prefetch parameters. - Reduce maximum prefetch distance for 32bit platforms to 8MB as it was previously. Those systems didn't grow much probably, so better stay conservative there. - Retire array_rd_sz tunable, blocking prefetch for large requests. We should not penalize applications trying to be more efficient. The speculative prefetcher by itself has reasonable distance limits, and 1MB is not much at all these days. Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15072	2023-07-21 16:35:12 -07:00
Alexander Motin	8a6fde8213	Add explicit prefetches to bpobj_iterate(). To simplify error handling bpobj_iterate_blkptrs() iterates through the list of block pointers backwards. Unfortunately speculative prefetcher is currently unable to detect such patterns, that makes each block read there synchronous and very slow on HDD pools. According to my tests, added explicit prefetch reduces time needed to asynchronously delete 8 snapshots of 4 million blocks each from 20 seconds to less than one, that should free sync thread for other useful work, such as async writes, scrub, etc. While there, plug one memory leak in case of bpobj_open() error and harmonize some variable names. Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15071	2023-07-21 16:35:12 -07:00
Alan Somers	b6f618f8ff	Don't emit cksum_{actual_expected} in ereport.fs.zfs.checksum events With anything but fletcher-4, even a tiny change in the input will cause the checksum value to change completely. So knowing the actual and expected checksums doesn't provide much more information than "they don't match". The harm in sending them is simply that they bloat the event. In particular, on FreeBSD the event must fit into a 1016 byte buffer. Fixes #14717 for mirrored pools. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rich Ercolani <rincebrain@gmail.com> Signed-off-by: Alan Somers <asomers@gmail.com> Sponsored-by: Axcient Closes #14717 Closes #15052	2023-07-21 16:35:12 -07:00
Alan Somers	51a2b59767	Don't emit checksum histograms in ereport.fs.zfs.checksum events The checksum histograms were intended to be used with ATA and parallel SCSI, which are obsolete. With modern storage hardware, they will almost always look like white noise; all bits will be wrong. They only serve to bloat the event. That's a particular problem on FreeBSD, where events must fit into a 1016 byte buffer. This fixes issue #14717 for RAIDZ pools, but not for mirror pools. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rich Ercolani <rincebrain@gmail.com> Signed-off-by: Alan Somers <asomers@gmail.com> Sponsored-by: Axcient Closes #15052	2023-07-21 16:35:12 -07:00
Tony Hutter	8c81c0b05d	zed: Fix zed ASSERT on slot power cycle We would see zed assert on one of our systems if we powered off a slot. Further examination showed zfs_retire_recv() was reporting a GUID of 0, which in turn would return a NULL nvlist. Add in a check for a zero GUID. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #15084	2023-07-21 16:35:12 -07:00
Chunwei Chen	b221f43943	Fix zpl_test_super race with zfs_umount We cannot call zpl_enter in zpl_test_super, because zpl_test_super is under spinlock so we can't sleep, and also because zpl_test_super is called without sb->s_umount taken, so it's possible we would race with zfs_umount and call zpl_enter on freed zfsvfs. Here's an stack trace when this happens: [ 2379.114837] VERIFY(cvp->cv_magic == CV_MAGIC) failed [ 2379.114845] PANIC at spl-condvar.c:497:__cv_broadcast() [ 2379.114854] Kernel panic - not syncing: VERIFY(cvp->cv_magic == CV_MAGIC) failed [ 2379.115012] Call Trace: [ 2379.115019] dump_stack+0x74/0x96 [ 2379.115024] panic+0x114/0x2f6 [ 2379.115035] spl_panic+0xcf/0xfc [spl] [ 2379.115477] __cv_broadcast+0x68/0xa0 [spl] [ 2379.115585] rrw_exit+0xb8/0x310 [zfs] [ 2379.115696] rrm_exit+0x4a/0x80 [zfs] [ 2379.115808] zpl_test_super+0xa9/0xd0 [zfs] [ 2379.115920] sget+0xd1/0x230 [ 2379.116033] zpl_mount+0xdc/0x230 [zfs] [ 2379.116037] legacy_get_tree+0x28/0x50 [ 2379.116039] vfs_get_tree+0x27/0xc0 [ 2379.116045] path_mount+0x2fe/0xa70 [ 2379.116048] do_mount+0x80/0xa0 [ 2379.116050] __x64_sys_mount+0x8b/0xe0 [ 2379.116052] do_syscall_64+0x35/0x50 [ 2379.116054] entry_SYSCALL_64_after_hwframe+0x61/0xc6 [ 2379.116057] RIP: 0033:0x7f9912e8b26a Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@nutanix.com> Closes #15077	2023-07-21 16:35:12 -07:00
Ameer Hamza	e037327bfe	spa_min_alloc should be GCD, not min Since spa_min_alloc may not be a power of 2, unlike ashifts, in the case of DRAID, we should not select the minimal value among several vdevs. Rounding to a multiple of it is unlikely to work for other vdevs. Instead, using the greatest common divisor produces smaller yet more reasonable results. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #15067	2023-07-21 16:35:12 -07:00

... 4 5 6 7 8 ...

8982 Commits All Branches Search

8982 Commits

All Branches