Archive-Team/zfs - zfs - Gitea: Git with a cup of tea

Commit Graph

Author	SHA1	Message	Date
the-Chain-Warden-thresh	d22bf6a9bd	LUA: Backport CVE-2020-24370's patch CVE-2020-24370 is a security vulnerability in lua. Although the CVE description in CVE-2020-24370 said that this CVE only affected lua 5.4.0, according to lua this CVE actually existed since lua 5.2. The root cause of this CVE is the negation overflow that occurs when you try to take the negative of 0x80000000. Thus, this CVE also exists in openzfs. Try to backport the fix to the lua in openzfs since the original fix is for 5.4 and several functions have been changed. https://github.com/advisories/GHSA-gfr4-c37g-mm3v https://nvd.nist.gov/vuln/detail/CVE-2020-24370 https://www.lua.org/bugs.html#5.4.0-11 https://github.com/lua/lua/commit/a585eae6e7ada1ca9271607a4f48dfb1786 Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: ChenHao Lu <18302010006@fudan.edu.cn> Closes #15847	2024-02-08 15:22:16 -08:00
Umer Saleem	08fd5ccc38	Improve performance for zpool trim on linux On Linux, ZFS uses blkdev_issue_discard in vdev_disk_io_trim to issue trim command which is synchronous. This commit updates vdev_disk_io_trim to use __blkdev_issue_discard, which is asynchronous. Unfortunately there isn't any asynchronous version for blkdev_issue_secure_erase, so performance of secure trim will still suffer. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Umer Saleem <usaleem@ixsystems.com> Closes #15843	2024-02-06 12:58:55 -08:00
Tony Hutter	00d85a98ea	BRT: Fix FICLONE/FICLONERANGE shortened copy On Linux the ioctl_ficlonerange() and ioctl_ficlone() system calls are expected to either fully clone the specified range or return an error. The range may be for an entire file. While internally ZFS supports cloning partial ranges there's no way to return the length cloned to the caller so we need to make this all or nothing. As part of this change support for the REMAP_FILE_CAN_SHORTEN flag has been added. When REMAP_FILE_CAN_SHORTEN is set zfs_clone_range() will return a shortened range when encountering pending dirty records. When it's clear zfs_clone_range() will block and wait for the records to be written out allowing the blocks to be cloned. Furthermore, the file range lock is held over the region being cloned to prevent it from being modified while cloning. This doesn't quite provide an atomic semantics since if an error is encountered only a portion of the range may be cloned. This will be converted to an error if REMAP_FILE_CAN_SHORTEN was not provided and returned to the caller. However, the destination file range is left in an undefined state. A test case has been added which exercises this functionality by verifying that `cp --reflink=never\|auto\|always` works correctly. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #15728 Closes #15842	2024-02-06 10:01:15 -08:00
Rob Norris	09e6724e1e	Linux 6.8 compat: replace MAX_ORDER define MAX_ORDER has been renamed to MAX_PAGE_ORDER. Rather than just redefining it, instead define our own name and set it consistently from the start. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/ Closes #15805	2024-01-29 14:53:29 -08:00
Rob Norris	7466e09a49	Linux 6.8 compat: implement strlcpy fallback Linux has removed strlcpy in favour of strscpy. This implements a fallback implementation of strlcpy for this case. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/ Closes #15805	2024-01-29 14:53:29 -08:00
Rob Norris	ce782d0804	Linux 6.8 compat: update for new bdev access functions blkdev_get_by_path() and blkdev_put() have been replaced by bdev_open_by_path() and bdev_release(), which return a "handle" object with the bdev object itself inside. This adds detection for the new functions, and macros to handle the old and new forms consistently. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/ Closes #15805	2024-01-29 14:53:29 -08:00
Paul Dagnelie	ab653603f8	Don't assert mg_initialized due to device addition race During device removal stress tests, we noticed that we were tripping the assertion that mg_initialized was true. After investigation, it was determined that the mg in question was the embedded log metaslab group for a newly added vdev; the normal mg had been initialized (by metaslab_sync_reassess, via vdev_sync_done). However, because the spa config alloc lock is not held as writer across both calls to metaslab_sync_reassess, it is possible for an allocation to happen between the two metaslab_groups being initialized. Because the metaslab code doesn't check the group in question, just the vdev's main mg, it is possible to get past the initial check in vdev_allocatable and later fail due to the assertion. We simply remove the assertions. We could also consider locking the ALLOC lock around the reassess calls in vdev_sync_done, but that risks deadlocks. We could check the actual target mg in vdev_allocatable, but that risks racing with a passivation that comes in after that check but before the assertion. We still won't be able to actually allocate from the metaslab group if no metaslabs are ready, so this change shouldn't break anything. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Wilson <george.wilson@delphix.com> Signed-off-by: Paul Dagnelie <pcd@delphix.com> Closes #15818	2024-01-29 14:53:29 -08:00
Ameer Hamza	dd3a0a2715	Update vdev devid and physpath if changed between imports If devid or physpath for a vdev changes between imports, ensure it is updated to the new value. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #15816	2024-01-29 14:53:29 -08:00
Tino Reichardt	276be5357c	linux spl: fix typo in top comment of spl-condvar.c Credential Implementation -> Condition Variables Implementation Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de> Closes #15782	2024-01-29 14:53:29 -08:00
youzhongyang	6b64acc157	Make spl_kmem_cache size check consistent On Linux x86_64, kmem cache can have size up to 4M, however increasing spl_kmem_cache_slab_limit can lead to crash due to the size check inconsistency. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Youzhong Yang <yyang@mathworks.com> Closes #15757	2024-01-29 14:53:29 -08:00
Ameer Hamza	eb4a36bcef	Extend aux label to add path information Pool import logic uses vdev paths, so it makes sense to add path information on AUX vdev as well. Reviewed-by: Umer Saleem <usaleem@ixsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #15737	2024-01-29 14:53:29 -08:00
Ameer Hamza	52cee9a3eb	fix: Uber block label not always found for aux vdevs When spare or l2cache (aux) vdev is added during pool creation, spa->spa_uberblock is not dumped until that point. Subsequently, the aux label is never synchronized after its initial creation, resulting in the uberblock label remaining undumped. The uberblock is crucial for lib_blkid in identifying the ZFS partition type. To address this issue, we now ensure sync of the uberblock label once if it's not dumped initially. Reviewed-by: Umer Saleem <usaleem@ixsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #15737	2024-01-29 14:53:29 -08:00
Mark Johnston	8b1c6db3d2	Fix a potential use-after-free in zfs_setsecattr() In general, VOPs must not load the "z_log" field until having called zfs_enter_verify_zp(). Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Mark Johnston <markj@FreeBSD.org> Closes #15752	2024-01-29 14:53:29 -08:00
Mark Johnston	22e4f08c30	Linux: Defer loading the object set in zfs_setattr() We need to wait until after having done a zfs_enter() to load some fields from the zfsvfs structure. Otherwise a use-after-free is possible in the face of a concurrent rollback. Other functions in this file are careful to avoid this bug, I believe this is the only instance. Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Mark Johnston <markj@FreeBSD.org> Closes #15752	2024-01-29 14:53:29 -08:00
Pawel Jakub Dawidek	3425484eb9	Fix file descriptor leak on pool import. Descriptor leak can be easily reproduced by doing: # zpool import tank # sysctl kern.openfiles # zpool export tank; zpool import tank # sysctl kern.openfiles We were leaking four file descriptors on every import. Similar leak most likely existed when using file-based VDEVs. External-issue: https://reviews.freebsd.org/D43529 Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net> Closes #15630	2024-01-26 13:38:25 -08:00
Pawel Jakub Dawidek	ef527958c6	Fix cloning into mmaped and cached file. If the destination file is mmaped and the mmaped region was already read, so it is cached, we need to update mmaped pages after successful clone using update_pages(). Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Pointed out by: Ka Ho Ng <khng@freebsd.org> Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net> Closes #15772	2024-01-19 12:28:02 -08:00
Kevin Jin	07cf973fe9	Autotrim High Load Average Fix Switch from cv_wait() to cv_wait_idle() in vdev_autotrim_wait_kick(), which should mitigate the high load average while waiting. Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: jxdking <lostking2008@hotmail.com> Closes #15781	2024-01-18 11:33:29 -08:00
Rob N	2ecc2dfe42	Linux 6.7 compat: zfs_setattr fix atime update In `db4fc559c` I messed up and changed this bit of code to set the inode atime to an uninitialised value, when actually it was just supposed to loading the atime from the inode to be stored in the SA. This changes it to what it should have been. Ensure times change by the right amount Previously, we only checked if the times changed at all, which missed a bug where the atime was being set to an undefined value. Now ensure the times change by two seconds (or thereabouts), ensuring we catch cases where we set the time to something bonkers Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/ Closes #15762 Closes #15773	2024-01-17 08:59:28 -08:00
Mark Johnston	a00231a3fc	spa: Let spa_taskq_param_get()'s addition of a newline be optional For FreeBSD sysctls, we don't want the extra newline, since the sysctl(8) utility will format strings appropriately. Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reported-by: Peter Holm <pho@FreeBSD.org> Signed-off-by: Mark Johnston <markj@FreeBSD.org> Closes #15719	2024-01-16 11:32:19 -08:00
Mark Johnston	9181e94f0b	spa: Fix FreeBSD sysctl handlers sbuf_cpy() resets the sbuf state, which is wrong for sbufs allocated by sbuf_new_for_sysctl(). In particular, this code triggers an assertion failure in sbuf_clear(). Simplify by just using sysctl_handle_string() for both reading and setting the tunable. Fixes: `6930ecbb7` ("spa: make read/write queues configurable") Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reported-by: Peter Holm <pho@FreeBSD.org> Signed-off-by: Mark Johnston <markj@FreeBSD.org> Closes #15719	2024-01-16 11:32:19 -08:00
Alexander Motin	ac592318b8	Fix livelist assertions for dedup and cloning Two block pointers in livelist pointing to the same location may be caused not only by dedup, but also by block cloning. We should not assert D bit set in them. Two block pointers in livelist pointing to the same location may have different logical birth time in case of dedup or cloning. We should assert identical physical birth time instead. Assert identical physical block size between pointers in addition to checksum, since that is what checksums are calculated on. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15732	2024-01-12 12:53:00 -08:00
Alexander Motin	152a775eac	Improve block sizes checks during cloning - Fail if source block is smaller than destination. We can only grow blocks, not shrink them. - Fail if we do not have full znode range lock. In that case grow is not even called. We should improve zfs_rangelock_cb() somehow to know when cloning needs to grow the block size unlike write. - Fail of we tried to resize, but failed. There are many reasons for it to fail that we can not predict at this level, so be ready for them. Unlike write, that may proceed after growth failure, block cloning can't and must return error. This fixes assertion inside dmu_brt_clone() when it sees different number of blocks held in destination than it got block pointers. Builds without ZFS_DEBUG returned EXDEV, so are not affected much. Reviewed-by: Pawel Jakub Dawidek <pawel@dawidek.net> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15724 Closes #15735	2024-01-12 12:53:00 -08:00
chrisperedun	f71c16a661	Don't panic on unencrypted block in encrypted dataset While `763ca47` closes the situation of block cloning creating unencrypted records in encrypted datasets, existing data still causes panic on read. Setting zfs_recover bypasses this but at the cost of potentially ignoring more serious issues. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chris Peredun <chris.peredun@ixsystems.com> Closes #15677	2024-01-08 16:11:39 -08:00
Alexander Motin	9c40ae0219	dbuf: Set dr_data when unoverriding after clone Block cloning normally creates dirty record without dr_data. But if the block is read after cloning, it is moved into DB_CACHED state and receives the data buffer. If after that we call dbuf_unoverride() to convert the dirty record into normal write, we should give it the data buffer from dbuf and release one. Reviewed-by: Kay Pedersen <mail@mkwg.de> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15654 Closes #15656	2024-01-08 16:11:39 -08:00
Alexander Motin	a701548eb4	dbuf: Handle arcbuf assignment after block cloning In some cases dbuf_assign_arcbuf() may be called on a block that was recently cloned. If it happened in current TXG we must undo the block cloning first, since the only one dirty record per TXG can't and shouldn't mean both cloning and overwrite same time. Reviewed-by: Kay Pedersen <mail@mkwg.de> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15653	2024-01-08 16:11:39 -08:00
Alexander Motin	b13c91bb29	DMU: Fix lock leak on dbuf_hold() error dmu_assign_arcbuf_by_dnode() should drop dn_struct_rwlock lock in case dbuf_hold() failed. I don't have reproduction for this, but it looks inconsistent with dmu_buf_hold_noread_by_dnode() and co. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15644	2024-01-08 16:11:39 -08:00
Alexander Motin	e09356fa05	BRT: Limit brt_vdev_dump() to only one vdev Without this patch on pool of 60 vdevs with ZFS_DEBUG enabled clone takes much more time than copy, while heavily trashing dbgmsg for no good reason, repeatedly dumping all vdevs BRTs again and again, even unmodified ones. I am generally not sure this dumping is not excessive, but decided to keep it for now, just restricting its scope to more reasonable. Reviewed-by: Kay Pedersen <mail@mkwg.de> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15625	2024-01-08 16:11:39 -08:00
Alexander Motin	1e1d748cae	ZIL: Remove 128K into 2x68K LWB split optimization To improve 128KB block write performance in case of multiple VDEVs ZIL used to spit those writes into two 64KB ones. Unfortunately it was found to cause LWB buffer overflow, trying to write maximum- sizes 128KB TX_CLONE_RANGE record with 1022 block pointers into 68KB buffer, since unlike TX_WRITE ZIL code can't split it. This is a minimally-invasive temporary block cloning fix until the following more invasive prediction code refactoring. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ameer Hamza <ahamza@ixsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15634	2024-01-08 16:11:39 -08:00
oromenahar	121924575e	Allow block cloning across encrypted datasets When two datasets share the same master encryption key, it is safe to clone encrypted blocks. Currently only snapshots and clones of a dataset share with it the same encryption key. Added a test for: - Clone from encrypted sibling to encrypted sibling with non encrypted parent - Clone from encrypted parent to inherited encrypted child - Clone from child to sibling with encrypted parent - Clone from snapshot to the original datasets - Clone from foreign snapshot to a foreign dataset - Cloning from non-encrypted to encrypted datasets - Cloning from encrypted to non-encrypted datasets Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Original-patch-by: Pawel Jakub Dawidek <pawel@dawidek.net> Signed-off-by: Kay Pedersen <mail@mkwg.de> Closes #15544	2024-01-08 16:11:39 -08:00
Alexander Motin	e11b3eb1c6	ZIL: Do not clone blocks from the future ZIL claim can not handle block pointers cloned from the future, since they are not yet allocated at that point. It may happen either if the block was just written when it was cloned, or if the pool was frozen or somehow else rewound on import. Handle it from two sides: prevent cloning of blocks with physical birth time from not yet synced or frozen TXG, and abort ZIL claim if we still detect such blocks due to rewind or something else. While there, assert that any cloned blocks we claim are really allocated by calling metaslab_check_free(). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15617	2024-01-08 16:11:39 -08:00
Alexander Motin	3b8f227362	ZIL: Remove TX_CLONE_RANGE replay for ZVOLs. zil_claim_clone_range() takes references on cloned blocks before ZIL replay. Later zil_free_clone_range() drops them after replay or on dataset destroy. The total balance is neutral. It means we do not need to do anything (drop the references) for not implemented yet TX_CLONE_RANGE replay for ZVOLs. This is a logical follow up to #15603. Reviewed-by: Kay Pedersen <mail@mkwg.de> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15612	2024-01-08 16:11:39 -08:00
Alexander Motin	e48195c816	ZIO: Add overflow checks for linear buffers Since we use a limited set of kmem caches, quite often we have unused memory after the end of the buffer. Put there up to a 512-byte canary when built with debug to detect buffer overflows at the free time. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15553	2024-01-08 16:11:39 -08:00
Alexander Motin	ad47eca195	ZIL: Assert record sizes in different places This should make sure we have log written without overflows. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15517	2024-01-08 16:11:39 -08:00
Alexander Motin	2e259c6f00	L2ARC: Restrict write size to 1/4 of the device PR #15457 exposed weird logic in L2ARC write sizing. If it appeared bigger than device size, instead of liming write it reset all the system-wide tunables to their default. Aside of being excessive, it did not actually help with the problem, still allowing infinite loop to happen. This patch removes the tunables reverting logic, but instead limits L2ARC writes (or at least eviction/trim) to 1/4 of the capacity. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Amanakis <gamanakis@gmail.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15519	2024-01-08 16:11:39 -08:00
Alexander Motin	a8c29a79df	Linux: Reclaim unused spl_kmem_cache_reclaim It is unused for 3 years since #10576. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15507	2024-01-08 16:11:39 -08:00
Alexander Motin	f13593619b	FreeBSD: Optimize large kstat outputs - Use sbuf_new_for_sysctl() to reduce double-buffering on sysctl output. - Use much faster sbuf_cat() instead of sbuf_printf("%s"). Together it reduces `sysctl kstat.zfs.misc.dbufs` time from minutes to seconds, making dbufstat almost usable. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15495	2024-01-08 16:11:39 -08:00
Alan Somers	c34fe8dcbc	Update the kstat dataset_name when renaming a zvol Add a dataset_kstats_rename function, and call it when renaming a zvol on FreeBSD and Linux. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alan Somers <asomers@gmail.com> Sponsored-by: Axcient Closes #15482 Closes #15486	2024-01-08 16:11:39 -08:00
Alexander Motin	2a59b6bfa9	ABD: Be more assertive in iterators Once we verified the ABDs and asserted the sizes we should never see premature ABDs ends. Assert that and remove extra branches from production builds. Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15428	2024-01-08 16:11:39 -08:00
Rob Norris	db2db50e37	spa: make read/write queues configurable We are finding that as customers get larger and faster machines (hundreds of cores, large NVMe-backed pools) they keep hitting relatively low performance ceilings. Our profiling work almost always finds that they're running into bottlenecks on the SPA IO taskqs. Unfortunately there's often little we can advise at that point, because there's very few ways to change behaviour without patching. This commit adds two load-time parameters `zio_taskq_read` and `zio_taskq_write` that can configure the READ and WRITE IO taskqs directly. This achieves two goals: it gives operators (and those that support them) a way to tune things without requiring a custom build of OpenZFS, which is often not possible, and it lets us easily try different config variations in a variety of environments to inform the development of better defaults for these kind of systems. Because tuning the IO taskqs really requires a fairly deep understanding of how IO in ZFS works, and generally isn't needed without a pretty serious workload and an ability to identify bottlenecks, only minimal documentation is provided. Its expected that anyone using this is going to have the source code there as well. Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc.	2023-12-22 13:25:07 -08:00
Brian Behlendorf	d530d5d8a5	Linux 6.5 compat: check BLK_OPEN_EXCL is defined On some systems we already have blkdev_get_by_path() with 4 args but still the old FMODE_EXCL and not BLK_OPEN_EXCL defined. The vdev_bdev_mode() function was added to handle this case but there was no generic way to specify exclusive access. Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #15692	2023-12-21 16:19:48 -08:00
Rob Norris	03b84099d9	linux 6.7 compat: rework shrinker setup for heap allocations 6.7 changes the shrinker API such that shrinkers must be allocated dynamically by the kernel. To accomodate this, this commit reworks spl_register_shrinker() to do something similar against earlier kernels. Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://github.com/sponsors/robn	2023-12-21 11:03:08 -08:00
Rob Norris	18a9185165	linux 6.7 compat: handle superblock shrinker member change In 6.7 the superblock shrinker member s_shrink has changed from being an embedded struct to a pointer. Detect this, and don't take a reference if it already is one. Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://github.com/sponsors/robn	2023-12-21 11:03:08 -08:00
Rob Norris	3c13601a12	linux 6.7 compat: use inode atime/mtime accessors 6.6 made i_ctime inaccessible; 6.7 has done the same for i_atime and i_mtime. This extends the method used for ctime in `b37f29341` to atime and mtime as well. Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://github.com/sponsors/robn	2023-12-21 11:03:08 -08:00
rmacklem	522414da3b	FreeBSD: Fix ZFS so that snapshots under .zfs/snapshot are NFS visible Call vfs_exjail_clone() for mounts created under .zfs/snapshot to fill in the mnt_exjail field for the mount. If this is not done, the snapshots under .zfs/snapshot with not be accessible over NFS. This version has the argument name in vfs.h fixed to match that of the name in spl_vfs.c, although it really does not matter. External-issue: https://reviews.freebsd.org/D42672 Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rick Macklem <rmacklem@uoguelph.ca> Closes #15563	2023-11-29 14:08:46 -08:00
Alexander Motin	a8c256046b	ZIL: Call brt_pending_add() replaying TX_CLONE_RANGE zil_claim_clone_range() takes references on cloned blocks before ZIL replay. Later zil_free_clone_range() drops them after replay or on dataset destroy. The total balance is neutral. It means on actual replay we must take additional references, which would stay in BRT. Without this blocks could be freed prematurely when either original file or its clone are destroyed. I've observed BRT being emptied and the feature being deactivated after ZIL replay completion, which should not have happened. With the patch I see expected stats. Reviewed-by: Kay Pedersen <mail@mkwg.de> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15603	2023-11-29 13:08:25 -08:00
Alan Somers	349fb77f11	FreeBSD: Fix the build on FreeBSD 12 It was broken for several reasons: * VOP_UNLOCK lost an argument in 13.0. So OpenZFS should be using VOP_UNLOCK1, but a few direct calls to VOP_UNLOCK snuck in. * The location of the zlib header moved in 13.0 and 12.1. We can drop support for building on 12.0, which is EoL. * knlist_init lost an argument in 13.0. OpenZFS change `9d0887402b` assumed 13.0 or later. * FreeBSD 13.0 added copy_file_range, and OpenZFS change `67a1b03791` assumed 13.0 or later. Sponsored-by: Axcient Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Alan Somers <asomers@gmail.com> Closes #15551	2023-11-28 15:19:07 -08:00
Rob N	2a953e0ac9	dmu_buf_will_clone: fix race in transition back to NOFILL Previously, dmu_buf_will_clone() would roll back any dirty record, but would not clean out the modified data nor reset the state before releasing the lock. That leaves the last-written data in db_data, but the dbuf in the wrong state. This is eventually corrected when the dbuf state is made NOFILL, and dbuf_noread() called (which clears out the old data), but at this point its too late, because the lock was already dropped with that invalid state. Any caller acquiring the lock before the call into dmu_buf_will_not_fill() can find what appears to be a clean, readable buffer, and would take the wrong state from it: it should be getting the data from the cloned block, not from earlier (unwritten) dirty data. Even after the state was switched to NOFILL, the old data was still not cleaned out until dbuf_noread(), which is another gap for a caller to take the lock and read the wrong data. This commit fixes all this by properly cleaning up the previous state and then setting the new state before dropping the lock. The DBUF_VERIFY() calls confirm that the dbuf is in a valid state when the lock is down. Sponsored-by: Klara, Inc. Sponsored-By: OpenDrives Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Pawel Jakub Dawidek <pawel@dawidek.net> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #15566 Closes #15526	2023-11-28 12:59:00 -08:00
Rob Norris	d702f86eaf	brt: lift internal definitions into _impl header So that zdb (and others!) can get at the BRT on-disk structures. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Kay Pedersen <mail@mkwg.de> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #15541	2023-11-28 12:56:43 -08:00
Alexander Motin	56a2a0981e	ZIL: Do not encrypt block pointers in lr_clone_range_t In case of crash cloned blocks need to be claimed on pool import. It is only possible if they (lr_bps) and their count (lr_nbps) are not encrypted but only authenticated, similar to block pointer in lr_write_t. Few other fields can be and are still encrypted. This should fix panic on ZIL claim after crash when block cloning is actively used. Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Tom Caputi <caputit1@tcnj.edu> Reviewed-by: Sean Eric Fagan <sef@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Edmund Nadolski <edmund.nadolski@ixsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15543 Closes #15513	2023-11-28 11:17:52 -08:00
Rob N	9b9b09f452	dnode_is_dirty: check dnode and its data for dirtiness Over its history this the dirty dnode test has been changed between checking for a dnodes being on `os_dirty_dnodes` (`dn_dirty_link`) and `dn_dirty_record`. `de198f2d9` Fix lseek(SEEK_DATA/SEEK_HOLE) mmap consistency `2531ce372` Revert "Report holes when there are only metadata changes" `ec4f9b8f3` Report holes when there are only metadata changes `454365bba` Fix dirty check in dmu_offset_next() `66aca2473` SEEK_HOLE should not block on txg_wait_synced() Also illumos/illumos-gate@c543ec060d illumos/illumos-gate@2bcf0248e9 It turns out both are actually required. In the case of appending data to a newly created file, the dnode proper is dirtied (at least to change the blocksize) and dirty records are added. Thus, a single logical operation is represented by separate dirty indicators, and must not be separated. The incorrect dirty check becomes a problem when the first block of a file is being appended to while another process is calling lseek to skip holes. There is a small window where the dnode part is undirtied while there are still dirty records. In this case, `lseek(fd, 0, SEEK_DATA)` would not know that the file is dirty, and would go to `dnode_next_offset()`. Since the object has no data blocks yet, it returns `ESRCH`, indicating no data found, which results in `ENXIO` being returned to `lseek()`'s caller. Since coreutils 9.2, `cp` performs sparse copies by default, that is, it uses `SEEK_DATA` and `SEEK_HOLE` against the source file and attempts to replicate the holes in the target. When it hits the bug, its initial search for data fails, and it goes on to call `fallocate()` to create a hole over the entire destination file. This has come up more recently as users upgrade their systems, getting OpenZFS 2.2 as well as a newer coreutils. However, this problem has been reproduced against 2.1, as well as on FreeBSD 13 and 14. This change simply updates the dirty check to check both types of dirty. If there's anything dirty at all, we immediately go to the "wait for sync" stage, It doesn't really matter after that; both changes are on disk, so the dirty fields should be correct. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Rich Ercolani <rincebrain@gmail.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #15571 Closes #15526	2023-11-28 09:15:48 -08:00

1 2 3 4 5 ...

4346 Commits