Archive-Team/zfs - zfs - Gitea: Git with a cup of tea

Commit Graph

Author	SHA1	Message	Date
Rob Norris	4429ad9276	libzpool: set thread names Arrange for the thread/task name to be set when new threads are created. This makes them visible in the process table etc. pthread_setname_np() is generally available in glibc, musl and FreeBSD, so no test is required. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/ Closes #16140	2024-05-01 10:51:44 -07:00
Don Brady	c3f2f1aa2d	vdev probe to slow disk can stall mmp write checker Simplify vdev probes in the zio_vdev_io_done context to avoid holding the spa config lock for a long duration. Also allow zpool clear if no evidence of another host is using the pool. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Don Brady <don.brady@klarasystems.com> Closes #15839	2024-04-29 14:35:53 -07:00
George Wilson	c183d164aa	Parallel pool import This commit allow spa_load() to drop the spa_namespace_lock so that imports can happen concurrently. Prior to dropping the spa_namespace_lock, the import logic will set the spa_load_thread value to track the thread which is doing the import. Consumers of spa_lookup() retain the same behavior by blocking when either a thread is holding the spa_namespace_lock or the spa_load_thread value is set. This will ensure that critical concurrent operations cannot take place while a pool is being imported. The zpool command is also enhanced to provide multi-threaded support when invoking zpool import -a. Lastly, zinject provides a mechanism to insert artificial delays when importing a pool and new zfs tests are added to verify parallel import functionality. Contributions-by: Don Brady <don.brady@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Wilson <gwilson@delphix.com> Closes #16093	2024-04-22 09:42:38 -07:00
Rob Norris	d7605ae77b	zio: rename ZIO_TYPE_IOCTL to ZIO_TYPE_FLUSH The only possible ioctl is a flush, and any other kind of meta-operation introduced in the future is likely to have different semantics (much like trim did). So, lets just call it what it is. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16064	2024-04-11 17:17:23 -07:00
Rob Norris	b613709c46	dkio: remove kernel dkio.h compatibility header Without DKIOCFLUSHWRITECACHE, we no longer need the compat header. Note that we're keeping the userspace SPL compat header, which is used by libefi. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16064	2024-04-11 17:17:18 -07:00
Rob Norris	c9c838aa1f	zio: remove io_cmd and DKIOCFLUSHWRITECACHE There's no other options, so we can just always assume its a flush. Includes some light refactoring where a switch statement was doing control flow that no longer works. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16064	2024-04-11 17:17:11 -07:00
Rob Norris	cac416f106	zio: remove zio_ioctl() It only had one user, zio_flush(), and there are no other vdev ioctls anyway. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16064	2024-04-11 17:16:46 -07:00
Alexander Motin	997f85b4d3	L2ARC: Relax locking during write Previous code held ARC state sublist lock throughout all L2ARC write process, which included number of allocations and even ZIO issues. Being blocked in any of those places the code could also block ARC eviction, that could cause OOM activation or even dead- lock if system is low on memory or one is too fragmented. Fix it by dropping the lock as soon as we see a block eligible for L2ARC writing and pick it up later using earlier inserted marker. While there, also reduce scope of hash lock, moving ZIO allocation and other operations not requiring header access out of it. All operations requiring header access move under hash lock, since L2_WRITING flag does not prevent header eviction only transition to arc_l2c_only state with L1 header. To be able to manipulate sublist lock and marker as needed add few more multilist functions and modify one. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #16040	2024-04-09 16:23:19 -07:00
Alexander Motin	5e5fd0a178	Speculative prefetch for reordered requests Before this change speculative prefetcher was able to detect a stream only if all of its accesses are perfectly sequential. It was easy to implement and is perfectly fine for single-threaded applications. Unfortunately multi-threaded network servers, such as iSCSI, SMB or NFS usually have plenty of threads and may often reorder requests, preventing successful speculation and prefetch. This change allows speculative prefetcher to detect streams even if requests are reordered by introducing a list of 9 non-contiguous ranges up to 16MB ahead of current stream position and filling the gaps as more requests arrive. It also allows stream to proceed even with holes up to a certain configurable threshold (25%). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #16022	2024-04-08 15:13:27 -07:00
Rob N	76d1dde94c	zinject: inject device errors into ioctls Adds 'ioctl' as a valid IO type for device error injection, so we can simulate a flush error (which OpenZFS currently ignores, but that's by the by). To support this, adding ZIO_STAGE_VDEV_IO_DONE to ZIO_IOCTL_PIPELINE, since that's where device error injection happens. This needs a small exclusion to avoid the vdev_queue, since flushes are not queued, and I'm assuming that the various failure responses are still reasonable for flush failures (probes, media change, etc). This seems reasonable to me, as a flush failure is not unlike a write failure in this regard, however this may be too aggressive or subtle to assume in just this change. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16061	2024-04-08 11:59:04 -07:00
Alek P	ea2862cdda	vdev props comment and manpage should include zfsd and FreeBSD mentions Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Allan Jude <allan@klarasystems.com> Signed-off-by: Alek Pinchuk <apinchuk@axcient.com> Closes #15968	2024-04-03 17:56:34 -07:00
Rob N	b21b967bd5	zap_leaf: make l_hash[] variable length to silence UBSAN When UBSAN is active and OpenZFS is a debug build, the l_hash assert at the bottom of zap_open_leaf() causes UBSAN to complain. This follows the example in `786641dcf` to shut it up. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #15964	2024-04-03 16:38:18 -07:00
George Wilson	b1e46f869e	Add ashift validation when adding devices to a pool Currently, zpool add allows users to add top-level vdevs that have different ashifts but doing so prevents users from being able to perform a top-level vdev removal. Often times consumers may not realize that they have mismatched ashifts until the top-level removal fails. This feature adds ashift validation to the zpool add command and will fail the operation if the sector size of the specified vdev does not match the existing pool. This behavior can be disabled by using the -f flag. In addition, new flags have been added to provide fine-grained control to disable specific checks. These flags are: --allow-in-use --allow-ashift-mismatch --allow-replicaton-mismatch The force flag will disable all of these checks. Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Mark Maybee <mmaybee@delphix.com> Signed-off-by: George Wilson <gwilson@delphix.com> Closes #15509	2024-03-29 13:15:56 -06:00
Rob Norris	f3b85d706b	vdev_disk: rename existing functions to vdev_classic_* This is just renaming the existing functions we're about to replace and grouping them together to make the next commits easier to follow. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes #15533 Closes #15588	2024-03-25 16:50:47 -07:00
Rob Norris	390b448726	abd: add page iterator The regular ABD iterators yield data buffers, so they have to map and unmap pages into kernel memory. If the caller only wants to count chunks, or can use page pointers directly, then the map/unmap is just unnecessary overhead. This adds adb_iterate_page_func, which yields unmapped struct page instead. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes #15533 Closes #15588	2024-03-25 16:50:35 -07:00
George Wilson	493fcce9be	Provide macros for setting and getting blkptr birth times There exist a couple of macros that are used to update the blkptr birth times but they can often be confusing. For example, the BP_PHYSICAL_BIRTH() macro will provide either the physical birth time if it is set or else return back the logical birth time. The complement to this macro is BP_SET_BIRTH() which will set the logical birth time and set the physical birth time if they are not the same. Consumers may get confused when they are trying to get the physical birth time and use the BP_PHYSICAL_BIRTH() macro only to find out that the logical birth time is what is actually returned. This change cleans up these macros and makes them symmetrical. The same functionally is preserved but the name is changed. Instead of calling BP_PHYSICAL_BIRTH(), consumer can now call BP_GET_BIRTH(). In additional to cleaning up this naming conventions, two new sets of macros are introduced -- BP_[SET\|GET]_LOGICAL_BIRTH() and BP_[SET\|GET]_PHYSICAL_BIRTH. These new macros allow the consumer to get and set the specific birth time. As part of the cleanup, the unused GRID macros have been removed and that portion of the blkptr are currently unused. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Signed-off-by: George Wilson <gwilson@delphix.com> Closes #15962	2024-03-25 15:01:54 -07:00
Alexander Motin	80cc516295	ZAP: Massively switch to _by_dnode() interfaces Before this change ZAP called dnode_hold() for almost every block access, that was clearly visible in profiler under heavy load, such as BRT. This patch makes it always hold the dnode reference between zap_lockdir() and zap_unlockdir(). It allows to avoid most of dnode operations between those. It also adds several new _by_dnode() APIs to ZAP and uses them in BRT code. Also adds dmu_prefetch_by_dnode() variant and uses it in the ZAP code. After this there remains only one call to dmu_buf_dnode_enter(), which seems to be unneeded. So remove the call and the functions. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15951	2024-03-25 14:58:50 -07:00
Alexander Motin	c28f94f32e	ZAP: Some cleanups/micro-optimizations - Remove custom zap_memset(), use regular memset(). - Use PANIC() instead of opaque cmn_err(CE_PANIC). - Provide entry parameter to zap_leaf_rehash_entry(). - Reduce branching in zap_leaf_array_create() inner loop. - Remove signedness where it should not be. Should be no function changes. Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15976	2024-03-21 16:43:53 -07:00
Rob N	5c4a4f82c8	zio: update ZIO type x stage documentation - add column for TRIM ZIOs - remove R from ZIO_STAGE_ISSUE_ASYNC, never happened - remove I from ZIO_STAGE_VDEV_IO_DONE, never happened Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #15959	2024-03-21 12:10:04 -07:00
Rob Norris	8f2f6cd2ac	ddt: reduce DDT_NAMELEN This is the buffer size passed to ddt_object_name(), to expand the DMU_POOL_DDT format. That format inserts the table checksum, class and type names, which as I write this are max 6, 9 and 3, respectively. Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/ Closes #15908	2024-02-26 12:24:22 -08:00
Rob Norris	5720b00632	ddt: document the theory and the key data structures Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc. Closes #15887	2024-02-15 11:46:00 -08:00
Rob Norris	406562c563	ddt: simplify entry load and flags Only a single bit is needed to track entry state, and definitely not two whole bytes. Some light refactoring in ddt_lookup() is needed to support this, but it reads a lot better now. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc. Closes #15887	2024-02-15 11:45:50 -08:00
Rob Norris	2cffddd405	ddt: remove ddt_node Nothing uses it. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc. Closes #15887	2024-02-15 11:45:44 -08:00
Rob Norris	9029278dde	ddt: rework ops interface in terms of keys and values Store objects store keys and values, so have them take those types and nothing more. This way, they don't need to be concerned about the "kind" of entry being operated on; the dispatch layer can take care of the appropriate conversions. This adds a "contains" op to see if a particular entry exists without loading it, which makes a couple of things easier to do; in particular, it allows us to avoid an allocation in ddt_class_contains(). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc. Closes #15887	2024-02-15 11:45:38 -08:00
Rob Norris	3bad70040a	ddt: remove struct names and forward declarations Things get confused when there's more than one name for a thing. Note that we don't do this for ddt_object_t, ddt_histogram_t and ddt_stat_t because they're part of the public ZFS interface. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc. Closes #15887	2024-02-15 11:45:24 -08:00
Rob Norris	c8f694fe39	ddt: typedef ddt_type and ddt_class Mostly for consistency, so the reader is less likely to wonder why these things look different. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc. Closes #15887	2024-02-15 11:45:19 -08:00
Rob Norris	8e414fcdf4	ddt: split internal DDT API into separate header Just to make it easier to know which bits to pay attention to. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc. Closes #15887	2024-02-15 11:45:15 -08:00
Rob Norris	909006049f	ddt: remove DDE_GET_NDVAS macro It was a weird and confusing name, because it wasn't actually returning the number of DVAs in the entry (as in, in the value/phys part) but the maximum number of possible DVAs in a BP generated from the entry, based on the encrypt bit in the key. This is unlike the similarly named BP_GET_NDVAS, which really does return the number of DVAs. Since its only used in this one place, and for a specific purpose, it seemed more sensible to just write it in-place and remove the name. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc. Closes #15887	2024-02-15 11:45:10 -08:00
Rob Norris	5973854153	ddt: lift dedup stats out to separate file We want to add other kinds of dedup-related objects and keep stats for them. This makes those functions easier to use from outside ddt.c. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc. Closes #15887	2024-02-15 11:45:05 -08:00
Rob Norris	0cb1ef60ae	ddt: compare keys, not entries We're about to have different kinds of things that we'll compare on key, so generalise this function to support that. (It actually worked fine because of the way the casts work out, but it requires the key to be at the start of the object so the cast through ddt_entry_t works, and even then it reads strangely for anything that's not a ddt_entry_t). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc. Closes #15887	2024-02-15 11:45:00 -08:00
Rob Norris	86e91c030c	ddt: move entry compression into ddt_zap I think I can say with some confidence that anyone making a new storage type in 2023 is doing their own thing with compression, not this. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc. Closes #15887	2024-02-15 11:44:47 -08:00
Don Brady	cbe882298e	Add slow disk diagnosis to ZED Slow disk response times can be indicative of a failing drive. ZFS currently tracks slow I/Os (slower than zio_slow_io_ms) and generates events (ereport.fs.zfs.delay). However, no action is taken by ZED, like is done for checksum or I/O errors. This change adds slow disk diagnosis to ZED which is opt-in using new VDEV properties: VDEV_PROP_SLOW_IO_N VDEV_PROP_SLOW_IO_T If multiple VDEVs in a pool are undergoing slow I/Os, then it skips the zpool_vdev_degrade(). Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: Rob Wing <rob.wing@klarasystems.com> Signed-off-by: Don Brady <don.brady@klarasystems.com> Closes #15469	2024-02-08 09:19:52 -08:00
Brian Behlendorf	6dccdf501e	BRT: Fix FICLONE/FICLONERANGE shortened copy On Linux the ioctl_ficlonerange() and ioctl_ficlone() system calls are expected to either fully clone the specified range or return an error. The range may be for an entire file. While internally ZFS supports cloning partial ranges there's no way to return the length cloned to the caller so we need to make this all or nothing. As part of this change support for the REMAP_FILE_CAN_SHORTEN flag has been added. When REMAP_FILE_CAN_SHORTEN is set zfs_clone_range() will return a shortened range when encountering pending dirty records. When it's clear zfs_clone_range() will block and wait for the records to be written out allowing the blocks to be cloned. Furthermore, the file range lock is held over the region being cloned to prevent it from being modified while cloning. This doesn't quite provide an atomic semantics since if an error is encountered only a portion of the range may be cloned. This will be converted to an error if REMAP_FILE_CAN_SHORTEN was not provided and returned to the caller. However, the destination file range is left in an undefined state. A test case has been added which exercises this functionality by verifying that `cp --reflink=never\|auto\|always` works correctly. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #15728 Closes #15842	2024-02-05 16:44:45 -08:00
Ameer Hamza	d9885b3776	fix: Uber block label not always found for aux vdevs When spare or l2cache (aux) vdev is added during pool creation, spa->spa_uberblock is not dumped until that point. Subsequently, the aux label is never synchronized after its initial creation, resulting in the uberblock label remaining undumped. The uberblock is crucial for lib_blkid in identifying the ZFS partition type. To address this issue, we now ensure sync of the uberblock label once if it's not dumped initially. Reviewed-by: Umer Saleem <usaleem@ixsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #15737	2024-01-16 13:17:14 -08:00
Alexander Motin	eff77a802d	ZIL: Improve next log block size prediction Track history in context of bursts, not individual log blocks. It allows to not blow away all the history by single large burst of many block, and same time allows optimizations covering multiple blocks in a burst and even predicted following burst. For each burst account its optimal block size and minimal first block size. Use that statistics from the last 8 bursts to predict first block size of the next burst. Remove predefined set of block sizes. Allocate any size we see fit, multiple of 4KB, as required by ZIL now. With compression enabled by default, ZFS already writes pretty random block sizes, so this should not surprise space allocator any more. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15635	2023-12-21 10:54:44 -08:00
Alexander Motin	9b1677fb5a	dmu: Allow buffer fills to fail When ZFS overwrites a whole block, it does not bother to read the old content from disk. It is a good optimization, but if the buffer fill fails due to page fault or something else, the buffer ends up corrupted, neither keeping old content, nor getting the new one. On FreeBSD this is additionally complicated by page faults being blocked by VFS layer, always returning EFAULT on attempt to write from mmap()'ed but not yet cached address range. Normally it is not a big problem, since after original failure VFS will retry the write after reading the required data. The problem becomes worse in specific case when somebody tries to write into a file its own mmap()'ed content from the same location. In that situation the only copy of the data is getting corrupted on the page fault and the following retries only fixate the status quo. Block cloning makes this issue easier to reproduce, since it does not read the old data, unlike traditional file copy, that may work by chance. This patch provides the fill status to dmu_buf_fill_done(), that in case of error can destroy the corrupted buffer as if no write happened. One more complication in case of block cloning is that if error is possible during fill, dmu_buf_will_fill() must read the data via fall-back to dmu_buf_will_dirty(). It is required to allow in case of error restoring the buffer to a state after the cloning, not not before it, that would happen if we just call dbuf_undirty(). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15665	2023-12-15 09:51:41 -08:00
Chunwei Chen	a9b937e066	For db_marker inherit the db pointer for AVL comparision. While evicting dbufs of a dnode, a marker node is added to the AVL. The marker node should be inserted in AVL tree ahead of the dbuf its trying to delete. The blkid and level is used to ensure this. However, this could go wrong there's another dbufs with the same blkid and level in DB_EVICTING state but not yet removed from AVL tree. dbuf_compare() could fail to give the right location or could cause confusion and trigger ASSERTs. To ensure that the marker is inserted before the deleting dbuf, use the pointer value of the original dbuf for comparision. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: Sanjeev Bagewadi <sanjeev.bagewadi@nutanix.com> Signed-off-by: Chunwei Chen <david.chen@nutanix.com> Closes #12482 Closes #15643	2023-12-11 14:42:06 -08:00
Don Brady	687e4d7f9c	Extend import_progress kstat with a notes field Detail the import progress of log spacemaps as they can take a very long time. Also grab the spa_note() messages to, as they provide insight into what is happening Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Don Brady <don.brady@klarasystems.com> Co-authored-by: Allan Jude <allan@klarasystems.com> Closes #15539	2023-12-05 14:27:56 -08:00
oromenahar	c7b6119268	Allow block cloning across encrypted datasets When two datasets share the same master encryption key, it is safe to clone encrypted blocks. Currently only snapshots and clones of a dataset share with it the same encryption key. Added a test for: - Clone from encrypted sibling to encrypted sibling with non encrypted parent - Clone from encrypted parent to inherited encrypted child - Clone from child to sibling with encrypted parent - Clone from snapshot to the original datasets - Clone from foreign snapshot to a foreign dataset - Cloning from non-encrypted to encrypted datasets - Cloning from encrypted to non-encrypted datasets Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Original-patch-by: Pawel Jakub Dawidek <pawel@dawidek.net> Signed-off-by: Kay Pedersen <mail@mkwg.de> Closes #15544	2023-12-05 11:03:48 -08:00
Alexander Motin	a03ebd9bee	ZIL: Call brt_pending_add() replaying TX_CLONE_RANGE zil_claim_clone_range() takes references on cloned blocks before ZIL replay. Later zil_free_clone_range() drops them after replay or on dataset destroy. The total balance is neutral. It means on actual replay we must take additional references, which would stay in BRT. Without this blocks could be freed prematurely when either original file or its clone are destroyed. I've observed BRT being emptied and the feature being deactivated after ZIL replay completion, which should not have happened. With the patch I see expected stats. Reviewed-by: Kay Pedersen <mail@mkwg.de> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15603	2023-11-29 10:51:34 -08:00
Rob Norris	803a9c12c9	brt: lift internal definitions into _impl header So that zdb (and others!) can get at the BRT on-disk structures. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Kay Pedersen <mail@mkwg.de> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #15541	2023-11-27 13:34:43 -08:00
Rob N	92dc4ad83d	Consider `dnode_t` allocations in dbuf cache size accounting Entries in the dbuf cache contribute only the size of the dbuf data to the cache size. Attached "user" data is not counted. This can lead to the data currently "owned" by the cache consuming more memory accounting appears to show. In some cases (eg a metadnode data block with all child dnode_t slots allocated), the actual size can be as much as 3x as what the cache believes it to be. This is arguably correct behaviour, as the cache is only tracking the size of the dbuf data, not even the overhead of the dbuf_t. On the other hand, in the above case of dnodes, evicting cached metadnode dbufs is the only current way to reclaim the dnode objects, and can lead to the situation where the dbuf cache appears to be comfortably within its target memory window and yet is holding enormous amounts of slab memory that cannot be reclaimed. This commit adds a facility for a dbuf user to artificially inflate the apparent size of the dbuf for caching purposes. This at least allows for cache tuning to be adjusted to match something closer to the real memory overhead. metadnode dbufs carry a >1KiB allocation per dnode in their user data. This informs the dbuf cache machinery of that fact, allowing it to make better decisions when evicting dbufs. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #15511	2023-11-17 13:25:53 -08:00
Tony Hutter	786641dcf9	Workaround UBSAN errors for variable arrays This gets around UBSAN errors when using arrays at the end of structs. It converts some zero-length arrays to variable length arrays and disables UBSAN checking on certain modules. It is based off of the patch from #15460. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Tested-by: Thomas Lamprecht <t.lamprecht@proxmox.com> Co-authored-by: Thomas Lamprecht <t.lamprecht@proxmox.com> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Issue #15145 Closes #15510	2023-11-12 16:26:07 -08:00
Don Brady	5caeef02fa	RAID-Z expansion feature This feature allows disks to be added one at a time to a RAID-Z group, expanding its capacity incrementally. This feature is especially useful for small pools (typically with only one RAID-Z group), where there isn't sufficient hardware to add capacity by adding a whole new RAID-Z group (typically doubling the number of disks). == Initiating expansion == A new device (disk) can be attached to an existing RAIDZ vdev, by running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank raidz2-0 sda`. The new device will become part of the RAIDZ group. A "raidz expansion" will be initiated, and the new device will contribute additional space to the RAIDZ group once the expansion completes. The `feature@raidz_expansion` on-disk feature flag must be `enabled` to initiate an expansion, and it remains `active` for the life of the pool. In other words, pools with expanded RAIDZ vdevs can not be imported by older releases of the ZFS software. == During expansion == The expansion entails reading all allocated space from existing disks in the RAIDZ group, and rewriting it to the new disks in the RAIDZ group (including the newly added device). The expansion progress can be monitored with `zpool status`. Data redundancy is maintained during (and after) the expansion. If a disk fails while the expansion is in progress, the expansion pauses until the health of the RAIDZ vdev is restored (e.g. by replacing the failed disk and waiting for reconstruction to complete). The pool remains accessible during expansion. Following a reboot or export/import, the expansion resumes where it left off. == After expansion == When the expansion completes, the additional space is available for use, and is reflected in the `available` zfs property (as seen in `zfs list`, `df`, etc). Expansion does not change the number of failures that can be tolerated without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after expansion). A RAIDZ vdev can be expanded multiple times. After the expansion completes, old blocks remain with their old data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but distributed among the larger set of disks. New blocks will be written with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ vdev's "assumed parity ratio" does not change, so slightly less space than is expected may be reported for newly-written blocks, according to `zfs list`, `df`, `ls -s`, and similar tools. Sponsored-by: The FreeBSD Foundation Sponsored-by: iXsystems, Inc. Sponsored-by: vStack Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Authored-by: Matthew Ahrens <mahrens@delphix.com> Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com> Contributions-by: Stuart Maybee <stuart.maybee@comcast.net> Contributions-by: Thorsten Behrens <tbehrens@outlook.com> Contributions-by: Fmstrat <nospam@nowsci.com> Contributions-by: Don Brady <dev.fs.zfs@gmail.com> Signed-off-by: Don Brady <dev.fs.zfs@gmail.com> Closes #15022	2023-11-08 10:19:41 -08:00
Alan Somers	e36ff84c33	Update the kstat dataset_name when renaming a zvol Add a dataset_kstats_rename function, and call it when renaming a zvol on FreeBSD and Linux. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alan Somers <asomers@gmail.com> Sponsored-by: Axcient Closes #15482 Closes #15486	2023-11-07 11:34:50 -08:00
MigeljanImeri	2a154b8484	Fix accounting error for pending sync IO ops in zpool iostat Currently vdev_queue_class_length is responsible for checking how long the queue length is, however, it doesn't check the length when a list is used, rather it just returns whether it is empty or not. To fix this I added a counter variable to vdev_queue_class to keep track of the sync IO ops, and changed vdev_queue_class_length to reference this variable instead. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: MigeljanImeri <ImeriMigel@gmail.com> Closes #15478	2023-11-07 09:06:14 -08:00
ednadolski-ix	3bd4df3841	Improve ZFS objset sync parallelism As part of transaction group commit, dsl_pool_sync() sequentially calls dsl_dataset_sync() for each dirty dataset, which subsequently calls dmu_objset_sync(). dmu_objset_sync() in turn uses up to 75% of CPU cores to run sync_dnodes_task() in taskq threads to sync the dirty dnodes (files). There are two problems: 1. Each ZVOL in a pool is a separate dataset/objset having a single dnode. This means the objsets are synchronized serially, which leads to a bottleneck of ~330K blocks written per second per pool. 2. In the case of multiple dirty dnodes/files on a dataset/objset on a big system they will be sync'd in parallel taskq threads. However, it is inefficient to to use 75% of CPU cores of a big system to do that, because of (a) bottlenecks on a single write issue taskq, and (b) allocation throttling. In addition, if not for the allocation throttling sorting write requests by bookmarks (logical address), writes for different files may reach space allocators interleaved, leading to unwanted fragmentation. The solution to both problems is to always sync no more and (if possible) no fewer dnodes at the same time than there are allocators the pool. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Edmund Nadolski <edmund.nadolski@ixsystems.com> Closes #15197	2023-11-06 10:38:42 -08:00
Ameer Hamza	9ccdb8becd	zvol: fix delayed update to block device ro entry The change in the zvol readonly property does not update the block device readonly entry until the first IO to the ZVOL. This patch addresses the issue by updating the block device readonly property from the set property IOCTL call. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #15409	2023-10-31 09:50:38 -07:00
Ameer Hamza	60387facd2	zvol: Implement zvol threading as a Property Currently, zvol threading can be switched through the zvol_request_sync module parameter system-wide. By making it a zvol property, zvol threading can be switched per zvol. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #15409	2023-10-31 09:50:32 -07:00
Ameer Hamza	dbe839a9ca	zvol: Cleanup set property zvol_set_volmode() and zvol_set_snapdev() share a common code path. Merge this shared code path into zvol_set_common(). Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #15409	2023-10-31 09:49:32 -07:00
Alexander Motin	799e09f75a	Unify arc_prune_async() code There is no sense to have separate implementations for FreeBSD and Linux. Make Linux code shared as more functional and just register FreeBSD-specific prune callback with arc_add_prune_callback() API. Aside of code cleanup this should fix excessive pruning on FreeBSD: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=274698 Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Mark Johnston <markj@FreeBSD.org> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15456	2023-10-30 16:56:04 -07:00
Alexander Motin	05a7348a7e	RAIDZ: Use cache blocking during parity math RAIDZ parity is calculated by adding data one column at a time. It works OK for small blocks, but for large blocks results of previous addition may already be evicted from CPU caches to main memory, and in addition to extra memory write require extra read to get it back. This patch splits large parity operations into 64KB chunks, that should in most cases fit into CPU L2 caches from the last decade. I haven't touched more complicated cases of data reconstruction to not over complicate the code. Those should be relatively rare. My tests on Xeon Gold 6242R CPU with 1MB of L2 cache per core show up to 10/20% memory traffic reduction when writing to 4-wide RAIDZ/ RAIDZ2 blocks of ~4MB and up. Older CPUs with 256KB of L2 cache should see the effect even on smaller blocks. Wider vdevs may need bigger blocks to be affected. Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15448	2023-10-30 14:54:27 -07:00
Alexander Motin	c3773de168	ZIL: Cleanup sync and commit handling ZVOL: - Mark all ZVOL ZIL transactions as sync. Since ZVOLs have only one object, it makes no sense to maintain async queue and on each commit merge it into sync. Single sync queue is just cheaper, while it changes nothing until actual commit request arrives. - Remove zsd_sync_cnt and the zil_async_to_sync() calls since we are no longer switching between sync and async queues. ZFS: - Mark write transactions as sync based only on number of sync opens (z_sync_cnt). We can not randomly jump between sync and async unless we want data corruptions due to writes reordering. - When file first opened with O_SYNC (z_sync_cnt incremented to 1) call zil_async_to_sync() for it to preserve correct ordering between past and future writes. - Drop zfs_fsyncer_key logic. Looks like it was an optimization for workloads heavily intermixing async writes with tons of fsyncs. But first it was broken 8 years ago due to Linux tsd implementation not allowing data storage between syscalls, and second, I doubt it is safe to switch from async to sync so often and without calling zil_async_to_sync(). - Rename sync argument of *_log_write() into commit, now only signalling caller's intent to call zil_commit() soon after. It allows WR_COPIED optimizations without extra other meanings. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Wilson <george.wilson@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15366	2023-10-30 14:51:56 -07:00
Thomas Bertschinger	97a0b5be50	Add mutex_enter_interruptible() for interruptible sleeping IOCTLs Many long-running ZFS ioctls lock the spa_namespace_lock, forcing concurrent ioctls to sleep for the mutex. Previously, the only option is to call mutex_enter() which sleeps uninterruptibly. This is a usability issue for sysadmins, for example, if the admin runs `zpool status` while a slow `zpool import` is ongoing, the admin's shell will be locked in uninterruptible sleep for a long time. This patch resolves this admin usability issue by introducing mutex_enter_interruptible() which sleeps interruptibly while waiting to acquire a lock. It is implemented for both Linux and FreeBSD. The ZFS_IOC_POOL_CONFIGS ioctl, used by `zpool status`, is changed to use this new macro so that the command can be interrupted if it is issued during a concurrent `zpool import` (or other long-running operation). Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Thomas Bertschinger <bertschinger@lanl.gov> Closes #15360	2023-10-26 09:17:40 -07:00
Alexander Motin	3afdc97d91	ZIO: Remove READY pipeline stage from root ZIOs zio_root() has no arguments for ready callback or parent ZIO. Except one recent case in ZIL code if root ZIOs ever have a parent it is also a root ZIO. It means we do not need READY pipeline stage for them, which takes some time to process, but even more time to wait for the children and be woken by them, and both for no good reason. The most visible effect of this change is that it avoids one taskq wakeup per ZIL block written, previously used to run zio_ready() for lwb_root_zio and skipped now. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15398	2023-10-25 15:22:25 -07:00
Alexander Motin	252f46be7d	ZIL: Detect single-threaded workloads ... by checking that previous block is fully written and flushed. It allows to skip commit delays since we can give up on aggregation in that case. This removes zil_min_commit_timeout parameter, since for single-threaded workloads it is not needed at all, while on very fast devices even some multi-threaded workloads may get detected as single-threaded and still bypass the wait. To give multi-threaded workloads more aggregation chances increase zfs_commit_timeout_pct from 5 to 10%, as they should suffer less from additional latency. Also single-threaded workloads detection allows in perspective better prediction of the next block size. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Prakash Surya <prakash.surya@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15381	2023-10-24 14:35:25 -07:00
Brian Behlendorf	07345ac252	Add prefetch property ZFS prefetch is currently governed by the zfs_prefetch_disable tunable. However, this is a module-wide settings - if a specific dataset benefits from prefetch, while others have issue with it, an optimal solution does not exists. This commit introduce the "prefetch" tri-state property, which enable granular control (at dataset/volume level) for prefetching. This patch does not remove the zfs_prefetch_disable, which remains a system-wide switch for enable/disable prefetch. However, to avoid duplication, it would be preferable to deprecate and then remove the module tunable. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Ameer Hamza <ahamza@ixsystems.com> Signed-off-by: Gionatan Danti <g.danti@assyoma.it> Co-authored-by: Gionatan Danti <g.danti@assyoma.it> Closes #15237 Closes #15436	2023-10-24 11:00:07 -07:00
Alexander Motin	b29e98fa8d	Properly pad struct tx_cpu to cache line We already use ____cacheline_aligned in many places, so add one more instead of seems arbitrary char tc_pad[8]. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15402	2023-10-20 11:54:05 -07:00
Alexander Motin	008baa091f	FreeBSD: Reduce divergence from in-tree sources This includes random small tweaks, primarily a build fixes, required when ZFS is built as part of FreeBSD base. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15368	2023-10-09 13:27:18 -07:00
Alexander Motin	342357cd9e	Reduce number of metaslab preload taskq threads. Before this change ZFS created threads for 50% of CPUs for each top- level vdev. Plus it created the same number of threads for embedded log groups (that have only one metaslab and don't need any preload). As result, on system with 80 CPUs and pool of 60 vdevs this resulted in 4800 metaslab preload threads, that is absolutely insane. This patch changes the preload threads to 50% of CPUs in one taskq per pool, so on the mentioned system it will be only 40 threads. Among other things this fixes zdb on the mentioned system and pool on FreeBSD, that failed to create so many threads in one process. Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15319	2023-10-06 09:04:00 -07:00
Alexander Motin	96b9cf42e0	ARC: Remove b_bufcnt/b_ebufcnt from ARC headers In most cases we do not care about exact number of buffers linked to the header, we just need to know if it is zero, non-zero or one. That can easily be checked just looking on b_buf pointer or in some cases derefencing it. b_ebufcnt is read only once, and in that case we already traverse the list as part of arc_buf_remove(), so second traverse should not be expensive. This reduces L1 ARC header size by 8 bytes and full crypto header by 16 bytes, down to 176 and 232 bytes on FreeBSD respectively. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15350	2023-10-06 08:56:17 -07:00
Alexander Motin	2a6c62109c	ARC: Remove b_cv from struct l1arc_buf_hdr Earlier as part of #14123 I've removed one use of b_cv. This patch reuses the same approach to remove the other one from much more rare code path. This saves 16 bytes of L1 ARC header on FreeBSD (reducing it from 200 to 184 bytes) and seems even more on Linux. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15340	2023-10-04 14:45:00 -07:00
ednadolski-ix	95f71c019d	Selectable block allocators ZFS historically has had several space allocators that were dynamically selectable. While these have been retained in OpenZFS, only a single allocator has been statically compiled in. This patch compiles all allocators for OpenZFS and provides a module parameter to allow for manual selection between them. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ameer Hamza <ahamza@ixsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Edmund Nadolski <edmund.nadolski@ixsystems.com> Closes #15218	2023-09-01 18:00:30 -07:00
Paul Dagnelie	bee9cfb813	Increase limit of redaction list by using spill block Currently redaction bookmarks and their associated redaction lists have a relatively low limit of 36 redaction snapshots. This is imposed by the number of snapshot GUIDs that fit in the bonus buffer of the redaction list object. While this is more than enough for most use cases, there are some limited cases where larger numbers would be useful to support. We tweak the redaction list creation code to use a spill block if the number of redaction snapshots is above the amount that would fit in the bonus buffer. We also make a small change to allow spill blocks to be use for types of data besides SA. In order to fully leverage this logic, we also change the redaction code to use vmem_alloc, to handle extremely large allocations if needed. Finally, small tweaks were made to the zfs commands and the test suite. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Paul Dagnelie <pcd@delphix.com> Closes #15018	2023-08-26 11:34:43 -07:00
Alexander Motin	eda3fcd56f	ZIL: Second attempt to reduce scope of zl_issuer_lock. The previous patch #14841 appeared to have significant flaw, causing deadlocks if zl_get_data callback got blocked waiting for TXG sync. I already handled some of such cases in the original patch, but issue #14982 shown cases that were impossible to solve in that design. This patch fixes the problem by postponing log blocks allocation till the very end, just before the zios issue, leaving nothing blocking after that point to cause deadlocks. Before that point though any sleeps are now allowed, not causing sync thread blockage. This require slightly more complicated lwb state machine to allocate blocks and issue zios in proper order. But with removal of special early issue workarounds the new code is much cleaner now, and should even be more efficient. Since this patch uses null zios between write, I've found that null zios do not wait for logical children ready status in zio_ready(), that makes parent write to proceed prematurely, producing incorrect log blocks. Added ZIO_CHILD_LOGICAL_BIT to zio_wait_for_children() fixes it. Reviewed-by: Rob Norris <rob.norris@klarasystems.com> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Reviewed-by: George Wilson <george.wilson@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15122	2023-08-24 17:08:49 -07:00
Alexander Motin	bdb7df4245	ZIL: Avoid dbuf_read() before dmu_sync(). In most cases dmu_sync() works with dirty records directly and does not need actual data. The only exception is dmu_sync_late_arrival(). To save some CPU time use dmu_buf_hold_noread() in z_get_data() and explicitly call dbuf_read() in dmu_sync_late_arrival(). There is also a chance that by that time TXG will already be synced and we won't have to do it at all. Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15153	2023-08-11 09:04:08 -07:00
Alexander Motin	6c94e64963	Refactor dmu_prefetch(). - Split dmu_prefetch_dnode() from dmu_prefetch() into a separate function. It is quite inconvenient to read the code where len = 0 means dnode prefetch instead indirect/data prefetch. One function doing both has no benefits, since the code paths are independent. - Improve dmu_prefetch() handling of long block ranges. Instead of limiting L0 data length to prefetch for to dmu_prefetch_max, make dmu_prefetch_max limit the actual amount of prefetch at the specified level, and, if there is more, prefetch all the rest at higher indirection level. It should improve random access times within the prefetched range of any length, reducing importance of specific dmu_prefetch_max value. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15076	2023-08-07 13:54:41 -07:00
Mateusz Piotrowski	a97b8fc2dd	Fix some typos Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Mateusz Piotrowski <0mp@FreeBSD.org> Closes #15141	2023-08-07 13:53:59 -07:00
Rob N	114a39964f	zdb: include cloned blocks in block statistics This gives `zdb -b` support for clone blocks. Previously, it didn't know what clones were, so would count their space allocation multiple times and then report leaked space (or, in debug, would assert trying to claim blocks a second time). This commit fixes those bugs, and reports the number of clones and the space "used" (saved) by them. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Kay Pedersen <mail@mkwg.de> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Closes #15123	2023-08-01 08:56:30 -07:00
Alexander Motin	b22bab2547	Remove fastwrite mechanism. Fastwrite was introduced many years ago to improve ZIL writes spread between multiple top-level vdevs by tracking number of allocated but not written blocks and choosing vdev with smaller count. It suposed to reduce ZIL knowledge about allocation, but actually made ZIL to even more actively report allocation code about the allocations, complicating both ZIL and metaslabs code. On top of that, it seems ZIO_FLAG_FASTWRITE setting in dmu_sync() was lost many years ago, that was one of the declared benefits. Plus introduction of embedded log metaslab class solved another problem with allocation rotor accounting both normal and log allocations, since in most cases those are now in different metaslab classes. After all that, I'd prefer to simplify already too complicated ZIL, ZIO and metaslab code if the benefit of complexity is not obvious. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Wilson <george.wilson@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15107	2023-07-28 13:30:33 -07:00
Alexander Motin	34b3d498a9	Adjust prefetch parameters. - Reduce maximum prefetch distance for 32bit platforms to 8MB as it was previously. Those systems didn't grow much probably, so better stay conservative there. - Retire array_rd_sz tunable, blocking prefetch for large requests. We should not penalize applications trying to be more efficient. The speculative prefetcher by itself has reasonable distance limits, and 1MB is not much at all these days. Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15072	2023-07-21 11:51:47 -07:00
Alexander Motin	28430b51e3	Add explicit prefetches to bpobj_iterate(). To simplify error handling bpobj_iterate_blkptrs() iterates through the list of block pointers backwards. Unfortunately speculative prefetcher is currently unable to detect such patterns, that makes each block read there synchronous and very slow on HDD pools. According to my tests, added explicit prefetch reduces time needed to asynchronously delete 8 snapshots of 4 million blocks each from 20 seconds to less than one, that should free sync thread for other useful work, such as async writes, scrub, etc. While there, plug one memory leak in case of bpobj_open() error and harmonize some variable names. Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15071	2023-07-21 11:50:48 -07:00
Alan Somers	6fd87e1d8d	Don't emit cksum_{actual_expected} in ereport.fs.zfs.checksum events With anything but fletcher-4, even a tiny change in the input will cause the checksum value to change completely. So knowing the actual and expected checksums doesn't provide much more information than "they don't match". The harm in sending them is simply that they bloat the event. In particular, on FreeBSD the event must fit into a 1016 byte buffer. Fixes #14717 for mirrored pools. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rich Ercolani <rincebrain@gmail.com> Signed-off-by: Alan Somers <asomers@gmail.com> Sponsored-by: Axcient Closes #14717 Closes #15052	2023-07-21 11:49:26 -07:00
Alan Somers	cf2a225b24	Don't emit checksum histograms in ereport.fs.zfs.checksum events The checksum histograms were intended to be used with ATA and parallel SCSI, which are obsolete. With modern storage hardware, they will almost always look like white noise; all bits will be wrong. They only serve to bloat the event. That's a particular problem on FreeBSD, where events must fit into a 1016 byte buffer. This fixes issue #14717 for RAIDZ pools, but not for mirror pools. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rich Ercolani <rincebrain@gmail.com> Signed-off-by: Alan Somers <asomers@gmail.com> Sponsored-by: Axcient Closes #15052	2023-07-21 11:48:17 -07:00
Ameer Hamza	d9bb583c25	spa_min_alloc should be GCD, not min Since spa_min_alloc may not be a power of 2, unlike ashifts, in the case of DRAID, we should not select the minimal value among several vdevs. Rounding to a multiple of it is unlikely to work for other vdevs. Instead, using the greatest common divisor produces smaller yet more reasonable results. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #15067	2023-07-20 10:23:52 -07:00
Ameer Hamza	4d2dad04aa	Ignore pool ashift property during vdev attachment Ashift can be set for a vdev only during its creation, and the top-level vdev does not change when a vdev is attached or replaced. The ashift property should not be used during attachment, as it does not allow attaching/replacing a vdev if the pool's ashift property is increased after the existing vdev was created. Instead, we should be able to attach the vdev if the attached vdev can satisfy the ashift requirement with its parent. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #15061	2023-07-20 09:57:16 -07:00
Yuri Pankov	8beabfd3bf	set autotrim default to 'off' everywhere As it turns out having autotrim default to 'on' on FreeBSD never really worked due to mess with defines where userland and kernel module were getting different default values (userland was defaulting to 'off', module was thinking it's 'on'). Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Yuri Pankov <yuripv@FreeBSD.org> Closes #15079	2023-07-20 09:06:55 -07:00
Brian Behlendorf	ac8ae18d22	Revert "spa.h: use IN_BASE instead of IN_FREEBSD_BASE" This reverts commit `77a3bb1f47`. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2023-06-30 10:03:46 -07:00
Alexander Motin	b4a0873092	Some ZIO micro-optimizations. - Pack struct zio_prop by 4 bytes from 84 to 80. - Skip new child ZIO locking while linking to parent. The newly allocated ZIO is not externally visible yet, so nobody should care. - Skip io_bp_copy writes when not used (write && non-debug). Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #14985	2023-06-30 08:54:00 -07:00
Alexander Motin	fa7b2390d4	Do not report bytes skipped by scan as issued. Scan process may skip blocks based on their birth time, DVA, etc. Traditionally those blocks were accounted as issued, that caused reporting of hugely over-inflated numbers, having nothing to do with actual disk I/O. This change utilizes never used field in struct dsl_scan_phys to account such skipped bytes, allowing to report how much data were actually scrubbed/resilvered and what is the actual I/O speed. While formally it is an on-disk format change, it should be compatible both ways, so should not need a feature flag. This should partially address the same issue as `c85ac731a0`, but from a different perspective, complementing it. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Akash B <akash-b@hpe.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15007	2023-06-30 08:47:13 -07:00
Yuri Pankov	77a3bb1f47	spa.h: use IN_BASE instead of IN_FREEBSD_BASE Consistently get the proper default value for autotrim. Currently, only the kernel module is built with IN_FREEBSD_BASE, and libzfs get the wrong default value, leading to confusion and incorrect output when autotrim value was not set explicitly. Reviewed-by: Warner Losh <imp@bsdimp.com> Signed-off-by: Yuri Pankov <yuripv@FreeBSD.org> Closes #15016	2023-06-29 11:50:52 -07:00
Alexander Motin	8469b5aac0	Another set of vdev queue optimizations. Switch FIFO queues (SYNC/TRIM) and active queue of vdev queue from time-sorted AVL-trees to simple lists. AVL-trees are too expensive for such a simple task. To change I/O priority without searching through the trees, add io_queue_state field to struct zio. To not check number of queued I/Os for each priority add vq_cqueued bitmap to struct vdev_queue. Update it when adding/removing I/Os. Make vq_cactive a separate array instead of struct vdev_queue_class member. Together those allow to avoid lots of cache misses when looking for work in vdev_queue_class_to_issue(). Introduce deadline of ~0.5s for LBA-sorted queues. Before this I saw some I/Os waiting in a queue for up to 8 seconds and possibly more due to starvation. With this change I no longer see it. I had to slightly more complicate the comparison function, but since it uses all the same cache lines the difference is minimal. For a sequential I/Os the new code in vdev_queue_io_to_issue() actually often uses more simple avl_first(), falling back to avl_find() and avl_nearest() only when needed. Arrange members in struct zio to access only one cache line when searching through vdev queues. While there, remove io_alloc_node, reusing the io_queue_node instead. Those two are never used same time. Remove zfs_vdev_aggregate_trim parameter. It was disabled for 4 years since implemented, while still wasted time maintaining the offset-sorted tree of TRIM requests. Just remove the tree. Remove locking from txg_all_lists_empty(). It is racy by design, while 2 pair of locks/unlocks take noticeable time under the vdev queue lock. With these changes in my tests with volblocksize=4KB I measure vdev queue lock spin time reduction by 50% on read and 75% on write. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #14925	2023-06-27 09:09:48 -07:00
Alexander Motin	ccec7fbe1c	Remove ARC/ZIO physdone callbacks. Those callbacks were introduced many years ago as part of a bigger patch to smoothen the write throttling within a txg. They allow to account completion of individual physical writes within a logical one, improving cases when some of physical writes complete much sooner than others, gradually opening the write throttle. Few years after that ZFS got allocation throttling, working on a level of logical writes and limiting number of writes queued to vdevs at any point, and so limiting latency distribution between the physical writes and especially writes of multiple copies. The addition of scheduling deadline I proposed in #14925 should further reduce the latency distribution. Grown memory sizes over the past 10 years should also reduce importance of the smoothing. While the use of physdone callback may still in theory provide some smoother throttling, there are cases where we simply can not afford it. Since dirty data accounting is protected by pool-wide lock, in case of 6-wide RAIDZ, for example, it requires us to take it 8 times per logical block write, creating huge lock contention. My tests of this patch show radical reduction of the lock spinning time on workloads when smaller blocks are written to RAIDZ pools, when each of the disks receives 8-16KB chunks, but the total rate reaching 100K+ blocks per second. Same time attempts to measure any write time fluctuations didn't show anything noticeable. While there, remove also io_child_count/io_parent_count counters. They are used only for couple assertions that can be avoided. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #14948	2023-06-15 10:49:03 -07:00
Alexander Motin	d057807ede	Switch refcount tracking from lists to AVL-trees. With large number of tracked references list searches under the lock become too expensive, creating enormous lock contention. On my tests with ZFS_DEBUG enabled this increases write throughput with 32KB blocks from ~1.2GB/s to ~7.5GB/s. Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #14970	2023-06-14 08:02:27 -07:00
Alexander Motin	70ea484e3e	Finally drop long disabled vdev cache. It was a vdev level read cache, designed to aggregate many small reads by speculatively issuing bigger reads instead and caching the result. But since it has almost no idea about what is going on with exception of ZIO_FLAG_DONT_CACHE flag set by higher layers, it was found to make more harm than good, for which reason it was disabled for the past 12 years. These days we have much better instruments to enlarge the I/Os, such as speculative and prescient prefetches, I/O scheduler, I/O aggregation etc. Besides just the dead code removal this removes one extra mutex lock/unlock per write inside vdev_cache_write(), not otherwise disabled and trying to do some work. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #14953	2023-06-09 12:40:55 -07:00
Rob Norris	2b9f8ba673	znode: expose zfs_get_zplprop to libzpool There's no particular reason this function should be kernel-only, and I want to use it (indirectly) from zdb. I've moved it to zfs_znode.c because libzpool does not compile in zfs_vfsops.c, and this at least matches the header its imported from. Sponsored-By: Klara, Inc. Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: WHR <msl0000023508@gmail.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #14642	2023-06-05 11:53:44 -07:00
Alexander Motin	5ba4025a8d	Introduce zfs_refcount_(add\|remove)_few(). There are two places where we need to add/remove several references with semantics of zfs_refcount_(add\|remove). But when debug/tracing is disabled, it is a crime to run multiple atomic_inc() in a loop, especially under congested pool-wide allocator lock. Introduced new functions implement the same semantics as the loop, but without overhead in production builds. Reviewed-by: Rich Ercolani <rincebrain@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #14934	2023-06-05 11:51:44 -07:00
Richard Yao	0f03a41161	Use __attribute__((malloc)) on memory allocation functions This informs the C compiler that pointers returned from these functions do not alias other functions, which allows it to do better code optimization and should make the compiled code smaller. References: https://stackoverflow.com/a/53654773 https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#index-malloc-function-attribute https://clang.llvm.org/docs/AttributeReference.html#malloc Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14827	2023-05-26 15:47:52 -07:00
Richard Yao	677c6f8457	btree: Implement faster binary search algorithm This implements a binary search algorithm for B-Trees that reduces branching to the absolute minimum necessary for a binary search algorithm. It also enables the compiler to inline the comparator to ensure that the only slowdown when doing binary search is from waiting for memory accesses. Additionally, it instructs the compiler to unroll the loop, which gives an additional 40% improve with Clang and 8% improvement with GCC. Consumers must opt into using the faster algorithm. At present, only B-Trees used inside kernel code have been modified to use the faster algorithm. Micro-benchmarks suggest that this can improve binary search performance by up to 3.5 times when compiling with Clang 16 and up to 1.9 times when compiling with GCC 12.2. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14866	2023-05-26 10:03:12 -07:00
Alexander Motin	b6fbe61fa6	zil: Add some more statistics. In addition to a number of actual log bytes written, account also a total written bytes including padding and total allocated bytes (bytes <= write <= alloc). It should allow to monitor zil traffic and space efficiency. Add dtrace probe for zil block size selection. Make zilstat report more information and fit it into less width. Reviewed-by: Ameer Hamza <ahamza@ixsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #14863	2023-05-25 13:51:53 -07:00
Alexander Motin	f63811f072	ZIL: Reduce scope of per-dataset zl_issuer_lock. Before this change ZIL copied all log data while holding the lock. It caused huge lock contention on workloads with many big parallel writes. This change splits the process into two parts: first, zil_lwb_assign() estimates the log space needed for all transactions, and zil_lwb_write_close() allocates blocks and zios while holding the lock, then, after the lock in dropped, zil_lwb_commit() copies the data, and zil_lwb_write_issue() issues the I/Os. Also while there slightly reduce scope of zl_lock. Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Prakash Surya <prakash.surya@delphix.com> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #14841	2023-05-25 09:48:43 -07:00
George Amanakis	482eeef804	Teach zpool scrub to scrub only blocks in error log Added a flag '-e' in zpool scrub to scrub only blocks in error log. A user can pause, resume and cancel the error scrub by passing additional command line arguments -p -s just like a regular scrub. This involves adding a new flag, creating new libzfs interfaces, a new ioctl, and the actual iteration and read-issuing logic. Error scrubbing is executed in multiple txg to make sure pool performance is not affected. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Co-authored-by: TulsiJain tulsi.jain@delphix.com Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #8995 Closes #12355	2023-05-18 11:59:42 -07:00
Brian Behlendorf	e34e15ed6d	Add the ability to uninitialize zpool initialize functions well for touching every free byte...once. But if we want to do it again, we're currently out of luck. So let's add zpool initialize -u to clear it. Co-authored-by: Rich Ercolani <rincebrain@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes #12451 Closes #14873	2023-05-18 10:02:20 -07:00
Pawel Jakub Dawidek	d0d91f185e	Don't use dmu_buf_is_dirty() for unassigned transaction. The dmu_buf_is_dirty() call doesn't make sense here for two reasons: 1. txg is 0 for unassigned tx, so it was a no-op. 2. It is equivalent of checking if we have dirty records and we are doing this few lines earlier. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net> Closes #14825	2023-05-11 16:06:57 -07:00
Pawel Jakub Dawidek	bd8c6bd66f	Deny block cloning is dbuf size doesn't match BP size. I don't know an easy way to shrink down dbuf size, so just deny block cloning into dbufs that don't match our BP's size. This fixes the following situation: 1. Create a small file, eg. 1kB of random bytes. Its dbuf will be 1kB. 2. Create a larger file, eg. 2kB of random bytes. Its dbuf will be 2kB. 3. Truncate the large file to 0. Its dbuf will remain 2kB. 4. Clone the small file into the large file. Small file's BP lsize is 1kB, but the large file's dbuf is 2kB. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net> Closes #14825	2023-05-11 16:06:52 -07:00
Pawel Jakub Dawidek	555ef90c5c	Additional block cloning fixes. Reimplement some of the block cloning vs dbuf logic, mostly to fix situation where we clone a block and in the same transaction group we want to partially overwrite the clone. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net> Closes #14825	2023-05-11 16:06:36 -07:00
Brian Behlendorf	903c3613d4	Add dmu_tx_hold_append() interface Provides an interface which callers can use to declare a write when the exact starting offset in not yet known. Since the full range being updated is not available only the first L0 block at the provided offset will be prefetched. Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #14819	2023-05-09 09:03:10 -07:00
George Amanakis	6839ec6f10	Enable the head_errlog feature to remove errors In case check_filesystem() does not error out and does not report an error, remove that error block from error lists and logs without requiring a scrub. This can happen when the original file and all snapshots/clones referencing it have been removed. Otherwise zpool status will still report that "Permanent errors have been detected..." without actually reporting any of them. To implement this change the functions introduced in corrective receive were modified to take into account the head_errlog feature. Before this change: ============================= pool: test state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A config: NAME STATE READ WRITE CKSUM test ONLINE 0 0 0 /home/user/vdev_a ONLINE 0 0 2 errors: Permanent errors have been detected in the following files: ============================= After this change: ============================= pool: test state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P config: NAME STATE READ WRITE CKSUM test ONLINE 0 0 0 /home/user/vdev_a ONLINE 0 0 2 errors: No known data errors ============================= Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #14813	2023-05-09 08:53:27 -07:00
Matthew Ahrens	3095ca91c2	Verify block pointers before writing them out If a block pointer is corrupted (but the block containing it checksums correctly, e.g. due to a bug that overwrites random memory), we can often detect it before the block is read, with the `zfs_blkptr_verify()` function, which is used in `arc_read()`, `zio_free()`, etc. However, such corruption is not typically recoverable. To recover from it we would need to detect the memory error before the block pointer is written to disk. This PR verifies BP's that are contained in indirect blocks and dnodes before they are written to disk, in `dbuf_write_ready()`. This way, we'll get a panic before the on-disk data is corrupted. This will help us to diagnose what's causing the corruption, as well as being much easier to recover from. To minimize performance impact, only checks that can be done without holding the spa_config_lock are performed. Additionally, when corruption is detected, the raw words of the block pointer are logged. (Note that `dprintf_bp()` is a no-op by default, but if enabled it is not safe to use with invalid block pointers.) Reviewed-by: Rich Ercolani <rincebrain@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Paul Zuchowski <pzuchowski@datto.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #14817	2023-05-08 11:20:23 -07:00
Brian Behlendorf	0e8a42bbee	Revert "Fix data race between zil_commit() and zil_suspend()" This reverts commit `4c856fb333` to resolve a newly introduced deadlock which in practice in more disruptive that the issue this commit intended to address. Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #14775 Closes #14790	2023-04-25 16:40:55 -07:00

1 2 3 4 5 ...

1323 Commits