Archive-Team/zfs - zfs - Gitea: Git with a cup of tea

Commit Graph

Author	SHA1	Message	Date
Don Brady	f2d25f56f5	vdev probe to slow disk can stall mmp write checker Simplify vdev probes in the zio_vdev_io_done context to avoid holding the spa config lock for a long duration. Also allow zpool clear if no evidence of another host is using the pool. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Don Brady <don.brady@klarasystems.com> Closes #15839	2024-04-30 16:48:01 +10:00
George Wilson	dbf9aed3cf	Parallel pool import This commit allow spa_load() to drop the spa_namespace_lock so that imports can happen concurrently. Prior to dropping the spa_namespace_lock, the import logic will set the spa_load_thread value to track the thread which is doing the import. Consumers of spa_lookup() retain the same behavior by blocking when either a thread is holding the spa_namespace_lock or the spa_load_thread value is set. This will ensure that critical concurrent operations cannot take place while a pool is being imported. The zpool command is also enhanced to provide multi-threaded support when invoking zpool import -a. Lastly, zinject provides a mechanism to insert artificial delays when importing a pool and new zfs tests are added to verify parallel import functionality. Contributions-by: Don Brady <don.brady@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Wilson <gwilson@delphix.com> Closes #16093	2024-04-30 16:47:59 +10:00
Don Brady	4296626ab0	Extend import_progress kstat with a notes field Detail the import progress of log spacemaps as they can take a very long time. Also grab the spa_note() messages to, as they provide insight into what is happening Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Don Brady <don.brady@klarasystems.com> Co-authored-by: Allan Jude <allan@klarasystems.com> Closes #15539	2024-04-30 16:33:14 +10:00
Rob Norris	0ab4172e4c	import: require force when cachefile hostid doesn't match on-disk Previously, if a cachefile is passed to zpool import, the cached config is mostly offered as-is to ZFS_IOC_POOL_TRYIMPORT->spa_tryimport(), and the results are taken as the canonical pool config and handed back to ZFS_IOC_POOL_IMPORT. In the course of its operation, spa_load() will inspect the pool and build a new config from what it finds on disk. However, it then regenerates a new config ready to import, and so rightly sets the hostid and hostname for the local host in the config it returns. Because of this, the "require force" checks always decide the pool is exported and last touched by the local host, even if this is not true, which is possible in a HA environment when MMP is not enabled. The pool may be imported on another head, but the import checks still pass here, so the pool ends up imported on both. (This doesn't happen when a cachefile isn't used, because the pool config is discovered in userspace in zpool_find_import(), and that does find the on-disk hostid and hostname correctly). Since the systemd zfs-import-cache.service unit uses cachefile imports, this can lead to a system returning after a crash with a "valid" cachefile on disk and automatically, quietly, importing a pool that has already been taken up by a secondary head. This commit causes the on-disk hostid and hostname to be included in the ZPOOL_CONFIG_LOAD_INFO item in the returned config, and then changes the "force" checks for zpool import to use them if present. This method should give no change in behaviour for old userspace on new kernels (they won't know to look for the new config items) and for new userspace on old kernels (the won't find the new config items). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes #15290 (cherry picked from commit `54b1b1d893`)	2024-04-03 10:23:13 +11:00
Rob Norris	831fdad1b0	slack: require module param before enabling slack compression Just an extra explicit opt-in, since it changes the compression method isn't upstreamed yet.	2024-04-03 10:23:13 +11:00
Rob Norris	014ff864d8	slack: fix decompression Turns out decompression never worked at all; likely an oversight converting the original "transparent" versions to a true compression option. This makes it work. Signed-off-by: Rob Norris <rob.norris@klarasystems.com>	2024-04-03 10:23:13 +11:00
Rob Norris	4e9cbd04a4	spa: add zio_taskq_trylock tuneable Signed-off-by: Rob Norris <rob.norris@klarasystems.com>	2024-04-03 10:23:13 +11:00
Rob Norris	7d74fc7a68	spa_taskq_dispatch_ent: look for an unlocked taskq before waiting Signed-off-by: Rob Norris <rob.norris@klarasystems.com>	2024-04-03 10:23:13 +11:00
Rob Norris	c85d31a88b	taskq: add taskq_try_dispatch_ent() Non-blocking form of taskq_dispatch_ent(), returns false if it can't acquire the taskq lock. Signed-off-by: Rob Norris <rob.norris@klarasystems.com>	2024-04-03 10:23:13 +11:00
Rob Norris	2fec57bf03	trace: spa_taskqs_ent trace class, dispatch and dispatched tracepoints Signed-off-by: Rob Norris <rob.norris@klarasystems.com>	2024-04-03 10:23:13 +11:00
Mark Johnston	79d092987b	spa: Let spa_taskq_param_get()'s addition of a newline be optional For FreeBSD sysctls, we don't want the extra newline, since the sysctl(8) utility will format strings appropriately. Signed-off-by: Mark Johnston <markj@FreeBSD.org> (cherry picked from commit 4cf943769d3decd73b379eb069cebe0c148e3f39)	2024-04-03 10:23:13 +11:00
Mark Johnston	92063ad7d4	spa: Fix FreeBSD sysctl handlers sbuf_cpy() resets the sbuf state, which is wrong for sbufs allocated by sbuf_new_for_sysctl(). In particular, this code triggers an assertion failure in sbuf_clear(). Simplify by just using sysctl_handle_string() for both reading and setting the tunable. Fixes: `6930ecbb7` ("spa: make read/write queues configurable") Reported-by: Peter Holm <pho@FreeBSD.org> Signed-off-by: Mark Johnston <markj@FreeBSD.org> (cherry picked from commit e5230fb9d2c3670feda07b8a0759ac446749c247)	2024-04-03 10:23:13 +11:00
Rob Norris	1339b7e0ac	spa: make read/write queues configurable We are finding that as customers get larger and faster machines (hundreds of cores, large NVMe-backed pools) they keep hitting relatively low performance ceilings. Our profiling work almost always finds that they're running into bottlenecks on the SPA IO taskqs. Unfortunately there's often little we can advise at that point, because there's very few ways to change behaviour without patching. This commit adds two load-time parameters `zio_taskq_read` and `zio_taskq_write` that can configure the READ and WRITE IO taskqs directly. This achieves two goals: it gives operators (and those that support them) a way to tune things without requiring a custom build of OpenZFS, which is often not possible, and it lets us easily try different config variations in a variety of environments to inform the development of better defaults for these kind of systems. Because tuning the IO taskqs really requires a fairly deep understanding of how IO in ZFS works, and generally isn't needed without a pretty serious workload and an ability to identify bottlenecks, only minimal documentation is provided. Its expected that anyone using this is going to have the source code there as well. Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. (cherry picked from commit `12a031a3f5`)	2024-04-03 10:23:13 +11:00
Rob Norris	657c1f75ad	vdev_disk: don't touch vbio after its handed off to the kernel After IO is unplugged, it may complete immediately and vbio_completion be called on interrupt context. That may interrupt or deschedule our task. If its the last bio, the vbio will be freed. Then, we get rescheduled, and try to write to freed memory through vbio->. This patch just removes the the cleanup, and the corresponding assert. These were leftovers from a previous iteration of vbio_submit() and were always "belt and suspenders" ops anyway, never strictly required. Reported-by: Rich Ercolani <rincebrain@gmail.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. (cherry picked from commit 34f662ad22206af6852020fd923ceccd836a855f)	2024-04-03 10:22:54 +11:00
Rob Norris	9e1afd0d91	abd_iter_page: don't use compound heads on Linux <4.5 Before 4.5 (specifically, torvalds/linux@ddc58f2), head and tail pages in a compound page were refcounted separately. This means that using the head page without taking a reference to it could see it cleaned up later before we're finished with it. Specifically, bio_add_page() would take a reference, and drop its reference after the bio completion callback returns. If the zio is executed immediately from the completion callback, this is usually ok, as any data is referenced through the tail page referenced by the ABD, and so becomes "live" that way. If there's a delay in zio execution (high load, error injection), then the head page can be freed, along with any dirty flags or other indicators that the underlying memory is used. Later, when the zio completes and that memory is accessed, its either unmapped and an unhandled fault takes down the entire system, or it is mapped and we end up messing around in someone else's memory. Both of these are very bad. The solution on these older kernels is to take a reference to the head page when we use it, and release it when we're done. There's not really a sensible way under our current structure to do this; the "best" would be to keep a list of head page references in the ABD, and release them when the ABD is freed. Since this additional overhead is totally unnecessary on 4.5+, where head and tail pages share refcounts, I've opted to simply not use the compound head in ABD page iteration there. This is theoretically less efficient (though cleaning up head page references would add overhead), but its safe, and we still get the other benefits of not mapping pages before adding them to a bio and not mis-splitting pages. There doesn't appear to be an obvious symbol name or config option we can match on to discover this behaviour in configure (and the mm/page APIs have changed a lot since then anyway), so I've gone with a simple version check. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes #15533 Closes #15588 (cherry picked from commit `c6be6ce175`)	2024-04-03 10:15:02 +11:00
Rob Norris	ce5719554f	vdev_disk: use bio_chain() to submit multiple BIOs Simplifies our code a lot, so we don't have to wait for each and reassemble them. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes #15533 Closes #15588 (cherry picked from commit `72fd834c47`)	2024-04-03 10:12:51 +11:00
Rob Norris	ba6d0fb060	vdev_disk: add module parameter to select BIO submission method This makes the submission method selectable at module load time via the `zfs_vdev_disk_classic` parameter, allowing this change to be backported to 2.2 safely, and disabled in favour of the "classic" submission method if new problems come up. Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. (cherry picked from commit 2382fdb0a83a5a3c6cf3860695d3f29281773170)	2024-04-03 10:10:20 +11:00
Rob Norris	1b16b90ae1	vdev_disk: rewrite BIO filling machinery to avoid split pages This commit tackles a number of issues in the way BIOs (`struct bio`) are constructed for submission to the Linux block layer. ### BIO segment limits are set incorrectly The kernel has a hard upper limit on the number of pages/segments that can be added to a BIO, as well as a separate limit for each device (related to its queue depth and other scheduling characteristics). ZFS counts the number of memory pages in the request ABD (`abd_nr_pages_off()`, and then uses that as the number of segments to put into the BIO, up to the hard upper limit. If it requires more than the limit, it will create multiple BIOs. Leaving aside the fact that page count method is wrong (see below), not limiting to the device segment max means that the device driver will need to split the BIO in half. This is alone is not necessarily a problem, but it interacts with another issue to cause a much larger problem. ### BIOs are filled inefficiently The kernel function to add a segment to a BIO (`bio_add_page()`) takes a `struct page` pointer, and offset+len within it. `struct page` can represent a run of contiguous memory pages (known as a "compound page"). In can be of arbitrary length. The ZFS functions that count ABD pages and load them into the BIO (`abd_nr_pages_off()`, `bio_map()` and `abd_bio_map_off()`) will never consider a page to be more than `PAGE_SIZE` (4K), even if the `struct page` is for multiple pages. In this case, it will load the same `struct page` into the BIO multiple times, with the offset adjusted each time. With a sufficiently large ABD, this can easily lead to the BIO being entirely filled much earlier than it could have been. This is also further contributes to the problem caused by the incorrect segment limit calculation, as its much easier to go past the device limit, and so require a split. Again, this is not a problem on its own. ### Incomplete pages are submitted to BIOs The logic for "never submit more than `PAGE_SIZE`" is actually a little more subtle. It will actually never submit a buffer that crosses a 4K page boundary. In practice, this is fine, as most ABDs are scattered, that is a list of complete 4K pages, and so are loaded in as such. Linear ABDs are typically allocated from slabs, and for small sizes they are frequently not aligned to page boundaries. For example, a 12K allocation can span four pages, eg: -- 4K -- -- 4K -- -- 4K -- -- 4K -- \| \| \| \| \| :## ######## ######## ######: [1K, 4K, 4K, 3K] Such an allocation would be loaded into a BIO as you see: [1K, 4K, 4K, 3K] This tends not to be a problem in practice, because even if the BIO were filled and needed to be split, each half would still have either a start or end aligned to the logical block size of the device (assuming 4K at least). --- In ideal circumstances, these shortcomings don't cause any particular problems. Its when they start to interact with other ZFS features that things get interesting. ### Aggregation Aggregation will create a "gang" ABD, which is simply a list of other ABDs. Iterating over a gang ABD is just iterating over each ABD within it in turn. Because the segments are simply loaded in order, we can end up with uneven segments either side of the "gap" between the two ABDs. For example, two 12K ABDs might be aggregated and then loaded as: [1K, 4K, 4K, 3K, 2K, 4K, 4K, 2K] Should a split occur, each individual BIO can end up either having an start or end offset that is not aligned to the logical block size, which some drivers (eg SCSI) will reject. However, this tends not to happen because the default aggregation limit usually keeps the BIO small enough to not require more than one split, and most pages are actually full 4K pages, so hitting an uneven gap is very rare anyway. ### Gang blocks If the pool is under particular memory pressure, then an IO can be broken down into a "gang block", a 512-byte block composed of a header and up to three block pointers. Each points to a fragment of the original write, or in turn, another gang block, breaking the original data up over and over until space can be found in the pool for each of them. Each gang header is a separate 512-byte memory allocation from a slab, that needs to be written down to disk. When the gang header is added to the BIO, its a single 512-byte segment. ### Aggregation with gang blocks Pulling all this together, consider a large aggregated write of gang blocks. This results a BIO containing lots of 512-byte segments. Given our tendency to overfill the BIO, a split is likely, and most possible split points will yield a pair of BIOs that are misaligned. Drivers that care, like the SCSI driver, will reject them. --- This commit is a substantial refactor and rewrite of much of `vdev_disk` to sort all this out. ### Configure maximum segment size for device `vdev_bio_max_segs()` now returns the ideal maximum size for the device, if available. There's also a tuneable `zfs_vdev_disk_max_segs` to override this, to assist with testing. ### ABDs checked up front for page count and alignment We scan the ABD up front to count the number of pages within it, and to confirm that if we submitted all those pages to one or more BIOs, it could be split at any point with creating a misaligned BIO. Along the way, we determine how many BIO segments we'll need to handle the entire ABD, accounting for BIO fill limits (including segment and byte limits). If the pages in the BIO are not usable (as in any of the above situations), the ABD is linearised, and then checked again. This is the same technique used in `vdev_geom` on FreeBSD, adjusted for Linux's variable page size and allocator quirks. In the end, a count of segments is produced, which is then used to determine how many BIOs will be allocated. ### Virtual block IO object `vbio_t` is a cleanup and enhancement of the old `dio_request_t`. The idea is simply that it can hold all the state needed to create, submit and return multiple BIOs, including all the refcounts, the ABD copy if it was needed, and so on. Apart from what I hope is a clearer interface, the major difference is that because we know how many BIOs we'll need up front, we don't need the old overflow logic that would grow the BIO array, throw away all the old work and restart. We can get it right from the start. Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. (cherry picked from commit 588a6a2d38f20cd6e0d458042feda1831b302207)	2024-04-03 10:10:12 +11:00
Rob Norris	21f6808780	vdev_disk: make read/write IO function configurable This is just setting up for the next couple of commits, which will add a new IO function and a parameter to select it. Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. (cherry picked from commit 7ee83696cffac172eea89844ccc5e6b6899781ac)	2024-04-03 10:10:08 +11:00
Rob Norris	1feddf1ed6	vdev_disk: reorganise vdev_disk_io_start Light reshuffle to make it a bit more linear to read and get rid of a bunch of args that aren't needed in all cases. Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. (cherry picked from commit ad847ff6acb77fbba0f3ab2e864784225fd41007)	2024-04-03 10:10:04 +11:00
Rob Norris	d00ab549d4	vdev_disk: rename existing functions to vdev_classic_* This is just renaming the existing functions we're about to replace and grouping them together to make the next commits easier to follow. Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. (cherry picked from commit 9bf6a7c8c3bdcc4e5975fa5baf6e9ff6f279a553)	2024-04-03 10:09:57 +11:00
Rob Norris	edf4cf0ce7	abd: add page iterator The regular ABD iterators yield data buffers, so they have to map and unmap pages into kernel memory. If the caller only wants to count chunks, or can use page pointers directly, then the map/unmap is just unnecessary overhead. This adds adb_iterate_page_func, which yields unmapped struct page instead. Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. (cherry picked from commit 930b785c61e9724f0a3a0e09571032ed397f368c)	2024-04-03 10:09:53 +11:00
Rob N	40b2b48fa5	Consider `dnode_t` allocations in dbuf cache size accounting Entries in the dbuf cache contribute only the size of the dbuf data to the cache size. Attached "user" data is not counted. This can lead to the data currently "owned" by the cache consuming more memory accounting appears to show. In some cases (eg a metadnode data block with all child dnode_t slots allocated), the actual size can be as much as 3x as what the cache believes it to be. This is arguably correct behaviour, as the cache is only tracking the size of the dbuf data, not even the overhead of the dbuf_t. On the other hand, in the above case of dnodes, evicting cached metadnode dbufs is the only current way to reclaim the dnode objects, and can lead to the situation where the dbuf cache appears to be comfortably within its target memory window and yet is holding enormous amounts of slab memory that cannot be reclaimed. This commit adds a facility for a dbuf user to artificially inflate the apparent size of the dbuf for caching purposes. This at least allows for cache tuning to be adjusted to match something closer to the real memory overhead. metadnode dbufs carry a >1KiB allocation per dnode in their user data. This informs the dbuf cache machinery of that fact, allowing it to make better decisions when evicting dbufs. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #15511 (cherry picked from commit `92dc4ad83d`)	2024-04-03 09:58:25 +11:00
Rob N	66586d604a	dnode_is_dirty: check dnode and its data for dirtiness Over its history this the dirty dnode test has been changed between checking for a dnodes being on `os_dirty_dnodes` (`dn_dirty_link`) and `dn_dirty_record`. `de198f2d9` Fix lseek(SEEK_DATA/SEEK_HOLE) mmap consistency `2531ce372` Revert "Report holes when there are only metadata changes" `ec4f9b8f3` Report holes when there are only metadata changes `454365bba` Fix dirty check in dmu_offset_next() `66aca2473` SEEK_HOLE should not block on txg_wait_synced() Also illumos/illumos-gate@c543ec060d illumos/illumos-gate@2bcf0248e9 It turns out both are actually required. In the case of appending data to a newly created file, the dnode proper is dirtied (at least to change the blocksize) and dirty records are added. Thus, a single logical operation is represented by separate dirty indicators, and must not be separated. The incorrect dirty check becomes a problem when the first block of a file is being appended to while another process is calling lseek to skip holes. There is a small window where the dnode part is undirtied while there are still dirty records. In this case, `lseek(fd, 0, SEEK_DATA)` would not know that the file is dirty, and would go to `dnode_next_offset()`. Since the object has no data blocks yet, it returns `ESRCH`, indicating no data found, which results in `ENXIO` being returned to `lseek()`'s caller. Since coreutils 9.2, `cp` performs sparse copies by default, that is, it uses `SEEK_DATA` and `SEEK_HOLE` against the source file and attempts to replicate the holes in the target. When it hits the bug, its initial search for data fails, and it goes on to call `fallocate()` to create a hole over the entire destination file. This has come up more recently as users upgrade their systems, getting OpenZFS 2.2 as well as a newer coreutils. However, this problem has been reproduced against 2.1, as well as on FreeBSD 13 and 14. This change simply updates the dirty check to check both types of dirty. If there's anything dirty at all, we immediately go to the "wait for sync" stage, It doesn't really matter after that; both changes are on disk, so the dirty fields should be correct. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Rich Ercolani <rincebrain@gmail.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #15571 Closes #15526 (cherry picked from commit `30d581121b`)	2023-11-29 22:19:29 +00:00
Alexander Motin	976c01223a	Improve log spacemap load time Previous flushing algorithm limited only total number of log blocks to the minimum of 256K and 4x number of metaslabs in the pool. As result, system with 1500 disks with 1000 metaslabs each, touching several new metaslabs each TXG could grow spacemap log to huge size without much benefits. We've observed one of such systems importing pool for about 45 minutes. This patch improves the situation from five sides: - By limiting maximum period for each metaslab to be flushed to 1000 TXGs, that effectively limits maximum number of per-TXG spacemap logs to load to the same number. - By making flushing more smooth via accounting number of metaslabs that were touched after the last flush and actually need another flush, not just ms_unflushed_txg bump. - By applying zfs_unflushed_log_block_pct to the number of metaslabs that were touched after the last flush, not all metaslabs in the pool. - By aggressively prefetching per-TXG spacemap logs up to 16 TXGs in advance, making log spacemap load process for wide HDD pool CPU-bound, accelerating it by many times. - By reducing zfs_unflushed_log_block_max from 256K to 128K, reducing single-threaded by nature log processing time from ~10 to ~5 minutes. As further optimization we could skip bumping ms_unflushed_txg for metaslabs not touched since the last flush, but that would be an incompatible change, requiring new pool feature. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #12789 (cherry picked from commit cbfe5cb849518dd8fb65bf94a72fd88a15093a67)	2023-10-23 21:41:18 +00:00
Alexander Motin	fc1bd532b7	Add more control/visibility to spa_load_verify(). Use error thresholds from policy to control whether to scrub data and/or metadata. If threshold is set to UINT64_MAX, then caller probably does not care about result and we may skip that part. By default import neither set the data error threshold nor read the error counter, so skip the data scrub for faster import. Metadata are still scrubbed and fail if even single error found. While there just for symmetry return number of metadata errors in case threshold is not set to zero and we haven't reached it. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #13022 (cherry picked from commit `f2c5bc150e`)	2023-09-27 20:30:22 +00:00
Rob Norris	dd5e7171dc	zil: only flush leaf vdevs that were actually written to Enables vdev traces for ZIL writes, and then only issues flushes to things that were written to. This simplifies a few things. We no longer have to extract the toplevel vdevs to flush from the block pointer; instead we just look at what was written. The vdev tree remains as a means to defer flushes to the next lwb, which means a bit more copying trees, but also means we no longer have to lock the tree. Signed-off-by: Rob Norris <rob.norris@klarasystems.com>	2023-09-07 18:54:12 +00:00
Rob Norris	ca18c876e5	zio: function to issue flushes by trace tree If you have a trace tree from, say, a write, hand it directly to zio_flush_traced() to flush only the leaf vdevs that were involved in the write. Signed-off-by: Rob Norris <rob.norris@klarasystems.com>	2023-09-07 18:54:12 +00:00
Rob Norris	4b07fb9a98	zio: expose trace node alloc/free/compare Meant for external callers to be able to build trace trees that can later be submitted back to zio for work. Its hardly necessary, but saves needing to double up on kmem cache and comparison function. Signed-off-by: Rob Norris <rob.norris@klarasystems.com>	2023-09-07 18:54:12 +00:00
Rob Norris	efeeeec2a4	zio: add vdev tracing machinery The idea here is that you can add a flag to a zio, and every vdev that contributed to the successful completion of that zio will be referenced on the "trace tree". You can poke around in here from your _done handler to do any per-vdev followup work. The actual use case is to track the vdevs that were actually written to, in order to have a list of vdevs that we should flush. Thats why it looks like the ZIL vdev flush tracker - the only difference is that it will also list interior and leaf vdevs, not just toplevel vdevs. Signed-off-by: Rob Norris <rob.norris@klarasystems.com>	2023-09-07 18:54:12 +00:00
Richard Yao	609550bc5f	Convert enum zio_flag to uint64_t We ran out of space in enum zio_flag for additional flags. Rather than introduce enum zio_flag2 and then modify a bunch of functions to take a second flags variable, we expand the type to 64 bits via `typedef uint64_t zio_flag_t`. Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@klarasystems.com> Signed-off-by: Allan Jude <allan@klarasystems.com> Co-authored-by: Richard Yao <richard.yao@klarasystems.com> Closes #14086	2023-09-07 18:54:12 +00:00
Rob Norris	1a160e3b08	zio: refactor zio_flush() zio_ioctl() is the only user of zio_flush(), and its structure and flag use is fairly specific to flushing. So here we bring the guts of zio_ioctl() into zio_flush(), allowing some light reorganising (mostly around how zio_nowait() is called) and a better signature. This will help in the future when changing the way flush works, as its clear where the change should be made and no wondering if zio_ioctl() is being used somewhere else. Signed-off-by: Rob Norris <rob.norris@klarasystems.com>	2023-09-07 18:54:11 +00:00
Rob Norris	97b5e0bbbd	zil_fail: handle failure when previous failure not cleaned up yet After the ZIL is reopened, itxs created before the failure are held on the failure itxg until the cleaner thread comes through and cleans them up by calling zil_clean(). That's an asynchronous job, so may not run immediately. Previously, if the ZIL fails again while there are still itxs on the fail list, it would trip assertions in debug mode, and in production, the itx list would be leaked and the previous outstanding fsync() calls would be lost. This commit makes it so that if the ZIL fails before the previous failure has been cleaned up, it will be immediately cleaned up before being filled with currently outstanding itxs. Signed-off-by: Rob Norris <rob.norris@klarasystems.com>	2023-09-07 18:54:11 +00:00
Rob Norris	27e39dd59a	zil_fail: handle the case where no itxs are found This is possible if spa_sync() completes before the ZIL write/flush does, which then errors. At this point all itxs are in the past, leaving us with nothing to wait for. In a perfect world we would not fail the ZIL, but at this point we've already locked out the itxgs early, so we have to see it through. Signed-off-by: Rob Norris <rob.norris@klarasystems.com>	2023-09-07 18:54:11 +00:00
Rob Norris	5b98ced63c	zil: refactor zil_commit_waiter() The previous change added a check to fall back to waiting forever if the ZIL failed. This check was inverted; it actually caused it to always enter a timed wait when it failed. When combined with the fact that the last lwb issued likely would have failed quickly and so had a very small latency, this caused effectively an infinite loop. I initially fixed the check, but on further study I decided that this loop doesn't need to exist. The way the whole logic falls out of the original code in 2.1.5 is that if the lwb is OPENED, wait then issue it, and if not (or post issue), wait forever. The loop will never see more than two iterations, one for each half of the OPENED check, and it will stop as soon as the waiter is signaled (zcw_done true), so it can be far more simply expressed as a linear sequence: if (!issued) { wait a few if (done) return issue IO } if (!done) wait forever This still holds when the ZIL fails, because zil_commit_waiter_timeout() will check for failure under zl_issuer_lock, which zil_fail() will wait for, and in turn, zil_fail() will wait on zcw_lock and then signal the waiter before it releases zl_issuer_lock. Taken together, that means that zil_commit_waiter_timeout() will do all it can under the circumstances, and waiting forever the waiter to complete is all we can past that point. (cherry picked from commit c57c2ddd6f803f429da1e2b53abab277d781a5a3)	2023-08-10 19:10:59 +00:00
Rob Norris	d6d343bb87	Revert debug code from bio-seg-limit This patch reverts parts of bf08bc108dce6f1ecd0820c8b5a67b6fb7962c7e.	2023-08-10 19:10:59 +00:00
Rob Norris	bf18873541	bio-seg-limit debug Various bits of output for catching broken bios. (cherry picked from commit b1a5bc49acce3cbec56f3bf0638539f836aa2208) Signed-off-by: Allan Jude <allan@klarasystems.com>	2023-07-31 15:05:56 +00:00
Rob Norris	74e8091130	abd_bio_map_off: avoid splitting scatter pages across bios This is the same change as the previous commit, but for scatter abds. Its less clear if this change is needed. Since scatter abds are only ever added a page at time, both sides of the split should always be added in consecutive segments. Intuitively though, it may be possible for a partially-filled bio to be used, or a bio with an odd number of iovecs, and that then could lead to a misaligned bio. While I've not been able to reproduce this at all, it seems to make sense to protect against it. Signed-off-by: Rob Norris <rob.norris@klarasystems.com> (cherry picked from commit cbdf21fd1a32a5e696a22cad497d9211221fa309)	2023-07-31 15:05:56 +00:00
Rob Norris	1bd475d6c6	bio_map: avoid splitting abd pages across bios If we encounter a split page, we add two iovecs to the bio, one for the fragment of the buffer on each side of the split. In order to do this safely, we must be sure that we always have room for both fragments. Its possible for a linear abd to have multiple pages, in which case we want to add the "left" fragment, then a run of proper 4K pages. then then "right" fragment. In this way we can keep whole pages together as much possible. This change handles both cases by noticing a split page. If we don't have at least two iovecs remaining in the bio, then we abort outright (allowing the caller to allocate a new bio and try again). We add the "left" fragment, and note how big we expect the right fragment to be. Then we load in as many full pages as are available. When we reach the last iovec, we close out the bio by taking as uch as is necessary to restore alignment. Signed-off-by: Rob Norris <rob.norris@klarasystems.com> (cherry picked from commit 173cafcc3d8b6c94c61844c705d7a410f412a18e)	2023-07-31 15:05:56 +00:00
Rob Norris	98bcc390e8	vdev_disk: rework bio max segment calculation A single "page" in an ABD does not necessarily correspond to one segment in a bio, because of how ZFS does ABD allocations and how it breaks them up with adding them to a bio. Because of this, simply dividing the ABD size by the page size can only ever give a minimum number of segments required, rather than the correct number. Until we can fix that, we'll just make each bio as large as they can be for as many segments as the device queue will permit without needing to split the the bio. This is a little wasteful if we don't intend to put that many segments in the bio, but its not a lot of memory and its only lost until the bio is completed. This also adds a tuneable, vdev_disk_max_segs, to allow setting this value to be set by the operator. This is very useful for debugging. Signed-off-by: Rob Norris <rob.norris@klarasystems.com> (cherry picked from commit a3a438d1bedb0626417cd73ba10b1479a06bef7f)	2023-07-31 15:05:56 +00:00
Rob Norris	f615207ee6	metaslab: add tuneables to better control when to force ganging Signed-off-by: Rob Norris <rob.norris@klarasystems.com> (cherry picked from commit cf152e9d4da8fc82c940355ba444de915a00d754)	2023-07-31 15:05:56 +00:00
Brian Behlendorf	564e07e5c8	Silence -Winfinite-recursion warning in luaD_throw() This code should be kept inline with the upstream lua version as much as possible. Therefore, we simply want to silence the warning. This check was enabled by default as part of -Wall in gcc 12.1. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #13528 Closes #13575 (cherry picked from commit 4d0c1f14e77cf83d06de7c730de7f93f8a85c2eb)	2023-07-31 15:05:56 +00:00
Rob Norris	7121e678a2	zil_fail: reheck ZIL fail state before issuing IO The ZIL may be failed any time we don't have the issuer lock, which means we can end up working through zil_commit_impl() even when the ZIL already failed. So, each time we gain the issuer lock, recheck the fail state, and do not issue IO if its failed. The waiter will eventually go to sleep waiting to be signalled with the ZIL reopens in zil_sync()->zil_clean(). (cherry picked from commit 49fa92c8db389f257a15029d643fb026fa5b6dc2)	2023-07-31 15:05:56 +00:00
Rob Norris	88bb9a3add	zil_fail: set fail state as early as possible zil_failed() is called unlocked, and itxg_failed is only checked with itxg_lock held. This was making the ZIL appear to be not failed even as zil_fail() was in progress, scanning the itx lists. With zil_failed() returning false, zil_commit() would continue to zil_commit_impl() and also start processing itx lists, racing each other for locks. So instead, set the fail state as early as possible, before we start processing the itx lists. This won't stop new itxs arriving on the itxgs proper, but it will avoid additional commit itxs being created and will stop any attempts to collect and commit them. (cherry picked from commit 17579a79a2b481e746879d5a033626754931441e)	2023-07-31 15:05:56 +00:00
Rob Norris	43f45f8df0	zil_fail: don't take the namespace lock Early on I thought it would be necessary to stop the world making progress under us, and taking the namespace lock was my initial idea of how to do that. The right way is a bit more nuanced than that, but as it turns out, we don't even need it. To fail the ZIL is effectively to stop it in its tracks and hold onto all itxs stored within until they operations they represent are committed to the pool by some other means (ie the regular txg sync). It doesn't matter if the pool makes progress while we're doing this. If the pool does progress, then zil_clean() will be called to process any itxs now completed. That will be to take the itxg_lock, process and destroy the itxs, and release the lock, leaving the itxg empty. If zil_fail() is running at the same time, then either zil_clean() will have cleaned up the itxg and zil_fail() will find it empty, or zil_fail() will get there first and empty it onto the fail itxg. (cherry picked from commit 83ce694898f5a89bd382dda0ba09bb8a04ac5666)	2023-07-31 15:05:56 +00:00
Rob Norris	3cb140863f	zil_fail: fix infinite loop in commit itx search (cherry picked from commit ba1f888858f8599e10211aa9957d942e7bcc36ce)	2023-07-31 15:05:56 +00:00
Rob N	6c36d72e71	vdev: expose zfs_vdev_max_ms_shift as a module parameter Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Seagate Technology LLC Closes #14719 (cherry picked from commit `ff73574cd8`) The type of sysctls had to be changed from uint_t to int to match other sysctls in to OpenZFS 2.1.5.	2023-07-31 15:05:56 +00:00
Fred Weigel	b037880efb	JSON spares state readout Spares can be in multiple pools, even if in use. This means that status check AVAIL/INUSE is a bit tricky. spa_add_spares does not need to be called, but we do need to do the equivalent. Which is now done directly.	2023-07-05 13:27:31 +00:00
Fred Weigel	f7574ccff3	Update to json output. Missing proper state output for spare devices. Everything else should be ok.	2023-07-05 13:27:31 +00:00
Rob Norris	802c258fc1	compress: add "slack" compression options Signed-off-by: Allan Jude <allan@klarasystems.com>	2023-07-05 13:27:31 +00:00

1 2 3 4 5 ...

3658 Commits