Archive-Team/zfs - zfs - Gitea: Git with a cup of tea

Commit Graph

Author	SHA1	Message	Date
Rich Ercolani	6385f4e70e	Fix unfortunate NULL in spa_update_dspace After `1325434b`, we can in certain circumstances end up calling spa_update_dspace with vd->vdev_mg NULL, which ends poorly during vdev removal. So let's not do that further space adjustment when we can't. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes #12380 Closes #12428	2021-09-14 12:41:10 -07:00
Brian Behlendorf	2f073cc9c6	Linux 5.14 compat: blk_alloc_disk() In Linux 5.14, blk_alloc_queue is no longer exported, and its usage has been superseded by blk_alloc_disk, which returns a gendisk struct from which we can still retrieve the struct request_queue* that is needed in the one place where it is used. This also replaces the call to alloc_disk(minors), and minors is now set via struct member assignment. Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: Coleman Kane <ckane@colemankane.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #12362 Closes #12409	2021-09-14 12:40:45 -07:00
Alexander Motin	93e11e257b	FreeBSD: Ignore make_dev_s() errors Since errors returned by zvol_create_minor_impl() are ignored by the common code, it is more convenient to ignore make_dev_s() errors there. It allows, for example, to get device created for the zvol after later rename instead of having it further stuck in half-created state. zvol_rename_minor() already ignores those errors. While there, switch from MAXPHYS to maxphys in FreeBSD 13+. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #12375	2021-09-14 12:40:45 -07:00
Alexander Motin	32c0b6468c	Optimize allocation throttling Remove mc_lock use from metaslab_class_throttle_(). The math there is based on refcounts and so atomic, so the only race possible there is between zfs_refcount_count() and zfs_refcount_add(). But in most cases metaslab_class_throttle_reserve() is called with the allocator lock held, which covers the race. In cases where the lock is not held, GANG_ALLOCATION() or METASLAB_MUST_RESERVE are set, and so we do not use zfs_refcount_count(). And even if we assume some other non-existing scenario, the worst that may happen from this race is few more I/Os get to allocation earlier, that is not a problem. Move locks and data of different allocators into different cache lines to avoid false sharing. Group spa_alloc_ arrays together into single array of aligned struct spa_alloc spa_allocs. Align struct metaslab_class_allocator. Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Don Brady <don.brady@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #12314	2021-09-14 12:40:15 -07:00
Alexander Motin	6a49948c73	Minor ARC optimizations Remove unneeded global, practically constant, state pointer variables (arc_anon, arc_mru, etc.), replacing them with macros of real state variables addresses (&ARC_anon, &ARC_mru, etc.). Change ARC_EVICT_ALL from -1ULL to UINT64_MAX, not requiring special handling in inner loop of ARC reclamation. Respectively change bytes argument of arc_evict_state() from int64_t to uint64_t. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #12348	2021-09-14 12:39:48 -07:00
Jorgen Lundman	4dfb698aac	dmu_redact.c does not call bqueue_destroy Ensure all calls to bqueue_init() has a corresponding call to bqueue_destroy() Reviewed-by: Paul Dagnelie <pcd@delphix.com> Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Jorgen Lundman <lundman@lundman.net> Closes #12118	2021-09-14 12:39:48 -07:00
Alexander	4affa09f3e	A few fixes of callback typecasting (for the upcoming ClangCFI) * zio: avoid callback typecasting * zil: avoid zil_itxg_clean() callback typecasting * zpl: decouple zpl_readpage() into two separate callbacks * nvpair: explicitly declare callbacks for xdr_array() * linux/zfs_nvops: don't use external iput() as a callback * zcp_synctask: don't use fnvlist_free() as a callback * zvol: don't use ops->zv_free() as a callback for taskq_dispatch() Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Signed-off-by: Alexander Lobakin <alobakin@pm.me> Closes #12260	2021-09-14 12:39:48 -07:00
Ryan Moeller	0ca9558561	Remove unused fields from zvol_task_t We don't use or need the pool name or value source in the zvol tasks. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #12361	2021-09-14 12:39:17 -07:00
Alexander Motin	c2c4d05700	FreeBSD: Switch from MAXPHYS to maxphys on FreeBSD 13+ Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #12378	2021-09-14 12:39:17 -07:00
Alexander Motin	ba76bb30a6	Introduce dsl_dir_diduse_transfer_space() Most of dsl_dir_diduse_space() and dsl_dir_transfer_space() CPU time is a dd_lock overhead and time spent in dmu_buf_will_dirty(). Calling them one after another is a waste of time and even more contention. Doing that twice for each rewritten block within dbuf_write_done() via dsl_dataset_block_kill() and dsl_dataset_block_born() created one of the biggest CPU overheads in case of small blocks rewrite. dsl_dir_diduse_transfer_space() combines functionality of these two functions for cases where it is needed, but without double overhead, practically for the cost of dsl_dir_diduse_space() or even cheaper. While there, optimize dsl_dir_phys() calls in dsl_dir_diduse_space() and dsl_dir_transfer_space(). It seems Clang detects some aliasing there, repeating dd->dd_dbuf->db_data dereference multiple times, increasing dd_lock scope and contention. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Author: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #12300	2021-09-14 12:38:51 -07:00
Rich Ercolani	960a5a557b	Tinker with slop space accounting with dedup * Tinker with slop space accounting with dedup Do not include the deduplicated space usage in the slop space reservation, it leads to surprising outcomes. * Update spa_dedup_dspace sometimes Sometimes, we get into spa_get_slop_space() with spa_dedup_dspace=~0ULL, AKA "unset", while spa_dspace is correctly set. So call the code to update it before we use it if we hit that case. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes #12271	2021-09-14 12:38:05 -07:00
Alexander Motin	45305a067f	Fix ARC ghost states eviction accounting arc_evict_hdr() returns number of evicted bytes in scope of specific state. For ghost states it does not mean the amount of really freed memory, but the logical buffer size. It is correct for the eviction process, but not for waking up threads waiting for ARC size reduction, as added in "Revise ARC shrinker algorithm" commit, causing premature wakeups while ARC is still overflowed, allowing even bigger overflow, plus processing overhead when next allocation will also get blocked, probably also for too short time. To fix that make arc_evict_hdr() also return the amount of really freed memory, which for the ghost states is only the header, and use it to update arc_evict_count instead. Originally I was thinking to not return it at all, since arc_get_data_impl() does not account for the headers, but decided that some slow allocation progress is better than long waits, reaching on my tests up to 100ms. To reduce negative latency effects of long time periods when reclaim thread can free little real memory, start reclamation process earlier, before we actually reached the overflow threshold, when we have to throttle new allocations. We can also do it without taking global arc_evict_lock, reducing the contention. Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #12279	2021-09-14 12:38:05 -07:00
George Wilson	8415c3c170	file reference counts can get corrupted Callers of zfs_file_get and zfs_file_put can corrupt the reference counts for the file structure resulting in a panic or a soft lockup. When zfs send/recv runs, it will add a reference count to the open file, and begin to send or recv the stream. If the file descriptor is closed, then when dmu_recv_stream() or dmu_send() return we will call zfs_file_put to remove the reference we placed on the file structure. Unfortunately, because zfs_file_put() uses the file descriptor to lookup the file structure, it may end up finding that the file descriptor table no longer contains the file struct, thus leaking the file structure. Or it might end up finding a file descriptor for a different file and blindly updating its reference counts. Other failure modes probably exists. This change reworks the zfs_file_[get\|put] interface to not rely on the file descriptor but instead pass the zfs_file_t pointer around. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Co-authored-by: Allan Jude <allan@klarasystems.com> Signed-off-by: George Wilson <gwilson@delphix.com> External-issue: DLPX-76119 Closes #12299	2021-09-14 12:37:38 -07:00
Alexander Motin	c84670950a	FreeBSD: Use unmapped I/O for scattered/gang ABD buffers Many FreeBSD disk drivers support "unmapped" I/O mode, when data buffer represented not with a virtually contiguous KVA-mapped address range, but with a list of physical memory pages. Originally it was designed to do I/O from buffers without KVA mapping (unmapped). But moving virtual addresses out of equation allows us to operate even non-contiguous data buffers with one condition: all buffer discon- tinuities must be aligned to memory page borders. Doing I/O to capable GEOM device this patch traverses through non- linear ABD buffers, validating the chunks borders. If the condition is met, it supplies GEOM with the list of original physical memory pages instead of copying the data into temporary contiguous buffer. On capable hardware on pools with ashift=12 and default ABD chunk of 4KB it should handle all the I/O without additional memory copying. Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #12320	2021-09-14 12:37:02 -07:00
Alexander Motin	49bb454120	FreeBSD: Hardcode abd_chunk_size to PAGE_SIZE It makes no sense to set it below PAGE_SIZE, since it increases all overheads and makes returning memory to OS problematic. It makes no sense to set it above PAGE_SIZE, since such allocations and especially frees are too expensive and cause KVA fragmentation to benefit from fewer chunks. After that it makes no sense to keep more complicated math here. What may have sense though is just a tunable border between linear and scatter ABDs, previously also controlled by this tunable. Retain that functionality by taking abd_scatter_min_size tunable from Linux, just with different default value. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #12328	2021-09-14 12:36:44 -07:00
Alexander Motin	41b33dce44	Move gethrtime() calls out of vdev queue lock This dramatically reduces the lock contention on systems with slower (non-TSC) timecounters. With TSC the difference is minimal, but since this lock is pretty congested, any improvement counts. Plus I don't see any reason to do it under the lock other than the latency of the lock itself, which this change actually reduces. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #12281	2021-09-14 12:35:53 -07:00
Alexander Motin	15177c1aac	Compact dbuf/buf hashes and lock arrays With default dbuf cache size of 1/32 of ARC, it makes no sense to have hash table of the same size (or even bigger on Linux). Reduce it to 1/8 of ARC's one, still leaving some slack, assuming higher I/O rate via dbuf cache than via ARC. Remove padding from ARC hash locks array. The idea behind padding is to avoid false sharing between locks. It would have sense if there would be a limited number of very busy locks. But since we have no limit on the number, using the same memory for more locks we can achieve even lower lock contention with the same false sharing, or we can use less memory for the same contention level. Reduce number of hash locks from 8192 to 2048. The number is still big enough to not cause contention, but reduced memory size improves cache hit rate for mutex_tryenter() in ARC eviction thread, saving about 1% of the thread time. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #12289	2021-09-14 12:22:46 -07:00
Jorgen Lundman	035219ee10	Fix abd leak, kmem_free correct size of abd_t Fix a leak of abd_t that manifested mostly when using raidzN with at least as many columns as N (e.g. a four-disk raidz2 but not a three-disk raidz2). Sufficiently heavy raidz use would eventually run a system out of memory. Additionally: * Switch abd_cache arena to FIRSTFIT, which empirically improves perofrmance. * Make abd_chunk_cache more performant and debuggable. * Allocate the abd_zero_buf from abd_chunk_cache rather than the heap. * Don't try to reap non-existent qcaches in abd_cache arena. * KM_PUSHPAGE->KM_SLEEP when allocating chunks from their own arena Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Jorgen Lundman <lundman@lundman.net> Co-authored-by: Sean Doran <smd@use.net> Closes #12295	2021-09-14 12:22:28 -07:00
Jorgen Lundman	2334bc4efa	Upstream: dmu_zfetch_stream_fini leaks refcount dmu_zfetch_stream_fini() is missing calls to destroy the refcounts, leaking them and the mutex inside. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Jorgen Lundman <lundman@lundman.net> Closes #12294	2021-09-14 12:21:55 -07:00
Alexander Motin	f3969ea78b	Optimize small random numbers generation In all places except two spa_get_random() is used for small values, and the consumers do not require well seeded high quality values. Switch those two exceptions directly to random_get_pseudo_bytes() and optimize spa_get_random(), renaming it to random_in_range(), since it is not related to SPA or ZFS in general. On FreeBSD directly map random_in_range() to new prng32_bounded() KPI added in FreeBSD 13. On Linux and in user-space just reduce the type used to uint32_t to avoid more expensive 64bit division. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #12183	2021-09-14 12:10:17 -07:00
Ryan Moeller	6fe6192796	FreeBSD: Implement xattr=sa FreeBSD historically has not cared about the xattr property; it was always treated as xattr=on. With xattr=on, xattrs are stored as files in a hidden xattr directory. With xattr=sa, xattrs are stored as system attributes and get cached in nvlists during xattr operations. This makes SA xattrs simpler and more efficient to manipulate. FreeBSD needs to implement the SA xattr operations for feature parity with Linux and to ensure that SA xattrs are accessible when migrated or replicated from Linux. Following the example set by Linux, refactor our existing extattr vnops to split off the parts handling dir style xattrs, and add the corresponding SA handling parts. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #11997	2021-09-14 12:09:35 -07:00
Ryan Moeller	1826068523	FreeBSD: Clean up ASSERT/VERIFY use in module Convert use of ASSERT() to ASSERT0(), ASSERT3U(), ASSERT3S(), ASSERT3P(), and likewise for VERIFY(). In some cases it ended up making more sense to change the code, such as VERIFY on nvlist operations that I have converted to use fnvlist instead. In one place I changed an internal struct member from int to boolean_t to match its use. Some asserts that combined multiple checks with && in a single assert have been split to separate asserts, to make it apparent which check fails. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #11971	2021-09-14 12:02:23 -07:00
Alexander Motin	ea47857090	Avoid 64bit division in multilist index functions The number of sublists in a multilist is relatively small. We dont need 64 bits to calculate an index. 32 bits is sufficient and makes the code more efficient. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #12288	2021-06-29 13:15:04 -07:00
Rich Ercolani	57ce66d293	Fix build with KASAN The stock zstd code expects some helpers from ASAN if present. This works fine in userland, but in kernel, KASAN also gets detected, and lacks those helpers. So let's make some empty substitutes for that case. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes #12232	2021-06-29 13:14:42 -07:00
Alexander Motin	da6c288dfc	Help compiller optimize out abd_verify() While abd_verify() does nothing when built without debug, compiler can't optimize it out by itself due to calls to external list_*() and abd_verify_scatter(). This commit makes it explicit. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Adam Moss <c@yotes.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #12280	2021-06-29 13:14:20 -07:00
Brian Behlendorf	aee26af277	Update cache file when setting compatibility property Unlike most other properties the 'compatibility' property is stored in the pool config object and not the DMU_OT_POOL_PROPS object. This had the advantage that the compatibility information is available without needing to fully import the pool (it can be read with zdb). However, this means we need to make sure to update both the copy of the config in the MOS and the cache file. This wasn't being done. This commit adds a call to spa_async_request() to ensure the copy of the config in the cache file gets updated as well as the one stored in the pool. This same change is made for the 'comment' property which suffers from the same inconsistency. Reviewed-by: Sean Eric Fagan <sef@ixsystems.com> Reviewed-by: Colm Buckley <colm@tuatha.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #12261 Closes #12276	2021-06-24 14:33:51 -07:00
jumbi77	5d7f47d828	zfs_metaslab_mem_limit should be 25 instead of 75 According to current zfs man page zfs_metaslab_mem_limit should be 25 instead of 75. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Signed-off-by: jumbi77@users.noreply.github.com Closes #12273	2021-06-24 13:13:56 -07:00
Attila Fülöp	088712793e	gcc 11 cleanup Compiling with gcc 11.1.0 produces three new warnings. Change the code slightly to avoid them. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Attila Fülöp <attila@fueloep.org> Closes #12130 Closes #12188 Closes #12237	2021-06-24 13:13:40 -07:00
Rich Ercolani	5e89181544	Annotated dprintf as printf-like ZFS loves using %llu for uint64_t, but that requires a cast to not be noisy - which is even done in many, though not all, places. Also a couple places used %u for uint64_t, which were promoted to %llu. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes #12233	2021-06-24 13:12:36 -07:00
Antonio Russo	dfbda2465f	Revert Consolidate arc_buf allocation checks This reverts commit `13fac09868`. Per the discussion in #11531, the reverted commit---which intended only to be a cleanup commit---introduced a subtle, unintended change in behavior. Care was taken to partially revert and then reapply `10b3c7f5e4` which would otherwise have caused a conflict. These changes were squashed in to this commit. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Suggested-by: @chrisrd Suggested-by: robn@despairlabs.com Signed-off-by: Antonio Russo <aerusso@aerusso.net> Closes #11531 Closes #12227	2021-06-24 13:12:25 -07:00
Alexander Motin	6b239d1757	Use wmsum for arc, abd, dbuf and zfetch statistics. (#12172 ) wmsum was designed exactly for cases like these with many updates and rare reads. It allows to completely avoid atomic operations on congested global variables. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #12172	2021-06-24 13:10:59 -07:00
George Amanakis	3de7aeb68a	Avoid deadlock when removing L2ARC devices under I/O In case we have I/O and try to remove an L2ARC device a deadlock might occur. arc_read()->zio_read()->zfs_blkptr_verify() waits for SCL_VDEV to be dropped while holding the hash_lock. However, spa_l2cache_load() holds SCL_ALL and waits for the hash_lock in l2arc_evict(). Fix this by moving zfs_blkptr_verify() to the top top arc_read() before the hash_lock is taken. Verify the block pointer and return a checksum error if damaged rather than halting the system, by using BLK_VERIFY_LOG instead of BLK_VERIFY_HALT. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #12054	2021-06-17 13:35:29 -07:00
Matthew Ahrens	7e2212990b	vdev_draid_min_asize() ignores reserved space vdev_draid_min_asize() returns the minimum size of a child vdev. This is used when determining if a disk is big enough to replace a child. It's also used by zdb to determine how big of a child to make to test replacement. vdev_draid_min_asize() says that the child’s asize has to be at least 1/Nth of the entire draid’s asize, which is the same logic as raidz. However, this contradicts the code in vdev_draid_open(), which calculates the draid’s asize based on a reduced child size: An additional 32MB of scratch space is reserved at the end of each child for use by the dRAID expansion feature So the problem is that you can replace a draid disk with one that’s vdev_draid_min_asize(), but it actually needs to be larger to accommodate the additional 32MB. The replacement is allowed and everything works at first (since the reserved space is at the end, and we don’t try to use it yet), but when you try to close and reopen the pool, vdev_draid_open() calculates a smaller asize for the draid, because of the smaller leaf, which is not allowed. I think the confusion is that vdev_draid_min_asize() is correctly returning the amount of required allocatable space in a leaf, but the actual size of the leaf needs to be at least 32MB more than that. ztest_vdev_attach_detach() assumes that it can attach that size of device, and it actually can (the kernel/libzpool accepts it), but it then later causes zdb to not be able to open the pool. This commit changes vdev_draid_min_asize() to return the required size of the leaf, not the size that draid will make available to the metaslab allocator. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #11459 Closes #12221	2021-06-15 16:56:42 -07:00
Paul Zuchowski	bd83c1e0c6	Do not hash unlinked inodes In zfs_znode_alloc we always hash inodes. If the znode is unlinked, we do not need to hash it. This fixes the problem where zfs_suspend_fs is doing zrele (iput) in an async fashion, and zfs_resume_fs unlinked drain processing will try to hash an inode that could still be hashed, resulting in a panic. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alan Somers <asomers@gmail.com> Signed-off-by: Paul Zuchowski <pzuchowski@datto.com> Closes #9741 Closes #11223 Closes #11648 Closes #12210	2021-06-15 16:56:19 -07:00
Alexander Motin	57196f8ae9	Re-embed multilist_t storage This commit partially reverts changes to multilists in PR 7968 (multi-threaded spa-sync()) and adds some cache line alignments to separate read-only multilists and heavily modified refcount's to different cache lines. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-by: iXsystems, Inc. Closes #12158	2021-06-10 10:50:16 -07:00
Alexander Motin	efdfb14fc8	Remove pool io kstats This mostly reverts "3537 want pool io kstats" commit of 8 years ago. From one side this code using pool-wide locks became pretty bad for performance, creating significant lock contention in I/O pipeline. From another, there are more efficient ways now to obtain detailed statistics, while this statistics is illumos-specific and much less usable on Linux and FreeBSD, reported only via procfs/sysctls. This commit does not remove KSTAT_TYPE_IO implementation, that may be removed later together with already unused KSTAT_TYPE_INTR and KSTAT_TYPE_TIMER. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #12212	2021-06-10 10:50:16 -07:00
Alan Somers	2c3d7283b4	libzfs: On FreeBSD, use MNT_NOWAIT with getfsstat `getfsstat(2)` is used to retrieve the list of mounted file systems, which libzfs uses when fetching properties like mountpoint, atime, setuid, etc. The `mode` parameter may be `MNT_NOWAIT`, which uses information in the VFS's cache, or `MNT_WAIT`, which effectively does a `statfs` on every single mounted file system in order to fetch the most up-to-date information. As far as I can tell, the only fields that libzfs cares about are the filesystem's name, mountpoint, fstypename, and mount flags. Those things are always updated on mount and unmount, so they will always be accurate in the VFS's mount cache except in two circumstances: 1) When a file system is busy unmounting 2) When a ZFS file system changes the value of a mount-overridable property like atime or setuid, but doesn't remount the file system. Right now that only happens when the property is changed by an unprivileged user who has delegated authority to change the property but not to mount the dataset. But perhaps libzfs could choose to do it for other reasons in the future. Switching to `MNT_NOWAIT` will greatly improve speed with no downside, as long as we explicitly update the mount cache whenever we change a mount-overridable property. For comparison, Illumos gets this information using the native `getmntany` and `getmntent` functions, which also use cached information. The illumos function that would refresh the cache, `resetmnttab`, is never called by libzfs. And on GNU/Linux, `getmntany` and `getmntent` don't even communicate with the kernel directly. They simply parse the file they are given, which is usually /etc/mtab or /proc/mounts. Perhaps the implementation of /proc/mounts is synchronous, ala MNT_WAIT; I don't know. Sponsored-by: Axcient Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alan Somers <asomers@gmail.com> Closes: #12091	2021-06-09 13:05:34 -07:00
Serapheim Dimitropoulos	a377bde727	Livelist logic should handle dedup blkptrs Update the logic to handle the dedup-case of consecutive FREEs in the livelist code. The logic still ensures that all the FREE entries are matched up with a respective ALLOC by keeping a refcount for each FREE blkptr that we encounter and ensuring that this refcount gets to zero by the time we are done processing the livelist. zdb -y no longer panics when encountering double frees Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: John Kennedy <john.kennedy@delphix.com> Reviewed-by: Don Brady <don.brady@delphix.com> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com> Closes #11480 Closes #12177	2021-06-09 13:05:34 -07:00
Alexander Motin	e76373de7b	More aggsum optimizations - Avoid atomic_add() when updating as_lower_bound/as_upper_bound. Previous code was excessively strong on 64bit systems while not strong enough on 32bit ones. Instead introduce and use real atomic_load() and atomic_store() operations, just an assignments on 64bit machines, but using proper atomics on 32bit ones to avoid torn reads/writes. - Reduce number of buckets on large systems. Extra buckets not as much improve add speed, as hurt reads. Unlike wmsum for aggsum reads are still important. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #12145	2021-06-09 13:05:34 -07:00
jharmening	d3dddbaa20	FreeBSD: incorporate changes to the VFS_QUOTACTL(9) KPI VFS_QUOTACTL(9) has been updated to allow each filesystem to indicate whether it has changed the busy state of the mount. The filesystem may still assume that its .vfs_quotactl entrypoint is always called with the mount busied, but only needs to unbusy the mount (and clear *mp_busy) if it does something that actually requires the mount to be unbusied. It no longer needs to blindly copy-paste the UFS protocol for calling vfs_unbusy(9) for the Q_QUOTAOFF and Q_QUOTAON commands. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Jason Harmening <jason.harmening@gmail.com> Closes #12052	2021-06-09 13:05:34 -07:00
Brian Behlendorf	af80160f48	Linux: Set spl_kmem_cache_slab_limit when page size !4K For small objects the kernel's slab implementation is very fast and space efficient. However, as the allocation size increases to require multiple pages performance suffers. The SPL kmem cache allocator was designed to better handle these large allocation sizes. Therefore, on Linux the kmem_cache_* compatibility wrappers prefer to use the kernel's slab allocator for small objects and the custom SPL kmem cache allocator for larger objects. This logic was effectively disabled for all architectures using a non-4K page size which caused all kmem caches to only use the SPL implementation. Functionally this is fine, but the SPL code which calculates the target number of objects per-slab does not take in to account that __vmalloc() always returns page-aligned memory. This can result in a massive amount of wasted space when allocating tiny objects on a platform using large pages (64k). To resolve this issue we set the spl_kmem_cache_slab_limit cutoff to 16K for all architectures. This particular change does not attempt to update the logic used to calculate the optimal number of pages per slab. This remains an issue which should be addressed in a future change. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #12152 Closes #11429 Closes #11574 Closes #12150	2021-06-09 13:05:34 -07:00
наб	2fe8060cee	spl-module-parameters.5: remove spl_kmem_cache_{expire,obj_per_slab_min} Both were removed in `4fbdb10c7b` ("remove kmem_cache module parameter KMC_EXPIRE_AGE") Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Laager <rlaager@wiktel.com> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #12157	2021-06-09 13:05:34 -07:00
Rich Ercolani	a0c055cfd3	Remove iov_iter_advance() for iter_write The additional iter advance is incorrect, as copy_from_iter() has already done the right thing. This will result in the following warning being printed to the console as of the 5.12 kernel. Attempted to advance past end of bvec iter This change should have been included with #11378 when a similar change was made on the read side. Suggested-by: @siebenmann Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Issue #11378 Closes #12041 Closes #12155	2021-06-09 13:05:34 -07:00
Alexander Motin	85c43508f3	Introduce write-mostly sums wmsum counters are a reduced version of aggsum counters, optimized for write-mostly scenarios. They do not provide optimized read functions, but instead allow much cheaper add function. The primary usage is infrequently read statistic counters, not requiring exact precision. The Linux implementation is directly mapped into percpu_counter KPI. The FreeBSD implementation is directly mapped into counter(9) KPI. In user-space due to lack of better implementation mapped to aggsum. Unfortunately neither Linux percpu_counter nor FreeBSD counter(9) provide sufficient functionality to completelly replace aggsum, so it still remains to be used for several hot counters. Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #12114	2021-06-09 13:05:34 -07:00
Alexander Motin	09c0a8fd1a	Improve scrub maxinflight_bytes math. Previously, ZFS scaled maxinflight_bytes based on total number of disks in the pool. A 3-wide mirror was receiving a queue depth of 3 disks, which it should not, since it reads from all the disks inside. For wide raidz the situation was slightly better, but still a 3-wide raidz1 received a depth of 3 disks instead of 2. The new code counts only unique data disks, i.e. 1 disk for mirrors and non-parity disks for raidz/draid. For draid the math is still imperfect, since vdev_get_nparity() returns number of parity disks per group, not per vdev, but still some better than it was. This should slightly reduce scrub influence on payload for some pool topologies by avoiding excessive queuing. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closing #12046	2021-06-08 14:50:57 -07:00
Rich Ercolani	57e3b9c3cc	Bend zpl_set_acl to permit the new userns* parameter Just like #12087, the set_acl signature changed with all the bolted-on *userns parameters, which disabled set_acl usage, and caused #12076. Turn zpl_set_acl into zpl_set_acl and zpl_set_acl_impl, and add a new configure test for the new version. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes #12076 Closes #12093	2021-05-27 22:31:57 -07:00
наб	6316086b72	Various Linux kABI cosmetics Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #12103	2021-05-27 22:31:57 -07:00
наб	7cdd4dd33b	linux: don't fall through to 3-arg vfs_getattr Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #12103	2021-05-27 22:31:57 -07:00
Alexander Motin	a7bb2dab4e	FreeBSD: Update dataset_kstats for zvols in dev mode Previous commit added accounting for geom mode, but not for dev. In geom mode we actually have GEOM statistics, while in dev mode additional accounting actually makes more sense. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #12097	2021-05-27 22:31:57 -07:00
Alexander Motin	aa4a84e616	FreeBSD: avoid memory allocation in arc_prune_async Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #12049	2021-05-27 22:31:57 -07:00
Alexander Motin	cc1c7b0171	FreeBSD: Retry OCF ENOMEM errors. ZFS does not expect transient errors from crypto. For read they are counted as checksum errors, while for write end up in panic. To not panic on random low memory conditions retry ENOMEM errors in the OCF wrapper function. While there remove unneeded timeout and priority from msleep(). External-issue: https://reviews.freebsd.org/D30339 Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #12077	2021-05-27 22:31:57 -07:00
Rich Ercolani	fa7ee48e10	Add note for printing all dbgmsg entries on FreeBSD I looked for a bit, and couldn't find any documentation on how to print all logged dbgmsg entries, just messages since the DTrace probe started, until @allanjude kindly pointed me toward the sysctl. So let's add that note where the DTrace probe is mentioned for FreeBSD, so other people can find it. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Allan Jude <allan@klarasystems.com> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes #12113	2021-05-27 22:31:57 -07:00
vermavipinkumar	d6bedbbc44	Propagate vdev state due to invalid label corruption Propagate vdev child state to parents on invalid label Add VDEV_AUX_BAD_LABEL to print_import_config() Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Co-authored-by: Srikanth N S <srikanth.nagasubbaraoseetharaman@hpe.com> Signed-off-by: Vipin Kumar Verma <vipin.verma@hpe.com> Closes #12088	2021-05-27 22:31:57 -07:00
Rich Ercolani	ab6717cba6	Update tmpfile() existence detection Linux changed the tmpfile() signature again in torvalds/linux@6521f89, which in turn broke our HAVE_TMPFILE detection in configure. Update that macro to include the new case, and change the signature of zpl_tmpfile as appropriate. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes: #12060 Closes: #12087	2021-05-27 22:31:56 -07:00
Brian Behlendorf	33a06f27e6	Fix dRAID sequential resilver silent damage handling This change addresses two distinct scenarios which are possible when performing a sequential resilver to a dRAID pool with vdevs that contain silent unknown damage. Which in this circumstance took the form of the devices being intentionally overwritten with zeros. However, it could also result from a device returning incorrect data while a sequential resilver was in progress. Scenario 1) A sequential resilver is performed while all of the dRAID vdevs are ONLINE and there is silent damage present on the vdev being resilvered. In this case, nothing will be repaired by vdev_raidz_io_done_reconstruct_known_missing() because rc->rc_error isn't set on any of the raid columns. To address this vdev_draid_io_start_read() has been updated to always mark the resilvering column as ESTALE for sequential resilver IO. Scenario 2) Multiple columns contain silent damage for the same block and a sequential resilver is performed. In this case it's impossible to generate the correct data from parity unless all of the damaged columns are being sequentially resilvered (and thus only good data is used to generate parity). This is as expected and there's nothing which can be done about it. However, we need to be careful not to make to situation worse. Since we can't verify the data is actually good without a checksum, we must only repair the devices which are being sequentially resilvered. Otherwise, an incorrect repair to a device which previously contained good data could effectively lock in the damage and make reconstruction impossible. A check for this was added to vdev_raidz_io_done_verified() along with a new test case. Lastly, this change updates the redundancy_draid_spare1 and redundancy_draid_spare3 test cases to be more representative of normal dRAID replacement operation. Specifically, what we care about is that the scrub run after a sequential resilver does not find additional blocks which need repair. This would indicate the sequential resilver failed to rebuild a section of one of the devices. Note also the tests were switched to using the verify_pool() function which still checks for checksum errors. Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #12061	2021-05-27 22:31:56 -07:00
Rich Ercolani	272b178d52	Simple change to fix building in recent environments Renamed _fini too for symmetry. Suggested-by: @ensch Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes #12059 Closes: #11987 Closes: #12056	2021-05-27 22:31:56 -07:00
Alexander Motin	8f3584292f	Scale worker threads and taskqs with number of CPUs While use of dynamic taskqs allows to reduce number of idle threads, hardcoded 8 taskqs of each kind is a big overkill for small systems, complicating CPU scheduling, increasing I/O reorder, etc, while providing no real locking benefits, just not needed there. On another side, 12*8 worker threads per kind are able to overload almost any system nowadays. For example, pool of several fast SSDs with SHA256 checksum makes system barely responsive during scrub, or with dedup enabled barely responsive during large file deletion. To address both problems this patch introduces ZTI_SCALE macro, alike to ZTI_BATCH, but with multiple taskqs, depending on number of CPUs, to be used in places where lock scalability is needed, while request ordering is not so much. The code is made to create new taskq for ~6 worker threads (less for small systems, but more for very large) up to 80% of CPU cores (previous 75% was not good for rounding down). Both number of threads and threads per taskq are now tunable in case somebody really wants to use all of system power for ZFS. While obviously some benchmarks show small peak performance reduction (not so big really, especially on systems with SMT, where use of the second threads does not give as much performance as the first ones), they also show dramatic latency reduction and much more smooth user- space operation in case of high CPU usage by ZFS. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #11966	2021-05-27 22:31:56 -07:00
Paul Zuchowski	ec75f80a50	Fix dmu_recv_stream test for resumable Use dsl_dataset_has_resume_receive_state() not dsl_dataset_is_zapified() to check if stream is resumable. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Alek Pinchuk <apinchuk@axcient.com> Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Signed-off-by: Paul Zuchowski <pzuchowski@datto.com> Closes #12034	2021-05-27 22:12:40 -07:00
Ryan Moeller	c2c02e490f	FreeBSD: Use SET_ERROR to trace xattr name errors Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #11997	2021-05-27 22:12:26 -07:00
Brian Behlendorf	faa5673982	Revert "Fix raw sends on encrypted datasets when copying back snapshots" Commit `d1d4769` takes into account the encryption key version to decide if the local_mac could be zeroed out. However, this could lead to failure mounting encrypted datasets created with intermediate versions of ZFS encryption available in master between major releases. In order to prevent this situation revert `d1d4769` pending a more comprehensive fix which addresses the mount failure case. Reviewed-by: George Amanakis <gamanakis@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #11294 Issue #12025 Issue #12300 Closes #12033	2021-05-27 22:10:13 -07:00
Coleman Kane	17351a79e2	linux 5.13 compat: bdevops->revalidate_disk() removed Linux kernel commit 0f00b82e5413571ed225ddbccad6882d7ea60bc7 removes the revalidate_disk() handler from struct block_device_operations. This caused a regression, and this commit eliminates the call to it and the assignment in the block_device_operations static handler assignment code, when configure identifies that the kernel doesn't support that API handler. Reviewed-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Coleman Kane <ckane@colemankane.org> Closes #11967 Closes #11977	2021-05-27 22:09:26 -07:00
наб	1cb517aebd	module/zfs: remove zfs_zevent_console and zfs_zevent_cols zfs_zevent_console committed multiple printk()s per line without properly continuing them ‒ a single event could easily be fragmented across over thirty lines, making it useless for direct application zfs_zevent_cols exists purely to wrap the output from zfs_zevent_console The niche this was supposed to fill can be better served by something akin to the all-syslog ZEDLET Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #7082 Closes #11996	2021-05-27 22:09:19 -07:00
Brian Behlendorf	2085a5f992	Fix dRAID self-healing short columns When dRAID performs a normal read operation only the data columns in the raid map are read from disk. This is enough information to calculate the checksum, verify it, and return the needed data to the application. It's only in the event of a checksum failure that the additional parity and any empty columns must be read since they are required for parity reconstruction. Reading these additional columns is handled by vdev_raidz_read_all() which calls vdev_draid_map_alloc_empty() to expand the raid_map_t and submit IOs for the missing columns. This all works correctly, but it fails to account for any "short" columns. These are data columns which are padded with a empty skip sector at the end. Since that empty sector is not needed for a normal read it's not read when columns is first read from disk. However, like the parity and empty columns the skip sector is needed to perform reconstruction. The fix is to mark any "short" columns as never being read by clearing the rc_tried flag when expanding the raid_map_t. This will cause the entire column to re-read from disk in the event of a checksum failure allowing the self-healing functionality to repair the block. Note that this only effects the self-healing feature because when scrubbing a pool the parity, data, and empty columns are all read initially to verify their contents. Furthermore, only blocks which contain "short" columns would be effected, and only when the memory backing the skip sector wasn't already zeroed out. This change extends the existing redundancy_raidz.ksh test case to verify self-healing (as well as resilver and scrub). Then applies the same test case to dRAID with a slightly modified version of the test script called redundancy_draid.ksh. The unused variable combrec was also removed from both test cases. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #12010	2021-05-10 12:18:36 -07:00
наб	b1dd6351bb	Replace ZoL with OpenZFS where applicable Afterward, git grep ZoL matches: * README.md: * [ZoL Site](https://zfsonlinux.org) - Correct * etc/default/zfs.in:# ZoL userland configuration. - Changing this would induce a needless upgrade-check, if the user has modified the configuration; this can be updated the next time the defaults change * module/zfs/dmu_send.c: * ZoL < 0.7 does not handle [...] - Before 0.7 is ZoL, so fair enough Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Issue #11956	2021-05-10 12:16:46 -07:00
Ryan Moeller	cb18cf6b0a	FreeBSD: Remove !FreeBSD ifdef'd code Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #11994	2021-05-10 12:16:39 -07:00
Ryan Moeller	6c25218c7e	Clean up use of zfs_log_create in zfs_dir zfs_log_create returns void, so there is no reason to cast its return value to void at the call site. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #11994	2021-05-10 12:16:32 -07:00
Alyssa Ross	f15ec889a9	Return required size when encode_fh size too small Quoting <linux/exportfs.h>: > encode_fh() should return the fileid_type on success and on error > returns 255 (if the space needed to encode fh is greater than > @max_len*4 bytes). On error @max_len contains the minimum size (in 4 > byte unit) needed to encode the file handle. ZFS was not setting max_len in the case where the handle was too small. As a result of this, the `t_name_to_handle_at.c' example in name_to_handle_at(2) did not work on ZFS. zfsctl_fid() will itself set max_len if called with a fid that is too small, so if we give zfs_fid() that behavior as well, the fix is quite easy: if the handle is too small, just use a zero-size fid instead of the handle. Tested by running t_name_to_handle_at on a normal file, a directory, a .zfs directory, and a snapshot. Thanks-to: Puck Meerburg <puck@puckipedia.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Signed-off-by: Alyssa Ross <hi@alyssa.is> Closes #11995	2021-05-10 12:13:45 -07:00
Alexander Motin	0db64e9036	Simplify/fix dnode_move() for dn_zfetch Previous code tried to keep prefetch streams while moving dnode. But it was at least not updating per-stream zs_fetchback pointers, causing use-after-free on next access. Instead of that I see much easier and cleaner to just drop old prefetch state and start new from scratch. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Igor Kozhukhov <igor@dilos.org> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #11936 Closes #11998	2021-05-10 12:13:29 -07:00
Ryan Moeller	85071b2ff5	FreeBSD: Initialize/destroy zp->z_lock zp->z_lock is used in shared code for protecting projid and scantime. We don't exercise these paths much if at all on FreeBSD, so have been lucky enough not to have issues with the uninitialized locks so far. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ryan Moeller <ryan@ixsystems.com> Closes #12003	2021-05-10 12:13:12 -07:00
Ryan Moeller	5701e393b7	FreeBSD: Prune some unneeded definitions IS_XATTRDIR is never used. v_count is only used in two places, one immediately followed by the use of the real name, v_usecount. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <ryan@ixsystems.com> Closes #11973	2021-05-10 12:09:34 -07:00
Nathaniel Wesley Filardo	4ac37f8b2e	vdev_mirror: don't scrub/resilver devices that can't be read This ensures that we don't accumulate checksum errors against offline or unavailable devices but, more importantly, means that we don't needlessly create DTL entries for offline devices that are already up-to-date. Consider a 3-way mirror, with disk A always online (and so always with an empty DTL) and B and C only occasionally online. When A & B resilver with C offline, B's DTL will effectively be appended to C's due to these spurious ZIOs even as the resilver empties B's DTL: * These ZIOs land in vdev_mirror_scrub_done() and flag an error * That flagged error causes vdev_mirror_io_done() to see unexpected_errors, so it issues a ZIO_TYPE_WRITE repair ZIO, which inherits ZIO_FLAG_SCAN_THREAD because zio_vdev_child_io() includes that flag in ZIO_VDEV_CHILD_FLAGS. * That ZIO fails, too, and eventually zio_done() gets its hands on it and calls vdev_stat_update(). * vdev_stat_update() sees the error and this zio... * is not speculative, * is not due to EIO (but rather ENXIO, since the device is closed) * has an ->io_vd != NULL (specifically, the offline leaf device) * is a write * is for a txg != 0 (but rather the read block's physical birth txg) * has ZIO_FLAG_SCAN_THREAD asserted * So: vdev_stat_update() calls vdev_dtl_dirty() on the offline vdev. Then, when A & C resilver with B offline, that story gets replayed and C's DTL will be appended to B's. In fact, one does not need this permanently-broken-mirror scenario to induce badness: breaking a mirror with no DTLs and then scrubbing will create DTLs for all offline devices. These DTLs will persist until the entire mirror is reassembled for the duration of the resilver, which, incidentally, will not consider the devices with good data to be sources of good data in the case of a read failure. Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Nathaniel Wesley Filardo <nwfilardo@gmail.com> Closes #11930	2021-05-10 12:08:07 -07:00
Martin Matuška	463d7e1a61	Drop "All rights reserved" from files by trasz@FreeBSD.org This obeys the change in freebsd/freebsd-src@bce7ee9d4 External-issue: https://reviews.freebsd.org/D26980 Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Martin Matuska <mm@FreeBSD.org> Closes #11947	2021-05-10 12:06:21 -07:00
Mateusz Guzik	1b7d883eaa	FreeBSD: damage control racing .. lookups in face of mkdir/rmdir External-issue: https://reviews.freebsd.org/D29769 Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Closes #11926	2021-05-10 12:05:49 -07:00
Romain Dolbeau	e5c01296ff	Fix AVX512BW Fletcher code on AVX512-but-not-BW machines Introduce a specific valid function for avx512f+avx512bw (instead of checking only for avx512f). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Adam Moss <c@yotes.com> Signed-off-by: Romain Dolbeau <romain@dolbeau.org> Closes #11937 Closes #11938	2021-05-10 12:05:36 -07:00
Attila Fülöp	cdc27fd061	ICP: Silence objtool "stack pointer realignment" warnings Objtool requires the use of a DRAP register while aligning the stack. Since a DRAP register is a gcc concept and we are notoriously low on registers in the crypto code, it's not worth the effort to mimic gcc generated stack realignment. We simply silence the warning by adding the offending object files to OBJECT_FILES_NON_STANDARD. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Attila Fülöp <attila@fueloep.org> Closes #6950 Closes #11914	2021-04-19 15:22:58 -07:00
Mateusz Guzik	14a1980b35	Combine zio caches if possible This deduplicates 2 sets of caches which use the same allocation size. Memory savings fluctuate a lot, one sample result is FreeBSD running "make buildworld" saving ~180MB RAM in reduced page count associated with zio caches. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Closes #11877	2021-04-19 15:22:58 -07:00
Paul Zuchowski	d938c6ee27	Fix crash in zio_done error reporting Fix NULL pointer dereference when reporting checksum error for gang block in zio_done. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Zuchowski <pzuchowski@datto.com> Closes #11872 Closes #11896	2021-04-19 15:22:57 -07:00
наб	bdcf0cca10	linux/spl: proc: use global table_{min,max} values instead of local ones Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #11879	2021-04-19 15:22:57 -07:00
наб	e5c4f86e7a	linux/spl: base proc_dohostid() on proc_dostring() This fixes /proc/sys/kernel/spl/hostid on kernels with mainline commit 32927393dc1ccd60fb2bdc05b9e8e88753761469 ("sysctl: pass kernel pointers to ->proc_handler") ‒ 5.7-rc1 and up The access_ok() check in copy_to_user() in proc_copyout_string() would always fail, so all userspace reads and writes would fail with EINVAL proc_dostring() strips only the final new-line, but simple_strtoul() doesn't actually need a back-trimmed string ‒ writing "012345678 \n" is still allowed, as is "012345678zupsko", &c. This alters what happens when an invalid value is written ‒ previously it'd get set to what-ever simple_strtoul() returned (probably 0, thereby resetting it to default), now it does nothing Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #11878 Closes #11879	2021-04-19 15:22:57 -07:00
Jitendra Patidar	4c925936e3	ZFS traverse_visitbp optimization to limit prefetch Traversal code, traverse_visitbp() does visit blocks recursively. Indirect (Non L0) Block of size 128k could contain, 1024 block pointers of 128 bytes. In case of full traverse OR incremental traverse, where all blocks were modified, it could traverse large number of blocks pointed by indirect. Traversal code does issue prefetch of blocks traversed below indirect. This could result into large number of async reads queued on vdev queue. So, account for prefetch issued for blocks pointed by indirect and limit max prefetch in one go. Module Param: zfs_traverse_indirect_prefetch_limit: Limit of prefetch while traversing an indirect block. Local counters: prefetched: Local counter to account for number prefetch done. pidx: Index for which next prefetch to be issued. ptidx: Index at which next prefetch to be triggered. Keep "ptidx" somewhere in the middle of blocks prefetched, so that blocks prefetch read gets the enough time window before their demand read is issued. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Jitendra Patidar <jitendra.patidar@nutanix.com> Closes #11802 Closes #11803	2021-04-19 15:22:57 -07:00
Paul Dagnelie	d682e20ba4	Add SIGSTOP and SIGTSTP handling to issig This change adds SIGSTOP and SIGTSTP handling to the issig function; this mirrors its behavior on Solaris. This way, long running kernel tasks can be stopped with the appropriate signals. Note that doing so with ctrl-z on the command line doesn't return control of the tty to the shell, because tty handling is done separately from stopping the process. That can be future work, if people feel that it is a necessary addition. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Paul Dagnelie <pcd@delphix.com> Issue #810 Issue #10843 Closes #11801	2021-04-19 15:12:33 -07:00
Mateusz Guzik	2aa0d643fd	FreeBSD: use vnlru_free_vfsops if available Fixes issues when zfs is used along with other filesystems. External-issue: https://cgit.freebsd.org/src/commit/?id=e9272225e6bed840b00eef1c817b188c172338ee Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Closes #11881	2021-04-14 13:23:08 -07:00
Mateusz Guzik	a6b82cc0bb	FreeBSD: add missing seqc write begin/end around zfs_acl_chown_setattr It happens to trip over an assert but does not matter for correctness at this time. Done for future proofing. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Closes #11884	2021-04-14 13:23:08 -07:00
Mateusz Guzik	4568b5cfba	FreeBSD: add support for lockless symlink lookup Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Closes #11883	2021-04-14 13:23:08 -07:00
Brian Behlendorf	7de1797cee	ZTS: fix removal_condense_export test case It's been observed in the CI that the required 25% of obsolete bytes in the mapping can be to high a threshold for this test resulting in condensing never being triggered and a test failure. To prevent these failures make the existing zfs_condense_indirect_obsolete_pct tuning available so the obsolete percentage can be reduced from 25% to 5% during this test. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #11869	2021-04-14 13:23:08 -07:00
pstef	dfd3015048	Balance parentheses in parameter descriptions Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Piotr Paweł Stefaniak <pstef@freebsd.org> Closes #11882	2021-04-14 13:23:07 -07:00
TerraTech	30f5b2fbe2	zpl_inode.c: Fix SMACK interoperability SMACK needs to have the ZFS dentry security field setup before SMACK's d_instantiate() hook is called as it requires functioning '__vfs_getxattr()' calls to properly set the labels. Fxes: 1) file instantiation properly setting the object label to the subject's label 2) proper file labeling in a transmutable directory Functions Updated: 1) zpl_create() 2) zpl_mknod() 3) zpl_mkdir() 4) zpl_symlink() External-issue: https://github.com/cschaufler/smack-next/issues/1 Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: TerraTech <TerraTech@users.noreply.github.com> Closes #11646 Closes #11839	2021-04-14 13:19:50 -07:00
Brian Behlendorf	f17d146ca6	Use dsl_scan_setup_check() to setup a scrub When a rebuild completes it will automatically schedule a follow up scrub to verify all of the block checksums. Before setting up the scrub execute the counterpart dsl_scan_setup_check() function to confirm the scrub can be started. Prior to this change we'd only check vdev_rebuild_active() which isn't as comprehensive, and using the check function keeps all of this logic in one place. Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #11849	2021-04-14 13:19:49 -07:00
Tino Reichardt	a35cadcf14	Fix double sha1/sha1.o line in module/icp/Makefile.in Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de> Closes #11852	2021-04-14 13:19:49 -07:00
Ryan Moeller	7822c01eb6	Ratelimit deadman zevents as with delay zevents Just as delay zevents can flood the zevent pipe when a vdev becomes unresponsive, so do the deadman zevents. Ratelimit deadman zevents according to the same tunable as for delay zevents. Enable deadman tests on FreeBSD and add a test for deadman event ratelimiting. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Don Brady <don.brady@delphix.com> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #11786	2021-04-14 13:19:49 -07:00
Andrea Gelmini	ca7af7f675	Fix various typos Correct an assortment of typos throughout the code base. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net> Closes #11774	2021-04-07 13:27:11 -07:00
Ryan Moeller	003f2d04b6	FreeBSD: Fix stable/12 after AT_BENEATH removal Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #11827	2021-04-07 13:25:20 -07:00
Ryan Moeller	895d39aa83	Allow pool names that look like Solaris disk names Nothing bad happens if a prefix of your pool name matches a disk name. This is a bit of a silly restriction at this point. Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Ryan Moeller <freqlabs@FreeBSD.org> Closes #11781 Closes #11813	2021-04-07 13:24:46 -07:00
Ryan Moeller	7f789e150f	Don't scale zfs_zevent_len_max by CPU count The lower bound for this scaling to too low and the upper bound is too high. Use a fixed default length of 512 instead, which is a reasonable value on any system. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #11822	2021-04-07 13:24:38 -07:00
Ryan Moeller	917f4b334c	Atomically check and set dropped zevent count ratelimit_dropped isn't protected by a lock and is expected to be updated atomically. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #11822	2021-04-07 13:24:33 -07:00
Matthew Ahrens	2b56a63457	Use a helper function to clarify gang block size For gang blocks, `DVA_GET_ASIZE()` is the total space allocated for the gang DVA including its children BP's. The space allocated at each DVA's vdev/offset is `vdev_psize_to_asize(vd, SPA_GANGBLOCKSIZE)`. This commit makes this relationship more clear by using a helper function, `vdev_gang_header_asize()`, for the space allocated at the gang block's vdev/offset. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #11744	2021-03-26 11:19:35 -07:00
Luis Henriques	2037edbdaa	Fix error code on __zpl_ioctl_setflags() Other (all?) Linux filesystems seem to return -EPERM instead of -EACCESS when trying to set FS_APPEND_FL or FS_IMMUTABLE_FL without the CAP_LINUX_IMMUTABLE capability. This was detected by generic/545 test in the fstest suite. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Luis Henriques <henrix@camandro.org> Closes #11791	2021-03-26 10:46:45 -07:00
Andrea Gelmini	8a915ba1f6	Removed duplicated includes Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net> Closes #11775	2021-03-22 12:34:58 -07:00
Alexander Motin	891568c990	Split dmu_zfetch() speculation and execution parts To make better predictions on parallel workloads dmu_zfetch() should be called as early as possible to reduce possible request reordering. In particular, it should be called before dmu_buf_hold_array_by_dnode() calls dbuf_hold(), which may sleep waiting for indirect blocks, waking up multiple threads same time on completion, that can significantly reorder the requests, making the stream look like random. But we should not issue prefetch requests before the on-demand ones, since they may get to the disks first despite the I/O scheduler, increasing on-demand request latency. This patch splits dmu_zfetch() into two functions: dmu_zfetch_prepare() and dmu_zfetch_run(). The first can be executed as early as needed. It only updates statistics and makes predictions without issuing any I/Os. The I/O issuance is handled by dmu_zfetch_run(), which can be called later when all on-demand I/Os are already issued. It even tracks the activity of other concurrent threads, issuing the prefetch only when _all_ on-demand requests are issued. For many years it was a big problem for storage servers, handling deeper request queues from their clients, having to either serialize consequential reads to make ZFS prefetcher usable, or execute the incoming requests as-is and get almost no prefetch from ZFS, relying only on deep enough prefetch by the clients. Benefits of those ways varied, but neither was perfect. With this patch deeper queue sequential read benchmarks with CrystalDiskMark from Windows via iSCSI to FreeBSD target show me much better throughput with almost 100% prefetcher hit rate, comparing to almost zero before. While there, I also removed per-stream zs_lock as useless, completely covered by parent zf_lock. Also I reused zs_blocks refcount to track zf_stream linkage of the stream, since I believe previous zs_fetch == NULL check in dmu_zfetch_stream_done() was racy. Delete prefetch streams when they reach ends of files. It saves up to 1KB of RAM per file, plus reduces searches through the stream list. Block data prefetch (speculation and indirect block prefetch is still done since they are cheaper) if all dbufs of the stream are already in DMU cache. First cache miss immediately fires all the prefetch that would be done for the stream by that time. It saves some CPU time if same files within DMU cache capacity are read over and over. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Adam Moss <c@yotes.com> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #11652	2021-03-19 22:56:11 -07:00
Chunwei Chen	296a4a369b	Fix zfs_get_data access to files with wrong generation If TX_WRITE is create on a file, and the file is later deleted and a new directory is created on the same object id, it is possible that when zil_commit happens, zfs_get_data will be called on the new directory. This may result in panic as it tries to do range lock. This patch fixes this issue by record the generation number during zfs_log_write, so zfs_get_data can check if the object is valid. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@nutanix.com> Closes #10593 Closes #11682	2021-03-19 22:53:31 -07:00

1 2 3 4 5 ...

3482 Commits