Archive-Team/zfs - zfs - Gitea: Git with a cup of tea

Commit Graph

Author	SHA1	Message	Date
Brian Behlendorf	6575defc52	Verify dRAID empty sectors Verify that all empty sectors are zero filled before using them to calculate parity. Failure to do so can result in incorrect parity columns being generated and written to disk if the contents of an empty sector are non-zero. This was possible because the checksum only protects the data portions of the buffer, not the empty sector padding. This issue has been addressed by updating raidz_parity_verify() to check that all dRAID empty sectors are zero filled. Any sectors which are non-zero will be fixed, repair IO issued, and a checksum error logged. They can then be safely used to verify the parity. This specific type of damage is unlikely to occur since it requires a disk to have silently returned bad data, for an empty sector, while performing a scrub. However, if a pool were to have been damaged in this way, scrubbing the pool with this change applied will repair both the empty sector and parity columns as long as the data checksum is valid. Checksum errors will be reported in the `zpool status` output for any repairs which are made. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #12857	2022-02-03 15:28:01 -08:00
Damian Szuberski	64e88992b6	Update `checkstyle` workflow env to ubuntu-20.04 - `checkstyle` workflow uses ubuntu-20.04 environment - improved `mancheck.sh` readability Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: John Kennedy <john.kennedy@delphix.com> Signed-off-by: szubersk <szuberskidamian@gmail.com> Closes #12713	2021-12-08 13:27:56 -08:00
Brian Behlendorf	664d487a5d	Fix lseek(SEEK_DATA/SEEK_HOLE) mmap consistency When using lseek(2) to report data/holes memory mapped regions of the file were ignored. This could result in incorrect results. To handle this zfs_holey_common() was updated to asynchronously writeback any dirty mmap(2) regions prior to reporting holes. Additionally, while not strictly required, the dn_struct_rwlock is now held over the dirty check to prevent the dnode structure from changing. This ensures that a clean dnode can't be dirtied before the data/hole is located. The range lock is now also taken to ensure the call cannot race with zfs_write(). Furthermore, the code was refactored to provide a dnode_is_dirty() helper function which checks the dnode for any dirty records to determine its dirtiness. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Rich Ercolani <rincebrain@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #11900 Closes #12724	2021-11-05 08:08:55 -07:00
Martin Matuška	b7ecb4ff0d	FreeBSD: fix compilation of FreeBSD world after `29274c9f6` prng32_bounded() is available to kernel only on FreeBSD 13+. Call inline random_get_pseudo_bytes() with correct pointer type. To be consistent, apply to Linux as well. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Martin Matuska <mm@FreeBSD.org> Closes #12282	2021-11-02 13:35:47 -07:00
Brian Behlendorf	143476ce8d	Use fallthrough macro As of the Linux 5.9 kernel a fallthrough macro has been added which should be used to anotate all intentional fallthrough paths. Once all of the kernel code paths have been updated to use fallthrough the -Wimplicit-fallthrough option will because the default. To avoid warnings in the OpenZFS code base when this happens apply the fallthrough macro. Additional reading: https://lwn.net/Articles/794944/ Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #12441	2021-11-02 09:50:30 -07:00
Rich Ercolani	72a989cf60	Fix cross-endian interoperability of zstd It turns out that layouts of union bitfields are a pain, and the current code results in an inconsistent layout between BE and LE systems, leading to zstd-active datasets on one erroring out on the other. Switch everyone over to the LE layout, and add compatibility code to read both. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes #12008 Closes #12022	2021-09-14 15:05:55 -07:00
Alexander Motin	a4862125b8	Remove b_pabd/b_rabd allocation from arc_hdr_alloc() When a header is allocated for full overwrite it is a waste of time to allocate b_pabd/b_rabd for it, since arc_write() will free them without ever being touched. If it is a read or a partial overwrite then arc_read() and arc_hdr_decrypt() allocate them explicitly. Reduced memory allocation in user threads also reduces ARC eviction throttling there, proportionally increasing it in ZIO threads, that is not good. To minimize or even avoid it introduce ARC allocation reserve, allowing certain arc_get_data_abd() callers to allocate a bit longer in situations where user threads will already throttle. Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #12398	2021-09-14 14:31:50 -07:00
Alexander Motin	40e02f49e9	Fix/improve dbuf hits accounting Instead of clearing stats inside arc_buf_alloc_impl() do it inside arc_hdr_alloc() and arc_release(). It fixes statistics being wiped every time a new dbuf is filled from the ARC. Remove b_l1hdr.b_l2_hits. L2ARC hits are accounted at b_l2hdr.b_hits. Since the hits are accounted under hash lock, replace atomics with simple increments. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Wilson <george.wilson@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #12422	2021-09-14 14:31:22 -07:00
Alexander Motin	5afc35b698	Use more atomics in refcounts Use atomic_load_64() for zfs_refcount_count() to prevent torn reads on 32-bit platforms. On 64-bit ones it should not change anything. When built with ZFS_DEBUG but running without tracking enabled use atomics instead of mutexes same as for builds without ZFS_DEBUG. Since rc_tracked can't change live we can check it without lock. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #12420	2021-09-14 14:31:01 -07:00
Allan Jude	24e51e3749	Restore FreeBSD sysctl processing for arc.min and arc.max Before OpenZFS 2.0, trying to set the FreeBSD sysctl vfs.zfs.arc_max to a disallowed value would return an error. Since the switch, it instead only generates WARN_IF_TUNING_IGNORED Keep the ability to set the sysctl's specifically to 0, even though that is less than the minimum, because some tests depend on this. Also lost, was the ability to set vfs.zfs.arc_max to a value less than the default vfs.zfs.arc_min at boot time. Restore this as well. Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Signed-off-by: Allan Jude <allan@klarasystems.com> Closes #12161	2021-09-14 14:31:01 -07:00
Tony Nguyen	477edd642c	Run arc_evict thread at higher priority Run arc_evict thread at higher priority, nice=0, to give it more CPU time which can improve performance for workload with high ARC evict activities. On mixed read/write and sequential read workloads, I've seen between 10-40% better performance. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Tony Nguyen <tony.nguyen@delphix.com> Closes #12397	2021-09-14 14:30:13 -07:00
Alexander Motin	32c0b6468c	Optimize allocation throttling Remove mc_lock use from metaslab_class_throttle_(). The math there is based on refcounts and so atomic, so the only race possible there is between zfs_refcount_count() and zfs_refcount_add(). But in most cases metaslab_class_throttle_reserve() is called with the allocator lock held, which covers the race. In cases where the lock is not held, GANG_ALLOCATION() or METASLAB_MUST_RESERVE are set, and so we do not use zfs_refcount_count(). And even if we assume some other non-existing scenario, the worst that may happen from this race is few more I/Os get to allocation earlier, that is not a problem. Move locks and data of different allocators into different cache lines to avoid false sharing. Group spa_alloc_ arrays together into single array of aligned struct spa_alloc spa_allocs. Align struct metaslab_class_allocator. Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Don Brady <don.brady@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #12314	2021-09-14 12:40:15 -07:00
Alexander Motin	6a49948c73	Minor ARC optimizations Remove unneeded global, practically constant, state pointer variables (arc_anon, arc_mru, etc.), replacing them with macros of real state variables addresses (&ARC_anon, &ARC_mru, etc.). Change ARC_EVICT_ALL from -1ULL to UINT64_MAX, not requiring special handling in inner loop of ARC reclamation. Respectively change bytes argument of arc_evict_state() from int64_t to uint64_t. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #12348	2021-09-14 12:39:48 -07:00
Alexander	4affa09f3e	A few fixes of callback typecasting (for the upcoming ClangCFI) * zio: avoid callback typecasting * zil: avoid zil_itxg_clean() callback typecasting * zpl: decouple zpl_readpage() into two separate callbacks * nvpair: explicitly declare callbacks for xdr_array() * linux/zfs_nvops: don't use external iput() as a callback * zcp_synctask: don't use fnvlist_free() as a callback * zvol: don't use ops->zv_free() as a callback for taskq_dispatch() Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Signed-off-by: Alexander Lobakin <alobakin@pm.me> Closes #12260	2021-09-14 12:39:48 -07:00
Alexander Motin	ba76bb30a6	Introduce dsl_dir_diduse_transfer_space() Most of dsl_dir_diduse_space() and dsl_dir_transfer_space() CPU time is a dd_lock overhead and time spent in dmu_buf_will_dirty(). Calling them one after another is a waste of time and even more contention. Doing that twice for each rewritten block within dbuf_write_done() via dsl_dataset_block_kill() and dsl_dataset_block_born() created one of the biggest CPU overheads in case of small blocks rewrite. dsl_dir_diduse_transfer_space() combines functionality of these two functions for cases where it is needed, but without double overhead, practically for the cost of dsl_dir_diduse_space() or even cheaper. While there, optimize dsl_dir_phys() calls in dsl_dir_diduse_space() and dsl_dir_transfer_space(). It seems Clang detects some aliasing there, repeating dd->dd_dbuf->db_data dereference multiple times, increasing dd_lock scope and contention. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Author: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #12300	2021-09-14 12:38:51 -07:00
Alexander Motin	45305a067f	Fix ARC ghost states eviction accounting arc_evict_hdr() returns number of evicted bytes in scope of specific state. For ghost states it does not mean the amount of really freed memory, but the logical buffer size. It is correct for the eviction process, but not for waking up threads waiting for ARC size reduction, as added in "Revise ARC shrinker algorithm" commit, causing premature wakeups while ARC is still overflowed, allowing even bigger overflow, plus processing overhead when next allocation will also get blocked, probably also for too short time. To fix that make arc_evict_hdr() also return the amount of really freed memory, which for the ghost states is only the header, and use it to update arc_evict_count instead. Originally I was thinking to not return it at all, since arc_get_data_impl() does not account for the headers, but decided that some slow allocation progress is better than long waits, reaching on my tests up to 100ms. To reduce negative latency effects of long time periods when reclaim thread can free little real memory, start reclamation process earlier, before we actually reached the overflow threshold, when we have to throttle new allocations. We can also do it without taking global arc_evict_lock, reducing the contention. Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #12279	2021-09-14 12:38:05 -07:00
George Wilson	8415c3c170	file reference counts can get corrupted Callers of zfs_file_get and zfs_file_put can corrupt the reference counts for the file structure resulting in a panic or a soft lockup. When zfs send/recv runs, it will add a reference count to the open file, and begin to send or recv the stream. If the file descriptor is closed, then when dmu_recv_stream() or dmu_send() return we will call zfs_file_put to remove the reference we placed on the file structure. Unfortunately, because zfs_file_put() uses the file descriptor to lookup the file structure, it may end up finding that the file descriptor table no longer contains the file struct, thus leaking the file structure. Or it might end up finding a file descriptor for a different file and blindly updating its reference counts. Other failure modes probably exists. This change reworks the zfs_file_[get\|put] interface to not rely on the file descriptor but instead pass the zfs_file_t pointer around. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Co-authored-by: Allan Jude <allan@klarasystems.com> Signed-off-by: George Wilson <gwilson@delphix.com> External-issue: DLPX-76119 Closes #12299	2021-09-14 12:37:38 -07:00
Jorgen Lundman	04ebe29188	dprintf_dnode: strcpy -> strlcpy Missed a couple of strcpy() in earlier commit, this is only used with --enable-debug. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Signed-off-by: Jorgen Lundman <lundman@lundman.net> Closes #12311	2021-09-14 12:37:38 -07:00
Alexander Motin	49bb454120	FreeBSD: Hardcode abd_chunk_size to PAGE_SIZE It makes no sense to set it below PAGE_SIZE, since it increases all overheads and makes returning memory to OS problematic. It makes no sense to set it above PAGE_SIZE, since such allocations and especially frees are too expensive and cause KVA fragmentation to benefit from fewer chunks. After that it makes no sense to keep more complicated math here. What may have sense though is just a tunable border between linear and scatter ABDs, previously also controlled by this tunable. Retain that functionality by taking abd_scatter_min_size tunable from Linux, just with different default value. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #12328	2021-09-14 12:36:44 -07:00
Alexander Motin	15177c1aac	Compact dbuf/buf hashes and lock arrays With default dbuf cache size of 1/32 of ARC, it makes no sense to have hash table of the same size (or even bigger on Linux). Reduce it to 1/8 of ARC's one, still leaving some slack, assuming higher I/O rate via dbuf cache than via ARC. Remove padding from ARC hash locks array. The idea behind padding is to avoid false sharing between locks. It would have sense if there would be a limited number of very busy locks. But since we have no limit on the number, using the same memory for more locks we can achieve even lower lock contention with the same false sharing, or we can use less memory for the same contention level. Reduce number of hash locks from 8192 to 2048. The number is still big enough to not cause contention, but reduced memory size improves cache hit rate for mutex_tryenter() in ARC eviction thread, saving about 1% of the thread time. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #12289	2021-09-14 12:22:46 -07:00
Jorgen Lundman	035219ee10	Fix abd leak, kmem_free correct size of abd_t Fix a leak of abd_t that manifested mostly when using raidzN with at least as many columns as N (e.g. a four-disk raidz2 but not a three-disk raidz2). Sufficiently heavy raidz use would eventually run a system out of memory. Additionally: * Switch abd_cache arena to FIRSTFIT, which empirically improves perofrmance. * Make abd_chunk_cache more performant and debuggable. * Allocate the abd_zero_buf from abd_chunk_cache rather than the heap. * Don't try to reap non-existent qcaches in abd_cache arena. * KM_PUSHPAGE->KM_SLEEP when allocating chunks from their own arena Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Jorgen Lundman <lundman@lundman.net> Co-authored-by: Sean Doran <smd@use.net> Closes #12295	2021-09-14 12:22:28 -07:00
Alexander Motin	f3969ea78b	Optimize small random numbers generation In all places except two spa_get_random() is used for small values, and the consumers do not require well seeded high quality values. Switch those two exceptions directly to random_get_pseudo_bytes() and optimize spa_get_random(), renaming it to random_in_range(), since it is not related to SPA or ZFS in general. On FreeBSD directly map random_in_range() to new prng32_bounded() KPI added in FreeBSD 13. On Linux and in user-space just reduce the type used to uint32_t to avoid more expensive 64bit division. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #12183	2021-09-14 12:10:17 -07:00
Attila Fülöp	088712793e	gcc 11 cleanup Compiling with gcc 11.1.0 produces three new warnings. Change the code slightly to avoid them. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Attila Fülöp <attila@fueloep.org> Closes #12130 Closes #12188 Closes #12237	2021-06-24 13:13:40 -07:00
Rich Ercolani	5e89181544	Annotated dprintf as printf-like ZFS loves using %llu for uint64_t, but that requires a cast to not be noisy - which is even done in many, though not all, places. Also a couple places used %u for uint64_t, which were promoted to %llu. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes #12233	2021-06-24 13:12:36 -07:00
Alexander Motin	6b239d1757	Use wmsum for arc, abd, dbuf and zfetch statistics. (#12172 ) wmsum was designed exactly for cases like these with many updates and rare reads. It allows to completely avoid atomic operations on congested global variables. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #12172	2021-06-24 13:10:59 -07:00
Alexander Motin	57196f8ae9	Re-embed multilist_t storage This commit partially reverts changes to multilists in PR 7968 (multi-threaded spa-sync()) and adds some cache line alignments to separate read-only multilists and heavily modified refcount's to different cache lines. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-by: iXsystems, Inc. Closes #12158	2021-06-10 10:50:16 -07:00
Alexander Motin	efdfb14fc8	Remove pool io kstats This mostly reverts "3537 want pool io kstats" commit of 8 years ago. From one side this code using pool-wide locks became pretty bad for performance, creating significant lock contention in I/O pipeline. From another, there are more efficient ways now to obtain detailed statistics, while this statistics is illumos-specific and much less usable on Linux and FreeBSD, reported only via procfs/sysctls. This commit does not remove KSTAT_TYPE_IO implementation, that may be removed later together with already unused KSTAT_TYPE_INTR and KSTAT_TYPE_TIMER. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #12212	2021-06-10 10:50:16 -07:00
Alan Somers	2c3d7283b4	libzfs: On FreeBSD, use MNT_NOWAIT with getfsstat `getfsstat(2)` is used to retrieve the list of mounted file systems, which libzfs uses when fetching properties like mountpoint, atime, setuid, etc. The `mode` parameter may be `MNT_NOWAIT`, which uses information in the VFS's cache, or `MNT_WAIT`, which effectively does a `statfs` on every single mounted file system in order to fetch the most up-to-date information. As far as I can tell, the only fields that libzfs cares about are the filesystem's name, mountpoint, fstypename, and mount flags. Those things are always updated on mount and unmount, so they will always be accurate in the VFS's mount cache except in two circumstances: 1) When a file system is busy unmounting 2) When a ZFS file system changes the value of a mount-overridable property like atime or setuid, but doesn't remount the file system. Right now that only happens when the property is changed by an unprivileged user who has delegated authority to change the property but not to mount the dataset. But perhaps libzfs could choose to do it for other reasons in the future. Switching to `MNT_NOWAIT` will greatly improve speed with no downside, as long as we explicitly update the mount cache whenever we change a mount-overridable property. For comparison, Illumos gets this information using the native `getmntany` and `getmntent` functions, which also use cached information. The illumos function that would refresh the cache, `resetmnttab`, is never called by libzfs. And on GNU/Linux, `getmntany` and `getmntent` don't even communicate with the kernel directly. They simply parse the file they are given, which is usually /etc/mtab or /proc/mounts. Perhaps the implementation of /proc/mounts is synchronous, ala MNT_WAIT; I don't know. Sponsored-by: Axcient Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alan Somers <asomers@gmail.com> Closes: #12091	2021-06-09 13:05:34 -07:00
Alexander Motin	e76373de7b	More aggsum optimizations - Avoid atomic_add() when updating as_lower_bound/as_upper_bound. Previous code was excessively strong on 64bit systems while not strong enough on 32bit ones. Instead introduce and use real atomic_load() and atomic_store() operations, just an assignments on 64bit machines, but using proper atomics on 32bit ones to avoid torn reads/writes. - Reduce number of buckets on large systems. Extra buckets not as much improve add speed, as hurt reads. Unlike wmsum for aggsum reads are still important. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #12145	2021-06-09 13:05:34 -07:00
Alexander Motin	85c43508f3	Introduce write-mostly sums wmsum counters are a reduced version of aggsum counters, optimized for write-mostly scenarios. They do not provide optimized read functions, but instead allow much cheaper add function. The primary usage is infrequently read statistic counters, not requiring exact precision. The Linux implementation is directly mapped into percpu_counter KPI. The FreeBSD implementation is directly mapped into counter(9) KPI. In user-space due to lack of better implementation mapped to aggsum. Unfortunately neither Linux percpu_counter nor FreeBSD counter(9) provide sufficient functionality to completelly replace aggsum, so it still remains to be used for several hot counters. Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #12114	2021-06-09 13:05:34 -07:00
Brian Behlendorf	33a06f27e6	Fix dRAID sequential resilver silent damage handling This change addresses two distinct scenarios which are possible when performing a sequential resilver to a dRAID pool with vdevs that contain silent unknown damage. Which in this circumstance took the form of the devices being intentionally overwritten with zeros. However, it could also result from a device returning incorrect data while a sequential resilver was in progress. Scenario 1) A sequential resilver is performed while all of the dRAID vdevs are ONLINE and there is silent damage present on the vdev being resilvered. In this case, nothing will be repaired by vdev_raidz_io_done_reconstruct_known_missing() because rc->rc_error isn't set on any of the raid columns. To address this vdev_draid_io_start_read() has been updated to always mark the resilvering column as ESTALE for sequential resilver IO. Scenario 2) Multiple columns contain silent damage for the same block and a sequential resilver is performed. In this case it's impossible to generate the correct data from parity unless all of the damaged columns are being sequentially resilvered (and thus only good data is used to generate parity). This is as expected and there's nothing which can be done about it. However, we need to be careful not to make to situation worse. Since we can't verify the data is actually good without a checksum, we must only repair the devices which are being sequentially resilvered. Otherwise, an incorrect repair to a device which previously contained good data could effectively lock in the damage and make reconstruction impossible. A check for this was added to vdev_raidz_io_done_verified() along with a new test case. Lastly, this change updates the redundancy_draid_spare1 and redundancy_draid_spare3 test cases to be more representative of normal dRAID replacement operation. Specifically, what we care about is that the scrub run after a sequential resilver does not find additional blocks which need repair. This would indicate the sequential resilver failed to rebuild a section of one of the devices. Note also the tests were switched to using the verify_pool() function which still checks for checksum errors. Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #12061	2021-05-27 22:31:56 -07:00
наб	1cb517aebd	module/zfs: remove zfs_zevent_console and zfs_zevent_cols zfs_zevent_console committed multiple printk()s per line without properly continuing them ‒ a single event could easily be fragmented across over thirty lines, making it useless for direct application zfs_zevent_cols exists purely to wrap the output from zfs_zevent_console The niche this was supposed to fill can be better served by something akin to the all-syslog ZEDLET Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #7082 Closes #11996	2021-05-27 22:09:19 -07:00
наб	2453d0263d	lib/: set O_CLOEXEC on all fds As found by git grep -E '(open\|setmntent\|pipe2?)\(' \| grep -vE '((zfs\|zpool)_\|fd\|dl\|lzc_re\|pidfile_\|g_)open\(' FreeBSD's pidfile_open() says nothing about the flags of the files it opens, but we can't do anything about it anyway; the implementation does open all files with O_CLOEXEC Consider this output with zpool.d/media appended with "pid=$$; (ls -l /proc/$pid/fd > /dev/tty)": $ /sbin/zpool iostat -vc media lrwx------ 0 -> /dev/pts/0 l-wx------ 1 -> 'pipe:[3278500]' l-wx------ 2 -> /dev/null lrwx------ 3 -> /dev/zfs lr-x------ 4 -> /proc/31895/mounts lrwx------ 5 -> /dev/zfs lr-x------ 10 -> /usr/lib/zfs-linux/zpool.d/media vs $ ./zpool iostat -vc vendor,upath,iostat,media lrwx------ 0 -> /dev/pts/0 l-wx------ 1 -> 'pipe:[3279887]' l-wx------ 2 -> /dev/null lr-x------ 10 -> /usr/lib/zfs-linux/zpool.d/media Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #11866	2021-04-14 13:23:07 -07:00
Brian Behlendorf	f17d146ca6	Use dsl_scan_setup_check() to setup a scrub When a rebuild completes it will automatically schedule a follow up scrub to verify all of the block checksums. Before setting up the scrub execute the counterpart dsl_scan_setup_check() function to confirm the scrub can be started. Prior to this change we'd only check vdev_rebuild_active() which isn't as comprehensive, and using the check function keeps all of this logic in one place. Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #11849	2021-04-14 13:19:49 -07:00
Ryan Moeller	7822c01eb6	Ratelimit deadman zevents as with delay zevents Just as delay zevents can flood the zevent pipe when a vdev becomes unresponsive, so do the deadman zevents. Ratelimit deadman zevents according to the same tunable as for delay zevents. Enable deadman tests on FreeBSD and add a test for deadman event ratelimiting. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Don Brady <don.brady@delphix.com> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #11786	2021-04-14 13:19:49 -07:00
Andrea Gelmini	ca7af7f675	Fix various typos Correct an assortment of typos throughout the code base. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net> Closes #11774	2021-04-07 13:27:11 -07:00
Matthew Ahrens	2b56a63457	Use a helper function to clarify gang block size For gang blocks, `DVA_GET_ASIZE()` is the total space allocated for the gang DVA including its children BP's. The space allocated at each DVA's vdev/offset is `vdev_psize_to_asize(vd, SPA_GANGBLOCKSIZE)`. This commit makes this relationship more clear by using a helper function, `vdev_gang_header_asize()`, for the space allocated at the gang block's vdev/offset. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #11744	2021-03-26 11:19:35 -07:00
Andrea Gelmini	8a915ba1f6	Removed duplicated includes Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net> Closes #11775	2021-03-22 12:34:58 -07:00
Alexander Motin	891568c990	Split dmu_zfetch() speculation and execution parts To make better predictions on parallel workloads dmu_zfetch() should be called as early as possible to reduce possible request reordering. In particular, it should be called before dmu_buf_hold_array_by_dnode() calls dbuf_hold(), which may sleep waiting for indirect blocks, waking up multiple threads same time on completion, that can significantly reorder the requests, making the stream look like random. But we should not issue prefetch requests before the on-demand ones, since they may get to the disks first despite the I/O scheduler, increasing on-demand request latency. This patch splits dmu_zfetch() into two functions: dmu_zfetch_prepare() and dmu_zfetch_run(). The first can be executed as early as needed. It only updates statistics and makes predictions without issuing any I/Os. The I/O issuance is handled by dmu_zfetch_run(), which can be called later when all on-demand I/Os are already issued. It even tracks the activity of other concurrent threads, issuing the prefetch only when _all_ on-demand requests are issued. For many years it was a big problem for storage servers, handling deeper request queues from their clients, having to either serialize consequential reads to make ZFS prefetcher usable, or execute the incoming requests as-is and get almost no prefetch from ZFS, relying only on deep enough prefetch by the clients. Benefits of those ways varied, but neither was perfect. With this patch deeper queue sequential read benchmarks with CrystalDiskMark from Windows via iSCSI to FreeBSD target show me much better throughput with almost 100% prefetcher hit rate, comparing to almost zero before. While there, I also removed per-stream zs_lock as useless, completely covered by parent zf_lock. Also I reused zs_blocks refcount to track zf_stream linkage of the stream, since I believe previous zs_fetch == NULL check in dmu_zfetch_stream_done() was racy. Delete prefetch streams when they reach ends of files. It saves up to 1KB of RAM per file, plus reduces searches through the stream list. Block data prefetch (speculation and indirect block prefetch is still done since they are cheaper) if all dbufs of the stream are already in DMU cache. First cache miss immediately fires all the prefetch that would be done for the stream by that time. It saves some CPU time if same files within DMU cache capacity are read over and over. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Adam Moss <c@yotes.com> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #11652	2021-03-19 22:56:11 -07:00
Chunwei Chen	296a4a369b	Fix zfs_get_data access to files with wrong generation If TX_WRITE is create on a file, and the file is later deleted and a new directory is created on the same object id, it is possible that when zil_commit happens, zfs_get_data will be called on the new directory. This may result in panic as it tries to do range lock. This patch fixes this issue by record the generation number during zfs_log_write, so zfs_get_data can check if the object is valid. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@nutanix.com> Closes #10593 Closes #11682	2021-03-19 22:53:31 -07:00
Matthew Ahrens	330c6c0523	Clean up RAIDZ/DRAID ereport code The RAIDZ and DRAID code is responsible for reporting checksum errors on their child vdevs. Checksum errors represent events where a disk returned data or parity that should have been correct, but was not. In other words, these are instances of silent data corruption. The checksum errors show up in the vdev stats (and thus `zpool status`'s CKSUM column), and in the event log (`zpool events`). Note, this is in contrast with the more common "noisy" errors where a disk goes offline, in which case ZFS knows that the disk is bad and doesn't try to read it, or the device returns an error on the requested read or write operation. RAIDZ/DRAID generate checksum errors via three code paths: 1. When RAIDZ/DRAID reconstructs a damaged block, checksum errors are reported on any children whose data was not used during the reconstruction. This is handled in `raidz_reconstruct()`. This is the most common type of RAIDZ/DRAID checksum error. 2. When RAIDZ/DRAID is not able to reconstruct a damaged block, that means that the data has been lost. The zio fails and an error is returned to the consumer (e.g. the read(2) system call). This would happen if, for example, three different disks in a RAIDZ2 group are silently damaged. Since the damage is silent, it isn't possible to know which three disks are damaged, so a checksum error is reported against every child that returned data or parity for this read. (For DRAID, typically only one "group" of children is involved in each io.) This case is handled in `vdev_raidz_cksum_finish()`. This is the next most common type of RAIDZ/DRAID checksum error. 3. If RAIDZ/DRAID is not able to reconstruct a damaged block (like in case 2), but there happens to be additional copies of this block due to "ditto blocks" (i.e. multiple DVA's in this blkptr_t), and one of those copies is good, then RAIDZ/DRAID compares each sector of the data or parity that it retrieved with the good data from the other DVA, and if they differ then it reports a checksum error on this child. This differs from case 2 in that the checksum error is reported on only the subset of children that actually have bad data or parity. This case happens very rarely, since normally only metadata has ditto blocks. If the silent damage is extensive, there will be many instances of case 2, and the pool will likely be unrecoverable. The code for handling case 3 is considerably more complicated than the other cases, for two reasons: 1. It needs to run after the main raidz read logic has completed. The data RAIDZ read needs to be preserved until after the alternate DVA has been read, which necessitates refcounts and callbacks managed by the non-raidz-specific zio layer. 2. It's nontrivial to map the sections of data read by RAIDZ to the correct data. For example, the correct data does not include the parity information, so the parity must be recalculated based on the correct data, and then compared to the parity that was read from the RAIDZ children. Due to the complexity of case 3, the rareness of hitting it, and the minimal benefit it provides above case 2, this commit removes the code for case 3. These types of errors will now be handled the same as case 2, i.e. the checksum error will be reported against all children that returned data or parity. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #11735	2021-03-19 16:22:10 -07:00
Matthew Ahrens	46df6e98aa	Remove unused rr_code The `rr_code` field in `raidz_row_t` is unused. This commit removes the field, as well as the code that's used to set it. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #11736	2021-03-17 21:57:09 -07:00
Ryan Moeller	f845b2dd1c	FreeBSD: Clean up zfsdev_close to match Linux Resolve some oddities in zfsdev_close() which could result in a panic and were not present in the equivalent function for Linux. - Remove unused definition ZFS_MIN_MINOR - FreeBSD: Simplify zfsdev state destruction - Assert zs_minor is valid in zfsdev_close - Make locking around zfsdev state match Linux Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #11720	2021-03-12 16:09:15 -08:00
Brian Behlendorf	e7a06356c1	Suppress cppcheck invalidSyntax warninigs For some reason cppcheck 1.90 is generating an invalidSyntax warning when the BF64_SET macro is used in the zstream source. The same warning is not reported by cppcheck 2.3, nor is their any evident problem with the expanded macro. This appears to be an issue with this version of cppcheck. This commit annotates the source to suppress the warning. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #11700	2021-03-05 17:56:35 -08:00
Prakash Surya	f01eaed455	Add upper bound for slop space calculation This change modifies the behavior of how we determine how much slop space to use in the pool, such that now it has an upper limit. The default upper limit is 128G, but is configurable via a tunable. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Prakash Surya <prakash.surya@delphix.com> Closes #11023	2021-02-24 09:52:43 -08:00
Brian Atkinson	c0801bf35a	Cleaning up uio headers Making uio_impl.h the common header interface between Linux and FreeBSD so both OS's can share a common header file. This also helps reduce code duplication for zfs_uio_t for each OS. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Brian Atkinson <batkinson@lanl.gov> Closes #11622	2021-02-20 20:16:50 -08:00
Christian Schwarz	edc508ac0b	libzpool: set_global_var: refactor to not modify 'arg' Also fixes leak of the dlopen handle in the error case. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com> Signed-off-by: Christian Schwarz <me@cschwarz.com> Closes #11602	2021-02-19 22:45:04 -08:00
Ryan Moeller	64e0fe14ff	Restore FreeBSD resource usage accounting Add zfs_racct_* interfaces for platform-dependent read/write accounting. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #11613	2021-02-19 22:34:33 -08:00
Don Brady	03e02e5b56	Checksum errors may not be counted Fix regression seen in issue #11545 where checksum errors where not being counted or showing up in a zpool event. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Don Brady <don.brady@delphix.com> Closes #11609	2021-02-19 22:33:15 -08:00
Colm	658fb8020f	Add "compatibility" property for zpool feature sets Property to allow sets of features to be specified; for compatibility with specific versions / releases / external systems. Influences the behavior of 'zpool upgrade' and 'zpool create'. Initial man page changes and test cases included. Brief synopsis: zpool create -o compatibility=off\|legacy\|file[,file...] pool vdev... compatibility = off : disable compatibility mode (enable all features) compatibility = legacy : request that no features be enabled compatibility = file[,file...] : read features from specified files. Only features present in all files will be enabled on the resulting pool. Filenames may be absolute, or relative to /etc/zfs/compatibility.d or /usr/share/zfs/compatibility.d (/etc checked first). Only affects zpool create, zpool upgrade and zpool status. ABI changes in libzfs: * New function "zpool_load_compat" to load and parse compat sets. * Add "zpool_compat_status_t" typedef for compatibility parse status. * Add ZPOOL_PROP_COMPATIBILITY to the pool properties enum * Add ZPOOL_STATUS_COMPATIBILITY_ERR to the pool status enum An initial set of base compatibility sets are included in cmd/zpool/compatibility.d, and the Makefile for cmd/zpool is modified to install these in $pkgdatadir/compatibility.d and to create symbolic links to a reasonable set of aliases. Reviewed-by: ericloewe Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Colm Buckley <colm@tuatha.org> Closes #11468	2021-02-17 21:30:45 -08:00
Ryan Moeller	436ab35a53	Make inline ABD predicates compatible with C++ FreeBSD's zfsd fails to build after `e2af2acce3` due to strict type checking errors from the implicit conversion between bool and boolean_t in the inline predicate definitions in abd.h. Use conditionals to return the correct value type from these functions. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Igor Kozhukhov <igor@dilos.org> Signed-off-by: Ryan Moeller <freqlabs@FreeBSD.org> Closes #11592	2021-02-15 10:15:50 -08:00
khng300	fc273894d2	Rename zfs_inode_update to zfs_znode_update_vfs zfs_znode_update_vfs is a more platform-agnostic name than zfs_inode_update. Besides that, the function's prototype is moved to include/sys/zfs_znode.h as the function is also used in common code. Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ka Ho Ng <khng300@gmail.com> Sponsored by: The FreeBSD Foundation Closes #11580	2021-02-09 11:17:29 -08:00
Matthew Ahrens	2d4bbd14fc	The abd child/parent relationship does not need to be tracked ABD's currently track their parent/child relationship. This applies to `abd_get_offset()` and `abd_borrow_buf()`. However, nothing depends on knowing this relationship, it's only used for consistency checks to verify that we are not destroying an ABD that's still in use. When we are creating/destroying ABD's frequently, the performance impact of maintaining these data structures (in particular the atomic increment/decrement operations) can be measurable. This commit removes this verification code on production builds, but keeps it when ZFS_DEBUG is set. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #11535	2021-01-30 10:04:42 -08:00
Alan Somers	cf0977ad72	Parallelize vdev_validate The runtime of vdev_validate is dominated by the disk accesses in vdev_label_read_config. Speed it up by validating all vdevs in parallel using a taskq. Sponsored by: Axcient Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alan Somers <asomers@gmail.com> Closes #11470	2021-01-26 19:36:51 -08:00
Alan Somers	a0e01997ec	Parallelize vdev_load metaslab_init is the slowest part of importing a mature pool, and it must be repeated hundreds of times for each top-level vdev. But its speed is dominated by a few serialized disk accesses. That can lead to import times of > 1 hour for pools with many top-level vdevs on spinny disks. Speed up the import by using a taskqueue to parallelize vdev_load across all top-level vdevs. This also requires adding mutex protection to metaslab_class_t.mc_historgram. The mc_histogram fields were unprotected when that code was first written in "Illumos 4976-4984 - metaslab improvements" (OpenZFS `f3a7f6610f`). The lock wasn't added until `3dfb57a35e`, though it's unclear exactly which fields it's supposed to protect. In any case, it wasn't until vdev_load was parallelized that any code attempted concurrent access to those fields. Sponsored by: Axcient Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alan Somers <asomers@gmail.com> Closes #11470	2021-01-26 19:35:59 -08:00
Matthew Ahrens	aa755b3549	Set aside a metaslab for ZIL blocks Mixing ZIL and normal allocations has several problems: 1. The ZIL allocations are allocated, written to disk, and then a few seconds later freed. This leaves behind holes (free segments) where the ZIL blocks used to be, which increases fragmentation, which negatively impacts performance. 2. When under moderate load, ZIL allocations are of 128KB. If the pool is fairly fragmented, there may not be many free chunks of that size. This causes ZFS to load more metaslabs to locate free segments of 128KB or more. The loading happens synchronously (from zil_commit()), and can take around a second even if the metaslab's spacemap is cached in the ARC. All concurrent synchronous operations on this filesystem must wait while the metaslab is loading. This can cause a significant performance impact. 3. If the pool is very fragmented, there may be zero free chunks of 128KB or more. In this case, the ZIL falls back to txg_wait_synced(), which has an enormous performance impact. These problems can be eliminated by using a dedicated log device ("slog"), even one with the same performance characteristics as the normal devices. This change sets aside one metaslab from each top-level vdev that is preferentially used for ZIL allocations (vdev_log_mg, spa_embedded_log_class). From an allocation perspective, this is similar to having a dedicated log device, and it eliminates the above-mentioned performance problems. Log (ZIL) blocks can be allocated from the following locations. Each one is tried in order until the allocation succeeds: 1. dedicated log vdevs, aka "slog" (spa_log_class) 2. embedded slog metaslabs (spa_embedded_log_class) 3. other metaslabs in normal vdevs (spa_normal_class) The space required for the embedded slog metaslabs is usually between 0.5% and 1.0% of the pool, and comes out of the existing 3.2% of "slop" space that is not available for user data. On an all-ssd system with 4TB storage, 87% fragmentation, 60% capacity, and recordsize=8k, testing shows a ~50% performance increase on random 8k sync writes. On even more fragmented systems (which hit problem #3 above and call txg_wait_synced()), the performance improvement can be arbitrarily large (>100x). Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed-by: Don Brady <don.brady@delphix.com> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #11389	2021-01-21 15:12:54 -08:00
Brian Atkinson	d0cd9a5cc6	Extending FreeBSD UIO Struct In FreeBSD the struct uio was just a typedef to uio_t. In order to extend this struct, outside of the definition for the struct uio, the struct uio has been embedded inside of a uio_t struct. Also renamed all the uio_* interfaces to be zfs_uio_* to make it clear this is a ZFS interface. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Jorgen Lundman <lundman@lundman.net> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Brian Atkinson <batkinson@lanl.gov> Closes #11438	2021-01-20 21:27:30 -08:00
Matthew Ahrens	e2af2acce3	allow callers to allocate and provide the abd_t struct The `abd_get_offset_()` routines create an abd_t that references another abd_t, and doesn't allocate any pages/buffers of its own. In some workloads, these routines may be called frequently, to create many abd_t's representing small pieces of a single large abd_t. In particular, the upcoming RAIDZ Expansion project makes heavy use of these routines. This commit adds the ability for the caller to allocate and provide the abd_t struct to a variant of `abd_get_offset_()`. This eliminates the cost of allocating the abd_t and performing the accounting associated with it (`abdstat_struct_size`). The RAIDZ/DRAID code uses this for the `rc_abd`, which references the zio's abd. The upcoming RAIDZ Expansion project will leverage this infrastructure to increase performance of reads post-expansion by around 50%. Additionally, some of the interfaces around creating and destroying abd_t's are cleaned up. Most significantly, the distinction between `abd_put()` and `abd_free()` is eliminated; all types of abd_t's are now disposed of with `abd_free()`. Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Issue #8853 Closes #11439	2021-01-20 11:24:37 -08:00
Matthew Ahrens	2ac90457f5	record ioctl elapsed time in zpool history Each zfs ioctl that changes on-disk state (e.g. set property, create snapshot, destroy filesystem) is recorded in the zpool history, and is printed by `zpool history -i`. For performance diagnostic purposes, it would be useful to know how long each of these ioctls took to run. This commit adds that functionality, with a new `ZPOOL_HIST_ELAPSED_NS` member of the history nvlist. Additionally, the time recorded in this history log is currently the time that the history record is written to disk. But in many cases (CLI args logging and ioctl logging), this happens asynchronously, potentially many seconds after the operation completed. This commit changes the timestamp to reflect when the history event was created, rather than when it was written to disk. Reviewed-by: Mark Maybee <mmaybee@cray.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #11440	2021-01-11 09:29:25 -08:00
Toomas Soome	40ab927ae8	implicit conversion from 'boolean_t' to 'ds_hold_flags_t' Build error on illumos with gcc 10 did reveal: In function 'dmu_objset_refresh_ownership': ../../common/fs/zfs/dmu_objset.c:857:25: error: implicit conversion from 'boolean_t' to 'ds_hold_flags_t' {aka 'enum ds_hold_flags'} [-Werror=enum-conversion] 857 \| dsl_dataset_disown(ds, decrypt, tag); \| ^~~~~~~ cc1: all warnings being treated as errors libzfs_input_check.c: In function 'zfs_ioc_input_tests': libzfs_input_check.c:754:28: error: implicit conversion from 'enum dmu_objset_type' to 'enum lzc_dataset_type' [-Werror=enum-conversion] 754 \| err = lzc_create(dataset, DMU_OST_ZFS, NULL, NULL, 0); \| ^~~~~~~~~~~ cc1: all warnings being treated as errors The same issue is present in openzfs, and also the same issue about ds_hold_flags_t, which currently defines exactly one valid value. Reviewed-by: Igor Kozhukhov <igor@dilos.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Toomas Soome <tsoome@me.com> Closes #11406	2020-12-27 16:31:02 -08:00
Matthew Ahrens	be5c6d9653	Only examine best metaslabs on each vdev On a system with very high fragmentation, we may need to do lots of gang allocations (e.g. most indirect block allocations (~50KB) may need to gang). Before failing a "normal" allocation and resorting to ganging, we try every metaslab. This has the impact of loading every metaslab (not a huge deal since we now typically keep all metaslabs loaded), and also iterating over every metaslab for every failing allocation. If there are many metaslabs (more than the typical ~200, e.g. due to vdev expansion or very large vdevs), the CPU cost of this iteration can be very impactful. This iteration is done with the mg_lock held, creating long hold times and high lock contention for concurrent allocations, ultimately causing long txg sync times and poor application performance. To address this, this commit changes the behavior of "normal" (not try_hard, not ZIL) allocations. These will now only examine the 100 best metaslabs (as determined by their ms_weight). If none of these have a large enough free segment, then the allocation will fail and we'll fall back on ganging. To accomplish this, we will now (normally) gang before doing a `try_hard` allocation. Non-try_hard allocations will only examine the 100 best metaslabs of each vdev. In summary, we will first try normal allocation. If that fails then we will do a gang allocation. If that fails then we will do a "try hard" gang allocation. If that fails then we will have a multi-layer gang block. Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #11327	2020-12-16 14:40:05 -08:00
Alexander Motin	f8020c9363	Make metaslab class rotor and aliquot per-allocator. Metaslab rotor and aliquot are used to distribute workload between vdevs while keeping some locality for logically adjacent blocks. Once multiple allocators were introduced to separate allocation of different objects it does not make much sense for different allocators to write into different metaslabs of the same metaslab group (vdev) same time, competing for its resources. This change makes each allocator choose metaslab group independently, colliding with others only sporadically. Test including simultaneous write into 4 files with recordsize of 4KB on a striped pool of 30 disks on a system with 40 logical cores show reduction of vdev queue lock contention from 54 to 27% due to better load distribution. Unfortunately it won't help much ZVOLs yet since only one dataset/ZVOL is synced at a time, and so for the most part only one allocator is used, but it may improve later. While there, to reduce the number of pointer dereferences change per-allocator storage for metaslab classes and groups from several separate malloc()'s to variable length arrays at the ends of the original class and group structures. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #11288	2020-12-15 10:55:44 -08:00
Ryan Libby	956f94010f	spa: avoid type narrowing warning Building the spa module for i386 caused gcc to emit -Wint-to-pointer-cast "cast to pointer from integer of different size" because spa.spa_did was uint64_t but pthread_join (via thread_join in spa_deactivate) takes a pointer (32-bit on i386). Define spa_did to be pointer-size instead. For now spa_did is in fact never non-zero and the thread_join could instead be ifdef'd out, but changing the size of spa_did may be more useful for the future. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Libby <rlibby@FreeBSD.org> Closes #11336	2020-12-15 09:20:06 -08:00
Matthew Macy	923d730329	dmu_zfetch: fix memory leak The last change caused the read completion callback to not be called if the IO was still in progress. This change restores allocation of the arc buf callback, but in the callback path checks the new acb_nobuf field to know to skip buffer allocation. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #11324	2020-12-12 16:00:00 -08:00
Matthew Ahrens	ba67d82142	Improve zfs receive performance with lightweight write The performance of `zfs receive` can be bottlenecked on the CPU consumed by the `receive_writer` thread, especially when receiving streams with small compressed block sizes. Much of the CPU is spent creating and destroying dbuf's and arc buf's, one for each `WRITE` record in the send stream. This commit introduces the concept of "lightweight writes", which allows `zfs receive` to write to the DMU by providing an ABD, and instantiating only a new type of `dbuf_dirty_record_t`. The dbuf and arc buf for this "dirty leaf block" are not instantiated. Because there is no dbuf with the dirty data, this mechanism doesn't support reading from "lightweight-dirty" blocks (they would see the on-disk state rather than the dirty data). Since the dedup-receive code has been removed, `zfs receive` is write-only, so this works fine. Because there are no arc bufs for the received data, the received data is no longer cached in the ARC. Testing a receive of a stream with average compressed block size of 4KB, this commit improves performance by 50%, while also reducing CPU usage by 50% of a CPU. On a per-block basis, CPU consumed by receive_writer() and dbuf_evict() is now 1/7th (14%) of what it was. Baseline: 450MB/s, CPU in receive_writer() 40% + dbuf_evict() 35% New: 670MB/s, CPU in receive_writer() 17% + dbuf_evict() 0% The code is also restructured in a few ways: Added a `dr_dnode` field to the dbuf_dirty_record_t. This simplifies some existing code that no longer needs `DB_DNODE_ENTER()` and related routines. The new field is needed by the lightweight-type dirty record. To ensure that the `dr_dnode` field remains valid until the dirty record is freed, we have to ensure that the `dnode_move()` doesn't relocate the dnode_t. To do this we keep a hold on the dnode until it's zio's have completed. This is already done by the user-accounting code (`userquota_updates_task()`), this commit extends that so that it always keeps the dnode hold until zio completion (see `dnode_rele_task()`). `dn_dirty_txg` was previously zeroed when the dnode was synced. This was not necessary, since its meaning can be "when was this dnode last dirtied". This change simplifies the new `dnode_rele_task()` code. Removed some dead code related to `DRR_WRITE_BYREF` (dedup receive). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: George Wilson <gwilson@delphix.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #11105	2020-12-11 10:26:02 -08:00
Paul Dagnelie	60a4c7d2a2	Implement memory and CPU hotplug ZFS currently doesn't react to hotplugging cpu or memory into the system in any way. This patch changes that by adding logic to the ARC that allows the system to take advantage of new memory that is added for caching purposes. It also adds logic to the taskq infrastructure to support dynamically expanding the number of threads allocated to a taskq. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: Matthew Ahrens <matthew.ahrens@delphix.com> Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <pcd@delphix.com> Closes #11212	2020-12-10 14:09:23 -08:00
Matthew Macy	1e4732cbda	Decouple arc_read_done callback from arc buf instantiation Add ARC_FLAG_NO_BUF to indicate that a buffer need not be instantiated. This fixes a ~20% performance regression on cached reads due to zfetch changes. Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #11220 Closes #11232	2020-12-09 15:05:06 -08:00
Alexander Motin	6f5aac3ca0	Reduce latency effects of non-interactive I/O Investigating influence of scrub (especially sequential) on random read latency I've noticed that on some HDDs single 4KB read may take up to 4 seconds! Deeper investigation shown that many HDDs heavily prioritize sequential reads even when those are submitted with queue depth of 1. This patch addresses the latency from two sides: - by using _min_active queue depths for non-interactive requests while the interactive request(s) are active and few requests after; - by throttling it further if no interactive requests has completed while configured amount of non-interactive did. While there, I've also modified vdev_queue_class_to_issue() to give more chances to schedule at least _min_active requests to the lowest priorities. It should reduce starvation if several non-interactive processes are running same time with some interactive and I think should make possible setting of zfs_vdev_max_active to as low as 1. I've benchmarked this change with 4KB random reads from ZVOL with 16KB block size on newly written non-fragmented pool. On fragmented pool I also saw improvements, but not so dramatic. Below are log2 histograms of the random read latency in milliseconds for different devices: 4 2x mirror vdevs of SATA HDD WDC WD20EFRX-68EUZN0 before: 0, 0, 2, 1, 12, 21, 19, 18, 10, 15, 17, 21 after: 0, 0, 0, 24, 101, 195, 419, 250, 47, 4, 0, 0 , that means maximum latency reduction from 2s to 500ms. 4 2x mirror vdevs of SATA HDD WDC WD80EFZX-68UW8N0 before: 0, 0, 2, 31, 38, 28, 18, 12, 17, 20, 24, 10, 3 after: 0, 0, 55, 247, 455, 470, 412, 181, 36, 0, 0, 0, 0 , i.e. from 4s to 250ms. 1 SAS HDD SEAGATE ST14000NM0048 before: 0, 0, 29, 70, 107, 45, 27, 1, 0, 0, 1, 4, 19 after: 1, 29, 681, 1261, 676, 1633, 67, 1, 0, 0, 0, 0, 0 , i.e. from 4s to 125ms. 1 SAS SSD SEAGATE XS3840TE70014 before (microseconds): 0, 0, 0, 0, 0, 0, 0, 0, 70, 18343, 82548, 618 after: 0, 0, 0, 0, 0, 0, 0, 0, 283, 92351, 34844, 90 I've also measured scrub time during the test and on idle pools. On idle fragmented pool I've measured scrub getting few percent faster due to use of QD3 instead of QD2 before. On idle non-fragmented pool I've measured no difference. On busy non-fragmented pool I've measured scrub time increase about 1.5-1.7x, while IOPS increase reached 5-9x. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #11166	2020-11-24 09:26:42 -08:00
Matthew Macy	0ca45cb310	Fix problems in zvol_set_volmode_impl - Don't leave fstrans set when passed a snapshot - Don't remove minor if volmode already matches new value - (FreeBSD) Wait for GEOM ops to complete before trying remove (at create time GEOM will be "tasting" in parallel) - (FreeBSD) Don't leak zvol_state_lock on open if zv == NULL - (FreeBSD) Don't try to unlock zv->zv_state lock if zv == NULL Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #11199	2020-11-17 09:50:52 -08:00
loli10K	4072f465bc	Fix 'zfs userspace' for received datasets in encrypted root For encrypted receives, where user accounting is initially disabled on creation, both 'zfs userspace' and 'zfs groupspace' fails with EOPNOTSUPP: this is because dmu_objset_id_quota_upgrade_cb() forgets to set OBJSET_FLAG_USERACCOUNTING_COMPLETE on the objset flags after a successful dmu_objset_space_upgrade(). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #9501 Closes #9596	2020-11-16 09:10:29 -08:00
Matthew Ahrens	d66aab7c08	Assertion failure when logging large output of channel program The output of ZFS channel programs is logged on-disk in the zpool history, and printed by `zpool history -i`. Channel programs can use 10MB of memory by default, and up to 100MB by using the `zfs program -m` flag. Therefore their output can be up to some fraction of 100MB. In addition to being somewhat wasteful of the limited space reserved for the pool history (which for large pools is 1GB), in extreme cases this can result in a failure of `ASSERT(length <= DMU_MAX_ACCESS);` in `dmu_buf_hold_array_by_dnode()`. This commit limits the output size that will be logged to 1MB. Larger outputs will not be logged, instead a entry will be logged indicating the size of the omitted output. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #11194	2020-11-14 10:17:16 -08:00
Brian Behlendorf	b2255edcc0	Distributed Spare (dRAID) Feature This patch adds a new top-level vdev type called dRAID, which stands for Distributed parity RAID. This pool configuration allows all dRAID vdevs to participate when rebuilding to a distributed hot spare device. This can substantially reduce the total time required to restore full parity to pool with a failed device. A dRAID pool can be created using the new top-level `draid` type. Like `raidz`, the desired redundancy is specified after the type: `draid[1,2,3]`. No additional information is required to create the pool and reasonable default values will be chosen based on the number of child vdevs in the dRAID vdev. zpool create <pool> draid[1,2,3] <vdevs...> Unlike raidz, additional optional dRAID configuration values can be provided as part of the draid type as colon separated values. This allows administrators to fully specify a layout for either performance or capacity reasons. The supported options include: zpool create <pool> \ draid[<parity>][:<data>d][:<children>c][:<spares>s] \ <vdevs...> - draid[parity] - Parity level (default 1) - draid[:<data>d] - Data devices per group (default 8) - draid[:<children>c] - Expected number of child vdevs - draid[:<spares>s] - Distributed hot spares (default 0) Abbreviated example `zpool status` output for a 68 disk dRAID pool with two distributed spares using special allocation classes. ``` pool: tank state: ONLINE config: NAME STATE READ WRITE CKSUM slag7 ONLINE 0 0 0 draid2:8d:68c:2s-0 ONLINE 0 0 0 L0 ONLINE 0 0 0 L1 ONLINE 0 0 0 ... U25 ONLINE 0 0 0 U26 ONLINE 0 0 0 spare-53 ONLINE 0 0 0 U27 ONLINE 0 0 0 draid2-0-0 ONLINE 0 0 0 U28 ONLINE 0 0 0 U29 ONLINE 0 0 0 ... U42 ONLINE 0 0 0 U43 ONLINE 0 0 0 special mirror-1 ONLINE 0 0 0 L5 ONLINE 0 0 0 U5 ONLINE 0 0 0 mirror-2 ONLINE 0 0 0 L6 ONLINE 0 0 0 U6 ONLINE 0 0 0 spares draid2-0-0 INUSE currently in use draid2-0-1 AVAIL ``` When adding test coverage for the new dRAID vdev type the following options were added to the ztest command. These options are leverages by zloop.sh to test a wide range of dRAID configurations. -K draid\|raidz\|random - kind of RAID to test -D <value> - dRAID data drives per group -S <value> - dRAID distributed hot spares -R <value> - RAID parity (raidz or dRAID) The zpool_create, zpool_import, redundancy, replacement and fault test groups have all been updated provide test coverage for the dRAID feature. Co-authored-by: Isaac Huang <he.huang@intel.com> Co-authored-by: Mark Maybee <mmaybee@cray.com> Co-authored-by: Don Brady <don.brady@delphix.com> Co-authored-by: Matthew Ahrens <mahrens@delphix.com> Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Mark Maybee <mmaybee@cray.com> Reviewed-by: Matt Ahrens <matt@delphix.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #10102	2020-11-13 13:51:51 -08:00
Mateusz Guzik	1a0b4f566c	G/C struct znode -> z_moved The field is yet another leftover from unsupported zfs_znode_move. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Closes #11186	2020-11-10 12:42:47 -08:00
Ryan Moeller	8a9634e2f3	Remove redundant oid parameter to update_pages The oid comes from the znode we are already passing. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matt Macy <mmacy@FreeBSD.org> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #11176	2020-11-10 10:54:30 -08:00
Brian Behlendorf	8c7d604c62	Linux 5.10 compat: frame.h renamed objtool.h In Linux 5.10 the linux/frame.h header was renamed linux/objtool.h. Add a configure check to detect and use the correctly named header. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #11085	2020-11-02 22:01:10 +00:00
Christian Schwarz	ab8c935ea6	zfs_vnops: make zfs_get_data OS-independent Move zfs_get_data() in to platform-independent code. The only platform-specific aspect of it is the way we release an inode (Linux) / vnode_t (FreeBSD). I am not aware of a platform that could be supported by ZFS that couldn't implement zfs_rele_async itself. It's sibling zvol_get_data already is platform-independent. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Christian Schwarz <me@cschwarz.com> Closes #10979	2020-11-02 12:07:07 -08:00
Mateusz Guzik	09eb36ce3d	Introduce CPU_SEQID_UNSTABLE Current CPU_SEQID users don't care about possibly changing CPU ID, but enclose it within kpreempt disable/enable in order to fend off warnings from Linux's CONFIG_DEBUG_PREEMPT. There is no need to do it. The expected way to get CPU ID while allowing for migration is to use raw_smp_processor_id. In order to make this future-proof this patch keeps CPU_SEQID as is and introduces CPU_SEQID_UNSTABLE instead, to make it clear that consumers explicitly want this behavior. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Matt Macy <mmacy@FreeBSD.org> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Closes #11142	2020-11-02 11:51:12 -08:00
Matthew Macy	8583540c6e	Consolidate zfs_holey and zfs_access The zfs_holey() and zfs_access() functions can be made common to both FreeBSD and Linux. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #11125	2020-10-31 09:40:08 -07:00
Mateusz Guzik	c4ede65bdf	zstd: track allocator statistics Note that this only tracks sizes as requested by the caller. Actual allocated space will almost always be bigger (e.g., rounded up to the next power of 2 or page size). Additionally the allocated buffer may be holding other areas hostage. Nonetheless, this is a starting point for tracking memory usage in zstd. Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Closes #11129	2020-10-30 15:26:10 -07:00
Matthew Macy	5fa356ea44	Remove UIO_ZEROCOPY functions structures The original xuio zero copy functionality has always been unused on Linux and FreeBSD. Remove this disabled code to avoid any confusion and improve readability. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #11124	2020-10-30 10:00:33 -07:00
Ryan Moeller	76d04993a6	Update references to nonexistent man pages in code Refer to the correct section or alternative for FreeBSD and Linux. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #11132	2020-10-30 08:55:59 -07:00
Matthew Macy	e53d678d4a	Share zfs_fsync, zfs_read, zfs_write, et al between Linux and FreeBSD The zfs_fsync, zfs_read, and zfs_write function are almost identical between Linux and FreeBSD. With a little refactoring they can be moved to the common code which is what is done by this commit. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #11078	2020-10-21 14:08:06 -07:00
Ryan Moeller	485b50bb9e	Cross-platform acltype The acltype property is currently hidden on FreeBSD and does not reflect the NFSv4 style ZFS ACLs used on the platform. This makes it difficult to observe that a pool imported from FreeBSD on Linux has a different type of ACL that is being ignored, and vice versa. Add an nfsv4 acltype and expose the property on FreeBSD. Make the default acltype nfsv4 on FreeBSD. Setting acltype to an unhanded style is treated the same as setting it to off. The ACLs will not be removed, but they will be ignored. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #10520	2020-10-13 21:25:48 -07:00
Warner Losh	b302185a92	FreeBSD: make adjustments for the standalone environment In FreeBSD, there are three compile environments that are supported: user land, the kernel and the bootloader / standalone. Adjust the headers to compile in the standalone environment. Limit kernel-only items from view when _STANDALONE is defined. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Warner Losh <imp@FreeBSD.org> Closes #10998	2020-10-13 21:05:49 -07:00
Christian Schwarz	701f656b97	dmu.h: remove stale declaration dmu_objset_snapshot_tmp Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Adam Moss <c@yotes.com> Signed-off-by: Christian Schwarz <me@cschwarz.com> Closes #11047	2020-10-13 16:46:00 -07:00
Christian Schwarz	61868bb14d	zil_parse: make callback parameters const Code cleanup, a follow up commit to `4d55ea81`. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Co-authored-by: Ryan Moeller <ryan@freqlabs.com> Signed-off-by: Christian Schwarz <me@cschwarz.com> Closes #11020	2020-10-09 09:34:54 -07:00
Brian Behlendorf	d0249a4bd0	Replace ZFS on Linux references with OpenZFS This change updates the documentation to refer to the project as OpenZFS instead ZFS on Linux. Web links have been updated to refer to https://github.com/openzfs/zfs. The extraneous zfsonlinux.org web links in the ZED and SPL sources have been dropped. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #11007	2020-10-08 20:10:13 -07:00
Ryan Moeller	4d55ea811d	Throw const on some strings In C, const indicates to the reader that mutation will not occur. It can also serve as a hint about ownership. Add const in a few places where it makes sense. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <freqlabs@FreeBSD.org> Closes #10997	2020-10-02 17:44:10 -07:00
Sebastian Gottschall	8a171ccd92	do a cyclic seek for unused memory objects in pool In non regular use cases allocated memory might stay persistent in memory pool. This small patch checks every minute if there are old objects which can be released from memory pool. Right now with regular use, the pool is checked for old objects on each allocation attempt from this pool. so basically polling by its use. Now consider what happens if someone writes a lot of files and stops use of the volume or even unmounts it. So the code will no longer check if objects can be released from the pool. Already allocated objects will still stay in pool cache. this is no big issue for common use. But someone discovered this issue while doing tests. personally i know this behavior and I'm aware of it. Its no big issue. just a enhancement Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Kjeld Schouten-Lebbing <kjeld@schouten-lebbing.nl> Signed-off-by: Sebastian Gottschall <s.gottschall@dd-wrt.com> Closes #10938 Closes #10969	2020-09-30 13:22:34 -07:00
Brian Behlendorf	96951e0327	Fix objtool configure check The m4 objtool configure check can incorrectly fail because of a missing header in the test. This appears to be the result of a recent kernel change and was observed on the Fedora 5.8.11-200 kernel. In file included from /home/fedora/zfs/build/objtool/objtool.c:75: ./arch/x86/include/asm/frame.h💯57: error: 'struct pt_regs' declared inside parameter list will not be visible outside of this definition or declaration [-Werror] The consequence of this is that the "stack_frame_non_standard" check is never run and HAVE_STACK_FRAME_NON_STANDARD is set incorrectly which results in a build failure. This change adds the appropriate header to the "objtool" check so it now behaves as intended. Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #10990	2020-09-28 16:40:50 -07:00
Matthew Macy	af20b97078	zfetch: Don't issue new streams when old have not completed The current dmu_zfetch code implicitly assumes that I/Os complete within min_sec_reap seconds. With async dmu and a readonly workload (and thus no exponential backoff in operations from the "write throttle") such as L2ARC rebuild it is possible to saturate the drives with I/O requests. These are then effectively compounded with prefetch requests. This change reference counts streams and prevents them from being recycled after their min_sec_reap timeout if they still have outstanding I/Os. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #10900	2020-09-27 17:08:38 -07:00
Matthew Macy	7b8363d7f0	FreeBSD: Add support for procfs_list The procfs_list interface is required by several kstats. Implement this functionality for FreeBSD to provide access to these kstats. Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #10890	2020-09-23 16:43:51 -07:00
Ryan Moeller	5f8a9e6a02	FreeBSD: Reduce stack usage of Lua Use the same reduced buffer size for lauxlib that is used on Linux. Fixes panic on HEAD in lua gsub test designed to exhaust stack space. With this we can remove the special case to reserve more stack space on FreeBSD. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Kyle Evans <kevans@FreeBSD.org> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #10959	2020-09-22 16:03:11 -07:00
George Wilson	c494aa7f57	vdev_ashift should only be set once == Motivation and Context The new vdev ashift optimization prevents the removal of devices when a zfs configuration is comprised of disks which have different logical and physical block sizes. This is caused because we set 'spa_min_ashift' in vdev_open and then later call 'vdev_ashift_optimize'. This would result in an inconsistency between spa's ashift calculations and that of the top-level vdev. In addition, the optimization logical ignores the overridden ashift value that would be provided by '-o ashift=<val>'. == Description This change reworks the vdev ashift optimization so that it's only set the first time the device is configured. It still allows the physical and logical ahsift values to be set every time the device is opened but those values are only consulted on first open. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Cedric Berger <cedric@precidata.com> Signed-off-by: George Wilson <gwilson@delphix.com> External-Issue: DLPX-71831 Closes #10932	2020-09-18 12:13:47 -07:00
Ryan Moeller	7ead2be3d2	Rename acltype=posixacl to acltype=posix Prefer acltype=off\|posix, retaining the old names as aliases. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #10918	2020-09-16 12:26:06 -07:00
Toomas Soome	1db9e6e4e4	zfs label bootenv should store data as nvlist nvlist does allow us to support different data types and systems. To encapsulate user data to/from nvlist, the libzfsbootenv library is provided. Reviewed-by: Arvind Sankar <nivedita@alum.mit.edu> Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Igor Kozhukhov <igor@dilos.org> Signed-off-by: Toomas Soome <tsoome@me.com> Closes #10774	2020-09-15 15:42:27 -07:00
George Amanakis	085321621e	Add L2ARC arcstats for MFU/MRU buffers and buffer content type Currently the ARC state (MFU/MRU) of cached L2ARC buffer and their content type is unknown. Knowing this information may prove beneficial in adjusting the L2ARC caching policy. This commit adds L2ARC arcstats that display the aligned size (in bytes) of L2ARC buffers according to their content type (data/metadata) and according to their ARC state (MRU/MFU or prefetch). It also expands the existing evict_l2_eligible arcstat to differentiate between MFU and MRU buffers. L2ARC caches buffers from the MRU and MFU lists of ARC. Upon caching a buffer, its ARC state (MRU/MFU) is stored in the L2 header (b_arcs_state). The l2_m{f,r}u_asize arcstats reflect the aligned size (in bytes) of L2ARC buffers according to their ARC state (based on b_arcs_state). We also account for the case where an L2ARC and ARC cached MRU or MRU_ghost buffer transitions to MFU. The l2_prefetch_asize reflects the alinged size (in bytes) of L2ARC buffers that were cached while they had the prefetch flag set in ARC. This is dynamically updated as the prefetch flag of L2ARC buffers changes. When buffers are evicted from ARC, if they are determined to be L2ARC eligible then their logical size is recorded in evict_l2_eligible_m{r,f}u arcstats according to their ARC state upon eviction. Persistent L2ARC: When committing an L2ARC buffer to a log block (L2ARC metadata) its b_arcs_state and prefetch flag is also stored. If the buffer changes its arcstate or prefetch flag this is reflected in the above arcstats. However, the L2ARC metadata cannot currently be updated to reflect this change. Example: L2ARC caches an MRU buffer. L2ARC metadata and arcstats count this as an MRU buffer. The buffer transitions to MFU. The arcstats are updated to reflect this. Upon pool re-import or on/offlining the L2ARC device the arcstats are cleared and the buffer will now be counted as an MRU buffer, as the L2ARC metadata were not updated. Bug fix: - If l2arc_noprefetch is set, arc_read_done clears the L2CACHE flag of an ARC buffer. However, prefetches may be issued in a way that arc_read_done() is bypassed. Instead, move the related code in l2arc_write_eligible() to account for those cases too. Also add a test and update manpages for l2arc_mfuonly module parameter, and update the manpages and code comments for l2arc_noprefetch. Move persist_l2arc tests to l2arc. Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org> Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #10743	2020-09-14 10:10:44 -07:00
Ryan Moeller	ebc4b52369	Avoid possibility of division by zero When hz > 1000, msec / (1000 / hz) results in division by zero. I found somewhere in FreeBSD using howmany(msec * hz, 1000) to convert ms to ticks, avoiding the potential for a zero in the divisor. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <freqlabs@FreeBSD.org> Closes #10894	2020-09-08 11:39:16 -07:00
Don Brady	4f07282786	Avoid posting duplicate zpool events Duplicate io and checksum ereport events can misrepresent that things are worse than they seem. Ideally the zpool events and the corresponding vdev stat error counts in a zpool status should be for unique errors -- not the same error being counted over and over. This can be demonstrated in a simple example. With a single bad block in a datafile and just 5 reads of the file we end up with a degraded vdev, even though there is only one unique error in the pool. The proposed solution to the above issue, is to eliminate duplicates when posting events and when updating vdev error stats. We now save recent error events of interest when posting events so that we can easily check for duplicates when posting an error. Reviewed by: Brad Lewis <brad.lewis@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Don Brady <don.brady@delphix.com> Closes #10861	2020-09-04 10:34:28 -07:00
Matthew Ahrens	3808032489	nowait synctask must succeed If a `zfs_space_check_t` other than `ZFS_SPACE_CHECK_NONE` is used with `dsl_sync_task_nowait()`, the sync task may fail due to ENOSPC. However, there is no way to notice or communicate this failure, so it's extremely difficult to use this functionality correctly, and in fact almost all callers use `ZFS_SPACE_CHECK_NONE`. This commit removes the `zfs_space_check_t` argument from `dsl_sync_task_nowait()`, and always uses `ZFS_SPACE_CHECK_NONE`. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #10855	2020-09-04 10:29:39 -07:00

1 2 3 4 5 ...

1040 Commits