Archive-Team/zfs - zfs - Gitea: Git with a cup of tea

Commit Graph

Author	SHA1	Message	Date
Ameer Hamza	ca3a675c74	zed: Prevent special vdev to be replaced by hot spare Special vdevs should not be replaced by a hot spare. Log vdevs already support this, extending the functionality for special vdevs. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #14129	2022-11-07 13:36:57 -08:00
Attila Fülöp	cd1f023846	Deny receiving into encrypted datasets if the keys are not loaded (#14139 ) Commit `68ddc06b61` introduced support for receiving unencrypted datasets as children of encrypted ones but unfortunately got the logic upside down. This resulted in failing to deny receives of incremental sends into encrypted datasets without their keys loaded. If receiving a filesystem, the receive was done into a newly created unencrypted child dataset of the target. In case of volumes the receive made the target volume undeletable since a dataset was created below it, which we obviously can't handle. Incremental streams with embedded blocks are affected as well. We fix the broken logic to properly deny receives in such cases. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Attila Fülöp <attila@fueloep.org> Closes #13598 Closes #14055 Closes #14119	2022-11-04 11:07:29 -07:00
Ryan Moeller	b27c7a1457	zil: Relax assertion in zil_parse Rather than panic debug builds when we fail to parse a whole ZIL, let's instead improve the logging of errors and continue like in a release build. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #14116	2022-11-01 12:49:14 -07:00
Mariusz Zaborski	186e39f336	quota: extend quota for dataset This patch relax the quota limitation for dataset by around 3%. What this means is that user can write more data then the quota is set to. However thanks to that we can get more stable bandwidth, in case when we are overwriting data in-place, and not consuming any additional space. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Allan Jude <allan@klarasystems.com> Signed-off-by: Mariusz Zaborski <oshogbo@vexillium.org> Sponsored-by: Zededa Inc. Sponsored-by: Klara Inc. Closes #13839	2022-11-01 12:48:37 -07:00
shodanshok	1d2b0563f7	Fix ARC target collapse when zfs_arc_meta_limit_percent=100 Reclaim metadata when arc_available_memory < 0 even if meta_used is not bigger than arc_meta_limit. As described in https://github.com/openzfs/zfs/issues/14054 if zfs_arc_meta_limit_percent=100 then ARC target can collapse to arc_min due to arc_purge not freeing any metadata. This patch lets arc_prune to do its work when arc_available_memory is negative even if meta_used is not bigger than arc_meta_limit, avoiding ARC target collapse. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Gionatan Danti <g.danti@assyoma.it> Closes #14054 Closes #14093	2022-11-01 12:48:30 -07:00
vaclavskala	8929355b4c	Propagate extent_bytes change to autotrim thread The autotrim thread only reads zfs_trim_extent_bytes_min and zfs_trim_extent_bytes_max variable only on thread start. We should check for parameter changes during thread execution to allow parameter changes take effect without needing to disable then restart the autotrim. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Václav Skála <skala@vshosting.cz> Closes #14077	2022-11-01 12:48:23 -07:00
Coleman Kane	212ba9bd97	Linux 6.1 compat: change order of sys/mutex.h includes After Linux 6.1-rc1 came out, the build started failing to build a couple of the files in the linux spl code due to the mutex_init redefinition. Moving the sys/mutex.h include to a lower position within these two files appears to fix the problem. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Coleman Kane <ckane@colemankane.org> Closes #14040	2022-11-01 12:44:56 -07:00
Christian Schwarz	df000276b8	zfs_domount: fix double-disown of dataset / double-free of zfsvfs_t Before this patch, in zfs_domount, if zfs_root or d_make_root fails, we leave zfsvfs != NULL. This will lead to execution of the error handling `if` statement at the `out` label, and hence to a call to dmu_objset_disown and zfsvfs_free. However, zfs_umount, which we call upon failure of zfs_root and d_make_root already does dmu_objset_disown and zfsvfs_free. I suppose this patch rather adds to the brittleness of this part of the code base, but I don't want to invest more time in this right now. To add a regression test, we'd need some kind of fault injection facility for zfs_root or d_make_root, which doesn't exist right now. And even then, I think that regression test would be too closely tied to the implementation. To repro the double-disown / double-free, do the following: 1. patch zfs_root to always return an error 2. mount a ZFS filesystem Here's the stack trace you would see then: VERIFY3(ds->ds_owner == tag) failed (0000000000000000 == ffff9142361e8000) PANIC at dsl_dataset.c:1003:dsl_dataset_disown() Showing stack for process 28332 CPU: 2 PID: 28332 Comm: zpool Tainted: G O 5.10.103-1.nutanix.el7.x86_64 #1 Call Trace: dump_stack+0x74/0x92 spl_dumpstack+0x29/0x2b [spl] spl_panic+0xd4/0xfc [spl] dsl_dataset_disown+0xe9/0x150 [zfs] dmu_objset_disown+0xd6/0x150 [zfs] zfs_domount+0x17b/0x4b0 [zfs] zpl_mount+0x174/0x220 [zfs] legacy_get_tree+0x2b/0x50 vfs_get_tree+0x2a/0xc0 path_mount+0x2fa/0xa70 do_mount+0x7c/0xa0 __x64_sys_mount+0x8b/0xe0 do_syscall_64+0x38/0x50 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Co-authored-by: Christian Schwarz <christian.schwarz@nutanix.com> Signed-off-by: Christian Schwarz <christian.schwarz@nutanix.com> Closes #14025	2022-11-01 12:42:32 -07:00
Serapheim Dimitropoulos	37d5a3e04b	Stop ganging due to past vdev write errors = Problem While examining a customer's system we noticed unreasonable space usage from a few snapshots due to gang blocks. Under some further analysis we discovered that the pool would create gang blocks because all its disks had non-zero write error counts and they'd be skipped for normal metaslab allocations due to the following if-clause in `metaslab_alloc_dva()`: ``` /* * Avoid writing single-copy data to a failing, * non-redundant vdev, unless we've already tried all * other vdevs. */ if ((vd->vdev_stat.vs_write_errors > 0 \|\| vd->vdev_state < VDEV_STATE_HEALTHY) && d == 0 && !try_hard && vd->vdev_children == 0) { metaslab_trace_add(zal, mg, NULL, psize, d, TRACE_VDEV_ERROR, allocator); goto next; } ``` = Proposed Solution Get rid of the predicate in the if-clause that checks the past write errors of the selected vdev. We still try to allocate from HEALTHY vdevs anyway by checking vdev_state so the past write errors doesn't seem to help us (quite the opposite - it can cause issues in long-lived pools like the one from our customer). = Testing I first created a pool with 3 vdevs: ``` $ zpool list -v volpool NAME SIZE ALLOC FREE volpool 22.5G 117M 22.4G xvdb 7.99G 40.2M 7.46G xvdc 7.99G 39.1M 7.46G xvdd 7.99G 37.8M 7.46G ``` And used `zinject` like so with each one of them: ``` $ sudo zinject -d xvdb -e io -T write -f 0.1 volpool ``` And got the vdevs to the following state: ``` $ zpool status volpool pool: volpool state: ONLINE status: One or more devices has experienced an unrecoverable error. ...<cropped>.. action: Determine if the device needs to be replaced, and clear the ...<cropped>.. config: NAME STATE READ WRITE CKSUM volpool ONLINE 0 0 0 xvdb ONLINE 0 1 0 xvdc ONLINE 0 1 0 xvdd ONLINE 0 4 0 ``` I also double-checked their write error counters with sdb: ``` sdb> spa volpool \| vdev \| member vdev_stat.vs_write_errors (uint64_t)0 # <---- this is the root vdev (uint64_t)2 (uint64_t)1 (uint64_t)1 ``` Then I checked that I the problem was reproduced in my VM as I the gang count was growing in zdb as I was writting more data: ``` $ sudo zdb volpool \| grep gang ganged count: 1384 $ sudo zdb volpool \| grep gang ganged count: 1393 $ sudo zdb volpool \| grep gang ganged count: 1402 $ sudo zdb volpool \| grep gang ganged count: 1414 ``` Then I updated my bits with this patch and the gang count stayed the same. Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com> Closes #14003	2022-11-01 12:36:25 -07:00
Serapheim Dimitropoulos	37763ea2a6	Fix panic in dsl_process_sub_livelist for EINTR = Issue Recently we hit an assertion panic in `dsl_process_sub_livelist` while exporting the spa and interrupting `bpobj_iterate_nofree`. In that case `bpobj_iterate_nofree` stops mid-way returning an EINTR without clearing the intermediate AVL tree that keeps track of the livelist entries it has encountered so far. At that point the code has a VERIFY for the number of elements of the AVL expecting it to be zero (which is not the case for EINTR). = Fix Cleanup any intermediate state before destroying the AVL when encountering EINTR. Also added a comment documenting the scenario where the EINTR comes up. There is no need to do anything else for the calles of `dsl_process_sub_livelist` as they already handle the EINTR case. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com> Closes #13939	2022-11-01 12:34:08 -07:00
Mateusz Guzik	c8d6a91a99	Bring per_txg_dirty_frees_percent back to 30 The current value causes significant artificial slowdown during mass parallel file removal, which can be observed both on FreeBSD and Linux when running real workloads. Sample results from Linux doing make -j 96 clean after an allyesconfig modules build: before: 4.14s user 6.79s system 48% cpu 22.631 total after: 4.17s user 6.44s system 153% cpu 6.927 total FreeBSD results in the ticket. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Closes #13932 Closes #13938	2022-11-01 12:32:40 -07:00
Akash B	7ac732b8d6	Add options to zfs redundant_metadata property Currently, additional/extra copies are created for metadata in addition to the redundancy provided by the pool(mirror/raidz/draid), due to this 2 times more space is utilized per inode and this decreases the total number of inodes that can be created in the filesystem. By setting redundant_metadata to none, no additional copies of metadata are created, hence can reduce the space consumed by the additional metadata copies and increase the total number of inodes that can be created in the filesystem. Additionally, this can improve file create performance due to the reduced amount of metadata which needs to be written. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Dipak Ghosh <dipak.ghosh@hpe.com> Signed-off-by: Akash B <akash-b@hpe.com> Closes #13680	2022-11-01 12:25:58 -07:00
Mark Johnston	4e3fecbdfd	FreeBSD: Fix a pair of bugs in zfs_fhtovp() - Add a zfs_exit() call in an error path, otherwise a lock is leaked. - Remove the fid_gen > 1 check. That appears to be Linux-specific: zfsctl_snapdir_fid() sets fid_gen to 0 or 1 depending on whether the snapshot directory is mounted. On FreeBSD it fails, making snapshot dirs inaccessible via NFS. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Andriy Gapon <avg@FreeBSD.org> Signed-off-by: Mark Johnston <markj@FreeBSD.org> Fixes: `43dbf88178` ("FreeBSD: vfsops: use setgen for error case") Closes #14001 Closes #13974 (cherry picked from commit `ed566bf1cd`)	2022-10-26 14:59:25 -07:00
samwyc	fc1c0053f9	Fix sequential resilver drive failure race condition This patch handles the race condition on simultaneous failure of 2 drives, which misses the vdev_rebuild_reset_wanted signal in vdev_rebuild_thread. We retry to catch this inside the vdev_rebuild_complete_sync function. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Dipak Ghosh <dipak.ghosh@hpe.com> Reviewed-by: Akash B <akash-b@hpe.com> Signed-off-by: Samuel Wycliffe J <samwyc@hpe.com> Closes #14041 Closes #14050	2022-10-21 14:05:06 -07:00
Richard Yao	b0bc882395	kcfpool_alloc() should have its argument list marked void This error occurred when building on Gentoo with debugging enabled: zfs-kmod-2.1.6/work/zfs-2.1.6/module/icp/core/kcf_sched.c:1277:14: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes] kcfpool_alloc() ^ void 1 error generated. This function is not present in master. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14023	2022-10-12 15:47:39 -07:00
Richard Yao	566e908fa0	Fix bad free in skein code Clang's static analyzer found a bad free caused by skein_mac_atomic(). It will allocate a context on the stack and then pass it to skein_final(), which attempts to free it. Upon inspection, skein_digest_atomic() also has the same problem. These functions were created to match the OpenSolaris ICP API, so I was curious how we avoided this in other providers and looked at the SHA2 code. It appears that SHA2 has a SHA2Final() helper function that is called by the exported sha2_mac_final()/sha2_digest_final() as well as the sha2_mac_atomic() and sha2_digest_atomic() functions. The real work is done in SHA2Final() while some checks and the free are done in sha2_mac_final()/sha2_digest_final(). We fix the use after free in the skein code by taking inspiration from the SHA2 code. We introduce a skein_final_nofree() that does most of the work, and make skein_final() into a function that calls it and then frees the memory. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #13954	2022-09-28 17:25:10 -07:00
Mateusz Guzik	63d4838b4a	FreeBSD: handle V_PCATCH See https://cgit.FreeBSD.org/src/commit/?id=a75d1ddd74312f5dd79bc1e965f7077679659f2e Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Closes #13910	2022-09-28 10:35:13 -07:00
Mateusz Guzik	eec942cc54	FreeBSD: catch up to 1400068 Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Closes #13909	2022-09-28 10:35:13 -07:00
Mateusz Guzik	2c8e3e4b28	FreeBSD: stop passing LK_INTERLOCK to VOP_LOCK There is an ongoing effort to eliminate this feature. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Closes #13908	2022-09-28 10:35:13 -07:00
Richard Yao	55816c64da	FreeBSD: Fix integer conversion for vnlru_free{,_vfsops}() When reviewing #13875, I noticed that our FreeBSD code has an issue where it converts from `int64_t` to `int` when calling `vnlru_free{,_vfsops}()`. The result is that if the int64_t is `1 << 36`, the int will be 0, since the low bits are 0. Even when some low bits are set, a value such as `((1 << 36) + 1)` would truncate to 1, which is wrong. There is protection against this on 32-bit platforms, but on 64-bit platforms, there is no check to protect us, so we add a check. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #13882	2022-09-28 10:35:13 -07:00
Ryan Moeller	8dcd6af623	FreeBSD: Ignore symlink to i386 includes A symlink to i386 includes is created in the build dir on amd64 since freebsd/freebsd-src@d07600c563 Tell git to ignore it like the other include links. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #13719	2022-09-28 10:35:13 -07:00
Richard Yao	c973929b29	LUA: Fix CVE-2014-5461 Apply the fix from upstream. http://www.lua.org/bugs.html#5.2.2-1 https://www.opencve.io/cve/CVE-2014-5461 It should be noted that exploiting this requires the `SYS_CONFIG` privilege, and anyone with that privilege likely has other opportunities to do exploits, so it is unlikely that bad actors could exploit this unless system administrators are executing untrusted ZFS Channel Programs. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #13949	2022-09-27 16:49:02 -07:00
Richard Yao	835e03682c	Linux: Fix uninitialized variable usage in zio_do_crypt_data() Coverity complained about this. An error from `hkdf_sha512()` before uio initialization will cause pointers to uninitialized memory to be passed to `zio_crypt_destroy_uio()`. This is a regression that was introduced by `cf63739191`. Interestingly, this never affected FreeBSD, since the FreeBSD version never had that patch ported. Since moving uio initialization to the top of this function would slow down the qat_crypt() path, we only move the `memset()` calls to the top of the function. This is sufficient to fix this problem. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Neal Gompa <ngompa@datto.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #13944	2022-09-27 15:43:26 -07:00
Alexander Motin	33223cbc3c	Refactor Log Size Limit Original Log Size Limit implementation blocked all writes in case of limit reached until the TXG is committed and the log is freed. It caused huge delays and following speed spikes in application writes. This implementation instead smoothly throttles writes, using exactly the same mechanism as used for dirty data. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: jxdking <lostking2008@hotmail.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Issue #12284 Closes #13476	2022-09-26 14:55:27 -07:00
Brian Behlendorf	91e02156dd	Revert "Reduce dbuf_find() lock contention" This reverts commit `34dbc618f5`. While this change resolved the lock contention observed for certain workloads, it inadventantly reduced the maximum hash inserts/removes per second. This appears to be due to the slightly higher acquisition cost of a rwlock vs a mutex. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>	2022-09-21 13:15:51 -07:00
Richard Yao	b66f8d3c2b	Add zfs_btree_verify_intensity kernel module parameter I see a few issues in the issue tracker that might be aided by being able to turn this on. We have no module parameter for it, so I would like to add one. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #13874	2022-09-21 13:15:51 -07:00
Richard Yao	5096ed31c8	Fix incorrect size given to bqueue_enqueue() call in dmu_redact.c We pass sizeof (struct redact_record *) rather than sizeof (struct redact_record). Passing the pointer size is wrong. Coverity caught this in two places. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #13885	2022-09-21 13:15:51 -07:00
Ameer Hamza	035e52f591	Delay ZFS_PROP_SHARESMB property to handle it for encrypted raw receive For encrypted raw receive, objset creation is delayed until a call to dmu_recv_stream(). ZFS_PROP_SHARESMB property requires objset to be populated when calling zpl_earlier_version(). To correctly handle the ZFS_PROP_SHARESMB property for encrypted raw receive, this change delays setting the property. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #13878	2022-09-21 13:15:26 -07:00
Alexander Motin	44cec45f72	Improve too large physical ashift handling When iterating through children physical ashifts for vdev, prefer ones above the maximum logical ashift, that we can actually use, but within the administrator defined maximum. When selecting top-level vdev ashift, do not set it to the defined maximum in case physical ashift is even higher, but just ignore one. Using the maximum does not prevent misaligned writes, but reduces space efficiency. Since ZFS tries to write data sequentially and aggregates the writes, in many cases large misanigned writes may be not as bad as the space penalty otherwise. Allow internal physical ashifts for vdevs higher than SHIFT_MAX. May be one day allocator or aggregation could benefit from that. Reduce zfs_vdev_max_auto_ashift default from 16 (64KB) to 14 (16KB), so that ZFS may still use bigger ashifts up to SHIFT_MAX (64KB), but only if it really has to or explicitly told to, but not as an "optimization". There are some read-intensive NVMe SSDs that report Preferred Write Alignment of 64KB, and attempt to build RAIDZ2 of those leads to a space inefficiency that can't be justified. Instead these changes make ZFS fall back to logical ashift of 12 (4KB) by default and only warn user that it may be suboptimal for performance. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #13798	2022-09-21 13:15:15 -07:00
Kevin Jin	d05f3039f7	Add Module Parameter Regarding Log Size Limit zfs_wrlog_data_max The upper limit of TX_WRITE log data. Once it is reached, write operation is blocked, until log data is cleared out after txg sync. It only counts TX_WRITE log with WR_COPIED or WR_NEED_COPY. Reviewed-by: Prakash Surya <prakash.surya@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: jxdking <lostking2008@hotmail.com> Closes #12284	2022-09-21 16:12:14 -07:00
Kevin Jin	999830a021	Optimize txg_kick() process (#12274 ) Use dp_dirty_pertxg[] for txg_kick(), instead of dp_dirty_total in original code. Extra parameter "txg" is added for txg_kick(), thus it knows which txg to kick. Also txg_kick() call is moved from dsl_pool_need_dirty_delay() to dsl_pool_dirty_space() so that we can know the txg number assigned for txg_kick(). Some unnecessary code regarding dp_dirty_total in txg_sync_thread() is also cleaned up. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: jxdking <lostking2008@hotmail.com> Closes #12274	2022-09-21 16:12:14 -07:00
Ameer Hamza	a5b0d42540	zfs recv hangs if max recordsize is less than received recordsize - Some optimizations for bqueue enqueue/dequeue. - Added a fix to prevent deadlock when both bqueue_enqueue_impl() and bqueue_dequeue() waits for signal to be triggered. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #13855	2022-09-19 09:39:07 -07:00
Richard Yao	3f7c174b50	vdev_draid_lookup_map() should not iterate outside draid_maps Coverity reported this as an out-of-bounds read. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Neal Gompa <ngompa@datto.com> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #13865	2022-09-15 16:58:35 -07:00
Akash B	03fa3ef264	Add physical device size to SIZE column in 'zpool list -v' Add physical device size/capacity only for physical devices in 'zpool list -v' instead of displaying "-" in the SIZE column. This would make it easier to see the individual device capacity and to determine which spares are large enough to replace which devices. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Dipak Ghosh <dipak.ghosh@hpe.com> Signed-off-by: Akash B <akash-b@hpe.com> Closes #12561 Closes #13106	2022-09-15 10:23:01 -07:00
George Amanakis	8bd3dca9bf	Introduce a tunable to exclude special class buffers from L2ARC Special allocation class or dedup vdevs may have roughly the same performance as L2ARC vdevs. Introduce a new tunable to exclude those buffers from being cacheable on L2ARC. Reviewed-by: Don Brady <don.brady@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #11761 Closes #12285	2022-09-14 11:27:00 -07:00
Alexander Motin	b6ebf270eb	Apply arc_shrink_shift to ARC above arc_c_min It makes sense to free memory in smaller chunks when approaching arc_c_min to let other kernel subsystems to free more, since after that point we can't free anything. This also matches behavior on Linux, where to shrinker reported only the size above arc_c_min. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #13794	2022-09-13 17:59:10 -07:00
Richard Yao	8131a96544	Fix use-after-free in btree code Coverty static analysis found these. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Neal Gompa <ngompa@datto.com> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #10989 Closes #13861	2022-09-13 16:15:38 -07:00
Brian Behlendorf	4063d7b6b4	Linux 5.20 compat: blk_cleanup_disk() As of the Linux 5.20 kernel blk_cleanup_disk() has been removed, all callers should use put_disk(). Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #13728	2022-08-09 09:41:06 -07:00
Brian Behlendorf	58571ba447	Linux 5.20 compat: bdevname() As of the Linux 5.20 kernel bdevname() has been removed, all callers should use snprintf() and the "%pg" format specifier. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #13728	2022-08-09 09:41:06 -07:00
Rich Ercolani	035ee628cf	Revert behavior of `59eab109` on not-Linux It turns out that short-circuiting the EFAULT behavior on a short read breaks things on FreeBSD. So until there's a nicer solution, let's just revert the behavior for not-Linux. Reference: https://reviews.freebsd.org/R10:70f51f0e474ffe1fb74cb427423a2fba3637544d Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes #12698	2022-08-02 10:05:14 -07:00
Rich Ercolani	5c56591b57	Handle partial reads in zfs_read Currently, dmu_read_uio_dnode can read 64K of a requested 1M in one loop, get EFAULT back from zfs_uiomove() (because the iovec only holds 64k), and return EFAULT, which turns into EAGAIN on the way out. EAGAIN gets interpreted as "I didn't read anything", the caller tries again without consuming the 64k we already read, and we're stuck. This apparently works on newer kernels because the caller which breaks on older Linux kernels by happily passing along a 1M read request and a 64k iovec just requests 64k at a time. With this, we now won't return EFAULT if we got a partial read. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes #12370 Closes #12509 Closes #12516	2022-08-02 10:05:14 -07:00
наб	17512aba0c	module: lua: ldo: fix pragma name /home/nabijaczleweli/store/code/zfs/module/lua/ldo.c:175:32: warning: unknown option after ‘#pragma GCC diagnostic’ kind [-Wpragmas] 175 \| #pragma GCC diagnostic ignored "-Winfinite-recursion"a \| ^~~~~~~~~~~~~~~~~~~~~~ Fixes: `a6e8113fed` ("Silence -Winfinite-recursion warning in luaD_throw()") Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #13348	2022-07-28 14:17:38 -07:00
Brian Behlendorf	69ad0bd769	Fix objtool: missing int3 after ret warning Resolve straight-line speculation warnings reported by objtool for x86_64 assembly on Linux when CONFIG_SLS is set. See the following LWN article for the complete details. https://lwn.net/Articles/877845/ Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #13528 Closes #13575	2022-07-27 13:38:56 -07:00
Attila Fülöp	b9d862f2db	ICP: Add missing stack frame info to SHA asm files Since the assembly routines calculating SHA checksums don't use a standard stack layout, CFI directives are needed to unroll the stack. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Attila Fülöp <attila@fueloep.org> Closes #11733	2022-07-27 13:38:56 -07:00
Brian Behlendorf	60f2cfd24f	Fix -Wuse-after-free warning in dbuf_destroy() Move the use of the db pointer after it is freed. It's only used as a tag so a dereference would never occur, but there's no reason we can't invert the order to resolve the warning. module/zfs/dbuf.c: In function 'dbuf_destroy': module/zfs/dbuf.c:2953:17: error: pointer 'db' may be used after 'free' [-Werror=use-after-free] Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #13528 Closes #13575	2022-07-27 13:38:56 -07:00
Brian Behlendorf	6a81173026	Fix -Wuse-after-free warning in dbuf_issue_final_prefetch_done() Move the use of the private pointer after it is freed. It's only used as a tag so a dereference would never occur, but there's no harm in inverting the order to resolve the warning. module/zfs/dbuf.c: In function 'dbuf_issue_final_prefetch_done': module/zfs/dbuf.c:3204:17: error: pointer 'private' may be used after 'free' [-Werror=use-after-free] Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #13528 Closes #13575	2022-07-27 13:38:56 -07:00
Brian Behlendorf	087f5dedd5	Fix -Wattribute-warning in dsl layer The memcpy(), memmove(), and memset() functions have been annotated to perform bounds checking when using FORTIFY_SOURCE. A warning is now generted when writing beyond the end of the specified field. Alternately, the new struct_group() macro could be used to create an anonymous union member for use by memcpy(). However, since this is the only place the macro would be helpful it's preferable to restructure the code slights to avoid the need for additional compatibility code when the macro does not exist. https://lore.kernel.org/lkml/20211118183807.1283332-1-keescook@chromium.org/T/ Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #13528 Closes #13575	2022-07-27 13:38:56 -07:00
Brian Behlendorf	c771583f23	Fix -Wattribute-warning in edonr The wrong union memory was being accessed in EdonRInit resulting in a write beyond size of field compiler warning. Reference the correct member to resolve the warning. The warning was correct and this in case the mistake was harmless. In function ‘fortify_memcpy_chk’, inlined from ‘EdonRInit’ at zfs/module/icp/algs/edonr/edonr.c:494:3: ./include/linux/fortify-string.h:344:25: error: call to ‘__write_overflow_field’ declared with attribute warning: detected write beyond size of field (1st parameter); maybe use struct_group()? [-Werror=attribute-warning] Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #13528 Closes #13575	2022-07-27 13:38:56 -07:00
Brian Behlendorf	ef0e506f46	Fix -Wattribute-warning in zfs_log_xvattr() Restructure the code in zfs_log_xvattr() to use a lr_attr_end structure when accessing lr_attr_t elements located after the variable sized array. This makes the code more understandable and resolves the accessing beyond the end of the field warnings. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #13528 Closes #13575	2022-07-27 13:38:56 -07:00
Brian Behlendorf	d7a8c573cf	Silence -Winfinite-recursion warning in luaD_throw() This code should be kept inline with the upstream lua version as much as possible. Therefore, we simply want to silence the warning. This check was enabled by default as part of -Wall in gcc 12.1. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #13528 Closes #13575	2022-07-27 13:38:56 -07:00
наб	37430e8211	libtpool: -Wno-clobbered Also remove -Wno-unused-but-set-variable Upstream-bug: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61118 Reviewed-by: Alejandro Colomar <alx.manpages@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #13110	2022-07-27 13:38:56 -07:00
Tino Reichardt	4b0977027b	Remove sha1 hashing from OpenZFS, it's not used anywhere. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Attila Fülöp <attila@fueloep.org> Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #12895 Closes #12902 Signed-off-by: Rich Ercolani <rincebrain@gmail.com>	2022-07-26 10:12:44 -07:00
Alexander Motin	15868d3ecb	Fix scrub resume from newly created hole. It may happen that scan bookmark points to a block that was turned into a part of a big hole. In such case dsl_scan_visitbp() may skip it and dsl_scan_check_resume() will not be called for it. As result new scan suspend won't be possible until the end of the object, that may take hours if the object is a multi-terabyte ZVOL on a slow HDD pool, stretching TXG to all that time, creating all sorts of problems. This patch changes the resume condition to any greater or equal block, so even if we miss the bookmarked block, the next one we find will delete the bookmark, allowing new suspend. Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc.	2022-07-26 10:10:37 -07:00
Alexander Motin	bbb50e6129	Avoid memory copy when verifying raidz/draid parity Before this change for every valid parity column raidz_parity_verify() allocated new buffer and copied there existing data, then recalculated the parity and compared the result with the copy. This patch removes the memory copy, simply swapping original buffer pointers with newly allocated empty ones for parity recalculation and comparison. Original buffers with potentially incorrect parity data are then just freed, while new recalculated ones are used for repair. On a pool of 12 4-wide raidz vdevs, storing 1.5TB of 16MB blocks, this change reduces memory traffic during scrub by 17% and total unhalted CPU time by 25%. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #13613	2022-07-26 10:10:37 -07:00
Alexander Motin	03e33b2bb8	Avoid memory copies during mirror scrub Issuing several scrub reads for a block we may use the parent ZIO buffer for one of child ZIOs. If that read complete successfully, then we won't need to copy the data explicitly. If block has only one copy (typical for root vdev, which is also a mirror inside), then we never need to copy -- succeed or fail as-is. Previous code also copied data from buffer of every successfully completed child ZIO, but that just does not make any sense. On healthy N-wide mirror this saves all N+1 (or even more in case of ditto blocks) memory copies for each scrubbed block, allowing CPU to focus mostly on check-summing. For other vdev types it should save one memory copy per block copy at root vdev. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #13606	2022-07-26 10:10:37 -07:00
Alexander Motin	4b8f16072d	Fix and disable blocks statistics during scrub Block statistics calculation during scrub I/O issue in case of sorted scrub accounted ditto blocks several times. Embedded blocks on other side were not accounted at all. This change moves the accounting from issue to scan stage, that fixes both problems and also allows to avoid pool-wide locking and the lock contention it created. Since this statistics is quite specific and is not even exposed now anywhere, disable its calculation by default to not waste CPU time. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #13579	2022-07-26 10:10:37 -07:00
Alexander Motin	5e06805d8e	Avoid two 64-bit divisions per scanned block Change math to make it like the ARC, using multiplications instead. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #13591	2022-07-26 10:10:37 -07:00
Alexander Motin	dc91a6a660	Several B-tree optimizations - Introduce first element offset within a leaf. It allows to reduce by ~50% average memmove() size when adding/removing elements. If the added/removed element is in the first half of the leaf, we may shift elements before it and adjust the bth_first instead of moving more elements after it. - Use memcpy() instead of memmove() when we know there is no overlap. - Switch from uint64_t to uint32_t. It does not limit anything, but 32-bit arches should appreciate it greatly in hot paths. - Store leaf capacity in struct btree to avoid 64-bit divisions. - Adjust zfs_btree_insert_into_leaf() to always result in balanced leaves after splitting, no matter where the new element was inserted. Not that we care about it much, but it should also allow B-trees with as little as two elements per leaf instead of 4 previously. When scrubbing pool of 12 SSDs, storing 1.5TB of 4KB zvol blocks this reduces amount of time spent in memmove() inside the scan thread from 13.7% to 5.7% and total scrub time by ~15 seconds out of 9 minutes. It should also reduce spacemaps load time, but I haven't measured it. Reviewed-by: Paul Dagnelie <pcd@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #13582	2022-07-26 10:10:37 -07:00
Alexander Motin	a861aa2b9e	Several sorted scrub optimizations - Reduce size and comparison complexity of q_exts_by_size B-tree. Previous code used two 64-bit divisions and many other operations to compare two B-tree elements. It created enormous overhead. This implementation moves the math to the upper level and stores the score in the B-tree elements themselves. Since all that we need to store in that B-tree is the extent score and offset, those can fit into single 8 byte value instead of 24 bytes of q_exts_by_addr element and can be compared with single operation. - Better decouple secondary tree logic from main range_tree by moving rt_btree_ops and related functions into dsl_scan.c as ext_size_ops. Those functions are very small to worry about the code duplication and range_tree does not need to know details such as rt_btree_compare. - Instead of accounting number of pending bytes per pool, that needs atomic on global variable per block, account the number of non-empty per-vdev queues, that change much more rarely. - When extent scan is interrupted by TXG end, continue it in the next TXG instead of selecting next best extent. It allows to avoid leaving one truncated (and so likely not the best any more) extent each TXG. On top of some other optimizations this saves about 1.5 minutes out of 10 to scrub pool of 12 SSDs, storing 1.5TB of 4KB zvol blocks. Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tom Caputi <caputit1@tcnj.edu> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #13576	2022-07-26 10:10:37 -07:00
Alexander Motin	881249de6f	FreeBSD: Improve crypto_dispatch() handling Handle crypto_dispatch() return values same as crp->crp_etype errors. On FreeBSD 12 many drivers returned same errors both ways, and lack of proper handling for the first ended up in assertion panic later. It was changed in FreeBSD 13, but there is no reason to not be safe. While there, skip waiting for completion, including locking and wakeup() call, for sessions on synchronous crypto drivers, such as typical aesni and software. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #13563	2022-07-26 10:10:37 -07:00
Alexander Motin	916d9de158	Reduce ZIO io_lock contention on sorted scrub During sorted scrub multiple threads (one per vdev) are issuing many ZIOs same time, all using the same scn->scn_zio_root ZIO as parent. It causes huge lock contention on the single global lock on that ZIO. Improve it by introducing per-queue null ZIOs, children to that one, and using them instead as proxy. For 12 SSD pool storing 1.5TB of 4KB blocks on 80-core system this dramatically reduces lock contention and reduces scrub time from 21 minutes down to 12.5, while actual read stages (not scan) are about 3x faster, reaching 100K blocks per second per vdev. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #13553	2022-07-26 10:10:37 -07:00
Alexander Motin	813e15f28c	AVL: Remove obsolete branching optimizations Modern Clang and GCC can successfully implement simple conditions without branching with math and flag operations. Use of arrays for translation no longer helps as much as it was 14+ years ago. Disassemble of the code generated by Clang 13.0.0 on FreeBSD 13.1, Clang 14.0.4 on FreeBSD 14 and GCC 10.2.1 on Debian 11 with this change still shows no branching instructions. Profiling of CPU-bound scan stage of sorted scrub shows reproducible reduction of time spent inside avl_find() from 6.52% to 4.58%. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #13540	2022-07-26 10:10:37 -07:00
Alexander Motin	884364ea85	More speculative prefetcher improvements - Make prefetch distance adaptive: up to 4MB prefetch doubles for every, hit same as before, but after that it grows by 1/8 every time the prefetch read does not complete in time to satisfy the demand. My tests show that 4MB is sufficient for wide NVMe pool to saturate single reader thread at 2.5GB/s, while new 64MB maximum allows the same thread to reach 1.5GB/s on wide HDD pool. Further distance increase may increase speed even more, but less dramatic and with higher latency. - Allow early reuse of inactive prefetch streams: streams that never saw hits can be reused immediately if there is a demand, while others can be reused after 1s of inactivity, starting with the oldest. After 2s of inactivity streams are deleted to free resources same as before. This allows by several times increase strided read performance on HDD pool in presence of simultaneous random reads, previously filling the zfetch_max_streams limit for seconds and so blocking most of prefetch. - Always issue intermediate indirect block reads with SYNC priority. Each of those reads if delayed for longer may delay up to 1024 other block prefetches, that may be not good for wide pools. Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #13452	2022-07-26 10:10:37 -07:00
Alexander Motin	6e1e90d64c	Improve mg_aliquot math When calculating mg_aliquot alike to #12046 use number of unique data disks in the vdev, not the total number of children vdev. Increase default value of the tunable from 512KB to 1MB to compensate. Before this change each disk in striped pool was getting 512KB of sequential data, in 2-wide mirror -- 1MB, in 3-wide RAIDZ1 -- 768KB. After this change in all the cases each disk should get 1MB. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #13388	2022-07-26 10:10:37 -07:00
Alexander Motin	dd9c110ab5	Improve log spacemap load time Previous flushing algorithm limited only total number of log blocks to the minimum of 256K and 4x number of metaslabs in the pool. As result, system with 1500 disks with 1000 metaslabs each, touching several new metaslabs each TXG could grow spacemap log to huge size without much benefits. We've observed one of such systems importing pool for about 45 minutes. This patch improves the situation from five sides: - By limiting maximum period for each metaslab to be flushed to 1000 TXGs, that effectively limits maximum number of per-TXG spacemap logs to load to the same number. - By making flushing more smooth via accounting number of metaslabs that were touched after the last flush and actually need another flush, not just ms_unflushed_txg bump. - By applying zfs_unflushed_log_block_pct to the number of metaslabs that were touched after the last flush, not all metaslabs in the pool. - By aggressively prefetching per-TXG spacemap logs up to 16 TXGs in advance, making log spacemap load process for wide HDD pool CPU-bound, accelerating it by many times. - By reducing zfs_unflushed_log_block_max from 256K to 128K, reducing single-threaded by nature log processing time from ~10 to ~5 minutes. As further optimization we could skip bumping ms_unflushed_txg for metaslabs not touched since the last flush, but that would be an incompatible change, requiring new pool feature. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #12789	2022-07-26 10:10:37 -07:00
Alexander Motin	fdb80a2301	Add more control/visibility to spa_load_verify(). Use error thresholds from policy to control whether to scrub data and/or metadata. If threshold is set to UINT64_MAX, then caller probably does not care about result and we may skip that part. By default import neither set the data error threshold nor read the error counter, so skip the data scrub for faster import. Metadata are still scrubbed and fail if even single error found. While there just for symmetry return number of metadata errors in case threshold is not set to zero and we haven't reached it. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #13022	2022-07-26 10:10:37 -07:00
Allan Jude	72a4709a59	spa.c: Replace VERIFY(nvlist_(...) == 0) with fnvlist_ (#12678 ) The fnvlist versions of the functions are fatal if they fail, saving each call from having to include checking the result. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Igor Kozhukhov <igor@dilos.org> Signed-off-by: Allan Jude <allan@klarasystems.com>	2022-07-26 10:10:37 -07:00
Alexander Motin	415882d228	Avoid small buffer copying on write It is wrong for arc_write_ready() to use zfs_abd_scatter_enabled to decide whether to reallocate/copy the buffer, because the answer is OS-specific and depends on the buffer size. Instead of that use abd_size_alloc_linear(), moved into public header. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #12425	2022-07-26 10:10:37 -07:00
Alexander Motin	5b860ae1fb	Remove refcount from spa_config_() The only reason for spa_config_() to use refcount instead of simple non-atomic (thanks to scl_lock) variable for scl_count is tracking, hard disabled for the last 8 years. Switch to simple int scl_count reduces the lock hold time by avoiding atomic, plus makes structure fit into single cache line, reducing the locks contention. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #12287	2022-07-26 10:10:37 -07:00
Brian Behlendorf	3920d7f325	Scrub mirror children without BPs When scrubbing a raidz/draid pool, which contains a replacing or sparing mirror with multiple online children, only one child will be read. This is not normally a serious concern because the DTL records are used to determine where a good copy of the data is. As long as the data can be read from one child the mirror vdev will use it to repair gaps in any of its children. Furthermore, even if the data which was read is corrupt the raidz code will detect this and issue its own repair I/O to correct the damage in the mirror vdev. However, in the scenario where the DTL is wrong due to silent data corruption (say due to overwriting one child) and the scrub happens to read from a child with good data, then the other damaged mirror child will not be detected nor repaired. While this is possible for both raidz and draid vdevs, it's most pronounced when using draid. This is because by default the zed will sequentially rebuild a draid pool to a distributed spare, and the distributed spare half of the mirror is always preferred since it delivers better performance. This means the damaged half of the mirror will go undetected even after scrubbing. For system administrations this behavior is non-intuitive and in a worst case scenario could result in the only good copy of the data being unknowingly detached from the mirror. This change resolves the issue by reading all replacing/sparing mirror children when scrubbing. When the BP isn't available for verification, then compare the data buffers from each child. They must all be identical, if not there's silent damage and an error is returned to prompt the top-level vdev to issue a repair I/O to rewrite the data on all of the mirror children. Since we can't tell which child was wrong a checksum error is logged against the replacing or sparing mirror vdev. Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #13555	2022-07-14 10:21:29 -07:00
Ryan Moeller	403d4bc66e	FreeBSD: Silence clang unused-but-set-variable Quick and dirty build fix for warnings being treated as errors. Signed-off-by: Ryan Moeller <ryan@iXsystems.com>	2022-06-15 11:27:28 -07:00
Alexander Motin	6ff89fe126	Improve sorted scan memory accounting Since we use two B-trees q_exts_by_size and q_exts_by_addr, we should count 2x sizeof (range_seg_gap_t) per node. And since average B-tree memory efficiency is about 75%, we should increase it to 3x. Previous code under-counted up to 30% of the memory usage. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #13537	2022-06-15 11:23:49 -07:00
Rich Ercolani	cc565f557b	Corrected edge case in uncompressed ARC->L2ARC handling I genuinely don't know why this didn't come up before, but adding the LZ4 early abort pointed out this flaw, in which we're allocating a buffer of one size, and then telling the compressor that we're handing it buffers of a different size, which may be Very Different - say, allocating 512b and then telling it the inputs are 128k. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Amanakis <gamanakis@gmail.com> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes #13375	2022-06-14 18:10:21 -07:00
Alexander Motin	338188562b	Remove wrong assertion in log spacemap It is typical, but not generally true that if log summary has more blocks it must also have unflushed metaslabs. Normally with metaslabs flushed in order it works, but there are known exceptions, such as device removal or metaslab being loaded during its flush attempt. Before `600a02b884` if spa_flush_metaslabs() hit loading metaslab it usually stopped (unless memlimit is also exceeded), but now it may flush more metaslabs, just skipping that particular one. This increased chances of assertion to fire when the skipped metaslab is flushed on next iteration if all other metaslabs in that summary entry are already flushed out of order. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #13486 Closes #13513	2022-06-06 16:57:56 -07:00
Brian Behlendorf	fec407fb69	Linux 5.19 compat: aops->read_folio() As of the Linux 5.19 kernel the readpage() address space operation has been replaced by read_folio(). Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #13515	2022-06-01 14:24:49 -07:00
Brian Behlendorf	7ae5ea8864	Linux 5.19 compat: blkdev_issue_secure_erase() Linux 5.19 commit torvalds/linux@44abff2c0 splits the secure erase functionality from the blkdev_issue_discard() function. The blkdev_issue_secure_erase() must now be issued to issue a secure erase. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #13515	2022-06-01 14:24:49 -07:00
Brian Behlendorf	048301b6dc	Linux 5.19 compat: bdev_max_secure_erase_sectors() Linux 5.19 commit torvalds/linux@44abff2c0 removed the blk_queue_secure_erase() helper function. The preferred interface is to now use the bdev_max_secure_erase_sectors() function to check for discard support. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #13515	2022-06-01 14:24:49 -07:00
Brian Behlendorf	9ce5eb18ef	Linux 5.19 compat: bdev_max_discard_sectors() Linux 5.19 commit torvalds/linux@70200574cc removed the blk_queue_discard() helper function. The preferred interface is to now use the bdev_max_discard_sectors() function to check for discard support. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #13515	2022-06-01 14:24:49 -07:00
Brian Behlendorf	5a639f0802	Linux 5.18 compat: bio_alloc() As for the Linux 5.18 kernel bio_alloc() expects a block_device struct as an argument. This removes the need for the bio_set_dev() compatibility code for 5.18 and newer kernels. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #13515	2022-06-01 14:24:49 -07:00
hping	b28c0c4bf8	abd_os: remove redundant refcount creation for abd_children Refcount creation for abd_zero_scatter->abd_children is redundant in abd_alloc_zero_scatter, as it has been done in abd_init_struct. In addition, abd_children is undefined when ZFS_DEBUG is disabled, the reference of abd_children in abd_alloc_zero_scatter breaks build of libzpool when ZFS_DEBUG is disabled. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Ping Huang <huangping@smartx.com> Closes #13429	2022-05-20 10:33:24 -07:00
Aidan Harris	eee389ba2e	Fix functions without a prototype clang-15 emits the following error message for functions without a prototype: fs/zfs/os/linux/spl/spl-kmem-cache.c:1423:27: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes] Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Aidan Harris <me@aidanharr.is> Closes #13421	2022-05-20 10:33:24 -07:00
Mateusz Guzik	2c5c8bb0a6	FreeBSD: use zero_region instead of allocating a dedicated page Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Closes #13406	2022-05-20 10:33:24 -07:00
szubersk	756c3e085b	autoconf: Fail when __copy_from_user_inatomic is a non-GPL symbol A followup to `849c14e048` Fix https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1009242 Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: szubersk <szuberskidamian@gmail.com> Closes #13389	2022-05-20 10:33:24 -07:00
Damian Szuberski	13b1f336d3	PPC get_user workaround Linux 5.12 PPC 5.12 get_user() and __copy_from_user_inatomic() inline helpers very indirectly include a reference to the GPL'd array mmu_feature_keys[] and fails to build. Workaround this by using copy_from_user() and throwing EFAULT for any calls to __copy_from_user_inatomic(). This is a workaround until a fix for Linux commit 7613f5a66becfd0e43a0f34de8518695888f5458 "powerpc/64s/kuap: Use mmu_has_feature()" is fully addressed. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Authored-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: szubersk <szuberskidamian@gmail.com> Closes #11958 Closes #12590 Closes #13367	2022-05-20 10:33:24 -07:00
Brian Atkinson	60fc173251	Adding ZERO_PAGE detection On some architectures ZERO_PAGE is unavailable because it references a GPL exported symbol of empty_zero_page. Originally `e08b993` removed the call to PAGE_ZERO(0) for assignment to the abd_zero_page. However, a simple check can be done to avoid a kernel allocation and free for the abd_zero_page if ZERO_PAGE is available. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Brian Atkinson <batkinson@lanl.gov> Closes #13199	2022-05-20 10:33:24 -07:00
Ka Ho Ng	1f31889046	FreeBSD: Implement hole-punching support This adds supports for hole-punching facilities in the FreeBSD kernel starting from __FreeBSD_version 1400032. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Ka Ho Ng <khng@FreeBSD.org> Sponsored-by: The FreeBSD Foundation Closes #12458	2022-05-17 11:15:29 -07:00
наб	1467a1bb33	module: zstd: check we don't leak symbols; regenerate symbol map Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: Rich Ercolani <rincebrain@gmail.com> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #12988 Closes #13209 (cherry picked from commit `6ef00196db`)	2022-05-16 15:48:21 -07:00
Brian Behlendorf	bb29f1eb38	Reduce dbuf_find() lock contention Holding a dbuf is a common operation which can become highly contended in dbuf_find() when acquiring the dbuf hash mutex. This is particularly true on Linux when reading/writing volumes since by default up to 32 threads from the zvol_taskq may be taking a hold of the same dbuf. This should also be observable on FreeBSD as long as there are enough processes accessing the volume concurrently. This is further aggregrated by the fact that only the block id will be unique when calculating the dbuf hash for a single volume. The objset id, object id, and level will be the same for data blocks. This has been observed to result in a somehwat less than uniform hash distribution and a longer than expected max hash chain depth (~20) on a large memory system (256 GB) using volumes. This commit improves the siutation by switching the hash mutex to an rwlock to allow concurrent lookups, and increasing DBUF_RWLOCKS from 2048 to 8192 to further reduce the odds of a hash collision. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #13405	2022-05-06 12:02:45 -07:00
Brian Behlendorf	4c9c96aba4	Silence unused-but-set-variable warnings Clang 13.0.0 added support for `Wunused-but-set-parameter` and `-Wunused-but-set-variable` which correctly detects two unused variables in zstd resulting in a build failure. This commit annotates these instances accordingly. https://releases.llvm.org/13.0.1/tools/clang/docs/ReleaseNotes.html#id6 In FSE_createCTable(), malloc() is intentionally defined as NULL when compiled in the kernel so the variable is unused. zstd/lib/compress/fse_compress.c:307:12: error: variable 'size' set but not used [-Werror,-Wunused-but-set-variable] Additionally, in ZSTD_seqDecompressedSize() the assert is compiled out similarly resulting in an unused variable. zstd/lib/compress/zstd_compress_superblock.c:412:12: error: variable 'litLengthSum' set but not used [-Werror,-Wunused-but-set-variable] Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2022-05-02 15:42:58 -07:00
наб	ecec151c14	module: zfs: freebsd: fix unused, remove argsused Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #12844	2022-05-02 15:42:58 -07:00
наб	a4f582f0b6	FreeBSD: remove unused variable Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Issue #12899	2022-05-02 15:42:58 -07:00
наб	9e68b734b3	zvol: remove unused variable Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #12917	2022-05-02 15:42:58 -07:00
наб	a175fe82e6	fm: remove unused variables Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #12917	2022-05-02 15:42:58 -07:00
наб	b8e1366ee6	zvol: remove unused variable Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #12917	2022-05-02 15:42:58 -07:00
наб	7536ad35ca	module/zfs: vdev_removal: spa_vdev_remove_thread: remove unused variable Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #12187	2022-05-02 15:42:58 -07:00
наб	986d64ccca	module/zfs: vdev_indirect: vdev_indirect_repair: remove unused variable Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #12187	2022-05-02 15:42:58 -07:00
наб	18e9268087	module/zfs: dbuf: dbuf_read_impl: remove unused variable Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #12187	2022-05-02 15:42:58 -07:00
наб	4149e19dfc	module/zfs: arc: arc_hdr_realloc_crypt: remove unused variables Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #12187	2022-05-02 15:42:58 -07:00
Brian Behlendorf	ce8d41ef75	Skip spacemaps reading in case of pool readonly import The only zdb utility require to read metaslab-related data during read-only pool import because of spacemaps validation. Add global variable which will allow zdb read spacemaps in case of readonly import mode. Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Fedor Uporov <fuporov.vstack@gmail.com> Closes #9095 Closes #12687	2022-04-28 16:47:12 -07:00
Brian Behlendorf	49c1346c10	Linux 5.18 compat: replace __set_page_dirty_nobuffers Replace __set_page_dirty_nobuffers with filemap_dirty_folio. Upstream-commit: 6b1f86f8e9c7f9de7ca1cb987b2cf25e99b1ae3a ("Merge tag 'folio-5.18b' of git://git.infradead.org/users/willy/pagecache ") Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Authored-by: Satadru Pramanik <satadru@gmail.com> Signed-off-by: Satadru Pramanik <satadru@gmail.com> Closes #13325 Closes #13380	2022-04-28 15:17:38 -07:00
Brian Behlendorf	71cd3726c0	Fix O_APPEND for Linux 3.15 and older kernels When using a Linux kernel which predates the iov_iter interface the O_APPEND flag should be applied in zpl_aio_write() via the call to generic_write_checks(). The updated pos variable was incorrectly ignored resulting in the current offset being used. This issue should only realistically impact the RHEL/CentOS 7.x kernels which are based on Linux 3.10. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #13370 Closes #13377	2022-04-28 15:15:28 -07:00
наб	642426095a	Linux 5.18 compat: kobj_type.default_attrs replaced with default_groups Upstream-commit: cdb4f26a63c391317e335e6e683a614358e70aeb ("kobject: kobj_type: remove default_attrs") Upstream-commit: `0cdda2edb3` Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #13357	2022-04-25 10:00:09 -07:00
Alexander Motin	972637dc06	FreeBSD: Fix translation from ABD to physical pages. In hypothetical case of non-linear ABD with single segment, multiple to page size but not aligned to it, vdev_geom_fill_unmap_cb() could fill one page less into bio_ma array. I am not sure it is expoitable, but better to be safe than sorry. Reported-by: Mark Johnston <markj@FreeBSD.org> Signed-off-by: Alexander Motin <mav@FreeBSD.org> (cherry picked from commit `5352f85cdd`)	2022-04-21 16:59:09 -07:00
Rich Ercolani	c220771a47	Corrected oversight in ZERO_RANGE behavior It turns out, no, in fact, ZERO_RANGE and PUNCH_HOLE do have differing semantics in some ways - in particular, one requires KEEP_SIZE, and the other does not. Also added a zero-range test to catch this, corrected a flaw that made the punch-hole test succeed vacuously, and a typo in file_write. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes #13329 Closes #13338	2022-04-21 16:58:07 -07:00
Brian Behlendorf	aa1c3c1d1d	Linux 5.17 compat: GENHD_FL_EXT_DEVT / GENHD_FL_NO_PART_SCAN As of the 5.17 kernel the GENHD_FL_EXT_DEVT flag has been removed and the GENHD_FL_NO_PART_SCAN flag renamed GENHD_FL_NO_PART. Update zvol_alloc() to set GENHD_FL_NO_PART for the newer kernels which is sufficient. The behavior for prior kernels remains unchanged. 1ebe2e5f ("block: remove GENHD_FL_EXT_DEVT") 46e7eac6 ("block: rename GENHD_FL_NO_PART_SCAN to GENHD_FL_NO_PART") Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #13294 Closes #13297	2022-04-20 13:44:19 -07:00
Mark Johnston	b7546f92ea	FreeBSD: Return Mach error codes from VOP_(GET\|PUT)PAGES FreeBSD's memory management system uses its own error numbers and gets confused when these VOPs return EIO. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reported-by: Peter Holm <pho@FreeBSD.org> Signed-off-by: Mark Johnston <markj@FreeBSD.org> Closes #13311	2022-04-19 10:42:54 -07:00
Mark Johnston	e9cd90f6e5	FreeBSD: Parameterize ZFS_ENTER/ZFS_VERIFY_VP with an error code For legacy reasons, a couple of VOPs have to return error numbers that don't come from the usual errno namespace. To handle the cases where ZFS_ENTER or ZFS_VERIFY_ZP fail, we need to be able to override the default error return value of EIO. Extend the macros to permit this. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Mark Johnston <markj@FreeBSD.org> Closes #13311	2022-04-19 10:42:54 -07:00
Riccardo Schirone	35ddd8ee2e	Linux 5.18 compat: use address_space_operations->readahead ->readpages was removed and replaced by ->readahead. Define zpl_readahead for kernels that don't have ->readpages. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Riccardo Schirone <rschirone91@gmail.com> Closes #13278	2022-04-06 13:15:27 -07:00
Riccardo Schirone	10a9f5fc47	Linux 5.18 compat: blkg_tryget is moved to private headers Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Riccardo Schirone <rschirone91@gmail.com> Closes #13278	2022-04-06 13:15:27 -07:00
наб	9f7f704507	Linux 5.18 compat: replace genhd.h with blkdev.h includes blkdev.h includes genhd.h since dawn of upstream git, so this is globally safe Upstream-commit: 322cbb50de711814c42fb088f6d31901502c711a ("block: remove genhd.h") Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #13251	2022-04-06 13:15:27 -07:00
наб	215a8255a9	Linux 5.18 compat: 4-argument bio_alloc() bio_alloc(gfp_t gfp_mask, unsigned short nr_iovecs) became bio_alloc(struct block_device *bdev, unsigned short nr_vecs, unsigned int opf, gfp_t gfp_mask) passing NULL/0 continues previous behaviour Upstream-commit: 07888c665b405b1cd3577ddebfeb74f4717a84c4 ("block: pass a block_device and opf to bio_alloc") Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #13251	2022-04-06 13:15:27 -07:00
Ryan Moeller	a5a28723bd	FreeBSD: Use NDFREE_PNBUF if available NDF_ONLY_PNBUF has been removed from FreeBSD in favor of NDFREE_PNBUF. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <freqlabs@FreeBSD.org> Closes #13277	2022-04-06 10:29:53 -07:00
Brian Behlendorf	5a9994f5ae	Export minimal zfs_refcount interfaces Lustre makes light use of the zfs_refcount interfaces which isn't a problem when using a non-debug build of OpenZFS. However, when debugging is enabled the required symbols are not exported. Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #12613	2022-04-06 10:29:00 -07:00
Brian Behlendorf	9f6943504a	Default to zfs_dmu_offset_next_sync=1 Strict hole reporting was previously disabled by default as a performance optimization. However, this has lead to confusion over the expected behavior and a variety of workarounds being adopted by consumers of ZFS. Change the default behavior to always report holes and force the TXG sync. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Upstream-commit: `05b3eb6d23` Ref: #13261 Closes #12746	2022-04-01 09:59:47 -07:00
Brian Behlendorf	847d03060f	Fix ACL checks for NFS kernel server This PR changes ZFS ACL checks to evaluate fsuid / fsgid rather than euid / egid to avoid accidentally granting elevated permissions to NFS clients. Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: Andrew Walker <awalker@ixsystems.com> Co-authored-by: Ryan Moeller <freqlabs@FreeBSD.org> Signed-off-by: Ryan Moeller <freqlabs@FreeBSD.org> Closes #13221	2022-03-20 21:21:18 -07:00
Kyle Evans	421750672b	module: freebsd: avoid a taking a destroyed lock in zfs_zevent bits At shutdown time, we drain all of the zevents and set the ZEVENT_SHUTDOWN flag. On FreeBSD, we may end up calling zfs_zevent_destroy() after the zevent_lock has been destroyed while the sysevent thread is winding down; we observe ESHUTDOWN, then back out. Events have already been drained, so just inline the kmem_free call in sysevent_worker() to avoid the race, and document the assumption that zfs_zevent_destroy doesn't do anything else useful at that point. This fixes a panic that can occur at module unload time. Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Kyle Evans <kevans@FreeBSD.org> Closes #13220	2022-03-18 17:11:43 -07:00
Mateusz Guzik	275c756730	FreeBSD: add missing replay check to an assert in zfs_xvattr_set Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Closes #13219	2022-03-18 17:11:43 -07:00
Ryan Moeller	7b215d93bc	Fix module build with -Werror This is a direct commit to zfs-2.1-release to fix release builds that error out on an unused variable. The issue is avoided on master by a huge series of commits that change how the ASSERT macros work, but that is not feasible to backport. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Igor Kozhukhov <igor@dilos.org> Signed-off-by: Ryan Moeller <freqlabs@FreeBSD.org> Closes #13194 Closes #13196	2022-03-17 10:18:23 -07:00
Brian Behlendorf	145af480d3	Fix ENOSPC when unlinking multiple files from full pool When unlinking multiple files from a pool at 100% capacity, it was possible for ENOSPC to be returned after the first unlink. e.g. rm -f /mnt/fs/test1.0.0 /mnt/fs/test1.1.0 /mnt/fs/test1.2.0 rm: cannot remove '/mnt/fs/test1.1.0': No space left on device rm: cannot remove '/mnt/fs/test1.2.0': No space left on device After waiting for the pending deferred frees from the first unlink to be processed the remaining files can then be unlinked. This is caused by the quota limit in dsl_dir_tempreserve_impl() being temporarily decreased to the allocatable pool capacity less any deferred free space. This is resolved using the existing mechanism of returning ERESTART when over quota as long as we know enough space will shortly be available after processing the pending deferred frees. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #13172	2022-03-08 11:46:03 -08:00
Mark Johnston	b3427b18b1	zfs: Fix a deadlock between page busy and the teardown lock When rolling back a dataset, ZFS has to purge file data resident in the system page cache. To do this, it loops over all vnodes for the mountpoint and calls vn_pages_remove() to purge pages associated with the vnode's VM object. Each page is thus exclusively busied while the dataset's teardown write lock is held. When handling a page fault on a mapped ZFS file, FreeBSD's page fault handler busies newly allocated pages and then uses VOP_GETPAGES to fill them. The ZFS getpages VOP acquires the teardown read lock with vnode pages already busied. This represents a lock order reversal which can lead to deadlock. To break the deadlock, observe that zfs_rezget() need only purge those pages marked valid, and that pages busied by the page fault handler are, by definition, invalid. Furthermore, ZFS pages always transition from invalid to valid with the teardown lock held, and ZFS never creates partially valid pages. Thus, zfs_rezget() can use the new vn_pages_remove_valid() to skip over pages busied by the fault handler. PR: 258208 Tested by: pho Reviewed by: avg, sef, kib MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D32931 Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ryan Moeller <freqlabs@FreeBSD.org> Closes #12828	2022-03-04 15:37:41 -08:00
Alexander Motin	0e2bb1a3ee	Really zero the zero page While switching abd_zero_buf allocation KPI I've missed the fact that kmem_zalloc() zeroed the allocation, while kmem_cache_alloc() does not. Add explicit bzero() after it. I don't think it should have caused real problems, but leaking one memory page content all over the pool is not good. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #12569	2022-03-04 15:37:33 -08:00
Paul Dagnelie	7bd292e59b	Fix cpu hotplug atomic sleep issue We move the spinlock unlock before the thread creation. This should be safe because the thread creation code doesn't actually manipulate any taskq data structures; that's done by the thread once it's created. We also remove the assertion that the maxthreads is the current threads plus one; that assertion could fail if multiple hotplug events come in quick succession, and the first new taskq thread hasn't had a chance to start processing yet. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> eviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Signed-off-by: Paul Dagnelie <pcd@delphix.com> Closes #12714	2022-02-15 21:52:45 -08:00
George Amanakis	bcddb18bae	Enable encrypted raw sending to pools with greater ashift Raw sending from pool1/encrypted with ashift=9 to pool2/encrypted with ashift=12 results to failure when mounting pool2/encrypted (Input/Output error). Notably, the opposite, raw sending from a greater ashift to a lower one does not fail. This happens because zio_compress_write() falsely checks only ZIO_FLAG_RAW_COMPRESS and not ZIO_FLAG_RAW_ENCRYPT which is also set in encrypted raw send streams. In this case it rounds up the psize and if not equal to the zio->io_size it modifies the block by zeroing out the extra bytes. Because this happens in a SA attr. registration object (type=46), the decryption fails upon mounting the filesystem, and zpool status falsely reports an error. Fix this by checking both ZIO_FLAG_RAW_COMPRESS and ZIO_FLAG_RAW_ENCRYPT before deciding whether to zero-pad a block. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #13067 Closes #13074	2022-02-23 16:47:37 -08:00
George Amanakis	6c6153e5b8	Avoid dirtying the final TXGs when exporting a pool There are two codepaths than can dirty final TXGs: 1) If calling spa_export_common()->spa_unload()-> spa_unload_log_sm_flush_all() after the spa_final_txg is set, then spa_sync()->spa_flush_metaslabs() may end up dirtying the final TXGs. Then we have the following panic: Call Trace: <TASK> dump_stack_lvl+0x46/0x62 spl_panic+0xea/0x102 [spl] dbuf_dirty+0xcd6/0x11b0 [zfs] zap_lockdir_impl+0x321/0x590 [zfs] zap_lockdir+0xed/0x150 [zfs] zap_update+0x69/0x250 [zfs] feature_sync+0x5f/0x190 [zfs] space_map_alloc+0x83/0xc0 [zfs] spa_generate_syncing_log_sm+0x10b/0x2f0 [zfs] spa_flush_metaslabs+0xb2/0x350 [zfs] spa_sync_iterate_to_convergence+0x15a/0x320 [zfs] spa_sync+0x2e0/0x840 [zfs] txg_sync_thread+0x2b1/0x3f0 [zfs] thread_generic_wrapper+0x62/0xa0 [spl] kthread+0x127/0x150 ret_from_fork+0x22/0x30 </TASK> 2) Calling vdev_*_stop_all() for a second time in spa_unload() after spa_export_common() unnecessarily delays the final TXGs beyond what spa_final_txg is set at. Fix this by performing the check and call for spa_unload_log_sm_flush_all() before the spa_final_txg is set in spa_export_common(). Also check if the spa_final_txg has already been set in spa_unload() and skip those calls in this case. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Amanakis <gamanakis@gmail.com> External-issue: https://www.illumos.org/issues/9081 Closes #13048 Closes #13098	2022-02-23 16:47:33 -08:00
Tomohiro Kusumi	02309af096	Remove unneeded "extern inline" function declarations All of these externs are already #included as static inline functions via corresponding headers. Reviewed-by: Igor Kozhukhov <igor@dilos.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Closes #13073	2022-02-16 17:58:56 -08:00
наб	94a4b7ec3d	module: zfs: fix unused, remove argsused Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #12844	2022-02-16 17:58:56 -08:00
drowfx	bc99c809d5	Add dataset_kstats_update.. to mmap read/write paths This allows reads/writes caused by accesses to mmap files to be accounted correctly in the per-dataset kstats for both Linux and FreeBSD. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org> Signed-off-by: Matthias Blankertz <matthias@blankertz.org> Closes #12994 Closes #13044	2022-02-16 17:58:56 -08:00
Attila Fülöp	5c19af07d4	Receive checks should allow unencrypted child datasets dmu_recv_begin_check() unconditionally sets the DS_HOLD_FLAG_DECRYPT flag before calling dsl_dataset_hold_flags(). If the key on the receiving side isn't loaded or the send stream contains embedded blocks, the receive check fails for a stream which is perfectly valid and could be received without any problem. This seems like a remnant of the initial design, where unencrypted datasets below encrypted ones weren't allowed. Add a condition to set `DS_HOLD_FLAG_DECRYPT` only for encrypted datasets, modify an existing test to detect this regression and add a test for raw replication streams. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Amanakis <gamanakis@gmail.com> Co-authored-by: George Amanakis <gamanakis@gmail.com> Signed-off-by: Attila Fülöp <attila@fueloep.org> Closes #13033 Closes #13076	2022-02-16 17:58:55 -08:00
Peter Levine	c7fcf00917	Add support for $KERNEL_{CC,LD,LLVM} variables Currently, $(CC), $(LD), and $(LLVM) variables aren't passed to kbuild while building modules. This causes modules to build with the default GNU GCC toolchain and prevents experimenting with other toolchains such as CLANG/LLVM. It can also lead to build failure if the CFLAGS/LDFLAGS passed are incompatible with gcc/ld. Pass $KERNEL_CC, $KERNEL_LD, and $KERNEL_LLVM as $(CC), $(LD), and $(LLVM), respectively, to kbuild for each that is defined in the environment. This should take care of the majority of alternative toolchain use cases. Reviewed-by: Damian Szuberski <szuberskidamian@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Peter Levine <plevine457@gmail.com> Closes #13046	2022-02-16 17:58:55 -08:00
наб	52aae04c6a	module: Makefile: simplify clean and install jobs Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #12979	2022-02-16 17:58:55 -08:00
наб	77ae804f9e	module: Makefile: flatten subdir loop, use $PWD instead of `pwd` Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Issue #12899	2022-02-16 17:58:55 -08:00
Attila Fülöp	3b52ccd7d7	Linux 5.16 compat: don't use XSTATE_XSAVE to save FPU state Linux 5.16 moved XSTATE_XSAVE and XSTATE_XRESTORE out of our reach, so add our own XSAVE{,OPT,S} code and use it for Linux 5.16. Please note that this differs from previous behavior in that it won't handle exceptions created by XSAVE an XRSTOR. This is sensible for three reasons. - Exceptions during XSAVE and XRSTOR can only occur if the feature is not supported or enabled or the memory operand isn't aligned on a 64 byte boundary. If this happens something else went terribly wrong, and it may be better to stop execution. - Previously we just printed a warning and didn't handle the fault, this is arguable for the above reason. - All other *SAVE instruction also don't handle exceptions, so this at least aligns behavior. Finally add a test to catch such a regression in the future. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Attila Fülöp <attila@fueloep.org> Closes #13042 Closes #13059	2022-02-16 17:58:55 -08:00
Christian Schwarz	a61915e086	dsl_dir_tempreserve_impl: remove unused `deferred` variable The following commit moved the users of `deferred` into function dsl_pool_unreserved_space: commit `d2734cce68` Author: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com> Date: Fri Dec 16 14:11:29 2016 -0800 OpenZFS 9166 - zfs storage pool checkpoint Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Christian Schwarz <christian.schwarz@nutanix.com> Closes #13056	2022-02-16 17:58:55 -08:00
Pawel Jakub Dawidek	3e27b589cf	Fix clearing set-uid and set-gid bits on a file when replying a write POSIX requires that set-uid and set-gid bits to be removed when an unprivileged user writes to a file and ZFS does that during normal operation. The problem arrises when the write is stored in the ZIL and replayed. During replay we have no access to original credentials of the process doing the write, so zfs_write() will be performed with the root credentials. When root is doing the write set-uid and set-gid bits are not removed from the file. To correct that, log a separate TX_SETATTR entry that removed those bits on first write to such file. Idea from: Christian Schwarz Add test for ZIL replay of setuid/setgid clearing. Improve various edge cases when clearing setid bits: - The setid bits can be readded during a single write, so make sure to check for them on every chunk write. - Log TX_SETATTR record at most once per transaction group (if the setid bits are keep coming back). - Move zfs_log_setattr() outside of zp->z_acl_lock. Reviewed-by: Dan McDonald <danmcd@joyent.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: Christian Schwarz <me@cschwarz.com> Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net> Closes #13027	2022-02-16 17:58:55 -08:00
George Amanakis	72a82f312f	Report dnodes with faulty bonuslen In files created/modified before `4254acb` there may be a corruption of xattrs which is not reported during scrub and normal send/receive. It manifests only as an error when raw sending/receiving. This happens because currently only the raw receive path checks for discrepancies between the dnode bonus length and the spill pointer flag. In case we encounter a dnode whose bonus length is greater than the predicted one, we should report an error. Modify in this regard dnode_sync() with an assertion at the end, dump_dnode() to error out, dsl_scan_recurse() to report errors during a scrub, and zstream to report a warning when dumping. Also added a test to verify spill blocks are sent correctly in a raw send. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #12720 Closes #13014	2022-02-16 17:58:55 -08:00
наб	9cbc2ed20f	libzfs: add keylocation=https://, backed by fetch(3) or libcurl Add support for http and https to the keylocation properly to allow encryption keys to be fetched from the specified URL. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Issue #9543 Closes #9947 Closes #11956	2022-02-16 17:58:37 -08:00
Rich Ercolani	2e3b3e3a2e	Workaround Debian's fake System.map behavior Debian ships fake System.map files by default, leading to the invocation of depmod with them to flood you with errors about missing symbols. Let's notice and not do that. Reviewed-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes #12862	2022-02-10 11:18:38 -08:00
José Luis Salvador Rufo	a35125e3d5	Proper support for DESTDIR and INSTALL_MOD_PATH The environment variables DESTDIR and INSTALL_MOD_PATH must be mutually exclusive. https://www.gnu.org/prep/standards/html_node/DESTDIR.html https://www.kernel.org/doc/Documentation/kbuild/modules.txt This issue was discussed in this Buildroot thread: https://lists.buildroot.org/pipermail/buildroot/2021-August/621350.html I saw this behavior in other different projects, as: - Yocto Project: https://www.yoctoproject.org/pipermail/meta-freescale/2013-August/004307.html - Google IA Coral: https://coral.googlesource.com/linux-imx-debian/+/refs/heads/master/debian/rules For the above reasons, INSTALL_MOD_PATH will be set as DESTDIR by default. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: José Luis Salvador Rufo <salvador.joseluis@gmail.com> Signed-off-by: Romain Naour <romain.naour@gmail.com> Closes #12577	2022-02-10 11:18:29 -08:00
Jorgen Lundman	f31b45176c	Upstream: Add snapshot and zvol events For kernel to send snapshot mount/unmount events to zed. For kernel to send symlink creates/removes on zvol plumbing. (/dev/run/dsk/zvol/$pool/$zvol -> /dev/diskX) If zed misses the ENODEV, all errors after are EINVAL. Treat any error as kernel module failure. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Jorgen Lundman <lundman@lundman.net> Closes #12416	2022-02-10 11:04:06 -08:00
George Amanakis	e257bd481b	Introduce a flag to skip comparing the local mac when raw sending Raw receiving a snapshot back to the originating dataset is currently impossible because of user accounting being present in the originating dataset. One solution would be resetting user accounting when raw receiving on the receiving dataset. However, to recalculate it we would have to dirty all dnodes, which may not be preferable on big datasets. Instead, we rely on the os_phys flag OBJSET_FLAG_USERACCOUNTING_COMPLETE to indicate that user accounting is incomplete when raw receiving. Thus, on the next mount of the receiving dataset the local mac protecting user accounting is zeroed out. The flag is then cleared when user accounting of the raw received snapshot is calculated. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #12981 Closes #10523 Closes #11221 Closes #11294 Closes #12594 Issue #11300	2022-02-04 16:14:56 -08:00
Finix1979	1009e60992	Linux <4.8 compat: submit_bio() rw arg When using the two argument version of submit_bio() in kernel's prior to 4.8 the first argument should be specified. It's used by block dump to report the bio direction. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Finix Yan <yancw@info2soft.com> Closes #13006	2022-02-04 08:33:52 -08:00
наб	4f6599416a	Linux 5.17 compat: PDE_DATA() renamed to pde_data() Upstream commit 359745d78351c6f5442435f81549f0207ece28aa ("proc: remove PDE_DATA() completely") Link: https://lore.kernel.org/all/20211124081956.87711-2-songmuchun@bytedance.com/T/#u Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #13004 Closes #12989	2022-02-04 08:33:52 -08:00
наб	f42c126029	Linux 5.17 compat: dequeue_signal() takes a 4th argument Linux 5.17's dequeue_signal() takes an additional enum pid_type * output argument Upstream commit 5768d8906bc23d512b1a736c1e198aa833a6daa4 ("signal: Requeue signals in the appropriate queue") Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #12989	2022-02-04 08:33:52 -08:00
наб	2ce06d93a8	Linux 5.17 compat: detect complete_and_exit() rename Linux 5.17 sees a rename from complete_and_exit() to kthread complete_and_exit() Upstream commit cead18552660702a4a46f58e65188fe5f36e9dfe ("exit: Rename complete_and_exit to kthread_complete_and_exit") Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #12989	2022-02-04 08:33:52 -08:00
Rich Ercolani	8ef01afbfc	Add support for FALLOC_FL_ZERO_RANGE For us, I think it's always just FALLOC_FL_PUNCH_HOLE with a fake mustache on. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Coleman Kane <ckane@colemankane.org> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes #12975	2022-02-04 08:33:52 -08:00
Rich Ercolani	c31c1146b6	Linux 5.16 compat: Added add_disk check for return add_disk went from void to must-check int return. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Coleman Kane <ckane@colemankane.org> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes #12975	2022-02-04 08:33:52 -08:00
Mark Johnston	0da15f9194	Fix handling of errors from dmu_write_uio_dbuf() on FreeBSD FreeBSD's implementation of zfs_uio_fault_move() returns EFAULT when a page fault occurs while copying data in or out of user buffers. The VFS treats such errors specially and will retry the I/O operation (which may have made some partial progress). When the FreeBSD and Linux implementations of zfs_write() were merged, the handling of errors from dmu_write_uio_dbuf() changed such that EFAULT is not handled as a partial write. For example, when appending to a file, the z_size field of the znode is not updated after a partial write resulting in EFAULT. Restore the old handling of errors from dmu_write_uio_dbuf() to fix this. This should have no impact on Linux, which has special handling for EFAULT already. Reviewed-by: Andriy Gapon <avg@FreeBSD.org> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Mark Johnston <markj@FreeBSD.org> Closes #12964	2022-02-03 15:30:52 -08:00
Mark Johnston	5303fc4c95	Avoid memory allocations in the ARC eviction thread When the eviction thread goes to shrink an ARC state, it allocates a set of marker buffers used to hold its place in the state's sublists. This can be problematic in low memory conditions, since 1) the allocation can be substantial, as we allocate NCPU markers; 2) on at least FreeBSD, page reclamation can block in arc_wait_for_eviction() In particular, in stress tests it's possible to hit a deadlock on FreeBSD when the number of free pages is very low, wherein the system is waiting for the page daemon to reclaim memory, the page daemon is waiting for the ARC eviction thread to finish, and the ARC eviction thread is blocked waiting for more memory. Try to reduce the likelihood of such deadlocks by pre-allocating markers for the eviction thread at ARC initialization time. When evicting buffers from an ARC state, check to see if the current thread is the ARC eviction thread, and use the pre-allocated markers for that purpose rather than dynamically allocating them. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: George Amanakis <gamanakis@gmail.com> Signed-off-by: Mark Johnston <markj@FreeBSD.org> Closes #12985	2022-02-03 15:30:52 -08:00
Ryan Moeller	af1630c883	FreeBSD: Fix zvol_cdev_open locking First open locking changes were correctly applied to zvol_geom_open but incorrectly applied to zvol_cdev_open, causing spa_namespace_lock to be held indefinitely. Make the first open locking in zvol_cdev_open match zvol_geom_open. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #13016	2022-02-03 15:28:01 -08:00
Ryan Moeller	1828b68a0b	FreeBSD: Fix zvol_*_open() locking These are the changes for FreeBSD corresponding to the changes made for Linux in #12863, see that PR for details. Changes from #12863 are applied for zvol_geom_open and zvol_cdev_open on FreeBSD. This also adds a check for the zvol dying which we had in zvol_geom_open but was missing in zvol_cdev_open. The check causes the open to fail early with ENXIO when we are in the middle of changing volmode. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #12934	2022-02-03 15:28:01 -08:00
наб	36a91d6cef	FreeBSD: vfsops: use setgen for error case Fix from https://github.com/openzfs/zfs/pull/12844#discussion_r774179413 Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #12905	2022-02-03 15:28:01 -08:00
chrisrd	1259dc6e6a	zfs_prune: reset sc.nr_to_scan sc.nr_to_scan is an input to super_cache_clean (via shrinker->scan_objects), used to set the number of objects to scan in the various caches. However super_cache_scan also modifies sc.nr_to_scan, so when used in a loop we need to reset sc.nr_to_scan back to our desired nr_to_scan for the next iteration. Issue discovered and solution suggested by Tenzin Lhakhang @tlhakhan. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chris Dunlop <chris@onthe.net.au> Issue #12433 Closes #12908	2022-02-03 15:28:01 -08:00
Brian Behlendorf	6575defc52	Verify dRAID empty sectors Verify that all empty sectors are zero filled before using them to calculate parity. Failure to do so can result in incorrect parity columns being generated and written to disk if the contents of an empty sector are non-zero. This was possible because the checksum only protects the data portions of the buffer, not the empty sector padding. This issue has been addressed by updating raidz_parity_verify() to check that all dRAID empty sectors are zero filled. Any sectors which are non-zero will be fixed, repair IO issued, and a checksum error logged. They can then be safely used to verify the parity. This specific type of damage is unlikely to occur since it requires a disk to have silently returned bad data, for an empty sector, while performing a scrub. However, if a pool were to have been damaged in this way, scrubbing the pool with this change applied will repair both the empty sector and parity columns as long as the data checksum is valid. Checksum errors will be reported in the `zpool status` output for any repairs which are made. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #12857	2022-02-03 15:28:01 -08:00
наб	5d8c081193	FreeBSD: fix unpropagated error When performing I/O on FreeBSD using a file based vdev ensure all errors encountered when reading/writing are propagated through the zio pipeline. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #12904	2022-02-03 15:28:01 -08:00
Brian Behlendorf	9ec630ff2c	Fix zvol_open() lock inversion When restructuring the zvol_open() logic for the Linux 5.13 kernel a lock inversion was accidentally introduced. In the updated code the spa_namespace_lock is now taken before the zv_suspend_lock allowing the following scenario to occur: down_read <=== waiting for zv_suspend_lock zvol_open <=== holds spa_namespace_lock __blkdev_get blkdev_get_by_dev blkdev_open ... mutex_lock <== waiting for spa_namespace_lock spa_open_common spa_open dsl_pool_hold dmu_objset_hold_flags dmu_objset_hold dsl_prop_get dsl_prop_get_integer zvol_create_minor dmu_recv_end zfs_ioc_recv_impl <=== holds zv_suspend_lock via zvol_suspend() zfs_ioc_recv ... This commit resolves the issue by moving the acquisition of the spa_namespace_lock back to after the zv_suspend_lock which restores the original ordering. Additionally, as part of this change the error exit paths were simplified where possible. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Rich Ercolani <rincebrain@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #12863	2022-02-03 15:28:01 -08:00
Alan Somers	4b2bac5fe9	FreeBSD: Update argument types for VOP_READDIR A recent commit to FreeBSD changed the type of vop_readdir_args.a_cookies to a uint64_t**. There is no functional impact to ZFS because ZFS only uses 32-bit cookies, which will be zero-extended to 64-bits by the existing code. `b214fcceac` Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Alan Somers <asomers@gmail.com> Closes #12874	2022-02-03 15:28:01 -08:00
Alexander Motin	786abf5321	Reduce number of arc_prune threads On FreeBSD vnode reclamation is single-threaded, protected by single global lock. Linux seems to be able to use a thread per mount point, but at this time it creates more harm than good. Reduce number of threads to 1, adding tunable in case somebody wants to try more. Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Chris Dunlop <chris@onthe.net.au> Reviewed-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #12896 Issue #9966	2022-02-03 15:28:01 -08:00
Ryan Moeller	913ae45218	FreeBSD: Provide correct file generation number va_seq was actually a thin veil over va_gen, so z_gen is a more appropriate value than z_seq to populate the field with. Drop the unnecessary compat obfuscation and provide the correct file generation number. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ryan Moeller <freqlabs@freebsd.org> Closes #12851	2022-02-03 15:28:01 -08:00
Ryan Moeller	def73c0735	FreeBSD: Add vop_standard_writecount_nomsync https://cgit.freebsd.org/src/commit?id=3ffcfa599e29686cf2b3c1a6087408c37acaed78 Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ryan Moeller <freqlabs@FreeBSD.org> Closes #12828	2021-12-13 13:23:07 -08:00
Ryan Moeller	effe984148	FreeBSD: Catch up with more VFS changes Unused thread argument was removed from NDINIT* https://cgit.freebsd.org/src/commit?id=7e1d3eefd410ca0fbae5a217422821244c3eeee4 Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ryan Moeller <freqlabs@FreeBSD.org> Closes #12828	2021-12-13 13:23:01 -08:00
Mark Johnston	19337332cc	Fix several bugs in the FreeBSD rename VOP implementation - To avoid a use-after-free, zfsvfs->z_log needs to be loaded after the teardown lock is acquired with ZFS_ENTER(). - Avoid leaking vnode locks in zfs_rename_relock() and zfs_rename_() when the ZFS_ENTER() macros forces an early return. Refactor the rename implementation so that ZFS_ENTER() can be used safely. As a bonus, this lets us use the ZFS_VERIFY_ZP() macro instead of open-coding its implementation. Reported-by: Peter Holm <pho@FreeBSD.org> Tested-by: Peter Holm <pho@FreeBSD.org> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Signed-off-by: Mark Johnston <markj@FreeBSD.org> Sponsored-by: The FreeBSD Foundation Closes #12717	2021-12-13 13:22:54 -08:00
Pawel Jakub Dawidek	b96737b83e	Remove (now unused) td argument from zfs_lookup() Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net> Closes #12748	2021-12-13 13:22:47 -08:00
Mark Johnston	4b7bfcf8a0	Exit the teardown section later in rename on FreeBSD We have to hold the teardown lock while dereferencing zfsvfs->z_os and, I believe, when committing to the ZIL. Note that jumping to the "out" label, "error" is always non-zero. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Signed-off-by: Mark Johnston <markj@FreeBSD.org> Closes #12704	2021-12-13 13:22:41 -08:00
Mark Johnston	07165ce540	Fix potential use-after-frees in FreeBSD getpages and setattr VOPs The objset object is reallocated during certain dataset operations, such as rollbacks, so the objset pointer must be loaded after acquiring the teardown lock. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Signed-off-by: Mark Johnston <markj@FreeBSD.org> Closes #12704	2021-12-13 13:22:34 -08:00
Damian Szuberski	64e88992b6	Update `checkstyle` workflow env to ubuntu-20.04 - `checkstyle` workflow uses ubuntu-20.04 environment - improved `mancheck.sh` readability Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: John Kennedy <john.kennedy@delphix.com> Signed-off-by: szubersk <szuberskidamian@gmail.com> Closes #12713	2021-12-08 13:27:56 -08:00
Paul Dagnelie	57f6a050e6	ZFS send/recv with ashift 9->12 leads to data corruption Improve the ability of zfs send to determine if a block is compressed or not by using information contained in the blkptr. Reviewed-by: Rich Ercolani <rincebrain@gmail.com> Reviewed-by: Matthew Ahrens <matthew.ahrens@delphix.com> Signed-off-by: Paul Dagnelie <pcd@delphix.com> Closes #12770	2021-12-07 17:04:34 -08:00
Coleman Kane	b3b293c9fc	Linux 5.16: Resolve ZSTD_isError symbol collision in Linux kernel Newer zstd code introduced in the main kernel tree now creates a symbol collision with ZSTD_isError in our ZSTD code. This change relabels our implementation with a ZFS-specific symbol name, and undoes some macro-based micro-optimizations that conflict with the attempt to rename our internal-use version. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Coleman Kane <ckane@colemankane.org> Closes #12819	2021-12-07 13:14:24 -08:00
Coleman Kane	bef7c02c81	Linux 5.16: The blk-cgroup.h header is where struct blkcg_gq is defined The definition of struct blkcg_gq was moved into blk-cgroup.h, which is a header that's been in Linux since 2015. This is used by vdev_blkg_tryget() in module/os/linux/zfs/vdev_disk.c. Since the kernel for CentOS 7 and similar-generation releases doesn't have this header, its inclusion is guarded by a configure test. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Coleman Kane <ckane@colemankane.org> Closes #12819	2021-12-07 13:14:23 -08:00
Coleman Kane	ea61e07413	Linux 5.16: bio_set_dev is no longer a helper macro This change adds a confiugre check to determine if bio_set_dev is a helper macro or not. If not, then the attempt to override its internal call to bio_associate_blkg(), with a macro definition to our own version, is no longer possible, as the compiler won't use it when compiling the new inline function replacement implemented in the header. This change also creates a new vdev_bio_set_dev() function that performs the same work, and also performs the work implemented in vdev_bio_associate_blkg(), as it is the only thing calling that function in our code. Our custom vdev_bio_associate_blkg() is now only compiled if the bio_set_dev() is a macro in the Linux headers. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Coleman Kane <ckane@colemankane.org> Closes #12819	2021-12-07 13:14:23 -08:00
Coleman Kane	9519fe1ff8	Linux 5.16: type member of iov_iter renamed iter_type The iov_iter->type member was renamed iov_iter->iter_type. However, while looking into this, realized that in 2018 a iov_iter_type(*iov) accessor function was introduced. So if that is present, use it, otherwise fall back to trying the existing behavior of directly accessing type from iov_iter. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Coleman Kane <ckane@colemankane.org> Closes #12819	2021-12-07 13:14:23 -08:00
Coleman Kane	0c40ff56f2	Linux 5.16: block_device_operations->submit_bio now returns void The return type for the submit_bio member of struct block_device_operations was changed to no longer return a value. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Coleman Kane <ckane@colemankane.org> Closes #12819	2021-12-07 13:14:23 -08:00
Brian Behlendorf	16da688f25	Linux 5.13 compat: retry zvol_open() when contended Due to a possible lock inversion the zvol open call path on Linux needs to be able to retry in the case where the spa_namespace_lock cannot be acquired. For Linux 5.12 an older kernel this was accomplished by returning -ERESTARTSYS from zvol_open() to request that blkdev_get() drop the bdev->bd_mutex lock, reaquire it, then call the open callback again. However, as of the 5.13 kernel this behavior was removed. Therefore, for 5.12 and older kernels we preserved the existing retry logic, but for 5.13 and newer kernels we retry internally in zvol_open(). This should always succeed except in the case where a pool's vdev are layed on zvols, in which case it may fail. To handle this case vdev_disk_open() has been updated to retry when opening a device when -ERESTARTSYS is returned. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #12301 Closes #12759	2021-12-06 12:22:57 -08:00
Coleman Kane	12d27e7134	Linux 5.16: wait_on_page_bit() no longer available to modules Instead, linux/pagemap.h offers a number of folio-specific functions to be called instead. In this case, module/os/linux/zfs/zfs_vnops_os.c wants to call wait_on_page_bit(pp, PG_writeback). This gets replaced with folio_wait_bit(folio_page(pp), PG_writeback). This change modifies the code to conditionally compile that if configure identifies th presence of the folio_wait_bit() function. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Coleman Kane <ckane@colemankane.org> Closes #12800	2021-12-06 12:22:38 -08:00
Jorgen Lundman	a1a29bf8fc	Iterate encrypted clones at zvol_create_minor Userland figures out which encryption-root keys are required to load, and issues ZFS_IOC_LOAD_KEY. The tail section of spa_keystore_load_wkey() will call zvol_create_minors() on the encryption-root object. Any clones of the encrypted zvol will not be plumbed. This commits adds additional logic to detect if zvol has clones, and is encrypted, then adds these to the list of zvols to call zvol_create_minors() on. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Jorgen Lundman <lundman@lundman.net> Closes #12471	2021-12-06 12:22:32 -08:00
Brian Behlendorf	d7e640cf95	Restore dirty dnode detection logic In addition to flushing memory mapped regions when checking holes, commit `de198f2d95` modified the dirty dnode detection logic to check the dn->dn_dirty_records instead of the dn->dn_dirty_link. Relying on the dirty record has not be reliable, switch back to the previous method. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #11900 Closes #12745	2021-11-05 09:45:04 -07:00
Brian Behlendorf	664d487a5d	Fix lseek(SEEK_DATA/SEEK_HOLE) mmap consistency When using lseek(2) to report data/holes memory mapped regions of the file were ignored. This could result in incorrect results. To handle this zfs_holey_common() was updated to asynchronously writeback any dirty mmap(2) regions prior to reporting holes. Additionally, while not strictly required, the dn_struct_rwlock is now held over the dirty check to prevent the dnode structure from changing. This ensures that a clean dnode can't be dirtied before the data/hole is located. The range lock is now also taken to ensure the call cannot race with zfs_write(). Furthermore, the code was refactored to provide a dnode_is_dirty() helper function which checks the dnode for any dirty records to determine its dirtiness. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Rich Ercolani <rincebrain@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #11900 Closes #12724	2021-11-05 08:08:55 -07:00
Brian Behlendorf	22b0891dbb	Linux 5.16 compat: submit_bio() The submit_bio() prototype has changed again. The version is 5.16 still only expects a single argument but the return type has changed to void. Since we never used the returned value before update the configure check to detect both single arg versions. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Alexander Lobakin <alobakin@pm.me> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #12725	2021-11-05 07:51:21 -07:00
Tony Hutter	586b5d366e	Rescan enclosure sysfs path on import When you create a pool, zfs writes vd->vdev_enc_sysfs_path with the enclosure sysfs path to the fault LEDs, like: vdev_enc_sysfs_path = /sys/class/enclosure/0:0:1:0/SLOT8 However, this enclosure path doesn't get updated on successive imports even if enclosure path to the disk changes. This patch fixes the issue. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #11950 Closes #12095	2021-11-02 16:31:05 -07:00
Ryan Moeller	27d9c6ae2b	FreeBSD: Catch up with recent VFS changes cn_thread is always curthread. https://cgit.freebsd.org/src/commit?id=b4a58fbf640409a1e507d9f7b411c83a3f83a2f3 https://cgit.freebsd.org/src/commit?id=2b68eb8e1dbbdaf6a0df1c83b26f5403ca52d4c3 Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Alan Somers <asomers@gmail.com> Signed-off-by: Ryan Moeller <freqlabs@FreeBSD.org> Closes #12668	2021-11-02 13:48:54 -07:00
Brian Behlendorf	143476ce8d	Use fallthrough macro As of the Linux 5.9 kernel a fallthrough macro has been added which should be used to anotate all intentional fallthrough paths. Once all of the kernel code paths have been updated to use fallthrough the -Wimplicit-fallthrough option will because the default. To avoid warnings in the OpenZFS code base when this happens apply the fallthrough macro. Additional reading: https://lwn.net/Articles/794944/ Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #12441	2021-11-02 09:50:30 -07:00
Arun KV	bb80b4649a	Fixed data integrity issue when underlying disk returns error Errors in zil_lwb_write_done() are not propagated to zil_lwb_flush_vdevs_done() which can result in zil_commit_impl() not returning an error to applications even when zfs was not able to write data to the disk. Remove the ZIO_FLAG_DONT_PROPAGATE flag from zio_rewrite() to allow errors to propagate and consolidate the error handling for flush and write errors to a single location (rather than having error handling split between the "write done" and "flush done" handlers). Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed-by: Prakash Surya <prakash.surya@delphix.com> Signed-off-by: Arun KV <arun.kv@datacore.com> Closes #12391 Closes #12443	2021-09-14 15:45:30 -07:00
Brian Behlendorf	9183321501	Verify embedded blkptr's in arc_read() The block pointer verification check in arc_read() should also cover embedded block pointers. While highly unlikely, accessing a damaged block pointer can result in panic. To further harden the code extend the existing check to include embedded block pointers and add a comment explaining the rational for this sanity check. Lastly, correct a flaw in zfs_blkptr_verify() so the error count is checked even when checking a untrusted config to verify the non-pool-specific portions of a block pointer. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #12535	2021-09-14 15:43:18 -07:00
Brian Behlendorf	32512acbc0	Linux 5.15 compat: get_acl() Kernel commits 332f606b32b6 ovl: enable RCU'd ->get_acl() 0cad6246621b vfs: add rcu argument to ->get_acl() callback Added compatibility code to detect the new ->get_acl() interface and correctly handle the case where the new rcu argument is set. Reviewed-by: Coleman Kane <ckane@colemankane.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #12548	2021-09-14 15:42:59 -07:00
Allan Jude	cea0752f8d	Allow sending corrupt snapshots even if metadata is corrupted When zfs_send_corrupt_data is set, use the TRAVERSE_HARD flag, so traverse_visitbp() will not fail with ECKSUM if a blockpointer cannot be read, but rather will continue and send the objects it can. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: John Kennedy <john.kennedy@delphix.com> Signed-off-by: Allan Jude <allan@klarasystems.com> Sponsored-By: Klara Inc. Sponsored-By: WHC Online Solutions Inc. Closes #12541	2021-09-14 15:42:49 -07:00
Rich Ercolani	7d70f1e099	arc: Drop an incorrect assert Unfortunately, there was an overzealous assertion that was (in pretty specific circumstances) false, causing failure. This assertion was added in error, so we're removing it. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes #9897 Closes #12020 Closes #12246	2021-09-14 15:42:33 -07:00
Paul Dagnelie	fd92825445	Compressed receive with different ashift can result in incorrect PSIZE on disk We round up the psize to the nearest multiple of the asize or to the lsize, whichever is smaller. Once that's done, we allocate a new buffer of the appropriate size, zero the tail, and copy the data into it. This adds a small performance cost to these kinds of writes, but fixes the bookkeeping problems. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Co-authored-by: Matthew Ahrens <matthew.ahrens@delphix.com> Signed-off-by: Paul Dagnelie <pcd@delphix.com> Closes #12522 Closes #8462	2021-09-14 15:42:17 -07:00
Ryan Moeller	81611683c8	FreeBSD: Don't remove SA xattr if not SA znode We attempt to remove an existing SA xattr when setting a dir xattr, but this only makes sense if the znode has been upgraded to the SA format. Otherwise, we will hit an assert in zfs_sa_get_xattr. Make sure this is an SA znode before attempting to remove the SA xattr. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #12514	2021-09-14 15:11:56 -07:00
Rich Ercolani	72a989cf60	Fix cross-endian interoperability of zstd It turns out that layouts of union bitfields are a pain, and the current code results in an inconsistent layout between BE and LE systems, leading to zstd-active datasets on one erroring out on the other. Switch everyone over to the LE layout, and add compatibility code to read both. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes #12008 Closes #12022	2021-09-14 15:05:55 -07:00
Mark Johnston	2016d7fb9c	Initialize parity blocks before RAID-Z reconstruction benchmarking benchmark_raidz() allocates a row to benchmark parity calculation and reconstruction. In the latter case, the parity blocks are left uninitialized, leading to reports from KMSAN. Initialize parity blocks to 0xAA as we do for the data earlier in the function. This does not affect the selected RAID-Z implementation on any of several systems tested. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Mark Johnston <markj@FreeBSD.org> Closes #12473	2021-09-14 14:32:16 -07:00
Richard Yao	1655ce5619	Linux 4.11 compat: statx support Linux 4.11 added a new statx system call that allows us to expose crtime as btime. We do this by caching crtime in the znode to match how atime, ctime and mtime are cached in the inode. statx also introduced a new way of reporting whether the immutable, append and nodump bits have been set. It adds support for reporting compression and encryption, but the semantics on other filesystems is not just to report compression/encryption, but to allow it to be turned on/off at the file level. We do not support that. We could implement semantics where we refuse to allow user modification of the bit, but we would need to do a dnode_hold() in zfs_znode_alloc() to find out encryption/compression information. That would introduce locking that will have a minor (although unmeasured) performance cost. It also would be inferior to zdb, which reports far more detailed information. We therefore omit reporting of encryption/compression through statx in favor of recommending that users interested in such information use zdb. Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Richard Yao <ryao@gentoo.org> Closes #8507	2021-09-14 14:31:50 -07:00
Alexander Motin	a4862125b8	Remove b_pabd/b_rabd allocation from arc_hdr_alloc() When a header is allocated for full overwrite it is a waste of time to allocate b_pabd/b_rabd for it, since arc_write() will free them without ever being touched. If it is a read or a partial overwrite then arc_read() and arc_hdr_decrypt() allocate them explicitly. Reduced memory allocation in user threads also reduces ARC eviction throttling there, proportionally increasing it in ZIO threads, that is not good. To minimize or even avoid it introduce ARC allocation reserve, allowing certain arc_get_data_abd() callers to allocate a bit longer in situations where user threads will already throttle. Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #12398	2021-09-14 14:31:50 -07:00
Alexander Motin	61773f41b8	Optimize arc_l2c_only lists assertions It is very expensive and not informative to call multilist_is_empty() for each arc_change_state() on debug builds to check for impossible. Instead implement special index function for arc_l2c_only->arcs_list, multilists, panicking on any attempt to use it. Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #12421	2021-09-14 14:31:22 -07:00
Alexander Motin	40e02f49e9	Fix/improve dbuf hits accounting Instead of clearing stats inside arc_buf_alloc_impl() do it inside arc_hdr_alloc() and arc_release(). It fixes statistics being wiped every time a new dbuf is filled from the ARC. Remove b_l1hdr.b_l2_hits. L2ARC hits are accounted at b_l2hdr.b_hits. Since the hits are accounted under hash lock, replace atomics with simple increments. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Wilson <george.wilson@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #12422	2021-09-14 14:31:22 -07:00
Alexander Motin	c600f0687f	Avoid vq_lock drop in vdev_queue_aggregate() vq_lock is already too congested for two more operations per I/O. Instead of dropping and reacquiring it inside vdev_queue_aggregate() delegate the zio_vdev_io_bypass() and zio_execute() calls for parent I/Os to callers, that drop the lock any way to execute the new I/O. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #12297	2021-09-14 14:31:22 -07:00
Alexander Motin	5afc35b698	Use more atomics in refcounts Use atomic_load_64() for zfs_refcount_count() to prevent torn reads on 32-bit platforms. On 64-bit ones it should not change anything. When built with ZFS_DEBUG but running without tracking enabled use atomics instead of mutexes same as for builds without ZFS_DEBUG. Since rc_tracked can't change live we can check it without lock. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #12420	2021-09-14 14:31:01 -07:00
Allan Jude	24e51e3749	Restore FreeBSD sysctl processing for arc.min and arc.max Before OpenZFS 2.0, trying to set the FreeBSD sysctl vfs.zfs.arc_max to a disallowed value would return an error. Since the switch, it instead only generates WARN_IF_TUNING_IGNORED Keep the ability to set the sysctl's specifically to 0, even though that is less than the minimum, because some tests depend on this. Also lost, was the ability to set vfs.zfs.arc_max to a value less than the default vfs.zfs.arc_min at boot time. Restore this as well. Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Signed-off-by: Allan Jude <allan@klarasystems.com> Closes #12161	2021-09-14 14:31:01 -07:00
Ryan Moeller	744f3009fc	zfs: add missed dependency of zfs module on zlib Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Martin Matuska <mm@FreeBSD.org> Co-authored-by: Konstantin Belousov <kib@FreeBSD.org> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> External-issue: https://reviews.freebsd.org/D31207 Closes #12442	2021-09-14 14:30:39 -07:00
Tony Nguyen	477edd642c	Run arc_evict thread at higher priority Run arc_evict thread at higher priority, nice=0, to give it more CPU time which can improve performance for workload with high ARC evict activities. On mixed read/write and sequential read workloads, I've seen between 10-40% better performance. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Tony Nguyen <tony.nguyen@delphix.com> Closes #12397	2021-09-14 14:30:13 -07:00
Brian Behlendorf	32a971e749	Enable /proc/diskstats for zvols The /proc/diskstats accounting needs to be explicitly enabled for block devices which do not use multi-queue. Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #12440 Closes #12066	2021-09-14 14:30:13 -07:00
hedongzhang	ddb732e2c8	Modify checksum obtain method of QAT CpaDcGeneratefooter function that obtain the checksum code does not support the CPA_DC_STATELESS mode. So we get the adler32 chencksum of the end of the zlib from dc_results. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Chengfei Zhu <chengfeix.zhu@intel.com> Signed-off-by: hedong.zhang <h_d_zhang@163.com> Closes #12343	2021-09-14 14:29:46 -07:00

... 2 3 4 5 6 ...

3791 Commits