Archive-Team/zfs - zfs - Gitea: Git with a cup of tea

Commit Graph

Author	SHA1	Message	Date
Brian Behlendorf	a6e8113fed	Silence -Winfinite-recursion warning in luaD_throw() This code should be kept inline with the upstream lua version as much as possible. Therefore, we simply want to silence the warning. This check was enabled by default as part of -Wall in gcc 12.1. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #13528 Closes #13575	2022-06-27 14:18:50 -07:00
George Amanakis	80a650b7bb	Avoid panic with recordsize > 128k, raw sending and no large_blocks The current codebase does not support raw sending buffers with block size > 128kB when large_blocks is not active. This can happen in the codepath dsl_dataset_sync()->dmu_objset_sync()->zio_nowait() which calls back dmu_objset_write_done()->dsl_dataset_block_born(). If dsl_dataset_sync() completes its run before dsl_dataset_block_born() is called, we will end up not activating some of the necessary flags, while having blocks based on those flags written in the filesystem. A subsequent send will then panic. Fix this by directly deciding in dmu_objset_sync() whether these flags need to be activated later by dsl_dataset_sync(). Instead of panicking due to a NULL pointer dereference in dmu_dump_write() in case of a send, print out an error message. Also during scrub verify there are no contradicting filesystem flags. Reviewed-by: Paul Dagnelie <pcd@delphix.com> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #12275 Closes #12438	2022-06-27 14:17:25 -07:00
Alexander Motin	1cd72b9c13	Avoid two 64-bit divisions per scanned block Change math to make it like the ARC, using multiplications instead. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #13591	2022-06-27 11:08:21 -07:00
Alexander Motin	c0bf952c84	Several B-tree optimizations - Introduce first element offset within a leaf. It allows to reduce by ~50% average memmove() size when adding/removing elements. If the added/removed element is in the first half of the leaf, we may shift elements before it and adjust the bth_first instead of moving more elements after it. - Use memcpy() instead of memmove() when we know there is no overlap. - Switch from uint64_t to uint32_t. It does not limit anything, but 32-bit arches should appreciate it greatly in hot paths. - Store leaf capacity in struct btree to avoid 64-bit divisions. - Adjust zfs_btree_insert_into_leaf() to always result in balanced leaves after splitting, no matter where the new element was inserted. Not that we care about it much, but it should also allow B-trees with as little as two elements per leaf instead of 4 previously. When scrubbing pool of 12 SSDs, storing 1.5TB of 4KB zvol blocks this reduces amount of time spent in memmove() inside the scan thread from 13.7% to 5.7% and total scrub time by ~15 seconds out of 9 minutes. It should also reduce spacemaps load time, but I haven't measured it. Reviewed-by: Paul Dagnelie <pcd@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #13582	2022-06-24 13:55:58 -07:00
Alexander Motin	1c0c729ab4	Several sorted scrub optimizations - Reduce size and comparison complexity of q_exts_by_size B-tree. Previous code used two 64-bit divisions and many other operations to compare two B-tree elements. It created enormous overhead. This implementation moves the math to the upper level and stores the score in the B-tree elements themselves. Since all that we need to store in that B-tree is the extent score and offset, those can fit into single 8 byte value instead of 24 bytes of q_exts_by_addr element and can be compared with single operation. - Better decouple secondary tree logic from main range_tree by moving rt_btree_ops and related functions into dsl_scan.c as ext_size_ops. Those functions are very small to worry about the code duplication and range_tree does not need to know details such as rt_btree_compare. - Instead of accounting number of pending bytes per pool, that needs atomic on global variable per block, account the number of non-empty per-vdev queues, that change much more rarely. - When extent scan is interrupted by TXG end, continue it in the next TXG instead of selecting next best extent. It allows to avoid leaving one truncated (and so likely not the best any more) extent each TXG. On top of some other optimizations this saves about 1.5 minutes out of 10 to scrub pool of 12 SSDs, storing 1.5TB of 4KB zvol blocks. Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tom Caputi <caputit1@tcnj.edu> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #13576	2022-06-24 09:50:37 -07:00
Brian Behlendorf	ad8b9f940c	Scrub mirror children without BPs When scrubbing a raidz/draid pool, which contains a replacing or sparing mirror with multiple online children, only one child will be read. This is not normally a serious concern because the DTL records are used to determine where a good copy of the data is. As long as the data can be read from one child the mirror vdev will use it to repair gaps in any of its children. Furthermore, even if the data which was read is corrupt the raidz code will detect this and issue its own repair I/O to correct the damage in the mirror vdev. However, in the scenario where the DTL is wrong due to silent data corruption (say due to overwriting one child) and the scrub happens to read from a child with good data, then the other damaged mirror child will not be detected nor repaired. While this is possible for both raidz and draid vdevs, it's most pronounced when using draid. This is because by default the zed will sequentially rebuild a draid pool to a distributed spare, and the distributed spare half of the mirror is always preferred since it delivers better performance. This means the damaged half of the mirror will go undetected even after scrubbing. For system administrations this behavior is non-intuitive and in a worst case scenario could result in the only good copy of the data being unknowingly detached from the mirror. This change resolves the issue by reading all replacing/sparing mirror children when scrubbing. When the BP isn't available for verification, then compare the data buffers from each child. They must all be identical, if not there's silent damage and an error is returned to prompt the top-level vdev to issue a repair I/O to rewrite the data on all of the mirror children. Since we can't tell which child was wrong a checksum error is logged against the replacing or sparing mirror vdev. Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #13555	2022-06-23 10:36:28 -07:00
Tino Reichardt	deb1213098	Fix memory allocation issue for BLAKE3 context The kmem_alloc(sizeof (*ctx), KM_NOSLEEP) call on FreeBSD can't be used in this code segment. Work around this by pre-allocating a percpu context array for later use. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de> Closes #13568	2022-06-21 14:32:09 -07:00
Alexander Motin	d51f4ea5f9	FreeBSD: Improve crypto_dispatch() handling Handle crypto_dispatch() return values same as crp->crp_etype errors. On FreeBSD 12 many drivers returned same errors both ways, and lack of proper handling for the first ended up in assertion panic later. It was changed in FreeBSD 13, but there is no reason to not be safe. While there, skip waiting for completion, including locking and wakeup() call, for sessions on synchronous crypto drivers, such as typical aesni and software. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #13563	2022-06-17 15:38:51 -07:00
Andrew	f609739985	expose snapshot count via stat(2) of .zfs/snapshot (#13559 ) Increase nlinks in stat results of ./zfs/snapshot based on snapshot count. This provides quick and efficient method for administrators to get snapshot counts without having to use libzfs or list the snapdir contents. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Andrew Walker <awalker@ixsystems.com> Closes #13559	2022-06-17 11:44:49 -07:00
Alexander Motin	dd8671459f	Reduce ZIO io_lock contention on sorted scrub During sorted scrub multiple threads (one per vdev) are issuing many ZIOs same time, all using the same scn->scn_zio_root ZIO as parent. It causes huge lock contention on the single global lock on that ZIO. Improve it by introducing per-queue null ZIOs, children to that one, and using them instead as proxy. For 12 SSD pool storing 1.5TB of 4KB blocks on 80-core system this dramatically reduces lock contention and reduces scrub time from 21 minutes down to 12.5, while actual read stages (not scan) are about 3x faster, reaching 100K blocks per second per vdev. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #13553	2022-06-15 14:25:08 -07:00
crass	bc00d2c711	Add support for ARCH=um for x86 sub-architectures When building modules (as well as the kernel) with ARCH=um, the options -Dsetjmp=kernel_setjmp and -Dlongjmp=kernel_longjmp are passed to the C preprocessor for C files. This causes the setjmp and longjmp used in module/lua/ldo.c to be kernel_setjmp and kernel_longjmp respectively in the object file. However, the setjmp and longjmp that is intended to be called is defined in an architecture dependent assembly file under the directory module/lua/setjmp. Since it is an assembly and not a C file, the preprocessor define is not given and the names do not change. This becomes an issue when modpost is trying to create the Module.symvers and sees no defined symbol for kernel_setjmp and kernel_longjmp. To fix this, if the macro CONFIG_UML is defined, then setjmp and longjmp macros are undefined. When building with ARCH=um for x86 sub-architectures, CONFIG_X86 is not defined. Instead, CONFIG_UML_X86 is defined. Despite this, the UML x86 sub-architecture can use the same object files as the x86 architectures because the x86 sub-architecture UML kernel is running with the same instruction set as CONFIG_X86. So the modules/Kbuild build file is updated to add the same object files that CONFIG_X86 would add when CONFIG_UML_X86 is defined. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Glenn Washburn <development@efficientek.com> Closes #13547	2022-06-15 14:22:52 -07:00
Damian Szuberski	9884319666	Fix clang 13 compilation errors ``` os/linux/zfs/zvol_os.c:1111:3: error: ignoring return value of function declared with 'warn_unused_result' attribute [-Werror,-Wunused-result] add_disk(zv->zv_zso->zvo_disk); ^~~~~~~~ ~~~~~~~~~~~~~~~~~~~~ zpl_xattr.c:1579:1: warning: no previous prototype for function 'zpl_posix_acl_release_impl' [-Wmissing-prototypes] ``` Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: szubersk <szuberskidamian@gmail.com> Closes #13551	2022-06-15 14:20:28 -07:00
Allan Jude	4ff7a8fa2f	Replace ZPROP_INVAL with ZPROP_USERPROP where it means a user property Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Allan Jude <allan@klarasystems.com> Sponsored-by: Klara Inc. Closes #12676	2022-06-14 11:27:53 -07:00
Ryan Moeller	9e605cf155	spl: Use a clearer name for the user namespace fd This fd has nothing to do with cleanup, that's just the name of the field in zfs_cmd_t that was used to pass it to the kernel. Call it what it is, an fd for a user namespace. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Allan Jude <allan@klarasystems.com> Signed-off-by: Ryan Moeller <freqlabs@FreeBSD.org> Closes #13554	2022-06-14 08:14:19 -07:00
Alexander Motin	87b46d63b2	Improve sorted scan memory accounting Since we use two B-trees q_exts_by_size and q_exts_by_addr, we should count 2x sizeof (range_seg_gap_t) per node. And since average B-tree memory efficiency is about 75%, we should increase it to 3x. Previous code under-counted up to 30% of the memory usage. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #13537	2022-06-10 10:01:46 -07:00
Will Andrews	4ed5e25074	Add Linux namespace delegation support This allows ZFS datasets to be delegated to a user/mount namespace Within that namespace, only the delegated datasets are visible Works very similarly to Zones/Jailes on other ZFS OSes As a user: ``` $ unshare -Um $ zfs list no datasets available $ echo $$ 1234 ``` As root: ``` # zfs list NAME ZONED MOUNTPOINT containers off /containers containers/host off /containers/host containers/host/child off /containers/host/child containers/host/child/gchild off /containers/host/child/gchild containers/unpriv on /unpriv containers/unpriv/child on /unpriv/child containers/unpriv/child/gchild on /unpriv/child/gchild # zfs zone /proc/1234/ns/user containers/unpriv ``` Back to the user namespace: ``` $ zfs list NAME USED AVAIL REFER MOUNTPOINT containers 129M 47.8G 24K /containers containers/unpriv 128M 47.8G 24K /unpriv containers/unpriv/child 128M 47.8G 128M /unpriv/child ``` Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Will Andrews <will.andrews@klarasystems.com> Signed-off-by: Allan Jude <allan@klarasystems.com> Signed-off-by: Mateusz Piotrowski <mateusz.piotrowski@klarasystems.com> Co-authored-by: Allan Jude <allan@klarasystems.com> Co-authored-by: Mateusz Piotrowski <mateusz.piotrowski@klarasystems.com> Sponsored-by: Buddy <https://buddy.works> Closes #12263	2022-06-10 09:51:46 -07:00
Alexander Motin	fc5200aa9b	AVL: Remove obsolete branching optimizations Modern Clang and GCC can successfully implement simple conditions without branching with math and flag operations. Use of arrays for translation no longer helps as much as it was 14+ years ago. Disassemble of the code generated by Clang 13.0.0 on FreeBSD 13.1, Clang 14.0.4 on FreeBSD 14 and GCC 10.2.1 on Debian 11 with this change still shows no branching instructions. Profiling of CPU-bound scan stage of sorted scrub shows reproducible reduction of time spent inside avl_find() from 6.52% to 4.58%. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #13540	2022-06-09 15:27:36 -07:00
Tony Hutter	6f73d02168	zvol: Support blk-mq for better performance Add support for the kernel's block multiqueue (blk-mq) interface in the zvol block driver. blk-mq creates multiple request queues on different CPUs rather than having a single request queue. This can improve zvol performance with multithreaded reads/writes. This implementation uses the blk-mq interfaces on 4.13 or newer kernels. Building against older kernels will fall back to the older BIO interfaces. Note that you must set the `zvol_use_blk_mq` module param to enable the blk-mq API. It is disabled by default. In addition, this commit lets the zvol blk-mq layer process whole `struct request` IOs at a time, rather than breaking them down into their individual BIOs. This reduces dbuf lock contention and overhead versus the legacy zvol submit_bio() codepath. sequential dd to one zvol, 8k volblocksize, no O_DIRECT: legacy submit_bio() 292MB/s write 453MB/s read this commit 453MB/s write 885MB/s read It also introduces a new `zvol_blk_mq_chunks_per_thread` module parameter. This parameter represents how many volblocksize'd chunks to process per each zvol thread. It can be used to tune your zvols for better read vs write performance (higher values favor write, lower favor read). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #13148 Issue #12483	2022-06-09 08:10:38 -06:00
Tino Reichardt	985c33b132	Introduce BLAKE3 checksums as an OpenZFS feature This commit adds BLAKE3 checksums to OpenZFS, it has similar performance to Edon-R, but without the caveats around the latter. Homepage of BLAKE3: https://github.com/BLAKE3-team/BLAKE3 Wikipedia: https://en.wikipedia.org/wiki/BLAKE_(hash_function)#BLAKE3 Short description of Wikipedia: BLAKE3 is a cryptographic hash function based on Bao and BLAKE2, created by Jack O'Connor, Jean-Philippe Aumasson, Samuel Neves, and Zooko Wilcox-O'Hearn. It was announced on January 9, 2020, at Real World Crypto. BLAKE3 is a single algorithm with many desirable features (parallelism, XOF, KDF, PRF and MAC), in contrast to BLAKE and BLAKE2, which are algorithm families with multiple variants. BLAKE3 has a binary tree structure, so it supports a practically unlimited degree of parallelism (both SIMD and multithreading) given enough input. The official Rust and C implementations are dual-licensed as public domain (CC0) and the Apache License. Along with adding the BLAKE3 hash into the OpenZFS infrastructure a new benchmarking file called chksum_bench was introduced. When read it reports the speed of the available checksum functions. On Linux: cat /proc/spl/kstat/zfs/chksum_bench On FreeBSD: sysctl kstat.zfs.misc.chksum_bench This is an example output of an i3-1005G1 test system with Debian 11: implementation 1k 4k 16k 64k 256k 1m 4m edonr-generic 1196 1602 1761 1749 1762 1759 1751 skein-generic 546 591 608 615 619 612 616 sha256-generic 240 300 316 314 304 285 276 sha512-generic 353 441 467 476 472 467 426 blake3-generic 308 313 313 313 312 313 312 blake3-sse2 402 1289 1423 1446 1432 1458 1413 blake3-sse41 427 1470 1625 1704 1679 1607 1629 blake3-avx2 428 1920 3095 3343 3356 3318 3204 blake3-avx512 473 2687 4905 5836 5844 5643 5374 Output on Debian 5.10.0-10-amd64 system: (Ryzen 7 5800X) implementation 1k 4k 16k 64k 256k 1m 4m edonr-generic 1840 2458 2665 2719 2711 2723 2693 skein-generic 870 966 996 992 1003 1005 1009 sha256-generic 415 442 453 455 457 457 457 sha512-generic 608 690 711 718 719 720 721 blake3-generic 301 313 311 309 309 310 310 blake3-sse2 343 1865 2124 2188 2180 2181 2186 blake3-sse41 364 2091 2396 2509 2463 2482 2488 blake3-avx2 365 2590 4399 4971 4915 4802 4764 Output on Debian 5.10.0-9-powerpc64le system: (POWER 9) implementation 1k 4k 16k 64k 256k 1m 4m edonr-generic 1213 1703 1889 1918 1957 1902 1907 skein-generic 434 492 520 522 511 525 525 sha256-generic 167 183 187 188 188 187 188 sha512-generic 186 216 222 221 225 224 224 blake3-generic 153 152 154 153 151 153 153 blake3-sse2 391 1170 1366 1406 1428 1426 1414 blake3-sse41 352 1049 1212 1174 1262 1258 1259 Output on Debian 5.10.0-11-arm64 system: (Pi400) implementation 1k 4k 16k 64k 256k 1m 4m edonr-generic 487 603 629 639 643 641 641 skein-generic 271 299 303 308 309 309 307 sha256-generic 117 127 128 130 130 129 130 sha512-generic 145 165 170 172 173 174 175 blake3-generic 81 29 71 89 89 89 89 blake3-sse2 112 323 368 379 380 371 374 blake3-sse41 101 315 357 368 369 364 360 Structurally, the new code is mainly split into these parts: - 1x cross platform generic c variant: blake3_generic.c - 4x assembly for X86-64 (SSE2, SSE4.1, AVX2, AVX512) - 2x assembly for ARMv8 (NEON converted from SSE2) - 2x assembly for PPC64-LE (POWER8 converted from SSE2) - one file for switching between the implementations Note the PPC64 assembly requires the VSX instruction set and the kfpu_begin() / kfpu_end() calls on PowerPC were updated accordingly. Reviewed-by: Felix Dörre <felix@dogcraft.de> Reviewed-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de> Co-authored-by: Rich Ercolani <rincebrain@gmail.com> Closes #10058 Closes #12918	2022-06-08 15:55:57 -07:00
Alexander Motin	42cf2ad0e4	Remove wrong assertion in log spacemap It is typical, but not generally true that if log summary has more blocks it must also have unflushed metaslabs. Normally with metaslabs flushed in order it works, but there are known exceptions, such as device removal or metaslab being loaded during its flush attempt. Before `600a02b884` if spa_flush_metaslabs() hit loading metaslab it usually stopped (unless memlimit is also exceeded), but now it may flush more metaslabs, just skipping that particular one. This increased chances of assertion to fire when the skipped metaslab is flushed on next iteration if all other metaslabs in that summary entry are already flushed out of order. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #13486 Closes #13513	2022-06-01 09:54:35 -07:00
Allan Jude	2310dba9eb	Fix typo in zil_commit() comment block Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Allan Jude <allan@klarasystems.com> Closes #13518	2022-05-31 15:37:46 -07:00
Brian Behlendorf	c2c2e7bb8b	Linux 5.19 compat: aops->read_folio() As of the Linux 5.19 kernel the readpage() address space operation has been replaced by read_folio(). Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #13515	2022-05-31 12:04:31 -07:00
Brian Behlendorf	a12a5cb5b8	Linux 5.19 compat: blkdev_issue_secure_erase() Linux 5.19 commit torvalds/linux@44abff2c0 splits the secure erase functionality from the blkdev_issue_discard() function. The blkdev_issue_secure_erase() must now be issued to issue a secure erase. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #13515	2022-05-31 12:04:26 -07:00
Brian Behlendorf	e2c31f2bc7	Linux 5.19 compat: bdev_max_secure_erase_sectors() Linux 5.19 commit torvalds/linux@44abff2c0 removed the blk_queue_secure_erase() helper function. The preferred interface is to now use the bdev_max_secure_erase_sectors() function to check for discard support. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #13515	2022-05-31 12:04:22 -07:00
Brian Behlendorf	5e4aedaca7	Linux 5.19 compat: bdev_max_discard_sectors() Linux 5.19 commit torvalds/linux@70200574cc removed the blk_queue_discard() helper function. The preferred interface is to now use the bdev_max_discard_sectors() function to check for discard support. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #13515	2022-05-31 12:04:17 -07:00
Brian Behlendorf	5f264996f4	Linux 5.18 compat: bio_alloc() As for the Linux 5.18 kernel bio_alloc() expects a block_device struct as an argument. This removes the need for the bio_set_dev() compatibility code for 5.18 and newer kernels. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #13515	2022-05-31 12:04:03 -07:00
Kevin Jin	152d6fda54	Fix inflated quiesce time caused by lwb_tx during zil_commit() In current zil_commit() process, transaction lwb_tx is assigned in zil_lwb_write_issue(), and is committed in zil_lwb_flush_vdevs_done(). Thus, during lwb write out process, the txg is held in open or quiesing state, until zil_lwb_flush_vdevs_done() is called. If the zil's zio latency is high, it will cause txg_sync_thread() to starve. The goal here is to defer waiting for zil_lwb_flush_vdevs_done to the 'syncing' txg state. That is, in zil_sync(). In this patch, it achieves the goal without holding transaction. A new function zil_lwb_flush_wait_all() is introduced. It waits for the completion of all the zil_lwb_flush_vdevs_done() by given txg. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Prakash Surya <prakash.surya@delphix.com> Signed-off-by: jxdking <lostking2008@hotmail.com> Closes #12321	2022-05-26 09:36:14 -07:00
Alexander Motin	6aa8c21a2a	More speculative prefetcher improvements - Make prefetch distance adaptive: up to 4MB prefetch doubles for every, hit same as before, but after that it grows by 1/8 every time the prefetch read does not complete in time to satisfy the demand. My tests show that 4MB is sufficient for wide NVMe pool to saturate single reader thread at 2.5GB/s, while new 64MB maximum allows the same thread to reach 1.5GB/s on wide HDD pool. Further distance increase may increase speed even more, but less dramatic and with higher latency. - Allow early reuse of inactive prefetch streams: streams that never saw hits can be reused immediately if there is a demand, while others can be reused after 1s of inactivity, starting with the oldest. After 2s of inactivity streams are deleted to free resources same as before. This allows by several times increase strided read performance on HDD pool in presence of simultaneous random reads, previously filling the zfetch_max_streams limit for seconds and so blocking most of prefetch. - Always issue intermediate indirect block reads with SYNC priority. Each of those reads if delayed for longer may delay up to 1024 other block prefetches, that may be not good for wide pools. Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #13452	2022-05-25 10:12:52 -07:00
Paul Dagnelie	7829b465a7	Cancel in-progress rebuilds when we finish removal This issue was discovered by zloop runs. When a mirror or other redundant top-level vdev has a disk failure, and the disk is replaced, the rebuild process occurs. A removal can happen while this is in progress. If the removal completes before the rebuild does, the removal process will try to free the vdev that is still in use. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <pcd@delphix.com> Closes #13498	2022-05-25 09:25:13 -07:00
Alexander Motin	84d0a03f3e	Refactor Log Size Limit Original Log Size Limit implementation blocked all writes in case of limit reached until the TXG is committed and the log is freed. It caused huge delays and following speed spikes in application writes. This implementation instead smoothly throttles writes, using exactly the same mechanism as used for dirty data. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: jxdking <lostking2008@hotmail.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Issue #12284 Closes #13476	2022-05-24 09:46:35 -07:00
Rich Ercolani	f375b23c02	Tiered early abort, zstd edition It turns out that "do LZ4 and zstd-1 both fail" is a great heuristic for "don't even bother trying higher zstd tiers". By way of illustration: $ cat /incompress \| mbuffer \| zfs recv -o compression=zstd-12 evenfaster/lowcomp_1M_zstd12_normal summary: 39.8 GiByte in 3min 40.2sec - average of 185 MiB/s $ echo 3 \| sudo tee /sys/module/zzstd/parameters/zstd_lz4_pass 3 $ cat /incompress \| mbuffer -m 4G \| zfs recv -o compression=zstd-12 evenfaster/lowcomp_1M_zstd12_patched summary: 39.8 GiByte in 48.6sec - average of 839 MiB/s $ sudo zfs list -p -o name,used,lused,ratio evenfaster/lowcomp_1M_zstd12_normal evenfaster/lowcomp_1M_zstd12_patched NAME USED LUSED RATIO evenfaster/lowcomp_1M_zstd12_normal 39549931520 42721221632 1.08 evenfaster/lowcomp_1M_zstd12_patched 39626399744 42721217536 1.07 $ python3 -c "print(39626399744 - 39549931520)" 76468224 $ I'll take 76 MB out of 42 GB for > 4x speedup. Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl> Reviewed-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes #13244	2022-05-24 09:43:22 -07:00
Brian Behlendorf	2cd0f98f4a	Verify BPs in spa_load_verify_cb() and dsl_scan_visitbp() We want `zpool import` to be highly robust and never panic, even when encountering corrupt metadata. This is already handled in the arc_read() code path, which covers most cases, but spa_load_verify_cb() relies on zio_read() and is responsible for verifying the block pointer. During import it is also possible to encounter blocks pointers which contain ZIO_COMPRESS_INHERIT and ZIO_CHECKSUM_INHERIT values. Relax the verification function slightly to allow this. Futhermore, extend dsl_scan_recurse() to verify the block pointer contents of level zero blocks which are not of type DMU_OT_DNODE or DMU_OT_OBJSET. This is handled by arc_read() in the other cases. Reviewed-by: Paul Dagnelie <pcd@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #13124 Closes #13360	2022-05-20 10:36:14 -07:00
Andrew	00ac77464e	Expose zpool guids through kstats There are times when end-users may wish to have a fast and convenient method to get zpool guid without having to use libzfs. This commit exposes the zpool guid via kstats in similar manner to the zpool state. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Andrew Walker <awalker@ixsystems.com> Closes #13466	2022-05-18 10:25:33 -07:00
наб	de82164518	linux: spl: generic: ddi_strto: match solaris ddi_strto(9) Recognise initial whitespace, + in both cases, and - also in unsigneds Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #13434	2022-05-13 10:15:47 -07:00
наб	354a1bfb8e	linux: spl: generic: ddi_strtou##type: elide unused flag Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #13434	2022-05-13 10:15:44 -07:00
наб	c25b281378	Remove hw_serial, ddi_strtoul() Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #13434	2022-05-13 10:15:31 -07:00
Rich Ercolani	bd88c036e6	Added a workaround for Linux KASAN builds Linux passes -Wframe-larger-than=1024, which breaks our build in a number of places with -Werror. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes #13450	2022-05-11 13:26:55 -07:00
szubersk	e0911f7b7f	autoconf: Fail when __copy_from_user_inatomic is a non-GPL symbol A followup to `849c14e048` Fix https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1009242 Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: szubersk <szuberskidamian@gmail.com> Closes #13389	2022-05-11 10:32:51 -07:00
наб	c8970f52ed	autoconf: use include directives instead of recursing down lib As a bonus, this also adds zfs-mount-generator (previously undescended down) and libzstd (not included) to CppCheck As a bonus bonus, abigail rules work out-of-tree, too Against current trunk: $ diff -U0 ./destdir.listing ~/store/code/zfs/destdir.listing -destdir/usr/local/include/libspl/sscanf.h $ diff --color -U0 ./zfs-2.1.99.tar.gz.listing ../oot/zfs-2.1.99.tar.gz.listing \| grep -v @@ \| grep -v /Makefile -zfs-2.1.99/config/Abigail.am -zfs-2.1.99/lib/libspl/include/util/ -zfs-2.1.99/lib/libspl/include/util/sscanf.h $ diff --color -U0 ./zfs-2.1.99.tar.gz.listing ../oot/zfs-2.1.99.tar.gz.listing \| grep -v @@ \| grep /Makefile -zfs-2.1.99/lib/libavl/Makefile.in -zfs-2.1.99/lib/libefi/Makefile.in -zfs-2.1.99/lib/libicp/Makefile.in -zfs-2.1.99/lib/libnvpair/Makefile.in -zfs-2.1.99/lib/libshare/Makefile.in -zfs-2.1.99/lib/libspl/include/Makefile.in -zfs-2.1.99/lib/libspl/include/os/freebsd/Makefile.am -zfs-2.1.99/lib/libspl/include/os/freebsd/Makefile.in -zfs-2.1.99/lib/libspl/include/os/freebsd/sys/Makefile.am -zfs-2.1.99/lib/libspl/include/os/freebsd/sys/Makefile.in -zfs-2.1.99/lib/libspl/include/os/linux/Makefile.am -zfs-2.1.99/lib/libspl/include/os/linux/Makefile.in -zfs-2.1.99/lib/libspl/include/os/linux/sys/Makefile.am -zfs-2.1.99/lib/libspl/include/os/linux/sys/Makefile.in -zfs-2.1.99/lib/libspl/include/os/Makefile.am -zfs-2.1.99/lib/libspl/include/os/Makefile.in -zfs-2.1.99/lib/libspl/include/rpc/Makefile.am -zfs-2.1.99/lib/libspl/include/rpc/Makefile.in -zfs-2.1.99/lib/libspl/include/sys/dktp/Makefile.am -zfs-2.1.99/lib/libspl/include/sys/dktp/Makefile.in -zfs-2.1.99/lib/libspl/include/sys/Makefile.am -zfs-2.1.99/lib/libspl/include/sys/Makefile.in -zfs-2.1.99/lib/libspl/include/util/Makefile.am -zfs-2.1.99/lib/libspl/include/util/Makefile.in -zfs-2.1.99/lib/libspl/Makefile.in -zfs-2.1.99/lib/libtpool/Makefile.in -zfs-2.1.99/lib/libunicode/Makefile.in -zfs-2.1.99/lib/libuutil/Makefile.in -zfs-2.1.99/lib/libzfsbootenv/Makefile.in -zfs-2.1.99/lib/libzfs_core/Makefile.in -zfs-2.1.99/lib/libzfs/Makefile.in -zfs-2.1.99/lib/libzpool/Makefile.in -zfs-2.1.99/lib/libzstd/Makefile.in -zfs-2.1.99/lib/libzutil/Makefile.in -zfs-2.1.99/lib/Makefile.in Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #13316	2022-05-10 10:18:11 -07:00
наб	6fc34371e1	libzfs: pool: fix false-positives -Wmaybe-uninitialised As noted by gcc (Debian 10.2.1-6) 10.2.1 20210110 Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #13316	2022-05-10 10:18:06 -07:00
наб	be91239efa	module: Makefile: cppcheck: zfs_config.h lives in builddir Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #13316	2022-05-10 10:18:02 -07:00
hping	a18d13c200	abd_os: remove redundant refcount creation for abd_children Refcount creation for abd_zero_scatter->abd_children is redundant in abd_alloc_zero_scatter, as it has been done in abd_init_struct. In addition, abd_children is undefined when ZFS_DEBUG is disabled, the reference of abd_children in abd_alloc_zero_scatter breaks build of libzpool when ZFS_DEBUG is disabled. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Ping Huang <huangping@smartx.com> Closes #13429	2022-05-09 16:30:16 -07:00
Aidan Harris	493b6e5607	Fix functions without a prototype clang-15 emits the following error message for functions without a prototype: fs/zfs/os/linux/spl/spl-kmem-cache.c:1423:27: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes] Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Aidan Harris <me@aidanharr.is> Closes #13421	2022-05-06 11:57:37 -07:00
Rich Ercolani	7bf06f7262	Corrected edge case in uncompressed ARC->L2ARC handling I genuinely don't know why this didn't come up before, but adding the LZ4 early abort pointed out this flaw, in which we're allocating a buffer of one size, and then telling the compressor that we're handing it buffers of a different size, which may be Very Different - say, allocating 512b and then telling it the inputs are 128k. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Amanakis <gamanakis@gmail.com> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes #13375	2022-05-04 11:59:30 -07:00
Mateusz Guzik	81b8b2d004	FreeBSD: use zero_region instead of allocating a dedicated page Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Closes #13406	2022-05-04 11:46:37 -07:00
Alexander Motin	c55b293287	Improve mg_aliquot math When calculating mg_aliquot alike to #12046 use number of unique data disks in the vdev, not the total number of children vdev. Increase default value of the tunable from 512KB to 1MB to compensate. Before this change each disk in striped pool was getting 512KB of sequential data, in 2-wide mirror -- 1MB, in 3-wide RAIDZ1 -- 768KB. After this change in all the cases each disk should get 1MB. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #13388	2022-05-04 11:33:42 -07:00
Brian Behlendorf	34dbc618f5	Reduce dbuf_find() lock contention Holding a dbuf is a common operation which can become highly contended in dbuf_find() when acquiring the dbuf hash mutex. This is particularly true on Linux when reading/writing volumes since by default up to 32 threads from the zvol_taskq may be taking a hold of the same dbuf. This should also be observable on FreeBSD as long as there are enough processes accessing the volume concurrently. This is further aggregrated by the fact that only the block id will be unique when calculating the dbuf hash for a single volume. The objset id, object id, and level will be the same for data blocks. This has been observed to result in a somehwat less than uniform hash distribution and a longer than expected max hash chain depth (~20) on a large memory system (256 GB) using volumes. This commit improves the siutation by switching the hash mutex to an rwlock to allow concurrent lookups, and increasing DBUF_RWLOCKS from 2048 to 8192 to further reduce the odds of a hash collision. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #13405	2022-05-04 11:17:29 -07:00
Shaan Nobee	411f4a018d	Speed up WB_SYNC_NONE when a WB_SYNC_ALL occurs simultaneously Page writebacks with WB_SYNC_NONE can take several seconds to complete since they wait for the transaction group to close before being committed. This is usually not a problem since the caller does not need to wait. However, if we're simultaneously doing a writeback with WB_SYNC_ALL (e.g via msync), the latter can block for several seconds (up to zfs_txg_timeout) due to the active WB_SYNC_NONE writeback since it needs to wait for the transaction to complete and the PG_writeback bit to be cleared. This commit deals with 2 cases: - No page writeback is active. A WB_SYNC_ALL page writeback starts and even completes. But when it's about to check if the PG_writeback bit has been cleared, another writeback with WB_SYNC_NONE starts. The sync page writeback ends up waiting for the non-sync page writeback to complete. - A page writeback with WB_SYNC_NONE is already active when a WB_SYNC_ALL writeback starts. The WB_SYNC_ALL writeback ends up waiting for the WB_SYNC_NONE writeback. The fix works by carefully keeping track of active sync/non-sync writebacks and committing when beneficial. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Shaan Nobee <sniper111@gmail.com> Closes #12662 Closes #12790	2022-05-03 13:23:26 -07:00
Pawel Jakub Dawidek	a64d757aa4	FreeBSD: Clean up the use of ioflags - Prefer O_* flags over F* flags that mostly mirror O_* flags anyway, but O_* flags seem to be preferred. - Simplify the code as all the F*SYNC flags were defined as FFSYNC flag. - Don't define FRSYNC flag, so we don't generate unnecessary ZIL commits. - Remove EXCL define, FreeBSD ignores the excl argument for zfs_create() anyway. Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net> Closes #13400	2022-05-02 16:26:28 -07:00
Jitendra Patidar	159c6fd154	Add missing replay entry in zvol_replay_vector for TX_SETSAXATTR Commit `361a7e8` (log xattr=sa create/remove/update to ZIL) introduced a TX_SETSAXATTR, but missed to add a corresponding entry in zvol_replay_vector. Adding a missing replay entry in zvol_replay_vector. Reviewed-by: Christian Schwarz <christian.schwarz@nutanix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Jitendra Patidar <jitendra.patidar@nutanix.com> Closes #13396 Closes #13395	2022-05-02 11:01:26 -07:00

1 2 3 4 5 ...

3763 Commits