Archive-Team/zfs - zfs - Gitea: Git with a cup of tea

Commit Graph

Author	SHA1	Message	Date
Brian Atkinson	72f674a22b	Updating based on PR Feedback(5) 1. Added new module parameter zfs_dio_enabled which allows for all reads and writes to pass through the ARC. This module parameter can be set to 0 by default in OpenZFS 2.3 release if necessary. 2. Updated ZTS direct tests to account for the new zfs_dio_enabled module parameter. 3. Updated libzfs.abi to account for changes. Signed-off-by: Brian Atkinson <batkinson@lanl.gov>	2024-09-04 17:08:56 -06:00
Brian Atkinson	71ce314930	Updating based on PR Feedback(4) 1. When testing out installing a VM with virtual manager on Linux and a dataset with direct=always, there an ASSERT failure in abd_alloc_from_pages(). Originally zfs_setup_direct() did an alignment check of the UIO using SPA_MINBLOCKSIZE with zfs_uio_aligned(). The idea behind this was maybe the page alignment restriction could be changed to use ashift as the alignment check in the future. Howver, this diea never came to be. The alignment restrictions for Direct I/O are based on PAGE_SIZE. Updating the check zfs_setup_direct() for the UIO to use PAGE_SIZE fixed the issue. 2. Updated other alignment check in dmu_read_impl() to also use PAGE_SIZE. 3. As a consequence of updating the UIO alignment checks the ZTS test case dio_unaligned_filesize began to fail. This is because there was no way to detect reading past the end of the file before issue EINVAL in the ZPL and VOPs layers in FreeBSD. This was resolved by moving zfs_setup_direct() into zfs_write() and zfs_read(). This allows for other error checking to take place before checking any Direct I/O limitations. Updating the call site of zfs_setup_direct() did require a bit of changes to the logic in that function. In particular Direct I/O can just be avoid altogether depending on the checks in zfs_setup_direct() and there is no reason to return EAGAIN at all. 4. After moving zfs_setup_direct() into zfs_write() and zfs_read(), there was no reason to call zfs_check_direct_enabled() in the ZPL layer in Linux or in the VNOPS layer of FreeBSD. This function was completely removed. This allowed for much of the code in both those layers to return to their original code. 5. Upated the checksum verify module parameter for Direct I/O writes to only be a boolean and return EIO in the event a checksum verify failure occurs. By default, this module parameter is set to 1 for Linux and 0 for FreeBSD. The module parameter has been changed to zfs_vdev_direct_write_verify. There are still counters on the top-level VDEV for checksum verify failures, but this could be removed. It would still be good to to leave the ZED event dio_verify for checksum failures as a notification that an application was manipulating the contents of a buffer after issuing that buffer with for I/O using Direct I/O. As part of this cahnge, man pages were updated, the ZTS test case dio_writy_verify was updated, and all comments relating to the module parameter were udpated as well. 6. Updated comments in dio_property ZTS test to properly reflect that stride_dd is being called with check_write and check_read. Signed-off-by: Brian Atkinson <batkinson@lanl.gov>	2024-09-04 17:08:56 -06:00
Brian Atkinson	6e0ffaf627	Updating based on PR Feedback(3) 1. Unified the block cloning and Direct I/O code paths further. As part of this unification, it is important to outline that Direct I/O writes transition the db_state to DB_UNCACHED. This is used so that dbuf_unoverride() is called when dbuf_undirty() is called. This is needed to cleanup space accounting in a TXG. When a dbuf is redirtied through dbuf_redirty(), then dbuf_unoverride() is also called to clean up space accounting. This is a bit of a different approach that block cloning, which always calls dbuf_undirty(). 2. As part of uniying the two, Direct I/O also performs the same check in dmu_buf_will_fill() so that on failure the previous contents of the dbuf are set correctly. 3. General just code cleanup removing checks that are no longer necessary. Signed-off-by: Brian Atkinson <batkinson@lanl.gov>	2024-09-04 17:08:56 -06:00
Brian Atkinson	b1ee363675	Updating based on PR Feedback(2) Updating code base on PR code comments. I adjusted the following parts of the code base on the comments: 1. Updated zfs_check_direct_enabled() so it now just returns an error. This removed the need for the added enum and cleaned up the code. 2. Moved acquiring the rangelock from zfs_fillpage() out to zfs_getpage(). This cleans up the code and gets rid of the need to pass a boolean into zfs_fillpage() to conditionally gra the rangelock. 3. Cleaned up the code in both zfs_uio_get_dio_pages() and zfs_uio_get_dio_pages_iov(). There was no need to have wanted and maxsize as they were the same thing. Also, since the previous commit cleaned up the call to zfs_uio_iov_step() the code is much cleaner over all. 4. Removed dbuf_get_dirty_direct() function. 5. Unified dbuf_read() to account for both block clones and direct I/O writes. This removes redundant code from dbuf_read_impl() for grabbingthe BP. 6. Removed zfs_map_page() and zfs_unmap_page() declarations from Linux headers as those were never called. Signed-off-by: Brian Atkinson <batkinson@lanl.gov>	2024-09-04 17:08:56 -06:00
Brian Atkinson	506bc54006	Updating based on PR Feedback(1) Updating code based on PR code comments. I adjusted the following parts of code based on comments: 1. Revert dbuf_undirty() to original logic and got rid of uncessary code change. 2. Cleanup in abd_impl.h 3. Cleanup in abd.h 4. Got rid of duplicate declaration of dmu_buf_hold_noread() in dmu.h. 5. Cleaned up comment for db_mtx in dmu_imp.h 6. Updated zfsprop man page to state correct ZFS version 7. Updated to correct cast in zfs_uio_page_aligned() calls to use uintptr_t. 8. Cleaned up comment in FreeBSD uio code. 9. Removed unnecessary format changes in comments in Linux abd code. 10. Updated ZFS VFS hook for direct_IO to use PANIC(). 11. Updated comment above dbuf_undirty to use double space again. 12. Converted module paramter zfs_vdev_direct_write_verify_pct OS indedepent and in doing so this removed the uneccessary check for bounds. 13. Updated to casting in zfs_dio_page_aligned to uniptr_t and added kernel guard. 14. Updated zfs_dio_size_aligned() to use modulo math because dn->dn_datablksz is not required to be a power of 2. 15. Removed abd scatter stats update calls from all ABD_FLAG_FROM_PAGES. 16. Updated check in abd_alloc_from_pages() for the linear page. This way a single page that is even 4K can represented as an ABD_FLAG_LINEAR_PAGE. 17. Fixing types for UIO code. In FreeBSD the vm code expects and returns int's for values. In linux the interfaces return long value in get_user_pages_unlocked() and rest of the IOV interfaces return int's. Stuck with the worse case and used long for npages in Linux. Updated the uio npage struct to correspond to the correct types and that type checking is consistent in the UIO code. 18. Updated comments about what zfs_uio_get_pages_alloc() is doing. 19. Updated error handeling in zfs_uio_get_dio_pages_alloc() for Linux. Signed-off-by: Brian Atkinson <batkinson@lanl.gov>	2024-09-04 17:08:56 -06:00
Brian Atkinson	7c8b7fe0f3	Fixing race condition with rangelocks There existed a race condition between when a Direct I/O write could complete and if a sync operation was issued. This was due to the fact that a Direct I/O would sleep waiting on previous TXG's to sync out their dirty records assosciated with a dbuf if there was an ARC buffer associated with the dbuf. This was necessay to safely destroy the ARC buffer in case previous dirty records dr_data as pointed at that the db_buf. The main issue with this approach is a Direct I/o write holds the rangelock across the entire block, so when a sync on that same block was issued and tried to grab the rangelock as reader, it would be blocked indefinitely because the Direct I/O that was now sleeping was holding that same rangelock as writer. This led to a complete deadlock. This commit fixes this issue and removes the wait in dmu_write_direct_done(). The way this is now handled is the ARC buffer is destroyed, if there an associated one with dbuf, before ever issuing the Direct I/O write. This implemenation heavily borrows from the block cloning implementation. A new function dmu_buf_wil_clone_or_dio() is called in both dmu_write_direct() and dmu_brt_clone() that does the following: 1. Undirties a dirty record for that db if there one currently associated with the current TXG. 2. Destroys the ARC buffer if the previous dirty record dr_data does not point at the dbufs ARC buffer (db_buf). 3. Sets the dbufs data pointers to NULL. 4. Redirties the dbuf using db_state = DB_NOFILL. As part of this commit, the dmu_write_direct_done() function was also cleaned up. Now dmu_sync_done() is called before undirtying the dbuf dirty record associated with a failed Direct I/O write. This is correct logic and how it always should have been. The additional benefits of these modifications is there is no longer a stall in a Direct I/O write if the user is mixing bufferd and O_DIRECT together. Also it unifies the block cloning and Direct I/O write path as they both need to call dbuf_fix_old_data() before destroying the ARC buffer. As part of this commit, there is also just general code cleanup. Various dbuf stats were removed because they are not necesary any longer. Additionally, useless functions were removed to make the code paths cleaner for Direct I/O. Below is the race condtion stack trace that was being consistently observed in the CI runs for the dio_random test case that prompted these changes: trace: [ 7795.294473] sd 0:0:0:0: [sda] Synchronizing SCSI cache [ 9954.769075] INFO: task z_wr_int:1051869 blocked for more than 120 seconds. [ 9954.770512] Tainted: P OE -------- - - 4.18.0-553.5.1.el8_10.x86_64 #1 [ 9954.772159] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 9954.773848] task:z_wr_int state:D stack:0 pid:1051869 ppid:2 flags:0x80004080 [ 9954.775512] Call Trace: [ 9954.776406] __schedule+0x2d1/0x870 [ 9954.777386] ? free_one_page+0x204/0x530 [ 9954.778466] schedule+0x55/0xf0 [ 9954.779355] cv_wait_common+0x16d/0x280 [spl] [ 9954.780491] ? finish_wait+0x80/0x80 [ 9954.781450] dmu_buf_direct_mixed_io_wait+0x84/0x1a0 [zfs] [ 9954.782889] dmu_write_direct_done+0x90/0x3b0 [zfs] [ 9954.784255] zio_done+0x373/0x1d50 [zfs] [ 9954.785410] zio_execute+0xee/0x210 [zfs] [ 9954.786588] taskq_thread+0x205/0x3f0 [spl] [ 9954.787673] ? wake_up_q+0x60/0x60 [ 9954.788571] ? zio_execute_stack_check.constprop.1+0x10/0x10 [zfs] [ 9954.790079] ? taskq_lowest_id+0xc0/0xc0 [spl] [ 9954.791199] kthread+0x134/0x150 [ 9954.792082] ? set_kthread_struct+0x50/0x50 [ 9954.793189] ret_from_fork+0x35/0x40 [ 9954.794108] INFO: task txg_sync:1051894 blocked for more than 120 seconds. [ 9954.795535] Tainted: P OE -------- - - 4.18.0-553.5.1.el8_10.x86_64 #1 [ 9954.797103] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 9954.798669] task:txg_sync state:D stack:0 pid:1051894 ppid:2 flags:0x80004080 [ 9954.800267] Call Trace: [ 9954.801096] __schedule+0x2d1/0x870 [ 9954.801972] ? __wake_up_common+0x7a/0x190 [ 9954.802963] schedule+0x55/0xf0 [ 9954.803884] schedule_timeout+0x19f/0x320 [ 9954.804837] ? __next_timer_interrupt+0xf0/0xf0 [ 9954.805932] ? taskq_dispatch+0xab/0x280 [spl] [ 9954.806959] io_schedule_timeout+0x19/0x40 [ 9954.807989] __cv_timedwait_common+0x19e/0x2c0 [spl] [ 9954.809110] ? finish_wait+0x80/0x80 [ 9954.810068] __cv_timedwait_io+0x15/0x20 [spl] [ 9954.811103] zio_wait+0x1ad/0x4f0 [zfs] [ 9954.812255] dsl_pool_sync+0xcb/0x6c0 [zfs] [ 9954.813442] ? spa_errlog_sync+0x2f0/0x3d0 [zfs] [ 9954.814648] spa_sync_iterate_to_convergence+0xcb/0x310 [zfs] [ 9954.816023] spa_sync+0x362/0x8f0 [zfs] [ 9954.817110] txg_sync_thread+0x27a/0x3b0 [zfs] [ 9954.818267] ? txg_dispatch_callbacks+0xf0/0xf0 [zfs] [ 9954.819510] ? spl_assert.constprop.0+0x20/0x20 [spl] [ 9954.820643] thread_generic_wrapper+0x63/0x90 [spl] [ 9954.821709] kthread+0x134/0x150 [ 9954.822590] ? set_kthread_struct+0x50/0x50 [ 9954.823584] ret_from_fork+0x35/0x40 [ 9954.824444] INFO: task fio:1055501 blocked for more than 120 seconds. [ 9954.825781] Tainted: P OE -------- - - 4.18.0-553.5.1.el8_10.x86_64 #1 [ 9954.827315] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 9954.828871] task:fio state:D stack:0 pid:1055501 ppid:1055490 flags:0x00004080 [ 9954.830463] Call Trace: [ 9954.831280] __schedule+0x2d1/0x870 [ 9954.832159] ? dbuf_hold_copy+0xec/0x230 [zfs] [ 9954.833396] schedule+0x55/0xf0 [ 9954.834286] cv_wait_common+0x16d/0x280 [spl] [ 9954.835291] ? finish_wait+0x80/0x80 [ 9954.836235] zfs_rangelock_enter_reader+0xa1/0x1f0 [zfs] [ 9954.837543] zfs_rangelock_enter_impl+0xbf/0x1b0 [zfs] [ 9954.838838] zfs_get_data+0x566/0x810 [zfs] [ 9954.840034] zil_lwb_commit+0x194/0x3f0 [zfs] [ 9954.841154] zil_lwb_write_issue+0x68/0xb90 [zfs] [ 9954.842367] ? __list_add+0x12/0x30 [zfs] [ 9954.843496] ? __raw_spin_unlock+0x5/0x10 [zfs] [ 9954.844665] ? zil_alloc_lwb+0x217/0x360 [zfs] [ 9954.845852] zil_commit_waiter_timeout+0x1f3/0x570 [zfs] [ 9954.847203] zil_commit_waiter+0x1d2/0x3b0 [zfs] [ 9954.848380] zil_commit_impl+0x6d/0xd0 [zfs] [ 9954.849550] zfs_fsync+0x66/0x90 [zfs] [ 9954.850640] zpl_fsync+0xe5/0x140 [zfs] [ 9954.851729] do_fsync+0x38/0x70 [ 9954.852585] __x64_sys_fsync+0x10/0x20 [ 9954.853486] do_syscall_64+0x5b/0x1b0 [ 9954.854416] entry_SYSCALL_64_after_hwframe+0x61/0xc6 [ 9954.855466] RIP: 0033:0x7eff236bb057 [ 9954.856388] Code: Unable to access opcode bytes at RIP 0x7eff236bb02d. [ 9954.857651] RSP: 002b:00007ffffb8e5320 EFLAGS: 00000293 ORIG_RAX: 000000000000004a [ 9954.859141] RAX: ffffffffffffffda RBX: 0000000000000006 RCX: 00007eff236bb057 [ 9954.860496] RDX: 0000000000000000 RSI: 000055e4d1f13ac0 RDI: 0000000000000006 [ 9954.861945] RBP: 00007efeb8ed8000 R08: 0000000000000000 R09: 0000000000000000 [ 9954.863327] R10: 0000000000056000 R11: 0000000000000293 R12: 0000000000000003 [ 9954.864765] R13: 000055e4d1f13ac0 R14: 0000000000000000 R15: 000055e4d1f13ae8 [ 9954.866149] INFO: task fio:1055502 blocked for more than 120 seconds. [ 9954.867490] Tainted: P OE -------- - - 4.18.0-553.5.1.el8_10.x86_64 #1 [ 9954.869029] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 9954.870571] task:fio state:D stack:0 pid:1055502 ppid:1055490 flags:0x00004080 [ 9954.872162] Call Trace: [ 9954.872947] __schedule+0x2d1/0x870 [ 9954.873844] schedule+0x55/0xf0 [ 9954.874716] schedule_timeout+0x19f/0x320 [ 9954.875645] ? __next_timer_interrupt+0xf0/0xf0 [ 9954.876722] io_schedule_timeout+0x19/0x40 [ 9954.877677] __cv_timedwait_common+0x19e/0x2c0 [spl] [ 9954.878822] ? finish_wait+0x80/0x80 [ 9954.879694] __cv_timedwait_io+0x15/0x20 [spl] [ 9954.880763] zio_wait+0x1ad/0x4f0 [zfs] [ 9954.881865] dmu_write_abd+0x174/0x1c0 [zfs] [ 9954.883074] dmu_write_uio_direct+0x79/0x100 [zfs] [ 9954.884285] dmu_write_uio_dnode+0xb2/0x320 [zfs] [ 9954.885507] dmu_write_uio_dbuf+0x47/0x60 [zfs] [ 9954.886687] zfs_write+0x581/0xe20 [zfs] [ 9954.887822] ? iov_iter_get_pages+0xe9/0x390 [ 9954.888862] ? trylock_page+0xd/0x20 [zfs] [ 9954.890005] ? __raw_spin_unlock+0x5/0x10 [zfs] [ 9954.891217] ? zfs_setup_direct+0x7e/0x1b0 [zfs] [ 9954.892391] zpl_iter_write_direct+0xd4/0x170 [zfs] [ 9954.893663] ? rrw_exit+0xc6/0x200 [zfs] [ 9954.894764] zpl_iter_write+0xd5/0x110 [zfs] [ 9954.895911] new_sync_write+0x112/0x160 [ 9954.896881] vfs_write+0xa5/0x1b0 [ 9954.897701] ksys_write+0x4f/0xb0 [ 9954.898569] do_syscall_64+0x5b/0x1b0 [ 9954.899417] entry_SYSCALL_64_after_hwframe+0x61/0xc6 [ 9954.900515] RIP: 0033:0x7eff236baa47 [ 9954.901363] Code: Unable to access opcode bytes at RIP 0x7eff236baa1d. [ 9954.902673] RSP: 002b:00007ffffb8e5330 EFLAGS: 00000293 ORIG_RAX: 0000000000000001 [ 9954.904099] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007eff236baa47 [ 9954.905535] RDX: 00000000000e4000 RSI: 00007efeb7dd4000 RDI: 0000000000000005 [ 9954.906902] RBP: 00007efeb7dd4000 R08: 0000000000000000 R09: 0000000000000000 [ 9954.908339] R10: 0000000000000000 R11: 0000000000000293 R12: 00000000000e4000 [ 9954.909705] R13: 000055e4d1f13ac0 R14: 00000000000e4000 R15: 000055e4d1f13ae8 [ 9954.911129] INFO: task fio:1055504 blocked for more than 120 seconds. [ 9954.912381] Tainted: P OE -------- - - 4.18.0-553.5.1.el8_10.x86_64 #1 [ 9954.913978] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 9954.915434] task:fio state:D stack:0 pid:1055504 ppid:1055493 flags:0x00000080 [ 9954.917082] Call Trace: [ 9954.917773] __schedule+0x2d1/0x870 [ 9954.918648] ? zilog_dirty+0x4f/0xc0 [zfs] [ 9954.919831] schedule+0x55/0xf0 [ 9954.920717] cv_wait_common+0x16d/0x280 [spl] [ 9954.921704] ? finish_wait+0x80/0x80 [ 9954.922639] zfs_rangelock_enter_writer+0x46/0x1c0 [zfs] [ 9954.923940] zfs_rangelock_enter_impl+0x12a/0x1b0 [zfs] [ 9954.925306] zfs_write+0x703/0xe20 [zfs] [ 9954.926406] zpl_iter_write_buffered+0xb2/0x120 [zfs] [ 9954.927687] ? rrw_exit+0xc6/0x200 [zfs] [ 9954.928821] zpl_iter_write+0xbe/0x110 [zfs] [ 9954.930028] new_sync_write+0x112/0x160 [ 9954.930913] vfs_write+0xa5/0x1b0 [ 9954.931758] ksys_write+0x4f/0xb0 [ 9954.932666] do_syscall_64+0x5b/0x1b0 [ 9954.933544] entry_SYSCALL_64_after_hwframe+0x61/0xc6 [ 9954.934689] RIP: 0033:0x7fcaee8f0a47 [ 9954.935551] Code: Unable to access opcode bytes at RIP 0x7fcaee8f0a1d. [ 9954.936893] RSP: 002b:00007fff56b2c240 EFLAGS: 00000293 ORIG_RAX: 0000000000000001 [ 9954.938327] RAX: ffffffffffffffda RBX: 0000000000000006 RCX: 00007fcaee8f0a47 [ 9954.939777] RDX: 000000000001d000 RSI: 00007fca8300b010 RDI: 0000000000000006 [ 9954.941187] RBP: 00007fca8300b010 R08: 0000000000000000 R09: 0000000000000000 [ 9954.942655] R10: 0000000000000000 R11: 0000000000000293 R12: 000000000001d000 [ 9954.944062] R13: 0000557a2006bac0 R14: 000000000001d000 R15: 0000557a2006bae8 [ 9954.945525] INFO: task fio:1055505 blocked for more than 120 seconds. [ 9954.946819] Tainted: P OE -------- - - 4.18.0-553.5.1.el8_10.x86_64 #1 [ 9954.948466] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 9954.949959] task:fio state:D stack:0 pid:1055505 ppid:1055493 flags:0x00004080 [ 9954.951653] Call Trace: [ 9954.952417] __schedule+0x2d1/0x870 [ 9954.953393] ? finish_wait+0x3e/0x80 [ 9954.954315] schedule+0x55/0xf0 [ 9954.955212] cv_wait_common+0x16d/0x280 [spl] [ 9954.956211] ? finish_wait+0x80/0x80 [ 9954.957159] zil_commit_waiter+0xfa/0x3b0 [zfs] [ 9954.958343] zil_commit_impl+0x6d/0xd0 [zfs] [ 9954.959524] zfs_fsync+0x66/0x90 [zfs] [ 9954.960626] zpl_fsync+0xe5/0x140 [zfs] [ 9954.961763] do_fsync+0x38/0x70 [ 9954.962638] __x64_sys_fsync+0x10/0x20 [ 9954.963520] do_syscall_64+0x5b/0x1b0 [ 9954.964470] entry_SYSCALL_64_after_hwframe+0x61/0xc6 [ 9954.965567] RIP: 0033:0x7fcaee8f1057 [ 9954.966490] Code: Unable to access opcode bytes at RIP 0x7fcaee8f102d. [ 9954.967752] RSP: 002b:00007fff56b2c230 EFLAGS: 00000293 ORIG_RAX: 000000000000004a [ 9954.969260] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007fcaee8f1057 [ 9954.970628] RDX: 0000000000000000 RSI: 0000557a2006bac0 RDI: 0000000000000005 [ 9954.972092] RBP: 00007fca84152a18 R08: 0000000000000000 R09: 0000000000000000 [ 9954.973484] R10: 0000000000035000 R11: 0000000000000293 R12: 0000000000000003 [ 9954.974958] R13: 0000557a2006bac0 R14: 0000000000000000 R15: 0000557a2006bae8 [10077.648150] INFO: task z_wr_int:1051869 blocked for more than 120 seconds. [10077.649541] Tainted: P OE -------- - - 4.18.0-553.5.1.el8_10.x86_64 #1 [10077.651116] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [10077.652782] task:z_wr_int state:D stack:0 pid:1051869 ppid:2 flags:0x80004080 [10077.654420] Call Trace: [10077.655267] __schedule+0x2d1/0x870 [10077.656179] ? free_one_page+0x204/0x530 [10077.657192] schedule+0x55/0xf0 [10077.658004] cv_wait_common+0x16d/0x280 [spl] [10077.659018] ? finish_wait+0x80/0x80 [10077.660013] dmu_buf_direct_mixed_io_wait+0x84/0x1a0 [zfs] [10077.661396] dmu_write_direct_done+0x90/0x3b0 [zfs] [10077.662617] zio_done+0x373/0x1d50 [zfs] [10077.663783] zio_execute+0xee/0x210 [zfs] [10077.664921] taskq_thread+0x205/0x3f0 [spl] [10077.665982] ? wake_up_q+0x60/0x60 [10077.666842] ? zio_execute_stack_check.constprop.1+0x10/0x10 [zfs] [10077.668295] ? taskq_lowest_id+0xc0/0xc0 [spl] [10077.669360] kthread+0x134/0x150 [10077.670191] ? set_kthread_struct+0x50/0x50 [10077.671209] ret_from_fork+0x35/0x40 [10077.672076] INFO: task txg_sync:1051894 blocked for more than 120 seconds. [10077.673467] Tainted: P OE -------- - - 4.18.0-553.5.1.el8_10.x86_64 #1 [10077.675112] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [10077.676612] task:txg_sync state:D stack:0 pid:1051894 ppid:2 flags:0x80004080 [10077.678288] Call Trace: [10077.679024] __schedule+0x2d1/0x870 [10077.679948] ? __wake_up_common+0x7a/0x190 [10077.681042] schedule+0x55/0xf0 [10077.681899] schedule_timeout+0x19f/0x320 [10077.682951] ? __next_timer_interrupt+0xf0/0xf0 [10077.684005] ? taskq_dispatch+0xab/0x280 [spl] [10077.685085] io_schedule_timeout+0x19/0x40 [10077.686080] __cv_timedwait_common+0x19e/0x2c0 [spl] [10077.687227] ? finish_wait+0x80/0x80 [10077.688123] __cv_timedwait_io+0x15/0x20 [spl] [10077.689206] zio_wait+0x1ad/0x4f0 [zfs] [10077.690300] dsl_pool_sync+0xcb/0x6c0 [zfs] [10077.691435] ? spa_errlog_sync+0x2f0/0x3d0 [zfs] [10077.692636] spa_sync_iterate_to_convergence+0xcb/0x310 [zfs] [10077.693997] spa_sync+0x362/0x8f0 [zfs] [10077.695112] txg_sync_thread+0x27a/0x3b0 [zfs] [10077.696239] ? txg_dispatch_callbacks+0xf0/0xf0 [zfs] [10077.697512] ? spl_assert.constprop.0+0x20/0x20 [spl] [10077.698639] thread_generic_wrapper+0x63/0x90 [spl] [10077.699687] kthread+0x134/0x150 [10077.700567] ? set_kthread_struct+0x50/0x50 [10077.701502] ret_from_fork+0x35/0x40 [10077.702430] INFO: task fio:1055501 blocked for more than 120 seconds. [10077.703697] Tainted: P OE -------- - - 4.18.0-553.5.1.el8_10.x86_64 #1 [10077.705309] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [10077.706780] task:fio state:D stack:0 pid:1055501 ppid:1055490 flags:0x00004080 [10077.708479] Call Trace: [10077.709231] __schedule+0x2d1/0x870 [10077.710190] ? dbuf_hold_copy+0xec/0x230 [zfs] [10077.711368] schedule+0x55/0xf0 [10077.712286] cv_wait_common+0x16d/0x280 [spl] [10077.713316] ? finish_wait+0x80/0x80 [10077.714262] zfs_rangelock_enter_reader+0xa1/0x1f0 [zfs] [10077.715566] zfs_rangelock_enter_impl+0xbf/0x1b0 [zfs] [10077.716878] zfs_get_data+0x566/0x810 [zfs] [10077.718032] zil_lwb_commit+0x194/0x3f0 [zfs] [10077.719234] zil_lwb_write_issue+0x68/0xb90 [zfs] [10077.720413] ? __list_add+0x12/0x30 [zfs] [10077.721525] ? __raw_spin_unlock+0x5/0x10 [zfs] [10077.722708] ? zil_alloc_lwb+0x217/0x360 [zfs] [10077.723931] zil_commit_waiter_timeout+0x1f3/0x570 [zfs] [10077.725273] zil_commit_waiter+0x1d2/0x3b0 [zfs] [10077.726438] zil_commit_impl+0x6d/0xd0 [zfs] [10077.727586] zfs_fsync+0x66/0x90 [zfs] [10077.728675] zpl_fsync+0xe5/0x140 [zfs] [10077.729755] do_fsync+0x38/0x70 [10077.730607] __x64_sys_fsync+0x10/0x20 [10077.731482] do_syscall_64+0x5b/0x1b0 [10077.732415] entry_SYSCALL_64_after_hwframe+0x61/0xc6 [10077.733487] RIP: 0033:0x7eff236bb057 [10077.734399] Code: Unable to access opcode bytes at RIP 0x7eff236bb02d. [10077.735657] RSP: 002b:00007ffffb8e5320 EFLAGS: 00000293 ORIG_RAX: 000000000000004a [10077.737163] RAX: ffffffffffffffda RBX: 0000000000000006 RCX: 00007eff236bb057 [10077.738526] RDX: 0000000000000000 RSI: 000055e4d1f13ac0 RDI: 0000000000000006 [10077.739966] RBP: 00007efeb8ed8000 R08: 0000000000000000 R09: 0000000000000000 [10077.741336] R10: 0000000000056000 R11: 0000000000000293 R12: 0000000000000003 [10077.742773] R13: 000055e4d1f13ac0 R14: 0000000000000000 R15: 000055e4d1f13ae8 [10077.744168] INFO: task fio:1055502 blocked for more than 120 seconds. [10077.745505] Tainted: P OE -------- - - 4.18.0-553.5.1.el8_10.x86_64 #1 [10077.747073] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [10077.748642] task:fio state:D stack:0 pid:1055502 ppid:1055490 flags:0x00004080 [10077.750233] Call Trace: [10077.751011] __schedule+0x2d1/0x870 [10077.751915] schedule+0x55/0xf0 [10077.752811] schedule_timeout+0x19f/0x320 [10077.753762] ? __next_timer_interrupt+0xf0/0xf0 [10077.754824] io_schedule_timeout+0x19/0x40 [10077.755782] __cv_timedwait_common+0x19e/0x2c0 [spl] [10077.756922] ? finish_wait+0x80/0x80 [10077.757788] __cv_timedwait_io+0x15/0x20 [spl] [10077.758845] zio_wait+0x1ad/0x4f0 [zfs] [10077.759941] dmu_write_abd+0x174/0x1c0 [zfs] [10077.761144] dmu_write_uio_direct+0x79/0x100 [zfs] [10077.762327] dmu_write_uio_dnode+0xb2/0x320 [zfs] [10077.763523] dmu_write_uio_dbuf+0x47/0x60 [zfs] [10077.764749] zfs_write+0x581/0xe20 [zfs] [10077.765825] ? iov_iter_get_pages+0xe9/0x390 [10077.766842] ? trylock_page+0xd/0x20 [zfs] [10077.767956] ? __raw_spin_unlock+0x5/0x10 [zfs] [10077.769189] ? zfs_setup_direct+0x7e/0x1b0 [zfs] [10077.770343] zpl_iter_write_direct+0xd4/0x170 [zfs] [10077.771570] ? rrw_exit+0xc6/0x200 [zfs] [10077.772674] zpl_iter_write+0xd5/0x110 [zfs] [10077.773834] new_sync_write+0x112/0x160 [10077.774805] vfs_write+0xa5/0x1b0 [10077.775634] ksys_write+0x4f/0xb0 [10077.776526] do_syscall_64+0x5b/0x1b0 [10077.777386] entry_SYSCALL_64_after_hwframe+0x61/0xc6 [10077.778488] RIP: 0033:0x7eff236baa47 [10077.779339] Code: Unable to access opcode bytes at RIP 0x7eff236baa1d. [10077.780655] RSP: 002b:00007ffffb8e5330 EFLAGS: 00000293 ORIG_RAX: 0000000000000001 [10077.782056] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007eff236baa47 [10077.783507] RDX: 00000000000e4000 RSI: 00007efeb7dd4000 RDI: 0000000000000005 [10077.784890] RBP: 00007efeb7dd4000 R08: 0000000000000000 R09: 0000000000000000 [10077.786303] R10: 0000000000000000 R11: 0000000000000293 R12: 00000000000e4000 [10077.787637] R13: 000055e4d1f13ac0 R14: 00000000000e4000 R15: 000055e4d1f13ae8 Signed-off-by: Brian Atkinson <batkinson@lanl.gov>	2024-09-04 17:08:56 -06:00
Brian Atkinson	d93a3febba	Adding Direct IO Support Adding O_DIRECT support to ZFS to bypass the ARC for writes/reads. O_DIRECT support in ZFS will always ensure there is coherency between buffered and O_DIRECT IO requests. This ensures that all IO requests, whether buffered or direct, will see the same file contents at all times. Just as in other FS's , O_DIRECT does not imply O_SYNC. While data is written directly to VDEV disks, metadata will not be synced until the associated TXG is synced. For both O_DIRECT read and write request the offset and requeset sizes, at a minimum, must be PAGE_SIZE aligned. In the event they are not, then EINVAL is returned unless the direct property is set to always (see below). For O_DIRECT writes: The request also must be block aligned (recordsize) or the write request will take the normal (buffered) write path. In the event that request is block aligned and a cached copy of the buffer in the ARC, then it will be discarded from the ARC forcing all further reads to retrieve the data from disk. For O_DIRECT reads: The only alignment restrictions are PAGE_SIZE alignment. In the event that the requested data is in buffered (in the ARC) it will just be copied from the ARC into the user buffer. For both O_DIRECT writes and reads the O_DIRECT flag will be ignored in the event that file contents are mmap'ed. In this case, all requests that are at least PAGE_SIZE aligned will just fall back to the buffered paths. If the request however is not PAGE_SIZE aligned, EINVAL will be returned as always regardless if the file's contents are mmap'ed. Since O_DIRECT writes go through the normal ZIO pipeline, the following operations are supported just as with normal buffered writes: Checksum Compression Dedup Encryption Erasure Coding There is one caveat for the data integrity of O_DIRECT writes that is distinct for each of the OS's supported by ZFS. FreeBSD - FreeBSD is able to place user pages under write protection so any data in the user buffers and written directly down to the VDEV disks is guaranteed to not change. There is no concern with data integrity and O_DIRECT writes. Linux - Linux is not able to place anonymous user pages under write protection. Because of this, if the user decides to manipulate the page contents while the write operation is occurring, data integrity can not be guaranteed. However, there is a module parameter `zfs_vdev_direct_write_verify` that contols the if a O_DIRECT writes that can occur to a top-level VDEV before a checksum verify is run before the contents of the I/O buffer are committed to disk. In the event of a checksum verification failure the write will return EIO. The number of O_DIRECT write checksum verification errors can be observed by doing `zpool status -d`, which will list all verification errors that have occurred on a top-level VDEV. Along with `zpool status`, a ZED event will be issues as `dio_verify` when a checksum verification error occurs. A new dataset property `direct` has been added with the following 3 allowable values: disabled - Accepts O_DIRECT flag, but silently ignores it and treats the request as a buffered IO request. standard - Follows the alignment restrictions outlined above for write/read IO requests when the O_DIRECT flag is used. always - Treats every write/read IO request as though it passed O_DIRECT and will do O_DIRECT if the alignment restrictions are met otherwise will redirect through the ARC. This property will not allow a request to fail. There is also a module paramter zfs_dio_enabled that can be used to force all reads and writes through the ARC. By setting this module paramter to 0, it mimics as if the direct dataset property is set to disabled. Signed-off-by: Brian Atkinson <batkinson@lanl.gov> Co-authored-by: Mark Maybee <mark.maybee@delphix.com> Co-authored-by: Matt Macy <mmacy@FreeBSD.org> Co-authored-by: Brian Behlendorf <behlendorf@llnl.gov>	2024-09-04 17:06:56 -06:00
Don Brady	d4d79451cb	Add DDT prune command Requires the new 'flat' physical data which has the start time for a class entry. The amount to prune can be based on a target percentage of the unique entries or based on the age (i.e., every entry older than N days). Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Don Brady <don.brady@klarasystems.com> Closes #16277	2024-09-04 14:17:02 -07:00
Rob Norris	4a4f7b019f	zdb: rework dedup accounting for log, quota and prune The simplest thing first: add the FDT and log objects to the list of objects to be considered when checking for leaks. The rest is based on a conceptual change in all of this patch stack: a block on disk with a 'D' bit is not necessarily in the DDT at all (pruned), or in the DDT ZAPs (still on the log). As such, walking the DDT up front is difficult (for all the reasons that walking an unflushed log is difficult) and not really useful, since it's not a reflection of what's on disk anyway. Instead, we rework things here to be more like the BRT checks. When we see a dedup'd block, we look it up in the DDT, consume a refcount, and for the second-or-later instances, count them as duplicates. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: Allan Jude <allan@klarasystems.com> Co-authored-by: Don Brady <don.brady@klarasystems.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc. Closes #16277	2024-09-04 14:16:42 -07:00
Seth Hoffert	bf8c61f489	Remove unused sysctl node PR #14953 removed vdev-level read cache but accidentally left this sysctl node behind. Reviewed-by: Rich Ercolani <rincebrain@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Seth Hoffert <seth.hoffert@gmail.com> Closes #16493	2024-09-03 17:52:33 -07:00
Rob Norris	b3b7491615	build: rename FORCEDEBUG_CPPFLAGS to LIBZPOOL_CPPFLAGS This is just a very small attempt to make it more obvious that these flags aren't optional for libzpool-using programs, by not making it seem like there's an option to say "well, I don't _want_ to force debugging". Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Rich Ercolani <rincebrain@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Issue #16476 Closes #16477	2024-08-27 12:53:27 -07:00
Rob Norris	92fca1c2d0	zstream: build with debug to fix stack overruns abd_t differs in size depending on whether or not ZFS_DEBUG is set. It turns out that libzpool is built with FORCEDEBUG_CPPFLAGS, which sets -DZFS_DEBUG, and so it always has a larger abd_t with extra debug fields, regardless of whether or not --enable-debug is set. zdb, ztest and zhack are also all built with FORCEDEBUG_CPPFLAGS, so had the same idea of the size of abd_t, but zstream was not, and used the "smaller" abd_t. In practice this didn't matter because it never used abd_t directly. This changed in `b4d81b1a6`, zstream was switched to use stack ABDs for compression. When built with --enable-debug, zstream implicitly gets ZFS_DEBUG, and everything was fine. Productions builds without that flag ends up with the smaller abd_t, which is now mismatched with libzpool, and causes stack overruns in zstream recompress. The simplest fix for now is to compile zstream with FORCEDEBUG_CPPFLAGS like the other binaries. This commit does that. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Rich Ercolani <rincebrain@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Issue #16476 Closes #16477	2024-08-27 12:52:23 -07:00
Rob Norris	50b32cb925	fm: pass io_flags through events & zed as uint64_t In `4938d01db` (#14086) zio_flag_t was converted from an enum (generally signed 32-bit) to a uint64_t. The corresponding change wasn't made to the error reporting subsystem, limiting the error flags being delivered to zed to 32 bits. This bumps the whole pipeline to use uint64s. A tiny bit of compatibility is added for newer zed working agsinst an older kernel module, because its easy to do and misdetecting scrub/resilver errors and taking action is potentially dangerous. Making it work for new kernel modules against older zed seems to be far more invasive for far less benefit, so I have not. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16469	2024-08-26 17:39:13 -07:00
Jitendra Patidar	73866cf346	Fix issig() to check signal_pending after dequeue SIGSTOP/SIGTSTP When process got SIGSTOP/SIGTSTP, issig() dequeue them and return 0. But process could still have another signal pending after dequeue. So, after dequeue, check and return 1, if signal_pending. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Jitendra Patidar <jitendra.patidar@nutanix.com> Closes #16464	2024-08-26 17:36:49 -07:00
Mateusz Piotrowski	6be8bf5552	zpool: Provide GUID to zpool-reguid(8) with -g (#16239 ) This commit extends the zpool-reguid(8) command with a -g flag, which allows the user to specify the GUID to set. This change also adds some general tests for zpool-reguid(8). Sponsored-by: Wasabi Technology, Inc. Sponsored-by: Klara, Inc. Signed-off-by: Mateusz Piotrowski <0mp@FreeBSD.org> Reviewed-by: Rob Norris <rob.norris@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov>	2024-08-26 09:27:24 -07:00
Rob Norris	2420ee6e12	spl-taskq: fix task counts for delayed and cancelled tasks Dispatched delayed tasks were not added to tasks_total, and cancelled tasks were not removed. This notably could make tasks_total go to UNIT64_MAX, but just generally meant the count could be wrong. So lets not! Sponsored-by: Klara, Inc. Sponsored-by: Syneto Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16473	2024-08-23 10:40:45 -07:00
Low-power	34118eac06	Make mount.zfs(8) calling zfs_mount_at for legacy mounts as well Commit `329e2ffa4b` has made mount.zfs(8) to call libzfs function 'zfs_mount_at', in order to propagate dataset properties into mount options. This fix however, is limited to a special use case where mount.zfs(8) is used in initrd with option '-o zfsutil'. If either initrd or the user need to use mount.zfs(8) to mount a file system with 'mountpoint' set to 'legacy', '-o zfsutil' can't be used and the original issue #7947 will still happen. Since the existing code already excluded the possibility of calling 'zfs_mount_at' when it was invoked as a helper program from zfs(8), by checking 'ZFS_MOUNT_HELPER' environment variable, it makes no sense to avoid calling 'zfs_mount_at' without '-o zfsutil'. An exception however, is when mount.zfs(8) was invoked with '-o remount' to update the mount options for an existing mount point. In this case call mount(2) directly without modifying the mount options passed from command line. Furthermore, don't run mount.zfs(8) helper for automounting snapshot. The above change to make mount.zfs(8) to call 'zfs_mount_at' apparently caused it to trigger an automount for the snapshot directory. When the helper was invoked as a result of a snapshot automount, an infinite recursion will occur. Since the need of invoking user mode mount(8) for automounting was to overcome that the 'vfs_kern_mount' being GPL-only, just run mount(8) without the mount.zfs(8) helper by adding option '-i'. Reviewed-by: Umer Saleem <usaleem@ixsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: WHR <whr@rivoreo.one> Closes #16393	2024-08-23 10:39:09 -07:00
Rob Norris	cb36f4f352	zstream recompress: fix zero recompressed buffer and output If compression happend, any garbage past the compress size was not zeroed out. If compression didn't happen, then the payload size was still set to the rounded-up return from zio_compress_data(), which is dependent on the input, which is not necessarily the logical size. So that's all fixed too, mostly from stealing the math from zio.c. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <rob.norris@klarasystems.com>	2024-08-22 16:22:24 -07:00
Rob Norris	a537d90734	zstream decompress: fix decompress size and output This was incorrectly using the compressed length for the size of the decompress buffer, and quietly doing nothing if the decompressor refused to decompress the block because there wasn't enough space. After that, it wasn't correctly rewriting the record to indicate "not compressed". So that's fixed now. Sigh. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <rob.norris@klarasystems.com>	2024-08-22 16:22:24 -07:00
Rob Norris	a9c94bea9f	zio_compress_data: limit dest length to ABD size Some callers (eg `do_corrective_recv()`) pass in a dest buffer much smaller than the wanted 87.5% of the source buffer, because the incoming abd is larger than the source data and they "know" what the decompressed size with be. However, `abd_borrow_buf()` rightly asserts if we try to borrow more than is available, so these callers fail. Previously when all we had was a dest buffer, we didn't know how big it was, so we couldn't do anything. Now we have a dest abd, with a size, so we can clamp dest size to the abd size. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <rob.norris@klarasystems.com>	2024-08-22 16:22:24 -07:00
Rob Norris	f62e6e1f98	compress: change zio_compress API to use ABDs This commit changes the frontend zio_compress_data and zio_decompress_data APIs to take ABD points instead of buffer pointers. All callers are updated to match. Any that already have an appropriate ABD nearby now use it directly, while at the rest we create an one. Internally, the ABDs are passed through to the provider directly. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <rob.norris@klarasystems.com>	2024-08-22 16:22:24 -07:00
Rob Norris	d3c12383c9	compress: change compression providers API to use ABDs This commit changes the provider compress and decompress API to take ABD pointers instead of buffer pointers for both data source and destination. It then updates all providers to match. This doesn't actually change the providers to do chunked compression, just changes the API to allow such an update in the future. Helper macros are added to easily adapt the ABD functions to their buffer-based implementations. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <rob.norris@klarasystems.com>	2024-08-22 16:22:24 -07:00
Rob Norris	522816498c	compress: standardise names of compression functions This is mostly to make searching easier. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <rob.norris@klarasystems.com>	2024-08-22 16:22:24 -07:00
Rob Norris	dd0c08f9c6	compress: remove unused abd compress prototype Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <rob.norris@klarasystems.com>	2024-08-22 16:22:24 -07:00
Rob Norris	e119483a95	compress: remove zio_decompress_data_buf Nothing uses it anymore! Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <rob.norris@klarasystems.com>	2024-08-22 16:22:24 -07:00
Rob Norris	b4d81b1a6a	zstream: use zio_compress calls for compression This is updating zstream to use the zio_compress calls rather than using its own dispatch. Since that was fairly entangled, some refactoring included. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <rob.norris@klarasystems.com>	2024-08-22 16:22:24 -07:00
Rob Norris	5eede0d5fd	compress: rework callers to always use the zio_compress calls This will make future refactoring easier. There are two we can't change for the moment, because zio_compress_data does hole detection & collapsing which zio_decompress_data does not actually know how to handle. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <rob.norris@klarasystems.com>	2024-08-22 16:22:24 -07:00
Rob Norris	ba2209ec9e	abd_get_from_buf_struct: wrap existing buf with ABD stored on stack This allows a simple "wrapping" ABD for an existing linear buffer to be allocated on the stack, avoiding an allocation. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <rob.norris@klarasystems.com>	2024-08-22 16:22:24 -07:00
Tony Hutter	9e15877dfb	Linux 6.10 compat: META Update the META file to reflect compatibility with the 6.10 kernel. Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #16466	2024-08-21 17:38:06 -07:00
Rob Norris	b69bebb535	libzpool/abd_os: iovec-based scatter abd This is intended to be a simple userspace scatter abd based on struct iovec. It's not very sophisticated as-is, but sets a base for something much more interesting. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16253	2024-08-21 13:37:25 -07:00
Rob Norris	5b9e695392	abd_os: break out platform-specific header parts Removing the platform #ifdefs from shared headers in favour of per-platform headers. Makes abd_t much leaner, among other things. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16253	2024-08-21 13:37:18 -07:00
Rob Norris	7a5b4355e2	abd_os: split userspace and Linux kernel code The Linux abd_os.c serves double-duty as the userspace scatter abd implementation, by carrying an emulation of kernel scatterlists. This commit lifts common and userspace-specific parts out into a separate abd_os.c for libzpool. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16253	2024-08-21 13:37:13 -07:00
Rob Norris	2b7d9a7863	zio: no alloc canary in userspace Makes it harder to use memory debuggers like valgrind directly, because they can't see canary overruns. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16253	2024-08-21 13:37:07 -07:00
Rob Norris	b3f4e4e1ec	abd: remove ABD_FLAG_ZEROS Nothing ever checks it. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16253	2024-08-21 13:36:24 -07:00
shodanshok	bbe8512a93	Ignore zfs_arc_shrinker_limit in direct reclaim mode zfs_arc_shrinker_limit (default: 10000) avoids ARC collapse due to excessive memory reclaim. However, when the kernel is in direct reclaim mode (ie: low on memory), limiting ARC reclaim increases OOM risk. This is especially true on system without (or with inadequate) swap. This patch ignores zfs_arc_shrinker_limit when the kernel is in direct reclaim mode, avoiding most OOM. It also restores "echo 3 > /proc/sys/vm/drop_caches" ability to correctly drop (almost) all ARC. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Adam Moss <c@yotes.com> Signed-off-by: Gionatan Danti <g.danti@assyoma.it> Closes #16313	2024-08-21 10:00:33 -07:00
Ameer Hamza	a2c4e95cfd	linux/zvol_os.c: cleanup limits for non-blk mq case Rob Noris suggested that we could clean up redundant limits for the case of non-blk mq scenario. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #16462	2024-08-20 17:16:08 -07:00
Ameer Hamza	8e6a9aabb1	linux/zvol_os.c: Fix max_discard_sectors limit for 6.8+ kernel In kernels 6.8 and later, the zvol block device is allocated with qlimits passed during initialization. However, the zvol driver does not set `max_hw_discard_sectors`, which is necessary to properly initialize `max_discard_sectors`. This causes the `zvol_misc_trim` test to fail on 6.8+ kernels when invoking the `blkdiscard` command. Setting `max_hw_discard_sectors` in the `HAVE_BLK_ALLOC_DISK_2ARG` case resolve the issue. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #16462	2024-08-20 17:14:44 -07:00
Rob Norris	816d2b2bfc	spl-proc: remove old taskq stats These had minimal useful information for the admin, didn't work properly in some places, and knew far too much about taskq internals. With the new stats available, these should never be needed anymore. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Syneto Closes #16171	2024-08-19 09:50:45 -07:00
Rob Norris	3f8fd3cae0	spl-taskq: summary stats for all taskqs This adds /proc/spl/kstats/taskq/summary, which attempts to show a useful subset of stats for all taskqs in the system. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Syneto Closes #16171	2024-08-19 09:50:41 -07:00
Rob Norris	db40fe4cf6	spl-taskq: per-taskq kstats This exposes a variety of per-taskq stats under /proc/spl/kstat/taskq, one file per taskq, named for the taskq name.instance. These include a small amount of info about the taskq config, the current state of the threads and queues, and various counters for thread and queue activity since the taskq was created. To assist with decrementing queue size counters, the list an entry is on is encoded in spare bits in the entry flags. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Syneto Closes #16171	2024-08-19 09:50:35 -07:00
Rob Norris	f0ad031cd9	spl-generic: bring up kstats subsystem before taskq For spl-taskq to use the kstats infrastructure, it has to be available first. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Syneto Closes #16171	2024-08-19 09:49:28 -07:00
Chunwei Chen	06a7b123ac	Skip ro check for snaps when multi-mount Skip ro check for snapshots since they are always ro regardless if ro flag is passed by mount or not. This allows multi-mounting snapshots without requiring to specify ro flag. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@nutanix.com> Closes #16299	2024-08-19 09:42:17 -07:00
shodanshok	77a797a382	Enable L2 cache of all (MRU+MFU) metadata but MFU data only `l2arc_mfuonly` was added to avoid wasting L2 ARC on read-once MRU data and metadata. However it can be useful to cache as much metadata as possible while, at the same time, restricting data cache to MFU buffers only. This patch allow for such behavior by setting `l2arc_mfuonly` to 2 (or higher). The list of possible values is the following: 0: cache both MRU and MFU for both data and metadata; 1: cache only MFU for both data and metadata; 2: cache both MRU and MFU for metadata, but only MFU for data. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Gionatan Danti <g.danti@assyoma.it> Closes #16343 Closes #16402	2024-08-16 13:34:07 -07:00
Allan Jude	a60e15d6b9	Man page updates for dmu_ddt_copies Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Allan Jude <allan@klarasystems.com> Closes #15895	2024-08-16 12:04:04 -07:00
Rob Norris	0d2707815d	ddt: lookup and log stats Adds per-DDT stats counting lookups and where they were serviced from (either log or backing zap), number of log entries in memory, and flow rates. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc. Closes #15895	2024-08-16 12:03:51 -07:00
Rob Norris	a1902f4950	ddt: block scan until log is flushed, and flush aggressively The dedup log does not have a stable cursor, so its not possible to persist our current scan location within it across pool reloads. Beccause of this, when walking (scanning), we can't treat it like just another source of dedup entries. Instead, when a scan is wanted, we switch to an aggressive flushing mode, pushing out entries older than the scan start txg as fast as we can, before starting the scan proper. Entries after the scan start txg will be handled via other methods; the DDT ZAPs and logs will be written as normal, and blocks not seen yet will be offered to the scan machinery as normal. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: Allan Jude <allan@klarasystems.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc. Closes #15895	2024-08-16 12:03:43 -07:00
Rob Norris	cd69ba3d49	ddt: dedup log Adds a log/journal to dedup. At the end of txg, instead of writing the entry directly to the ZAP, instead its adding to an in-memory tree and appended to an on-disk object. The on-disk object is only read at import, to reload the in-memory tree. Lookups first go the the log tree before going to the ZAP, so recently-used entries will remain close by in memory. This vastly reduces overhead from dedup IO, as it will not have to do so many read/update/write cycles on ZAP leaf nodes. A flushing facility is added at end of txg, to push logged entries out to the ZAP. There's actually two separate "logs" (in-memory tree and on-disk object), one active (recieving updated entries) and one flushing (writing out to disk). These are swapped (ie flushing begins) based on memory used by the in-memory log trees and time since we last flushed something. The flushing facility monitors the amount of entries coming in and being flushed out, and calibrates itself to try to flush enough each txg to keep up with the ingest rate without competing too much with other IO. Multiple tuneables are provided to control the flushing facility. All the histograms and stats are update to accomodate the log as a separate entry store. zdb gains knowledge of how to count them and dump them. Documentation included! Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: Allan Jude <allan@klarasystems.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc. Closes #15895	2024-08-16 12:03:35 -07:00
Rob Norris	cbb9ef0a4c	ddt: tuneable to override copies= on dedup metadata objects All objects stored in the MOS get copies=3. For a large dedup table, this requires significant extra IO and disk space, when its not really necessary - the dedup table itself isn't needed to read or write data, only to keep data usage down. Losing the dedup table does not render the pool unusable, it just messes up the accounting somewhat. This adds a dmu_ddt_copies tuneable. When set to 0, the existing behaviour is used. When set higher, dedup table blocks (ZAP and log) will have this many copies rather than the usual 3, while indirect blocks will have one more again. This is a tuneable for now mostly for testing. Losing a dedup table can cause blocks to be leaked, and we currently have no facilities to repair that. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: Allan Jude <allan@klarasystems.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc. Closes #15895	2024-08-16 12:03:27 -07:00
Rob Norris	592f38900d	ddt: compare keys 64-bits at a time, trying to match ZAP order This yields substantial performance improvements when we only write out some small % of entries at a time, as it will cause entries that will go into "nearby" ZAP leaf nodes to be grouped closer together in the AVL, and so touch fewer blocks. Without this, the distribution is an even spread, so we touch a lot more ZAP leaf nodes for any given number of entries. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: Allan Jude <allan@klarasystems.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc. Closes #15895	2024-08-16 12:03:19 -07:00
Rob Norris	27e9cb5f80	ddt: cleanup the stats & histogram code Both the API and the code were kinda mangled and I was really struggling to follow it. The worst offender was the old ddt_stat_add(); after fixing it up the rest of the changes are mostly knock-on effects and targets of opportunity. Note that the old ddt_stat_add() was safe against overflows - it could produce crazy numbers, but the compiler wouldn't do anything stupid. The assertions in ddt_stat_sub() go a lot of the way to protecting against this; getting in a position where overflows are a problem is definitely a programming error. Also expanding ddt_stat_add() and ddt_histogram_empty() produces less efficient assembly. I'm not bothered about this right now though; these should not be hot functions, and if they are we'll optimise them later. If we have to go back to the old form, we'll comment it like crazy. Finally, I've removed the assertion that the bucket will never be negative, as it will soon be possible to have entries with zero refcounts: an entry for a block that is no longer on the pool, but is on the log waiting to be synced out. It might be better to have a separate bucket for these, since they're still using real space on disk, but ultimately these stats are driving UI, and for now I've chosen to keep them matching how they've looked in the past, as well as match the operators mental model - pool usage is managed elsewhere. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc. Closes #15895	2024-08-16 12:02:56 -07:00

1 2 3 4 5 ...

9366 Commits All Branches Search

9366 Commits

All Branches