Commit Graph

2388 Commits

Author SHA1 Message Date
Brian Atkinson 71ce314930 Updating based on PR Feedback(4)
1. When testing out installing a VM with virtual manager on Linux and a
   dataset with direct=always, there an ASSERT failure in
   abd_alloc_from_pages(). Originally zfs_setup_direct() did an
   alignment check of the UIO using SPA_MINBLOCKSIZE with
   zfs_uio_aligned(). The idea behind this was maybe the page alignment
   restriction could be changed to use ashift as the alignment check in
   the future. Howver, this diea never came to be. The alignment
   restrictions for Direct I/O are based on PAGE_SIZE. Updating the
   check zfs_setup_direct() for the UIO to use PAGE_SIZE fixed the
   issue.
2. Updated other alignment check in dmu_read_impl() to also use
   PAGE_SIZE.
3. As a consequence of updating the UIO alignment checks the ZTS test
   case dio_unaligned_filesize began to fail. This is because there was
   no way to detect reading past the end of the file before issue EINVAL
   in the ZPL and VOPs layers in FreeBSD. This was resolved by moving
   zfs_setup_direct() into zfs_write() and zfs_read(). This allows for
   other error checking to take place before checking any Direct I/O
   limitations. Updating the call site of zfs_setup_direct() did require
   a bit of changes to the logic in that function. In particular Direct
   I/O can just be avoid altogether depending on the checks in
   zfs_setup_direct() and there is no reason to return EAGAIN at all.
4. After moving zfs_setup_direct() into zfs_write() and zfs_read(),
   there was no reason to call zfs_check_direct_enabled() in the ZPL
   layer in Linux or in the VNOPS layer of FreeBSD. This function was
   completely removed. This allowed for much of the code in both those
   layers to return to their original code.
5. Upated the checksum verify module parameter for Direct I/O writes to
   only be a boolean and return EIO in the event a checksum verify
   failure occurs. By default, this module parameter is set to 1 for
   Linux and 0 for FreeBSD. The module parameter has been changed to
   zfs_vdev_direct_write_verify. There are still counters on the
   top-level VDEV for checksum verify failures, but this could be
   removed. It would still be good to to leave the ZED event dio_verify
   for checksum failures as a notification that an application was
   manipulating the contents of a buffer after issuing that buffer with
   for I/O using Direct I/O. As part of this cahnge, man pages were
   updated, the ZTS test case dio_writy_verify was updated, and all
   comments relating to the module parameter were udpated as well.
6. Updated comments in dio_property ZTS test to properly reflect that
   stride_dd is being called with check_write and check_read.

Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
2024-09-04 17:08:56 -06:00
Brian Atkinson 6e0ffaf627 Updating based on PR Feedback(3)
1. Unified the block cloning and Direct I/O code paths further. As part
   of this unification, it is important to outline that Direct I/O
   writes transition the db_state to DB_UNCACHED. This is used so that
   dbuf_unoverride() is called when dbuf_undirty() is called. This is
   needed to cleanup space accounting in a TXG. When a dbuf is redirtied
   through dbuf_redirty(), then dbuf_unoverride() is also called to
   clean up space accounting. This is a bit of a different approach that
   block cloning, which always calls dbuf_undirty().
2. As part of uniying the two, Direct I/O also performs the same check
   in dmu_buf_will_fill() so that on failure the previous contents of
   the dbuf are set correctly.
3. General just code cleanup removing checks that are no longer
   necessary.

Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
2024-09-04 17:08:56 -06:00
Brian Atkinson b1ee363675 Updating based on PR Feedback(2)
Updating code base on PR code comments. I adjusted the following parts
of the code base on the comments:
1. Updated zfs_check_direct_enabled() so it now just returns an error.
   This removed the need for the added enum and cleaned up the code.
2. Moved acquiring the rangelock from zfs_fillpage() out to
   zfs_getpage(). This cleans up the code and gets rid of the need to
   pass a boolean into zfs_fillpage() to conditionally gra the
   rangelock.
3. Cleaned up the code in both zfs_uio_get_dio_pages() and
   zfs_uio_get_dio_pages_iov(). There was no need to have wanted and
   maxsize as they were the same thing. Also, since the previous
   commit cleaned up the call to zfs_uio_iov_step() the code is much
   cleaner over all.
4. Removed dbuf_get_dirty_direct() function.
5. Unified dbuf_read() to account for both block clones and direct I/O
   writes. This removes redundant code from dbuf_read_impl() for
   grabbingthe BP.
6. Removed zfs_map_page() and zfs_unmap_page() declarations from Linux
   headers as those were never called.

Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
2024-09-04 17:08:56 -06:00
Brian Atkinson 506bc54006 Updating based on PR Feedback(1)
Updating code based on PR code comments. I adjusted the following parts
of code based on comments:
1.  Revert dbuf_undirty() to original logic and got rid of uncessary
    code change.
2.  Cleanup in abd_impl.h
3.  Cleanup in abd.h
4.  Got rid of duplicate declaration of dmu_buf_hold_noread() in dmu.h.
5.  Cleaned up comment for db_mtx in dmu_imp.h
6.  Updated zfsprop man page to state correct ZFS version
7.  Updated to correct cast in zfs_uio_page_aligned() calls to use
    uintptr_t.
8.  Cleaned up comment in FreeBSD uio code.
9.  Removed unnecessary format changes in comments in Linux abd code.
10.  Updated ZFS VFS hook for direct_IO to use PANIC().
11. Updated comment above dbuf_undirty to use double space again.
12. Converted module paramter zfs_vdev_direct_write_verify_pct OS
    indedepent and in doing so this removed the uneccessary check for
    bounds.
13. Updated to casting in zfs_dio_page_aligned to uniptr_t and added
    kernel guard.
14. Updated zfs_dio_size_aligned() to use modulo math because
    dn->dn_datablksz is not required to be a power of 2.
15. Removed abd scatter stats update calls from all
    ABD_FLAG_FROM_PAGES.
16. Updated check in abd_alloc_from_pages() for the linear page. This
    way a single page that is even 4K can represented as an
    ABD_FLAG_LINEAR_PAGE.
17. Fixing types for UIO code. In FreeBSD the vm code expects and
    returns int's for values. In linux the interfaces return long value
    in get_user_pages_unlocked() and rest of the IOV interfaces return
    int's. Stuck with the worse case and used long for npages in Linux.
    Updated the uio npage struct to correspond to the correct types and
    that type checking is consistent in the UIO code.
18. Updated comments about what zfs_uio_get_pages_alloc() is doing.
19. Updated error handeling in zfs_uio_get_dio_pages_alloc() for Linux.

Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
2024-09-04 17:08:56 -06:00
Brian Atkinson 7c8b7fe0f3 Fixing race condition with rangelocks
There existed a race condition between when a Direct I/O write could
complete and if a sync operation was issued. This was due to the fact
that a Direct I/O would sleep waiting on previous TXG's to sync out
their dirty records assosciated with a dbuf if there was an ARC buffer
associated with the dbuf. This was necessay to safely destroy the ARC
buffer in case previous dirty records dr_data as pointed at that the
db_buf. The main issue with this approach is a Direct I/o write holds
the rangelock across the entire block, so when a sync on that same block
was issued and tried to grab the rangelock as reader, it would be
blocked indefinitely because the Direct I/O that was now sleeping was
holding that same rangelock as writer. This led to a complete deadlock.

This commit fixes this issue and removes the wait in
dmu_write_direct_done().

The way this is now handled is the ARC buffer is destroyed, if there an
associated one with dbuf, before ever issuing the Direct I/O write.
This implemenation heavily borrows from the block cloning
implementation.

A new function dmu_buf_wil_clone_or_dio() is called in both
dmu_write_direct() and dmu_brt_clone() that does the following:
1. Undirties a dirty record for that db if there one currently
   associated with the current TXG.
2. Destroys the ARC buffer if the previous dirty record dr_data does not
   point at the dbufs ARC buffer (db_buf).
3. Sets the dbufs data pointers to NULL.
4. Redirties the dbuf using db_state = DB_NOFILL.

As part of this commit, the dmu_write_direct_done() function was also
cleaned up. Now dmu_sync_done() is called before undirtying the dbuf
dirty record associated with a failed Direct I/O write. This is correct
logic and how it always should have been.

The additional benefits of these modifications is there is no longer a
stall in a Direct I/O write if the user is mixing bufferd and O_DIRECT
together. Also it unifies the block cloning and Direct I/O write path as
they both need to call dbuf_fix_old_data() before destroying the ARC
buffer.

As part of this commit, there is also just general code cleanup. Various
dbuf stats were removed because they are not necesary any longer.
Additionally, useless functions were removed to make the code paths
cleaner for Direct I/O.

Below is the race condtion stack trace that was being consistently
observed in the CI runs for the dio_random test case that prompted
these changes:
trace:
[ 7795.294473] sd 0:0:0:0: [sda] Synchronizing SCSI cache
[ 9954.769075] INFO: task z_wr_int:1051869 blocked for more than
120
seconds.
[ 9954.770512]       Tainted: P           OE     -------- -  -
4.18.0-553.5.1.el8_10.x86_64 #1
[ 9954.772159] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 9954.773848] task:z_wr_int        state:D stack:0
pid:1051869
ppid:2      flags:0x80004080
[ 9954.775512] Call Trace:
[ 9954.776406]  __schedule+0x2d1/0x870
[ 9954.777386]  ? free_one_page+0x204/0x530
[ 9954.778466]  schedule+0x55/0xf0
[ 9954.779355]  cv_wait_common+0x16d/0x280 [spl]
[ 9954.780491]  ? finish_wait+0x80/0x80
[ 9954.781450]  dmu_buf_direct_mixed_io_wait+0x84/0x1a0 [zfs]
[ 9954.782889]  dmu_write_direct_done+0x90/0x3b0 [zfs]
[ 9954.784255]  zio_done+0x373/0x1d50 [zfs]
[ 9954.785410]  zio_execute+0xee/0x210 [zfs]
[ 9954.786588]  taskq_thread+0x205/0x3f0 [spl]
[ 9954.787673]  ? wake_up_q+0x60/0x60
[ 9954.788571]  ? zio_execute_stack_check.constprop.1+0x10/0x10
[zfs]
[ 9954.790079]  ? taskq_lowest_id+0xc0/0xc0 [spl]
[ 9954.791199]  kthread+0x134/0x150
[ 9954.792082]  ? set_kthread_struct+0x50/0x50
[ 9954.793189]  ret_from_fork+0x35/0x40
[ 9954.794108] INFO: task txg_sync:1051894 blocked for more than
120
seconds.
[ 9954.795535]       Tainted: P           OE     -------- -  -
4.18.0-553.5.1.el8_10.x86_64 #1
[ 9954.797103] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 9954.798669] task:txg_sync        state:D stack:0
pid:1051894
ppid:2      flags:0x80004080
[ 9954.800267] Call Trace:
[ 9954.801096]  __schedule+0x2d1/0x870
[ 9954.801972]  ? __wake_up_common+0x7a/0x190
[ 9954.802963]  schedule+0x55/0xf0
[ 9954.803884]  schedule_timeout+0x19f/0x320
[ 9954.804837]  ? __next_timer_interrupt+0xf0/0xf0
[ 9954.805932]  ? taskq_dispatch+0xab/0x280 [spl]
[ 9954.806959]  io_schedule_timeout+0x19/0x40
[ 9954.807989]  __cv_timedwait_common+0x19e/0x2c0 [spl]
[ 9954.809110]  ? finish_wait+0x80/0x80
[ 9954.810068]  __cv_timedwait_io+0x15/0x20 [spl]
[ 9954.811103]  zio_wait+0x1ad/0x4f0 [zfs]
[ 9954.812255]  dsl_pool_sync+0xcb/0x6c0 [zfs]
[ 9954.813442]  ? spa_errlog_sync+0x2f0/0x3d0 [zfs]
[ 9954.814648]  spa_sync_iterate_to_convergence+0xcb/0x310 [zfs]
[ 9954.816023]  spa_sync+0x362/0x8f0 [zfs]
[ 9954.817110]  txg_sync_thread+0x27a/0x3b0 [zfs]
[ 9954.818267]  ? txg_dispatch_callbacks+0xf0/0xf0 [zfs]
[ 9954.819510]  ? spl_assert.constprop.0+0x20/0x20 [spl]
[ 9954.820643]  thread_generic_wrapper+0x63/0x90 [spl]
[ 9954.821709]  kthread+0x134/0x150
[ 9954.822590]  ? set_kthread_struct+0x50/0x50
[ 9954.823584]  ret_from_fork+0x35/0x40
[ 9954.824444] INFO: task fio:1055501 blocked for more than 120
seconds.
[ 9954.825781]       Tainted: P           OE     -------- -  -
4.18.0-553.5.1.el8_10.x86_64 #1
[ 9954.827315] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 9954.828871] task:fio             state:D stack:0
pid:1055501
ppid:1055490 flags:0x00004080
[ 9954.830463] Call Trace:
[ 9954.831280]  __schedule+0x2d1/0x870
[ 9954.832159]  ? dbuf_hold_copy+0xec/0x230 [zfs]
[ 9954.833396]  schedule+0x55/0xf0
[ 9954.834286]  cv_wait_common+0x16d/0x280 [spl]
[ 9954.835291]  ? finish_wait+0x80/0x80
[ 9954.836235]  zfs_rangelock_enter_reader+0xa1/0x1f0 [zfs]
[ 9954.837543]  zfs_rangelock_enter_impl+0xbf/0x1b0 [zfs]
[ 9954.838838]  zfs_get_data+0x566/0x810 [zfs]
[ 9954.840034]  zil_lwb_commit+0x194/0x3f0 [zfs]
[ 9954.841154]  zil_lwb_write_issue+0x68/0xb90 [zfs]
[ 9954.842367]  ? __list_add+0x12/0x30 [zfs]
[ 9954.843496]  ? __raw_spin_unlock+0x5/0x10 [zfs]
[ 9954.844665]  ? zil_alloc_lwb+0x217/0x360 [zfs]
[ 9954.845852]  zil_commit_waiter_timeout+0x1f3/0x570 [zfs]
[ 9954.847203]  zil_commit_waiter+0x1d2/0x3b0 [zfs]
[ 9954.848380]  zil_commit_impl+0x6d/0xd0 [zfs]
[ 9954.849550]  zfs_fsync+0x66/0x90 [zfs]
[ 9954.850640]  zpl_fsync+0xe5/0x140 [zfs]
[ 9954.851729]  do_fsync+0x38/0x70
[ 9954.852585]  __x64_sys_fsync+0x10/0x20
[ 9954.853486]  do_syscall_64+0x5b/0x1b0
[ 9954.854416]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
[ 9954.855466] RIP: 0033:0x7eff236bb057
[ 9954.856388] Code: Unable to access opcode bytes at RIP
0x7eff236bb02d.
[ 9954.857651] RSP: 002b:00007ffffb8e5320 EFLAGS: 00000293
ORIG_RAX:
000000000000004a
[ 9954.859141] RAX: ffffffffffffffda RBX: 0000000000000006 RCX:
00007eff236bb057
[ 9954.860496] RDX: 0000000000000000 RSI: 000055e4d1f13ac0 RDI:
0000000000000006
[ 9954.861945] RBP: 00007efeb8ed8000 R08: 0000000000000000 R09:
0000000000000000
[ 9954.863327] R10: 0000000000056000 R11: 0000000000000293 R12:
0000000000000003
[ 9954.864765] R13: 000055e4d1f13ac0 R14: 0000000000000000 R15:
000055e4d1f13ae8
[ 9954.866149] INFO: task fio:1055502 blocked for more than 120
seconds.
[ 9954.867490]       Tainted: P           OE     -------- -  -
4.18.0-553.5.1.el8_10.x86_64 #1
[ 9954.869029] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 9954.870571] task:fio             state:D stack:0
pid:1055502
ppid:1055490 flags:0x00004080
[ 9954.872162] Call Trace:
[ 9954.872947]  __schedule+0x2d1/0x870
[ 9954.873844]  schedule+0x55/0xf0
[ 9954.874716]  schedule_timeout+0x19f/0x320
[ 9954.875645]  ? __next_timer_interrupt+0xf0/0xf0
[ 9954.876722]  io_schedule_timeout+0x19/0x40
[ 9954.877677]  __cv_timedwait_common+0x19e/0x2c0 [spl]
[ 9954.878822]  ? finish_wait+0x80/0x80
[ 9954.879694]  __cv_timedwait_io+0x15/0x20 [spl]
[ 9954.880763]  zio_wait+0x1ad/0x4f0 [zfs]
[ 9954.881865]  dmu_write_abd+0x174/0x1c0 [zfs]
[ 9954.883074]  dmu_write_uio_direct+0x79/0x100 [zfs]
[ 9954.884285]  dmu_write_uio_dnode+0xb2/0x320 [zfs]
[ 9954.885507]  dmu_write_uio_dbuf+0x47/0x60 [zfs]
[ 9954.886687]  zfs_write+0x581/0xe20 [zfs]
[ 9954.887822]  ? iov_iter_get_pages+0xe9/0x390
[ 9954.888862]  ? trylock_page+0xd/0x20 [zfs]
[ 9954.890005]  ? __raw_spin_unlock+0x5/0x10 [zfs]
[ 9954.891217]  ? zfs_setup_direct+0x7e/0x1b0 [zfs]
[ 9954.892391]  zpl_iter_write_direct+0xd4/0x170 [zfs]
[ 9954.893663]  ? rrw_exit+0xc6/0x200 [zfs]
[ 9954.894764]  zpl_iter_write+0xd5/0x110 [zfs]
[ 9954.895911]  new_sync_write+0x112/0x160
[ 9954.896881]  vfs_write+0xa5/0x1b0
[ 9954.897701]  ksys_write+0x4f/0xb0
[ 9954.898569]  do_syscall_64+0x5b/0x1b0
[ 9954.899417]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
[ 9954.900515] RIP: 0033:0x7eff236baa47
[ 9954.901363] Code: Unable to access opcode bytes at RIP
0x7eff236baa1d.
[ 9954.902673] RSP: 002b:00007ffffb8e5330 EFLAGS: 00000293
ORIG_RAX:
0000000000000001
[ 9954.904099] RAX: ffffffffffffffda RBX: 0000000000000005 RCX:
00007eff236baa47
[ 9954.905535] RDX: 00000000000e4000 RSI: 00007efeb7dd4000 RDI:
0000000000000005
[ 9954.906902] RBP: 00007efeb7dd4000 R08: 0000000000000000 R09:
0000000000000000
[ 9954.908339] R10: 0000000000000000 R11: 0000000000000293 R12:
00000000000e4000
[ 9954.909705] R13: 000055e4d1f13ac0 R14: 00000000000e4000 R15:
000055e4d1f13ae8
[ 9954.911129] INFO: task fio:1055504 blocked for more than 120
seconds.
[ 9954.912381]       Tainted: P           OE     -------- -  -
4.18.0-553.5.1.el8_10.x86_64 #1
[ 9954.913978] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 9954.915434] task:fio             state:D stack:0
pid:1055504
ppid:1055493 flags:0x00000080
[ 9954.917082] Call Trace:
[ 9954.917773]  __schedule+0x2d1/0x870
[ 9954.918648]  ? zilog_dirty+0x4f/0xc0 [zfs]
[ 9954.919831]  schedule+0x55/0xf0
[ 9954.920717]  cv_wait_common+0x16d/0x280 [spl]
[ 9954.921704]  ? finish_wait+0x80/0x80
[ 9954.922639]  zfs_rangelock_enter_writer+0x46/0x1c0 [zfs]
[ 9954.923940]  zfs_rangelock_enter_impl+0x12a/0x1b0 [zfs]
[ 9954.925306]  zfs_write+0x703/0xe20 [zfs]
[ 9954.926406]  zpl_iter_write_buffered+0xb2/0x120 [zfs]
[ 9954.927687]  ? rrw_exit+0xc6/0x200 [zfs]
[ 9954.928821]  zpl_iter_write+0xbe/0x110 [zfs]
[ 9954.930028]  new_sync_write+0x112/0x160
[ 9954.930913]  vfs_write+0xa5/0x1b0
[ 9954.931758]  ksys_write+0x4f/0xb0
[ 9954.932666]  do_syscall_64+0x5b/0x1b0
[ 9954.933544]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
[ 9954.934689] RIP: 0033:0x7fcaee8f0a47
[ 9954.935551] Code: Unable to access opcode bytes at RIP
0x7fcaee8f0a1d.
[ 9954.936893] RSP: 002b:00007fff56b2c240 EFLAGS: 00000293
ORIG_RAX:
0000000000000001
[ 9954.938327] RAX: ffffffffffffffda RBX: 0000000000000006 RCX:
00007fcaee8f0a47
[ 9954.939777] RDX: 000000000001d000 RSI: 00007fca8300b010 RDI:
0000000000000006
[ 9954.941187] RBP: 00007fca8300b010 R08: 0000000000000000 R09:
0000000000000000
[ 9954.942655] R10: 0000000000000000 R11: 0000000000000293 R12:
000000000001d000
[ 9954.944062] R13: 0000557a2006bac0 R14: 000000000001d000 R15:
0000557a2006bae8
[ 9954.945525] INFO: task fio:1055505 blocked for more than 120
seconds.
[ 9954.946819]       Tainted: P           OE     -------- -  -
4.18.0-553.5.1.el8_10.x86_64 #1
[ 9954.948466] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 9954.949959] task:fio             state:D stack:0
pid:1055505
ppid:1055493 flags:0x00004080
[ 9954.951653] Call Trace:
[ 9954.952417]  __schedule+0x2d1/0x870
[ 9954.953393]  ? finish_wait+0x3e/0x80
[ 9954.954315]  schedule+0x55/0xf0
[ 9954.955212]  cv_wait_common+0x16d/0x280 [spl]
[ 9954.956211]  ? finish_wait+0x80/0x80
[ 9954.957159]  zil_commit_waiter+0xfa/0x3b0 [zfs]
[ 9954.958343]  zil_commit_impl+0x6d/0xd0 [zfs]
[ 9954.959524]  zfs_fsync+0x66/0x90 [zfs]
[ 9954.960626]  zpl_fsync+0xe5/0x140 [zfs]
[ 9954.961763]  do_fsync+0x38/0x70
[ 9954.962638]  __x64_sys_fsync+0x10/0x20
[ 9954.963520]  do_syscall_64+0x5b/0x1b0
[ 9954.964470]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
[ 9954.965567] RIP: 0033:0x7fcaee8f1057
[ 9954.966490] Code: Unable to access opcode bytes at RIP
0x7fcaee8f102d.
[ 9954.967752] RSP: 002b:00007fff56b2c230 EFLAGS: 00000293
ORIG_RAX:
000000000000004a
[ 9954.969260] RAX: ffffffffffffffda RBX: 0000000000000005 RCX:
00007fcaee8f1057
[ 9954.970628] RDX: 0000000000000000 RSI: 0000557a2006bac0 RDI:
0000000000000005
[ 9954.972092] RBP: 00007fca84152a18 R08: 0000000000000000 R09:
0000000000000000
[ 9954.973484] R10: 0000000000035000 R11: 0000000000000293 R12:
0000000000000003
[ 9954.974958] R13: 0000557a2006bac0 R14: 0000000000000000 R15:
0000557a2006bae8
[10077.648150] INFO: task z_wr_int:1051869 blocked for more than
120
seconds.
[10077.649541]       Tainted: P           OE     -------- -  -
4.18.0-553.5.1.el8_10.x86_64 #1
[10077.651116] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[10077.652782] task:z_wr_int        state:D stack:0
pid:1051869
ppid:2      flags:0x80004080
[10077.654420] Call Trace:
[10077.655267]  __schedule+0x2d1/0x870
[10077.656179]  ? free_one_page+0x204/0x530
[10077.657192]  schedule+0x55/0xf0
[10077.658004]  cv_wait_common+0x16d/0x280 [spl]
[10077.659018]  ? finish_wait+0x80/0x80
[10077.660013]  dmu_buf_direct_mixed_io_wait+0x84/0x1a0 [zfs]
[10077.661396]  dmu_write_direct_done+0x90/0x3b0 [zfs]
[10077.662617]  zio_done+0x373/0x1d50 [zfs]
[10077.663783]  zio_execute+0xee/0x210 [zfs]
[10077.664921]  taskq_thread+0x205/0x3f0 [spl]
[10077.665982]  ? wake_up_q+0x60/0x60
[10077.666842]  ? zio_execute_stack_check.constprop.1+0x10/0x10
[zfs]
[10077.668295]  ? taskq_lowest_id+0xc0/0xc0 [spl]
[10077.669360]  kthread+0x134/0x150
[10077.670191]  ? set_kthread_struct+0x50/0x50
[10077.671209]  ret_from_fork+0x35/0x40
[10077.672076] INFO: task txg_sync:1051894 blocked for more than
120
seconds.
[10077.673467]       Tainted: P           OE     -------- -  -
4.18.0-553.5.1.el8_10.x86_64 #1
[10077.675112] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[10077.676612] task:txg_sync        state:D stack:0
pid:1051894
ppid:2      flags:0x80004080
[10077.678288] Call Trace:
[10077.679024]  __schedule+0x2d1/0x870
[10077.679948]  ? __wake_up_common+0x7a/0x190
[10077.681042]  schedule+0x55/0xf0
[10077.681899]  schedule_timeout+0x19f/0x320
[10077.682951]  ? __next_timer_interrupt+0xf0/0xf0
[10077.684005]  ? taskq_dispatch+0xab/0x280 [spl]
[10077.685085]  io_schedule_timeout+0x19/0x40
[10077.686080]  __cv_timedwait_common+0x19e/0x2c0 [spl]
[10077.687227]  ? finish_wait+0x80/0x80
[10077.688123]  __cv_timedwait_io+0x15/0x20 [spl]
[10077.689206]  zio_wait+0x1ad/0x4f0 [zfs]
[10077.690300]  dsl_pool_sync+0xcb/0x6c0 [zfs]
[10077.691435]  ? spa_errlog_sync+0x2f0/0x3d0 [zfs]
[10077.692636]  spa_sync_iterate_to_convergence+0xcb/0x310 [zfs]
[10077.693997]  spa_sync+0x362/0x8f0 [zfs]
[10077.695112]  txg_sync_thread+0x27a/0x3b0 [zfs]
[10077.696239]  ? txg_dispatch_callbacks+0xf0/0xf0 [zfs]
[10077.697512]  ? spl_assert.constprop.0+0x20/0x20 [spl]
[10077.698639]  thread_generic_wrapper+0x63/0x90 [spl]
[10077.699687]  kthread+0x134/0x150
[10077.700567]  ? set_kthread_struct+0x50/0x50
[10077.701502]  ret_from_fork+0x35/0x40
[10077.702430] INFO: task fio:1055501 blocked for more than 120
seconds.
[10077.703697]       Tainted: P           OE     -------- -  -
4.18.0-553.5.1.el8_10.x86_64 #1
[10077.705309] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[10077.706780] task:fio             state:D stack:0
pid:1055501
ppid:1055490 flags:0x00004080
[10077.708479] Call Trace:
[10077.709231]  __schedule+0x2d1/0x870
[10077.710190]  ? dbuf_hold_copy+0xec/0x230 [zfs]
[10077.711368]  schedule+0x55/0xf0
[10077.712286]  cv_wait_common+0x16d/0x280 [spl]
[10077.713316]  ? finish_wait+0x80/0x80
[10077.714262]  zfs_rangelock_enter_reader+0xa1/0x1f0 [zfs]
[10077.715566]  zfs_rangelock_enter_impl+0xbf/0x1b0 [zfs]
[10077.716878]  zfs_get_data+0x566/0x810 [zfs]
[10077.718032]  zil_lwb_commit+0x194/0x3f0 [zfs]
[10077.719234]  zil_lwb_write_issue+0x68/0xb90 [zfs]
[10077.720413]  ? __list_add+0x12/0x30 [zfs]
[10077.721525]  ? __raw_spin_unlock+0x5/0x10 [zfs]
[10077.722708]  ? zil_alloc_lwb+0x217/0x360 [zfs]
[10077.723931]  zil_commit_waiter_timeout+0x1f3/0x570 [zfs]
[10077.725273]  zil_commit_waiter+0x1d2/0x3b0 [zfs]
[10077.726438]  zil_commit_impl+0x6d/0xd0 [zfs]
[10077.727586]  zfs_fsync+0x66/0x90 [zfs]
[10077.728675]  zpl_fsync+0xe5/0x140 [zfs]
[10077.729755]  do_fsync+0x38/0x70
[10077.730607]  __x64_sys_fsync+0x10/0x20
[10077.731482]  do_syscall_64+0x5b/0x1b0
[10077.732415]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
[10077.733487] RIP: 0033:0x7eff236bb057
[10077.734399] Code: Unable to access opcode bytes at RIP
0x7eff236bb02d.
[10077.735657] RSP: 002b:00007ffffb8e5320 EFLAGS: 00000293
ORIG_RAX:
000000000000004a
[10077.737163] RAX: ffffffffffffffda RBX: 0000000000000006 RCX:
00007eff236bb057
[10077.738526] RDX: 0000000000000000 RSI: 000055e4d1f13ac0 RDI:
0000000000000006
[10077.739966] RBP: 00007efeb8ed8000 R08: 0000000000000000 R09:
0000000000000000
[10077.741336] R10: 0000000000056000 R11: 0000000000000293 R12:
0000000000000003
[10077.742773] R13: 000055e4d1f13ac0 R14: 0000000000000000 R15:
000055e4d1f13ae8
[10077.744168] INFO: task fio:1055502 blocked for more than 120
seconds.
[10077.745505]       Tainted: P           OE     -------- -  -
4.18.0-553.5.1.el8_10.x86_64 #1
[10077.747073] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[10077.748642] task:fio             state:D stack:0
pid:1055502
ppid:1055490 flags:0x00004080
[10077.750233] Call Trace:
[10077.751011]  __schedule+0x2d1/0x870
[10077.751915]  schedule+0x55/0xf0
[10077.752811]  schedule_timeout+0x19f/0x320
[10077.753762]  ? __next_timer_interrupt+0xf0/0xf0
[10077.754824]  io_schedule_timeout+0x19/0x40
[10077.755782]  __cv_timedwait_common+0x19e/0x2c0 [spl]
[10077.756922]  ? finish_wait+0x80/0x80
[10077.757788]  __cv_timedwait_io+0x15/0x20 [spl]
[10077.758845]  zio_wait+0x1ad/0x4f0 [zfs]
[10077.759941]  dmu_write_abd+0x174/0x1c0 [zfs]
[10077.761144]  dmu_write_uio_direct+0x79/0x100 [zfs]
[10077.762327]  dmu_write_uio_dnode+0xb2/0x320 [zfs]
[10077.763523]  dmu_write_uio_dbuf+0x47/0x60 [zfs]
[10077.764749]  zfs_write+0x581/0xe20 [zfs]
[10077.765825]  ? iov_iter_get_pages+0xe9/0x390
[10077.766842]  ? trylock_page+0xd/0x20 [zfs]
[10077.767956]  ? __raw_spin_unlock+0x5/0x10 [zfs]
[10077.769189]  ? zfs_setup_direct+0x7e/0x1b0 [zfs]
[10077.770343]  zpl_iter_write_direct+0xd4/0x170 [zfs]
[10077.771570]  ? rrw_exit+0xc6/0x200 [zfs]
[10077.772674]  zpl_iter_write+0xd5/0x110 [zfs]
[10077.773834]  new_sync_write+0x112/0x160
[10077.774805]  vfs_write+0xa5/0x1b0
[10077.775634]  ksys_write+0x4f/0xb0
[10077.776526]  do_syscall_64+0x5b/0x1b0
[10077.777386]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
[10077.778488] RIP: 0033:0x7eff236baa47
[10077.779339] Code: Unable to access opcode bytes at RIP
0x7eff236baa1d.
[10077.780655] RSP: 002b:00007ffffb8e5330 EFLAGS: 00000293
ORIG_RAX:
0000000000000001
[10077.782056] RAX: ffffffffffffffda RBX: 0000000000000005 RCX:
00007eff236baa47
[10077.783507] RDX: 00000000000e4000 RSI: 00007efeb7dd4000 RDI:
0000000000000005
[10077.784890] RBP: 00007efeb7dd4000 R08: 0000000000000000 R09:
0000000000000000
[10077.786303] R10: 0000000000000000 R11: 0000000000000293 R12:
00000000000e4000
[10077.787637] R13: 000055e4d1f13ac0 R14: 00000000000e4000 R15:
000055e4d1f13ae8

Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
2024-09-04 17:08:56 -06:00
Brian Atkinson d93a3febba Adding Direct IO Support
Adding O_DIRECT support to ZFS to bypass the ARC for writes/reads.

O_DIRECT support in ZFS will always ensure there is coherency between
buffered and O_DIRECT IO requests. This ensures that all IO requests,
whether buffered or direct, will see the same file contents at all
times. Just as in other FS's , O_DIRECT does not imply O_SYNC. While
data is written directly to VDEV disks, metadata will not be synced
until the associated  TXG is synced.
For both O_DIRECT read and write request the offset and requeset sizes,
at a minimum, must be PAGE_SIZE aligned. In the event they are not,
then EINVAL is returned unless the direct property is set to always (see
below).

For O_DIRECT writes:
The request also must be block aligned (recordsize) or the write
request will take the normal (buffered) write path. In the event that
request is block aligned and a cached copy of the buffer in the ARC,
then it will be discarded from the ARC forcing all further reads to
retrieve the data from disk.

For O_DIRECT reads:
The only alignment restrictions are PAGE_SIZE alignment. In the event
that the requested data is in buffered (in the ARC) it will just be
copied from the ARC into the user buffer.

For both O_DIRECT writes and reads the O_DIRECT flag will be ignored in
the event that file contents are mmap'ed. In this case, all requests
that are at least PAGE_SIZE aligned will just fall back to the buffered
paths. If the request however is not PAGE_SIZE aligned, EINVAL will
be returned as always regardless if the file's contents are mmap'ed.

Since O_DIRECT writes go through the normal ZIO pipeline, the
following operations are supported just as with normal buffered writes:
Checksum
Compression
Dedup
Encryption
Erasure Coding
There is one caveat for the data integrity of O_DIRECT writes that is
distinct for each of the OS's supported by ZFS.
FreeBSD - FreeBSD is able to place user pages under write protection so
          any data in the user buffers and written directly down to the
	  VDEV disks is guaranteed to not change. There is no concern
	  with data integrity and O_DIRECT writes.
Linux - Linux is not able to place anonymous user pages under write
        protection. Because of this, if the user decides to manipulate
	the page contents while the write operation is occurring, data
	integrity can not be guaranteed. However, there is a module
	parameter `zfs_vdev_direct_write_verify` that contols the
	if a O_DIRECT writes that can occur to a top-level VDEV before
	a checksum verify is run before the contents of the I/O buffer
        are committed to disk. In the event of a checksum verification
	failure the write will return EIO. The number of O_DIRECT write
	checksum verification errors can be observed by doing
	`zpool status -d`, which will list all verification errors that
	have occurred on a top-level VDEV. Along with `zpool status`, a
	ZED event will be issues as `dio_verify` when a checksum
	verification error occurs.

A new dataset property `direct` has been added with the following 3
allowable values:
disabled - Accepts O_DIRECT flag, but silently ignores it and treats
	   the request as a buffered IO request.
standard - Follows the alignment restrictions  outlined above for
	   write/read IO requests when the O_DIRECT flag is used.
always   - Treats every write/read IO request as though it passed
           O_DIRECT and will do O_DIRECT if the alignment restrictions
	   are met otherwise will redirect through the ARC. This
	   property will not allow a request to fail.

There is also a module paramter zfs_dio_enabled that can be used to
force all reads and writes through the ARC. By setting this module
paramter to 0, it mimics as if the  direct dataset property is set to
disabled.

Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Co-authored-by: Mark Maybee <mark.maybee@delphix.com>
Co-authored-by: Matt Macy <mmacy@FreeBSD.org>
Co-authored-by: Brian Behlendorf <behlendorf@llnl.gov>
2024-09-04 17:06:56 -06:00
Don Brady d4d79451cb Add DDT prune command
Requires the new 'flat' physical data which has the start
time for a class entry.

The amount to prune can be based on a target percentage of
the unique entries or based on the age (i.e., every entry
older than N days).

Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@klarasystems.com>
Closes #16277
2024-09-04 14:17:02 -07:00
Mateusz Piotrowski 6be8bf5552
zpool: Provide GUID to zpool-reguid(8) with -g (#16239)
This commit extends the zpool-reguid(8) command with a -g flag, which
allows the user to specify the GUID to set.

This change also adds some general tests for zpool-reguid(8).

Sponsored-by: Wasabi Technology, Inc.
Sponsored-by: Klara, Inc.

Signed-off-by: Mateusz Piotrowski <0mp@FreeBSD.org>
Reviewed-by: Rob Norris <rob.norris@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
2024-08-26 09:27:24 -07:00
Rob Norris f62e6e1f98 compress: change zio_compress API to use ABDs
This commit changes the frontend zio_compress_data and
zio_decompress_data APIs to take ABD points instead of buffer pointers.

All callers are updated to match. Any that already have an appropriate
ABD nearby now use it directly, while at the rest we create an one.

Internally, the ABDs are passed through to the provider directly.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
2024-08-22 16:22:24 -07:00
Rob Norris d3c12383c9 compress: change compression providers API to use ABDs
This commit changes the provider compress and decompress API to take ABD
pointers instead of buffer pointers for both data source and
destination. It then updates all providers to match.

This doesn't actually change the providers to do chunked compression,
just changes the API to allow such an update in the future. Helper
macros are added to easily adapt the ABD functions to their buffer-based
implementations.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
2024-08-22 16:22:24 -07:00
Rob Norris 522816498c compress: standardise names of compression functions
This is mostly to make searching easier.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
2024-08-22 16:22:24 -07:00
Rob Norris dd0c08f9c6 compress: remove unused abd compress prototype
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
2024-08-22 16:22:24 -07:00
Rob Norris e119483a95 compress: remove zio_decompress_data_buf
Nothing uses it anymore!

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
2024-08-22 16:22:24 -07:00
Rob Norris ba2209ec9e abd_get_from_buf_struct: wrap existing buf with ABD stored on stack
This allows a simple "wrapping" ABD for an existing linear buffer to be
allocated on the stack, avoiding an allocation.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
2024-08-22 16:22:24 -07:00
Rob Norris 5b9e695392 abd_os: break out platform-specific header parts
Removing the platform #ifdefs from shared headers in favour of
per-platform headers. Makes abd_t much leaner, among other things.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16253
2024-08-21 13:37:18 -07:00
Rob Norris 7a5b4355e2 abd_os: split userspace and Linux kernel code
The Linux abd_os.c serves double-duty as the userspace scatter abd
implementation, by carrying an emulation of kernel scatterlists. This
commit lifts common and userspace-specific parts out into a separate
abd_os.c for libzpool.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16253
2024-08-21 13:37:13 -07:00
Rob Norris b3f4e4e1ec abd: remove ABD_FLAG_ZEROS
Nothing ever checks it.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16253
2024-08-21 13:36:24 -07:00
Rob Norris db40fe4cf6 spl-taskq: per-taskq kstats
This exposes a variety of per-taskq stats under /proc/spl/kstat/taskq,
one file per taskq, named for the taskq name.instance.

These include a small amount of info about the taskq config, the current
state of the threads and queues, and various counters for thread and
queue activity since the taskq was created.

To assist with decrementing queue size counters, the list an entry is on
is encoded in spare bits in the entry flags.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Syneto
Closes #16171
2024-08-19 09:50:35 -07:00
Rob Norris 0d2707815d ddt: lookup and log stats
Adds per-DDT stats counting lookups and where they were serviced from
(either log or backing zap), number of log entries in memory, and flow
rates.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes #15895
2024-08-16 12:03:51 -07:00
Rob Norris a1902f4950 ddt: block scan until log is flushed, and flush aggressively
The dedup log does not have a stable cursor, so its not possible to
persist our current scan location within it across pool reloads.
Beccause of this, when walking (scanning), we can't treat it like just
another source of dedup entries.

Instead, when a scan is wanted, we switch to an aggressive flushing
mode, pushing out entries older than the scan start txg as fast as we
can, before starting the scan proper.

Entries after the scan start txg will be handled via other methods; the
DDT ZAPs and logs will be written as normal, and blocks not seen yet
will be offered to the scan machinery as normal.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes #15895
2024-08-16 12:03:43 -07:00
Rob Norris cd69ba3d49 ddt: dedup log
Adds a log/journal to dedup. At the end of txg, instead of writing the
entry directly to the ZAP, instead its adding to an in-memory tree and
appended to an on-disk object. The on-disk object is only read at
import, to reload the in-memory tree.

Lookups first go the the log tree before going to the ZAP, so
recently-used entries will remain close by in memory. This vastly
reduces overhead from dedup IO, as it will not have to do so many
read/update/write cycles on ZAP leaf nodes.

A flushing facility is added at end of txg, to push logged entries out
to the ZAP. There's actually two separate "logs" (in-memory tree and
on-disk object), one active (recieving updated entries) and one flushing
(writing out to disk). These are swapped (ie flushing begins) based on
memory used by the in-memory log trees and time since we last flushed
something.

The flushing facility monitors the amount of entries coming in and being
flushed out, and calibrates itself to try to flush enough each txg to
keep up with the ingest rate without competing too much with other IO.
Multiple tuneables are provided to control the flushing facility.

All the histograms and stats are update to accomodate the log as a
separate entry store. zdb gains knowledge of how to count them and dump
them. Documentation included!

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes #15895
2024-08-16 12:03:35 -07:00
Rob Norris 27e9cb5f80 ddt: cleanup the stats & histogram code
Both the API and the code were kinda mangled and I was really struggling
to follow it. The worst offender was the old ddt_stat_add(); after
fixing it up the rest of the changes are mostly knock-on effects and
targets of opportunity.

Note that the old ddt_stat_add() was safe against overflows - it could
produce crazy numbers, but the compiler wouldn't do anything stupid. The
assertions in ddt_stat_sub() go a lot of the way to protecting against
this; getting in a position where overflows are a problem is definitely
a programming error.

Also expanding ddt_stat_add() and ddt_histogram_empty() produces less
efficient assembly. I'm not bothered about this right now though; these
should not be hot functions, and if they are we'll optimise them later.
If we have to go back to the old form, we'll comment it like crazy.

Finally, I've removed the assertion that the bucket will never be
negative, as it will soon be possible to have entries with zero
refcounts: an entry for a block that is no longer on the pool, but is on
the log waiting to be synced out. It might be better to have a separate
bucket for these, since they're still using real space on disk, but
ultimately these stats are driving UI, and for now I've chosen to keep
them matching how they've looked in the past, as well as match the
operators mental model - pool usage is managed elsewhere.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes #15895
2024-08-16 12:02:56 -07:00
Rob Norris f4aeb23f52 ddt: add "flat phys" feature
Traditional dedup keeps a separate ddt_phys_t "type" for each possible
count of DVAs (that is, copies=) parameter. Each of these are tracked
independently of each other, and have their own set of DVAs. This leads
to an (admittedly rare) situation where you can create as many as six
copies of the data, by changing the copies= parameter between copying.
This is both a waste of storage on disk, but also a waste of space in
the stored DDT entries, since there never needs to be more than three
DVAs to handle all possible values of copies=.

This commit adds a new FDT feature, DDT_FLAG_FLAT. When active, only the
first ddt_phys_t is used. Each time a block is written with the dedup
bit set, this single phys is checked to see if it has enough DVAs to
fulfill the request. If it does, the block is filled with the saved DVAs
as normal. If not, an adjusted write is issued to create as many extra
copies as are needed to fulfill the request, which are then saved into
the entry too.

Because a single phys is no longer an all-or-nothing, but can be
transitioning from fewer to more DVAs, the write path now has to keep a
copy of the previous "known good" DVA set so we can revert to it in case
an error occurs. zio_ddt_write() has been restructured and heavily
commented to make it much easier to see what's happening.

Backwards compatibility is maintained simply by allocating four
ddt_phys_t when the DDT_FLAG_FLAT flag is not set, and updating the phys
selection macros to check the flag. In the old arrangement, each number
of copies gets a whole phys, so it will always have either zero or all
necessary DVAs filled, with no in-between, so the old behaviour
naturally falls out of the new code.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Co-authored-by: Don Brady <don.brady@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes #15893
2024-08-16 12:02:39 -07:00
Rob Norris 0ba5f503c5 ddt: slim down ddt_entry_t
This slims down the in-memory entry to as small as it can be. The
IO-related parts are made into a separate entry, since they're
relatively rarely needed.

The variable allocation for dde_phys is to support the upcoming flat
format.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes #15893
2024-08-16 12:02:31 -07:00
Rob Norris 4d686c3da5 ddt: introduce lightweight entry
The idea here is that sometimes you need the contents of an entry with
no intent to modify it, and/or from a place where its difficult to get
hold of its originating ddt_t to know how to interpret it.

A lightweight entry contains everything you might need to "read" an
entry - its key, type and phys contents - but none of the extras for
modifying it or using it in a larger context. It also has the full
complement of phys slots, so it can represent any kind of dedup entry
without having to know the specific configuration of the table it came
from.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes #15893
2024-08-16 12:02:22 -07:00
Rob Norris d17ab631a9 ddt: rework access to phys array slots
The "flat phys" feature will use only a single phys slot for all
entries, which means the old "single", "double" etc naming now makes no
sense, and more importantly, means that choosing the right slot for a
given block pointer will depend on how many slots are in use for a given
DDT.

This removes the old names, and adds accessor macros to decouple
specific phys array indexes from any particular meaning.

(These macros look strange in isolation, mainly in the way they take the
ddt_t* as an arg but don't use it. This is mostly a separate commit to
introduce the concept to the reader before the "flat phys" commit
extends it).

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes #15893
2024-08-16 12:02:02 -07:00
Rob Norris d63f5d7e50 zdb: rework DDT block count and leak check to just count the blocks
The upcoming dedup features break the long held assumption that all
blocks on disk with a 'D' dedup bit will always be present in the DDT,
or will have the same set of DVA allocations on disk as in the DDT.

If the DDT is no longer a complete picture of all the dedup blocks that
will be and should be on disk, then it does us no good to walk and prime
it up front, since it won't necessarily match up with every block we'll
see anyway.

Instead, we rework things here to be more like the BRT checks. When we
see a dedup'd block, we look it up in the DDT, consume a refcount, and
for the second-or-later instances, count them as duplicates.

The DDT and BRT are moved ahead of the space accounting. This will
become important for the "flat" feature, which may need to count a
modified version of the block.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Allan Jude <allan@klarasystems.com>
Co-authored-by: Don Brady <don.brady@klarasystems.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes #15892
2024-08-16 12:01:41 -07:00
Rob Norris db2b1fdb79 ddt: add FDT feature and support for legacy and new on-disk formats
This is the supporting infrastructure for the upcoming dedup features.

Traditionally, dedup objects live directly in the MOS root. While their
details vary (checksum, type and class), they are all the same "kind" of
thing - a store of dedup entries.

The new features are more varied than that, and are better thought of as
a set of related stores for the overall state of a dedup table.

This adds a new feature flag, SPA_FEATURE_FAST_DEDUP. Enabling this will
cause new DDTs to be created as a ZAP in the MOS root, named
DDT-<checksum>. The is used as the root object for the normal type/class
store objects, but will also be a place for any storage required by new
features.

This commit adds two new fields to ddt_t, for version and flags. These
are intended to describe the structure and features of the overall dedup
table, and are stored as-is in the DDT root. In this commit, flags are
always zero, but the intent is that they can be used to hang optional
logic or state onto for new dedup features. Version is always 1.

For a "legacy" dedup table, where no DDT root directory exists, the
version will be 0.

ddt_configure() is expected to determine the version and flags features
currently in operation based on whether or not the fast_dedup feature is
enabled, and from what's available on disk. In this way, its possible to
support both old and new tables.

This also provides a migration path. A legacy setup can be upgraded to
FDT by creating the DDT root ZAP, moving the existing objects into it,
and setting version and flags appropriately. There's no support for that
here, but it would be straightforward to add later and allows the
possibility that newer features could be applied to existing dedup
tables.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes #15892
2024-08-16 11:58:59 -07:00
Rob Norris 3abffc8781 Linux 6.11: add compat macro for page_mapping()
Since the change to folios it has just been a wrapper anyway. Linux has
removed their wrapper, so we add one.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Sponsored-by: https://despairlabs.com/sponsor/
Closes #16400
2024-08-13 17:47:18 -07:00
Rob Norris 7e98d30f46 Linux 6.11: get backing_dev_info through queue gendisk
It's no longer available directly on the request queue, but its easy to
get from the attached disk.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Sponsored-by: https://despairlabs.com/sponsor/
Closes #16400
2024-08-13 17:46:49 -07:00
Rob Norris e95b732e49 Linux 6.11: enable queue flush through queue limits
In 6.11 struct queue_limits gains a 'features' field, where, among other
things, flush and write-cache are enabled. Detect it and use it.

Along the way, the blk_queue_set_write_cache() compat wrapper gets a
little cleanup. Since both flags are alway set together, its now a
single bool. Also the very very ancient version that sets q->flush_flags
directly couldn't actually turn it off, so I've fixed that. Not that we
use it, but still.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Sponsored-by: https://despairlabs.com/sponsor/
Closes #16400
2024-08-13 17:46:41 -07:00
Rob Norris b0bf14cdb5
abd: lift ABD zero scan from zio_compress_data() to abd_cmp_zero()
It's now the caller's responsibility do special handling for holes if
that's something it wants.

This also makes zio_compress_data() and zio_decompress_data() properly
the inverse of each other.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Jason Lee <jasonlee@lanl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16326
2024-08-09 14:30:26 -07:00
Mateusz Piotrowski 24e6585e76 libzfs.h: Set ZFS_MAXPROPLEN and ZPOOL_MAXPROPLEN to ZAP_MAXVALUELEN
So far, the values of ZFS_MAXPROPLEN and ZPOOL_MAXPROPLEN were equal to
MAXPATHLEN, which is 1024 on FreeBSD and 4096 on Linux. This wasn't
ideal. Some of the surprising outcomes of this implementation are:

1. When creating a pool user property with zpool-set(8), libzfs makes
   sure that the length of the property's value is less than
   ZFS_MAXPROPLEN. However, the ZFS kernel module does not do that.
   Instead, it checks the length against ZAP_MAXVALUELEN. As a result,
   it is possible to create a property the length of which is going to
   be larger than zpool(8) is ready to read.
2. A pool user property created on Linux is too big to be read on
   FreeBSD.

This change sets both ZFS_MAXPROPLEN and ZPOOL_MAXPROPLEN to
ZAP_MAXVALUELEN, which is 8192 at the moment.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Mateusz Piotrowski <0mp@FreeBSD.org>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Closes #16248
2024-08-08 15:23:40 -07:00
Ameer Hamza 5536c0dee2
Sync AUX label during pool import
Spare and l2cache vdev labels are not updated during import. Therefore,
if disk paths are updated between pool export and import, the AUX label
still shows the old paths. This patch syncs the AUX label
during import to show the correct path information.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Umer Saleem <usaleem@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #15817
2024-08-08 15:16:46 -07:00
Tony Hutter 02a9f7fed7 JSON: Fix class values for mirrored special vdevs
This fixes things so mirrored special vdevs report themselves as
"class=special" rather than "class=normal".

This happens due to the way the vdev nvlists are constructed:

mirrored special devices - The 'mirror' vdev has allocation bias as
"special" and it's leaf vdevs are "normal"

single or RAID0 special devices - Leaf vdevs have allocation bias as
"special".

This commit adds in code to check if a leaf's parent is a "special"
vdev to see if it should also report "special".

Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Reviewed-by: Umer Saleem <usaleem@ixsystems.com>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #16217
2024-08-06 12:47:58 -07:00
Umer Saleem 959e963c81 JSON output support for zpool status
This commit adds support for zpool status command to displpay status
of ZFS pools in JSON format using '-j' option. Status information is
collected in nvlist which is later dumped on stdout in JSON format.
Existing options for zpool status work with '-j' flag. man page for
zpool status is updated accordingly.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Signed-off-by: Umer Saleem <usaleem@ixsystems.com>
Closes #16217
2024-08-06 12:47:10 -07:00
Umer Saleem eb2b824bde JSON output support for zpool get
This commit adds support for zpool get command to output the list of
properties for ZFS Pools and VDEVS in JSON format using '-j' option.
Man page for zpool get is updated to include '-j' option.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Signed-off-by: Umer Saleem <usaleem@ixsystems.com>
Closes #16217
2024-08-06 12:47:00 -07:00
Umer Saleem aa15b60e58 JSON output support for zfs version and zfs get
This commit adds support for JSON output for zfs version and zfs get
commands. '-j' flag can be used to get output in JSON format.

Information is collected in nvlist objects which is later printed in
JSON format. Existing options that work for zfs get and zfs version
also work with '-j' flag.

man pages for zfs get and zfs version are updated accordingly.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Signed-off-by: Umer Saleem <usaleem@ixsystems.com>
Closes #16217
2024-08-06 12:45:45 -07:00
Rob Norris 670147be53 zvol: ensure device minors are properly cleaned up
Currently, if a minor is in use when we try to remove it, we'll skip it
and never come back to it again. Since the zvol state is hung off the
minor in the kernel, this can get us into weird situations if something
tries to use it after the removal fails. It's even worse at pool export,
as there's now a vestigial zvol state with no pool under it. It's
weirder again if the pool is subsequently reimported, as the zvol code
(reasonably) assumes the zvol state has been properly setup, when it's
actually left over from the previous import of the pool.

This commit attempts to tackle that by setting a flag on the zvol if its
minor can't be removed, and then checking that flag when a request is
made and rejecting it, thus stopping new work coming in.

The flag also causes a condvar to be signaled when the last client
finishes. For the case where a single minor is being removed (eg
changing volmode), it will wait for this signal before proceeding.
Meanwhile, when removing all minors, a background task is created for
each minor that couldn't be removed on the spot, and those tasks then
wake and clean up.

Since any new tasks are queued on to the pool's spa_zvol_taskq,
spa_export_common() will continue to wait at export until all minors are
removed.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #14872
Closes #16364
2024-08-06 12:08:14 -07:00
Rob Norris 9b9a3934ad zvol_impl: document and tidy flags
ZVOL_DUMPIFIED is a vestigial Solaris leftover, and not used anywhere.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16364
2024-08-06 12:06:46 -07:00
Rob Norris 6c82951d11
FreeBSD: remove support for FreeBSD < 13.0-RELEASE (#16372)
This includes the last 12.x release (now EOL) and 13.0 development
versions (<1300139).

Sponsored-by: https://despairlabs.com/sponsor/

Signed-off-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
2024-08-05 16:56:45 -07:00
Jitendra Patidar d60debbf59
Fix sa_add_projid to lookup and update SA_ZPL_DXATTR (avoid DXATTR loss) (#16288)
sa_add_projid() gets called via zfs_setattr() for setting project id
on old file/dir, which were created before upgrading to project quota
feature. This function does lookup for all possible SA and update them
all together along with project ID at needed fixed offset. But its
missing lookup and update of SA_ZPL_DXATTR, effectively it losses
SA_ZPL_DXATTR.

Closes #16287
Signed-off-by: Jitendra Patidar <jitendra.patidar@nutanix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Rob Norris <rob.norris@klarasystems.com>
2024-07-31 18:41:49 -07:00
Alexander Motin d4b5517ef9
Linux: Report reclaimable memory to kernel as such (#16385)
Linux provides SLAB_RECLAIM_ACCOUNT and __GFP_RECLAIMABLE flags to
mark memory allocations that can be freed via shinker calls.  It
should allow kernel to tune and group such allocations for lower
memory fragmentation and better reclamation under pressure.

This patch marks as reclaimable most of ARC memory, directly
evictable via ZFS shrinker, plus also dnode/znode/sa memory,
indirectly evictable via kernel's superblock shrinker.

Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
2024-07-30 11:40:47 -07:00
Rob Norris d54d0fff39 dnode: allow storage class to be overridden by object type
spa_preferred_class() selects a storage class based on (among other
things) the DMU object type. This only works for old-style object types
that match only one specific kind of thing. For DMU_OTN_ types we need
another way to signal the storage class.

This commit allows the object type to be overridden in the IO policy for
the purposes of choosing a storage class. It then adds the ability to
set the storage type on a dnode hold, such that all writes generated
under that hold will get it.

This method has two shortcomings:

- it would be better if we could "name" a set of storage class
  preferences rather than it being implied by the object type.
- it would be better if this info were stored in the dnode on disk.

In the absence of those things, this seems like the smallest possible
change.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes #15894
2024-07-29 17:05:41 -07:00
Rob Norris e26b3771ee spa_preferred_class: pass the entire zio
Rather than picking out specific values out of the properties, just pass
the entire zio in, to make it easier in the future to use more of that
info to decide on the storage class.

I would have rathered just pass io_prop in, but having spa.h include
zio.h gets a bit tricky.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes #15894
2024-07-29 17:05:08 -07:00
Alexander Motin ed87d456e4 Skip dnode handles use when not needed
Neither FreeBSD nor Linux currently implement kmem_cache_set_move(),
which means dnode_move() is never called.  In such situation use of
dnode handles with respective locking to access dnode from dbuf is
a waste of time for no benefit.

This patch implements optional simplified code for such platforms,
saving at least 3 dnode lock/dereference/unlock per dbuf life cycle.
Originally I hoped to drop the handles completely to save memory,
but they are still used in dnodes allocation code, so left for now.

Before this change in CPU profiles of some workloads I saw 4-20% of
CPU time spent in zrl_add_impl()/zrl_remove(), which are gone now.

Reviewed-by: Rob Wing <rob.wing@klarasystems.com
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by:  Alexander Motin <mav@FreeBSD.org>
Sponsored by:   iXsystems, Inc.
Closes #16374
2024-07-29 14:48:12 -07:00
Allan Jude 62e7d3c89e
ddt: add support for prefetching tables into the ARC
This change adds a new `zpool prefetch -t ddt $pool` command which
causes a pool's DDT to be loaded into the ARC. The primary goal is to
remove the need to "warm" a pool's cache before deduplication stops
slowing write performance. It may also provide a way to reload portions
of a DDT if they have been flushed due to inactivity.

Sponsored-by: iXsystems, Inc.
Sponsored-by: Catalogics, Inc.
Sponsored-by: Klara, Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Will Andrews <will.andrews@klarasystems.com>
Signed-off-by: Fred Weigel <fred.weigel@klarasystems.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Signed-off-by: Don Brady <don.brady@klarasystems.com>
Co-authored-by: Will Andrews <will.andrews@klarasystems.com>
Co-authored-by: Don Brady <don.brady@klarasystems.com>
Closes #15890
2024-07-26 09:16:18 -07:00
Rob Norris 7ddc1f737f
zil: add stats for commit failure/fallback (#16315)
There's no good way to tell when a ZIL commit fails and falls back to a
transaction sync, other than perhaps a throughput drop. This adds
counters so we can see when it happens and why.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.

Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
2024-07-25 16:53:59 -07:00
Alexander Motin 55427add3c
Several improvements to ARC shrinking (#16197)
- When receiving memory pressure signal from OS be more strict
trying to free some memory.  Otherwise kernel may come again and
request much more.  Return as result how much arc_c was actually
reduced due to this request, that may be less than requested.
 - On Linux when receiving direct reclaim from some file system
(that may be ZFS) instead of ignoring request completely, just
shrink the ARC, but do not wait for eviction.  Waiting there may
cause deadlock.  Ignoring it as before may put extra pressure on
other caches and/or swap, and cause OOM if nothing help.  While
not waiting may result in more ARC evicted later, and may be too
late if OOM killer activate right now, but I hope it to be better
than doing nothing at all.
 - On Linux set arc_no_grow before waiting for reclaim, not after,
or it may grow back while we are waiting.
 - On Linux add new parameter zfs_arc_shrinker_seeks to balance
ARC eviction cost, relative to page cache and other subsystems.
 - Slightly update Linux arc_set_sys_free() math for new kernels.

Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Reviewed-by: Rob Norris <rob.norris@klarasystems.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
2024-07-25 10:31:14 -07:00
Allan Jude c7ada64bb6
ddt: dedup table quota enforcement
This adds two new pool properties:
- dedup_table_size, the total size of all DDTs on the pool; and
- dedup_table_quota, the maximum possible size of all DDTs in the pool

When set, quota will be enforced by checking when a new entry is about
to be created. If the pool is over its dedup quota, the entry won't be
created, and the corresponding write will be converted to a regular
non-dedup write. Note that existing entries can be updated (ie their
refcounts changed), as that reuses the space rather than requiring more.

dedup_table_quota can be set to 'auto', which will set it based on the
size of the devices backing the "dedup" allocation device. This makes it
possible to limit the DDTs to the size of a dedup vdev only, such that
when the device fills, no new blocks are deduplicated.

Sponsored-by: iXsystems, Inc.
Sponsored-By: Klara Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Signed-off-by: Don Brady <don.brady@klarasystems.com>
Co-authored-by: Don Brady <don.brady@klarasystems.com>
Co-authored-by: Rob Wing <rob.wing@klarasystems.com>
Co-authored-by: Sean Eric Fagan <sean.fagan@klarasystems.com>
Closes #15889
2024-07-25 09:47:36 -07:00