1. When testing out installing a VM with virtual manager on Linux and a
dataset with direct=always, there an ASSERT failure in
abd_alloc_from_pages(). Originally zfs_setup_direct() did an
alignment check of the UIO using SPA_MINBLOCKSIZE with
zfs_uio_aligned(). The idea behind this was maybe the page alignment
restriction could be changed to use ashift as the alignment check in
the future. Howver, this diea never came to be. The alignment
restrictions for Direct I/O are based on PAGE_SIZE. Updating the
check zfs_setup_direct() for the UIO to use PAGE_SIZE fixed the
issue.
2. Updated other alignment check in dmu_read_impl() to also use
PAGE_SIZE.
3. As a consequence of updating the UIO alignment checks the ZTS test
case dio_unaligned_filesize began to fail. This is because there was
no way to detect reading past the end of the file before issue EINVAL
in the ZPL and VOPs layers in FreeBSD. This was resolved by moving
zfs_setup_direct() into zfs_write() and zfs_read(). This allows for
other error checking to take place before checking any Direct I/O
limitations. Updating the call site of zfs_setup_direct() did require
a bit of changes to the logic in that function. In particular Direct
I/O can just be avoid altogether depending on the checks in
zfs_setup_direct() and there is no reason to return EAGAIN at all.
4. After moving zfs_setup_direct() into zfs_write() and zfs_read(),
there was no reason to call zfs_check_direct_enabled() in the ZPL
layer in Linux or in the VNOPS layer of FreeBSD. This function was
completely removed. This allowed for much of the code in both those
layers to return to their original code.
5. Upated the checksum verify module parameter for Direct I/O writes to
only be a boolean and return EIO in the event a checksum verify
failure occurs. By default, this module parameter is set to 1 for
Linux and 0 for FreeBSD. The module parameter has been changed to
zfs_vdev_direct_write_verify. There are still counters on the
top-level VDEV for checksum verify failures, but this could be
removed. It would still be good to to leave the ZED event dio_verify
for checksum failures as a notification that an application was
manipulating the contents of a buffer after issuing that buffer with
for I/O using Direct I/O. As part of this cahnge, man pages were
updated, the ZTS test case dio_writy_verify was updated, and all
comments relating to the module parameter were udpated as well.
6. Updated comments in dio_property ZTS test to properly reflect that
stride_dd is being called with check_write and check_read.
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Adding O_DIRECT support to ZFS to bypass the ARC for writes/reads.
O_DIRECT support in ZFS will always ensure there is coherency between
buffered and O_DIRECT IO requests. This ensures that all IO requests,
whether buffered or direct, will see the same file contents at all
times. Just as in other FS's , O_DIRECT does not imply O_SYNC. While
data is written directly to VDEV disks, metadata will not be synced
until the associated TXG is synced.
For both O_DIRECT read and write request the offset and requeset sizes,
at a minimum, must be PAGE_SIZE aligned. In the event they are not,
then EINVAL is returned unless the direct property is set to always (see
below).
For O_DIRECT writes:
The request also must be block aligned (recordsize) or the write
request will take the normal (buffered) write path. In the event that
request is block aligned and a cached copy of the buffer in the ARC,
then it will be discarded from the ARC forcing all further reads to
retrieve the data from disk.
For O_DIRECT reads:
The only alignment restrictions are PAGE_SIZE alignment. In the event
that the requested data is in buffered (in the ARC) it will just be
copied from the ARC into the user buffer.
For both O_DIRECT writes and reads the O_DIRECT flag will be ignored in
the event that file contents are mmap'ed. In this case, all requests
that are at least PAGE_SIZE aligned will just fall back to the buffered
paths. If the request however is not PAGE_SIZE aligned, EINVAL will
be returned as always regardless if the file's contents are mmap'ed.
Since O_DIRECT writes go through the normal ZIO pipeline, the
following operations are supported just as with normal buffered writes:
Checksum
Compression
Dedup
Encryption
Erasure Coding
There is one caveat for the data integrity of O_DIRECT writes that is
distinct for each of the OS's supported by ZFS.
FreeBSD - FreeBSD is able to place user pages under write protection so
any data in the user buffers and written directly down to the
VDEV disks is guaranteed to not change. There is no concern
with data integrity and O_DIRECT writes.
Linux - Linux is not able to place anonymous user pages under write
protection. Because of this, if the user decides to manipulate
the page contents while the write operation is occurring, data
integrity can not be guaranteed. However, there is a module
parameter `zfs_vdev_direct_write_verify` that contols the
if a O_DIRECT writes that can occur to a top-level VDEV before
a checksum verify is run before the contents of the I/O buffer
are committed to disk. In the event of a checksum verification
failure the write will return EIO. The number of O_DIRECT write
checksum verification errors can be observed by doing
`zpool status -d`, which will list all verification errors that
have occurred on a top-level VDEV. Along with `zpool status`, a
ZED event will be issues as `dio_verify` when a checksum
verification error occurs.
A new dataset property `direct` has been added with the following 3
allowable values:
disabled - Accepts O_DIRECT flag, but silently ignores it and treats
the request as a buffered IO request.
standard - Follows the alignment restrictions outlined above for
write/read IO requests when the O_DIRECT flag is used.
always - Treats every write/read IO request as though it passed
O_DIRECT and will do O_DIRECT if the alignment restrictions
are met otherwise will redirect through the ARC. This
property will not allow a request to fail.
There is also a module paramter zfs_dio_enabled that can be used to
force all reads and writes through the ARC. By setting this module
paramter to 0, it mimics as if the direct dataset property is set to
disabled.
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Co-authored-by: Mark Maybee <mark.maybee@delphix.com>
Co-authored-by: Matt Macy <mmacy@FreeBSD.org>
Co-authored-by: Brian Behlendorf <behlendorf@llnl.gov>
This fixes the instances of the "Multiplication result converted to
larger type" alert that codeQL scanning found.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Andrew Innes <andrew.c12@gmail.com>
Closes#14094
This confers an >10x speedup on t/z-t/cmd builds (12s -> 1.1s),
gets rid of 23 redundant identical automake specs and gitignores,
and groups the binaries with their common headers
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes#13259