1. When testing out installing a VM with virtual manager on Linux and a
dataset with direct=always, there an ASSERT failure in
abd_alloc_from_pages(). Originally zfs_setup_direct() did an
alignment check of the UIO using SPA_MINBLOCKSIZE with
zfs_uio_aligned(). The idea behind this was maybe the page alignment
restriction could be changed to use ashift as the alignment check in
the future. Howver, this diea never came to be. The alignment
restrictions for Direct I/O are based on PAGE_SIZE. Updating the
check zfs_setup_direct() for the UIO to use PAGE_SIZE fixed the
issue.
2. Updated other alignment check in dmu_read_impl() to also use
PAGE_SIZE.
3. As a consequence of updating the UIO alignment checks the ZTS test
case dio_unaligned_filesize began to fail. This is because there was
no way to detect reading past the end of the file before issue EINVAL
in the ZPL and VOPs layers in FreeBSD. This was resolved by moving
zfs_setup_direct() into zfs_write() and zfs_read(). This allows for
other error checking to take place before checking any Direct I/O
limitations. Updating the call site of zfs_setup_direct() did require
a bit of changes to the logic in that function. In particular Direct
I/O can just be avoid altogether depending on the checks in
zfs_setup_direct() and there is no reason to return EAGAIN at all.
4. After moving zfs_setup_direct() into zfs_write() and zfs_read(),
there was no reason to call zfs_check_direct_enabled() in the ZPL
layer in Linux or in the VNOPS layer of FreeBSD. This function was
completely removed. This allowed for much of the code in both those
layers to return to their original code.
5. Upated the checksum verify module parameter for Direct I/O writes to
only be a boolean and return EIO in the event a checksum verify
failure occurs. By default, this module parameter is set to 1 for
Linux and 0 for FreeBSD. The module parameter has been changed to
zfs_vdev_direct_write_verify. There are still counters on the
top-level VDEV for checksum verify failures, but this could be
removed. It would still be good to to leave the ZED event dio_verify
for checksum failures as a notification that an application was
manipulating the contents of a buffer after issuing that buffer with
for I/O using Direct I/O. As part of this cahnge, man pages were
updated, the ZTS test case dio_writy_verify was updated, and all
comments relating to the module parameter were udpated as well.
6. Updated comments in dio_property ZTS test to properly reflect that
stride_dd is being called with check_write and check_read.
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Updating code base on PR code comments. I adjusted the following parts
of the code base on the comments:
1. Updated zfs_check_direct_enabled() so it now just returns an error.
This removed the need for the added enum and cleaned up the code.
2. Moved acquiring the rangelock from zfs_fillpage() out to
zfs_getpage(). This cleans up the code and gets rid of the need to
pass a boolean into zfs_fillpage() to conditionally gra the
rangelock.
3. Cleaned up the code in both zfs_uio_get_dio_pages() and
zfs_uio_get_dio_pages_iov(). There was no need to have wanted and
maxsize as they were the same thing. Also, since the previous
commit cleaned up the call to zfs_uio_iov_step() the code is much
cleaner over all.
4. Removed dbuf_get_dirty_direct() function.
5. Unified dbuf_read() to account for both block clones and direct I/O
writes. This removes redundant code from dbuf_read_impl() for
grabbingthe BP.
6. Removed zfs_map_page() and zfs_unmap_page() declarations from Linux
headers as those were never called.
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Adding O_DIRECT support to ZFS to bypass the ARC for writes/reads.
O_DIRECT support in ZFS will always ensure there is coherency between
buffered and O_DIRECT IO requests. This ensures that all IO requests,
whether buffered or direct, will see the same file contents at all
times. Just as in other FS's , O_DIRECT does not imply O_SYNC. While
data is written directly to VDEV disks, metadata will not be synced
until the associated TXG is synced.
For both O_DIRECT read and write request the offset and requeset sizes,
at a minimum, must be PAGE_SIZE aligned. In the event they are not,
then EINVAL is returned unless the direct property is set to always (see
below).
For O_DIRECT writes:
The request also must be block aligned (recordsize) or the write
request will take the normal (buffered) write path. In the event that
request is block aligned and a cached copy of the buffer in the ARC,
then it will be discarded from the ARC forcing all further reads to
retrieve the data from disk.
For O_DIRECT reads:
The only alignment restrictions are PAGE_SIZE alignment. In the event
that the requested data is in buffered (in the ARC) it will just be
copied from the ARC into the user buffer.
For both O_DIRECT writes and reads the O_DIRECT flag will be ignored in
the event that file contents are mmap'ed. In this case, all requests
that are at least PAGE_SIZE aligned will just fall back to the buffered
paths. If the request however is not PAGE_SIZE aligned, EINVAL will
be returned as always regardless if the file's contents are mmap'ed.
Since O_DIRECT writes go through the normal ZIO pipeline, the
following operations are supported just as with normal buffered writes:
Checksum
Compression
Dedup
Encryption
Erasure Coding
There is one caveat for the data integrity of O_DIRECT writes that is
distinct for each of the OS's supported by ZFS.
FreeBSD - FreeBSD is able to place user pages under write protection so
any data in the user buffers and written directly down to the
VDEV disks is guaranteed to not change. There is no concern
with data integrity and O_DIRECT writes.
Linux - Linux is not able to place anonymous user pages under write
protection. Because of this, if the user decides to manipulate
the page contents while the write operation is occurring, data
integrity can not be guaranteed. However, there is a module
parameter `zfs_vdev_direct_write_verify` that contols the
if a O_DIRECT writes that can occur to a top-level VDEV before
a checksum verify is run before the contents of the I/O buffer
are committed to disk. In the event of a checksum verification
failure the write will return EIO. The number of O_DIRECT write
checksum verification errors can be observed by doing
`zpool status -d`, which will list all verification errors that
have occurred on a top-level VDEV. Along with `zpool status`, a
ZED event will be issues as `dio_verify` when a checksum
verification error occurs.
A new dataset property `direct` has been added with the following 3
allowable values:
disabled - Accepts O_DIRECT flag, but silently ignores it and treats
the request as a buffered IO request.
standard - Follows the alignment restrictions outlined above for
write/read IO requests when the O_DIRECT flag is used.
always - Treats every write/read IO request as though it passed
O_DIRECT and will do O_DIRECT if the alignment restrictions
are met otherwise will redirect through the ARC. This
property will not allow a request to fail.
There is also a module paramter zfs_dio_enabled that can be used to
force all reads and writes through the ARC. By setting this module
paramter to 0, it mimics as if the direct dataset property is set to
disabled.
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Co-authored-by: Mark Maybee <mark.maybee@delphix.com>
Co-authored-by: Matt Macy <mmacy@FreeBSD.org>
Co-authored-by: Brian Behlendorf <behlendorf@llnl.gov>
On Linux the ioctl_ficlonerange() and ioctl_ficlone() system calls
are expected to either fully clone the specified range or return an
error. The range may be for an entire file. While internally ZFS
supports cloning partial ranges there's no way to return the length
cloned to the caller so we need to make this all or nothing.
As part of this change support for the REMAP_FILE_CAN_SHORTEN flag
has been added. When REMAP_FILE_CAN_SHORTEN is set zfs_clone_range()
will return a shortened range when encountering pending dirty records.
When it's clear zfs_clone_range() will block and wait for the records
to be written out allowing the blocks to be cloned.
Furthermore, the file range lock is held over the region being cloned
to prevent it from being modified while cloning. This doesn't quite
provide an atomic semantics since if an error is encountered only a
portion of the range may be cloned. This will be converted to an
error if REMAP_FILE_CAN_SHORTEN was not provided and returned to the
caller. However, the destination file range is left in an undefined
state.
A test case has been added which exercises this functionality by
verifying that `cp --reflink=never|auto|always` works correctly.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#15728Closes#15842
Block Cloning allows to manually clone a file (or a subset of its
blocks) into another (or the same) file by just creating additional
references to the data blocks without copying the data itself.
Those references are kept in the Block Reference Tables (BRTs).
The whole design of block cloning is documented in module/zfs/brt.c.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Christian Schwarz <christian.schwarz@nutanix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rich Ercolani <rincebrain@gmail.com>
Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Closes#13392
In FreeBSD the struct uio was just a typedef to uio_t. In order to
extend this struct, outside of the definition for the struct uio, the
struct uio has been embedded inside of a uio_t struct.
Also renamed all the uio_* interfaces to be zfs_uio_* to make it clear
this is a ZFS interface.
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Closes#11438
The oid comes from the znode we are already passing.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Macy <mmacy@FreeBSD.org>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#11176
Move zfs_get_data() in to platform-independent code. The only
platform-specific aspect of it is the way we release an inode
(Linux) / vnode_t (FreeBSD). I am not aware of a platform that
could be supported by ZFS that couldn't implement zfs_rele_async
itself. It's sibling zvol_get_data already is platform-independent.
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Christian Schwarz <me@cschwarz.com>
Closes#10979
The zfs_holey() and zfs_access() functions can be made common
to both FreeBSD and Linux.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#11125
The zfs_fsync, zfs_read, and zfs_write function are almost identical
between Linux and FreeBSD. With a little refactoring they can be
moved to the common code which is what is done by this commit.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#11078
Move platform specific Linux headers under include/os/linux/.
Update the build system accordingly to detect the platform.
This lays some of the initial groundwork to supporting building
for other platforms.
As part of this change it was necessary to create both a user
and kernel space sys/simd.h header which can be included in
either context. No functional change, the source has been
refactored and the relevant #include's updated.
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Signed-off-by: Matthew Macy <mmacy@FreeBSD.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#9198
As of RHEL 7.5 the mainline fops.iterate() method was added to
the file_operations structure and is correctly detected by the
configure script.
Normally this is what we want, but in order to maintain KABI
compatibility the RHEL change additionally does the following:
* Requires that callers intending to use this extended interface
set the FMODE_KABI_ITERATE flag on the file structure when
opening the directory.
* Adds the fops.iterate() method to the end of the structure,
without removing fops.readdir().
This change updates the configure check to ignore the RHEL 7.5+
variant of fops.iterate() when detected. Instead fallback to
the fops.readdir() interface which will be available.
Finally, add the 'zpl_' prefix to the directory context wrappers
to avoid colliding with the kernel provided symbols when both
the fops.iterate() and fops.readdir() are provided by the kernel.
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#7460Closes#7463
Several functions were renamed when ZFS was originally ported to
Linux. Revert the code to the original names to minimize the
delta with upstream OpenZFS.
zfs_sb_teardown -> zfsvfs_teardown
zfs_sb_create -> zfsvfs_create
zfs_sb_setup -> zfsvfs_setup
zfs_sb_free -> zfsvfs_free
get_zfs_sb -> getzfsvfs
zfs_sb_hold -> zfsvfs_hold
zfs_sb_rele -> zfsvfs_rele
zfs_sb_prune_aliases -> zfs_prune_aliases (Linux-only)
zfs_sb_prune -> zfs_prune (Linux only)
Align the zfs_vnops.h and zfs_vfsops.h with upstream as much
as possible. Several prototypes were removed and those that
remain were reordered.
Move the EXPORT_SYMBOL lines to the end of the source files
for consistency with the other source files.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Linux 3.11 add O_TMPFILE to open(2), which allow creating an unlinked file on
supported filesystem. It's basically doing open(2) and unlink(2) atomically.
The filesystem support is added through i_op->tmpfile. We basically copy the
create operation except we get rid of the link and name related stuff and add
the new node to unlinked set.
We also add support for linkat(2) to link tmpfile. However, since all previous
file operation will skip ZIL, we force a txg_wait_synced to make sure we are
sync safe.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
In order to remove the HAVE_PN_UTILS wrappers the pn_alloc() and
pn_free() functions must be implemented. The existing illumos
implementation were used for this purpose.
The `flags` argument which was used in places wrapped by the
HAVE_PN_UTILS condition has beed added back to zfs_remove() and
zfs_link() functions. This removes a small point of divergence
between the ZoL code and upstream.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4522
Handle all iputs in zfs_purgedir() and zfs_inode_destroy()
asynchronously to prevent deadlocks. When the iputs are allowed
to run synchronously in the destroy call path deadlocks between
xattr directory inodes and their parent file inodes are possible.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes#457
Commit torvalds/linux@2233f31aad
replaced ->readdir() with ->iterate() in struct file_operations.
All filesystems must now use the new ->iterate method.
To handle this the code was reworked to use the new ->iterate
interface. Care was taken to keep the majority of changes
confined to the ZPL layer which is already Linux specific.
However, minor changes were required to the common zfs_readdir()
function.
Compatibility with older kernels was accomplished by adding
versions of the trivial dir_emit* helper functions. Also the
various *_readdir() functions were reworked in to wrappers
which create a dir_context structure to pass to the new
*_iterate() functions.
Unfortunately, the new dir_emit* functions prevent us from
passing a private pointer to the filldir function. The xattr
directory code leveraged this ability through zfs_readdir()
to generate the list of xattr names. Since we can no longer
use zfs_readdir() a simplified zpl_xattr_readdir() function
was added to perform the same task.
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1653
Issue #1591
The approach taken was the rework zfs_holey() as little as
possible and then just wrap the code as needed to ensure
correct locking and error handling.
Tested with xfstests 285 and 286. All tests pass except for
7-9 of 285 which try to reserve blocks first via fallocate(2)
and fail because fallocate(2) is not yet supported.
Note that the filp->f_lock spinlock did not exist prior to
Linux 2.6.30, but we avoid the need for autotools check by
virtue of the fact that SEEK_DATA/SEEK_HOLE support was not
added until Linux 3.1.
An autoconf check was added for lseek_execute() which is
currently a private function but the expectation is that it
will be exported perhaps as early as Linux 3.11.
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1384
Revert the portion of commit d3aa3ea which always resulted in the
SAs being update when an mmap()'ed file was closed. That change
accidentally resulted in unexpected ctime updates which upset tools
like git. That was always a horrible hack and I'm happy it will
never make it in to a tagged release.
The right fix is something I initially resisted doing because I
was worried about the additional overhead. However, in hindsight
the overhead isn't as bad as I feared.
This patch implemented the sops->dirty_inode() callback which is
unsurprisingly called when an inode is dirtied. We leverage this
callback to keep the znode SAs strictly in sync with the inode.
However, for now we're going to go slowly to avoid introducing
any new unexpected issues by only updating the atime, mtime, and
ctime. This will cover the callpath of most concern to us.
->filemap_page_mkwrite->file_update_time->update_time->
mark_inode_dirty_sync->__mark_inode_dirty->dirty_inode
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#764Closes#1140
While the existing implementation of .writepage()/zpl_putpage() was
functional it was not entirely correct. In particular, it would move
dirty pages in to a clean state simply after copying them in to the
ARC cache. This would result in the pages being lost if the system
were to crash enough though the Linux VFS believed them to be safe on
stable storage.
Since at the moment virtually all I/O, except mmap(2), bypasses the
page cache this isn't as bad as it sounds. However, as hopefully
start using the page cache more getting this right becomes more
important so it's good to improve this now.
This patch takes a big step in that direction by updating the code
to correctly move dirty pages through a writeback phase before they
are marked clean. When a dirty page is copied in to the ARC it will
now be set in writeback and a completion callback is registered with
the transaction. The page will stay in writeback until the dmu runs
the completion callback indicating the page is on stable storage.
At this point the page can be safely marked clean.
This process is normally entirely asynchronous and will be repeated
for every dirty page. This may initially sound inefficient but most
of these pages will end up in a few txgs. That means when they are
eventually written to disk they should be nicely batched. However,
there is room for improvement. It may still be desirable to batch
up the pages in to larger writes for the dmu. This would reduce
the number of callbacks and small 4k buffer required by the ARC.
Finally, if the caller requires that the I/O be done synchronously
by setting WB_SYNC_ALL or if ZFS_SYNC_ALWAYS is set. Then the I/O
will trigger a zil_commit() to flush the data to stable storage.
At which point the registered callbacks will be run leaving the
date safe of disk and marked clean before returning from .writepage.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
There is at most a factor of 3x performance improvement to be
had by using the Linux generic_fillattr() helper. However, to
use it safely we need to ensure the values in a cached inode
are kept rigerously up to date. Unfortunately, this isn't
the case for the blksize, blocks, and atime fields. At the
moment the authoritative values are still stored in the znode.
This patch introduces an optimized zfs_getattr_fast() call.
The idea is to use the up to date values from the inode and
the blksize, block, and atime fields from the znode. At some
latter date we should be able to strictly use the inode values
and further improve performance.
The remaining overhead in the zfs_getattr_fast() call can be
attributed to having to take the znode mutex. This overhead is
unavoidable until the inode is kept strictly up to date. The
the careful reader will notice the we do not use the customary
ZFS_ENTER()/ZFS_EXIT() macros. These macro's are designed to
ensure the filesystem is not torn down in the middle of an
operation. However, in this case the VFS is holding a
reference on the active inode so we know this is impossible.
=================== Performance Tests ========================
This test calls the fstat(2) system call 10,000,000 times on
an open file description in a tight loop. The test results
show the zfs stat(2) performance is now only 22% slower than
ext4. This is a 2.5x improvement and there is a clear long
term plan to get to parity with ext4.
filesystem | test-1 test-2 test-3 | average | times-ext4
--------------+-------------------------+---------+-----------
ext4 | 7.785s 7.899s 7.284s | 7.656s | 1.000x
zfs-0.6.0-rc4 | 24.052s 22.531s 23.857s | 23.480s | 3.066x
zfs-faststat | 9.224s 9.398s 9.485s | 9.369s | 1.223x
The second test is to run 'du' of a copy of the /usr tree
which contains 110514 files. The test is run multiple times
both using both a cold cache (/proc/sys/vm/drop_caches) and
a hot cache. As expected this change signigicantly improved
the zfs hot cache performance and doesn't quite bring zfs to
parity with ext4.
A little surprisingly the zfs cold cache performance is better
than ext4. This can probably be attributed to the zfs allocation
policy of co-locating all the meta data on disk which minimizes
seek times. By default the ext4 allocator will spread the data
over the entire disk only co-locating each directory.
filesystem | cold | hot
--------------+---------+--------
ext4 | 13.318s | 1.040s
zfs-0.6.0-rc4 | 4.982s | 1.762s
zfs-faststat | 4.933s | 1.345s
Under Linux the VFS handles virtually all of the mmap() access
checks. Filesystem specific checks are left to be handled in
the .mmap() hook and normally there arn't any.
However, ZFS provides a few attributes which can influence the
mmap behavior and should be honored. Note, currently the code
to modify these attributes has not been implemented under Linux.
* ZFS_IMMUTABLE | ZFS_READONLY | ZFS_APPENDONLY: when any of these
attributes are set a file may not be mmaped with write access.
* ZFS_AV_QUARANTINED: when set a file file may not be mmaped with
read or exec access.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Enable zfs_getpage, zfs_fillpage, zfs_putpage, zfs_putapage functions.
The functions have been modified to make them Linux friendly.
ZFS uses these functions to read/write the mmapped pages. Using them
from readpage/writepage results in clear code. The patch also adds
readpages and writepages interface functions to read/write list of
pages in one function call.
The code change handles the first mmap optimization mentioned on
https://github.com/behlendorf/zfs/issues/225
Signed-off-by: Prasad Joshi <pjoshi@stec-inc.com>
Signed-off-by: Brian Behlendorf <behlendorf@llnl.gov>
Issue #255
In the original implementation the zfs_open()/zfs_close() hooks
were dropped for simplicity. This was functional but not 100%
correct with the expected ZFS sematics. Updating and re-adding the
zfs_open()/zfs_close() hooks resolves the following issues.
1) The ZFS_APPENDONLY file attribute is once again honored. While
there are still no Linux tools to set/clear these attributes once
there are it should behave correctly.
2) Minimal virus scan file attribute hooks were added. Once again
this support in disabled but the infrastructure is back in place.
3) Most importantly correctly handle assigning files which were
opened syncronously to the intent log. Without this change O_SYNC
modifications could be lost during a system crash even though they
were marked synchronous.
When I began work on the Posix layer it immediately became clear to
me that to integrate cleanly with the Linux VFS certain Solaris
specific things would have to go. One of these things was to elimate
as many Solaris specific types from the ZPL layer as possible. They
would be replaced with their Linux equivalents. This would not only
be good for performance, but for the general readability and health of
the code. The Solaris and Linux VFS are different beasts and should
be treated as such. Most of the code remains common for constructing
transactions and such, but there are subtle and important differenced
which need to be repsected.
This policy went quite for for certain types such as the vnode_t,
and it initially seemed to be working out well for the vattr_t. There
was a relatively small amount of related xvattr_t code I was forced to
comment out with HAVE_XVATTR. But it didn't look that hard to come
back soon and replace it all with a native Linux type.
However, after going doing this path with xvattr some distance it
clear that this code was woven in the ZPL more deeply than I thought.
In particular its hooks went very deep in to the ZPL replay code
and replacing it would not be as easy as I originally thought.
Rather than continue persuing replacing and removing this code I've
taken a step back and reevaluted things. This commit reverts many of
my previous commits which removed xvattr related code. It restores
much of the code to its original upstream state and now relies on
improved xvattr_t support in the zfs package itself.
The result of this is that much of the code which I had commented
out, which accidentally broke things like replay, is now back in
place and working. However, there may be a small performance
impact for getattr/setattr operations because they now require
a translation from native Linux to Solaris types. For now that's
a price I'm willing to pay. Once everything is completely functional
we can revisting the issue of removing the vattr_t/xvattr_t types.
Closes#111
With the removal of the minimal xvattr support from the spl this
support needs to be replaced in the zfs package. This is fairly
easily accomplished by directly adding portions of the sys/vnode.h
header from OpenSolaris. These xvattr additions have been placed
in the sys/xvattr.h header file and included as needed where simply
a sys/vnode.h was included before.
In additon to the xvattr types and helper macros two functions
were also included. The xva_init() and xva_getxoptattr() functions
were included as static inline functions in xvattr.h. They are
simple enough and it was simpler to place them here rather than
in their own .c file.
I appologize in advance why to many things ended up in this commit.
When it could be seperated in to a whole series of commits teasing
that all apart now would take considerable time and I'm not sure
there's much merrit in it. As such I'll just summerize the intent
of the changes which are all (or partly) in this commit. Broadly
the intent is to remove as much Solaris specific code as possible
and replace it with native Linux equivilants. More specifically:
1) Replace all instances of zfsvfs_t with zfs_sb_t. While the
type is largely the same calling it private super block data
rather than a zfsvfs is more consistent with how Linux names
this. While non critical it makes the code easier to read when
your thinking in Linux friendly VFS terms.
2) Replace vnode_t with struct inode. The Linux VFS doesn't have
the notion of a vnode and there's absolutely no good reason to
create one. There are in fact several good reasons to remove it.
It just adds overhead on Linux if we were to manage one, it
conplicates the code, and it likely will lead to bugs so there's
a good change it will be out of date. The code has been updated
to remove all need for this type.
3) Replace all vtype_t's with umode types. Along with this shift
all uses of types to mode bits. The Solaris code would pass a
vtype which is redundant with the Linux mode. Just update all the
code to use the Linux mode macros and remove this redundancy.
4) Remove using of vn_* helpers and replace where needed with
inode helpers. The big example here is creating iput_aync to
replace vn_rele_async. Other vn helpers will be addressed as
needed but they should be be emulated. They are a Solaris VFS'ism
and should simply be replaced with Linux equivilants.
5) Update znode alloc/free code. Under Linux it's common to
embed the inode specific data with the inode itself. This removes
the need for an extra memory allocation. In zfs this information
is called a znode and it now embeds the inode with it. Allocators
have been updated accordingly.
6) Minimal integration with the vfs flags for setting up the
super block and handling mount options has been added this
code will need to be refined but functionally it's all there.
This will be the first and last of these to large to review commits.