Commit Graph

7921 Commits

Author SHA1 Message Date
Brian Behlendorf f4ea75d492 Conserve stack in zfs_setattr()
Move 'tmpxvattr' from the stack to the heap to minimize stack
space usage.  This is enough to get us below the 1024 byte stack
frame warning.  That however is still a large stack frame and it
should be further reduced by moving the 'bulk' and 'xattr_bulk'
sa_bulk_attr_t variables to the heap in a future patch.
2011-03-02 14:18:58 -08:00
Brian Behlendorf 5484965ab6 Drop HAVE_XVATTR macros
When I began work on the Posix layer it immediately became clear to
me that to integrate cleanly with the Linux VFS certain Solaris
specific things would have to go.  One of these things was to elimate
as many Solaris specific types from the ZPL layer as possible.  They
would be replaced with their Linux equivalents.  This would not only
be good for performance, but for the general readability and health of
the code.  The Solaris and Linux VFS are different beasts and should
be treated as such.  Most of the code remains common for constructing
transactions and such, but there are subtle and important differenced
which need to be repsected.

This policy went quite for for certain types such as the vnode_t,
and it initially seemed to be working out well for the vattr_t.  There
was a relatively small amount of related xvattr_t code I was forced to
comment out with HAVE_XVATTR.  But it didn't look that hard to come
back soon and replace it all with a native Linux type.

However, after going doing this path with xvattr some distance it
clear that this code was woven in the ZPL more deeply than I thought.
In particular its hooks went very deep in to the ZPL replay code
and replacing it would not be as easy as I originally thought.

Rather than continue persuing replacing and removing this code I've
taken a step back and reevaluted things.  This commit reverts many of
my previous commits which removed xvattr related code.  It restores
much of the code to its original upstream state and now relies on
improved xvattr_t support in the zfs package itself.

The result of this is that much of the code which I had commented
out, which accidentally broke things like replay, is now back in
place and working.  However, there may be a small performance
impact for getattr/setattr operations because they now require
a translation from native Linux to Solaris types.  For now that's
a price I'm willing to pay.  Once everything is completely functional
we can revisting the issue of removing the vattr_t/xvattr_t types.

Closes #111
2011-03-02 11:44:34 -08:00
Brian Behlendorf 321a498b95 Add xvattr support
With the removal of the minimal xvattr support from the spl this
support needs to be replaced in the zfs package.  This is fairly
easily accomplished by directly adding portions of the sys/vnode.h
header from OpenSolaris.  These xvattr additions have been placed
in the sys/xvattr.h header file and included as needed where simply
a sys/vnode.h was included before.

In additon to the xvattr types and helper macros two functions
were also included.  The xva_init() and xva_getxoptattr() functions
were included as static inline functions in xvattr.h.  They are
simple enough and it was simpler to place them here rather than
in their own .c file.
2011-03-02 11:43:50 -08:00
Brian Behlendorf 9623f736d9 Remove caller_context_t
Remove the remaining callers of caller_context_t.  This type has
been removed because it is not needed for the Linux port.
2011-03-02 11:35:35 -08:00
Brian Behlendorf 47995fa691 Remove xvattr support
The xvattr support in the spl has always simply consisted of
defining a couple structures and a few #defines.  This was enough
to enable compilation of code which just passed xvattr types
around but not enough to effectively manipulate them.

This change removes even this minimal support leaving it up
to packages which leverage the spl to prove the full xvattr
support.  By removing it from the spl we ensure not conflict
with the higher level packages.

This just leaves minimal vnode support for basical manipulation
of files.  This code is does have the proper support functions
in the spl and a set of regression tests.

Additionally, this change removed the unused 'caller_context_t *'
type and replaces it with a 'void *'.
2011-03-02 11:34:46 -08:00
Brian Behlendorf a4a1e1ecb4 Add TIMESPEC_OVERFLOW helper
Add the TIMESPEC_OVERFLOW helper macro to allow easy checking
of timespec overflow.
2011-03-02 11:34:43 -08:00
Darik Horn a23cc0a443 Add the zpool and filesystem versions
Print the supported zpool and filesystem versions at module load
time.  This change removes an ambiguity and adds information that
system administrators care about.  The phrase "ZFS pool version %s"
is the same as zpool upgrade -v so that the operator is familiar
with the message.

  ZFS: Loaded module v0.6.0, ZFS pool version 28, ZFS filesystem version 5
  ZFS: Unloaded module v0.6.0

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-02-28 09:46:23 -08:00
Brian Behlendorf 19c1eb829d Add zlib regression test
A zlib regression test has been added to verify the correct behavior
of z_compress_level() and z_uncompress.  The test case simply takes
a 128k buffer, it compresses the buffer, it them uncompresses the
buffer, and finally it compares the buffers after the transform.
If the buffers match then everything is fine and no data was lost.
It performs this test for all 9 zlib compression levels.
2011-02-25 16:56:46 -08:00
Brian Behlendorf 5c1967ebe2 Fix zlib compression
While portions of the code needed to support z_compress_level() and
z_uncompress() where in place.  In reality the current implementation
was non-functional, it just was compilable.

The critical missing component was to setup a workspace for the
compress/uncompress stream structures to use.  A kmem_cache was
added for the workspace area because we require a large chunk
of memory.  This avoids to need to continually alloc/free this
memory and vmap() the pages which is very slow.  Several objects
will reside in the per-cpu kmem_cache making them quick to acquire
and release.  A further optimization would be to adjust the
implementation to additional ensure the memory is local to the cpu.
Currently that may not be the case.
2011-02-25 16:56:22 -08:00
Brian Behlendorf fdcd952b4d Fix set block scheduler warnings
There were two cases when attempting to set the vdev block device
scheduler which would causes console warnings.

The first case was when the vdev used a loop, ram, dm, or other
such device which doesn't support a configurable scheduler.  In
these cases attempting to set a scheduler is pointless and can
be safely skipped.

The secord case is slightly more troubling.  We were seeing
transient cases where setting the elevator would return -EFAULT.
On retry everything is fine so there appears to be a small window
where this is possible.  To handle that case we silently retry
up to three times before reporting the warning.

In all of the above cases the warning is harmless and at worse you
may see slightly different performance characteristics from one
or more of your vdevs.
2011-02-25 11:37:11 -08:00
Fajar A. Nugraha 4c0d8e50b9 Use udev to create /dev/zvol/[dataset_name] links
This commit allows zvols with names longer than 32 characters, which
fixes issue on https://github.com/behlendorf/zfs/issues/#issue/102.

Changes include:
- use /dev/zd* device names for zvol, where * is the device minor
  (include/sys/fs/zfs.h, module/zfs/zvol.c).
- add BLKZNAME ioctl to get dataset name from userland
  (include/sys/fs/zfs.h, module/zfs/zvol.c, cmd/zvol_id).
- add udev rule to create /dev/zvol/[dataset_name] and the legacy
  /dev/[dataset_name] symlink. For partitions on zvol, it will create
  /dev/zvol/[dataset_name]-part* (etc/udev/rules.d/60-zvol.rules,
  cmd/zvol_id).

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-02-25 09:43:19 -08:00
Darik Horn 61da501f9d Add the new blkdev_compat.h header to the DIST target.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-02-24 09:40:06 -08:00
Brian Behlendorf dc1d7665c5 Remove rdev packing
Remove custom code to pack/unpack dev_t's.  Under Linux all dev_t's
are an unsigned 32-bit value even on 64-bit platforms.  The lower
20 bits are used for the minor number and the upper 12 for the major
number.

This means if your importing a pool from Solaris you may get strange
major/minor numbers.  But it doesn't really matter because even if
we add compatibility code to translate the encoded Solaris major/minor
they won't do you any good under Linux.  You will still need to
recreate the dev_t with a major/minor which maps to reserved major
numbers used under Linux.

Dropping this code also resolves 32-bit builds by removing the
offending 32-bit compatibility code.
2011-02-23 15:13:03 -08:00
Brian Behlendorf 99c564bc48 Use correct ASSERT3* variant
ASSERT3P should be used instead of ASSERT3U when comparing
pointers.  Using ASSERT3U with the cast causes a compiler
warning for 32-bit builds which is fatal with --enable-debug.
2011-02-23 15:03:30 -08:00
Brian Behlendorf 5a52a782a0 Use Linux flock struct
Rather than defining our own structure which will conflict with
Linux's version when building 32-bit.  Simply setup a typedef
to always use the correct Linux version for both 32 ad 64-bit
builds.
2011-02-23 14:32:15 -08:00
Brian Behlendorf 05ff35c602 Increase fragment size to block size
The underlying storage pool actually uses multiple block
size.  Under Solaris frsize (fragment size) is reported as
the smallest block size we support, and bsize (block size)
as the filesystem's maximum block size.  Unfortunately,
under Linux the fragment size and block size are often used
interchangeably.  Thus we are forced to report both of them
as the filesystem's maximum block size.

Closes #112
2011-02-23 14:00:06 -08:00
Brian Behlendorf f6dcdf13f8 Fix 'statement with no effect' warning
Because the secpolicy_* macros are all currently defined to (0).
And because the caller of this function does not check the return
code.  The compiler complains that this statement has no effect
which is correct and OK.  To suppress the warning explictly cast
the result to (void).
2011-02-23 13:03:19 -08:00
Brian Behlendorf 718d77f622 Fix uninitialized variable
It was possible for rc to be unitialized in the parse_options()
function which triggered a compiler warning.  Ensure rc is always
initialized.
2011-02-23 12:57:25 -08:00
Brian Behlendorf a31a70bbd1 Fix enum compiler warning
Generally it's a good idea to use enums for switch statements,
but in this case it causes warning because the enum is really a
set of flags.  These flags are OR'ed together in some cases
resulting in values which are not part of the original enum.
This causes compiler warning such as this about invalid cases.

  error: case value ‘33’ not in enumerated type ‘zprop_source_t’

To handle this we simply case the enum to an int for the switch
statement.  This leaves all other enum type checking in place
and effectively disabled these warnings.
2011-02-23 12:52:51 -08:00
Brian Behlendorf 914b063133 Linux compat 2.6.37, invalidate_inodes()
In the 2.6.37 kernel the function invalidate_inodes() is no longer
exported for use by modules.  This memory management functionality
is needed to invalidate the inodes attached to a super block without
unmounting the filesystem.

Because this function still exists in the kernel and the prototype
is available is a common header all we strictly need is the symbol
address.  The address is obtained using spl_kallsyms_lookup_name()
and assigned to the variable invalidate_inodes_fn.  Then a #define
is used to replace all instances of invalidate_inodes() with a
call to the acquired address.  All the complexity is hidden behind
HAVE_INVALIDATE_INODES and invalidate_inodes() can be used as usual.

Long term we should try to get this, or another, interface made
available to modules again.
2011-02-23 12:44:32 -08:00
Brian Behlendorf 45066d1f20 Linux 2.6.38 compat, blkdev_get_by_path()
The open_bdev_exclusive() function has been replaced (again) by the
more generic blkdev_get_by_path() function.  Additionally, the
counterpart function close_bdev_exclusive() has been replaced by
blkdev_put().  Because these functions are more generic versions
of the functions they replaced the compatibility macro must add
the FMODE_EXCL mask to ensure they are exclusive.

Closes #114
2011-02-23 12:29:38 -08:00
Brian Behlendorf 61e909608d Linux 2.6.x compat, blkdev_compat.h
For legacy reasons the zvol.c and vdev_disk.c Linux compatibility
code ended up in sys/blkdev.h and sys/vdev_disk.h headers.  While
there are worse places for this code to live it should be in a
linux/blkdev_compat.h header.  This change moves this block device
Linux compatibility code in to the linux/blkdev_compat.h header
and updates all the correct #include locations.  This is not a
functional change or bug fix, it is just code cleanup.
2011-02-23 12:29:38 -08:00
Brian Behlendorf bf665d4075 Prep spl-0.6.0-rc1 tag
Create the first 0.6.0 release candidate tag (rc1).
2011-02-18 09:35:55 -08:00
Brian Behlendorf 075cf6cb72 Prep zfs-0.6.0-rc1 tag
Create the first 0.6.0 release candidate tag (rc1).  The Posix
layer is now functional and passes fstest and several other
test suites cleanly.  We now need this release candidate tag
to broaden the test coverage before we can release the official
zfs-0.6.0.
2011-02-18 09:33:12 -08:00
Brian Behlendorf 5d0265c0dd Merge branch 'zpl' 2011-02-18 09:31:25 -08:00
Brian Behlendorf 037849f854 Use provided uid/gid for setattr
When changing the uid/gid of a file via zfs_setattr() use the
Posix id passed in iattr->ia_uid/gid.  While the zfs_fuid_create()
code already had the fuid support disabled for Linux it was
returning the uid/gid from the credential.  With this change
the 'chown' command which relies on setxattr is now working
properly.

Also remove a little stray white space which was in front of
zfs_update_inode() call and the end of zfs_setattr().
2011-02-17 14:23:48 -08:00
Brian Behlendorf efd1832bc6 Fix symlink(2) inode reference count
Under Linux sys_symlink(2) should result in a inode being created
with one reference for the inode itself, and a second reference on
the inode which is held by the new dentry.  Under Solaris this
appears not to be the case.  Their zfs_symlink() handler drops
the inode reference before returning.

The result of this under Linux is that the reference count for
symlinks is always one smaller than it should have been. This
results in a BUG() when the symlink is unlinked.  To handle this
the Linux port now keeps the inode reference which differs from
the Solaris behavior.  This results in correct reference counts.

Closes #96
2011-02-17 11:34:47 -08:00
Brian Behlendorf 5095000169 Use -zfs_readlink() error
The zfs_readlink() function returns a Solaris positive error value
and that needs to be converted to a Linux negative error value.
While in this case nothing would actually go wrong, it's still
incorrect and should be fixed if for no other reason than clarity.
2011-02-17 09:48:06 -08:00
Brian Behlendorf f03e41e8da Improve 'zpool import' safety
There are three improvements here to 'zpool import' proposed by Fajar
in Github issue #98.  They are all good so I'm commiting all three.

1) Add descriptions for "hpet" and "core" blacklist entries.

2) Add "core" to the blacklist, as described in the issue accessing
this device will crash Xen dom0.

3) Refine probing behavior to use fstatat64().  This allows us to
determine if a device is a block device or a regular file without
having to open it.  This is the safest appraoch when probing /dev/
because the simple act of opening a device may have unexpected
consequences.

Closes #98
2011-02-17 09:35:43 -08:00
Brian Behlendorf 8b4f9a2d55 Fix readlink(2)
This patch addresses three issues related to symlinks.

1) Revert the zfs_follow_link() function to a modified version
of the original zfs_readlink().  The only changes from the
original OpenSolaris version relate to using Linux types.
For the moment this means no vnode's and no zfsvfs_t.  The
caller zpl_follow_link() was also updated accordingly.  This
change was reverted because it was slightly gratuitious.

2) Update zpl_follow_link() to use local variables for the
link buffer.  I'd forgotten that iov.iov_base is updated by
uiomove() so after the call to zfs_readlink() it can not longer
be used.  We need our own private copy of the link pointer.

3) Allocate MAXPATHLEN instead of MAXPATHLEN+1.  By default
MAXPATHLEN is 4096 bytes which is a full page, adding one to
it pushes it slightly over a page.  That means you'll likely
end up allocating 2 pages which is wasteful of memory and
possibly slightly slower.
2011-02-16 15:54:55 -08:00
Ricardo M. Correia 54a179e7b8 Add API to wait for pending commit callbacks
This adds an API to wait for pending commit callbacks of already-synced
transactions to finish processing.  This is needed by the DMU-OSD in
Lustre during device finalization when some callbacks may still not be
called, this leads to non-zero reference count errors.  See lustre.org
bug 23931.
2011-02-16 11:20:06 -08:00
Brian Behlendorf b9f6a49025 Update 'zfs.sh -u' to umount all zfs filesystems
Before it is safe to unload the zfs module stack all mounted
zfs filesystems must be unmounted.  If they are not unmounted,
there will be references held on the modules and the stack cannot
be removed.  To handle this have 'zfs.sh -u' which is used by all
of the test scripts umount all zfs filesystem before attempting
to unload the module stack.
2011-02-16 11:10:31 -08:00
Brian Behlendorf 07bd86718b Suppress share error on mount
Until code is added to support automatically sharing datasets
we should return success instead of failure.  This prevents the
command line tools from returning a non-zero error code.  While
a user likely won't notice this, test scripts like zconfig.sh
do and correctly fail because of it.
2011-02-16 11:05:55 -08:00
Brian Behlendorf a6695d83b7 Add get/setattr, get/setxattr hooks
While the attr/xattr hooks were already in place for regular
files this hooks can also apply to directories and special files.
While they aren't typically used in this way, it should be
supported.  This patch registers these additional callbacks
for both directory and special inode types.
2011-02-16 09:55:53 -08:00
Brian Behlendorf d8fd10545b Fix FIFO and socket handling
Under Linux when creating a fifo or socket type device in the ZFS
filesystem it's critical that the rdev is stored in a SA.  This
was already being correctly done for character and block devices,
but that logic needed to be extended to include FIFOs and sockets.

This patch takes care of device creation but a follow on patch
may still be required to verify that the dev_t is being correctly
packed/unpacked from the SA.
2011-02-16 09:51:44 -08:00
Brian Behlendorf d567444809 Create minors for all zvols
It was noticed that when you have zvols in multiple datasets
not all of the zvol devices are created at module load time.
Fajarnugraha did the leg work to identify that the root cause of
this bug is a non-zero return value from zvol_create_minors_cb().

Returning a non-zero value from the dmu_objset_find_spa() callback
function results in aborting processing the remaining children in
a dataset.  Since we want to ensure that the callback in run on
all children regardless of error simply unconditionally return
zero from the zvol_create_minors_cb().  This callback function
is solely used for this purpose so surpressing the error is safe.

Closes #96
2011-02-16 09:50:06 -08:00
Brian Behlendorf 2c395def27 Linux 2.6.36 compat, sops->evict_inode()
The new prefered inteface for evicting an inode from the inode cache
is the ->evict_inode() callback.  It replaces both the ->delete_inode()
and ->clear_inode() callbacks which were previously used for this.
2011-02-11 13:47:51 -08:00
Brian Behlendorf f9637c6c8b Linux 2.6.33 compat, get/set xattr callbacks
The xattr handler prototypes were sanitized with the idea being that
the same handlers could be used for multiple methods.  The result of
this was the inode type was changes to a dentry, and both the get()
and set() hooks had a handler_flags argument added.  The list()
callback was similiarly effected but no autoconf check was added
because we do not use the list() callback.
2011-02-11 10:41:00 -08:00
Brian Behlendorf 7268e1bec8 Linux 2.6.35 compat, fops->fsync()
The fsync() callback in the file_operations structure used to take
3 arguments.  The callback now only takes 2 arguments because the
dentry argument was determined to be unused by all consumers.  To
handle this a compatibility prototype was added to ensure the right
prototype is used.  Our implementation never used the dentry argument
either so it's just a matter of using the right prototype.
2011-02-11 09:05:51 -08:00
Brian Behlendorf 777d4af891 Linux 2.6.35 compat, const struct xattr_handler
The const keyword was added to the 'struct xattr_handler' in the
generic Linux super_block structure.  To handle this we define an
appropriate xattr_handler_t typedef which can be used.  This was
the preferred solution because it keeps the code clean and readable.
2011-02-10 16:29:00 -08:00
Brian Behlendorf 1b94c25ceb Prefer /lib/modules/$(uname -r)/ links
Preferentially use the /lib/modules/$(uname -r)/source and
/lib/modules/$(uname -r)/build links.  Only if neither of these
links exist fallback to alternate methods for deducing which
kernel to build with.  This resolves the need to manually
specify --with-linux= and --with-linux-obj= on Debian systems.
2011-02-10 14:54:33 -08:00
Brian Behlendorf 22ccfaa8b5 Prefer /lib/modules/$(uname -r)/ links
Preferentially use the /lib/modules/$(uname -r)/source and
/lib/modules/$(uname -r)/build links.  Only if neither of these
links exist fallback to alternate methods for deducing which
kernel to build with.  This resolves the need to manually
specify --with-linux= and --with-linux-obj= on Debian systems.
2011-02-10 14:47:08 -08:00
Brian Behlendorf afffb5cd10 MS_DIRSYNC and MS_REC compat
It turns out that older versions of the glibc headers do not
properly define MS_DIRSYNC despite it being explicitly mentioned
in the man pages.  They instead call it S_WRITE, so for system
where this is not correct defined map MS_DIRSYNC to S_WRITE.
At the time of this commit both Ubuntu Lucid, and Debian Squeeze
both use the out of date glibc headers.

As for MS_REC this field is also not available in the older headers.
Since there is no obvious mapping in this case we simply disable
the recursive mount option which used it.
2011-02-10 12:14:57 -08:00
Brian Behlendorf 1ac0ea38a5 Add missing -ldl linker option
The inclusion on dlsym(), dlopen(), and dlclose() symbols require
us to link against the dl library.  Be careful to add the flag to
both the libzfs library and the commands which depend on the library.
2011-02-10 11:05:44 -08:00
Brian Behlendorf 6c9e06f14d Update AUTHORS file
This file has gotten stale and needed to be updated.  There are
individuals who deserve to be recognized for their contributions
to the project.  I've done my best to assemble names from the
commit logs of those who have submitted patches.  This list may
not be comprehensive, if you feel I've overlooked your contribution
please let me know and we can get your name added.
2011-02-10 09:27:22 -08:00
Brian Behlendorf 6839eed23e Use 'noop' IO Scheduler
Initial testing has shown the the right IO scheduler to use under Linux
is noop.  This strikes the ideal balance by allowing the zfs elevator
to do all request ordering and prioritization.  While allowing the
Linux elevator to do the maximum front/back merging allowed by the
physical device.  This yields the largest possible requests for the
device with the lowest total overhead.

While 'noop' should be right for your system you can choose a different
IO scheduler with the 'zfs_vdev_scheduler' option.  You may set this
value to any of the standard Linux schedulers: noop, cfq, deadline,
anticipatory.  In addition, if you choose 'none' zfs will not attempt
to change the IO scheduler for the block device.
2011-02-10 09:27:22 -08:00
Brian Behlendorf 4db77a74a6 Suppress large kmem_alloc() warning
The following warning was observed under normal operation.  It's
not fatal but it's something to be addressed long term.  Flag the
offending allocation with KM_NODEBUG to suppress the warning and
flag the call site.

SPL: Showing stack for process 21761
Pid: 21761, comm: iozone Tainted: P           ----------------
2.6.32-71.14.1.el6.x86_64 #1
Call Trace:
 [<ffffffffa05465a7>] spl_debug_dumpstack+0x27/0x40 [spl]
 [<ffffffffa054a84d>] kmem_alloc_debug+0x11d/0x130 [spl]
 [<ffffffffa05de166>] dmu_buf_hold_array_by_dnode+0xa6/0x4e0 [zfs]
 [<ffffffffa05de825>] dmu_buf_hold_array+0x65/0x90 [zfs]
 [<ffffffffa05de891>] dmu_read_uio+0x41/0xd0 [zfs]
 [<ffffffffa0654827>] zfs_read+0x147/0x470 [zfs]
 [<ffffffffa06644a2>] zpl_read_common+0x52/0x70 [zfs]
 [<ffffffffa0664503>] zpl_read+0x43/0x70 [zfs]
 [<ffffffff8116d905>] vfs_read+0xb5/0x1a0
 [<ffffffff8116da41>] sys_read+0x51/0x90
 [<ffffffff81013172>] system_call_fastpath+0x16/0x1b
2011-02-10 09:27:22 -08:00
Brian Behlendorf f44b46a632 Update META to 0.6.0
Roll the version forward to 0.6.0, the addition of the Posix
layer warrents updating the major version number.
2011-02-10 09:27:22 -08:00
Brian Behlendorf ceb43b935d Invalidate dcache and inode cache
When performing a 'zfs rollback' it's critical to invalidate
the previous dcache and inode cache.  If we don't there will
stale cache entries which when accessed will result in EIOs.
2011-02-10 09:27:22 -08:00
Brian Behlendorf b3b4f547f9 Remove useless libefi warnings
These two warnings in libefi serve no real purpose.  When running
without DEBUG they are already supressed, and even when DEBUG is
enabled all they indicate is the device doesn't already have an
EFI label.  For a Linux machine this is probably the common case.
2011-02-10 09:27:22 -08:00