Commit Graph

658 Commits

Author SHA1 Message Date
Brian Behlendorf 0da7869690 Fix the configure CONFIG_* option detection
The latest kernels no longer define AUTOCONF_INCLUDED which was
being used to detect the new style autoconf.h kernel configure
options.  This results in the CONFIG_* checks always failing
incorrectly for newer kernels.

The fix for this is a simplification of the testing method.
Rather than attempting to explicitly include to renamed config
header.  It is simpler to unconditionally include <linux/module.h>
which must pick up the correctly named header.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #320
2011-07-22 15:07:16 -07:00
Brian Behlendorf 22872ff5da Use zfs_mknode() to create dataset root
Long, long, long ago when the effort to port ZFS was begun
the zfs_create_fs() function was heavily modified to remove
all of its VFS dependencies.  This allowed Lustre to use
the dataset without us having to spend the time porting all
the required VFS code.

Fast-forward several years and we now have all the VFS code
in place but are still relying on the modified zfs_create_fs().
This isn't required anymore and we can now use zfs_mknode()
to create the root znode for the filesystem.

This commit reverts the contents of zfs_create_fs() to largely
match the upstream OpenSolaris code.  There have been minor
modifications to accomidate the Linux VFS but that is all.

This code fixes issue #116 by bootstraping enough of the VFS
data structures so we can rely on zfs_mknode() to create the
root directory.  This ensures it is created properly with
support for system attributes.  Previously it wasn't which
is why it behaved differently that all other directories
when modified.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #116
2011-07-20 19:52:26 -07:00
Brian Behlendorf 9fd91daeef Honor setgit bit on directories
Newly created files were always being created with the fsuid/fsgid
in the current users credentials.  This is correct except in the
case when the parent directory sets the 'setgit' bit.  In this
case according to posix the newly created file/directory should
inherit the gid of the parent directory.  Additionally, in the
case of a subdirectory it should also inherit the 'setgit' bit.

Finally, this commit performs a little cleanup of the vattr_t
initialization by moving it to a common helper function.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #262
2011-07-20 14:07:13 -07:00
Brian Behlendorf fe0ed8f910 Fix 'make install' overly broad 'rm'
When running 'make install' without DESTDIR set the module install
rules would mistakenly destroy the 'modules.*' files for ALL of
your installed kernels.  This could lead to a non-functional system
for the alternate kernels because 'depmod -a' will only be run for
the kernel which was compiled against.  This issue would not impact
anyone using the 'make <deb|rpm|pkg>' build targets to build and
install packages.

The fix for this issue is to only remove extraneous build products
when DESTDIR is set.  This almost exclusively indicates we are
building packages and installed the build products in to a temporary
staging location.  Additionally, limit the removal the unneeded
build products to the target kernel version.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #328
2011-07-20 09:38:51 -07:00
Brian Behlendorf cfc9a5c88f Fix zpl_writepage() deadlock
Disable the normal reclaim path for zpl_putpage().  This ensures that
all memory allocations under this call path will never enter direct
reclaim.  If this were to happen the VM might try to write out
additional pages by calling zpl_putpage() again resulting in a
deadlock.

This sitution is typically handled in Linux by marking each offending
allocation GFP_NOFS.  However, since much of the code used is common
it makes more sense to use PF_MEMALLOC to flag the entire call tree.
Alternately, the code could be updated to pass the needed allocation
flags but that's a more invasive change.

The following example of the above described deadlock was triggered
by test 074 in the xfstest suite.

Call Trace:
 [<ffffffff814dcdb2>] down_write+0x32/0x40
 [<ffffffffa05af6e4>] dnode_new_blkid+0x94/0x2d0 [zfs]
 [<ffffffffa0597d66>] dbuf_dirty+0x556/0x750 [zfs]
 [<ffffffffa05987d1>] dmu_buf_will_dirty+0x81/0xd0 [zfs]
 [<ffffffffa059ee70>] dmu_write+0x90/0x170 [zfs]
 [<ffffffffa0611afe>] zfs_putpage+0x2ce/0x360 [zfs]
 [<ffffffffa062875e>] zpl_putpage+0x1e/0x60 [zfs]
 [<ffffffffa06287b2>] zpl_writepage+0x12/0x20 [zfs]
 [<ffffffff8115f907>] writeout+0xa7/0xd0
 [<ffffffff8115fa6b>] move_to_new_page+0x13b/0x170
 [<ffffffff8115fed4>] migrate_pages+0x434/0x4c0
 [<ffffffff811559ab>] compact_zone+0x4fb/0x780
 [<ffffffff81155ed1>] compact_zone_order+0xa1/0xe0
 [<ffffffff8115602c>] try_to_compact_pages+0x11c/0x190
 [<ffffffff811200bb>] __alloc_pages_nodemask+0x5eb/0x8b0
 [<ffffffff8115464a>] alloc_pages_current+0xaa/0x110
 [<ffffffff8111e36e>] __get_free_pages+0xe/0x50
 [<ffffffffa03f0e2f>] kv_alloc+0x3f/0xb0 [spl]
 [<ffffffffa03f11d9>] spl_kmem_cache_alloc+0x339/0x660 [spl]
 [<ffffffffa05950b3>] dbuf_create+0x43/0x370 [zfs]
 [<ffffffffa0596fb1>] __dbuf_hold_impl+0x241/0x480 [zfs]
 [<ffffffffa0597276>] dbuf_hold_impl+0x86/0xc0 [zfs]
 [<ffffffffa05977ff>] dbuf_hold_level+0x1f/0x30 [zfs]
 [<ffffffffa05a9dde>] dmu_tx_check_ioerr+0x4e/0x110 [zfs]
 [<ffffffffa05aa1f9>] dmu_tx_count_write+0x359/0x6f0 [zfs]
 [<ffffffffa05aa5df>] dmu_tx_hold_write+0x4f/0x70 [zfs]
 [<ffffffffa0611a6d>] zfs_putpage+0x23d/0x360 [zfs]
 [<ffffffffa062875e>] zpl_putpage+0x1e/0x60 [zfs]
 [<ffffffff811221f9>] write_cache_pages+0x1c9/0x4a0
 [<ffffffffa0628738>] zpl_writepages+0x18/0x20 [zfs]
 [<ffffffff81122521>] do_writepages+0x21/0x40
 [<ffffffff8119bbbd>] writeback_single_inode+0xdd/0x2c0
 [<ffffffff8119bfbe>] writeback_sb_inodes+0xce/0x180
 [<ffffffff8119c11b>] writeback_inodes_wb+0xab/0x1b0
 [<ffffffff8119c4bb>] wb_writeback+0x29b/0x3f0
 [<ffffffff8119c6cb>] wb_do_writeback+0xbb/0x240
 [<ffffffff811308ea>] bdi_forker_task+0x6a/0x310
 [<ffffffff8108ddf6>] kthread+0x96/0xa0

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #327
2011-07-19 16:17:01 -07:00
Brian Behlendorf abd39a8289 Fix zio_execute() deadlock
To avoid deadlocking the system it is crucial that all memory
allocations performed in the zio_execute() call path are marked
KM_PUSHPAGE (GFP_NOFS).  This ensures that while a z_wr_iss
thread is processing the syncing transaction group it does
not re-enter the filesystem code and deadlock on itself.

Call Trace:
 [<ffffffffa02580e8>] cv_wait_common+0x78/0xe0 [spl]
 [<ffffffffa0347bab>] txg_wait_open+0x7b/0xa0 [zfs]
 [<ffffffffa030e73d>] dmu_tx_wait+0xed/0xf0 [zfs]
 [<ffffffffa0376a49>] zfs_putpage+0x219/0x360 [zfs]
 [<ffffffffa038d75e>] zpl_putpage+0x1e/0x60 [zfs]
 [<ffffffffa038d7b2>] zpl_writepage+0x12/0x20 [zfs]
 [<ffffffff8115f907>] writeout+0xa7/0xd0
 [<ffffffff8115fa6b>] move_to_new_page+0x13b/0x170
 [<ffffffff8115fed4>] migrate_pages+0x434/0x4c0
 [<ffffffff811559ab>] compact_zone+0x4fb/0x780
 [<ffffffff81155ed1>] compact_zone_order+0xa1/0xe0
 [<ffffffff8115602c>] try_to_compact_pages+0x11c/0x190
 [<ffffffff811200bb>] __alloc_pages_nodemask+0x5eb/0x8b0
 [<ffffffff81159932>] kmem_getpages+0x62/0x170
 [<ffffffff8115a54a>] fallback_alloc+0x1ba/0x270
 [<ffffffff8115a2c9>] ____cache_alloc_node+0x99/0x160
 [<ffffffff8115b059>] __kmalloc+0x189/0x220
 [<ffffffffa02539fb>] kmem_alloc_debug+0xeb/0x130 [spl]
 [<ffffffffa031454a>] dnode_hold_impl+0x46a/0x550 [zfs]
 [<ffffffffa0314649>] dnode_hold+0x19/0x20 [zfs]
 [<ffffffffa03042e3>] dmu_read+0x33/0x180 [zfs]
 [<ffffffffa034729d>] space_map_load+0xfd/0x320 [zfs]
 [<ffffffffa03300bc>] metaslab_activate+0x10c/0x170 [zfs]
 [<ffffffffa0330ad9>] metaslab_alloc+0x469/0x800 [zfs]
 [<ffffffffa038963c>] zio_dva_allocate+0x6c/0x2f0 [zfs]
 [<ffffffffa038a249>] zio_execute+0x99/0xf0 [zfs]
 [<ffffffffa0254b1c>] taskq_thread+0x1cc/0x330 [spl]
 [<ffffffff8108ddf6>] kthread+0x96/0xa0

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #291
2011-07-19 11:55:42 -07:00
Brian Behlendorf a140dc5469 Fix mmap(2)/write(2)/read(2) deadlock
When modifing overlapping regions of a file using mmap(2) and
write(2)/read(2) it is possible to deadlock due to a lock inversion.
The zfs_write() and zfs_read() hooks first take the zfs range lock
and then lock the individual pages.  Conversely, when using mmap'ed
I/O the zpl_writepage() hook is called with the individual page
locks already taken and then zfs_putpage() takes the zfs range lock.

The most straight forward fix is to simply not take the zfs range
lock in the mmap(2) case.  The individual pages will still be locked
thus serializing access.  Updating the same region of a file with
write(2) and mmap(2) has always been a dodgy thing to do.  This change
at a minimum ensures we don't deadlock and is consistent with the
existing Linux semantics enforced by the VFS.

This isn't an issue under Solaris because the only range locking
performed will be with the zfs range locks.  It's up to each filesystem
to perform its own file locking.  Under Linux the VFS provides many
of these services.

It may be possible/desirable at a latter date to entirely dump the
existing zfs range locking and rely on the Linux VFS page locks.
However, for now its safest to perform both layers of locking until
zfs is more tightly integrated with the page cache.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #302
2011-07-19 11:55:42 -07:00
Brian Behlendorf 7f9d9946f8 Update 'zpool import' man page
The following supported options were missing from the zpool.8
man page.  The OpenSolaris man pages originally used were simply
out of date with the code.

zpool import
  -F Recovery mode
  -m Allow missing log devices
  -N Import but don't mount
  -n Determine if recoverable but don't do it

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-07-19 11:24:20 -07:00
Brian Behlendorf 61f218b090 Fix send/recv 'dataset is busy' errors
This commit fixes a regression which was accidentally introduced by
the Linux 2.6.39 compatibility chanages.  As part of these changes
instead of holding an active reference on the namepsace (which is
no longer posible) a reference is taken on the super block.  This
reference ensures the super block remains valid while it is in use.

To handle the unlikely race condition of the filesystem being
unmounted concurrently with the start of a 'zfs send/recv' the
code was updated to only take the super block reference when there
was an existing reference.  This indicates that the filesystem is
active and in use.

Unfortunately, in the 'zfs recv' case this is not the case.  The
newly created dataset will not have a super block without an
active reference which results in the 'dataset is busy' error.

The most straight forward fix for this is to simply update the
code to always take the reference even when it's zero.  This
may expose us to very very unlikely concurrent umount/send/recv
case but the consequences of that are minor.

Closes #319
2011-07-15 16:37:19 -07:00
Kyle Fuller 615ab66d18 Provide a rc.d script for archlinux
Unlike most other Linux distributions archlinux installs its
init scripts in /etc/rc.d insead of /etc/init.d.  This commit
provides an archlinux rc.d script for zfs and extends the
build infrastructure to ensure it get's installed in the
correct place.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #322
2011-07-11 14:12:23 -07:00
Brian Behlendorf 057e8eee35 Improve fstat(2) performance
There is at most a factor of 3x performance improvement to be
had by using the Linux generic_fillattr() helper.  However, to
use it safely we need to ensure the values in a cached inode
are kept rigerously up to date.  Unfortunately, this isn't
the case for the blksize, blocks, and atime fields.  At the
moment the authoritative values are still stored in the znode.

This patch introduces an optimized zfs_getattr_fast() call.
The idea is to use the up to date values from the inode and
the blksize, block, and atime fields from the znode.  At some
latter date we should be able to strictly use the inode values
and further improve performance.

The remaining overhead in the zfs_getattr_fast() call can be
attributed to having to take the znode mutex.  This overhead is
unavoidable until the inode is kept strictly up to date.  The
the careful reader will notice the we do not use the customary
ZFS_ENTER()/ZFS_EXIT() macros.  These macro's are designed to
ensure the filesystem is not torn down in the middle of an
operation.  However, in this case the VFS is holding a
reference on the active inode so we know this is impossible.

=================== Performance Tests ========================

This test calls the fstat(2) system call 10,000,000 times on
an open file description in a tight loop.  The test results
show the zfs stat(2) performance is now only 22% slower than
ext4.  This is a 2.5x improvement and there is a clear long
term plan to get to parity with ext4.

filesystem    | test-1  test-2  test-3  | average | times-ext4
--------------+-------------------------+---------+-----------
ext4          |  7.785s  7.899s  7.284s |  7.656s | 1.000x
zfs-0.6.0-rc4 | 24.052s 22.531s 23.857s | 23.480s | 3.066x
zfs-faststat  |  9.224s  9.398s  9.485s |  9.369s | 1.223x

The second test is to run 'du' of a copy of the /usr tree
which contains 110514 files.  The test is run multiple times
both using both a cold cache (/proc/sys/vm/drop_caches) and
a hot cache.  As expected this change signigicantly improved
the zfs hot cache performance and doesn't quite bring zfs to
parity with ext4.

A little surprisingly the zfs cold cache performance is better
than ext4.  This can probably be attributed to the zfs allocation
policy of co-locating all the meta data on disk which minimizes
seek times.  By default the ext4 allocator will spread the data
over the entire disk only co-locating each directory.

filesystem    | cold    | hot
--------------+---------+--------
ext4          | 13.318s | 1.040s
zfs-0.6.0-rc4 |  4.982s | 1.762s
zfs-faststat  |  4.933s | 1.345s
2011-07-11 09:11:22 -07:00
Brian Behlendorf abd8610cd5 Add L2ARC tunables
The performance of the L2ARC can be tweaked by a number of tunables, which
may be necessary for different workloads:

     l2arc_write_max         max write bytes per interval
     l2arc_write_boost       extra write bytes during device warmup
     l2arc_noprefetch        skip caching prefetched buffers
     l2arc_headroom          number of max device writes to precache
     l2arc_feed_secs         seconds between L2ARC writing
     l2arc_feed_min_ms       min feed interval in milliseconds
     l2arc_feed_again        turbo L2ARC warmup
     l2arc_norw              no reads during writes

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #316
2011-07-08 12:44:11 -07:00
Brian Behlendorf e0f86c9862 Update 'zfs send' documentation
The -D and -p options were missing from the manpage.  This commit
adds documentation for these features.

Closes #311
2011-07-08 12:16:09 -07:00
Fajar A. Nugraha 1fa3bb750d Remove zfs service only on uninstall, not on upgrade
This caused problems on upgrade using RPM:

* The new version will run chkconfig --add, which has no effect
  since the service was already added.

* The old version will run chkconfig --del, which caused zfs
  service removal.

Only run "chkconfig --del" on complete uninstall, by checking
the value of "$1" to %preun, which will be "0" on uninstall,
and "1" on upgrade.

  http://www.rpm.org/max-rpm/s1-rpm-inside-scripts.html

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #314
2011-07-08 11:44:17 -07:00
Fajar A. Nugraha 3af2ce4d68 Check for "udevadm settle" vs "udevsettle"
RHEL5 does not have udevadm, so fix initscript accordingly

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #315
2011-07-08 11:43:16 -07:00
Brian Behlendorf 341b5f1d4c Update ztest paths
Unfortunately, ztest is hard coded to export the zdb utility to
be installed in a certain location.  When the packaging was updated
to install zdb in /sbin/ ztest was broken.  To fix this I'm updating
ztest to check both common install paths.
2011-07-06 12:30:09 -07:00
Brian Behlendorf b1c932d318 Add proper library versioning
The zfs libraries were never properly versioned.  Since the API has
remained static for quite some time this we never an issue.  However,
going forward they should be versioned.  This commit versions all
of the libraries to 1.0.0.  From here on out this version must be
updated to reflect changes to the library.
2011-07-06 09:20:28 -07:00
Gunnar Beutner 8b0cf399ff Updated init scripts to enable automatic sharing of ZFS datasets.
The relevant init scripts were updated so as to automatically share
ZFS datasets using "zfs share -a" at boot time.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-07-06 09:20:28 -07:00
Gunnar Beutner 3c9609b322 Renamed HAVE_SHARE ifdefs to HAVE_SMB_SHARE.
The remaining code that is guarded by HAVE_SHARE ifdefs is related to the
.zfs/shares functionality which is currently not available on Linux.

On Solaris the .zfs/shares directory can be used to set permissions for
SMB shares.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-07-06 09:20:28 -07:00
Gunnar Beutner 52e7c3a2e5 Link libshare directly to libzfs
Drop usage of dlopen/dlsym for libshare.  There is no need to do
this because the zfs packages provide libshare.  Unlike on Solaris
we are guaranteed it will be available.

This avoids possible problems with hardcoding the libshare path in
the code (e.g. when users specify a different install path via
configure options).  It additionally simplifies the code which is
good for maintainability.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-07-06 09:20:28 -07:00
Gunnar Beutner 46e18b3f0f Implemented sharing datasets via NFS using libshare.
The sharenfs and sharesmb properties depend on the libshare library
to export datasets via NFS and SMB. This commit implements the base
libshare functionality as well as support for managing NFS shares.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-07-06 09:20:28 -07:00
Zachary Bedell dc2a4a9136 Document initramfs process
Add documentation for Dracut and the initramfs process.  This includes
detailing the basic boot process and options available.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-07-06 09:20:28 -07:00
Zachary Bedell fde4ce992d Update for Dracut-010
Update Dracut module for Dracut-010 and fix race conditions that
caused boot to fail on MP systems.  Add support for zfs_force flag
and parsing of spl_hostid from kernel command line.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-07-06 09:20:28 -07:00
Zachary Bedell e93ced4847 Update zfs.gentoo/zfs.lsb init script
* Update paths to zpool/zfs tools,
* Log less for non-error conditions,
* Don't be fatal if umount fails at shutdown -- final init remount
  will take care of it if /usr or / are in use

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-07-06 09:20:14 -07:00
Gunnar Beutner c8082367cf Removed erroneous backticks in the zfs.lunar init script.
The backticks would cause the output of the zfs commands
to be evaluated as input for the if construct rather than
their exit status.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-07-05 11:25:48 -07:00
Gunnar Beutner 0f4524cca4 Fixed indentation in the zfs.lunar init script.
One of the blocks in the init script wasn't indented properly.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-07-05 11:25:48 -07:00
Prasad Joshi 5a52105925 Use consistent error message in zpool sub-command
The zpool sub-commands like iostat, list, and status should
display consistent message when a given pool is unavailable or
no pool is present.  This change unifies the default behavior
as follows:

  root@prasad:~# ./zpool list 1 2
  no pools available
  no pools available

  root@prasad:~# ./zpool iostat  1 2
  no pools available
  no pools available

  root@prasad:~# ./zpool status 1 2
  no pools available
  no pools available

  root@prasad:~# ./zpool list tan 1 2
  cannot open 'tan': no such pool

  root@prasad:~# ./zpool iostat tan 1 2
  cannot open 'tan': no such pool

  root@prasad:~# ./zpool status tan 1 2
  cannot open 'tan': no such pool

Reported-by: Rajshree Thorat <rthorat@stec-inc.com>
Signed-off-by: Prasad Joshi <pjoshi@stec-inc.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Closes #306
2011-07-04 19:57:01 -07:00
Andrew Tselischev b59322a0d8 Fix 'rc_parallel="YES"' error
If rc_parallel="YES" zfs starts before localmount, which leads
to "No such file or directory" error on systems with /usr on a
separate partition.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-07-04 13:54:59 -07:00
Brian Behlendorf b2f25e00ec Prep zfs-0.6.0-rc5 tag
Create the fifth 0.6.0 release candidate tag (rc5).
2011-07-01 15:24:34 -07:00
Brian Behlendorf 285226eff3 Always allow non-user xattrs
Under Linux you may only disable USER xattrs.  The SECURITY,
SYSTEM, and TRUSTED xattr namespaces must always be available
if xattrs are supported by the filesystem.  The enforcement
of USER xattrs is performed in the zpl_xattr_user_* handlers.

Under Solaris there is only a single xattr namespace which
is managed globally.
2011-07-01 13:39:48 -07:00
Brian Behlendorf f2cfee80e3 Fix implicit declaration of 'mkdirp'
The lib/libspl/include/libgen.h header file was being mistakenly
left out of the 'make dist' tarball.  It just happens this doesn't
cause a build failure when creating packages because the system
libgen/h is included instead.  This simply results in the following
warning due to the missing forward declaration of mkdirp().

  ../../lib/libzfs/libzfs_mount.c:417:3: warning: implicit declaration
  of function 'mkdirp' [-Wimplicit-function-declaration]
2011-07-01 13:39:47 -07:00
Rohan Puri a89c3e0bd5 Support mandatory locks (nbmand)
The Linux kernel already has support for mandatory locking.  This
change just replaces the Solaris mandatory locking calls with the
Linux equivilants.  In fact, it looks like this code could be
removed entirely because this checking is already done generically
in the Linux VFS.  However, for now we'll leave it in place even
if it is redundant just in case we missed something.

The original patch to update the code to support mandatory locking
was done by Rohan Puri.  This patch is an updated version which is
compatible with the previous mount option handling changes.

Original-Patch-by: Rohan Puri <rohan.puri15@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #222
Closes #253
2011-07-01 13:39:40 -07:00
Brian Behlendorf 2cf7f52bc4 Linux compat 2.6.39: mount_nodev()
The .get_sb callback has been replaced by a .mount callback
in the file_system_type structure.  When using the new
interface the caller must now use the mount_nodev() helper.

Unfortunately, the new interface no longer passes the vfsmount
down to the zfs layers.  This poses a problem for the existing
implementation because we currently save this pointer in the
super block for latter use.  It provides our only entry point
in to the namespace layer for manipulating certain mount options.

This needed to be done originally to allow commands like
'zfs set atime=off tank' to work properly.  It also allowed me
to keep more of the original Solaris code unmodified.  Under
Solaris there is a 1-to-1 mapping between a mount point and a
file system so this is a fairly natural thing to do.  However,
under Linux they many be multiple entries in the namespace
which reference the same filesystem.  Thus keeping a back
reference from the filesystem to the namespace is complicated.

Rather than introduce some ugly hack to get the vfsmount and
continue as before.  I'm leveraging this API change to update
the ZFS code to do things in a more natural way for Linux.
This has the upside that is resolves the compatibility issue
for the long term and fixes several other minor bugs which
have been reported.

This commit updates the code to remove this vfsmount back
reference entirely.  All modifications to filesystem mount
options are now passed in to the kernel via a '-o remount'.
This is the expected Linux mechanism and allows the namespace
to properly handle any options which apply to it before passing
them on to the file system itself.

Aside from fixing the compatibility issue, removing the
vfsmount has had the benefit of simplifying the code.  This
change which fairly involved has turned out nicely.

Closes #246
Closes #217
Closes #187
Closes #248
Closes #231
2011-07-01 13:36:39 -07:00
Brian Behlendorf 5c03efc379 Linux compat 2.6.39: security_inode_init_security()
The security_inode_init_security() function now takes an additional
qstr argument which must be passed in from the dentry if available.
Passing a NULL is safe when no qstr is available the relevant
security checks will just be skipped.

Closes #246
Closes #217
Closes #187
2011-07-01 12:40:08 -07:00
Brian Behlendorf bd2f5ac97f Avoid 'rpm -q' bug for 'make pkg'
RPM version 4.9.0 has been observed to generate extra debug
messages in certain cases.  These debug messages prevent us
from cleanly acquiring the architecture.  This is clearly
an upstream RPM bug which will get fixed.  But until then
a safe solution is to pipe the result through 'tail -1'
to just grab the architecture bit we care about.

Example 'rpm -qp spl-0.6.0-rc4.src.rpm --qf %{arch}' output:

Freeing read locks for locker 0x166: 28031/47480843735008
Freeing read locks for locker 0x168: 28031/47480843735008
x86_64
2011-07-01 12:39:25 -07:00
Brian Behlendorf e2e7aa2df8 Add ZFS specific mmap() checks
Under Linux the VFS handles virtually all of the mmap() access
checks.  Filesystem specific checks are left to be handled in
the .mmap() hook and normally there arn't any.

However, ZFS provides a few attributes which can influence the
mmap behavior and should be honored.  Note, currently the code
to modify these attributes has not been implemented under Linux.

* ZFS_IMMUTABLE | ZFS_READONLY | ZFS_APPENDONLY: when any of these
  attributes are set a file may not be mmaped with write access.

* ZFS_AV_QUARANTINED: when set a file file may not be mmaped with
  read or exec access.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-07-01 12:23:46 -07:00
Brian Behlendorf f0b2486034 Remove unused MMAP functions
The following functions were required for the OpenSolaris mmap
implementation.  Because the Linux VFS does most the most heavy
lifting for us they are not required and are being removed to
keep the code clean and easy to understand.

  * zfs_null_putapage()
  * zfs_frlock()
  * zfs_no_putpage()

Signed-off-by: Brian Behlendorf <behlendorf@llnl.gov>
2011-07-01 12:22:57 -07:00
Prasad Joshi dde471ef5a MMAP Optimization
Enable zfs_getpage, zfs_fillpage, zfs_putpage, zfs_putapage functions.
The functions have been modified to make them Linux friendly.

ZFS uses these functions to read/write the mmapped pages. Using them
from readpage/writepage results in clear code. The patch also adds
readpages and writepages interface functions to read/write list of
pages in one function call.

The code change handles the first mmap optimization mentioned on
https://github.com/behlendorf/zfs/issues/225

Signed-off-by: Prasad Joshi <pjoshi@stec-inc.com>
Signed-off-by: Brian Behlendorf <behlendorf@llnl.gov>
Issue #255
2011-07-01 12:22:52 -07:00
Brian Behlendorf 2a005961a4 Ensure all block devices are available
These days most disk drivers will probe for devices asynchronously.
This means it's possible that when you zfs init script runs all the
required block devices may not yet have been discovered.  The result
is the pool may fail to cleanly import at boot time.  This is
particularly common when you have a large number of devices.

The fix is for the init script to block until udev settles and we
are no longer detecting new devices.  Once the system has settled
the zfs modules can be loaded and the pool with be automatically
imported.
2011-06-30 14:45:33 -07:00
Prasad Joshi 218b8eafbd Use truncate_setsize in zfs_setattr
According to Linux kernel commit 2c27c65e, using truncate_setsize in
setattr simplifies the code. Therefore, the patch replaces the call
to vmtruncate() with truncate_setsize().

zfs_setattr uses zfs_freesp to free the disk space belonging to the
file.  As truncate_setsize may release the page cache and flushing
the dirty data to disk, it must be called before the zfs_freesp.

Suggested-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prasad Joshi <pjoshi@stec-inc.com>
Closes #255
2011-06-27 09:59:52 -07:00
Prasad Joshi b312979252 Tear down and flush the mmap region
The inode eviction should unmap the pages associated with the inode.
These pages should also be flushed to disk to avoid the data loss.
Therefore, use truncate_setsize() in evict_inode() to release the
pagecache.

The API truncate_setsize() was added in 2.6.35 kernel. To ensure
compatibility with the old kernel, the patch defines its own
truncate_setsize function.

Signed-off-by: Prasad Joshi <pjoshi@stec-inc.com>
Closes #255
2011-06-27 09:59:19 -07:00
Ned A. Bass 560bcf9d14 Multipath device manageability improvements
Update udev helper scripts to deal with device-mapper devices created
by multipathd.  These enhancements are targeted at a particular
storage network topology under evaluation at LLNL consisting of two
SAS switches providing redundant connectivity between multiple server
nodes and disk enclosures.

The key to making these systems manageable is to create shortnames for
each disk that conveys its physical location in a drawer.  In a
direct-attached topology we infer a disk's enclosure from the PCI bus
number and HBA port number in the by-path name provided by udev.  In a
switched topology, however, multiple drawers are accessed via a single
HBA port.  We therefore resort to assigning drawer identifiers based
on which switch port a drive's enclosure is connected to.  This
information is available from sysfs.

Add options to zpool_layout to generate an /etc/zfs/zdev.conf using
symbolic links in /dev/disk/by-id of the form
<label>-<UUID>-switch-port:<X>-slot:<Y>.  <label> is a string that
depends on the subsystem that created the link and defaults to
"dm-uuid-mpath" (this prefix is used by multipathd).  <UUID> is a
unique identifier for the disk typically obtained from the scsi_id
program, and <X> and <Y> denote the switch port and disk slot numbers,
respectively.

Add a callout script sas_switch_id for use by multipathd to help
create symlinks of the form described above.  Update zpool_id and the
udev zpool rules file to handle both multipath devices and
conventional drives.
2011-06-23 10:46:06 -07:00
Brian Behlendorf 7e7baecaa3 Linux 3.0 compat, shrinker compatibility
To accomindate the updated Linux 3.0 shrinker API the spl
shrinker compatibility code was updated.  Unfortunately, this
couldn't be done cleanly without slightly adjusting the comapt
API.  See spl commit a55bcaad18.

This commit updates the ZFS code to use the slightly modified
API.  You must use the latest SPL if your building ZFS.
2011-06-21 14:36:39 -07:00
Gunnar Beutner b00131d43c Fix unlink/xattr deadlock
The problem here is that prune_icache() tries to evict/delete
both the xattr directory inode as well as at least one xattr
inode contained in that directory. Here's what happens:

1. File is created.
2. xattr is created for that file (behind the scenes a xattr
   directory and a file in that xattr directory are created)
3. File is deleted.
4. Both the xattr directory inode and at least one xattr
   inode from that directory are evicted by prune_icache();
   prune_icache() acquires a lock on both inodes before it
   calls ->evict() on the inodes

When the xattr directory inode is evicted zfs_zinactive attempts
to delete the xattr files contained in that directory. While
enumerating these files zfs_zget() is called to obtain a reference
to the xattr file znode - which tries to lock the xattr inode.
However that very same xattr inode was already locked by
prune_icache() further up the call stack, thus leading to a
deadlock.

This can be reliably reproduced like this:
$ touch test
$ attr -s a -V b test
$ rm test
$ echo 3 > /proc/sys/vm/drop_caches

This patch fixes the deadlock by moving the zfs_purgedir() call to
zfs_unlinked_drain().  Instead zfs_rmnode() now checks whether the
xattr dir is empty and leaves the xattr dir in the unlinked set if
it finds any xattrs.

To ensure zfs_unlinked_drain() never accesses a stale super block
zfsvfs_teardown() has been update to block until the iput taskq
has been drained.  This avoids a potential race where a file with
an xattr directory is removed and the file system is immediately
unmounted.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #266
2011-06-20 13:47:03 -07:00
Gunnar Beutner 6f0cf71e0d Removed erroneous zfs_inode_destroy() calls from zfs_rmnode().
iput_final() already calls zpl_inode_destroy() -> zfs_inode_destroy()
for us after zfs_zinactive(), thus making sure that the inode is
properly cleaned up.

The zfs_inode_destroy() calls in zfs_rmnode() would lead to a
double-free.

Fixes #282
2011-06-20 10:30:17 -07:00
Christian Kohlschütter df30f56639 Add "ashift" property to zpool create
Some disks with internal sectors larger than 512 bytes (e.g., 4k) can
suffer from bad write performance when ashift is not configured
correctly.  This is caused by the disk not reporting its actual sector
size, but a sector size of 512 bytes.  The drive may behave this way
for compatibility reasons.  For example, the WDC WD20EARS disks are
known to exhibit this behavior.

When creating a zpool, ZFS takes that wrong sector size and sets the
"ashift" property accordingly (to 9: 1<<9=512), whereas it should be
set to 12 for 4k sectors (1<<12=4096).

This patch allows an adminstrator to manual specify the known correct
ashift size at 'zpool create' time.  This can significantly improve
performance in certain cases.  However, it will have an impact on your
total pool capacity.  See the updated ashift property description
in the zpool.8 man page for additional details.

Valid values for the ashift property range from 9 to 17 (512B-128KB).
Additionally, you may set the ashift to 0 if you wish to auto-detect
the sector size based on what the disk reports, this is the default
behavior.  The most common ashift values are 9 and 12.

  Example:
  zpool create -o ashift=12 tank raidz2 sda sdb sdc sdd

Closes #280

Original-patch-by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-06-17 16:35:49 -07:00
Brian Behlendorf 96801d2906 Linux 2.6.37 compat, WRITE_FLUSH_FUA
The WRITE_FLUSH, WRITE_FUA, and WRITE_FLUSH_FUA flags have been
introduced as a replacement for WRITE_BARRIER.  This was done
to allow richer semantics to be expressed to the block layer.
It is the block layers responsibility to choose the correct way
to implement these semantics.

This change simply updates the bio's to use the new kernel API
which should be absolutely safe.  However, since ZFS depends
entirely on this working as designed for correctness we do
want to be careful.

Closes #281
2011-06-17 14:37:26 -07:00
Brian Behlendorf db97f88646 Update rpm/deb packages to be FHS compliant
This change is the first step towards updating the default
rpm/deb packages to be FHS compliant.  It accomplishes this
by passing the following options to ./configure to ensure the
zfs build products are installed in FHS compliant locations.

  ./configure --prefix=/ --bindir=/lib/udev \
    --libexecdir=/usr/libexec --datadir=/usr/share

The core zfs utilities (zfs, zpool, zdb) are now be installed
in /sbin, the core libraries in /lib, and the udev helpers
(zpool_id, zvol_id) are in /lib/udev with the other udev
helpers.

The remaining files in the zfs package remain in their
previous locations under /usr.
2011-06-17 13:36:16 -07:00
Darik Horn b772aedeec Autogen refresh.
Run autogen.sh using the same autotools versions as upstream:

 * autoconf-2.63
 * automake-1.11.1
 * libtool-2.2.6b
2011-06-17 13:24:44 -07:00
Brian Behlendorf 47a2455fbc Use datadir not datarootdir for dracut
The zfs dracut modules should be installed under the --datadir
not --datarootdir path.  This was just an oversight in the
original Makefile.am.

After this change %{_datadir} can now be set safely in the
zfs.spec file.  The 'make install' location is now consistent
with the location expected by the spec file.
2011-06-17 13:22:19 -07:00