This fixes a bug that can effect first reboot after install
using Dracut. The Dracut module didn't check the return
value from several calls to z* functions. This resulted in
"Using no pools available as root" on boot if the ZFS module
didn't auto-import the pools. It's most likely to happen on
initial restart after a fresh install & requires juggling in
the Dracut emergency holographic shell to fix.
This patch checks return codes & output from zpool list and
related functions and correctly falls into the explicit zpool
import code branch if the module didn't import the pool at load.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
For consistency with the upstream sources and the rest of the
project use hard instead of soft tabs.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Update two kmem_alloc()'s in dbuf_dirty() to use KM_PUSHPAGE.
Because these functions are called from txg_sync_thread we
must ensure they don't reenter the zfs filesystem code via
the .writepage callback. This would result in a deadlock.
This deadlock is rare and has only been observed once under
an abusive mmap() write workload.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The bootfs example in the dracut documentation was sightly incorrect
because it lacked the trailing required pool argument. Add it.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The latest kernels no longer define AUTOCONF_INCLUDED which was
being used to detect the new style autoconf.h kernel configure
options. This results in the CONFIG_* checks always failing
incorrectly for newer kernels.
The fix for this is a simplification of the testing method.
Rather than attempting to explicitly include to renamed config
header. It is simpler to unconditionally include <linux/module.h>
which must pick up the correctly named header.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#320
Long, long, long ago when the effort to port ZFS was begun
the zfs_create_fs() function was heavily modified to remove
all of its VFS dependencies. This allowed Lustre to use
the dataset without us having to spend the time porting all
the required VFS code.
Fast-forward several years and we now have all the VFS code
in place but are still relying on the modified zfs_create_fs().
This isn't required anymore and we can now use zfs_mknode()
to create the root znode for the filesystem.
This commit reverts the contents of zfs_create_fs() to largely
match the upstream OpenSolaris code. There have been minor
modifications to accomidate the Linux VFS but that is all.
This code fixes issue #116 by bootstraping enough of the VFS
data structures so we can rely on zfs_mknode() to create the
root directory. This ensures it is created properly with
support for system attributes. Previously it wasn't which
is why it behaved differently that all other directories
when modified.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#116
Newly created files were always being created with the fsuid/fsgid
in the current users credentials. This is correct except in the
case when the parent directory sets the 'setgit' bit. In this
case according to posix the newly created file/directory should
inherit the gid of the parent directory. Additionally, in the
case of a subdirectory it should also inherit the 'setgit' bit.
Finally, this commit performs a little cleanup of the vattr_t
initialization by moving it to a common helper function.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#262
When running 'make install' without DESTDIR set the module install
rules would mistakenly destroy the 'modules.*' files for ALL of
your installed kernels. This could lead to a non-functional system
for the alternate kernels because 'depmod -a' will only be run for
the kernel which was compiled against. This issue would not impact
anyone using the 'make <deb|rpm|pkg>' build targets to build and
install packages.
The fix for this issue is to only remove extraneous build products
when DESTDIR is set. This almost exclusively indicates we are
building packages and installed the build products in to a temporary
staging location. Additionally, limit the removal the unneeded
build products to the target kernel version.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#328
Disable the normal reclaim path for zpl_putpage(). This ensures that
all memory allocations under this call path will never enter direct
reclaim. If this were to happen the VM might try to write out
additional pages by calling zpl_putpage() again resulting in a
deadlock.
This sitution is typically handled in Linux by marking each offending
allocation GFP_NOFS. However, since much of the code used is common
it makes more sense to use PF_MEMALLOC to flag the entire call tree.
Alternately, the code could be updated to pass the needed allocation
flags but that's a more invasive change.
The following example of the above described deadlock was triggered
by test 074 in the xfstest suite.
Call Trace:
[<ffffffff814dcdb2>] down_write+0x32/0x40
[<ffffffffa05af6e4>] dnode_new_blkid+0x94/0x2d0 [zfs]
[<ffffffffa0597d66>] dbuf_dirty+0x556/0x750 [zfs]
[<ffffffffa05987d1>] dmu_buf_will_dirty+0x81/0xd0 [zfs]
[<ffffffffa059ee70>] dmu_write+0x90/0x170 [zfs]
[<ffffffffa0611afe>] zfs_putpage+0x2ce/0x360 [zfs]
[<ffffffffa062875e>] zpl_putpage+0x1e/0x60 [zfs]
[<ffffffffa06287b2>] zpl_writepage+0x12/0x20 [zfs]
[<ffffffff8115f907>] writeout+0xa7/0xd0
[<ffffffff8115fa6b>] move_to_new_page+0x13b/0x170
[<ffffffff8115fed4>] migrate_pages+0x434/0x4c0
[<ffffffff811559ab>] compact_zone+0x4fb/0x780
[<ffffffff81155ed1>] compact_zone_order+0xa1/0xe0
[<ffffffff8115602c>] try_to_compact_pages+0x11c/0x190
[<ffffffff811200bb>] __alloc_pages_nodemask+0x5eb/0x8b0
[<ffffffff8115464a>] alloc_pages_current+0xaa/0x110
[<ffffffff8111e36e>] __get_free_pages+0xe/0x50
[<ffffffffa03f0e2f>] kv_alloc+0x3f/0xb0 [spl]
[<ffffffffa03f11d9>] spl_kmem_cache_alloc+0x339/0x660 [spl]
[<ffffffffa05950b3>] dbuf_create+0x43/0x370 [zfs]
[<ffffffffa0596fb1>] __dbuf_hold_impl+0x241/0x480 [zfs]
[<ffffffffa0597276>] dbuf_hold_impl+0x86/0xc0 [zfs]
[<ffffffffa05977ff>] dbuf_hold_level+0x1f/0x30 [zfs]
[<ffffffffa05a9dde>] dmu_tx_check_ioerr+0x4e/0x110 [zfs]
[<ffffffffa05aa1f9>] dmu_tx_count_write+0x359/0x6f0 [zfs]
[<ffffffffa05aa5df>] dmu_tx_hold_write+0x4f/0x70 [zfs]
[<ffffffffa0611a6d>] zfs_putpage+0x23d/0x360 [zfs]
[<ffffffffa062875e>] zpl_putpage+0x1e/0x60 [zfs]
[<ffffffff811221f9>] write_cache_pages+0x1c9/0x4a0
[<ffffffffa0628738>] zpl_writepages+0x18/0x20 [zfs]
[<ffffffff81122521>] do_writepages+0x21/0x40
[<ffffffff8119bbbd>] writeback_single_inode+0xdd/0x2c0
[<ffffffff8119bfbe>] writeback_sb_inodes+0xce/0x180
[<ffffffff8119c11b>] writeback_inodes_wb+0xab/0x1b0
[<ffffffff8119c4bb>] wb_writeback+0x29b/0x3f0
[<ffffffff8119c6cb>] wb_do_writeback+0xbb/0x240
[<ffffffff811308ea>] bdi_forker_task+0x6a/0x310
[<ffffffff8108ddf6>] kthread+0x96/0xa0
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#327
When modifing overlapping regions of a file using mmap(2) and
write(2)/read(2) it is possible to deadlock due to a lock inversion.
The zfs_write() and zfs_read() hooks first take the zfs range lock
and then lock the individual pages. Conversely, when using mmap'ed
I/O the zpl_writepage() hook is called with the individual page
locks already taken and then zfs_putpage() takes the zfs range lock.
The most straight forward fix is to simply not take the zfs range
lock in the mmap(2) case. The individual pages will still be locked
thus serializing access. Updating the same region of a file with
write(2) and mmap(2) has always been a dodgy thing to do. This change
at a minimum ensures we don't deadlock and is consistent with the
existing Linux semantics enforced by the VFS.
This isn't an issue under Solaris because the only range locking
performed will be with the zfs range locks. It's up to each filesystem
to perform its own file locking. Under Linux the VFS provides many
of these services.
It may be possible/desirable at a latter date to entirely dump the
existing zfs range locking and rely on the Linux VFS page locks.
However, for now its safest to perform both layers of locking until
zfs is more tightly integrated with the page cache.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #302
The following supported options were missing from the zpool.8
man page. The OpenSolaris man pages originally used were simply
out of date with the code.
zpool import
-F Recovery mode
-m Allow missing log devices
-N Import but don't mount
-n Determine if recoverable but don't do it
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This commit fixes a regression which was accidentally introduced by
the Linux 2.6.39 compatibility chanages. As part of these changes
instead of holding an active reference on the namepsace (which is
no longer posible) a reference is taken on the super block. This
reference ensures the super block remains valid while it is in use.
To handle the unlikely race condition of the filesystem being
unmounted concurrently with the start of a 'zfs send/recv' the
code was updated to only take the super block reference when there
was an existing reference. This indicates that the filesystem is
active and in use.
Unfortunately, in the 'zfs recv' case this is not the case. The
newly created dataset will not have a super block without an
active reference which results in the 'dataset is busy' error.
The most straight forward fix for this is to simply update the
code to always take the reference even when it's zero. This
may expose us to very very unlikely concurrent umount/send/recv
case but the consequences of that are minor.
Closes#319
Unlike most other Linux distributions archlinux installs its
init scripts in /etc/rc.d insead of /etc/init.d. This commit
provides an archlinux rc.d script for zfs and extends the
build infrastructure to ensure it get's installed in the
correct place.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#322
There is at most a factor of 3x performance improvement to be
had by using the Linux generic_fillattr() helper. However, to
use it safely we need to ensure the values in a cached inode
are kept rigerously up to date. Unfortunately, this isn't
the case for the blksize, blocks, and atime fields. At the
moment the authoritative values are still stored in the znode.
This patch introduces an optimized zfs_getattr_fast() call.
The idea is to use the up to date values from the inode and
the blksize, block, and atime fields from the znode. At some
latter date we should be able to strictly use the inode values
and further improve performance.
The remaining overhead in the zfs_getattr_fast() call can be
attributed to having to take the znode mutex. This overhead is
unavoidable until the inode is kept strictly up to date. The
the careful reader will notice the we do not use the customary
ZFS_ENTER()/ZFS_EXIT() macros. These macro's are designed to
ensure the filesystem is not torn down in the middle of an
operation. However, in this case the VFS is holding a
reference on the active inode so we know this is impossible.
=================== Performance Tests ========================
This test calls the fstat(2) system call 10,000,000 times on
an open file description in a tight loop. The test results
show the zfs stat(2) performance is now only 22% slower than
ext4. This is a 2.5x improvement and there is a clear long
term plan to get to parity with ext4.
filesystem | test-1 test-2 test-3 | average | times-ext4
--------------+-------------------------+---------+-----------
ext4 | 7.785s 7.899s 7.284s | 7.656s | 1.000x
zfs-0.6.0-rc4 | 24.052s 22.531s 23.857s | 23.480s | 3.066x
zfs-faststat | 9.224s 9.398s 9.485s | 9.369s | 1.223x
The second test is to run 'du' of a copy of the /usr tree
which contains 110514 files. The test is run multiple times
both using both a cold cache (/proc/sys/vm/drop_caches) and
a hot cache. As expected this change signigicantly improved
the zfs hot cache performance and doesn't quite bring zfs to
parity with ext4.
A little surprisingly the zfs cold cache performance is better
than ext4. This can probably be attributed to the zfs allocation
policy of co-locating all the meta data on disk which minimizes
seek times. By default the ext4 allocator will spread the data
over the entire disk only co-locating each directory.
filesystem | cold | hot
--------------+---------+--------
ext4 | 13.318s | 1.040s
zfs-0.6.0-rc4 | 4.982s | 1.762s
zfs-faststat | 4.933s | 1.345s
The performance of the L2ARC can be tweaked by a number of tunables, which
may be necessary for different workloads:
l2arc_write_max max write bytes per interval
l2arc_write_boost extra write bytes during device warmup
l2arc_noprefetch skip caching prefetched buffers
l2arc_headroom number of max device writes to precache
l2arc_feed_secs seconds between L2ARC writing
l2arc_feed_min_ms min feed interval in milliseconds
l2arc_feed_again turbo L2ARC warmup
l2arc_norw no reads during writes
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#316
This caused problems on upgrade using RPM:
* The new version will run chkconfig --add, which has no effect
since the service was already added.
* The old version will run chkconfig --del, which caused zfs
service removal.
Only run "chkconfig --del" on complete uninstall, by checking
the value of "$1" to %preun, which will be "0" on uninstall,
and "1" on upgrade.
http://www.rpm.org/max-rpm/s1-rpm-inside-scripts.html
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#314
Unfortunately, ztest is hard coded to export the zdb utility to
be installed in a certain location. When the packaging was updated
to install zdb in /sbin/ ztest was broken. To fix this I'm updating
ztest to check both common install paths.
The zfs libraries were never properly versioned. Since the API has
remained static for quite some time this we never an issue. However,
going forward they should be versioned. This commit versions all
of the libraries to 1.0.0. From here on out this version must be
updated to reflect changes to the library.
The relevant init scripts were updated so as to automatically share
ZFS datasets using "zfs share -a" at boot time.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The remaining code that is guarded by HAVE_SHARE ifdefs is related to the
.zfs/shares functionality which is currently not available on Linux.
On Solaris the .zfs/shares directory can be used to set permissions for
SMB shares.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Drop usage of dlopen/dlsym for libshare. There is no need to do
this because the zfs packages provide libshare. Unlike on Solaris
we are guaranteed it will be available.
This avoids possible problems with hardcoding the libshare path in
the code (e.g. when users specify a different install path via
configure options). It additionally simplifies the code which is
good for maintainability.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The sharenfs and sharesmb properties depend on the libshare library
to export datasets via NFS and SMB. This commit implements the base
libshare functionality as well as support for managing NFS shares.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Add documentation for Dracut and the initramfs process. This includes
detailing the basic boot process and options available.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Update Dracut module for Dracut-010 and fix race conditions that
caused boot to fail on MP systems. Add support for zfs_force flag
and parsing of spl_hostid from kernel command line.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
* Update paths to zpool/zfs tools,
* Log less for non-error conditions,
* Don't be fatal if umount fails at shutdown -- final init remount
will take care of it if /usr or / are in use
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The backticks would cause the output of the zfs commands
to be evaluated as input for the if construct rather than
their exit status.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The zpool sub-commands like iostat, list, and status should
display consistent message when a given pool is unavailable or
no pool is present. This change unifies the default behavior
as follows:
root@prasad:~# ./zpool list 1 2
no pools available
no pools available
root@prasad:~# ./zpool iostat 1 2
no pools available
no pools available
root@prasad:~# ./zpool status 1 2
no pools available
no pools available
root@prasad:~# ./zpool list tan 1 2
cannot open 'tan': no such pool
root@prasad:~# ./zpool iostat tan 1 2
cannot open 'tan': no such pool
root@prasad:~# ./zpool status tan 1 2
cannot open 'tan': no such pool
Reported-by: Rajshree Thorat <rthorat@stec-inc.com>
Signed-off-by: Prasad Joshi <pjoshi@stec-inc.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#306
If rc_parallel="YES" zfs starts before localmount, which leads
to "No such file or directory" error on systems with /usr on a
separate partition.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Under Linux you may only disable USER xattrs. The SECURITY,
SYSTEM, and TRUSTED xattr namespaces must always be available
if xattrs are supported by the filesystem. The enforcement
of USER xattrs is performed in the zpl_xattr_user_* handlers.
Under Solaris there is only a single xattr namespace which
is managed globally.
The lib/libspl/include/libgen.h header file was being mistakenly
left out of the 'make dist' tarball. It just happens this doesn't
cause a build failure when creating packages because the system
libgen/h is included instead. This simply results in the following
warning due to the missing forward declaration of mkdirp().
../../lib/libzfs/libzfs_mount.c:417:3: warning: implicit declaration
of function 'mkdirp' [-Wimplicit-function-declaration]
The Linux kernel already has support for mandatory locking. This
change just replaces the Solaris mandatory locking calls with the
Linux equivilants. In fact, it looks like this code could be
removed entirely because this checking is already done generically
in the Linux VFS. However, for now we'll leave it in place even
if it is redundant just in case we missed something.
The original patch to update the code to support mandatory locking
was done by Rohan Puri. This patch is an updated version which is
compatible with the previous mount option handling changes.
Original-Patch-by: Rohan Puri <rohan.puri15@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#222Closes#253
The .get_sb callback has been replaced by a .mount callback
in the file_system_type structure. When using the new
interface the caller must now use the mount_nodev() helper.
Unfortunately, the new interface no longer passes the vfsmount
down to the zfs layers. This poses a problem for the existing
implementation because we currently save this pointer in the
super block for latter use. It provides our only entry point
in to the namespace layer for manipulating certain mount options.
This needed to be done originally to allow commands like
'zfs set atime=off tank' to work properly. It also allowed me
to keep more of the original Solaris code unmodified. Under
Solaris there is a 1-to-1 mapping between a mount point and a
file system so this is a fairly natural thing to do. However,
under Linux they many be multiple entries in the namespace
which reference the same filesystem. Thus keeping a back
reference from the filesystem to the namespace is complicated.
Rather than introduce some ugly hack to get the vfsmount and
continue as before. I'm leveraging this API change to update
the ZFS code to do things in a more natural way for Linux.
This has the upside that is resolves the compatibility issue
for the long term and fixes several other minor bugs which
have been reported.
This commit updates the code to remove this vfsmount back
reference entirely. All modifications to filesystem mount
options are now passed in to the kernel via a '-o remount'.
This is the expected Linux mechanism and allows the namespace
to properly handle any options which apply to it before passing
them on to the file system itself.
Aside from fixing the compatibility issue, removing the
vfsmount has had the benefit of simplifying the code. This
change which fairly involved has turned out nicely.
Closes#246Closes#217Closes#187Closes#248Closes#231
The security_inode_init_security() function now takes an additional
qstr argument which must be passed in from the dentry if available.
Passing a NULL is safe when no qstr is available the relevant
security checks will just be skipped.
Closes#246Closes#217Closes#187
RPM version 4.9.0 has been observed to generate extra debug
messages in certain cases. These debug messages prevent us
from cleanly acquiring the architecture. This is clearly
an upstream RPM bug which will get fixed. But until then
a safe solution is to pipe the result through 'tail -1'
to just grab the architecture bit we care about.
Example 'rpm -qp spl-0.6.0-rc4.src.rpm --qf %{arch}' output:
Freeing read locks for locker 0x166: 28031/47480843735008
Freeing read locks for locker 0x168: 28031/47480843735008
x86_64
Under Linux the VFS handles virtually all of the mmap() access
checks. Filesystem specific checks are left to be handled in
the .mmap() hook and normally there arn't any.
However, ZFS provides a few attributes which can influence the
mmap behavior and should be honored. Note, currently the code
to modify these attributes has not been implemented under Linux.
* ZFS_IMMUTABLE | ZFS_READONLY | ZFS_APPENDONLY: when any of these
attributes are set a file may not be mmaped with write access.
* ZFS_AV_QUARANTINED: when set a file file may not be mmaped with
read or exec access.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The following functions were required for the OpenSolaris mmap
implementation. Because the Linux VFS does most the most heavy
lifting for us they are not required and are being removed to
keep the code clean and easy to understand.
* zfs_null_putapage()
* zfs_frlock()
* zfs_no_putpage()
Signed-off-by: Brian Behlendorf <behlendorf@llnl.gov>
Enable zfs_getpage, zfs_fillpage, zfs_putpage, zfs_putapage functions.
The functions have been modified to make them Linux friendly.
ZFS uses these functions to read/write the mmapped pages. Using them
from readpage/writepage results in clear code. The patch also adds
readpages and writepages interface functions to read/write list of
pages in one function call.
The code change handles the first mmap optimization mentioned on
https://github.com/behlendorf/zfs/issues/225
Signed-off-by: Prasad Joshi <pjoshi@stec-inc.com>
Signed-off-by: Brian Behlendorf <behlendorf@llnl.gov>
Issue #255
These days most disk drivers will probe for devices asynchronously.
This means it's possible that when you zfs init script runs all the
required block devices may not yet have been discovered. The result
is the pool may fail to cleanly import at boot time. This is
particularly common when you have a large number of devices.
The fix is for the init script to block until udev settles and we
are no longer detecting new devices. Once the system has settled
the zfs modules can be loaded and the pool with be automatically
imported.
According to Linux kernel commit 2c27c65e, using truncate_setsize in
setattr simplifies the code. Therefore, the patch replaces the call
to vmtruncate() with truncate_setsize().
zfs_setattr uses zfs_freesp to free the disk space belonging to the
file. As truncate_setsize may release the page cache and flushing
the dirty data to disk, it must be called before the zfs_freesp.
Suggested-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prasad Joshi <pjoshi@stec-inc.com>
Closes#255
The inode eviction should unmap the pages associated with the inode.
These pages should also be flushed to disk to avoid the data loss.
Therefore, use truncate_setsize() in evict_inode() to release the
pagecache.
The API truncate_setsize() was added in 2.6.35 kernel. To ensure
compatibility with the old kernel, the patch defines its own
truncate_setsize function.
Signed-off-by: Prasad Joshi <pjoshi@stec-inc.com>
Closes#255
Update udev helper scripts to deal with device-mapper devices created
by multipathd. These enhancements are targeted at a particular
storage network topology under evaluation at LLNL consisting of two
SAS switches providing redundant connectivity between multiple server
nodes and disk enclosures.
The key to making these systems manageable is to create shortnames for
each disk that conveys its physical location in a drawer. In a
direct-attached topology we infer a disk's enclosure from the PCI bus
number and HBA port number in the by-path name provided by udev. In a
switched topology, however, multiple drawers are accessed via a single
HBA port. We therefore resort to assigning drawer identifiers based
on which switch port a drive's enclosure is connected to. This
information is available from sysfs.
Add options to zpool_layout to generate an /etc/zfs/zdev.conf using
symbolic links in /dev/disk/by-id of the form
<label>-<UUID>-switch-port:<X>-slot:<Y>. <label> is a string that
depends on the subsystem that created the link and defaults to
"dm-uuid-mpath" (this prefix is used by multipathd). <UUID> is a
unique identifier for the disk typically obtained from the scsi_id
program, and <X> and <Y> denote the switch port and disk slot numbers,
respectively.
Add a callout script sas_switch_id for use by multipathd to help
create symlinks of the form described above. Update zpool_id and the
udev zpool rules file to handle both multipath devices and
conventional drives.
To accomindate the updated Linux 3.0 shrinker API the spl
shrinker compatibility code was updated. Unfortunately, this
couldn't be done cleanly without slightly adjusting the comapt
API. See spl commit a55bcaad18.
This commit updates the ZFS code to use the slightly modified
API. You must use the latest SPL if your building ZFS.
The problem here is that prune_icache() tries to evict/delete
both the xattr directory inode as well as at least one xattr
inode contained in that directory. Here's what happens:
1. File is created.
2. xattr is created for that file (behind the scenes a xattr
directory and a file in that xattr directory are created)
3. File is deleted.
4. Both the xattr directory inode and at least one xattr
inode from that directory are evicted by prune_icache();
prune_icache() acquires a lock on both inodes before it
calls ->evict() on the inodes
When the xattr directory inode is evicted zfs_zinactive attempts
to delete the xattr files contained in that directory. While
enumerating these files zfs_zget() is called to obtain a reference
to the xattr file znode - which tries to lock the xattr inode.
However that very same xattr inode was already locked by
prune_icache() further up the call stack, thus leading to a
deadlock.
This can be reliably reproduced like this:
$ touch test
$ attr -s a -V b test
$ rm test
$ echo 3 > /proc/sys/vm/drop_caches
This patch fixes the deadlock by moving the zfs_purgedir() call to
zfs_unlinked_drain(). Instead zfs_rmnode() now checks whether the
xattr dir is empty and leaves the xattr dir in the unlinked set if
it finds any xattrs.
To ensure zfs_unlinked_drain() never accesses a stale super block
zfsvfs_teardown() has been update to block until the iput taskq
has been drained. This avoids a potential race where a file with
an xattr directory is removed and the file system is immediately
unmounted.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#266