When vdev_disk.c was implemented for Linux we failed to handle the
reopen case. According to the vdev_reopen() comment leaf vdevs should
not be closed or opened when v->vdev_reopening is set. Under Linux
we would always close and open the device.
This issue was only noticed when a 'zpool scrub' command was run while
the leaf vdev device names in /dev/disk/by-vdev were missing. The
scrub command calls vdev_reopen() which caused the vdevs to be closed
but they couldn't be reopened due to the missing links. The result
was that all the vdevs were marked unavailable and the pool was
halted due to failmode=wait.
This patch adds the missing functionality in a similiar fashion to
to the Illumos code.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
`zpool status -x` should only flag errors or where the pool is
unavailable. If it imported fine but isn't using the latest features
available in the code, that's not an error.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1319
To determine whether the kernel is capable of handling empty barrier
BIOs, we check for the presence of the bio_empty_barrier() macro,
which was introduced in 2.6.24. If this macro is defined, then we can
flush disk vdevs; if it isn't, then flushing is disabled.
Unfortunately, the bio_empty_barrier() macro was removed in 2.6.37,
even though the kernel is still capable of handling empty barrier BIOs.
As a result, flushing is effectively disabled on kernels >= 2.6.37,
meaning that starting from this kernel version, zfs doesn't use
barriers to guarantee on-disk data consistency. This is quite bad and
can lead to potential data corruption on power failures.
This patch fixes the issue by removing the configure check for
bio_empty_barrier(), as we don't support kernels <= 2.6.24 anymore.
Thanks to Richard Kojedzinszky for catching this nasty bug.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1318
As a matter of fact, we're already using -Werror for most tests because
of a bug in kernel-bio-empty-barrier.m4 which sets -Werror without
reverting it afterwards. This meant that all tests which ran after this
one was using -Werror.
This patch simply makes it clear that we're using -Werror and makes
the code more readable and more predictable.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1317
The zfs_arc_memory_throttle_disable module option was introduced
by commit 0c5493d470 to resolve a
memory miscalculation which could result in the txg_sync thread
spinning.
When this was first introduced the default behavior was left
unchanged until enough real world usage confirmed there were no
unexpected issues. We've now reached that point. Linux's
direct reclaim is working as expected so we're enabling this
behavior by default.
This helps pave the way to retire the spl_kmem_availrmem()
functionality in the SPL layer. This was the only caller.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #938
Rather then setting _prefix=/ and having to override all the
default install locations. It's cleaner, and more understandable,
to leave prefix=/usr and only override _sbindir and _libdir. This
fixes three issues:
* The commands no longer get built with an incorrect rpath for
the libraries. This is good because fixing this sort of
thing is required by the Fedora packaging guidelines.
http://fedoraproject.org/wiki/Packaging:Guidelines#Beware_of_Rpath
* The various AUTHORS, COPYRIGHT, etc files are now correctly
installed under /usr/share/doc instead of /share/doc.
* _libexecdir is now handled properly for each distribution.
Fedora/RHEL=/usr/libexec, OpenSUSE/SLES=/usr/lib, Debian=/usr/lib/rpm
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1058
A couple of assertions in spa.c were designed to prevent the use of
invalid pool versions. They were written under the assumption
that all valid pools are less than SPA_VERSION. Since feature flags
jumped from 28 to 5000, any numbers in the range 28 to 5000
non-inclusive will fail to trigger them. We switch to the new
SPA_VERSION_IS_SUPPORTED macro to correct this.
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1282
It turns out that the Linux VFS doesn't strictly handle all cases
where a component path name exceeds MAXNAMELEN. It does however
appear to correctly handle MAXPATHLEN for us.
The right way to handle this appears to be to add an explicit
check to the zpl_lookup() function. Several in-tree filesystems
handle this case the same way.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1279
As of 0.6.0-rc11 using ZFS volumes as Linux swap devices is
supported. Swapping to files in ZFS filesystems is not.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1189
Two more locations where KM_SLEEP was used in a call which must
use KM_PUSHPAGE were found while using the zpool upgrade command.
See commit b8d06fc for additional details.
Also make a small correction to the comment block above
dsl_dir_open_spa().
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1268
Long ago infrastructure was added to the SPL to keep an internal
debug log of the last few seconds of activity. This was helpful
during the early development, but these days it is no longer
needed. I haven't had to resort to this debug buffer to resolve
an issue for several years now.
Today better more generic tools like systemtap and ftrace have
evolved to the point where they can be used for this purpose.
Along with the stack trace dumped to the system console, and in
rare cases a crash dump we almost always have the debug we need.
Therefore, I'm disabling the code which automatically dumps
this log to disk during an assertion except for the case where
spl_debug_panic_on_bug is set (disabled by default).
This should be viewed as a first step towards either.
a) Retiring this infrastructure and complexity entirely, or
b) Integrating this logging more properly with ftrace.
As part of this change I'm also removing from the packages the
undocumented spl utility which is used to decode the binary logs.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The HAVE_MUTEX_OWNER_TASK_STRUCT fails on PPC64 with the following
error:
error: 'current' undeclared (first use in this function)
We include linux/sched.h to ensure that current is available.
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The machelf.h header is never included by anything in the zfs
build process. It is all effectively dead code which can be
safely removed.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1265
binutils 2.23.1 fails in situations that generate function relocations
on PowerPC and possibly other architectures. This causes linking of
libzpool to fail because it depends on libnvpair. We add a dependency on
libnvpair to lib/libzpool/Makefile.am to correct that.
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1267
The SPL_AC_ATOMIC_SPINLOCK, SPL_AC_TYPE_ATOMIC64_CMPXCHG, and
SPL_AC_TYPE_ATOMIC64_XCHG were all directly including the
'asm/atomic.h' header. As of Linux 3.4 this header was removed
which results in a build failure.
The right thing to do is include 'linux/atomic.h' however we
can't safely do this because it doesn't exist in 2.6.26 kernels.
Therefore, we include 'linux/fs.h' which in turn includes the
correct atomic header regardless of the kernel version.
When these incorrect APIs are used in ZFS the following build
failure results.
arc.c:791:80: warning: '__ret' may be used uninitialized
in this function [-Wuninitialized]
arc.c:791:1875: error: call to '__cmpxchg_wrong_size'
declared with attribute error: Bad argument size for cmpxchg
Since this is all Linux 2.6.24 compatibility code there's
an argument to be made that it should be removed because
kernels this old are not supported. However, because we're
so close to a release I'm going to leave it in place for now.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closeszfsonlinux/zfs#814Closeszfsonlinux/zfs#1254
Explicitly case this value to an unsigned long long for 32-bit
systems to inform the compiler that a long type should not be
used. Otherwise we get the following compiler error:
dmu_send.c:376: error: integer constant is too large for
‘long’ type
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The zpool-features(5) man page should reference the Linux zfs(8)
and zpool(8) man pages. The 1M convention isn't used on Linux.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1184
The zpool-features(5) man page was accidentally omitted from the
build target when feature flags was merged. As a result it doesn't
get installed as part of 'make install' so none of the packages
include this man page.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1262
The way in which virtual box ab(uses) memory can throw off the
free memory calculation in arc_memory_throttle(). The result is
the txg_sync thread will effectively spin waiting for memory to
be released even though there's lots of memory on the system.
To handle this case I'm adding a zfs_arc_memory_throttle_disable
module option largely for virtual box users. Setting this option
disables free memory checks which allows the txg_sync thread to
make progress.
By default this option is disabled to preserve the current
behavior. However, because Linux supports direct memory reclaim
it's doubtful throttling due to perceived memory pressure is ever
a good idea. We should enable this option by default once we've
done enough real world testing to convince ourselve there aren't
any unexpected side effects.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#938
Commit 1eb5bfa introduced a new zfs_disable_dup_eviction tunable.
It should have been made available as a module option in the
original patch but was overlooked.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The tsd_exit() and tsd_destroy() functions remove entries from
hash bins without taking the hash bin lock. They do take the
table lock, but tsd_get() and tsd_set() only take the hash bin
lock to allow for maximum concurency.
The result is that while tsd_get() and tsd_set() are traversing
the hash bin list it can be modified by another thread in which
happens to hash to the same value. To avoid this add the needed
locking to tsd_exit() and tsd_destroy().
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#174
This is a minor nit, but the second line of the 'action' message
when you need to upgrade your pool to support feature flags exceeds
the standard 80 character limit. Fix it by moving the word
'feature' on to the third line.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
When a system attribute layout is created an inconsistency may occur
between the system attribute header (sa_hdr_phys_t) size and the
variable-sized attribute count stored in the layout. The inconsistency
results in the following failed assertion when SA_HDR_SIZE_MATCH_LAYOUT
returns false:
SPLError: 11315:0:(sa.c:1541:sa_find_idx_tab())
ASSERTION((IS_SA_BONUSTYPE(bonustype) && SA_HDR_SIZE_MATCH_LAYOUT(hdr,
tb)) || !IS_SA_BONUSTYPE(bonustype) || (IS_SA_BONUSTYPE(bonustype) &&
hdr->sa_layout_info == 0)) failed
The bug originates in this snippet from sa_find_sizes().
if (is_var_sz && var_size > 1) {
if (P2ROUNDUP(hdrsize + sizeof (uint16_t),
*total < full_space) {
hdrsize += sizeof (uint16_t);
This assumes that the current variable-sized attribute will be stored in
the current buffer and accounts for the space needed to store its size
in the sa_hdr_phys_t. However if the next attribute spills over we need
to store a blkptr_t at the end of the bonus buffer to point to the spill
block. If the current attribute is in the way of the blkptr_t then it
too will be relocated into the spill block. But since we've already
accounted for it in the header size we get the inconsistency described
above.
To avoid this, record the index of the last variable-sized attribute
that prompted a hdrsize increase, and reverse the increase if we later
determine that that attribute will be relocated to the spill block.
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1250
A rounding discrepancy exists between how sa_build_layouts() and
sa_find_sizes() calculate when the spill block needs to be kicked in.
This results in a narrow size range where sa_build_layouts() believes
there must be a spill block allocated but due to the discrepancy there
isn't. A panic then occurs when the hdl->sa_spill NULL pointer is
dereferenced.
The following reproducer for this bug was isolated:
truncate -s 128m /tmp/tank
zpool create tank /tmp/tank
zfs create -o xattr=sa tank/fish
ln -s `perl -e 'print "z" x 41'` /tank/fish/z
setfattr -hn trusted.foo -v`perl -e 'print "z"x45'` /tank/fish/z
This test results in roughly the following system attribute (SA)
layout:
176 bytes - "standard" SA's
41 bytes - name of symbolic link target
100 bytes - XDR encoded nvlist for xattr
---
317 bytes - total
Because 317 is less than DN_MAX_BONUSLEN (320), sa_find_sizes()
decides no spill block is needed. But sa_build_layouts() rounds 41 up
to 48 when computing the space requirements so it tries to switch to
the spill block.
Note that we were only able to reproduce this bug using a combination
of symbolic links and the Linux-specific xattr=sa dataset property.
So while this issue is not technically Linux-specific, it may be
difficult or impossible to hit the narrow size range needed to
reproduce it on other platforms.
To fix the discrepancy, round the running total in sa_find_sizes() up
to an 8-byte boundary before accounting for each SA, since this is how
they will be stored in the bonus and (possibly) spill buffers.
To make the intent of the code more clear, explicitly assert key
assumptions about expected alignment of data and whether spill-over
will occur.
Signed-off-by: Matthew Ahrens <mahrens@delphix.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1240
Check at ./configure time that the kernel was built with kallsyms
support. If the kernel doesn't have CONFIG_KALLSYMS defined the
modules will still compile cleanly but will not be loadable. So
we really want to catch this early during ./configure. Note that
we do not require CONFIG_KALLSYMS_ALL but it may be safely defined.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#6
In the interest of maintaining only one udev helper to give vdevs
user friendly names, the zpool_id and zpool_layout infrastructure
is being retired. They are superseded by vdev_id which incorporates
all the previous functionality.
Documentation for the new vdev_id(8) helper and its configuration
file, vdev_id.conf(5), can be found in their respective man pages.
Several useful example files are installed under /etc/zfs/.
/etc/zfs/vdev_id.conf.alias.example
/etc/zfs/vdev_id.conf.multipath.example
/etc/zfs/vdev_id.conf.sas_direct.example
/etc/zfs/vdev_id.conf.sas_switch.example
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#981
Commit 4b2f65b253 increased the user
space stack by 4x to resolve certain stack overflows. As such it
no longer makes sense to worry about a single extra page which
might or might not be part of the process stack. There is now
ample headroom for normal usage.
By eliminating this configure check we are also resolving the
following segfault which intentionally occurs at configure time
and may be logged in dmesg.
conftest[22156]: segfault at 7fbf18a47e48 ip 00000000004007fe
sp 00007fbf18a4be50 error 6 in conftest[400000+1000]
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The new lz4 compression algorithm, zfsonlinux/zfs@9759c60, requires
the generic BE_IN16 and BE_IN32 functions. These are added to the SPL
for other consumers to take advantage of.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
3035 LZ4 compression support in ZFS and GRUB
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Christopher Siden <csiden@delphix.com>
References:
illumos/illumos-gate@a6f561b4aehttps://www.illumos.org/issues/3035http://wiki.illumos.org/display/illumos/LZ4+Compression+In+ZFS
This patch has been slightly modified from the upstream Illumos
version to be compatible with Linux. Due to the very limited
stack space in the kernel a lz4 workspace kmem cache is used.
Since we are using gcc we are also able to take advantage of the
gcc optimized __builtin_ctz functions.
Support for GRUB has been dropped from this patch. That code
is available but those changes will need to made to the upstream
GRUB package.
Lastly, several hunks of dead code were dropped for clarity. They
include the functions real_LZ4_uncompress(), LZ4_compressBound()
and the Visual Studio specific hunks wrapped in _MSC_VER.
Ported-by: Eric Dillmann <eric@jave.fr>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1217
The -q option should quiet the mkfs.ext2 output but certain
versions of e2fsprogs appear to ignore it. This can result in
an extra 'done' message in the test output. To keep this noise
from distracting just direct stdout to /dev/null.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
It's doubtful many people were impacted by this but commit 6c28567
accidentally broke ZFS builds for 2.6.26 and earlier kernels. This
commit depends on the lookup_bdev() function which exists in 2.6.26
but wasn't exported until 2.6.27.
The availability of the function isn't critical so a wrapper is
introduced which returns ERR_PTR(-ENOTSUP) when the function isn't
defined. This will have the effect of causing zvol_is_zvol() to
always fail for 2.6.26 kernels. This in turn means vdevs will
always get opened concurrently which is good for normal usage.
This will only become an issue if your using a zvol as a vdev in
another pool. In which case you really should be using a newer
kernel anyway.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1205
Test 5, 6, 7, and 7 in zconfig.sh use /bin/ as a source of random
directories and files for their test. This has lead to unexpected
tests failures because the total size of /bin/ on the test system
isn't checked and it is entirely possible for it to be larger than
the target filesystem.
To resolve this issue we create a somewhat random collection of
files and directories in /var/tmp to use. On average we expect
about 5MB of data with the worst case being 20MB. This is large
enough to be interesting and small enough to always fit in the
default test datasets.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1113
The differ() function used strerror_r() instead of strerror() because
it allowed the error message to be directly copied in to a buffer.
This causes two issues under Linux.
* There are two versions of strerror_r() available an XSI-compliant
version which returns an 'int' error code. And a GNU-specific
version which return a 'char *' to the resulting error string.
int strerror_r(int errnum, char *buf, size_t buflen); /* XSI */
char *strerror_r(int errnum, char *buf, size_t buflen); /* GNU */
* The most recent versions of strerror_r() are annotated with the
warn_unused_result attribute. This causes the following warning
since the upstream implementation casts the result to void.
warning: ignoring return value of 'strerror_r', declared with
attribute warn_unused_result [-Wunused-result]
The cleanest way to resolve both of these problems is just to use
strerror() and make a copy of the result in to the buffer. This
resolves both issues and this is the only instance of strerror_r()
in the code base.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1231
Cache aging was implemented because it was part of the default Solaris
kmem_cache behavior. The idea is that per-cpu objects which haven't been
accessed in several seconds should be returned to the cache. On the other
hand Linux slabs never move objects back to the slabs unless there is
memory pressure on the system.
This behavior is now configurable through the 'spl_kmem_cache_expire'
module option. The value is a bit mask with the following meaning.
0x1 - Solaris style cache aging eviction is enabled.
0x2 - Linux style low memory eviction is enabled.
Both methods may be safely enabled simultaneously, but by default
both are disabled. It has never been clear if the kmem cache aging
(which has been around from day one) actually does any good. It has
however been the source of numerous bugs so I wouldn't mind retiring
it entirely.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes zfsonlinux/zfs#1227
Closes#210
Explicitly set acl details to zero to silence gcc (zfs_acl_node_read
can't be sure zfs_acl_znode_info will set acl_count and aclsize).
Normally suppressing these warnings by setting this to zero at
declaration time is a bad idea but in this instance it's hard to
avoid and should be fairly safe.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1244
Retire the dmu_snapshot_id() function which was introduced in the
initial .zfs control directory implementation. There is already
an existing dsl_dataset_snap_lookup() which does exactly what we
need, and the dmu_snapshot_id() function as implemented is racy.
https://github.com/zfsonlinux/zfs/issues/1215#issuecomment-12579879
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1238
The vdev_id udev helper strictly requires configuration file keywords
to always be anchored at the beginning of the line and to be followed
by a space character. However, users may prefer to use indentation or
tab delimitation. Improve flexibility by simply requiring a keyword
to be the first field on the line.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1239
Parted version 3.0 doesn't allow us to specify the start and end
percentages as 50% and 100% respectively. This results in:
Error: The location 100% is outside the device /dev/zd0
Therefore we change the syntax to 51% and -1 for end of device.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The 'exit $?' command in the INT TERM EXIT trap was overwritting
the expected error code with the error code from mv. Fix the
issue by removing the 'exit $?'. It's important the we preserve
the original error code so failures are easily noticed.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Added d_clear_d_op() helper function which clears some flags and the
registered dentry->d_op table. This is required because d_set_d_op()
issues a warning when the dentry operations table is already set.
For the .zfs control directory to work properly we must be able to
override the default operations table and register custom .d_automount
and .d_revalidate callbacks.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes#1230
Callers of zap_deref_leaf() must be careful to drop leaf->l_rwlock
since that function returns with the lock held on success. All other
callers drop the lock correctly but it seems fzap_cursor_move_to_key()
does not. This may block writers or cause VERIFY failures when the
lock is freed.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1215Closeszfsonlinux/spl#143Closeszfsonlinux/spl#97
In zpl_revalidate() it's possible for the nameidata to be NULL
for kernels which still accept the parameter. In particular,
lookup_one_len() calls d_revalidate() with a NULL nameidata.
Resolve the issue by checking for a NULL nameidata in which case
just set the flags to 0.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1226
As of Linux 2.6.37 the right way to register custom dentry
operations is to use the super block's ->s_d_op field.
For older kernels they should be registered as part of the
lookup operation.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1223
Commit 65d56083b4 fixes the lock
inversion between spa_namespace_lock and bdev->bd_mutex but only
for the first user of spa_namespace_lock: dmu_objset_own().
Later spa_namespace_lock gets acquired by dsl_prop_get_integer()
though dsl_prop_get()->dsl_dataset_hold()->dsl_dir_open_spa()->
spa_open()->spa_open_common() without this "protection". By
moving the mutex release after this second use, even this
acquisition of the lock is "protected" by the ERESTARTSYS trick.
Signed-off-by: Massimo Maggi <me@massimo-maggi.eu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1220
This functionality is no longer required by ZFS, see commit
zfsonlinux/zfs@7b3e34ba5a.
Since there are no other consumers, and because it adds
additional autoconf complexity which must be maintained
the spl_invalidate_inodes() function has been removed.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#795
This reverts commit 53c7411919
effectively reinstating the asynchronous xattr cleanup code.
These Linux changes were reverted because after testing
and careful contemplation I was convinced that due to the
89260a1c8851ce05ea04b23606ba438b271d890 commit they were no
longer required.
Unfortunately, the deadlock described in #1176 was a case
which wasn't considered. At mount zfs_unlinked_drain() can
occur which will unlink a list of znodes in effectively a
random order which isn't safe. The only reason it was safe
to originally revert this change was the we could guarantee
that the VFS would always prune the xattr leaves before the
parents.
Therefore, until we can cleanly resolve this deadlock for
all cases we need to keep this change in spite of the xattr
unlink performance penalty associated with it.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1176
Issue #457
Rolling back a mounted filesystem with open file handles and
cached dentries+inodes never worked properly in ZoL. The
major issue was that Linux provides no easy mechanism for
modules to invalidate the inode cache for a file system.
Because of this it was possible that an inode from the previous
filesystem would not get properly dropped from the cache during
rolling back. Then a new inode with the same inode number would
be create and collide with the existing cached inode. Ideally
this would trigger an VERIFY() but in practice the error wasn't
handled and it would just NULL reference.
Luckily, this issue can be resolved by sprucing up the existing
Solaris zfs_rezget() functionality for the Linux VFS.
The way it works now is that when a file system is rolled back
all the cached inodes will be traversed and refetched from disk.
If a version of the cached inode exists on disk the in-core
copy will be updated accordingly. If there is no match for that
object on disk it will be unhashed from the inode cache and
marked as stale.
This will effectively make the inode unfindable for lookups
allowing the inode number to be immediately recycled. The inode
will then only be accessible from the cached dentries. Subsequent
dentry lookups which reference a stale inode will result in the
dentry being invalidated. Once invalidated the dentry will drop
its reference on the inode allowing it to be safely pruned from
the cache.
Special care is taken for negative dentries since they do not
reference any inode. These dentires will be invalidate based
on when they were added to the dentry cache. Entries added
before the last rollback will be invalidate to prevent them
from masking real files in the dataset.
Two nice side effects of this fix are:
* Removes the dependency on spl_invalidate_inodes(), it can now
be safely removed from the SPL when we choose to do so.
* zfs_znode_alloc() no longer requires a dentry to be passed.
This effectively reverts this portition of the code to its
upstream counterpart. The dentry is not instantiated more
correctly in the Linux ZPL layer.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes#795