Added d_clear_d_op() helper function which clears some flags and the
registered dentry->d_op table. This is required because d_set_d_op()
issues a warning when the dentry operations table is already set.
For the .zfs control directory to work properly we must be able to
override the default operations table and register custom .d_automount
and .d_revalidate callbacks.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes#1230
Callers of zap_deref_leaf() must be careful to drop leaf->l_rwlock
since that function returns with the lock held on success. All other
callers drop the lock correctly but it seems fzap_cursor_move_to_key()
does not. This may block writers or cause VERIFY failures when the
lock is freed.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1215Closeszfsonlinux/spl#143Closeszfsonlinux/spl#97
In zpl_revalidate() it's possible for the nameidata to be NULL
for kernels which still accept the parameter. In particular,
lookup_one_len() calls d_revalidate() with a NULL nameidata.
Resolve the issue by checking for a NULL nameidata in which case
just set the flags to 0.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1226
As of Linux 2.6.37 the right way to register custom dentry
operations is to use the super block's ->s_d_op field.
For older kernels they should be registered as part of the
lookup operation.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1223
Commit 65d56083b4 fixes the lock
inversion between spa_namespace_lock and bdev->bd_mutex but only
for the first user of spa_namespace_lock: dmu_objset_own().
Later spa_namespace_lock gets acquired by dsl_prop_get_integer()
though dsl_prop_get()->dsl_dataset_hold()->dsl_dir_open_spa()->
spa_open()->spa_open_common() without this "protection". By
moving the mutex release after this second use, even this
acquisition of the lock is "protected" by the ERESTARTSYS trick.
Signed-off-by: Massimo Maggi <me@massimo-maggi.eu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1220
This reverts commit 53c7411919
effectively reinstating the asynchronous xattr cleanup code.
These Linux changes were reverted because after testing
and careful contemplation I was convinced that due to the
89260a1c8851ce05ea04b23606ba438b271d890 commit they were no
longer required.
Unfortunately, the deadlock described in #1176 was a case
which wasn't considered. At mount zfs_unlinked_drain() can
occur which will unlink a list of znodes in effectively a
random order which isn't safe. The only reason it was safe
to originally revert this change was the we could guarantee
that the VFS would always prune the xattr leaves before the
parents.
Therefore, until we can cleanly resolve this deadlock for
all cases we need to keep this change in spite of the xattr
unlink performance penalty associated with it.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1176
Issue #457
Rolling back a mounted filesystem with open file handles and
cached dentries+inodes never worked properly in ZoL. The
major issue was that Linux provides no easy mechanism for
modules to invalidate the inode cache for a file system.
Because of this it was possible that an inode from the previous
filesystem would not get properly dropped from the cache during
rolling back. Then a new inode with the same inode number would
be create and collide with the existing cached inode. Ideally
this would trigger an VERIFY() but in practice the error wasn't
handled and it would just NULL reference.
Luckily, this issue can be resolved by sprucing up the existing
Solaris zfs_rezget() functionality for the Linux VFS.
The way it works now is that when a file system is rolled back
all the cached inodes will be traversed and refetched from disk.
If a version of the cached inode exists on disk the in-core
copy will be updated accordingly. If there is no match for that
object on disk it will be unhashed from the inode cache and
marked as stale.
This will effectively make the inode unfindable for lookups
allowing the inode number to be immediately recycled. The inode
will then only be accessible from the cached dentries. Subsequent
dentry lookups which reference a stale inode will result in the
dentry being invalidated. Once invalidated the dentry will drop
its reference on the inode allowing it to be safely pruned from
the cache.
Special care is taken for negative dentries since they do not
reference any inode. These dentires will be invalidate based
on when they were added to the dentry cache. Entries added
before the last rollback will be invalidate to prevent them
from masking real files in the dataset.
Two nice side effects of this fix are:
* Removes the dependency on spl_invalidate_inodes(), it can now
be safely removed from the SPL when we choose to do so.
* zfs_znode_alloc() no longer requires a dentry to be passed.
This effectively reverts this portition of the code to its
upstream counterpart. The dentry is not instantiated more
correctly in the Linux ZPL layer.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes#795
Lookups in the snapshot control directory for an existing snapshot
fail with ENOENT if an earlier lookup failed before the snapshot was
created. This is because the earlier lookup causes a negative dentry
to be cached which is never invalidated.
The bug can be reproduced as follows (the second ls should succeed):
$ ls /tank/.zfs/snapshot/s
ls: cannot access /tank/.zfs/snapshot/s: No such file or directory
$ zfs snap tank@s
$ ls /tank/.zfs/snapshot/s
ls: cannot access /tank/.zfs/snapshot/s: No such file or directory
To remedy this, always invalidate cached dentries in the snapshot
control directory. Since these entries never exist on disk there is
no significant performance penalty for the extra lookups.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1192
A misplaced single quote caused the umount command to fail with a
syntax error when unmounting snapshots under the .zfs/snapshot
control directory.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1210
In the stream_bytes() library function used by `zfs diff`, explicitly
cast each byte in the input string to an unsigned character so that the
Linux fprintf() correctly escapes to octal and does not mangle the output.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1172
3189 kernel panic in ZFS test suite during hotspare_onoffline_004_neg
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Arne Jansen <sensille@gmx.net>
Approved by: Dan McDonald <danmcd@nexenta.com>
References:
illumos/illumos-gate@8f0b538d1d
changeset: 13818:e9ad0a945d45
https://www.illumos.org/issues/3189
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
1337 `zpool status -D' should tell if there are no DDT entries
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Reviewed by: George Wilson <gwilson@zfsmail.com>
Approved by: Albert Lee <trisk@nexenta.com>
References:
illumos/illumos-gate@ce72e614c1
illumos changeset: 13432:d1ad8d106d64
https://www.illumos.org/issues/1337
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
2618 arc.c mistypes in the comments
Reviewed by: Jason King <jason.brian.king@gmail.com>
Reviewed by: Josef Sipek <jeffpc@josefsipek.net>
Approved by: Richard Lowe <richlowe@richlowe.net>
References:
illumos/illumos-gate@fc98fea58e
illumos changeset: 13721:5b51a16a186f
https://www.illumos.org/issues/2618
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Certain versions of gcc generate an 'unrecognized command
line option' error message when -Wunused-but-set-variable
is used unconditionally. This in turn can cause several
of the autoconf tests to misdetect an interface.
Now, the use of -Wunused-but-set-variable in the autoconf
tests was introduced by commit b9c59ec8 to address a gcc
4.6 compatibility problem. So we really only need to pass
this option for version of gcc which are known to support it.
Therefore, the tests have been updated to use the result of
the existing ZFS_AC_CONFIG_ALWAYS_NO_UNUSED_BUT_SET_VARIABLE
which determines if gcc supports this option.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1004
* Move -R option up one position in the list to match
the Illumos documentation.
* Move -D option up one position and refreshed it to
match the Illumos documentation.
* Move -p option up one position and refreshed it to
match the Illumos documentation.
* Add the -n, -P documentation found in zfs receive
in to zfs send where to belongs.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1187
The only valid options are -vnFu, these other ones seem to be
misplaced zfs send options.
Remove: -D -r -p -n -P
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1186
A fsck helper to accomidate distributions that expect to be able
to execute a fsck on all filesystem types. Currently this script
does nothing but it could be extended to act as a compatibility
wrapper for 'zpool scrub'.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#964
Rather than just reporting the failure include the passed
mount point and error number.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1153
As of Linux 3.4 the UMH_WAIT_* constants were renumbered. In
particular, the meaning of "1" changed from UMH_WAIT_PROC (wait for
process to complete), to UMH_WAIT_EXEC (wait for the exec, but not the
process). A number of call sites used the number 1 instead of the
constant name, so the behavior was not as expected on kernels with this
change.
One visible consequence of this change was that processes accessing
automounted snapshots received an ELOOP error because they failed to
wait for zfs.mount to complete.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#816
This reverts commit 7afcf5b1da which
accidentally introduced a regression with the .zfs snapshot directory.
While the updated code still does correctly mount the requested
snapshot. It updates the vfsmount such that it references the
original dataset vfsmount. The result is that the snapshot itself
isn't visible.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #816
Related to 91579709fc we need to
be very careful about not overrunning the stack in kernel space.
However, in user space we're already allowing slightly larger
stacks so this stack usage optimization is not required there.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Feature flags support for ZFS ported from Illumos. Only minimal
compatibility changes were made where required to accomidate Linux.
For a detailed description of feature flags see original proposal
on zfs-discuss. They are conceptually very similar to Linux's
ext[234] style of feature flags.
http://lists.freebsd.org/pipermail/freebsd-fs/2011-May/011568.html
NOTE: This branch updates the default pool version for new pools
from 28 to 5000. Version 28 pools may still be created for
compatibility with Solaris by using the '-o version=28' option.
$ zpool create -o version=28 ...
Existing pools must be manually upgraded using 'zpool upgrade'.
$ zpool upgrade ...
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#778
To save valuable stack all zio's were made asynchronous when in the
tgx_sync_thread context or during pool initialization. See commit
2fac4c2 for the original patch and motivation.
Unfortuantely, the changes to dsl_pool_sync_context() made by the
feature flags broke this logic causing in __zio_execute() to dispatch
itself infinitely when called during pool initialization. This
commit refines the existing logic to specificly target only the two
cases we care about.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
3349 zpool upgrade -V bumps the on disk version number, but leaves
the in core version
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <chris.siden@delphix.com>
Reviewed by: Matt Ahrens <matthew.ahrens@delphix.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Approved by: Dan McDonald <danmcd@nexenta.com>
References:
illumos/illumos-gate@25345e4666https://www.illumos.org/issues/3349
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
2762 zpool command should have better support for feature flags
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Eric Schrock <Eric.Schrock@delphix.com>
References:
illumos/illumos-gate@57221772c3https://www.illumos.org/issues/2762
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
3090 vdev_reopen() during reguid causes vdev to be treated as corrupt
3102 vdev_uberblock_load() and vdev_validate() may read the wrong label
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Christopher Siden <chris.siden@delphix.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Approved by: Eric Schrock <Eric.Schrock@delphix.com>
References:
illumos/illumos-gate@dfbb943217
illumos changeset: 13777:b1e53580146d
https://www.illumos.org/issues/3090https://www.illumos.org/issues/3102
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#939
This reverts commit d135245791.
Since feature flags have now been merged we can apply the real
upstream fix from Illumos.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #997
2619 asynchronous destruction of ZFS file systems
2747 SPA versioning with zfs feature flags
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <gwilson@delphix.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Reviewed by: Dan Kruchinin <dan.kruchinin@gmail.com>
Approved by: Eric Schrock <Eric.Schrock@delphix.com>
References:
illumos/illumos-gate@53089ab7c8illumos/illumos-gate@ad135b5d64
illumos changeset: 13700:2889e2596bd6
https://www.illumos.org/issues/2619https://www.illumos.org/issues/2747
NOTE: The grub specific changes were not ported. This change
must be made to the Linux grub packages.
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
mountall in Debian depends on being able to pass the -f parameter to
mount, which specifies a fake mount and just updates the mtab. Currently
mount.zfs will fail such a request if it is not passed with -o zfsutil.
This patch allows a fake mount on a non-legacy filesystem to succeed in
the same manner as a -o remount does, thus enabling mountall to work
correctly.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1167
In a debug build, certain GCC versions flag an array bounds warning in
the below code from dnode_sync.c
} else {
int i;
ASSERT(dn->dn_next_nblkptr[txgoff] < dnp->dn_nblkptr);
/* the blkptrs we are losing better be unallocated */
for (i = dn->dn_next_nblkptr[txgoff];
i < dnp->dn_nblkptr; i++)
ASSERT(BP_IS_HOLE(&dnp->dn_blkptr[i]));
This usage is in fact safe, since the ASSERT ensures the index does
not exceed to maximum possible number of block pointers. However gcc
can't determine that the assignment 'i = dn->dn_next_nblkptr[txgoff];'
falls within the array bounds so it issues a warning. To avoid this,
initialize i to zero to make gcc happy but skip the elements before
dn->dn_next_nblkptr[txgoff] in the loop body. Since a dnode contains
at most 3 block pointers this overhead should be negligible.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#950
Currently ZFS doesn't show any I/O time in eg "top" wait% or in
/proc/$pid/stat's blkio_ticks. Using io_schedule() instead of
schedule() in zio_wait()'s cv_wait() is the correct way to fix
this.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1158Closes#1175
This reverts commit 9dcb971983
which was originally introduced to debug occasional slow I/Os.
These I/Os would complete eventually but were observed to take
several 100 seconds.
The root cause of this issue was the CFQ scheduler which can,
under certain conditions, excessively delay an I/O from being
issued to the device. This issue was mitigated somewhat by
commit 84daaddedb which ensures
the I/O elevator gets changed even for DM style devices.
This change isn't in any way harmful but it does conflict with
a required change to properly account from I/O wait time.
Because Linux does not export the io_schedule_timeout() function
we must instead rely on io_schedule() via cv_wait_io().
The additional debugging information which was added to the
delay event has been intentionally left in place.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
In all but one case the spa_namespace_lock is taken before the
bdev->bd_mutex lock. But Linux __blkdev_get() function calls
fops->open() with the bdev->bd_mutex lock held and we must
somehow still safely acquire the spa_namespace_lock.
To avoid a potential lock inversion deadlock we preemptively
try to take the spa_namespace_lock(). Normally it will not
be contended and this is safe because spa_open_common() handles
the case where the caller already holds the spa_namespace_lock.
When it is contended we risk a lock inversion if we were to
block waiting for the lock. Luckily, the __blkdev_get()
function allows us to return -ERESTARTSYS which will result in
bdev->bd_mutex being dropped, reacquired, and fops->open() being
called again. This process can be repeated safely until both
locks are acquired.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Closes#612
This reverts commit 31f2b5abdf back
to the original code until the fsync(2) performance regression
can be addressed.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The AUTHORS file was getting stale. Refresh its contents
using the authors listed in the git commit logs.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The ChangeLog was retired long ago, the git commit logs are
authoritative. To avoid any confusion remove the ChangeLog.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
It's my understanding that the zfs_fsyncer_key TSD was added as
a performance omtimization to reduce contention on the zl_lock
from zil_commit(). This issue manifested itself as very long
(100+ms) fsync() system call times for fsync() heavy workloads.
However, under Linux I'm not seeing the same contention that
was originally described. Therefore, I'm removing this code
in order to ween ourselves off any dependence on TSD. If the
original performance issue reappears on Linux we can revisit
fixing it without resorting to TSD.
This just leaves one small ZFS TSD consumer. If it can be
cleanly removed from the code we'll be able to shed the SPL
TSD implementation entirely.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closeszfsonlinux/spl#174
The current state of udev and devicer-mapper devices makes it difficult
to construct a mapping of DM partitions and their underlying DM device.
For example, with a /dev directory with the following contents:
$ ls -d /dev/dm-*
/dev/dm-0
/dev/dm-1
/dev/dm-2
/dev/dm-3
it is not immediately apparent if these are completely separate devices,
or partitions and real devices intermixed. In contrast, SCSI devices
would appear as so:
$ ls -d /dev/sd*
/dev/sda
/dev/sda1
/dev/sdb
/dev/sdb1
Here, one can immediately determine that there are two devices (sda and
sdb), each containing a single partition. The lack of a predictable and
consistent mapping from DM devices to DM device partitions makes it
difficult for user space to process these devices the same way it does
SCSI devices.
As a result, the ZFS utilities do not partition DM devices, and instead
set the "vdev_wholedisk" label to 0 and treat them as partitions. This
has the side effect that, even if ZFS has sole ownership of the device,
the IO scheduler will not be modified because it is treated as a
partition.
This change adds an exception for DM devices in vdev_elevator_switch,
allowing the elevator to be modified even though the "vdev_wholedisk"
property is not set.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1149
During the original ZoL port the vdev_uses_zvols() function was
disabled until it could be properly implemented. This prevented
a zpool from use a zvol for its slog device.
This patch implements that missing functionality by adding a
zvol_is_zvol() function to zvol.c. Given the full path to a
device it will lookup the device and verify its major number
against the registered zvol major number for the system. If
they match we know the device is a zvol.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1131