Commit Graph

5107 Commits

Author SHA1 Message Date
Christopher Siden b9b24bb4ca Illumos #2762: zpool command should have better support for feature flags
2762 zpool command should have better support for feature flags
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Eric Schrock <Eric.Schrock@delphix.com>

References:
  illumos/illumos-gate@57221772c3
  https://www.illumos.org/issues/2762

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-01-08 10:35:43 -08:00
George Wilson 3bc7e0fb0f Illumos #3090 and #3102
3090 vdev_reopen() during reguid causes vdev to be treated as corrupt
3102 vdev_uberblock_load() and vdev_validate() may read the wrong label

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Christopher Siden <chris.siden@delphix.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Approved by: Eric Schrock <Eric.Schrock@delphix.com>

References:
  illumos/illumos-gate@dfbb943217
  illumos changeset: 13777:b1e53580146d
  https://www.illumos.org/issues/3090
  https://www.illumos.org/issues/3102

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #939
2013-01-08 10:35:42 -08:00
Brian Behlendorf 5ac0c30a94 Revert "Temporarily disable the reguid test."
This reverts commit d135245791.
Since feature flags have now been merged we can apply the real
upstream fix from Illumos.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #997
2013-01-08 10:35:42 -08:00
Christopher Siden 9ae529ec5d Illumos #2619 and #2747
2619 asynchronous destruction of ZFS file systems
2747 SPA versioning with zfs feature flags
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <gwilson@delphix.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Reviewed by: Dan Kruchinin <dan.kruchinin@gmail.com>
Approved by: Eric Schrock <Eric.Schrock@delphix.com>

References:
  illumos/illumos-gate@53089ab7c8
  illumos/illumos-gate@ad135b5d64
  illumos changeset: 13700:2889e2596bd6
  https://www.illumos.org/issues/2619
  https://www.illumos.org/issues/2747

NOTE: The grub specific changes were not ported.  This change
must be made to the Linux grub packages.

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-01-08 10:35:35 -08:00
Brian Behlendorf 1c7b3eaf87 RHEL 6.4 compat, fallocate()
In the upstream kernel the FALLOC_FL_PUNCH_HOLE #define was
introduced after the fallocate() function was moved from the
inode_operations to the file_operations structure.  Therefore,
the SPL code assumed that if FALLOC_FL_PUNCH_HOLE was defined
it was safe to use f_ops->fallocate().

Unfortunately, the RHEL6.4 kernel has only backported the
FALLOC_FL_PUNCH_HOLE #define and not the fallocate() change.

To address this compatibility issue the spl_filp_fallocate()
helper function was added to properly detect which interface
is available.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-01-08 09:53:13 -08:00
Dominik Honnef 15313c5e18 Fix duplicate words in zpool.8
Remove the duplicate words 'cannot be' from the zpool.8 man page.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1177
2013-01-07 11:34:30 -08:00
Will Rouesnel 462ee8e3f3 Allow fake mounts to succeed on non-legacy filesystems.
mountall in Debian depends on being able to pass the -f parameter to
mount, which specifies a fake mount and just updates the mtab. Currently
mount.zfs will fail such a request if it is not passed with -o zfsutil.

This patch allows a fake mount on a non-legacy filesystem to succeed in
the same manner as a -o remount does, thus enabling mountall to work
correctly.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1167
2013-01-07 11:30:27 -08:00
Ned Bass 37f000c5aa Fix gcc array subscript above bounds warning
In a debug build, certain GCC versions flag an array bounds warning in
the below code from dnode_sync.c

    } else {
            int i;
            ASSERT(dn->dn_next_nblkptr[txgoff] < dnp->dn_nblkptr);
            /* the blkptrs we are losing better be unallocated */
            for (i = dn->dn_next_nblkptr[txgoff];
                i < dnp->dn_nblkptr; i++)
                    ASSERT(BP_IS_HOLE(&dnp->dn_blkptr[i]));

This usage is in fact safe, since the ASSERT ensures the index does
not exceed to maximum possible number of block pointers. However gcc
can't determine that the assignment 'i = dn->dn_next_nblkptr[txgoff];'
falls within the array bounds so it issues a warning.  To avoid this,
initialize i to zero to make gcc happy but skip the elements before
dn->dn_next_nblkptr[txgoff] in the loop body.  Since a dnode contains
at most 3 block pointers this overhead should be negligible.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #950
2013-01-07 11:21:52 -08:00
Brian Behlendorf e44056fcb6 Merge branch 'io_schedule'
Currently ZFS doesn't show any I/O time in eg "top" wait% or in
/proc/$pid/stat's blkio_ticks.  Using io_schedule() instead of
schedule() in zio_wait()'s cv_wait() is the correct way to fix
this.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1158
Closes #1175
2013-01-07 10:55:20 -08:00
Matt Johnston 72938d6905 Use cv_wait_io() which will will account for iowait
Update zio_wait() to use cv_wait_io() to ensure the iowait time
is properly accounted for.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-01-07 10:52:52 -08:00
Matt Johnston 72f53c5694 Revert part of "Log I/Os longer than zio_delay_max (30s default)"
This reverts commit 9dcb971983
which was originally introduced to debug occasional slow I/Os.
These I/Os would complete eventually but were observed to take
several 100 seconds.

The root cause of this issue was the CFQ scheduler which can,
under certain conditions, excessively delay an I/O from being
issued to the device.  This issue was mitigated somewhat by
commit 84daaddedb which ensures
the I/O elevator gets changed even for DM style devices.

This change isn't in any way harmful but it does conflict with
a required change to properly account from I/O wait time.
Because Linux does not export the io_schedule_timeout() function
we must instead rely  on io_schedule() via cv_wait_io().

The additional debugging information which was added to the
delay event has been intentionally left in place.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-01-07 10:51:04 -08:00
Matt Johnston 46a75aadb7 Add cv_wait_io() to account I/O time
Under Linux when a task is waiting on I/O it should call the
io_schedule() function for proper accounting.  The Solaris
cv_wait() function provides no way to specify what the cv
is waiting on therefore cv_wait_io() is introduced.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #206
2013-01-07 10:29:26 -08:00
Brian Behlendorf 15f9d4e1c2 ZFS 0.6.0-rc13 2012-12-20 11:02:24 -08:00
Brian Behlendorf 02d25048d2 SPL 0.6.0-rc13 2012-12-20 11:01:47 -08:00
Brian Behlendorf 65d56083b4 Fix zpool on zvol lock inversion deadlock
In all but one case the spa_namespace_lock is taken before the
bdev->bd_mutex lock.  But Linux __blkdev_get() function calls
fops->open() with the bdev->bd_mutex lock held and we must
somehow still safely acquire the spa_namespace_lock.

To avoid a potential lock inversion deadlock we preemptively
try to take the spa_namespace_lock().  Normally it will not
be contended and this is safe because spa_open_common() handles
the case where the caller already holds the spa_namespace_lock.

When it is contended we risk a lock inversion if we were to
block waiting for the lock.  Luckily, the __blkdev_get()
function allows us to return -ERESTARTSYS which will result in
bdev->bd_mutex being dropped, reacquired, and fops->open() being
called again.  This process can be repeated safely until both
locks are acquired.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Closes #612
2012-12-20 09:57:39 -08:00
Brian Behlendorf d5446cfc52 Revert "Remove TSD zfs_fsyncer_key"
This reverts commit 31f2b5abdf back
to the original code until the fsync(2) performance regression
can be addressed.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-12-20 09:56:28 -08:00
Brian Behlendorf 23e3c743f6 Refresh AUTHORS
The AUTHORS file was getting stale.  Refresh its contents
using the authors listed in the git commit logs.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-12-19 09:47:25 -08:00
Brian Behlendorf 5b2fdbb69c Refresh AUTHORS
The AUTHORS file was getting stale.  Refresh its contents
using the authors listed in the git commit logs.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-12-19 09:40:18 -08:00
Brian Behlendorf dd5b6d96f1 Remove the ChangeLog
The ChangeLog was retired long ago, the git commit logs are
authoritative.  To avoid any confusion remove the ChangeLog.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-12-19 09:28:18 -08:00
Brian Behlendorf 840222eb0b Remove the ChangeLog
The ChangeLog was retired long ago, the git commit logs are
authoritative.  To avoid any confusion remove the ChangeLog.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-12-19 09:27:50 -08:00
Brian Behlendorf 31f2b5abdf Remove TSD zfs_fsyncer_key
It's my understanding that the zfs_fsyncer_key TSD was added as
a performance omtimization to reduce contention on the zl_lock
from zil_commit().  This issue manifested itself as very long
(100+ms) fsync() system call times for fsync() heavy workloads.

However, under Linux I'm not seeing the same contention that
was originally described.  Therefore, I'm removing this code
in order to ween ourselves off any dependence on TSD.  If the
original performance issue reappears on Linux we can revisit
fixing it without resorting to TSD.

This just leaves one small ZFS TSD consumer.  If it can be
cleanly removed from the code we'll be able to shed the SPL
TSD implementation entirely.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes zfsonlinux/spl#174
2012-12-19 09:08:01 -08:00
Brian Behlendorf 034f1b331e Fix spl_kmem_init_kallsyms_lookup() panic
Due to I/O buffering the helper may return successfully before
the proc handler has a chance to execute.  To catch this case
wait up to 1 second to verify spl_kallsyms_lookup_name_fn was
updated to a non SYMBOL_POISON value.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes zfsonlinux/zfs#699
Closes zfsonlinux/zfs#859
2012-12-19 09:06:35 -08:00
Prakash Surya 84daaddedb Set elevator for DM devices despite vdev_wholedisk
The current state of udev and devicer-mapper devices makes it difficult
to construct a mapping of DM partitions and their underlying DM device.
For example, with a /dev directory with the following contents:

    $ ls -d /dev/dm-*
    /dev/dm-0
    /dev/dm-1
    /dev/dm-2
    /dev/dm-3

it is not immediately apparent if these are completely separate devices,
or partitions and real devices intermixed. In contrast, SCSI devices
would appear as so:

    $ ls -d /dev/sd*
    /dev/sda
    /dev/sda1
    /dev/sdb
    /dev/sdb1

Here, one can immediately determine that there are two devices (sda and
sdb), each containing a single partition. The lack of a predictable and
consistent mapping from DM devices to DM device partitions makes it
difficult for user space to process these devices the same way it does
SCSI devices.

As a result, the ZFS utilities do not partition DM devices, and instead
set the "vdev_wholedisk" label to 0 and treat them as partitions. This
has the side effect that, even if ZFS has sole ownership of the device,
the IO scheduler will not be modified because it is treated as a
partition.

This change adds an exception for DM devices in vdev_elevator_switch,
allowing the elevator to be modified even though the "vdev_wholedisk"
property is not set.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1149
2012-12-18 15:12:40 -08:00
Jorgen Lundman 6c2856726f Fix using zvol as slog device
During the original ZoL port the vdev_uses_zvols() function was
disabled until it could be properly implemented.  This prevented
a zpool from use a zvol for its slog device.

This patch implements that missing functionality by adding a
zvol_is_zvol() function to zvol.c.  Given the full path to a
device it will lookup the device and verify its major number
against the registered zvol major number for the system.  If
they match we know the device is a zvol.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1131
2012-12-18 11:02:28 -08:00
Richard Yao 30196bfd42 Do not use KERNEL_DIR env var in Makefile.am
A Gentoo user reported an issue where the build system would
attempt to recurse into the kernel source tree if KERNEL_DIR
is set in the environment. KERNEL_DIR is an environment variable
that is used when the kernel sources are in a non-standard
location, so it is necessary to stop relying on it to prevent
this issue.

https://bugs.gentoo.org/show_bug.cgi?id=433946

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-12-17 10:59:12 -08:00
Massimo Maggi 5e6320cd12 Fix get/set users/groups in quota props via numeric id
Fix setting/getting users/groups in quota properties through
numeric identifier.  This support was accidentally disabled
in the original port by applying the HAVE_IDMAP wrapper macro
too broadly.

Fix obtained by moving #ifdef HAVE_IDMAP to exclude only
the part of code that really needs IDMAP.  Now zfs (get|set)
(user|group)quota@1000 works as expected.

Signed-off-by: Massimo Maggi <massimo@mmmm.it>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1147
2012-12-17 09:52:58 -08:00
Richard Yao fd7fd5ebcf Do not use KERNEL_DIR env var in Makefile.am
A Gentoo user reported an issue where the build system would
attempt to recurse into the kernel source tree if KERNEL_DIR
is set in the environment. KERNEL_DIR is an environment variable
that is used when the kernel sources are in a non-standard
location, so it is necessary to stop relying on it to prevent
this issue.

https://bugs.gentoo.org/show_bug.cgi?id=433946

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-12-17 09:45:47 -08:00
Brian Behlendorf 8780c53961 Update SAs when an inode is dirtied
Revert the portion of commit d3aa3ea which always resulted in the
SAs being update when an mmap()'ed file was closed.  That change
accidentally resulted in unexpected ctime updates which upset tools
like git.  That was always a horrible hack and I'm happy it will
never make it in to a tagged release.

The right fix is something I initially resisted doing because I
was worried about the additional overhead.  However, in hindsight
the overhead isn't as bad as I feared.

This patch implemented the sops->dirty_inode() callback which is
unsurprisingly called when an inode is dirtied.  We leverage this
callback to keep the znode SAs strictly in sync with the inode.

However, for now we're going to go slowly to avoid introducing
any new unexpected issues by only updating the atime, mtime, and
ctime.  This will cover the callpath of most concern to us.

  ->filemap_page_mkwrite->file_update_time->update_time->
      mark_inode_dirty_sync->__mark_inode_dirty->dirty_inode

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #764
Closes #1140
2012-12-14 12:18:54 -08:00
Brian Behlendorf bd192c4f48 Update 69-vdev.rules .gitignore
Commit 2957f38 renamed 60-vdev.rules to 69-vdev.rules but failed
to update the .gitignore file to reflect this change.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-12-14 12:16:42 -08:00
Ned Bass 7afcf5b1da Avoid ELOOP on auto-mounted snapshots
Ensure that the path member pointers are associated with the
newly-mounted snapshot when zpl_snapdir_automount() returns.  Otherwise
the follow_automount() function may be called repeatedly, leading to an
incorrect ELOOP error return. This problem was observed as a 'Too many
levels of symbolic links' error from user-space commands accessing an
unmounted snapshot in the .zfs/snapshot directory.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #816
2012-12-13 08:57:11 -08:00
Brian Behlendorf 2ae1031962 Linux 3.7 compat, schedule_delayed_work()
Linux kernel commit d8e794d accidentally broke the delayed work
APIs for non-GPL callers.   While the APIs to schedule a delayed
work item are still available to all callers, it is no longer
possible to initialize the delayed work item.

I'm cautiously optimistic we could get the delayed_work_timer_fn
exported for all callers in the upstream kernel.  But frankly
the compatibility code to use this kernel interface has always
been problematic.

Therefore, this patch abandons direct use the of the Linux
kernel interface in favor of the new delayed taskq interface.
It provides roughly the same functionality as delayed work queues
but it's a stable interface under our control.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1053
2012-12-12 10:47:05 -08:00
Brian Behlendorf 18e0c500a7 Merge branch 'taskq'
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #199
2012-12-12 10:45:48 -08:00
Brian Behlendorf eb0be2ed46 Removed SPL_AC_3ARGS_INIT_WORK check
All consumers of the kernel delayed work queues have been shifted
over to rely on the taskq implementation.  This compatibility code
can now be removed.  Any new callers which need this functionality
should use the taskq interfaces for delayed work items.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-12-12 09:57:10 -08:00
Brian Behlendorf 33e94ef1dd kmem-cache: Use a taskq for async allocations
Shift the asynchronous allocations over to use the taskq interfaces.
This allows us to abandon the kernels delayed work queue interface
and all the compatibility code it requires.

This code never actually used the delay functionality it was just
done this way to leverage the existing compatibility code.  All that
is required is a thread context to perform the allocation in.  The
only thing clever in this change is that we take advantage of the
preallocated task queue entries to avoid a memory allocation.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-12-12 09:56:54 -08:00
Brian Behlendorf a10287e00d kmem-cache: Use taskqs for ageing
Shift the cache and magazine ageing functionality over to the new
delayed taskq interfaces.  This allows us to abandon the kernels
delayed work queue interface and all the compatibility code it
requires.

However, the delayed taskq interface does not allow us to schedule
a task for a specfic cpu so the ageing code was slightly reworked.
The magazine ageing delay has been directly linked to the cache
ageing function.  The spl_cache_age() function invokes on_each_cpu()
in order to run spl_magazine_age() on each cpu.  It then blocks
waiting for them to complete and promptly reclaims any free slabs.

When restructing the code wasn't the primary goal I think the
new code is far more understable and maintainable.  It also should
help minimize magazine thrashing because free slabs are immediately
released after the magazine is aged.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-12-12 09:56:54 -08:00
Brian Behlendorf 296a8e596d kmem-cache: spl_kmem_cache_create() may always sleep
When this code was originally written I went overboard and allowed
for the possibility of creating a cache in an atomic context.  In
practice there are no callers which ever do this.  This makes sense
since a cache is by design a long lived data structure.

To prevent abuse of this function going forward I'm removing the
code which is supported to handle an atomic context.  All allocators
have been updated to use KM_SLEEP and the might_sleep() debug macro
has been added to immediately detect atomic callers.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-12-12 09:56:54 -08:00
Brian Behlendorf a5a98e7260 splat taskq:front: Reduce stack frame
The slightly increased size of the taskq_ent_t when debugging is
enabled has pushed the taskq:front splat test over frame size
limit.  To resolve this dynamically allocate the taskq_ent_t
structures so they are part of the heap instead of the stack.

  In function 'splat_taskq_test6_impl'
  error: the frame size of 1648 bytes is larger than 1024 bytes

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-12-12 09:56:54 -08:00
Brian Behlendorf 94ff5d38e3 splat taskq:order: Reduce stack frame
The slightly increased size of the taskq_ent_t when debugging is
enabled has pushed the taskq:order splat test over frame size
limit.  To resolve this dynamically allocate the taskq_ent_t
structures so they are part of the heap instead of the stack.

  In function 'splat_taskq_test5_impl'
  error: the frame size of 1680 bytes is larger than 1024 bytes

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-12-12 09:56:54 -08:00
Brian Behlendorf 3238e71763 splat taskq:cancel: Add test case
Add a test case for taskq_cancel_id() to verify it is working
properly.  Just like taskq:delay we start by dispatching 100
tasks.  However this time 1/3 of the tasks use taskq_dispatch()
and will be run immediately, and 2/3 use taskq_dispatch_delay().
The idea is to create a busy taskq with both active, pending,
and delayed tasks.

After all the items have been successfully dispatched the test
begins randomly canceling known task ids.  It will do this for
5 seconds randomly canceling a task id and then sleeping for a
few milliseconds.   The task being canceled may have already run,
still be on the pending list, or may be currently being executed
by a worker thread.  The idea is to ensure we catch any subtle
race conditions.

Once all the non-canceled tasks have completed we cross check
the number of tasks which ran with the number of tasks which
were successfully canceled.  Additionally, we verify that the
taskq_cancel_id() function never blocks longer than needed.
This time is bounded by the longest run time of the task which
was dispatched.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-12-12 09:56:49 -08:00
Brian Behlendorf 2f35782620 splat taskq:delay: Add test case
Add a test case for taskq_dispatch_delay() to verify it is working
properly.  The test dispatchs 100 tasks to a taskq with random
expiration times spread over 5 seconds.  As each task expires and
gets executed by a worker thread it verifies that it was run at
the correct time.  Once all the delayed tasks have been executed
we double check that all the dispatched tasks were successful.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-12-12 09:54:07 -08:00
Brian Behlendorf d9acd930b5 taskq delay/cancel functionality
Add the ability to dispatch a delayed task to a taskq.  The desired
behavior is for the task to be queued but not executed by a worker
thread until the expiration time is reached.  To achieve this two
new functions were added.

* taskq_dispatch_delay() -

  This function behaves exactly like taskq_dispatch() however it
takes a third 'expire_time' argument.  The caller should pass the
desired time the task should be executed as an absolute value in
jiffies.  The task is guarenteed not to run before this time, it
may run slightly latter if all the worker threads are busy.

* taskq_cancel_id() -

  Given a task id attempt to cancel the task before it gets executed.
This is primarily useful for canceling delay tasks but can be used for
canceling any previously dispatched task.  There are three possible
return values.

  0      - The task was found and canceled before it was executed.
  ENOENT - The task was not found, either it was already run or an
           invalid task id was supplied by the caller.
  EBUSY  - The task is currently executing any may not be canceled.
           This function will block until the task has been completed.

* taskq_wait_all() -

  The taskq_wait_id() function was renamed taskq_wait_all() to more
clearly reflect its actual behavior.  It is only curreny used by
the splat taskq regression tests.

* taskq_wait_id() -

  Historically, the only difference between this function and
taskq_wait() was that you passed the task id.  In both functions you
would block until ALL lower task ids which executed.  This was
semantically correct but could be very slow particularly if there
were delay tasks submitted.

  To better accomidate the delay tasks this function was reimplemnted.
It will now only block until the passed task id has been completed.

This is actually a fairly low risk change for a few reasons.

* Only new ZFS callers will make use of the new interfaces and
  very little common code was changed to support the new functions.

* The existing taskq_wait() implementation was not changed just
  slightly refactored.

* The newly optimized taskq_wait_id() implementation was never
  used by ZFS we can't accidentally introduce a new bug there.

NOTE: This functionality does not exist in the Illumos taskqs.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-12-12 09:54:07 -08:00
Brian Behlendorf aed8671cb0 taskq style, remove #define wrappers
When the taskq implementation was originally written I wrapped all
the API functions in #define's.  This was done as a preventative
measure to ensure that a taskq symbol never conflicted with an
existing kernel symbol.

However, in practice the taskq symbols never conflicted.  The only
major conflicts occured with the kmem cache API.  Since this added
layer of obfuscation never bought us anything for the taskq's I'm
removing it.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-12-12 09:54:07 -08:00
Brian Behlendorf 472a34caff taskq style, convert spaces to soft tabs
Update the taskq implementation to conform with the style used
throughout the rest of the code.  There are no functional
changes in this commit.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-12-12 09:54:07 -08:00
Steven Johnson 794f145bf9 splat linux:shrinker: Fix fail-safe
Ensure the fail-safe is reset between successive tests.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-12-12 09:04:29 -08:00
Steven Johnson ca072ee70f splat linux:shrinker: Fix race condition
Ensure the test thread blocks until the shrinker has completed its
work.  This is done by putting the test thread to sleep and waking
it each time the shrinker callback runs.  Once the shrinker size
drops to zero or we time out the test is allowed to proceed.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #96
Closes #125
Closes #182
2012-12-12 09:04:11 -08:00
Brian Behlendorf 576ec6aac4 splat command verbose behavior
The splat command takes a verbose option which when set prints
the internal debug log for every test.  This is helpful when
tracking down a common failure, but for a rare failure the
volume of log data is distracting.

Therefore, the verbose option has been adjusted to allow only
printing the debug log on failure.  The legacy behavior is still
available by specifying the verbose option twice.  For example:

$ splat -t all:all     # Never print the debug log
$ splat -v -t all:all  # Only print debug log on failure
$ splat -vv -t all:all # Always print the debug log

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-12-11 15:08:19 -08:00
Richard Yao e4d89e9cfc Switch KM_SLEEP to KM_PUSHPAGE
When writes to zvols invoke ZIL, zfs_range_new_proxy() is called,
which allocates memory using KM_SLEEP, triggering a warning.
Switch to KM_PUSHPAGE to silence that warning.  See commit
b8d06fca08 for additional details.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1138
2012-12-10 09:44:45 -08:00
Brian Behlendorf 53c7411919 Revert "Fix unlink/xattr deadlock"
This reverts commit b00131d43c which
is no longer needed due to e89260a1c8.

This change forces all xattr znodes to hold a reference on their
parent which ensures prune_icache() will never attempt to evict
both the parent and child concurrently.  This effectively prevents
the deadlock condition from ever occuring.

Therefore we can safely revert back to the upstream synchronous
cleanup code.  This is nice because it keeps our code base closer
to upstream and resolves the performance issues introduced by the
original deadlock fix.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #457
2012-12-05 13:41:30 -08:00
Brian Behlendorf d3aa3ea96e Preserve inode mtime/ctime in .writepage()
When updating a file via mmap()'ed I/O preserve the mtime/ctime
which were updated when the page was made writable by the generic
callback filemap_page_mkwrite().

But more importantly than preserving the exact time add the missing
call to sa_bulk_update().  This ensures that the znode modifications
are written to disk as part of the transaction.  Without this the
inode may mistaken rollback to the previous on-disk znode state.

Additionally, for mmap()'ed znodes explicitly set the atime, mtime,
and ctime on close using the up to date values in the inode.  This
is critical because writepage() may occur after close and on close
we need to ensure the values are correct.

Original-patch-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #764
2012-12-05 13:00:25 -08:00
Steven Johnson 9b88fa165f splat taskq:front: Fix race
The taskq:front test has a race condition where task 4 and 8
race to complete, due to an incorrectly calculated set of delay
"factors" (T). If task 4 wins and actually finishes first, the
verification of the order of completion will fail.

The delays calculated to order task completion do not take into
account the terminal line in the table, and so are all off by
a factor of 1. This causes all the tasks in all queues to finish
sooner than expected and the accumulated error is the root cause
of tasks 4 and 8 racing to complete first. Before the change the
"actual" table looks like I commented in #130.

I changed:

* the table in the comment to correctly reflect the test and the
  factor timings needed.
* the individual task delay factors of T so that ONLY 1 task will
  every 2T. (on average)
* 1T was reduced from 100ms to 50ms. This halves the duration of
  the test and makes any remaining raciness more likely to cause
  failures, but it did not cause the test to fail.
* simplified the delay factor logic by using a table look-up
  instead of a switch.
* Added a "task started" message so that with -v it is possible
  to see the order tasks are started.
* Moved the "task completed" message inside the spinlock so that
  with -v the message truly reflects the absolute order of
  completion as guaranteed by the spinlock.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #130
2012-12-05 12:23:40 -08:00