This reverts commit a430c11f0b. Using
journal_info like this can cause a BUG at kernel fs/jbd2/transaction.c:425!
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #500
The ->journal_info pointer in the task_struct is reserved for use by
filesystems and because the kernel can have multiple file systems on the
same stack due to direct reclaim, each filesystem that touches
->journal_info in a callback function will save the value at the start
of its frame and restore it at the end of its frame. This allows us to
safely use ->journal_info to store a pointer to the taskq's struct in
taskq threads so that ZFS code paths can detect the presence of a taskq.
This could break if the ZFS code were to use taskq_member from the
context of direct reclaim. However, there are no such uses of it in that
manner, so this is safe.
This eliminates an O(N) list traversal under a spinlock with an O(1)
unlocked pointer comparison.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: tuxoko <tuxoko@gmail.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes#500
On Linux the meaning of a processes priority is inverted with respect
to illumos. High values on Linux indicate a _low_ priority while high
value on illumos indicate a _high_ priority.
In order to preserve the logical meaning of the minclsyspri and
maxclsyspri macros when they are used by the illumos wrapper functions
their values have been inverted. This way when changes are merged
from upstream illumos we won't need to remember to invert the macro.
It could also lead to confusion.
Note this change also reverts some of the priorities changes in prior
commit 62aa81a. The rational is as follows:
spl_kmem_cache - High priority may result in blocked memory allocs
spl_system_taskq - May perform I/O for file backed VDEVs
spl_dynamic_taskq - New taskq threads should be spawned promptly
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Issue zfsonlinux/zfs#3607
Add a new defclsyspri macro which can be used to request the default
Linux scheduler priority. Neither the minclsyspri or maxclsyspri map
to the default Linux kernel thread priority. This makes it awkward to
create taskqs which run with the same priority as the rest of the kernel
threads on the system which can lead to performance issues.
All SPL callers which previously used minclsyspri or maxclsyspri have
been changed to use defclsyspri. The vast majority of callers were
part of the test suite which won't have an external impact. The few
places where it could impact performance the change was from maxclsyspri
to defclsyspri. This makes it more likely the process will be scheduled
which may help performance.
To facilitate further performance analysis the spl_taskq_thread_priority
module option has been added. When disabled (0) all newly created kernel
threads will use the default kernel thread priority. When enabled (1)
the specified taskq priority will be used. By default this value is
enabled (1).
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Setting the TASKQ_DYNAMIC flag will create a taskq with dynamic
semantics. Initially only a single worker thread will be created
to service tasks dispatched to the queue. As additional threads
are needed they will be dynamically spawned up to the max number
specified by 'nthreads'. When the threads are no longer needed,
because the taskq is empty, they will automatically terminate.
Due to the low cost of creating and destroying threads under Linux
by default new threads and spawned and terminated aggressively.
There are two modules options which can be tuned to adjust this
behavior if needed.
* spl_taskq_thread_sequential - The number of sequential tasks,
without interruption, which needed to be handled by a worker
thread before a new worker thread is spawned. Default 4.
* spl_taskq_thread_dynamic - Provides the ability to completely
disable the use of dynamic taskqs on the system. This is provided
for the purposes of debugging and troubleshooting. Default 1
(enabled).
This behavior is fundamentally consistent with the dynamic taskq
implementation found in both illumos and FreeBSD.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes#458
Under Illumos taskq_wait() returns when there are no more tasks
in the queue. This behavior differs from ZoL and FreeBSD where
taskq_wait() returns when all the tasks in the queue at the
beginning of the taskq_wait() call are complete. New tasks
added whilst taskq_wait() is running will be ignored.
This difference in semantics makes it possible that new subtle
issues could be introduced when porting changes from Illumos.
To avoid that possibility the taskq_wait() function is being
updated such that it blocks until the queue in empty.
The previous behavior remains available through the
taskq_wait_outstanding() interface. Note that this function
was previously called taskq_wait_all() but has been renamed
to avoid confusion.
Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#455
Update links to refer to the official ZFS on Linux website instead of
@behlendorf's personal fork on github.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Add the ability to dispatch a delayed task to a taskq. The desired
behavior is for the task to be queued but not executed by a worker
thread until the expiration time is reached. To achieve this two
new functions were added.
* taskq_dispatch_delay() -
This function behaves exactly like taskq_dispatch() however it
takes a third 'expire_time' argument. The caller should pass the
desired time the task should be executed as an absolute value in
jiffies. The task is guarenteed not to run before this time, it
may run slightly latter if all the worker threads are busy.
* taskq_cancel_id() -
Given a task id attempt to cancel the task before it gets executed.
This is primarily useful for canceling delay tasks but can be used for
canceling any previously dispatched task. There are three possible
return values.
0 - The task was found and canceled before it was executed.
ENOENT - The task was not found, either it was already run or an
invalid task id was supplied by the caller.
EBUSY - The task is currently executing any may not be canceled.
This function will block until the task has been completed.
* taskq_wait_all() -
The taskq_wait_id() function was renamed taskq_wait_all() to more
clearly reflect its actual behavior. It is only curreny used by
the splat taskq regression tests.
* taskq_wait_id() -
Historically, the only difference between this function and
taskq_wait() was that you passed the task id. In both functions you
would block until ALL lower task ids which executed. This was
semantically correct but could be very slow particularly if there
were delay tasks submitted.
To better accomidate the delay tasks this function was reimplemnted.
It will now only block until the passed task id has been completed.
This is actually a fairly low risk change for a few reasons.
* Only new ZFS callers will make use of the new interfaces and
very little common code was changed to support the new functions.
* The existing taskq_wait() implementation was not changed just
slightly refactored.
* The newly optimized taskq_wait_id() implementation was never
used by ZFS we can't accidentally introduce a new bug there.
NOTE: This functionality does not exist in the Illumos taskqs.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
When the taskq implementation was originally written I wrapped all
the API functions in #define's. This was done as a preventative
measure to ensure that a taskq symbol never conflicted with an
existing kernel symbol.
However, in practice the taskq symbols never conflicted. The only
major conflicts occured with the kmem cache API. Since this added
layer of obfuscation never bought us anything for the taskq's I'm
removing it.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Update the taskq implementation to conform with the style used
throughout the rest of the code. There are no functional
changes in this commit.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
When the taskq code was originally written it seemed like a good
idea to simply map TQ_SLEEP to KM_SLEEP. Unfortunately, this
assumed that the TQ_* flags would never confict with any of the
Linux GFP_* flags. When adding the TQ_PUSHPAGE support in commit
cd5ca4b this invariant was accidentally broken.
Therefore to support TQ_PUSHPAGE, which is needed for Linux, and
prevent any further confusion I have removed this direct mapping.
The TQ_SLEEP, TQ_NOSLEEP, and TQ_PUSHPAGE are no longer defined
in terms of their KM_* counterparts. Instead a simple mapping
function is introduce to convert TQ_* -> KM_* where needed.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #171
This reverts commit cd5ca4b2f8
due to conflicts in the higher TQ_ bits which caused incorrect
behavior.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Under certain circumstances the following functions may be called
in a context where KM_SLEEP is unsafe and can result in a deadlocked
system. To avoid this problem the unconditional KM_SLEEPs are
converted to KM_PUSHPAGEs. This will prevent them from attempting
to initiate any I/O during direct reclaim.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This reverts commit 372c257233. The
use of the PF_MEMALLOC flag was always a hack to work around memory
reclaim deadlocks. Those issues are believed to be resolved so this
workaround can be safely reverted.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
A preallocated taskq_ent_t's tqent_flags must be checked prior to
servicing the taskq_ent_t. Once a preallocated taskq entry is serviced,
the ownership of the entry is handed back to the caller of
taskq_dispatch, thus the entry's contents can potentially be mangled.
In particular, this is a problem in the case where a preallocated taskq
entry is serviced, and the caller clears it's tqent_flags field. Thus,
when the function returns and task_done is called, it looks as though
the entry is **not** a preallocated task (when in fact it **is** a
preallocated task).
In this situation, task_done will place the preallocated taskq_ent_t
structure onto the taskq_t's free list. This is a **huge** mistake. If
the taskq_ent_t is then freed by the caller of taskq_dispatch, the
taskq_t's free list will hold a pointer to garbage data. Even worse, if
nothing has over written the freed memory before the pointer is
dereferenced, it may still look as though it points to a valid list_head
belonging to a taskq_ent_t structure.
Thus, the task entry's flags are now copied prior to servicing the task.
This copy is then checked to see if it is a preallocated task, and
determine if the entry needs to be passed down to the task_done
function.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#71
The taskq_t's active thread list is sorted based on its
tqt_ent->tqent_id field. The list is kept sorted solely by inserting
new taskq_thread_t's in their correct sorted location; no other
means is used. This means that once inserted, if a taskq_thread_t's
tqt_ent->tqent_id field changes, the list runs the risk of no
longer being sorted.
Prior to the introduction of the taskq_dispatch_prealloc() interface,
this was not a problem as a taskq_ent_t actively being serviced under
the old interface should always have a static tqent_id field. Thus,
once the taskq_thread_t is added to the taskq_t's active thread list,
the taskq_thread_t's tqt_ent->tqent_id field would remain constant.
Now, this is no longer the case. Currently, if using the
taskq_dispatch_prealloc() interface, any given taskq_ent_t actively
being serviced _may_ have its tqent_id value incremented. This happens
when the preallocated taskq_ent_t structure is recursively dispatched.
Thus, a taskq_thread_t could potentially have its tqt_ent->tqent_id
field silently modified from under its feet. If this were to happen
to a taskq_thread_t on a taskq_t's active thread list, this would
compromise the integrity of the order of the list (as the list
_may_ no longer be sorted).
To get around this, the taskq_thread_t's taskq_ent_t pointer was
replaced with its own static copy of the tqent_id. So, as a taskq_ent_t
is pulled off of the taskq_t's pending list, a static copy of its
tqent_id is made and this copy is used to sort the active thread
list. Using a static copy is key in ensuring the integrity of the
order of the active thread list. Even if the underlying taskq_ent_t
is recursively dispatched (as has its tqent_id modified), this
static copy stored inside the taskq_thread_t will remain constant.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #71
This patch implements the taskq_dispatch_prealloc() interface which
was introduced by the following illumos-gate commit. It allows for
a preallocated taskq_ent_t to be used when dispatching items to a
taskq. This eliminates a memory allocation which helps minimize
lock contention in the taskq when dispatching functions.
commit 5aeb94743e3be0c51e86f73096334611ae3a058e
Author: Garrett D'Amore <garrett@nexenta.com>
Date: Wed Jul 27 07:13:44 2011 -0700
734 taskq_dispatch_prealloc() desired
943 zio_interrupt ends up calling taskq_dispatch with TQ_SLEEP
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #65
To lay the ground work for introducing the taskq_dispatch_prealloc()
interface, the tq_work_list and tq_threads fields had to be replaced
with new alternatives in the taskq_t structure.
The tq_threads field was replaced with tq_thread_list. Rather than
storing the pointers to the taskq's kernel threads in an array, they are
now stored as a list. In addition to laying the ground work for the
taskq_dispatch_prealloc() interface, this change could also enable taskq
threads to be dynamically created and destroyed as threads can now be
added and removed to this list relatively easily.
The tq_work_list field was replaced with tq_active_list. Instead of
keeping a list of taskq_ent_t's which are currently being serviced, a
list of taskq_threads currently servicing a taskq_ent_t is kept. This
frees up the taskq_ent_t's tqent_list field when it is being serviced
(i.e. now when a taskq_ent_t is being serviced, it's tqent_list field
will be empty).
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #65
It has become necessary to be able to optionally disable
direct memory reclaim for certain taskqs. To support
this the TASKQ_NORECLAIM flags has been added which sets
the PF_MEMALLOC bit for all threads in the taskq.
Adds a task queue to receive tasks dispatched with TQ_FRONT. Worker
threads pull tasks from this high priority queue before the default
pending queue.
Executing tasks out of FIFO order potentially breaks taskq_lowest_id()
if we do not preserve the ordering of the work list by taskqid.
Therefore, instead of always appending to the work list, we search for
the appropriate place to insert a task. The common case is to append
to the list, so we make this operation efficient by searching the work
list in reverse order.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
For the moment the SPL accepts the TASKQ_DC_BATCH and TQ_FRONT
flags however they get silently ignored. This is harmless for
the moment but it does need to be implemented at some point.
Under linux the proc.h header is for the /proc filesystem, and under
Solaris the proc/h header if for processes. This patch correctly
moves the Linux proc functionality in a linux/proc_compat.h header
and leaves the sys/proc.h for use by Solaris. Minor updates were
required to all the call sites where it was included of course.
Updated AUTHORS, COPYING, DISCLAIMER, and INSTALL files. Added
standardized headers to all source file to clearly indicate the
copyright, license, and to give credit where credit is due.
used to scale the number of threads based on the number of online
CPUs. As CPUs are added/removed we should rescale the thread
count appropriately, but currently this is only done at create.
I'm very surprised this has not surfaced until now. But the taskq_wait()
implementation work only wait successfully the first time it was called.
Subsequent usage of taskq_wait() on the taskq would not wait.
The issue was caused by tq->tq_lowest_id being set to MAX_INT after the
first wait completed. This caused subsequent waits which check that the
waiting id is less than the lowest taskq id to always succeed. The fix
is to ensure that tq->tq_lowest_id is never set larger than tq->tq_next.id.
Additional fixes which were added to this patch include:
1) Fix a race by placing the taskq_wait_check() in the tq->tq_lock spinlock.
2) taskq_wait() should wait for the largest outstanding id.
3) Multiple spelling corrections.
4) Added taskq wait regression test to validate correct behavior.
* spl-04-fix-taskq-spinlock-lockup.patch
Fixes a deadlock in the BIO completion handler, due to the taskq code
prematurely re-enabling interrupts when another spinlock had disabled
them in the IDE IRQ handler.
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@161 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
configurable number of threads like the Solaris version and almost
all of the options are supported. Unfortunately, it appears to have
made absolutely no difference to our performance numbers. I need
to keep looking for where we are bottle necking.
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@93 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
muck with #includes in existing Solaris style source to get it
to find the right stuff.
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@18 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c