The current code contains a race condition that triggers when bit 2 in
spl.spl_kmem_cache_expire is set, spl_kmem_cache_reap_now() is invoked
and another thread is concurrently accessing its magazine.
spl_kmem_cache_reap_now() currently invokes spl_cache_flush() on each
magazine in the same thread when bit 2 in spl.spl_kmem_cache_expire is
set. This is unsafe because there is one magazine per CPU and the
magazines are lockless, so it is impossible to guarentee that another
CPU is not using its magazine when this function is called.
The solution is to only touch the local CPU's magazine and leave other
CPU's magazines to other CPUs.
Reported-by: DHE
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#274
num_physpages was removed by
torvalds/linux@cfa11e08ed, so lets replace
it with totalram_pages.
This is a bug fix as much as it is a compatibility fix because
num_physpages did not reflect the number of pages actually available to
the kernel:
http://lkml.indiana.edu/hypermail/linux/kernel/0908.2/01001.html
Also, there are known issues with memory calculations when ZFS is in a
Xen dom0. There is a chance that using totalram_pages could resolve
them. This conjecture is untested at the time of writing.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#273
Because spl_slab_size() was always returning -ENOSPC for caches of
type KMC_OFFSLAB the cache could never be created. Additionally
the slab size is rounded up to a page which is what kv_alloc()
expects. The kv_alloc() code will minimally allocate a page,
in the KMC_OFFSLAB case this could be reduced.
The basic regression tests kmem:slab_small, kmem:slab_large,
and kmem:slab_align regression were updated to test KMC_OFFSLAB.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ying Zhu <casualfisher@gmail.com>
Closes#266
It has been observed that it's possible to get in a state where
shrink_slabs() will spin repeated invoking the generic kmem cache
shrinker. It fails to detect it's not making forward progress
reclaiming from the cache and doesn't give up. To ensure this
never occurs we unconditionally return -1 after reclaiming what
we can.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closeszfsonlinux/zfs#1276Closeszfsonlinux/zfs#1598Closeszfsonlinux/zfs#1432
This allows us to get nanosecond resolution. It also means
we use the same time source as utimensat(now) etc.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#255
Commit 5c7a036 correctly relocated the creation of a taskq
and the registraction of the kmem_cache_shrinker after the
initialization of the kmem tracking code. However, the
cleanup of these structures was not done before the leak
checks in spl_kmem_fini(). This resulted in an incorrect
'kmem leaked' warning even though there was no actual leak.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closeszfsonlinux/zfs#1569
This code has gotten something stale and no longer builds cleanly
against modern kernels. The two issues addressed here are as
follows:
* The hlist_*_rcu interfaces in the kernel have been relatively
unstable. Since this isn't performance critical code just use
the long standing hlist_* variants.
* In older kernels the hash_ptr() function takes a 'void *' but
in newer kernels it expects a 'const void *'. To silence the
compiler warnings about this explicitly cast it to a 'void *'.
The memset function is a similar case but it always expects
a 'void *'.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#256
Linux kernel commit torvalds/linux#59d8053f moved the definition of
struct proc_dir_entry from include/linux/proc_fs.h to the private
header fs/proc/internal.h. The SPL relied on that to map Solaris'
kstat to entries in /proc/spl/kstat.
Since the proc_dir_entry structure is now private the only safe
thing to do is wrap the opaque proc handle with our own structure.
This actually ends up simplify the code and is good because it
moves us away from depending on implementation details of /proc.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #257
Linux kernel commit torvalds/linux@d9dda78b renamed PDE() to
PDE_DATA(). To handle this detect the prefered interface
and define a PDE_DATA() wrapper for consistency.
Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #257
Re-order initialization in spl_kmem_init to allow for kmem tracing
to work. The spl_kmem_init function calls taskq_create prior to
initializing the tracking (calling spl_kmem_init_tracking). Since
taskq_create uses kmem_alloc, NULL dereferences occur because the
global kmem_list hasn't had its next & prev pointers initialized yet.
This commit moves the calls to spl_kmem_init_tracking earlier in the
spl_kmem_init function in order that the subsequent kmem_alloc calls
(by taskq_create) work properly.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#243
The existing taskq_wait_id() function can incorrectly block
indefinitely. Reimplement it more simply using wait_event()
in a similar fashion to taskq_wait_all().
This flaw was uncovered in the context of moving vn_rdwr() to
a taskq. Previously taskq_wait_id() had no consumers outside
the SPLAT task framework which is why the issue went unnoticed.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Calling cond_resched() after each object is freed and then after each
slab is freed can cause slabs of objects to live for excessive periods
of time following reclaimation. This interferes with the kernel's own
memory management when called from kswapd and can cause direct reclaim
to occur in response to memory pressure that should have been resolved.
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
torvalds/linux@b67bfe0d42 changed
hlist_for_each_entry{,_rcu} to take 3 arguments instead of 4. We handle
this by switching to hlist_for_each{,_rcu}, which works across all
supported kernels.
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This was a suggestion that Brian Behlendorf made when reviewing an early
pull request for Linux 3.9 support. This commit was made intentionally
easy to revert should we ever have a reason to reintroduce support for
older kernels.
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
torvalds/linux@dcf787f391 enforces
const-correctness in passing struct path *.
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The function prototype of vfs_getattr previoulsy took struct vfsmount *
and struct dentry * as arguments. These would always be defined together
in a struct path *.
torvalds/linux@3dadecce20 modified
vfs_getattr to take struct path * is taken as an argument instead.
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
torvalds/linux@182be68478 removed the
preprocessor definition for f_vfsmnt. The ability to access the
mountpoint via ->f_path.mnt has been stable for a long time, so we
switch to that.
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Update links to refer to the official ZFS on Linux website instead of
@behlendorf's personal fork on github.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Long ago infrastructure was added to the SPL to keep an internal
debug log of the last few seconds of activity. This was helpful
during the early development, but these days it is no longer
needed. I haven't had to resort to this debug buffer to resolve
an issue for several years now.
Today better more generic tools like systemtap and ftrace have
evolved to the point where they can be used for this purpose.
Along with the stack trace dumped to the system console, and in
rare cases a crash dump we almost always have the debug we need.
Therefore, I'm disabling the code which automatically dumps
this log to disk during an assertion except for the case where
spl_debug_panic_on_bug is set (disabled by default).
This should be viewed as a first step towards either.
a) Retiring this infrastructure and complexity entirely, or
b) Integrating this logging more properly with ftrace.
As part of this change I'm also removing from the packages the
undocumented spl utility which is used to decode the binary logs.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The tsd_exit() and tsd_destroy() functions remove entries from
hash bins without taking the hash bin lock. They do take the
table lock, but tsd_get() and tsd_set() only take the hash bin
lock to allow for maximum concurency.
The result is that while tsd_get() and tsd_set() are traversing
the hash bin list it can be modified by another thread in which
happens to hash to the same value. To avoid this add the needed
locking to tsd_exit() and tsd_destroy().
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#174
Cache aging was implemented because it was part of the default Solaris
kmem_cache behavior. The idea is that per-cpu objects which haven't been
accessed in several seconds should be returned to the cache. On the other
hand Linux slabs never move objects back to the slabs unless there is
memory pressure on the system.
This behavior is now configurable through the 'spl_kmem_cache_expire'
module option. The value is a bit mask with the following meaning.
0x1 - Solaris style cache aging eviction is enabled.
0x2 - Linux style low memory eviction is enabled.
Both methods may be safely enabled simultaneously, but by default
both are disabled. It has never been clear if the kmem cache aging
(which has been around from day one) actually does any good. It has
however been the source of numerous bugs so I wouldn't mind retiring
it entirely.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes zfsonlinux/zfs#1227
Closes#210
This functionality is no longer required by ZFS, see commit
zfsonlinux/zfs@7b3e34ba5a.
Since there are no other consumers, and because it adds
additional autoconf complexity which must be maintained
the spl_invalidate_inodes() function has been removed.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#795
Commit a10287e00d slightly reworked
the slab ageing code such that it is no longer dependent on the
Linux delayed work queue interfaces.
This was good for portability and performance, but it requires us
to use the on_each_cpu() function to execute the spl_magazine_age()
function. That means that the function is now executing in interrupt
context whereas before it was scheduled in normal process context.
And that means we need to be slightly more careful about the locking
in the interrupt handler.
With the reworked code it's possible that we'll be holding the
skc->skc_lock and be interrupted to handle the spl_magazine_age()
IRQ. This will result in a deadlock and soft lockup errors unless
we're careful to detect the contention and avoid taking the lock in
the interupt handler. So that's what this patch does.
Alternately, (and slightly more conventionally) we could have used
spin_lock_irqsave() to prevent this race entirely but I'd perfer to
avoid disabling interrupts as much as possible due to performance
concerns. There is absolutely no penalty for us not aging objects
out of the magazine due to contention.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Closeszfsonlinux/zfs#1193
As of Linux 3.4 the UMH_WAIT_* constants were renumbered. In
particular, the meaning of "1" changed from UMH_WAIT_PROC (wait for
process to complete), to UMH_WAIT_EXEC (wait for the exec, but not the
process). A number of call sites used the number 1 instead of the
constant name, so the behavior was not as expected on kernels with
this change.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
In the upstream kernel the FALLOC_FL_PUNCH_HOLE #define was
introduced after the fallocate() function was moved from the
inode_operations to the file_operations structure. Therefore,
the SPL code assumed that if FALLOC_FL_PUNCH_HOLE was defined
it was safe to use f_ops->fallocate().
Unfortunately, the RHEL6.4 kernel has only backported the
FALLOC_FL_PUNCH_HOLE #define and not the fallocate() change.
To address this compatibility issue the spl_filp_fallocate()
helper function was added to properly detect which interface
is available.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Under Linux when a task is waiting on I/O it should call the
io_schedule() function for proper accounting. The Solaris
cv_wait() function provides no way to specify what the cv
is waiting on therefore cv_wait_io() is introduced.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#206
Due to I/O buffering the helper may return successfully before
the proc handler has a chance to execute. To catch this case
wait up to 1 second to verify spl_kallsyms_lookup_name_fn was
updated to a non SYMBOL_POISON value.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closeszfsonlinux/zfs#699Closeszfsonlinux/zfs#859
Shift the asynchronous allocations over to use the taskq interfaces.
This allows us to abandon the kernels delayed work queue interface
and all the compatibility code it requires.
This code never actually used the delay functionality it was just
done this way to leverage the existing compatibility code. All that
is required is a thread context to perform the allocation in. The
only thing clever in this change is that we take advantage of the
preallocated task queue entries to avoid a memory allocation.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Shift the cache and magazine ageing functionality over to the new
delayed taskq interfaces. This allows us to abandon the kernels
delayed work queue interface and all the compatibility code it
requires.
However, the delayed taskq interface does not allow us to schedule
a task for a specfic cpu so the ageing code was slightly reworked.
The magazine ageing delay has been directly linked to the cache
ageing function. The spl_cache_age() function invokes on_each_cpu()
in order to run spl_magazine_age() on each cpu. It then blocks
waiting for them to complete and promptly reclaims any free slabs.
When restructing the code wasn't the primary goal I think the
new code is far more understable and maintainable. It also should
help minimize magazine thrashing because free slabs are immediately
released after the magazine is aged.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
When this code was originally written I went overboard and allowed
for the possibility of creating a cache in an atomic context. In
practice there are no callers which ever do this. This makes sense
since a cache is by design a long lived data structure.
To prevent abuse of this function going forward I'm removing the
code which is supported to handle an atomic context. All allocators
have been updated to use KM_SLEEP and the might_sleep() debug macro
has been added to immediately detect atomic callers.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Add the ability to dispatch a delayed task to a taskq. The desired
behavior is for the task to be queued but not executed by a worker
thread until the expiration time is reached. To achieve this two
new functions were added.
* taskq_dispatch_delay() -
This function behaves exactly like taskq_dispatch() however it
takes a third 'expire_time' argument. The caller should pass the
desired time the task should be executed as an absolute value in
jiffies. The task is guarenteed not to run before this time, it
may run slightly latter if all the worker threads are busy.
* taskq_cancel_id() -
Given a task id attempt to cancel the task before it gets executed.
This is primarily useful for canceling delay tasks but can be used for
canceling any previously dispatched task. There are three possible
return values.
0 - The task was found and canceled before it was executed.
ENOENT - The task was not found, either it was already run or an
invalid task id was supplied by the caller.
EBUSY - The task is currently executing any may not be canceled.
This function will block until the task has been completed.
* taskq_wait_all() -
The taskq_wait_id() function was renamed taskq_wait_all() to more
clearly reflect its actual behavior. It is only curreny used by
the splat taskq regression tests.
* taskq_wait_id() -
Historically, the only difference between this function and
taskq_wait() was that you passed the task id. In both functions you
would block until ALL lower task ids which executed. This was
semantically correct but could be very slow particularly if there
were delay tasks submitted.
To better accomidate the delay tasks this function was reimplemnted.
It will now only block until the passed task id has been completed.
This is actually a fairly low risk change for a few reasons.
* Only new ZFS callers will make use of the new interfaces and
very little common code was changed to support the new functions.
* The existing taskq_wait() implementation was not changed just
slightly refactored.
* The newly optimized taskq_wait_id() implementation was never
used by ZFS we can't accidentally introduce a new bug there.
NOTE: This functionality does not exist in the Illumos taskqs.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
When the taskq implementation was originally written I wrapped all
the API functions in #define's. This was done as a preventative
measure to ensure that a taskq symbol never conflicted with an
existing kernel symbol.
However, in practice the taskq symbols never conflicted. The only
major conflicts occured with the kmem cache API. Since this added
layer of obfuscation never bought us anything for the taskq's I'm
removing it.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Update the taskq implementation to conform with the style used
throughout the rest of the code. There are no functional
changes in this commit.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
When the Linux 3.6 KERN_PATH_LOCKED compatibility code was added
by commit bcb1589 an entirely new vn_remove() implementation was
added. That function did not properly handle an error from
spl_kern_path_locked() which would result in an panic. This
patch addresses the issue by returning the error to the caller.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#187
Allowing the spl_cache_grow_work() function to reclaim inodes
allows for two unlikely deadlocks. Therefore, we clear __GFP_FS
for these allocations. The two deadlocks are:
* While holding the ZFS_OBJ_HOLD_ENTER(zsb, obj1) lock a function
calls kmem_cache_alloc() which happens to need to allocate a
new slab. To allocate the new slab we enter FS level reclaim
and attempt to evict several inodes. To evict these inodes we
need to take the ZFS_OBJ_HOLD_ENTER(zsb, obj2) lock and it
just happens that obj1 and obj2 use the same hashed lock.
* Similar to the first case however instead of getting blocked
on the hash lock we block in txg_wait_open() which is waiting
for the next txg which isn't coming because the txg_sync
thread is blocked in kmem_cache_alloc().
Note this isn't a 100% fix because vmalloc() won't strictly
honor __GFP_FS. However, it practice this is sufficient because
several very unlikely things must all occur concurrently.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#1101
If we are reaping from the cache and a concurrent allocation
occurs then the caller must block until the reaping is complete.
This is signaled by the clearing of the KMC_BIT_REAPING bit.
Otherwise the caller will be in a tight loop which takes and
releases the skc->skc_cache lock. When there are multiple
concurrent callers the system will thrash on the lock and
appear to lock up.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Because only virtual slabs may have emergency objects and these
objects are guaranteed to have physical addresses. It can be
easily determined if the passed object is a virtual slab object
or an emergency object. This allows us to completely optimize
the emergency object free case out of the common free path.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
In the initial implementation emergency objects were tracked on a
per-cache list. The assumption was that under normal operation we
would never allocate more than a handful of these objects. So the
cost of walking the list during free was expected to be negligible.
However real world usage has shown that emergency objects tend to
be allocated in batches. A deadlock will be detected and several
thousand emergency objects will be allocated before the original
blocked slab allocation can complete.
Therefore the original list has been replaced by a red black tree
which is sorted by the memory address of each allocated object.
This bounds the worst case insertion and removal time to O(log n)
which minimize contention on the assoicated spin lock.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The entire goal of performing the slab allocations asynchronously
is to be able to detect when a vmalloc() deadlocks. In this case,
and only this case, do we want to start allocating emergency objects.
The trick here is to minimize false positives because the overhead
of tracking emergency objects is far higher than normal slab objects.
With that goal in mind the code was reworked to be less sensitive
to slow allocations by increasing the wait time. Once a cache is
is marked deadlocked all subsequent allocations which can not be
satisfied with existing cache objects will immediately allocate new
emergency objects. This behavior persists until the asynchronous
allocation completes and clears the deadlocked flag.
The result of these tweaks is that far fewer emergency objects
get created which is important because this minimizes the cost of
releasing them latter in kmem_cache_free().
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reference count every entry and exit from the condition variable
functions: cv_wait(), cv_wait_timeout(), cv_signal(), cv_broadcast().
This allows us to safely block in cv_destroy() until all consumers
have been scheduled and are no longer accessing the condition
variable memory.
In addition poison the magic value at the start of cv_destroy() to
ensure there are never any new callers after cv_destroy() is called.
The consumer is responsible for ensuring this never occurs.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Add a new kstat type for tracking useful statistics about a TXG.
The new KSTAT_TYPE_TXG type can be used to tracks the following
statistics per-txg.
txg - Unique txg number
state - State (O)pen/(Q)uiescing/(S)yncing/(C)ommitted
birth; - Creation time
nread - Bytes read
nwritten; - Bytes written
reads - IOPs read
writes - IOPs write
open_time; - Length in nanoseconds the txg was open
quiesce_time - Length in nanoseconds the txg was quiescing
sync_time; - Length in nanoseconds the txg was syncing
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Move the kstat ks_update() callback under the ks_lock. This
enables dynamically sized kstats without modification to the
kstat API.
* Create a kstat with the KSTAT_FLAG_VIRTUAL flag.
* Register a ->ks_update() callback which does:
o Frees any existing ks_data buffer.
o Set ks_data_size to the kstat array size.
o Set ks_data to an allocated buffer of size ks_data_size
o Populate the array of buffers with the required data.
The buffer allocated in the ks_update() callback is guaranteed
to remain allocated and valid while the proc sequence handler
iterates over the buffer. The lock will not be dropped until
kstat_seq_stop() function is run making it safe for concurrent
access. To allow the ks_update() callback to perform memory
allocations the lock was changed to a mutex.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The kern_path_parent() function was removed from Linux 3.6 because
it was observed that all the callers just want the parent dentry.
The simpler kern_path_locked() function replaces kern_path_parent()
and does the lookup while holding the ->i_mutex lock.
This is good news for the vn implementation because it removes the
need for us to handle the locking. However, it makes it harder to
implement a single readable vn_remove()/vn_rename() function which
is usually what we prefer.
Therefore, we implement a new version of vn_remove()/vn_rename()
for Linux 3.6 and newer kernels. This allows us to leave the
existing working implementation untouched, and to add a simpler
version for newer kernels.
Long term I would very much like to see all of the vn code removed
since what this code enabled is generally frowned upon in the kernel.
But that can't happen util we either abondon the zpool.cache file
or implement alternate infrastructure to update is correctly in
user space.
Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#154
In this particular instance the allocation occurred in the context
of sys_msync()->...->zpl_putpage() where we must be careful not to
initiate additional I/O.
Signed-off-by: Massimo Maggi <massimo@mmmm.it>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This adds an interface to "punch holes" (deallocate space) in VFS
files. The interface is identical to the Solaris VOP_SPACE interface.
This interface is necessary for TRIM support on file vdevs.
This is implemented using Linux fallocate(FALLOC_FL_PUNCH_HOLE), which
was introduced in 2.6.38. For a brief time before 2.6.38 this was done
using the truncate_range inode operation, which was quickly deprecated.
This patch only supports FALLOC_FL_PUNCH_HOLE.
This adds support for the truncate_range() inode operation to
VOP_SPACE() for file hole punching. This API is deprecated and removed
in 3.5, so it's only useful for old kernels.
On tmpfs, the truncate_range() inode operation translates to
shmem_truncate_range(). Unfortunately, this function expects the end
offset to be inclusive and aligned to the end of a page. If it is not,
the kernel will stop with a BUG_ON().
This patch fixes the issue by adapting to the constraints set forth by
shmem_truncate_range().
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#168
Under certain circumstances the following functions may be called
in a context where KM_SLEEP is unsafe and can result in a deadlocked
system. To avoid this problem the unconditional KM_SLEEPs are
converted to KM_PUSHPAGEs. This will prevent them from attempting
to initiate any I/O during direct reclaim.
This change was originally part of cd5ca4b but was reverted by
330fe01. It always should have had its own commit for exactly
this reason.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
When the taskq code was originally written it seemed like a good
idea to simply map TQ_SLEEP to KM_SLEEP. Unfortunately, this
assumed that the TQ_* flags would never confict with any of the
Linux GFP_* flags. When adding the TQ_PUSHPAGE support in commit
cd5ca4b this invariant was accidentally broken.
Therefore to support TQ_PUSHPAGE, which is needed for Linux, and
prevent any further confusion I have removed this direct mapping.
The TQ_SLEEP, TQ_NOSLEEP, and TQ_PUSHPAGE are no longer defined
in terms of their KM_* counterparts. Instead a simple mapping
function is introduce to convert TQ_* -> KM_* where needed.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #171
This reverts commit cd5ca4b2f8
due to conflicts in the higher TQ_ bits which caused incorrect
behavior.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
There still appears to be a race in the condition variables where
->cv_mutex is set after we are woken from the cv_destroy wait queue.
This might be possible when cv_destroy() is called immediately after
cv_broadcast(). We had some troubles with this previously but
there may still be a small race, see commit d599e4f.
The following patch closes one small race and improves the ASSERTs
such that they log the offending value.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
zfsonlinux/zfs#943
The workspace required by zlib to perform compression is roughly
512MB (order-7). These allocations are so large that we should
never attempt to directly kmalloc an emergency object for them.
It is far preferable to asynchronously vmalloc an additional slab
in case it's needed. Then simply block waiting for an existing
object to be released or for the new slab to be allocated.
This can be accomplished by disabling emergency slab objects by
passing the KMC_NOEMERGENCY flag at slab creation time.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
zfsonlinux/zfs#917