Commit 5c7a036 correctly relocated the creation of a taskq
and the registraction of the kmem_cache_shrinker after the
initialization of the kmem tracking code. However, the
cleanup of these structures was not done before the leak
checks in spl_kmem_fini(). This resulted in an incorrect
'kmem leaked' warning even though there was no actual leak.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closeszfsonlinux/zfs#1569
This code has gotten something stale and no longer builds cleanly
against modern kernels. The two issues addressed here are as
follows:
* The hlist_*_rcu interfaces in the kernel have been relatively
unstable. Since this isn't performance critical code just use
the long standing hlist_* variants.
* In older kernels the hash_ptr() function takes a 'void *' but
in newer kernels it expects a 'const void *'. To silence the
compiler warnings about this explicitly cast it to a 'void *'.
The memset function is a similar case but it always expects
a 'void *'.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#256
Re-order initialization in spl_kmem_init to allow for kmem tracing
to work. The spl_kmem_init function calls taskq_create prior to
initializing the tracking (calling spl_kmem_init_tracking). Since
taskq_create uses kmem_alloc, NULL dereferences occur because the
global kmem_list hasn't had its next & prev pointers initialized yet.
This commit moves the calls to spl_kmem_init_tracking earlier in the
spl_kmem_init function in order that the subsequent kmem_alloc calls
(by taskq_create) work properly.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#243
Calling cond_resched() after each object is freed and then after each
slab is freed can cause slabs of objects to live for excessive periods
of time following reclaimation. This interferes with the kernel's own
memory management when called from kswapd and can cause direct reclaim
to occur in response to memory pressure that should have been resolved.
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
torvalds/linux@b67bfe0d42 changed
hlist_for_each_entry{,_rcu} to take 3 arguments instead of 4. We handle
this by switching to hlist_for_each{,_rcu}, which works across all
supported kernels.
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Update links to refer to the official ZFS on Linux website instead of
@behlendorf's personal fork on github.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Cache aging was implemented because it was part of the default Solaris
kmem_cache behavior. The idea is that per-cpu objects which haven't been
accessed in several seconds should be returned to the cache. On the other
hand Linux slabs never move objects back to the slabs unless there is
memory pressure on the system.
This behavior is now configurable through the 'spl_kmem_cache_expire'
module option. The value is a bit mask with the following meaning.
0x1 - Solaris style cache aging eviction is enabled.
0x2 - Linux style low memory eviction is enabled.
Both methods may be safely enabled simultaneously, but by default
both are disabled. It has never been clear if the kmem cache aging
(which has been around from day one) actually does any good. It has
however been the source of numerous bugs so I wouldn't mind retiring
it entirely.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes zfsonlinux/zfs#1227
Closes#210
This functionality is no longer required by ZFS, see commit
zfsonlinux/zfs@7b3e34ba5a.
Since there are no other consumers, and because it adds
additional autoconf complexity which must be maintained
the spl_invalidate_inodes() function has been removed.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#795
Commit a10287e00d slightly reworked
the slab ageing code such that it is no longer dependent on the
Linux delayed work queue interfaces.
This was good for portability and performance, but it requires us
to use the on_each_cpu() function to execute the spl_magazine_age()
function. That means that the function is now executing in interrupt
context whereas before it was scheduled in normal process context.
And that means we need to be slightly more careful about the locking
in the interrupt handler.
With the reworked code it's possible that we'll be holding the
skc->skc_lock and be interrupted to handle the spl_magazine_age()
IRQ. This will result in a deadlock and soft lockup errors unless
we're careful to detect the contention and avoid taking the lock in
the interupt handler. So that's what this patch does.
Alternately, (and slightly more conventionally) we could have used
spin_lock_irqsave() to prevent this race entirely but I'd perfer to
avoid disabling interrupts as much as possible due to performance
concerns. There is absolutely no penalty for us not aging objects
out of the magazine due to contention.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Closeszfsonlinux/zfs#1193
Shift the asynchronous allocations over to use the taskq interfaces.
This allows us to abandon the kernels delayed work queue interface
and all the compatibility code it requires.
This code never actually used the delay functionality it was just
done this way to leverage the existing compatibility code. All that
is required is a thread context to perform the allocation in. The
only thing clever in this change is that we take advantage of the
preallocated task queue entries to avoid a memory allocation.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Shift the cache and magazine ageing functionality over to the new
delayed taskq interfaces. This allows us to abandon the kernels
delayed work queue interface and all the compatibility code it
requires.
However, the delayed taskq interface does not allow us to schedule
a task for a specfic cpu so the ageing code was slightly reworked.
The magazine ageing delay has been directly linked to the cache
ageing function. The spl_cache_age() function invokes on_each_cpu()
in order to run spl_magazine_age() on each cpu. It then blocks
waiting for them to complete and promptly reclaims any free slabs.
When restructing the code wasn't the primary goal I think the
new code is far more understable and maintainable. It also should
help minimize magazine thrashing because free slabs are immediately
released after the magazine is aged.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
When this code was originally written I went overboard and allowed
for the possibility of creating a cache in an atomic context. In
practice there are no callers which ever do this. This makes sense
since a cache is by design a long lived data structure.
To prevent abuse of this function going forward I'm removing the
code which is supported to handle an atomic context. All allocators
have been updated to use KM_SLEEP and the might_sleep() debug macro
has been added to immediately detect atomic callers.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Allowing the spl_cache_grow_work() function to reclaim inodes
allows for two unlikely deadlocks. Therefore, we clear __GFP_FS
for these allocations. The two deadlocks are:
* While holding the ZFS_OBJ_HOLD_ENTER(zsb, obj1) lock a function
calls kmem_cache_alloc() which happens to need to allocate a
new slab. To allocate the new slab we enter FS level reclaim
and attempt to evict several inodes. To evict these inodes we
need to take the ZFS_OBJ_HOLD_ENTER(zsb, obj2) lock and it
just happens that obj1 and obj2 use the same hashed lock.
* Similar to the first case however instead of getting blocked
on the hash lock we block in txg_wait_open() which is waiting
for the next txg which isn't coming because the txg_sync
thread is blocked in kmem_cache_alloc().
Note this isn't a 100% fix because vmalloc() won't strictly
honor __GFP_FS. However, it practice this is sufficient because
several very unlikely things must all occur concurrently.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#1101
If we are reaping from the cache and a concurrent allocation
occurs then the caller must block until the reaping is complete.
This is signaled by the clearing of the KMC_BIT_REAPING bit.
Otherwise the caller will be in a tight loop which takes and
releases the skc->skc_cache lock. When there are multiple
concurrent callers the system will thrash on the lock and
appear to lock up.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Because only virtual slabs may have emergency objects and these
objects are guaranteed to have physical addresses. It can be
easily determined if the passed object is a virtual slab object
or an emergency object. This allows us to completely optimize
the emergency object free case out of the common free path.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
In the initial implementation emergency objects were tracked on a
per-cache list. The assumption was that under normal operation we
would never allocate more than a handful of these objects. So the
cost of walking the list during free was expected to be negligible.
However real world usage has shown that emergency objects tend to
be allocated in batches. A deadlock will be detected and several
thousand emergency objects will be allocated before the original
blocked slab allocation can complete.
Therefore the original list has been replaced by a red black tree
which is sorted by the memory address of each allocated object.
This bounds the worst case insertion and removal time to O(log n)
which minimize contention on the assoicated spin lock.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The entire goal of performing the slab allocations asynchronously
is to be able to detect when a vmalloc() deadlocks. In this case,
and only this case, do we want to start allocating emergency objects.
The trick here is to minimize false positives because the overhead
of tracking emergency objects is far higher than normal slab objects.
With that goal in mind the code was reworked to be less sensitive
to slow allocations by increasing the wait time. Once a cache is
is marked deadlocked all subsequent allocations which can not be
satisfied with existing cache objects will immediately allocate new
emergency objects. This behavior persists until the asynchronous
allocation completes and clears the deadlocked flag.
The result of these tweaks is that far fewer emergency objects
get created which is important because this minimizes the cost of
releasing them latter in kmem_cache_free().
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Provide a flag to disable the use of emergency objects for a
specific kmem cache. There may be instances where under no
circumstances should you kmalloc() an emergency object. For
example, when you cache contains very large objects (>128k).
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This reverts commit 2092cf68d8. The
use of the PF_MEMALLOC flag was always a hack to work around memory
reclaim deadlocks. Those issues are believed to be resolved so this
workaround can be safely reverted.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This reverts commit b8b6e4c453. The
use of the PF_MEMALLOC flag was always a hack to work around memory
reclaim deadlocks. Those issues are believed to be resolved so this
workaround can be safely reverted.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This reverts commit 36811b4430.
Which is no longer required because there is now SPL code in
place to safely handle the deadlocks the kernel patch was designed
to address. Therefore we can unconditionally use vmalloc() and
drop all the PF_MEMALLOC code.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This patch is designed to resolve a deadlock which can occur with
__vmalloc() based slabs. The issue is that the Linux kernel does
not honor the flags passed to __vmalloc(). This makes it unsafe
to use in a writeback context. Unfortunately, this is a use case
ZFS depends on for correct operation.
Fixing this issue in the upstream kernel was pursued and patches
are available which resolve the issue.
https://bugs.gentoo.org/show_bug.cgi?id=416685
However, these changes were rejected because upstream felt that
using __vmalloc() in the context of writeback should never be done.
Their solution was for us to rewrite parts of ZFS to accomidate
the Linux VM.
While that is probably the right long term solution, and it is
something we want to pursue, it is not a trivial task and will
likely destabilize the existing code. This work has been planned
for the 0.7.0 release but in the meanwhile we want to improve the
SPL slab implementation to accomidate this expected ZFS usage.
This is accomplished by performing the __vmalloc() asynchronously
in the context of a work queue. This doesn't prevent the posibility
of the worker thread from deadlocking. However, the caller can now
safely block on a wait queue for the slab allocation to complete.
Normally this will occur in a reasonable amount of time and the
caller will be woken up when the new slab is available,. The objects
will then get cached in the per-cpu magazines and everything will
proceed as usual.
However, if the __vmalloc() deadlocks for the reasons described
above, or is just very slow, then the callers on the wait queues
will timeout out. When this rare situation occurs they will attempt
to kmalloc() a single minimally sized object using the GFP_NOIO flags.
This allocation will not deadlock because kmalloc() will honor the
passed flags and the caller will be able to make forward progress.
As long as forward progress can be maintained then even if the
worker thread is deadlocked the critical thread will make progress.
This will eventually allow the deadlocked worker thread to complete
and normal operation will resume.
These emergency allocations will likely be slow since they require
contiguous pages. However, their use should be rare so the impact
is expected to be minimal. If that turns out not to be the case in
practice further optimizations are possible.
One additional concern is if these emergency objects are long lived.
Right now they are simply tracked on a list which must be walked when
an object is freed. Is they accumulate on a system and the list
grows freeing objects will become more expensive. This could be
handled relatively easily by using a hash instead of a list, but that
optimization (if needed) is left for a follow up patch.
Additionally, these emeregency objects could be repacked in to existing
slabs as objects are freed if the kmem_cache_set_move() functionality
was implemented. See issue https://github.com/zfsonlinux/spl/issues/26
for full details. This work would also help reduce ZFS's memory
fragmentation problems.
The /proc/spl/kmem/slab file has had two new columns added at the
end. The 'emerg' column reports the current number of these emergency
objects in use for the cache, and the following 'max' column shows
the historical worst case. These value should give us a good idea
of how often these objects are needed. Based on these values under
real use cases we can tune the default behavior.
Lastly, as a side benefit using a single work queue for the slab
allocations should reduce cpu contention on the global virtual address
space lock. This should manifest itself as reduced cpu usage for
the system.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The spl_magazine_age function had the implied assumption that it will
remain on its current cpu through its execution. In order to support
preempt enabled kernels, this assumption had to be removed.
The spl_kmem_magazine structure now holds the cpu id of the cpu it is
local to. This allows spl_magazine_age to use this field when scheduling
work to be done by the magazine's local cpu.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #98
A preprocessor definition renders this harmless. However, it is a good
idea to change this to be consistent.
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
zfsonlinux/spl@2092cf68d8 used
PF_MEMALLOC to workaround a bug in the Linux kernel where
allocations did not honor the gfp flags passed to vmalloc().
Unfortunately, PF_MEMALLOC has the side effect of permitting
allocations to allocate pages outside of ZONE_NORMAL. This
has been observed to result in the depletion of ZONE_DMA32.
A kernel patch is available in the Gentoo bug tracker for
this issue.
https://bugs.gentoo.org/show_bug.cgi?id=416685
This negates any benefit PF_MEMALLOC provides, so we introduce
an autotools check to disable the use of PF_MEMALLOC on
systems with patched kernels.
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#126
This prevents warnings in ZFS that were caused by changes necessary to
support PaX patched kernels. When debugging is enabled, these warnings
become build failures.
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#131
To minimize the chance of triggering an OOM during direct reclaim.
The kmem caches have been improved to make a best effort to reclaim
at least one slab when a reclaim function is registered. This helps
avoid the case where objects are released but they are spread over
multiple slabs so no memory gets reclaimed.
Care has been taken to avoid deadlocking if the reclaim function
is unable to make forward progress. Additionally, the reclaim
function may be skipped entirely if there are already free slabs
which can be safely reaped.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#107
The Linux direct reclaim path uses this out of band value to
determine if forward progress is being made. Normally this is
incremented by kmem_freepages() which is part of the various
Linux slab implementations. However, since we are using none
of that infrastructure we're responsible for incrementing this
count.
If no forward progress is detected and a subsequent allocation
fails the OOM killer will be invoked. If there was forward
progress additional reclaim will be attempted via the page
cache and registerd shrinker until the allocation succeeds.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#107
When memory pressure triggers direct memory reclaim, a slabs age
and delay should not prevent it from being freed. This patch ensures
these values are ignored, allowing an empty slab to be freed in this
code path no matter the value of its age and delay.
This prevents needless scanning of the partial slabs and has been
observed to significantly reduce the total cpu usage. In addition,
it should allow for snappier reclaim under memory pressure.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#102
Previously, the SPL tried to maintain Solaris semantics by freeing
all available (empty) slabs from its slab caches when the shrinker
was called. This is not desirable when running on Linux. To make
the SPL shrinker more Linux friendly, the actual number of freed
slabs from each of the slab caches is now derived from nr_to_scan
and skc_slab_objs.
Additionally, an accounting bug was fixed in spl_slab_reclaim()
which could cause us to reclaim one more slab than requested.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#101
Until now the notion of an internal debug logging infrastructure
was conflated with enabling ASSERT()s. This patch clarifies things
by cleanly breaking the two subsystem apart. The result of this
is the following behavior.
--enable-debug - Enable/disable code wrapped in ASSERT()s.
--disable-debug ASSERT()s are used to check invariants and
are never required for correct operation.
They are disabled by default because they
may impact performance.
--enable-debug-log - Enable/disable the debug log infrastructure.
--disable-debug-log This infrastructure allows the spl code and
its consumer to log messages to an in-kernel
log. The granularity of the logging can be
controlled by a debug mask. By default the
mask disables most debug messages resulting
in a negligible performance impact. Because
of this the debug log is enabled by default.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The Proxmox VE kernel contains a patch which renames the function
invalidate_inodes() to invalidate_inodes_check(). In the process
it adds a 'check' argument and a '#define invalidate_inodes(x)'
compatibility wrapper for legacy callers. Therefore, if either
of these functions are exported invalidate_inodes() can be
safely used.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#58
As of Linux 3.1 the shrink_dcache_memory and shrink_icache_memory
functions have been removed. This same task is now accomplished
more cleanly with per super block shrinkers. This unfortunately
leaves us no easy way to support the dnlc_reduce_cache() function.
This support has always been entirely optional. So when no
reasonable interface is available allow the dnlc_reduce_cache()
function to effectively become a no-op.
The downside of this change is that it will prevent the zfs arc
meta data limts from being enforced. However, the current zfs
implementation in this regard is already flawed and needs to
be reworked. If the arc needs to enfore a meta data limit it
will need to be extended to coordinate directly with the zpl.
This will allow us to drop all this compatibility code and get
more fine grained control over the cache management.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #52
Be careful not to unconditionally clear the PF_MEMALLOC bit in
the task structure. It may have already been set when entering
kv_alloc() in which case it must remain set on exit. In
particular the kswapd thread will have PF_MEMALLOC set in
order to prevent it from entering direct reclaim. By clearing
it we allow the following NULL deref to potentially occur.
BUG: unable to handle kernel NULL pointer dereference at (null)
IP: [<ffffffff8109c7ab>] balance_pgdat+0x25b/0x4ff
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes ZFS issue #287
The typo did not have any effect (apart from a negligible performance
impact) because skc->skc_flags * KMC_OFFSLAB is always non-null when
at least one bit in skc->skc_flags is set.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
In a non-debug build the ASSERT() would be optimized away
which could cause pending work items to not be cancelled.
We must also use cancel_delayed_work_sync() rather than just
cancel_delayed_work() to actually wait until work items have
completed. Otherwise they might accidentally access free'd
memory.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes ZFS bugs #279, #62, #363, #418
Update the the wrapper macros for the memory shrinker to handle
this 4th API change. The callback function now takes a
shrink_control structure. This is certainly a step in the
right direction but it's annoying to have to accomidate yet
another version of the API.
To resolve a potiential filesystem corruption issue a second
argument was added to invalidate_inodes(). This argument controls
whether dirty inodes are dropped or treated as busy when invalidating
a super block. When only the legacy API is available the second
argument will be dropped for compatibility.
Provide the dnlc_reduce_cache() function which attempts to prune
cached entries from the dcache and icache. After the entries are
pruned any slabs which they may have been using are reaped.
Note the API takes a reclaim percentage but we don't have easy
access to the total number of cache entries to calculate the
reclaim count. However, in practice this doesn't need to be
exactly correct. We simply need to reclaim some useful fraction
(but not all) of the cache. The caller can determine if more
needs to be done.
The Linux shrinker has gone through three API changes since 2.6.22.
Rather than force every caller to understand all three APIs this
change consolidates the compatibility code in to the mm-compat.h
header. The caller then can then use a single spl provided
shrinker API which does the right thing for your kernel.
SPL_SHRINKER_CALLBACK_PROTO(shrinker_callback, cb, nr_to_scan, gfp_mask);
SPL_SHRINKER_DECLARE(shrinker_struct, shrinker_callback, seeks);
spl_register_shrinker(&shrinker_struct);
spl_unregister_shrinker(&&shrinker_struct);
spl_exec_shrinker(&shrinker_struct, nr_to_scan, gfp_mask);
As part of vmalloc() a __pte_alloc_kernel() allocation may occur. This
internal allocation does not honor the gfp flags passed to vmalloc().
This means even when vmalloc(GFP_NOFS) is called it is possible that a
synchronous reclaim will occur. This reclaim can trigger file IO which
can result in a deadlock. This issue can be avoided by explicitly
setting PF_MEMALLOC on the process to subvert synchronous reclaim when
vmalloc() is called with !__GFP_FS.
An example stack of the deadlock can be found here (1), along with the
upstream kernel bug (2), and the original bug discussion on the
linux-mm mailing list (3). This code can be properly autoconf'ed
when the upstream bug is fixed.
1) http://github.com/behlendorf/zfs/issues/labels/Vmalloc#issue/133
2) http://bugzilla.kernel.org/show_bug.cgi?id=30702
3) http://marc.info/?l=linux-mm&m=128942194520631&w=4
In the 2.6.37 kernel the function invalidate_inodes() is no longer
exported for use by modules. This memory management functionality
is needed to invalidate the inodes attached to a super block without
unmounting the filesystem.
Because this function still exists in the kernel and the prototype
is available is a common header all we strictly need is the symbol
address. The address is obtained using spl_kallsyms_lookup_name()
and assigned to the variable invalidate_inodes_fn. Then a #define
is used to replace all instances of invalidate_inodes() with a
call to the acquired address. All the complexity is hidden behind
HAVE_INVALIDATE_INODES and invalidate_inodes() can be used as usual.
Long term we should try to get this, or another, interface made
available to modules again.
As of linux-2.6.35 the shrinker callback API now takes an additional
argument. The shrinker struct is passed to the callback so that users
can embed the shrinker structure in private data and use container_of()
to access it. This removes the need to always use global state for the
shrinker.
To handle this we add the SPL_AC_3ARGS_SHRINKER_CALLBACK autoconf
check to properly detect the API. Then we simply setup a callback
function with the correct number of arguments. For now we do not make
use of the new 3rd argument.
At some point we are going to need to implement the kmem cache
move callbacks to allow for kmem cache defragmentation. This
commit simply lays a small part of the API ground work, it does
not actually implement any of this feature. This is safe for
now because the move callbacks are just an optimization. Even
if they are registered we don't ever really have to call them.
Using kmem_free() results in deducting X bytes from the memory
accounting when --enable-debug is set. Unfortunately, currently
the counterpart kmem_asprintf() and friends do not properly
account for memory allocated, so we must do the same on free.
If we don't then we end up with a negative number of lost bytes
reported when the module is unloaded.
A better long term fix would be to add the accounting in to the
allocation side but that's a project for another day.
The Solaris semantics for kmem_alloc() and vmem_alloc() are that they
must never fail when called with KM_SLEEP. They may only fail if
called with KM_NOSLEEP otherwise they must block until memory is
available. This is quite different from how the Linux memory
allocators work, under Linux a memory allocation failure is always
possible and must be dealt with.
At one point in the past the kmem code did properly implement this
behavior, however as the code evolved this behavior was overlooked
in places. This patch goes through all three implementations of
the kmem/vmem allocation functions and ensures that they will all
block in the KM_SLEEP case when memory is not available. They
may still fail in the KM_NOSLEEP case in which case the caller
is responsible for handling the failure.
Special care is taken in vmalloc_nofail() to avoid thrashing the
system on the virtual address space spin lock. The down side of
course is if you do see a failure here, which is unlikely for
64-bit systems, your allocation will delay for an entire second.
Still this is preferable to locking up your system and it is the
best we can do given the constraints.
Additionally, the code was cleaned up to be much more readable
and comments were added to describe the various kmem-debug-*
configure options. The default configure options remain:
"--enable-debug-kmem --disable-debug-kmem-tracking"
When the kvasprintf() call fails they should reset the arguments
by calling va_start()/va_copy() and va_end() inside the loop,
otherwise they'll try to read more arguments rather than starting
over and reading them from the beginning.
Signed-off-by: Ricardo M. Correia <ricardo.correia@oracle.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
To avoid conflicts with symbols defined by dependent packages
all debugging symbols have been prefixed with a 'S' for SPL.
Any dependent package needing to integrate with the SPL debug
should include the spl-debug.h header and use the 'S' prefixed
macros. They must also build with DEBUG defined.
To avoid symbol conflicts with dependent packages the debug
header must be split in to several parts. The <sys/debug.h>
header now only contains the Solaris macro's such as ASSERT
and VERIFY. The spl-debug.h header contain the spl specific
debugging infrastructure and should be included by any package
which needs to use the spl logging. Finally the spl-trace.h
header contains internal data structures only used for the log
facility and should not be included by anythign by spl-debug.c.
This way dependent packages can include the standard Solaris
headers without picking up any SPL debug macros. However, if
the dependant package want to integrate with the SPL debugging
subsystem they can then explicitly include spl-debug.h.
Along with this change I have dropped the CHECK_STACK macros
because the upstream Linux kernel now has much better stack
depth checking built in and we don't need this complexity.
Additionally SBUG has been replaced with PANIC and provided as
part of the Solaris macro set. While the Solaris version is
really panic() that conflicts with the Linux kernel so we'll
just have to make due to PANIC. It should rarely be called
directly, the prefered usage would be an ASSERT or VERIFY.
There's lots of change here but this cleanup was overdue.