The entire goal of performing the slab allocations asynchronously
is to be able to detect when a vmalloc() deadlocks. In this case,
and only this case, do we want to start allocating emergency objects.
The trick here is to minimize false positives because the overhead
of tracking emergency objects is far higher than normal slab objects.
With that goal in mind the code was reworked to be less sensitive
to slow allocations by increasing the wait time. Once a cache is
is marked deadlocked all subsequent allocations which can not be
satisfied with existing cache objects will immediately allocate new
emergency objects. This behavior persists until the asynchronous
allocation completes and clears the deadlocked flag.
The result of these tweaks is that far fewer emergency objects
get created which is important because this minimizes the cost of
releasing them latter in kmem_cache_free().
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Restructure the the SPLAT headers such that each test only
includes the minimal set of headers it requires.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Provide a flag to disable the use of emergency objects for a
specific kmem cache. There may be instances where under no
circumstances should you kmalloc() an emergency object. For
example, when you cache contains very large objects (>128k).
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
PF_NOFS is a per-process debug flag which is set in current->flags to
detect when a process is performing an unsafe allocation. All tasks
with PF_NOFS set must strictly use KM_PUSHPAGE for allocations because
if they enter direct reclaim and initiate I/O they may deadlock.
When debugging is disabled, any incorrect usage will be detected and
a call stack with a warning will be printed to the console. The flags
will then be automatically corrected to allow for safe execution. If
debugging is enabled this will be treated as a fatal condition.
To avoid any risk of conflicting with the existing PF_ flags. The
PF_NOFS bit shadows the rarely used PF_MUTEX_TESTER bit. Only when
CONFIG_RT_MUTEX_TESTER is not set, and we know this bit is unused,
will the PF_NOFS bit be valid. Happily, most existing distributions
ship a kernel with CONFIG_RT_MUTEX_TESTER disabled.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This patch is designed to resolve a deadlock which can occur with
__vmalloc() based slabs. The issue is that the Linux kernel does
not honor the flags passed to __vmalloc(). This makes it unsafe
to use in a writeback context. Unfortunately, this is a use case
ZFS depends on for correct operation.
Fixing this issue in the upstream kernel was pursued and patches
are available which resolve the issue.
https://bugs.gentoo.org/show_bug.cgi?id=416685
However, these changes were rejected because upstream felt that
using __vmalloc() in the context of writeback should never be done.
Their solution was for us to rewrite parts of ZFS to accomidate
the Linux VM.
While that is probably the right long term solution, and it is
something we want to pursue, it is not a trivial task and will
likely destabilize the existing code. This work has been planned
for the 0.7.0 release but in the meanwhile we want to improve the
SPL slab implementation to accomidate this expected ZFS usage.
This is accomplished by performing the __vmalloc() asynchronously
in the context of a work queue. This doesn't prevent the posibility
of the worker thread from deadlocking. However, the caller can now
safely block on a wait queue for the slab allocation to complete.
Normally this will occur in a reasonable amount of time and the
caller will be woken up when the new slab is available,. The objects
will then get cached in the per-cpu magazines and everything will
proceed as usual.
However, if the __vmalloc() deadlocks for the reasons described
above, or is just very slow, then the callers on the wait queues
will timeout out. When this rare situation occurs they will attempt
to kmalloc() a single minimally sized object using the GFP_NOIO flags.
This allocation will not deadlock because kmalloc() will honor the
passed flags and the caller will be able to make forward progress.
As long as forward progress can be maintained then even if the
worker thread is deadlocked the critical thread will make progress.
This will eventually allow the deadlocked worker thread to complete
and normal operation will resume.
These emergency allocations will likely be slow since they require
contiguous pages. However, their use should be rare so the impact
is expected to be minimal. If that turns out not to be the case in
practice further optimizations are possible.
One additional concern is if these emergency objects are long lived.
Right now they are simply tracked on a list which must be walked when
an object is freed. Is they accumulate on a system and the list
grows freeing objects will become more expensive. This could be
handled relatively easily by using a hash instead of a list, but that
optimization (if needed) is left for a follow up patch.
Additionally, these emeregency objects could be repacked in to existing
slabs as objects are freed if the kmem_cache_set_move() functionality
was implemented. See issue https://github.com/zfsonlinux/spl/issues/26
for full details. This work would also help reduce ZFS's memory
fragmentation problems.
The /proc/spl/kmem/slab file has had two new columns added at the
end. The 'emerg' column reports the current number of these emergency
objects in use for the cache, and the following 'max' column shows
the historical worst case. These value should give us a good idea
of how often these objects are needed. Based on these values under
real use cases we can tune the default behavior.
Lastly, as a side benefit using a single work queue for the slab
allocations should reduce cpu contention on the global virtual address
space lock. This should manifest itself as reduced cpu usage for
the system.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The spl_magazine_age function had the implied assumption that it will
remain on its current cpu through its execution. In order to support
preempt enabled kernels, this assumption had to be removed.
The spl_kmem_magazine structure now holds the cpu id of the cpu it is
local to. This allows spl_magazine_age to use this field when scheduling
work to be done by the magazine's local cpu.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #98
A preprocessor definition renders this harmless. However, it is a good
idea to change this to be consistent.
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
This prevents warnings in ZFS that were caused by changes necessary to
support PaX patched kernels. When debugging is enabled, these warnings
become build failures.
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#131
The resolution of issue #31 made KM_PUSHPAGE imply GFP_NOFS. This
was done to prevent situations where filesystem operations which are
holding locks enter direct reclaim and attempt to reaquire those
same locks. This clearly will result in a deadlock.
This works for datasets which are implemented in terms for filesystem
operations. But unfortunately, swapping to a zvol will encounter
many of the same deadlocks and GFP_NOFS will not prevent this. As
such, it is appropriate to extend KM_PUSHPAGE to use the broader
GFP_NOIO mask to handle these non-filesystem cases.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#342Closes#105
Previously, the SPL tried to maintain Solaris semantics by freeing
all available (empty) slabs from its slab caches when the shrinker
was called. This is not desirable when running on Linux. To make
the SPL shrinker more Linux friendly, the actual number of freed
slabs from each of the slab caches is now derived from nr_to_scan
and skc_slab_objs.
Additionally, an accounting bug was fixed in spl_slab_reclaim()
which could cause us to reclaim one more slab than requested.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#101
By decreasing the number of target objects per slab we increase
the likelyhood that a slab can be freed. This reduces the level
of fragmentation in the slab which has been observed to be a
problem for certain workloads. The penalty for this is that we
also decrease the speed which need objects can be allocated.
One of the most common things you want to know when looking at
the slab is how much memory is being used. This information was
available in /proc/spl/kmem/slab but only on a per-slab basis.
This commit adds the following /proc/sys/kernel/spl/kmem/slab*
entries to make total slab usage easily available at a glance.
slab_kmem_total - Total kmem slab size
slab_kmem_avail - Alloc'd kmem slab size
slab_kmem_max - Max observed kmem slab size
slab_vmem_total - Total vmem slab size
slab_vmem_avail - Alloc'd vmem slab size
slab_vmem_max - Max observed vmem slab size
NOTE: The slab_*_max values are expected to over report because
they show maximum values since boot, not current values.
As originally described in commit 82b8c8fa64
this was done to prevent certain deadlocks from occuring in the system.
However, as suspected the price for doing this proved to be too high.
The VM is having a hard time effectively reclaiming memory thus we are
reverting this change.
However, we still need to fundamentally handle the issue. Under
Solaris the KM_PUSHPAGE mask is used commonly in I/O paths to ensure
a memory allocations will succeed. We leverage this fact and redefine
KM_PUSHPAGE to include GFP_NOFS. This ensures that in these common
I/O path we don't trigger additional reclaim. This minimizes the
change to the Solaris code.
At some point we are going to need to implement the kmem cache
move callbacks to allow for kmem cache defragmentation. This
commit simply lays a small part of the API ground work, it does
not actually implement any of this feature. This is safe for
now because the move callbacks are just an optimization. Even
if they are registered we don't ever really have to call them.
The Solaris semantics for kmem_alloc() and vmem_alloc() are that they
must never fail when called with KM_SLEEP. They may only fail if
called with KM_NOSLEEP otherwise they must block until memory is
available. This is quite different from how the Linux memory
allocators work, under Linux a memory allocation failure is always
possible and must be dealt with.
At one point in the past the kmem code did properly implement this
behavior, however as the code evolved this behavior was overlooked
in places. This patch goes through all three implementations of
the kmem/vmem allocation functions and ensures that they will all
block in the KM_SLEEP case when memory is not available. They
may still fail in the KM_NOSLEEP case in which case the caller
is responsible for handling the failure.
Special care is taken in vmalloc_nofail() to avoid thrashing the
system on the virtual address space spin lock. The down side of
course is if you do see a failure here, which is unlikely for
64-bit systems, your allocation will delay for an entire second.
Still this is preferable to locking up your system and it is the
best we can do given the constraints.
Additionally, the code was cleaned up to be much more readable
and comments were added to describe the various kmem-debug-*
configure options. The default configure options remain:
"--enable-debug-kmem --disable-debug-kmem-tracking"
To avoid symbol conflicts with dependent packages the debug
header must be split in to several parts. The <sys/debug.h>
header now only contains the Solaris macro's such as ASSERT
and VERIFY. The spl-debug.h header contain the spl specific
debugging infrastructure and should be included by any package
which needs to use the spl logging. Finally the spl-trace.h
header contains internal data structures only used for the log
facility and should not be included by anythign by spl-debug.c.
This way dependent packages can include the standard Solaris
headers without picking up any SPL debug macros. However, if
the dependant package want to integrate with the SPL debugging
subsystem they can then explicitly include spl-debug.h.
Along with this change I have dropped the CHECK_STACK macros
because the upstream Linux kernel now has much better stack
depth checking built in and we don't need this complexity.
Additionally SBUG has been replaced with PANIC and provided as
part of the Solaris macro set. While the Solaris version is
really panic() that conflicts with the Linux kernel so we'll
just have to make due to PANIC. It should rarely be called
directly, the prefered usage would be an ASSERT or VERIFY.
There's lots of change here but this cleanup was overdue.
Deadlocks in the zvol were observed when one of the ZFS threads
performing IO trys to allocate memory while the system is low
on memory. The low memory condition causes dirty pages to be
synced to the zvol but this can't progress because the original
thread is blocked waiting on a memory allocation. Thus we end
up deadlocking.
A proper solution proposed by Wizeman is to change KM_SLEEP from
GFP_KERNEL top GFP_NOFS. This will prevent the memory allocation
which is trying to allocate memory from forcing a sync to the
zvol in shrink_page_list()->pageout().
The down side to all of this is that we are using a pretty big
hammer by changing KM_SLEEP. This change means ALL of the zfs
memory allocations will be until to trigger dirty data to be
synced. The caller still should be able to reclaim memory from
the various slab caches. We will be totally dependent of other
kernel processes which happen to be running and a small number
of asynchronous reclaim threads to trigger the reclaim of dirty
data pages. This should be OK but I think we may see some
slightly longer allocation times when under memory pressure.
We shall see.
We might as well have both asprintf() variants. This allows us
to safely pass a va_list through several levels of the stack
using va_copy() instead of va_start().
It turns out Solaris incidentally includes kstat.h from kmem.h. As
a side effect of this certain higher level .c files which should
explicitly include kstat.h don't because they happen to get it
via kmem.h. To make like easier for everyone I do the same.
This patch adds three missing Solaris functions: kmem_asprintf(), strfree(),
and strdup(). They are all implemented as a thin layer which just calls
their Linux counterparts. As part of this an autoconf check for kvasprintf
was added because it does not appear in older kernels. If the kernel does
not provide it then spl-generic implements it.
Additionally the dead DEBUG_KMEM_UNIMPLEMENTED code was removed to clean
things up and make the kmem.h a little more readable.
Updated AUTHORS, COPYING, DISCLAIMER, and INSTALL files. Added
standardized headers to all source file to clearly indicate the
copyright, license, and to give credit where credit is due.
This patch is another step towards updating the code to handle the
32-bit kernels which I have not been regularly testing. This changes
do not really impact the common case I'm expected which is the latest
kernel running on an x86_64 arch.
Until the linux-2.6.31 kernel the x86 arch did not have support for
64-bit atomic operations. Additionally, the new atomic_compat.h support
for this case was wrong because it embedded a spinlock in the atomic
variable which must always and only be 64-bits total. To handle these
32-bit issues we now simply fall back to the --enable-atomic-spinlock
implementation if the kernel does not provide the 64-bit atomic funcs.
The second issue this patch addresses is the DEBUG_KMEM assumption that
there will always be atomic64 funcs available. On 32-bit archs this may
not be true, and actually that's just fine. In that case the kernel will
will never be able to allocate more the 32-bits worth anyway. So just
check if atomic64 funcs are available, if they are not it means this
is a 32-bit machine and we can safely use atomic_t's instead.
As of 2.6.31 it's clear __GFP_NOFAIL should no longer be used and it
may disappear from the kernel at any time. To handle this I have simply
added *_nofail wrappers in the kmem implementation which perform the
retry for non-atomic allocations.
From linux-2.6.31 mm/page_alloc.c:1166
/*
* __GFP_NOFAIL is not to be used in new code.
*
* All __GFP_NOFAIL callers should be fixed so that they
* properly detect and handle allocation failures.
*
* We most definitely don't want callers attempting to
* allocate greater than order-1 page units with
* __GFP_NOFAIL.
*/
WARN_ON_ONCE(order > 1);
SPL_AC_2ARGS_SET_FS_PWD macro updated to explicitly include
linux/fs_struct.h which was dropped from linux/sched.h.
min_wmark_pages, low_wmark_pages, high_wmark_pages macros
introduced in newer kernels. For older kernels mm_compat.h
was introduced to define them as needed as direct mappings
to per zone min_pages, low_pages, max_pages.
Cleanup the --enable-debug-* configure options, this has been pending
for quite some time and I am glad I finally got to it. To summerize:
1) All SPL_AC_DEBUG_* macros were updated to be a more autoconf
friendly. This mainly involved shift to the GNU approved usage of
AC_ARG_ENABLE and ensuring AS_IF is used rather than directly using
an if [ test ] construct.
2) --enable-debug-kmem=yes by default. This simply enabled keeping
a running tally of total memory allocated and freed and reporting a
memory leak if there was one at module unload. Additionally, it
ensure /proc/spl/kmem/slab will exist by default which is handy.
The overhead is low for this and it should not impact performance.
3) --enable-debug-kmem-tracking=no by default. This option was added
to provide a configure option to enable to detailed memory allocation
tracking. This support was always there but you had to know where to
turn it on. By default this support is disabled because it is known
to badly hurt performence, however it is invaluable when chasing a
memory leak.
4) --enable-debug-kstat removed. After further reflection I can't see
why you would ever really want to turn this support off. It is now
always on which had the nice side effect of simplifying the proc handling
code in spl-proc.c. We can now always assume the top level directory
will be there.
5) --enable-debug-callb removed. This never really did anything, it was
put in provisionally because it might have been needed. It turns out
it was not so I am just removing it to prevent confusion.
Remove all instances of functions being reimplemented in the SPL.
When the prototypes are available in the linux headers but the
function address itself is not exported use kallsyms_lookup_name()
to find the address. The function name itself can them become a
define which calls a function pointer. This is preferable to
reimplementing the function in the SPL because it ensures we get
the correct version of the function for the running kernel. This
is actually pretty safe because the prototype is defined in the
headers so we know we are calling the function properly.
This patch also includes a rhel5 kernel patch we exports the needed
symbols so we don't need to use kallsyms_lookup_name(). There are
autoconf checks to detect if the symbol is exported and if so to
use it directly. We should add patches for stock upstream kernels
as needed if for no other reason than so we can easily track which
additional symbols we needed exported. Those patches can also be
used by anyone willing to rebuild their kernel, but this should
not be a requirement. The rhel5 version of the export-symbols
patch has been applied to the chaos kernel.
Additional fixes:
1) Implement vmem_size() function using get_vmalloc_info()
2) SPL_CHECK_SYMBOL_EXPORT macro updated to use $LINUX_OBJ instead
of $LINUX because Module.symvers is a build product. When
$LINUX_OBJ != $LINUX we will not properly detect exported symbols.
3) SPL_LINUX_COMPILE_IFELSE macro updated to add include2 and
$LINUX/include search paths to allow proper compilation when
the kernel target build directory is not the source directory.
Because vmem_free() was implemented as a macro using the ','
operator to evaluate both arguments and we performed the free
before evaluating size we would deference the free'd pointer.
To resolve the problem we just invert the ordering and evaluate
size first just as if it was evaluated by the caller when being
passed to this function. This ensure that if the caller is
doing something reckless like performing an assignment as
part of the size argument we still perform it and it simply
doesn't get removed by the macro. Oh course nobody should
be doing this sort of thing, but just in case.
- The previous magazine ageing sceme relied on the on_each_cpu()
function to call spl_magazine_age() on each cpu. It turns out
this could deadlock with do_flush_tlb_all() which also relies
on the IPI based on_each_cpu(). To avoid this problem a per-
magazine delayed work item is created and indepentantly
scheduled to the correct cpu removing the need for on_each_cpu().
- Additionally two unused fields were removed from the type
spl_kmem_cache_t, they were hold overs from previous cleanup.
- struct work_struct work
- struct timer_list timer
- Default SPL_KMEM_CACHE_DELAY changed to 15 to match Solaris.
- Aged out slab checking occurs every SPL_KMEM_CACHE_DELAY / 3.
- skc->skc_reap tunable added whichs allows callers of
spl_slab_reclaim() to cap the number of slabs reclaimed.
On Solaris all eligible slabs are always reclaimed, and this
is still the default behavior. However, I suspect that is
not always wise for reasons such as in the next comment.
- spl_slab_reclaim() added cond_resched() while walking the
slab/object free lists. Soft lockups were observed when
freeing large numbers of vmalloc'd slabs/objets.
- spl_slab_reclaim() 'sks->sks_ref > 0' check changes from
incorrect 'break' to 'continue' to ensure all slabs are
checked.
- spl_cache_age() reworked to avoid a deadlock with
do_flush_tlb_all() which occured because we slept waiting
for completion in spl_cache_age(). To waiting for magazine
reclamation to finish is not required so we no longer wait.
- spl_magazine_create() and spl_magazine_destroy() shifted
back to using for_each_online_cpu() instead of the
spl_on_each_cpu() approach which was of course a bad idea
due to memory allocations which Ricardo pointed out.
Support added to provide reasonable values for the global Solaris
VM variables: minfree, desfree, lotsfree, needfree. These values
are set to the sum of their per-zone linux counterparts which
should be close enough for Solaris consumers.
When a non-GPL app links against the SPL we cannot use the udev
interfaces, which means non of the device special files are created.
Because of this I had added a poor mans udev which cause the SPL
to invoke an upcall and create the basic devices when a minor
is registered. When a minor is unregistered we use the vnode
interface to unlink the special file.
- Added SPL_AC_3ARGS_ON_EACH_CPU configure check to determine
if the older 4 argument version of on_each_cpu() should be
used or the new 3 argument version. The retry argument was
dropped in the new API which was never used anyway.
- Updated work queue compatibility wrappers. The old way this
worked was to pass a data point when initialized the workqueue.
The new API assumed the work item is embedding in a structure
and we us container_of() to find that data pointer.
- Updated skc->skc_flags to be an unsigned long which is now
type checked in the bit operations. This silences the warnings.
- Updated autogen products and splat tests accordingly
- Added slab work queue task which gradually ages and free's slabs
from the cache which have not been used recently.
- Optimized slab packing algorithm to ensure each slab contains the
maximum number of objects without create to large a slab.
- Fix deadlock, we can never call kv_free() under the skc_lock. We
now unlink the objects and slabs from the cache itself and attach
them to a private work list. The contents of the list are then
subsequently freed outside the spin lock.
- Move magazine create/destroy operation on to local cpu.
- Further performace optimizations by minimize the usage of the large
per-cache skc_lock. This includes the addition of KMC_BIT_REAPING
bit mask which is used to prevent concurrent reaping, and to defer
new slab creation when reaping is occuring.
- Add KMC_BIT_DESTROYING bit mask which is set when the cache is being
destroyed, this is used to catch any task accessing the cache while
it is being destroyed.
- Add comments to all the functions and additional comments to try
and make everything as clear as possible.
- Major cleanup and additions to the SPLAT kmem tests to more
rigerously stress the cache implementation and look for any problems.
This includes correctness and performance tests.
- Updated portable work queue interfaces
That said when working with a finite resource like memory failure really
is always a possibility. It would be far better longer term if the ZFS
code could be weened off this assumption and properly handle the cases
where an allocation fails. Still I've applied the patch to spl-0.3.4
since this layer is supposed to emulate Solaris as closely as possible.
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@164 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
already supprt atomic64_t types.
* spl-07-kmem-cleanup.patch
This moves all the debugging code from sys/kmem.h to spl-kmem.c, because
the huge macros were hard to debug and were bloating functions that
allocated memory. I also fixed some other minor problems, including
32-bit fixes and a reported memory leak which was just due to using the
wrong free function.
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@163 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
spl-05-div64.patch
This is a much less intrusive fix for undefined 64-bit division symbols
when compiling the DMU in 32-bit kernels.
* spl-06-atomic64.patch
This is a workaround for 32-bit kernels that don't have atomic64_t.
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@162 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
at a time as I audit it. This chunk finishes moving the SPL entirely
off the linux slab on to the SPL implementation. It differs slightly
from the proposed version in that the spl continues to export to
all the Solaris types and functions. These do conflict with the
Linux slab so a module usings these interfaces must not include the
SPL slab if they also intend to use the linux slab. Or they must
explcitly #undef the macros which remap the functioin to their
spl_* equivilants.
A nice side of effect of dropping the entire linux slab is we
don't need to autoconf checks anymore. They kept messing with
the slab API endlessly!
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@148 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
based of the spl_kmem_obj_t tacked on the end of each object.
This actually isn't so back because we are now allocing large
chunks for the slab and partitioning it ourselves. So there's
not a ton of wasted space. We may suffer a performance hit
however due to alignment issues.
- Remove remaining depenancies on the linux slab implementation.
We're standing on our own now for better or worse.
- Rework slabs to be either kmem or vmem based. If neither
KMC_VMEM of KMC_KMEM are specified we make a decent guess
about what will work best for their based on the object
size. Additionally we provide a kmem_virt() function caller
can use to see if they have a virtual or physical address.
- Minor fixups in the test suite.
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@141 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
cycle count which was costing me overhead. It was hurting
performance pretty badly for heavily used caches. I'm also
thinking the hash may be hurting me as well and it might
be worth sticking a pointer in to a little space after the
alloced object.
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@140 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
based by vmalloc()'ed memory. I now alloc a slab which is
roughly 32*spl_obj_size and in this block of memory I place
the slab descriptor, slab object descriptors, and objects
themselves. This greatly reduces vmalloc lock contention.
Still some minor cleanup remains and fine tuning but
it's working pretty well.
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@139 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
well for the expected workloads. Improvement in this commit include:
- Added DEBUG_KMEM_TRACKING #define which can optionally be set
when DEBUG_KMEM is defined to do per allocation tracking. This
allows us to get all the lightweight kmem debugging enabled by
default which is pretty light weight, and only when looking
for a memory leak we can briefly enable the per alloc tracking.
- Added set_normalized_timespec() in to SPL to simply using
the timespec() primatives from within a module.
- Added per-spinlock cycle counters to the slab in an attempt
to run down a lock contention issue. The contended lock
was in vmalloc() but I'm going to leave the cycle counters
in place for a little while until I'm convinced there arn't
other locking improvement possible in the slab.
- Added a proc interface to the slab to export per slab
cache statistics to /proc/spl/kmem/slab for analysis.
- Reworked spl_slab_alloc() function to allocate from kmem for
small allocation and vmem for large allocations. This improved
things considerably but futher work is needed.
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@138 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
when repopulating it. Plus I fixed a few more suble races in
that part of the code which were catching me. Finally I fixed
a small race in kmem_test8.
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@137 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
factor of 10x improvement on SMP system due to reduced lock contention.
This may put me in the ballpark of what is needed. We can still further
improve things on NUMA systems by creating an additional L3 cache per
memory node instead of the current global pool. With luck this won't
be needed. I should also take another look at the locking now that
everything is working. There's a good chance I can tighten it up a
little bit and improve things a little more.
kmem_lock: time (sec) slabs objs hash
kmem_lock: tot/max/calc tot/max/calc size/depth
kmem_lock: 0.000999926 6/6/1 192/192/32 32768/0
kmem_lock: 0.000999926 4/4/2 128/128/64 32768/0
kmem_lock: 0.000999926 4/4/4 128/128/128 32768/0
kmem_lock: 0.000999926 4/4/8 128/128/256 32768/0
kmem_lock: 0.000999926 4/4/16 128/128/512 32768/0
kmem_lock: 0.000999926 4/4/32 128/128/1024 32768/0
kmem_lock: 0.000999926 4/4/64 128/128/2048 32768/0
kmem_lock: 0.000999926 8/8/128 256/256/4096 32768/0
kmem_lock: 0.003999704 24/23/256 768/736/8192 32768/1
kmem_lock: 0.012999038 44/41/512 1408/1312/16384 32768/1
kmem_lock: 0.051996153 96/93/1024 3072/2976/32768 32768/2
kmem_lock: 0.181986536 187/184/2048 5984/5888/65536 32768/3
kmem_lock: 0.655951469 342/339/4096 10944/10848/131072 32768/4
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@136 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
to be overly clever and the context switch when the semaphore was busy
was destroying performance. Converting to a simple spin lock bough me
a factor of 50 or so. That said it's still not good enough. Tests
show bad performance and we are still CPU bound. The logical fix is
I need to implement per-cpu hot caches to minimize the SMP contention.
Linux and Solaris both have this, I was hoping to do without but it
looks like that's not to be.
kmem_lock: time (sec) slabs objs hash
kmem_lock: tot/max/calc tot/max/calc size/depth
kmem_lock: 0.022000000 7/6/64 224/177/2048 32768/1
kmem_lock: 0.039000000 13/13/128 416/404/4096 32768/1
kmem_lock: 0.079000000 23/21/256 736/672/8192 32768/1
kmem_lock: 0.158000000 48/47/512 1536/1504/16384 32768/1
kmem_lock: 0.345000000 105/105/1024 3360/3358/32768 32768/2
kmem_lock: 0.760000000 202/200/2048 6464/6400/65536 32768/3
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@135 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
longer be based on the linux slab but to be its own complete
implementation. The new slab behaves much more like the
Solaris slab than the Linux slab.
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@132 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
working on this branch for the next few days I suggested you work
off of the 0.3.1 tag. The following changes are fairly extensive
and are designed to make the SPL compatible with all kernels in
the range of 2.6.18-2.6.25. There were 13 relevant API changes
between these releases and I have added the needed autoconf tests
to check for them. However, this has not all been tested extensively.
I'll sort of the breakage on Fedora Core 9 and RHEL5 this week.
SPL_AC_TYPE_UINTPTR_T
SPL_AC_TYPE_KMEM_CACHE_T
SPL_AC_KMEM_CACHE_DESTROY_INT
SPL_AC_ATOMIC_PANIC_NOTIFIER
SPL_AC_3ARGS_INIT_WORK
SPL_AC_2ARGS_REGISTER_SYSCTL
SPL_AC_KMEM_CACHE_T
SPL_AC_KMEM_CACHE_CREATE_DTOR
SPL_AC_3ARG_KMEM_CACHE_CREATE_CTOR
SPL_AC_SET_SHRINKER
SPL_AC_PATH_IN_NAMEIDATA
SPL_AC_TASK_CURR
SPL_AC_CTL_UNNUMBERED
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@119 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c