Commit Graph

398 Commits

Author SHA1 Message Date
Brian Behlendorf 1a20496834 Make slab reclaim more aggressive
Many people have noticed that the kmem cache implementation is slow
to release its memory.  This patch makes the reclaim behavior more
aggressive by immediately freeing a slab once it is empty.  Unused
objects which are cached in the magazines will still prevent a slab
from being freed.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-01-16 13:55:09 -08:00
Richard Yao a988a35a93 Enforce architecture-specific barriers around clear_bit()
The comment above the Linux 3.16 kernel's clear_bit() states:

/**
 * clear_bit - Clears a bit in memory
 * @nr: Bit to clear
 * @addr: Address to start counting from
 *
 * clear_bit() is atomic and may not be reordered.  However, it does
 * not contain a memory barrier, so if it is used for locking purposes,
 * you should call smp_mb__before_atomic() and/or smp_mb__after_atomic()
 * in order to ensure changes are visible on other processors.
 */

This comment does not make sense in the context of x86 because x86 maps the
operations to barrier(), which is a compiler barrier. However, it does make
sense to me when I consider architectures that reorder around atomic
instructions. In such situations, a processor is allowed to execute the
wake_up_bit() before clear_bit() and we have a race. There are a few
architectures that suffer from this issue.

In such situations, the other processor would wake-up, see the bit is still
taken and go to sleep, while the one responsible for waking it up will
assume that it did its job and continue.

This patch implements a wrapper that maps smp_mb__{before,after}_atomic() to
smp_mb__{before,after}_clear_bit() on older kernels and changes our code to
leverage it in a manner consistent with the mainline kernel.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-01-16 13:55:09 -08:00
Richard Yao c2fa09454e Add hooks for disabling direct reclaim
The port of XFS to Linux introduced a thread-specific PF_FSTRANS bit
that is used to mark contexts which are processing transactions.  When
set, allocations in this context can dip into kernel memory reserves
to avoid deadlocks during writeback.  Linux 3.9 provided the additional
PF_MEMALLOC_NOIO for disabling __GFP_IO in page allocations, which XFS
began using in 3.15.

This patch implements hooks for marking transactions via PF_FSTRANS.
When an allocation is performed in the context of PF_FSTRANS, any
KM_SLEEP allocation is transparently converted to a GFP_NOIO allocation.

Additionally, when using a Linux 3.9 or newer kernel, it will set
PF_MEMALLOC_NOIO to prevent direct reclaim from entering pageout() on
on any KM_PUSHPAGE or KM_NOSLEEP allocation.  This effectively allows
the spl_vmalloc() helper function to be used safely in a thread which
is responsible for IO.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-01-16 13:55:09 -08:00
Brian Behlendorf c3eabc75b1 Refactor generic memory allocation interfaces
This patch achieves the following goals:

1. It replaces the preprocessor kmem flag to gfp flag mapping with
   proper translation logic. This eliminates the potential for
   surprises that were previously possible where kmem flags were
   mapped to gfp flags.

2. It maps vmem_alloc() allocations to kmem_alloc() for allocations
   sized less than or equal to the newly-added spl_kmem_alloc_max
   parameter.  This ensures that small allocations will not contend
   on a single global lock, large allocations can still be handled,
   and potentially limited virtual address space will not be squandered.
   This behavior is entirely different than under Illumos due to
   different memory management strategies employed by the respective
   kernels.  However, this functionally provides the semantics required.

3. The --disable-debug-kmem, --enable-debug-kmem (default), and
   --enable-debug-kmem-tracking allocators have been unified in to
   a single spl_kmem_alloc_impl() allocation function.  This was
   done to simplify the code and make it more maintainable.

4. Improve portability by exposing an implementation of the memory
   allocations functions that can be safely used in the same way
   they are used on Illumos.   Specifically, callers may safely
   use KM_SLEEP in contexts which perform filesystem IO.  This
   allows us to eliminate an entire class of Linux specific changes
   which were previously required to avoid deadlocking the system.

This change will be largely transparent to existing callers but there
are a few caveats:

1. Because the headers were refactored and extraneous includes removed
   callers may find they need to explicitly add additional #includes.
   In particular, kmem_cache.h must now be explicitly includes to
   access the SPL's kmem cache implementation.  This behavior is
   different from Illumos but it was done to avoid always masking
   the Linux slab functions when kmem.h is included.

2. Callers, like Lustre, which made assumptions about the definitions
   of KM_SLEEP, KM_NOSLEEP, and KM_PUSHPAGE will need to be updated.
   Other callers such as ZFS which did not will not require changes.

3. KM_PUSHPAGE is no longer overloaded to imply GFP_NOIO.  It retains
   its original meaning of allowing allocations to access reserved
   memory.  KM_PUSHPAGE callers can be converted back to KM_SLEEP.

4. The KM_NODEBUG flags has been retired and the default warning
   threshold increased to 32k.

5. The kmem_virt() functions has been removed.  For callers which
   need to distinguish between a physical and virtual address use
   is_vmalloc_addr().

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-01-16 13:55:09 -08:00
Brian Behlendorf b34b95635a Fix kmem cstyle issues
Address all cstyle issues in the kmem, vmem, and kmem_cache source
and headers.  This will done to make it easier to review subsequent
changes which will rework the kmem/vmem implementation.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-01-16 13:55:09 -08:00
Brian Behlendorf e5b9b344c7 Refactor existing code
This change introduces no functional changes to the memory management
interfaces.  It only restructures the existing codes by separating the
kmem, vmem, and kmem cache implementations in the separate source and
header files.

Splitting this functionality in to separate files required the addition
of spl_vmem_{init,fini}() and spl_kmem_cache_{initi,fini}() functions.

Additionally, several minor changes to the #include's were required to
accommodate the removal of extraneous header from kmem.h.

But again, while large this patch introduces no functional changes.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-01-16 13:55:08 -08:00
Ned Bass 52479ecf58 Remove compat includes from sys/types.h
Don't include the compatibility code in linux/*_compat.h in the public
header sys/types.h. This causes problems when an external code base
includes the ZFS headers and has its own conflicting compatibility code.
Lustre, in particular, defined SHRINK_STOP for compatibility with
pre-3.12 kernels in a way that conflicted with the SPL's definition.
Because Lustre ZFS OSD includes ZFS headers it fails to build due to a
'"SHRINK_STOP" redefined' compiler warning.  To avoid such conflicts
only include the compat headers from .c files or private headers.

Also, for consistency, include sys/*.h before linux/*.h then sort by
header name.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #411
2014-11-19 10:35:12 -08:00
Brian Behlendorf 8d9a23e82c Retire legacy debugging infrastructure
When the SPL was originally written Linux tracepoints were still
in their infancy.  Therefore, an entire debugging subsystem was
added to facilite tracing which served us well for many years.

Now that Linux tracepoints have matured they provide all the
functionality of the previous tracing subsystem.  Rather than
maintain parallel functionality it makes sense to fully adopt
tracepoints.  Therefore, this patch retires the legacy debugging
infrastructure.

See zfsonlinux/zfs@bc9f413 for the tracepoint changes.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #408
2014-11-19 10:35:07 -08:00
Richard Yao ad9863e80b kmem_cache: Call constructor/destructor on each alloc/free
This has a few benefits. First, it fixes a regression that "Rework
generic memory allocation interfaces" appears to have triggered in
splat's slab_reap and slab_age tests. Second, it makes porting code from
Illumos to ZFSOnLinux easier. Third, it has the side effect of making
reclaim from slab caches that specify reclaim functions an order of
magnitude faster. The splat slab_reap test usually took 30 to 40
seconds. With this change, it takes 3 to 4.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #369
2014-10-28 09:21:08 -07:00
Tim Chase 802a4a2ad5 Linux 3.12 compat: shrinker semantics
The new shrinker API as of Linux 3.12 modifies "struct shrinker" by
replacing the @shrink callback with the pair of @count_objects and
@scan_objects.  It also requires the return value of @count_objects to
return the number of objects actually freed whereas the previous @shrink
callback returned the number of remaining freeable objects.

This patch adds support for the new @scan_objects return value semantics
and updates the splat shrinker test case appropriately.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #403
2014-10-28 09:20:13 -07:00
Brian Behlendorf 599662c538 Remove kern_path() wrapper
The kern_path() function has been available since Linux 2.6.28.
There is no longer a need to maintain this compatibility code.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2014-10-17 15:11:52 -07:00
Brian Behlendorf 3d5392cefa Remove kvasprintf() wrapper
The kvasprintf() function has been available since Linux 2.6.22.
There is no longer a need to maintain this compatibility code.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2014-10-17 15:11:52 -07:00
Brian Behlendorf 0fac9c9e6d Remove proc_handler() wrapper
As of Linux 2.6.32 the proc handlers where updated to expect only
five arguments.  Therefore there is no longer a need to maintain
this compatibility code and this infrastructure can be simplified.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2014-10-17 15:11:52 -07:00
Brian Behlendorf 68a829b29d Remove credential configure checks.
The groups_search() function was never exported by a mainline kernel
therefore we drop this compatibility code and always provide our own
implementation.

Additionally, the cred_t structure has been available since 2.6.29
so there is no longer a need to maintain compatibility code.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2014-10-17 15:11:51 -07:00
Brian Behlendorf 137af025f6 Remove set_fs_pwd() configure check
This function has never been exported by any mainline and was only
briefly available under RHEL5.  Therefore this check is being removed
and the code update to always use the wrapper function.

The next step will be to eliminate all this code.  If ZFS were updated
not to assume that it's pwd was / there would be no need for this.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2014-10-17 15:11:51 -07:00
Brian Behlendorf 3c49a16989 Remove user_path_dir() wrapper
The user_path_dir() function has been available since Linux 2.6.27.
There is no longer a need to maintain this compatibility code.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2014-10-17 15:11:51 -07:00
Brian Behlendorf 44778f4110 Remove kallsyms_lookup_name() wrapper
After the removable of get_vmalloc_info(), the unused global memory
variables, and the optional dcache/icache shrinkers there is no
longer a need for the kallsyms compatibility code.  This allows
us to eliminate another brittle area of the code by removing the
kernel upcall this functionality depended on for older kernels.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2014-10-17 15:11:51 -07:00
Brian Behlendorf 89a461e70c Remove shrink_{i,d}node_cache() wrappers
This is optional functionality which may or may not be useful to
ZFS when using older kernels.  It is never a hard requirement.
Therefore this functionality is being removed from the SPL and
a simpler slimmed down version will be added to ZFS.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2014-10-17 15:11:51 -07:00
Brian Behlendorf 8bbbe46f86 Remove global memory variables
Platforms such as Illumos and FreeBSD have historically provided
global variables which summerize the memory state of a system.
Linux on the otherhand doesn't expose any of this information
to kernel modules and uses entirely different mechanisms for
memory management.

In order to simplify the original ZFS port to Linux these global
variables were emulated by the SPL for the benefit of ZFS.  As ZoL
has matured over the years it has moved steadily away from these
interfaces and now no longer depends on them at all.

Therefore, this patch completely removes the global variables
availrmem, minfree, desfree, lotsfree, needfree, swapfs_minfree,
and swapfs_reserve.  This greatly simplifies the memory management
code and eliminates a common area of confusion.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2014-10-17 15:11:51 -07:00
Brian Behlendorf e1310afae3 Remove get_vmalloc_info() wrapper
The get_vmalloc_info() function was used to back the vmem_size()
function.  This was always problematic and resulted in brittle
code because the kernel never provided a clean interface for
modules.

However, it turns out that the only caller of this function in
ZFS uses it to determine the total virtual address space size.
This can be determined easily without get_vmalloc_info() so
vmem_size() has been updated to take this approach which allows
us to shed the get_vmalloc_info() dependency.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2014-10-17 15:11:51 -07:00
Brian Behlendorf 50e41ab1e1 Remove on_each_cpu() wrapper
The on_each_cpu() function has been available since Linux 2.6.27.
There is no longer a need to maintain this compatibility code.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2014-10-17 15:11:51 -07:00
Brian Behlendorf 2bc5666f53 Remove i_mutex() configure check
The inode structure has used i_mutex as its internal locking
primitive since 2.6.16.  The compatibility code to check for
the previous semaphore primitive has been removed.  However,
the wrapper function itself is being kept because it's entirely
possible this primitive will change again to allow finer grained
locking.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2014-10-17 15:11:51 -07:00
Brian Behlendorf 82f2f1a3af Simplify the time compatibility wrappers
Many of the time functions had grown overly complex in order to
handle kernel compatibility issues.  However, as of Linux 2.6.26
all the required functionality is available.  This allows us to
retire numerous configure checks and greatly simplify the time
compatibility wrappers.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2014-10-17 15:11:50 -07:00
Brian Behlendorf 87f8055a91 Map highbit64() to fls64()
The fls64() function has been available since Linux 2.6.16 and
it should be used to implemented highbit64().  This allows us
to provide an optimized implementation and simplify the code.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2014-10-17 15:11:50 -07:00
Brian Behlendorf 9c91800d19 Remove CTL_UNNUMBERED sysctl interface
Support for the CTL_UNNUMBERED sysctl interface was removed in
Linux 2.6.19.  There is no longer any reason to maintain this
compatibility code.  There also issue any reason to keep around
the CTL_NAME macro and helpers so they have been retired.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2014-10-17 15:11:50 -07:00
Brian Behlendorf b38bf6a4e3 Remove register_sysctl() compatibility code
The register_sysctl() interface has been stable since Linux 2.6.21.
There is no longer a need to maintain compatibility code.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2014-10-17 15:11:50 -07:00
Brian Behlendorf bb4dee3df2 Remove utsname() wrapper
There is no longer a need to wrap this because utsname() is provided
by the kernel and can be called directly.  This will require a small
change in the ZFS code because utsname is expected to be a global
structure and not a function.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2014-10-17 15:11:41 -07:00
Brian Behlendorf aa363c5c05 Remove sysctl_vfs_cache_pressure assumption
The generic SPL cache shrinkers make the assumption that the
caches only contain VFS cache data and therefore should be scaled
based on vfs_cache_pressure.  This is not strictly true and it
should not be assumed.

Removing this tuning should not have any impact on the stock
behavior because vfs_cache_pressure=100 by default.  This means
that no scaling will take place.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2014-10-17 15:07:28 -07:00
Brian Behlendorf a80d69caf0 Remove adaptive mutex implementation
Since the Linux 2.6.29 kernel all mutexes have been adaptive mutexs.
There is no longer any point in keeping this code so it is being
removed to simplify the code.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2014-10-17 15:07:28 -07:00
Brian Behlendorf 6203295438 Make license compatibility checks consistent
Apply the license specified in the META file to ensure the
compatibility checks are all performed consistently.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2014-10-17 15:07:28 -07:00
Turbo Fredriksson e3020723dc Linux 3.16 compat: smp_mb__after_clear_bit()
The smp_mb__{before,after}_clear_bit functions have been renamed
smp_mb__{before,after}_atomic.  Rather than adding a compatibility
function to handle this the code has been updated to use smp_wmb().

This has the advantage of being a stable functionally equivalent
interface.  On many architectures smp_mb__after_clear_bit() expands
to smp_wmb().  Others might be able to do something slightly more
efficient but this will be safe and correct on all of them.

Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #386
2014-09-22 16:24:55 -07:00
Richard Yao ec18fe3ce8 Cleanup vn_rename() and vn_remove()
zfsonlinux/spl#bcb15891ab394e11615eee08bba1fd85ac32e158 implemented
Linux 3.6+ support by adding duplicate vn_rename and vn_remove
functions. The new ones were cleaner, but the duplicate functions made
the codebase less maintainable. This adds some compatibility shims that
allow us to retire the older vn_rename and vn_remove in favor of the new
ones on old kernels. The result is a net 143 line reduction in lines of
code and a cleaner codebase.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #370
2014-08-13 16:25:44 -07:00
Ned Bass 2fc44f66ec Linux 3.17 compat: remove wait_on_bit action function
Linux kernel 3.17 removes the action function argument from
wait_on_bit().  Add autoconf test and compatibility macro to support
the new interface.

The former "wait_on_bit" interface required an 'action' function to
be provided which does the actual waiting. There were over 20 such
functions in the kernel, many of them identical, though most cases
can be satisfied by one of just two functions: one which uses
io_schedule() and one which just uses schedule().  This API change
was made to consolidate all of those redundant wait functions.

References: torvalds/linux@7431620

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #378
2014-08-11 14:17:00 -07:00
Brian Behlendorf f2297b5a89 Set spl_kmem_cache_slab_limit=16384 to default
For small objects the Linux slab allocator should be used to make the most
efficient use of the memory.  However, large objects are not supported by
the Linux slab and therefore the SPL implementation is preferred.  A cutoff
of 16K was determined to be optimal for architectures using 4K pages.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: DHE <git@dehacked.net>
Issue #356
Closes #379
2014-08-08 08:51:45 -07:00
Brian Behlendorf c1aef26944 Set spl_kmem_cache_reclaim=0 to default
Reinstate the correct default behavior of returning the number of objects
in the cache for reclaim.  This behavior was disabled in recent releases
to do occasional reports of spinning in shrink_slabs().  Those issues have
been resolved and can no longer can be reproduced.  See commit 376dc35.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: DHE <git@dehacked.net>
Issue #358
Closes #379
2014-08-08 08:50:03 -07:00
Tim Chase 7f23e00109 Add functions and macros as used upstream.
Added highbit64() and howmany() which are used in recent upstream
code.  Both highbit() and highbit64() should at some point be
re-factored to use the optimized fls() and fls64() functions.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #363
2014-07-22 09:47:48 -07:00
Brian Behlendorf 377e12f14a Rate limit debugging stack traces
There have been issues in the past where excessive debug logging
to the console has resulted in significant performance impacts.
In the vast majority of these cases only a few stack traces are
required to diagnose the issue.  Therefore, stack traces dumped to
the console will now we limited to 5 every 60s.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Closes #374
2014-07-22 09:47:24 -07:00
Tim Chase f6a869614e Safer debugging and assertion macros.
Spl's debugging and assertion macros macro used the typical do/while(0)
form for if/else friendliness, however, this limits their use in contexts
where a do loop is not valid; such as within another multi-statement
style macro.

The following macros have been converted to not use do/while(0):
	PANIC, ASSERT, ASSERTF, VERIFY, VERIFY3_IMPL

PANIC has been converted to a wrapper around the new spl_PANIC() function.

The other macros have been converted to use the "&&" operator for the
branch-predicition conditional and also to use spl_PANIC().

The __ASSERT() macro was not touched.  It is only used by the debugging
infrastructure and that code, including this macro, will be retired when
the tracepoint patches are merged.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #367
2014-07-01 15:14:43 -07:00
Brian Behlendorf 376dc35e22 Add spl_kmem_cache_reclaim module option
The correct behavior for all registered shrinkers is to return the
number of objects in their cache.  In theory this allows the Linux
VM to balance memory reclaim across all registered caches.

In commit b9b3715 this behavior was disabled in favor of returning
-1 which notifies the VM that no additional objects are available
for reclaim.  This was done as a workaround to resolve thrashing
in shrink_slabs() which could occur when memory was low and numerous
core where in reclaim.  Unfortunately, this has been observed to
increase the likelihood of OOM events when SPL slab consumers are
responsible for consuming the majority of memory.

Therefore, this patch makes this behavior tunable.  Setting the
spl_kmem_cache_reclaim module option to 0x1 will result in the
shrinker only being called once.  This is the default behavior.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Closes #358
2014-05-22 10:30:12 -07:00
Brian Behlendorf a073aeb060 Add KMC_SLAB cache type
For small objects the Linux slab allocator has several advantages
over its counterpart in the SPL.  These include:

1) It is more memory-efficient and packs objects more tightly.
2) It is continually tuned to maximize performance.

Therefore it makes sense to layer the SPLs slab allocator on top
of the Linux slab allocator.  This allows us to leverage the
advantages above while preserving the Illumos semantics we depend
on.  However, there are some things we need to be careful of:

1) The Linux slab allocator was never designed to work well with
   large objects.  Because the SPL slab must still handle this use
   case a cut off limit was added to transition from Linux slab
   backed objects to kmem or vmem backed slabs.

   spl_kmem_cache_slab_limit - Objects less than or equal to this
   size in bytes will be backed by the Linux slab.  By default
   this value is zero which disables the Linux slab functionality.
   Reasonable values for this cut off limit are in the range of
   4096-16386 bytes.

   spl_kmem_cache_kmem_limit - Objects less than or equal to this
   size in bytes will be backed by a kmem slab.  Objects over this
   size will be vmem backed instead.  This value defaults to
   1/8 a page, or 512 bytes on an x86_64 architecture.

2) Be aware that using the Linux slab may inadvertently introduce
   new deadlocks.  Care has been taken previously to ensure that
   all allocations which occur in the write path use GFP_NOIO.
   However, there may be internal allocations performed in the
   Linux slab which do not honor these flags.  If this is the case
   a deadlock may occur.

The path forward is definitely to start relying on the Linux slab.
But for that to happen we need to start building confidence that
there aren't any unexpected surprises lurking for us.  And ideally
need to move completely away from using the SPLs slab for large
memory allocations.  This patch is a first step.

NOTES:
1) The KMC_NOMAGAZINE flag was leveraged to support the Linux slab
   backed caches but it is not supported for kmem/vmem backed caches.

2) Regardless of the spl_kmem_cache_*_limit settings a cache may
   be explicitly set to a given type by passed the KMC_KMEM,
   KMC_VMEM, or KMC_SLAB flags during cache creation.

3) The constructors, destructors, and reclaim callbacks are all
   functional and will be called regardless of the cache type.

4) KMC_SLAB caches will not appear in /proc/spl/kmem/slab due to
   the issues involved in presenting correct object accounting.
   Instead they will appear in /proc/slabinfo under the same names.

5) Several kmem SPLAT tests needed to be fixed because they relied
   incorrectly on internal kmem slab accounting.  With the updated
   test cases all the SPLAT tests pass as expected.

6) An autoconf test was added to ensure that the __GFP_COMP flag
   was correctly added to the default flags used when allocating
   a slab.  This is required to ensure all pages in higher order
   slabs are properly refcounted, see ae16ed9.

7) When using the SLUB allocator there is no need to attempt to
   set the __GFP_COMP flag.  This has been the default behavior
   for the SLUB since Linux 2.6.25.

8) When using the SLUB it may be desirable to set the slub_nomerge
   kernel parameter to prevent caches from being merged.

Original-patch-by: DHE <git@dehacked.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: DHE <git@dehacked.net>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Closes #356
2014-05-22 10:28:01 -07:00
Chunwei Chen ad3412efd7 Linux 3.15: vfs_rename() added a flags argument
Detect the updated vfs_rename() interface and call it with an
extra flags argument.

References:
  torvalds/linux@520c8b1

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #355
2014-05-07 13:38:17 -07:00
Andrey Vesnovaty 703371d8c7 Evenly distribute the taskq threads across available CPUs
The problem is described in commit aeeb4e0c0a.
However, instead of disabling the binding to CPU altogether we just keep the
last CPU index across calls to taskq_create() and thus achieve even
distribution of the taskq threads across all available CPUs.

The implementation based on assumption that task queues initialization
performed in serial manner.

Signed-off-by: Andrey Vesnovaty <andrey.vesnovaty@gmail.com>
Signed-off-by: Andrey Vesnovaty <andreyv@infinidat.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #336
2014-04-25 15:29:18 -07:00
Chunwei Chen ae16ed992b Fix crash when using ZFS on Ceph rbd
When using __get_free_pages to get high order memory, only the first page's
_count will set to 1, other's will be 0. When an internal page get passed into
rbd, it will eventully go into tcp_sendpage. There, it will be called with
get_page and put_page, and get freed erroneously when _count jump back to 0.

The solution to this problem is to use compound page. All pages in a
high order compound page share a single _count. So get_page and put_page in
tcp_sendpage will not cause _count jump to 0.

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #251
2014-04-25 15:26:52 -07:00
Richard Yao 89aa97059d Change spl_kmem_cache_expire default setting to 2
This behavior is more consistent with the way memory reclaim
is expected to work under Linux.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #349
2014-04-14 16:29:01 -07:00
Andrey Vesnovaty bdfbe594a1 Expose max/min objs per slab and max slab size
By default maximal number of objects in slab can't exceed (16*2 - 1) and slab
size can't exceed 32M.
Today's high end servers having couple hundreds of RAM available for ARC may
run into a trouble with virtual memory because of the restriction mentioned
above.

Problem:
Reasons for very high number of virtual memory allocations:
	* Real slab size very small relative to the size of the entire RAM
	* Slabs allocated on virtual memory and fill entire ARC

The result is very high number of allocated virtual memory ranges (hundreds of
ranges). When virtual memory subsystem manages high number of ranges its
performance become so poor that it freezes from time to time.

Solution:
Number of objects per slab should be increased taking into account maximal
slab size which can also be increased if needed.

Signed-off-by: Andrey Vesnovaty <andrey.vesnovaty@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #337
2014-04-14 09:42:04 -07:00
Richard Yao acf0ade362 Simplify hostid logic
There is plenty of compatibility code for a hw_hostid
that isn't used by anything. At the same time, there are apparently
issues with the current hostid logic. coredumb in #zfsonlinux on
freenode reported that Fedora 17 changes its hostid on every boot, which
required force importing his pool. A suggestion by wca was to adopt
FreeBSD's behavior, where it treats hostid as zero if /etc/hostid does
not exist

Adopting FreeBSD's behavior permits us to eliminate plenty of code,
including a userland helper that invokes the system's hostid as a
fallback.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #224
2014-04-14 09:04:41 -07:00
Tim Chase 3ceb71e896 Call kthread_create() correctly with fixed arguments.
The kernel's kthread_create() function is defined as "..." and there is
no va_list variant at the moment.  The task name is pre-formatted into
a local buffer and passed to kthread_create() with fixed arguments.

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #347
2014-04-11 09:41:40 -07:00
Tim Chase ed650dee76 De-inline spl_kthread_create().
The function was defined as a static inline with variable arguments
which causes gcc to generate errors on some distros.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #346
2014-04-09 19:17:12 -07:00
Tim Chase 17a527cb0f Support post-3.13 kthread_create() semantics.
Provide spl_kthread_create() as a wrapper to the kernel's kthread_create()
to provide pre-3.13 semantics.  Re-try if the call is interrupted or if it
would have returned -ENOMEM.  Otherwise return NULL.

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #339
2014-04-08 12:44:42 -07:00
Brian Behlendorf e19101e08f splat cred:groupmember: Fix false positives
Due to certain assumptions made in the the cred:groupmember test it
could result in false positives when run on specific distributions.
This was solely a bug in the test case and not in the groupmember()
function which the test case was validating.

To prevent future false positives the test case has been rewritten
to be both more rigerous and to make fewer assumptions about the
system.

Minor style cleanup was done to cr_groups_search() and groupmember()
functions.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2014-04-08 12:44:41 -07:00
Brian Behlendorf aeeb4e0c0a Remove default taskq thread to CPU bindings
When this code was written it appears to have been assumed that
every taskq would have a large number of threads.  In this case
it would make sense to attempt to evenly bind the threads over
all available CPUs.  However, it failed to consider that creating
taskqs with a small number of threads will cause the CPUs with
lower ids become over-subscribed.

For this reason the kthread_bind() call is being removed and
we're leaving the kernel to schedule these threads as it sees fit.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #325
2014-01-07 10:46:24 -08:00
Brian Behlendorf 921a35adeb Add module versioning
Use the standard Linux MODULE_VERSION macro to expose the installed
spl and splat module versions.  This will also automatically add a
checksum of the .c files and headers in "srcversion".  See:

  /sys/module/spl/version
  /sys/module/spl/srcversion
  /sys/module/splat/version
  /sys/module/splat/srcversion

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes zfsonlinux/zfs#1923

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-12-06 11:03:43 -08:00
Richard Yao 50a0749eba Linux 3.13 compat: Pass NULL for new delegated inode argument
This check was originally added for SLES10, a093c6a, to check for
a 'struct vfsmount *' argument which they added.  However, since
SLES10 is based on a 2.6.16 kernel which is no longer supported
this functionality was dropped.  The checks were refactored to
support Linux 3.13 without concern for historical versions.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #312
2013-12-02 10:37:49 -08:00
Richard Yao 3e96de17d7 Linux 3.13 compat: Remove unused flags variable from __cv_init()
GCC 4.8.1 complained about an unused flags variable when building
against Linux 2.6.26.8:

/var/tmp/portage/sys-kernel/spl-9999/work/spl-9999/module/spl/../../module/spl/spl-condvar.c:
In function ‘__cv_init’:
/var/tmp/portage/sys-kernel/spl-9999/work/spl-9999/module/spl/../../module/spl/spl-condvar.c:39:6:
error: variable ‘flags’ set but not used
[-Werror=unused-but-set-variable]
  int flags = KM_SLEEP;
        ^
	cc1: all warnings being treated as errors

Additionally, the superfluous code uses a preempt_count variable that is
no longer available on Linux 3.13. Deleting the unnecessary code fixes a
Linux 3.13 compatibility issue.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #312
2013-12-02 10:11:19 -08:00
Ned Bass 184c687387 Emulate illumos interface cv_timedwait_hires()
Needed for Illumos #3582. This interface is supposed to support
a variable-resolution timeout with nanosecond granularity.  This
implementation rounds up to microsecond resolution, as nanosecond-
precision timing is rarely needed for real-world performance
tuning and may incur unnecessary busy-waiting.  usleep_range() is
used if available, otherwise udelay() or msleep() are used
depending on the length of the delay interval.

Add flags from sys/callo.h as these are used to control the behavior of
cv_timedwait_hires().  Specifically,

CALLOUT_FLAG_ABSOLUTE
    Normally, the expiration passed to the timeout API functions is
    an expiration interval. If this flag is specified, then it is
    interpreted as the expiration time itself.

CALLOUT_FLAG_ROUNDUP
    Roundup the expiration time to the next resolution boundary. If this
    flag is not specified, the expiration time is rounded down.

References:
    https://www.illumos.org/issues/3582
    illumos/illumos-gate@0689f76

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #304
2013-11-04 09:49:24 -08:00
Ned Bass f483a97a41 3537 add kstat_waitq_enter and friends
These kstat interfaces are required to port
"Illumos #3537 want pool io kstats" to ZFS on Linux.

kstat_waitq_enter()
kstat_waitq_exit()
kstat_runq_enter()
kstat_runq_exit()

Additionally, zero out the ks_data buffer in __kstat_create() so
that the kstat_io_t counters are initialized to zero.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-10-25 13:41:52 -07:00
Cyril Plisko ffbf0e57c2 Kstat to use private lock by default
While porting Illumos #3537 I found that ks_lock member of kstat_t
structure is different between Illumos and SPL. It is a pointer to
the kmutex_t in Illumos, but the mutex lock itself in SPL.
Apparently Illumos kstat API allows consumer to override the lock
if required. With SPL implementation it is not possible anymore.

Things were alright until the first attempt to actually override
the lock. Porting of Illumos #3537 introduced such code for the
first time.

In order to provide the Solaris/Illumos like functionality we:
  1. convert ks_lock to "kmutex_t *ks_lock"
  2. create a new field "kmutex_t ks_private_lock"
  3. On kstat_create() ks_lock = &ks_private_lock

Thus if consumer doesn't care we still have our internal lock in use.
If, however, consumer does care she has a chance to set ks_lock to
anything else before calling kstat_install().

The rest of the code will use ks_lock regardless of its origin.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #286
2013-10-25 13:41:30 -07:00
Brian Behlendorf ce07767f79 Revert "Add KSTAT_TYPE_TXG type"
This reverts commit dba79fcbf2 in
favor of using the generic KSTAT_TYPE_RAW callbacks.  The advantage
of this approach is that arbitrary types can be added without the
need to add them to the SPL.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #296
2013-10-16 14:48:35 -07:00
Prakash Surya 56d40a686b Add callbacks for displaying KSTAT_TYPE_RAW kstats
The current implementation for displaying kstats of type KSTAT_TYPE_RAW
is rather crude. This patch attempts to enhance this handling by
allowing a kstat user to register formatting callbacks which can
optionally be used.

The callbacks allow the user to implement functions for interpreting
their data and transposing it into a character buffer. This buffer,
containing a string representation of the raw data, is then be displayed
through the current /proc textual interface.

Additionally the kstats are made writable because it's now possible
to provide a useful handler via the existing ks_update() interface.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #296
2013-10-16 14:48:35 -07:00
Brian Behlendorf 429fe89cee Consistently use local_irq_disable/local_irq_enable
It was observed that spl_kmem_cache_alloc() uses local_irq_save()
and saves the interrupt state in a local variable.  This would
normally be fine except that spl_kmem_cache_alloc() calls
spl_cache_refill() which re-enables interrupts.  It is then
possible that while interrupts are enabled the process is
rescheduled to a different cpu before being disable again.
This could result in us restoring the saved interrupt state
from one cpu to another.

What the consequences of this are aren't perfectly clear, but
this is clearly a bug and it has the potential to cause issues.
The code has been updated to just use local_irq_enable() and
local_irq_disable() to avoid this.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-10-09 14:00:56 -07:00
Richard Yao df2c0f1849 Replace current_kernel_time() with getnstimeofday()
current_kernel_time() is used by the SPLAT, but it is not meant for
performance measurement. We modify the SPLAT to use getnstimeofday(),
which is equivalent to the gethrestime() function on Solaris.
Additionally, we update gethrestime() to invoke getnstimeofday().

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #279
2013-10-09 13:28:30 -07:00
Richard Yao f7fd6ddd96 Linux 3.8 compat: Use kuid_t/kgid_t when required
When CONFIG_UIDGID_STRICT_TYPE_CHECKS is enabled uid_t/git_t are
replaced by kuid_t/kgid_t, which are structures instead of integral
types. This causes any code that uses an integral type to fail to build.
The User Namespace functionality introduced in Linux 3.8 requires
CONFIG_UIDGID_STRICT_TYPE_CHECKS, so we could not build against any
kernel that supported it.

We resolve this by converting between the new kuid_t/kgid_t structures
and the original uid_t/gid_t types.

Original-patch-by: DHE
Rewrite-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #260
2013-08-09 10:09:29 -07:00
Richard Yao e3c4d44886 PaX/GrSecurity Linux 3.8.y compat: Use __no_const on struct ctl_table
The PaX team started constifying `struct ctl_table` as of their Linux
3.8.0 patchset. This lead to zfsonlinux/spl#225 and Gentoo bug #463012.

While investigating our options, I learned that there is a preprocessor
directive called CONSTIFY_PLUGIN that we can use to detect the presence
of the PaX changes and adjust the code accordingly.

The PaX Team had suggested adopting ctl_table_no_const, but supporting
older kernels required declaring that whenever the CONSTIFY_PLUGIN was
set. Future compiler changes could potentially cause that to break in
the presence of -Werror, so instead we define our own spl_ctl_table
typdef and use that. This should be compatible with all PaX kernels.

This introduces a Linux kernel version number check to prevent a build
failure on versions of the PaX GCC plugin that existed for kernels
before Linux 3.8.0. Affected versions of the PaX plugin will trigger a
compiler error when they see no_const cast on a non-constified
structure.  Ordinarily, we would need an autotools check to catch that.
However, it is safe to do a kernel version check instead of an autotools
check in this specific instance because the affected versions of the PaX
GCC plugin only exist for Linux kernels before 3.8.0 and the
constification of `struct ctl_table` by the PaX developers only occurs
in Linux 3.8.0 and later.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #225
2013-08-08 09:51:34 -07:00
Richard Yao 251e7a779b Fix race in spl_kmem_cache_reap_now()
The current code contains a race condition that triggers when bit 2 in
spl.spl_kmem_cache_expire is set, spl_kmem_cache_reap_now() is invoked
and another thread is concurrently accessing its magazine.

spl_kmem_cache_reap_now() currently invokes spl_cache_flush() on each
magazine in the same thread when bit 2 in spl.spl_kmem_cache_expire is
set. This is unsafe because there is one magazine per CPU and the
magazines are lockless, so it is impossible to guarentee that another
CPU is not using its magazine when this function is called.

The solution is to only touch the local CPU's magazine and leave other
CPU's magazines to other CPUs.

Reported-by: DHE
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #274
2013-08-08 09:14:41 -07:00
Richard Yao ba06298072 Linux 3.11 compat: Replace num_physpages with totalram_pages
num_physpages was removed by
torvalds/linux@cfa11e08ed, so lets replace
it with totalram_pages.

This is a bug fix as much as it is a compatibility fix because
num_physpages did not reflect the number of pages actually available to
the kernel:

http://lkml.indiana.edu/hypermail/linux/kernel/0908.2/01001.html

Also, there are known issues with memory calculations when ZFS is in a
Xen dom0. There is a chance that using totalram_pages could resolve
them. This conjecture is untested at the time of writing.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #273
2013-08-08 09:14:29 -07:00
Brian Behlendorf ceb3872825 Fix KMC_OFFSLAB type caches
Because spl_slab_size() was always returning -ENOSPC for caches of
type KMC_OFFSLAB the cache could never be created.  Additionally
the slab size is rounded up to a page which is what kv_alloc()
expects.  The kv_alloc() code will minimally allocate a page,
in the KMC_OFFSLAB case this could be reduced.

The basic regression tests kmem:slab_small, kmem:slab_large,
and kmem:slab_align regression were updated to test KMC_OFFSLAB.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ying Zhu <casualfisher@gmail.com>
Closes #266
2013-07-30 15:39:23 -07:00
Brian Behlendorf b9b3715346 Return -1 for generic kmem cache shrinker
It has been observed that it's possible to get in a state where
shrink_slabs() will spin repeated invoking the generic kmem cache
shrinker.  It fails to detect it's not making forward progress
reclaiming from the cache and doesn't give up.  To ensure this
never occurs we unconditionally return -1 after reclaiming what
we can.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes zfsonlinux/zfs#1276
Closes zfsonlinux/zfs#1598
Closes zfsonlinux/zfs#1432
2013-07-30 15:33:24 -07:00
James H c47efbc7fd Modify gethrestime to use current_kernel_time()
This allows us to get nanosecond resolution. It also means
we use the same time source as utimensat(now) etc.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #255
2013-07-15 09:17:19 -07:00
Brian Behlendorf ab4e74cc38 Fix bogus kmem leak warning
Commit 5c7a036 correctly relocated the creation of a taskq
and the registraction of the kmem_cache_shrinker after the
initialization of the kmem tracking code.  However, the
cleanup of these structures was not done before the leak
checks in spl_kmem_fini().  This resulted in an incorrect
'kmem leaked' warning even though there was no actual leak.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes zfsonlinux/zfs#1569
2013-07-10 15:08:22 -07:00
Brian Behlendorf b1424adda5 Fix --enable-debug-kmem-tracking option
This code has gotten something stale and no longer builds cleanly
against modern kernels.  The two issues addressed here are as
follows:

* The hlist_*_rcu interfaces in the kernel have been relatively
  unstable.  Since this isn't performance critical code just use
  the long standing hlist_* variants.

* In older kernels the hash_ptr() function takes a 'void *' but
  in newer kernels it expects a 'const void *'.  To silence the
  compiler warnings about this explicitly cast it to a 'void *'.
  The memset function is a similar case but it always expects
  a 'void *'.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #256
2013-07-09 09:23:54 -07:00
Richard Yao f2a745c41d Linux 3.10 compat: Do not rely on struct proc_dir_entry definition
Linux kernel commit torvalds/linux#59d8053f moved the definition of
struct proc_dir_entry from include/linux/proc_fs.h to the private
header fs/proc/internal.h. The SPL relied on that to map Solaris'
kstat to entries in /proc/spl/kstat.

Since the proc_dir_entry structure is now private the only safe
thing to do is wrap the opaque proc handle with our own structure.
This actually ends up simplify the code and is good because it
moves us away from depending on implementation details of /proc.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #257
2013-07-08 15:25:18 -07:00
Yuxuan Shui 1ddf9722dc Linux 3.10 compat: replace PDE()->data with PDE_DATA()
Linux kernel commit torvalds/linux@d9dda78b renamed PDE() to
PDE_DATA().  To handle this detect the prefered interface
and define a PDE_DATA() wrapper for consistency.

Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #257
2013-07-08 15:14:21 -07:00
Tim Chase 5c7a0369e2 Fix --enable-debug-kmem-tracking option
Re-order initialization in spl_kmem_init to allow for kmem tracing
to work.  The spl_kmem_init function calls taskq_create prior to
initializing the tracking (calling spl_kmem_init_tracking).  Since
taskq_create uses kmem_alloc, NULL dereferences occur because the
global kmem_list hasn't had its next & prev pointers initialized yet.

This commit moves the calls to spl_kmem_init_tracking earlier in the
spl_kmem_init function in order that the subsequent kmem_alloc calls
(by taskq_create) work properly.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #243
2013-06-18 11:40:33 -07:00
Brian Behlendorf 99c452bbba Fix taskq_wait_id()
The existing taskq_wait_id() function can incorrectly block
indefinitely.  Reimplement it more simply using wait_event()
in a similar fashion to taskq_wait_all().

This flaw was uncovered in the context of moving vn_rdwr() to
a taskq.  Previously taskq_wait_id() had no consumers outside
the SPLAT task framework which is why the issue went unnoticed.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-05-03 14:32:29 -07:00
Richard Yao feaf1e321d Do not call cond_resched() in spl_slab_reclaim()
Calling cond_resched() after each object is freed and then after each
slab is freed can cause slabs of objects to live for excessive periods
of time following reclaimation. This interferes with the kernel's own
memory management when called from kswapd and can cause direct reclaim
to occur in response to memory pressure that should have been resolved.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
2013-03-21 12:58:44 -07:00
Richard Yao 4a31e5aa9b Linux 3.9 compat: Switch to hlist_for_each{,_rcu}
torvalds/linux@b67bfe0d42 changed
hlist_for_each_entry{,_rcu} to take 3 arguments instead of 4. We handle
this by switching to hlist_for_each{,_rcu}, which works across all
supported kernels.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-03-14 10:43:34 -07:00
Richard Yao 8274ed5988 Drop support for 3 argument version of set_fs_pwd
This was a suggestion that Brian Behlendorf made when reviewing an early
pull request for Linux 3.9 support. This commit was made intentionally
easy to revert should we ever have a reason to reintroduce support for
older kernels.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-03-14 10:43:31 -07:00
Richard Yao a54718cfe0 Linux 3.9 compat: set_fs_root takes const struct path *
torvalds/linux@dcf787f391 enforces
const-correctness in passing struct path *.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-03-14 10:43:29 -07:00
Richard Yao 2a305c34c8 Linux 3.9 compat: vfs_getattr takes two arguments
The function prototype of vfs_getattr previoulsy took struct vfsmount *
and struct dentry * as arguments. These would always be defined together
in a struct path *.

torvalds/linux@3dadecce20 modified
vfs_getattr to take struct path * is taken as an argument instead.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-03-14 10:43:26 -07:00
Richard Yao bc90df6688 Linux 3.9 compat: Do not depend on f_vfsmnt
torvalds/linux@182be68478 removed the
preprocessor definition for f_vfsmnt. The ability to access the
mountpoint via ->f_path.mnt has been stable for a long time, so we
switch to that.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-03-14 10:43:23 -07:00
Ned Bass 3d6af2dd6d Refresh links to web site
Update links to refer to the official ZFS on Linux website instead of
@behlendorf's personal fork on github.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-03-04 19:09:34 -08:00
Brian Behlendorf 4bf3909e51 Disable automatic log dumping
Long ago infrastructure was added to the SPL to keep an internal
debug log of the last few seconds of activity.  This was helpful
during the early development, but these days it is no longer
needed.  I haven't had to resort to this debug buffer to resolve
an issue for several years now.

Today better more generic tools like systemtap and ftrace have
evolved to the point where they can be used for this purpose.
Along with the stack trace dumped to the system console, and in
rare cases a crash dump we almost always have the debug we need.

Therefore, I'm disabling the code which automatically dumps
this log to disk during an assertion except for the case where
spl_debug_panic_on_bug is set (disabled by default).

This should be viewed as a first step towards either.

  a) Retiring this infrastructure and complexity entirely, or
  b) Integrating this logging more properly with ftrace.

As part of this change I'm also removing from the packages the
undocumented spl utility which is used to decode the binary logs.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-02-05 16:13:27 -08:00
Brian Behlendorf 6ef94aa67a Fix tsd_get/set() race with tsd_exit/destroy()
The tsd_exit() and tsd_destroy() functions remove entries from
hash bins without taking the hash bin lock.  They do take the
table lock, but tsd_get() and tsd_set() only take the hash bin
lock to allow for maximum concurency.

The result is that while tsd_get() and tsd_set() are traversing
the hash bin list it can be modified by another thread in which
happens to hash to the same value.  To avoid this add the needed
locking to tsd_exit() and tsd_destroy().

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #174
2013-01-31 13:54:59 -08:00
Brian Behlendorf 0936c3449f Add spl_kmem_cache_expire module option
Cache aging was implemented because it was part of the default Solaris
kmem_cache behavior.  The idea is that per-cpu objects which haven't been
accessed in several seconds should be returned to the cache.  On the other
hand Linux slabs never move objects back to the slabs unless there is
memory pressure on the system.

This behavior is now configurable through the 'spl_kmem_cache_expire'
module option.  The value is a bit mask with the following meaning.

  0x1 - Solaris style cache aging eviction is enabled.
  0x2 - Linux style low memory eviction is enabled.

Both methods may be safely enabled simultaneously, but by default
both are disabled.  It has never been clear if the kmem cache aging
(which has been around from day one) actually does any good.  It has
however been the source of numerous bugs so I wouldn't mind retiring
it entirely.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes zfsonlinux/zfs#1227
Closes #210
2013-01-28 09:34:12 -08:00
Brian Behlendorf 84dd1f4f15 Remove spl_invalidate_inodes()
This functionality is no longer required by ZFS, see commit
zfsonlinux/zfs@7b3e34ba5a.
Since there are no other consumers, and because it adds
additional autoconf complexity which must be maintained
the spl_invalidate_inodes() function has been removed.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#795
2013-01-17 11:40:47 -08:00
Brian Behlendorf d4899f4747 kmem-cache: Fix slab ageing soft lockup
Commit a10287e00d slightly reworked
the slab ageing code such that it is no longer dependent on the
Linux delayed work queue interfaces.

This was good for portability and performance, but it requires us
to use the on_each_cpu() function to execute the spl_magazine_age()
function.  That means that the function is now executing in interrupt
context whereas before it was scheduled in normal process context.
And that means we need to be slightly more careful about the locking
in the interrupt handler.

With the reworked code it's possible that we'll be holding the
skc->skc_lock and be interrupted to handle the spl_magazine_age()
IRQ.  This will result in a deadlock and soft lockup errors unless
we're careful to detect the contention and avoid taking the lock in
the interupt handler.  So that's what this patch does.

Alternately, (and slightly more conventionally) we could have used
spin_lock_irqsave() to prevent this race entirely but I'd perfer to
avoid disabling interrupts as much as possible due to performance
concerns.  There is absolutely no penalty for us not aging objects
out of the magazine due to contention.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Closes zfsonlinux/zfs#1193
2013-01-14 10:07:58 -08:00
Ned Bass 8842263bd0 call_usermodehelper() should wait for process
As of Linux 3.4 the UMH_WAIT_* constants were renumbered.  In
particular, the meaning of "1" changed from UMH_WAIT_PROC (wait for
process to complete), to UMH_WAIT_EXEC (wait for the exec, but not the
process).  A number of call sites used the number 1 instead of the
constant name, so the behavior was not as expected on kernels with
this change.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-01-09 16:54:19 -08:00
Brian Behlendorf 1c7b3eaf87 RHEL 6.4 compat, fallocate()
In the upstream kernel the FALLOC_FL_PUNCH_HOLE #define was
introduced after the fallocate() function was moved from the
inode_operations to the file_operations structure.  Therefore,
the SPL code assumed that if FALLOC_FL_PUNCH_HOLE was defined
it was safe to use f_ops->fallocate().

Unfortunately, the RHEL6.4 kernel has only backported the
FALLOC_FL_PUNCH_HOLE #define and not the fallocate() change.

To address this compatibility issue the spl_filp_fallocate()
helper function was added to properly detect which interface
is available.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-01-08 09:53:13 -08:00
Matt Johnston 46a75aadb7 Add cv_wait_io() to account I/O time
Under Linux when a task is waiting on I/O it should call the
io_schedule() function for proper accounting.  The Solaris
cv_wait() function provides no way to specify what the cv
is waiting on therefore cv_wait_io() is introduced.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #206
2013-01-07 10:29:26 -08:00
Brian Behlendorf 034f1b331e Fix spl_kmem_init_kallsyms_lookup() panic
Due to I/O buffering the helper may return successfully before
the proc handler has a chance to execute.  To catch this case
wait up to 1 second to verify spl_kallsyms_lookup_name_fn was
updated to a non SYMBOL_POISON value.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes zfsonlinux/zfs#699
Closes zfsonlinux/zfs#859
2012-12-19 09:06:35 -08:00
Brian Behlendorf 33e94ef1dd kmem-cache: Use a taskq for async allocations
Shift the asynchronous allocations over to use the taskq interfaces.
This allows us to abandon the kernels delayed work queue interface
and all the compatibility code it requires.

This code never actually used the delay functionality it was just
done this way to leverage the existing compatibility code.  All that
is required is a thread context to perform the allocation in.  The
only thing clever in this change is that we take advantage of the
preallocated task queue entries to avoid a memory allocation.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-12-12 09:56:54 -08:00
Brian Behlendorf a10287e00d kmem-cache: Use taskqs for ageing
Shift the cache and magazine ageing functionality over to the new
delayed taskq interfaces.  This allows us to abandon the kernels
delayed work queue interface and all the compatibility code it
requires.

However, the delayed taskq interface does not allow us to schedule
a task for a specfic cpu so the ageing code was slightly reworked.
The magazine ageing delay has been directly linked to the cache
ageing function.  The spl_cache_age() function invokes on_each_cpu()
in order to run spl_magazine_age() on each cpu.  It then blocks
waiting for them to complete and promptly reclaims any free slabs.

When restructing the code wasn't the primary goal I think the
new code is far more understable and maintainable.  It also should
help minimize magazine thrashing because free slabs are immediately
released after the magazine is aged.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-12-12 09:56:54 -08:00
Brian Behlendorf 296a8e596d kmem-cache: spl_kmem_cache_create() may always sleep
When this code was originally written I went overboard and allowed
for the possibility of creating a cache in an atomic context.  In
practice there are no callers which ever do this.  This makes sense
since a cache is by design a long lived data structure.

To prevent abuse of this function going forward I'm removing the
code which is supported to handle an atomic context.  All allocators
have been updated to use KM_SLEEP and the might_sleep() debug macro
has been added to immediately detect atomic callers.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-12-12 09:56:54 -08:00
Brian Behlendorf d9acd930b5 taskq delay/cancel functionality
Add the ability to dispatch a delayed task to a taskq.  The desired
behavior is for the task to be queued but not executed by a worker
thread until the expiration time is reached.  To achieve this two
new functions were added.

* taskq_dispatch_delay() -

  This function behaves exactly like taskq_dispatch() however it
takes a third 'expire_time' argument.  The caller should pass the
desired time the task should be executed as an absolute value in
jiffies.  The task is guarenteed not to run before this time, it
may run slightly latter if all the worker threads are busy.

* taskq_cancel_id() -

  Given a task id attempt to cancel the task before it gets executed.
This is primarily useful for canceling delay tasks but can be used for
canceling any previously dispatched task.  There are three possible
return values.

  0      - The task was found and canceled before it was executed.
  ENOENT - The task was not found, either it was already run or an
           invalid task id was supplied by the caller.
  EBUSY  - The task is currently executing any may not be canceled.
           This function will block until the task has been completed.

* taskq_wait_all() -

  The taskq_wait_id() function was renamed taskq_wait_all() to more
clearly reflect its actual behavior.  It is only curreny used by
the splat taskq regression tests.

* taskq_wait_id() -

  Historically, the only difference between this function and
taskq_wait() was that you passed the task id.  In both functions you
would block until ALL lower task ids which executed.  This was
semantically correct but could be very slow particularly if there
were delay tasks submitted.

  To better accomidate the delay tasks this function was reimplemnted.
It will now only block until the passed task id has been completed.

This is actually a fairly low risk change for a few reasons.

* Only new ZFS callers will make use of the new interfaces and
  very little common code was changed to support the new functions.

* The existing taskq_wait() implementation was not changed just
  slightly refactored.

* The newly optimized taskq_wait_id() implementation was never
  used by ZFS we can't accidentally introduce a new bug there.

NOTE: This functionality does not exist in the Illumos taskqs.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-12-12 09:54:07 -08:00
Brian Behlendorf aed8671cb0 taskq style, remove #define wrappers
When the taskq implementation was originally written I wrapped all
the API functions in #define's.  This was done as a preventative
measure to ensure that a taskq symbol never conflicted with an
existing kernel symbol.

However, in practice the taskq symbols never conflicted.  The only
major conflicts occured with the kmem cache API.  Since this added
layer of obfuscation never bought us anything for the taskq's I'm
removing it.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-12-12 09:54:07 -08:00
Brian Behlendorf 472a34caff taskq style, convert spaces to soft tabs
Update the taskq implementation to conform with the style used
throughout the rest of the code.  There are no functional
changes in this commit.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-12-12 09:54:07 -08:00
Brian Behlendorf 053678f3b0 Handle errors from spl_kern_path_locked()
When the Linux 3.6 KERN_PATH_LOCKED compatibility code was added
by commit bcb1589 an entirely new vn_remove() implementation was
added.  That function did not properly handle an error from
spl_kern_path_locked() which would result in an panic.  This
patch addresses the issue by returning the error to the caller.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #187
2012-12-03 12:06:25 -08:00
Brian Behlendorf 043f9b5724 Disable FS reclaim when allocating new slabs
Allowing the spl_cache_grow_work() function to reclaim inodes
allows for two unlikely deadlocks.  Therefore, we clear __GFP_FS
for these allocations.  The two deadlocks are:

* While holding the ZFS_OBJ_HOLD_ENTER(zsb, obj1) lock a function
  calls kmem_cache_alloc() which happens to need to allocate a
  new slab.  To allocate the new slab we enter FS level reclaim
  and attempt to evict several inodes.  To evict these inodes we
  need to take the ZFS_OBJ_HOLD_ENTER(zsb, obj2) lock and it
  just happens that obj1 and obj2 use the same hashed lock.

* Similar to the first case however instead of getting blocked
  on the hash lock we block in txg_wait_open() which is waiting
  for the next txg which isn't coming because the txg_sync
  thread is blocked in kmem_cache_alloc().

Note this isn't a 100% fix because vmalloc() won't strictly
honor __GFP_FS.  However, it practice this is sufficient because
several very unlikely things must all occur concurrently.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#1101
2012-11-27 13:43:27 -08:00
Brian Behlendorf dc1b30224f Never spin in kmem_cache_alloc()
If we are reaping from the cache and a concurrent allocation
occurs then the caller must block until the reaping is complete.
This is signaled by the clearing of the KMC_BIT_REAPING bit.

Otherwise the caller will be in a tight loop which takes and
releases the skc->skc_cache lock.  When there are multiple
concurrent callers the system will thrash on the lock and
appear to lock up.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-11-06 15:48:39 -08:00
Brian Behlendorf a1af8fb1ea Optimize spl_kmem_cache_free()
Because only virtual slabs may have emergency objects and these
objects are guaranteed to have physical addresses.  It can be
easily determined if the passed object is a virtual slab object
or an emergency object.  This allows us to completely optimize
the emergency object free case out of the common free path.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-11-06 14:54:19 -08:00
Brian Behlendorf ed3163484d Track emergency object in rbtree
In the initial implementation emergency objects were tracked on a
per-cache list.  The assumption was that under normal operation we
would never allocate more than a handful of these objects.  So the
cost of walking the list during free was expected to be negligible.

However real world usage has shown that emergency objects tend to
be allocated in batches.  A deadlock will be detected and several
thousand emergency objects will be allocated before the original
blocked slab allocation can complete.

Therefore the original list has been replaced by a red black tree
which is sorted by the memory address of each allocated object.
This bounds the worst case insertion and removal time to O(log n)
which minimize contention on the assoicated spin lock.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-11-06 14:54:19 -08:00
Brian Behlendorf 165f13c33a Improved vmem cached deadlock detection
The entire goal of performing the slab allocations asynchronously
is to be able to detect when a vmalloc() deadlocks.  In this case,
and only this case, do we want to start allocating emergency objects.
The trick here is to minimize false positives because the overhead
of tracking emergency objects is far higher than normal slab objects.

With that goal in mind the code was reworked to be less sensitive
to slow allocations by increasing the wait time.  Once a cache is
is marked deadlocked all subsequent allocations which can not be
satisfied with existing cache objects will immediately allocate new
emergency objects.  This behavior persists until the asynchronous
allocation completes and clears the deadlocked flag.

The result of these tweaks is that far fewer emergency objects
get created which is important because this minimizes the cost of
releasing them latter in kmem_cache_free().

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-11-06 14:54:15 -08:00
Brian Behlendorf d2733258d0 Condition variable reference counts
Reference count every entry and exit from the condition variable
functions: cv_wait(), cv_wait_timeout(), cv_signal(), cv_broadcast().

This allows us to safely block in cv_destroy() until all consumers
have been scheduled and are no longer accessing the condition
variable memory.

In addition poison the magic value at the start of cv_destroy() to
ensure there are never any new callers after cv_destroy() is called.
The consumer is responsible for ensuring this never occurs.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-11-06 14:48:55 -08:00
Brian Behlendorf dba79fcbf2 Add KSTAT_TYPE_TXG type
Add a new kstat type for tracking useful statistics about a TXG.
The new KSTAT_TYPE_TXG type can be used to tracks the following
statistics per-txg.

  txg          - Unique txg number
  state        - State (O)pen/(Q)uiescing/(S)yncing/(C)ommitted
  birth;       - Creation time
  nread        - Bytes read
  nwritten;    - Bytes written
  reads        - IOPs read
  writes       - IOPs write
  open_time;   - Length in nanoseconds the txg was open
  quiesce_time - Length in nanoseconds the txg was quiescing
  sync_time;   - Length in nanoseconds the txg was syncing

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-11-02 15:17:40 -07:00
Brian Behlendorf 71c9f0b003 Make kstat.ks_update() callback atomic
Move the kstat ks_update() callback under the ks_lock.  This
enables dynamically sized kstats without modification to the
kstat API.

  * Create a kstat with the KSTAT_FLAG_VIRTUAL flag.
  * Register a ->ks_update() callback which does:
    o Frees any existing ks_data buffer.
    o Set ks_data_size to the kstat array size.
    o Set ks_data to an allocated buffer of size ks_data_size
    o Populate the array of buffers with the required data.

The buffer allocated in the ks_update() callback is guaranteed
to remain allocated and valid while the proc sequence handler
iterates over the buffer.  The lock will not be dropped until
kstat_seq_stop() function is run making it safe for concurrent
access.  To allow the ks_update() callback to perform memory
allocations the lock was changed to a mutex.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-10-23 09:36:19 -07:00
Yuxuan Shui bcb15891ab Linux 3.6 compat, kern_path_locked() added
The kern_path_parent() function was removed from Linux 3.6 because
it was observed that all the callers just want the parent dentry.
The simpler kern_path_locked() function replaces kern_path_parent()
and does the lookup while holding the ->i_mutex lock.

This is good news for the vn implementation because it removes the
need for us to handle the locking.  However, it makes it harder to
implement a single readable vn_remove()/vn_rename() function which
is usually what we prefer.

Therefore, we implement a new version of vn_remove()/vn_rename()
for Linux 3.6 and newer kernels.  This allows us to leave the
existing working implementation untouched, and to add a simpler
version for newer kernels.

Long term I would very much like to see all of the vn code removed
since what this code enabled is generally frowned upon in the kernel.
But that can't happen util we either abondon the zpool.cache file
or implement alternate infrastructure to update is correctly in
user space.

Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #154
2012-10-14 16:26:21 -07:00
Massimo Maggi dea3505dff Switch KM_SLEEP to KM_PUSHPAGE
In this particular instance the allocation occurred in the context
of sys_msync()->...->zpl_putpage() where we must be careful not to
initiate additional I/O.

Signed-off-by: Massimo Maggi <massimo@mmmm.it>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-10-11 16:22:29 -07:00
Etienne Dechamps bbdc6ae495 Add interface for file hole punching.
This adds an interface to "punch holes" (deallocate space) in VFS
files. The interface is identical to the Solaris VOP_SPACE interface.
This interface is necessary for TRIM support on file vdevs.

This is implemented using Linux fallocate(FALLOC_FL_PUNCH_HOLE), which
was introduced in 2.6.38. For a brief time before 2.6.38 this was done
using the truncate_range inode operation, which was quickly deprecated.
This patch only supports FALLOC_FL_PUNCH_HOLE.

This adds support for the truncate_range() inode operation to
VOP_SPACE() for file hole punching. This API is deprecated and removed
in 3.5, so it's only useful for old kernels.

On tmpfs, the truncate_range() inode operation translates to
shmem_truncate_range(). Unfortunately, this function expects the end
offset to be inclusive and aligned to the end of a page. If it is not,
the kernel will stop with a BUG_ON().

This patch fixes the issue by adapting to the constraints set forth by
shmem_truncate_range().

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #168
2012-10-04 16:22:07 -07:00
Brian Behlendorf 3050c9314f Switch KM_SLEEP to KM_PUSHPAGE
Under certain circumstances the following functions may be called
in a context where KM_SLEEP is unsafe and can result in a deadlocked
system.  To avoid this problem the unconditional KM_SLEEPs are
converted to KM_PUSHPAGEs.  This will prevent them from attempting
to initiate any I/O during direct reclaim.

This change was originally part of cd5ca4b but was reverted by
330fe01.  It always should have had its own commit for exactly
this reason.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-09-12 12:27:09 -07:00
Brian Behlendorf 9b51f21841 Remove TQ_SLEEP -> KM_SLEEP mapping
When the taskq code was originally written it seemed like a good
idea to simply map TQ_SLEEP to KM_SLEEP.  Unfortunately, this
assumed that the TQ_* flags would never confict with any of the
Linux GFP_* flags.  When adding the TQ_PUSHPAGE support in commit
cd5ca4b this invariant was accidentally broken.

Therefore to support TQ_PUSHPAGE, which is needed for Linux, and
prevent any further confusion I have removed this direct mapping.
The TQ_SLEEP, TQ_NOSLEEP, and TQ_PUSHPAGE are no longer defined
in terms of their KM_* counterparts.  Instead a simple mapping
function is introduce to convert TQ_* -> KM_* where needed.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #171
2012-09-12 11:41:42 -07:00
Brian Behlendorf 330fe010e4 Revert "Switch KM_SLEEP to KM_PUSHPAGE"
This reverts commit cd5ca4b2f8
due to conflicts in the higher TQ_ bits which caused incorrect
behavior.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-09-12 10:07:48 -07:00
Brian Behlendorf 3c60f5054c Debug cv_destroy() with mutex held
There still appears to be a race in the condition variables where
->cv_mutex is set after we are woken from the cv_destroy wait queue.
This might be possible when cv_destroy() is called immediately after
cv_broadcast().  We had some troubles with this previously but
there may still be a small race, see commit d599e4f.

The following patch closes one small race and improves the ASSERTs
such that they log the offending value.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
zfsonlinux/zfs#943
2012-09-10 10:23:26 -07:00
Brian Behlendorf 95331f4437 Set KMC_NOEMERGENCY for zlib workspaces
The workspace required by zlib to perform compression is roughly
512MB (order-7).  These allocations are so large that we should
never attempt to directly kmalloc an emergency object for them.

It is far preferable to asynchronously vmalloc an additional slab
in case it's needed.  Then simply block waiting for an existing
object to be released or for the new slab to be allocated.

This can be accomplished by disabling emergency slab objects by
passing the KMC_NOEMERGENCY flag at slab creation time.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
zfsonlinux/zfs#917
2012-09-07 14:36:26 -07:00
Brian Behlendorf cb5c2acebb Add KMC_NOEMERGENCY slab flag
Provide a flag to disable the use of emergency objects for a
specific kmem cache.  There may be instances where under no
circumstances should you kmalloc() an emergency object.  For
example, when you cache contains very large objects (>128k).

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-09-07 14:27:03 -07:00
Brian Behlendorf 46b3945d5d Suppress task_hash_table_init() large allocation warning
When various kernel debuging options are enabled this allocation
may be larger than usual as shown by the following warning.  It
is in no way harmful so we suppress the warning.

  SPL: large kmem_alloc(40960, 0x80d0) at
  tsd_hash_table_init:358 (76495/76495)

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #93
2012-08-30 21:02:52 -07:00
Brian Behlendorf cd5ca4b2f8 Switch KM_SLEEP to KM_PUSHPAGE
Under certain circumstances the following functions may be called
in a context where KM_SLEEP is unsafe and can result in a deadlocked
system.  To avoid this problem the unconditional KM_SLEEPs are
converted to KM_PUSHPAGEs.  This will prevent them from attempting
to initiate any I/O during direct reclaim.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-08-27 12:00:55 -07:00
Brian Behlendorf 500e95c884 Revert "Disable vmalloc() direct reclaim"
This reverts commit 2092cf68d8.  The
use of the PF_MEMALLOC flag was always a hack to work around memory
reclaim deadlocks.  Those issues are believed to be resolved so this
workaround can be safely reverted.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-08-27 12:00:55 -07:00
Brian Behlendorf 617f79de6a Revert "Fix NULL deref in balance_pgdat()"
This reverts commit b8b6e4c453.  The
use of the PF_MEMALLOC flag was always a hack to work around memory
reclaim deadlocks.  Those issues are believed to be resolved so this
workaround can be safely reverted.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-08-27 12:00:55 -07:00
Brian Behlendorf bc03e07a7c Revert "Detect kernels that honor gfp flags passed to vmalloc()"
This reverts commit 36811b4430.
Which is no longer required because there is now SPL code in
place to safely handle the deadlocks the kernel patch was designed
to address.  Therefore we can unconditionally use vmalloc() and
drop all the PF_MEMALLOC code.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-08-27 12:00:55 -07:00
Brian Behlendorf d47e664ad4 Revert "Add TASKQ_NORECLAIM flag"
This reverts commit 372c257233.  The
use of the PF_MEMALLOC flag was always a hack to work around memory
reclaim deadlocks.  Those issues are believed to be resolved so this
workaround can be safely reverted.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-08-27 12:00:42 -07:00
Brian Behlendorf e2dcc6e2b8 Emergency slab objects
This patch is designed to resolve a deadlock which can occur with
__vmalloc() based slabs.  The issue is that the Linux kernel does
not honor the flags passed to __vmalloc().  This makes it unsafe
to use in a writeback context.  Unfortunately, this is a use case
ZFS depends on for correct operation.

Fixing this issue in the upstream kernel was pursued and patches
are available which resolve the issue.

  https://bugs.gentoo.org/show_bug.cgi?id=416685

However, these changes were rejected because upstream felt that
using __vmalloc() in the context of writeback should never be done.
Their solution was for us to rewrite parts of ZFS to accomidate
the Linux VM.

While that is probably the right long term solution, and it is
something we want to pursue, it is not a trivial task and will
likely destabilize the existing code.  This work has been planned
for the 0.7.0 release but in the meanwhile we want to improve the
SPL slab implementation to accomidate this expected ZFS usage.

This is accomplished by performing the __vmalloc() asynchronously
in the context of a work queue.  This doesn't prevent the posibility
of the worker thread from deadlocking.  However, the caller can now
safely block on a wait queue for the slab allocation to complete.

Normally this will occur in a reasonable amount of time and the
caller will be woken up when the new slab is available,.  The objects
will then get cached in the per-cpu magazines and everything will
proceed as usual.

However, if the __vmalloc() deadlocks for the reasons described
above, or is just very slow, then the callers on the wait queues
will timeout out.  When this rare situation occurs they will attempt
to kmalloc() a single minimally sized object using the GFP_NOIO flags.
This allocation will not deadlock because kmalloc() will honor the
passed flags and the caller will be able to make forward progress.

As long as forward progress can be maintained then even if the
worker thread is deadlocked the critical thread will make progress.
This will eventually allow the deadlocked worker thread to complete
and normal operation will resume.

These emergency allocations will likely be slow since they require
contiguous pages.  However, their use should be rare so the impact
is expected to be minimal.  If that turns out not to be the case in
practice further optimizations are possible.

One additional concern is if these emergency objects are long lived.
Right now they are simply tracked on a list which must be walked when
an object is freed.  Is they accumulate on a system and the list
grows freeing objects will become more expensive.  This could be
handled relatively easily by using a hash instead of a list, but that
optimization (if needed) is left for a follow up patch.

Additionally, these emeregency objects could be repacked in to existing
slabs as objects are freed if the kmem_cache_set_move() functionality
was implemented.  See issue https://github.com/zfsonlinux/spl/issues/26
for full details.  This work would also help reduce ZFS's memory
fragmentation problems.

The /proc/spl/kmem/slab file has had two new columns added at the
end.  The 'emerg' column reports the current number of these emergency
objects in use for the cache, and the following 'max' column shows
the historical worst case.  These value should give us a good idea
of how often these objects are needed.  Based on these values under
real use cases we can tune the default behavior.

Lastly, as a side benefit using a single work queue for the slab
allocations should reduce cpu contention on the global virtual address
space lock.   This should manifest itself as reduced cpu usage for
the system.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-08-27 12:00:42 -07:00
Prakash Surya 08850eddcb Avoid calling smp_processor_id in spl_magazine_age
The spl_magazine_age function had the implied assumption that it will
remain on its current cpu through its execution. In order to support
preempt enabled kernels, this assumption had to be removed.

The spl_kmem_magazine structure now holds the cpu id of the cpu it is
local to. This allows spl_magazine_age to use this field when scheduling
work to be done by the magazine's local cpu.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #98
2012-08-24 09:43:22 -07:00
Richard Yao 15d0411297 Remove Makefile from non-toplevel .gitignore files
When building SPL support into the kernel, ./copy-builtin will copy
non-toplevel .gitignore files. These files list /Makefile, which causes
git-archive to omit ./module/{spl,splat}/Makefile. The absence of these
files result in build failures when SPL is selected. ZFS is unaffected
because it puts Makefile in the toplevel .gitignore, which is not
copied. We fix SPL by emulating that behavior.

Reported-by: Fabio Erculiani <lxnay@gentoo.org>
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #152
2012-08-23 12:49:04 -07:00
Prakash Surya 9baf44bc17 Wrap trace_set_debug_header in trace_[get|put]_tcd
To properly support CONFIG_PREEMPT enabled kernels, we must refrain from
using a CPU index when preemption is enabled. As a result, this change
moves the trace_set_debug_header call (which calls smp_processor_id)
within trace_get_tcd and trace_put_tcd (which disable and enable
preemption respectively).

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #160
2012-08-23 10:01:20 -07:00
Richard Yao 6576a1a70d Fix incorrect type in spl_kmem_cache_set_move() parameter
A preprocessor definition renders this harmless. However, it is a good
idea to change this to be consistent.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
2012-08-01 16:35:18 -07:00
Etienne Dechamps a9f2397ee9 Determine the hostid on demand.
Currently, the SPL tries to determine the hostid at module load. The
hostid is usually determined by running the userland program "hostid"
during module initialization.

Unfortunately, when the module initializes, it may be way too soon to be
able to run any userland programs. This is especially true when the
module is compiled directly inside the kernel (built-in); in that case,
the SPL would try to run hostid when the kernel is still initializing,
which of course is doomed to fail.

This patch fixes the issue by deferring hostid generation until
something actually needs the hostid (that is, when zone_get_hostid() is
called), thus switching to a "on-initialization" model to a "on-demand"
(lazy loading) model. ZFS only needs the hostid when some pool
operations are requested, and this always happens way after the kernel
has finished initialization, thus solving the problem.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#851
2012-07-26 15:14:02 -07:00
Etienne Dechamps c167aadb27 Add script for builtin module building.
This commit introduces a "copy-builtin" script designed to prepare a
kernel source tree for building SPL as a builtin module. The script
makes a full copy of all needed files, thus making the kernel source
tree fully independent of the spl source package.

To achieve that, some compilation flags (-include, -I) have been moved
to module/Makefile. This Makefile is only used when compiling external
modules; when compiling builtin modules, a Kbuild file generated by the
configure-builtin script is used instead. This makes sure Makefiles
inside the kernel source tree does not contain references to the spl
source package.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#851
2012-07-26 15:13:09 -07:00
Etienne Dechamps 38b5ff4d07 Fix undefined reference on spl_mutex_spin_max().
Commit 3160d4f56b changed the set of
conditions under which spl_mutex_spin_max would be implemented as a
function by changing an #if in sys/mutex.h. The corresponding
implementation file spl-mutex.c, however, has not been updated to
reflect the change. This results in undefined reference errors on
spl_mutex_spin_max under the following condition:

((!CONFIG_SMP || CONFIG_DEBUG_MUTEXES) && HAVE_MUTEX_OWNER && HAVE_TASK_CURR)

This patch fixes the issue by using the same #if in sys/mutex.h and
spl-mutex.c.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#851
2012-07-26 14:54:53 -07:00
Etienne Dechamps 94aac9c9bc Use MODULE variable in module Makefile like zfs.
In zfs, each module Makefile contains a MODULE variable which contains
the name of the module, and the following declarations reference this
variable.

In spl, there is a MODULES variable which is never used. Rename it to
MODULE and use it like in zfs. This improves consistency between the two
build systems.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#851
2012-07-26 14:53:48 -07:00
Brian Behlendorf e8267acd25 32-bit compat, hostid_read()
Explicitly cast the sizeof in hostid_read() to prevent the
following compiler warning on 32-bit systems.

  module/spl/spl-generic.c:490:10: error: format '%lu' expects
  argument of type 'long unsigned int', but argument 4 has type
  'unsigned int' [-Werror=format]

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-07-20 11:14:04 -07:00
Richard Yao 36811b4430 Detect kernels that honor gfp flags passed to vmalloc()
zfsonlinux/spl@2092cf68d8 used
PF_MEMALLOC to workaround a bug in the Linux kernel where
allocations did not honor the gfp flags passed to vmalloc().
Unfortunately, PF_MEMALLOC has the side effect of permitting
allocations to allocate pages outside of ZONE_NORMAL. This
has been observed to result in the depletion of ZONE_DMA32.

A kernel patch is available in the Gentoo bug tracker for
this issue.

  https://bugs.gentoo.org/show_bug.cgi?id=416685

This negates any benefit PF_MEMALLOC provides, so we introduce
an autotools check to disable the use of PF_MEMALLOC on
systems with patched kernels.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #126
2012-07-11 11:44:27 -07:00
Richard Yao 973e8269bd Constify memory management functions
This prevents warnings in ZFS that were caused by changes necessary to
support PaX patched kernels. When debugging is enabled, these warnings
become build failures.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #131
2012-07-03 16:07:27 -07:00
Brian Behlendorf 44e406d712 PowerPC Compatibility
Usage of get_current() is not supported across all architectures.
The correct interface to use is the '#define current' which will
map to the appropriate function, usually current_thread_info().

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #119
2012-07-02 09:33:09 -07:00
Brian Behlendorf 2371321e8a Fix invalid context bug
In the module unload path the vm_file_cache was being destroyed
under a spin lock.  Because this operation might sleep it was
possible, although very very unlikely, that this could result
in a deadlock.

This issue was indentified by using a Linux debug kernel and
has been fixed by moving the kmem_cache_destroy() out from under
the spin lock.  There is no need to lock this operation here.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes zfsonlinux/zfs#771
2012-06-11 09:17:45 -07:00
Jorgen Lundman 93b0dc92ea Fix ARM 64-bit division
Correctly implementating 64-bit division for ARM requires more than
just providing the __aeabi_uldivmod() and __aeabi_ldivmod() symbols.
They are need to be implemented is such a way that the quotient and
remainder and left in specific registers after the division operation
completes.  This change updates the wrapper functions to accomplish
this according to the official ARM Run-time ABI.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes zfsonlinux/zfs#706
2012-05-22 09:27:11 -07:00
Brian Behlendorf 38d31a1e57 Remove Solaris module emulation
Originally I believed that these interfaces would be needed.
However, in practice it turned out that it was more straight
forward and maintainable to use the native Linux interfaces.
As such, this is all dead code and can be safely removed.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #109
2012-05-18 13:57:44 -07:00
Brian Behlendorf b78d4b9d98 Ensure a minimum of one slab is reclaimed
To minimize the chance of triggering an OOM during direct reclaim.
The kmem caches have been improved to make a best effort to reclaim
at least one slab when a reclaim function is registered.  This helps
avoid the case where objects are released but they are spread over
multiple slabs so no memory gets reclaimed.

Care has been taken to avoid deadlocking if the reclaim function
is unable to make forward progress.  Additionally, the reclaim
function may be skipped entirely if there are already free slabs
which can be safely reaped.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #107
2012-05-07 11:54:28 -07:00
Brian Behlendorf 06089b9e19 Ensure direct reclaim forward progress
The Linux direct reclaim path uses this out of band value to
determine if forward progress is being made.  Normally this is
incremented by kmem_freepages() which is part of the various
Linux slab implementations.  However, since we are using none
of that infrastructure we're responsible for incrementing this
count.

If no forward progress is detected and a subsequent allocation
fails the OOM killer will be invoked.  If there was forward
progress additional reclaim will be attempted via the page
cache and registerd shrinker until the allocation succeeds.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #107
2012-05-07 11:54:19 -07:00
Prakash Surya c0e0fc14e3 Ignore slab cache age and delay in direct reclaim
When memory pressure triggers direct memory reclaim, a slabs age
and delay should not prevent it from being freed. This patch ensures
these values are ignored, allowing an empty slab to be freed in this
code path no matter the value of its age and delay.

This prevents needless scanning of the partial slabs and has been
observed to significantly reduce the total cpu usage.  In addition,
it should allow for snappier reclaim under memory pressure.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #102
2012-05-07 11:50:04 -07:00
Prakash Surya cef7605c34 Throttle number of freed slabs based on nr_to_scan
Previously, the SPL tried to maintain Solaris semantics by freeing
all available (empty) slabs from its slab caches when the shrinker
was called. This is not desirable when running on Linux. To make
the SPL shrinker more Linux friendly, the actual number of freed
slabs from each of the slab caches is now derived from nr_to_scan
and skc_slab_objs.

Additionally, an accounting bug was fixed in spl_slab_reclaim()
which could cause us to reclaim one more slab than requested.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #101
2012-05-07 11:46:15 -07:00
Jorgen Lundman ef6f91ce0c Add missing 64-bit divide for 32-bit ARM
Leverage the existing generic 64-bit division operations which
were originally implemented for x86 to support ARM.  All that is
required is to make the symbols available to the linker with the
expected names.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-05-03 10:07:54 -07:00
Prakash Surya 05b8f50c33 Update a comment to reflect new taskq internals
As of the removal of the taskq work list made in commit:

    commit 2c02b71b14
    Author: Prakash Surya <surya1@llnl.gov>
    Date:   Mon Dec 5 17:32:48 2011 -0800

        Replace tq_work_list and tq_threads in taskq_t

        To lay the ground work for introducing the taskq_dispatch_prealloc()
        interface, the tq_work_list and tq_threads fields had to be replaced
        with new alternatives in the taskq_t structure.

the comment above taskq_wait_check has been incorrect. This change is an
attempt at bringing that description more in line with the current
implementation. Essentially, references to the old task work list had to
be updated to reference the new taskq thread active list.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #65
2012-04-30 10:49:15 -07:00
Brian Behlendorf b29012b999 Remove condition variable names
Long ago I added support to the spl for condition variable names
because I thought they might be needed.  It turns out they aren't.
In fact the official Solaris cv_init(9F) man page discourages
their use in the kernel.

  cv_init(9F)
    Parameters
      name - Descriptive string. This is obsolete and should be
             NULL. (Non-NULL strings are legal, but they're a
             waste of kernel memory.)

Therefore, I'm removing them from the spl to reclaim this memory
and adding an ASSERT() to ensure no new consumers are added which
make use of the name.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-04-06 12:06:19 -07:00
Brian Behlendorf 0835057ee7 Add SPL_META_RELEASE to module load/unload messages
Include the ZFS_META_RELEASE in the module load/unload messages
to more clearly indicate exactly what version of the SPL has
been loaded.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-03-23 12:11:50 -07:00
Brian Behlendorf 9a8b7a7458 Add basic dynamic kstat support
Add the bare minimum functionality to support dynamic kstats.  A
complete kstat implementation should be done as part of issue #84.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #84
2012-02-02 11:28:00 -08:00
Brian Behlendorf 4b2220f0b9 Add --enable-debug-log configure option
Until now the notion of an internal debug logging infrastructure
was conflated with enabling ASSERT()s.  This patch clarifies things
by cleanly breaking the two subsystem apart.  The result of this
is the following behavior.

--enable-debug      - Enable/disable code wrapped in ASSERT()s.
--disable-debug       ASSERT()s are used to check invariants and
                      are never required for correct operation.
                      They are disabled by default because they
                      may impact performance.

--enable-debug-log  - Enable/disable the debug log infrastructure.
--disable-debug-log   This infrastructure allows the spl code and
                      its consumer to log messages to an in-kernel
                      log.  The granularity of the logging can be
                      controlled by a debug mask.  By default the
                      mask disables most debug messages resulting
                      in a negligible performance impact.  Because
                      of this the debug log is enabled by default.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-02-02 11:27:54 -08:00
Ned Bass 3c6ed5410b Taskq locking optimizations
Testing has shown that tq->tq_lock can be highly contended when a
large number of small work items are dispatched.  The lock hold time
is reduced by the following changes:

1) Use exclusive threads in the work_waitq

When a single work item is dispatched we only need to wake a single
thread to service it.  The current implementation uses non-exclusive
threads so all threads are woken when the dispatcher calls wake_up().
If a large number of threads are in the queue this overhead can become
non-negligible.

2) Conditionally add/remove threads from work waitq

Taskq threads need only add themselves to the work wait queue if
there are no pending work items.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #32
2012-01-19 14:42:49 -08:00
Ned Bass 0bb43ca282 Revert "Taskq locking optimizations"
This reverts commit ec2b41049f.

A race condition was introduced by which a wake_up() call can be lost
after the taskq thread determines there is no pending work items,
leading to deadlock:

1. taksq thread enables interrupts
2. dispatcher thread runs, queues work item, call wake_up()
3. taskq thread runs, adds self to waitq, sleeps

This could easily happen if an interrupt for an IO completion was
outstanding at the point where the taskq thread reenables interrupts,
just before the call to add_wait_queue_exclusive().  The handler would
run immediately within the race window.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #32
2012-01-19 14:42:39 -08:00
Ned Bass ec2b41049f Taskq locking optimizations
Testing has shown that tq->tq_lock can be highly contended when a
large number of small work items are dispatched.  The lock hold time
is reduced by the following changes:

1) Use exclusive threads in the work_waitq

When a single work item is dispatched we only need to wake a single
thread to service it.  The current implementation uses non-exclusive
threads so all threads are woken when the dispatcher calls wake_up().
If a large number of threads are in the queue this overhead can become
non-negligible.

2) Conditionally add/remove threads from work waitq outside of tq_lock

Taskq threads need only add themselves to the work wait queue if there
are no pending work items.  Furthermore, the add and remove function
calls can be made outside of the taskq lock since the wait queues are
protected from concurrent access by their own spinlocks.

3) Call wake_up() outside of tq->tq_lock

Again, the wait queues are protected by their own spinlock, so the
dispatcher functions can drop tq->tq_lock before calling wake_up().

A new splat test taskq:contention was added in a prior commit to measure
the impact of these changes.  The following table summarizes the
results using data from the kernel lock profiler.

                        tq_lock time    %diff   Wall clock (s)  %diff
original:               39117614.10     0       41.72           0
exclusive threads:      31871483.61     18.5    34.2            18.0
unlocked add/rm waitq:  13794303.90     64.7    16.17           61.2
unlocked wake_up():     1589172.08      95.9    16.61           60.2

Each row reflects the average result over 5 test runs.
/proc/lock_stats was zeroed out before and collected after each run.
Column 1 is the cumulative hold time in microseconds for tq->tq_lock.
The tests are cumulative; each row reflects the code changes of the
previous rows.  %diff is calculated with respect to "original" as
100*(orig-new)/orig.

Although calling wake_up() outside of the taskq lock dramatically
reduced the taskq lock hold time, the test actually took slightly more
wall clock time.  This is because the point of contention shifts from
the taskq lock to the wait queue lock.  But the change still seems
worthwhile since it removes our taskq implementation as a bottleneck,
assuming the small increase in wall clock time to be statistical
noise.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #32
2012-01-18 10:36:57 -08:00
Brian Behlendorf 5f6c14b1ed Proxmox VE kernel compat, invalidate_inodes()
The Proxmox VE kernel contains a patch which renames the function
invalidate_inodes() to invalidate_inodes_check().  In the process
it adds a 'check' argument and a '#define invalidate_inodes(x)'
compatibility wrapper for legacy callers.  Therefore, if either
of these functions are exported invalidate_inodes() can be
safely used.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #58
2011-12-21 14:29:45 -08:00