Commit Graph

349 Commits

Author SHA1 Message Date
Brian Behlendorf 043f9b5724 Disable FS reclaim when allocating new slabs
Allowing the spl_cache_grow_work() function to reclaim inodes
allows for two unlikely deadlocks.  Therefore, we clear __GFP_FS
for these allocations.  The two deadlocks are:

* While holding the ZFS_OBJ_HOLD_ENTER(zsb, obj1) lock a function
  calls kmem_cache_alloc() which happens to need to allocate a
  new slab.  To allocate the new slab we enter FS level reclaim
  and attempt to evict several inodes.  To evict these inodes we
  need to take the ZFS_OBJ_HOLD_ENTER(zsb, obj2) lock and it
  just happens that obj1 and obj2 use the same hashed lock.

* Similar to the first case however instead of getting blocked
  on the hash lock we block in txg_wait_open() which is waiting
  for the next txg which isn't coming because the txg_sync
  thread is blocked in kmem_cache_alloc().

Note this isn't a 100% fix because vmalloc() won't strictly
honor __GFP_FS.  However, it practice this is sufficient because
several very unlikely things must all occur concurrently.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#1101
2012-11-27 13:43:27 -08:00
Brian Behlendorf dc1b30224f Never spin in kmem_cache_alloc()
If we are reaping from the cache and a concurrent allocation
occurs then the caller must block until the reaping is complete.
This is signaled by the clearing of the KMC_BIT_REAPING bit.

Otherwise the caller will be in a tight loop which takes and
releases the skc->skc_cache lock.  When there are multiple
concurrent callers the system will thrash on the lock and
appear to lock up.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-11-06 15:48:39 -08:00
Brian Behlendorf a1af8fb1ea Optimize spl_kmem_cache_free()
Because only virtual slabs may have emergency objects and these
objects are guaranteed to have physical addresses.  It can be
easily determined if the passed object is a virtual slab object
or an emergency object.  This allows us to completely optimize
the emergency object free case out of the common free path.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-11-06 14:54:19 -08:00
Brian Behlendorf ed3163484d Track emergency object in rbtree
In the initial implementation emergency objects were tracked on a
per-cache list.  The assumption was that under normal operation we
would never allocate more than a handful of these objects.  So the
cost of walking the list during free was expected to be negligible.

However real world usage has shown that emergency objects tend to
be allocated in batches.  A deadlock will be detected and several
thousand emergency objects will be allocated before the original
blocked slab allocation can complete.

Therefore the original list has been replaced by a red black tree
which is sorted by the memory address of each allocated object.
This bounds the worst case insertion and removal time to O(log n)
which minimize contention on the assoicated spin lock.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-11-06 14:54:19 -08:00
Brian Behlendorf 165f13c33a Improved vmem cached deadlock detection
The entire goal of performing the slab allocations asynchronously
is to be able to detect when a vmalloc() deadlocks.  In this case,
and only this case, do we want to start allocating emergency objects.
The trick here is to minimize false positives because the overhead
of tracking emergency objects is far higher than normal slab objects.

With that goal in mind the code was reworked to be less sensitive
to slow allocations by increasing the wait time.  Once a cache is
is marked deadlocked all subsequent allocations which can not be
satisfied with existing cache objects will immediately allocate new
emergency objects.  This behavior persists until the asynchronous
allocation completes and clears the deadlocked flag.

The result of these tweaks is that far fewer emergency objects
get created which is important because this minimizes the cost of
releasing them latter in kmem_cache_free().

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-11-06 14:54:15 -08:00
Brian Behlendorf 1112486356 splat kmem:slab_overcommit: Disabled
Disable this test because it may result in an OOM event on the
system which can result in the test infrastructure being killed.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-11-06 14:48:57 -08:00
Brian Behlendorf b8296bf3e6 splat atomic:64-bit: Create thread outside spin lock
The Fedora 3.6 debug kernel identified the following issue where
we create a thread under a spin lock.  This isn't safe because
sleeping could result in a deadlock.  Therefore the lock is changed
to a mutex so it's safe to sleep.

  BUG: sleeping function called from invalid context at mm/slub.c:930
  in_atomic(): 1, irqs_disabled(): 0, pid: 10583, name: splat
  1 lock held by splat/10583:

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-11-06 14:48:57 -08:00
Brian Behlendorf 0e149d4204 splat: Fix log buffer locking
The Fedora 3.6 debug kernel identified the following issue where
we call copy_to_user() under a spin lock().  This used to be safe
in older kernels but no longer appears to be true so the spin
lock was changed to a mutex.  None of this code is performance
critical so allowing the process to sleep is harmless.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-11-06 14:48:56 -08:00
Brian Behlendorf df870a697f splat: Cleanup headers
Restructure the the SPLAT headers such that each test only
includes the minimal set of headers it requires.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-11-06 14:48:56 -08:00
Brian Behlendorf d2733258d0 Condition variable reference counts
Reference count every entry and exit from the condition variable
functions: cv_wait(), cv_wait_timeout(), cv_signal(), cv_broadcast().

This allows us to safely block in cv_destroy() until all consumers
have been scheduled and are no longer accessing the condition
variable memory.

In addition poison the magic value at the start of cv_destroy() to
ensure there are never any new callers after cv_destroy() is called.
The consumer is responsible for ensuring this never occurs.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-11-06 14:48:55 -08:00
Brian Behlendorf dba79fcbf2 Add KSTAT_TYPE_TXG type
Add a new kstat type for tracking useful statistics about a TXG.
The new KSTAT_TYPE_TXG type can be used to tracks the following
statistics per-txg.

  txg          - Unique txg number
  state        - State (O)pen/(Q)uiescing/(S)yncing/(C)ommitted
  birth;       - Creation time
  nread        - Bytes read
  nwritten;    - Bytes written
  reads        - IOPs read
  writes       - IOPs write
  open_time;   - Length in nanoseconds the txg was open
  quiesce_time - Length in nanoseconds the txg was quiescing
  sync_time;   - Length in nanoseconds the txg was syncing

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-11-02 15:17:40 -07:00
Brian Behlendorf 71c9f0b003 Make kstat.ks_update() callback atomic
Move the kstat ks_update() callback under the ks_lock.  This
enables dynamically sized kstats without modification to the
kstat API.

  * Create a kstat with the KSTAT_FLAG_VIRTUAL flag.
  * Register a ->ks_update() callback which does:
    o Frees any existing ks_data buffer.
    o Set ks_data_size to the kstat array size.
    o Set ks_data to an allocated buffer of size ks_data_size
    o Populate the array of buffers with the required data.

The buffer allocated in the ks_update() callback is guaranteed
to remain allocated and valid while the proc sequence handler
iterates over the buffer.  The lock will not be dropped until
kstat_seq_stop() function is run making it safe for concurrent
access.  To allow the ks_update() callback to perform memory
allocations the lock was changed to a mutex.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-10-23 09:36:19 -07:00
Brian Behlendorf 1e0c2c2ccf Linux 3.7 compat, __clear_close_on_exec() removed
Commit torvalds/linux@b8318b0 moved the __clear_close_on_exec()
function out of include/linux/fdtable.h and in to fs/file.c
making it unavailable to the SPL.

Now as it turns out we only used this function to tear down
some test infrastructure for the vn_getf()/vn_releasef() SPLAT
regression tests.  Rather than implement even more autoconf
compatibilty code to handle this we just remove the test case.
This also allows us to drop three existing autoconf tests.

This does mean the SPLAT tests will no longer verify these
functions but historically they have never been a problem.
And if we feel we absolutely need this test coverage I'm
sure a more portable version of the test case could be added.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #183
2012-10-18 13:36:44 -07:00
Yuxuan Shui bcb15891ab Linux 3.6 compat, kern_path_locked() added
The kern_path_parent() function was removed from Linux 3.6 because
it was observed that all the callers just want the parent dentry.
The simpler kern_path_locked() function replaces kern_path_parent()
and does the lookup while holding the ->i_mutex lock.

This is good news for the vn implementation because it removes the
need for us to handle the locking.  However, it makes it harder to
implement a single readable vn_remove()/vn_rename() function which
is usually what we prefer.

Therefore, we implement a new version of vn_remove()/vn_rename()
for Linux 3.6 and newer kernels.  This allows us to leave the
existing working implementation untouched, and to add a simpler
version for newer kernels.

Long term I would very much like to see all of the vn code removed
since what this code enabled is generally frowned upon in the kernel.
But that can't happen util we either abondon the zpool.cache file
or implement alternate infrastructure to update is correctly in
user space.

Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #154
2012-10-14 16:26:21 -07:00
Massimo Maggi dea3505dff Switch KM_SLEEP to KM_PUSHPAGE
In this particular instance the allocation occurred in the context
of sys_msync()->...->zpl_putpage() where we must be careful not to
initiate additional I/O.

Signed-off-by: Massimo Maggi <massimo@mmmm.it>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-10-11 16:22:29 -07:00
Etienne Dechamps bbdc6ae495 Add interface for file hole punching.
This adds an interface to "punch holes" (deallocate space) in VFS
files. The interface is identical to the Solaris VOP_SPACE interface.
This interface is necessary for TRIM support on file vdevs.

This is implemented using Linux fallocate(FALLOC_FL_PUNCH_HOLE), which
was introduced in 2.6.38. For a brief time before 2.6.38 this was done
using the truncate_range inode operation, which was quickly deprecated.
This patch only supports FALLOC_FL_PUNCH_HOLE.

This adds support for the truncate_range() inode operation to
VOP_SPACE() for file hole punching. This API is deprecated and removed
in 3.5, so it's only useful for old kernels.

On tmpfs, the truncate_range() inode operation translates to
shmem_truncate_range(). Unfortunately, this function expects the end
offset to be inclusive and aligned to the end of a page. If it is not,
the kernel will stop with a BUG_ON().

This patch fixes the issue by adapting to the constraints set forth by
shmem_truncate_range().

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #168
2012-10-04 16:22:07 -07:00
Brian Behlendorf 3050c9314f Switch KM_SLEEP to KM_PUSHPAGE
Under certain circumstances the following functions may be called
in a context where KM_SLEEP is unsafe and can result in a deadlocked
system.  To avoid this problem the unconditional KM_SLEEPs are
converted to KM_PUSHPAGEs.  This will prevent them from attempting
to initiate any I/O during direct reclaim.

This change was originally part of cd5ca4b but was reverted by
330fe01.  It always should have had its own commit for exactly
this reason.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-09-12 12:27:09 -07:00
Brian Behlendorf 9b51f21841 Remove TQ_SLEEP -> KM_SLEEP mapping
When the taskq code was originally written it seemed like a good
idea to simply map TQ_SLEEP to KM_SLEEP.  Unfortunately, this
assumed that the TQ_* flags would never confict with any of the
Linux GFP_* flags.  When adding the TQ_PUSHPAGE support in commit
cd5ca4b this invariant was accidentally broken.

Therefore to support TQ_PUSHPAGE, which is needed for Linux, and
prevent any further confusion I have removed this direct mapping.
The TQ_SLEEP, TQ_NOSLEEP, and TQ_PUSHPAGE are no longer defined
in terms of their KM_* counterparts.  Instead a simple mapping
function is introduce to convert TQ_* -> KM_* where needed.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #171
2012-09-12 11:41:42 -07:00
Brian Behlendorf 330fe010e4 Revert "Switch KM_SLEEP to KM_PUSHPAGE"
This reverts commit cd5ca4b2f8
due to conflicts in the higher TQ_ bits which caused incorrect
behavior.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-09-12 10:07:48 -07:00
Brian Behlendorf 3c60f5054c Debug cv_destroy() with mutex held
There still appears to be a race in the condition variables where
->cv_mutex is set after we are woken from the cv_destroy wait queue.
This might be possible when cv_destroy() is called immediately after
cv_broadcast().  We had some troubles with this previously but
there may still be a small race, see commit d599e4f.

The following patch closes one small race and improves the ASSERTs
such that they log the offending value.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
zfsonlinux/zfs#943
2012-09-10 10:23:26 -07:00
Brian Behlendorf 95331f4437 Set KMC_NOEMERGENCY for zlib workspaces
The workspace required by zlib to perform compression is roughly
512MB (order-7).  These allocations are so large that we should
never attempt to directly kmalloc an emergency object for them.

It is far preferable to asynchronously vmalloc an additional slab
in case it's needed.  Then simply block waiting for an existing
object to be released or for the new slab to be allocated.

This can be accomplished by disabling emergency slab objects by
passing the KMC_NOEMERGENCY flag at slab creation time.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
zfsonlinux/zfs#917
2012-09-07 14:36:26 -07:00
Brian Behlendorf cb5c2acebb Add KMC_NOEMERGENCY slab flag
Provide a flag to disable the use of emergency objects for a
specific kmem cache.  There may be instances where under no
circumstances should you kmalloc() an emergency object.  For
example, when you cache contains very large objects (>128k).

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-09-07 14:27:03 -07:00
Brian Behlendorf 46b3945d5d Suppress task_hash_table_init() large allocation warning
When various kernel debuging options are enabled this allocation
may be larger than usual as shown by the following warning.  It
is in no way harmful so we suppress the warning.

  SPL: large kmem_alloc(40960, 0x80d0) at
  tsd_hash_table_init:358 (76495/76495)

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #93
2012-08-30 21:02:52 -07:00
Brian Behlendorf efcd0ca32d Enhance SPLAT kmem:slab_overcommit test
After the emergency slab objects were merged I started observing
timeout failures in the kmem:slab_overcommit test.  These were
due to the ineffecient way the slab_overcommit reclaim function
was implemented.  And due to the additional cost of potentially
allocating ten of thousands of emergency objects and tracking
them on a single list.

This patch addresses the first concern by enhansing the test
case to trace all of the allocations objects as a linked list.
This allows for a cleaner version of the reclaim function to
simply release SPLAT_KMEM_OBJ_RECLAIM objects.

Since this touches some common code all the tests which share
these data structions were also updated.  After making these
changes slab_overcommit is reliably passing.  However, there
is certainly additional cleanup which could be done here.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-08-30 15:49:00 -07:00
Brian Behlendorf cd5ca4b2f8 Switch KM_SLEEP to KM_PUSHPAGE
Under certain circumstances the following functions may be called
in a context where KM_SLEEP is unsafe and can result in a deadlocked
system.  To avoid this problem the unconditional KM_SLEEPs are
converted to KM_PUSHPAGEs.  This will prevent them from attempting
to initiate any I/O during direct reclaim.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-08-27 12:00:55 -07:00
Brian Behlendorf 500e95c884 Revert "Disable vmalloc() direct reclaim"
This reverts commit 2092cf68d8.  The
use of the PF_MEMALLOC flag was always a hack to work around memory
reclaim deadlocks.  Those issues are believed to be resolved so this
workaround can be safely reverted.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-08-27 12:00:55 -07:00
Brian Behlendorf 617f79de6a Revert "Fix NULL deref in balance_pgdat()"
This reverts commit b8b6e4c453.  The
use of the PF_MEMALLOC flag was always a hack to work around memory
reclaim deadlocks.  Those issues are believed to be resolved so this
workaround can be safely reverted.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-08-27 12:00:55 -07:00
Brian Behlendorf bc03e07a7c Revert "Detect kernels that honor gfp flags passed to vmalloc()"
This reverts commit 36811b4430.
Which is no longer required because there is now SPL code in
place to safely handle the deadlocks the kernel patch was designed
to address.  Therefore we can unconditionally use vmalloc() and
drop all the PF_MEMALLOC code.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-08-27 12:00:55 -07:00
Brian Behlendorf d47e664ad4 Revert "Add TASKQ_NORECLAIM flag"
This reverts commit 372c257233.  The
use of the PF_MEMALLOC flag was always a hack to work around memory
reclaim deadlocks.  Those issues are believed to be resolved so this
workaround can be safely reverted.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-08-27 12:00:42 -07:00
Brian Behlendorf e2dcc6e2b8 Emergency slab objects
This patch is designed to resolve a deadlock which can occur with
__vmalloc() based slabs.  The issue is that the Linux kernel does
not honor the flags passed to __vmalloc().  This makes it unsafe
to use in a writeback context.  Unfortunately, this is a use case
ZFS depends on for correct operation.

Fixing this issue in the upstream kernel was pursued and patches
are available which resolve the issue.

  https://bugs.gentoo.org/show_bug.cgi?id=416685

However, these changes were rejected because upstream felt that
using __vmalloc() in the context of writeback should never be done.
Their solution was for us to rewrite parts of ZFS to accomidate
the Linux VM.

While that is probably the right long term solution, and it is
something we want to pursue, it is not a trivial task and will
likely destabilize the existing code.  This work has been planned
for the 0.7.0 release but in the meanwhile we want to improve the
SPL slab implementation to accomidate this expected ZFS usage.

This is accomplished by performing the __vmalloc() asynchronously
in the context of a work queue.  This doesn't prevent the posibility
of the worker thread from deadlocking.  However, the caller can now
safely block on a wait queue for the slab allocation to complete.

Normally this will occur in a reasonable amount of time and the
caller will be woken up when the new slab is available,.  The objects
will then get cached in the per-cpu magazines and everything will
proceed as usual.

However, if the __vmalloc() deadlocks for the reasons described
above, or is just very slow, then the callers on the wait queues
will timeout out.  When this rare situation occurs they will attempt
to kmalloc() a single minimally sized object using the GFP_NOIO flags.
This allocation will not deadlock because kmalloc() will honor the
passed flags and the caller will be able to make forward progress.

As long as forward progress can be maintained then even if the
worker thread is deadlocked the critical thread will make progress.
This will eventually allow the deadlocked worker thread to complete
and normal operation will resume.

These emergency allocations will likely be slow since they require
contiguous pages.  However, their use should be rare so the impact
is expected to be minimal.  If that turns out not to be the case in
practice further optimizations are possible.

One additional concern is if these emergency objects are long lived.
Right now they are simply tracked on a list which must be walked when
an object is freed.  Is they accumulate on a system and the list
grows freeing objects will become more expensive.  This could be
handled relatively easily by using a hash instead of a list, but that
optimization (if needed) is left for a follow up patch.

Additionally, these emeregency objects could be repacked in to existing
slabs as objects are freed if the kmem_cache_set_move() functionality
was implemented.  See issue https://github.com/zfsonlinux/spl/issues/26
for full details.  This work would also help reduce ZFS's memory
fragmentation problems.

The /proc/spl/kmem/slab file has had two new columns added at the
end.  The 'emerg' column reports the current number of these emergency
objects in use for the cache, and the following 'max' column shows
the historical worst case.  These value should give us a good idea
of how often these objects are needed.  Based on these values under
real use cases we can tune the default behavior.

Lastly, as a side benefit using a single work queue for the slab
allocations should reduce cpu contention on the global virtual address
space lock.   This should manifest itself as reduced cpu usage for
the system.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-08-27 12:00:42 -07:00
Prakash Surya 08850eddcb Avoid calling smp_processor_id in spl_magazine_age
The spl_magazine_age function had the implied assumption that it will
remain on its current cpu through its execution. In order to support
preempt enabled kernels, this assumption had to be removed.

The spl_kmem_magazine structure now holds the cpu id of the cpu it is
local to. This allows spl_magazine_age to use this field when scheduling
work to be done by the magazine's local cpu.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #98
2012-08-24 09:43:22 -07:00
Richard Yao 15d0411297 Remove Makefile from non-toplevel .gitignore files
When building SPL support into the kernel, ./copy-builtin will copy
non-toplevel .gitignore files. These files list /Makefile, which causes
git-archive to omit ./module/{spl,splat}/Makefile. The absence of these
files result in build failures when SPL is selected. ZFS is unaffected
because it puts Makefile in the toplevel .gitignore, which is not
copied. We fix SPL by emulating that behavior.

Reported-by: Fabio Erculiani <lxnay@gentoo.org>
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #152
2012-08-23 12:49:04 -07:00
Prakash Surya 9baf44bc17 Wrap trace_set_debug_header in trace_[get|put]_tcd
To properly support CONFIG_PREEMPT enabled kernels, we must refrain from
using a CPU index when preemption is enabled. As a result, this change
moves the trace_set_debug_header call (which calls smp_processor_id)
within trace_get_tcd and trace_put_tcd (which disable and enable
preemption respectively).

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #160
2012-08-23 10:01:20 -07:00
Richard Yao 6576a1a70d Fix incorrect type in spl_kmem_cache_set_move() parameter
A preprocessor definition renders this harmless. However, it is a good
idea to change this to be consistent.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
2012-08-01 16:35:18 -07:00
Etienne Dechamps a9f2397ee9 Determine the hostid on demand.
Currently, the SPL tries to determine the hostid at module load. The
hostid is usually determined by running the userland program "hostid"
during module initialization.

Unfortunately, when the module initializes, it may be way too soon to be
able to run any userland programs. This is especially true when the
module is compiled directly inside the kernel (built-in); in that case,
the SPL would try to run hostid when the kernel is still initializing,
which of course is doomed to fail.

This patch fixes the issue by deferring hostid generation until
something actually needs the hostid (that is, when zone_get_hostid() is
called), thus switching to a "on-initialization" model to a "on-demand"
(lazy loading) model. ZFS only needs the hostid when some pool
operations are requested, and this always happens way after the kernel
has finished initialization, thus solving the problem.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#851
2012-07-26 15:14:02 -07:00
Etienne Dechamps c167aadb27 Add script for builtin module building.
This commit introduces a "copy-builtin" script designed to prepare a
kernel source tree for building SPL as a builtin module. The script
makes a full copy of all needed files, thus making the kernel source
tree fully independent of the spl source package.

To achieve that, some compilation flags (-include, -I) have been moved
to module/Makefile. This Makefile is only used when compiling external
modules; when compiling builtin modules, a Kbuild file generated by the
configure-builtin script is used instead. This makes sure Makefiles
inside the kernel source tree does not contain references to the spl
source package.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#851
2012-07-26 15:13:09 -07:00
Etienne Dechamps 38b5ff4d07 Fix undefined reference on spl_mutex_spin_max().
Commit 3160d4f56b changed the set of
conditions under which spl_mutex_spin_max would be implemented as a
function by changing an #if in sys/mutex.h. The corresponding
implementation file spl-mutex.c, however, has not been updated to
reflect the change. This results in undefined reference errors on
spl_mutex_spin_max under the following condition:

((!CONFIG_SMP || CONFIG_DEBUG_MUTEXES) && HAVE_MUTEX_OWNER && HAVE_TASK_CURR)

This patch fixes the issue by using the same #if in sys/mutex.h and
spl-mutex.c.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#851
2012-07-26 14:54:53 -07:00
Etienne Dechamps 94aac9c9bc Use MODULE variable in module Makefile like zfs.
In zfs, each module Makefile contains a MODULE variable which contains
the name of the module, and the following declarations reference this
variable.

In spl, there is a MODULES variable which is never used. Rename it to
MODULE and use it like in zfs. This improves consistency between the two
build systems.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#851
2012-07-26 14:53:48 -07:00
Brian Behlendorf e8267acd25 32-bit compat, hostid_read()
Explicitly cast the sizeof in hostid_read() to prevent the
following compiler warning on 32-bit systems.

  module/spl/spl-generic.c:490:10: error: format '%lu' expects
  argument of type 'long unsigned int', but argument 4 has type
  'unsigned int' [-Werror=format]

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-07-20 11:14:04 -07:00
Richard Yao 36811b4430 Detect kernels that honor gfp flags passed to vmalloc()
zfsonlinux/spl@2092cf68d8 used
PF_MEMALLOC to workaround a bug in the Linux kernel where
allocations did not honor the gfp flags passed to vmalloc().
Unfortunately, PF_MEMALLOC has the side effect of permitting
allocations to allocate pages outside of ZONE_NORMAL. This
has been observed to result in the depletion of ZONE_DMA32.

A kernel patch is available in the Gentoo bug tracker for
this issue.

  https://bugs.gentoo.org/show_bug.cgi?id=416685

This negates any benefit PF_MEMALLOC provides, so we introduce
an autotools check to disable the use of PF_MEMALLOC on
systems with patched kernels.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #126
2012-07-11 11:44:27 -07:00
Richard Yao 973e8269bd Constify memory management functions
This prevents warnings in ZFS that were caused by changes necessary to
support PaX patched kernels. When debugging is enabled, these warnings
become build failures.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #131
2012-07-03 16:07:27 -07:00
Brian Behlendorf 44e406d712 PowerPC Compatibility
Usage of get_current() is not supported across all architectures.
The correct interface to use is the '#define current' which will
map to the appropriate function, usually current_thread_info().

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #119
2012-07-02 09:33:09 -07:00
Richard Yao e0093fea58 Linux 3.4 compat, __clear_close_on_exec replaces FD_CLR
torvalds/linux@1dce27c5aa introduced
__clear_close_on_exec() as a replacement for FD_CLR. Further commits
appear to have removed FD_CLR from the Linux source tree.  This
causes the following failure:

  error: implicit declaration of function '__FD_CLR'
  [-Werror=implicit-function-declaration]

To correct this we update the code to use the current
__clear_close_on_exec() interface for readability.  Then we introduce
an autotools check to determine if __clear_close_on_exec() is available.
If it isn't then we define some compatibility logic which used the older
FD_CLR() interface.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #124
2012-06-13 16:18:51 -07:00
Brian Behlendorf eaac9ba510 Fix uninit variable in slab reclaim test
Gcc version 4.7.0 reports the delta.tv_sec in the slab reclaim test
as potentially unitialized.  In practice this will never occur but
to keep gcc happy we initialize the variable to zero.

Signed-off-by: Brian Behlendorf <behlendo@fedora-17-amd64.(none)>
2012-06-13 16:17:22 -07:00
Brian Behlendorf 2371321e8a Fix invalid context bug
In the module unload path the vm_file_cache was being destroyed
under a spin lock.  Because this operation might sleep it was
possible, although very very unlikely, that this could result
in a deadlock.

This issue was indentified by using a Linux debug kernel and
has been fixed by moving the kmem_cache_destroy() out from under
the spin lock.  There is no need to lock this operation here.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes zfsonlinux/zfs#771
2012-06-11 09:17:45 -07:00
Jorgen Lundman 93b0dc92ea Fix ARM 64-bit division
Correctly implementating 64-bit division for ARM requires more than
just providing the __aeabi_uldivmod() and __aeabi_ldivmod() symbols.
They are need to be implemented is such a way that the quotient and
remainder and left in specific registers after the division operation
completes.  This change updates the wrapper functions to accomplish
this according to the official ARM Run-time ABI.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes zfsonlinux/zfs#706
2012-05-22 09:27:11 -07:00
Brian Behlendorf 38d31a1e57 Remove Solaris module emulation
Originally I believed that these interfaces would be needed.
However, in practice it turned out that it was more straight
forward and maintainable to use the native Linux interfaces.
As such, this is all dead code and can be safely removed.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #109
2012-05-18 13:57:44 -07:00
Prakash Surya a9a7a01cf5 Add SPLAT test to exercise slab direct reclaim
This test is designed to verify that direct reclaim is functioning as
expected.  We allocate a large number of objects thus creating a large
number of slabs.  We then apply memory pressure and expect that the
direct reclaim path can easily recover those slabs.  The registered
reclaim function will free the objects and the slab shrinker will call
it repeatedly until at least a single slab can be freed.

Note it may not be possible to reclaim every last slab via direct reclaim
without a failure because the shrinker_rwsem may be contended.  For this
reason, quickly reclaiming 3/4 of the slabs is considered a success.

This should all be possible within 10 seconds.  For reference, on a
system with 2G of memory this test takes roughly 0.2 seconds to run.
It may take longer on larger memory systems but should still easily
complete in the alloted 10 seconds.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #107
2012-05-07 11:55:59 -07:00
Brian Behlendorf b78d4b9d98 Ensure a minimum of one slab is reclaimed
To minimize the chance of triggering an OOM during direct reclaim.
The kmem caches have been improved to make a best effort to reclaim
at least one slab when a reclaim function is registered.  This helps
avoid the case where objects are released but they are spread over
multiple slabs so no memory gets reclaimed.

Care has been taken to avoid deadlocking if the reclaim function
is unable to make forward progress.  Additionally, the reclaim
function may be skipped entirely if there are already free slabs
which can be safely reaped.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #107
2012-05-07 11:54:28 -07:00
Brian Behlendorf 06089b9e19 Ensure direct reclaim forward progress
The Linux direct reclaim path uses this out of band value to
determine if forward progress is being made.  Normally this is
incremented by kmem_freepages() which is part of the various
Linux slab implementations.  However, since we are using none
of that infrastructure we're responsible for incrementing this
count.

If no forward progress is detected and a subsequent allocation
fails the OOM killer will be invoked.  If there was forward
progress additional reclaim will be attempted via the page
cache and registerd shrinker until the allocation succeeds.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #107
2012-05-07 11:54:19 -07:00
Prakash Surya c0e0fc14e3 Ignore slab cache age and delay in direct reclaim
When memory pressure triggers direct memory reclaim, a slabs age
and delay should not prevent it from being freed. This patch ensures
these values are ignored, allowing an empty slab to be freed in this
code path no matter the value of its age and delay.

This prevents needless scanning of the partial slabs and has been
observed to significantly reduce the total cpu usage.  In addition,
it should allow for snappier reclaim under memory pressure.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #102
2012-05-07 11:50:04 -07:00
Prakash Surya cef7605c34 Throttle number of freed slabs based on nr_to_scan
Previously, the SPL tried to maintain Solaris semantics by freeing
all available (empty) slabs from its slab caches when the shrinker
was called. This is not desirable when running on Linux. To make
the SPL shrinker more Linux friendly, the actual number of freed
slabs from each of the slab caches is now derived from nr_to_scan
and skc_slab_objs.

Additionally, an accounting bug was fixed in spl_slab_reclaim()
which could cause us to reclaim one more slab than requested.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #101
2012-05-07 11:46:15 -07:00
Jorgen Lundman ef6f91ce0c Add missing 64-bit divide for 32-bit ARM
Leverage the existing generic 64-bit division operations which
were originally implemented for x86 to support ARM.  All that is
required is to make the symbols available to the linker with the
expected names.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-05-03 10:07:54 -07:00
Prakash Surya 05b8f50c33 Update a comment to reflect new taskq internals
As of the removal of the taskq work list made in commit:

    commit 2c02b71b14
    Author: Prakash Surya <surya1@llnl.gov>
    Date:   Mon Dec 5 17:32:48 2011 -0800

        Replace tq_work_list and tq_threads in taskq_t

        To lay the ground work for introducing the taskq_dispatch_prealloc()
        interface, the tq_work_list and tq_threads fields had to be replaced
        with new alternatives in the taskq_t structure.

the comment above taskq_wait_check has been incorrect. This change is an
attempt at bringing that description more in line with the current
implementation. Essentially, references to the old task work list had to
be updated to reference the new taskq thread active list.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #65
2012-04-30 10:49:15 -07:00
Brian Behlendorf b29012b999 Remove condition variable names
Long ago I added support to the spl for condition variable names
because I thought they might be needed.  It turns out they aren't.
In fact the official Solaris cv_init(9F) man page discourages
their use in the kernel.

  cv_init(9F)
    Parameters
      name - Descriptive string. This is obsolete and should be
             NULL. (Non-NULL strings are legal, but they're a
             waste of kernel memory.)

Therefore, I'm removing them from the spl to reclaim this memory
and adding an ASSERT() to ensure no new consumers are added which
make use of the name.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-04-06 12:06:19 -07:00
Brian Behlendorf 0835057ee7 Add SPL_META_RELEASE to module load/unload messages
Include the ZFS_META_RELEASE in the module load/unload messages
to more clearly indicate exactly what version of the SPL has
been loaded.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-03-23 12:11:50 -07:00
Brian Behlendorf 9a8b7a7458 Add basic dynamic kstat support
Add the bare minimum functionality to support dynamic kstats.  A
complete kstat implementation should be done as part of issue #84.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #84
2012-02-02 11:28:00 -08:00
Brian Behlendorf 4b2220f0b9 Add --enable-debug-log configure option
Until now the notion of an internal debug logging infrastructure
was conflated with enabling ASSERT()s.  This patch clarifies things
by cleanly breaking the two subsystem apart.  The result of this
is the following behavior.

--enable-debug      - Enable/disable code wrapped in ASSERT()s.
--disable-debug       ASSERT()s are used to check invariants and
                      are never required for correct operation.
                      They are disabled by default because they
                      may impact performance.

--enable-debug-log  - Enable/disable the debug log infrastructure.
--disable-debug-log   This infrastructure allows the spl code and
                      its consumer to log messages to an in-kernel
                      log.  The granularity of the logging can be
                      controlled by a debug mask.  By default the
                      mask disables most debug messages resulting
                      in a negligible performance impact.  Because
                      of this the debug log is enabled by default.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-02-02 11:27:54 -08:00
Ned Bass 3c6ed5410b Taskq locking optimizations
Testing has shown that tq->tq_lock can be highly contended when a
large number of small work items are dispatched.  The lock hold time
is reduced by the following changes:

1) Use exclusive threads in the work_waitq

When a single work item is dispatched we only need to wake a single
thread to service it.  The current implementation uses non-exclusive
threads so all threads are woken when the dispatcher calls wake_up().
If a large number of threads are in the queue this overhead can become
non-negligible.

2) Conditionally add/remove threads from work waitq

Taskq threads need only add themselves to the work wait queue if
there are no pending work items.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #32
2012-01-19 14:42:49 -08:00
Ned Bass 0bb43ca282 Revert "Taskq locking optimizations"
This reverts commit ec2b41049f.

A race condition was introduced by which a wake_up() call can be lost
after the taskq thread determines there is no pending work items,
leading to deadlock:

1. taksq thread enables interrupts
2. dispatcher thread runs, queues work item, call wake_up()
3. taskq thread runs, adds self to waitq, sleeps

This could easily happen if an interrupt for an IO completion was
outstanding at the point where the taskq thread reenables interrupts,
just before the call to add_wait_queue_exclusive().  The handler would
run immediately within the race window.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #32
2012-01-19 14:42:39 -08:00
Ned Bass ec2b41049f Taskq locking optimizations
Testing has shown that tq->tq_lock can be highly contended when a
large number of small work items are dispatched.  The lock hold time
is reduced by the following changes:

1) Use exclusive threads in the work_waitq

When a single work item is dispatched we only need to wake a single
thread to service it.  The current implementation uses non-exclusive
threads so all threads are woken when the dispatcher calls wake_up().
If a large number of threads are in the queue this overhead can become
non-negligible.

2) Conditionally add/remove threads from work waitq outside of tq_lock

Taskq threads need only add themselves to the work wait queue if there
are no pending work items.  Furthermore, the add and remove function
calls can be made outside of the taskq lock since the wait queues are
protected from concurrent access by their own spinlocks.

3) Call wake_up() outside of tq->tq_lock

Again, the wait queues are protected by their own spinlock, so the
dispatcher functions can drop tq->tq_lock before calling wake_up().

A new splat test taskq:contention was added in a prior commit to measure
the impact of these changes.  The following table summarizes the
results using data from the kernel lock profiler.

                        tq_lock time    %diff   Wall clock (s)  %diff
original:               39117614.10     0       41.72           0
exclusive threads:      31871483.61     18.5    34.2            18.0
unlocked add/rm waitq:  13794303.90     64.7    16.17           61.2
unlocked wake_up():     1589172.08      95.9    16.61           60.2

Each row reflects the average result over 5 test runs.
/proc/lock_stats was zeroed out before and collected after each run.
Column 1 is the cumulative hold time in microseconds for tq->tq_lock.
The tests are cumulative; each row reflects the code changes of the
previous rows.  %diff is calculated with respect to "original" as
100*(orig-new)/orig.

Although calling wake_up() outside of the taskq lock dramatically
reduced the taskq lock hold time, the test actually took slightly more
wall clock time.  This is because the point of contention shifts from
the taskq lock to the wait queue lock.  But the change still seems
worthwhile since it removes our taskq implementation as a bottleneck,
assuming the small increase in wall clock time to be statistical
noise.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #32
2012-01-18 10:36:57 -08:00
Ned Bass cf5d23fa1e Add taskq contention splat test
Add a test designed to generate contention on the taskq spinlock by
using a large number of threads (100) to perform a large number (131072)
of trivial work items from a single queue.  This simulates conditions
that may occur with the zio free taskq when a 1TB file is removed from a
ZFS filesystem, for example.  This test should always pass.  Its purpose
is to provide a benchmark to easily measure the effectiveness of taskq
optimizations using statistics from the kernel lock profiler.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #32
2012-01-18 10:36:51 -08:00
Darik Horn 966e5200d3 Fix `make distclean` for `--with-config=user`
Apply the same fix to SPL that was applied to ZFS earlier at:
zfsonlinux/zfs@d433c20651

Additionally quote @LINUX_SYMBOLS@ because it is a null substitution
in this configuration, which results in a `[ -f  ]` expression that
incorrectly evaluates to true.

  # ./configure --with-config=user
  # make distclean

  Making distclean in module
  make[1]: Entering directory `/spl/module'
  make -C  SUBDIRS=`pwd`  clean
  make: Entering an unknown directory
  make: *** SUBDIRS=/spl/module: No such file or directory.  Stop.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-01-17 10:06:00 -08:00
Brian Behlendorf 5f6c14b1ed Proxmox VE kernel compat, invalidate_inodes()
The Proxmox VE kernel contains a patch which renames the function
invalidate_inodes() to invalidate_inodes_check().  In the process
it adds a 'check' argument and a '#define invalidate_inodes(x)'
compatibility wrapper for legacy callers.  Therefore, if either
of these functions are exported invalidate_inodes() can be
safely used.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #58
2011-12-21 14:29:45 -08:00
Prakash Surya 8f2503e0af Store copy of tqent_flags prior to servicing task
A preallocated taskq_ent_t's tqent_flags must be checked prior to
servicing the taskq_ent_t. Once a preallocated taskq entry is serviced,
the ownership of the entry is handed back to the caller of
taskq_dispatch, thus the entry's contents can potentially be mangled.

In particular, this is a problem in the case where a preallocated taskq
entry is serviced, and the caller clears it's tqent_flags field. Thus,
when the function returns and task_done is called, it looks as though
the entry is **not** a preallocated task (when in fact it **is** a
preallocated task).

In this situation, task_done will place the preallocated taskq_ent_t
structure onto the taskq_t's free list. This is a **huge** mistake. If
the taskq_ent_t is then freed by the caller of taskq_dispatch, the
taskq_t's free list will hold a pointer to garbage data. Even worse, if
nothing has over written the freed memory before the pointer is
dereferenced, it may still look as though it points to a valid list_head
belonging to a taskq_ent_t structure.

Thus, the task entry's flags are now copied prior to servicing the task.
This copy is then checked to see if it is a preallocated task, and
determine if the entry needs to be passed down to the task_done
function.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #71
2011-12-16 16:54:00 -08:00
Prakash Surya e7e5f78e7b Swap taskq_ent_t with taskqid_t in taskq_thread_t
The taskq_t's active thread list is sorted based on its
tqt_ent->tqent_id field. The list is kept sorted solely by inserting
new taskq_thread_t's in their correct sorted location; no other
means is used. This means that once inserted, if a taskq_thread_t's
tqt_ent->tqent_id field changes, the list runs the risk of no
longer being sorted.

Prior to the introduction of the taskq_dispatch_prealloc() interface,
this was not a problem as a taskq_ent_t actively being serviced under
the old interface should always have a static tqent_id field. Thus,
once the taskq_thread_t is added to the taskq_t's active thread list,
the taskq_thread_t's tqt_ent->tqent_id field would remain constant.

Now, this is no longer the case. Currently, if using the
taskq_dispatch_prealloc() interface, any given taskq_ent_t actively
being serviced _may_ have its tqent_id value incremented. This happens
when the preallocated taskq_ent_t structure is recursively dispatched.
Thus, a taskq_thread_t could potentially have its tqt_ent->tqent_id
field silently modified from under its feet. If this were to happen
to a taskq_thread_t on a taskq_t's active thread list, this would
compromise the integrity of the order of the list (as the list
_may_ no longer be sorted).

To get around this, the taskq_thread_t's taskq_ent_t pointer was
replaced with its own static copy of the tqent_id. So, as a taskq_ent_t
is pulled off of the taskq_t's pending list, a static copy of its
tqent_id is made and this copy is used to sort the active thread
list. Using a static copy is key in ensuring the integrity of the
order of the active thread list. Even if the underlying taskq_ent_t
is recursively dispatched (as has its tqent_id modified), this
static copy stored inside the taskq_thread_t will remain constant.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #71
2011-12-16 13:26:54 -08:00
Prakash Surya 699d5ee8a9 Exercise new taskq interface in splat-taskq tests
The splat-taskq test functions were slightly modified to exercise
the new taskq interface in addition to the old interface.  If the
old interface passes each of its tests, the new interface is
exercised.  Both sub tests (old interface and new interface) must
pass for each test as a whole to pass.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #65
2011-12-13 16:10:57 -08:00
Prakash Surya 44217f7aad Implement taskq_dispatch_prealloc() interface
This patch implements the taskq_dispatch_prealloc() interface which
was introduced by the following illumos-gate commit.  It allows for
a preallocated taskq_ent_t to be used when dispatching items to a
taskq.  This eliminates a memory allocation which helps minimize
lock contention in the taskq when dispatching functions.

    commit 5aeb94743e3be0c51e86f73096334611ae3a058e
    Author: Garrett D'Amore <garrett@nexenta.com>
    Date:   Wed Jul 27 07:13:44 2011 -0700

    734 taskq_dispatch_prealloc() desired
    943 zio_interrupt ends up calling taskq_dispatch with TQ_SLEEP

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #65
2011-12-13 16:10:57 -08:00
Prakash Surya ac1e5b6033 Add Test: "Single task queue, recursive dispatch"
Added another splat taskq test to ensure tasks can be recursively
submitted to a single task queue without issue. When the
taskq_dispatch_prealloc() interface is introduced, this use case
can potentially cause a deadlock if a taskq_ent_t is dispatched
while its tqent_list field is not empty. This _should_ never be
a problem with the existing taskq_dispatch() interface.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #65
2011-12-13 16:10:57 -08:00
Prakash Surya 2c02b71b14 Replace tq_work_list and tq_threads in taskq_t
To lay the ground work for introducing the taskq_dispatch_prealloc()
interface, the tq_work_list and tq_threads fields had to be replaced
with new alternatives in the taskq_t structure.

The tq_threads field was replaced with tq_thread_list. Rather than
storing the pointers to the taskq's kernel threads in an array, they are
now stored as a list. In addition to laying the ground work for the
taskq_dispatch_prealloc() interface, this change could also enable taskq
threads to be dynamically created and destroyed as threads can now be
added and removed to this list relatively easily.

The tq_work_list field was replaced with tq_active_list. Instead of
keeping a list of taskq_ent_t's which are currently being serviced, a
list of taskq_threads currently servicing a taskq_ent_t is kept. This
frees up the taskq_ent_t's tqent_list field when it is being serviced
(i.e. now when a taskq_ent_t is being serviced, it's tqent_list field
will be empty).

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #65
2011-12-13 16:10:50 -08:00
Prakash Surya 046a70c93b Replace struct spl_task with struct taskq_ent
The spl_task structure was renamed to taskq_ent, and all of
its fields were renamed to have a prefix of 'tqent' rather
than 't'. This was to align with the naming convention which
the ZFS code assumes.  Previously these fields were private
so the name never mattered.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #65
2011-12-13 12:28:09 -08:00
Prakash Surya ed948fa72b Add SPLAT_TEST_FINI call for SPLAT_TASKQ_TEST6_ID
This change adds the neglected SPLAT_TEST_FINI call for the
SPLAT_TASKQ_TEST6_ID, just as is done for the other 5 SPLAT_TASKQ_*
tests.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #64
2011-12-13 12:26:16 -08:00
Prakash Surya e05bec805b Fix a typo referencing an incorrect symbol
The splat_taskq_test4_common function was incorrectly referencing
the splat_taskq-test13_func symbol, when it meant to be using the
splat_taskq_test4_func symbol.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #61
2011-11-21 16:52:36 -08:00
Brian Behlendorf 1114ae6ae7 Prepend spl_ to all init/fini functions
This is a bit of cleanup I'd been meaning to get to for a while
to reduce the chance of a type conflict.  Well that conflict
finally occurred with the kstat_init() function which conflicts
with a function in the 2.6.32-6-pve kernel.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #56
2011-11-11 09:18:28 -08:00
Brian Behlendorf fe71c0e567 Linux 3.1 compat, shrink_*cache_memory
As of Linux 3.1 the shrink_dcache_memory and shrink_icache_memory
functions have been removed.  This same task is now accomplished
more cleanly with per super block shrinkers.  This unfortunately
leaves us no easy way to support the dnlc_reduce_cache() function.

This support has always been entirely optional.  So when no
reasonable interface is available allow the dnlc_reduce_cache()
function to effectively become a no-op.

The downside of this change is that it will prevent the zfs arc
meta data limts from being enforced.  However, the current zfs
implementation in this regard is already flawed and needs to
be reworked.  If the arc needs to enfore a meta data limit it
will need to be extended to coordinate directly with the zpl.
This will allow us to drop all this compatibility code and get
more fine grained control over the cache management.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #52
2011-11-09 19:36:30 -08:00
Brian Behlendorf 12ff95ff57 Linux 3.1 compat, kern_path_parent()
Prior to Linux 3.1 the kern_path_parent symbol was exported for
use by kernel modules.  As of Linux 3.1 it is now longer easily
available.  To handle this case the spl will now dynamically
look up address of the missing symbol at module load time.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #52
2011-11-09 16:51:25 -08:00
Brian Behlendorf b8b6e4c453 Fix NULL deref in balance_pgdat()
Be careful not to unconditionally clear the PF_MEMALLOC bit in
the task structure.  It may have already been set when entering
kv_alloc() in which case it must remain set on exit.  In
particular the kswapd thread will have PF_MEMALLOC set in
order to prevent it from entering direct reclaim.  By clearing
it we allow the following NULL deref to potentially occur.

  BUG: unable to handle kernel NULL pointer dereference at (null)
  IP: [<ffffffff8109c7ab>] balance_pgdat+0x25b/0x4ff

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes ZFS issue #287
2011-11-03 09:50:22 -07:00
Gunnar Beutner f3989ed322 vn_rdwr() didn't properly advance the file position
This would cause problems when using 'zfs send' with a file as the
target (rather than a pipe or a socket as is usually the case) as
for each write the destination offset in the file would be 0.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes ZFS issue #391
2011-10-18 16:51:35 -07:00
Brian Behlendorf ecc3981007 Fix various typos in comments
Just clean up some of the typos and spelling mistakes in the
comments of spl-kmem.c.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-10-11 10:32:49 -07:00
Gunnar Beutner 8d177c181f Fixed typo in spl_slab_alloc()
The typo did not have any effect (apart from a negligible performance
impact) because skc->skc_flags * KMC_OFFSLAB is always non-null when
at least one bit in skc->skc_flags is set.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-10-11 10:03:43 -07:00
Gunnar Beutner 64c075c3f4 Properly destroy work items in spl_kmem_cache_destroy()
In a non-debug build the ASSERT() would be optimized away
which could cause pending work items to not be cancelled.

We must also use cancel_delayed_work_sync() rather than just
cancel_delayed_work() to actually wait until work items have
completed.  Otherwise they might accidentally access free'd
memory.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes ZFS bugs #279, #62, #363, #418
2011-10-11 09:59:19 -07:00
Gunnar Beutner 763b2f3b57 Fixed invalid resource re-use in file_find()
File descriptors are a per-process resource. The same descriptor
in different processes can refer to different files. find_file()
incorrectly assumed that file descriptors are globally unique.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes ZFS issue #386
2011-10-11 09:51:51 -07:00
Brian Behlendorf 6b3b569df3 Remove /etc/hostid missing warning
No longer print the following warning to the console when the
/etc/hostid file is missing.  This is the expected default behavior.
Keeping the hostid in sync with the initramfs is now accomplished
by creating the /etc/hostid in the initramfs not on the system.

  SPL: The /etc/hostid file is not found.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-10-06 14:58:09 -07:00
Brian Behlendorf e80cd06b8e Fix 'make install' overly broad 'rm'
When running 'make install' without DESTDIR set the module install
rules would mistakenly destroy the 'modules.*' files for ALL of
your installed kernels.  This could lead to a non-functional system
for the alternate kernels because 'depmod -a' will only be run for
the kernel which was compiled against.  This issue would not impact
anyone using the 'make <deb|rpm|pkg>' build targets to build and
install packages.

The fix for this issue is to only remove extraneous build products
when DESTDIR is set.  This almost exclusively indicates we are
building packages and installed the build products in to a temporary
staging location.  Additionally, limit the removal the unneeded
build products to the target kernel version.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #328
2011-07-20 09:37:41 -07:00
Darik Horn 0d54dcb566 Read the /etc/hostid file directly.
Deprecate the /usr/bin/hostid call by reading the /etc/hostid file
directly. Add the spl_hostid_path parameter to override the default
/etc/hostid path.

Rename the set_hostid() function to hostid_exec() to better reflect
actual behavior and complement the new hostid_read() function.

Use HW_INVALID_HOSTID as the spl_hostid sentinel value because
zero seems to be a valid gethostid() result on Linux.
2011-06-24 09:58:03 -07:00
Brian Behlendorf bf0c60c060 Add linux compatibility tests
While the splat tests were originally designed to stress test
the Solaris primatives.  I am extending them to include some kernel
compatibility tests.  Certain linux APIs have changed frequently.
These tests ensure that added compatibility is working properly
and no unnoticed regression have slipped in.

Test 1 and 2 add basic regression tests for shrink_icache_memory
and shrink_dcache_memory.  These are simply functional tests to
ensure we can call these functions safely.  Checking for correct
behavior is more difficult since other running processes will
influence the behavior.  However, these functions are provided
by the kernel so if we can successfully call them we assume they
are working correctly.

Test 3 checks that shrinker functions are being registered and
called correctly.  As of Linux 3.0 the shrinker API has changed
four different times so I felt the need to add a trivial test
case to ensure each variant works as expected.
2011-06-21 14:02:46 -07:00
Brian Behlendorf a55bcaad18 Linux 3.0: Shrinker compatibility
Update the the wrapper macros for the memory shrinker to handle
this 4th API change.  The callback function now takes a
shrink_control structure.  This is certainly a step in the
right direction but it's annoying to have to accomidate yet
another version of the API.
2011-06-21 14:02:39 -07:00
Brian Behlendorf 372c257233 Add TASKQ_NORECLAIM flag
It has become necessary to be able to optionally disable
direct memory reclaim for certain taskqs.  To support
this the TASKQ_NORECLAIM flags has been added which sets
the PF_MEMALLOC bit for all threads in the taskq.
2011-05-06 15:23:58 -07:00
Darik Horn c95b308d12 Correct typos in the spl proc handler.
Correct a format typo that causes /proc/sys/kernel/spl/hostid
to return a decimal number instead of a hexadecimal number.
2011-04-24 20:56:07 -05:00
Darik Horn 5b8f76ea16 Make the SPL kernel messages consistent with ZFS.
Change the SPL kernel messages for module loading and module
unloading so that they are similar to the ZFS kernel messages.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-04-21 09:41:13 -07:00
Darik Horn ad35b6a6e9 Remove the gawk dependency.
This reverts commit 1814251453.

Demote the gawk call back to awk and ensure that stderr is attached.  GNU gawk
tolerates a missing stderr handle, but many utilities do not, which could be
why a regular awk call was unexplainably failing on some systems.

Use argv[0] instead of sh_path for consistency internally and with other Linux
drivers.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-04-21 09:41:09 -07:00
Darik Horn fa6f7d8f9d Import spl_hostid as a module parameter.
Provide a call_usermodehelper() alternative by letting the hostid be passed as
a module parameter like this:

  $ modprobe spl spl_hostid=0x12345678

Internally change the spl_hostid variable to unsigned long because that is the
type that the coreutils /usr/bin/hostid returns.

Move the hostid command into GET_HOSTID_CMD for consistency with the similar
GET_KALLSYMS_ADDR_CMD invocation.

Use argv[0] instead of sh_path for consistency internally and with other Linux
drivers.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-04-21 09:41:01 -07:00
Brian Behlendorf 3dfc591ac4 Linux 2.6.39 compat, zlib_deflate_workspacesize()
The function zlib_deflate_workspacesize() now take 2 arguments.
This was done to avoid always having to allocate the maximum size
workspace (268K).  The caller can now specific the windowBits and
memLevel compression parameters to get a smaller workspace.

For our purposes we introduce a spl_zlib_deflate_workspacesize()
wrapper which accepts both arguments.  When the two argument
version of zlib_deflate_workspacesize() is available the arguments
are passed through.  When it's not we assume the worst case and
a maximally sized workspace is used.
2011-04-20 14:39:15 -07:00
Brian Behlendorf b1cbc4610c Linux 2.6.39 compat, kern_path_parent()
The path_lookup() function has been renamed to kern_path_parent()
and the flags argument has been removed.  The only behavior now
offered is that of LOOKUP_PARENT.  The spl already always passed
this flag so dropping the flag does not impact us.
2011-04-20 12:30:17 -07:00
Brian Behlendorf 83c623aa1a Linux 2.6.39 compat, DEFINE_SPINLOCK()
This is a long over due compatibility change.  Way, way, way back
in 2007 there was a push to remove all consumers of SPIN_LOCK_UNLOCKED.
Finally, in 2011 with 2.6.39 all the consumers have been updated
and SPIN_LOCK_UNLOCKED was removed.  It's about time we use the
new API as well, this change does exactly that.  DEFINE_SPINLOCK()
was available as far back as 2.6.12 so there doesn't need to be
any additional autoconf-foo for this change.
2011-04-20 12:01:11 -07:00
Brian Behlendorf 98e2afd1c5 Fix unused variable
Flagged by the default -Wunused-but-set-variable gcc option when
running under Fedora 15.  Since it's correct this variable is
entirely unused this commit removes it.
2011-04-19 09:45:36 -07:00
Brian Behlendorf 9b0f9079d2 Linux 2.6.39 compat, invalidate_inodes()
To resolve a potiential filesystem corruption issue a second
argument was added to invalidate_inodes().  This argument controls
whether dirty inodes are dropped or treated as busy when invalidating
a super block.  When only the legacy API is available the second
argument will be dropped for compatibility.
2011-04-19 09:08:08 -07:00
Brian Behlendorf e76f4bf11d Add dnlc_reduce_cache() support
Provide the dnlc_reduce_cache() function which attempts to prune
cached entries from the dcache and icache.  After the entries are
pruned any slabs which they may have been using are reaped.

Note the API takes a reclaim percentage but we don't have easy
access to the total number of cache entries to calculate the
reclaim count.  However, in practice this doesn't need to be
exactly correct.  We simply need to reclaim some useful fraction
(but not all) of the cache.  The caller can determine if more
needs to be done.
2011-04-06 20:06:03 -07:00
Brian Behlendorf 3336e29cc2 Add slab usage summeries to /proc
One of the most common things you want to know when looking at
the slab is how much memory is being used.  This information was
available in /proc/spl/kmem/slab but only on a per-slab basis.
This commit adds the following /proc/sys/kernel/spl/kmem/slab*
entries to make total slab usage easily available at a glance.

  slab_kmem_total - Total kmem slab size
  slab_kmem_avail - Alloc'd kmem slab size
  slab_kmem_max   - Max observed kmem slab size
  slab_vmem_total - Total vmem slab size
  slab_vmem_avail - Alloc'd vmem slab size
  slab_vmem_max   - Max observed vmem slab size

NOTE: The slab_*_max values are expected to over report because
they show maximum values since boot, not current values.
2011-04-06 20:06:03 -07:00
Brian Behlendorf d0a1038ff3 Update /proc/spl/kmem/slab output
The 'slab_fail', 'slab_create', and 'slab_destroy' columns in the slab
output have been removed because they are virtually always zero and
not very useful.

The much more useful 'size' and 'alloc' columns have been added which
show the total slab size and how much of the total size has been
allocated to objects.

Finally, the formatting has been updated to be much more human
readable while still being friendly for tool like awk to parse.
2011-04-06 20:06:03 -07:00
Brian Behlendorf 495bd532ab Linux shrinker compat
The Linux shrinker has gone through three API changes since 2.6.22.
Rather than force every caller to understand all three APIs this
change consolidates the compatibility code in to the mm-compat.h
header.  The caller then can then use a single spl provided
shrinker API which does the right thing for your kernel.

SPL_SHRINKER_CALLBACK_PROTO(shrinker_callback, cb, nr_to_scan, gfp_mask);
SPL_SHRINKER_DECLARE(shrinker_struct, shrinker_callback, seeks);
spl_register_shrinker(&shrinker_struct);
spl_unregister_shrinker(&&shrinker_struct);
spl_exec_shrinker(&shrinker_struct, nr_to_scan, gfp_mask);
2011-04-06 20:06:03 -07:00
Brian Behlendorf 734fcac78d Add crgetfsuid()/crgetfsgid() helpers
Solaris credentials don't have an fsuid/fsguid field but Linux
credentials do.  To handle this case the Solaris API is being
modestly extended to include the crgetfsuid()/crgetfsgid()
helper functions.

Addititionally, because the crget*() helpers are implemented
identically regardless of HAVE_CRED_STRUCT they have been
moved outside the #ifdef to common code.  This simplification
means we only have one version of the helper to keep to to date.
2011-03-22 12:18:44 -07:00
Brian Behlendorf 2092cf68d8 Disable vmalloc() direct reclaim
As part of vmalloc() a __pte_alloc_kernel() allocation may occur.  This
internal allocation does not honor the gfp flags passed to vmalloc().
This means even when vmalloc(GFP_NOFS) is called it is possible that a
synchronous reclaim will occur.  This reclaim can trigger file IO which
can result in a deadlock.  This issue can be avoided by explicitly
setting PF_MEMALLOC on the process to subvert synchronous reclaim when
vmalloc() is called with !__GFP_FS.

An example stack of the deadlock can be found here (1), along with the
upstream kernel bug (2), and the original bug discussion on the
linux-mm mailing list (3).  This code can be properly autoconf'ed
when the upstream bug is fixed.

1) http://github.com/behlendorf/zfs/issues/labels/Vmalloc#issue/133
2) http://bugzilla.kernel.org/show_bug.cgi?id=30702
3) http://marc.info/?l=linux-mm&m=128942194520631&w=4
2011-03-20 15:12:08 -07:00
Brian Behlendorf 47995fa691 Remove xvattr support
The xvattr support in the spl has always simply consisted of
defining a couple structures and a few #defines.  This was enough
to enable compilation of code which just passed xvattr types
around but not enough to effectively manipulate them.

This change removes even this minimal support leaving it up
to packages which leverage the spl to prove the full xvattr
support.  By removing it from the spl we ensure not conflict
with the higher level packages.

This just leaves minimal vnode support for basical manipulation
of files.  This code is does have the proper support functions
in the spl and a set of regression tests.

Additionally, this change removed the unused 'caller_context_t *'
type and replaces it with a 'void *'.
2011-03-02 11:34:46 -08:00
Brian Behlendorf 19c1eb829d Add zlib regression test
A zlib regression test has been added to verify the correct behavior
of z_compress_level() and z_uncompress.  The test case simply takes
a 128k buffer, it compresses the buffer, it them uncompresses the
buffer, and finally it compares the buffers after the transform.
If the buffers match then everything is fine and no data was lost.
It performs this test for all 9 zlib compression levels.
2011-02-25 16:56:46 -08:00
Brian Behlendorf 5c1967ebe2 Fix zlib compression
While portions of the code needed to support z_compress_level() and
z_uncompress() where in place.  In reality the current implementation
was non-functional, it just was compilable.

The critical missing component was to setup a workspace for the
compress/uncompress stream structures to use.  A kmem_cache was
added for the workspace area because we require a large chunk
of memory.  This avoids to need to continually alloc/free this
memory and vmap() the pages which is very slow.  Several objects
will reside in the per-cpu kmem_cache making them quick to acquire
and release.  A further optimization would be to adjust the
implementation to additional ensure the memory is local to the cpu.
Currently that may not be the case.
2011-02-25 16:56:22 -08:00
Brian Behlendorf 914b063133 Linux compat 2.6.37, invalidate_inodes()
In the 2.6.37 kernel the function invalidate_inodes() is no longer
exported for use by modules.  This memory management functionality
is needed to invalidate the inodes attached to a super block without
unmounting the filesystem.

Because this function still exists in the kernel and the prototype
is available is a common header all we strictly need is the symbol
address.  The address is obtained using spl_kallsyms_lookup_name()
and assigned to the variable invalidate_inodes_fn.  Then a #define
is used to replace all instances of invalidate_inodes() with a
call to the acquired address.  All the complexity is hidden behind
HAVE_INVALIDATE_INODES and invalidate_inodes() can be used as usual.

Long term we should try to get this, or another, interface made
available to modules again.
2011-02-23 12:44:32 -08:00
Brian Behlendorf d599e4fa79 Block in cv_destroy() on all waiters
Previously we would ASSERT in cv_destroy() if it was ever called
with active waiters.  However, I've now seen several instances in
OpenSolaris code where they do the following:

  cv_broadcast();
  cv_destroy();

This leaves no time for active waiters to be woken up and scheduled
and we trip the ASSERT.  This has not been observed to be an issue
on OpenSolaris because their cv_destroy() basically does nothing.
They still do run the risk of the memory being free'd after the
cv_destroy() and hitting a bad paging request.  But in practice
this race is so small and unlikely it either doesn't happen, or
is so unlikely when it does happen the root cause has not yet been
identified.

Rather than risk the same issue in our code this change updates
cv_destroy() to block until all waiters have been woken and
scheduled.  This may take some time because each waiter must
acquire the mutex.

This change may have an impact on performance for frequently
created and destroyed condition variables.  That however is a price
worth paying it avoid crashing your system.  If performance issues
are observed they can be addressed by the caller.
2011-02-04 14:09:08 -08:00
Brian Behlendorf 647fa73cf3 Remove VN_HOLD/VN_RELE/VOP_PUTPAGE
Previously these were defined to noops but rather than give
the misleading impression that these are actually implemented
I'm removing the type entirely for clarity.
2011-01-12 11:38:05 -08:00
Brian Behlendorf a5b40eed17 Make vn_cache|vn_file_cache kmem caches
Both of these caches were previously allowed to be either a
vmem or kmem cache based on the size of the object involved.
Since we know the object won't be to large and performce is
much better for a kmem cache for them to be kmem backed.
2011-01-12 11:38:05 -08:00
Brian Behlendorf dcd9cb5a17 Clean vattr_t and vsecattr_t types
Minor cleanup for the vattr_t and vsecattr_t types.
2011-01-12 11:38:04 -08:00
Brian Behlendorf 4295b530ee Add vn_mode_to_vtype/vn_vtype to_mode helpers
Add simple helpers to convert a vnode->v_type to a inode->i_mode.
These should be used sparingly but they are handy to have.
2011-01-12 11:38:04 -08:00
Neependra Khare 3f688a8c38 Add cv_timedwait_interruptible() function
The cv_timedwait() function by definition must wait unconditionally
for cv_signal()/cv_broadcast() before waking.  This causes processes
to go in the D state which increases the load average.  The load
average is the summation of processes in D state and run queue.

To avoid this it can be desirable to sleep interruptibly.  These
processes do not count against the load average but may be woken by
a signal.  It is up to the caller to determine why the process
was woken it may be for one of three reasons.

  1) cv_signal()/cv_broadcast()
  2) the timeout expired
  3) a signal was received

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-01-11 12:14:48 -08:00
Brian Behlendorf 6bf4d76f47 Linux Compat: inode->i_mutex/i_sem
Create spl_inode_lock/spl_inode_unlock compability macros to simply
access to the inode mutex/sem.  This avoids the need to have to ugly
up the code with the required #define's at every call site.  At the
moment the SPL only uses this in one place but higher layers can
benefit from the macro.
2011-01-11 12:14:48 -08:00
Brian Behlendorf b7dc313837 Add Thread Specific Data (TSD) Regression Test
To validate the correct behavior of the TSD interfaces it's
important that we add a regression test.  This test is designed
to minimally exercise the fundamental TSD behavior, it does not
attempt to validate all potential corner cases.

The test will first create 32 keys via tsd_create() and register
a common destructor.  Next 16 wait threads will be created each
of which set/verify a random value for all 32 keys, then block
waiting to be released by the control thread.  Meanwhile the
control thread verifies that none of the destructors have been
run prematurely.

The next phase of the test is to create 16 exit threads which
set/verify a random value for all 32 keys.  They then immediately
exit.  This is is designed to verify tsd_exit() which will be
called via thread_exit().  This must result in all registered
destructors being run and the memory for the tsd being free'd.

After this tsd_destroy() is verified by destroying all 32 keys.
Once again we must see the expected number of destructors run
and the tsd memory free'd.  At this point the blocked threads
are released and they exit calling tsd_exit() which should do
very little since all the tsd has already been destroyed.

If this all goes off without a hitch the test passes.  To ensure
no memory has been leaked, I have manually verified that after
spl module unload no memory is reported leaked.
2010-12-07 10:02:44 -08:00
Brian Behlendorf 9fe45dc1ac Add Thread Specific Data (TSD) Implementation
Thread specific data has implemented using a hash table, this avoids
the need to add a member to the task structure and allows maximum
portability between kernels.  This implementation has been optimized
to keep the tsd_set() and tsd_get() times as small as possible.

The majority of the entries in the hash table are for specific tsd
entries.  These entries are hashed by the product of their key and
pid because by design the key and pid are guaranteed to be unique.
Their product also has the desirable properly that it will be uniformly
distributed over the hash bins providing neither the pid nor key is zero.
Under linux the zero pid is always the init process and thus won't be
used, and this implementation is careful to never to assign a zero key.
By default the hash table is sized to 512 bins which is expected to
be sufficient for light to moderate usage of thread specific data.

The hash table contains two additional type of entries.  They first
type is entry is called a 'key' entry and it is added to the hash during
tsd_create().  It is used to store the address of the destructor function
and it is used as an anchor point.  All tsd entries which use the same
key will be linked to this entry.  This is used during tsd_destory() to
quickly call the destructor function for all tsd associated with the key.
The 'key' entry may be looked up with tsd_hash_search() by passing the
key you wish to lookup and DTOR_PID constant as the pid.

The second type of entry is called a 'pid' entry and it is added to the
hash the first time a process set a key.  The 'pid' entry is also used
as an anchor and all tsd for the process will be linked to it.  This
list is using during tsd_exit() to ensure all registered destructors
are run for the process.  The 'pid' entry may be looked up with
tsd_hash_search() by passing the PID_KEY constant as the key, and
the process pid.  Note that tsd_exit() is called by thread_exit()
so if your using the Solaris thread API you should not need to call
tsd_exit() directly.
2010-12-07 10:02:32 -08:00
Brian Behlendorf 058de03caa Clear cv->cv_mutex when not in use
For debugging purposes the condition varaibles keep track of the
mutex used during a wait.  The idea is to validate that all callers
always use the same mutex.  Unfortunately, we have seen cases where
the caller reuses the condition variable with a different mutex but
in a way which is known to be safe.  My reading of the man pages
suggests you should not do this and always cv_destroy()/cv_init()
a new mutex.  However, there is overhead in doing this and it does
appear to be allowed under Solaris.

To accomidate this behavior cv_wait_common() and __cv_timedwait()
have been modified to clear the associated mutex when the last
waiter is dropped.  This ensures that while the condition variable
is in use the incorrect mutex case is detected.  It also allows the
condition variable to be safely recycled without requiring the
overhead of a cv_destroy()/cv_init() as long as it isn't currently
in use.

Finally, spin lock cv->cv_lock was removed because it is not required.
When the condition variable is used properly the caller will always
be holding the mutex so the spin lock is redundant.  The lock was
originally added because I expected to need to protect more than
just the cv->cv_mutex.  It turns out that was not the case.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-11-29 11:02:34 -08:00
Brian Behlendorf 8655ce492f Linux 2.6.36 compat, use fops->unlocked_ioctl()
As of linux-2.6.36 the last in-tree consumer of fops->ioctl() has
been removed and thus fops()->ioctl() has also been removed.  The
replacement hook is fops->unlocked_ioctl() which has existed in
kernel since 2.6.12.  Since the SPL only contains support back
to 2.6.18 vintage kernels, I'm not adding an autoconf check for
this and simply moving everything to use fops->unlocked_ioctl().
2010-11-10 13:16:12 -08:00
Brian Behlendorf 9b2048c26b Linux 2.6.36 compat, fs_struct->lock type change
In the linux-2.6.36 kernel the fs_struct lock was changed from a
rwlock_t to a spinlock_t.  If the kernel would export the set_fs_pwd()
symbol by default this would not have caused us any issues, but they
don't.  So we're forced to add a new autoconf check which sets the
HAVE_FS_STRUCT_SPINLOCK define when a spinlock_t is used.  We can
then correctly use either spin_lock or write_lock in our custom
set_fs_pwd() implementation.
2010-11-09 13:29:47 -08:00
Brian Behlendorf 1e18307b61 Fix incorrect krw_type_t type
Flagged by the default compile options on archlinux 2010.05, we should
be using the krw_t type not the krw_type_t type in the private data.

  module/splat/splat-rwlock.c: In function ‘splat_rwlock_test4_func’:
  module/splat/splat-rwlock.c:432:6: warning: case value ‘1’ not in
  enumerated type ‘krw_type_t’
2010-11-09 10:18:01 -08:00
Brian Behlendorf 23aa63cbf5 Fix 2.6.35 shrinker callback API change
As of linux-2.6.35 the shrinker callback API now takes an additional
argument.  The shrinker struct is passed to the callback so that users
can embed the shrinker structure in private data and use container_of()
to access it.  This removes the need to always use global state for the
shrinker.

To handle this we add the SPL_AC_3ARGS_SHRINKER_CALLBACK autoconf
check to properly detect the API.  Then we simply setup a callback
function with the correct number of arguments.  For now we do not make
use of the new 3rd argument.
2010-10-22 14:51:26 -07:00
Brian Behlendorf a7958f7eef Support custom build directories
One of the neat tricks an autoconf style project is capable of
is allow configurion/building in a directory other than the
source directory.  The major advantage to this is that you can
build the project various different ways while making changes
in a single source tree.

For example, this project is designed to work on various different
Linux distributions each of which work slightly differently.  This
means that changes need to verified on each of those supported
distributions perferably before the change is committed to the
public git repo.

Using nfs and custom build directories makes this much easier.
I now have a single source tree in nfs mounted on several different
systems each running a supported distribution.  When I make a
change to the source base I suspect may break things I can
concurrently build from the same source on all the systems each
in their own subdirectory.

wget -c http://github.com/downloads/behlendorf/spl/spl-x.y.z.tar.gz
tar -xzf spl-x.y.z.tar.gz
cd spl-x-y-z

------------------------- run concurrently ----------------------
<ubuntu system>  <fedora system>  <debian system>  <rhel6 system>
mkdir ubuntu     mkdir fedora     mkdir debian     mkdir rhel6
cd ubuntu        cd fedora        cd debian        cd rhel6
../configure     ../configure     ../configure     ../configure
make             make             make             make
make check       make check       make check       make check

This is something the project has almost supported for a long time
but finishing this support should save me lots of time.
2010-09-05 21:49:05 -07:00
Brian Behlendorf 2b3543025c Stub out kmem cache defrag API
At some point we are going to need to implement the kmem cache
move callbacks to allow for kmem cache defragmentation.  This
commit simply lays a small part of the API ground work, it does
not actually implement any of this feature.  This is safe for
now because the move callbacks are just an optimization.  Even
if they are registered we don't ever really have to call them.
2010-08-27 14:23:42 -07:00
Li Wei 4be55565fe Fix stack overflow in vn_rdwr() due to memory reclaim
Unless __GFP_IO and __GFP_FS are removed from the file mapping gfp
mask we may enter memory reclaim during IO.  In this case shrink_slab()
entered another file system which is notoriously hungry for stack.
This additional stack usage may cause a stack overflow.  This patch
removes __GFP_IO and __GFP_FS from the mapping gfp mask of each file
during vn_open() to avoid any reclaim in the vn_rdwr() IO path.  The
original mask is then restored at vn_close() time.  Hats off to the
loop driver which does something similiar for the same reason.

  [...]
  shrink_slab+0xdc/0x153
  try_to_free_pages+0x1da/0x2d7
  __alloc_pages+0x1d7/0x2da
  do_generic_mapping_read+0x2c9/0x36f
  file_read_actor+0x0/0x145
  __generic_file_aio_read+0x14f/0x19b
  generic_file_aio_read+0x34/0x39
  do_sync_read+0xc7/0x104
  vfs_read+0xcb/0x171
  :spl:vn_rdwr+0x2b8/0x402
  :zfs:vdev_file_io_start+0xad/0xe1
  [...]

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-08-12 09:34:33 -07:00
Ricardo M. Correia 26f7245c7c Fix taskq code to not drop tasks when TQ_SLEEP is used.
When TQ_SLEEP is used, taskq_dispatch() should always succeed even if the
number of pending tasks is above tq->tq_maxalloc. This semantic is similar
to KM_SLEEP in kmem allocations, which also always succeed.

However, we cannot block forever otherwise there is a risk of deadlock.
Therefore, we still allow the number of pending tasks to go above
tq->tq_maxalloc with TQ_SLEEP, but we may sleep up to 1 second per task
dispatch, thereby throttling the task dispatch rate.

One of the existing splat tests was also augmented to test for this scenario.
The test would fail with the previous implementation but now it succeeds.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-08-02 11:20:31 -07:00
Brian Behlendorf 41f84a8d56 Strfree() should call kfree() not kmem_free()
Using kmem_free() results in deducting X bytes from the memory
accounting when --enable-debug is set.  Unfortunately, currently
the counterpart kmem_asprintf() and friends do not properly
account for memory allocated, so we must do the same on free.
If we don't then we end up with a negative number of lost bytes
reported when the module is unloaded.

A better long term fix would be to add the accounting in to the
allocation side but that's a project for another day.
2010-07-30 22:20:58 -07:00
Brian Behlendorf 099dc9c2d2 Add uninstall Makefile targets
Extend the Makefiles with an uninstall target to cleanly
remove a package which was installed with 'make install'.

Additionally, ensure a 'depmod -a' is run as part of the
install to update the module dependency information.
2010-07-28 14:55:32 -07:00
Brian Behlendorf 10129680f8 Ensure kmem_alloc() and vmem_alloc() never fail
The Solaris semantics for kmem_alloc() and vmem_alloc() are that they
must never fail when called with KM_SLEEP.  They may only fail if
called with KM_NOSLEEP otherwise they must block until memory is
available.  This is quite different from how the Linux memory
allocators work, under Linux a memory allocation failure is always
possible and must be dealt with.

At one point in the past the kmem code did properly implement this
behavior, however as the code evolved this behavior was overlooked
in places.  This patch goes through all three implementations of
the kmem/vmem allocation functions and ensures that they will all
block in the KM_SLEEP case when memory is not available.  They
may still fail in the KM_NOSLEEP case in which case the caller
is responsible for handling the failure.

Special care is taken in vmalloc_nofail() to avoid thrashing the
system on the virtual address space spin lock.  The down side of
course is if you do see a failure here, which is unlikely for
64-bit systems, your allocation will delay for an entire second.
Still this is preferable to locking up your system and it is the
best we can do given the constraints.

Additionally, the code was cleaned up to be much more readable
and comments were added to describe the various kmem-debug-*
configure options.  The default configure options remain:
"--enable-debug-kmem --disable-debug-kmem-tracking"
2010-07-26 15:47:55 -07:00
Brian Behlendorf 849c50e7f2 Fix two minor compiler warnings
In cmd/splat.c there was a comparison between an __u32 and an int.  To
resolve the issue simply use a __u32 and strtoul() when converting the
provided user string.

In module/spl/spl-vnode.c we should explicitly cast nd->last.name to
a const char * which is what is expected by the prototype.
2010-07-26 10:24:26 -07:00
Brian Behlendorf 8b0eb3f0dc Remove deadcode caused by removal of format1 arg
Commit 55abb0929e removed the never
used format1 argument of spl_debug_msg().  That in turn resulted
in some deadcode which should be removed since it's now useless.
2010-07-21 16:31:42 -07:00
Ricardo M. Correia 81672c0122 Display DEBUG keyword during module load when --enable-debug is used.
Signed-off-by: Ricardo M. Correia <ricardo.correia@oracle.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-07-20 15:31:03 -07:00
Ricardo M. Correia 2c762de830 Fix buggy kmem_{v}asprintf() functions
When the kvasprintf() call fails they should reset the arguments
by calling va_start()/va_copy() and va_end() inside the loop,
otherwise they'll try to read more arguments rather than starting
over and reading them from the beginning.

Signed-off-by: Ricardo M. Correia <ricardo.correia@oracle.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-07-20 13:51:46 -07:00
Brian Behlendorf b17edc10a9 Prefix all SPL debug macros with 'S'
To avoid conflicts with symbols defined by dependent packages
all debugging symbols have been prefixed with a 'S' for SPL.
Any dependent package needing to integrate with the SPL debug
should include the spl-debug.h header and use the 'S' prefixed
macros.  They must also build with DEBUG defined.
2010-07-20 13:30:40 -07:00
Brian Behlendorf 55abb0929e Split <sys/debug.h> header
To avoid symbol conflicts with dependent packages the debug
header must be split in to several parts.  The <sys/debug.h>
header now only contains the Solaris macro's such as ASSERT
and VERIFY.  The spl-debug.h header contain the spl specific
debugging infrastructure and should be included by any package
which needs to use the spl logging.  Finally the spl-trace.h
header contains internal data structures only used for the log
facility and should not be included by anythign by spl-debug.c.

This way dependent packages can include the standard Solaris
headers without picking up any SPL debug macros.  However, if
the dependant package want to integrate with the SPL debugging
subsystem they can then explicitly include spl-debug.h.

Along with this change I have dropped the CHECK_STACK macros
because the upstream Linux kernel now has much better stack
depth checking built in and we don't need this complexity.

Additionally SBUG has been replaced with PANIC and provided as
part of the Solaris macro set.  While the Solaris version is
really panic() that conflicts with the Linux kernel so we'll
just have to make due to PANIC.  It should rarely be called
directly, the prefered usage would be an ASSERT or VERIFY.

There's lots of change here but this cleanup was overdue.
2010-07-20 13:29:35 -07:00
Ned Bass 8f813bb168 Proposed fix for oops on SIGINT in splat atomic:64-bit test.
The threads in the splat atomic:64-bit test share the data structure
atomic_priv_t ap, which lives on the kernel stack of the splat user-space
utility.  If splat terminates before the threads, accesses to that memory
location by the other threads become invalid.  Splat synchronizes with
the threads with the call:

wait_event_interruptible(ap.ap_waitq, splat_atomic_test1_cond(&ap, i));

Apparently, the SIGINT wakes and terminates splat prematurely, so that
GPFs or other bad things happen when the threads subsequently access ap.
This commit prevents this by using the uninterruptible form:

wait_event(ap.ap_waitq, splat_atomic_test1_cond(&ap, i));
2010-07-15 12:50:15 -07:00
Brian Behlendorf d0bd694ca9 Fix -Werror=format-security compiler option
Noticed under Ubuntu kernel builds we should be passing a
format specifier and the string, not just the string.
2010-07-14 11:53:57 -07:00
Brian Behlendorf f0ff89fc86 Linux 2.6.35 compat: filp_fsync() dropped 'stuct dentry *'
The prototype for filp_fsync() drop the unused argument 'stuct dentry *'.
I've fixed this by adding the needed autoconf check and moving all of
those filp related functions to file_compat.h.  This will simplify
handling any further API changes in the future.
2010-07-14 11:40:55 -07:00
Brian Behlendorf a4bfd8ea1b Add __divdi3(), remove __udivdi3() kernel dependency
Up until now no SPL consumer attempted to perform signed 64-bit
division so there was no need to support this.  That has now
changed so I adding 64-bit division support for 32-bit platforms.
The signed implementation is based on the unsigned version.

Since the have been several bug reports in the past concerning
correct 64-bit division on 32-bit platforms I added some long
over due regression tests.  Much to my surprise the unsigned
64-bit division regression tests failed.

This was surprising because __udivdi3() was implemented by simply
calling div64_u64() which is provided by the kernel.  This meant
that the linux kernels 64-bit division algorithm on 32-bit platforms
was flawed.  After some investigation this turned out to be exactly
the case.

Because of this I was forced to abandon the kernel helper and
instead to fully implement 64-bit division in the spl.  There are
several published implementation out there on how to do this
properly and I settled on one proposed in the book Hacker's Delight.
Their proposed algoritm is freely available without restriction
and I have just modified it to be linux kernel friendly.

The update implementation now passed all the unsigned and signed
regression tests.  This should be functional, but not fast, which is
good enough for out purposes.  If you want fast too I'd strongly
suggest you upgrade to a 64-bit platform.  I have also reported the
kernel bug and we'll see if we can't get it fixed up stream.
2010-07-13 16:44:02 -07:00
Brian Behlendorf 1814251453 Require gawk the usermode helper fails with awk
For some reason when awk invoked by the usermode helper the command
always fails.  Interestingly gawk does not suffer from this problem
which is why I never observed this failure since the distro I tested
with all had gawk installed instead of awk.  Anyway, the simplest
thing to do here is to just make gawk mandatory.  I've added a
configure check for gawk specifically and have updated the command
to call gawk not awk.
2010-07-01 16:38:08 -07:00
Brian Behlendorf 7119bf7044 Add configure check for user_path_dir()
I didn't notice at the time but user_path_dir() was not introduced
at the same time as set_fs_pwd() change.  I had lumped the two
together but in fact user_path_dir() was introduced in 2.6.27 and
set_fs_pwd() taking 2 args was introduced in 2.6.25.  This means
builds against 2.6.25-2.6.26 kernels were broken.

To fix this I've added a check for user_path_dir() and no longer
assume that if set_fs_pwd() takes 2 args then user_path_dir() is
also available.
2010-07-01 13:53:26 -07:00
Ned Bass 55f10ae5e9 Implementation of a regression test for TQ_FRONT.
Use 3 threads and 8 tasks.  Dispatch the final 3 tasks with TQ_FRONT.
The first three tasks keep the worker threads busy while we stuff the
queues.  Use msleep() to force a known execution order, assuming
TQ_FRONT is properly honored.  Verify that the expected completion
order occurs.

The splat_taskq_test5_order() function may be useful in more than
one test.  This commit generalizes it by renaming the function to
splat_taskq_test_order() and adding a name argument instead of
assuming SPLAT_TASKQ_TEST5_NAME as the test name.

The documentation for splat taskq regression test #5 swaps the two required
completion orders in the diagram.  This commit corrects the error.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-07-01 10:59:52 -07:00
Ned Bass 1a73940d39 Initialize the /dev/splatctl device buffer
On open() and initialize the buffer with the SPL version string.  The
user space splat utility expects to find the SPL version string when
it opens and reads from /dev/splatctl.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-07-01 10:59:46 -07:00
Ned Bass f0d8bb26b4 Implementation of the TQ_FRONT flag.
Adds a task queue to receive tasks dispatched with TQ_FRONT.  Worker
threads pull tasks from this high priority queue before the default
pending queue.

Executing tasks out of FIFO order potentially breaks taskq_lowest_id()
if we do not preserve the ordering of the work list by taskqid.
Therefore, instead of always appending to the work list, we search for
the appropriate place to insert a task.  The common case is to append
to the list, so we make this operation efficient by searching the work
list in reverse order.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-07-01 10:59:38 -07:00
Brian Behlendorf 79a3bf130b Linux-2.6.33 compat, .ctl_name removed from struct ctl_table
As of linux-2.6.33 the ctl_name member of the ctl_table struct
has been entirely removed.  The upstream code has been updated
to depend entirely on the the procname member.  To handle this
all references to ctl_name are wrapped in a CTL_NAME macro which
simply expands to nothing for newer kernels.  Older kernels are
supported by having it expand to .ctl_name = X just as before.
2010-06-30 12:49:12 -07:00
Brian Behlendorf ede0bdffb6 Treat mutex->owner as volatile
When HAVE_MUTEX_OWNER is defined and we are directly accessing
mutex->owner treat is as volative with the ACCESS_ONCE() helper.
Without this you may get a stale cached value when accessing it
from different cpus.  This can result in incorrect behavior from
mutex_owned() and mutex_owner().  This is not a problem for the
!HAVE_MUTEX_OWNER case because in this case all the accesses
are covered by a spin lock which similarly gaurentees we will
not be accessing stale data.

Secondly, check CONFIG_SMP before allowing access to mutex->owner.
I see that for non-SMP setups the kernel does not track the owner
so we cannot rely on it.

Thirdly, check CONFIG_MUTEX_DEBUG when this is defined and the
HAVE_MUTEX_OWNER is defined surprisingly the mutex->owner will
not be cleared on mutex_exit().  When this is the case the SPL
needs to make sure to do it to ensure MUTEX_HELD() behaves as
expected or you will certainly assert in mutex_destroy().

Finally, improve the mutex regression tests.  For mutex_owned() we
now minimally check that it behaves correctly when checked from the
owner thread or the non-owner thread.  This subtle behaviour has bit
me before and I'd like to catch it early next time if it reappears.

As for mutex_owned() regression test additonally verify that
mutex->owner is always cleared on mutex_exit().
2010-06-28 16:02:57 -07:00
Brian Behlendorf 616df2dd8b Fix subtle race in threads test case
The call to wake_up() must be moved under the spin lock because
once we drop the lock 'tp' may no longer be valid because the
creating thread has exited.  This basic thread implementation
was correct, this was simply a flaw in the test case.
2010-06-28 12:34:20 -07:00
Brian Behlendorf e6de04b73c Add kmem_vasprintf function
We might as well have both asprintf() variants.  This allows us
to safely pass a va_list through several levels of the stack
using va_copy() instead of va_start().
2010-06-24 09:41:59 -07:00
Brian Behlendorf 438683c0a9 Revert "Support TQ_FRONT flag used by taskq_dispatch()"
This reverts commit eb12b3782c.
2010-06-21 10:19:44 -07:00
Brian Behlendorf 3cb77549d1 Update warnings in kmem debug code
This fix was long overdue.  Most of the ground work was laid long
ago to include the exact function and line number in the error message
which there was an issue with a memory allocation call.  However,
probably due to lack of time at the moment that informatin never
made it in to the error message.  This patch fixes that and trys
to standardize the kmem debug messages as well.
2010-06-16 16:01:16 -07:00
Brian Behlendorf eb12b3782c Support TQ_FRONT flag used by taskq_dispatch()
Allow taskq_dispatch() to insert work items at the head of the
queue instead of just the tail by passing the TQ_FRONT flag.
2010-06-11 15:57:25 -07:00