Commit Graph

217 Commits

Author SHA1 Message Date
Brian Behlendorf 734fcac78d Add crgetfsuid()/crgetfsgid() helpers
Solaris credentials don't have an fsuid/fsguid field but Linux
credentials do.  To handle this case the Solaris API is being
modestly extended to include the crgetfsuid()/crgetfsgid()
helper functions.

Addititionally, because the crget*() helpers are implemented
identically regardless of HAVE_CRED_STRUCT they have been
moved outside the #ifdef to common code.  This simplification
means we only have one version of the helper to keep to to date.
2011-03-22 12:18:44 -07:00
Brian Behlendorf 2092cf68d8 Disable vmalloc() direct reclaim
As part of vmalloc() a __pte_alloc_kernel() allocation may occur.  This
internal allocation does not honor the gfp flags passed to vmalloc().
This means even when vmalloc(GFP_NOFS) is called it is possible that a
synchronous reclaim will occur.  This reclaim can trigger file IO which
can result in a deadlock.  This issue can be avoided by explicitly
setting PF_MEMALLOC on the process to subvert synchronous reclaim when
vmalloc() is called with !__GFP_FS.

An example stack of the deadlock can be found here (1), along with the
upstream kernel bug (2), and the original bug discussion on the
linux-mm mailing list (3).  This code can be properly autoconf'ed
when the upstream bug is fixed.

1) http://github.com/behlendorf/zfs/issues/labels/Vmalloc#issue/133
2) http://bugzilla.kernel.org/show_bug.cgi?id=30702
3) http://marc.info/?l=linux-mm&m=128942194520631&w=4
2011-03-20 15:12:08 -07:00
Brian Behlendorf 47995fa691 Remove xvattr support
The xvattr support in the spl has always simply consisted of
defining a couple structures and a few #defines.  This was enough
to enable compilation of code which just passed xvattr types
around but not enough to effectively manipulate them.

This change removes even this minimal support leaving it up
to packages which leverage the spl to prove the full xvattr
support.  By removing it from the spl we ensure not conflict
with the higher level packages.

This just leaves minimal vnode support for basical manipulation
of files.  This code is does have the proper support functions
in the spl and a set of regression tests.

Additionally, this change removed the unused 'caller_context_t *'
type and replaces it with a 'void *'.
2011-03-02 11:34:46 -08:00
Brian Behlendorf 5c1967ebe2 Fix zlib compression
While portions of the code needed to support z_compress_level() and
z_uncompress() where in place.  In reality the current implementation
was non-functional, it just was compilable.

The critical missing component was to setup a workspace for the
compress/uncompress stream structures to use.  A kmem_cache was
added for the workspace area because we require a large chunk
of memory.  This avoids to need to continually alloc/free this
memory and vmap() the pages which is very slow.  Several objects
will reside in the per-cpu kmem_cache making them quick to acquire
and release.  A further optimization would be to adjust the
implementation to additional ensure the memory is local to the cpu.
Currently that may not be the case.
2011-02-25 16:56:22 -08:00
Brian Behlendorf 914b063133 Linux compat 2.6.37, invalidate_inodes()
In the 2.6.37 kernel the function invalidate_inodes() is no longer
exported for use by modules.  This memory management functionality
is needed to invalidate the inodes attached to a super block without
unmounting the filesystem.

Because this function still exists in the kernel and the prototype
is available is a common header all we strictly need is the symbol
address.  The address is obtained using spl_kallsyms_lookup_name()
and assigned to the variable invalidate_inodes_fn.  Then a #define
is used to replace all instances of invalidate_inodes() with a
call to the acquired address.  All the complexity is hidden behind
HAVE_INVALIDATE_INODES and invalidate_inodes() can be used as usual.

Long term we should try to get this, or another, interface made
available to modules again.
2011-02-23 12:44:32 -08:00
Brian Behlendorf d599e4fa79 Block in cv_destroy() on all waiters
Previously we would ASSERT in cv_destroy() if it was ever called
with active waiters.  However, I've now seen several instances in
OpenSolaris code where they do the following:

  cv_broadcast();
  cv_destroy();

This leaves no time for active waiters to be woken up and scheduled
and we trip the ASSERT.  This has not been observed to be an issue
on OpenSolaris because their cv_destroy() basically does nothing.
They still do run the risk of the memory being free'd after the
cv_destroy() and hitting a bad paging request.  But in practice
this race is so small and unlikely it either doesn't happen, or
is so unlikely when it does happen the root cause has not yet been
identified.

Rather than risk the same issue in our code this change updates
cv_destroy() to block until all waiters have been woken and
scheduled.  This may take some time because each waiter must
acquire the mutex.

This change may have an impact on performance for frequently
created and destroyed condition variables.  That however is a price
worth paying it avoid crashing your system.  If performance issues
are observed they can be addressed by the caller.
2011-02-04 14:09:08 -08:00
Brian Behlendorf 647fa73cf3 Remove VN_HOLD/VN_RELE/VOP_PUTPAGE
Previously these were defined to noops but rather than give
the misleading impression that these are actually implemented
I'm removing the type entirely for clarity.
2011-01-12 11:38:05 -08:00
Brian Behlendorf a5b40eed17 Make vn_cache|vn_file_cache kmem caches
Both of these caches were previously allowed to be either a
vmem or kmem cache based on the size of the object involved.
Since we know the object won't be to large and performce is
much better for a kmem cache for them to be kmem backed.
2011-01-12 11:38:05 -08:00
Brian Behlendorf dcd9cb5a17 Clean vattr_t and vsecattr_t types
Minor cleanup for the vattr_t and vsecattr_t types.
2011-01-12 11:38:04 -08:00
Brian Behlendorf 4295b530ee Add vn_mode_to_vtype/vn_vtype to_mode helpers
Add simple helpers to convert a vnode->v_type to a inode->i_mode.
These should be used sparingly but they are handy to have.
2011-01-12 11:38:04 -08:00
Neependra Khare 3f688a8c38 Add cv_timedwait_interruptible() function
The cv_timedwait() function by definition must wait unconditionally
for cv_signal()/cv_broadcast() before waking.  This causes processes
to go in the D state which increases the load average.  The load
average is the summation of processes in D state and run queue.

To avoid this it can be desirable to sleep interruptibly.  These
processes do not count against the load average but may be woken by
a signal.  It is up to the caller to determine why the process
was woken it may be for one of three reasons.

  1) cv_signal()/cv_broadcast()
  2) the timeout expired
  3) a signal was received

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-01-11 12:14:48 -08:00
Brian Behlendorf 6bf4d76f47 Linux Compat: inode->i_mutex/i_sem
Create spl_inode_lock/spl_inode_unlock compability macros to simply
access to the inode mutex/sem.  This avoids the need to have to ugly
up the code with the required #define's at every call site.  At the
moment the SPL only uses this in one place but higher layers can
benefit from the macro.
2011-01-11 12:14:48 -08:00
Brian Behlendorf 9fe45dc1ac Add Thread Specific Data (TSD) Implementation
Thread specific data has implemented using a hash table, this avoids
the need to add a member to the task structure and allows maximum
portability between kernels.  This implementation has been optimized
to keep the tsd_set() and tsd_get() times as small as possible.

The majority of the entries in the hash table are for specific tsd
entries.  These entries are hashed by the product of their key and
pid because by design the key and pid are guaranteed to be unique.
Their product also has the desirable properly that it will be uniformly
distributed over the hash bins providing neither the pid nor key is zero.
Under linux the zero pid is always the init process and thus won't be
used, and this implementation is careful to never to assign a zero key.
By default the hash table is sized to 512 bins which is expected to
be sufficient for light to moderate usage of thread specific data.

The hash table contains two additional type of entries.  They first
type is entry is called a 'key' entry and it is added to the hash during
tsd_create().  It is used to store the address of the destructor function
and it is used as an anchor point.  All tsd entries which use the same
key will be linked to this entry.  This is used during tsd_destory() to
quickly call the destructor function for all tsd associated with the key.
The 'key' entry may be looked up with tsd_hash_search() by passing the
key you wish to lookup and DTOR_PID constant as the pid.

The second type of entry is called a 'pid' entry and it is added to the
hash the first time a process set a key.  The 'pid' entry is also used
as an anchor and all tsd for the process will be linked to it.  This
list is using during tsd_exit() to ensure all registered destructors
are run for the process.  The 'pid' entry may be looked up with
tsd_hash_search() by passing the PID_KEY constant as the key, and
the process pid.  Note that tsd_exit() is called by thread_exit()
so if your using the Solaris thread API you should not need to call
tsd_exit() directly.
2010-12-07 10:02:32 -08:00
Brian Behlendorf 058de03caa Clear cv->cv_mutex when not in use
For debugging purposes the condition varaibles keep track of the
mutex used during a wait.  The idea is to validate that all callers
always use the same mutex.  Unfortunately, we have seen cases where
the caller reuses the condition variable with a different mutex but
in a way which is known to be safe.  My reading of the man pages
suggests you should not do this and always cv_destroy()/cv_init()
a new mutex.  However, there is overhead in doing this and it does
appear to be allowed under Solaris.

To accomidate this behavior cv_wait_common() and __cv_timedwait()
have been modified to clear the associated mutex when the last
waiter is dropped.  This ensures that while the condition variable
is in use the incorrect mutex case is detected.  It also allows the
condition variable to be safely recycled without requiring the
overhead of a cv_destroy()/cv_init() as long as it isn't currently
in use.

Finally, spin lock cv->cv_lock was removed because it is not required.
When the condition variable is used properly the caller will always
be holding the mutex so the spin lock is redundant.  The lock was
originally added because I expected to need to protect more than
just the cv->cv_mutex.  It turns out that was not the case.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-11-29 11:02:34 -08:00
Brian Behlendorf 8655ce492f Linux 2.6.36 compat, use fops->unlocked_ioctl()
As of linux-2.6.36 the last in-tree consumer of fops->ioctl() has
been removed and thus fops()->ioctl() has also been removed.  The
replacement hook is fops->unlocked_ioctl() which has existed in
kernel since 2.6.12.  Since the SPL only contains support back
to 2.6.18 vintage kernels, I'm not adding an autoconf check for
this and simply moving everything to use fops->unlocked_ioctl().
2010-11-10 13:16:12 -08:00
Brian Behlendorf 9b2048c26b Linux 2.6.36 compat, fs_struct->lock type change
In the linux-2.6.36 kernel the fs_struct lock was changed from a
rwlock_t to a spinlock_t.  If the kernel would export the set_fs_pwd()
symbol by default this would not have caused us any issues, but they
don't.  So we're forced to add a new autoconf check which sets the
HAVE_FS_STRUCT_SPINLOCK define when a spinlock_t is used.  We can
then correctly use either spin_lock or write_lock in our custom
set_fs_pwd() implementation.
2010-11-09 13:29:47 -08:00
Brian Behlendorf 23aa63cbf5 Fix 2.6.35 shrinker callback API change
As of linux-2.6.35 the shrinker callback API now takes an additional
argument.  The shrinker struct is passed to the callback so that users
can embed the shrinker structure in private data and use container_of()
to access it.  This removes the need to always use global state for the
shrinker.

To handle this we add the SPL_AC_3ARGS_SHRINKER_CALLBACK autoconf
check to properly detect the API.  Then we simply setup a callback
function with the correct number of arguments.  For now we do not make
use of the new 3rd argument.
2010-10-22 14:51:26 -07:00
Brian Behlendorf a7958f7eef Support custom build directories
One of the neat tricks an autoconf style project is capable of
is allow configurion/building in a directory other than the
source directory.  The major advantage to this is that you can
build the project various different ways while making changes
in a single source tree.

For example, this project is designed to work on various different
Linux distributions each of which work slightly differently.  This
means that changes need to verified on each of those supported
distributions perferably before the change is committed to the
public git repo.

Using nfs and custom build directories makes this much easier.
I now have a single source tree in nfs mounted on several different
systems each running a supported distribution.  When I make a
change to the source base I suspect may break things I can
concurrently build from the same source on all the systems each
in their own subdirectory.

wget -c http://github.com/downloads/behlendorf/spl/spl-x.y.z.tar.gz
tar -xzf spl-x.y.z.tar.gz
cd spl-x-y-z

------------------------- run concurrently ----------------------
<ubuntu system>  <fedora system>  <debian system>  <rhel6 system>
mkdir ubuntu     mkdir fedora     mkdir debian     mkdir rhel6
cd ubuntu        cd fedora        cd debian        cd rhel6
../configure     ../configure     ../configure     ../configure
make             make             make             make
make check       make check       make check       make check

This is something the project has almost supported for a long time
but finishing this support should save me lots of time.
2010-09-05 21:49:05 -07:00
Brian Behlendorf 2b3543025c Stub out kmem cache defrag API
At some point we are going to need to implement the kmem cache
move callbacks to allow for kmem cache defragmentation.  This
commit simply lays a small part of the API ground work, it does
not actually implement any of this feature.  This is safe for
now because the move callbacks are just an optimization.  Even
if they are registered we don't ever really have to call them.
2010-08-27 14:23:42 -07:00
Li Wei 4be55565fe Fix stack overflow in vn_rdwr() due to memory reclaim
Unless __GFP_IO and __GFP_FS are removed from the file mapping gfp
mask we may enter memory reclaim during IO.  In this case shrink_slab()
entered another file system which is notoriously hungry for stack.
This additional stack usage may cause a stack overflow.  This patch
removes __GFP_IO and __GFP_FS from the mapping gfp mask of each file
during vn_open() to avoid any reclaim in the vn_rdwr() IO path.  The
original mask is then restored at vn_close() time.  Hats off to the
loop driver which does something similiar for the same reason.

  [...]
  shrink_slab+0xdc/0x153
  try_to_free_pages+0x1da/0x2d7
  __alloc_pages+0x1d7/0x2da
  do_generic_mapping_read+0x2c9/0x36f
  file_read_actor+0x0/0x145
  __generic_file_aio_read+0x14f/0x19b
  generic_file_aio_read+0x34/0x39
  do_sync_read+0xc7/0x104
  vfs_read+0xcb/0x171
  :spl:vn_rdwr+0x2b8/0x402
  :zfs:vdev_file_io_start+0xad/0xe1
  [...]

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-08-12 09:34:33 -07:00
Ricardo M. Correia 26f7245c7c Fix taskq code to not drop tasks when TQ_SLEEP is used.
When TQ_SLEEP is used, taskq_dispatch() should always succeed even if the
number of pending tasks is above tq->tq_maxalloc. This semantic is similar
to KM_SLEEP in kmem allocations, which also always succeed.

However, we cannot block forever otherwise there is a risk of deadlock.
Therefore, we still allow the number of pending tasks to go above
tq->tq_maxalloc with TQ_SLEEP, but we may sleep up to 1 second per task
dispatch, thereby throttling the task dispatch rate.

One of the existing splat tests was also augmented to test for this scenario.
The test would fail with the previous implementation but now it succeeds.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-08-02 11:20:31 -07:00
Brian Behlendorf 41f84a8d56 Strfree() should call kfree() not kmem_free()
Using kmem_free() results in deducting X bytes from the memory
accounting when --enable-debug is set.  Unfortunately, currently
the counterpart kmem_asprintf() and friends do not properly
account for memory allocated, so we must do the same on free.
If we don't then we end up with a negative number of lost bytes
reported when the module is unloaded.

A better long term fix would be to add the accounting in to the
allocation side but that's a project for another day.
2010-07-30 22:20:58 -07:00
Brian Behlendorf 10129680f8 Ensure kmem_alloc() and vmem_alloc() never fail
The Solaris semantics for kmem_alloc() and vmem_alloc() are that they
must never fail when called with KM_SLEEP.  They may only fail if
called with KM_NOSLEEP otherwise they must block until memory is
available.  This is quite different from how the Linux memory
allocators work, under Linux a memory allocation failure is always
possible and must be dealt with.

At one point in the past the kmem code did properly implement this
behavior, however as the code evolved this behavior was overlooked
in places.  This patch goes through all three implementations of
the kmem/vmem allocation functions and ensures that they will all
block in the KM_SLEEP case when memory is not available.  They
may still fail in the KM_NOSLEEP case in which case the caller
is responsible for handling the failure.

Special care is taken in vmalloc_nofail() to avoid thrashing the
system on the virtual address space spin lock.  The down side of
course is if you do see a failure here, which is unlikely for
64-bit systems, your allocation will delay for an entire second.
Still this is preferable to locking up your system and it is the
best we can do given the constraints.

Additionally, the code was cleaned up to be much more readable
and comments were added to describe the various kmem-debug-*
configure options.  The default configure options remain:
"--enable-debug-kmem --disable-debug-kmem-tracking"
2010-07-26 15:47:55 -07:00
Brian Behlendorf 849c50e7f2 Fix two minor compiler warnings
In cmd/splat.c there was a comparison between an __u32 and an int.  To
resolve the issue simply use a __u32 and strtoul() when converting the
provided user string.

In module/spl/spl-vnode.c we should explicitly cast nd->last.name to
a const char * which is what is expected by the prototype.
2010-07-26 10:24:26 -07:00
Brian Behlendorf 8b0eb3f0dc Remove deadcode caused by removal of format1 arg
Commit 55abb0929e removed the never
used format1 argument of spl_debug_msg().  That in turn resulted
in some deadcode which should be removed since it's now useless.
2010-07-21 16:31:42 -07:00
Ricardo M. Correia 81672c0122 Display DEBUG keyword during module load when --enable-debug is used.
Signed-off-by: Ricardo M. Correia <ricardo.correia@oracle.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-07-20 15:31:03 -07:00
Ricardo M. Correia 2c762de830 Fix buggy kmem_{v}asprintf() functions
When the kvasprintf() call fails they should reset the arguments
by calling va_start()/va_copy() and va_end() inside the loop,
otherwise they'll try to read more arguments rather than starting
over and reading them from the beginning.

Signed-off-by: Ricardo M. Correia <ricardo.correia@oracle.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-07-20 13:51:46 -07:00
Brian Behlendorf b17edc10a9 Prefix all SPL debug macros with 'S'
To avoid conflicts with symbols defined by dependent packages
all debugging symbols have been prefixed with a 'S' for SPL.
Any dependent package needing to integrate with the SPL debug
should include the spl-debug.h header and use the 'S' prefixed
macros.  They must also build with DEBUG defined.
2010-07-20 13:30:40 -07:00
Brian Behlendorf 55abb0929e Split <sys/debug.h> header
To avoid symbol conflicts with dependent packages the debug
header must be split in to several parts.  The <sys/debug.h>
header now only contains the Solaris macro's such as ASSERT
and VERIFY.  The spl-debug.h header contain the spl specific
debugging infrastructure and should be included by any package
which needs to use the spl logging.  Finally the spl-trace.h
header contains internal data structures only used for the log
facility and should not be included by anythign by spl-debug.c.

This way dependent packages can include the standard Solaris
headers without picking up any SPL debug macros.  However, if
the dependant package want to integrate with the SPL debugging
subsystem they can then explicitly include spl-debug.h.

Along with this change I have dropped the CHECK_STACK macros
because the upstream Linux kernel now has much better stack
depth checking built in and we don't need this complexity.

Additionally SBUG has been replaced with PANIC and provided as
part of the Solaris macro set.  While the Solaris version is
really panic() that conflicts with the Linux kernel so we'll
just have to make due to PANIC.  It should rarely be called
directly, the prefered usage would be an ASSERT or VERIFY.

There's lots of change here but this cleanup was overdue.
2010-07-20 13:29:35 -07:00
Brian Behlendorf d0bd694ca9 Fix -Werror=format-security compiler option
Noticed under Ubuntu kernel builds we should be passing a
format specifier and the string, not just the string.
2010-07-14 11:53:57 -07:00
Brian Behlendorf f0ff89fc86 Linux 2.6.35 compat: filp_fsync() dropped 'stuct dentry *'
The prototype for filp_fsync() drop the unused argument 'stuct dentry *'.
I've fixed this by adding the needed autoconf check and moving all of
those filp related functions to file_compat.h.  This will simplify
handling any further API changes in the future.
2010-07-14 11:40:55 -07:00
Brian Behlendorf a4bfd8ea1b Add __divdi3(), remove __udivdi3() kernel dependency
Up until now no SPL consumer attempted to perform signed 64-bit
division so there was no need to support this.  That has now
changed so I adding 64-bit division support for 32-bit platforms.
The signed implementation is based on the unsigned version.

Since the have been several bug reports in the past concerning
correct 64-bit division on 32-bit platforms I added some long
over due regression tests.  Much to my surprise the unsigned
64-bit division regression tests failed.

This was surprising because __udivdi3() was implemented by simply
calling div64_u64() which is provided by the kernel.  This meant
that the linux kernels 64-bit division algorithm on 32-bit platforms
was flawed.  After some investigation this turned out to be exactly
the case.

Because of this I was forced to abandon the kernel helper and
instead to fully implement 64-bit division in the spl.  There are
several published implementation out there on how to do this
properly and I settled on one proposed in the book Hacker's Delight.
Their proposed algoritm is freely available without restriction
and I have just modified it to be linux kernel friendly.

The update implementation now passed all the unsigned and signed
regression tests.  This should be functional, but not fast, which is
good enough for out purposes.  If you want fast too I'd strongly
suggest you upgrade to a 64-bit platform.  I have also reported the
kernel bug and we'll see if we can't get it fixed up stream.
2010-07-13 16:44:02 -07:00
Brian Behlendorf 1814251453 Require gawk the usermode helper fails with awk
For some reason when awk invoked by the usermode helper the command
always fails.  Interestingly gawk does not suffer from this problem
which is why I never observed this failure since the distro I tested
with all had gawk installed instead of awk.  Anyway, the simplest
thing to do here is to just make gawk mandatory.  I've added a
configure check for gawk specifically and have updated the command
to call gawk not awk.
2010-07-01 16:38:08 -07:00
Brian Behlendorf 7119bf7044 Add configure check for user_path_dir()
I didn't notice at the time but user_path_dir() was not introduced
at the same time as set_fs_pwd() change.  I had lumped the two
together but in fact user_path_dir() was introduced in 2.6.27 and
set_fs_pwd() taking 2 args was introduced in 2.6.25.  This means
builds against 2.6.25-2.6.26 kernels were broken.

To fix this I've added a check for user_path_dir() and no longer
assume that if set_fs_pwd() takes 2 args then user_path_dir() is
also available.
2010-07-01 13:53:26 -07:00
Ned Bass 1a73940d39 Initialize the /dev/splatctl device buffer
On open() and initialize the buffer with the SPL version string.  The
user space splat utility expects to find the SPL version string when
it opens and reads from /dev/splatctl.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-07-01 10:59:46 -07:00
Ned Bass f0d8bb26b4 Implementation of the TQ_FRONT flag.
Adds a task queue to receive tasks dispatched with TQ_FRONT.  Worker
threads pull tasks from this high priority queue before the default
pending queue.

Executing tasks out of FIFO order potentially breaks taskq_lowest_id()
if we do not preserve the ordering of the work list by taskqid.
Therefore, instead of always appending to the work list, we search for
the appropriate place to insert a task.  The common case is to append
to the list, so we make this operation efficient by searching the work
list in reverse order.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-07-01 10:59:38 -07:00
Brian Behlendorf 79a3bf130b Linux-2.6.33 compat, .ctl_name removed from struct ctl_table
As of linux-2.6.33 the ctl_name member of the ctl_table struct
has been entirely removed.  The upstream code has been updated
to depend entirely on the the procname member.  To handle this
all references to ctl_name are wrapped in a CTL_NAME macro which
simply expands to nothing for newer kernels.  Older kernels are
supported by having it expand to .ctl_name = X just as before.
2010-06-30 12:49:12 -07:00
Brian Behlendorf e6de04b73c Add kmem_vasprintf function
We might as well have both asprintf() variants.  This allows us
to safely pass a va_list through several levels of the stack
using va_copy() instead of va_start().
2010-06-24 09:41:59 -07:00
Brian Behlendorf 438683c0a9 Revert "Support TQ_FRONT flag used by taskq_dispatch()"
This reverts commit eb12b3782c.
2010-06-21 10:19:44 -07:00
Brian Behlendorf 3cb77549d1 Update warnings in kmem debug code
This fix was long overdue.  Most of the ground work was laid long
ago to include the exact function and line number in the error message
which there was an issue with a memory allocation call.  However,
probably due to lack of time at the moment that informatin never
made it in to the error message.  This patch fixes that and trys
to standardize the kmem debug messages as well.
2010-06-16 16:01:16 -07:00
Brian Behlendorf eb12b3782c Support TQ_FRONT flag used by taskq_dispatch()
Allow taskq_dispatch() to insert work items at the head of the
queue instead of just the tail by passing the TQ_FRONT flag.
2010-06-11 15:57:25 -07:00
Brian Behlendorf b868e22f05 Add kmem_asprintf(), strfree(), strdup(), and minor cleanup.
This patch adds three missing Solaris functions: kmem_asprintf(), strfree(),
and strdup().  They are all implemented as a thin layer which just calls
their Linux counterparts.  As part of this an autoconf check for kvasprintf
was added because it does not appear in older kernels.  If the kernel does
not provide it then spl-generic implements it.

Additionally the dead DEBUG_KMEM_UNIMPLEMENTED code was removed to clean
things up and make the kmem.h a little more readable.
2010-06-11 15:57:25 -07:00
Brian Behlendorf ae4c36adce Cleanly split Linux proc.h (fs) from conflicting Solaris proc.h (process)
Under linux the proc.h header is for the /proc filesystem, and under
Solaris the proc/h header if for processes.  This patch correctly
moves the Linux proc functionality in a linux/proc_compat.h header
and leaves the sys/proc.h for use by Solaris.  Minor updates were
required to all the call sites where it was included of course.
2010-06-11 15:57:25 -07:00
Alex Zhuravlev 1b4ad25e2f Stack overflow on 64-bit modulus operations on 32-bit architectures.
Running 'zpool create' on a 32-bit machine with an SPL compiled with
gcc 4.4.4 led to a stack overlow.  This turned out to be due to some
sort of 'optimization' by gcc:

uint64_t __umoddi3(uint64_t dividend, uint64_t divisor)
{
   return dividend - divisor * (dividend / divisor);
}

This code was supposed to be using __udivdi3 to implement /, but gcc
instead implemented it via __umoddi3 itself.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-06-03 09:06:55 -07:00
Brian Behlendorf 8a1c9a02fb Minor 32-bit fix cast to hrtime_t before the mutliply.
It's important to cast to hrtime_t before doing the multiply because
the ts.tv_sec type is only 32-bits and we need to promote it to 64-bits.
2010-05-23 09:51:17 -07:00
Brian Behlendorf 23d91792ef Use KM_NODEBUG macro in preference to __GFP_NOWARN. 2010-05-20 14:16:59 -07:00
Brian Behlendorf 3626ae6a70 Disable spl_debug_panic_on_bug by default.
While I may prefer to have the system panic on an SBUG and to get
crash dump for analysis.  I suspect most peoples systems are not
configured from crash dump and the best thing to so is to simply
halt the thread and print an error to the console.  This way they
have a good chance of actually saving the stack trace and debug log.
2010-05-20 10:15:51 -07:00
Brian Behlendorf 5198ea0e71 Remove kmem_set_warning() interface replace with __GFP_NOWARN flag.
Remove the kmem_set_warning() hack used by the kmem-splat regression
tests with a per-allocation flag called __GFP_NOWARN.  This matches
the lower level linux flag of similar by slightly different function.
The idea is you can then explicitly set this flag on requests where
you know your breaking the max 8k rule but you need/want to do it
anyway.

This is currently used by the regression tests where we intentionally
push things to the limit but don't want the log noise.  Additionally,
we are forced to use it in spl_kmem_cache_create() because by default
NR_CPUS is very large and theres no easy way to handle that.

Finally, I've added a stack_dump() call to the warning when it is
trigger to make to clear exactly where the allocation is taking place.
2010-05-19 16:53:13 -07:00
Brian Behlendorf 627a74972c Set default debug log patch to /tmp/spl-log.
Using /tmp/ is a preferable default, it can always be overriden
using the module option on a case-by-case basis.

Additionally standardize some log messages based on the same
default log level used by the kernel.
2010-05-19 16:17:06 -07:00
Brian Behlendorf 716154c592 Public Release Prep
Updated AUTHORS, COPYING, DISCLAIMER, and INSTALL files.  Added
standardized headers to all source file to clearly indicate the
copyright, license, and to give credit where credit is due.
2010-05-17 15:18:00 -07:00
Brian Behlendorf 6020190e8f Use do_posix_clock_monotonic_gettime() as described by comment.
While this does incur slightly more overhead we should be using
do_posix_clock_monotonic_gettime() for gethrtime() as described
by the existing comment.
2010-05-14 09:31:22 -07:00
Brian Behlendorf f752b46eb3 Add cv_wait_interruptible() function.
This is a minor extension to the condition variable API to allow
for reasonable signal handling on Linux.  The cv_wait() function by
definition must wait unconditionally for cv_signal()/cv_broadcast()
before waking it.  This makes it impossible to woken by a signal
such as SIGTERM.  The cv_wait_interruptible() function was added
to handle this case.  It behaves identically to cv_wait() with the
exception that it waits interruptibly allowing a signal to wake it
up.  This means you do need to be careful and check issig() after
waking.
2010-05-14 09:24:51 -07:00
Brian Behlendorf 97f8f6d789 Dump log from current process when required
When dumping a debug log first check that it is safe to create
a new thread and block waiting for it.  If we are in an atomic
context or irqs and disabled it is not safe to sleep and we
must write out of the debug log from the current process.
2010-04-23 15:55:02 -07:00
Brian Behlendorf d05ec4b45f Assume TQ_SLEEP when not explicitly specified. 2010-04-23 14:39:47 -07:00
Ricardo Correia 663e02a135 Handle the FAPPEND option in vn_rdwr().
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-04-23 14:39:42 -07:00
Brian Behlendorf 82a358d9c0 Update vn_set_pwd() to allow user|kernal address for filename
During module init spl_setup()->The vn_set_pwd("/") was failing
with -EFAULT because user_path_dir() and __user_walk() both
expect 'filename' to be a user space address and it's not in
this case.  To handle this the data segment size is increased
to to ensure strncpy_from_user() does not fail with -EFAULT.

Additionally, I've added a printk() warning to catch this and
log it to the console if it ever reoccurs.  I thought everything
was working properly here because there consequences of this
failing are subtle and usually non-critical.
2010-04-22 12:53:58 -07:00
Brian Behlendorf 16b719f006 Allow spl_config.h to be included by dependant packages (updated)
We need dependent packages to be able to include spl_config.h to
build properly.  This was partially solved in commit 0cbaeb1 by using
AH_BOTTOM to #undef common #defines (PACKAGE, VERSION, etc) which
autoconf always adds and cannot be easily removed.  This solution
works as long as the spl_config.h is included before your projects
config.h.  That turns out to be easier said than done.  In particular,
this is a problem when your package includes its config.h using the
-include gcc option which ensures the first thing included is your
config.h.

To handle all cases cleanly I have removed the AH_BOTTOM hack and
replaced it with an AC_CONFIG_HEADERS command.  This command runs
immediately after spl_config.h is written and with a little awk-foo
it strips the offending #defines from the file.  This eliminates
the problem entirely and makes header safe for inclusion.

Also in this change I have removed the few places in the code where
spl_config.h is included.  It is now added to the gcc compile line
to ensure the config results are always available.

Finally, I have also disabled the verbose kernel builds.  If you
want them back you can always build with 'make V=1'.  Since things
are working now they don't need to be on by default.
2010-03-22 14:45:33 -07:00
Brian Behlendorf aa600d8a38 Reduce max kmem based slab size
Allowing MAX_ORDER-1 sized allocations for kmem based slabs have
been observed to result in deadlocks.  To help prvent this limit
max kmem based slab size to MAX_ORDER-3.  Just for the record
callers should not be creating slabs like this, but if they do
we should still handle it as safely as we can.
2010-03-18 13:39:51 -07:00
Brian Behlendorf 3977f8370f Linux 2.6.32 compat, proc_handler() API change
As of linux-2.6.32 the 'struct file *filp' argument was dropped from
the proc_handle() prototype.  It was apparently unused _almost_
everywhere in the kernel and this was simply cleanup.

I've added a new SPL_AC_5ARGS_PROC_HANDLER autoconf check for this and
the proper compat macros to correctly define the prototypes and some
helper functions.  It's not pretty but API compat changes rarely are.
2010-03-04 12:14:56 -08:00
Ricardo M. Correia 694921bc49 sun-misc-gitignore
Add .gitignore files.

Signed-off-by: Ricardo M. Correia <Ricardo.M.Correia@Sun.COM>
2010-01-08 09:37:54 -08:00
Ricardo M. Correia f7e8739c94 sun-fix-whitespace
Whitespace fixes.

Signed-off-by: Ricardo M. Correia <Ricardo.M.Correia@Sun.COM>
2010-01-08 09:37:54 -08:00
Ricardo M. Correia b520b14305 sun-fix-panic-str
Fix panic() string, which was being used as a format string, instead of an already-formatted string.

Signed-off-by: Ricardo M. Correia <Ricardo.M.Correia@Sun.COM>
2010-01-08 09:37:54 -08:00
Brian Behlendorf 82387586af Optimize lowest outstanding taskqid calculation in taskq_lowest_id()
In the initial version of taskq_lowest_id() the entire pending and
work list was locked under the tq->tq_lock to determine the lowest
outstanding taskqid.  At the time this done because I was rushed
and wanted to make sure it was right... fast was secondary.  Well now
fast is important too so I carefully thought through the pending
and work list management and convinced myself it is safe and correct
to simply check the first entry.  I added a large comment to the source
to explain this.  But basically as long as we are careful to ensure the
pending and work list stay sorted this is safe and fast.

The motivation for this chance was that I was observing as much as
10% of the total CPU time go to waiting on the tq->tq_lock when the
pending list was long.  This resolves that problems and frees up
that CPU time for something useful.
2010-01-04 15:52:26 -08:00
Brian Behlendorf ef1c7a0691 Strip __GFP_ZERO from kmalloc it is not available for older kernels.
This is needed to avoid a BUG_ON() on RHEL5.4 kernel 2.6.18-164.6.1,
since __GFP_ZERO is not a valid flag for kmalloc().
2009-12-23 12:57:10 -08:00
Brian Behlendorf 242f539a2e Add skc_flags and full header to /proc/spl/kmem/slab. 2009-12-11 11:20:08 -08:00
Brian Behlendorf d04c8a563c Atomic64 compatibility for 32-bit systems without kernel support.
This patch is another step towards updating the code to handle the
32-bit kernels which I have not been regularly testing.  This changes
do not really impact the common case I'm expected which is the latest
kernel running on an x86_64 arch.

Until the linux-2.6.31 kernel the x86 arch did not have support for
64-bit atomic operations.  Additionally, the new atomic_compat.h support
for this case was wrong because it embedded a spinlock in the atomic
variable which must always and only be 64-bits total.  To handle these
32-bit issues we now simply fall back to the --enable-atomic-spinlock
implementation if the kernel does not provide the 64-bit atomic funcs.

The second issue this patch addresses is the DEBUG_KMEM assumption that
there will always be atomic64 funcs available.  On 32-bit archs this may
not be true, and actually that's just fine.  In that case the kernel will
will never be able to allocate more the 32-bits worth anyway.  So just
check if atomic64 funcs are available, if they are not it means this
is a 32-bit machine and we can safely use atomic_t's instead.
2009-12-04 15:54:12 -08:00
Brian Behlendorf db1aa22297 Correctly handle division on 32-bit RHEL5 systems by returning dividend. 2009-12-01 15:53:28 -08:00
Brian Behlendorf 0a6c005959 Ensure spl_config.h is include in spl-generic.c 2009-11-15 15:04:33 -08:00
Brian Behlendorf 8b45dda2bc Linux 2.6.31 kmem cache alignment fixes and cleanup.
The big fix here is the removal of kmalloc() in kv_alloc().  It used
to be true in previous kernels that kmallocs over PAGE_SIZE would
always be pages aligned.  This is no longer true atleast in 2.6.31
there are no longer any alignment expectations.  Since kv_alloc()
requires the resulting address to be page align we no only either
directly allocate pages in the KMC_KMEM case, or directly call
__vmalloc() both of which will always return a page aligned address.
Additionally, to avoid wasting memory size is always a power of two.

As for cleanup several helper functions were introduced to calculate
the aligned sizes of various data structures.  This helps ensure no
case is accidentally missed where the alignment needs to be taken in
to account.  The helpers now use P2ROUNDUP_TYPE instead of P2ROUNDUP
which is safer since the type will be explict and we no longer count
on the compiler to auto promote types hopefully as we expected.

Always wnforce minimum (SPL_KMEM_CACHE_ALIGN) and maximum (PAGE_SIZE)
alignment restrictions at cache creation time.

Use SPL_KMEM_CACHE_ALIGN in splat alignment test.
2009-11-13 11:12:43 -08:00
Brian Behlendorf c89fdee4d3 Remove __GFP_NOFAIL in kmem and retry internally.
As of 2.6.31 it's clear __GFP_NOFAIL should no longer be used and it
may disappear from the kernel at any time.  To handle this I have simply
added *_nofail wrappers in the kmem implementation which perform the
retry for non-atomic allocations.

From linux-2.6.31 mm/page_alloc.c:1166
/*
 * __GFP_NOFAIL is not to be used in new code.
 *
 * All __GFP_NOFAIL callers should be fixed so that they
 * properly detect and handle allocation failures.
 *
 * We most definitely don't want callers attempting to
 * allocate greater than order-1 page units with
 * __GFP_NOFAIL.
 */
WARN_ON_ONCE(order > 1);
2009-11-12 15:11:24 -08:00
Brian Behlendorf baf2979ed3 Linux 2.6.31 Compatibility Updates
SPL_AC_2ARGS_SET_FS_PWD macro updated to explicitly include
linux/fs_struct.h which was dropped from linux/sched.h.

min_wmark_pages, low_wmark_pages, high_wmark_pages macros
introduced in newer kernels.  For older kernels mm_compat.h
was introduced to define them as needed as direct mappings
to per zone min_pages, low_pages, max_pages.
2009-11-10 14:06:57 -08:00
Brian Behlendorf 055ffd98cf Autoconf --enable-debug-* cleanup
Cleanup the --enable-debug-* configure options, this has been pending
for quite some time and I am glad I finally got to it.  To summerize:

1) All SPL_AC_DEBUG_* macros were updated to be a more autoconf
friendly.  This mainly involved shift to the GNU approved usage of
AC_ARG_ENABLE and ensuring AS_IF is used rather than directly using
an if [ test ] construct.

2) --enable-debug-kmem=yes by default.  This simply enabled keeping
a running tally of total memory allocated and freed and reporting a
memory leak if there was one at module unload.  Additionally, it
ensure /proc/spl/kmem/slab will exist by default which is handy.
The overhead is low for this and it should not impact performance.

3) --enable-debug-kmem-tracking=no by default.  This option was added
to provide a configure option to enable to detailed memory allocation
tracking.  This support was always there but you had to know where to
turn it on.  By default this support is disabled because it is known
to badly hurt performence, however it is invaluable when chasing a
memory leak.

4) --enable-debug-kstat removed.  After further reflection I can't see
why you would ever really want to turn this support off.  It is now
always on which had the nice side effect of simplifying the proc handling
code in spl-proc.c.  We can now always assume the top level directory
will be there.

5) --enable-debug-callb removed.  This never really did anything, it was
put in provisionally because it might have been needed.  It turns out
it was not so I am just removing it to prevent confusion.
2009-10-30 13:58:51 -07:00
Brian Behlendorf 5e9b5d832b Use Linux atomic primitives by default.
Previously Solaris style atomic primitives were implemented simply by
wrapping the desired operation in a global spinlock.  This was easy to
implement at the time when I wasn't 100% sure I could safely layer the
Solaris atomic primatives on the Linux counterparts.  It however was
likely not good for performance.

After more investigation however it does appear the Solaris primitives
can be layered on Linux's fairly safely.  The Linux atomic_t type really
just wraps a long so we can simply cast the Solaris unsigned value to
either a atomic_t or atomic64_t.  The only lingering problem for both
implementations is that Solaris provides no atomic read function.  This
means reading a 64-bit value on a 32-bit arch can (and will) result in
word breaking.  I was very concerned about this initially, but upon
further reflection it is a limitation of the Solaris API.  So really
we are just being bug-for-bug compatible here.

With this change the default implementation is layered on top of Linux
atomic types.  However, because we're assuming a lot about the internal
implementation of those types I've made it easy to fall-back to the
generic approach.  Simply build with --enable-atomic_spinlocks if
issues are encountered with the new implementation.
2009-10-30 10:55:25 -07:00
Brian Behlendorf 2b5adaf18f I should not have removed these, they are important. 2009-10-27 16:17:06 -07:00
Brian Behlendorf 4bd577d069 Rebase cmn_err on vcmn_err and don't warn about missing \n
The cmn_err/vcmn_err functions are layered on top of the debug
system which usually expects a newline at the end.  However, there
really doesn't need to be a newline there and there in fact should
not be for the CE_CONT case so let's just drop the warning.

Also we make a half-hearted attempt to handle a leading ! which
means only send it to the syslog not the console.  In this case
we just send to the the debug logs and not the console.
2009-10-27 16:13:35 -07:00
Brian Behlendorf 51a727e90f Set cwd to '/' for the process executing insmod.
Ricardo has pointed out that under Solaris the cwd is set to '/'
during module load, while under Linux it is set to the callers cwd.
To handle this cleanly I've reworked the module *_init()/_exit()
macros so they call a *_setup()/_cleanup() function when any SPL
dependent module is loaded or unloaded.  This gives us a chance to
perform any needed modification of the process, in this case changing
the cwd.  It also handily provides a way to avoid creating wrapper
init()/exit() functions because the Solaris and Linux prototypes
differ slightly.  All dependent modules should now call the spl
helper macros spl_module_{init,exit}() instead of the native linux
versions.

Unfortunately, it appears that under Linux there has been no consistent
API in the kernel to set the cwd in a module.  Because of this I have
had to add more autoconf magic than I'd like.  However, what I have
done is correct and has been tested on RHEL5, SLES11, FC11, and CHAOS
kernels.

In addition, I have change the rootdir type from a 'void *' to the
correct 'vnode_t *' type.  And I've set rootdir to a non-NULL value.
2009-10-01 16:06:15 -07:00
Brian Behlendorf 4d54fdee1d Reimplement mutexs for Linux lock profiling/analysis
For a generic explanation of why mutexs needed to be reimplemented
to work with the kernel lock profiling see commits:
  e811949a57 and
  d28db80fd0

The specific changes made to the mutex implemetation are as follows.
The Linux mutex structure is now directly embedded in the kmutex_t.
This allows a kmutex_t to be directly case to a mutex struct and
passed directly to the Linux primative.

Just like with the rwlocks it is critical that these functions be
implemented as '#defines to ensure the location information is
preserved.  The preprocessor can then do a direct replacement of
the Solaris primative with the linux primative.

Just as with the rwlocks we need to track the lock owner.  Here
things get a little more interesting because depending on your
kernel version, and how you've built your kernel Linux may already
do this for you.  If your running a 2.6.29 or newer kernel on a
SMP system the lock owner will be tracked.  This was added to Linux
to support adaptive mutexs, more on that shortly.  Alternately, your
kernel might track the lock owner if you've set CONFIG_DEBUG_MUTEXES
in the kernel build.  If neither of the above things is true for
your kernel the kmutex_t type will include and track the lock owner
to ensure correct behavior.  This is all handled by a new autoconf
check called SPL_AC_MUTEX_OWNER.

Concerning adaptive mutexs these are a very recent development and
they did not make it in to either the latest FC11 of SLES11 kernels.
Ideally, I'd love to see this kernel change appear in one of these
distros because it does help performance.  From Linux kernel commit:
  0d66bf6d3514b35eb6897629059443132992dbd7
  "Testing with Ingo's test-mutex application...
  gave a 345% boost for VFS scalability on my testbox"
However, if you don't want to backport this change yourself you
can still simply export the task_curr() symbol.  The kmutex_t
implementation will use this symbol when it's available to
provide it's own adaptive mutexs.

Finally, DEBUG_MUTEX support was removed including the proc handlers.
This was done because now that we are cleanly integrated with the
kernel profiling all this information and much much more is available
in debug kernel builds.  This code was now redundant.

Update mutexs validated on:
    - SLES10   (ppc64)
    - SLES11   (x86_64)
    - CHAOS4.2 (x86_64)
    - RHEL5.3  (x86_64)
    - RHEL6    (x86_64)
    - FC11     (x86_64)
2009-09-25 14:47:01 -07:00
Brian Behlendorf d28db80fd0 Update rwlocks to track owner to ensure correct semantics
The behavior of RW_*_HELD was updated because it was not quite right.
It is not sufficient to return non-zero when the lock is help, we must
only do this when the current task in the holder.

This means we need to track the lock owner which is not something
tracked in a Linux semaphore.  After some experimentation the
solution I settled on was to embed the Linux semaphore at the start
of a larger krwlock_t structure which includes the owner field.
This maintains good performance and allows us to cleanly intergrate
with the kernel lock analysis tools.  My reasons:

1) By placing the Linux semaphore at the start of krwlock_t we can
then simply cast krwlock_t to a rw_semaphore and pass that on to
the linux kernel.  This allows us to use '#defines so the preprocessor
can do direct replacement of the Solaris primative with the linux
equivilant.  This is important because it then maintains the location
information for each rw_* call point.

2) Additionally, by adding the owner to krwlock_t we can keep this
needed extra information adjacent to the lock itself.  This removes
the need for a fancy lookup to get the owner which is optimal for
performance.  We can also leverage the existing spin lock in the
semaphore to ensure owner is updated correctly.

3) All helper functions which do not need to strictly be implemented
as a define to preserve location information can be done as a static
inline function.

4) Adding the owner to krwlock_t allows us to remove all memory
allocations done during lock initialization.  This is good for all
the obvious reasons, we do give up the ability to specific the lock
name.  The Linux profiling tools will stringify the lock name used
in the code via the preprocessor and use that.

Update rwlocks validated on:
- SLES10   (ppc64)
- SLES11   (x86_64)
- CHAOS4.2 (x86_64)
- RHEL5.3  (x86_64)
- RHEL6    (x86_64)
- FC11     (x86_64)
2009-09-25 14:14:35 -07:00
Brian Behlendorf e811949a57 Reimplement rwlocks for Linux lock profiling/analysis.
It turns out that the previous rwlock implementation worked well but
did not integrate properly with the upstream kernel lock profiling/
analysis tools.  This is a major problem since it would be awfully
nice to be able to use the automatic lock checker and profiler.

The problem is that the upstream lock tools use the pre-processor
to create a lock class for each uniquely named locked.  Since the
rwsem was embedded in a wrapper structure the name was always the
same.  The effect was that we only ended up with one lock class for
the entire SPL which caused the lock dependency checker to flag
nearly everything as a possible deadlock.

The solution was to directly map a krwlock to a Linux rwsem using
a typedef there by eliminating the wrapper structure.  This was not
done initially because the rwsem implementation is specific to the arch.
To fully implement the Solaris krwlock API using only the provided rwsem
API is not possible.  It can only be done by directly accessing some of
the internal data member of the rwsem structure.

For example, the Linux API provides a different function for dropping
a reader vs writer lock.  Whereas the Solaris API uses the same function
and the caller does not pass in what type of lock it is.  This means to
properly drop the lock we need to determine if the lock is currently a
reader or writer lock.  Then we need to call the proper Linux API function.
Unfortunately, there is no provided API for this so we must extracted this
information directly from arch specific lock implementation.  This is
all do able, and what I did, but it does complicate things considerably.

The good news is that in addition to the profiling benefits of this
change.  We may see performance improvements due to slightly reduced
overhead when creating rwlocks and manipulating them.

The only function I was forced to sacrafice was rw_owner() because this
information is simply not stored anywhere in the rwsem.  Luckily this
appears not to be a commonly used function on Solaris, and it is my
understanding it is mainly used for debugging anyway.

In addition to the core rwlock changes, extensive updates were made to
the rwlock regression tests.  Each class of test was extended to provide
more API coverage and to be more rigerous in checking for misbehavior.

This is a pretty significant change and with that in mind I have been
careful to validate it on several platforms before committing.  The full
SPLAT regression test suite was run numberous times on all of the following
platforms.  This includes various kernels ranging from 2.6.16 to 2.6.29.

- SLES10   (ppc64)
- SLES11   (x86_64)
- CHAOS4.2 (x86_64)
- RHEL5.3  (x86_64)
- RHEL6    (x86_64)
- FC11     (x86_64)
2009-09-18 16:09:47 -07:00
Brian Behlendorf 6ae7fef5b9 Update global_page_state() support for 2.6.29 kernels.
Basically everything we need to monitor the global memory state of
the system is now cleanly available via global_page_state().  The
problem is that this interface is still fairly recent, and there
has been one change in the page state enum which we need to handle.
These changes basically boil down to the following:
- If global_page_state() is available we should use it.  Several
  autoconf checks have been added to detect the correct enum names.
- If global_page_state() is not available check to see if
  get_zone_counts() symbol is available and use that.
- If the get_zone_counts() symbol is not exported we have no choice
  be to dynamically aquire it at load time.  This is an absolute
  last resort for old kernel which we don't want to patch to
  cleanly export the symbol.
2009-07-28 15:06:42 -07:00
Brian Behlendorf 6b09f73939 Remove get/put_task_struct as they are not available for SLES11
This interface is going away, and it's not as if most callers actually
use crhold/crfree when working with credentials.  So it'll be okay
they we're not taking a reference on the task structure the odds of
it going away while working with a credential and pretty small.
2009-07-28 15:04:21 -07:00
Brian Behlendorf ec7d53e99a Add basic credential support and splat tests.
The previous credential implementation simply provided the needed types and
a couple of dummy functions needed.  This update correctly ties the basic
Solaris credential API in to one of two Linux kernel APIs.

Prior to 2.6.29 the linux kernel embeded all credentials in the task
structure.  For these kernels, we pass around the entire task struct as if
it were the credential, then we use the helper functions to extract the
credential related bits.

As of 2.6.29 a new credential type was added which we can and do fairly
cleanly layer on top of.  Once again the helper functions nicely hide
the implementation details from all callers.

Three tests were added to the splat test framework to verify basic
correctness.  They should be extended as needed when need credential
functions are added.
2009-07-27 17:18:59 -07:00
Brian Behlendorf 7064b767c2 Positive Solaris ioctl return codes need to be negated for use by libc 2009-07-23 16:14:52 -07:00
Brian Behlendorf 2141116167 The HAVE_PATH_IN_NAMEIDATA compat macros should have been used here. 2009-07-22 14:28:19 -07:00
Brian Behlendorf 78d6de97bd Register a basic compat ioctl handler (32 vs 64 bit compat)
Simply pass the ioctl on to the normal handler.  If the ioctl
helper macros are used correctly this should be safe as they
will handle the packing/unpacking of the data encoded in the
ioctl command.  And actually, if the caller does not use the
IO* macros at all, and just passes small values, it will probably
be OK as well.  We only get in to trouble if they try and use
the upper 32-bits.  Endianness is not really a concern here, we
we are pretty much assumed they user and kernel will match.
2009-07-21 10:13:58 -07:00
Ricardo M. Correia e004f04c8b Prevent integer overflow after ~164 days of uptime.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2009-07-14 15:23:25 -07:00
Brian Behlendorf d3126abe75 Add ddi_copyin/ddi_copyout support for fake kernel originated ioctls. 2009-07-10 10:56:32 -07:00
Brian Behlendorf 915404bd50 Add basic support for TASKQ_THREADS_CPU_PCT taskq flag which is
used to scale the number of threads based on the number of online
CPUs.  As CPUs are added/removed we should rescale the thread
count appropriately, but currently this is only done at create.
2009-07-09 10:07:52 -07:00
Brian Behlendorf c0517c35d2 Use do_div on older kernel where do_div64 doesn't exist. 2009-06-26 13:10:52 -07:00
Brian Behlendorf 124ca8a5a9 SLES10 Fixes (part 7)
- Initial SLES testing uncovered a long standing bug in the debug
  tracing.  The tcd_for_each() macro expected a NULL to terminate
  the trace_data[i] array but this was only ever true due to luck.
  All trace_data[] iterators are now properly capped by TCD_TYPE_MAX.
- SPLAT_MAJOR 229 conflicted with a 'hvc' device on my SLES system.
  Since this was always an arbitrary choice I picked something else.
- The HAVE_PGDAT_LIST case should set pgdat_list_addr to the value stored
  at the address of the memory location returned by kallsyms_lookup_name().
2009-05-20 15:30:13 -07:00
Brian Behlendorf 5232d256b4 SLES10 Fixes (part 6)
- Prior to 2.6.17 there were no *_pgdat helper functions in mm/mmzone.c.
  Instead for_each_zone() operated directly on pgdat_list which may or
  may not have been exported depending on how your kernel was compiled.
  Now new configure checks determine if you have the helpers or not, and
  if the needed symbols are exported.  If they are not exported then they
  are dynamically aquired at runtime by kallsyms_lookup_name().
2009-05-20 14:23:13 -07:00
Brian Behlendorf a093c6a499 SLES10 Fixes (part 4):
- Configure check for SLES specific API change to vfs_unlink()
  and vfs_rename() which added a 'struct vfsmount *' argument.
  This was for something called the linux-security-module, but
  it appears that it was never adopted upstream.
2009-05-20 11:31:55 -07:00
Brian Behlendorf 96dded3844 SLES10 Fixes (part 2):
- Configure check, the div64_64() function was renamed to
  div64_u64() as of 2.6.26.
- Configure check, the global_page_state() fuction was introduced
  in 2.6.18 kernels.  The earlier 2.6.16 based SLES10 must not try
  and use it, thankfully get_zone_counts() is still available.
- To simplify debugging poison all symbols aquired dynamically
  using spl_kallsyms_lookup_name() with SYMBOL_POISON.
- Add console messages when the user mode helpers fail.
- spl_kmem_init_globals() use bit shifts instead of division.
- When the monotonic clock is unavailable __gethrtime() must perform
  the HZ division as an 'unsigned long long' because the SPL only
  implements __udivdi3(), and not __divdi3() for 'long long' division
  on 32-bit arches.
2009-05-20 10:08:37 -07:00
Brian Behlendorf 0cbaeb117a Allow spl_config.h to be included by dependant packages
We need dependent packages to be able to include spl_config.h so they
can leverage the configure checks the SPL has done.  This is important
because several of the spl headers need the results of these checks to
work properly.  Unfortunately, the autoheader build product is always
private to a particular build and defined certain common things.
(PACKAGE, VERSION, etc).  This prevents other packages which also use
autoheader from being include because the definitions conflict.  To
avoid this problem the SPL build system leverage AH_BOTTOM to include
a spl_unconfig.h at the botton of the autoheader build product.  This
custom include undefs all known shared symbols to prevent the confict.
This does however mean that those definition are also not availble
to the SPL package either.  The SPL package therefore uses the
equivilant SPL_META_* definitions.
2009-03-17 14:55:59 -07:00
Brian Behlendorf e11d6c5f50 FC10/i686 Compatibility Update (2.6.27.19-170.2.35.fc10.i686)
In the interests of portability I have added a FC10/i686 box to
my list of development platforms.  The hope is this will allow me
to keep current with upstream kernel API changes, and at the same
time ensure I don't accidentally break x86 support.  This patch
resolves all remaining issues observed under that environment.

1) SPL_AC_ZONE_STAT_ITEM_FIA autoconf check added.  As of 2.6.21
the kernel added a clean API for modules to get the global count
for free, inactive, and active pages.  The SPL attempts to detect
if this API is available and directly map spl_global_page_state()
to global_page_state().  If the full API is not available then
spl_global_page_state() is implemented as a thin layer to get
these values via get_zone_counts() if that symbol is available.

2) New kmem:vmem_size regression test added to validate correct
vmem_size() functionality.  The test case acquires the current
global vmem state, allocates from the vmem region, then verifies
the allocation is correctly reflected in the vmem_size() stats.

3) Change splat_kmem_cache_thread_test() to always use KMC_KMEM
based memory.  On x86 systems with limited virtual address space
failures resulted due to exhaustig the address space.  The tests
really need to problem exhausting all memory on the system thus
we need to use the physical address space.

4) Change kmem:slab_lock to cap it's memory usage at availrmem
instead of using the native linux nr_free_pages().  This provides
additional test coverage of the SPL Linux VM integration.

5) Change kmem:slab_overcommit to perform allocation of 256K
instead of 1M.  On x86 based systems it is not possible to create
a kmem backed slab with entires of that size.  To compensate for
this the number of allocations performed in increased by 4x.

6) Additional autoconf documentation for proposed upstream API
changes to make additional symbols available to modules.

7) Console error messages added when spl_kallsyms_lookup_name()
fails to locate an expected symbol.  This causes the module to fail
to load and we need to know exactly which symbol was not available.
2009-03-17 12:16:31 -07:00
Brian Behlendorf 7257ec4185 Fix taskq_wait() not waiting bug
I'm very surprised this has not surfaced until now.  But the taskq_wait()
implementation work only wait successfully the first time it was called.
Subsequent usage of taskq_wait() on the taskq would not wait.

The issue was caused by tq->tq_lowest_id being set to MAX_INT after the
first wait completed.  This caused subsequent waits which check that the
waiting id is less than the lowest taskq id to always succeed.  The fix
is to ensure that tq->tq_lowest_id is never set larger than tq->tq_next.id.

Additional fixes which were added to this patch include:
1) Fix a race by placing the taskq_wait_check() in the tq->tq_lock spinlock.
2) taskq_wait() should wait for the largest outstanding id.
3) Multiple spelling corrections.
4) Added taskq wait regression test to validate correct behavior.
2009-03-15 15:13:49 -07:00
Ricardo M. Correia a0b5ae8aca Fix off-by-1 truncation of hw_serial when converting from integer to string, when writing to /proc/sys/kernel/spl/spl_hostid.
Fixes hostid mismatch which leads to assertion failure when the hostid/hw_serial is a 10-character decimal number:

$ zpool status
  pool: lustre
 state: ONLINE
lt-zpool: zpool_main.c:3176: status_callback: Assertion `reason == ZPOOL_STATUS_OK' failed.
zsh: 5262 abort      zpool status
2009-03-12 15:47:50 -07:00
Ricardo M. Correia 6c33eb8162 Minor bug fix in XDR code introduced in last minute change before landing.
1) Removed xdr_bytesrec typedef which has no consumers.  If we re-add
   it should also probably be xdr_bytesrec_t.
2009-03-11 16:27:35 -07:00
Ricardo M. Correia f48b61938a Add XDR implementation
Added proper XDR implementation (Lustre bug 17662), needed for on-disk
compatibility between platforms of different endianness.
2009-03-11 13:00:26 -07:00
Brian Behlendorf c5f704607b Build system and packaging (RPM support)
An update to the build system to properly support all commonly
used Makefile targets these include:

  make all        # Build everything
  make install    # Install everything
  make clean	  # Clean up build products
  make distclean  # Clean up everything
  make dist       # Create package tarball
  make srpm       # Create package source RPM
  make rpm        # Create package binary RPMs
  make tags       # Create ctags and etags for everything

Extra care was taken to ensure that the source RPMs are fully
rebuildable against Fedora/RHEL/Chaos kernels.  To build binary
RPMs from the source RPM for your system simply run:

  rpmbuild --rebuild spl-x.y.z-1.src.rpm

This will produce two binary RPMs with correct 'requires'
dependencies for your kernel.  One will contain all spl modules
and support utilities, the other is a devel package for compiling
additional kernel modules which are dependant on the spl.

  spl-x.y.z-1_<kernel version>.x86_64.rpm
  spl-devel-x.y.2-1_<kernel version>.x86_64.rpm
2009-03-09 15:56:55 -07:00
Ricardo M. Correia 32f74c5280 XXX: Temporarily disable vmem_size(). 2009-03-05 10:13:59 -08:00
Brian Behlendorf d1ff2312b0 Linux VM Integration Cleanup
Remove all instances of functions being reimplemented in the SPL.
When the prototypes are available in the linux headers but the
function address itself is not exported use kallsyms_lookup_name()
to find the address.  The function name itself can them become a
define which calls a function pointer.  This is preferable to
reimplementing the function in the SPL because it ensures we get
the correct version of the function for the running kernel.  This
is actually pretty safe because the prototype is defined in the
headers so we know we are calling the function properly.

This patch also includes a rhel5 kernel patch we exports the needed
symbols so we don't need to use kallsyms_lookup_name().  There are
autoconf checks to detect if the symbol is exported and if so to
use it directly.  We should add patches for stock upstream kernels
as needed if for no other reason than so we can easily track which
additional symbols we needed exported.  Those patches can also be
used by anyone willing to rebuild their kernel, but this should
not be a requirement.  The rhel5 version of the export-symbols
patch has been applied to the chaos kernel.

Additional fixes:
1) Implement vmem_size() function using get_vmalloc_info()
2) SPL_CHECK_SYMBOL_EXPORT macro updated to use $LINUX_OBJ instead
   of $LINUX because Module.symvers is a build product.  When
   $LINUX_OBJ != $LINUX we will not properly detect exported symbols.
3) SPL_LINUX_COMPILE_IFELSE macro updated to add include2 and
   $LINUX/include search paths to allow proper compilation when
   the kernel target build directory is not the source directory.
2009-03-04 10:04:15 -08:00
Brian Behlendorf 99639e4a13 Add zone_get_hostid() function
Minimal support added for the zone_get_hostid() function.  Only
global zones are supported therefore this function must be called
with a NULL argumment.  Additionally, I've added the HW_HOSTID_LEN
define and updated all instances where a hard coded magic value
of 11 was used; "A good riddance of bad rubbish!"
2009-02-19 11:26:17 -08:00
Brian Behlendorf 1315c88437 Coverity 9649, 9650, 9651: Uninit
This check was originally added to detect double initializations
of mutex types (which it did find).  Unfortunately, Coverity is
right that there is a very small chance we could trigger the
assertion by accident because an uninitialized stack variable
happens to contain the mutex magic.  This is particularly unlikely
since we do poison the mutexs when destroyed but still possible.
Therefore I'm simply removing the assertion.
2009-02-18 09:48:07 -08:00
Brian Behlendorf 9b1b8e4c24 kmem slab magazine ageing deadlock
- The previous magazine ageing sceme relied on the on_each_cpu()
  function to call spl_magazine_age() on each cpu.  It turns out
  this could deadlock with do_flush_tlb_all() which also relies
  on the IPI based on_each_cpu().  To avoid this problem a per-
  magazine delayed work item is created and indepentantly
  scheduled to the correct cpu removing the need for on_each_cpu().
- Additionally two unused fields were removed from the type
  spl_kmem_cache_t, they were hold overs from previous cleanup.
    - struct work_struct work
    - struct timer_list timer
2009-02-17 15:52:18 -08:00
Brian Behlendorf 1a944a7d0b kmem slab fixes
- spl_slab_reclaim() 'continue' changed back to 'break' from commit
  37db7d8cf9.  The original was correct,
  I have added a comment to ensure this does not happen again.
- spl_slab_reclaim() further optimized by moving the destructor call
  in spl_slab_free() outside the skc->skc_lock.  This minimizes the
  length of time the spin lock is held, allows the destructors to
  be invoked concurrently for different objects, and as a bonus makes
  it safe (although unwise) to sleep in the destructors.
2009-02-13 10:28:55 -08:00
Brian Behlendorf 37db7d8cf9 kmem slab fixes
- Default SPL_KMEM_CACHE_DELAY changed to 15 to match Solaris.
- Aged out slab checking occurs every SPL_KMEM_CACHE_DELAY / 3.
- skc->skc_reap tunable added whichs allows callers of
  spl_slab_reclaim() to cap the number of slabs reclaimed.
  On Solaris all eligible slabs are always reclaimed, and this
  is still the default behavior.  However, I suspect that is
  not always wise for reasons such as in the next comment.
- spl_slab_reclaim() added cond_resched() while walking the
  slab/object free lists.  Soft lockups were observed when
  freeing large numbers of vmalloc'd slabs/objets.
- spl_slab_reclaim() 'sks->sks_ref > 0' check changes from
  incorrect 'break' to 'continue' to ensure all slabs are
  checked.
- spl_cache_age() reworked to avoid a deadlock with
  do_flush_tlb_all() which occured because we slept waiting
  for completion in spl_cache_age().  To waiting for magazine
  reclamation to finish is not required so we no longer wait.
- spl_magazine_create() and spl_magazine_destroy() shifted
  back to using for_each_online_cpu() instead of the
  spl_on_each_cpu() approach which was of course a bad idea
  due to memory allocations which Ricardo pointed out.
2009-02-12 13:32:10 -08:00
Brian Behlendorf 4ab13d3b5c Additional Linux VM integration
Added support for Solaris swapfs_minfree, and swapfs_reserve tunables.
In additional availrmem is now available and return a reasonable value
which is reasonably analogous to the Solaris meaning.  On linux we
return the sun of free and inactive pages since these are all easily
reclaimable.

All tunables are available in /proc/sys/kernel/spl/vm/* and they may
need a little adjusting once we observe the real behavior.  Some of
the defaults are mapped to similar linux counterparts, others are
straight from the OpenSolaris defaults.
2009-02-05 12:26:34 -08:00
Brian Behlendorf 36b313dacf Linux VM integration / device special files
Support added to provide reasonable values for the global Solaris
VM variables: minfree, desfree, lotsfree, needfree.  These values
are set to the sum of their per-zone linux counterparts which
should be close enough for Solaris consumers.

When a non-GPL app links against the SPL we cannot use the udev
interfaces, which means non of the device special files are created.
Because of this I had added a poor mans udev which cause the SPL
to invoke an upcall and create the basic devices when a minor
is registered.  When a minor is unregistered we use the vnode
interface to unlink the special file.
2009-02-04 15:15:41 -08:00
Brian Behlendorf 31a033ecd4 2.6.27+ portability changes
- Added SPL_AC_3ARGS_ON_EACH_CPU configure check to determine
  if the older 4 argument version of on_each_cpu() should be
  used or the new 3 argument version.  The retry argument was
  dropped in the new API which was never used anyway.
- Updated work queue compatibility wrappers.  The old way this
  worked was to pass a data point when initialized the workqueue.
  The new API assumed the work item is embedding in a structure
  and we us container_of() to find that data pointer.
- Updated skc->skc_flags to be an unsigned long which is now
  type checked in the bit operations.  This silences the warnings.
- Updated autogen products and splat tests accordingly
2009-02-02 15:12:30 -08:00
Brian Behlendorf f220894e1f Make the number of system taskq threads based on the node of cores in the node, as is done for most linux system tasks 2009-02-02 08:53:53 -08:00
Brian Behlendorf ea3e6ca9e5 kmem_cache hardening and performance improvements
- Added slab work queue task which gradually ages and free's slabs
  from the cache which have not been used recently.
- Optimized slab packing algorithm to ensure each slab contains the
  maximum number of objects without create to large a slab.
- Fix deadlock, we can never call kv_free() under the skc_lock.  We
  now unlink the objects and slabs from the cache itself and attach
  them to a private work list.  The contents of the list are then
  subsequently freed outside the spin lock.
- Move magazine create/destroy operation on to local cpu.
- Further performace optimizations by minimize the usage of the large
  per-cache skc_lock.  This includes the addition of KMC_BIT_REAPING
  bit mask which is used to prevent concurrent reaping, and to defer
  new slab creation when reaping is occuring.
- Add KMC_BIT_DESTROYING bit mask which is set when the cache is being
  destroyed, this is used to catch any task accessing the cache while
  it is being destroyed.
- Add comments to all the functions and additional comments to try
  and make everything as clear as possible.
- Major cleanup and additions to the SPLAT kmem tests to more
  rigerously stress the cache implementation and look for any problems.
  This includes correctness and performance tests.
- Updated portable work queue interfaces
2009-01-30 20:54:49 -08:00
Brian Behlendorf 34e71c9e97 Remove debug check was was accidentally left in place an prevent the slab cache from working on systems with >4 cores 2009-01-26 20:10:23 -08:00
Brian Behlendorf 48e0606a52 Implement kmem cache alignment argument 2009-01-26 09:02:04 -08:00
Brian Behlendorf b6b2acc66e Minor fix for compiler warning when KMEM_TRACKING is enabled 2009-01-20 13:39:35 -08:00
Brian Behlendorf 15270e003e Ensure -NDEBUG does not get added to spl_config.h and is only set in the build options. This allows other kernel modules to use spl_config to leverage the reset of the config checks without getting confused with the debug options 2009-01-20 11:59:47 -08:00
Brian Behlendorf 617d5a673c Rename modules to module and update references 2009-01-15 10:44:54 -08:00