Commit Graph

396 Commits

Author SHA1 Message Date
Brian Behlendorf 38d31a1e57 Remove Solaris module emulation
Originally I believed that these interfaces would be needed.
However, in practice it turned out that it was more straight
forward and maintainable to use the native Linux interfaces.
As such, this is all dead code and can be safely removed.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #109
2012-05-18 13:57:44 -07:00
Richard Yao f90096c905 Modify KM_PUSHPAGE to use GFP_NOIO instead of GFP_NOFS
The resolution of issue #31 made KM_PUSHPAGE imply GFP_NOFS.  This
was done to prevent situations where filesystem operations which are
holding locks enter direct reclaim and attempt to reaquire those
same locks.  This clearly will result in a deadlock.

This works for datasets which are implemented in terms for filesystem
operations.  But unfortunately, swapping to a zvol will encounter
many of the same deadlocks and GFP_NOFS will not prevent this.  As
such, it is appropriate to extend KM_PUSHPAGE to use the broader
GFP_NOIO mask to handle these non-filesystem cases.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#342
Closes #105
2012-05-07 12:05:27 -07:00
Prakash Surya cef7605c34 Throttle number of freed slabs based on nr_to_scan
Previously, the SPL tried to maintain Solaris semantics by freeing
all available (empty) slabs from its slab caches when the shrinker
was called. This is not desirable when running on Linux. To make
the SPL shrinker more Linux friendly, the actual number of freed
slabs from each of the slab caches is now derived from nr_to_scan
and skc_slab_objs.

Additionally, an accounting bug was fixed in spl_slab_reclaim()
which could cause us to reclaim one more slab than requested.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #101
2012-05-07 11:46:15 -07:00
Jorgen Lundman cb75844e85 Define the needed ISA types for ARM
Add the minimum required ISA types to support the ARM architecture.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-05-03 09:56:15 -07:00
Brian Behlendorf b29012b999 Remove condition variable names
Long ago I added support to the spl for condition variable names
because I thought they might be needed.  It turns out they aren't.
In fact the official Solaris cv_init(9F) man page discourages
their use in the kernel.

  cv_init(9F)
    Parameters
      name - Descriptive string. This is obsolete and should be
             NULL. (Non-NULL strings are legal, but they're a
             waste of kernel memory.)

Therefore, I'm removing them from the spl to reclaim this memory
and adding an ASSERT() to ensure no new consumers are added which
make use of the name.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-04-06 12:06:19 -07:00
Brian Behlendorf 0835057ee7 Add SPL_META_RELEASE to module load/unload messages
Include the ZFS_META_RELEASE in the module load/unload messages
to more clearly indicate exactly what version of the SPL has
been loaded.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-03-23 12:11:50 -07:00
Brian Behlendorf 3c208a5480 Cleanly support debug packages
Allow a source rpm to be rebuilt with debugging enabled.  This
avoids the need to have to manually modify the spec file.  By
default debugging is still largely disabled.  To enable specific
debugging features use the following options with rpmbuild.

  '--with debug'               - Enables ASSERTs
  '--with debug-log'           - Enables the internal debug log
  '--with debug-kmem'          - Enables basic memory accounting
  '--with debug-kmem-tracking' - Enables detailed memory tracking

  # For example:
  $ rpmbuild --rebuild --with debug spl-modules-0.6.0-rc6.src.rpm

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-02-27 14:24:22 -08:00
Brian Behlendorf feedc43601 Add missing spl_debug_* helpers
When building the spl with --disable-debug-log the __SDEBUG()
macro and spl_debug_* helper functions were undefined.  This
change adds the missing functions so the upper layers compiling
against the spl don't need to be aware of how the spl was built.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-02-09 16:41:46 -08:00
Brian Behlendorf 9a8b7a7458 Add basic dynamic kstat support
Add the bare minimum functionality to support dynamic kstats.  A
complete kstat implementation should be done as part of issue #84.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #84
2012-02-02 11:28:00 -08:00
Brian Behlendorf 4b2220f0b9 Add --enable-debug-log configure option
Until now the notion of an internal debug logging infrastructure
was conflated with enabling ASSERT()s.  This patch clarifies things
by cleanly breaking the two subsystem apart.  The result of this
is the following behavior.

--enable-debug      - Enable/disable code wrapped in ASSERT()s.
--disable-debug       ASSERT()s are used to check invariants and
                      are never required for correct operation.
                      They are disabled by default because they
                      may impact performance.

--enable-debug-log  - Enable/disable the debug log infrastructure.
--disable-debug-log   This infrastructure allows the spl code and
                      its consumer to log messages to an in-kernel
                      log.  The granularity of the logging can be
                      controlled by a debug mask.  By default the
                      mask disables most debug messages resulting
                      in a negligible performance impact.  Because
                      of this the debug log is enabled by default.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-02-02 11:27:54 -08:00
Brian Behlendorf a2eda2ff48 Add the release component to headers
When the original build system code was added the release
component was accidentally omited from the development header
install path.  This patch adds the missing path component so
it's always clear exactly what release your compiling against.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-01-18 11:06:26 -08:00
Darik Horn 588d900433 Linux 3.2 compat: rw_semaphore.wait_lock is raw
The wait_lock member of the rw_semaphore struct became a raw_spinlock_t
in Linux 3.2 at torvalds/linux@ddb6c9b58a.

Wrap spin_lock_* function calls in a new spl_rwsem_* interface to
ensure type safety if raw_spinlock_t becomes architecture specific,
and to satisfy these compiler warnings:

  warning: passing argument 1 of ‘spinlock_check’
    from incompatible pointer type [enabled by default]
  note: expected ‘struct spinlock_t *’
    but argument is of type ‘struct raw_spinlock_t *’

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes: #76
Closes: zfsonlinux/zfs#463
2012-01-11 16:28:05 -08:00
Brian Behlendorf 5f6c14b1ed Proxmox VE kernel compat, invalidate_inodes()
The Proxmox VE kernel contains a patch which renames the function
invalidate_inodes() to invalidate_inodes_check().  In the process
it adds a 'check' argument and a '#define invalidate_inodes(x)'
compatibility wrapper for legacy callers.  Therefore, if either
of these functions are exported invalidate_inodes() can be
safely used.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #58
2011-12-21 14:29:45 -08:00
Prakash Surya 8f2503e0af Store copy of tqent_flags prior to servicing task
A preallocated taskq_ent_t's tqent_flags must be checked prior to
servicing the taskq_ent_t. Once a preallocated taskq entry is serviced,
the ownership of the entry is handed back to the caller of
taskq_dispatch, thus the entry's contents can potentially be mangled.

In particular, this is a problem in the case where a preallocated taskq
entry is serviced, and the caller clears it's tqent_flags field. Thus,
when the function returns and task_done is called, it looks as though
the entry is **not** a preallocated task (when in fact it **is** a
preallocated task).

In this situation, task_done will place the preallocated taskq_ent_t
structure onto the taskq_t's free list. This is a **huge** mistake. If
the taskq_ent_t is then freed by the caller of taskq_dispatch, the
taskq_t's free list will hold a pointer to garbage data. Even worse, if
nothing has over written the freed memory before the pointer is
dereferenced, it may still look as though it points to a valid list_head
belonging to a taskq_ent_t structure.

Thus, the task entry's flags are now copied prior to servicing the task.
This copy is then checked to see if it is a preallocated task, and
determine if the entry needs to be passed down to the task_done
function.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #71
2011-12-16 16:54:00 -08:00
Prakash Surya e7e5f78e7b Swap taskq_ent_t with taskqid_t in taskq_thread_t
The taskq_t's active thread list is sorted based on its
tqt_ent->tqent_id field. The list is kept sorted solely by inserting
new taskq_thread_t's in their correct sorted location; no other
means is used. This means that once inserted, if a taskq_thread_t's
tqt_ent->tqent_id field changes, the list runs the risk of no
longer being sorted.

Prior to the introduction of the taskq_dispatch_prealloc() interface,
this was not a problem as a taskq_ent_t actively being serviced under
the old interface should always have a static tqent_id field. Thus,
once the taskq_thread_t is added to the taskq_t's active thread list,
the taskq_thread_t's tqt_ent->tqent_id field would remain constant.

Now, this is no longer the case. Currently, if using the
taskq_dispatch_prealloc() interface, any given taskq_ent_t actively
being serviced _may_ have its tqent_id value incremented. This happens
when the preallocated taskq_ent_t structure is recursively dispatched.
Thus, a taskq_thread_t could potentially have its tqt_ent->tqent_id
field silently modified from under its feet. If this were to happen
to a taskq_thread_t on a taskq_t's active thread list, this would
compromise the integrity of the order of the list (as the list
_may_ no longer be sorted).

To get around this, the taskq_thread_t's taskq_ent_t pointer was
replaced with its own static copy of the tqent_id. So, as a taskq_ent_t
is pulled off of the taskq_t's pending list, a static copy of its
tqent_id is made and this copy is used to sort the active thread
list. Using a static copy is key in ensuring the integrity of the
order of the active thread list. Even if the underlying taskq_ent_t
is recursively dispatched (as has its tqent_id modified), this
static copy stored inside the taskq_thread_t will remain constant.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #71
2011-12-16 13:26:54 -08:00
Prakash Surya c2dceb5cd5 Add make rule for building Arch Linux packages
Added the necessary build infrastructure for building packages
compatible with the Arch Linux distribution. As such, one can now run:

    $ ./configure
    $ make pkg     # Alternatively, one can run 'make arch' as well

on an Arch Linux machine to create two binary packages compatible with
the pacman package manager, one for the spl userland utilties and
another for the spl kernel modules. The new packages can then be
installed by running:

    # pacman -U $package.pkg.tar.xz

In addition, source-only packages suitable for an Arch Linux chroot
environment or remote builder can also be built using the 'sarch' make
rule.

NOTE: Since the source dist tarball is created on the fly from the head
of the build tree, it's MD5 hash signature will be continually influx.
As a result, the md5sum variable was intentionally omitted from the
PKGBUILD files, and the '--skipinteg' makepkg option is used. This may
or may not have any serious security implications, as the source tarball
is not being downloaded from an outside source.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes: #68
2011-12-14 16:44:10 -08:00
Prakash Surya 44217f7aad Implement taskq_dispatch_prealloc() interface
This patch implements the taskq_dispatch_prealloc() interface which
was introduced by the following illumos-gate commit.  It allows for
a preallocated taskq_ent_t to be used when dispatching items to a
taskq.  This eliminates a memory allocation which helps minimize
lock contention in the taskq when dispatching functions.

    commit 5aeb94743e3be0c51e86f73096334611ae3a058e
    Author: Garrett D'Amore <garrett@nexenta.com>
    Date:   Wed Jul 27 07:13:44 2011 -0700

    734 taskq_dispatch_prealloc() desired
    943 zio_interrupt ends up calling taskq_dispatch with TQ_SLEEP

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #65
2011-12-13 16:10:57 -08:00
Prakash Surya 2c02b71b14 Replace tq_work_list and tq_threads in taskq_t
To lay the ground work for introducing the taskq_dispatch_prealloc()
interface, the tq_work_list and tq_threads fields had to be replaced
with new alternatives in the taskq_t structure.

The tq_threads field was replaced with tq_thread_list. Rather than
storing the pointers to the taskq's kernel threads in an array, they are
now stored as a list. In addition to laying the ground work for the
taskq_dispatch_prealloc() interface, this change could also enable taskq
threads to be dynamically created and destroyed as threads can now be
added and removed to this list relatively easily.

The tq_work_list field was replaced with tq_active_list. Instead of
keeping a list of taskq_ent_t's which are currently being serviced, a
list of taskq_threads currently servicing a taskq_ent_t is kept. This
frees up the taskq_ent_t's tqent_list field when it is being serviced
(i.e. now when a taskq_ent_t is being serviced, it's tqent_list field
will be empty).

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #65
2011-12-13 16:10:50 -08:00
Prakash Surya 93806f58a6 Fix usage of MUTEX macro in mutex_enter_nested
A call site of the MUTEX macro had incorrectly placed its closing
parenthesis, causing two parameters to be passed rather than one. This
change moves the misplaced parenthesis to fix the typographical error.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #70
2011-12-13 11:04:21 -08:00
Chris Dunlop 791dc876eb Allow 64-bit timestamps to be set on 64-bit kernels
ZFS and 64-bit linux are perfectly capable of dealing with 64-bit
timestamps, but ZFS deliberately prevents setting them.  Adjust
the SPL such that TIMESPEC_OVERFLOW will not always assume 32-bit
values and instead use the correct values for your kernel build.
This effectively allows 64-bit timestamps on 64-bit systems.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes ZFS issue #487
2011-12-12 11:06:03 -08:00
Brian Behlendorf 1114ae6ae7 Prepend spl_ to all init/fini functions
This is a bit of cleanup I'd been meaning to get to for a while
to reduce the chance of a type conflict.  Well that conflict
finally occurred with the kstat_init() function which conflicts
with a function in the 2.6.32-6-pve kernel.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #56
2011-11-11 09:18:28 -08:00
Brian Behlendorf fe71c0e567 Linux 3.1 compat, shrink_*cache_memory
As of Linux 3.1 the shrink_dcache_memory and shrink_icache_memory
functions have been removed.  This same task is now accomplished
more cleanly with per super block shrinkers.  This unfortunately
leaves us no easy way to support the dnlc_reduce_cache() function.

This support has always been entirely optional.  So when no
reasonable interface is available allow the dnlc_reduce_cache()
function to effectively become a no-op.

The downside of this change is that it will prevent the zfs arc
meta data limts from being enforced.  However, the current zfs
implementation in this regard is already flawed and needs to
be reworked.  If the arc needs to enfore a meta data limit it
will need to be extended to coordinate directly with the zpl.
This will allow us to drop all this compatibility code and get
more fine grained control over the cache management.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #52
2011-11-09 19:36:30 -08:00
Brian Behlendorf 0d0b523728 Linux 3.1 compat, vfs_fsync()
Preferentially use the vfs_fsync() function.  This function was
initially introduced in 2.6.29 and took three arguments.  As
of 2.6.35 the dentry argument was dropped from the function.
For older kernels fall back to using file_fsync() which also
took three arguments including the dentry.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #52
2011-11-09 19:36:21 -08:00
Brian Behlendorf 12ff95ff57 Linux 3.1 compat, kern_path_parent()
Prior to Linux 3.1 the kern_path_parent symbol was exported for
use by kernel modules.  As of Linux 3.1 it is now longer easily
available.  To handle this case the spl will now dynamically
look up address of the missing symbol at module load time.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #52
2011-11-09 16:51:25 -08:00
Gunnar Beutner f5e76dea03 Cleaned up MUTEX() #define
The old define assumed a specific layout of the kmutex_t struct. This
patch makes the macro independent from the actual struct layout.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-10-19 09:59:32 -07:00
Gunnar Beutner 66cdc93b8c Remove the spinlocks for mutex_enter()/mutex_exit()
The m_owner variable is protected by the mutex itself. Reading the variable
is guaranteed to be atomic (due to it being a word-sized reference) and
ACCESS_ONCE() takes care of read cache effects.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-10-19 09:58:57 -07:00
Gunnar Beutner 3160d4f56b Fix race condition in mutex_exit()
On kernels with CONFIG_DEBUG_MUTEXES mutex_exit() clears the mutex
owner after releasing the mutex. This would cause mutex_owner()
to return an incorrect owner if another thread managed to lock the
mutex before mutex_exit() had a chance to clear the owner.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes ZFS issue #167
2011-10-19 09:58:41 -07:00
Gunnar Beutner 763b2f3b57 Fixed invalid resource re-use in file_find()
File descriptors are a per-process resource. The same descriptor
in different processes can refer to different files. find_file()
incorrectly assumed that file descriptors are globally unique.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes ZFS issue #386
2011-10-11 09:51:51 -07:00
Brian Behlendorf 86fd39f354 Linux 2.6.39 compat, mutex owner
Prior to Linux 2.6.39 when CONFIG_DEBUG_MUTEXES was defined
the kernel stored a thread_info pointer as the mutex owner.
From this you could get the pointer of the current task_struct
to compare with get_current().

As of Linux 2.6.39 this behavior has changed and now the mutex
stores a pointer to the task_struct.  This commit detects the
type of pointer stored in the mutex and adjusts the mutex_owner()
and mutex_owned() functions to perform the correct comparision.
2011-06-24 13:00:08 -07:00
Darik Horn 0d54dcb566 Read the /etc/hostid file directly.
Deprecate the /usr/bin/hostid call by reading the /etc/hostid file
directly. Add the spl_hostid_path parameter to override the default
/etc/hostid path.

Rename the set_hostid() function to hostid_exec() to better reflect
actual behavior and complement the new hostid_read() function.

Use HW_INVALID_HOSTID as the spl_hostid sentinel value because
zero seems to be a valid gethostid() result on Linux.
2011-06-24 09:58:03 -07:00
Brian Behlendorf bf0c60c060 Add linux compatibility tests
While the splat tests were originally designed to stress test
the Solaris primatives.  I am extending them to include some kernel
compatibility tests.  Certain linux APIs have changed frequently.
These tests ensure that added compatibility is working properly
and no unnoticed regression have slipped in.

Test 1 and 2 add basic regression tests for shrink_icache_memory
and shrink_dcache_memory.  These are simply functional tests to
ensure we can call these functions safely.  Checking for correct
behavior is more difficult since other running processes will
influence the behavior.  However, these functions are provided
by the kernel so if we can successfully call them we assume they
are working correctly.

Test 3 checks that shrinker functions are being registered and
called correctly.  As of Linux 3.0 the shrinker API has changed
four different times so I felt the need to add a trivial test
case to ensure each variant works as expected.
2011-06-21 14:02:46 -07:00
Brian Behlendorf a55bcaad18 Linux 3.0: Shrinker compatibility
Update the the wrapper macros for the memory shrinker to handle
this 4th API change.  The callback function now takes a
shrink_control structure.  This is certainly a step in the
right direction but it's annoying to have to accomidate yet
another version of the API.
2011-06-21 14:02:39 -07:00
Brian Behlendorf 372c257233 Add TASKQ_NORECLAIM flag
It has become necessary to be able to optionally disable
direct memory reclaim for certain taskqs.  To support
this the TASKQ_NORECLAIM flags has been added which sets
the PF_MEMALLOC bit for all threads in the taskq.
2011-05-06 15:23:58 -07:00
Brian Behlendorf c1f95c2b94 Correct MAXUID
The uid_t on most systems is in fact and unsigned 32-bit value.
This is almost always correct, however you could compile your
kernel to use an unsigned 16-bit value for uid_t.  In practice
I've never encountered a distribution which does this so I'm
willing to overlook this corner case for now.
2011-04-29 13:58:45 -07:00
Gunnar Beutner 9d4b7c17a0 Renamed 'struct fid' for NFS
Renamed 'struct fid' because its name conflicts with another
struct in the Linux kernel headers.  The fid_t typedef remains
unchanged intentionally.
2011-04-29 12:10:54 -07:00
Brian Behlendorf d837ae395b Fix 32-bit MAXOFFSET_T definition
The correct definition of MAXOFFSET_T under Solaris is in reality
tied to the maximum size of a 'long long' type.  With this in mind
MAXOFFSET_T is now defined as LLONG_MAX which ensures the correct
value is used on both 32-bit and 64-bit systems.
2011-04-22 16:17:13 -07:00
Darik Horn fa6f7d8f9d Import spl_hostid as a module parameter.
Provide a call_usermodehelper() alternative by letting the hostid be passed as
a module parameter like this:

  $ modprobe spl spl_hostid=0x12345678

Internally change the spl_hostid variable to unsigned long because that is the
type that the coreutils /usr/bin/hostid returns.

Move the hostid command into GET_HOSTID_CMD for consistency with the similar
GET_KALLSYMS_ADDR_CMD invocation.

Use argv[0] instead of sh_path for consistency internally and with other Linux
drivers.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-04-21 09:41:01 -07:00
Brian Behlendorf 3dfc591ac4 Linux 2.6.39 compat, zlib_deflate_workspacesize()
The function zlib_deflate_workspacesize() now take 2 arguments.
This was done to avoid always having to allocate the maximum size
workspace (268K).  The caller can now specific the windowBits and
memLevel compression parameters to get a smaller workspace.

For our purposes we introduce a spl_zlib_deflate_workspacesize()
wrapper which accepts both arguments.  When the two argument
version of zlib_deflate_workspacesize() is available the arguments
are passed through.  When it's not we assume the worst case and
a maximally sized workspace is used.
2011-04-20 14:39:15 -07:00
Brian Behlendorf b1cbc4610c Linux 2.6.39 compat, kern_path_parent()
The path_lookup() function has been renamed to kern_path_parent()
and the flags argument has been removed.  The only behavior now
offered is that of LOOKUP_PARENT.  The spl already always passed
this flag so dropping the flag does not impact us.
2011-04-20 12:30:17 -07:00
Brian Behlendorf 9b0f9079d2 Linux 2.6.39 compat, invalidate_inodes()
To resolve a potiential filesystem corruption issue a second
argument was added to invalidate_inodes().  This argument controls
whether dirty inodes are dropped or treated as busy when invalidating
a super block.  When only the legacy API is available the second
argument will be dropped for compatibility.
2011-04-19 09:08:08 -07:00
Brian Behlendorf e76f4bf11d Add dnlc_reduce_cache() support
Provide the dnlc_reduce_cache() function which attempts to prune
cached entries from the dcache and icache.  After the entries are
pruned any slabs which they may have been using are reaped.

Note the API takes a reclaim percentage but we don't have easy
access to the total number of cache entries to calculate the
reclaim count.  However, in practice this doesn't need to be
exactly correct.  We simply need to reclaim some useful fraction
(but not all) of the cache.  The caller can determine if more
needs to be done.
2011-04-06 20:06:03 -07:00
Brian Behlendorf 83150861e6 Decrease target objects per slab
By decreasing the number of target objects per slab we increase
the likelyhood that a slab can be freed.  This reduces the level
of fragmentation in the slab which has been observed to be a
problem for certain workloads.  The penalty for this is that we
also decrease the speed which need objects can be allocated.
2011-04-06 20:06:03 -07:00
Brian Behlendorf 3336e29cc2 Add slab usage summeries to /proc
One of the most common things you want to know when looking at
the slab is how much memory is being used.  This information was
available in /proc/spl/kmem/slab but only on a per-slab basis.
This commit adds the following /proc/sys/kernel/spl/kmem/slab*
entries to make total slab usage easily available at a glance.

  slab_kmem_total - Total kmem slab size
  slab_kmem_avail - Alloc'd kmem slab size
  slab_kmem_max   - Max observed kmem slab size
  slab_vmem_total - Total vmem slab size
  slab_vmem_avail - Alloc'd vmem slab size
  slab_vmem_max   - Max observed vmem slab size

NOTE: The slab_*_max values are expected to over report because
they show maximum values since boot, not current values.
2011-04-06 20:06:03 -07:00
Brian Behlendorf 495bd532ab Linux shrinker compat
The Linux shrinker has gone through three API changes since 2.6.22.
Rather than force every caller to understand all three APIs this
change consolidates the compatibility code in to the mm-compat.h
header.  The caller then can then use a single spl provided
shrinker API which does the right thing for your kernel.

SPL_SHRINKER_CALLBACK_PROTO(shrinker_callback, cb, nr_to_scan, gfp_mask);
SPL_SHRINKER_DECLARE(shrinker_struct, shrinker_callback, seeks);
spl_register_shrinker(&shrinker_struct);
spl_unregister_shrinker(&&shrinker_struct);
spl_exec_shrinker(&shrinker_struct, nr_to_scan, gfp_mask);
2011-04-06 20:06:03 -07:00
Brian Behlendorf 91cb1d91a4 Add .va_dentry helper
While this extra structure memory does not exist under Solaris
it is needed under Linux to pass the dentry.  This allows the
dentry to be easily instantiated before the inode is unlocked.
2011-04-06 20:06:03 -07:00
Brian Behlendorf 734fcac78d Add crgetfsuid()/crgetfsgid() helpers
Solaris credentials don't have an fsuid/fsguid field but Linux
credentials do.  To handle this case the Solaris API is being
modestly extended to include the crgetfsuid()/crgetfsgid()
helper functions.

Addititionally, because the crget*() helpers are implemented
identically regardless of HAVE_CRED_STRUCT they have been
moved outside the #ifdef to common code.  This simplification
means we only have one version of the helper to keep to to date.
2011-03-22 12:18:44 -07:00
Brian Behlendorf cb255ae572 Remove default GFP_NOFS allocations
As originally described in commit 82b8c8fa64
this was done to prevent certain deadlocks from occuring in the system.
However, as suspected the price for doing this proved to be too high.
The VM is having a hard time effectively reclaiming memory thus we are
reverting this change.

However, we still need to fundamentally handle the issue.  Under
Solaris the KM_PUSHPAGE mask is used commonly in I/O paths to ensure
a memory allocations will succeed.  We leverage this fact and redefine
KM_PUSHPAGE to include GFP_NOFS.  This ensures that in these common
I/O path we don't trigger additional reclaim.  This minimizes the
change to the Solaris code.
2011-03-19 14:50:39 -07:00
Brian Behlendorf 6788762766 Linux 2.6.31 compat, include linux/seq_file.h
Explicitly include the linux/seq_file.h header in vfs.h.  This header
is required for the sequence handlers and is included indirectly in
newer kernels.
2011-03-07 13:52:00 -08:00
Brian Behlendorf 47995fa691 Remove xvattr support
The xvattr support in the spl has always simply consisted of
defining a couple structures and a few #defines.  This was enough
to enable compilation of code which just passed xvattr types
around but not enough to effectively manipulate them.

This change removes even this minimal support leaving it up
to packages which leverage the spl to prove the full xvattr
support.  By removing it from the spl we ensure not conflict
with the higher level packages.

This just leaves minimal vnode support for basical manipulation
of files.  This code is does have the proper support functions
in the spl and a set of regression tests.

Additionally, this change removed the unused 'caller_context_t *'
type and replaces it with a 'void *'.
2011-03-02 11:34:46 -08:00
Brian Behlendorf a4a1e1ecb4 Add TIMESPEC_OVERFLOW helper
Add the TIMESPEC_OVERFLOW helper macro to allow easy checking
of timespec overflow.
2011-03-02 11:34:43 -08:00
Brian Behlendorf 19c1eb829d Add zlib regression test
A zlib regression test has been added to verify the correct behavior
of z_compress_level() and z_uncompress.  The test case simply takes
a 128k buffer, it compresses the buffer, it them uncompresses the
buffer, and finally it compares the buffers after the transform.
If the buffers match then everything is fine and no data was lost.
It performs this test for all 9 zlib compression levels.
2011-02-25 16:56:46 -08:00
Brian Behlendorf 5c1967ebe2 Fix zlib compression
While portions of the code needed to support z_compress_level() and
z_uncompress() where in place.  In reality the current implementation
was non-functional, it just was compilable.

The critical missing component was to setup a workspace for the
compress/uncompress stream structures to use.  A kmem_cache was
added for the workspace area because we require a large chunk
of memory.  This avoids to need to continually alloc/free this
memory and vmap() the pages which is very slow.  Several objects
will reside in the per-cpu kmem_cache making them quick to acquire
and release.  A further optimization would be to adjust the
implementation to additional ensure the memory is local to the cpu.
Currently that may not be the case.
2011-02-25 16:56:22 -08:00
Brian Behlendorf 5a52a782a0 Use Linux flock struct
Rather than defining our own structure which will conflict with
Linux's version when building 32-bit.  Simply setup a typedef
to always use the correct Linux version for both 32 ad 64-bit
builds.
2011-02-23 14:32:15 -08:00
Brian Behlendorf 914b063133 Linux compat 2.6.37, invalidate_inodes()
In the 2.6.37 kernel the function invalidate_inodes() is no longer
exported for use by modules.  This memory management functionality
is needed to invalidate the inodes attached to a super block without
unmounting the filesystem.

Because this function still exists in the kernel and the prototype
is available is a common header all we strictly need is the symbol
address.  The address is obtained using spl_kallsyms_lookup_name()
and assigned to the variable invalidate_inodes_fn.  Then a #define
is used to replace all instances of invalidate_inodes() with a
call to the acquired address.  All the complexity is hidden behind
HAVE_INVALIDATE_INODES and invalidate_inodes() can be used as usual.

Long term we should try to get this, or another, interface made
available to modules again.
2011-02-23 12:44:32 -08:00
Brian Behlendorf d599e4fa79 Block in cv_destroy() on all waiters
Previously we would ASSERT in cv_destroy() if it was ever called
with active waiters.  However, I've now seen several instances in
OpenSolaris code where they do the following:

  cv_broadcast();
  cv_destroy();

This leaves no time for active waiters to be woken up and scheduled
and we trip the ASSERT.  This has not been observed to be an issue
on OpenSolaris because their cv_destroy() basically does nothing.
They still do run the risk of the memory being free'd after the
cv_destroy() and hitting a bad paging request.  But in practice
this race is so small and unlikely it either doesn't happen, or
is so unlikely when it does happen the root cause has not yet been
identified.

Rather than risk the same issue in our code this change updates
cv_destroy() to block until all waiters have been woken and
scheduled.  This may take some time because each waiter must
acquire the mutex.

This change may have an impact on performance for frequently
created and destroyed condition variables.  That however is a price
worth paying it avoid crashing your system.  If performance issues
are observed they can be addressed by the caller.
2011-02-04 14:09:08 -08:00
Brian Behlendorf 0aff071d18 Minor policy interface
Simply add the policy function wrappers.  They are completely
non-functional and always return that everything is OK, but once
again they simplify compilation of dependent packages for now.
These can/should be removed once the security policy of the
dependent application is completely understood and intergrade
as appropriate with Linux.
2011-01-27 16:06:09 -08:00
Brian Behlendorf ef57fb98e4 Add missing headers
Dependent packages require the following missing headers to
simplify compilation.  The headers are basically just stubbed
out with minimal content required.
2011-01-27 16:06:09 -08:00
Brian Behlendorf 3fc97f9335 Add VSA_ACE_* and MAX_ACL_ENTRIES defines
The following flags are use to get the proper mask when getting
and setting ACLs.  I'm hopeful this can all largely go away at
some point.

We also add a define for the maximum number of ACL entries.
MAX_ACL_ENTRIES is used as the maximum number of entries for
each type.
2011-01-27 16:06:09 -08:00
Brian Behlendorf e2b25f698c Add MAXUID define
For Linux the maximum uid can vary depending on how your kernel
is built.  The Linux kernel still can be compiled with 16 but uids
and gids, although I'm not aware of a major distribution which does
this (maybe an embedded one?).  Given that caviot it is reasonably
safe to define the MAXUID as 2147483647.
2011-01-27 16:06:09 -08:00
Brian Behlendorf 5f46a517f1 Add FIGNORECASE define
The FIGNORECASE case define is now needed, place it with the
related flags.
2011-01-27 16:06:09 -08:00
Brian Behlendorf 3e5d3d3285 Add ksid_index_t and ksid_t types
Add the ksid_index_t enum and ksid_t type for use.  These types
are now used by packages which depend on the SPL.
2011-01-27 16:06:09 -08:00
Brian Behlendorf d700637207 Minimal VFS additions
This patch simply removes the place holder vfs_t type and includes
some generic Linux VFS headers.  It also makes some minor fid_t
additions for compatibility.
2011-01-27 16:06:04 -08:00
Brian Behlendorf 647fa73cf3 Remove VN_HOLD/VN_RELE/VOP_PUTPAGE
Previously these were defined to noops but rather than give
the misleading impression that these are actually implemented
I'm removing the type entirely for clarity.
2011-01-12 11:38:05 -08:00
Brian Behlendorf bd6ac72b03 Add a few additional vnode #defines
These additional constants now have users in dependant packages.
2011-01-12 11:38:05 -08:00
Brian Behlendorf dcd9cb5a17 Clean vattr_t and vsecattr_t types
Minor cleanup for the vattr_t and vsecattr_t types.
2011-01-12 11:38:04 -08:00
Brian Behlendorf 1b439713f1 FRSYNC Should Use O_SYNC
The Solaris FRSYNC maps most logically to the Linux O_SYNC.  There
is no O_RSYNC on Linux but this wasn't noticed until just recently.
2011-01-12 11:38:04 -08:00
Brian Behlendorf 4295b530ee Add vn_mode_to_vtype/vn_vtype to_mode helpers
Add simple helpers to convert a vnode->v_type to a inode->i_mode.
These should be used sparingly but they are handy to have.
2011-01-12 11:38:04 -08:00
Neependra Khare 3f688a8c38 Add cv_timedwait_interruptible() function
The cv_timedwait() function by definition must wait unconditionally
for cv_signal()/cv_broadcast() before waking.  This causes processes
to go in the D state which increases the load average.  The load
average is the summation of processes in D state and run queue.

To avoid this it can be desirable to sleep interruptibly.  These
processes do not count against the load average but may be woken by
a signal.  It is up to the caller to determine why the process
was woken it may be for one of three reasons.

  1) cv_signal()/cv_broadcast()
  2) the timeout expired
  3) a signal was received

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-01-11 12:14:48 -08:00
Brian Behlendorf 6bf4d76f47 Linux Compat: inode->i_mutex/i_sem
Create spl_inode_lock/spl_inode_unlock compability macros to simply
access to the inode mutex/sem.  This avoids the need to have to ugly
up the code with the required #define's at every call site.  At the
moment the SPL only uses this in one place but higher layers can
benefit from the macro.
2011-01-11 12:14:48 -08:00
Brian Behlendorf 9fe45dc1ac Add Thread Specific Data (TSD) Implementation
Thread specific data has implemented using a hash table, this avoids
the need to add a member to the task structure and allows maximum
portability between kernels.  This implementation has been optimized
to keep the tsd_set() and tsd_get() times as small as possible.

The majority of the entries in the hash table are for specific tsd
entries.  These entries are hashed by the product of their key and
pid because by design the key and pid are guaranteed to be unique.
Their product also has the desirable properly that it will be uniformly
distributed over the hash bins providing neither the pid nor key is zero.
Under linux the zero pid is always the init process and thus won't be
used, and this implementation is careful to never to assign a zero key.
By default the hash table is sized to 512 bins which is expected to
be sufficient for light to moderate usage of thread specific data.

The hash table contains two additional type of entries.  They first
type is entry is called a 'key' entry and it is added to the hash during
tsd_create().  It is used to store the address of the destructor function
and it is used as an anchor point.  All tsd entries which use the same
key will be linked to this entry.  This is used during tsd_destory() to
quickly call the destructor function for all tsd associated with the key.
The 'key' entry may be looked up with tsd_hash_search() by passing the
key you wish to lookup and DTOR_PID constant as the pid.

The second type of entry is called a 'pid' entry and it is added to the
hash the first time a process set a key.  The 'pid' entry is also used
as an anchor and all tsd for the process will be linked to it.  This
list is using during tsd_exit() to ensure all registered destructors
are run for the process.  The 'pid' entry may be looked up with
tsd_hash_search() by passing the PID_KEY constant as the key, and
the process pid.  Note that tsd_exit() is called by thread_exit()
so if your using the Solaris thread API you should not need to call
tsd_exit() directly.
2010-12-07 10:02:32 -08:00
Ricardo M. Correia c2f997b0b3 Make kmutex_t typesafe in all cases.
When HAVE_MUTEX_OWNER and CONFIG_SMP are defined, kmutex_t is just
a typedef for struct mutex.

This is generally OK but has the downside that it can make mistakes
such as mutex_lock(&kmutex_var) to pass by unnoticed until someone
compiles the code without HAVE_MUTEX_OWNER or CONFIG_SMP (in which
case kmutex_t is a real struct). Note that the correct API to call
should have been mutex_enter() rather than mutex_lock().

We prevent these kind of mistakes by making kmutex_t a real structure
with only one field. This makes kmutex_t typesafe and it shouldn't
have any impact on the generated assembly code.

Signed-off-by: Ricardo M. Correia <ricardo.correia@oracle.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-11-29 11:25:32 -08:00
Brian Behlendorf 058de03caa Clear cv->cv_mutex when not in use
For debugging purposes the condition varaibles keep track of the
mutex used during a wait.  The idea is to validate that all callers
always use the same mutex.  Unfortunately, we have seen cases where
the caller reuses the condition variable with a different mutex but
in a way which is known to be safe.  My reading of the man pages
suggests you should not do this and always cv_destroy()/cv_init()
a new mutex.  However, there is overhead in doing this and it does
appear to be allowed under Solaris.

To accomidate this behavior cv_wait_common() and __cv_timedwait()
have been modified to clear the associated mutex when the last
waiter is dropped.  This ensures that while the condition variable
is in use the incorrect mutex case is detected.  It also allows the
condition variable to be safely recycled without requiring the
overhead of a cv_destroy()/cv_init() as long as it isn't currently
in use.

Finally, spin lock cv->cv_lock was removed because it is not required.
When the condition variable is used properly the caller will always
be holding the mutex so the spin lock is redundant.  The lock was
originally added because I expected to need to protect more than
just the cv->cv_mutex.  It turns out that was not the case.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-11-29 11:02:34 -08:00
Ned Bass 00ba7ef900 Give ENOTSUP a valid user space error value
The ZFS module returns ENOTSUP for several error conditions where an operation
is not (yet) supported.  The SPL defined ENOTSUP in terms of ENOTSUPP, but that
is an internal Linux kernel error code that should not be seen by user
programs.  As a result the zfs utilities print a confusing error message if an
unsupported operation is attempted:

    internal error: Unknown error 524
    Aborted

This change defines ENOTSUP in terms of EOPNOTSUPP which is consistent with
user space.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-11-10 13:25:49 -08:00
Brian Behlendorf a50cede388 Linux 2.6.36 compat, wrap RLIM64_INFINITY
As of linux-2.6.36 RLIM64_INFINITY is defined in linux/resource.h.
This is handled by conditionally defining RLIM64_INFINITY in the
SPL only when the kernel does not provide it.
2010-11-09 13:28:55 -08:00
Brian Behlendorf 8294c69bb7 Clear owner after dropping mutex
It's important to clear mp->owner after calling mutex_unlock()
because when CONFIG_DEBUG_MUTEXES is defined the mutex owner
is verified in mutex_unlock().  If we set it to NULL this check
fails and the lockdep support is immediately disabled.
2010-11-05 11:52:30 -07:00
Ricardo M. Correia a68d91d770 atomic_*_*_nv() functions need to return the new value atomically.
A local variable must be used for the return value to avoid a
potential race once the spin lock is dropped.

Signed-off-by: Ricardo M. Correia <ricardo.correia@oracle.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-09-17 16:03:25 -07:00
Brian Behlendorf a7958f7eef Support custom build directories
One of the neat tricks an autoconf style project is capable of
is allow configurion/building in a directory other than the
source directory.  The major advantage to this is that you can
build the project various different ways while making changes
in a single source tree.

For example, this project is designed to work on various different
Linux distributions each of which work slightly differently.  This
means that changes need to verified on each of those supported
distributions perferably before the change is committed to the
public git repo.

Using nfs and custom build directories makes this much easier.
I now have a single source tree in nfs mounted on several different
systems each running a supported distribution.  When I make a
change to the source base I suspect may break things I can
concurrently build from the same source on all the systems each
in their own subdirectory.

wget -c http://github.com/downloads/behlendorf/spl/spl-x.y.z.tar.gz
tar -xzf spl-x.y.z.tar.gz
cd spl-x-y-z

------------------------- run concurrently ----------------------
<ubuntu system>  <fedora system>  <debian system>  <rhel6 system>
mkdir ubuntu     mkdir fedora     mkdir debian     mkdir rhel6
cd ubuntu        cd fedora        cd debian        cd rhel6
../configure     ../configure     ../configure     ../configure
make             make             make             make
make check       make check       make check       make check

This is something the project has almost supported for a long time
but finishing this support should save me lots of time.
2010-09-05 21:49:05 -07:00
Brian Behlendorf 73fc084e92 Move vendor check to spl-build.m4
This check was previously done with a hack in config.guess.
However, since a new config.guess is copied in to place when
forcing a full autoreconf this change was easily lost and
never a good idea.  This commit also updates all of the
autoconf style support scripts in config.
2010-09-02 16:12:02 -07:00
Brian Behlendorf 8371f981f1 Add list_link_replace() function
The list_link_replace() function with swap a new item it to the place
of an old item in a list.  It is the callers responsibility to ensure
all lists involved are locked properly.
2010-08-27 14:23:48 -07:00
Brian Behlendorf d85e28ad69 Add MUTEX_NOT_HELD() function
Simply implement the missing MUTEX_NOT_HELD() function using
the !MUTEX_HELD construct.
2010-08-27 14:23:48 -07:00
Brian Behlendorf 2b3543025c Stub out kmem cache defrag API
At some point we are going to need to implement the kmem cache
move callbacks to allow for kmem cache defragmentation.  This
commit simply lays a small part of the API ground work, it does
not actually implement any of this feature.  This is safe for
now because the move callbacks are just an optimization.  Even
if they are registered we don't ever really have to call them.
2010-08-27 14:23:42 -07:00
Brian Behlendorf 8dbd3fbd5e Add missing atomic functions
These functions were not previous needed so they were not added.
Now they are so add the full set.

atomic_inc_32_nv()
atomic_dec_32_nv()
atomic_inc_64_nv()
atomic_dec_64_nv()
2010-08-27 13:02:55 -07:00
Li Wei 4be55565fe Fix stack overflow in vn_rdwr() due to memory reclaim
Unless __GFP_IO and __GFP_FS are removed from the file mapping gfp
mask we may enter memory reclaim during IO.  In this case shrink_slab()
entered another file system which is notoriously hungry for stack.
This additional stack usage may cause a stack overflow.  This patch
removes __GFP_IO and __GFP_FS from the mapping gfp mask of each file
during vn_open() to avoid any reclaim in the vn_rdwr() IO path.  The
original mask is then restored at vn_close() time.  Hats off to the
loop driver which does something similiar for the same reason.

  [...]
  shrink_slab+0xdc/0x153
  try_to_free_pages+0x1da/0x2d7
  __alloc_pages+0x1d7/0x2da
  do_generic_mapping_read+0x2c9/0x36f
  file_read_actor+0x0/0x145
  __generic_file_aio_read+0x14f/0x19b
  generic_file_aio_read+0x34/0x39
  do_sync_read+0xc7/0x104
  vfs_read+0xcb/0x171
  :spl:vn_rdwr+0x2b8/0x402
  :zfs:vdev_file_io_start+0xad/0xe1
  [...]

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-08-12 09:34:33 -07:00
Ned Bass 46aa7b3939 Correctly handle rwsem_is_locked() behavior
A race condition in rwsem_is_locked() was fixed in Linux 2.6.33 and the fix was
backported to RHEL5 as of kernel 2.6.18-190.el5.  Details can be found here:

https://bugzilla.redhat.com/show_bug.cgi?id=526092

The race condition was fixed in the kernel by acquiring the semaphore's
wait_lock inside rwsem_is_locked().  The SPL worked around the race condition
by acquiring the wait_lock before calling that function, but with the fix in
place it must not do that.

This commit implements an autoconf test to detect whether the fixed version of
rwsem_is_locked() is present.  The previous version of rwsem_is_locked() was an
inline static function while the new version is exported as a symbol which we
can check for in module.symvers.  Depending on the result we correctly
implement the needed compatibility macros for proper spinlock handling.

Finally, we do the right thing with spin locks in RW_*_HELD() by using the
new compatibility macros.  We only only acquire the semaphore's wait_lock if
it is calling a rwsem_is_locked() that does not itself try to acquire the lock.

Some new overhead and a small harmless race is introduced by this change.
This is because RW_READ_HELD() and RW_WRITE_HELD() now acquire and release
the wait_lock twice: once for the call to rwsem_is_locked() and once for
the call to rw_owner().  This can't be avoided if calling a rwsem_is_locked()
that takes the wait_lock, as it will in more recent kernels.

The other case which only occurs in legacy kernels could be optimized by
taking the lock only once, as was done prior to this commit.  However, I
decided that the performance gain probably wasn't significant enough to
justify the messy special cases required.

The function spl_rw_get_owner() was only used to enable the afore-mentioned
optimization.  Since it is no longer used, I removed it.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-08-10 16:43:00 -07:00
Brian Behlendorf 099dc9c2d2 Add uninstall Makefile targets
Extend the Makefiles with an uninstall target to cleanly
remove a package which was installed with 'make install'.

Additionally, ensure a 'depmod -a' is run as part of the
install to update the module dependency information.
2010-07-28 14:55:32 -07:00
Brian Behlendorf 287b2fb117 Add Debian and Slackware style packaging via alien
The long term fix for Debian and Slackware style packaging is
to add native support for building these packages.  Unfortunately,
that is a large chunk of work I don't have time for right now.
That said it would be nice to have at least basic packages for
these distributions.

As a quick short/medium term solution I've settled on using alien
to convert the RPM packages to DEB or TGZ style packages.  The
build system has been updated with the following build targets
which will first build RPM packages and then convert them as
needed to the target package type:

  make rpm: Create .rpm packages
  make deb: Create .deb packages
  make tgz: Create .tgz packages
  make pkg: Create the right package type for your distribution

The solution comes with lot of caveats and your mileage may vary.
But basically the big limitations are that the resulting packages:

  1) Will not have the correct dependency information.
  2) Will not not include the kernel version in the release.
  3) Will not handle all differences between distributions.

But the resulting packages should be easy to install and remove
from your system and take care of running 'depmod -a' and such.
As I said at the top this is not the right long term solution.
If any of the upstream distribution maintainers want to jump in
and help do this right for their distribution I'd love the help.
2010-07-27 15:52:34 -07:00
Brian Behlendorf 10129680f8 Ensure kmem_alloc() and vmem_alloc() never fail
The Solaris semantics for kmem_alloc() and vmem_alloc() are that they
must never fail when called with KM_SLEEP.  They may only fail if
called with KM_NOSLEEP otherwise they must block until memory is
available.  This is quite different from how the Linux memory
allocators work, under Linux a memory allocation failure is always
possible and must be dealt with.

At one point in the past the kmem code did properly implement this
behavior, however as the code evolved this behavior was overlooked
in places.  This patch goes through all three implementations of
the kmem/vmem allocation functions and ensures that they will all
block in the KM_SLEEP case when memory is not available.  They
may still fail in the KM_NOSLEEP case in which case the caller
is responsible for handling the failure.

Special care is taken in vmalloc_nofail() to avoid thrashing the
system on the virtual address space spin lock.  The down side of
course is if you do see a failure here, which is unlikely for
64-bit systems, your allocation will delay for an entire second.
Still this is preferable to locking up your system and it is the
best we can do given the constraints.

Additionally, the code was cleaned up to be much more readable
and comments were added to describe the various kmem-debug-*
configure options.  The default configure options remain:
"--enable-debug-kmem --disable-debug-kmem-tracking"
2010-07-26 15:47:55 -07:00
Ricardo M. Correia 15b52c083e Fix max_ncpus definition.
It was being defined as the constant 64 and at first I changed it to be
NR_CPUS instead.

However, NR_CPUS can be a large value on recent kernels (4096), and this
may cause too large kmem allocations to happen.

Therefore, now we use num_possible_cpus(), which should return a (typically)
small value which represents the maximum number of CPUs than can be brought
online in the running hardware (this value is determined at boot time by
arch-specific kernel code).

Signed-off-by: Ricardo M. Correia <ricardo.correia@oracle.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-07-20 15:49:25 -07:00
Ricardo M. Correia 81672c0122 Display DEBUG keyword during module load when --enable-debug is used.
Signed-off-by: Ricardo M. Correia <ricardo.correia@oracle.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-07-20 15:31:03 -07:00
Ricardo M. Correia 9dd5d138b2 Fix bcopy() to allow memory area overlap
Under Solaris bcopy() allows overlapping memory areas so we
must use memmove() instead of memcpy().

Signed-off-by: Ricardo M. Correia <ricardo.correia@oracle.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-07-20 13:48:53 -07:00
Ricardo M. Correia 22cd0f19b1 Fix compilation error due to undefined ACCESS_ONCE macro.
When CONFIG_DEBUG_MUTEXES is turned on in RHEL5's kernel config, the mutexes
store the owner for debugging purposes, therefore the SPL will enable
HAVE_MUTEX_OWNER. However, the SPL code uses ACCESS_ONCE() to access the
owner, and this macro is not defined in the RHEL5 kernel, therefore we define it
ourselves in include/linux/compiler_compat.h.

Signed-off-by: Ricardo M. Correia <ricardo.correia@oracle.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-07-20 13:47:52 -07:00
Brian Behlendorf b17edc10a9 Prefix all SPL debug macros with 'S'
To avoid conflicts with symbols defined by dependent packages
all debugging symbols have been prefixed with a 'S' for SPL.
Any dependent package needing to integrate with the SPL debug
should include the spl-debug.h header and use the 'S' prefixed
macros.  They must also build with DEBUG defined.
2010-07-20 13:30:40 -07:00
Brian Behlendorf 55abb0929e Split <sys/debug.h> header
To avoid symbol conflicts with dependent packages the debug
header must be split in to several parts.  The <sys/debug.h>
header now only contains the Solaris macro's such as ASSERT
and VERIFY.  The spl-debug.h header contain the spl specific
debugging infrastructure and should be included by any package
which needs to use the spl logging.  Finally the spl-trace.h
header contains internal data structures only used for the log
facility and should not be included by anythign by spl-debug.c.

This way dependent packages can include the standard Solaris
headers without picking up any SPL debug macros.  However, if
the dependant package want to integrate with the SPL debugging
subsystem they can then explicitly include spl-debug.h.

Along with this change I have dropped the CHECK_STACK macros
because the upstream Linux kernel now has much better stack
depth checking built in and we don't need this complexity.

Additionally SBUG has been replaced with PANIC and provided as
part of the Solaris macro set.  While the Solaris version is
really panic() that conflicts with the Linux kernel so we'll
just have to make due to PANIC.  It should rarely be called
directly, the prefered usage would be an ASSERT or VERIFY.

There's lots of change here but this cleanup was overdue.
2010-07-20 13:29:35 -07:00
Brian Behlendorf f0ff89fc86 Linux 2.6.35 compat: filp_fsync() dropped 'stuct dentry *'
The prototype for filp_fsync() drop the unused argument 'stuct dentry *'.
I've fixed this by adding the needed autoconf check and moving all of
those filp related functions to file_compat.h.  This will simplify
handling any further API changes in the future.
2010-07-14 11:40:55 -07:00
Brian Behlendorf 82b8c8fa64 Proposed fix for low memory ZFS deadlocks
Deadlocks in the zvol were observed when one of the ZFS threads
performing IO trys to allocate memory while the system is low
on memory.  The low memory condition causes dirty pages to be
synced to the zvol but this can't progress because the original
thread is blocked waiting on a memory allocation.  Thus we end
up deadlocking.

A proper solution proposed by Wizeman is to change KM_SLEEP from
GFP_KERNEL top GFP_NOFS.  This will prevent the memory allocation
which is trying to allocate memory from forcing a sync to the
zvol in shrink_page_list()->pageout().

The down side to all of this is that we are using a pretty big
hammer by changing KM_SLEEP.  This change means ALL of the zfs
memory allocations will be until to trigger dirty data to be
synced.  The caller still should be able to reclaim memory from
the various slab caches.  We will be totally dependent of other
kernel processes which happen to be running and a small number
of asynchronous reclaim threads to trigger the reclaim of dirty
data pages.  This should be OK but I think we may see some
slightly longer allocation times when under memory pressure.

We shall see.
2010-07-13 21:30:56 -07:00
Brian Behlendorf a4bfd8ea1b Add __divdi3(), remove __udivdi3() kernel dependency
Up until now no SPL consumer attempted to perform signed 64-bit
division so there was no need to support this.  That has now
changed so I adding 64-bit division support for 32-bit platforms.
The signed implementation is based on the unsigned version.

Since the have been several bug reports in the past concerning
correct 64-bit division on 32-bit platforms I added some long
over due regression tests.  Much to my surprise the unsigned
64-bit division regression tests failed.

This was surprising because __udivdi3() was implemented by simply
calling div64_u64() which is provided by the kernel.  This meant
that the linux kernels 64-bit division algorithm on 32-bit platforms
was flawed.  After some investigation this turned out to be exactly
the case.

Because of this I was forced to abandon the kernel helper and
instead to fully implement 64-bit division in the spl.  There are
several published implementation out there on how to do this
properly and I settled on one proposed in the book Hacker's Delight.
Their proposed algoritm is freely available without restriction
and I have just modified it to be linux kernel friendly.

The update implementation now passed all the unsigned and signed
regression tests.  This should be functional, but not fast, which is
good enough for out purposes.  If you want fast too I'd strongly
suggest you upgrade to a 64-bit platform.  I have also reported the
kernel bug and we'll see if we can't get it fixed up stream.
2010-07-13 16:44:02 -07:00
Ned Bass f0d8bb26b4 Implementation of the TQ_FRONT flag.
Adds a task queue to receive tasks dispatched with TQ_FRONT.  Worker
threads pull tasks from this high priority queue before the default
pending queue.

Executing tasks out of FIFO order potentially breaks taskq_lowest_id()
if we do not preserve the ordering of the work list by taskqid.
Therefore, instead of always appending to the work list, we search for
the appropriate place to insert a task.  The common case is to append
to the list, so we make this operation efficient by searching the work
list in reverse order.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-07-01 10:59:38 -07:00
Brian Behlendorf c950d1480d Only make compiler warnings fatal with --enable-debug
While in theory I like the idea of compiler warnings always being
fatal.  In practice this causes problems when small harmless errors
cause build failures for end users.  To handle this I've updated
the build system such that -Werror is only used when --enable-debug
is passed to configure.  This is how I always build when developing
so I'll catch all build warnings and end users will not get stuck
by minor issues.
2010-06-30 17:05:36 -07:00
Brian Behlendorf 6801b7154c Linux-2.6.33 compat, O_DSYNC flag added
Prior to linux-2.6.33 only O_DSYNC semantics were implemented and
they used the O_SYNC flag.  As of linux-2.6.33 this behavior was
properly split in to O_SYNC and O_DSYNC respectively.
2010-06-30 12:49:39 -07:00
Brian Behlendorf 79a3bf130b Linux-2.6.33 compat, .ctl_name removed from struct ctl_table
As of linux-2.6.33 the ctl_name member of the ctl_table struct
has been entirely removed.  The upstream code has been updated
to depend entirely on the the procname member.  To handle this
all references to ctl_name are wrapped in a CTL_NAME macro which
simply expands to nothing for newer kernels.  Older kernels are
supported by having it expand to .ctl_name = X just as before.
2010-06-30 12:49:12 -07:00