Commit Graph

351 Commits

Author SHA1 Message Date
Brian Behlendorf 3e904f40b4 Mutex ASSERT on self deadlock
Generate an assertion if we're going to deadlock the system by
attempting to acquire a mutex the process is already holding.

There are currently no known instances of this under normal
operation, but it _might_ be possible when using a ZVOL as a
swap device.  I want to ensure we catch this immediately if it
were to occur.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-08-27 12:00:55 -07:00
Brian Behlendorf eb0f407a2b Add PF_NOFS debugging flag
PF_NOFS is a per-process debug flag which is set in current->flags to
detect when a process is performing an unsafe allocation.  All tasks
with PF_NOFS set must strictly use KM_PUSHPAGE for allocations because
if they enter direct reclaim and initiate I/O they may deadlock.

When debugging is disabled, any incorrect usage will be detected and
a call stack with a warning will be printed to the console.  The flags
will then be automatically corrected to allow for safe execution.  If
debugging is enabled this will be treated as a fatal condition.

To avoid any risk of conflicting with the existing PF_ flags.  The
PF_NOFS bit shadows the rarely used PF_MUTEX_TESTER bit.  Only when
CONFIG_RT_MUTEX_TESTER is not set, and we know this bit is unused,
will the PF_NOFS bit be valid.  Happily, most existing distributions
ship a kernel with CONFIG_RT_MUTEX_TESTER disabled.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-08-27 12:00:55 -07:00
Brian Behlendorf d47e664ad4 Revert "Add TASKQ_NORECLAIM flag"
This reverts commit 372c257233.  The
use of the PF_MEMALLOC flag was always a hack to work around memory
reclaim deadlocks.  Those issues are believed to be resolved so this
workaround can be safely reverted.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-08-27 12:00:42 -07:00
Brian Behlendorf e2dcc6e2b8 Emergency slab objects
This patch is designed to resolve a deadlock which can occur with
__vmalloc() based slabs.  The issue is that the Linux kernel does
not honor the flags passed to __vmalloc().  This makes it unsafe
to use in a writeback context.  Unfortunately, this is a use case
ZFS depends on for correct operation.

Fixing this issue in the upstream kernel was pursued and patches
are available which resolve the issue.

  https://bugs.gentoo.org/show_bug.cgi?id=416685

However, these changes were rejected because upstream felt that
using __vmalloc() in the context of writeback should never be done.
Their solution was for us to rewrite parts of ZFS to accomidate
the Linux VM.

While that is probably the right long term solution, and it is
something we want to pursue, it is not a trivial task and will
likely destabilize the existing code.  This work has been planned
for the 0.7.0 release but in the meanwhile we want to improve the
SPL slab implementation to accomidate this expected ZFS usage.

This is accomplished by performing the __vmalloc() asynchronously
in the context of a work queue.  This doesn't prevent the posibility
of the worker thread from deadlocking.  However, the caller can now
safely block on a wait queue for the slab allocation to complete.

Normally this will occur in a reasonable amount of time and the
caller will be woken up when the new slab is available,.  The objects
will then get cached in the per-cpu magazines and everything will
proceed as usual.

However, if the __vmalloc() deadlocks for the reasons described
above, or is just very slow, then the callers on the wait queues
will timeout out.  When this rare situation occurs they will attempt
to kmalloc() a single minimally sized object using the GFP_NOIO flags.
This allocation will not deadlock because kmalloc() will honor the
passed flags and the caller will be able to make forward progress.

As long as forward progress can be maintained then even if the
worker thread is deadlocked the critical thread will make progress.
This will eventually allow the deadlocked worker thread to complete
and normal operation will resume.

These emergency allocations will likely be slow since they require
contiguous pages.  However, their use should be rare so the impact
is expected to be minimal.  If that turns out not to be the case in
practice further optimizations are possible.

One additional concern is if these emergency objects are long lived.
Right now they are simply tracked on a list which must be walked when
an object is freed.  Is they accumulate on a system and the list
grows freeing objects will become more expensive.  This could be
handled relatively easily by using a hash instead of a list, but that
optimization (if needed) is left for a follow up patch.

Additionally, these emeregency objects could be repacked in to existing
slabs as objects are freed if the kmem_cache_set_move() functionality
was implemented.  See issue https://github.com/zfsonlinux/spl/issues/26
for full details.  This work would also help reduce ZFS's memory
fragmentation problems.

The /proc/spl/kmem/slab file has had two new columns added at the
end.  The 'emerg' column reports the current number of these emergency
objects in use for the cache, and the following 'max' column shows
the historical worst case.  These value should give us a good idea
of how often these objects are needed.  Based on these values under
real use cases we can tune the default behavior.

Lastly, as a side benefit using a single work queue for the slab
allocations should reduce cpu contention on the global virtual address
space lock.   This should manifest itself as reduced cpu usage for
the system.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-08-27 12:00:42 -07:00
Brian Behlendorf c638e9ad04 Remove autotools products
Remove all of the generated autotools products from the repository
and update the .gitignore files accordingly.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#718
2012-08-27 11:46:23 -07:00
Prakash Surya 45324c7c41 Add kpreempt_[dis|en]able macros in <sys/disp.h>
To support preempt enabled kernels in ZFS on Linux, there are a couple
places where the ZFS code needs to disable interrupts. This change adds
the Solaris preempt functions and maps them to the equivalent ZFS
functions, allowing the ZFS to make use of them.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #98
2012-08-24 15:18:38 -07:00
Prakash Surya 08850eddcb Avoid calling smp_processor_id in spl_magazine_age
The spl_magazine_age function had the implied assumption that it will
remain on its current cpu through its execution. In order to support
preempt enabled kernels, this assumption had to be removed.

The spl_kmem_magazine structure now holds the cpu id of the cpu it is
local to. This allows spl_magazine_age to use this field when scheduling
work to be done by the magazine's local cpu.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #98
2012-08-24 09:43:22 -07:00
Richard Yao 6576a1a70d Fix incorrect type in spl_kmem_cache_set_move() parameter
A preprocessor definition renders this harmless. However, it is a good
idea to change this to be consistent.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
2012-08-01 16:35:18 -07:00
Richard Yao 973e8269bd Constify memory management functions
This prevents warnings in ZFS that were caused by changes necessary to
support PaX patched kernels. When debugging is enabled, these warnings
become build failures.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #131
2012-07-03 16:07:27 -07:00
Brian Behlendorf 44e406d712 PowerPC Compatibility
Usage of get_current() is not supported across all architectures.
The correct interface to use is the '#define current' which will
map to the appropriate function, usually current_thread_info().

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #119
2012-07-02 09:33:09 -07:00
Brian Behlendorf 38d31a1e57 Remove Solaris module emulation
Originally I believed that these interfaces would be needed.
However, in practice it turned out that it was more straight
forward and maintainable to use the native Linux interfaces.
As such, this is all dead code and can be safely removed.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #109
2012-05-18 13:57:44 -07:00
Richard Yao f90096c905 Modify KM_PUSHPAGE to use GFP_NOIO instead of GFP_NOFS
The resolution of issue #31 made KM_PUSHPAGE imply GFP_NOFS.  This
was done to prevent situations where filesystem operations which are
holding locks enter direct reclaim and attempt to reaquire those
same locks.  This clearly will result in a deadlock.

This works for datasets which are implemented in terms for filesystem
operations.  But unfortunately, swapping to a zvol will encounter
many of the same deadlocks and GFP_NOFS will not prevent this.  As
such, it is appropriate to extend KM_PUSHPAGE to use the broader
GFP_NOIO mask to handle these non-filesystem cases.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#342
Closes #105
2012-05-07 12:05:27 -07:00
Prakash Surya cef7605c34 Throttle number of freed slabs based on nr_to_scan
Previously, the SPL tried to maintain Solaris semantics by freeing
all available (empty) slabs from its slab caches when the shrinker
was called. This is not desirable when running on Linux. To make
the SPL shrinker more Linux friendly, the actual number of freed
slabs from each of the slab caches is now derived from nr_to_scan
and skc_slab_objs.

Additionally, an accounting bug was fixed in spl_slab_reclaim()
which could cause us to reclaim one more slab than requested.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #101
2012-05-07 11:46:15 -07:00
Jorgen Lundman cb75844e85 Define the needed ISA types for ARM
Add the minimum required ISA types to support the ARM architecture.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-05-03 09:56:15 -07:00
Brian Behlendorf b29012b999 Remove condition variable names
Long ago I added support to the spl for condition variable names
because I thought they might be needed.  It turns out they aren't.
In fact the official Solaris cv_init(9F) man page discourages
their use in the kernel.

  cv_init(9F)
    Parameters
      name - Descriptive string. This is obsolete and should be
             NULL. (Non-NULL strings are legal, but they're a
             waste of kernel memory.)

Therefore, I'm removing them from the spl to reclaim this memory
and adding an ASSERT() to ensure no new consumers are added which
make use of the name.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-04-06 12:06:19 -07:00
Brian Behlendorf 0835057ee7 Add SPL_META_RELEASE to module load/unload messages
Include the ZFS_META_RELEASE in the module load/unload messages
to more clearly indicate exactly what version of the SPL has
been loaded.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-03-23 12:11:50 -07:00
Brian Behlendorf 9a8b7a7458 Add basic dynamic kstat support
Add the bare minimum functionality to support dynamic kstats.  A
complete kstat implementation should be done as part of issue #84.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #84
2012-02-02 11:28:00 -08:00
Brian Behlendorf 4b2220f0b9 Add --enable-debug-log configure option
Until now the notion of an internal debug logging infrastructure
was conflated with enabling ASSERT()s.  This patch clarifies things
by cleanly breaking the two subsystem apart.  The result of this
is the following behavior.

--enable-debug      - Enable/disable code wrapped in ASSERT()s.
--disable-debug       ASSERT()s are used to check invariants and
                      are never required for correct operation.
                      They are disabled by default because they
                      may impact performance.

--enable-debug-log  - Enable/disable the debug log infrastructure.
--disable-debug-log   This infrastructure allows the spl code and
                      its consumer to log messages to an in-kernel
                      log.  The granularity of the logging can be
                      controlled by a debug mask.  By default the
                      mask disables most debug messages resulting
                      in a negligible performance impact.  Because
                      of this the debug log is enabled by default.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-02-02 11:27:54 -08:00
Darik Horn 588d900433 Linux 3.2 compat: rw_semaphore.wait_lock is raw
The wait_lock member of the rw_semaphore struct became a raw_spinlock_t
in Linux 3.2 at torvalds/linux@ddb6c9b58a.

Wrap spin_lock_* function calls in a new spl_rwsem_* interface to
ensure type safety if raw_spinlock_t becomes architecture specific,
and to satisfy these compiler warnings:

  warning: passing argument 1 of ‘spinlock_check’
    from incompatible pointer type [enabled by default]
  note: expected ‘struct spinlock_t *’
    but argument is of type ‘struct raw_spinlock_t *’

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes: #76
Closes: zfsonlinux/zfs#463
2012-01-11 16:28:05 -08:00
Prakash Surya 8f2503e0af Store copy of tqent_flags prior to servicing task
A preallocated taskq_ent_t's tqent_flags must be checked prior to
servicing the taskq_ent_t. Once a preallocated taskq entry is serviced,
the ownership of the entry is handed back to the caller of
taskq_dispatch, thus the entry's contents can potentially be mangled.

In particular, this is a problem in the case where a preallocated taskq
entry is serviced, and the caller clears it's tqent_flags field. Thus,
when the function returns and task_done is called, it looks as though
the entry is **not** a preallocated task (when in fact it **is** a
preallocated task).

In this situation, task_done will place the preallocated taskq_ent_t
structure onto the taskq_t's free list. This is a **huge** mistake. If
the taskq_ent_t is then freed by the caller of taskq_dispatch, the
taskq_t's free list will hold a pointer to garbage data. Even worse, if
nothing has over written the freed memory before the pointer is
dereferenced, it may still look as though it points to a valid list_head
belonging to a taskq_ent_t structure.

Thus, the task entry's flags are now copied prior to servicing the task.
This copy is then checked to see if it is a preallocated task, and
determine if the entry needs to be passed down to the task_done
function.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #71
2011-12-16 16:54:00 -08:00
Prakash Surya e7e5f78e7b Swap taskq_ent_t with taskqid_t in taskq_thread_t
The taskq_t's active thread list is sorted based on its
tqt_ent->tqent_id field. The list is kept sorted solely by inserting
new taskq_thread_t's in their correct sorted location; no other
means is used. This means that once inserted, if a taskq_thread_t's
tqt_ent->tqent_id field changes, the list runs the risk of no
longer being sorted.

Prior to the introduction of the taskq_dispatch_prealloc() interface,
this was not a problem as a taskq_ent_t actively being serviced under
the old interface should always have a static tqent_id field. Thus,
once the taskq_thread_t is added to the taskq_t's active thread list,
the taskq_thread_t's tqt_ent->tqent_id field would remain constant.

Now, this is no longer the case. Currently, if using the
taskq_dispatch_prealloc() interface, any given taskq_ent_t actively
being serviced _may_ have its tqent_id value incremented. This happens
when the preallocated taskq_ent_t structure is recursively dispatched.
Thus, a taskq_thread_t could potentially have its tqt_ent->tqent_id
field silently modified from under its feet. If this were to happen
to a taskq_thread_t on a taskq_t's active thread list, this would
compromise the integrity of the order of the list (as the list
_may_ no longer be sorted).

To get around this, the taskq_thread_t's taskq_ent_t pointer was
replaced with its own static copy of the tqent_id. So, as a taskq_ent_t
is pulled off of the taskq_t's pending list, a static copy of its
tqent_id is made and this copy is used to sort the active thread
list. Using a static copy is key in ensuring the integrity of the
order of the active thread list. Even if the underlying taskq_ent_t
is recursively dispatched (as has its tqent_id modified), this
static copy stored inside the taskq_thread_t will remain constant.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #71
2011-12-16 13:26:54 -08:00
Prakash Surya 44217f7aad Implement taskq_dispatch_prealloc() interface
This patch implements the taskq_dispatch_prealloc() interface which
was introduced by the following illumos-gate commit.  It allows for
a preallocated taskq_ent_t to be used when dispatching items to a
taskq.  This eliminates a memory allocation which helps minimize
lock contention in the taskq when dispatching functions.

    commit 5aeb94743e3be0c51e86f73096334611ae3a058e
    Author: Garrett D'Amore <garrett@nexenta.com>
    Date:   Wed Jul 27 07:13:44 2011 -0700

    734 taskq_dispatch_prealloc() desired
    943 zio_interrupt ends up calling taskq_dispatch with TQ_SLEEP

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #65
2011-12-13 16:10:57 -08:00
Prakash Surya 2c02b71b14 Replace tq_work_list and tq_threads in taskq_t
To lay the ground work for introducing the taskq_dispatch_prealloc()
interface, the tq_work_list and tq_threads fields had to be replaced
with new alternatives in the taskq_t structure.

The tq_threads field was replaced with tq_thread_list. Rather than
storing the pointers to the taskq's kernel threads in an array, they are
now stored as a list. In addition to laying the ground work for the
taskq_dispatch_prealloc() interface, this change could also enable taskq
threads to be dynamically created and destroyed as threads can now be
added and removed to this list relatively easily.

The tq_work_list field was replaced with tq_active_list. Instead of
keeping a list of taskq_ent_t's which are currently being serviced, a
list of taskq_threads currently servicing a taskq_ent_t is kept. This
frees up the taskq_ent_t's tqent_list field when it is being serviced
(i.e. now when a taskq_ent_t is being serviced, it's tqent_list field
will be empty).

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #65
2011-12-13 16:10:50 -08:00
Prakash Surya 93806f58a6 Fix usage of MUTEX macro in mutex_enter_nested
A call site of the MUTEX macro had incorrectly placed its closing
parenthesis, causing two parameters to be passed rather than one. This
change moves the misplaced parenthesis to fix the typographical error.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #70
2011-12-13 11:04:21 -08:00
Chris Dunlop 791dc876eb Allow 64-bit timestamps to be set on 64-bit kernels
ZFS and 64-bit linux are perfectly capable of dealing with 64-bit
timestamps, but ZFS deliberately prevents setting them.  Adjust
the SPL such that TIMESPEC_OVERFLOW will not always assume 32-bit
values and instead use the correct values for your kernel build.
This effectively allows 64-bit timestamps on 64-bit systems.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes ZFS issue #487
2011-12-12 11:06:03 -08:00
Brian Behlendorf 1114ae6ae7 Prepend spl_ to all init/fini functions
This is a bit of cleanup I'd been meaning to get to for a while
to reduce the chance of a type conflict.  Well that conflict
finally occurred with the kstat_init() function which conflicts
with a function in the 2.6.32-6-pve kernel.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #56
2011-11-11 09:18:28 -08:00
Brian Behlendorf 12ff95ff57 Linux 3.1 compat, kern_path_parent()
Prior to Linux 3.1 the kern_path_parent symbol was exported for
use by kernel modules.  As of Linux 3.1 it is now longer easily
available.  To handle this case the spl will now dynamically
look up address of the missing symbol at module load time.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #52
2011-11-09 16:51:25 -08:00
Gunnar Beutner f5e76dea03 Cleaned up MUTEX() #define
The old define assumed a specific layout of the kmutex_t struct. This
patch makes the macro independent from the actual struct layout.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-10-19 09:59:32 -07:00
Gunnar Beutner 66cdc93b8c Remove the spinlocks for mutex_enter()/mutex_exit()
The m_owner variable is protected by the mutex itself. Reading the variable
is guaranteed to be atomic (due to it being a word-sized reference) and
ACCESS_ONCE() takes care of read cache effects.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-10-19 09:58:57 -07:00
Gunnar Beutner 3160d4f56b Fix race condition in mutex_exit()
On kernels with CONFIG_DEBUG_MUTEXES mutex_exit() clears the mutex
owner after releasing the mutex. This would cause mutex_owner()
to return an incorrect owner if another thread managed to lock the
mutex before mutex_exit() had a chance to clear the owner.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes ZFS issue #167
2011-10-19 09:58:41 -07:00
Gunnar Beutner 763b2f3b57 Fixed invalid resource re-use in file_find()
File descriptors are a per-process resource. The same descriptor
in different processes can refer to different files. find_file()
incorrectly assumed that file descriptors are globally unique.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes ZFS issue #386
2011-10-11 09:51:51 -07:00
Brian Behlendorf 86fd39f354 Linux 2.6.39 compat, mutex owner
Prior to Linux 2.6.39 when CONFIG_DEBUG_MUTEXES was defined
the kernel stored a thread_info pointer as the mutex owner.
From this you could get the pointer of the current task_struct
to compare with get_current().

As of Linux 2.6.39 this behavior has changed and now the mutex
stores a pointer to the task_struct.  This commit detects the
type of pointer stored in the mutex and adjusts the mutex_owner()
and mutex_owned() functions to perform the correct comparision.
2011-06-24 13:00:08 -07:00
Darik Horn 0d54dcb566 Read the /etc/hostid file directly.
Deprecate the /usr/bin/hostid call by reading the /etc/hostid file
directly. Add the spl_hostid_path parameter to override the default
/etc/hostid path.

Rename the set_hostid() function to hostid_exec() to better reflect
actual behavior and complement the new hostid_read() function.

Use HW_INVALID_HOSTID as the spl_hostid sentinel value because
zero seems to be a valid gethostid() result on Linux.
2011-06-24 09:58:03 -07:00
Brian Behlendorf 372c257233 Add TASKQ_NORECLAIM flag
It has become necessary to be able to optionally disable
direct memory reclaim for certain taskqs.  To support
this the TASKQ_NORECLAIM flags has been added which sets
the PF_MEMALLOC bit for all threads in the taskq.
2011-05-06 15:23:58 -07:00
Brian Behlendorf c1f95c2b94 Correct MAXUID
The uid_t on most systems is in fact and unsigned 32-bit value.
This is almost always correct, however you could compile your
kernel to use an unsigned 16-bit value for uid_t.  In practice
I've never encountered a distribution which does this so I'm
willing to overlook this corner case for now.
2011-04-29 13:58:45 -07:00
Gunnar Beutner 9d4b7c17a0 Renamed 'struct fid' for NFS
Renamed 'struct fid' because its name conflicts with another
struct in the Linux kernel headers.  The fid_t typedef remains
unchanged intentionally.
2011-04-29 12:10:54 -07:00
Brian Behlendorf d837ae395b Fix 32-bit MAXOFFSET_T definition
The correct definition of MAXOFFSET_T under Solaris is in reality
tied to the maximum size of a 'long long' type.  With this in mind
MAXOFFSET_T is now defined as LLONG_MAX which ensures the correct
value is used on both 32-bit and 64-bit systems.
2011-04-22 16:17:13 -07:00
Darik Horn fa6f7d8f9d Import spl_hostid as a module parameter.
Provide a call_usermodehelper() alternative by letting the hostid be passed as
a module parameter like this:

  $ modprobe spl spl_hostid=0x12345678

Internally change the spl_hostid variable to unsigned long because that is the
type that the coreutils /usr/bin/hostid returns.

Move the hostid command into GET_HOSTID_CMD for consistency with the similar
GET_KALLSYMS_ADDR_CMD invocation.

Use argv[0] instead of sh_path for consistency internally and with other Linux
drivers.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-04-21 09:41:01 -07:00
Brian Behlendorf 3dfc591ac4 Linux 2.6.39 compat, zlib_deflate_workspacesize()
The function zlib_deflate_workspacesize() now take 2 arguments.
This was done to avoid always having to allocate the maximum size
workspace (268K).  The caller can now specific the windowBits and
memLevel compression parameters to get a smaller workspace.

For our purposes we introduce a spl_zlib_deflate_workspacesize()
wrapper which accepts both arguments.  When the two argument
version of zlib_deflate_workspacesize() is available the arguments
are passed through.  When it's not we assume the worst case and
a maximally sized workspace is used.
2011-04-20 14:39:15 -07:00
Brian Behlendorf e76f4bf11d Add dnlc_reduce_cache() support
Provide the dnlc_reduce_cache() function which attempts to prune
cached entries from the dcache and icache.  After the entries are
pruned any slabs which they may have been using are reaped.

Note the API takes a reclaim percentage but we don't have easy
access to the total number of cache entries to calculate the
reclaim count.  However, in practice this doesn't need to be
exactly correct.  We simply need to reclaim some useful fraction
(but not all) of the cache.  The caller can determine if more
needs to be done.
2011-04-06 20:06:03 -07:00
Brian Behlendorf 83150861e6 Decrease target objects per slab
By decreasing the number of target objects per slab we increase
the likelyhood that a slab can be freed.  This reduces the level
of fragmentation in the slab which has been observed to be a
problem for certain workloads.  The penalty for this is that we
also decrease the speed which need objects can be allocated.
2011-04-06 20:06:03 -07:00
Brian Behlendorf 3336e29cc2 Add slab usage summeries to /proc
One of the most common things you want to know when looking at
the slab is how much memory is being used.  This information was
available in /proc/spl/kmem/slab but only on a per-slab basis.
This commit adds the following /proc/sys/kernel/spl/kmem/slab*
entries to make total slab usage easily available at a glance.

  slab_kmem_total - Total kmem slab size
  slab_kmem_avail - Alloc'd kmem slab size
  slab_kmem_max   - Max observed kmem slab size
  slab_vmem_total - Total vmem slab size
  slab_vmem_avail - Alloc'd vmem slab size
  slab_vmem_max   - Max observed vmem slab size

NOTE: The slab_*_max values are expected to over report because
they show maximum values since boot, not current values.
2011-04-06 20:06:03 -07:00
Brian Behlendorf 91cb1d91a4 Add .va_dentry helper
While this extra structure memory does not exist under Solaris
it is needed under Linux to pass the dentry.  This allows the
dentry to be easily instantiated before the inode is unlocked.
2011-04-06 20:06:03 -07:00
Brian Behlendorf 734fcac78d Add crgetfsuid()/crgetfsgid() helpers
Solaris credentials don't have an fsuid/fsguid field but Linux
credentials do.  To handle this case the Solaris API is being
modestly extended to include the crgetfsuid()/crgetfsgid()
helper functions.

Addititionally, because the crget*() helpers are implemented
identically regardless of HAVE_CRED_STRUCT they have been
moved outside the #ifdef to common code.  This simplification
means we only have one version of the helper to keep to to date.
2011-03-22 12:18:44 -07:00
Brian Behlendorf cb255ae572 Remove default GFP_NOFS allocations
As originally described in commit 82b8c8fa64
this was done to prevent certain deadlocks from occuring in the system.
However, as suspected the price for doing this proved to be too high.
The VM is having a hard time effectively reclaiming memory thus we are
reverting this change.

However, we still need to fundamentally handle the issue.  Under
Solaris the KM_PUSHPAGE mask is used commonly in I/O paths to ensure
a memory allocations will succeed.  We leverage this fact and redefine
KM_PUSHPAGE to include GFP_NOFS.  This ensures that in these common
I/O path we don't trigger additional reclaim.  This minimizes the
change to the Solaris code.
2011-03-19 14:50:39 -07:00
Brian Behlendorf 6788762766 Linux 2.6.31 compat, include linux/seq_file.h
Explicitly include the linux/seq_file.h header in vfs.h.  This header
is required for the sequence handlers and is included indirectly in
newer kernels.
2011-03-07 13:52:00 -08:00
Brian Behlendorf 47995fa691 Remove xvattr support
The xvattr support in the spl has always simply consisted of
defining a couple structures and a few #defines.  This was enough
to enable compilation of code which just passed xvattr types
around but not enough to effectively manipulate them.

This change removes even this minimal support leaving it up
to packages which leverage the spl to prove the full xvattr
support.  By removing it from the spl we ensure not conflict
with the higher level packages.

This just leaves minimal vnode support for basical manipulation
of files.  This code is does have the proper support functions
in the spl and a set of regression tests.

Additionally, this change removed the unused 'caller_context_t *'
type and replaces it with a 'void *'.
2011-03-02 11:34:46 -08:00
Brian Behlendorf a4a1e1ecb4 Add TIMESPEC_OVERFLOW helper
Add the TIMESPEC_OVERFLOW helper macro to allow easy checking
of timespec overflow.
2011-03-02 11:34:43 -08:00
Brian Behlendorf 5c1967ebe2 Fix zlib compression
While portions of the code needed to support z_compress_level() and
z_uncompress() where in place.  In reality the current implementation
was non-functional, it just was compilable.

The critical missing component was to setup a workspace for the
compress/uncompress stream structures to use.  A kmem_cache was
added for the workspace area because we require a large chunk
of memory.  This avoids to need to continually alloc/free this
memory and vmap() the pages which is very slow.  Several objects
will reside in the per-cpu kmem_cache making them quick to acquire
and release.  A further optimization would be to adjust the
implementation to additional ensure the memory is local to the cpu.
Currently that may not be the case.
2011-02-25 16:56:22 -08:00
Brian Behlendorf 5a52a782a0 Use Linux flock struct
Rather than defining our own structure which will conflict with
Linux's version when building 32-bit.  Simply setup a typedef
to always use the correct Linux version for both 32 ad 64-bit
builds.
2011-02-23 14:32:15 -08:00
Brian Behlendorf d599e4fa79 Block in cv_destroy() on all waiters
Previously we would ASSERT in cv_destroy() if it was ever called
with active waiters.  However, I've now seen several instances in
OpenSolaris code where they do the following:

  cv_broadcast();
  cv_destroy();

This leaves no time for active waiters to be woken up and scheduled
and we trip the ASSERT.  This has not been observed to be an issue
on OpenSolaris because their cv_destroy() basically does nothing.
They still do run the risk of the memory being free'd after the
cv_destroy() and hitting a bad paging request.  But in practice
this race is so small and unlikely it either doesn't happen, or
is so unlikely when it does happen the root cause has not yet been
identified.

Rather than risk the same issue in our code this change updates
cv_destroy() to block until all waiters have been woken and
scheduled.  This may take some time because each waiter must
acquire the mutex.

This change may have an impact on performance for frequently
created and destroyed condition variables.  That however is a price
worth paying it avoid crashing your system.  If performance issues
are observed they can be addressed by the caller.
2011-02-04 14:09:08 -08:00
Brian Behlendorf 0aff071d18 Minor policy interface
Simply add the policy function wrappers.  They are completely
non-functional and always return that everything is OK, but once
again they simplify compilation of dependent packages for now.
These can/should be removed once the security policy of the
dependent application is completely understood and intergrade
as appropriate with Linux.
2011-01-27 16:06:09 -08:00
Brian Behlendorf ef57fb98e4 Add missing headers
Dependent packages require the following missing headers to
simplify compilation.  The headers are basically just stubbed
out with minimal content required.
2011-01-27 16:06:09 -08:00
Brian Behlendorf 3fc97f9335 Add VSA_ACE_* and MAX_ACL_ENTRIES defines
The following flags are use to get the proper mask when getting
and setting ACLs.  I'm hopeful this can all largely go away at
some point.

We also add a define for the maximum number of ACL entries.
MAX_ACL_ENTRIES is used as the maximum number of entries for
each type.
2011-01-27 16:06:09 -08:00
Brian Behlendorf e2b25f698c Add MAXUID define
For Linux the maximum uid can vary depending on how your kernel
is built.  The Linux kernel still can be compiled with 16 but uids
and gids, although I'm not aware of a major distribution which does
this (maybe an embedded one?).  Given that caviot it is reasonably
safe to define the MAXUID as 2147483647.
2011-01-27 16:06:09 -08:00
Brian Behlendorf 5f46a517f1 Add FIGNORECASE define
The FIGNORECASE case define is now needed, place it with the
related flags.
2011-01-27 16:06:09 -08:00
Brian Behlendorf 3e5d3d3285 Add ksid_index_t and ksid_t types
Add the ksid_index_t enum and ksid_t type for use.  These types
are now used by packages which depend on the SPL.
2011-01-27 16:06:09 -08:00
Brian Behlendorf d700637207 Minimal VFS additions
This patch simply removes the place holder vfs_t type and includes
some generic Linux VFS headers.  It also makes some minor fid_t
additions for compatibility.
2011-01-27 16:06:04 -08:00
Brian Behlendorf 647fa73cf3 Remove VN_HOLD/VN_RELE/VOP_PUTPAGE
Previously these were defined to noops but rather than give
the misleading impression that these are actually implemented
I'm removing the type entirely for clarity.
2011-01-12 11:38:05 -08:00
Brian Behlendorf bd6ac72b03 Add a few additional vnode #defines
These additional constants now have users in dependant packages.
2011-01-12 11:38:05 -08:00
Brian Behlendorf dcd9cb5a17 Clean vattr_t and vsecattr_t types
Minor cleanup for the vattr_t and vsecattr_t types.
2011-01-12 11:38:04 -08:00
Brian Behlendorf 1b439713f1 FRSYNC Should Use O_SYNC
The Solaris FRSYNC maps most logically to the Linux O_SYNC.  There
is no O_RSYNC on Linux but this wasn't noticed until just recently.
2011-01-12 11:38:04 -08:00
Brian Behlendorf 4295b530ee Add vn_mode_to_vtype/vn_vtype to_mode helpers
Add simple helpers to convert a vnode->v_type to a inode->i_mode.
These should be used sparingly but they are handy to have.
2011-01-12 11:38:04 -08:00
Neependra Khare 3f688a8c38 Add cv_timedwait_interruptible() function
The cv_timedwait() function by definition must wait unconditionally
for cv_signal()/cv_broadcast() before waking.  This causes processes
to go in the D state which increases the load average.  The load
average is the summation of processes in D state and run queue.

To avoid this it can be desirable to sleep interruptibly.  These
processes do not count against the load average but may be woken by
a signal.  It is up to the caller to determine why the process
was woken it may be for one of three reasons.

  1) cv_signal()/cv_broadcast()
  2) the timeout expired
  3) a signal was received

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-01-11 12:14:48 -08:00
Brian Behlendorf 9fe45dc1ac Add Thread Specific Data (TSD) Implementation
Thread specific data has implemented using a hash table, this avoids
the need to add a member to the task structure and allows maximum
portability between kernels.  This implementation has been optimized
to keep the tsd_set() and tsd_get() times as small as possible.

The majority of the entries in the hash table are for specific tsd
entries.  These entries are hashed by the product of their key and
pid because by design the key and pid are guaranteed to be unique.
Their product also has the desirable properly that it will be uniformly
distributed over the hash bins providing neither the pid nor key is zero.
Under linux the zero pid is always the init process and thus won't be
used, and this implementation is careful to never to assign a zero key.
By default the hash table is sized to 512 bins which is expected to
be sufficient for light to moderate usage of thread specific data.

The hash table contains two additional type of entries.  They first
type is entry is called a 'key' entry and it is added to the hash during
tsd_create().  It is used to store the address of the destructor function
and it is used as an anchor point.  All tsd entries which use the same
key will be linked to this entry.  This is used during tsd_destory() to
quickly call the destructor function for all tsd associated with the key.
The 'key' entry may be looked up with tsd_hash_search() by passing the
key you wish to lookup and DTOR_PID constant as the pid.

The second type of entry is called a 'pid' entry and it is added to the
hash the first time a process set a key.  The 'pid' entry is also used
as an anchor and all tsd for the process will be linked to it.  This
list is using during tsd_exit() to ensure all registered destructors
are run for the process.  The 'pid' entry may be looked up with
tsd_hash_search() by passing the PID_KEY constant as the key, and
the process pid.  Note that tsd_exit() is called by thread_exit()
so if your using the Solaris thread API you should not need to call
tsd_exit() directly.
2010-12-07 10:02:32 -08:00
Ricardo M. Correia c2f997b0b3 Make kmutex_t typesafe in all cases.
When HAVE_MUTEX_OWNER and CONFIG_SMP are defined, kmutex_t is just
a typedef for struct mutex.

This is generally OK but has the downside that it can make mistakes
such as mutex_lock(&kmutex_var) to pass by unnoticed until someone
compiles the code without HAVE_MUTEX_OWNER or CONFIG_SMP (in which
case kmutex_t is a real struct). Note that the correct API to call
should have been mutex_enter() rather than mutex_lock().

We prevent these kind of mistakes by making kmutex_t a real structure
with only one field. This makes kmutex_t typesafe and it shouldn't
have any impact on the generated assembly code.

Signed-off-by: Ricardo M. Correia <ricardo.correia@oracle.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-11-29 11:25:32 -08:00
Brian Behlendorf 058de03caa Clear cv->cv_mutex when not in use
For debugging purposes the condition varaibles keep track of the
mutex used during a wait.  The idea is to validate that all callers
always use the same mutex.  Unfortunately, we have seen cases where
the caller reuses the condition variable with a different mutex but
in a way which is known to be safe.  My reading of the man pages
suggests you should not do this and always cv_destroy()/cv_init()
a new mutex.  However, there is overhead in doing this and it does
appear to be allowed under Solaris.

To accomidate this behavior cv_wait_common() and __cv_timedwait()
have been modified to clear the associated mutex when the last
waiter is dropped.  This ensures that while the condition variable
is in use the incorrect mutex case is detected.  It also allows the
condition variable to be safely recycled without requiring the
overhead of a cv_destroy()/cv_init() as long as it isn't currently
in use.

Finally, spin lock cv->cv_lock was removed because it is not required.
When the condition variable is used properly the caller will always
be holding the mutex so the spin lock is redundant.  The lock was
originally added because I expected to need to protect more than
just the cv->cv_mutex.  It turns out that was not the case.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-11-29 11:02:34 -08:00
Ned Bass 00ba7ef900 Give ENOTSUP a valid user space error value
The ZFS module returns ENOTSUP for several error conditions where an operation
is not (yet) supported.  The SPL defined ENOTSUP in terms of ENOTSUPP, but that
is an internal Linux kernel error code that should not be seen by user
programs.  As a result the zfs utilities print a confusing error message if an
unsupported operation is attempted:

    internal error: Unknown error 524
    Aborted

This change defines ENOTSUP in terms of EOPNOTSUPP which is consistent with
user space.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-11-10 13:25:49 -08:00
Brian Behlendorf a50cede388 Linux 2.6.36 compat, wrap RLIM64_INFINITY
As of linux-2.6.36 RLIM64_INFINITY is defined in linux/resource.h.
This is handled by conditionally defining RLIM64_INFINITY in the
SPL only when the kernel does not provide it.
2010-11-09 13:28:55 -08:00
Brian Behlendorf 8294c69bb7 Clear owner after dropping mutex
It's important to clear mp->owner after calling mutex_unlock()
because when CONFIG_DEBUG_MUTEXES is defined the mutex owner
is verified in mutex_unlock().  If we set it to NULL this check
fails and the lockdep support is immediately disabled.
2010-11-05 11:52:30 -07:00
Ricardo M. Correia a68d91d770 atomic_*_*_nv() functions need to return the new value atomically.
A local variable must be used for the return value to avoid a
potential race once the spin lock is dropped.

Signed-off-by: Ricardo M. Correia <ricardo.correia@oracle.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-09-17 16:03:25 -07:00
Brian Behlendorf 8371f981f1 Add list_link_replace() function
The list_link_replace() function with swap a new item it to the place
of an old item in a list.  It is the callers responsibility to ensure
all lists involved are locked properly.
2010-08-27 14:23:48 -07:00
Brian Behlendorf d85e28ad69 Add MUTEX_NOT_HELD() function
Simply implement the missing MUTEX_NOT_HELD() function using
the !MUTEX_HELD construct.
2010-08-27 14:23:48 -07:00
Brian Behlendorf 2b3543025c Stub out kmem cache defrag API
At some point we are going to need to implement the kmem cache
move callbacks to allow for kmem cache defragmentation.  This
commit simply lays a small part of the API ground work, it does
not actually implement any of this feature.  This is safe for
now because the move callbacks are just an optimization.  Even
if they are registered we don't ever really have to call them.
2010-08-27 14:23:42 -07:00
Brian Behlendorf 8dbd3fbd5e Add missing atomic functions
These functions were not previous needed so they were not added.
Now they are so add the full set.

atomic_inc_32_nv()
atomic_dec_32_nv()
atomic_inc_64_nv()
atomic_dec_64_nv()
2010-08-27 13:02:55 -07:00
Li Wei 4be55565fe Fix stack overflow in vn_rdwr() due to memory reclaim
Unless __GFP_IO and __GFP_FS are removed from the file mapping gfp
mask we may enter memory reclaim during IO.  In this case shrink_slab()
entered another file system which is notoriously hungry for stack.
This additional stack usage may cause a stack overflow.  This patch
removes __GFP_IO and __GFP_FS from the mapping gfp mask of each file
during vn_open() to avoid any reclaim in the vn_rdwr() IO path.  The
original mask is then restored at vn_close() time.  Hats off to the
loop driver which does something similiar for the same reason.

  [...]
  shrink_slab+0xdc/0x153
  try_to_free_pages+0x1da/0x2d7
  __alloc_pages+0x1d7/0x2da
  do_generic_mapping_read+0x2c9/0x36f
  file_read_actor+0x0/0x145
  __generic_file_aio_read+0x14f/0x19b
  generic_file_aio_read+0x34/0x39
  do_sync_read+0xc7/0x104
  vfs_read+0xcb/0x171
  :spl:vn_rdwr+0x2b8/0x402
  :zfs:vdev_file_io_start+0xad/0xe1
  [...]

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-08-12 09:34:33 -07:00
Ned Bass 46aa7b3939 Correctly handle rwsem_is_locked() behavior
A race condition in rwsem_is_locked() was fixed in Linux 2.6.33 and the fix was
backported to RHEL5 as of kernel 2.6.18-190.el5.  Details can be found here:

https://bugzilla.redhat.com/show_bug.cgi?id=526092

The race condition was fixed in the kernel by acquiring the semaphore's
wait_lock inside rwsem_is_locked().  The SPL worked around the race condition
by acquiring the wait_lock before calling that function, but with the fix in
place it must not do that.

This commit implements an autoconf test to detect whether the fixed version of
rwsem_is_locked() is present.  The previous version of rwsem_is_locked() was an
inline static function while the new version is exported as a symbol which we
can check for in module.symvers.  Depending on the result we correctly
implement the needed compatibility macros for proper spinlock handling.

Finally, we do the right thing with spin locks in RW_*_HELD() by using the
new compatibility macros.  We only only acquire the semaphore's wait_lock if
it is calling a rwsem_is_locked() that does not itself try to acquire the lock.

Some new overhead and a small harmless race is introduced by this change.
This is because RW_READ_HELD() and RW_WRITE_HELD() now acquire and release
the wait_lock twice: once for the call to rwsem_is_locked() and once for
the call to rw_owner().  This can't be avoided if calling a rwsem_is_locked()
that takes the wait_lock, as it will in more recent kernels.

The other case which only occurs in legacy kernels could be optimized by
taking the lock only once, as was done prior to this commit.  However, I
decided that the performance gain probably wasn't significant enough to
justify the messy special cases required.

The function spl_rw_get_owner() was only used to enable the afore-mentioned
optimization.  Since it is no longer used, I removed it.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-08-10 16:43:00 -07:00
Brian Behlendorf 10129680f8 Ensure kmem_alloc() and vmem_alloc() never fail
The Solaris semantics for kmem_alloc() and vmem_alloc() are that they
must never fail when called with KM_SLEEP.  They may only fail if
called with KM_NOSLEEP otherwise they must block until memory is
available.  This is quite different from how the Linux memory
allocators work, under Linux a memory allocation failure is always
possible and must be dealt with.

At one point in the past the kmem code did properly implement this
behavior, however as the code evolved this behavior was overlooked
in places.  This patch goes through all three implementations of
the kmem/vmem allocation functions and ensures that they will all
block in the KM_SLEEP case when memory is not available.  They
may still fail in the KM_NOSLEEP case in which case the caller
is responsible for handling the failure.

Special care is taken in vmalloc_nofail() to avoid thrashing the
system on the virtual address space spin lock.  The down side of
course is if you do see a failure here, which is unlikely for
64-bit systems, your allocation will delay for an entire second.
Still this is preferable to locking up your system and it is the
best we can do given the constraints.

Additionally, the code was cleaned up to be much more readable
and comments were added to describe the various kmem-debug-*
configure options.  The default configure options remain:
"--enable-debug-kmem --disable-debug-kmem-tracking"
2010-07-26 15:47:55 -07:00
Ricardo M. Correia 15b52c083e Fix max_ncpus definition.
It was being defined as the constant 64 and at first I changed it to be
NR_CPUS instead.

However, NR_CPUS can be a large value on recent kernels (4096), and this
may cause too large kmem allocations to happen.

Therefore, now we use num_possible_cpus(), which should return a (typically)
small value which represents the maximum number of CPUs than can be brought
online in the running hardware (this value is determined at boot time by
arch-specific kernel code).

Signed-off-by: Ricardo M. Correia <ricardo.correia@oracle.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-07-20 15:49:25 -07:00
Ricardo M. Correia 81672c0122 Display DEBUG keyword during module load when --enable-debug is used.
Signed-off-by: Ricardo M. Correia <ricardo.correia@oracle.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-07-20 15:31:03 -07:00
Ricardo M. Correia 9dd5d138b2 Fix bcopy() to allow memory area overlap
Under Solaris bcopy() allows overlapping memory areas so we
must use memmove() instead of memcpy().

Signed-off-by: Ricardo M. Correia <ricardo.correia@oracle.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-07-20 13:48:53 -07:00
Ricardo M. Correia 22cd0f19b1 Fix compilation error due to undefined ACCESS_ONCE macro.
When CONFIG_DEBUG_MUTEXES is turned on in RHEL5's kernel config, the mutexes
store the owner for debugging purposes, therefore the SPL will enable
HAVE_MUTEX_OWNER. However, the SPL code uses ACCESS_ONCE() to access the
owner, and this macro is not defined in the RHEL5 kernel, therefore we define it
ourselves in include/linux/compiler_compat.h.

Signed-off-by: Ricardo M. Correia <ricardo.correia@oracle.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-07-20 13:47:52 -07:00
Brian Behlendorf 55abb0929e Split <sys/debug.h> header
To avoid symbol conflicts with dependent packages the debug
header must be split in to several parts.  The <sys/debug.h>
header now only contains the Solaris macro's such as ASSERT
and VERIFY.  The spl-debug.h header contain the spl specific
debugging infrastructure and should be included by any package
which needs to use the spl logging.  Finally the spl-trace.h
header contains internal data structures only used for the log
facility and should not be included by anythign by spl-debug.c.

This way dependent packages can include the standard Solaris
headers without picking up any SPL debug macros.  However, if
the dependant package want to integrate with the SPL debugging
subsystem they can then explicitly include spl-debug.h.

Along with this change I have dropped the CHECK_STACK macros
because the upstream Linux kernel now has much better stack
depth checking built in and we don't need this complexity.

Additionally SBUG has been replaced with PANIC and provided as
part of the Solaris macro set.  While the Solaris version is
really panic() that conflicts with the Linux kernel so we'll
just have to make due to PANIC.  It should rarely be called
directly, the prefered usage would be an ASSERT or VERIFY.

There's lots of change here but this cleanup was overdue.
2010-07-20 13:29:35 -07:00
Brian Behlendorf 82b8c8fa64 Proposed fix for low memory ZFS deadlocks
Deadlocks in the zvol were observed when one of the ZFS threads
performing IO trys to allocate memory while the system is low
on memory.  The low memory condition causes dirty pages to be
synced to the zvol but this can't progress because the original
thread is blocked waiting on a memory allocation.  Thus we end
up deadlocking.

A proper solution proposed by Wizeman is to change KM_SLEEP from
GFP_KERNEL top GFP_NOFS.  This will prevent the memory allocation
which is trying to allocate memory from forcing a sync to the
zvol in shrink_page_list()->pageout().

The down side to all of this is that we are using a pretty big
hammer by changing KM_SLEEP.  This change means ALL of the zfs
memory allocations will be until to trigger dirty data to be
synced.  The caller still should be able to reclaim memory from
the various slab caches.  We will be totally dependent of other
kernel processes which happen to be running and a small number
of asynchronous reclaim threads to trigger the reclaim of dirty
data pages.  This should be OK but I think we may see some
slightly longer allocation times when under memory pressure.

We shall see.
2010-07-13 21:30:56 -07:00
Brian Behlendorf a4bfd8ea1b Add __divdi3(), remove __udivdi3() kernel dependency
Up until now no SPL consumer attempted to perform signed 64-bit
division so there was no need to support this.  That has now
changed so I adding 64-bit division support for 32-bit platforms.
The signed implementation is based on the unsigned version.

Since the have been several bug reports in the past concerning
correct 64-bit division on 32-bit platforms I added some long
over due regression tests.  Much to my surprise the unsigned
64-bit division regression tests failed.

This was surprising because __udivdi3() was implemented by simply
calling div64_u64() which is provided by the kernel.  This meant
that the linux kernels 64-bit division algorithm on 32-bit platforms
was flawed.  After some investigation this turned out to be exactly
the case.

Because of this I was forced to abandon the kernel helper and
instead to fully implement 64-bit division in the spl.  There are
several published implementation out there on how to do this
properly and I settled on one proposed in the book Hacker's Delight.
Their proposed algoritm is freely available without restriction
and I have just modified it to be linux kernel friendly.

The update implementation now passed all the unsigned and signed
regression tests.  This should be functional, but not fast, which is
good enough for out purposes.  If you want fast too I'd strongly
suggest you upgrade to a 64-bit platform.  I have also reported the
kernel bug and we'll see if we can't get it fixed up stream.
2010-07-13 16:44:02 -07:00
Ned Bass f0d8bb26b4 Implementation of the TQ_FRONT flag.
Adds a task queue to receive tasks dispatched with TQ_FRONT.  Worker
threads pull tasks from this high priority queue before the default
pending queue.

Executing tasks out of FIFO order potentially breaks taskq_lowest_id()
if we do not preserve the ordering of the work list by taskqid.
Therefore, instead of always appending to the work list, we search for
the appropriate place to insert a task.  The common case is to append
to the list, so we make this operation efficient by searching the work
list in reverse order.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-07-01 10:59:38 -07:00
Brian Behlendorf 6801b7154c Linux-2.6.33 compat, O_DSYNC flag added
Prior to linux-2.6.33 only O_DSYNC semantics were implemented and
they used the O_SYNC flag.  As of linux-2.6.33 this behavior was
properly split in to O_SYNC and O_DSYNC respectively.
2010-06-30 12:49:39 -07:00
Brian Behlendorf ede0bdffb6 Treat mutex->owner as volatile
When HAVE_MUTEX_OWNER is defined and we are directly accessing
mutex->owner treat is as volative with the ACCESS_ONCE() helper.
Without this you may get a stale cached value when accessing it
from different cpus.  This can result in incorrect behavior from
mutex_owned() and mutex_owner().  This is not a problem for the
!HAVE_MUTEX_OWNER case because in this case all the accesses
are covered by a spin lock which similarly gaurentees we will
not be accessing stale data.

Secondly, check CONFIG_SMP before allowing access to mutex->owner.
I see that for non-SMP setups the kernel does not track the owner
so we cannot rely on it.

Thirdly, check CONFIG_MUTEX_DEBUG when this is defined and the
HAVE_MUTEX_OWNER is defined surprisingly the mutex->owner will
not be cleared on mutex_exit().  When this is the case the SPL
needs to make sure to do it to ensure MUTEX_HELD() behaves as
expected or you will certainly assert in mutex_destroy().

Finally, improve the mutex regression tests.  For mutex_owned() we
now minimally check that it behaves correctly when checked from the
owner thread or the non-owner thread.  This subtle behaviour has bit
me before and I'd like to catch it early next time if it reappears.

As for mutex_owned() regression test additonally verify that
mutex->owner is always cleared on mutex_exit().
2010-06-28 16:02:57 -07:00
Brian Behlendorf 5be4767ae1 Accept but ignore TASKQ_DC_BATCH and TQ_FRONT
For the moment the SPL accepts the TASKQ_DC_BATCH and TQ_FRONT
flags however they get silently ignored.  This is harmless for
the moment but it does need to be implemented at some point.
2010-06-28 11:39:43 -07:00
Brian Behlendorf e6de04b73c Add kmem_vasprintf function
We might as well have both asprintf() variants.  This allows us
to safely pass a va_list through several levels of the stack
using va_copy() instead of va_start().
2010-06-24 09:41:59 -07:00
Brian Behlendorf 438683c0a9 Revert "Support TQ_FRONT flag used by taskq_dispatch()"
This reverts commit eb12b3782c.
2010-06-21 10:19:44 -07:00
Brian Behlendorf def465ad4b Include kstat.h from kmem.h
It turns out Solaris incidentally includes kstat.h from kmem.h.  As
a side effect of this certain higher level .c files which should
explicitly include kstat.h don't because they happen to get it
via kmem.h.  To make like easier for everyone I do the same.
2010-06-14 14:18:48 -07:00
Brian Behlendorf eb12b3782c Support TQ_FRONT flag used by taskq_dispatch()
Allow taskq_dispatch() to insert work items at the head of the
queue instead of just the tail by passing the TQ_FRONT flag.
2010-06-11 15:57:25 -07:00
Brian Behlendorf 32c6147dee Minor cleanup and Solaris API additions.
Minor formatting cleanups.

API additions:
* {U}INT8_{MIN,MAX}, {U}INT16_{MIN,MAX} macros.
* id_t typedef
* ddi_get_lbolt(), ddi_get_lbolt64() functions.
2010-06-11 15:57:25 -07:00
Brian Behlendorf b868e22f05 Add kmem_asprintf(), strfree(), strdup(), and minor cleanup.
This patch adds three missing Solaris functions: kmem_asprintf(), strfree(),
and strdup().  They are all implemented as a thin layer which just calls
their Linux counterparts.  As part of this an autoconf check for kvasprintf
was added because it does not appear in older kernels.  If the kernel does
not provide it then spl-generic implements it.

Additionally the dead DEBUG_KMEM_UNIMPLEMENTED code was removed to clean
things up and make the kmem.h a little more readable.
2010-06-11 15:57:25 -07:00
Brian Behlendorf bb1bb2c4c4 Add xuio_* structures and typedefs.
Add the basic xuio structure and typedefs for Solaris style zero copy.
There's a decent chance this will not be the way I handle this on Linux
but providing the basic types simplifies things for now.
2010-06-11 15:57:25 -07:00
Brian Behlendorf 750a7101f8 Stub out additional missing headers 2010-06-11 15:57:25 -07:00
Brian Behlendorf ae4c36adce Cleanly split Linux proc.h (fs) from conflicting Solaris proc.h (process)
Under linux the proc.h header is for the /proc filesystem, and under
Solaris the proc/h header if for processes.  This patch correctly
moves the Linux proc functionality in a linux/proc_compat.h header
and leaves the sys/proc.h for use by Solaris.  Minor updates were
required to all the call sites where it was included of course.
2010-06-11 15:57:25 -07:00
Brian Behlendorf 32f5faff69 Simplify rwlock implementation.
Remove RW_COUNT() from the rwlock implementation.  The idea was that it
could be used as a generic wrapper for getting at the internal state
of a rwlock.  While a good idea it's proven problematic to keep it
correct for multiple archs and internal implementation changes.  In
short it hasn't been worth the trouble.

With that and simplicity in mind things have been updated to use the
rwsem_is_locked() function instead of RW_COUNT for the RW_*_HELD()
functions.  As for rw_upgrade() it remains only implemented for
the generic rwsem implemenation.  It remains to be determined if its
worth the effort of adding a custom implementation for each arch.
2010-05-20 14:20:34 -07:00
Brian Behlendorf 23d91792ef Use KM_NODEBUG macro in preference to __GFP_NOWARN. 2010-05-20 14:16:59 -07:00