Previously we would ASSERT in cv_destroy() if it was ever called
with active waiters. However, I've now seen several instances in
OpenSolaris code where they do the following:
cv_broadcast();
cv_destroy();
This leaves no time for active waiters to be woken up and scheduled
and we trip the ASSERT. This has not been observed to be an issue
on OpenSolaris because their cv_destroy() basically does nothing.
They still do run the risk of the memory being free'd after the
cv_destroy() and hitting a bad paging request. But in practice
this race is so small and unlikely it either doesn't happen, or
is so unlikely when it does happen the root cause has not yet been
identified.
Rather than risk the same issue in our code this change updates
cv_destroy() to block until all waiters have been woken and
scheduled. This may take some time because each waiter must
acquire the mutex.
This change may have an impact on performance for frequently
created and destroyed condition variables. That however is a price
worth paying it avoid crashing your system. If performance issues
are observed they can be addressed by the caller.
Simply add the policy function wrappers. They are completely
non-functional and always return that everything is OK, but once
again they simplify compilation of dependent packages for now.
These can/should be removed once the security policy of the
dependent application is completely understood and intergrade
as appropriate with Linux.
Dependent packages require the following missing headers to
simplify compilation. The headers are basically just stubbed
out with minimal content required.
The following flags are use to get the proper mask when getting
and setting ACLs. I'm hopeful this can all largely go away at
some point.
We also add a define for the maximum number of ACL entries.
MAX_ACL_ENTRIES is used as the maximum number of entries for
each type.
For Linux the maximum uid can vary depending on how your kernel
is built. The Linux kernel still can be compiled with 16 but uids
and gids, although I'm not aware of a major distribution which does
this (maybe an embedded one?). Given that caviot it is reasonably
safe to define the MAXUID as 2147483647.
This patch simply removes the place holder vfs_t type and includes
some generic Linux VFS headers. It also makes some minor fid_t
additions for compatibility.
Previously these were defined to noops but rather than give
the misleading impression that these are actually implemented
I'm removing the type entirely for clarity.
The cv_timedwait() function by definition must wait unconditionally
for cv_signal()/cv_broadcast() before waking. This causes processes
to go in the D state which increases the load average. The load
average is the summation of processes in D state and run queue.
To avoid this it can be desirable to sleep interruptibly. These
processes do not count against the load average but may be woken by
a signal. It is up to the caller to determine why the process
was woken it may be for one of three reasons.
1) cv_signal()/cv_broadcast()
2) the timeout expired
3) a signal was received
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Create spl_inode_lock/spl_inode_unlock compability macros to simply
access to the inode mutex/sem. This avoids the need to have to ugly
up the code with the required #define's at every call site. At the
moment the SPL only uses this in one place but higher layers can
benefit from the macro.
Thread specific data has implemented using a hash table, this avoids
the need to add a member to the task structure and allows maximum
portability between kernels. This implementation has been optimized
to keep the tsd_set() and tsd_get() times as small as possible.
The majority of the entries in the hash table are for specific tsd
entries. These entries are hashed by the product of their key and
pid because by design the key and pid are guaranteed to be unique.
Their product also has the desirable properly that it will be uniformly
distributed over the hash bins providing neither the pid nor key is zero.
Under linux the zero pid is always the init process and thus won't be
used, and this implementation is careful to never to assign a zero key.
By default the hash table is sized to 512 bins which is expected to
be sufficient for light to moderate usage of thread specific data.
The hash table contains two additional type of entries. They first
type is entry is called a 'key' entry and it is added to the hash during
tsd_create(). It is used to store the address of the destructor function
and it is used as an anchor point. All tsd entries which use the same
key will be linked to this entry. This is used during tsd_destory() to
quickly call the destructor function for all tsd associated with the key.
The 'key' entry may be looked up with tsd_hash_search() by passing the
key you wish to lookup and DTOR_PID constant as the pid.
The second type of entry is called a 'pid' entry and it is added to the
hash the first time a process set a key. The 'pid' entry is also used
as an anchor and all tsd for the process will be linked to it. This
list is using during tsd_exit() to ensure all registered destructors
are run for the process. The 'pid' entry may be looked up with
tsd_hash_search() by passing the PID_KEY constant as the key, and
the process pid. Note that tsd_exit() is called by thread_exit()
so if your using the Solaris thread API you should not need to call
tsd_exit() directly.
When HAVE_MUTEX_OWNER and CONFIG_SMP are defined, kmutex_t is just
a typedef for struct mutex.
This is generally OK but has the downside that it can make mistakes
such as mutex_lock(&kmutex_var) to pass by unnoticed until someone
compiles the code without HAVE_MUTEX_OWNER or CONFIG_SMP (in which
case kmutex_t is a real struct). Note that the correct API to call
should have been mutex_enter() rather than mutex_lock().
We prevent these kind of mistakes by making kmutex_t a real structure
with only one field. This makes kmutex_t typesafe and it shouldn't
have any impact on the generated assembly code.
Signed-off-by: Ricardo M. Correia <ricardo.correia@oracle.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
For debugging purposes the condition varaibles keep track of the
mutex used during a wait. The idea is to validate that all callers
always use the same mutex. Unfortunately, we have seen cases where
the caller reuses the condition variable with a different mutex but
in a way which is known to be safe. My reading of the man pages
suggests you should not do this and always cv_destroy()/cv_init()
a new mutex. However, there is overhead in doing this and it does
appear to be allowed under Solaris.
To accomidate this behavior cv_wait_common() and __cv_timedwait()
have been modified to clear the associated mutex when the last
waiter is dropped. This ensures that while the condition variable
is in use the incorrect mutex case is detected. It also allows the
condition variable to be safely recycled without requiring the
overhead of a cv_destroy()/cv_init() as long as it isn't currently
in use.
Finally, spin lock cv->cv_lock was removed because it is not required.
When the condition variable is used properly the caller will always
be holding the mutex so the spin lock is redundant. The lock was
originally added because I expected to need to protect more than
just the cv->cv_mutex. It turns out that was not the case.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The ZFS module returns ENOTSUP for several error conditions where an operation
is not (yet) supported. The SPL defined ENOTSUP in terms of ENOTSUPP, but that
is an internal Linux kernel error code that should not be seen by user
programs. As a result the zfs utilities print a confusing error message if an
unsupported operation is attempted:
internal error: Unknown error 524
Aborted
This change defines ENOTSUP in terms of EOPNOTSUPP which is consistent with
user space.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
As of linux-2.6.36 RLIM64_INFINITY is defined in linux/resource.h.
This is handled by conditionally defining RLIM64_INFINITY in the
SPL only when the kernel does not provide it.
It's important to clear mp->owner after calling mutex_unlock()
because when CONFIG_DEBUG_MUTEXES is defined the mutex owner
is verified in mutex_unlock(). If we set it to NULL this check
fails and the lockdep support is immediately disabled.
A local variable must be used for the return value to avoid a
potential race once the spin lock is dropped.
Signed-off-by: Ricardo M. Correia <ricardo.correia@oracle.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
One of the neat tricks an autoconf style project is capable of
is allow configurion/building in a directory other than the
source directory. The major advantage to this is that you can
build the project various different ways while making changes
in a single source tree.
For example, this project is designed to work on various different
Linux distributions each of which work slightly differently. This
means that changes need to verified on each of those supported
distributions perferably before the change is committed to the
public git repo.
Using nfs and custom build directories makes this much easier.
I now have a single source tree in nfs mounted on several different
systems each running a supported distribution. When I make a
change to the source base I suspect may break things I can
concurrently build from the same source on all the systems each
in their own subdirectory.
wget -c http://github.com/downloads/behlendorf/spl/spl-x.y.z.tar.gz
tar -xzf spl-x.y.z.tar.gz
cd spl-x-y-z
------------------------- run concurrently ----------------------
<ubuntu system> <fedora system> <debian system> <rhel6 system>
mkdir ubuntu mkdir fedora mkdir debian mkdir rhel6
cd ubuntu cd fedora cd debian cd rhel6
../configure ../configure ../configure ../configure
make make make make
make check make check make check make check
This is something the project has almost supported for a long time
but finishing this support should save me lots of time.
This check was previously done with a hack in config.guess.
However, since a new config.guess is copied in to place when
forcing a full autoreconf this change was easily lost and
never a good idea. This commit also updates all of the
autoconf style support scripts in config.
The list_link_replace() function with swap a new item it to the place
of an old item in a list. It is the callers responsibility to ensure
all lists involved are locked properly.
At some point we are going to need to implement the kmem cache
move callbacks to allow for kmem cache defragmentation. This
commit simply lays a small part of the API ground work, it does
not actually implement any of this feature. This is safe for
now because the move callbacks are just an optimization. Even
if they are registered we don't ever really have to call them.
These functions were not previous needed so they were not added.
Now they are so add the full set.
atomic_inc_32_nv()
atomic_dec_32_nv()
atomic_inc_64_nv()
atomic_dec_64_nv()
Unless __GFP_IO and __GFP_FS are removed from the file mapping gfp
mask we may enter memory reclaim during IO. In this case shrink_slab()
entered another file system which is notoriously hungry for stack.
This additional stack usage may cause a stack overflow. This patch
removes __GFP_IO and __GFP_FS from the mapping gfp mask of each file
during vn_open() to avoid any reclaim in the vn_rdwr() IO path. The
original mask is then restored at vn_close() time. Hats off to the
loop driver which does something similiar for the same reason.
[...]
shrink_slab+0xdc/0x153
try_to_free_pages+0x1da/0x2d7
__alloc_pages+0x1d7/0x2da
do_generic_mapping_read+0x2c9/0x36f
file_read_actor+0x0/0x145
__generic_file_aio_read+0x14f/0x19b
generic_file_aio_read+0x34/0x39
do_sync_read+0xc7/0x104
vfs_read+0xcb/0x171
:spl:vn_rdwr+0x2b8/0x402
:zfs:vdev_file_io_start+0xad/0xe1
[...]
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
A race condition in rwsem_is_locked() was fixed in Linux 2.6.33 and the fix was
backported to RHEL5 as of kernel 2.6.18-190.el5. Details can be found here:
https://bugzilla.redhat.com/show_bug.cgi?id=526092
The race condition was fixed in the kernel by acquiring the semaphore's
wait_lock inside rwsem_is_locked(). The SPL worked around the race condition
by acquiring the wait_lock before calling that function, but with the fix in
place it must not do that.
This commit implements an autoconf test to detect whether the fixed version of
rwsem_is_locked() is present. The previous version of rwsem_is_locked() was an
inline static function while the new version is exported as a symbol which we
can check for in module.symvers. Depending on the result we correctly
implement the needed compatibility macros for proper spinlock handling.
Finally, we do the right thing with spin locks in RW_*_HELD() by using the
new compatibility macros. We only only acquire the semaphore's wait_lock if
it is calling a rwsem_is_locked() that does not itself try to acquire the lock.
Some new overhead and a small harmless race is introduced by this change.
This is because RW_READ_HELD() and RW_WRITE_HELD() now acquire and release
the wait_lock twice: once for the call to rwsem_is_locked() and once for
the call to rw_owner(). This can't be avoided if calling a rwsem_is_locked()
that takes the wait_lock, as it will in more recent kernels.
The other case which only occurs in legacy kernels could be optimized by
taking the lock only once, as was done prior to this commit. However, I
decided that the performance gain probably wasn't significant enough to
justify the messy special cases required.
The function spl_rw_get_owner() was only used to enable the afore-mentioned
optimization. Since it is no longer used, I removed it.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Extend the Makefiles with an uninstall target to cleanly
remove a package which was installed with 'make install'.
Additionally, ensure a 'depmod -a' is run as part of the
install to update the module dependency information.
The long term fix for Debian and Slackware style packaging is
to add native support for building these packages. Unfortunately,
that is a large chunk of work I don't have time for right now.
That said it would be nice to have at least basic packages for
these distributions.
As a quick short/medium term solution I've settled on using alien
to convert the RPM packages to DEB or TGZ style packages. The
build system has been updated with the following build targets
which will first build RPM packages and then convert them as
needed to the target package type:
make rpm: Create .rpm packages
make deb: Create .deb packages
make tgz: Create .tgz packages
make pkg: Create the right package type for your distribution
The solution comes with lot of caveats and your mileage may vary.
But basically the big limitations are that the resulting packages:
1) Will not have the correct dependency information.
2) Will not not include the kernel version in the release.
3) Will not handle all differences between distributions.
But the resulting packages should be easy to install and remove
from your system and take care of running 'depmod -a' and such.
As I said at the top this is not the right long term solution.
If any of the upstream distribution maintainers want to jump in
and help do this right for their distribution I'd love the help.
The Solaris semantics for kmem_alloc() and vmem_alloc() are that they
must never fail when called with KM_SLEEP. They may only fail if
called with KM_NOSLEEP otherwise they must block until memory is
available. This is quite different from how the Linux memory
allocators work, under Linux a memory allocation failure is always
possible and must be dealt with.
At one point in the past the kmem code did properly implement this
behavior, however as the code evolved this behavior was overlooked
in places. This patch goes through all three implementations of
the kmem/vmem allocation functions and ensures that they will all
block in the KM_SLEEP case when memory is not available. They
may still fail in the KM_NOSLEEP case in which case the caller
is responsible for handling the failure.
Special care is taken in vmalloc_nofail() to avoid thrashing the
system on the virtual address space spin lock. The down side of
course is if you do see a failure here, which is unlikely for
64-bit systems, your allocation will delay for an entire second.
Still this is preferable to locking up your system and it is the
best we can do given the constraints.
Additionally, the code was cleaned up to be much more readable
and comments were added to describe the various kmem-debug-*
configure options. The default configure options remain:
"--enable-debug-kmem --disable-debug-kmem-tracking"
It was being defined as the constant 64 and at first I changed it to be
NR_CPUS instead.
However, NR_CPUS can be a large value on recent kernels (4096), and this
may cause too large kmem allocations to happen.
Therefore, now we use num_possible_cpus(), which should return a (typically)
small value which represents the maximum number of CPUs than can be brought
online in the running hardware (this value is determined at boot time by
arch-specific kernel code).
Signed-off-by: Ricardo M. Correia <ricardo.correia@oracle.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Under Solaris bcopy() allows overlapping memory areas so we
must use memmove() instead of memcpy().
Signed-off-by: Ricardo M. Correia <ricardo.correia@oracle.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
When CONFIG_DEBUG_MUTEXES is turned on in RHEL5's kernel config, the mutexes
store the owner for debugging purposes, therefore the SPL will enable
HAVE_MUTEX_OWNER. However, the SPL code uses ACCESS_ONCE() to access the
owner, and this macro is not defined in the RHEL5 kernel, therefore we define it
ourselves in include/linux/compiler_compat.h.
Signed-off-by: Ricardo M. Correia <ricardo.correia@oracle.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
To avoid conflicts with symbols defined by dependent packages
all debugging symbols have been prefixed with a 'S' for SPL.
Any dependent package needing to integrate with the SPL debug
should include the spl-debug.h header and use the 'S' prefixed
macros. They must also build with DEBUG defined.
To avoid symbol conflicts with dependent packages the debug
header must be split in to several parts. The <sys/debug.h>
header now only contains the Solaris macro's such as ASSERT
and VERIFY. The spl-debug.h header contain the spl specific
debugging infrastructure and should be included by any package
which needs to use the spl logging. Finally the spl-trace.h
header contains internal data structures only used for the log
facility and should not be included by anythign by spl-debug.c.
This way dependent packages can include the standard Solaris
headers without picking up any SPL debug macros. However, if
the dependant package want to integrate with the SPL debugging
subsystem they can then explicitly include spl-debug.h.
Along with this change I have dropped the CHECK_STACK macros
because the upstream Linux kernel now has much better stack
depth checking built in and we don't need this complexity.
Additionally SBUG has been replaced with PANIC and provided as
part of the Solaris macro set. While the Solaris version is
really panic() that conflicts with the Linux kernel so we'll
just have to make due to PANIC. It should rarely be called
directly, the prefered usage would be an ASSERT or VERIFY.
There's lots of change here but this cleanup was overdue.
The prototype for filp_fsync() drop the unused argument 'stuct dentry *'.
I've fixed this by adding the needed autoconf check and moving all of
those filp related functions to file_compat.h. This will simplify
handling any further API changes in the future.
Deadlocks in the zvol were observed when one of the ZFS threads
performing IO trys to allocate memory while the system is low
on memory. The low memory condition causes dirty pages to be
synced to the zvol but this can't progress because the original
thread is blocked waiting on a memory allocation. Thus we end
up deadlocking.
A proper solution proposed by Wizeman is to change KM_SLEEP from
GFP_KERNEL top GFP_NOFS. This will prevent the memory allocation
which is trying to allocate memory from forcing a sync to the
zvol in shrink_page_list()->pageout().
The down side to all of this is that we are using a pretty big
hammer by changing KM_SLEEP. This change means ALL of the zfs
memory allocations will be until to trigger dirty data to be
synced. The caller still should be able to reclaim memory from
the various slab caches. We will be totally dependent of other
kernel processes which happen to be running and a small number
of asynchronous reclaim threads to trigger the reclaim of dirty
data pages. This should be OK but I think we may see some
slightly longer allocation times when under memory pressure.
We shall see.
Up until now no SPL consumer attempted to perform signed 64-bit
division so there was no need to support this. That has now
changed so I adding 64-bit division support for 32-bit platforms.
The signed implementation is based on the unsigned version.
Since the have been several bug reports in the past concerning
correct 64-bit division on 32-bit platforms I added some long
over due regression tests. Much to my surprise the unsigned
64-bit division regression tests failed.
This was surprising because __udivdi3() was implemented by simply
calling div64_u64() which is provided by the kernel. This meant
that the linux kernels 64-bit division algorithm on 32-bit platforms
was flawed. After some investigation this turned out to be exactly
the case.
Because of this I was forced to abandon the kernel helper and
instead to fully implement 64-bit division in the spl. There are
several published implementation out there on how to do this
properly and I settled on one proposed in the book Hacker's Delight.
Their proposed algoritm is freely available without restriction
and I have just modified it to be linux kernel friendly.
The update implementation now passed all the unsigned and signed
regression tests. This should be functional, but not fast, which is
good enough for out purposes. If you want fast too I'd strongly
suggest you upgrade to a 64-bit platform. I have also reported the
kernel bug and we'll see if we can't get it fixed up stream.
Adds a task queue to receive tasks dispatched with TQ_FRONT. Worker
threads pull tasks from this high priority queue before the default
pending queue.
Executing tasks out of FIFO order potentially breaks taskq_lowest_id()
if we do not preserve the ordering of the work list by taskqid.
Therefore, instead of always appending to the work list, we search for
the appropriate place to insert a task. The common case is to append
to the list, so we make this operation efficient by searching the work
list in reverse order.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
While in theory I like the idea of compiler warnings always being
fatal. In practice this causes problems when small harmless errors
cause build failures for end users. To handle this I've updated
the build system such that -Werror is only used when --enable-debug
is passed to configure. This is how I always build when developing
so I'll catch all build warnings and end users will not get stuck
by minor issues.
Prior to linux-2.6.33 only O_DSYNC semantics were implemented and
they used the O_SYNC flag. As of linux-2.6.33 this behavior was
properly split in to O_SYNC and O_DSYNC respectively.
As of linux-2.6.33 the ctl_name member of the ctl_table struct
has been entirely removed. The upstream code has been updated
to depend entirely on the the procname member. To handle this
all references to ctl_name are wrapped in a CTL_NAME macro which
simply expands to nothing for newer kernels. Older kernels are
supported by having it expand to .ctl_name = X just as before.
When HAVE_MUTEX_OWNER is defined and we are directly accessing
mutex->owner treat is as volative with the ACCESS_ONCE() helper.
Without this you may get a stale cached value when accessing it
from different cpus. This can result in incorrect behavior from
mutex_owned() and mutex_owner(). This is not a problem for the
!HAVE_MUTEX_OWNER case because in this case all the accesses
are covered by a spin lock which similarly gaurentees we will
not be accessing stale data.
Secondly, check CONFIG_SMP before allowing access to mutex->owner.
I see that for non-SMP setups the kernel does not track the owner
so we cannot rely on it.
Thirdly, check CONFIG_MUTEX_DEBUG when this is defined and the
HAVE_MUTEX_OWNER is defined surprisingly the mutex->owner will
not be cleared on mutex_exit(). When this is the case the SPL
needs to make sure to do it to ensure MUTEX_HELD() behaves as
expected or you will certainly assert in mutex_destroy().
Finally, improve the mutex regression tests. For mutex_owned() we
now minimally check that it behaves correctly when checked from the
owner thread or the non-owner thread. This subtle behaviour has bit
me before and I'd like to catch it early next time if it reappears.
As for mutex_owned() regression test additonally verify that
mutex->owner is always cleared on mutex_exit().
For the moment the SPL accepts the TASKQ_DC_BATCH and TQ_FRONT
flags however they get silently ignored. This is harmless for
the moment but it does need to be implemented at some point.
We might as well have both asprintf() variants. This allows us
to safely pass a va_list through several levels of the stack
using va_copy() instead of va_start().