The feature branch 'fix-taskq' in Linux's ZFS tree changes the taskq_dispatch()
flag from TQ_SLEEP to TQ_NOSLEEP to avoid sleeping in some circumstances.
However, this has the side effect that taskq_dispatch() now may fail, and since
the return code was not even being checked, it could lead to zio's not being
scheduled to execute.
I'm fixing this in a simplistic but not very elegant way, by just looping until
taskq_dispatch() succeeds.
Signed-off-by: Ricardo M. Correia <ricardo.correia@oracle.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The dmu_object_info_t structures which are roughly 60 bytes
have been moved from the stack to the heap. Every little
bit helps and there's absolutely no reason these temporary
working variables need to be on the stack.
During module load we could deadlock because the zvol_init()
callpath took the spa_namespace_lock before the zvol_state_lock.
The rest of the zvol code takes the locks in the opposite order.
In particular, I observed the following deadlock cause by the
lock inversion.
I've fixed the ording by creating an unlocked version of
zvol_create_minor and zvol_remove_minor. This allows me to
take the zvol_state_lock before the spa_namespace_lock in
zvol_cr_minors_common and simply call the unlocked version.
With the update to onnv_141 how minor devices were created and
removed for ZVOL was substantially changed. The updated system
is much more tightly integrated with Solaris's /dev/ filesystem.
This is great for Solaris but bad for Linux.
On the kernel side the ZFS_IOC_{CREATE,REMOVE}_MINOR ioctl
entry points have been re-added. They now call directly in
to the ZVOL to create the needed minor node and add the sysfs
entried for udev.
Also as part of this change I've decided it would really be
best if all the zvols were in a /dev/zvol directory like on
Solaris. Organizationally this makes sense and on the code
side it allows us to know a block device is a zvol simply by
where it is located in /dev/. Unless Solaris there still is
to ./dsk or ./rdsk as part of the path.
Extend the Makefiles with an uninstall target to cleanly
remove a package which was installed with 'make install'.
Additionally, ensure a 'depmod -a' is run as part of the
install to update the module dependency information.
We should just make a best effort when removing zvol minors
from the system during destroy. Failure here should never
prevent the pool from being destroyed. For example, if a
pool is so heavily damaged it cannot be opened (part of the
minor removal) we still want to be able to destroy it. The
worst case here is we may orphan a few minors but even that
is unlikely and not particularly harmful.
This was caught under Debian Lenny builds because they are one of
the few/only current distros based on a 2.6.26 kernel. In one
of the build conditionals I accidently failed to assign the
return code to rc before returning.
Devices were only being created at module load time or when a
dataset was created. Similiar devices were not always being
removed at all the correct times. This patch updates all the
places where devices should either be created or removed. I'm
reasonably sure I got them all but if theres a case I missed
we can catch it with a follow up patch.
module load/unload
zfs create/remove
zpool import/export
zpool destroy
This patch also adds a simple regression test to zconfig.sh
to ensure zpool import/export is basically working properly.
This test specifically checks that devices are created
properly, removed after export, created after import, and
removed as a consequence of a zpool destroy.
With the recent ZVOL update zvol_set_volblocksize() accidentally
lost its mutex_exit(). This was noticed when zvol_create_minor()
blocked on the zvol_state_lock while it was holding the
spa_namespace_lock(). This caused everything to get blocked
up and hung the system.
Due to limited stack space recursive functions are frowned upon in
the Linux kernel. However, they often are the most elegant solution
to a problem. The following code preserves the recursive function
traverse_visitbp() but moves the local variables AND function
arguments to the heap to minimize the stack frame size. Enough
space is initially allocated on the stack for 20 levels of recursion.
This change does ugly-up-the-code but it reduces the worst case
usage from roughly 4160 bytes to 960 bytes on x86_64 archs.
These changes are now taken care of by the fix-stack-traverse_impl
topic branch which not only solves the uninit problem but also
moves these locals off the stack and on to the heap.
Move dsl_dataset_t local variable from the stack to the heap.
This reduces the stack usage of this function from 2048 bytes
to 176 bytes for x84_64 arches.
Much to my surprise bcopy() under Linux appears to copy the data in
word sized chunks. It does the right thing but if you buffer is not
a multiple of the word size you will be reading past the end of your
buffer. Or at least that is what valgrind is reporting. We should
be using mempcy() anyway on Linux so replace bcopy() with memcpy()
to resolve the issue.
==305== Thread 211:
==305== Invalid read of size 8
==305== at 0x3BCD28357D: _wordcopy_fwd_dest_aligned (in /lib64/libc-2.11.1.so)
==305== by 0x3BCD282B05: bcopy (in /lib64/libc-2.11.1.so)
==305== by 0x58D7FEF: dmu_write (dmu.c:730)
==305== by 0x591C942: spa_history_write (spa_history.c:165)
==305== by 0x591D255: spa_history_log_sync (spa_history.c:277)
==305== by 0x591D545: log_internal (spa_history.c:450)
==305== by 0x591D5EC: spa_history_log_internal (spa_history.c:475)
==305== by 0x5902319: dsl_prop_set_sync (dsl_prop.c:707)
==305== by 0x5906A7D: dsl_sync_task_group_sync (dsl_synctask.c:199)
==305== by 0x58FF4EC: dsl_pool_sync (dsl_pool.c:376)
==305== by 0x591744C: spa_sync (spa.c:5365)
==305== by 0x5922C85: txg_sync_thread (txg.c:414)
This may not strictly be needed but it does keep gcc happy. We
should keep our eye on this though if the extra bcopy significantly
impacts performance. It may.
The following cleanup was missed in the first pass when the ZVOL
implementation was updated. An extra instance of a zvol_state_t
was removed from the stack and the error handling was simplified.
There are cases where under Linux it is not safe to sleep in
taskq_dispatch(). Rather than adding Linux specific code to
detect these cases I opted to keep it simple and just never
allow a sleep here. The impact of this should be minimal.
I missed a instanse of removing the & operator when reducing the
stack usage in this function. This unfortunately doesn't cause
a compile warning but it is does cause ztest failures. Anyway,
update the topic branch to correct this mistake.