Deep recursive call chains are contributing to segfaults in ztest due to
heavy stack use. Inlining zio_execute() helps reduce the stack depth of
the zio_notify_parent() -> zio_execute() -> zio_wait() recursive cycle.
I am no longer seeing ztest segfaults in this code path with this change.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Deep recursive call chains are contributing to segfaults in ztest due
to heavy stack use. Inlining dbuf_findbp() helps reduce the stack depth
of the dbuf_findbp() -> dbuf_hold_impl() cycle. However, segfaults are
still occurring in this code path, so further reductions are still needed.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Deep recursive call chains are contributing to segfaults in ztest due
to heavy stack use. Inlining zio_notify_parent() helps reduce the
stack depth of the zio_notify_parent() -> zio_execute() -> zio_done()
recursive cycle. I am no longer seeing ztest segfaults in this code
path with this change combined with the zio_done() stack reduction in
the previous commit.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The spa_load function may call itself recursively through
the spa_load_impl function. This call path of spa_load->
spa_load_impl->spa_load->spa_load_impl takes 640 bytes of
stack. By forcing spa_load_impl to be inlined as part of
spa_load the can be reduced to 448 bytes, for a savings of
192 bytes,
The feature branch 'fix-taskq' in Linux's ZFS tree changes the taskq_dispatch()
flag from TQ_SLEEP to TQ_NOSLEEP to avoid sleeping in some circumstances.
However, this has the side effect that taskq_dispatch() now may fail, and since
the return code was not even being checked, it could lead to zio's not being
scheduled to execute.
I'm fixing this in a simplistic but not very elegant way, by just looping until
taskq_dispatch() succeeds.
Signed-off-by: Ricardo M. Correia <ricardo.correia@oracle.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The dmu_object_info_t structures which are roughly 60 bytes
have been moved from the stack to the heap. Every little
bit helps and there's absolutely no reason these temporary
working variables need to be on the stack.
During module load we could deadlock because the zvol_init()
callpath took the spa_namespace_lock before the zvol_state_lock.
The rest of the zvol code takes the locks in the opposite order.
In particular, I observed the following deadlock cause by the
lock inversion.
I've fixed the ording by creating an unlocked version of
zvol_create_minor and zvol_remove_minor. This allows me to
take the zvol_state_lock before the spa_namespace_lock in
zvol_cr_minors_common and simply call the unlocked version.
With the update to onnv_141 how minor devices were created and
removed for ZVOL was substantially changed. The updated system
is much more tightly integrated with Solaris's /dev/ filesystem.
This is great for Solaris but bad for Linux.
On the kernel side the ZFS_IOC_{CREATE,REMOVE}_MINOR ioctl
entry points have been re-added. They now call directly in
to the ZVOL to create the needed minor node and add the sysfs
entried for udev.
Also as part of this change I've decided it would really be
best if all the zvols were in a /dev/zvol directory like on
Solaris. Organizationally this makes sense and on the code
side it allows us to know a block device is a zvol simply by
where it is located in /dev/. Unless Solaris there still is
to ./dsk or ./rdsk as part of the path.
Extend the Makefiles with an uninstall target to cleanly
remove a package which was installed with 'make install'.
Additionally, ensure a 'depmod -a' is run as part of the
install to update the module dependency information.
We should just make a best effort when removing zvol minors
from the system during destroy. Failure here should never
prevent the pool from being destroyed. For example, if a
pool is so heavily damaged it cannot be opened (part of the
minor removal) we still want to be able to destroy it. The
worst case here is we may orphan a few minors but even that
is unlikely and not particularly harmful.
This was caught under Debian Lenny builds because they are one of
the few/only current distros based on a 2.6.26 kernel. In one
of the build conditionals I accidently failed to assign the
return code to rc before returning.
Devices were only being created at module load time or when a
dataset was created. Similiar devices were not always being
removed at all the correct times. This patch updates all the
places where devices should either be created or removed. I'm
reasonably sure I got them all but if theres a case I missed
we can catch it with a follow up patch.
module load/unload
zfs create/remove
zpool import/export
zpool destroy
This patch also adds a simple regression test to zconfig.sh
to ensure zpool import/export is basically working properly.
This test specifically checks that devices are created
properly, removed after export, created after import, and
removed as a consequence of a zpool destroy.
With the recent ZVOL update zvol_set_volblocksize() accidentally
lost its mutex_exit(). This was noticed when zvol_create_minor()
blocked on the zvol_state_lock while it was holding the
spa_namespace_lock(). This caused everything to get blocked
up and hung the system.
Due to limited stack space recursive functions are frowned upon in
the Linux kernel. However, they often are the most elegant solution
to a problem. The following code preserves the recursive function
traverse_visitbp() but moves the local variables AND function
arguments to the heap to minimize the stack frame size. Enough
space is initially allocated on the stack for 20 levels of recursion.
This change does ugly-up-the-code but it reduces the worst case
usage from roughly 4160 bytes to 960 bytes on x86_64 archs.
These changes are now taken care of by the fix-stack-traverse_impl
topic branch which not only solves the uninit problem but also
moves these locals off the stack and on to the heap.
Move dsl_dataset_t local variable from the stack to the heap.
This reduces the stack usage of this function from 2048 bytes
to 176 bytes for x84_64 arches.
Much to my surprise bcopy() under Linux appears to copy the data in
word sized chunks. It does the right thing but if you buffer is not
a multiple of the word size you will be reading past the end of your
buffer. Or at least that is what valgrind is reporting. We should
be using mempcy() anyway on Linux so replace bcopy() with memcpy()
to resolve the issue.
==305== Thread 211:
==305== Invalid read of size 8
==305== at 0x3BCD28357D: _wordcopy_fwd_dest_aligned (in /lib64/libc-2.11.1.so)
==305== by 0x3BCD282B05: bcopy (in /lib64/libc-2.11.1.so)
==305== by 0x58D7FEF: dmu_write (dmu.c:730)
==305== by 0x591C942: spa_history_write (spa_history.c:165)
==305== by 0x591D255: spa_history_log_sync (spa_history.c:277)
==305== by 0x591D545: log_internal (spa_history.c:450)
==305== by 0x591D5EC: spa_history_log_internal (spa_history.c:475)
==305== by 0x5902319: dsl_prop_set_sync (dsl_prop.c:707)
==305== by 0x5906A7D: dsl_sync_task_group_sync (dsl_synctask.c:199)
==305== by 0x58FF4EC: dsl_pool_sync (dsl_pool.c:376)
==305== by 0x591744C: spa_sync (spa.c:5365)
==305== by 0x5922C85: txg_sync_thread (txg.c:414)
This may not strictly be needed but it does keep gcc happy. We
should keep our eye on this though if the extra bcopy significantly
impacts performance. It may.
The following cleanup was missed in the first pass when the ZVOL
implementation was updated. An extra instance of a zvol_state_t
was removed from the stack and the error handling was simplified.
There are cases where under Linux it is not safe to sleep in
taskq_dispatch(). Rather than adding Linux specific code to
detect these cases I opted to keep it simple and just never
allow a sleep here. The impact of this should be minimal.
I missed a instanse of removing the & operator when reducing the
stack usage in this function. This unfortunately doesn't cause
a compile warning but it is does cause ztest failures. Anyway,
update the topic branch to correct this mistake.
Certain function must never be automatically inlined by gcc because
they are stack heavy or called recursively. This patch flags all
such functions I have found as 'noinline' to prevent gcc from making
the optimization.
Reduce stack usage in dsl_deleg_get, gcc flagged it as consuming a
whopping 1040 bytes or potentially 1/4 of a 4K stack. This patch
moves all the large structures and buffer off the stack and on to
the heap. This includes 2 zap_cursor_t structs each 52 bytes in
size, 2 zap_attribute_t structs each 280 bytes in size, and 1
256 byte char array. The total saves on the stack is 880 bytes
after you account for the 5 new pointers added.
Also the source buffer length has been increased from MAXNAMELEN
to MAXNAMELEN+strlen(MOS_DIR_NAME)+1 as described by the comment in
dsl_dir_name(). A buffer overrun may have been possible with the
slightly smaller buffer.
Reduce kernel stack usage by lzjb_compress() by moving uint16 array
off the stack and on to the heap. The exact performance implications
of this I have not measured but we absolutely need to keep stack
usage to a minimum. If/when this becomes and issue we optimize.
Move xiou stat structures from a header to the dmu.c source as is
done with all the other kstat interfaces. This information is local
to dmu.c registered the xuio kstat and should stay that way.
If your only going to allow one allocator to be used and it is defined
at compile time there is no point including the others in the build.
This patch could/should be refined for Linux to make the metaslab
configurable at run time. That might be a bit tricky however since
you would need to quiese all IO. Short of that making it configurable
as a module load option would be a reasonable compromise.
In the linux kernel 'current' is defined to mean the current process
and can never be used as a local variable in a function. Simply
replace all usage of 'current' with 'curr' in this function.
Updated fix to detect if we are in an interrupt and only sleep if it
is safe to do some. I guess it must be safe to sleep under Solaris
this must be handled in a sort interrupt handler there
The ZVOL interfaces changed significantly with the latest update. I've
updated the Linux version of the code to handle this and it looks like
the net result has been a simpler implementation which is good! Plus,
I'm relatively sure the ZIL integration is right this time although it
needs some serious crash testing to verify that.
Also minor additions to vdev_disk for .hold and .rele callbacks.
Currently, they do nothing and I may be able to simply stub them out
with NULLs for Linux since opening the device in Linux should have
much the same effort. More investigation is needed though since
the ZFS interface may make some demands here I'm overlooking.