It turns out the zil allocates quite large buffers. This isn't
all the surprising but we need to suppress the warnings until
it's clear what to do about it.
This topic branch leverages the Solaris style FMA call points
in ZFS to create a user space visible event notification system
under Linux. This new system is called zevent and it unifies
all previous Solaris style ereports and sysevent notifications.
Under this Linux specific scheme when a sysevent or ereport event
occurs an nvlist describing the event is created which looks almost
exactly like a Solaris ereport. These events are queued up in the
kernel when they occur and conditionally logged to the console.
It is then up to a user space application to consume the events
and do whatever it likes with them.
To make this possible the existing /dev/zfs ABI has been extended
with two new ioctls which behave as follows.
* ZFS_IOC_EVENTS_NEXT
Get the next pending event. The kernel will keep track of the last
event consumed by the file descriptor and provide the next one if
available. If no new events are available the ioctl() will block
waiting for the next event. This ioctl may also be called in a
non-blocking mode by setting zc.zc_guid = ZEVENT_NONBLOCK. In the
non-blocking case if no events are available ENOENT will be returned.
It is possible that ESHUTDOWN will be returned if the ioctl() is
called while module unloading is in progress. And finally ENOMEM
may occur if the provided nvlist buffer is not large enough to
contain the entire event.
* ZFS_IOC_EVENTS_CLEAR
Clear are events queued by the kernel. The kernel will keep a fairly
large number of recent events queued, use this ioctl to clear the
in kernel list. This will effect all user space processes consuming
events.
The zpool command has been extended to use this events ABI with the
'events' subcommand. You may run 'zpool events -v' to output a
verbose log of all recent events. This is very similar to the
Solaris 'fmdump -ev' command with the key difference being it also
includes what would be considered sysevents under Solaris. You
may also run in follow mode with the '-f' option. To clear the
in kernel event queue use the '-c' option.
$ sudo cmd/zpool/zpool events -fv
TIME CLASS
May 13 2010 16:31:15.777711000 ereport.fs.zfs.config.sync
class = "ereport.fs.zfs.config.sync"
ena = 0x40982b7897700001
detector = (embedded nvlist)
version = 0x0
scheme = "zfs"
pool = 0xed976600de75dfa6
(end detector)
time = 0x4bec8bc3 0x2e5aed98
pool = "zpios"
pool_guid = 0xed976600de75dfa6
pool_context = 0x0
While the 'zpool events' command is handy for interactive debugging
it is not expected to be the primary consumer of zevents. This ABI
was primarily added to facilitate the addition of a user space
monitoring daemon. This daemon would consume all events posted by
the kernel and based on the type of event perform an action. For
most events simply forwarding them on to syslog is likely enough.
But this interface also cleanly allows for more sophisticated
actions to be taken such as generating an email for a failed drive
Specifically, the following bugs are fixed in this patch:
1) zdb wasn't getting the correct device size when the vdev is a
block device. In Solaris, fstat64() returns the device size but
in Linux an ioctl() is needed.
2) We were opening block devices with O_DIRECT, which caused pread64()
to fail with EINVAL due to memory alignment issues. This was fixed by
the previous umem cache alignment fix in stub implementation to align
objects correctly. But we still needed to add a check for the error here.
3) We also make sure that we don't try to open a block device in write
mode in userspace. This shouldn't happen, because zdb opens devices
in read-only mode, and ztest only uses files.
In vn_open(), if fstat64() returned an error, the real errno
was being obscured by calling close().
Add error handling for both pwrite64() calls in vn_rdwr().
With this patch applied I get the following failure 100% of the time,
I'd prefer to debug it and keep moving forward but I do not have the
time right now so I'm reverting the patch to the version which worked.
Ricardo please fix.
(gdb) bt
0 ztest_dmu_write_parallel (za=0x2aaaac898960) at
../../cmd/ztest/ztest.c:2566
1 0x0000000000405a79 in ztest_thread (arg=<value optimized out>)
at ../../cmd/ztest/ztest.c:3862
2 0x00002b2e6a7a841d in zk_thread_helper (arg=<value optimized out>)
at ../../lib/libzpool/kernel.c:131
3 0x000000379be06367 in start_thread (arg=<value optimized out>)
at pthread_create.c:297
4 0x000000379b2d30ad in clone () from /lib64/libc.so.6
This resolves previous scalabily concerns about the cost of calling
curthread which previously required a list walk. The kthread address
is now tracked as thread specific data which can be quickly returned.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The intent here is to fully remove the previous Solaris thread
implementation so we don't need to simulate both Solaris kernel
and user space thread APIs. The few user space consumers of the
thread API have been updated to use the kthread API. In order
to support this we needed to more fully support the kthread API
and that means not doing crazy things like casting a thread id
to a pointer and using that as was done before. This first
implementation is not effecient but it does provide all the
corrent semantics. If/when performance becomes and issue we
can and should just natively adopt pthreads which is portable.
Let me finish by saying I'm not proud of any of this and I would
love to see it improved. However, this slow implementation does
at least provide all the correct kthread API semantics whereas
the previous method of casting the thread ID to a pointer was
dodgy at best.
code to only use the kthread API regardless of if it is compiled in the
kernel or user space. The kthread API will be layered on top of pthreads
as best as possible in zfs_context, this is non optimal but much clearer.
These changes bring the zfs-0.4.4 tree in to compliance with
the spl-0.4.4 packaging changes. The bottom line is 2 source
rpms and 4 binary rpms will now be generated when creating
packages there will be:
zfs-<version>.src.rpm
- Fully rebuildable source rpm for libzfs and utils.
zfs-modules-<version>.src.rpm
- Fully rebuildable source rpm for kernel modules.
zfs-<version>.<arch>.rpm
- Binary rpm for libzfs and utils. The utils in this package are
compatible with all zfs-module rpms of the same version.
zfs-devel-<version>.<arch>.rpm
- Binary rpm containing headers for building against libzfs libraries.
zfs-modules-<verion>-<kernel>.arch.rpm
- Binary rpm containing the kernel modules for a specific kernel build.
The package name contains the kernel version and you should have one
of these packages installed to match every kernel on your system.
zfs-modules-devel-<verion>-<kernel>.arch.rpm
- Binary rpm containing development header and module symbols needed
for building additional kernel modules which are dependent on the
zfs module stack.
Expect minor interations on these changes as I validate they work
properly on CHAOS, RHEL, Fedora, and SLES style distros.