While I completely agree the udev is the lesser of many possibles
evils when solving the device issue... it is still evil. After
attempting to craft a single rule which will work for various
versions of udev in various distros. I've come to the conclusion
the only maintainable way to solve this issue is to split the rule
from any particular configuration.
This commit provides a generic 60-zpool.rules file which use a
small helper util 'zpool_id' to parse a configuration file by
default located in /etc/zfs/zdev.conf. The helper script maps
a by-path udev name to a more friendly name of <channel><rank>
for large configurations.
As part of this change all of the support scripts why rely on
this udev naming convention have been updated as needed. Example
zdev.conf files have also been added for 3 different systems by
you will always need to add one for your exact hardware.
Finally, included in these changes are the proper tweaks to the
build system to ensure everything still get's packaged properly
in the rpms and can run in or out of tree.
At last a useful user space interface for the Linux ZFS port arrives.
With the addition of the ZVOL real ZFS based block devices are available
and can be compared head to head with Linux's MD and LVM block drivers.
The Linux ZVOL has not yet had any performance work done but from a user
perspective it should be functionally complete and behave like any other
Linux block device.
The ZVOL has so far been tested using zconfig.sh on the following x86_64
based platforms: FC11, CHAOS4, RHEL5, RHEL6, and SLES11. However, more
testing is required to ensure everything is working as designed.
What follows in a somewhat detailed list of changes includes in this
commit to make ZVOL's possible. A few other issues were addressed in
the context of these changes which will also be mentioned.
* zvol_create_link_common() simplified to simply issue to ioctl to
create the device and then wait up to 10 seconds for it to appear.
The device will be created within a few miliseconds by udev under
/dev/<pool>/<volume>. Note this naming convention is slightly
different than on Solaris by I feel is more Linuxy.
* Removed support for dump vdevs. This concept is specific to Solaris
and done not map cleanly to Linux. Under Linux generating system cores
is perferably done over the network via netdump, or alternately to a
block device via O_DIRECT.
Based on the block device type we can expect a specific naming
convention. With this in mind update efi_get_info() to be more
aware of the type when parsing out the partition number. In,
addition be aware that all block device types are not partitionable.
Finally, when attempting to lookup a device partition by appending
the partition number to the whole device take in to account the
kernel naming scheme. If the last character of the device name
is a digit the partition will always be 'p#' instead of just '#'.
In check_disk() we should only check the entire device if it
not a whole disk. It is a whole disk with an EFI label on it,
it is possible that libblkid will misidentify the device as a
filesystem. I had a case yesterday where 2 bytes in the EFI
GUID happened we set to the right values such that libblkid
decided there was a minux filesystem there. If it's a whole
device we look for a EFI label.
If we are able to read the backup EFI label from a device but
the primary is corrupt. Then don't bother trying to stat
the partitions in /dev/ the kernel will not create devices
using the backup label when the primary is damaged.
Add code to determine if we have a udev path instead of a
normal device path. In this case use the -part# partition
naming scheme instead of the /dev/disk# scheme. This is
important because we always want to access devices using
the full path provided at configuration time.
Readded support for zpool_relabel_disk() now that we have
the full libefi library in place we do have access to this
functionality.
Lots of additional paranoia to ensure EFI label are written
correctly. These changes include:
1) Removing the O_NDELAY flag when opening a file descriptor
for libefi. This flag should really only be used when you
do not intend to do any file IO. Under Solaris only ioctl()'s
were performed under linux we do perform reads and writes.
2) Use O_DIRECT to ensure any caching is bypassed while
writing or reading the EFI labels. This change forces the
use of sector aligned memory buffers which are allocated
using posix_memalign().
3) Add additional efi_debug error messages to efi_ioctl().
4) While doing a fsync is good to ensure the EFI label is on
disk we can, and should go one step futher by issuing the
BLKFLSBUF ioctl(). This signals the kernel to instruct the
drive to flush it's on-disk cache.
5) Because of some initial strangeness I observed in testing
with some flakey drives be extra paranoid in zpool_label_disk().
After we've written the device without error, flushed the drive
caches, correctly detected the new partitions created by the
kernel. Then additionally read back the EFI label from user
space to make sure it is intact and correct. I don't think we
can ever be to careful here.
NOTE: The was recently some concern expressed that writing EFI
labels from user space on Linux was not the right way to do this.
That instead two kernel ioctl()s should be used to create and
remove partitions. After some investigation it's clear to me
using those ioctl() would be a bad idea. The in fact don't
actually write partition tables to the disk, they only create
the partition devices in the kernel. So what you really want
to do is write the label out from user space, then prompt the
kernel to re-read the partition from disk to create the partitions.
This is in fact exactly what newer version of parted do.