After spending considerable time thinking about this I've come to the
conclusion that on Linux systems we don't need Solaris style devid
support. Instead was can simply use udev if we are careful, there
are even some advantages.
The Solaris style devid's are designed to provide a mechanism by which
a device can be opened reliably regardless of it's location in the system.
This is exactly what udev provides us on Linux, a flexible mechanism for
consistently identifing the same devices regardless of probing order.
We just need to be careful to always open the device by the path provided
at creation time, this path must be stored in ZPOOL_CONFIG_PATH. This
in fact has certain advantages.
For example, if in your system you always want the zpool to be able to
locate the disk regardless of physical location you can create the pool
using /dev/disk/by-id/. This is perhaps what you'ld want on a desktop
system where the exact location is not that important. It's more
critical that all the disks can be found.
However, in an enterprise setup there's a good chace that the physical
location of each drive is important. You have like set things up such
that your raid groups span multiple hosts adapters, such that you can
lose an adapter without downtime. In this case you would want to use
the /dev/disk/by-path/ path to ensure the path information is preserved
and you always open the disks at the right physical locations. This
would ensure your system never gets accidently misconfigured and still
just works because the zpool was still able to locate the disk.
Finally, if you want to get really fancy you can always create your
own udev rules. This way you could implement whatever lookup sceme
you wanted in user space for your drives. This would include nice
cosmetic things like being able to control the device names in tools
like zpool status, since the name as just based of the device names.
I've yet to come up with a good reason to implement devid support on
Linux since we have udev. But I've still just commented it out for now
because somebody might come up with a really good I forgot.
The majority of this this patch concerns itself with doing a direct
replacement of Solaris's libdiskmgt library with libblkid+libefi.
You'll notice that this patch removes all libdiskmgt code instead of
ifdef'ing it out. This was done to minimize any confusion when reading
the code because it seems unlikely we will ever port libdiskmgt to Linux.
Despite the replacement the behavior of the tools should have remained
the same with one exception. For the moment, we are unable to check
the partitions of devices which have an MBR style partition table when
creating a filesystem. If a non-efi partition sceme is detected on a
whole disk device we prompt the user to explicity use the force option.
It would not be a ton of work to make the tool aware of MBR style
partitions if this becomes a problem.
I've done basic sanity checking for various configurations and all
the issues I'm aware of have been addressed. Even things like blkid
misidentifing a disk as ext3 when it is added to a zfs pool. I'm
careful to always zero out the first 4k of any new zfs partition. That
all said this is all new code and while it looks like it's working right
for me we should keep an eye on it for any strange behavior.
The major change here is to fix up libefi to be linux aware. For
the most part this wasn't too hard but there were a few major issues.
First off I needed to handle the DKIOCGMEDIAINFO and DKIOCINFO ioctls.
There is no direct equivilant for these ioctls under linux. To handle
this I added wrapper functions which under Solaris simple call the ioctls.
But under Linux dig around the system a little bit getting the needed
info to fill in the requested structures.
Secondly the efi_ioctl() call was adapted such that under linux it directly
read or writes out the partition table. Under Solaris this work was
handed off to the kernel via an ioctl. In the efi_write() case we also
ensure we prompt the kernel via BLKRRPART to re-scan the new partition
table. The libefi generated partition tables are correct but older
versions of ~parted-1.8.1 can not read them without a small patch.
The kernel and fdisk are able to read them just fine.
Thirdly efi_alloc_and_init() which is used by zpool to determine if a
device is a 'wholedisk' was updated to be linux aware. This check is
performed by using the partition number for the device, which the
partition number is 0 on linux it is a 'wholedisk'. However, certain
device type such as the loopback and ram disks needed to be excluded
because they do not support partitioning.
Forthly the zpool command was made symlink aware so it can correctly
resolve udev entries such as /dev/disk/by-*/*. This symlinks are
fully expanded ensuring all block devices are recognized. When a
when a 'wholedisk' block device is detected we now properly write
out an efi label and place zfs in the first partition (0th slice).
This partition is created 1MiB in to the disk to ensure it is aligned
nicely with all high end block devices I'm aware of.
This all works for me now but it did take quite a bit of work to get
it all sorted out. It would not surprise me if certain special cases
were missed so we should keep any eye of for any odd behavior.
This include updating all the Makefile.am to have the correct
include paths and libraries. In addition, the zlib m4 macro was
updated to more correctly integrate with the Makefiles. And I
added two new macros libblkid and libuuid which will be needed by
subsequent commits for blkid and uuid support respectively. The
blkid support is optional, the uuid support is mandatory for libefi.
Under FC11 rpm builds by default add the --fortify-source option which
ensures that functions flagged with certain attributes must have their
return codes checked. Normally this is just a warning but we always
build with -Werror so this is fatal. Simply wrap the function in a
verify call to ensure we catch a failure if there is one.
With this patch applied I get the following failure 100% of the time,
I'd prefer to debug it and keep moving forward but I do not have the
time right now so I'm reverting the patch to the version which worked.
Ricardo please fix.
(gdb) bt
0 ztest_dmu_write_parallel (za=0x2aaaac898960) at
../../cmd/ztest/ztest.c:2566
1 0x0000000000405a79 in ztest_thread (arg=<value optimized out>)
at ../../cmd/ztest/ztest.c:3862
2 0x00002b2e6a7a841d in zk_thread_helper (arg=<value optimized out>)
at ../../lib/libzpool/kernel.c:131
3 0x000000379be06367 in start_thread (arg=<value optimized out>)
at pthread_create.c:297
4 0x000000379b2d30ad in clone () from /lib64/libc.so.6
This resolves previous scalabily concerns about the cost of calling
curthread which previously required a list walk. The kthread address
is now tracked as thread specific data which can be quickly returned.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
A compat ioctl handler for zpios was added which simply passes the
ioctl on to the usual handler. The IOWR macro's correctly handle
this. Additionally replace the use of 'struct timespec' which uses
longs internally and is therefore different sizes on 32-bit vs 64-bit
objects with 'struct zpios_timespec_t'. This custom structure uses
uint32_t types internally and is safe to pass through an ioctl. The
helper functions for this new type were also moved to a common place
so they may be used safely by the user or kernel code.
The intent here is to fully remove the previous Solaris thread
implementation so we don't need to simulate both Solaris kernel
and user space thread APIs. The few user space consumers of the
thread API have been updated to use the kthread API. In order
to support this we needed to more fully support the kthread API
and that means not doing crazy things like casting a thread id
to a pointer and using that as was done before. This first
implementation is not effecient but it does provide all the
corrent semantics. If/when performance becomes and issue we
can and should just natively adopt pthreads which is portable.
Let me finish by saying I'm not proud of any of this and I would
love to see it improved. However, this slow implementation does
at least provide all the correct kthread API semantics whereas
the previous method of casting the thread ID to a pointer was
dodgy at best.
within an ASSERT with the ASSERTV macro which will ensure it will
be removed when the ASSERTs are commented out. This makes gcc much
happier, makes the variables usage explicit, and removes the need
for the compiler to detect it is unused and do the right thing.