Misc fixed based on testing with the dragon config.

In check_disk() we should only check the entire device if it
not a whole disk.  It is a whole disk with an EFI label on it,
it is possible that libblkid will misidentify the device as a
filesystem.  I had a case yesterday where 2 bytes in the EFI
GUID happened we set to the right values such that libblkid
decided there was a minux filesystem there.  If it's a whole
device we look for a EFI label.

If we are able to read the backup EFI label from a device but
the primary is corrupt.  Then don't bother trying to stat
the partitions in /dev/ the kernel will not create devices
using the backup label when the primary is damaged.

Add code to determine if we have a udev path instead of a
normal device path.  In this case use the -part# partition
naming scheme instead of the /dev/disk# scheme.  This is
important because we always want to access devices using
the full path provided at configuration time.

Readded support for zpool_relabel_disk() now that we have
the full libefi library in place we do have access to this
functionality.

Lots of additional paranoia to ensure EFI label are written
correctly.  These changes include:

1) Removing the O_NDELAY flag when opening a file descriptor
for libefi.  This flag should really only be used when you
do not intend to do any file IO.  Under Solaris only ioctl()'s
were performed under linux we do perform reads and writes.

2) Use O_DIRECT to ensure any caching is bypassed while
writing or reading the EFI labels.  This change forces the
use of sector aligned memory buffers which are allocated
using posix_memalign().

3) Add additional efi_debug error messages to efi_ioctl().

4) While doing a fsync is good to ensure the EFI label is on
disk we can, and should go one step futher by issuing the
BLKFLSBUF ioctl().  This signals the kernel to instruct the
drive to flush it's on-disk cache.

5) Because of some initial strangeness I observed in testing
with some flakey drives be extra paranoid in zpool_label_disk().
After we've written the device without error, flushed the drive
caches, correctly detected the new partitions created by the
kernel.  Then additionally read back the EFI label from user
space to make sure it is intact and correct.  I don't think we
can ever be to careful here.

NOTE: The was recently some concern expressed that writing EFI
labels from user space on Linux was not the right way to do this.
That instead two kernel ioctl()s should be used to create and
remove partitions.  After some investigation it's clear to me
using those ioctl() would be a bad idea.  The in fact don't
actually write partition tables to the disk, they only create
the partition devices in the kernel.  So what you really want
to do is write the label out from user space, then prompt the
kernel to re-read the partition from disk to create the partitions.
This is in fact exactly what newer version of parted do.
This commit is contained in:
Brian Behlendorf 2009-10-23 11:57:59 -07:00
parent 6367f93602
commit 24f3d6e49e
3 changed files with 126 additions and 87 deletions

View File

@ -237,14 +237,9 @@ check_disk(const char *path, blkid_cache cache, int force,
int err = 0; int err = 0;
int fd, i; int fd, i;
/* Check the device as given */ /* This is not a wholedisk we only check the given partition */
err = check_slice(path, cache, force, isspare);
if (err)
return (err);
/* Additional checking is only required for whole disks */
if (!iswholedisk) if (!iswholedisk)
return 0; return check_slice(path, cache, force, isspare);
/* /*
* When the device is a whole disk try to read the efi partition * When the device is a whole disk try to read the efi partition
@ -252,9 +247,9 @@ check_disk(const char *path, blkid_cache cache, int force,
* partitions. However, when it fails it may simply be because * partitions. However, when it fails it may simply be because
* the disk is partitioned via the MBR. Since we currently can * the disk is partitioned via the MBR. Since we currently can
* not easily decode the MBR return a failure and prompt to the * not easily decode the MBR return a failure and prompt to the
* user to use --force since we cannot check the partitions. * user to use force option since we cannot check the partitions.
*/ */
if ((fd = open(path, O_RDONLY|O_NDELAY)) < 0) { if ((fd = open(path, O_RDWR|O_DIRECT)) < 0) {
check_error(errno); check_error(errno);
return -1; return -1;
} }
@ -265,11 +260,9 @@ check_disk(const char *path, blkid_cache cache, int force,
if (force) { if (force) {
return 0; return 0;
} else { } else {
vdev_error(gettext( vdev_error(gettext("%s does not contain an EFI "
"%s may contain a non-efi partition table " "label but it may contain partition\n"
"describing existing\nfilesystems. If you are " "information in the MBR.\n"), path);
"sure you want to use this device use the\n"
"force command line option.\n"), path);
return -1; return -1;
} }
} }
@ -279,14 +272,18 @@ check_disk(const char *path, blkid_cache cache, int force,
* label at the end of the device is intact. Rather than use this * label at the end of the device is intact. Rather than use this
* label we should play it safe and treat this as a non efi device. * label we should play it safe and treat this as a non efi device.
*/ */
if (!force && vtoc->efi_flags & EFI_GPT_PRIMARY_CORRUPT) { if (vtoc->efi_flags & EFI_GPT_PRIMARY_CORRUPT) {
vdev_error(gettext(
"%s contains a corrupt primary efi partition table. "
"If you are\nsure you want to use this device use "
"the force command line option.\n"), path);
efi_free(vtoc); efi_free(vtoc);
(void) close(fd); (void) close(fd);
return -1;
if (force) {
/* Partitions will no be created using the backup */
return 0;
} else {
vdev_error(gettext("%s contains a corrupt primary "
"EFI label.\n"), path);
return -1;
}
} }
for (i = 0; i < vtoc->efi_nparts; i++) { for (i = 0; i < vtoc->efi_nparts; i++) {
@ -295,15 +292,13 @@ check_disk(const char *path, blkid_cache cache, int force,
uuid_is_null((uchar_t *)&vtoc->efi_parts[i].p_guid)) uuid_is_null((uchar_t *)&vtoc->efi_parts[i].p_guid))
continue; continue;
/* Resolve possible symlink to safely append partition */ if (strncmp(path, UDISK_ROOT, strlen(UDISK_ROOT)) == 0)
if (realpath(path, slice_path) == NULL) { (void) snprintf(slice_path, sizeof (slice_path),
(void) fprintf(stderr, "%s%s%d", path, "-part", i+1);
gettext("cannot resolve path '%s'\n"), slice_path); else
err = errno; (void) snprintf(slice_path, sizeof (slice_path),
break; "%s%d", path, i+1);
}
sprintf(slice_path, "%s%d", slice_path, i+1);
err = check_slice(slice_path, cache, force, isspare); err = check_slice(slice_path, cache, force, isspare);
if (err) if (err)
break; break;
@ -369,7 +364,7 @@ is_whole_disk(const char *arg)
(void) snprintf(path, sizeof (path), "%s%s%s", (void) snprintf(path, sizeof (path), "%s%s%s",
RDISK_ROOT, strrchr(arg, '/'), BACKUP_SLICE); RDISK_ROOT, strrchr(arg, '/'), BACKUP_SLICE);
if ((fd = open(path, O_RDWR | O_NDELAY)) < 0) if ((fd = open(path, O_RDWR|O_DIRECT)) < 0)
return (B_FALSE); return (B_FALSE);
if (efi_alloc_and_init(fd, EFI_NUMPAR, &label) != 0) { if (efi_alloc_and_init(fd, EFI_NUMPAR, &label) != 0) {
(void) close(fd); (void) close(fd);

View File

@ -111,7 +111,7 @@ int efi_debug = 1;
int efi_debug = 0; int efi_debug = 0;
#endif #endif
static int efi_read(int, struct dk_gpt *); static int efi_read(int, struct dk_gpt *);
/* /*
* Return a 32-bit CRC of the contents of the buffer. Pre-and-post * Return a 32-bit CRC of the contents of the buffer. Pre-and-post
@ -394,12 +394,20 @@ efi_ioctl(int fd, int cmd, dk_efi_t *dk_ioc)
} }
error = lseek(fd, dk_ioc->dki_lba * lbsize, SEEK_SET); error = lseek(fd, dk_ioc->dki_lba * lbsize, SEEK_SET);
if (error == -1) if (error == -1) {
if (efi_debug)
(void) fprintf(stderr, "DKIOCGETEFI lseek "
"error: %d\n", errno);
return error; return error;
}
error = read(fd, data, dk_ioc->dki_length); error = read(fd, data, dk_ioc->dki_length);
if (error == -1) if (error == -1) {
if (efi_debug)
(void) fprintf(stderr, "DKIOCGETEFI read "
"error: %d\n", errno);
return error; return error;
}
if (error != dk_ioc->dki_length) { if (error != dk_ioc->dki_length) {
if (efi_debug) if (efi_debug)
@ -421,12 +429,20 @@ efi_ioctl(int fd, int cmd, dk_efi_t *dk_ioc)
} }
error = lseek(fd, dk_ioc->dki_lba * lbsize, SEEK_SET); error = lseek(fd, dk_ioc->dki_lba * lbsize, SEEK_SET);
if (error == -1) if (error == -1) {
if (efi_debug)
(void) fprintf(stderr, "DKIOCSETEFI lseek "
"error: %d\n", errno);
return error; return error;
}
error = write(fd, data, dk_ioc->dki_length); error = write(fd, data, dk_ioc->dki_length);
if (error == -1) if (error == -1) {
if (efi_debug)
(void) fprintf(stderr, "DKIOCSETEFI write "
"error: %d\n", errno);
return error; return error;
}
if (error != dk_ioc->dki_length) { if (error != dk_ioc->dki_length) {
if (efi_debug) if (efi_debug)
@ -436,10 +452,15 @@ efi_ioctl(int fd, int cmd, dk_efi_t *dk_ioc)
return -1; return -1;
} }
error = fdatasync(fd); /* Sync the new EFI table to disk */
error = fsync(fd);
if (error == -1) if (error == -1)
return error; return error;
/* Ensure any local disk cache is also flushed */
if (ioctl(fd, BLKFLSBUF, 0) == -1)
return error;
error = 0; error = 0;
break; break;
@ -597,9 +618,11 @@ efi_read(int fd, struct dk_gpt *vtoc)
} }
} }
if ((dk_ioc.dki_data = calloc(label_len, 1)) == NULL) if (posix_memalign((void **)&dk_ioc.dki_data,
disk_info.dki_lbsize, label_len))
return (VT_ERROR); return (VT_ERROR);
memset(dk_ioc.dki_data, 0, label_len);
dk_ioc.dki_length = disk_info.dki_lbsize; dk_ioc.dki_length = disk_info.dki_lbsize;
user_length = vtoc->efi_nparts; user_length = vtoc->efi_nparts;
efi = dk_ioc.dki_data; efi = dk_ioc.dki_data;
@ -795,12 +818,14 @@ write_pmbr(int fd, struct dk_gpt *vtoc)
int len; int len;
len = (vtoc->efi_lbasize == 0) ? sizeof (mb) : vtoc->efi_lbasize; len = (vtoc->efi_lbasize == 0) ? sizeof (mb) : vtoc->efi_lbasize;
buf = calloc(len, 1); if (posix_memalign((void **)&buf, len, len))
return (VT_ERROR);
/* /*
* Preserve any boot code and disk signature if the first block is * Preserve any boot code and disk signature if the first block is
* already an MBR. * already an MBR.
*/ */
memset(buf, 0, len);
dk_ioc.dki_lba = 0; dk_ioc.dki_lba = 0;
dk_ioc.dki_length = len; dk_ioc.dki_length = len;
/* LINTED -- always longlong aligned */ /* LINTED -- always longlong aligned */
@ -886,10 +911,9 @@ check_input(struct dk_gpt *vtoc)
if ((vtoc->efi_parts[i].p_tag == V_UNASSIGNED) && if ((vtoc->efi_parts[i].p_tag == V_UNASSIGNED) &&
(vtoc->efi_parts[i].p_size != 0)) { (vtoc->efi_parts[i].p_size != 0)) {
if (efi_debug) { if (efi_debug) {
(void) fprintf(stderr, (void) fprintf(stderr, "partition %d is "
"partition %d is \"unassigned\" but has a size of %llu", "\"unassigned\" but has a size of %llu",
i, i, vtoc->efi_parts[i].p_size);
vtoc->efi_parts[i].p_size);
} }
return (VT_EINVAL); return (VT_EINVAL);
} }
@ -902,9 +926,9 @@ check_input(struct dk_gpt *vtoc)
if (vtoc->efi_parts[i].p_tag == V_RESERVED) { if (vtoc->efi_parts[i].p_tag == V_RESERVED) {
if (resv_part != -1) { if (resv_part != -1) {
if (efi_debug) { if (efi_debug) {
(void) fprintf(stderr, (void) fprintf(stderr, "found "
"found duplicate reserved partition at %d\n", "duplicate reserved partition "
i); "at %d\n", i);
} }
return (VT_EINVAL); return (VT_EINVAL);
} }
@ -955,8 +979,8 @@ check_input(struct dk_gpt *vtoc)
(istart <= endsect)) { (istart <= endsect)) {
if (efi_debug) { if (efi_debug) {
(void) fprintf(stderr, (void) fprintf(stderr,
"Partition %d overlaps partition %d.", "Partition %d overlaps "
i, j); "partition %d.", i, j);
} }
return (VT_EINVAL); return (VT_EINVAL);
} }
@ -1106,9 +1130,11 @@ efi_write(int fd, struct dk_gpt *vtoc)
* for backup GPT header. * for backup GPT header.
*/ */
lba_backup_gpt_hdr = vtoc->efi_last_u_lba + 1 + nblocks; lba_backup_gpt_hdr = vtoc->efi_last_u_lba + 1 + nblocks;
if ((dk_ioc.dki_data = calloc(dk_ioc.dki_length, 1)) == NULL) if (posix_memalign((void **)&dk_ioc.dki_data,
vtoc->efi_lbasize, dk_ioc.dki_length))
return (VT_ERROR); return (VT_ERROR);
memset(dk_ioc.dki_data, 0, dk_ioc.dki_length);
efi = dk_ioc.dki_data; efi = dk_ioc.dki_data;
/* stuff user's input into EFI struct */ /* stuff user's input into EFI struct */

View File

@ -1759,23 +1759,14 @@ is_guid_type(zpool_handle_t *zhp, uint64_t guid, const char *type)
* the disk to use the new unallocated space. * the disk to use the new unallocated space.
*/ */
static int static int
zpool_relabel_disk(libzfs_handle_t *hdl, const char *name) zpool_relabel_disk(libzfs_handle_t *hdl, const char *path)
{ {
#if 0
char path[MAXPATHLEN];
char errbuf[1024]; char errbuf[1024];
int fd, error; int fd, error;
int (*_efi_use_whole_disk)(int);
if ((_efi_use_whole_disk = (int (*)(int))dlsym(RTLD_DEFAULT, if ((fd = open(path, O_RDWR|O_DIRECT)) < 0) {
"efi_use_whole_disk")) == NULL)
return (-1);
(void) snprintf(path, sizeof (path), "%s/%s", RDISK_ROOT, name);
if ((fd = open(path, O_RDWR | O_NDELAY)) < 0) {
zfs_error_aux(hdl, dgettext(TEXT_DOMAIN, "cannot " zfs_error_aux(hdl, dgettext(TEXT_DOMAIN, "cannot "
"relabel '%s': unable to open device"), name); "relabel '%s': unable to open device"), path);
return (zfs_error(hdl, EZFS_OPENFAILED, errbuf)); return (zfs_error(hdl, EZFS_OPENFAILED, errbuf));
} }
@ -1784,20 +1775,14 @@ zpool_relabel_disk(libzfs_handle_t *hdl, const char *name)
* does not have any unallocated space left. If so, we simply * does not have any unallocated space left. If so, we simply
* ignore that error and continue on. * ignore that error and continue on.
*/ */
error = _efi_use_whole_disk(fd); error = efi_use_whole_disk(fd);
(void) close(fd); (void) close(fd);
if (error && error != VT_ENOSPC) { if (error && error != VT_ENOSPC) {
zfs_error_aux(hdl, dgettext(TEXT_DOMAIN, "cannot " zfs_error_aux(hdl, dgettext(TEXT_DOMAIN, "cannot "
"relabel '%s': unable to read disk capacity"), name); "relabel '%s': unable to read disk capacity"), path);
return (zfs_error(hdl, EZFS_NOCAP, errbuf)); return (zfs_error(hdl, EZFS_NOCAP, errbuf));
} }
return (0); return (0);
#else
char errbuf[1024];
zfs_error_aux(hdl, dgettext(TEXT_DOMAIN, "cannot "
"relabel '%s/%s': libefi is unsupported"), DISK_ROOT, name);
return (zfs_error(hdl, EZFS_NOTSUP, errbuf));
#endif
} }
/* /*
@ -1853,7 +1838,6 @@ zpool_vdev_online(zpool_handle_t *zhp, const char *path, int flags,
} }
if (wholedisk) { if (wholedisk) {
pathname += strlen(DISK_ROOT) + 1;
(void) zpool_relabel_disk(zhp->zpool_hdl, pathname); (void) zpool_relabel_disk(zhp->zpool_hdl, pathname);
} }
} }
@ -3070,7 +3054,7 @@ read_efi_label(nvlist_t *config, diskaddr_t *sb)
(void) snprintf(diskname, sizeof (diskname), "%s%s", RDISK_ROOT, (void) snprintf(diskname, sizeof (diskname), "%s%s", RDISK_ROOT,
strrchr(path, '/')); strrchr(path, '/'));
if ((fd = open(diskname, O_RDONLY|O_NDELAY)) >= 0) { if ((fd = open(diskname, O_RDWR|O_DIRECT)) >= 0) {
struct dk_gpt *vtoc; struct dk_gpt *vtoc;
if ((err = efi_alloc_and_read(fd, &vtoc)) >= 0) { if ((err = efi_alloc_and_read(fd, &vtoc)) >= 0) {
@ -3119,7 +3103,6 @@ find_start_block(nvlist_t *config)
int int
zpool_label_disk_wait(char *path, int timeout) zpool_label_disk_wait(char *path, int timeout)
{ {
#if defined(__linux__)
struct stat64 statbuf; struct stat64 statbuf;
int i; int i;
@ -3138,9 +3121,31 @@ zpool_label_disk_wait(char *path, int timeout)
} }
return (ENOENT); return (ENOENT);
#else }
return (0);
#endif int
zpool_label_disk_check(char *path)
{
struct dk_gpt *vtoc;
int fd, err;
if ((fd = open(path, O_RDWR|O_DIRECT)) < 0)
return errno;
if ((err = efi_alloc_and_read(fd, &vtoc)) != 0) {
(void) close(fd);
return err;
}
if (vtoc->efi_flags & EFI_GPT_PRIMARY_CORRUPT) {
efi_free(vtoc);
(void) close(fd);
return EIDRM;
}
efi_free(vtoc);
(void) close(fd);
return 0;
} }
/* /*
@ -3152,7 +3157,7 @@ zpool_label_disk(libzfs_handle_t *hdl, zpool_handle_t *zhp, char *name)
{ {
char path[MAXPATHLEN]; char path[MAXPATHLEN];
struct dk_gpt *vtoc; struct dk_gpt *vtoc;
int fd; int rval, fd;
size_t resv = EFI_MIN_RESV_SIZE; size_t resv = EFI_MIN_RESV_SIZE;
uint64_t slice_size; uint64_t slice_size;
diskaddr_t start_block; diskaddr_t start_block;
@ -3188,14 +3193,13 @@ zpool_label_disk(libzfs_handle_t *hdl, zpool_handle_t *zhp, char *name)
(void) snprintf(path, sizeof (path), "%s/%s%s", RDISK_ROOT, name, (void) snprintf(path, sizeof (path), "%s/%s%s", RDISK_ROOT, name,
BACKUP_SLICE); BACKUP_SLICE);
if ((fd = open(path, O_RDWR | O_NDELAY)) < 0) { if ((fd = open(path, O_RDWR|O_DIRECT)) < 0) {
/* /*
* This shouldn't happen. We've long since verified that this * This shouldn't happen. We've long since verified that this
* is a valid device. * is a valid device.
*/ */
printf("errno =%d\n", errno); zfs_error_aux(hdl, dgettext(TEXT_DOMAIN,
zfs_error_aux(hdl, "unable to open device '%s': %d"), path, errno);
dgettext(TEXT_DOMAIN, "unable to open device"));
return (zfs_error(hdl, EZFS_OPENFAILED, errbuf)); return (zfs_error(hdl, EZFS_OPENFAILED, errbuf));
} }
@ -3238,7 +3242,7 @@ zpool_label_disk(libzfs_handle_t *hdl, zpool_handle_t *zhp, char *name)
vtoc->efi_parts[8].p_size = resv; vtoc->efi_parts[8].p_size = resv;
vtoc->efi_parts[8].p_tag = V_RESERVED; vtoc->efi_parts[8].p_tag = V_RESERVED;
if (efi_write(fd, vtoc) != 0) { if ((rval = efi_write(fd, vtoc)) != 0) {
/* /*
* Some block drivers (like pcata) may not support EFI * Some block drivers (like pcata) may not support EFI
* GPT labels. Print out a helpful error message dir- * GPT labels. Print out a helpful error message dir-
@ -3248,22 +3252,36 @@ zpool_label_disk(libzfs_handle_t *hdl, zpool_handle_t *zhp, char *name)
(void) close(fd); (void) close(fd);
efi_free(vtoc); efi_free(vtoc);
zfs_error_aux(hdl, dgettext(TEXT_DOMAIN, zfs_error_aux(hdl, dgettext(TEXT_DOMAIN, "try using "
"try using fdisk(1M) and then provide a specific slice")); "parted(8) and then provide a specific slice: %d"), rval);
return (zfs_error(hdl, EZFS_LABELFAILED, errbuf)); return (zfs_error(hdl, EZFS_LABELFAILED, errbuf));
} }
(void) close(fd); (void) close(fd);
efi_free(vtoc); efi_free(vtoc);
#if defined(__linux__) /* Wait for the first expected slice to appear. */
/* Wait for the first expected slice to appear */
(void) snprintf(path, sizeof (path), "%s/%s%s", (void) snprintf(path, sizeof (path), "%s/%s%s",
DISK_ROOT, name, FIRST_SLICE); DISK_ROOT, name, FIRST_SLICE);
return zpool_label_disk_wait(path, 3000); rval = zpool_label_disk_wait(path, 3000);
#else if (rval) {
return (0); zfs_error_aux(hdl, dgettext(TEXT_DOMAIN, "failed to "
#endif "detect device partitions on '%s': %d"), path, rval);
return (zfs_error(hdl, EZFS_LABELFAILED, errbuf));
}
/* We can't be to paranoid. Read the label back and verify it. */
(void) snprintf(path, sizeof (path), "%s/%s", DISK_ROOT, name);
rval = zpool_label_disk_check(path);
if (rval) {
zfs_error_aux(hdl, dgettext(TEXT_DOMAIN, "freshly written "
"EFI label on '%s' is damaged. Ensure\nthis device "
"is not in in use, and is functioning properly: %d"),
path, rval);
return (zfs_error(hdl, EZFS_LABELFAILED, errbuf));
}
return 0;
} }
static boolean_t static boolean_t