The zpool_id script is posixly correct and does not use bash
features, so change its whackbang from /bin/bash to /bin/sh.
Debian policy also stipulates that system scripts be dash compatible.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Direct invocation of GNU egrep is deprecated by its man page, and the
its argument in the zpool_id script is not an extended expression.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
For consistency and safety, quote all variables in the zpool_id
script. This accomodates a `-c CONFIG` parameter value with
whitespace in the path name.
Also fix a typo in the usage synopsis for `-h`.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #439
The /lib/udev/path_id helper became a builtin command in the udev 174
release, so test whether path_id is external in the zpool_id script.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes: #429
Update the code to use the bdi_setup_and_register() helper to
simplify the bdi integration code. The updated code now just
registers the bdi during mount and destroys it during unmount.
The only complication is that for 2.6.32 - 2.6.33 kernels the
helper wasn't available so in these cases the zfs code must
provide it. Luckily the bdi_setup_and_register() function
is trivial.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#367
ZFS contains error messages that point to the defunct www.sun.com
domain, which is currently offline. Change these error messages
to use the zfsonlinux.org mirror instead.
This commit depends on:
zfsonlinux/zfsonlinux.github.com@8e10ead3dc
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Export all symbols already marked extern in the zfs_vfsops.h
header. Several non-static symbols have also been added to
the header and exportewd. This allows external modules to
more easily create and manipulate properly created ZFS
filesystem type datasets.
Rename zfsvfs_teardown() to zfs_sb_teardown and export it.
This is done simply for consistency with the rest of the code
base. All other zfsvfs_* functions have already been renamed.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
When compiling under Debian Lenny with gcc version 4.3.2
(Debian 4.3.2-1.1) the following warning occurs. To quiet
the warning initialize 'error' to zero. Newer versions of
gcc correctly determine that this uninitialized varible is
impossible because ZFS_NUM_USERQUOTA_PROPS is known to be
greater than zero.
cmd/zfs/zfs_main.c:2377: warning: "error" may be
used uninitialized in this function
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This warning was accidentally introduced by commit
b7936d5c2337bc976ac831c1c38de563844c36b. The fix is to
simply initialize the variable to ZFS_DELEG_WHO_UNKNOWN.
cmd/zfs/zfs_main.c:4460:25: warning: 'who_type' may be
used uninitialized in this function
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
These warnings were accidentally introduced by commit
b7936d5c2337bc976ac831c1c38de563844c36b. The fix is to
simply add the missing format specifier.
cmd/zfs/zfs_main.c:4565: warning: format not a string
literal and no format arguments
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Completely disable the zfs binary from attempting to directly update
/etc/mtab. The Linux port relies entirely on the mount.zfs helper
to safely update /etc/mtab. If we left the /etc/mtab updates to
the zfs binary then they could race with concurrent non-zfs mounts.
Routing everything through the system mount command ensures the
/etc/mtab updates are locked properly.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #329
This change moves the default install location for the zfs udev
rules from /etc/udev/ to /lib/udev/. The correct convention is
for rules provided by a package to be installed in /lib/udev/.
The /etc/udev/ directory is reserved for custom rules or local
overrides.
Additionally, this patch cleans up some abuse of the bindir install
location by adding a udevdir and udevruledir install directories.
This allows us to revert to the default bin install location. The
udev install directories can be set with the following new options.
--with-udevdir=DIR install udev helpers [EPREFIX/lib/udev]
--with-udevruledir=DIR install udev rules [UDEVDIR/rules.d]
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#356
For a long time now the kernel has been moving away from using the
pdflush daemon to write 'old' dirty pages to disk. The primary reason
for this is because the pdflush daemon is single threaded and can be
a limiting factor for performance. Since pdflush sequentially walks
the dirty inode list for each super block any delay in processing can
slow down dirty page writeback for all filesystems.
The replacement for pdflush is called bdi (backing device info). The
bdi system involves creating a per-filesystem control structure each
with its own private sets of queues to manage writeback. The advantage
is greater parallelism which improves performance and prevents a single
filesystem from slowing writeback to the others.
For a long time both systems co-existed in the kernel so it wasn't
strictly required to implement the bdi scheme. However, as of
Linux 2.6.36 kernels the pdflush functionality has been retired.
Since ZFS already bypasses the page cache for most I/O this is only
an issue for mmap(2) writes which must go through the page cache.
Even then adding this missing support for newer kernels was overlooked
because there are other mechanisms which can trigger writeback.
However, there is one critical case where not implementing the bdi
functionality can cause problems. If an application handles a page
fault it can enter the balance_dirty_pages() callpath. This will
result in the application hanging until the number of dirty pages in
the system drops below the dirty ratio.
Without a registered backing_device_info for the filesystem the
dirty pages will not get written out. Thus the application will hang.
As mentioned above this was less of an issue with older kernels because
pdflush would eventually write out the dirty pages.
This change adds a backing_device_info structure to the zfs_sb_t
which is already allocated per-super block. It is then registered
when the filesystem mounted and unregistered on unmount. It will
not be registered for mounted snapshots which are read-only. This
change will result in flush-<pool> thread being dynamically created
and destroyed per-mounted filesystem for writeback.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#174
Moving the zil_free() cleanup to zil_close() prevents this
problem from occurring in the first place. There is a very
good description of the issue and fix in Illumus #883.
Reviewed by: Matt Ahrens <Matt.Ahrens@delphix.com>
Reviewed by: Adam Leventhal <Adam.Leventhal@delphix.com>
Reviewed by: Albert Lee <trisk@nexenta.com>
Reviewed by: Gordon Ross <gwr@nexenta.com>
Reviewed by: Garrett D'Amore <garrett@nexenta.com>
Reivewed by: Dan McDonald <danmcd@nexenta.com>
Approved by: Gordon Ross <gwr@nexenta.com>
References to Illumos issue and patch:
- https://www.illumos.org/issues/883
- https://github.com/illumos/illumos-gate/commit/c9ba2a43cb
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #340
Today zfs tries to allocate blocks evenly across all devices.
This means when devices are imbalanced zfs will use lots of
CPU searching for space on devices which tend to be pretty
full. It should instead fail quickly on the full LUNs and
move onto devices which have more availability.
Reviewed by: Eric Schrock <Eric.Schrock@delphix.com>
Reviewed by: Matt Ahrens <Matt.Ahrens@delphix.com>
Reviewed by: Adam Leventhal <Adam.Leventhal@delphix.com>
Reviewed by: Albert Lee <trisk@nexenta.com>
Reviewed by: Gordon Ross <gwr@nexenta.com>
Approved by: Garrett D'Amore <garrett@nexenta.com>
References to Illumos issue and patch:
- https://www.illumos.org/issues/510
- https://github.com/illumos/illumos-gate/commit/5ead3ed965
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #340
The 'zfs get' command should be able to deal with mountpoint
as an argument. It already works with 'zfs list' command:
# zfs list /export/home/estibi
NAME USED AVAIL REFER MOUNTPOINT
rpool/export/home/estibi 1.14G 3.86G 1.14G /export/home/estibi
but it fails with 'zfs get':
# zfs get all /export/home/estibi
cannot open '/export/home/estibi': invalid dataset name
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Deano <deano@rattie.demon.co.uk>
Reviewed by: Garrett D'Amore <garrett@nexenta.com>
Approved by: Garrett D'Amore <garrett@nexenta.com>
References to Illumos issue and patch:
- https://www.illumos.org/issues/510
- https://github.com/illumos/illumos-gate/commit/5ead3ed965
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #340
Unlike most other Linux distributions archlinux installs its
init scripts in /etc/rc.d insead of /etc/init.d. This commit
provides an archlinux rc.d script for zfs and extends the
build infrastructure to ensure it get's installed in the
correct place.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#322
Unfortunately, ztest is hard coded to export the zdb utility to
be installed in a certain location. When the packaging was updated
to install zdb in /sbin/ ztest was broken. To fix this I'm updating
ztest to check both common install paths.
The zpool sub-commands like iostat, list, and status should
display consistent message when a given pool is unavailable or
no pool is present. This change unifies the default behavior
as follows:
root@prasad:~# ./zpool list 1 2
no pools available
no pools available
root@prasad:~# ./zpool iostat 1 2
no pools available
no pools available
root@prasad:~# ./zpool status 1 2
no pools available
no pools available
root@prasad:~# ./zpool list tan 1 2
cannot open 'tan': no such pool
root@prasad:~# ./zpool iostat tan 1 2
cannot open 'tan': no such pool
root@prasad:~# ./zpool status tan 1 2
cannot open 'tan': no such pool
Reported-by: Rajshree Thorat <rthorat@stec-inc.com>
Signed-off-by: Prasad Joshi <pjoshi@stec-inc.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#306
The .get_sb callback has been replaced by a .mount callback
in the file_system_type structure. When using the new
interface the caller must now use the mount_nodev() helper.
Unfortunately, the new interface no longer passes the vfsmount
down to the zfs layers. This poses a problem for the existing
implementation because we currently save this pointer in the
super block for latter use. It provides our only entry point
in to the namespace layer for manipulating certain mount options.
This needed to be done originally to allow commands like
'zfs set atime=off tank' to work properly. It also allowed me
to keep more of the original Solaris code unmodified. Under
Solaris there is a 1-to-1 mapping between a mount point and a
file system so this is a fairly natural thing to do. However,
under Linux they many be multiple entries in the namespace
which reference the same filesystem. Thus keeping a back
reference from the filesystem to the namespace is complicated.
Rather than introduce some ugly hack to get the vfsmount and
continue as before. I'm leveraging this API change to update
the ZFS code to do things in a more natural way for Linux.
This has the upside that is resolves the compatibility issue
for the long term and fixes several other minor bugs which
have been reported.
This commit updates the code to remove this vfsmount back
reference entirely. All modifications to filesystem mount
options are now passed in to the kernel via a '-o remount'.
This is the expected Linux mechanism and allows the namespace
to properly handle any options which apply to it before passing
them on to the file system itself.
Aside from fixing the compatibility issue, removing the
vfsmount has had the benefit of simplifying the code. This
change which fairly involved has turned out nicely.
Closes#246Closes#217Closes#187Closes#248Closes#231
The security_inode_init_security() function now takes an additional
qstr argument which must be passed in from the dentry if available.
Passing a NULL is safe when no qstr is available the relevant
security checks will just be skipped.
Closes#246Closes#217Closes#187
The inode eviction should unmap the pages associated with the inode.
These pages should also be flushed to disk to avoid the data loss.
Therefore, use truncate_setsize() in evict_inode() to release the
pagecache.
The API truncate_setsize() was added in 2.6.35 kernel. To ensure
compatibility with the old kernel, the patch defines its own
truncate_setsize function.
Signed-off-by: Prasad Joshi <pjoshi@stec-inc.com>
Closes#255
Update udev helper scripts to deal with device-mapper devices created
by multipathd. These enhancements are targeted at a particular
storage network topology under evaluation at LLNL consisting of two
SAS switches providing redundant connectivity between multiple server
nodes and disk enclosures.
The key to making these systems manageable is to create shortnames for
each disk that conveys its physical location in a drawer. In a
direct-attached topology we infer a disk's enclosure from the PCI bus
number and HBA port number in the by-path name provided by udev. In a
switched topology, however, multiple drawers are accessed via a single
HBA port. We therefore resort to assigning drawer identifiers based
on which switch port a drive's enclosure is connected to. This
information is available from sysfs.
Add options to zpool_layout to generate an /etc/zfs/zdev.conf using
symbolic links in /dev/disk/by-id of the form
<label>-<UUID>-switch-port:<X>-slot:<Y>. <label> is a string that
depends on the subsystem that created the link and defaults to
"dm-uuid-mpath" (this prefix is used by multipathd). <UUID> is a
unique identifier for the disk typically obtained from the scsi_id
program, and <X> and <Y> denote the switch port and disk slot numbers,
respectively.
Add a callout script sas_switch_id for use by multipathd to help
create symlinks of the form described above. Update zpool_id and the
udev zpool rules file to handle both multipath devices and
conventional drives.
Some disks with internal sectors larger than 512 bytes (e.g., 4k) can
suffer from bad write performance when ashift is not configured
correctly. This is caused by the disk not reporting its actual sector
size, but a sector size of 512 bytes. The drive may behave this way
for compatibility reasons. For example, the WDC WD20EARS disks are
known to exhibit this behavior.
When creating a zpool, ZFS takes that wrong sector size and sets the
"ashift" property accordingly (to 9: 1<<9=512), whereas it should be
set to 12 for 4k sectors (1<<12=4096).
This patch allows an adminstrator to manual specify the known correct
ashift size at 'zpool create' time. This can significantly improve
performance in certain cases. However, it will have an impact on your
total pool capacity. See the updated ashift property description
in the zpool.8 man page for additional details.
Valid values for the ashift property range from 9 to 17 (512B-128KB).
Additionally, you may set the ashift to 0 if you wish to auto-detect
the sector size based on what the disk reports, this is the default
behavior. The most common ashift values are 9 and 12.
Example:
zpool create -o ashift=12 tank raidz2 sda sdb sdc sdd
Closes#280
Original-patch-by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Under Fedora 15 /etc/mtab is now a symlink to /proc/mounts by
default. When /etc/mtab is a symlink the mount.zfs helper
should not update it. There was code in place to handle this
case but it used stat() which traverses the link and then issues
the stat on /proc/mounts. We need to use lstat() to prevent the
link traversal and instead stat /etc/mtab.
Closes#270
The previous commit 8a7e1ceefa wasn't
quite right. This check applies to both the user and kernel space
build and as such we must make sure it runs regardless of what
the --with-config option is set too.
For example, if --with-config=kernel then the autoconf test does
not run and we generate build warnings when compiling the kernel
packages.
Gcc versions 4.3.2 and earlier do not support the compiler flag
-Wno-unused-but-set-variable. This can lead to build failures
on older Linux platforms such as Debian Lenny. Since this is
an optional build argument this changes add a new autoconf check
for the option. If it is supported by the installed version of
gcc then it is used otherwise it is omited.
See commit's 12c1acde76 and
79713039a2 for the reason the
-Wno-unused-but-set-variable options was originally added.
We will never bring over the pyzfs.py helper script from Solaris
to Linux. Instead the missing functionality will be directly
integrated in to the zfs commands and libraries. To avoid
confusion remove the warning about the missing pyzfs.py utility
and simply use the default internal support.
The Illumous developers are of the same mind and have proposed an
initial patch to do this which has been integrated in to the 'allow'
development branch. After some additional testing this code
can be merged in to master as the right long term solution.
The zpool_id and zpool_layout helper scripts have been updated to
use the more common /usr/bin/awk symlink. On Fedora/Redhat systems
there are both /bin/awk and /usr/bin/awk symlinks to your installed
version of awk. On Debian/Ubuntu systems only the /usr/bin/awk
symlink exists.
Additionally, add the '\<' token to the beginning of the regex
pattern to prevent partial matches. This pattern only appears to
work with gawk despite the mawk man page claiming to support this
extended regex. Thus you will need to have gawk installed to use
these optional helper scripts. A comment has been added to the
script to reflect this reality.
With the addition of the mount helper we accidentally regressed
the ability to manually mount snapshots. This commit updates
the mount helper to expect the possibility of a ZFS_TYPE_SNAPSHOT.
All snapshot will be automatically treated as 'legacy' type mounts
so they can be mounted manually.
This change fixes a kernel panic which would occur when resizing
a dataset which was not open. The objset_t stored in the
zvol_state_t will be set to NULL when the block device is closed.
To avoid this issue we pass the correct objset_t as the third arg.
The code has also been updated to correctly notify the kernel
when the block device capacity changes. For 2.6.28 and newer
kernels the capacity change will be immediately detected. For
earlier kernels the capacity change will be detected when the
device is next opened. This is a known limitation of older
kernels.
Online ext3 resize test case passes on 2.6.28+ kernels:
$ dd if=/dev/zero of=/tmp/zvol bs=1M count=1 seek=1023
$ zpool create tank /tmp/zvol
$ zfs create -V 500M tank/zd0
$ mkfs.ext3 /dev/zd0
$ mkdir /mnt/zd0
$ mount /dev/zd0 /mnt/zd0
$ df -h /mnt/zd0
$ zfs set volsize=800M tank/zd0
$ resize2fs /dev/zd0
$ df -h /mnt/zd0
Original-patch-by: Fajar A. Nugraha <github@fajar.net>
Closes#68Closes#84
Disable the gethostid() override for Solaris behavior because Linux systems
implement the POSIX standard in a way that allows a negative result.
Mask the gethostid() result to the lower four bytes, like coreutils does in
/usr/bin/hostid, to prevent junk bits or sign-extension on systems that have an
eight byte long type. This can cause a spurious hostid mismatch that prevents
zpool import on 64-bit systems.
As of gcc-4.6 the option -Wunused-but-set-variable is enabled by
default. While this is a useful warning there are numerous places
in the ZFS code when a variable is set and then only checked in an
ASSERT(). To avoid having to update every instance of this in the
code we now set -Wno-unused-but-set-variable to suppress the warning.
Additionally, when building with --enable-debug and -Werror set these
warning also become fatal. We can reevaluate the suppression of these
error at a later time if it becomes an issue. For now we are basically
just reverting to the previous gcc behavior.
When compiling ZFS in user space gcc-4.6.0 correctly identifies
the variable 'value' as being set but never used. This generates a
warning and a build failure when using --enable-debug. Once again
this is correct but I'm reluctant to remove 'value' because we are
breaking the string in to name/value pairs. While it is not used
now there's a good chance it will be soon and I'd rather not have
to reinvent this. To suppress the warning with just as a VERIFY().
This was observed under Fedora 15.
cmd/mount_zfs/mount_zfs.c: In function ‘parse_option’:
cmd/mount_zfs/mount_zfs.c:112:21: error: variable ‘value’ set but not
used [-Werror=unused-but-set-variable]
Some udev hooks are not designed to be idempotent, so calling udevadm
trigger outside of the distribution's initialization scripts can have
unexpected (and potentially dangerous) side effects. For example, the
system time may change or devices may appear multiple times. See Ubuntu
launchpad bug 320200 and this mailing list post for more details:
https://lists.ubuntu.com/archives/ubuntu-devel/2009-January/027260.html
To avoid these problems we call udevadm trigger with --action=change
--subsystem-match=block. The first argument tells udev just to refresh
devices, and make sure everything's as it should be. The second
argument limits the scope to block devices, so devices belonging to
other subsystems cannot be affected.
This doesn't fix the problem on older udev implementations that don't
provide udevadm but instead have udevtrigger as a standalone program.
In this case the above options aren't available so there's no way to
call call udevtrigger safely. But we can live with that since this
issue only exists in optional test and helper scripts, and most
zfs-on-linux users are running newer systems anyways.
This commit fixes issue on
https://github.com/behlendorf/zfs/issues/#issue/172
Changes:
- update BLKZNAME to use _IOR instead of _IO. Kernel 2.6.32 allows
read parameters (copy_to_user) with _IO, while newer kernels (tested
Archlinux's 2.6.37 kernel) enforces _IOR (which is correct)
- fix return code and message on error
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Added insert_inode_locked() helper function, prior to this most callers
used insert_inode_hash(). The older method doesn't check for collisions
in the inode_hashtable but it still acceptible for use. Fallback to
using insert_inode_hash() when insert_inode_locked() is unavailable.
Compiling with 'LDFLAGS=-Wl,--as-needed' exposed the fact that
there were some library linking problems introduced by mount_zfs.
In particular, the libzfs library does use nvpair symbols, and
mount_zfs contains no dependencies on libzpool.
Closes#161Closes#162
New versions glibc declare getcwd() with the warn_unused_result attribute.
This results in a warning because the updated mount helper was not
checking this return value. This issue was fixed by checking the return
type and in the case of an error simply returning the passed dataset.
One possible, but unlikely, error would be having your cwd directory
unlinked while the mount command was running.
cmd/mount_zfs/mount_zfs.c: In function ‘parse_dataset’:
cmd/mount_zfs/mount_zfs.c:223:2: error: ignoring return value of
‘getcwd’, declared with attribute warn_unused_result
To support automatically mounting your zfs on filesystem on boot
a basic init script is needed. Unfortunately, every distribution
has their own idea of the _right_ way to do things. Rather than
write one very complicated portable init script, which would be
invariably replaced by the distributions own anyway. I have
instead added support to provide multiple distribution specific
init scripts.
The correct init script for your distribution will be selected
by ZFS_AC_DEFAULT_PACKAGE which will set DEFAULT_INIT_SCRIPT.
During 'make install' the correct script for your system will
be installed from zfs/etc/init.d/zfs.DEFAULT_INIT_SCRIPT to the
usual /etc/init.d/zfs location.
Currently, there is zfs.fedora and a more generic zfs.lsb init
script. Hopefully, the distribution maintainers who know best
how they want their init scripts to function will feedback their
approved versions to be included in the project.
This change does not consider upstart jobs but I'm not at all
opposed to add that sort of thing.
When updating /etc/mtab we should be careful and strip certain
options. In particular, we need to strip 'zfsutil' because if
we don't the mount utility will helpfull provide it to the
mount helper when we issue mount(8) again. This subverts the
check that the caller is zfs(8) and not mount(8).
Allow the mount(8) utility to always operate on all datasets when
remounting them read-only. This critical for rc.sysinit/umountroot
which remounts the root filesystem read-only during shutdown to
ensure everything is correctly flushed to disk.
Fix minor typo, the check to set zfsutil should use the bitwise
'&'. I must have accidentally hit the adjacent '*' and obviously
neither the compiler or my code review caught this. Fix it now.
When run with a root '/' cwd the mount.zfs helper would strip not
only the '/' but also the next character from the dataset name.
For example, '/tank' was changed to 'ank' instead of just 'tank'.
Originally, this was done for the '/tmp' cwd case where we needed
to strip the '/' following the cwd. For example '/tmp/tank' needed
to remove the '/tmp' cwd plus 1 character for the '/'.
This change fixes the problem by checking the cwd and if it ends in
a '/' it does not strip and extra character. Otherwise it will strip
the next character. I believe this should only ever be true for the
root directory.
Closes#148
Several issues related to strange mount/umount behavior were reported
and this commit should address most of them. The original idea was
to put in place a zfs mount helper (mount.zfs). This helper is used
to enforce 'legacy' mount behavior, and perform any extra mount argument
processing (selinux, zfsutil, etc). This helper wasn't ready for the
0.6.0-rc1 release but with this change it's functional but needs to
extensively tested.
This change addresses the following open issues.
Closes#101Closes#107Closes#113Closes#115Closes#119