Commit Graph

3145 Commits

Author SHA1 Message Date
Brian Behlendorf fd1cd4888a Merge branch 'udev'
Merge the remaining udev restructuring changes and cleanup.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Kyle Fuller <inbox@kylefuller.co.uk>
Signed-off-by: Zachary Bedell <zac@thebedells.org>
Closes #356
2011-08-22 09:27:20 -07:00
Brian Behlendorf aa2b4896c9 Fix autoconf variable substitution in init scripts.
Change the variable substitution in the init script templates
according to the method described in the Autoconf manual;
Chapter 4.7.2: Installation Directory Variables.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-08-19 16:26:14 -07:00
Kyle Fuller f0102d6e75 Make dracut module-setup.sh an autoconf config file
This ensures that module-setup.sh script will always be able to
install the required dracut components regardless of how the zfs
package was configured.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-08-19 16:26:13 -07:00
Kyle Fuller 146cde8f4a Move 90-zfs udev rule from dracut to udev/rules.d
This rule does not need to be dracut specific.  Automatically loading
the zfs module stack when a zfs device is detected is usually desirable.
My only concern is that this might cause trouble for large pools where
we don't want to automatically import the pool until all the disks are
available.  However, we'll cross that bridge when we come to it.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-08-19 16:26:13 -07:00
Brian Behlendorf 9c4f40b894 Buildbot suppression rules
The warnings listed in the suppression file will be suppressed
and not flagged during regular buildbot builds.  These warnings
are expected, harmless, and can obscure real issues unless they
are suppressed.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-08-19 16:26:06 -07:00
Brian Behlendorf 4c069d3494 Fixed uninitialized variable
This warning was accidentally introduced by commit
b7936d5c2337bc976ac831c1c38de563844c36b.  The fix is to
simply initialize the variable to ZFS_DELEG_WHO_UNKNOWN.

  cmd/zfs/zfs_main.c:4460:25: warning: 'who_type' may be
  used uninitialized in this function

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-08-19 16:26:06 -07:00
Brian Behlendorf 29b35200a7 Fix missing format arguments
These warnings were accidentally introduced by commit
b7936d5c2337bc976ac831c1c38de563844c36b.  The fix is to
simply add the missing format specifier.

  cmd/zfs/zfs_main.c:4565: warning: format not a string
  literal and no format arguments

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-08-19 15:16:34 -07:00
Brian Behlendorf 95d9fd028b Fix incompatible pointer type warning
This warning was accidentally introduced by commit
f3ab88d646 which updated the
.readpages() implementation.  The fix is to simply cast
the helper function to the appropriate type when passed.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-08-19 15:16:30 -07:00
Brian Behlendorf b740d602bd Disable zfs /etc/mtab updates
Completely disable the zfs binary from attempting to directly update
/etc/mtab.  The Linux port relies entirely on the mount.zfs helper
to safely update /etc/mtab.  If we left the /etc/mtab updates to
the zfs binary then they could race with concurrent non-zfs mounts.
Routing everything through the system mount command ensures the
/etc/mtab updates are locked properly.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #329
2011-08-19 11:53:01 -07:00
Brian Behlendorf ddd052aa83 Improve HAVE_EVICT_INODE check
The hardened gentoo kernel defines all of the super block
operation callbacks as const.  This prevents the autoconf test
from assigning the callback and results in a false negative.
By moving the assignment in to the declaration we can avoid
this issue and get a correct result for this patched kernel.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #296
2011-08-08 16:42:09 -07:00
Brian Behlendorf de0a1c099b Autogen refresh for udev changes
Run autogen.sh using the same autotools versions as upstream:

 * autoconf-2.63
 * automake-1.11.1
 * libtool-2.2.6b
2011-08-08 16:30:27 -07:00
Kyle Fuller 12d06bac9b Move udev rules from /etc/udev to /lib/udev
This change moves the default install location for the zfs udev
rules from /etc/udev/ to /lib/udev/.  The correct convention is
for rules provided by a package to be installed in /lib/udev/.
The /etc/udev/ directory is reserved for custom rules or local
overrides.

Additionally, this patch cleans up some abuse of the bindir install
location by adding a udevdir and udevruledir install directories.
This allows us to revert to the default bin install location.  The
udev install directories can be set with the following new options.

  --with-udevdir=DIR      install udev helpers [EPREFIX/lib/udev]
  --with-udevruledir=DIR  install udev rules [UDEVDIR/rules.d]

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #356
2011-08-08 16:21:10 -07:00
Brian Behlendorf f3ab88d646 Correctly lock pages for .readpages()
Unlike the .readpage() callback which is passed a single locked page
to be populated.  The .readpages() callback is passed a list of unlocked
pages which are all marked for read-ahead (PG_readahead set).  It is
the responsibly of .readpages() to ensure to pages are properly locked
before being populated.

Prior to this change the requested read-ahead pages would be updated
outside of the page lock which is unsafe.  The unlocked pages would then
be unlocked again which is harmless but should have been immediately
detected as bug.  Unfortunately, newer kernels failed detect this issue
because the check is done with a VM_BUG_ON which is disabled by default.
Luckily, the old Debian Lenny 2.6.26 kernel caught this because it
simply uses a BUG_ON.

The straight forward fix for this is to update the .readpages() callback
to use the read_cache_pages() helper function.  The helper function will
ensure that each page in the list is properly locked before it is passed
to the .readpage() callback.  In addition resolving the bug, this results
in a nice simplification of the existing code.

The downside to this change is that instead of passing one large read
request to the dmu multiple smaller ones are submitted.  All of these
requests however are marked for readahead so the lower layers should
issue a large I/O regardless.  Thus most of the request should hit the
ARC cache.

Futher optimization of this code can be done in the future is a perform
analysis determines it to be worthwhile.  But for the moment, it is
preferable that code be correct and understandable.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #355
2011-08-08 13:24:52 -07:00
Brian Behlendorf 76659dc110 Add backing_device_info per-filesystem
For a long time now the kernel has been moving away from using the
pdflush daemon to write 'old' dirty pages to disk.  The primary reason
for this is because the pdflush daemon is single threaded and can be
a limiting factor for performance.  Since pdflush sequentially walks
the dirty inode list for each super block any delay in processing can
slow down dirty page writeback for all filesystems.

The replacement for pdflush is called bdi (backing device info).  The
bdi system involves creating a per-filesystem control structure each
with its own private sets of queues to manage writeback.  The advantage
is greater parallelism which improves performance and prevents a single
filesystem from slowing writeback to the others.

For a long time both systems co-existed in the kernel so it wasn't
strictly required to implement the bdi scheme.  However, as of
Linux 2.6.36 kernels the pdflush functionality has been retired.

Since ZFS already bypasses the page cache for most I/O this is only
an issue for mmap(2) writes which must go through the page cache.
Even then adding this missing support for newer kernels was overlooked
because there are other mechanisms which can trigger writeback.

However, there is one critical case where not implementing the bdi
functionality can cause problems.  If an application handles a page
fault it can enter the balance_dirty_pages() callpath.  This will
result in the application hanging until the number of dirty pages in
the system drops below the dirty ratio.

Without a registered backing_device_info for the filesystem the
dirty pages will not get written out.  Thus the application will hang.
As mentioned above this was less of an issue with older kernels because
pdflush would eventually write out the dirty pages.

This change adds a backing_device_info structure to the zfs_sb_t
which is already allocated per-super block.  It is then registered
when the filesystem mounted and unregistered on unmount.  It will
not be registered for mounted snapshots which are read-only.  This
change will result in flush-<pool> thread being dynamically created
and destroyed per-mounted filesystem for writeback.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #174
2011-08-04 13:37:38 -07:00
Brian Behlendorf 3c0e5c0f45 Cleanup mmap(2) writes
While the existing implementation of .writepage()/zpl_putpage() was
functional it was not entirely correct.  In particular, it would move
dirty pages in to a clean state simply after copying them in to the
ARC cache.  This would result in the pages being lost if the system
were to crash enough though the Linux VFS believed them to be safe on
stable storage.

Since at the moment virtually all I/O, except mmap(2), bypasses the
page cache this isn't as bad as it sounds.  However, as hopefully
start using the page cache more getting this right becomes more
important so it's good to improve this now.

This patch takes a big step in that direction by updating the code
to correctly move dirty pages through a writeback phase before they
are marked clean.  When a dirty page is copied in to the ARC it will
now be set in writeback and a completion callback is registered with
the transaction.  The page will stay in writeback until the dmu runs
the completion callback indicating the page is on stable storage.
At this point the page can be safely marked clean.

This process is normally entirely asynchronous and will be repeated
for every dirty page.  This may initially sound inefficient but most
of these pages will end up in a few txgs.  That means when they are
eventually written to disk they should be nicely batched.  However,
there is room for improvement.  It may still be desirable to batch
up the pages in to larger writes for the dmu.  This would reduce
the number of callbacks and small 4k buffer required by the ARC.

Finally, if the caller requires that the I/O be done synchronously
by setting WB_SYNC_ALL or if ZFS_SYNC_ALWAYS is set.  Then the I/O
will trigger a zil_commit() to flush the data to stable storage.
At which point the registered callbacks will be run leaving the
date safe of disk and marked clean before returning from .writepage.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-08-02 10:34:55 -07:00
Gunnar Beutner ddd0fd9ef6 Use libzfs_run_process() in libshare.
This should simplify the code a bit by re-using existing code
to fork and exec a process.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #190
2011-08-01 13:52:35 -07:00
Gunnar Beutner 3132cb397a Use /dev/null for stdout/stderr in libzfs_run_process().
Simply closing the stdout and/or stderr file descriptors for
the child process can have bad side effects if for example
the child writes to stdout/stderr after open()ing a file.
The open() call might have returned the same file descriptor
one would usually expect for stdout/stderr (1 and 2), thereby
causing mis-directed writes.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #190
2011-08-01 13:52:23 -07:00
James H 5333eb0b3b Call exportfs -v once for NFS shares.
At the moment we call exportfs -v every time we check whether an
NFS share is active. This happens every time you run a zfs or
zpool command, making them extremely slow when you have a lot of
exports. The time taken is approx O(n2) of the number of shares.

This commit stores the output from exportfs -v in a temporary file
and use this to speed up subsequent accesses.

This mechanism is still too slow - if you have tens of thousands
of NFS shares it will still be painful running ANY zfs/zpool
command.

Signed-off-by: Gunnar Beutner <gunnar@beutner.name>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #341
2011-08-01 13:50:40 -07:00
Brian Behlendorf 77999e804f Merge branch 'illumos'
Merge in ten upstream fixes which have already been made to both
the Illumos and FreeBSD ZFS implementations.  This brings us up
to date with the latest ZFS changes in Illumos.

Credit goes to Martin Matuska of the FreeBSD project for posting
an excellent summary of the upstream patches we were missing.

Illumos #1313: Integer overflow in txg_delay()
Illumos #278:  get rid zfs of python and pyzfs dependencies
Illumos #1043: Recursive zfs snapshot destroy fails
Illumos #883:  ZIL reuse during remount corruption
Illumos #1092: zfs refratio property
Illumos #1051: zfs should handle
Illumos #510:  'zfs get' enhancement - mountpoint as an argument
Illumos #175:  zfs vdev cache consumes excessive memory
Illumos #764:  panic in zfs:dbuf_sync_list
Illumos #xxx:  zdb -vvv broken after zfs diff integration

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #340
2011-08-01 12:10:58 -07:00
Martin Matuska cddafdcbc5 Illumos #1313: Integer overflow in txg_delay()
The function txg_delay() is used to delay txg (transaction group)
threads in ZFS.  The timeout value for this function is calculated
using:

    int timeout = ddi_get_lbolt() + ticks;

Later, the actual wait is performed:

    while (ddi_get_lbolt() < timeout &&
        tx->tx_syncing_txg < txg-1 && !txg_stalled(dp))
            (void) cv_timedwait(&tx->tx_quiesce_more_cv, &tx->tx_sync_lock,
                timeout - ddi_get_lbolt());

The ddi_get_lbolt() function returns current uptime in clock ticks
and is typed as clock_t.  The clock_t type on 64-bit architectures
is int64_t.

The "timeout" variable will overflow depending on the tick frequency
(e.g. for 1000 it will overflow in 28.855 days). This will make the
expression "ddi_get_lbolt() < timeout" always false - txg threads will
not be delayed anymore at all. This leads to a slowdown in ZFS writes.

The attached patch initializes timeout as clock_t to match the return
value of ddi_get_lbolt().

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #352
2011-08-01 12:09:43 -07:00
Alexander Stetsenko 0b7936d5c2 Illumos #278: get rid zfs of python and pyzfs dependencies
Remove all python and pyzfs dependencies for consistency and
to ensure full functionality even in a mimimalist environment.

Reviewed by: gordon.w.ross@gmail.com
Reviewed by: trisk@opensolaris.org
Reviewed by: alexander.r.eremin@gmail.com
Reviewed by: jerry.jelinek@joyent.com
Approved by: garrett@nexenta.com

References to Illumos issue and patch:
- https://www.illumos.org/issues/278
- https://github.com/illumos/illumos-gate/commit/1af68beac3

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #340
Issue #160

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-08-01 12:09:36 -07:00
Martin Matuska ca5252204a Illumos #1043: Recursive zfs snapshot destroy fails
Prior to revision 11314 if a user was recursively destroying
snapshots of a dataset the target dataset was not required to
exist.  The zfs_secpolicy_destroy_snaps() function introduced
the security check on the target dataset, so since then if the
target dataset does not exist, the recursive destroy is not
performed.  Before 11314, only a delete permission check on
the snapshot's master dataset was performed.

Steps to reproduce:
zfs create pool/a
zfs snapshot pool/a@s1
zfs destroy -r pool@s1

Therefore I suggest to fallback to the old security check, if
the target snapshot does not exist and continue with the destroy.

References to Illumos issue and patch:
- https://www.illumos.org/issues/1043
- https://www.illumos.org/attachments/217/recursive_dataset_destroy.patch

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #340
2011-08-01 12:09:11 -07:00
Eric Schrock 3e31d2b080 Illumos #883: ZIL reuse during remount corruption
Moving the zil_free() cleanup to zil_close() prevents this
problem from occurring in the first place.  There is a very
good description of the issue and fix in Illumus #883.

Reviewed by: Matt Ahrens <Matt.Ahrens@delphix.com>
Reviewed by: Adam Leventhal <Adam.Leventhal@delphix.com>
Reviewed by: Albert Lee <trisk@nexenta.com>
Reviewed by: Gordon Ross <gwr@nexenta.com>
Reviewed by: Garrett D'Amore <garrett@nexenta.com>
Reivewed by: Dan McDonald <danmcd@nexenta.com>
Approved by: Gordon Ross <gwr@nexenta.com>

References to Illumos issue and patch:
- https://www.illumos.org/issues/883
- https://github.com/illumos/illumos-gate/commit/c9ba2a43cb

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #340
2011-08-01 12:09:11 -07:00
Matt Ahrens f5fc4acaa7 Illumos #1092: zfs refratio property
Add a "REFRATIO" property, which is the compression ratio based on
data referenced. For snapshots, this is the same as COMPRESSRATIO,
but for filesystems/volumes, the COMPRESSRATIO is based on the
data "USED" (ie, includes blocks in children, but not blocks
shared with the origin).

This is needed to figure out how much space a filesystem would
use if it were not compressed (ignoring snapshots).

Reviewed by: George Wilson <George.Wilson@delphix.com>
Reviewed by: Adam Leventhal <Adam.Leventhal@delphix.com>
Reviewed by: Dan McDonald <danmcd@nexenta.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Reviewed by: Mark Musante <Mark.Musante@oracle.com>
Reviewed by: Garrett D'Amore <garrett@nexenta.com>
Approved by: Garrett D'Amore <garrett@nexenta.com>

References to Illumos issue and patch:
- https://www.illumos.org/issues/1092
- https://github.com/illumos/illumos-gate/commit/187d6ac08a

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #340
2011-08-01 12:09:11 -07:00
George Wilson 6d974228ef Illumos #1051: zfs should handle imbalanced luns
Today zfs tries to allocate blocks evenly across all devices.
This means when devices are imbalanced zfs will use lots of
CPU searching for space on devices which tend to be pretty
full.  It should instead fail quickly on the full LUNs and
move onto devices which have more availability.

Reviewed by: Eric Schrock <Eric.Schrock@delphix.com>
Reviewed by: Matt Ahrens <Matt.Ahrens@delphix.com>
Reviewed by: Adam Leventhal <Adam.Leventhal@delphix.com>
Reviewed by: Albert Lee <trisk@nexenta.com>
Reviewed by: Gordon Ross <gwr@nexenta.com>
Approved by: Garrett D'Amore <garrett@nexenta.com>

References to Illumos issue and patch:
- https://www.illumos.org/issues/510
- https://github.com/illumos/illumos-gate/commit/5ead3ed965

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #340
2011-08-01 12:09:11 -07:00
Shampavman bb939d1085 Illumos #510: 'zfs get' enhancement - mountpoint as an argument
The 'zfs get' command should be able to deal with mountpoint
as an argument.  It already works with 'zfs list' command:

  # zfs list /export/home/estibi
  NAME                       USED  AVAIL  REFER  MOUNTPOINT
  rpool/export/home/estibi  1.14G  3.86G  1.14G  /export/home/estibi

but it fails with 'zfs get':

  # zfs get all /export/home/estibi
  cannot open '/export/home/estibi': invalid dataset name

Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Deano <deano@rattie.demon.co.uk>
Reviewed by: Garrett D'Amore <garrett@nexenta.com>
Approved by: Garrett D'Amore <garrett@nexenta.com>

References to Illumos issue and patch:
- https://www.illumos.org/issues/510
- https://github.com/illumos/illumos-gate/commit/5ead3ed965

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #340
2011-08-01 12:09:11 -07:00
Garrett D'Amore 2cc6c8db12 Illumos #175: zfs vdev cache consumes excessive memory
Note that with the current ZFS code, it turns out that the vdev
cache is not helpful, and in some cases actually harmful.  It
is better if we disable this.  Once some time has passed, we
should actually remove this to simplify the code.  For now we
just disable it by setting the zfs_vdev_cache_size to zero.
Note that Solaris 11 has made these same changes.

References to Illumos issue and patch:
- https://www.illumos.org/issues/175
- https://github.com/illumos/illumos-gate/commit/b68a40a845

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #340
2011-08-01 12:09:11 -07:00
Gordon Ross ef3c1dea70 Illumos #764: panic in zfs:dbuf_sync_list
Hypothesis about what's going on here.

At some time in the past, something, i.e. dnode_reallocate()
calls one of:
dbuf_rm_spill(dn, tx);

These will do:
dbuf_rm_spill(dnode_t *dn, dmu_tx_t *tx)
dbuf_free_range(dn, DMU_SPILL_BLKID, DMU_SPILL_BLKID, tx)
dbuf_undirty(db, tx)

Currently dbuf_undirty can leave a spill block in dn_dirty_records[],
(it having been put there previously by dbuf_dirty) and free it.
Sometime later, dbuf_sync_list trips over this reference to free'd
(and typically reused) memory.

Also, dbuf_undirty can call dnode_clear_range with a bogus
block ID. It needs to test for DMU_SPILL_BLKID, similar to
how dnode_clear_range is called in dbuf_dirty().

References to Illumos issue and patch:
- https://www.illumos.org/issues/764
- https://github.com/illumos/illumos-gate/commit/3f2366c2bb

Reviewed by: George Wilson <gwilson@zfsmail.com>
Reviewed by: Mark.Maybe@oracle.com
Reviewed by: Albert Lee <trisk@nexenta.com
Approved by: Garrett D'Amore <garrett@nexenta.com>

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #340
2011-08-01 12:09:11 -07:00
Tim Haley 7b8518cb8d Illumos #xxx: zdb -vvv broken after zfs diff integration
References to Illumos issue and patch:
- https://github.com/illumos/illumos-gate/commit/163eb7ff

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #340
2011-08-01 12:09:02 -07:00
Brian Behlendorf bfb73f9277 Add .gitignore for zfs.<distro> init scripts
Treat the automatically generated zfs.<distro> init scripts
as build products by adding them to a directory specific
.gitignore file.
2011-08-01 10:27:54 -07:00
Kyle Fuller 5faa9c0367 Turn the init.d scripts into autoconf config files
This change ensures the paths used by the provided init scripts
always reference the prefixes provided at configure time.  The
@sbindir@ and @sysconfdir@ prefixes will be correctly replaced
at build time.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #336
2011-08-01 09:54:44 -07:00
Zachary Bedell 7f4afd300b Wrap dracut scripts to 79 chars
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #347
2011-07-31 11:28:44 -07:00
Kyle Fuller 7304278d7f Make autogen.sh executable
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-07-26 10:15:35 -07:00
Zachary Bedell a4719e54e8 Catch return errors from zpool commands
This fixes a bug that can effect first reboot after install
using Dracut.  The Dracut module didn't check the return
value from several calls to z* functions.  This resulted in
"Using no pools available as root" on boot if the ZFS module
didn't auto-import the pools.  It's most likely to happen on
initial restart after a fresh install & requires juggling in
the Dracut emergency holographic shell to fix.

This patch checks return codes & output from zpool list and
related functions and correctly falls into the explicit zpool
import code branch if the module didn't import the pool at load.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-07-25 10:43:44 -07:00
Zachary Bedell 1ef5e8296a Soft to hard tabs
For consistency with the upstream sources and the rest of the
project use hard instead of soft tabs.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-07-25 10:42:07 -07:00
Brian Behlendorf beb9826902 Fix txg_sync_thread deadlock
Update two kmem_alloc()'s in dbuf_dirty() to use KM_PUSHPAGE.
Because these functions are called from txg_sync_thread we
must ensure they don't reenter the zfs filesystem code via
the .writepage callback.  This would result in a deadlock.

This deadlock is rare and has only been observed once under
an abusive mmap() write workload.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-07-22 15:24:57 -07:00
Brian Behlendorf f7ef846ea1 Add missing <pool> option
The bootfs example in the dracut documentation was sightly incorrect
because it lacked the trailing required pool argument.  Add it.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-07-22 15:18:24 -07:00
Brian Behlendorf 0da7869690 Fix the configure CONFIG_* option detection
The latest kernels no longer define AUTOCONF_INCLUDED which was
being used to detect the new style autoconf.h kernel configure
options.  This results in the CONFIG_* checks always failing
incorrectly for newer kernels.

The fix for this is a simplification of the testing method.
Rather than attempting to explicitly include to renamed config
header.  It is simpler to unconditionally include <linux/module.h>
which must pick up the correctly named header.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #320
2011-07-22 15:07:16 -07:00
Brian Behlendorf 22872ff5da Use zfs_mknode() to create dataset root
Long, long, long ago when the effort to port ZFS was begun
the zfs_create_fs() function was heavily modified to remove
all of its VFS dependencies.  This allowed Lustre to use
the dataset without us having to spend the time porting all
the required VFS code.

Fast-forward several years and we now have all the VFS code
in place but are still relying on the modified zfs_create_fs().
This isn't required anymore and we can now use zfs_mknode()
to create the root znode for the filesystem.

This commit reverts the contents of zfs_create_fs() to largely
match the upstream OpenSolaris code.  There have been minor
modifications to accomidate the Linux VFS but that is all.

This code fixes issue #116 by bootstraping enough of the VFS
data structures so we can rely on zfs_mknode() to create the
root directory.  This ensures it is created properly with
support for system attributes.  Previously it wasn't which
is why it behaved differently that all other directories
when modified.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #116
2011-07-20 19:52:26 -07:00
Brian Behlendorf 9fd91daeef Honor setgit bit on directories
Newly created files were always being created with the fsuid/fsgid
in the current users credentials.  This is correct except in the
case when the parent directory sets the 'setgit' bit.  In this
case according to posix the newly created file/directory should
inherit the gid of the parent directory.  Additionally, in the
case of a subdirectory it should also inherit the 'setgit' bit.

Finally, this commit performs a little cleanup of the vattr_t
initialization by moving it to a common helper function.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #262
2011-07-20 14:07:13 -07:00
Brian Behlendorf fe0ed8f910 Fix 'make install' overly broad 'rm'
When running 'make install' without DESTDIR set the module install
rules would mistakenly destroy the 'modules.*' files for ALL of
your installed kernels.  This could lead to a non-functional system
for the alternate kernels because 'depmod -a' will only be run for
the kernel which was compiled against.  This issue would not impact
anyone using the 'make <deb|rpm|pkg>' build targets to build and
install packages.

The fix for this issue is to only remove extraneous build products
when DESTDIR is set.  This almost exclusively indicates we are
building packages and installed the build products in to a temporary
staging location.  Additionally, limit the removal the unneeded
build products to the target kernel version.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #328
2011-07-20 09:38:51 -07:00
Brian Behlendorf cfc9a5c88f Fix zpl_writepage() deadlock
Disable the normal reclaim path for zpl_putpage().  This ensures that
all memory allocations under this call path will never enter direct
reclaim.  If this were to happen the VM might try to write out
additional pages by calling zpl_putpage() again resulting in a
deadlock.

This sitution is typically handled in Linux by marking each offending
allocation GFP_NOFS.  However, since much of the code used is common
it makes more sense to use PF_MEMALLOC to flag the entire call tree.
Alternately, the code could be updated to pass the needed allocation
flags but that's a more invasive change.

The following example of the above described deadlock was triggered
by test 074 in the xfstest suite.

Call Trace:
 [<ffffffff814dcdb2>] down_write+0x32/0x40
 [<ffffffffa05af6e4>] dnode_new_blkid+0x94/0x2d0 [zfs]
 [<ffffffffa0597d66>] dbuf_dirty+0x556/0x750 [zfs]
 [<ffffffffa05987d1>] dmu_buf_will_dirty+0x81/0xd0 [zfs]
 [<ffffffffa059ee70>] dmu_write+0x90/0x170 [zfs]
 [<ffffffffa0611afe>] zfs_putpage+0x2ce/0x360 [zfs]
 [<ffffffffa062875e>] zpl_putpage+0x1e/0x60 [zfs]
 [<ffffffffa06287b2>] zpl_writepage+0x12/0x20 [zfs]
 [<ffffffff8115f907>] writeout+0xa7/0xd0
 [<ffffffff8115fa6b>] move_to_new_page+0x13b/0x170
 [<ffffffff8115fed4>] migrate_pages+0x434/0x4c0
 [<ffffffff811559ab>] compact_zone+0x4fb/0x780
 [<ffffffff81155ed1>] compact_zone_order+0xa1/0xe0
 [<ffffffff8115602c>] try_to_compact_pages+0x11c/0x190
 [<ffffffff811200bb>] __alloc_pages_nodemask+0x5eb/0x8b0
 [<ffffffff8115464a>] alloc_pages_current+0xaa/0x110
 [<ffffffff8111e36e>] __get_free_pages+0xe/0x50
 [<ffffffffa03f0e2f>] kv_alloc+0x3f/0xb0 [spl]
 [<ffffffffa03f11d9>] spl_kmem_cache_alloc+0x339/0x660 [spl]
 [<ffffffffa05950b3>] dbuf_create+0x43/0x370 [zfs]
 [<ffffffffa0596fb1>] __dbuf_hold_impl+0x241/0x480 [zfs]
 [<ffffffffa0597276>] dbuf_hold_impl+0x86/0xc0 [zfs]
 [<ffffffffa05977ff>] dbuf_hold_level+0x1f/0x30 [zfs]
 [<ffffffffa05a9dde>] dmu_tx_check_ioerr+0x4e/0x110 [zfs]
 [<ffffffffa05aa1f9>] dmu_tx_count_write+0x359/0x6f0 [zfs]
 [<ffffffffa05aa5df>] dmu_tx_hold_write+0x4f/0x70 [zfs]
 [<ffffffffa0611a6d>] zfs_putpage+0x23d/0x360 [zfs]
 [<ffffffffa062875e>] zpl_putpage+0x1e/0x60 [zfs]
 [<ffffffff811221f9>] write_cache_pages+0x1c9/0x4a0
 [<ffffffffa0628738>] zpl_writepages+0x18/0x20 [zfs]
 [<ffffffff81122521>] do_writepages+0x21/0x40
 [<ffffffff8119bbbd>] writeback_single_inode+0xdd/0x2c0
 [<ffffffff8119bfbe>] writeback_sb_inodes+0xce/0x180
 [<ffffffff8119c11b>] writeback_inodes_wb+0xab/0x1b0
 [<ffffffff8119c4bb>] wb_writeback+0x29b/0x3f0
 [<ffffffff8119c6cb>] wb_do_writeback+0xbb/0x240
 [<ffffffff811308ea>] bdi_forker_task+0x6a/0x310
 [<ffffffff8108ddf6>] kthread+0x96/0xa0

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #327
2011-07-19 16:17:01 -07:00
Brian Behlendorf abd39a8289 Fix zio_execute() deadlock
To avoid deadlocking the system it is crucial that all memory
allocations performed in the zio_execute() call path are marked
KM_PUSHPAGE (GFP_NOFS).  This ensures that while a z_wr_iss
thread is processing the syncing transaction group it does
not re-enter the filesystem code and deadlock on itself.

Call Trace:
 [<ffffffffa02580e8>] cv_wait_common+0x78/0xe0 [spl]
 [<ffffffffa0347bab>] txg_wait_open+0x7b/0xa0 [zfs]
 [<ffffffffa030e73d>] dmu_tx_wait+0xed/0xf0 [zfs]
 [<ffffffffa0376a49>] zfs_putpage+0x219/0x360 [zfs]
 [<ffffffffa038d75e>] zpl_putpage+0x1e/0x60 [zfs]
 [<ffffffffa038d7b2>] zpl_writepage+0x12/0x20 [zfs]
 [<ffffffff8115f907>] writeout+0xa7/0xd0
 [<ffffffff8115fa6b>] move_to_new_page+0x13b/0x170
 [<ffffffff8115fed4>] migrate_pages+0x434/0x4c0
 [<ffffffff811559ab>] compact_zone+0x4fb/0x780
 [<ffffffff81155ed1>] compact_zone_order+0xa1/0xe0
 [<ffffffff8115602c>] try_to_compact_pages+0x11c/0x190
 [<ffffffff811200bb>] __alloc_pages_nodemask+0x5eb/0x8b0
 [<ffffffff81159932>] kmem_getpages+0x62/0x170
 [<ffffffff8115a54a>] fallback_alloc+0x1ba/0x270
 [<ffffffff8115a2c9>] ____cache_alloc_node+0x99/0x160
 [<ffffffff8115b059>] __kmalloc+0x189/0x220
 [<ffffffffa02539fb>] kmem_alloc_debug+0xeb/0x130 [spl]
 [<ffffffffa031454a>] dnode_hold_impl+0x46a/0x550 [zfs]
 [<ffffffffa0314649>] dnode_hold+0x19/0x20 [zfs]
 [<ffffffffa03042e3>] dmu_read+0x33/0x180 [zfs]
 [<ffffffffa034729d>] space_map_load+0xfd/0x320 [zfs]
 [<ffffffffa03300bc>] metaslab_activate+0x10c/0x170 [zfs]
 [<ffffffffa0330ad9>] metaslab_alloc+0x469/0x800 [zfs]
 [<ffffffffa038963c>] zio_dva_allocate+0x6c/0x2f0 [zfs]
 [<ffffffffa038a249>] zio_execute+0x99/0xf0 [zfs]
 [<ffffffffa0254b1c>] taskq_thread+0x1cc/0x330 [spl]
 [<ffffffff8108ddf6>] kthread+0x96/0xa0

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #291
2011-07-19 11:55:42 -07:00
Brian Behlendorf a140dc5469 Fix mmap(2)/write(2)/read(2) deadlock
When modifing overlapping regions of a file using mmap(2) and
write(2)/read(2) it is possible to deadlock due to a lock inversion.
The zfs_write() and zfs_read() hooks first take the zfs range lock
and then lock the individual pages.  Conversely, when using mmap'ed
I/O the zpl_writepage() hook is called with the individual page
locks already taken and then zfs_putpage() takes the zfs range lock.

The most straight forward fix is to simply not take the zfs range
lock in the mmap(2) case.  The individual pages will still be locked
thus serializing access.  Updating the same region of a file with
write(2) and mmap(2) has always been a dodgy thing to do.  This change
at a minimum ensures we don't deadlock and is consistent with the
existing Linux semantics enforced by the VFS.

This isn't an issue under Solaris because the only range locking
performed will be with the zfs range locks.  It's up to each filesystem
to perform its own file locking.  Under Linux the VFS provides many
of these services.

It may be possible/desirable at a latter date to entirely dump the
existing zfs range locking and rely on the Linux VFS page locks.
However, for now its safest to perform both layers of locking until
zfs is more tightly integrated with the page cache.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #302
2011-07-19 11:55:42 -07:00
Brian Behlendorf 7f9d9946f8 Update 'zpool import' man page
The following supported options were missing from the zpool.8
man page.  The OpenSolaris man pages originally used were simply
out of date with the code.

zpool import
  -F Recovery mode
  -m Allow missing log devices
  -N Import but don't mount
  -n Determine if recoverable but don't do it

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-07-19 11:24:20 -07:00
Brian Behlendorf 61f218b090 Fix send/recv 'dataset is busy' errors
This commit fixes a regression which was accidentally introduced by
the Linux 2.6.39 compatibility chanages.  As part of these changes
instead of holding an active reference on the namepsace (which is
no longer posible) a reference is taken on the super block.  This
reference ensures the super block remains valid while it is in use.

To handle the unlikely race condition of the filesystem being
unmounted concurrently with the start of a 'zfs send/recv' the
code was updated to only take the super block reference when there
was an existing reference.  This indicates that the filesystem is
active and in use.

Unfortunately, in the 'zfs recv' case this is not the case.  The
newly created dataset will not have a super block without an
active reference which results in the 'dataset is busy' error.

The most straight forward fix for this is to simply update the
code to always take the reference even when it's zero.  This
may expose us to very very unlikely concurrent umount/send/recv
case but the consequences of that are minor.

Closes #319
2011-07-15 16:37:19 -07:00
Kyle Fuller 615ab66d18 Provide a rc.d script for archlinux
Unlike most other Linux distributions archlinux installs its
init scripts in /etc/rc.d insead of /etc/init.d.  This commit
provides an archlinux rc.d script for zfs and extends the
build infrastructure to ensure it get's installed in the
correct place.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #322
2011-07-11 14:12:23 -07:00
Brian Behlendorf 057e8eee35 Improve fstat(2) performance
There is at most a factor of 3x performance improvement to be
had by using the Linux generic_fillattr() helper.  However, to
use it safely we need to ensure the values in a cached inode
are kept rigerously up to date.  Unfortunately, this isn't
the case for the blksize, blocks, and atime fields.  At the
moment the authoritative values are still stored in the znode.

This patch introduces an optimized zfs_getattr_fast() call.
The idea is to use the up to date values from the inode and
the blksize, block, and atime fields from the znode.  At some
latter date we should be able to strictly use the inode values
and further improve performance.

The remaining overhead in the zfs_getattr_fast() call can be
attributed to having to take the znode mutex.  This overhead is
unavoidable until the inode is kept strictly up to date.  The
the careful reader will notice the we do not use the customary
ZFS_ENTER()/ZFS_EXIT() macros.  These macro's are designed to
ensure the filesystem is not torn down in the middle of an
operation.  However, in this case the VFS is holding a
reference on the active inode so we know this is impossible.

=================== Performance Tests ========================

This test calls the fstat(2) system call 10,000,000 times on
an open file description in a tight loop.  The test results
show the zfs stat(2) performance is now only 22% slower than
ext4.  This is a 2.5x improvement and there is a clear long
term plan to get to parity with ext4.

filesystem    | test-1  test-2  test-3  | average | times-ext4
--------------+-------------------------+---------+-----------
ext4          |  7.785s  7.899s  7.284s |  7.656s | 1.000x
zfs-0.6.0-rc4 | 24.052s 22.531s 23.857s | 23.480s | 3.066x
zfs-faststat  |  9.224s  9.398s  9.485s |  9.369s | 1.223x

The second test is to run 'du' of a copy of the /usr tree
which contains 110514 files.  The test is run multiple times
both using both a cold cache (/proc/sys/vm/drop_caches) and
a hot cache.  As expected this change signigicantly improved
the zfs hot cache performance and doesn't quite bring zfs to
parity with ext4.

A little surprisingly the zfs cold cache performance is better
than ext4.  This can probably be attributed to the zfs allocation
policy of co-locating all the meta data on disk which minimizes
seek times.  By default the ext4 allocator will spread the data
over the entire disk only co-locating each directory.

filesystem    | cold    | hot
--------------+---------+--------
ext4          | 13.318s | 1.040s
zfs-0.6.0-rc4 |  4.982s | 1.762s
zfs-faststat  |  4.933s | 1.345s
2011-07-11 09:11:22 -07:00
Brian Behlendorf abd8610cd5 Add L2ARC tunables
The performance of the L2ARC can be tweaked by a number of tunables, which
may be necessary for different workloads:

     l2arc_write_max         max write bytes per interval
     l2arc_write_boost       extra write bytes during device warmup
     l2arc_noprefetch        skip caching prefetched buffers
     l2arc_headroom          number of max device writes to precache
     l2arc_feed_secs         seconds between L2ARC writing
     l2arc_feed_min_ms       min feed interval in milliseconds
     l2arc_feed_again        turbo L2ARC warmup
     l2arc_norw              no reads during writes

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #316
2011-07-08 12:44:11 -07:00
Brian Behlendorf e0f86c9862 Update 'zfs send' documentation
The -D and -p options were missing from the manpage.  This commit
adds documentation for these features.

Closes #311
2011-07-08 12:16:09 -07:00