zfs/dracut
Brian Behlendorf ab26409db7 Linux 3.1 compat, super_block->s_shrink
The Linux 3.1 kernel has introduced the concept of per-filesystem
shrinkers which are directly assoicated with a super block.  Prior
to this change there was one shared global shrinker.

The zfs code relied on being able to call the global shrinker when
the arc_meta_limit was exceeded.  This would cause the VFS to drop
references on a fraction of the dentries in the dcache.  The ARC
could then safely reclaim the memory used by these entries and
honor the arc_meta_limit.  Unfortunately, when per-filesystem
shrinkers were added the old interfaces were made unavailable.

This change adds support to use the new per-filesystem shrinker
interface so we can continue to honor the arc_meta_limit.  The
major benefit of the new interface is that we can now target
only the zfs filesystem for dentry and inode pruning.  Thus we
can minimize any impact on the caching of other filesystems.

In the context of making this change several other important
issues related to managing the ARC were addressed, they include:

* The dnlc_reduce_cache() function which was called by the ARC
to drop dentries for the Posix layer was replaced with a generic
zfs_prune_t callback.  The ZPL layer now registers a callback to
drop these dentries removing a layering violation which dates
back to the Solaris code.  This callback can also be used by
other ARC consumers such as Lustre.

  arc_add_prune_callback()
  arc_remove_prune_callback()

* The arc_reduce_dnlc_percent module option has been changed to
arc_meta_prune for clarity.  The dnlc functions are specific to
Solaris's VFS and have already been largely eliminated already.
The replacement tunable now represents the number of bytes the
prune callback will request when invoked.

* Less aggressively invoke the prune callback.  We used to call
this whenever we exceeded the arc_meta_limit however that's not
strictly correct since it results in over zeleous reclaim of
dentries and inodes.  It is now only called once the arc_meta_limit
is exceeded and every effort has been made to evict other data from
the ARC cache.

* More promptly manage exceeding the arc_meta_limit.  When reading
meta data in to the cache if a buffer was unable to be recycled
notify the arc_reclaim thread to invoke the required prune.

* Added arcstat_prune kstat which is incremented when the ARC
is forced to request that a consumer prune its cache.  Remember
this will only occur when the ARC has no other choice.  If it
can evict buffers safely without invoking the prune callback
it will.

* This change is also expected to resolve the unexpect collapses
of the ARC cache.  This would occur because when exceeded just the
arc_meta_limit reclaim presure would be excerted on the arc_c
value via arc_shrink().  This effectively shrunk the entire cache
when really we just needed to reclaim meta data.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #466
Closes #292
2012-01-11 11:46:02 -08:00
..
90zfs Linux 3.1 compat, super_block->s_shrink 2012-01-11 11:46:02 -08:00
Makefile.am Add dracut support 2011-03-17 16:52:04 -07:00
Makefile.in Linux 3.1 compat, super_block->s_shrink 2012-01-11 11:46:02 -08:00
README.dracut.markdown Wrap dracut scripts to 79 chars 2011-07-31 11:28:44 -07:00

README.dracut.markdown

How to setup a zfs root filesystem using dracut

  1. Install the zfs-dracut package. This package adds a zfs dracut module to the /usr/share/dracut/modules.d/ directory which allows dracut to create an initramfs which is zfs aware.

  2. Set the bootfs property for the bootable dataset in the pool. Then set the dataset mountpoint property to '/'.

    $ zpool set bootfs=pool/dataset pool $ zfs set mountpoint=/ pool/dataset

Alternately, legacy mountpoints can be used by setting the 'root=' option on the kernel line of your grub.conf/menu.lst configuration file. Then set the dataset mountpoint property to 'legacy'.

$ grub.conf/menu.lst: kernel ... root=ZFS=pool/dataset
$ zfs set mountpoint=legacy pool/dataset
  1. To set zfs module options put them in /etc/modprobe.d/zfs.conf file. The complete list of zfs module options is available by running the modinfo zfs command. Commonly set options include: zfs_arc_min, zfs_arc_max, zfs_prefetch_disable, and zfs_vdev_max_pending.

  2. Finally, create your new initramfs by running dracut.

    $ dracut --force /path/to/initramfs kernel_version

Kernel Command Line

The initramfs' behavior is influenced by the following kernel command line parameters passed in from the boot loader:

  • root=...: If not set, importable pools are searched for a bootfs attribute. If an explicitly set root is desired, you may use root=ZFS:pool/dataset

  • zfs_force=0: If set to 1, the initramfs will run zpool import -f when attempting to import pools if the required pool isn't automatically imported by the zfs module. This can save you a trip to a bootcd if hostid has changed, but is dangerous and can lead to zpool corruption, particularly in cases where storage is on a shared fabric such as iSCSI where multiple hosts can access storage devices concurrently. Please understand the implications of force-importing a pool before enabling this option!

  • spl_hostid: By default, the hostid used by the SPL module is read from /etc/hostid inside the initramfs. This file is placed there from the host system when the initramfs is built which effectively ties the ramdisk to the host which builds it. If a different hostid is desired, one may be set in this attribute and will override any file present in the ramdisk. The format should be hex exactly as found in the /etc/hostid file, IE spl_hostid=0x00bab10c.

Note that changing the hostid between boots will most likely lead to an un-importable pool since the last importing hostid won't match. In order to recover from this, you may use the zfs_force option or boot from a different filesystem and zpool import -f then zpool export the pool before rebooting with the new hostid.

How it Works

The Dracut module consists of the following files (less Makefile's):

  • module-setup.sh: Script run by the initramfs builder to create the ramdisk. Contains instructions on which files are required by the modules and z* programs. Also triggers inclusion of /etc/hostid and the zpool cache. This file is not included in the initramfs.

  • 90-zfs.rules: udev rules which trigger loading of the ZFS modules at boot.

  • parse-zfs.sh: Run early in the initramfs boot process to parse kernel command line and determine if ZFS is the active root filesystem.

  • mount-zfs.sh: Run later in initramfs boot process after udev has settled to mount the root dataset.

module-setup.sh

This file is run by the Dracut script within the live system, not at boot time. It's not included in the final initramfs. Functions in this script describe which files are needed by ZFS at boot time.

Currently all the various z* and spl modules are included, a dependency is asserted on udev-rules, and the various zfs, zpool, etc. helpers are included. Dracut provides library functions which automatically gather the shared libs necessary to run each of these binaries, so statically built binaries are not required.

The zpool and zvol udev rules files are copied from where they are installed by the ZFS build. PACKAGERS TAKE NOTE: If you move /etc/udev/rules/60-z*.rules, you'll need to update this file to match.

Currently this file also includes /etc/hostid and /etc/zfs/zpool.cache which means the generated ramdisk is specific to the host system which built it. If a generic initramfs is required, it may be preferable to omit these files and specify the spl_hostid from the boot loader instead.

parse-zfs.sh

Run during the cmdline phase of the initramfs boot process, this script performs some basic sanity checks on kernel command line parameters to determine if booting from ZFS is likely to be what is desired. Dracut requires this script to adjust the root variable if required and to set rootok=1 if a mountable root filesystem is available. Unfortunately this script must run before udev is settled and kernel modules are known to be loaded, so accessing the zpool and zfs commands is unsafe.

If the root=ZFS... parameter is set on the command line, then it's at least certain that ZFS is what is desired, though this script is unable to determine if ZFS is in fact available. This script will alter the root parameter to replace several historical forms of specifying the pool and dataset name with the canonical form of zfs:pool/dataset.

If no root= parameter is set, the best this script can do is guess that ZFS is desired. At present, no other known filesystems will work with no root= parameter, though this might possibly interfere with using the compiled-in default root in the kernel image. It's considered unlikely that would ever be the case when an initramfs is in use, so this script sets root=zfs:AUTO and hopes for the best.

Once the root=... (or lack thereof) parameter is parsed, a dummy symlink is created from /dev/root -> /dev/null to satisfy parts of the Dracut process which check for presence of a single root device node.

Finally, an initqueue/finished hook is registered which causes the initqueue phase of Dracut to wait for /dev/zfs to become available before attempting to mount anything.

mount-zfs.sh

This script is run after udev has settled and all tasks in the initqueue have succeeded. This ensures that /dev/zfs is available and that the various ZFS modules are successfully loaded. As it is now safe to call zpool and friends, we can proceed to find the bootfs attribute if necessary.

If the root parameter was explicitly set on the command line, no parsing is necessary. The list of imported pools is checked to see if the desired pool is already imported. If it's not, and attempt is made to import the pool explicitly, though no force is attempted. Finally the specified dataset is mounted on $NEWROOT, first using the -o zfsutil option to handle non-legacy mounts, then if that fails, without zfsutil to handle legacy mount points.

If no root parameter was specified, this script attempts to find a pool with its bootfs attribute set. First, already-imported pools are scanned and if an appropriate pool is found, no additional pools are imported. If no pool with bootfs is found, any additional pools in the system are imported with zpool import -N -a, and the scan for bootfs is tried again. If no bootfs is found with all pools imported, all pools are re-exported, and boot fails. Assuming a bootfs is found, an attempt is made to mount it to $NEWROOT, first with, then without the zfsutil option as above.

Ordinarily pools are imported without the force option which may cause boot to fail if the hostid has changed or a pool has been physically moved between servers. The zfs_force kernel parameter is provided which when set to 1 causes zpool import to be run with the -f flag. Forcing pool import can lead to serious data corruption and loss of pools, so this option should be used with extreme caution. Note that even with this flag set, if the required zpool was auto-imported by the kernel module, no additional zpool import commands are run, so nothing is forced.