Commit Graph

567 Commits

Author SHA1 Message Date
Tom Caputi 4807c0badb Encryption patch follow-up
* PBKDF2 implementation changed to OpenSSL implementation.

* HKDF implementation moved to its own file and tests
  added to ensure correctness.

* Removed libzfs's now unnecessary dependency on libzpool
  and libicp.

* Ztest can now create and test encrypted datasets. This is
  currently disabled until issue #6526 is resolved, but
  otherwise functions as advertised.

* Several small bug fixes discovered after enabling ztest
  to run on encrypted datasets.

* Fixed coverity defects added by the encryption patch.

* Updated man pages for encrypted send / receive behavior.

* Fixed a bug where encrypted datasets could receive
  DRR_WRITE_EMBEDDED records.

* Minor code cleanups / consolidation.

Signed-off-by: Tom Caputi <tcaputi@datto.com>
2017-10-11 16:54:48 -04:00
Alek P 01ff0d7540 Update the default for zfs_txg_history
It's often useful to have access to txg history for debugging
purposes. This patch changes the default from 0 to 100 TXGs
worth of history preserved.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Alek Pinchuk <apinchuk@datto.com>
Closes #6691
2017-09-29 15:58:52 -07:00
LOLi 90cdf2833d Add mdoc style checker
Add a new make 'mancheck' target which uses mandoc -Tlint to verify
manpage files: currently only zfs(8), zpool(8) zdb(8) and zgenhostid(8)
are supported.

Additionally fix some outstanding manpage formatting issues.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #6646
2017-09-16 10:51:24 -07:00
George Melikov 7c9abcf887 OpenZFS 8435 - zpool.1m and zfs.1m: minor cleanup
3796 listsnapshots not documented in zpool man page

Authored by: George Melikov <mail@gmelikov.ru>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Yuri Pankov <yuripv@gmx.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Ported-by: George Melikov mail@gmelikov.ru

OpenZFS-issue: https://www.illumos.org/issues/8435
OpenZFS-commit: openzfs/openzfs@a058d1c

Porting notes: OpenZFS review applied,
some ZoL changes were reverted.
See https://github.com/openzfs/openzfs/pull/415
2017-09-15 13:13:52 -07:00
LOLi 835db58592 Add -vnP support to 'zfs send' for bookmarks
This leverages the functionality introduced in cf7684b to expose
verbose, dry-run and parsable 'zfs send' options for bookmarks.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #3666 
Closes #6601
2017-09-08 15:24:31 -07:00
Mike Swanson 57858fb5ca Recommend compression=on in zfs(8) dedup section
compression=lz4 depends on the lz4 feature being enabled, while
compression=on will let ZFS use either lzjb or lz4 where appropriate.
It also allows the documentation to not go out of date if/when ZFS
picks a new default in the future.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Mike Swanson <mikeonthecomputer@gmail.com>
Closes #6614
2017-09-08 15:21:58 -07:00
bunder2015 2917956841 zfs(8) manpage corrections
Corrected indent of the note located at the bottom of the options for
zfs send as well as remove an extra whitespace

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: bunder2015 <omfgbunder@gmail.com>
Closes #6590
2017-09-05 13:45:18 -07:00
George Melikov 076e9b946e Remove copyright duplicate in zpool man page
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Melikov <mail@gmelikov.ru>
Closes #6553
2017-08-24 10:36:17 -07:00
Giuseppe Di Natale d7323e79a6 OpenZFS 8547 - update mandoc to 1.14.3
Authored by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Robert Mustacchi <rm@joyent.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/8547
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c66b804
Closes #6549
2017-08-24 10:30:42 -07:00
Alek P e4b6b2db12 OpenZFS 8414 - Implemented zpool scrub pause/resume
Authored by: Alek Pinchuk <apinchuk@datto.com>
Reviewed by: George Melikov <mail@gmelikov.ru>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Alek Pinchuk <apinchuk@datto.com>

OpenZFS-issue: https://www.illumos.org/issues/8414
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c29616076
Closes #6538
2017-08-24 10:27:20 -07:00
Tom Caputi 9b8407638d Send / Recv Fixes following b52563
This patch fixes several issues discovered after
the encryption patch was merged:

* Fixed a bug where encrypted datasets could attempt
  to receive embedded data records.

* Fixed a bug where dirty records created by the recv
  code wasn't properly setting the dr_raw flag.

* Fixed a typo where a dmu_tx_commit() was changed to
  dmu_tx_abort()

* Fixed a few error handling bugs unrelated to the
  encryption patch in dmu_recv_stream()

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #6512 
Closes #6524 
Closes #6545
2017-08-23 16:54:24 -07:00
Brian Behlendorf c8f9061fc7 Retire legacy test infrastructure
* Removed zpios kmod, utility, headers and man page.

* Removed unused scripts zpios-profile/*, zpios-test/*,
  zpool-config/*, smb.sh, zpios-sanity.sh, zpios-survey.sh,
  zpios.sh, and zpool-create.sh.

* Removed zfs-script-config.sh.in.  When building 'make' generates
  a common.sh with in-tree path information from the common.sh.in
  template.  This file and sourced by the test scripts and used
  for in-tree testing, it is not included in the packages.  When
  building packages 'make install' uses the same template to
  create a new common.sh which is appropriate for the packaging.

* Removed unused functions/variables from scripts/common.sh.in.
  Only minimal path information and configuration environment
  variables remain.

* Removed unused scripts from scripts/ directory.

* Remaining shell scripts in the scripts directory updated to
  cleanly pass shellcheck and added to checked scripts.

* Renamed tests/test-runner/cmd/ to tests/test-runner/bin/ to
  match install location name.

* Removed last traces of the --enable-debug-dmu-tx configure
  options which was retired some time ago.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6509
2017-08-15 17:26:38 -07:00
sckobras d49d9c2bdc vdev_id: implement slot numbering by port id
With HPE hardware and hpsa-driven SAS adapters, only a single phy is
reported, but no individual per-port phys (ie. no phy* entry below
port_dir), which breaks topology detection in the current sas_handler
code. Instead, slot information can be derived directly from the port
number. This change implements a new slot keyword "port" similar to
"id" and "lun", and assumes a default phy/port of 0 if no individual
phy entry can be found. It allows to use the "sas_direct" topology with
current HPE Dxxxx and Apollo 45xx JBODs.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Daniel Kobras <d.kobras@science-computing.de>
Closes #6484
2017-08-14 15:18:26 -07:00
Don Brady d977122da9 Add corruption failure option to zinject(8)
Added a 'corrupt' error option that will flip a bit in the data
after a read operation.  This is useful for generating checksum
errors at the device layer (in a mirror config for example). It
is also used to validate the diagnosis of checksum errors from
the zfs diagnosis engine.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@intel.com>
Closes #6345
2017-08-14 15:17:15 -07:00
Tom Caputi b525630342 Native Encryption for ZFS on Linux
This change incorporates three major pieces:

The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.

The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.

The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494 
Closes #5769
2017-08-14 10:36:48 -07:00
Fabian-Gruenbichler b58237e769 Man page fixes
* ztest.1 man page: fix typo
* zfs-module-parameters.5 man page: fix grammar

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
Closes #6492
2017-08-10 15:45:25 -07:00
Fabian-Gruenbichler 945b7f1c63 spl-module-parameters.5 manpage: fix macro
There is no '.sh' macro in troff/groff/man, only '.SH' for section
headers. I assume .sp for a line break was intended here like
in the rest of the man page.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #643
2017-08-10 15:22:31 -07:00
Fabian-Gruenbichler a02fa347b7 splat.1 manpage: fix spelling of 'hexadecimal'
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #642
2017-08-10 15:21:54 -07:00
bunder2015 0f69f42b43 Correct man page generation
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: bunder2015 <omfgbunder@gmail.com>
Closes #6409
Closes #6410
2017-07-27 19:06:34 -07:00
Ned Bass 8740cf4a2f Add line info and SET_ERROR() to ZFS debug log
Redefine the SET_ERROR macro in terms of __dprintf() so the error
return codes get logged as both tracepoint events (if tracepoints are
enabled) and as ZFS debug log entries.  This also allows us to use
the same definition of SET_ERROR() in kernel and user space.

Define a new debug flag ZFS_DEBUG_SET_ERROR=512 that may be bitwise
or'd into zfs_flags. Setting this flag enables both dprintf() and
SET_ERROR() messages in the debug log. That is, setting
ZFS_DEBUG_SET_ERROR and ZFS_DEBUG_DPRINTF|ZFS_DEBUG_SET_ERROR are
equivalent (this was done for sake of simplicity). Leaving
ZFS_DEBUG_SET_ERROR unset suppresses the SET_ERROR() messages which
helps avoid cluttering up the logs.

To enable SET_ERROR() logging, run:

  echo 1 >   /sys/module/zfs/parameters/zfs_dbgmsg_enable
  echo 512 > /sys/module/zfs/parameters/zfs_flags

Remove the zfs_set_error_class tracepoints event class since
SET_ERROR() now uses __dprintf(). This sacrifices a bit of
granularity when selecting individual tracepoint events to enable but
it makes the code simpler.

Include file, function, and line number information in debug log
entries.  The information is now added to the message buffer in
__dprintf() and as a result the zfs_dprintf_class tracepoints event
class was changed from a 4 parameter interface to a single parameter.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes #6400
2017-07-25 23:09:48 -07:00
Brian Behlendorf 9ff13dbe92 Fix zpool-features.5 indentation
The userobj_accounting feature described in the zpool-features.5
man page was incorrectly indented.  Fix it.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6402
2017-07-25 18:57:00 -07:00
Olaf Faaland b9373170e3 Add zgenhostid utility script
Turning the multihost property on requires that a hostid be set to allow
ZFS to determine when a foreign system is attemping to import a pool.
The error message instructing the user to set a hostid refers to
genhostid(1).

Genhostid(1) is not available on SUSE Linux.  This commit adds a script
modeled after genhostid(1) for those users.

Zgenhostid checks for an /etc/hostid file; if it does not exist, it
creates one and stores a value.  If the user has provided a hostid as an
argument, that value is used.  Otherwise, a random hostid is generated
and stored.

This differs from the CENTOS 6/7 versions of genhostid, which overwrite
the /etc/hostid file even though their manpages state otherwise.

A man page for zgenhostid is added. The one for genhostid is in (1), but
I put zgenhostid in (8) because I believe it's more appropriate.

The mmp tests are modified to use zgenhostid to set the hostid instead
of using the spl_hostid module parameter.  zgenhostid will not replace
an existing /etc/hostid file, so new mmp_clear_hostid calls are
required.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #6358
Closes #6379
2017-07-25 13:22:03 -04:00
Giuseppe Di Natale d6bcf7ff5e Restrict zpool iostat/status -c to search path
zpool iostat/status -c is supposed to be restricted
by its search path, but currently isn't. To prevent
arbitrary scripts from being executed, disallow '/'
from commands.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Closes #6353 
Closes #6359
2017-07-24 11:53:59 -07:00
Ned Bass 7a8ed6b8b7 Minor fixes in zpool iostat -c documentation (#6370)
- Use nested [] notation to denote optional script list elements
- Fix space before comma after smarctl(8)
- Fix typo and formatting error in reference to -v option
- Fix spelling errors

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes #6370
2017-07-20 17:04:35 -07:00
Olaf Faaland 379ca9cf2b Multi-modifier protection (MMP)
Add multihost=on|off pool property to control MMP.  When enabled
a new thread writes uberblocks to the last slot in each label, at a
set frequency, to indicate to other hosts the pool is actively imported.
These uberblocks are the last synced uberblock with an updated
timestamp.  Property defaults to off.

During tryimport, find the "best" uberblock (newest txg and timestamp)
repeatedly, checking for change in the found uberblock.  Include the
results of the activity test in the config returned by tryimport.
These results are reported to user in "zpool import".

Allow the user to control the period between MMP writes, and the
duration of the activity test on import, via a new module parameter
zfs_multihost_interval.  The period is specified in milliseconds.  The
activity test duration is calculated from this value, and from the
mmp_delay in the "best" uberblock found initially.

Add a kstat interface to export statistics about Multiple Modifier
Protection (MMP) updates. Include the last synced txg number, the
timestamp, the delay since the last MMP update, the VDEV GUID, the VDEV
label that received the last MMP update, and the VDEV path.  Abbreviated
output below.

$ cat /proc/spl/kstat/zfs/mypool/multihost
31 0 0x01 10 880 105092382393521 105144180101111
txg   timestamp  mmp_delay   vdev_guid   vdev_label vdev_path
20468    261337  250274925   68396651780       3    /dev/sda
20468    261339  252023374   6267402363293     1    /dev/sdc
20468    261340  252000858   6698080955233     1    /dev/sdx
20468    261341  251980635   783892869810      2    /dev/sdy
20468    261342  253385953   8923255792467     3    /dev/sdd
20468    261344  253336622   042125143176      0    /dev/sdab
20468    261345  253310522   1200778101278     2    /dev/sde
20468    261346  253286429   0950576198362     2    /dev/sdt
20468    261347  253261545   96209817917       3    /dev/sds
20468    261349  253238188   8555725937673     3    /dev/sdb

Add a new tunable zfs_multihost_history to specify the number of MMP
updates to store history for. By default it is set to zero meaning that
no MMP statistics are stored.

When using ztest to generate activity, for automated tests of the MMP
function, some test functions interfere with the test.  For example, the
pool is exported to run zdb and then imported again.  Add a new ztest
function, "-M", to alter ztest behavior to prevent this.

Add new tests to verify the new functionality.  Tests provided by
Giuseppe Di Natale.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #745
Closes #6279
2017-07-13 13:54:00 -04:00
LOLi cf8738d853 Add port of FreeBSD 'volmode' property
The volmode property may be set to control the visibility of ZVOL
block devices.

This allow switching ZVOL between three modes:
   full - existing fully functional behaviour (default)
   dev  - hide partitions on ZVOL block devices
   none - not exposing volumes outside ZFS

Additionally the new zvol_volmode module parameter can be used to
control the default behaviour.

This functionality can be used, for instance, on "backup" pools to
avoid cluttering /dev with unneeded zd* devices.

Original-patch-by: mav <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>

FreeBSD-commit: https://github.com/freebsd/freebsd/commit/dd28e6bb
Closes #1796 
Closes #3438 
Closes #6233
2017-07-12 13:05:37 -07:00
Matthew Ahrens a896468c78 OpenZFS 8067 - zdb should be able to dump literal embedded block pointer
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: Yuri Pankov <yuri.pankov@gmail.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/8067
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/8173085
Closes #6319
2017-07-07 11:28:01 -07:00
Alek P 0ea05c64f8 Implemented zpool scrub pause/resume
Currently, there is no way to pause a scrub. Pausing may
be useful when the pool is busy with other I/O to preserve
bandwidth.

This patch adds the ability to pause and resume scrubbing.
This is achieved by maintaining a persistent on-disk scrub state.
While the state is 'paused' we do not scrub any more blocks.
We do however perform regular scan housekeeping such as
freeing async destroyed and deadlist blocks while paused.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Thomas Caputi <tcaputi@datto.com>
Reviewed-by: Serapheim Dimitropoulos <serapheimd@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alek Pinchuk <apinchuk@datto.com>
Closes #6167
2017-07-06 22:16:13 -07:00
Brian Behlendorf 44f09cdc59 Convert man zfs.8 to mdoc (OpenZFS sync)
* Fixed some typos
* Additional description for some commands arguments
* Text reworked to be in sync with OpenZFS
* Added Linux as .Os type

Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6282
2017-06-29 09:55:30 -07:00
George Melikov cda0317e4d Convert man zpool.8 to mdoc (OpenZFS sync)
* Fixed some typos
* Additional description for some commands arguments
* `listsnapshots` remained
* Text reworked to be in sync with OpenZFS
* Added Linux as .Os type
* Updated `zpool events` section.
* Updated `zpool iostat|status -c` sections
* Added zed(8) reference to SEE ALSO

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Melikov <mail@gmelikov.ru>
Closes #6245
2017-06-28 09:26:46 -07:00
Don Brady 0241e491a0 Inject zinject(8) a percentage amount of dev errs
In the original form of device error injection, it was an all or nothing
situation.  To help simulate intermittent error conditions, you can now
specify a real number percentage value. This is also very useful for our
ZFS fault diagnosis testing and for injecting intermittent errors during
load testing.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@intel.com>
Closes #6227
2017-06-16 17:21:11 -07:00
chrisrd 627791f3c0 Fix manual description of zfs_arc_dnode_limit
In arc_evict_state() we start pruning when arc_dnode_size >
arc_dnode_limit, i.e. arc_dnode_limit is a ceiling rather than a
floor.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Closes #6228
2017-06-14 13:23:02 -07:00
Giuseppe Di Natale 1b7c1e5ce9 OpenZFS 7578 - Fix/improve some aspects of ZIL writing
- After some ZIL changes 6 years ago zil_slog_limit got partially broken
due to zl_itx_list_sz not updated when async itx'es upgraded to sync.
Actually because of other changes about that time zl_itx_list_sz is not
really required to implement the functionality, so this patch removes
some unneeded broken code and variables.

 - Original idea of zil_slog_limit was to reduce chance of SLOG abuse by
single heavy logger, that increased latency for other (more latency critical)
loggers, by pushing heavy log out into the main pool instead of SLOG.  Beside
huge latency increase for heavy writers, this implementation caused double
write of all data, since the log records were explicitly prepared for SLOG.
Since we now have I/O scheduler, I've found it can be much more efficient
to reduce priority of heavy logger SLOG writes from ZIO_PRIORITY_SYNC_WRITE
to ZIO_PRIORITY_ASYNC_WRITE, while still leave them on SLOG.

 - Existing ZIL implementation had problem with space efficiency when it
has to write large chunks of data into log blocks of limited size.  In some
cases efficiency stopped to almost as low as 50%.  In case of ZIL stored on
spinning rust, that also reduced log write speed in half, since head had to
uselessly fly over allocated but not written areas.  This change improves
the situation by offloading problematic operations from z*_log_write() to
zil_lwb_commit(), which knows real situation of log blocks allocation and
can split large requests into pieces much more efficiently.  Also as side
effect it removes one of two data copy operations done by ZIL code WR_COPIED
case.

 - While there, untangle and unify code of z*_log_write() functions.
Also zfs_log_write() alike to zvol_log_write() can now handle writes crossing
block boundary, that may also improve efficiency if ZPL is made to do that.

Sponsored by:   iXsystems, Inc.

Authored by: Alexander Motin <mav@FreeBSD.org>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Steven Hartland <steven.hartland@multiplay.co.uk>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Yao <ryao@gentoo.org>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/7578
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/aeb13ac
Closes #6191
2017-06-09 09:15:37 -07:00
Giuseppe Di Natale 099700d9df zpool iostat/status -c improvements
Users can now provide their own scripts to be run
with 'zpool iostat/status -c'. User scripts should be
placed in ~/.zpool.d to be included in zpool's
default search path.

Provide a script which can be used with
'zpool iostat|status -c' that will return the type of
device (hdd, sdd, file).

Provide a script to get various values from smartctl
when using 'zpool iostat/status -c'.

Allow users to define the ZPOOL_SCRIPTS_PATH
environment variable which can be used to override
the default 'zpool iostat/status -c' search path.

Allow the ZPOOL_SCRIPTS_ENABLED environment
variable to enable or disable 'zpool status/iostat -c'
functionality.

Use the new smart script to provide the serial command.

Install /etc/sudoers.d/zfs file which contains the sudoer
rule for smartctl as a sample.

Allow 'zpool iostat/status -c' tests to run in tree.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Closes #6121 
Closes #6153
2017-06-05 10:52:15 -07:00
Alek P bec1067d54 Implemented zpool sync command
This addition will enable us to sync an open TXG to the main pool
on demand. The functionality is similar to 'sync(2)' but 'zpool sync'
will return when data has hit the main storage instead of potentially
just the ZIL as is the case with the 'sync(2)' cmd.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Alek Pinchuk <apinchuk@datto.com>
Closes #6122
2017-05-19 12:33:11 -07:00
Tony Hutter 4a283c7f77 Force fault a vdev with 'zpool offline -f'
This patch adds a '-f' option to 'zpool offline' to fault a vdev
instead of bringing it offline.  Unlike the OFFLINE state, the
FAULTED state will trigger the FMA code, allowing for things like
autoreplace and triggering the slot fault LED.  The -f faults
persist across imports, unless they were set with the temporary
(-t) flag.  Both persistent and temporary faults can be cleared
with zpool clear.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #6094
2017-05-19 12:30:16 -07:00
LOLi a3eeab2de6 Add property overriding (-o|-x) to 'zfs receive'
This allows users to specify "-o property=value" to override and
"-x property" to exclude properties when receiving a zfs send stream.
Both native and user properties can be specified.

This is useful when using zfs send/receive for periodic
backup/replication because it lets users change properties such as
canmount, mountpoint, or compression without modifying the source.

References:
   https://www.illumos.org/issues/2745
   https://www.illumos.org/issues/3753

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alek Pinchuk <apinchuk@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #1350 
Closes #5349
2017-05-09 16:21:09 -07:00
Christian Schwarz 305bc4b370 Make createtxg and guid properties public
Document the existence of `createtxg` and `guid` native properties
in man pages and zfs command output.

One of the great features of ZFS is incremental replication of
snapshots, possibly between pools on different machines.

Shell scripts are commonly used to auomate this procedure. They have to
find the most recent common snapshot between both sides and then
perform incremental send & recv.
Currently, scripts rely on the sorting order of `zfs list`, which
defaults to `createtxg`, and the assumption that snapshot names on
either side do not change.

By making `createtxg` and `guid` part of the public ZFS interface,
scripts are enabled to use

  a) `createtxg` to determine the logical & temporal order of snapshots
     (the creation property is not an equivalent substitute since
      multiple snapshots may be created within one second)
  b) `guid` to uniquely identify a snapshot, independent of its current
      display name

This has the potential of making scripts safer and correct.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: DHE <git@dehacked.net>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Christian Schwarz <me@cschwarz.com>
Closes #6102
2017-05-09 15:36:53 -07:00
Brian Behlendorf 8fa5250f5d Default to zvol_request_async=0
Change the default ZVOL behavior so requests are handled asynchronously.
This behavior is functionally the same as in the zfs-0.6.4 release.

Reviewed-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #5902
2017-05-04 18:01:50 -04:00
LOLi dddef7d600 More ashift improvements
This commit allow higher ashift values (up to 16) in 'zpool create'

The ashift value was previously limited to 13 (8K block) in b41c990
because the limited number of uberblocks we could fit in the
statically sized (128K) vdev label ring buffer could prevent the
ability the safely roll back a pool to recover it.

Since b02fe35 the largest uberblock size we support is 8K: this
allow us to store a minimum number of 16 uberblocks in the vdev
label, even with higher ashift values.

Additionally change 'ashift' pool property behaviour: if set it will
be used as the default hint value in subsequent vdev operations
('zpool add', 'attach' and 'replace'). A custom ashift value can still
be specified from the command line, if desired.

Finally, fix a bug in add-o_ashift.ksh caused by a missing variable.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #2024 
Closes #4205 
Closes #4740 
Closes #5763
2017-05-03 09:31:05 -07:00
Debabrata Banerjee 03b60eee78 Allow scaling of arc in proportion to pagecache
When multiple filesystems are in use, memory pressure causes arc_cache
to collapse to a minimum. Allow arc_cache to maintain proportional size
even when hit rates are disproportionate. We do this only via evictable
size from the kernel shrinker, thus it's only in effect under memory
pressure.

AKAMAI: zfs: CR 3695072
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Richard Yao <ryao@gentoo.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>
Closes #6035
2017-05-02 15:50:49 -04:00
Chunwei Chen 692e55b8fe Reinstate zvol_taskq to fix aio on zvol
Commit 37f9dac removed the zvol_taskq for processing zvol requests.
This was removed as part of switching to make_request_fn and was
motivated by a concern at the time over dispatch latency.

However, this also made all bio request synchronous, and caused
serious performance issues as the bio submitter would wait for
every bio it submitted, effectively making the IO depth 1.

This patch reinstate zvol_taskq, and to make sure overlapped I/Os
are ordered properly, we take range lock in zvol_request, and pass
it along with bio to the I/O functions zvol_{write,discard,read}.

In order to facilitate benchmarks a zvol_request_sync module
option was added to switch between sync and async request handling.
For the moment, the default behavior is synchronous but this is
likely to change pending additional testing.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #5824
2017-04-26 13:54:40 -07:00
Tim Chase e815485fe9 Update documentation for zfs_vdev_queue_depth_pct
It was documented as being related to zfs_vdev_async_max_active
when it is actually related to zfs_vdev_async_write_max_active.
Also, expand the documentation to describe the allocation throttle
which was introduced as part of OpenZFS 7090 in 3dfb57a.

Reviewed-by: Richard Yao <ryao@gentoo.org>
Reviewed-by: Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #6064
2017-04-26 13:48:28 -07:00
Dan Kimmel a7004725d0 OpenZFS 7252 - compressed zfs send / receive
OpenZFS 7252 - compressed zfs send / receive
OpenZFS 7628 - create long versions of ZFS send / receive options

Authored by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed by: David Quigley <dpquigl@davequigley.com>
Reviewed by: Thomas Caputi <tcaputi@datto.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Reviewed by: David Quigley <dpquigl@davequigley.com>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Ported-by: bunder2015 <omfgbunder@gmail.com>
Ported-by: Don Brady <don.brady@intel.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

Porting Notes:
- Most of 7252 was already picked up during ABD work.  This
  commit represents the gap from the final commit to openzfs.
- Fixed split_large_blocks check in do_dump()
- An alternate version of the write_compressible() function was
  implemented for Linux which does not depend on fio.  The behavior
  of fio differs significantly based on the exact version.
- mkholes was replaced with truncate for Linux.

OpenZFS-issue: https://www.illumos.org/issues/7252
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/5602294
Closes #6067
2017-04-26 12:31:43 -07:00
bunder2015 603a178479 Fix typo in zfs-module-parameters man page
Fix typo in zfs-module-parameters man page

Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Mike McQuaid <mike@mikemcquaid.com>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Gvozden Neskovic <neskovic@gmail.com>
Reviewed-by: Neal Gompa <ngompa13@gmail.com>
Reviewed-by: Jason Zaman <jason@perfinion.com>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Hajo Möller <dasjoe@gmail.com>
Reviewed-by: Igor Kozhukhov <ikozhukhov@gmail.com>
Reviewed-by: Richard Yao <ryao@gentoo.org>
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: DHE <git@dehacked.net>
Reviewed-by: Matthew Thode <prometheanfire@gentoo.org>
Reviewed-by: Thomas Caputi <tcaputi@datto.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: kernelOfTruth <kerneloftruth@gmail.com>
Reviewed-by: Kash Pande <kash@tripleback.net>
Reviewed-by: ilovezfs <ilovezfs@icloud.com>
Reviewed-by: @jwittlincohen
Reviewed-by: Jack Draak <jackdraak@gmail.com>
Reviewed-by: @ptx0
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: @Kokokokoka
Reviewed-by: @JCount <JCount@hush.ai>
Signed-off-by: bunder2015 <omfgbunder@gmail.com>
Closes #6054
2017-04-24 10:56:44 -07:00
DeHackEd 321204bec6 Typo in zfs-module-parameters(5)
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: DHE <git@dehacked.net>
Closes #6061
2017-04-24 10:34:37 -07:00
Tony Hutter d6418de057 Prebaked scripts for zpool status/iostat -c
This patch updates the "zpool status/iostat -c" commands to only run
"pre-baked" scripts from the /etc/zfs/zpool.d directory (or wherever
you install to).  The scripts can only be run from -c as an unprivileged
user (unless the ZPOOL_SCRIPTS_AS_ROOT environment var is
set by root).  This was done to encourage scripts to be written is such
a way that normal users can use them, and to be cautious.  If your
script needs to run a privileged command, consider adding the
appropriate line in /etc/sudoers.  See zpool(8) for an example of how
to do this.

The patch also allows the scripts to output custom column names.  If
the script outputs a line like:

name=value

then "name" is used for the column name, and "value" is its value.
Multiple columns can be specified by outputting multiple lines.  Column
names and values can have spaces.  If the value is empty, a dash (-) is
printed instead.

After all the "name=value" lines are read (if any), zpool will take the
next the next line of output (if any) and print it without a column
header.  After that, no more lines will be processed. This can be
useful for printing errors.

Lastly, this patch also disables the -c option with the latency and
request size histograms, since it produced awkward output and made the
code harder to maintain.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #5852
2017-04-21 09:27:04 -07:00
LOLi 038091fd4f Documentation fixes for zfs(8) and 'zfs' binary
* bookmarks are not supported when sending all intermediary snaps (-I)
* add missing compressed (-c) option to the 'zfs' help and manpage

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #6028
2017-04-20 12:12:50 -07:00
Richard Yao 066753103f OpenZFS 6392 - zdb: introduce -V for verbatim import
Authored by: Richard Yao <ryao@gentoo.org>
Approved by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Yuri Pankov <yuri.pankov@gmail.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

Porting Notes:
  This was already implemented in ZFS on Linux. This patch
  is to resolved the deltas present in our version.

OpenZFS-issue: https://www.illumos.org/issues/6392
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/9bb97de
Closes #6020
2017-04-18 09:50:15 -07:00
DHE 06226b5936 Increase zfs_vdev_async_write_min_active to 2
Resilver operations frequently cause only a small amount of dirty data
to be written to disk at a time, resulting in the IO scheduler to only
issue 1 write at a time to the resilvering disk. When it is rotational
media the drive will often travel past the next sector to be written
before receiving a write command from ZFS, significantly delaying the
write of the next sector.

Raise zfs_vdev_async_write_min_active so that drives are kept fed
during resilvering.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: DHE <git@dehacked.net>
Issue #4825
Closes #5926
2017-04-14 14:03:44 -07:00
Debabrata Banerjee 66aca24730 SEEK_HOLE should not block on txg_wait_synced()
Force flushing of txg's can be painfully slow when competing for disk
IO, since this is a process meant to execute asynchronously. Optimize
this path via allowing data/hole seeking if the file is clean, but if
dirty fall back to old logic. This is a compromise to disabling the
feature entirely.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>
Closes #4306
Closes #5962
2017-04-13 10:51:20 -07:00
Brian Behlendorf a44e7faa6c OpenZFS 6410 - teach zdb to perform object lookups by path
Authored by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Will Andrews <will@freebsd.org>
Approved by: Dan McDonald <danmcd@omniti.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

Porting Notes:
- Replaced zdb.8 with upstream mdoc zdb.1m version.  Updated to
  include Linux specific features: -V verbatium imports and
  improved label printing (-u, and -l).
- Minor changes to `zdb -h` output to honor 80 character limit.

OpenZFS-issue: https://www.illumos.org/issues/6410
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/ed61ec1
Closes #6006
2017-04-13 09:40:56 -07:00
Giuseppe Di Natale 42db43e982 OpenZFS 2932 - support crash dumps to raidz, etc. pools
Authored by: Bill Pijewski <wdp@joyent.com>
Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@nexenta.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/2932
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/810e43b
Closes #5984
Closes #5216
2017-04-10 10:24:17 -07:00
Tom Matthews d456708525 list -o props should be alloc,free not used,avail
Manpage suggests the zpool list properties include 'used'
and 'available', when these are invalid property names.
Use alloc and free in their place.

```
$ zpool list -o name,size,used   2>&1 |head -1
bad property list: invalid property 'used'
$ zpool list -o name,size,avail   2>&1 |head -1
bad property list: invalid property 'avail'
$ zpool list -o name,size,available   2>&1 |head -1
bad property list: invalid property 'available'
$ zpool list -o name,size,alloc,free
NAME    SIZE  ALLOC   FREE
apool   464M   203M   261M
bpool  3.62T  1.97T  1.65T
```

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Matthews <tom@axiom-partners.com>
Closes #5959
2017-04-04 11:03:33 -07:00
wli5 39ccc90967 Update documentation for new parameter "zfs_qat_disable"
Update documentation in zfs-module-parameters.5 for new
parameter "zfs_qat_disable" which was introduced by #5846.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Weigang Li <weigang.li@intel.com>
Closes #5914
2017-03-27 12:33:57 -07:00
DeHackEd f974e41426 zfs(8) fixes
Documentation fixes for zfs(8)

* White space issue in the userused@user property section
* zfs send supports using bookmarks as the origin snapshot

Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: DHE <git@dehacked.net>
Closes #5906
2017-03-20 15:14:28 -07:00
Chunwei Chen 9b77d1c958 Fix nfs snapdir automount
The current implementation for allowing nfs to access snapdir is very buggy.
It uses a special fh for snapdirs, such that the next time nfsd does
fh_to_dentry, it actually returns the root inode inside the snapshot. So nfsd
never knows it cross a mountpoint.

The problem is that nfsd will not hold a reference on the vfsmount of the
snapshot. This cause auto unmounter to unmount the snapshot even though nfs is
still holding dentries in it.

To fix this, we return the inode for the snapdirs themselves. However, we also
trigger automount upon fh_to_dentry, and return ESTALE so nfsd will revalidate
and see the mountpoint and do crossmnt.

Because nfsd will now be aware that these are different filesystems users
must add crossmnt to their export options to access snapshot directories.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #3794
Closes #4716
Closes #5810 
Closes #5833
2017-03-08 09:26:33 -08:00
bunder2015 db4ed56538 Corrected highlight for zpool man page
SS is already highlighted and the fB/fR tags break the highlighting
prematurely, removing the tags highlights the entire line.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: bunder2015 <omfgbunder@gmail.com>
Closes #5873
2017-03-07 13:01:39 -08:00
Olaf Faaland 3c9e0d673e Dump unique configurations and Uberblocks in zdb -lu
For zdb -l, detect when the configuration nvlist in some label l (l>0)
is the same as a configuration already dumped.  If so, do not dump it.

Make a similar check when dumping Uberblocks for zdb -lu.  Check whether
a label already dumped contains an identical Uberblock.  If so, do not
dump the Uberblock.

When dumping a configuration or Uberblock, state which labels it is
found in (0-3), for example: labels = 1 2 3

Detecting redundant uberblocks or configurations is accomplished by
calculating checksums of the uberblocks and the packed nvlists
containing the configuration.

If there is nothing unique to be dumped for a label (ie the
configuration and uberblocks have checksums matching those already
dumped) print nothing for that label.

With additional l's or u's, increase verbosity as follows:

-l      Dump each unique configuration only once.
        Indicate which labels it appears in.
-ll     In addition, dump label space usage stats.
-lll    Dump every configuration, unique or not.

-u      Dump each unique, valid, uberblock only once.
        Indicate which labels it appears in.
-uu     In addition, state which slots are invalid.
-uuu    Dump every uberblock, unique or not.
-uuuu   Dump the uberblock blockpointer (used to be -uuu)

Make exit values conform to the manual page.  Failing to unpack a
configuration nvlist is considered an error, as well as failing to open
or read from the device.

Add three tests, zdb_00{3,4,5}_pos to verify the above functionality.

An example of the output:
	------------------------------------
	LABEL 0
	------------------------------------
	    version: 5000
	    name: 'pool'
	    state: 1
	    txg: 880
	    < ... redacted ... >
	    features_for_read:
		com.delphix:hole_birth
		com.delphix:embedded_data
	    labels = 0
	    Uberblock[0]
		magic = 0000000000bab10c
		version = 5000
		txg = 0
		guid_sum = 3038694082047428541
		timestamp = 1487715500 UTC = Tue Feb 21 14:18:20 2017
		labels = 0 1 2 3
	    Uberblock[4]
		magic = 0000000000bab10c
		version = 5000
		txg = 772
		guid_sum = 9045970794941528051
		timestamp = 1487727291 UTC = Tue Feb 21 17:34:51 2017
		labels = 0
	    < ... redacted ... >
	------------------------------------
	LABEL 1
	------------------------------------
	    version: 5000
	    name: 'pool'
	    state: 1
	    txg: 14
	    < ... redacted ... >
		com.delphix:embedded_data
	    labels = 1 2 3
	    Uberblock[4]
		magic = 0000000000bab10c
		version = 5000
		txg = 4
		guid_sum = 7793930272573252584
		timestamp = 1487727521 UTC = Tue Feb 21 17:38:41 2017
		labels = 1 2 3
	    < ... redacted ... >

Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Don Brady <don.brady@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #5738
2017-03-06 16:01:45 -08:00
Matthew Ahrens c30e58c462 zfs_arc_num_sublists_per_state should be common to all multilists
The global tunable zfs_arc_num_sublists_per_state is used by the ARC and
the dbuf cache, and other users are planned. We should change this
tunable to be common to all multilists.  This tuning may be overridden
on a per-multilist basis.

Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #5764
2017-02-15 15:49:33 -08:00
Tony Hutter b291029e86 Enclosure LED fixes
- Pass $VDEV_ENC_SYSFS_PATH to 'zpool [iostat|status] -c' to include
  enclosure LED sysfs path.

- Set LEDs correctly after import.  This includes clearing any erroniously
  set LEDs prior to the import, and setting the LED for any UNAVAIL drives.

- Include symlink for vdev_attach-led.sh in Makefile.am.

- Print the VDEV path in all-syslog.sh, and fix it so the pool GUID actually
  prints.    

Reviewed-by: Don Brady <don.brady@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #5716 
Closes #5751
2017-02-10 16:09:45 -08:00
David Quigley bef78122e6 Add missing module_param for zfs_per_txg_dirty_frees_percent
When the code was added this tunable was not exposed via module params. Also it
was not documented. This patch changes the type from a uint32 to a ulong as
done with other percentage tunables and also documents it in the
zfs-module-parameters man page.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: David Quigley <david.quigley@intel.com>
Closes #5750
2017-02-07 09:44:03 -08:00
Don Brady 35a357a9ef OpenZFS 6866 - zdb -l non-zero status if no label
Authored by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Ported-by: Don Brady <don.brady@intel.com>

OpenZFS-issue: https://www.illumos.org/issues/6866
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/3e4fae5
Closes #5730 

Porting Notes:
- Omitted the illumos-specific `/dev/dsk` and `/dev/rdsk`
path conversions since they don't apply on linux.
2017-02-03 14:18:28 -08:00
Tim Chase b81a3ddc32 Update deadman operation to better align with upstream OpenZFS
The deadman in ZoL didn't behave quite as it did in upstream
OpenZFS.  In addition to the 2 purposes for which OpenZFS used the
zfs_deadman_synctime_ms parameter, ZoL also used it to determine how
frequently the deadman would fire once it has been triggered.

This patch adds the zfs_deadman_checktime_ms parameter to control how
frequently the subsequent checks are performed.

The deadman is now disabled for suspended pools.

As had been the case, unlike upstream OpenZFS, ZoL will not panic when
a hung IO is detected.

The module parameter documentation has been upated to include the new
parameter and to better describe the operation of the deadmen.
    
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #5695
2017-01-31 14:19:08 -08:00
George Melikov ed828c0c37 OpenZFS 7280 - Allow changing global libzpool variables in zdb and ztest through command line
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: George Melikov <mail@gmelikov.ru>

OpenZFS-issue: https://www.illumos.org/issues/7280
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/0e60744
Closes #5676
2017-01-31 10:13:10 -08:00
George Melikov fa603f8233 OpenZFS 7277 - zdb should be able to print zfs_dbgmsg's
Porting notes:
- 'zfs_dbgmsg_print()' reintroduced to userspace.

Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Ported-by: George Melikov <mail@gmelikov.ru>

OpenZFS-issue: https://www.illumos.org/issues/7277
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/29bdd2f
Closes #5684
2017-01-28 12:16:43 -08:00
George Melikov aeacdefedc OpenZFS 7386 - zfs get does not work properly with bookmarks
Authored by: Marcel Telka <marcel@telka.sk>
Reviewed by: Simon Klinkert <simon.klinkert@gmail.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Approved by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: George Melikov <mail@gmelikov.ru>

OpenZFS-issue: https://www.illumos.org/issues/7386
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/edb901a
Closes #5666
2017-01-26 14:42:15 -08:00
Håkan Johansson 9ef3906a5a Minor man-page formatting fixes
fB -> \fB in zpool.8 (Properties -> cachefile)
\fN -> \fB in zfs-module-parameters.5 (zfs_dirty_data_max_max_percent)
Three | -> \fR|\fI fixes for arguments of diff and inherit in zfs.8.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Haakan T Johansson <f96hajo@chalmers.se>
Closes #5645
2017-01-24 09:09:02 -08:00
George Melikov a0aacd3741 OpenZFS 7257 - zfs manpage user property length needs to be updated
Since zpool version 16, this limit is actually 8192 characters.
Additionally, this limit is actually 8192 bytes, as it supports UTF-8.

Authored by: Eli Rosenthal <eli.rosenthal@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Reviewed by: Robert Mustacchi <rm@joyent.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: George Melikov <mail@gmelikov.ru>

OpenZFS-issue: https://www.illumos.org/issues/7257
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/3bc7169
Closes #5608
2017-01-17 15:30:01 -08:00
George Melikov fdbaf44ffb OpenZFS 7276 - zfs(1m) manpage could better describe space properties
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Robert Mustacchi <rm@joyent.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: George Melikov <mail@gmelikov.ru>

OpenZFS-issue: https://www.illumos.org/issues/7276
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/d750135
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/29c6739
Closes #5549
2017-01-13 13:31:29 -08:00
Don Brady 4e21fd060a OpenZFS 7303 - dynamic metaslab selection
This change introduces a new weighting algorithm to improve
metaslab selection. The new weighting algorithm relies on the
SPACEMAP_HISTOGRAM feature. As a result, the metaslab weight
now encodes the type of weighting algorithm used (size-based
vs segment-based).

Porting Notes: The metaslab allocation tracing code is conditionally
removed on linux (dependent on mdb debugger).

Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: Chris Siden <christopher.siden@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com>
Reviewed by: Pavel Zakharov pavel.zakharov@delphix.com
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Don Brady <don.brady@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Don Brady <don.brady@intel.com>

OpenZFS-issue: https://www.illumos.org/issues/7303
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/d5190931bd
Closes #5404
2017-01-12 11:52:56 -08:00
ka7 4e33ba4c38 Fix spelling
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Haakan T Johansson <f96hajo@chalmers.se>
Closes #5547 
Closes #5543
2017-01-03 11:31:18 -06:00
bunder2015 53ed2db212 Remove extra + from zfs man page
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: bunder2015 <omfgbunder@gmail.com>
Closes #5508
2016-12-21 11:06:02 -08:00
Brian Behlendorf 27f2b90d3e Revert "Disable zio_dva_throttle_enabled by default"
Enable zio_dva_throttle_enabled=1 by default. Subsequent
testing has been unable to reproduce the suspected regression.

Tested-by: kernelOfTruth kerneloftruth@gmail.com
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Signed-off-by: Brian Behlendorf behlendorf1@llnl.gov
Reverts #5335
Closes #5289
Closes #5457
2016-12-08 13:57:42 -07:00
Chunwei Chen 493492559e Limit number of tasks shown in taskq proc
To prevent holding tq_lock for too long.

Before zfsonlinux/zfs@8e71ab9, hogging delay tasks and cat /proc/spl/taskq
would easily cause a lockup. While that bug has been fixed. It's probably
still a good idea to do this just in case task lists grow too large.

Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #586
2016-12-01 11:06:27 -07:00
Tony Hutter 8720e9e748 Add -c to zpool iostat & status to run command
This patch adds a command (-c) option to zpool status and zpool iostat.  The
-c option allows you to run an arbitrary command on each vdev and display
the first line of output in zpool status/iostat.  The environment vars
VDEV_PATH and VDEV_UPATH are set to the vdev's path and "underlying path"
before running the command.  For device mapper, multipath, or partitioned
vdevs, VDEV_UPATH is the actual underlying /dev/sd* disk.  This can be useful
if the command you're running requires a /dev/sd* device.

The patch also uses /sys/block/<dev>/slaves/ to lookup the underlying device
instead of using libdevmapper.  This not only removes the libdevmapper
requirement at build time, but also allows you to resolve device mapper
devices without being root.  This means that UDEV_UPATH get set correctly
when running zpool status/iostat as an unprivileged user.

Example:

$ zpool status -c 'echo I am $VDEV_PATH, $VDEV_UPATH'

NAME        STATE     READ WRITE CKSUM
mypool      ONLINE       0     0     0
  mirror-0  ONLINE       0     0     0
    mpatha  ONLINE       0     0     0  I am /dev/mapper/mpatha, /dev/sdc
    sdb     ONLINE       0     0     0  I am /dev/sdb1, /dev/sdb

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #5368
2016-11-29 14:45:38 -07:00
LOLi 2f71caf2d9 Allow zfs unshare <protocol> -a
Allow `zfs unshare <protocol> -a` command to share or unshare all datasets
of a given protocol, nfs or smb.

Additionally, enable most of ZFS Test Suite zfs_share/zfs_unshare test cases.
To work around some Illumos-specific functionalities ($SHARE/$UNSHARE) some
function wrappers were added around them.

Finally, fix and issue in smb_is_share_active() that would leave SMB shares
exported when invoking 'zfs unshare -a'

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #3238 
Closes #5367
2016-11-29 12:22:38 -07:00
DeHackEd 6146e17ebe Fix man page formatting in zfs-module-parameters
Bold and Normal codes were mixed up in a few places resulting in
bad highlighting.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: DHE <git@dehacked.net>
Closes #5397
2016-11-14 17:03:57 -08:00
Håkan Johansson 1c91161258 Repair indent of zpool.8 man page
Repair indent of zpool.8 man page, just before zpool labelclear
details.  Accidentally introduced by 193a37cb2 (git bisect).

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Haakan T Johansson <f96hajo@chalmers.se>
Closes #5394
2016-11-14 09:47:49 -08:00
Romain Dolbeau 7f547f85fe Add parity generation/rebuild using AVX-512 for x86-64
avx512f should work on all AVX512 hardware, since it only uses
Foundation instructions.

avx512bw should be faster on hardware supporting the AVW512BW
extension. We can use full-width pshufb (instead of relying on the 256
bits AVX2 pshufb). As a side-effect, the code is also unrolled more.

Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Gvozden Neskovic <neskovic@gmail.com>
Reviewed-by: Jinshan Xiong <jinshan.xiong@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Romain Dolbeau <romain.github@dolbeau.name>
Closes #5219
2016-11-02 12:40:23 -07:00
Brian Behlendorf 76a87a902e Disable zio_dva_throttle_enabled by default
Until it can be determined definitively that a performance
regression wasn't introduced accidentally by 3dfb57a this
functionality is being disabled by default.  It can be re-
enabled by setting zio_dva_throttle_enabled=1.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #5335 
Issue #5289
2016-10-26 09:13:43 -07:00
LOLi e4010f2719 Allow for '-o feature@<feature>=disabled' on the command line
Sometimes it is desirable to specifically disable one or several
features directly on the 'zpool create' command line.

$ zpool create -o feature@<feature>=disabled ...

Original-patch-by: Turbo Fredriksson <turbo@bayour.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #3460 
Closes #5142 
Closes #5324
2016-10-25 16:17:47 -07:00
Romain Dolbeau 24cdeaf12e Fletcher4 algorithm implemented in pure NEON for Aarch64 / ARMv8 64 bits
This is not useful on micro-architecture with a weak NEON
implementation (only 64 bits); the native version is slower &
the byteswap barely faster than scalar.  On A53 or A57, it's
a small improvement on scalar but OK for byteswap.

Results from an A53 system:
0 0 0x01 -1 0 1499068294333000 1499101101878000
implementation   native         byteswap       
scalar           1008227510     755880264      
aarch64_neon     1198098720     1044818671     
fastest          aarch64_neon   aarch64_neon 

Results from a A57 system:
0 0 0x01 -1 0 4407214734807033 4407233933777404
implementation   native         byteswap       
scalar           2302071241     1124873346     
aarch64_neon     2542214946     2245570352     
fastest          aarch64_neon   aarch64_neon 

Reviewed-by: Gvozden Neskovic <neskovic@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Romain Dolbeau <romain.dolbeau@atos.net>
Closes #5248
2016-10-21 10:55:49 -07:00
Tony Hutter 6078881aa1 Multipath autoreplace, control enclosure LEDs, event rate limiting
1. Enable multipath autoreplace support for FMA.

This extends FMA autoreplace to work with multipath disks.  This
requires libdevmapper to be installed at build time.

2. Turn on/off fault LEDs when VDEVs become degraded/faulted/online

Set ZED_USE_ENCLOSURE_LEDS=1 in zed.rc to have ZED turn on/off the enclosure
LED for a drive when a drive becomes FAULTED/DEGRADED.  Your enclosure must
be supported by the Linux SES driver for this to work.  The enclosure LED
scripts work for multipath devices as well.  The scripts will clear the LED
when the fault is cleared.

3. Rate limit ZIO delay and checksum events so as not to flood ZED

ZIO delay and checksum events are rate limited to 5/sec in the zfs module.

Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Don Brady <don.brady@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #2449 
Closes #3017 
Closes #5159
2016-10-19 12:55:59 -07:00
Don Brady 3dfb57a35e OpenZFS 7090 - zfs should throttle allocations
OpenZFS 7090 - zfs should throttle allocations

Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Approved by: Matthew Ahrens <mahrens@delphix.com>
Ported-by: Don Brady <don.brady@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>

When write I/Os are issued, they are issued in block order but the ZIO
pipeline will drive them asynchronously through the allocation stage
which can result in blocks being allocated out-of-order. It would be
nice to preserve as much of the logical order as possible.

In addition, the allocations are equally scattered across all top-level
VDEVs but not all top-level VDEVs are created equally. The pipeline
should be able to detect devices that are more capable of handling
allocations and should allocate more blocks to those devices. This
allows for dynamic allocation distribution when devices are imbalanced
as fuller devices will tend to be slower than empty devices.

The change includes a new pool-wide allocation queue which would
throttle and order allocations in the ZIO pipeline. The queue would be
ordered by issued time and offset and would provide an initial amount of
allocation of work to each top-level vdev. The allocation logic utilizes
a reservation system to reserve allocations that will be performed by
the allocator. Once an allocation is successfully completed it's
scheduled on a given top-level vdev. Each top-level vdev maintains a
maximum number of allocations that it can handle (mg_alloc_queue_depth).
The pool-wide reserved allocations (top-levels * mg_alloc_queue_depth)
are distributed across the top-level vdevs metaslab groups and round
robin across all eligible metaslab groups to distribute the work. As
top-levels complete their work, they receive additional work from the
pool-wide allocation queue until the allocation queue is emptied.

OpenZFS-issue: https://www.illumos.org/issues/7090
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/4756c3d7
Closes #5258 

Porting Notes:
- Maintained minimal stack in zio_done
- Preserve linux-specific io sizes in zio_write_compress
- Added module params and documentation
- Updated to use optimize AVL cmp macros
2016-10-13 17:59:18 -07:00
Brian Behlendorf 7515f8f63d Fix file permissions
The following new test cases need to have execute permissions set:

  userquota/groupspace_003_pos.ksh
  userquota/userquota_013_pos.ksh
  userquota/userspace_003_pos.ksh
  upgrade/upgrade_userobj_001_pos.ksh
  upgrade/setup.ksh
  upgrade/cleanup.ksh

The following source files accidentally were marked executable:

  lib/libzpool/kernel.c
  lib/libshare/nfs.c
  lib/libzfs/libzfs_dataset.c
  lib/libzfs/libzfs_util.c
  tests/zfs-tests/cmd/rm_lnkcnt_zero_file/rm_lnkcnt_zero_file.c
  tests/zfs-tests/cmd/dir_rd_update/dir_rd_update.c
  cmd/zed/zed_exec.c
  module/icp/core/kcf_sched.c
  module/zfs/dsl_pool.c
  module/zfs/arc.c
  module/nvpair/nvpair.c
  man/man5/zfs-module-parameters.5

Reviewed-by: GeLiXin <ge.lixin@zte.com.cn>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Jinshan Xiong <jinshan.xiong@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #5241
2016-10-08 14:57:56 -07:00
Jinshan Xiong 1de321e626 Add support for user/group dnode accounting & quota
This patch tracks dnode usage for each user/group in the
DMU_USER/GROUPUSED_OBJECT ZAPs. ZAP entries dedicated to dnode
accounting have the key prefixed with "obj-" followed by the UID/GID
in string format (as done for the block accounting).
A new SPA feature has been added for dnode accounting as well as
a new ZPL version. The SPA feature must be enabled in the pool
before upgrading the zfs filesystem. During the zfs version upgrade,
a "quotacheck" will be executed by marking all dnode as dirty.

ZoL-bug-id: https://github.com/zfsonlinux/zfs/issues/3500

Signed-off-by: Jinshan Xiong <jinshan.xiong@intel.com>
Signed-off-by: Johann Lombardi <johann.lombardi@intel.com>
2016-10-07 09:45:13 -07:00
ilovezfs 125a406e24 OpenZFS 6585 - sha512, skein, and edonr have an unenforced dependency on extensible dataset
Authored by: ilovezfs <ilovezfs@icloud.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported by: Tony Hutter <hutter2@llnl.gov>

In any pool without the extensible dataset feature flag already enabled,
creating a dataset with dedup set to use one of the new checksums would
result in the following panic as soon as any data was added:

panic[cpu0]/thread=ffffff0006761c40: feature_get_refcount(spa, feature,
&refcount) != 48 (0x30 != 0x30), file: ../../common/fs/zfs/zfeature.c
line 390

Inpsection showed that feature->fi_feature was 7, which is the value of
SPA_FEATURE_EXTENSIBLE_DATASET in the spa_feature enum.  This commit
adds extensible dataset as a dependency for the sha512, edonr, and skein
feature flags, which prevents the panic.

OpenZFS-issue: https://www.illumos.org/issues/6585
OpenZFS-commit: 892586e8a1
Porting Notes:
This code was originally from Illumos, but I actually ported it from:
openzfsonosx/zfs@b62a652
2016-10-03 14:51:21 -07:00
Tony Hutter 3c67d83a8a OpenZFS 4185 - add new cryptographic checksums to ZFS: SHA-512, Skein, Edon-R
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported by: Tony Hutter <hutter2@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/4185
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/45818ee

Porting Notes:
This code is ported on top of the Illumos Crypto Framework code:

    b5e030c8db

The list of porting changes includes:

- Copied module/icp/include/sha2/sha2.h directly from illumos

- Removed from module/icp/algs/sha2/sha2.c:
	#pragma inline(SHA256Init, SHA384Init, SHA512Init)

- Added 'ctx' to lib/libzfs/libzfs_sendrecv.c:zio_checksum_SHA256() since
  it now takes in an extra parameter.

- Added CTASSERT() to assert.h from for module/zfs/edonr_zfs.c

- Added skein & edonr to libicp/Makefile.am

- Added sha512.S.  It was generated from sha512-x86_64.pl in Illumos.

- Updated ztest.c with new fletcher_4_*() args; used NULL for new CTX argument.

- In icp/algs/edonr/edonr_byteorder.h, Removed the #if defined(__linux) section
  to not #include the non-existant endian.h.

- In skein_test.c, renane NULL to 0 in "no test vector" array entries to get
  around a compiler warning.

- Fixup test files:
	- Rename <sys/varargs.h> -> <varargs.h>, <strings.h> -> <string.h>,
	- Remove <note.h> and define NOTE() as NOP.
	- Define u_longlong_t
	- Rename "#!/usr/bin/ksh" -> "#!/bin/ksh -p"
	- Rename NULL to 0 in "no test vector" array entries to get around a
	  compiler warning.
	- Remove "for isa in $($ISAINFO); do" stuff
	- Add/update Makefiles
	- Add some userspace headers like stdio.h/stdlib.h in places of
	  sys/types.h.

- EXPORT_SYMBOL *_Init/*_Update/*_Final... routines in ICP modules.

- Update scripts/zfs2zol-patch.sed

- include <sys/sha2.h> in sha2_impl.h

- Add sha2.h to include/sys/Makefile.am

- Add skein and edonr dirs to icp Makefile

- Add new checksums to zpool_get.cfg

- Move checksum switch block from zfs_secpolicy_setprop() to
  zfs_check_settable()

- Fix -Wuninitialized error in edonr_byteorder.h on PPC

- Fix stack frame size errors on ARM32
  	- Don't unroll loops in Skein on 32-bit to save stack space
  	- Add memory barriers in sha2.c on 32-bit to save stack space

- Add filetest_001_pos.ksh checksum sanity test

- Add option to write psudorandom data in file_write utility
2016-10-03 14:51:15 -07:00
Romain Dolbeau 62a65a654e Add parity generation/rebuild using 128-bits NEON for Aarch64
This re-use the framework established for SSE2, SSSE3 and
AVX2. However, GCC is using FP registers on Aarch64, so
unlike SSE/AVX2 we can't rely on the registers being left alone
between ASM statements. So instead, the NEON code uses
C variables and GCC extended ASM syntax. Note that since
the kernel explicitly disable vector registers, they
have to be locally re-enabled explicitly.

As we use the variable's number to define the symbolic
name, and GCC won't allow duplicate symbolic names,
numbers have to be unique. Even when the code is not
going to be used (e.g. the case for 4 registers when
using the macro with only 2). Only the actually used
variables should be declared, otherwise the build
will fails in debug mode.

This requires the replacement of the XOR(X,X) syntax
by a new ZERO(X) macro, which does the same thing but
without repeating the argument. And perhaps someday
there will be a machine where there is a more efficient
way to zero a register than XOR with itself. This affects
scalar, SSE2, SSSE3 and AVX2 as they need the new macro.

It's possible to write faster implementations (different
scheduling, different unrolling, interleaving NEON and
scalar, ...) for various cores, but this one has the
advantage of fitting in the current state of the code,
and thus is likely easier to review/check/merge.

The only difference between aarch64-neon and aarch64-neonx2
is that aarch64-neonx2 unroll some functions some more.

Reviewed-by: Gvozden Neskovic <neskovic@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Romain Dolbeau <romain.dolbeau@atos.net>
Closes #4801
2016-10-03 09:44:00 -07:00
Brian Behlendorf 9ea9e0b9a1 Enable ignore_hole_birth module option by default
Enable ignore_hole_birth by default until all known hole birth bugs
have been resolved and relevant test cases added.

Reviewed-by: Boris Protopopov <boris.protopopov@actifio.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4809
Closes #5099
2016-09-16 14:05:30 -07:00
Dan Kimmel 2aa34383b9 DLPX-40252 integrate EP-476 compressed zfs send/receive
Authored by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Tom Caputi <tcaputi@datto.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported by: David Quigley <david.quigley@intel.com>
Issue #5078
2016-09-13 09:58:58 -07:00
George Wilson d3c2ae1c08 OpenZFS 6950 - ARC should cache compressed data
Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Tom Caputi <tcaputi@datto.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported by: David Quigley <david.quigley@intel.com>

This review covers the reading and writing of compressed arc headers, sharing
data between the arc_hdr_t and the arc_buf_t, and the implementation of a new
dbuf cache to keep frequently access data uncompressed.

I've added a new member to l1 arc hdr called b_pdata. The b_pdata always hangs
off the arc_buf_hdr_t (if an L1 hdr is in use) and points to the physical block
for that DVA. The physical block may or may not be compressed. If compressed
arc is enabled and the block on-disk is compressed, then the b_pdata will match
the block on-disk and remain compressed in memory. If the block on disk is not
compressed, then neither will the b_pdata. Lastly, if compressed arc is
disabled, then b_pdata will always be an uncompressed version of the on-disk
block.

Typically the arc will cache only the arc_buf_hdr_t and will aggressively evict
any arc_buf_t's that are no longer referenced. This means that the arc will
primarily have compressed blocks as the arc_buf_t's are considered overhead and
are always uncompressed. When a consumer reads a block we first look to see if
the arc_buf_hdr_t is cached. If the hdr is cached then we allocate a new
arc_buf_t and decompress the b_pdata contents into the arc_buf_t's b_data. If
the hdr already has a arc_buf_t, then we will allocate an additional arc_buf_t
and bcopy the uncompressed contents from the first arc_buf_t to the new one.

Writing to the compressed arc requires that we first discard the b_pdata since
the physical block is about to be rewritten. The new data contents will be
passed in via an arc_buf_t (uncompressed) and during the I/O pipeline stages we
will copy the physical block contents to a newly allocated b_pdata.

When an l2arc is inuse it will also take advantage of the b_pdata. Now the
l2arc will always write the contents of b_pdata to the l2arc. This means that
when compressed arc is enabled that the l2arc blocks are identical to those
stored in the main data pool. This provides a significant advantage since we
can leverage the bp's checksum when reading from the l2arc to determine if the
contents are valid. If the compressed arc is disabled, then we must first
transform the read block to look like the physical block in the main data pool
before comparing the checksum and determining it's valid.

OpenZFS-issue: https://www.illumos.org/issues/6950
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/7fc10f0
Issue #5078
2016-09-13 09:58:33 -07:00
GeLiXin 9907cc1cc8 Add zfs_arc_meta_limit_percent tunable
ARC will evict meta buffers that exceed the arc_meta_limit. Before a further
investigating on whether we should take special protection on meta buffers,
this tunable make arc_meta_limit adjustable for different workloads.

People can set zfs_arc_meta_limit_percent to any value while insmod zfs.ko,
so some range check is added to guarantee a suitable arc_meta_limit.

Suggested by Tim Chase, zfs_arc_dnode_limit is changed to a percent-style
tunable as well.

Signed-off-by: GeLiXin <ge.lixin@zte.com.cn>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4957
2016-08-23 13:03:01 -07:00
Gvozden Neskovic 70b258fc96 Fletcher4 implementation using avx512f instruction set
Algorithm runs 8 parallel sums, consuming 8x uint32_t elements per
loop iteration. Size alignment of main fletcher4 methods is adjusted
accordingly. New implementation is called 'avx512f'.

Note: byteswap method can be implemented more efficiently when avx512bw hardware
becomes available. Currently, it is ~ 2x slower than native method.

Table shows result of full (native) fletcher4 calculation for different buffer size:

fletcher4   4KB     16KB    64KB    128KB   256KB   1MB     16MB
--------------------------------------------------------------------
[scalar]    1213    1228    1231    1231    1225    1200    1160
[sse2]      2374    2442    2459    2456    2462    2250    2220
[avx2]      4288    4753    4871    4893    4900    4050    3882
[avx512f]   5975    8445    9196    9221    9262    6307    5620

Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4952
2016-08-16 14:11:14 -07:00
Rich Ercolani 6d836e6f8b Add tunable to ignore hole_birth
Adds a module option which disables the hole_birth optimization
which has been responsible for several recent bugs, including
issue #4050.

Original-patch: https://gist.github.com/pcd1193182/2c0cd47211f3aee623958b4698836c48
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4833
2016-08-15 09:52:56 -07:00
Tim Chase 25458cbef9 Limit the amount of dnode metadata in the ARC
Metadata-intensive workloads can cause the ARC to become permanently
filled with dnode_t objects as they're pinned by the VFS layer.
Subsequent data-intensive workloads may only benefit from about
25% of the potential ARC (arc_c_max - arc_meta_limit).

In order to help track metadata usage more precisely, the other_size
metadata arcstat has replaced with dbuf_size, dnode_size and bonus_size.

The new zfs_arc_dnode_limit tunable, which defaults to 10% of
zfs_arc_meta_limit, defines the minimum number of bytes which is desirable
to be consumed by dnodes.  Attempts to evict non-metadata will trigger
async prune tasks if the space used by dnodes exceeds this limit.

The new zfs_arc_dnode_reduce_percent tunable specifies the amount by
which the excess dnode space is attempted to be pruned as a percentage of
the amount by which zfs_arc_dnode_limit is being exceeded.  By default,
it tries to unpin 10% of the dnodes.

The problem of dnode metadata pinning was observed with the following
testing procedure (in this example, zfs_arc_max is set to 4GiB):

    - Create a large number of small files until arc_meta_used exceeds
      arc_meta_limit (3GiB with default tuning) and arc_prune
      starts increasing.

    - Create a 3GiB file with dd.  Observe arc_mata_used.  It will still
      be around 3GiB.

    - Repeatedly read the 3GiB file and observe arc_meta_limit as before.
      It will continue to stay around 3GiB.

With this modification, space for the 3GiB file is gradually made
available as subsequent demands on the ARC are made.  The previous behavior
can be restored by setting zfs_arc_dnode_limit to the same value as the
zfs_arc_meta_limit.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4345
Issue #4512
Issue #4773
Closes #4858
2016-07-25 15:26:38 -07:00
Gvozden Neskovic c9187d867f Fixes and enhancements of SIMD raidz parity
- Implementation lock replaced with atomic variable

- Trailing whitespace is removed from user specified parameter, to enhance
experience when using commands that add newline, e.g. `echo`

- raidz_test: remove dependency on `getrusage()` and RUSAGE_THREAD, Issue #4813

- silence `cppcheck` in vdev_raidz, partial solution of Issue #1392

- Minor fixes and cleanups

- Enable use of original parity methods in [fastest] configuration.
New opaque original ops structure, representing native methods, is added
to supported raidz methods. Original parity methods are executed if selected
implementation has NULL fn pointer.

Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4813
Issue #1392
2016-07-19 16:43:07 -07:00
Tyler J. Stachecki 35a76a0366 Implementation of SSE optimized Fletcher-4
Builds off of 1eeb4562 (Implementation of AVX2 optimized Fletcher-4)
This commit adds another implementation of the Fletcher-4 algorithm.
It is automatically selected at module load if it benchmarks higher
than all other available implementations.

The module benchmark was also amended to analyze the performance of
the byteswap-ed version of Fletcher-4, as well as the non-byteswaped
version. The average performance of the two is used to select the
the fastest implementation available on the host system.

Adds a pair of fields to an existing zcommon module parameter:
-  zfs_fletcher_4_impl (str)
    "sse2"    - new SSE2 implementation if available
    "ssse3"   - new SSSE3 implementation if available

Signed-off-by: Tyler J. Stachecki <stachecki.tyler@gmail.com>
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4789
2016-07-15 10:42:35 -07:00
Gvozden Neskovic ae25d22235 Add RAID-Z routines for SSE2 instruction set, in x86_64 mode.
The patch covers low-end and older x86 CPUs.  Parity generation is
equivalent to SSSE3 implementation, but reconstruction is somewhat
slower.  Previous 'sse' implementation is renamed to 'ssse3' to
indicate highest instruction set used.

Benchmark results:
scalar_rec_p                    4    720476442
scalar_rec_q                    4    187462804
scalar_rec_r                    4    138996096
scalar_rec_pq                   4    140834951
scalar_rec_pr                   4    129332035
scalar_rec_qr                   4    81619194
scalar_rec_pqr                  4    53376668

sse2_rec_p                      4    2427757064
sse2_rec_q                      4    747120861
sse2_rec_r                      4    499871637
sse2_rec_pq                     4    522403710
sse2_rec_pr                     4    464632780
sse2_rec_qr                     4    319124434
sse2_rec_pqr                    4    205794190

ssse3_rec_p                     4    2519939444
ssse3_rec_q                     4    1003019289
ssse3_rec_r                     4    616428767
ssse3_rec_pq                    4    706326396
ssse3_rec_pr                    4    570493618
ssse3_rec_qr                    4    400185250
ssse3_rec_pqr                   4    377541245

original_rec_p                  4    691658568
original_rec_q                  4    195510948
original_rec_r                  4    26075538
original_rec_pq                 4    103087368
original_rec_pr                 4    15767058
original_rec_qr                 4    15513175
original_rec_pqr                4    10746357

Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4783
2016-07-13 10:24:55 -07:00
Paul Dagnelie e6d3a843d6 OpenZFS 6393 - zfs receive a full send as a clone
Authored by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/6394
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/68ecb2e
2016-06-28 13:47:03 -07:00
Matthew Ahrens 47dfff3b86 OpenZFS 2605, 6980, 6902
2605 want to resume interrupted zfs send
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed by: Xin Li <delphij@freebsd.org>
Reviewed by: Arne Jansen <sensille@gmx.net>
Approved by: Dan McDonald <danmcd@omniti.com>
Ported-by: kernelOfTruth <kerneloftruth@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/2605
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/9c3fd12

6980 6902 causes zfs send to break due to 32-bit/64-bit struct mismatch
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/6980
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/ea4a67f

Porting notes:
- All rsend and snapshop tests enabled and updated for Linux.
- Fix misuse of input argument in traverse_visitbp().
- Fix ISO C90 warnings and errors.
- Fix gcc 'missing braces around initializer' in
  'struct send_thread_arg to_arg =' warning.
- Replace 4 argument fletcher_4_native() with 3 argument version,
  this change was made in OpenZFS 4185 which has not been ported.
- Part of the sections for 'zfs receive' and 'zfs send' was
  rewritten and reordered to approximate upstream.
- Fix mktree xattr creation, 'user.' prefix required.
- Minor fixes to newly enabled test cases
- Long holds for volumes allowed during receive for minor registration.
2016-06-28 13:47:02 -07:00
Ned Bass 50c957f702 Implement large_dnode pool feature
Justification
-------------

This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks.  Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided.  Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks.  Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.

ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.

Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.

Implementation
--------------

The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.

Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.

The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run

 # zfs set dnodesize=auto tank/fish

The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.

The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.

New DMU interfaces:
  dmu_object_alloc_dnsize()
  dmu_object_claim_dnsize()
  dmu_object_reclaim_dnsize()

New ZAP interfaces:
  zap_create_dnsize()
  zap_create_norm_dnsize()
  zap_create_flags_dnsize()
  zap_create_claim_norm_dnsize()
  zap_create_link_dnsize()

The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.

These are a few noteworthy changes to key functions:

* The prototype for dnode_hold_impl() now takes a "slots" parameter.
  When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
  ensure the hole at the specified object offset is large enough to
  hold the dnode being created. The slots parameter is also used
  to ensure a dnode does not span multiple dnode blocks. In both of
  these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
  these failure cases are only possible when using DNODE_MUST_BE_FREE.

  If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
  dnode_hold_impl() will check if the requested dnode is already
  consumed as an extra dnode slot by an large dnode, in which case
  it returns ENOENT.

* The function dmu_object_alloc() advances to the next dnode block
  if dnode_hold_impl() returns an error for a requested object.
  This is because the beginning of the next dnode block is the only
  location it can safely assume to either be a hole or a valid
  starting point for a dnode.

* dnode_next_offset_level() and other functions that iterate
  through dnode blocks may no longer use a simple array indexing
  scheme. These now use the current dnode's dn_num_slots field to
  advance to the next dnode in the block. This is to ensure we
  properly skip the current dnode's bonus area and don't interpret it
  as a valid dnode.

zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.

For ZIL create log records, zdb will now display the slot count for
the object.

ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.

Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number.  This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.

ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.

Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.

While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.

For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.

ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.

Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.

Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-06-24 13:13:21 -07:00
Gvozden Neskovic ab9f4b0b82 SIMD implementation of vdev_raidz generate and reconstruct routines
This is a new implementation of RAIDZ1/2/3 routines using x86_64
scalar, SSE, and AVX2 instruction sets. Included are 3 parity
generation routines (P, PQ, and PQR) and 7 reconstruction routines,
for all RAIDZ level. On module load, a quick benchmark of supported
routines will select the fastest for each operation and they will
be used at runtime. Original implementation is still present and
can be selected via module parameter.

Patch contains:
- specialized gen/rec routines for all RAIDZ levels,
- new scalar raidz implementation (unrolled),
- two x86_64 SIMD implementations (SSE and AVX2 instructions sets),
- fastest routines selected on module load (benchmark).
- cmd/raidz_test - verify and benchmark all implementations
- added raidz_test to the ZFS Test Suite

New zfs module parameters:
- zfs_vdev_raidz_impl (str): selects the implementation to use. On
  module load, the parameter will only accept first 3 options, and
  the other implementations can be set once module is finished
  loading. Possible values for this option are:
    "fastest" - use the fastest math available
    "original" - use the original raidz code
    "scalar" - new scalar impl
    "sse" - new SSE impl if available
    "avx2" - new AVX2 impl if available

See contents of `/sys/module/zfs/parameters/zfs_vdev_raidz_impl` to
get the list of supported values. If an implementation is not supported
on the system, it will not be shown. Currently selected option is
enclosed in `[]`.

Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4328
2016-06-21 09:27:26 -07:00
Brian Behlendorf f74b821a66 Add `zfs allow` and `zfs unallow` support
ZFS allows for specific permissions to be delegated to normal users
with the `zfs allow` and `zfs unallow` commands.  In addition, non-
privileged users should be able to run all of the following commands:

  * zpool [list | iostat | status | get]
  * zfs [list | get]

Historically this functionality was not available on Linux.  In order
to add it the secpolicy_* functions needed to be implemented and mapped
to the equivalent Linux capability.  Only then could the permissions on
the `/dev/zfs` be relaxed and the internal ZFS permission checks used.

Even with this change some limitations remain.  Under Linux only the
root user is allowed to modify the namespace (unless it's a private
namespace).  This means the mount, mountpoint, canmount, unmount,
and remount delegations cannot be supported with the existing code.  It
may be possible to add this functionality in the future.

This functionality was validated with the cli_user and delegation test
cases from the ZFS Test Suite.  These tests exhaustively verify each
of the supported permissions which can be delegated and ensures only
an authorized user can perform it.

Two minor bug fixes were required for test-running.py.  First, the
Timer() object cannot be safely created in a `try:` block when there
is an unconditional `finally` block which references it.  Second,
when running as a normal user also check for scripts using the
both the .ksh and .sh suffixes.

Finally, existing users who are simulating delegations by setting
group permissions on the /dev/zfs device should revert that
customization when updating to a version with this change.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #362 
Closes #434 
Closes #4100
Closes #4394 
Closes #4410 
Closes #4487
2016-06-07 09:16:52 -07:00
Jinshan Xiong 1eeb4562a7 Implementation of AVX2 optimized Fletcher-4
New functionality:
- Preserves existing scalar implementation.
- Adds AVX2 optimized Fletcher-4 computation.
- Fastest routines selected on module load (benchmark).
- Test case for Fletcher-4 added to ztest.

New zcommon module parameters:
-  zfs_fletcher_4_impl (str): selects the implementation to use.
    "fastest" - use the fastest version available
    "cycle"   - cycle trough all available impl for ztest
    "scalar"  - use the original version
    "avx2"    - new AVX2 implementation if available

Performance comparison (Intel i7 CPU, 1MB data buffers):
- Scalar:  4216 MB/s
- AVX2:   14499 MB/s

See contents of `/sys/module/zcommon/parameters/zfs_fletcher_4_impl`
to get list of supported values. If an implementation is not supported
on the system, it will not be shown. Currently selected option is
enclosed in `[]`.

Signed-off-by: Jinshan Xiong <jinshan.xiong@intel.com>
Signed-off-by: Andreas Dilger <andreas.dilger@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4330
2016-06-02 14:30:51 -07:00
Tony Hutter 26ef0cc7db OpenZFS 6531 - Provide mechanism to artificially limit disk performance
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Ported by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/6531
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/97e8130

Porting notes:
- Added new IO delay tracepoints, and moved common ZIO tracepoint macros
  to a new trace_common.h file.
- Used zio_delay_taskq() in place of OpenZFS's timeout_generic() function.
- Updated zinject man page
- Updated zpool_scrub test files
2016-05-26 10:11:51 -07:00
Tony Hutter 7e945072d1 Add request size histograms (-r) to zpool iostat, minor man page fix
Add -r option to "zpool iostat" to print request size histograms for the leaf
ZIOs. This includes histograms of individual ZIOs ("ind") and aggregate ZIOs
("agg"). These stats can be useful for seeing how well the ZFS IO aggregator
is working.

$ zpool iostat -r
mypool        sync_read    sync_write    async_read    async_write      scrub
req_size      ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
512             0      0      0      0      0      0    530      0      0      0
1K              0      0    260      0      0      0    116    246      0      0
2K              0      0      0      0      0      0      0    431      0      0
4K              0      0      0      0      0      0      3    107      0      0
8K             15      0     35      0      0      0      0      6      0      0
16K             0      0      0      0      0      0      0     39      0      0
32K             0      0      0      0      0      0      0      0      0      0
64K            20      0     40      0      0      0      0      0      0      0
128K            0      0     20      0      0      0      0      0      0      0
256K            0      0      0      0      0      0      0      0      0      0
512K            0      0      0      0      0      0      0      0      0      0
1M              0      0      0      0      0      0      0      0      0      0
2M              0      0      0      0      0      0      0      0      0      0
4M              0      0      0      0      0      0    155     19      0      0
8M              0      0      0      0      0      0      0    811      0      0
16M             0      0      0      0      0      0      0     68      0      0
--------------------------------------------------------------------------------

Also rename the stray "-G" in the man page to be "-w" for latency histograms.

Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #4659
2016-05-25 15:49:35 -07:00
DHE 8342673502 Improve zfs-module-parameters(5)
Various rewrites to the descriptions of module parameters. Corrects
spelling mistakes, makes descriptions them more user-friendly and
describes some ZFS quirks which should be understood before changing
parameter values.

Signed-off-by: DHE <git@dehacked.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4671
2016-05-23 11:08:45 -07:00
Christer Ekholm 3491d6eb06 Consistently use parsable instead of parseable
This is a purely cosmetical change, to consistently prefer one of
two (both acceptable) choises for the word parsable in documentation and
code. I don't really care which to use, but acording to wiktionary
https://en.wiktionary.org/wiki/parsable#English parsable is preferred.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4682
2016-05-23 10:20:42 -07:00
Brian Behlendorf 81b4c075ec Remove additional cruft from manpages
These changes should have been part of the original 930b0d4
commit but were overlooked because 193a37c had not yet been
merged when the original change was ported.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4631
2016-05-17 15:36:46 -07:00
Richard Laager 61a3d06f84 zfs.8: Relative paths must start with ./
Simply containing a slash is not enough, presumably because foo/bar
could be either a dataset or a mountpoint.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Laager <rlaager@wiktel.com>
Closes #4655
2016-05-16 14:19:57 -07:00
Matthew Ahrens c5ee751394 Illumos 1644 add ZFS "clones" property
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Reviewed by: George Wilson <gwilson@zfsmail.com>
Approved by: Gordon Ross <gwr@nexenta.com>

References:
 https://www.illumos.org/issues/1644

Ported-by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Richard Laager <rlaager@wiktel.com>
2016-05-16 12:26:32 -07:00
Richard Laager 930b0d4c77 Illumos 1502 Remove conversion cruft from manpages
Reviewed by: Alexander Eremin <alexander.eremin@nexenta.com>
Reviewed by: Gordon Ross <gordon.w.ross@gmail.com>
Reviewed by: Garrett D'Amore <garrett.damore@gmail.com>

References:
 https://www.illumos.org/issues/1502

Ported-by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Richard Laager <rlaager@wiktel.com>

Conflicts:
	man/man8/zpool.8
2016-05-16 12:26:31 -07:00
Ruben Kerkhof 32a6c3d756 zfs.8 & mount.zfs.8: fix a few typos
filesytem -> filesystem
defntext -> defcontext

Signed-off-by: Ruben Kerkhof <ruben@rubenkerkhof.com>
2016-05-16 12:26:31 -07:00
Richard Laager d919da83fa zfs.8 & zpool.8: Standardize property value order
The default value is now always listed first.

Signed-off-by: Richard Laager <rlaager@wiktel.com>
2016-05-16 12:26:31 -07:00
Richard Laager 8fd888baa7 zfs.8 & zpool.8: Various documentation edits
Signed-off-by: Richard Laager <rlaager@wiktel.com>
2016-05-16 12:26:31 -07:00
Richard Laager 8e07f9a916 zfs.8: Improve zfs upgrade documentation
Signed-off-by: Richard Laager <rlaager@wiktel.com>
2016-05-16 12:26:31 -07:00
Richard Laager 9ef21991f3 zfs.8: Cleanup stray code
Bad copy-and-paste?

Signed-off-by: Richard Laager <rlaager@wiktel.com>
2016-05-16 12:26:31 -07:00
Richard Laager 8c5edae993 zfs.8 & zpool.8: Drop legal/illegal
There's a convention in documentation that these words not be used to
mean "invalid".

Signed-off-by: Richard Laager <rlaager@wiktel.com>
2016-05-16 12:26:31 -07:00
Richard Laager 9bb3e153c4 zfs.8: Fix minor typos and the like
This commit only contains the most trivial of changes.

Signed-off-by: Richard Laager <rlaager@wiktel.com>
2016-05-16 12:26:30 -07:00
Richard Laager 879dbef094 zfs.8: Rework native vs user properties
Signed-off-by: Richard Laager <rlaager@wiktel.com>
2016-05-16 12:26:30 -07:00
Richard Laager 6a107f4199 zfs.8 & zpool.8: Linux/Solaris differences
Signed-off-by: Richard Laager <rlaager@wiktel.com>
2016-05-16 12:26:30 -07:00
Richard Laager a5eb2d8746 zfs.8: Improve mount option documentation
This change is primarily about adding inline references in the
properties section to the traditional mount option names.

There are some other editorial changes too.

Signed-off-by: Richard Laager <rlaager@wiktel.com>
2016-05-16 12:26:30 -07:00
Richard Laager 7e0754c675 zfs.8: Improve consistency in size documentation
Signed-off-by: Richard Laager <rlaager@wiktel.com>
2016-05-16 12:26:30 -07:00
Richard Laager cab1aa295e zfs.8: Drop references to Oracle documentation
Signed-off-by: Richard Laager <rlaager@wiktel.com>
2016-05-16 12:26:30 -07:00
Richard Laager 76281da4eb zfs.8: zfs get and zfs list accept mountpoints
Signed-off-by: Richard Laager <rlaager@wiktel.com>
2016-05-16 12:26:29 -07:00
Tony Hutter 193a37cb24 Add -lhHpw options to "zpool iostat" for avg latency, histograms, & queues
Update the zfs module to collect statistics on average latencies, queue sizes,
and keep an internal histogram of all IO latencies.  Along with this, update
"zpool iostat" with some new options to print out the stats:

-l: Include average IO latencies stats:

 total_wait     disk_wait    syncq_wait    asyncq_wait  scrub
 read  write   read  write   read  write   read  write   wait
-----  -----  -----  -----  -----  -----  -----  -----  -----
    -   41ms      -    2ms      -   46ms      -    4ms      -
    -    5ms      -    1ms      -    1us      -    4ms      -
    -    5ms      -    1ms      -    1us      -    4ms      -
    -      -      -      -      -      -      -      -      -
    -   49ms      -    2ms      -   47ms      -      -      -
    -      -      -      -      -      -      -      -      -
    -    2ms      -    1ms      -      -      -    1ms      -
-----  -----  -----  -----  -----  -----  -----  -----  -----
  1ms    1ms    1ms  413us   16us   25us      -    5ms      -
  1ms    1ms    1ms  413us   16us   25us      -    5ms      -
  2ms    1ms    2ms  412us   26us   25us      -    5ms      -
    -    1ms      -  413us      -   25us      -    5ms      -
    -    1ms      -  460us      -   29us      -    5ms      -
196us    1ms  196us  370us    7us   23us      -    5ms      -
-----  -----  -----  -----  -----  -----  -----  -----  -----

-w: Print out latency histograms:

sdb           total           disk         sync_queue      async_queue
latency    read   write    read   write    read   write    read   write   scrub
-------  ------  ------  ------  ------  ------  ------  ------  ------  ------
1ns           0       0       0       0       0       0       0       0       0
...
33us          0       0       0       0       0       0       0       0       0
66us          0       0     107    2486       2     788      12      12       0
131us         2     797     359    4499      10     558     184     184       6
262us        22     801     264    1563      10     286     287     287      24
524us        87     575      71   52086      15    1063     136     136      92
1ms         152    1190       5   41292       4    1693     252     252     141
2ms         245    2018       0   50007       0    2322     371     371     220
4ms         189    7455      22  162957       0    3912    6726    6726     199
8ms         108    9461       0  102320       0    5775    2526    2526      86
17ms         23   11287       0   37142       0    8043    1813    1813      19
34ms          0   14725       0   24015       0   11732    3071    3071       0
67ms          0   23597       0    7914       0   18113    5025    5025       0
134ms         0   33798       0     254       0   25755    7326    7326       0
268ms         0   51780       0      12       0   41593   10002   10002       0
537ms         0   77808       0       0       0   64255   13120   13120       0
1s            0  105281       0       0       0   83805   20841   20841       0
2s            0   88248       0       0       0   73772   14006   14006       0
4s            0   47266       0       0       0   29783   17176   17176       0
9s            0   10460       0       0       0    4130    6295    6295       0
17s           0       0       0       0       0       0       0       0       0
34s           0       0       0       0       0       0       0       0       0
69s           0       0       0       0       0       0       0       0       0
137s          0       0       0       0       0       0       0       0       0
-------------------------------------------------------------------------------

-h: Help

-H: Scripted mode. Do not display headers, and separate fields by a single
    tab instead of arbitrary space.

-q: Include current number of entries in sync & async read/write queues,
    and scrub queue:

 syncq_read    syncq_write   asyncq_read  asyncq_write   scrubq_read
 pend  activ   pend  activ   pend  activ   pend  activ   pend  activ
-----  -----  -----  -----  -----  -----  -----  -----  -----  -----
    0      0      0      0     78     29      0      0      0      0
    0      0      0      0     78     29      0      0      0      0
    0      0      0      0      0      0      0      0      0      0
    -      -      -      -      -      -      -      -      -      -
    0      0      0      0      0      0      0      0      0      0
    -      -      -      -      -      -      -      -      -      -
    0      0      0      0      0      0      0      0      0      0
-----  -----  -----  -----  -----  -----  -----  -----  -----  -----
    0      0    227    394      0     19      0      0      0      0
    0      0    227    394      0     19      0      0      0      0
    0      0    108     98      0     19      0      0      0      0
    0      0     19     98      0      0      0      0      0      0
    0      0     78     98      0      0      0      0      0      0
    0      0     19     88      0      0      0      0      0      0
-----  -----  -----  -----  -----  -----  -----  -----  -----  -----

-p: Display numbers in parseable (exact) values.

Also, update iostat syntax to allow the user to specify specific vdevs
to show statistics for.  The three options for choosing pools/vdevs are:

Display a list of pools:
    zpool iostat ... [pool ...]

Display a list of vdevs from a specific pool:
    zpool iostat ... [pool vdev ...]

Display a list of vdevs from any pools:
    zpool iostat ... [vdev ...]

Lastly, allow zpool command "interval" value to be floating point:
    zpool iostat -v 0.5

Signed-off-by: Tony Hutter <hutter2@llnl.gov
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4433
2016-05-12 12:36:32 -07:00
Adam Stevko 2a8b84b747 OpenZFS 3993, 4700
3993 zpool(1M) and zfs(1M) should support -p for "list" and "get"
4700 "zpool get" doesn't support -H or -o options

Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/3993
OpenZFS-issue: https://www.illumos.org/issues/4700
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c58b352

Porting notes:
I removed ZoL's zpool_get_prop_literal() in favor of
zpool_get_prop(..., boolean_t literal) since that's what OpenZFS
uses.  The functionality is the same.
2016-05-11 11:49:37 -07:00
Don Brady 39fc0cb557 Add support for devid and phys_path keys in vdev disk labels
This is foundational work for ZED.

Updates a leaf vdev's persistent device strings on Linux platform

* only applies for a dedicated leaf vdev (aka whole disk)
* updated during pool create|add|attach|import
* used for matching device matching during auto-{online,expand,replace}
* stored in a leaf disk config label (i.e. alongside 'path' NVP)
* can opt-out using env var ZFS_VDEV_DEVID_OPT_OUT=YES

Some examples:

    path: '/dev/sdb1'
    devid: 'scsi-350000394a8ca4fbc-part1'
    phys_path: 'pci-0000:04:00.0-sas-0x50000394a8ca4fbf-lun-0'

    path: '/dev/mapper/mpatha'
    devid: 'dm-uuid-mpath-35000c5006304de3f'

Signed-off-by: Don Brady <don.brady@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2856
Closes #3978
Closes #4416
2016-03-31 13:45:53 -07:00
Richard Laager 1c0120832c Correct typo in spa_load_verify_metadata docs
Signed-off-by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4471
2016-03-29 18:33:16 -07:00
Brian Behlendorf 7d11e37e55 Require libblkid
Historically libblkid support was detected as part of configure
and optionally enabled.  This was done because at the time support
for detecting ZFS pool vdevs had just be added to libblkid and
those updated packages were not yet part of many distributions.
This is no longer the case and any reasonably current distribution
will ship a version of libblkid which can detect ZFS pool vdevs.

This patch makes libblkid mandatory at build time and libblkid
the preferred method of scanning for ZFS pools.  For distributions
which include a modern version of libblkid there is no change in
behavior.  Explicitly scanning the default search paths is still
supported and can be enabled with the '-s' command line option.

Additionally making libblkid mandatory means that the 'zpool create'
command can reliably detect if a specified device has an existing
non-ZFS filesystem (ext4, xfs) and print a warning.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2448
2016-03-09 10:39:22 -08:00
smh 9f500936c8 FreeBSD r256956: Improve ZFS N-way mirror read performance by using load and locality information.
The existing algorithm selects a preferred leaf vdev based on offset of the zio
request modulo the number of members in the mirror. It assumes the devices are
of equal performance and that spreading the requests randomly over both drives
will be sufficient to saturate them. In practice this results in the leaf vdevs
being under utilized.

The new algorithm takes into the following additional factors:
* Load of the vdevs (number outstanding I/O requests)
* The locality of last queued I/O vs the new I/O request.

Within the locality calculation additional knowledge about the underlying vdev
is considered such as; is the device backing the vdev a rotating media device.

This results in performance increases across the board as well as significant
increases for predominantly streaming loads and for configurations which don't
have evenly performing devices.

The following are results from a setup with 3 Way Mirror with 2 x HD's and
1 x SSD from a basic test running multiple parrallel dd's.

With pre-fetch disabled (vfs.zfs.prefetch_disable=1):

== Stripe Balanced (default) ==
Read 15360MB using bs: 1048576, readers: 3, took 161 seconds @ 95 MB/s
== Load Balanced (zfslinux) ==
Read 15360MB using bs: 1048576, readers: 3, took 297 seconds @ 51 MB/s
== Load Balanced (locality freebsd) ==
Read 15360MB using bs: 1048576, readers: 3, took 54 seconds @ 284 MB/s

With pre-fetch enabled (vfs.zfs.prefetch_disable=0):

== Stripe Balanced (default) ==
Read 15360MB using bs: 1048576, readers: 3, took 91 seconds @ 168 MB/s
== Load Balanced (zfslinux) ==
Read 15360MB using bs: 1048576, readers: 3, took 108 seconds @ 142 MB/s
== Load Balanced (locality freebsd) ==
Read 15360MB using bs: 1048576, readers: 3, took 48 seconds @ 320 MB/s

In addition to the performance changes the code was also restructured, with
the help of Justin Gibbs, to provide a more logical flow which also ensures
vdevs loads are only calculated from the set of valid candidates.

The following additional sysctls where added to allow the administrator
to tune the behaviour of the load algorithm:
* vfs.zfs.vdev.mirror.rotating_inc
* vfs.zfs.vdev.mirror.rotating_seek_inc
* vfs.zfs.vdev.mirror.rotating_seek_offset
* vfs.zfs.vdev.mirror.non_rotating_inc
* vfs.zfs.vdev.mirror.non_rotating_seek_inc

These changes where based on work started by the zfsonlinux developers:
https://github.com/zfsonlinux/zfs/pull/1487

Reviewed by:	gibbs, mav, will
MFC after:	2 weeks
Sponsored by:	Multiplay

References:
  https://github.com/freebsd/freebsd@5c7a6f5d
  https://github.com/freebsd/freebsd@31b7f68d
  https://github.com/freebsd/freebsd@e186f564

Performance Testing:
  https://github.com/zfsonlinux/zfs/pull/4334#issuecomment-189057141

Porting notes:
- The tunables were adjusted to have ZoL-style names.
- The code was modified to use ZoL's vd_nonrot.
- Fixes were done to make cstyle.pl happy
- Merge conflicts were handled manually
- freebsd/freebsd@e186f564bc by my
  collegue Andriy Gapon has been included. It applied perfectly, but
  added a cstyle regression.
- This replaces 556011dbec entirely.
- A typo "IO'a" has been corrected to say "IO's"
- Descriptions of new tunables were added to man/man5/zfs-module-parameters.5.

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4334
2016-02-26 11:24:35 -08:00
Brian Behlendorf a77f29f93c Change full path subcommand flag from -p to -P
Commit d2f3e29 introduced the -p option which outputs full paths
for vdevs to multiple zpool subcommands.  When this was merged
there was no conflict for this flag letter.  However it's certain
there will be a conflict with the -p (parsable) flag used by other
subcommands.  Therefore, -p is being changed to -P to avoid this.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4368
2016-02-26 09:06:26 -08:00
Richard Yao d2f3e292dc Add -gLp to zpool subcommands for alt vdev names
The following options have been added to the zpool add, iostat,
list, status, and split subcommands.  The default behavior was
not modified, from zfs(8).

  -g    Display vdev GUIDs  instead  of  the  normal  short
        device  names.  These GUIDs can be used in-place of
        device   names   for    the    zpool    detach/off‐
        line/remove/replace commands.

  -L    Display real paths for vdevs resolving all symbolic
        links. This can be used to lookup the current block
        device  name regardless of the /dev/disk/ path used
        to open it.

  -p    Display  full  paths  for vdevs instead of only the
        last component of the path.  This can  be  used  in
        conjunction with the -L flag.

This behavior may also be enabled using the following environment
variables.

  ZPOOL_VDEV_NAME_GUID
  ZPOOL_VDEV_NAME_FOLLOW_LINKS
  ZPOOL_VDEV_NAME_PATH

This change is based on worked originally started by Richard Yao
to add a -g option.  Then extended by @ilovezfs to add a -L option
for openzfsonosx.  Those changes have been merged, re-factored,
a -p option added and extended to all relevant zpool subcommands.

Original-patch-by: Richard Yao <ryao@gentoo.org>
Extended-by: ilovezfs <ilovezfs@icloud.com>
Extended-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: ilovezfs <ilovezfs@icloud.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2011
Closes #4341
2016-02-25 11:58:39 -08:00
Brian Behlendorf 8a09d5fd46 Add l2arc_max_block_size tunable
Set a limit for the largest compressed block which can be written
to an L2ARC device.  By default this limit is set to 16M so there
is no change in behavior.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Elling <Richard.Elling@RichardElling.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #4323
2016-02-25 09:44:00 -08:00
Chunwei Chen 8f3b403a73 Allow kicking a taskq to spawn more threads
This patch add a module parameter spl_taskq_kick. When writing non-zero value
to it, it will scan all the taskq, if a taskq contains a task pending for more
than 5 seconds, it will be forced to spawn a new thread. This is use as an
emergency recovery from deadlock, not a general solution.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #529
2016-02-05 14:08:31 -08:00
kernelOfTruth a966c5640e Reintroduce zfs_remove() synchronous deletes
Reintroduce a slightly adapted version of the Illumos logic for
synchronous unlinks.  The basic idea here is that only files
smaller than zfs_delete_blocks (20480) blocks should be deleted
synchronously.  Unlinking larger files should be handled
asynchronously to minimize impact to the caller.

To accomplish this iput() which is responsible for calling
zfs_znode_delete() on Linux is only called in the delete_now
path.  Otherwise zfs_async_iput() is used which allows the
last reference to be dropped by a taskq thread effectively
making the removal asynchronous.

Porting notes:
- Add zfs_delete_blocks module option for performance analysis.
  The default value is DMU_MAX_DELETEBLKCNT which is the same
  as upstream.  Reducing this value means that smaller files
  will be unlinked asynchronously like large files.
- All occurrences of zfsvfs changes to zsb.

Ported-by: KernelOfTruth kerneloftruth@gmail.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2016-01-26 15:26:02 -08:00
George Wilson ba5ad9a48d Illumos 6251 - add tunable to disable free_bpobj processing
6251 - add tunable to disable free_bpobj processing
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Simon Klinkert <simon.klinkert@gmail.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed by: Albert Lee <trisk@omniti.com>
Reviewed by: Xin Li <delphij@freebsd.org>
Approved by: Garrett D'Amore <garrett@damore.org>

References:
  https://www.illumos.org/issues/6251
  https://github.com/illumos/illumos-gate/commit/139510f

Porting notes:
- Added as module option declaration.
- Added to zfs-module-parameters.5 man page.

Ported-by: Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2016-01-25 13:15:17 -08:00
Brian Behlendorf 89666a8e1c Increase default user space stack size
Under RHEL6/CentOS6 the default stack size must be increased to 32K
to prevent overflowing the stack when running ztest.  This isn't an
issue for other distributions due to either the version of pthreads
or perhaps the compiler.  Doubling the stack size resolves the
issue safely for all distribution and leaves us some headroom.

$ sudo -E ztest -V -T 300 -f /var/tmp/
5 vdevs, 7 datasets, 23 threads, 300 seconds...

loading space map for vdev 0 of 1, metaslab 0 of 30 ...
...
loading space map for vdev 0 of 1, metaslab 14 of 30 ...
child died with signal 11
Exited ztest with error 3

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4215
2016-01-13 13:55:12 -08:00
Brian Behlendorf d297a5a3a1 Add spl_kmem_cache_kmem_threads man page entry
The spl_kmem_cache_kmem_threads module option was accidentally
omitted from the documentation.  Add it.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #512
2016-01-12 15:04:37 -08:00
Matthew Ahrens 7f60329a26 Illumos 5987 - zfs prefetch code needs work
5987 zfs prefetch code needs work
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Approved by: Gordon Ross <gordon.ross@nexenta.com>

References:
  https://www.illumos.org/issues/5987 zfs prefetch code needs work
  illumos/illumos-gate@cf6106c 5987 zfs prefetch code needs work

Porting notes:
- [module/zfs/dbuf.c]
  - 5f6d0b6 Handle block pointers with a corrupt logical size
- [module/zfs/dmu_zfetch.c]
  - c65aa5b Fix gcc missing parenthesis warnings
  - 428870f Update core ZFS code from build 121 to build 141.
  - 79c76d5 Change KM_PUSHPAGE -> KM_SLEEP
  - b8d06fc Switch KM_SLEEP to KM_PUSHPAGE
  - Account for ISO C90 - mixed declarations and code - warnings
  - Module parameters (new/changed):
    - Replaced zfetch_block_cap with zfetch_max_distance
      (Max bytes to prefetch per stream (default 8MB; 8 * 1024 * 1024))
    - Preserved zfs_prefetch_disable as 'int' for consistency with
      existing Linux module options.
- [include/sys/trace_arc.h]
  - Added new tracepoints
    - DEFINE_ARC_BUF_HDR_EVENT(zfs_arc__sync__wait__for__async);
    - DEFINE_ARC_BUF_HDR_EVENT(zfs_arc__demand__hit__predictive__prefetch);
- [man/man5/zfs-module-parameters.5]
  - Updated man page

Ported-by: kernelOfTruth kerneloftruth@gmail.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2016-01-12 09:02:33 -08:00
Matthew Ahrens 9867e8be2a Illumos 4891 - want zdb option to dump all metadata
4891 want zdb option to dump all metadata
Reviewed by: Sonu Pillai <sonu.pillai@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Approved by: Garrett D'Amore <garrett@damore.org>

We'd like a way for zdb to dump metadata in a machine-readable
format, so that we can bring that back from a customer site for
in-house diagnosis.  Think of it as a crash dump for zpools,
which can be used for post-mortem analysis of a malfunctioning
pool

References:
  https://www.illumos.org/issues/4891
  https://github.com/illumos/illumos-gate/commit/df15e41

Porting notes:
- [cmd/zdb/zdb.c]
  - a5778ea zdb: Introduce -V for verbatim import
  - In main() getopt 'opt' variable removed and the code was
    brought back in line with illumos.
- [lib/libzpool/kernel.c]
  - 1e33ac1 Fix Solaris thread dependency by using pthreads
  - f0e324f Update utsname support
  - 4d58b69 Fix vn_open/vn_rdwr error handling
  - In vn_open() allocate 'dumppath' on heap instead of stack
  - Properly handle 'dump_fd == -1' error path
  - Free 'realpath' after added vn_dumpdir_code block

Ported-by: kernelOfTruth kerneloftruth@gmail.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2016-01-11 11:36:54 -08:00
nathancheek 53e0313506 Man page whitespace
Signed-off-by: nathancheek <myself@nathancheek.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4184
2016-01-11 09:55:50 -08:00
Paul Dagnelie fcff0f35bd Illumos 5960, 5925
5960 zfs recv should prefetch indirect blocks
5925 zfs receive -o origin=
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>

References:
  https://www.illumos.org/issues/5960
  https://www.illumos.org/issues/5925
  https://github.com/illumos/illumos-gate/commit/a2cdcdd

Porting notes:
- [lib/libzfs/libzfs_sendrecv.c]
  - b8864a2 Fix gcc cast warnings
  - 325f023 Add linux kernel device support
  - 5c3f61e Increase Linux pipe buffer size on 'zfs receive'
- [module/zfs/zfs_vnops.c]
  - 3558fd7 Prototype/structure update for Linux
  - c12e3a5 Restructure zfs_readdir() to fix regressions
- [module/zfs/zvol.c]
  - Function @zvol_map_block() isn't needed in ZoL
  - 9965059 Prefetch start and end of volumes
- [module/zfs/dmu.c]
  - Fixed ISO C90 - mixed declarations and code
  - Function dmu_prefetch() 'int i' is initialized before
    the following code block (c90 vs. c99)
- [module/zfs/dbuf.c]
  - fc5bb51 Fix stack dbuf_hold_impl()
  - 9b67f60 Illumos 4757, 4913
  - 34229a2 Reduce stack usage for recursive traverse_visitbp()
- [module/zfs/dmu_send.c]
  - Fixed ISO C90 - mixed declarations and code
  - b58986e Use large stacks when available
  - 241b541 Illumos 5959 - clean up per-dataset feature count code
  - 77aef6f Use vmem_alloc() for nvlists
  - 00b4602 Add linux kernel memory support

Ported-by: kernelOfTruth kerneloftruth@gmail.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2016-01-08 15:08:19 -08:00
Chris Williamson 23de906c72 Illumos 5745 - zfs set allows only one dataset property to be set at a time
5745 zfs set allows only one dataset property to be set at a time
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com>
Reviewed by: Richard PALO <richard@NetBSD.org>
Reviewed by: Steven Hartland <killing@multiplay.co.uk>
Approved by: Rich Lowe <richlowe@richlowe.net>

References:
  https://www.illumos.org/issues/5745
  https://github.com/illumos/illumos-gate/commit/3092556

Porting notes:
- Fix the missing braces around initializer, zfs_cmd_t zc = {"\0"};
- Remove extra format argument in zfs_do_set()
- Declare at the top:
  - zfs_prop_t prop;
  - nvpair_t *elem;
  - nvpair_t *next;
  - int i;
- Additionally initialize:
  - int added_resv = 0;
  - zfs_prop_t prop = 0;
- Assign 0 install of NULL for uint64_t types.
  - zc->zc_nvlist_conf = '\0';
  - zc->zc_nvlist_src = '\0';
  - zc->zc_nvlist_dst = '\0';

Ported-by: kernelOfTruth kerneloftruth@gmail.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3574
2015-12-29 16:59:26 -08:00
DHE dcb6bed1df Make zio_taskq_batch_pct user configurable
Adds zio_taskq_batch_pct as an exported module parameter,
allowing users to modify it at module load time.

Signed-off-by: DHE <git@dehacked.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4110
2015-12-18 13:46:23 -08:00
Ned Bass 6b4e21c60e Man page white space and spelling corrections
Correct some misspelled words and grammatical errors, and remove
trailing white space in the man pages.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4115
2015-12-18 13:33:37 -08:00
Turbo Fredriksson b23d54305a Remove shareiscsi description and example from zfs(8).
Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-10-13 16:52:29 -07:00
Turbo Fredriksson 291b06c37e Unmount is part of the shutdown process, not the boot process.
Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes: #3762
2015-10-13 16:52:02 -07:00
Don Brady 56b3986316 Add large block support to zpios(1) benchmark
As part of the large block support effort, it makes sense to add
support for large blocks to **zpios(1)**. The specifying of a zfs
block size for zpios is optional and will default to 128K if the
block size is not specified.

  `zpios ... -S size | --blocksize size ...`

This will use *size* ZFS blocks for each test, specified as a comma
delimited list with an optional unit suffix. The supported range is
powers of two from 128K through 16M. A range of block sizes can be
tested as follows: `-S 128K,256K,512K,1M`

Example run below
(non realistic results from a VM and output abbreviated for space)

```
 --regioncount=750 --regionsize=8M --chunksize=1M --offset=4K
 --threaddelay=0 --cleanup --human-readable --verbose --cleanup
 --blocksize=128K,256K,512K,1M

 th-cnt  rg-cnt  rg-sz  ch-sz  blksz  wr-data wr-bw   rd-data rd-bw
---------------------------------------------------------------------
 4       750     8m     1m     128k   5g      90.06m  5g      93.37m
 4       750     8m     1m     256k   5g      79.71m  5g      99.81m
 4       750     8m     1m     512k   5g      42.20m  5g      93.14m
 4       750     8m     1m     1m     5g      35.51m  5g      89.36m
 8       750     8m     1m     128k   5g      85.49m  5g      90.81m
 8       750     8m     1m     256k   5g      61.42m  5g      99.24m
 8       750     8m     1m     512k   5g      49.09m  5g     108.78m
 16      750     8m     1m     128k   5g      86.28m  5g      88.73m
 16      750     8m     1m     256k   5g      64.34m  5g      93.47m
 16      750     8m     1m     512k   5g      68.84m  5g     124.47m
 16      750     8m     1m     1m     5g      53.97m  5g      97.20m
---------------------------------------------------------------------
```

Signed-off-by: Don Brady <don.brady@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3795
Closes #2071
2015-09-22 09:13:20 -07:00
Ned Bass 3af56fd95f Honor xattr=sa dataset property
ZFS incorrectly uses directory-based extended attributes even when
xattr=sa is specified as a dataset property or mount option. Support to
honor temporary mount options including "xattr" was added in commit
0282c4137e. There are two issues with the
mount option handling:

* Libzfs has historically included "xattr" in its list of default mount
  options. This overrides the dataset property, so the dataset is always
  configured to use directory-based xattrs even when the xattr dataset
  property is set to off or sa. Address this by removing "xattr" from
  the set of default mount options in libzfs.

* There was no way to enable system attribute-based extended attributes
  using temporary mount options. Add the mount options "saxattr" and
  "dirxattr" which enable the xattr behavior their names suggest.  This
  approach has the advantages of mirroring the valid xattr dataset
  property values and following existing conventions for mount option
  names.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3787
2015-09-19 14:04:14 -07:00
Brian Behlendorf 9965059ab9 Prefetch start and end of volumes
When adding a zvol to the system prefetch zvol_prefetch_bytes from the
start and end of the volume.  Prefetching these regions of the volume is
desirable because they are likely to be accessed immediately by blkid(8),
the kernel scanning for a partition table, or another task which probes
the devices.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3659
2015-09-09 14:38:29 -07:00
Brian Behlendorf 3b36f8319d Add dbgmsg kstat
Internally ZFS keeps a small log to facilitate debugging.  By default
the log is disabled, to enable it set zfs_dbgmsg_enable=1.  The contents
of the log can be accessed by reading the /proc/spl/kstat/zfs/dbgmsg file.
Writing 0 to this proc file clears the log.

$ echo 1 >/sys/module/zfs/parameters/zfs_dbgmsg_enable
$ echo 0 >/proc/spl/kstat/zfs/dbgmsg
$ zpool import tank
$ cat /proc/spl/kstat/zfs/dbgmsg
1 0 0x01 -1 0 2492357525542 2525836565501
timestamp    message
1441141408   spa=tank async request task=1
1441141408   txg 70 open pool version 5000; software version 5000/5; ...
1441141409   spa=tank async request task=32
1441141409   txg 72 import pool version 5000; software version 5000/5; ...
1441141414   command: lt-zpool import tank

Note the zfs_dbgmsg() and dprintf() functions are both now mapped to
the same log.  As mentioned above the kernel debug log can be accessed
though the /proc/spl/kstat/zfs/dbgmsg kstat.  For user space consumers
log messages are immediately written to stdout after applying the
ZFS_DEBUG environment variable.

$ ZFS_DEBUG=on ./cmd/ztest/ztest -V

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes #3728
2015-09-04 16:08:14 -07:00
Brian Behlendorf 0500e835af Support accessing .zfs/snapshot via NFS
This patch is based on the previous work done by @andrey-ve and
@yshui.  It triggers the automount by using kern_path() to traverse
to the known snapshout mount point.  Once the snapshot is mounted
NFS can access the contents of the snapshot.

Allowing NFS clients to access to the .zfs/snapshot directory would
normally mean that a root user on a client mounting an export with
'no_root_squash' would be able to use mkdir/rmdir/mv to manipulate
snapshots on the server.  To prevent configuration mistakes a
zfs_admin_snapshot module option was added which disables the
mkdir/rmdir/mv functionally.  System administators desiring this
functionally must explicitly enable it.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2797
Closes #1655
Closes #616
2015-09-04 13:23:53 -07:00
Brian Behlendorf e20cd6f7a8 Merge branch 'zvol'
Performance improvements for zvols.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3720
2015-09-04 13:14:21 -07:00
Richard Yao 37f9dac592 zvol processing should use struct bio
Internally, zvols are files exposed through the block device API. This
is intended to reduce overhead when things require block devices.
However, the ZoL zvol code emulates a traditional block device in that
it has a top half and a bottom half. This is an unnecessary source of
overhead that does not exist on any other OpenZFS platform does this.
This patch removes it. Early users of this patch reported double digit
performance gains in IOPS on zvols in the range of 50% to 80%.

Comments in the code suggest that the current implementation was done to
obtain IO merging from Linux's IO elevator. However, the DMU already
does write merging while arc_read() should implicitly merge read IOs
because only 1 thread is permitted to fetch the buffer into ARC. In
addition, commercial ZFSOnLinux distributions report that regular files
are more performant than zvols under the current implementation, and the
main consumers of zvols are VMs and iSCSI targets, which have their own
elevators to merge IOs.

Some minor refactoring allows us to register zfs_request() as our
->make_request() handler in place of the generic_make_request()
function. This eliminates the layer of code that broke IO requests on
zvols into a top half and a bottom half. This has several benefits:

1. No per zvol spinlocks.
2. No redundant IO elevator processing.
3. Interrupts are disabled only when actually necessary.
4. No redispatching of IOs when all taskq threads are busy.
5. Linux's page out routines will properly block.
6. Many autotools checks become obsolete.

An unfortunate consequence of eliminating the layer that
generic_make_request() is that we no longer calls the instrumentation
hooks for block IO accounting. Those hooks are GPL-exported, so we
cannot call them ourselves and consequently, we lose the ability to do
IO monitoring via iostat.  Since zvols are internally files mapped as
block devices, this should be okay. Anyone who is willing to accept the
performance penalty for the block IO layer's accounting could use the
loop device in between the zvol and its consumer. Alternatively, perf
and ftrace likely could be used. Also, tools like latencytop will still
work. Tools such as latencytop sometimes provide a better view of
performance bottlenecks than the traditional block IO accounting tools
do.

Lastly, if direct reclaim occurs during spacemap loading and swap is on
a zvol, this code will deadlock. That deadlock could already occur with
sync=always on zvols. Given that swap on zvols is not yet production
ready, this is not a blocker.

Signed-off-by: Richard Yao <ryao@gentoo.org>
2015-09-04 15:30:24 -04:00
Brian Behlendorf 0282c4137e Add temporary mount options
Add the required kernel side infrastructure to parse arbitrary
mount options.  This enables us to support temporary mount
options in largely the same way it is handled on other platforms.

See the 'Temporary Mount Point Properties' section of zfs(8)
for complete details.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #985
Closes #3351
2015-09-03 14:14:55 -07:00
Brian Behlendorf 6cde64351e Add spa_slop_shift module option
Allow for easy turning of a pools reserved free space.  Previous
versions of ZFS (v0.6.4 and earlier) held 1/64 of the pools capacity
in reserve.  Commits 3d45fdd and 0c60cc3 increased this to 1/32.
Setting spa_slop_shift=6 will restore the previous default setting.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3724
2015-09-02 09:30:18 -07:00
Andreas Buschmann bba365cfc8 Add extra keyword 'slot' to vdev_id.conf
Add new keyword 'slot' to vdev_id.conf
This selects from where to get the slot number for a SAS/SATA disk
Needed to enable access to the physical position of a disk in a
Supermicro 2027R-AR24NV .

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes #3693
2015-08-30 10:03:56 -07:00
Brian Behlendorf 7e8bddd019 Update arc_memory_throttle() to check pageout
This brings the behavior of arc_memory_throttle() back in sync with
illumos.  The updated memory throttling policy roughly goes like this:

* Never throttle if more than 10% of memory is free.  This threshold
  is configurable with the zfs_arc_lotsfree_percent module option.

* Minimize any throttling of kswapd even when free memory is below
  the set threshold.  Allow it to write out pages as quickly as
  possible to help alleviate the memory pressure.

* Delay all other threads when free memory is below the set threshold
  in order to avoid compounding the memory pressure.  Buffers will be
  evicted from the ARC to reduce the issue.

The Linux specific zfs_arc_memory_throttle_disable module option has
been removed in favor of the existing zfs_arc_lotsfree_percent tuning.
Setting zfs_arc_lotsfree_percent=0 will have the same effect as
zfs_arc_memory_throttle_disable and it was therefore redundant.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3637
2015-07-30 11:52:12 -07:00
Brian Behlendorf 11f552fa90 Update arc_available_memory() to check freemem
While Linux doesn't provide detailed information about the state of
the VM it does provide us total free pages.  This information should
be incorporated in to the arc_available_memory() calculation rather
than solely relying on a signal from direct reclaim.  Conceptually
this brings arc_available_memory() back in sync with illumos.

It is also desirable that the target amount of free memory be tunable
on a system.  While the default values are expected to work well
for most workloads there may be cases where custom values are needed.
The zfs_arc_sys_free module option was added for this purpose.

zfs_arc_sys_free - The target number of bytes the ARC should leave
                   as free memory on the system.  This value can
                   checked in /proc/spl/kstat/zfs/arcstats and
                   setting this module option will override the
                   default value.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3637
2015-07-30 11:50:22 -07:00
Brian Behlendorf 6339c1b9dc Bound zvol_threads module option
The zvol_threads module option should be bounded to a reasonable
range.  The taskq must have at least 1 thread and shouldn't have
more than 1,024 at most.  The default value of 32 is a reasonable
default.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3614
2015-07-29 07:42:11 -07:00
Brian Behlendorf 62aa81a577 Add defclsyspri macro
Add a new defclsyspri macro which can be used to request the default
Linux scheduler priority.  Neither the minclsyspri or maxclsyspri map
to the default Linux kernel thread priority.  This makes it awkward to
create taskqs which run with the same priority as the rest of the kernel
threads on the system which can lead to performance issues.

All SPL callers which previously used minclsyspri or maxclsyspri have
been changed to use defclsyspri.  The vast majority of callers were
part of the test suite which won't have an external impact.  The few
places where it could impact performance the change was from maxclsyspri
to defclsyspri.  This makes it more likely the process will be scheduled
which may help performance.

To facilitate further performance analysis the spl_taskq_thread_priority
module option has been added.  When disabled (0) all newly created kernel
threads will use the default kernel thread priority.  When enabled (1)
the specified taskq priority will be used.  By default this value is
enabled (1).

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-07-23 13:25:49 -07:00
Brian Behlendorf 728d6ae91e Reinstate zfs_arc_p_min_shift
Commit f521ce1 removed the minimum value for "arc_p" allowing it to
drop to zero or grow to "arc_c".  This was done to improve specific
workload which constantly dirties new "metadata" but also frequently
touches a "small" amount of mfu data (e.g. mkdir's).

This change may still be desirable but it needs to be re-investigated.
in the context of the recent ARC changes from upstream.  Therefore
this code is being restored to facilitate benchmarking.  By setting
"zfs_arc_p_min_shift=64" we easily compare the performance.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3533
2015-07-23 09:42:32 -07:00
Manoj Joseph 93f6d7e2e5 Illumos 5764 - "zfs send -nv" directs output to stderr
5764 "zfs send -nv" directs output to stderr
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com>
Reviewed by: Basil Crow <basil.crow@delphix.com>
Reviewed by: Steven Hartland <killing@multiplay.co.uk>
Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://github.com/illumos/illumos-gate/commit/dc5f28a
  https://www.illumos.org/issues/5764

Ported-by: kernelOfTruth kerneloftruth@gmail.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3585
2015-07-14 10:28:32 -07:00
Justin T. Gibbs 99197f034e Illumos 5661 - ZFS: "compression = on" should use lz4 if feature is enabled
5661 ZFS: "compression = on" should use lz4 if feature is enabled
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Reviewed by: Xin LI <delphij@freebsd.org>
Approved by: Robert Mustacchi <rm@joyent.com>

References:
  https://github.com/illumos/illumos-gate/commit/db1741f
  https://www.illumos.org/issues/5661

Ported-by: kernelOfTruth kerneloftruth@gmail.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3571
2015-07-10 12:11:45 -07:00
Colin Ian King 8f3439733f man: fix spelling mistakes in manual
A few minor mistakes than should be fixed:

zpool:
  compatability -> compatibility

zfs:
  accessable -> accessible
  availible  -> available

zfs-events:
  availible -> available

zfs-module-parameters:
  proceding -> proceeding

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3544
2015-07-01 10:58:31 -07:00
Brian Behlendorf f7a973d99b Add TASKQ_DYNAMIC feature
Setting the TASKQ_DYNAMIC flag will create a taskq with dynamic
semantics.  Initially only a single worker thread will be created
to service tasks dispatched to the queue.  As additional threads
are needed they will be dynamically spawned up to the max number
specified by 'nthreads'.  When the threads are no longer needed,
because the taskq is empty, they will automatically terminate.

Due to the low cost of creating and destroying threads under Linux
by default new threads and spawned and terminated aggressively.
There are two modules options which can be tuned to adjust this
behavior if needed.

* spl_taskq_thread_sequential - The number of sequential tasks,
without interruption, which needed to be handled by a worker
thread before a new worker thread is spawned.  Default 4.

* spl_taskq_thread_dynamic - Provides the ability to completely
disable the use of dynamic taskqs on the system.  This is provided
for the purposes of debugging and troubleshooting.  Default 1
(enabled).

This behavior is fundamentally consistent with the dynamic taskq
implementation found in both illumos and FreeBSD.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #458
2015-06-24 15:14:18 -07:00
Etienne Dechamps 99b14de421 Make metaslab_aliquot a module parameter.
This seems generally useful. metaslab_aliquot is the ZFS allocation
granularity, which is roughly equivalent to what is called the stripe
size in traditional RAID arrays. It seems relevant to performance
tuning.

Signed-off-by: Etienne Dechamps <etienne@edechamps.fr>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-06-22 14:19:38 -07:00
Hajo Möller 410921241d Add -y option to `zpool iostat`
sysstat's iostat omits the first report when the -y option is used.
This patch adds that functionality and omits the first report with
statistics since system boot.

Signed-off-by: Hajo Möller <dasjoe@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3439
2015-06-17 10:39:20 -07:00
Prakash Surya ca0bf58d65 Illumos 5497 - lock contention on arcs_mtx
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Approved by: Dan McDonald <danmcd@omniti.com>

Porting notes and other significant code changes:

The illumos 5368 patch (ARC should cache more metadata), which
was never picked up by ZoL, is mostly reverted by this patch.

Since ZoL relies on the kernel asynchronously calling the shrinker to
actually reap memory, the shrinker wakes up arc_reclaim_waiters_cv every
time it runs.

The arc_adapt_thread() function no longer calls arc_do_user_evicts()
since the newly-added arc_user_evicts_thread() calls it periodically.

Notable conflicting ZoL commits which conflicted with this patch or
whose effects are either duplicated or un-done by this patch:

    302f753 - Integrate ARC more tightly with Linux
    39e055c - Adjust arc_p based on "bytes" in arc_shrink
    f521ce1 - Allow "arc_p" to drop to zero or grow to "arc_c"
    77765b5 - Remove "arc_meta_used" from arc_adjust calculation
    94520ca - Prune metadata from ghost lists in arc_adjust_meta

Trace support for multilist_insert() and multilist_remove() has been
added and produces the following output:

    fio-12498 [077] .... 112936.448324: zfs_multilist__insert: ml { offset 240 numsublists 80 sublistidx 63 }
    fio-12498 [077] .... 112936.448347: zfs_multilist__remove: ml { offset 240 numsublists 80 sublistidx 29 }

The following arcstats have been removed:

    recycle_miss - Used by arcstat.py and arc_summary.py, both of which
    have been updated appropriately.

    l2_writes_hdr_miss

The following arcstats have been added:

    evict_not_enough - Number of times arc_evict_state() was unable to
    evict enough buffers to reach its target amount.

    evict_l2_skip - Number of times arc_evict_hdr() skipped eviction
    because it was being written to the l2arc.

    l2_writes_lock_retry - Replaces l2_writes_hdr_miss.  Number of times
    l2arc_write_done() failed to acquire hash_lock (and re-tries).

    arc_meta_min - Shows the value of the zfs_arc_meta_min module
    parameter (see below).

The "index" column of the "dbuf" kstat has been removed since it doesn't
have a direct analog in the new multilist scheme.  Additional multilist-
related stats could be added in the future but would likely require
extensions to the mulilist API.

The following module parameters have been added:

    zfs_arc_evict_batch_limit - Number of ARC headers to free per sub-list
    before moving on to the next sub-list.

    zfs_arc_meta_min - Enforce a floor on the amount of metadata in
    the ARC.

    zfs_arc_num_sublists_per_state - Number of multilist sub-lists per
    ARC state.

    zfs_arc_overflow_shift - Controls amount by which the ARC must exceed
    the target size to be considered "overflowing".

Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov
2015-06-11 10:27:25 -07:00
Turbo Fredriksson d050c627b5 Improve on the ZFS events documentation
* Add information about the 'zpool events' command in zpool(8).
* More events and payloads defined in zfs-events(5).
* I/O Stages and I/O Flags sections added.
* Remove unused legacy "zio_deadline" payload define.

Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3467
2015-06-09 11:19:19 -07:00
Tim Chase b1b85c8772 Zdb should be able to open the root dataset
If the pool/dataset command-line argument is specified with a trailing
slash, for example, "tank/", it is interpreted as the root dataset.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3415
2015-05-15 10:52:46 -07:00
Chris Dunlap 492b1d2ef0 Update ZED copyright boilerplate
This commit updates the copyright boilerplate within the ZED subtree.

The instructions for appending a contributor copyright line have
been removed.  Manually maintaining copyright notices in this
manner is error-prone, imprecise at a file-scope granularity, and
oftentimes inaccurate.  These lines can become a pernicious source of
merge conflicts.  A commit log is better suited to maintaining this
information.  Consequently, a line has been added to the boilerplate
to refer to the git commit log for authoritative copyright attribution.

To account for the scenario where a file may become separated from
the codebase and commit history (i.e., it is copied somewhere else),
a line has been added to identify the file's origin.

http://softwarefreedom.org/resources/2012/ManagingCopyrightInformation.html

Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3384
2015-05-11 15:07:00 -07:00
Matthew Ahrens f1512ee61e Illumos 5027 - zfs large block support
5027 zfs large block support
Reviewed by: Alek Pinchuk <pinchuk.alek@gmail.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/5027
  https://github.com/illumos/illumos-gate/commit/b515258

Porting Notes:

* Included in this patch is a tiny ISP2() cleanup in zio_init() from
Illumos 5255.

* Unlike the upstream Illumos commit this patch does not impose an
arbitrary 128K block size limit on volumes.  Volumes, like filesystems,
are limited by the zfs_max_recordsize=1M module option.

* By default the maximum record size is limited to 1M by the module
option zfs_max_recordsize.  This value may be safely increased up to
16M which is the largest block size supported by the on-disk format.
At the moment, 1M blocks clearly offer a significant performance
improvement but the benefits of going beyond this for the majority
of workloads are less clear.

* The illumos version of this patch increased DMU_MAX_ACCESS to 32M.
This was determined not to be large enough when using 16M blocks
because the zfs_make_xattrdir() function will fail (EFBIG) when
assigning a TX.  This was immediately observed under Linux because
all newly created files must have a security xattr created and
that was failing.  Therefore, we've set DMU_MAX_ACCESS to 64M.

* On 32-bit platforms a hard limit of 1M is set for blocks due
to the limited virtual address space.  We should be able to relax
this one the ABD patches are merged.

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #354
2015-05-11 12:23:16 -07:00
Turbo Fredriksson 859735c095 Add the '-a' option to 'zpool export'
Support exporting all imported pools in one go, using 'zpool export -a'.

This is accomplished by moving the export parts from zpool_do_export()
in to the new function zpool_export_one().  The for_each_pool() function
is used to enumerate the list of pools to be exported.  Passing an argc
of 0 implies the function should be called on all pools.

Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes: #3203
2015-05-04 10:17:08 -07:00
Jerry Jelinek 788eb90c4c Illumos 3897 - zfs filesystem and snapshot limits
3897 zfs filesystem and snapshot limits
Author: Jerry Jelinek <jerry.jelinek@joyent.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Christopher Siden <christopher.siden@delphix.com>

References:
  https://www.illumos.org/issues/3897
  https://github.com/illumos/illumos-gate/commit/a2afb61

Porting Notes:

dsl_dataset_snapshot_check(): reduce stack usage using kmem_alloc().

Ported-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-04-28 16:22:51 -07:00
Paul B. Henson 0bf8501ae1 5410 Document -S option to zfs inherit
5410 Document -S option to zfs inherit
5412 Mention -S option when zfs inherit fails on quota
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

References:
  https://www.illumos.org/issues/5410
  https://github.com/illumos/illumos-gate/commit/5ff8cfa9

Ported-by: DHE <git@dehacked.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3279
2015-04-24 15:16:49 -07:00
cburroughs 7008109646 align zfs_autoimport_disable manpage with reality
The default was changed in #2820.

Signed-off-by: cburroughs <chris.burroughs@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3341
2015-04-24 14:58:38 -07:00
DHE 614e598c88 Fix formatting error in zfs(8)
Commit b1a3e93217 accidentally
introduced an intentation error between the 'zfs receive'
and 'zfs allow' detailed documentation sections.

Signed-off-by: DHE <git@dehacked.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3312
2015-04-23 13:59:00 -07:00
Turbo Fredriksson b467db454e Document bookmarks a little better in zfs(8)
Add a basic summary to zfs(8) describing bookmarks.

Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3268
2015-04-14 14:07:07 -07:00
Brian Behlendorf 74aa2ba259 Update zfs_pd_bytes_max default in zfs(8)
Commit b738bc5 should have updated the default value of zfs_pd_bytes_max
in the zfs(8) man page.  The correct default value is 50*1024*1024.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-03-31 11:55:28 -07:00
George Wilson b738bc5a0f Illumos 5694 - traverse_prefetcher does not prefetch enough
5694 traverse_prefetcher does not prefetch enough
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com>
Approved by: Garrett D'Amore <garrett@damore.org>

References:
  https://www.illumos.org/issues/5694
  https://github.com/illumos/illumos-gate/commit/34d7ce05

Ported-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3230
2015-03-27 15:02:50 -07:00
Isaac Huang e89bd69775 zio_injection_enabled should not be a module option
The zio_inject.c keeps zio_injection_enabled as a counter of
fault handlers, so it should not be exported to user space as
a module option.

Several EXPORT_SYMBOLs are moved from zio.c to zio_inject.c,
where the symbols are defined.

Signed-off-by: Isaac Huang <he.huang@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3199
2015-03-24 13:22:03 -07:00
Brian Behlendorf bc88866657 Fix arc_adjust_meta() behavior
The goal of this function is to evict enough meta data buffers from the
ARC in order to enforce the arc_meta_limit.  Achieving this is slightly
more complicated than it appears because it is common for data buffers
to have holds on meta data buffers.  In addition, dnode meta data buffers
will be held by the dnodes in the block preventing them from being freed.
This means we can't simply traverse the ARC and expect to always find
enough unheld meta data buffer to release.

Therefore, this function has been updated to make alternating passes
over the ARC releasing data buffers and then newly unheld meta data
buffers.  This ensures forward progress is maintained and arc_meta_used
will decrease.  Normally this is sufficient, but if required the ARC
will call the registered prune callbacks causing dentry and inodes to
be dropped from the VFS cache.  This will make dnode meta data buffers
available for reclaim.  The number of total restarts in limited by
zfs_arc_meta_adjust_restarts to prevent spinning in the rare case
where all meta data is pinned.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pavel Snajdr <snajpa@snajpa.net>
Issue #3160
2015-03-20 10:35:20 -07:00
Brian Behlendorf 2cbb06b561 Restructure per-filesystem reclaim
Originally when the ARC prune callback was introduced the idea was
to register a single callback for the ZPL.  The ARC could invoke this
call back if it needed the ZPL to drop dentries, inodes, or other
cache objects which might be pinning buffers in the ARC.  The ZPL
would iterate over all ZFS super blocks and perform the reclaim.

For the most part this design has worked well but due to limitations
in 2.6.35 and earlier kernels there were some problems.  This patch
is designed to address those issues.

1) iterate_supers_type() is not provided by all kernels which makes
it impossible to safely iterate over all zpl_fs_type filesystems in
a single callback.  The most straight forward and portable way to
resolve this is to register a callback per-filesystem during mount.
The arc_*_prune_callback() functions have always supported multiple
callbacks so this is functionally a very small change.

2) Commit 050d22b removed the non-portable shrink_dcache_memory()
and shrink_icache_memory() functions and didn't replace them with
equivalent functionality.  This meant that for Linux 3.1 and older
kernels the ARC had no mechanism to drop dentries and inodes from
the caches if needed.  This patch adds that missing functionality
by calling shrink_dcache_parent() to release dentries which may be
pinning inodes.  This will result in all unused cache entries being
dropped which is a bit heavy handed but it's the only interface
available for old kernels.

3) A zpl_drop_inode() callback is registered for kernels older than
2.6.35 which do not support the .evict_inode callback.  This ensures
that when the last reference on an inode is dropped it is immediately
removed from the cache.  If this isn't done than inode can end up on
the global unused LRU with no mechanism available to ZFS to drop them.
Since the ARC buffers are not dropped the hottest inodes can still
be recreated without performing disk IO.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pavel Snajdr <snajpa@snajpa.net>
Issue #3160
2015-03-20 10:35:20 -07:00
Turbo Fredriksson b1a3e93217 Move duplicate information about the 'zfs send -e' option.
The extra one was under the 'zfs receive' command (which isn't relevant).
Instead, it should have been further up (still in the 'zfs send' option).

Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3194
2015-03-19 10:39:43 -07:00
Brian Behlendorf 6442f3cfe3 Retire zio_bulk_flags
Long ago the zio_bulk_flags module parameter was introduced to
facilitate debugging and profiling the zio_buf_caches.  Today
this code works well and there's no compelling reason to keep
this functionality.  In fact it's preferable to revert this so
the code is more consistent with other ZFS implementations.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Issue #3063
2015-02-10 16:08:49 -08:00
Brian Behlendorf 3018bffa9b Refine slab cache sizing
This change is designed to improve the memory utilization of
slabs by more carefully setting their size.  The way the code
currently works is problematic for slabs which contain large
objects (>1MB).  This is due to slabs being unconditionally
rounded up to a power of two which may result in unused space
at the end of the slab.

The reason the existing code rounds up every slab is because it
assumes it will backed by the buddy allocator.  Since the buddy
allocator can only performs power of two allocations this is
desirable because it avoids wasting any space.  However, this
logic breaks down if slab is backed by vmalloc() which operates
at a page level granularity.  In this case, the optimal thing to
do is calculate the minimum required slab size given certain
constraints (object size, alignment, objects/slab, etc).

Therefore, this patch reworks the spl_slab_size() function so
that it sizes KMC_KMEM slabs differently than KMC_VMEM slabs.
KMC_KMEM slabs are rounded up to the nearest power of two, and
KMC_VMEM slabs are allowed to be the minimum required size.

This change also reduces the default number of objects per slab.
This reduces how much memory a single cache object can pin, which
can result in significant memory saving for highly fragmented
caches.  But depending on the workload it may result in slabs
being allocated and freed more frequently.  In practice, this
has been shown to be a better default for most workloads.

Also the maximum slab size has been reduced to 4MB on 32-bit
systems.  Due to the limited virtual address space it's critical
the we be as frugal as possible.  A limit of 4M still lets us
reasonably comfortably allocate a limited number of 1MB objects.

Finally, the kmem:slab_small and kmem:slab_large SPLAT tests
were extended to provide better test coverage of various object
sizes and alignments.  Caches are created with random parameters
and their basic functionality is verified by allocating several
slabs worth of objects.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-01-16 13:55:09 -08:00
Brian Behlendorf b1c3ae48a7 Update spl-module-parameters(5) man page
The spl-module-parameters(5) was not kept up to date.  Refresh
the man page so that it lists all the possible module options,
describes what the do, and justify why the default values are
set they way the are.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-01-16 13:55:09 -08:00
Brian Behlendorf 1a20496834 Make slab reclaim more aggressive
Many people have noticed that the kmem cache implementation is slow
to release its memory.  This patch makes the reclaim behavior more
aggressive by immediately freeing a slab once it is empty.  Unused
objects which are cached in the magazines will still prevent a slab
from being freed.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-01-16 13:55:09 -08:00
Brian Behlendorf c3eabc75b1 Refactor generic memory allocation interfaces
This patch achieves the following goals:

1. It replaces the preprocessor kmem flag to gfp flag mapping with
   proper translation logic. This eliminates the potential for
   surprises that were previously possible where kmem flags were
   mapped to gfp flags.

2. It maps vmem_alloc() allocations to kmem_alloc() for allocations
   sized less than or equal to the newly-added spl_kmem_alloc_max
   parameter.  This ensures that small allocations will not contend
   on a single global lock, large allocations can still be handled,
   and potentially limited virtual address space will not be squandered.
   This behavior is entirely different than under Illumos due to
   different memory management strategies employed by the respective
   kernels.  However, this functionally provides the semantics required.

3. The --disable-debug-kmem, --enable-debug-kmem (default), and
   --enable-debug-kmem-tracking allocators have been unified in to
   a single spl_kmem_alloc_impl() allocation function.  This was
   done to simplify the code and make it more maintainable.

4. Improve portability by exposing an implementation of the memory
   allocations functions that can be safely used in the same way
   they are used on Illumos.   Specifically, callers may safely
   use KM_SLEEP in contexts which perform filesystem IO.  This
   allows us to eliminate an entire class of Linux specific changes
   which were previously required to avoid deadlocking the system.

This change will be largely transparent to existing callers but there
are a few caveats:

1. Because the headers were refactored and extraneous includes removed
   callers may find they need to explicitly add additional #includes.
   In particular, kmem_cache.h must now be explicitly includes to
   access the SPL's kmem cache implementation.  This behavior is
   different from Illumos but it was done to avoid always masking
   the Linux slab functions when kmem.h is included.

2. Callers, like Lustre, which made assumptions about the definitions
   of KM_SLEEP, KM_NOSLEEP, and KM_PUSHPAGE will need to be updated.
   Other callers such as ZFS which did not will not require changes.

3. KM_PUSHPAGE is no longer overloaded to imply GFP_NOIO.  It retains
   its original meaning of allowing allocations to access reserved
   memory.  KM_PUSHPAGE callers can be converted back to KM_SLEEP.

4. The KM_NODEBUG flags has been retired and the default warning
   threshold increased to 32k.

5. The kmem_virt() functions has been removed.  For callers which
   need to distinguish between a physical and virtual address use
   is_vmalloc_addr().

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-01-16 13:55:09 -08:00
Ned Bass 33b6dbbc51 Document zfs_flags module parameter
Add a table describing the debugging flags that can be set in the zfs_flags
module parameter.  Also change the module_param type to 'uint' so users aren't
shown a negative value. The updated man page text is reproduced below for
convenience.

zfs_flags (int)
            Set  additional debugging flags. The following flags may be
            bitwise-or'd together.

            +-------------------------------------------------------+
            |Value   Symbolic Name                                  |
            |        Description                                    |
            +-------------------------------------------------------+
            |    1   ZFS_DEBUG_DPRINTF                              |
            |        Enable dprintf entries in the debug log.       |
            +-------------------------------------------------------+
            |    2   ZFS_DEBUG_DBUF_VERIFY *                        |
            |        Enable extra dbuf verifications.               |
            +-------------------------------------------------------+
            |    4   ZFS_DEBUG_DNODE_VERIFY *                       |
            |        Enable extra dnode verifications.              |
            +-------------------------------------------------------+
            |    8   ZFS_DEBUG_SNAPNAMES                            |
            |        Enable snapshot name verification.             |
            +-------------------------------------------------------+
            |   16   ZFS_DEBUG_MODIFY                               |
            |        Check for illegally modified ARC buffers.      |
            +-------------------------------------------------------+
            |   32   ZFS_DEBUG_SPA                                  |
            |        Enable spa_dbgmsg entries in the debug log.    |
            +-------------------------------------------------------+
            |   64   ZFS_DEBUG_ZIO_FREE                             |
            |        Enable verification of block frees.            |
            +-------------------------------------------------------+
            |  128   ZFS_DEBUG_HISTOGRAM_VERIFY                     |
            |        Enable extra spacemap histogram verifications. |
            +-------------------------------------------------------+
            * Requires debug build.

            Default value: 0.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2988
2015-01-07 15:50:49 -08:00
Randall Mason 33c0819425 Fix small spelling mistake
recieve becomes receive

Signed-off-by: Randall Mason <ClashTheBunny@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2877
2014-11-14 15:48:51 -08:00
Daniil Lunev 62bdd5eb7a Illumos 4924 - LZ4 Compression for metadata
Reviewed by Matthew Ahrens <mahrens@delphix.com>
Reviewed by Saso Kiselkov <skiselkov.ml@gmail.com>
Approved by: Christopher Siden <christopher.siden@delphix.com>

References:
  https://github.com/illumos/illumos-gate/commit/b8289d2
  https://www.illumos.org/issues/3756

Porting notes:

The static function zfs_prop_activate_feature() was removed because
this change removes the only caller.  The function was not removed
from Illumos but instead left as dead code.  However, to keep gcc
happy it was removed from Linux and may be easily restored if needed.

Ported by: DHE <git@dehacked.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1540
2014-10-20 16:17:49 -07:00
Brian Behlendorf a80d69caf0 Remove adaptive mutex implementation
Since the Linux 2.6.29 kernel all mutexes have been adaptive mutexs.
There is no longer any point in keeping this code so it is being
removed to simplify the code.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2014-10-17 15:07:28 -07:00
Turbo Fredriksson 971808ec9f Add a stern warning about dedup
Users intending to use dedup should be clearly advised about
its memory requirements and the risks involved.

Thanx to Sachiru for comments and suggestions.

Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2754
2014-10-08 17:07:11 -07:00
Turbo Fredriksson a215ee16c0 Add an example for 'zfs bookmark' to the Example section.
Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2762
2014-10-07 11:29:26 -07:00
Richard Yao 83e9986f6e Implement -t option to zpool create for temporary pool names
Creating virtual machines that have their rootfs on ZFS on hosts that
have their rootfs on ZFS causes SPA namespace collisions when the
standard name rpool is used. The solution is either to give each guest
pool a name unique to the host, which is not always desireable, or boot
a VM environment containing an ISO image to install it, which is
cumbersome.

26b42f3f9d introduced `zpool import -t
...` to simplify situations where a host must access a guest's pool when
there is a SPA namespace conflict. We build upon that to introduce
`zpool import -t tname ...`. That allows us to create a pool whose
in-core name is tname, but whose on-disk name is the normal name
specified.

This simplifies the creation of machine images that use a rootfs on ZFS.
That benefits not only real world deployments, but also ZFSOnLinux
development by decreasing the time needed to perform rootfs on ZFS
experiments.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2417
2014-09-30 10:46:59 -07:00
Richard Yao 00d2a8c92f zpool import -t should not update cachefile
zpool import's -t parameter is intended for use with -R when operating
on pools that belong to other systems. Like -R, pools imported in this
way should not update the cachefile unless explicitly requested. The
initial implementation allowed the cachefile to be updated when -R was
not used. This went uncaught during testing because -R had implicitly
disabled use of the cachefile.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2417
2014-09-30 10:46:58 -07:00
Brian Behlendorf aa0ac7caa4 Make user stack limit configurable
To aid in detecting and debugging stack overflow issues make the
user space stack limit configurable via a new ZFS_STACK_SIZE
environment variable.  The value assigned to ZFS_STACK_SIZE will
be used as the default stack size in bytes.

Because this is mainly useful as a debugging aid in conjunction
with ztest the stack limit is disabled by default.  See the ztest(1)
man page for additional details on using the ZFS_STACK_SIZE
environment variable.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes #2743
Issue #2293
2014-09-30 10:46:55 -07:00
Chris Dunlap dcca723ace Refer to ZED's scripts as ZEDLETs
The executables invoked by the ZED in response to a given zevent
have been generically referred to as "scripts".  By convention,
these scripts have aimed to be /bin/sh compatible for reasons of
portability and comprehensibility.  However, the ZED only requires
they be executable and (ideally) capable of reading environment
variables.  As such, these scripts are now referred to as ZEDLETs
(ZFS Event Daemon Linkage for Executable Tasks).

Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2735
2014-09-25 13:54:17 -07:00
Max Grossman 36283ca233 Illumos 5138 - add tunable for maximum number of blocks freed in one txg
Reviewed by: Adam Leventhal <adam.leventhal@delphix.com>
Reviewed by: Mattew Ahrens <mahrens@delphix.com>
Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/5138
  https://github.com/illumos/illumos-gate/commit/af3465d

Porting notes:

Because support for exposing a uint64_t parameter wasn't added
until v3.17-rc1 the zfs_free_max_blocks variable has been declared
as a unsigned long.  This is already far larger than required and
it allows us to avoid additional autoconf compatibility code.

The default value has been set to 100,000 on Linux instead of
ULONG_MAX which is used on Illumos.  This was done to limit the
number of outstanding IOs in the system when snapshots are destroyed.
This helps ensure individual TXG sync times are kept reasonable and
memory isn't wasted managing a huge backlog of outstanding IOs.

Ported by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2675
Closes #2581
2014-09-23 14:26:34 -07:00
Matthew Ahrens b8bcca18f7 Illumos 5161 - add tunable for number of metaslabs per vdev
5161 add tunable for number of metaslabs per vdev
Reviewed by: Alex Reece <alex.reece@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com>
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

References:
  https://www.illumos.org/issues/5161
  https://github.com/illumos/illumos-gate/commit/bf3e216

Ported by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2698
2014-09-23 10:00:02 -07:00
Turbo Fredriksson 71bd064555 Document environment variables for zdb, zfs, zinject and zpool.
Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2691
2014-09-18 15:01:03 -07:00
Tim Chase 52dd454d05 Document the "readonly" pool property
This documentation is based FreeBSD's zpool(8) man page.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2682
2014-09-09 11:35:46 -07:00
Alexey Smirnoff 0dfc732416 Change the default 'zfs_dedup_prefetch' value to '0'
This gives a huge performance improvement in operations with deduped
datasets especially when the bottleneck is the amount of ram
available for zfs.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2639
2014-09-04 09:50:45 -07:00
Matthew Ahrens dea377c0d9 Illumos 4970-4974 - extreme rewind enhancements
4970 need controls on i/o issued by zpool import -XF
4971 zpool import -T should accept hex values
4972 zpool import -T implies extreme rewind, and thus a scrub
4973 spa_load_retry retries the same txg
4974 spa_load_verify() reads all data twice
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>

References:
  https://www.illumos.org/issues/4970
  https://www.illumos.org/issues/4971
  https://www.illumos.org/issues/4972
  https://www.illumos.org/issues/4973
  https://www.illumos.org/issues/4974
  https://github.com/illumos/illumos-gate/commit/e42d205

Notes:
    This set of patches adds a set of tunable parameters for the
    "extreme rewind" mode of pool import which allows control over
    the traversal performed during such an import.

Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2598
2014-08-26 16:29:57 -07:00
Matthew Ahrens 49ddb31506 Illumos 5034 - ARC's buf_hash_table is too small
5034 ARC's buf_hash_table is too small

Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Approved by: Gordon Ross <gwr@nexenta.com>

References:
  https://www.illumos.org/issues/5034
  https://github.com/illumos/illumos-gate/commit/63e911b

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2615
2014-08-26 16:14:49 -07:00
George Wilson f3a7f6610f Illumos 4976-4984 - metaslab improvements
4976 zfs should only avoid writing to a failing non-redundant top-level vdev
4978 ztest fails in get_metaslab_refcount()
4979 extend free space histogram to device and pool
4980 metaslabs should have a fragmentation metric
4981 remove fragmented ops vector from block allocator
4982 space_map object should proactively upgrade when feature is enabled
4983 need to collect metaslab information via mdb
4984 device selection should use fragmentation metric
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <adam.leventhal@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>

References:
  https://www.illumos.org/issues/4976
  https://www.illumos.org/issues/4978
  https://www.illumos.org/issues/4979
  https://www.illumos.org/issues/4980
  https://www.illumos.org/issues/4981
  https://www.illumos.org/issues/4982
  https://www.illumos.org/issues/4983
  https://www.illumos.org/issues/4984
  https://github.com/illumos/illumos-gate/commit/2e4c998

Notes:
    The "zdb -M" option has been re-tasked to display the new metaslab
    fragmentation metric and the new "zdb -I" option is used to control
    the maximum number of in-flight I/Os.

    The new fragmentation metric is derived from the space map histogram
    which has been rolled up to the vdev and pool level and is presented
    to the user via "zpool list".

    Add a number of module parameters related to the new metaslab weighting
    logic.

Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2595
2014-08-18 08:40:49 -07:00
Turbo Fredriksson f67d709080 Create an 'overlay' property
Add a new 'overlay' property (default 'off') that controls whether the
filesystem should be mounted even if the mountpoint is busy or if it
should fail with a 'mountpoint not empty'.

Doing overlay mounts is the default mount behavior on Linux, but not
in ZFS. It have been decided that following the ZFS behavior should
be the default, but this overlay allows for site administrator to
override this decision on a per-dataset basis.

Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes: #2503
2014-08-15 13:39:19 -07:00
Matthew Ahrens fbeddd60b7 Illumos 4390 - I/O errors can corrupt space map when deleting fs/vol
4390 i/o errors when deleting filesystem/zvol can lead to space map corruption
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/4390
  https://github.com/illumos/illumos-gate/commit/7fd05ac

Porting notes:

Previous stack-reduction efforts in traverse_visitb() caused a fair
number of un-mergable pieces of code.  This patch should reduce its
stack footprint a bit more.

The new local bptree_entry_phys_t in bptree_add() is dynamically-allocated
using kmem_zalloc() for the purpose of stack reduction.

The new global zfs_free_leak_on_eio has been defined as an integer
rather than a boolean_t as was the case with the related zfs_recover
global.  Also, zfs_free_leak_on_eio's definition has been inserted into
zfs_debug.c for consistency with the existing definition of zfs_recover.
Illumos placed it in spa_misc.c.

Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2545
2014-08-04 11:50:52 -07:00
Matthew Ahrens 9b67f60560 Illumos 4757, 4913
4757 ZFS embedded-data block pointers ("zero block compression")
4913 zfs release should not be subject to space checks

Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Max Grossman <max.grossman@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/4757
  https://www.illumos.org/issues/4913
  https://github.com/illumos/illumos-gate/commit/5d7b4d4

Porting notes:

For compatibility with the fastpath code the zio_done() function
needed to be updated.  Because embedded-data block pointers do
not require DVAs to be allocated the associated vdevs will not
be marked and therefore should not be unmarked.

Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2544
2014-08-01 14:28:05 -07:00
Matthew Ahrens faf0f58c69 Illumos 3835 zfs need not store 2 copies of all metadata
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

Description from Matt Ahrens's bug report at Delphix:

    Add a new zfs property, "redundant_metadata" which can have values
    "all" or "most".  The default will be "all", which is the current
    behavior.  Setting to "most" will cause us to only store 1 copy of
    level-1 indirect blocks of user data files.

Additional notes:

    The new man page section for this property states

        "The exact behavior of which metadata blocks
         are stored redundantly may change in future releases."

    and:

        "When set to most, ZFS stores an extra copy of most types of
         metadata. This can improve performance of random writes,
         because less metadata must be written."

    The current implementation is as described above in Matt's blog.
    It is controlled by a new global integer
    "zfs_redundant_metadata_most_ditto_level", currently initialized
    to 2. When "redundant_metadata" is set to "most", only indirect
    blocks of the specified level and higher will have additional ditto
    blocks created.

Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2542
2014-07-31 09:49:34 -07:00
Matthew Ahrens da536844d5 Illumos 4368, 4369.
4369 implement zfs bookmarks
4368 zfs send filesystems from readonly pools
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>

References:
  https://www.illumos.org/issues/4369
  https://www.illumos.org/issues/4368
  https://github.com/illumos/illumos-gate/commit/78f1710

Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2530
2014-07-29 10:55:29 -07:00
Max Grossman b0bc7a84d9 Illumos 4370, 4371
4370 avoid transmitting holes during zfs send
4371 DMU code clean up

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Approved by: Garrett D'Amore <garrett@damore.org>a

References:
  https://www.illumos.org/issues/4370
  https://www.illumos.org/issues/4371
  https://github.com/illumos/illumos-gate/commit/43466aa

Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2529
2014-07-28 14:29:58 -07:00
Matthew Ahrens fa86b5dbb6 Illumos 4171, 4172
4171 clean up spa_feature_*() interfaces
4172 implement extensible_dataset feature for use by other zpool features

Reviewed by: Max Grossman <max.grossman@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Approved by: Garrett D'Amore <garrett@damore.org>a

References:
  https://www.illumos.org/issues/4171
  https://www.illumos.org/issues/4172
  https://github.com/illumos/illumos-gate/commit/2acef22

Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2528
2014-07-25 16:40:07 -07:00
Turbo Fredriksson 79eb71dc6c Support '-H' (scripted mode) to 'zpool get'
This functionality is already available in 'zfs get'.  Providing
it for 'zpool get' is useful and good for consistency.

Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes: #2522
2014-07-25 11:58:36 -07:00
Turbo Fredriksson a60e668bd2 Initial attempt to document events and payloads.
In no way complete - most have been trial and error and some
deducing what they could mean. It needs more information from
someone that knows the code better. But this is a start and
it lays the basic structure for adding this additional detail.

Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2357
2014-07-25 11:58:36 -07:00
George Wilson 93cf20764a Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>

Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.

This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.

The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram

In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:

    * 4K sector devices will not see any compression benefit
    * large space_maps require more metadata on-disk
    * large space_maps require more time to load (typically random reads)

Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.

A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.

References:
  https://www.illumos.org/issues/4101
  https://www.illumos.org/issues/4102
  https://www.illumos.org/issues/4103
  https://www.illumos.org/issues/4105
  https://www.illumos.org/issues/4106
  https://github.com/illumos/illumos-gate/commit/0713e23

Porting notes:

A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.

Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2014-07-22 09:39:16 -07:00
Richard Yao a5778ea242 zdb: Introduce -V for verbatim import
When given a pool name via -e, zdb would attempt an import. If it
failed, then it would attempt a verbatim import. This behavior is
not always desirable so a -V switch is added to zdb to control the
behavior. When specified, a verbatim import is done. Otherwise,
the behavior is as it was previously, except no verbatim import
is done on failure.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2372
2014-07-17 11:40:32 -07:00
Tim Chase f4a4046bd6 Convert zfs_mg_noalloc_threshold to a module parameter and document
The parameter was added as illumos issue 4081 which was committed to
zfsonlinux in ac72fac3ea.  This patch
documents the parameter and allows for it to be set as a module parameter.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2483
2014-07-16 16:49:25 -07:00
Tim Chase 52e68edc2d Document the optional "device" argument for "zpool split"
Most ZFS implementations seemed to have missed this bit of documentation.
The additional text is based on FreeBSD's man page.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2416
2014-07-01 14:16:43 -07:00
Turbo Fredriksson 628668a39f Add information about the -o option to zpool replace
Users need to be aware that when replacing devices in an existing
pool they may need to override automatically detected ashift value.
This will all depend on the exact hardware they are using.

Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2024
2014-06-27 08:31:07 -07:00
SenH 1567e0758b Fix man zpool property feature_guid
The property name gets mangled with the explanation due to the property
length.  Fixed by putting the explanation on the next line.

Before:
  unsupported@feature_Info rmation about unsupported features that are
  enabled on the pool. See zpool-features(5) for details.

After:
  unsupported@feature_guid
  Information about unsupported features that are enabled on the pool. See
  zpool-features(5) for details.

Signed-off-by: SenH <sen@senhaerens.be>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2419
2014-06-26 16:21:15 -07:00
Turbo Fredriksson 21b446a79e Document the -X and -T options to 'zpool import'
These options have existed for a long time but have historically
been undocumented because they are not guaranteed to be safe.  They
should only be used as a last resort when attempting to recover a
damaged pool.

Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1130
2014-06-06 15:49:34 -07:00
Tim Chase 27b293be8a Expand the description of scan-related and other parameters.
Document that the scan-related parameters are, in fact, applicable only
to scrub and/or resilver operations as appropriate.

Expand a few of the prefetch-related descriptions.

Add clarification to other module parameters.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2361
2014-06-06 13:04:43 -07:00
Turbo Fredriksson beb4be77b7 Man page updates for 'zfs share'
* Remove the references to share(1M), unshare(1M) and dfstab(4)
  since they are not applicable to Linux.
* Add the exact exportfs command line used when setting sharenfs=on.

Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue: #1641
2014-06-06 13:00:51 -07:00
Turbo Fredriksson 022f7bf68e Document the fact that ashift is vdev specific, not a pool global.
Users need to be aware that when adding devices to an existing pool
they may need to override automatically detected ashift value.
This will all depend on the exact hardware they are using.

Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes: #2024
2014-06-06 12:52:01 -07:00
George Wilson aa7d06a98a Illumos #4101 finer-grained control of metaslab_debug
Today the metaslab_debug logic performs two tasks:

- load all metaslabs on import/open
- don't unload metaslabs at the end of spa_sync

This change provides knobs for each of these independently.

References:
  https://illumos.org/issues/4101
  https://github.com/illumos/illumos-gate/commit/0713e23

Notes:

1) This is a small piece of the metaslab improvement patch from
Illumos. It was worth bringing over before the rest, since it's
low risk and it can be useful on fragmented pools (e.g. Lustre
MDTs). metaslab_debug_unload would give the performance benefit
of the old metaslab_debug option without causing unwanted delay
during pool import.

Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2227
2014-05-06 09:46:04 -07:00
Andrey Vesnovaty 703371d8c7 Evenly distribute the taskq threads across available CPUs
The problem is described in commit aeeb4e0c0a.
However, instead of disabling the binding to CPU altogether we just keep the
last CPU index across calls to taskq_create() and thus achieve even
distribution of the taskq threads across all available CPUs.

The implementation based on assumption that task queues initialization
performed in serial manner.

Signed-off-by: Andrey Vesnovaty <andrey.vesnovaty@gmail.com>
Signed-off-by: Andrey Vesnovaty <andreyv@infinidat.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #336
2014-04-25 15:29:18 -07:00
Chris Dunlap 9e246ac3d8 Initial implementation of zed (ZFS Event Daemon)
zed monitors ZFS events.  When a zevent is posted, zed will run any
scripts that have been enabled for the corresponding zevent class.
Multiple scripts may be invoked for a given zevent.  The zevent
nvpairs are passed to the scripts as environment variables.

Events are processed synchronously by the single thread, and there is
no maximum timeout for script execution.  Consequently, a misbehaving
script can delay (or forever block) the processing of subsequent
zevents.  Plans are to address this in future commits.

Initial scripts have been developed to log events to syslog
and send email in response to checksum/data/io errors and
resilver.finish/scrub.finish events.  By default, email will only
be sent if the ZED_EMAIL variable is configured in zed.rc (which is
serving as a config file of sorts until a proper configuration file
is implemented).

Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2
2014-04-02 13:10:03 -07:00
Richard Yao 26b42f3f9d Implement -t option to zpool import for temporary pool names
Originally, users had to handle spa namespace collisions by either
exporting the already imported pool or by specifying a new name for the
pool with a conflicting name. In the case of root pools from virtual
guests, neither approach to collision resolution is reasonable. This is
addressed by extending the new name syntax with a -t option to specify
that the new name is temporary. When specified, this sets an internal
flag that is passed into the kernel to tell it that all label updates
should refer to the name used in the original label. Consequently, the
original pool name will be retained on export.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2189
2014-03-20 12:05:30 -07:00
Turbo Fredriksson 4e26f2fccd Fix NAME section of manpages zhack and fsck.zfs.
In Debian GNU/Linux a program called 'linitian' is used to make sure
that packages conforms to the Debian GNU/Linux packaging guide lines.
This fixes the problem reported as:

     W: zfsutils: manpage-has-bad-whatis-entry usr/share/man/man1/zhack.1.gz
     W: zfsutils: manpage-has-bad-whatis-entry usr/share/man/man8/fsck.zfs.8.gz

Not something that ZoL needs to addhere to, but every other man page
have their NAME section in a special way - why not these two as well?

Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2161
2014-03-10 09:19:17 -07:00
Prakash Surya 624227854e Disable arc_p adapt dampener by default
It's unclear why adjustments to arc_p need to be dampened as they are in
arc_adjust. With that said, it's removal significantly improves the arc's
ability to "warm up" to a given workload. Thus, I'm disabling by default
until its usefulness is better understood.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110
2014-02-21 16:10:49 -08:00
Prakash Surya f521ce1b9c Allow "arc_p" to drop to zero or grow to "arc_c"
Setting a limit on the minimum value of "arc_p" has been shown to have
detrimental effects on the arc hit rate for certain "metadata" intensive
workloads. Specifically, this has been exhibited with a workload that
constantly dirties new "metadata" but also frequently touches a "small"
amount of mfu data (e.g. mkdir's).

What is seen is that the new anon data throttles the mfu list to a
negligible size (because arc_p > anon + mru in arc_get_data_buf), even
though the mfu ghost list receives a constant stream of hits. To remedy
this, arc_p is now allowed to drop to zero if the algorithm deems it
necessary.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110
2014-02-21 16:10:27 -08:00
Prakash Surya 89c8cac493 Disable aggressive arc_p growth by default
For specific workloads consisting mainly of mfu data and new anon data
buffers, the aggressive growth of arc_p found in the arc_get_data_buf()
function can have detrimental effects on the mfu list size and ghost
list hit rate.

Running a workload consisting of two processes:

    * Process 1 is creating many small files
    * Process 2 is tar'ing a directory consisting of many small files

I've seen arc_p and the mru grow to their maximum size, while the mru
ghost list receives 100K times fewer hits than the mfu ghost list.

Ideally, as the mfu ghost list receives hits, arc_p should be driven
down and the size of the mfu should increase. Given the specific
workload I was testing with, the mfu list size should grow to a point
where almost no mfu ghost list hits would occur. Unfortunately, this
does not happen because the newly dirtied anon buffers constancy drive
arc_p to its maximum value and keep it there (effectively prioritizing
the mru list and starving the mfu list down to a negligible size).

The logic to increment arc_p from within the arc_get_data_buf() function
was introduced many years ago in this upstream commit:

    commit 641fbdae3a027d12b3c3dcd18927ccafae6d58bc
    Author: maybee <none@none>
    Date:   Wed Dec 20 15:46:12 2006 -0800

        6505658 target MRU size (arc.p) needs to be adjusted more aggressively

and since I don't fully understand the motivation for the change, I am
reluctant to completely remove it.

As a way to test out how it's removal might affect performance, I've
disabled that code by default, but left it tunable via a module option.
Thus, if its removal is found to be grossly detrimental for certain
workloads, it can be re-enabled on the fly, without a code change.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110
2014-02-21 14:53:28 -08:00
Tim Chase 6d111134c0 Implement relatime.
Add the "relatime" property.  When set to "on", a file's atime will only
be updated if the existing atime at least a day old or if the existing
ctime or mtime has been updated since the last access.  This behavior
is compatible with the Linux "relatime" mount option.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2064
Closes #1917
2014-01-29 15:50:44 -08:00
Brian Behlendorf 4c995417bc Remove incorrect use of EXTRA_DIST for man pages
Setting the 'dist_' prefix is the correct way to instruct Automake
to include these files in the distribution.  The EXTRA_DIST variable
is reserved for files which are not covered by the automatic rules.

  http://www.gnu.org/software/automake/manual/automake.html#Basics

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2014-01-17 11:54:22 -08:00
Brian Behlendorf 3566d5c7c3 Remove incorrect use of EXTRA_DIST for man pages
Setting the 'dist_' prefix is the correct way to instruct Automake
to include these files in the distribution.  The EXTRA_DIST variable
is reserved for files which are not covered by the automatic rules.

  http://www.gnu.org/software/automake/manual/automake.html#Basics

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2014-01-17 11:50:08 -08:00
Ned Bass 09d0b30fd1 vdev_id: support per-channel slot mappings
The vdev_id udev helper currently applies slot renumbering rules to
every channel (JBOD) in the system.  This is too inflexible for systems
with non-homogeneous storage topologies.  The "slot" keyword now takes
an optional third parameter which names a channel to which the mapping
will apply.  If the third parameter is omitted then the rule applies to
all channels.  The first-specified rule that can match a slot takes
precedence.  Therefore a channel-specific rule for a given slot should
generally appear before a generic rule for the same slot number.  In
this way a custom slot mapping can be applied to a particular channel
and a default mapping applied to the rest.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2056
2014-01-17 11:17:54 -08:00
Matthew Thode 11b9ec23b9 Add full SELinux support
Four new dataset properties have been added to support SELinux.  They
are 'context', 'fscontext', 'defcontext' and 'rootcontext' which map
directly to the context options described in mount(8).  When one of
these properties is set to something other than 'none'.  That string
will be passed verbatim as a mount option for the given context when
the filesystem is mounted.

For example, if you wanted the rootcontext for a filesystem to be set
to 'system_u:object_r:fs_t' you would set the property as follows:

  $ zfs set rootcontext="system_u:object_r:fs_t" storage-pool/media

This will ensure the filesystem is automatically mounted with that
rootcontext.  It is equivalent to manually specifying the rootcontext
with the -o option like this:

  $ zfs mount -o rootcontext=system_u:object_r:fs_t storage-pool/media

By default all four contexts are set to 'none'.  Further information
on SELinux contexts is detailed in mount(8) and selinux(8) man pages.

Signed-off-by: Matthew Thode <prometheanfire@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes #1504
2013-12-19 10:37:31 -08:00
Turbo Fredriksson fd8febbd1e Add zfs_send_corrupt_data module option
Tuning setting to ignore read/checksum errors when sending data.

Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1982
Issue #1897
2013-12-18 16:46:35 -08:00
Brian Behlendorf d17eab9ce0 Update zfs(8) Snapshots section
The Snapshots section of the zfs(8) man page is incorrect and should
have been updated as part of #1312.  Snapshots of volumes can be
accessed independently and their visibility is determined by the
'snapdev=hidden|visible' property.  This is analogous to the existing
'snapdir=hidden|visible' property.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #1921
2013-12-16 09:41:45 -08:00
Matthew Ahrens e8b96c6007 Illumos #4045 write throttle & i/o scheduler performance work
4045 zfs write throttle & i/o scheduler performance work

1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver.  The scheduler
issues a number of concurrent i/os from each class to the device.  Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes).  The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is.  See the block comment in vdev_queue.c (reproduced
below) for more details.

2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load.  The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system.  When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount.  This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens.  One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync().  Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes.  See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.

This diff has several other effects, including:

 * the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.

 * the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently.  There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.

 * zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc.  This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).

--matt

APPENDIX: problems with the current i/o scheduler

The current ZFS i/o scheduler (vdev_queue.c) is deadline based.  The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.

For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due".  One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).

If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os.  This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future.  If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due.  Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).

Notes on porting to ZFS on Linux:

- zio_t gained new members io_physdone and io_phys_children.  Because
  object caches in the Linux port call the constructor only once at
  allocation time, objects may contain residual data when retrieved
  from the cache. Therefore zio_create() was updated to zero out the two
  new fields.

- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
  (vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
  This tree has been replaced by vq->vq_active_tree which is now used
  for the same purpose.

- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
  the number of vdev I/O buffers to pre-allocate.  That global no longer
  exists, so we instead use the sum of the *_max_active values for each of
  the five I/O classes described above.

- The Illumos implementation of dmu_tx_delay() delays a transaction by
  sleeping in condition variable embedded in the thread
  (curthread->t_delay_cv).  We do not have an equivalent CV to use in
  Linux, so this change replaced the delay logic with a wrapper called
  zfs_sleep_until(). This wrapper could be adopted upstream and in other
  downstream ports to abstract away operating system-specific delay logic.

- These tunables are added as module parameters, and descriptions added
  to the zfs-module-parameters.5 man page.

  spa_asize_inflation
  zfs_deadman_synctime_ms
  zfs_vdev_max_active
  zfs_vdev_async_write_active_min_dirty_percent
  zfs_vdev_async_write_active_max_dirty_percent
  zfs_vdev_async_read_max_active
  zfs_vdev_async_read_min_active
  zfs_vdev_async_write_max_active
  zfs_vdev_async_write_min_active
  zfs_vdev_scrub_max_active
  zfs_vdev_scrub_min_active
  zfs_vdev_sync_read_max_active
  zfs_vdev_sync_read_min_active
  zfs_vdev_sync_write_max_active
  zfs_vdev_sync_write_min_active
  zfs_dirty_data_max_percent
  zfs_delay_min_dirty_percent
  zfs_dirty_data_max_max_percent
  zfs_dirty_data_max
  zfs_dirty_data_max_max
  zfs_dirty_data_sync
  zfs_delay_scale

  The latter four have type unsigned long, whereas they are uint64_t in
  Illumos.  This accommodates Linux's module_param() supported types, but
  means they may overflow on 32-bit architectures.

  The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
  likely to overflow on 32-bit systems, since they express physical RAM
  sizes in bytes.  In fact, Illumos initializes zfs_dirty_data_max_max to
  2^32 which does overflow. To resolve that, this port instead initializes
  it in arc_init() to 25% of physical RAM, and adds the tunable
  zfs_dirty_data_max_max_percent to override that percentage.  While this
  solution doesn't completely avoid the overflow issue, it should be a
  reasonable default for most systems, and the minority of affected
  systems can work around the issue by overriding the defaults.

- Fixed reversed logic in comment above zfs_delay_scale declaration.

- Clarified comments in vdev_queue.c regarding when per-queue minimums take
  effect.

- Replaced dmu_tx_write_limit in the dmu_tx kstat file
  with dmu_tx_dirty_delay and dmu_tx_dirty_over_max.  The first counts
  how many times a transaction has been delayed because the pool dirty
  data has exceeded zfs_delay_min_dirty_percent.  The latter counts how
  many times the pool dirty data has exceeded zfs_dirty_data_max (which
  we expect to never happen).

- The original patch would have regressed the bug fixed in
  zfsonlinux/zfs@c418410, which prevented users from setting the
  zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
  A similar fix is added to vdev_queue_aggregate().

- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
  heap instead of the stack.  In Linux we can't afford such large
  structures on the stack.

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>

References:
  http://www.illumos.org/issues/4045
  illumos/illumos-gate@69962b5647

Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1913
2013-12-06 09:32:43 -08:00
Turbo Fredriksson 30607d9b7b Document SPL module parameters.
This is a first draft of a spl-module-parameters(5) man page. I have
just extracted the parameter name and its description with modinfo,
then checked the source what type it is and its default value.

This will need more work, preferably someone that actually know these
values and what to use them for.  Similar to zfsonlinux/zfs#1856, but
for the spl.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes zfsonlinux/zfs#1856
2013-11-21 12:32:41 -08:00
Yuri Pankov 54d5378fae Illumos #2583
2583 Add -p (parsable) option to zfs list

References:
  https://www.illumos.org/issues/2583
  illumos/illumos-gate@43d68d68c1

Ported-by: Gregor Kopka <gregor@kopka.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes: #937
2013-11-21 11:13:53 -08:00
Turbo Fredriksson 29714574fa Document ZFS module parameters.
This is a first draft of a zfs-module-parameters(5) man page. I have
just extracted the parameter name and its description with modinfo,
then checked the source what type it is and its default value.

This will need more work, preferably someone that actually know these
values and what to use them for.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1856
2013-11-20 16:00:33 -08:00
Bassu 7a4f54688e Explain 'zfs list -t snap -o name -s name' speedup
Commit 0cee240 from FreeBSD dramatically speeds up 'zfs list'
performance assuming you're only interested in the dataset
names.  This optimization should be mentioned in the man page
to allow end users to take advantage of it.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1847
2013-11-08 14:21:58 -08:00