Archive-Team/zfs - zfs - Gitea: Git with a cup of tea

Commit Graph

Author	SHA1	Message	Date
Olaf Faaland	379ca9cf2b	Multi-modifier protection (MMP) Add multihost=on\|off pool property to control MMP. When enabled a new thread writes uberblocks to the last slot in each label, at a set frequency, to indicate to other hosts the pool is actively imported. These uberblocks are the last synced uberblock with an updated timestamp. Property defaults to off. During tryimport, find the "best" uberblock (newest txg and timestamp) repeatedly, checking for change in the found uberblock. Include the results of the activity test in the config returned by tryimport. These results are reported to user in "zpool import". Allow the user to control the period between MMP writes, and the duration of the activity test on import, via a new module parameter zfs_multihost_interval. The period is specified in milliseconds. The activity test duration is calculated from this value, and from the mmp_delay in the "best" uberblock found initially. Add a kstat interface to export statistics about Multiple Modifier Protection (MMP) updates. Include the last synced txg number, the timestamp, the delay since the last MMP update, the VDEV GUID, the VDEV label that received the last MMP update, and the VDEV path. Abbreviated output below. $ cat /proc/spl/kstat/zfs/mypool/multihost 31 0 0x01 10 880 105092382393521 105144180101111 txg timestamp mmp_delay vdev_guid vdev_label vdev_path 20468 261337 250274925 68396651780 3 /dev/sda 20468 261339 252023374 6267402363293 1 /dev/sdc 20468 261340 252000858 6698080955233 1 /dev/sdx 20468 261341 251980635 783892869810 2 /dev/sdy 20468 261342 253385953 8923255792467 3 /dev/sdd 20468 261344 253336622 042125143176 0 /dev/sdab 20468 261345 253310522 1200778101278 2 /dev/sde 20468 261346 253286429 0950576198362 2 /dev/sdt 20468 261347 253261545 96209817917 3 /dev/sds 20468 261349 253238188 8555725937673 3 /dev/sdb Add a new tunable zfs_multihost_history to specify the number of MMP updates to store history for. By default it is set to zero meaning that no MMP statistics are stored. When using ztest to generate activity, for automated tests of the MMP function, some test functions interfere with the test. For example, the pool is exported to run zdb and then imported again. Add a new ztest function, "-M", to alter ztest behavior to prevent this. Add new tests to verify the new functionality. Tests provided by Giuseppe Di Natale. Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Ned Bass <bass6@llnl.gov> Reviewed-by: Andreas Dilger <andreas.dilger@intel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Olaf Faaland <faaland1@llnl.gov> Closes #745 Closes #6279	2017-07-13 13:54:00 -04:00
LOLi	cf8738d853	Add port of FreeBSD 'volmode' property The volmode property may be set to control the visibility of ZVOL block devices. This allow switching ZVOL between three modes: full - existing fully functional behaviour (default) dev - hide partitions on ZVOL block devices none - not exposing volumes outside ZFS Additionally the new zvol_volmode module parameter can be used to control the default behaviour. This functionality can be used, for instance, on "backup" pools to avoid cluttering /dev with unneeded zd* devices. Original-patch-by: mav <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: loli10K <ezomori.nozomu@gmail.com> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> FreeBSD-commit: https://github.com/freebsd/freebsd/commit/dd28e6bb Closes #1796 Closes #3438 Closes #6233	2017-07-12 13:05:37 -07:00
Matthew Ahrens	a896468c78	OpenZFS 8067 - zdb should be able to dump literal embedded block pointer Authored by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Alex Reece <alex@delphix.com> Reviewed by: Yuri Pankov <yuri.pankov@gmail.com> Approved by: Robert Mustacchi <rm@joyent.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/8067 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/8173085 Closes #6319	2017-07-07 11:28:01 -07:00
Alek P	0ea05c64f8	Implemented zpool scrub pause/resume Currently, there is no way to pause a scrub. Pausing may be useful when the pool is busy with other I/O to preserve bandwidth. This patch adds the ability to pause and resume scrubbing. This is achieved by maintaining a persistent on-disk scrub state. While the state is 'paused' we do not scrub any more blocks. We do however perform regular scan housekeeping such as freeing async destroyed and deadlist blocks while paused. Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Thomas Caputi <tcaputi@datto.com> Reviewed-by: Serapheim Dimitropoulos <serapheimd@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alek Pinchuk <apinchuk@datto.com> Closes #6167	2017-07-06 22:16:13 -07:00
Brian Behlendorf	44f09cdc59	Convert man zfs.8 to mdoc (OpenZFS sync) * Fixed some typos * Additional description for some commands arguments * Text reworked to be in sync with OpenZFS * Added Linux as .Os type Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #6282	2017-06-29 09:55:30 -07:00
George Melikov	cda0317e4d	Convert man zpool.8 to mdoc (OpenZFS sync) * Fixed some typos * Additional description for some commands arguments * `listsnapshots` remained * Text reworked to be in sync with OpenZFS * Added Linux as .Os type * Updated `zpool events` section. * Updated `zpool iostat\|status -c` sections * Added zed(8) reference to SEE ALSO Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Melikov <mail@gmelikov.ru> Closes #6245	2017-06-28 09:26:46 -07:00
Don Brady	0241e491a0	Inject zinject(8) a percentage amount of dev errs In the original form of device error injection, it was an all or nothing situation. To help simulate intermittent error conditions, you can now specify a real number percentage value. This is also very useful for our ZFS fault diagnosis testing and for injecting intermittent errors during load testing. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Don Brady <don.brady@intel.com> Closes #6227	2017-06-16 17:21:11 -07:00
chrisrd	627791f3c0	Fix manual description of zfs_arc_dnode_limit In arc_evict_state() we start pruning when arc_dnode_size > arc_dnode_limit, i.e. arc_dnode_limit is a ceiling rather than a floor. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chris Dunlop <chris@onthe.net.au> Closes #6228	2017-06-14 13:23:02 -07:00
Giuseppe Di Natale	1b7c1e5ce9	OpenZFS 7578 - Fix/improve some aspects of ZIL writing - After some ZIL changes 6 years ago zil_slog_limit got partially broken due to zl_itx_list_sz not updated when async itx'es upgraded to sync. Actually because of other changes about that time zl_itx_list_sz is not really required to implement the functionality, so this patch removes some unneeded broken code and variables. - Original idea of zil_slog_limit was to reduce chance of SLOG abuse by single heavy logger, that increased latency for other (more latency critical) loggers, by pushing heavy log out into the main pool instead of SLOG. Beside huge latency increase for heavy writers, this implementation caused double write of all data, since the log records were explicitly prepared for SLOG. Since we now have I/O scheduler, I've found it can be much more efficient to reduce priority of heavy logger SLOG writes from ZIO_PRIORITY_SYNC_WRITE to ZIO_PRIORITY_ASYNC_WRITE, while still leave them on SLOG. - Existing ZIL implementation had problem with space efficiency when it has to write large chunks of data into log blocks of limited size. In some cases efficiency stopped to almost as low as 50%. In case of ZIL stored on spinning rust, that also reduced log write speed in half, since head had to uselessly fly over allocated but not written areas. This change improves the situation by offloading problematic operations from z_log_write() to zil_lwb_commit(), which knows real situation of log blocks allocation and can split large requests into pieces much more efficiently. Also as side effect it removes one of two data copy operations done by ZIL code WR_COPIED case. - While there, untangle and unify code of z_log_write() functions. Also zfs_log_write() alike to zvol_log_write() can now handle writes crossing block boundary, that may also improve efficiency if ZPL is made to do that. Sponsored by: iXsystems, Inc. Authored by: Alexander Motin <mav@FreeBSD.org> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Andriy Gapon <avg@FreeBSD.org> Reviewed by: Steven Hartland <steven.hartland@multiplay.co.uk> Reviewed by: Brad Lewis <brad.lewis@delphix.com> Reviewed by: Richard Elling <Richard.Elling@RichardElling.com> Approved by: Robert Mustacchi <rm@joyent.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Yao <ryao@gentoo.org> Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/7578 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/aeb13ac Closes #6191	2017-06-09 09:15:37 -07:00
Giuseppe Di Natale	099700d9df	zpool iostat/status -c improvements Users can now provide their own scripts to be run with 'zpool iostat/status -c'. User scripts should be placed in ~/.zpool.d to be included in zpool's default search path. Provide a script which can be used with 'zpool iostat\|status -c' that will return the type of device (hdd, sdd, file). Provide a script to get various values from smartctl when using 'zpool iostat/status -c'. Allow users to define the ZPOOL_SCRIPTS_PATH environment variable which can be used to override the default 'zpool iostat/status -c' search path. Allow the ZPOOL_SCRIPTS_ENABLED environment variable to enable or disable 'zpool status/iostat -c' functionality. Use the new smart script to provide the serial command. Install /etc/sudoers.d/zfs file which contains the sudoer rule for smartctl as a sample. Allow 'zpool iostat/status -c' tests to run in tree. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov> Closes #6121 Closes #6153	2017-06-05 10:52:15 -07:00
Alek P	bec1067d54	Implemented zpool sync command This addition will enable us to sync an open TXG to the main pool on demand. The functionality is similar to 'sync(2)' but 'zpool sync' will return when data has hit the main storage instead of potentially just the ZIL as is the case with the 'sync(2)' cmd. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Alek Pinchuk <apinchuk@datto.com> Closes #6122	2017-05-19 12:33:11 -07:00
Tony Hutter	4a283c7f77	Force fault a vdev with 'zpool offline -f' This patch adds a '-f' option to 'zpool offline' to fault a vdev instead of bringing it offline. Unlike the OFFLINE state, the FAULTED state will trigger the FMA code, allowing for things like autoreplace and triggering the slot fault LED. The -f faults persist across imports, unless they were set with the temporary (-t) flag. Both persistent and temporary faults can be cleared with zpool clear. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #6094	2017-05-19 12:30:16 -07:00
LOLi	a3eeab2de6	Add property overriding (-o\|-x) to 'zfs receive' This allows users to specify "-o property=value" to override and "-x property" to exclude properties when receiving a zfs send stream. Both native and user properties can be specified. This is useful when using zfs send/receive for periodic backup/replication because it lets users change properties such as canmount, mountpoint, or compression without modifying the source. References: https://www.illumos.org/issues/2745 https://www.illumos.org/issues/3753 Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Alek Pinchuk <apinchuk@datto.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #1350 Closes #5349	2017-05-09 16:21:09 -07:00
Christian Schwarz	305bc4b370	Make createtxg and guid properties public Document the existence of `createtxg` and `guid` native properties in man pages and zfs command output. One of the great features of ZFS is incremental replication of snapshots, possibly between pools on different machines. Shell scripts are commonly used to auomate this procedure. They have to find the most recent common snapshot between both sides and then perform incremental send & recv. Currently, scripts rely on the sorting order of `zfs list`, which defaults to `createtxg`, and the assumption that snapshot names on either side do not change. By making `createtxg` and `guid` part of the public ZFS interface, scripts are enabled to use a) `createtxg` to determine the logical & temporal order of snapshots (the creation property is not an equivalent substitute since multiple snapshots may be created within one second) b) `guid` to uniquely identify a snapshot, independent of its current display name This has the potential of making scripts safer and correct. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: DHE <git@dehacked.net> Reviewed-by: Richard Laager <rlaager@wiktel.com> Signed-off-by: Christian Schwarz <me@cschwarz.com> Closes #6102	2017-05-09 15:36:53 -07:00
Brian Behlendorf	8fa5250f5d	Default to zvol_request_async=0 Change the default ZVOL behavior so requests are handled asynchronously. This behavior is functionally the same as in the zfs-0.6.4 release. Reviewed-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #5902	2017-05-04 18:01:50 -04:00
LOLi	dddef7d600	More ashift improvements This commit allow higher ashift values (up to 16) in 'zpool create' The ashift value was previously limited to 13 (8K block) in `b41c990` because the limited number of uberblocks we could fit in the statically sized (128K) vdev label ring buffer could prevent the ability the safely roll back a pool to recover it. Since `b02fe35` the largest uberblock size we support is 8K: this allow us to store a minimum number of 16 uberblocks in the vdev label, even with higher ashift values. Additionally change 'ashift' pool property behaviour: if set it will be used as the default hint value in subsequent vdev operations ('zpool add', 'attach' and 'replace'). A custom ashift value can still be specified from the command line, if desired. Finally, fix a bug in add-o_ashift.ksh caused by a missing variable. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #2024 Closes #4205 Closes #4740 Closes #5763	2017-05-03 09:31:05 -07:00
Debabrata Banerjee	03b60eee78	Allow scaling of arc in proportion to pagecache When multiple filesystems are in use, memory pressure causes arc_cache to collapse to a minimum. Allow arc_cache to maintain proportional size even when hit rates are disproportionate. We do this only via evictable size from the kernel shrinker, thus it's only in effect under memory pressure. AKAMAI: zfs: CR 3695072 Reviewed-by: Tim Chase <tim@chase2k.com> Reviewed-by: Richard Yao <ryao@gentoo.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com> Closes #6035	2017-05-02 15:50:49 -04:00
Chunwei Chen	692e55b8fe	Reinstate zvol_taskq to fix aio on zvol Commit `37f9dac` removed the zvol_taskq for processing zvol requests. This was removed as part of switching to make_request_fn and was motivated by a concern at the time over dispatch latency. However, this also made all bio request synchronous, and caused serious performance issues as the bio submitter would wait for every bio it submitted, effectively making the IO depth 1. This patch reinstate zvol_taskq, and to make sure overlapped I/Os are ordered properly, we take range lock in zvol_request, and pass it along with bio to the I/O functions zvol_{write,discard,read}. In order to facilitate benchmarks a zvol_request_sync module option was added to switch between sync and async request handling. For the moment, the default behavior is synchronous but this is likely to change pending additional testing. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Closes #5824	2017-04-26 13:54:40 -07:00
Tim Chase	e815485fe9	Update documentation for zfs_vdev_queue_depth_pct It was documented as being related to zfs_vdev_async_max_active when it is actually related to zfs_vdev_async_write_max_active. Also, expand the documentation to describe the allocation throttle which was introduced as part of OpenZFS 7090 in `3dfb57a`. Reviewed-by: Richard Yao <ryao@gentoo.org> Reviewed-by: Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Closes #6064	2017-04-26 13:48:28 -07:00
Dan Kimmel	a7004725d0	OpenZFS 7252 - compressed zfs send / receive OpenZFS 7252 - compressed zfs send / receive OpenZFS 7628 - create long versions of ZFS send / receive options Authored by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: John Kennedy <john.kennedy@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Sebastien Roy <sebastien.roy@delphix.com> Reviewed by: David Quigley <dpquigl@davequigley.com> Reviewed by: Thomas Caputi <tcaputi@datto.com> Approved by: Dan McDonald <danmcd@omniti.com> Reviewed by: David Quigley <dpquigl@davequigley.com> Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Ported-by: bunder2015 <omfgbunder@gmail.com> Ported-by: Don Brady <don.brady@intel.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> Porting Notes: - Most of 7252 was already picked up during ABD work. This commit represents the gap from the final commit to openzfs. - Fixed split_large_blocks check in do_dump() - An alternate version of the write_compressible() function was implemented for Linux which does not depend on fio. The behavior of fio differs significantly based on the exact version. - mkholes was replaced with truncate for Linux. OpenZFS-issue: https://www.illumos.org/issues/7252 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/5602294 Closes #6067	2017-04-26 12:31:43 -07:00
bunder2015	603a178479	Fix typo in zfs-module-parameters man page Fix typo in zfs-module-parameters man page Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Mike McQuaid <mike@mikemcquaid.com> Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Gvozden Neskovic <neskovic@gmail.com> Reviewed-by: Neal Gompa <ngompa13@gmail.com> Reviewed-by: Jason Zaman <jason@perfinion.com> Reviewed-by: Ned Bass <bass6@llnl.gov> Reviewed-by: Jorgen Lundman <lundman@lundman.net> Reviewed-by: Prakash Surya <prakash.surya@delphix.com> Reviewed-by: Hajo Möller <dasjoe@gmail.com> Reviewed-by: Igor Kozhukhov <ikozhukhov@gmail.com> Reviewed-by: Richard Yao <ryao@gentoo.org> Reviewed-by: Tim Chase <tim@chase2k.com> Reviewed-by: DHE <git@dehacked.net> Reviewed-by: Matthew Thode <prometheanfire@gentoo.org> Reviewed-by: Thomas Caputi <tcaputi@datto.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: kernelOfTruth <kerneloftruth@gmail.com> Reviewed-by: Kash Pande <kash@tripleback.net> Reviewed-by: ilovezfs <ilovezfs@icloud.com> Reviewed-by: @jwittlincohen Reviewed-by: Jack Draak <jackdraak@gmail.com> Reviewed-by: @ptx0 Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: @Kokokokoka Reviewed-by: @JCount <JCount@hush.ai> Signed-off-by: bunder2015 <omfgbunder@gmail.com> Closes #6054	2017-04-24 10:56:44 -07:00
DeHackEd	321204bec6	Typo in zfs-module-parameters(5) Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: DHE <git@dehacked.net> Closes #6061	2017-04-24 10:34:37 -07:00
Tony Hutter	d6418de057	Prebaked scripts for zpool status/iostat -c This patch updates the "zpool status/iostat -c" commands to only run "pre-baked" scripts from the /etc/zfs/zpool.d directory (or wherever you install to). The scripts can only be run from -c as an unprivileged user (unless the ZPOOL_SCRIPTS_AS_ROOT environment var is set by root). This was done to encourage scripts to be written is such a way that normal users can use them, and to be cautious. If your script needs to run a privileged command, consider adding the appropriate line in /etc/sudoers. See zpool(8) for an example of how to do this. The patch also allows the scripts to output custom column names. If the script outputs a line like: name=value then "name" is used for the column name, and "value" is its value. Multiple columns can be specified by outputting multiple lines. Column names and values can have spaces. If the value is empty, a dash (-) is printed instead. After all the "name=value" lines are read (if any), zpool will take the next the next line of output (if any) and print it without a column header. After that, no more lines will be processed. This can be useful for printing errors. Lastly, this patch also disables the -c option with the latency and request size histograms, since it produced awkward output and made the code harder to maintain. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #5852	2017-04-21 09:27:04 -07:00
LOLi	038091fd4f	Documentation fixes for zfs(8) and 'zfs' binary * bookmarks are not supported when sending all intermediary snaps (-I) * add missing compressed (-c) option to the 'zfs' help and manpage Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Richard Laager <rlaager@wiktel.com> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #6028	2017-04-20 12:12:50 -07:00
Richard Yao	066753103f	OpenZFS 6392 - zdb: introduce -V for verbatim import Authored by: Richard Yao <ryao@gentoo.org> Approved by: Dan McDonald <danmcd@omniti.com> Reviewed by: Yuri Pankov <yuri.pankov@gmail.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov> Porting Notes: This was already implemented in ZFS on Linux. This patch is to resolved the deltas present in our version. OpenZFS-issue: https://www.illumos.org/issues/6392 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/9bb97de Closes #6020	2017-04-18 09:50:15 -07:00
DHE	06226b5936	Increase zfs_vdev_async_write_min_active to 2 Resilver operations frequently cause only a small amount of dirty data to be written to disk at a time, resulting in the IO scheduler to only issue 1 write at a time to the resilvering disk. When it is rotational media the drive will often travel past the next sector to be written before receiving a write command from ZFS, significantly delaying the write of the next sector. Raise zfs_vdev_async_write_min_active so that drives are kept fed during resilvering. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: DHE <git@dehacked.net> Issue #4825 Closes #5926	2017-04-14 14:03:44 -07:00
Debabrata Banerjee	66aca24730	SEEK_HOLE should not block on txg_wait_synced() Force flushing of txg's can be painfully slow when competing for disk IO, since this is a process meant to execute asynchronously. Optimize this path via allowing data/hole seeking if the file is clean, but if dirty fall back to old logic. This is a compromise to disabling the feature entirely. Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com> Closes #4306 Closes #5962	2017-04-13 10:51:20 -07:00
Brian Behlendorf	a44e7faa6c	OpenZFS 6410 - teach zdb to perform object lookups by path Authored by: Yuri Pankov <yuri.pankov@nexenta.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Will Andrews <will@freebsd.org> Approved by: Dan McDonald <danmcd@omniti.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> Porting Notes: - Replaced zdb.8 with upstream mdoc zdb.1m version. Updated to include Linux specific features: -V verbatium imports and improved label printing (-u, and -l). - Minor changes to `zdb -h` output to honor 80 character limit. OpenZFS-issue: https://www.illumos.org/issues/6410 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/ed61ec1 Closes #6006	2017-04-13 09:40:56 -07:00
Giuseppe Di Natale	42db43e982	OpenZFS 2932 - support crash dumps to raidz, etc. pools Authored by: Bill Pijewski <wdp@joyent.com> Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@nexenta.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/2932 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/810e43b Closes #5984 Closes #5216	2017-04-10 10:24:17 -07:00
Tom Matthews	d456708525	list -o props should be alloc,free not used,avail Manpage suggests the zpool list properties include 'used' and 'available', when these are invalid property names. Use alloc and free in their place. ``` $ zpool list -o name,size,used 2>&1 \|head -1 bad property list: invalid property 'used' $ zpool list -o name,size,avail 2>&1 \|head -1 bad property list: invalid property 'avail' $ zpool list -o name,size,available 2>&1 \|head -1 bad property list: invalid property 'available' $ zpool list -o name,size,alloc,free NAME SIZE ALLOC FREE apool 464M 203M 261M bpool 3.62T 1.97T 1.65T ``` Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Matthews <tom@axiom-partners.com> Closes #5959	2017-04-04 11:03:33 -07:00
wli5	39ccc90967	Update documentation for new parameter "zfs_qat_disable" Update documentation in zfs-module-parameters.5 for new parameter "zfs_qat_disable" which was introduced by #5846. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Weigang Li <weigang.li@intel.com> Closes #5914	2017-03-27 12:33:57 -07:00
DeHackEd	f974e41426	zfs(8) fixes Documentation fixes for zfs(8) * White space issue in the userused@user property section * zfs send supports using bookmarks as the origin snapshot Reviewed by: Ned Bass <bass6@llnl.gov> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Signed-off-by: DHE <git@dehacked.net> Closes #5906	2017-03-20 15:14:28 -07:00
Chunwei Chen	9b77d1c958	Fix nfs snapdir automount The current implementation for allowing nfs to access snapdir is very buggy. It uses a special fh for snapdirs, such that the next time nfsd does fh_to_dentry, it actually returns the root inode inside the snapshot. So nfsd never knows it cross a mountpoint. The problem is that nfsd will not hold a reference on the vfsmount of the snapshot. This cause auto unmounter to unmount the snapshot even though nfs is still holding dentries in it. To fix this, we return the inode for the snapdirs themselves. However, we also trigger automount upon fh_to_dentry, and return ESTALE so nfsd will revalidate and see the mountpoint and do crossmnt. Because nfsd will now be aware that these are different filesystems users must add crossmnt to their export options to access snapshot directories. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Closes #3794 Closes #4716 Closes #5810 Closes #5833	2017-03-08 09:26:33 -08:00
bunder2015	db4ed56538	Corrected highlight for zpool man page SS is already highlighted and the fB/fR tags break the highlighting prematurely, removing the tags highlights the entire line. Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: bunder2015 <omfgbunder@gmail.com> Closes #5873	2017-03-07 13:01:39 -08:00
Olaf Faaland	3c9e0d673e	Dump unique configurations and Uberblocks in zdb -lu For zdb -l, detect when the configuration nvlist in some label l (l>0) is the same as a configuration already dumped. If so, do not dump it. Make a similar check when dumping Uberblocks for zdb -lu. Check whether a label already dumped contains an identical Uberblock. If so, do not dump the Uberblock. When dumping a configuration or Uberblock, state which labels it is found in (0-3), for example: labels = 1 2 3 Detecting redundant uberblocks or configurations is accomplished by calculating checksums of the uberblocks and the packed nvlists containing the configuration. If there is nothing unique to be dumped for a label (ie the configuration and uberblocks have checksums matching those already dumped) print nothing for that label. With additional l's or u's, increase verbosity as follows: -l Dump each unique configuration only once. Indicate which labels it appears in. -ll In addition, dump label space usage stats. -lll Dump every configuration, unique or not. -u Dump each unique, valid, uberblock only once. Indicate which labels it appears in. -uu In addition, state which slots are invalid. -uuu Dump every uberblock, unique or not. -uuuu Dump the uberblock blockpointer (used to be -uuu) Make exit values conform to the manual page. Failing to unpack a configuration nvlist is considered an error, as well as failing to open or read from the device. Add three tests, zdb_00{3,4,5}_pos to verify the above functionality. An example of the output: ------------------------------------ LABEL 0 ------------------------------------ version: 5000 name: 'pool' state: 1 txg: 880 < ... redacted ... > features_for_read: com.delphix:hole_birth com.delphix:embedded_data labels = 0 Uberblock[0] magic = 0000000000bab10c version = 5000 txg = 0 guid_sum = 3038694082047428541 timestamp = 1487715500 UTC = Tue Feb 21 14:18:20 2017 labels = 0 1 2 3 Uberblock[4] magic = 0000000000bab10c version = 5000 txg = 772 guid_sum = 9045970794941528051 timestamp = 1487727291 UTC = Tue Feb 21 17:34:51 2017 labels = 0 < ... redacted ... > ------------------------------------ LABEL 1 ------------------------------------ version: 5000 name: 'pool' state: 1 txg: 14 < ... redacted ... > com.delphix:embedded_data labels = 1 2 3 Uberblock[4] magic = 0000000000bab10c version = 5000 txg = 4 guid_sum = 7793930272573252584 timestamp = 1487727521 UTC = Tue Feb 21 17:38:41 2017 labels = 1 2 3 < ... redacted ... > Reviewed-by: Tim Chase <tim@chase2k.com> Reviewed-by: Don Brady <don.brady@intel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Olaf Faaland <faaland1@llnl.gov> Closes #5738	2017-03-06 16:01:45 -08:00
Matthew Ahrens	c30e58c462	zfs_arc_num_sublists_per_state should be common to all multilists The global tunable zfs_arc_num_sublists_per_state is used by the ARC and the dbuf cache, and other users are planned. We should change this tunable to be common to all multilists. This tuning may be overridden on a per-multilist basis. Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #5764	2017-02-15 15:49:33 -08:00
Tony Hutter	b291029e86	Enclosure LED fixes - Pass $VDEV_ENC_SYSFS_PATH to 'zpool [iostat\|status] -c' to include enclosure LED sysfs path. - Set LEDs correctly after import. This includes clearing any erroniously set LEDs prior to the import, and setting the LED for any UNAVAIL drives. - Include symlink for vdev_attach-led.sh in Makefile.am. - Print the VDEV path in all-syslog.sh, and fix it so the pool GUID actually prints. Reviewed-by: Don Brady <don.brady@intel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #5716 Closes #5751	2017-02-10 16:09:45 -08:00
David Quigley	bef78122e6	Add missing module_param for zfs_per_txg_dirty_frees_percent When the code was added this tunable was not exposed via module params. Also it was not documented. This patch changes the type from a uint32 to a ulong as done with other percentage tunables and also documents it in the zfs-module-parameters man page. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: David Quigley <david.quigley@intel.com> Closes #5750	2017-02-07 09:44:03 -08:00
Don Brady	35a357a9ef	OpenZFS 6866 - zdb -l non-zero status if no label Authored by: Yuri Pankov <yuri.pankov@nexenta.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: John Kennedy <john.kennedy@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Ported-by: Don Brady <don.brady@intel.com> OpenZFS-issue: https://www.illumos.org/issues/6866 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/3e4fae5 Closes #5730 Porting Notes: - Omitted the illumos-specific `/dev/dsk` and `/dev/rdsk` path conversions since they don't apply on linux.	2017-02-03 14:18:28 -08:00
Tim Chase	b81a3ddc32	Update deadman operation to better align with upstream OpenZFS The deadman in ZoL didn't behave quite as it did in upstream OpenZFS. In addition to the 2 purposes for which OpenZFS used the zfs_deadman_synctime_ms parameter, ZoL also used it to determine how frequently the deadman would fire once it has been triggered. This patch adds the zfs_deadman_checktime_ms parameter to control how frequently the subsequent checks are performed. The deadman is now disabled for suspended pools. As had been the case, unlike upstream OpenZFS, ZoL will not panic when a hung IO is detected. The module parameter documentation has been upated to include the new parameter and to better describe the operation of the deadmen. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Tim Chase <tim@chase2k.com> Closes #5695	2017-01-31 14:19:08 -08:00
George Melikov	ed828c0c37	OpenZFS 7280 - Allow changing global libzpool variables in zdb and ztest through command line Authored by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: George Melikov <mail@gmelikov.ru> OpenZFS-issue: https://www.illumos.org/issues/7280 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/0e60744 Closes #5676	2017-01-31 10:13:10 -08:00
George Melikov	fa603f8233	OpenZFS 7277 - zdb should be able to print zfs_dbgmsg's Porting notes: - 'zfs_dbgmsg_print()' reintroduced to userspace. Authored by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com> Approved by: Dan McDonald <danmcd@omniti.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Ported-by: George Melikov <mail@gmelikov.ru> OpenZFS-issue: https://www.illumos.org/issues/7277 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/29bdd2f Closes #5684	2017-01-28 12:16:43 -08:00
George Melikov	aeacdefedc	OpenZFS 7386 - zfs get does not work properly with bookmarks Authored by: Marcel Telka <marcel@telka.sk> Reviewed by: Simon Klinkert <simon.klinkert@gmail.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Approved by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: George Melikov <mail@gmelikov.ru> OpenZFS-issue: https://www.illumos.org/issues/7386 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/edb901a Closes #5666	2017-01-26 14:42:15 -08:00
Håkan Johansson	9ef3906a5a	Minor man-page formatting fixes fB -> \fB in zpool.8 (Properties -> cachefile) \fN -> \fB in zfs-module-parameters.5 (zfs_dirty_data_max_max_percent) Three \| -> \fR\|\fI fixes for arguments of diff and inherit in zfs.8. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Signed-off-by: Haakan T Johansson <f96hajo@chalmers.se> Closes #5645	2017-01-24 09:09:02 -08:00
George Melikov	a0aacd3741	OpenZFS 7257 - zfs manpage user property length needs to be updated Since zpool version 16, this limit is actually 8192 characters. Additionally, this limit is actually 8192 bytes, as it supports UTF-8. Authored by: Eli Rosenthal <eli.rosenthal@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com> Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com> Reviewed by: Robert Mustacchi <rm@joyent.com> Approved by: Richard Lowe <richlowe@richlowe.net> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: George Melikov <mail@gmelikov.ru> OpenZFS-issue: https://www.illumos.org/issues/7257 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/3bc7169 Closes #5608	2017-01-17 15:30:01 -08:00
George Melikov	fdbaf44ffb	OpenZFS 7276 - zfs(1m) manpage could better describe space properties Authored by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Robert Mustacchi <rm@joyent.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: George Melikov <mail@gmelikov.ru> OpenZFS-issue: https://www.illumos.org/issues/7276 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/d750135 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/29c6739 Closes #5549	2017-01-13 13:31:29 -08:00
Don Brady	4e21fd060a	OpenZFS 7303 - dynamic metaslab selection This change introduces a new weighting algorithm to improve metaslab selection. The new weighting algorithm relies on the SPACEMAP_HISTOGRAM feature. As a result, the metaslab weight now encodes the type of weighting algorithm used (size-based vs segment-based). Porting Notes: The metaslab allocation tracing code is conditionally removed on linux (dependent on mdb debugger). Authored by: George Wilson <george.wilson@delphix.com> Reviewed by: Alex Reece <alex@delphix.com> Reviewed by: Chris Siden <christopher.siden@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com> Reviewed by: Pavel Zakharov pavel.zakharov@delphix.com Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Don Brady <don.brady@intel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: Don Brady <don.brady@intel.com> OpenZFS-issue: https://www.illumos.org/issues/7303 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/d5190931bd Closes #5404	2017-01-12 11:52:56 -08:00
ka7	4e33ba4c38	Fix spelling Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Haakan T Johansson <f96hajo@chalmers.se> Closes #5547 Closes #5543	2017-01-03 11:31:18 -06:00
bunder2015	53ed2db212	Remove extra + from zfs man page Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: bunder2015 <omfgbunder@gmail.com> Closes #5508	2016-12-21 11:06:02 -08:00
Brian Behlendorf	27f2b90d3e	Revert "Disable zio_dva_throttle_enabled by default" Enable zio_dva_throttle_enabled=1 by default. Subsequent testing has been unable to reproduce the suspected regression. Tested-by: kernelOfTruth kerneloftruth@gmail.com Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Signed-off-by: Brian Behlendorf behlendorf1@llnl.gov Reverts #5335 Closes #5289 Closes #5457	2016-12-08 13:57:42 -07:00
Chunwei Chen	493492559e	Limit number of tasks shown in taskq proc To prevent holding tq_lock for too long. Before zfsonlinux/zfs@8e71ab9, hogging delay tasks and cat /proc/spl/taskq would easily cause a lockup. While that bug has been fixed. It's probably still a good idea to do this just in case task lists grow too large. Reviewed-by: Tim Chase <tim@chase2k.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Closes #586	2016-12-01 11:06:27 -07:00
Tony Hutter	8720e9e748	Add -c to zpool iostat & status to run command This patch adds a command (-c) option to zpool status and zpool iostat. The -c option allows you to run an arbitrary command on each vdev and display the first line of output in zpool status/iostat. The environment vars VDEV_PATH and VDEV_UPATH are set to the vdev's path and "underlying path" before running the command. For device mapper, multipath, or partitioned vdevs, VDEV_UPATH is the actual underlying /dev/sd* disk. This can be useful if the command you're running requires a /dev/sd* device. The patch also uses /sys/block/<dev>/slaves/ to lookup the underlying device instead of using libdevmapper. This not only removes the libdevmapper requirement at build time, but also allows you to resolve device mapper devices without being root. This means that UDEV_UPATH get set correctly when running zpool status/iostat as an unprivileged user. Example: $ zpool status -c 'echo I am $VDEV_PATH, $VDEV_UPATH' NAME STATE READ WRITE CKSUM mypool ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 mpatha ONLINE 0 0 0 I am /dev/mapper/mpatha, /dev/sdc sdb ONLINE 0 0 0 I am /dev/sdb1, /dev/sdb Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #5368	2016-11-29 14:45:38 -07:00
LOLi	2f71caf2d9	Allow zfs unshare <protocol> -a Allow `zfs unshare <protocol> -a` command to share or unshare all datasets of a given protocol, nfs or smb. Additionally, enable most of ZFS Test Suite zfs_share/zfs_unshare test cases. To work around some Illumos-specific functionalities ($SHARE/$UNSHARE) some function wrappers were added around them. Finally, fix and issue in smb_is_share_active() that would leave SMB shares exported when invoking 'zfs unshare -a' Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Turbo Fredriksson <turbo@bayour.com> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #3238 Closes #5367	2016-11-29 12:22:38 -07:00
DeHackEd	6146e17ebe	Fix man page formatting in zfs-module-parameters Bold and Normal codes were mixed up in a few places resulting in bad highlighting. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: DHE <git@dehacked.net> Closes #5397	2016-11-14 17:03:57 -08:00
Håkan Johansson	1c91161258	Repair indent of zpool.8 man page Repair indent of zpool.8 man page, just before zpool labelclear details. Accidentally introduced by `193a37cb2` (git bisect). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Haakan T Johansson <f96hajo@chalmers.se> Closes #5394	2016-11-14 09:47:49 -08:00
Romain Dolbeau	7f547f85fe	Add parity generation/rebuild using AVX-512 for x86-64 avx512f should work on all AVX512 hardware, since it only uses Foundation instructions. avx512bw should be faster on hardware supporting the AVW512BW extension. We can use full-width pshufb (instead of relying on the 256 bits AVX2 pshufb). As a side-effect, the code is also unrolled more. Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Gvozden Neskovic <neskovic@gmail.com> Reviewed-by: Jinshan Xiong <jinshan.xiong@intel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Romain Dolbeau <romain.github@dolbeau.name> Closes #5219	2016-11-02 12:40:23 -07:00
Brian Behlendorf	76a87a902e	Disable zio_dva_throttle_enabled by default Until it can be determined definitively that a performance regression wasn't introduced accidentally by `3dfb57a` this functionality is being disabled by default. It can be re- enabled by setting zio_dva_throttle_enabled=1. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #5335 Issue #5289	2016-10-26 09:13:43 -07:00
LOLi	e4010f2719	Allow for '-o feature@<feature>=disabled' on the command line Sometimes it is desirable to specifically disable one or several features directly on the 'zpool create' command line. $ zpool create -o feature@<feature>=disabled ... Original-patch-by: Turbo Fredriksson <turbo@bayour.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #3460 Closes #5142 Closes #5324	2016-10-25 16:17:47 -07:00
Romain Dolbeau	24cdeaf12e	Fletcher4 algorithm implemented in pure NEON for Aarch64 / ARMv8 64 bits This is not useful on micro-architecture with a weak NEON implementation (only 64 bits); the native version is slower & the byteswap barely faster than scalar. On A53 or A57, it's a small improvement on scalar but OK for byteswap. Results from an A53 system: 0 0 0x01 -1 0 1499068294333000 1499101101878000 implementation native byteswap scalar 1008227510 755880264 aarch64_neon 1198098720 1044818671 fastest aarch64_neon aarch64_neon Results from a A57 system: 0 0 0x01 -1 0 4407214734807033 4407233933777404 implementation native byteswap scalar 2302071241 1124873346 aarch64_neon 2542214946 2245570352 fastest aarch64_neon aarch64_neon Reviewed-by: Gvozden Neskovic <neskovic@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Romain Dolbeau <romain.dolbeau@atos.net> Closes #5248	2016-10-21 10:55:49 -07:00
Tony Hutter	6078881aa1	Multipath autoreplace, control enclosure LEDs, event rate limiting 1. Enable multipath autoreplace support for FMA. This extends FMA autoreplace to work with multipath disks. This requires libdevmapper to be installed at build time. 2. Turn on/off fault LEDs when VDEVs become degraded/faulted/online Set ZED_USE_ENCLOSURE_LEDS=1 in zed.rc to have ZED turn on/off the enclosure LED for a drive when a drive becomes FAULTED/DEGRADED. Your enclosure must be supported by the Linux SES driver for this to work. The enclosure LED scripts work for multipath devices as well. The scripts will clear the LED when the fault is cleared. 3. Rate limit ZIO delay and checksum events so as not to flood ZED ZIO delay and checksum events are rate limited to 5/sec in the zfs module. Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed by: Don Brady <don.brady@intel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #2449 Closes #3017 Closes #5159	2016-10-19 12:55:59 -07:00
Don Brady	3dfb57a35e	OpenZFS 7090 - zfs should throttle allocations OpenZFS 7090 - zfs should throttle allocations Authored by: George Wilson <george.wilson@delphix.com> Reviewed by: Alex Reece <alex@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Sebastien Roy <sebastien.roy@delphix.com> Approved by: Matthew Ahrens <mahrens@delphix.com> Ported-by: Don Brady <don.brady@intel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> When write I/Os are issued, they are issued in block order but the ZIO pipeline will drive them asynchronously through the allocation stage which can result in blocks being allocated out-of-order. It would be nice to preserve as much of the logical order as possible. In addition, the allocations are equally scattered across all top-level VDEVs but not all top-level VDEVs are created equally. The pipeline should be able to detect devices that are more capable of handling allocations and should allocate more blocks to those devices. This allows for dynamic allocation distribution when devices are imbalanced as fuller devices will tend to be slower than empty devices. The change includes a new pool-wide allocation queue which would throttle and order allocations in the ZIO pipeline. The queue would be ordered by issued time and offset and would provide an initial amount of allocation of work to each top-level vdev. The allocation logic utilizes a reservation system to reserve allocations that will be performed by the allocator. Once an allocation is successfully completed it's scheduled on a given top-level vdev. Each top-level vdev maintains a maximum number of allocations that it can handle (mg_alloc_queue_depth). The pool-wide reserved allocations (top-levels * mg_alloc_queue_depth) are distributed across the top-level vdevs metaslab groups and round robin across all eligible metaslab groups to distribute the work. As top-levels complete their work, they receive additional work from the pool-wide allocation queue until the allocation queue is emptied. OpenZFS-issue: https://www.illumos.org/issues/7090 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/4756c3d7 Closes #5258 Porting Notes: - Maintained minimal stack in zio_done - Preserve linux-specific io sizes in zio_write_compress - Added module params and documentation - Updated to use optimize AVL cmp macros	2016-10-13 17:59:18 -07:00
Brian Behlendorf	7515f8f63d	Fix file permissions The following new test cases need to have execute permissions set: userquota/groupspace_003_pos.ksh userquota/userquota_013_pos.ksh userquota/userspace_003_pos.ksh upgrade/upgrade_userobj_001_pos.ksh upgrade/setup.ksh upgrade/cleanup.ksh The following source files accidentally were marked executable: lib/libzpool/kernel.c lib/libshare/nfs.c lib/libzfs/libzfs_dataset.c lib/libzfs/libzfs_util.c tests/zfs-tests/cmd/rm_lnkcnt_zero_file/rm_lnkcnt_zero_file.c tests/zfs-tests/cmd/dir_rd_update/dir_rd_update.c cmd/zed/zed_exec.c module/icp/core/kcf_sched.c module/zfs/dsl_pool.c module/zfs/arc.c module/nvpair/nvpair.c man/man5/zfs-module-parameters.5 Reviewed-by: GeLiXin <ge.lixin@zte.com.cn> Reviewed-by: Andreas Dilger <andreas.dilger@intel.com> Reviewed-by: Jinshan Xiong <jinshan.xiong@intel.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #5241	2016-10-08 14:57:56 -07:00
Jinshan Xiong	1de321e626	Add support for user/group dnode accounting & quota This patch tracks dnode usage for each user/group in the DMU_USER/GROUPUSED_OBJECT ZAPs. ZAP entries dedicated to dnode accounting have the key prefixed with "obj-" followed by the UID/GID in string format (as done for the block accounting). A new SPA feature has been added for dnode accounting as well as a new ZPL version. The SPA feature must be enabled in the pool before upgrading the zfs filesystem. During the zfs version upgrade, a "quotacheck" will be executed by marking all dnode as dirty. ZoL-bug-id: https://github.com/zfsonlinux/zfs/issues/3500 Signed-off-by: Jinshan Xiong <jinshan.xiong@intel.com> Signed-off-by: Johann Lombardi <johann.lombardi@intel.com>	2016-10-07 09:45:13 -07:00
ilovezfs	125a406e24	OpenZFS 6585 - sha512, skein, and edonr have an unenforced dependency on extensible dataset Authored by: ilovezfs <ilovezfs@icloud.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Richard Laager <rlaager@wiktel.com> Approved by: Robert Mustacchi <rm@joyent.com> Ported by: Tony Hutter <hutter2@llnl.gov> In any pool without the extensible dataset feature flag already enabled, creating a dataset with dedup set to use one of the new checksums would result in the following panic as soon as any data was added: panic[cpu0]/thread=ffffff0006761c40: feature_get_refcount(spa, feature, &refcount) != 48 (0x30 != 0x30), file: ../../common/fs/zfs/zfeature.c line 390 Inpsection showed that feature->fi_feature was 7, which is the value of SPA_FEATURE_EXTENSIBLE_DATASET in the spa_feature enum. This commit adds extensible dataset as a dependency for the sha512, edonr, and skein feature flags, which prevents the panic. OpenZFS-issue: https://www.illumos.org/issues/6585 OpenZFS-commit: `892586e8a1` Porting Notes: This code was originally from Illumos, but I actually ported it from: openzfsonosx/zfs@b62a652	2016-10-03 14:51:21 -07:00
Tony Hutter	3c67d83a8a	OpenZFS 4185 - add new cryptographic checksums to ZFS: SHA-512, Skein, Edon-R Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com> Reviewed by: Richard Lowe <richlowe@richlowe.net> Approved by: Garrett D'Amore <garrett@damore.org> Ported by: Tony Hutter <hutter2@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/4185 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/45818ee Porting Notes: This code is ported on top of the Illumos Crypto Framework code: `b5e030c8db` The list of porting changes includes: - Copied module/icp/include/sha2/sha2.h directly from illumos - Removed from module/icp/algs/sha2/sha2.c: #pragma inline(SHA256Init, SHA384Init, SHA512Init) - Added 'ctx' to lib/libzfs/libzfs_sendrecv.c:zio_checksum_SHA256() since it now takes in an extra parameter. - Added CTASSERT() to assert.h from for module/zfs/edonr_zfs.c - Added skein & edonr to libicp/Makefile.am - Added sha512.S. It was generated from sha512-x86_64.pl in Illumos. - Updated ztest.c with new fletcher_4_() args; used NULL for new CTX argument. - In icp/algs/edonr/edonr_byteorder.h, Removed the #if defined(__linux) section to not #include the non-existant endian.h. - In skein_test.c, renane NULL to 0 in "no test vector" array entries to get around a compiler warning. - Fixup test files: - Rename <sys/varargs.h> -> <varargs.h>, <strings.h> -> <string.h>, - Remove <note.h> and define NOTE() as NOP. - Define u_longlong_t - Rename "#!/usr/bin/ksh" -> "#!/bin/ksh -p" - Rename NULL to 0 in "no test vector" array entries to get around a compiler warning. - Remove "for isa in $($ISAINFO); do" stuff - Add/update Makefiles - Add some userspace headers like stdio.h/stdlib.h in places of sys/types.h. - EXPORT_SYMBOL _Init/_Update/_Final... routines in ICP modules. - Update scripts/zfs2zol-patch.sed - include <sys/sha2.h> in sha2_impl.h - Add sha2.h to include/sys/Makefile.am - Add skein and edonr dirs to icp Makefile - Add new checksums to zpool_get.cfg - Move checksum switch block from zfs_secpolicy_setprop() to zfs_check_settable() - Fix -Wuninitialized error in edonr_byteorder.h on PPC - Fix stack frame size errors on ARM32 - Don't unroll loops in Skein on 32-bit to save stack space - Add memory barriers in sha2.c on 32-bit to save stack space - Add filetest_001_pos.ksh checksum sanity test - Add option to write psudorandom data in file_write utility	2016-10-03 14:51:15 -07:00
Romain Dolbeau	62a65a654e	Add parity generation/rebuild using 128-bits NEON for Aarch64 This re-use the framework established for SSE2, SSSE3 and AVX2. However, GCC is using FP registers on Aarch64, so unlike SSE/AVX2 we can't rely on the registers being left alone between ASM statements. So instead, the NEON code uses C variables and GCC extended ASM syntax. Note that since the kernel explicitly disable vector registers, they have to be locally re-enabled explicitly. As we use the variable's number to define the symbolic name, and GCC won't allow duplicate symbolic names, numbers have to be unique. Even when the code is not going to be used (e.g. the case for 4 registers when using the macro with only 2). Only the actually used variables should be declared, otherwise the build will fails in debug mode. This requires the replacement of the XOR(X,X) syntax by a new ZERO(X) macro, which does the same thing but without repeating the argument. And perhaps someday there will be a machine where there is a more efficient way to zero a register than XOR with itself. This affects scalar, SSE2, SSSE3 and AVX2 as they need the new macro. It's possible to write faster implementations (different scheduling, different unrolling, interleaving NEON and scalar, ...) for various cores, but this one has the advantage of fitting in the current state of the code, and thus is likely easier to review/check/merge. The only difference between aarch64-neon and aarch64-neonx2 is that aarch64-neonx2 unroll some functions some more. Reviewed-by: Gvozden Neskovic <neskovic@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Romain Dolbeau <romain.dolbeau@atos.net> Closes #4801	2016-10-03 09:44:00 -07:00
Brian Behlendorf	9ea9e0b9a1	Enable ignore_hole_birth module option by default Enable ignore_hole_birth by default until all known hole birth bugs have been resolved and relevant test cases added. Reviewed-by: Boris Protopopov <boris.protopopov@actifio.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #4809 Closes #5099	2016-09-16 14:05:30 -07:00
Dan Kimmel	2aa34383b9	DLPX-40252 integrate EP-476 compressed zfs send/receive Authored by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Tom Caputi <tcaputi@datto.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Ported by: David Quigley <david.quigley@intel.com> Issue #5078	2016-09-13 09:58:58 -07:00
George Wilson	d3c2ae1c08	OpenZFS 6950 - ARC should cache compressed data Authored by: George Wilson <george.wilson@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Tom Caputi <tcaputi@datto.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Ported by: David Quigley <david.quigley@intel.com> This review covers the reading and writing of compressed arc headers, sharing data between the arc_hdr_t and the arc_buf_t, and the implementation of a new dbuf cache to keep frequently access data uncompressed. I've added a new member to l1 arc hdr called b_pdata. The b_pdata always hangs off the arc_buf_hdr_t (if an L1 hdr is in use) and points to the physical block for that DVA. The physical block may or may not be compressed. If compressed arc is enabled and the block on-disk is compressed, then the b_pdata will match the block on-disk and remain compressed in memory. If the block on disk is not compressed, then neither will the b_pdata. Lastly, if compressed arc is disabled, then b_pdata will always be an uncompressed version of the on-disk block. Typically the arc will cache only the arc_buf_hdr_t and will aggressively evict any arc_buf_t's that are no longer referenced. This means that the arc will primarily have compressed blocks as the arc_buf_t's are considered overhead and are always uncompressed. When a consumer reads a block we first look to see if the arc_buf_hdr_t is cached. If the hdr is cached then we allocate a new arc_buf_t and decompress the b_pdata contents into the arc_buf_t's b_data. If the hdr already has a arc_buf_t, then we will allocate an additional arc_buf_t and bcopy the uncompressed contents from the first arc_buf_t to the new one. Writing to the compressed arc requires that we first discard the b_pdata since the physical block is about to be rewritten. The new data contents will be passed in via an arc_buf_t (uncompressed) and during the I/O pipeline stages we will copy the physical block contents to a newly allocated b_pdata. When an l2arc is inuse it will also take advantage of the b_pdata. Now the l2arc will always write the contents of b_pdata to the l2arc. This means that when compressed arc is enabled that the l2arc blocks are identical to those stored in the main data pool. This provides a significant advantage since we can leverage the bp's checksum when reading from the l2arc to determine if the contents are valid. If the compressed arc is disabled, then we must first transform the read block to look like the physical block in the main data pool before comparing the checksum and determining it's valid. OpenZFS-issue: https://www.illumos.org/issues/6950 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/7fc10f0 Issue #5078	2016-09-13 09:58:33 -07:00
GeLiXin	9907cc1cc8	Add zfs_arc_meta_limit_percent tunable ARC will evict meta buffers that exceed the arc_meta_limit. Before a further investigating on whether we should take special protection on meta buffers, this tunable make arc_meta_limit adjustable for different workloads. People can set zfs_arc_meta_limit_percent to any value while insmod zfs.ko, so some range check is added to guarantee a suitable arc_meta_limit. Suggested by Tim Chase, zfs_arc_dnode_limit is changed to a percent-style tunable as well. Signed-off-by: GeLiXin <ge.lixin@zte.com.cn> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4957	2016-08-23 13:03:01 -07:00
Gvozden Neskovic	70b258fc96	Fletcher4 implementation using avx512f instruction set Algorithm runs 8 parallel sums, consuming 8x uint32_t elements per loop iteration. Size alignment of main fletcher4 methods is adjusted accordingly. New implementation is called 'avx512f'. Note: byteswap method can be implemented more efficiently when avx512bw hardware becomes available. Currently, it is ~ 2x slower than native method. Table shows result of full (native) fletcher4 calculation for different buffer size: fletcher4 4KB 16KB 64KB 128KB 256KB 1MB 16MB -------------------------------------------------------------------- [scalar] 1213 1228 1231 1231 1225 1200 1160 [sse2] 2374 2442 2459 2456 2462 2250 2220 [avx2] 4288 4753 4871 4893 4900 4050 3882 [avx512f] 5975 8445 9196 9221 9262 6307 5620 Signed-off-by: Gvozden Neskovic <neskovic@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #4952	2016-08-16 14:11:14 -07:00
Rich Ercolani	6d836e6f8b	Add tunable to ignore hole_birth Adds a module option which disables the hole_birth optimization which has been responsible for several recent bugs, including issue #4050. Original-patch: https://gist.github.com/pcd1193182/2c0cd47211f3aee623958b4698836c48 Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4833	2016-08-15 09:52:56 -07:00
Tim Chase	25458cbef9	Limit the amount of dnode metadata in the ARC Metadata-intensive workloads can cause the ARC to become permanently filled with dnode_t objects as they're pinned by the VFS layer. Subsequent data-intensive workloads may only benefit from about 25% of the potential ARC (arc_c_max - arc_meta_limit). In order to help track metadata usage more precisely, the other_size metadata arcstat has replaced with dbuf_size, dnode_size and bonus_size. The new zfs_arc_dnode_limit tunable, which defaults to 10% of zfs_arc_meta_limit, defines the minimum number of bytes which is desirable to be consumed by dnodes. Attempts to evict non-metadata will trigger async prune tasks if the space used by dnodes exceeds this limit. The new zfs_arc_dnode_reduce_percent tunable specifies the amount by which the excess dnode space is attempted to be pruned as a percentage of the amount by which zfs_arc_dnode_limit is being exceeded. By default, it tries to unpin 10% of the dnodes. The problem of dnode metadata pinning was observed with the following testing procedure (in this example, zfs_arc_max is set to 4GiB): - Create a large number of small files until arc_meta_used exceeds arc_meta_limit (3GiB with default tuning) and arc_prune starts increasing. - Create a 3GiB file with dd. Observe arc_mata_used. It will still be around 3GiB. - Repeatedly read the 3GiB file and observe arc_meta_limit as before. It will continue to stay around 3GiB. With this modification, space for the 3GiB file is gradually made available as subsequent demands on the ARC are made. The previous behavior can be restored by setting zfs_arc_dnode_limit to the same value as the zfs_arc_meta_limit. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #4345 Issue #4512 Issue #4773 Closes #4858	2016-07-25 15:26:38 -07:00
Gvozden Neskovic	c9187d867f	Fixes and enhancements of SIMD raidz parity - Implementation lock replaced with atomic variable - Trailing whitespace is removed from user specified parameter, to enhance experience when using commands that add newline, e.g. `echo` - raidz_test: remove dependency on `getrusage()` and RUSAGE_THREAD, Issue #4813 - silence `cppcheck` in vdev_raidz, partial solution of Issue #1392 - Minor fixes and cleanups - Enable use of original parity methods in [fastest] configuration. New opaque original ops structure, representing native methods, is added to supported raidz methods. Original parity methods are executed if selected implementation has NULL fn pointer. Signed-off-by: Gvozden Neskovic <neskovic@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #4813 Issue #1392	2016-07-19 16:43:07 -07:00
Tyler J. Stachecki	35a76a0366	Implementation of SSE optimized Fletcher-4 Builds off of `1eeb4562` (Implementation of AVX2 optimized Fletcher-4) This commit adds another implementation of the Fletcher-4 algorithm. It is automatically selected at module load if it benchmarks higher than all other available implementations. The module benchmark was also amended to analyze the performance of the byteswap-ed version of Fletcher-4, as well as the non-byteswaped version. The average performance of the two is used to select the the fastest implementation available on the host system. Adds a pair of fields to an existing zcommon module parameter: - zfs_fletcher_4_impl (str) "sse2" - new SSE2 implementation if available "ssse3" - new SSSE3 implementation if available Signed-off-by: Tyler J. Stachecki <stachecki.tyler@gmail.com> Signed-off-by: Gvozden Neskovic <neskovic@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4789	2016-07-15 10:42:35 -07:00
Gvozden Neskovic	ae25d22235	Add RAID-Z routines for SSE2 instruction set, in x86_64 mode. The patch covers low-end and older x86 CPUs. Parity generation is equivalent to SSSE3 implementation, but reconstruction is somewhat slower. Previous 'sse' implementation is renamed to 'ssse3' to indicate highest instruction set used. Benchmark results: scalar_rec_p 4 720476442 scalar_rec_q 4 187462804 scalar_rec_r 4 138996096 scalar_rec_pq 4 140834951 scalar_rec_pr 4 129332035 scalar_rec_qr 4 81619194 scalar_rec_pqr 4 53376668 sse2_rec_p 4 2427757064 sse2_rec_q 4 747120861 sse2_rec_r 4 499871637 sse2_rec_pq 4 522403710 sse2_rec_pr 4 464632780 sse2_rec_qr 4 319124434 sse2_rec_pqr 4 205794190 ssse3_rec_p 4 2519939444 ssse3_rec_q 4 1003019289 ssse3_rec_r 4 616428767 ssse3_rec_pq 4 706326396 ssse3_rec_pr 4 570493618 ssse3_rec_qr 4 400185250 ssse3_rec_pqr 4 377541245 original_rec_p 4 691658568 original_rec_q 4 195510948 original_rec_r 4 26075538 original_rec_pq 4 103087368 original_rec_pr 4 15767058 original_rec_qr 4 15513175 original_rec_pqr 4 10746357 Signed-off-by: Gvozden Neskovic <neskovic@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4783	2016-07-13 10:24:55 -07:00
Paul Dagnelie	e6d3a843d6	OpenZFS 6393 - zfs receive a full send as a clone Authored by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Richard Elling <Richard.Elling@RichardElling.com> Approved by: Dan McDonald <danmcd@omniti.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/6394 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/68ecb2e	2016-06-28 13:47:03 -07:00
Matthew Ahrens	47dfff3b86	OpenZFS 2605, 6980, 6902 2605 want to resume interrupted zfs send Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed by: Xin Li <delphij@freebsd.org> Reviewed by: Arne Jansen <sensille@gmx.net> Approved by: Dan McDonald <danmcd@omniti.com> Ported-by: kernelOfTruth <kerneloftruth@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/2605 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/9c3fd12 6980 6902 causes zfs send to break due to 32-bit/64-bit struct mismatch Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Ported by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/6980 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/ea4a67f Porting notes: - All rsend and snapshop tests enabled and updated for Linux. - Fix misuse of input argument in traverse_visitbp(). - Fix ISO C90 warnings and errors. - Fix gcc 'missing braces around initializer' in 'struct send_thread_arg to_arg =' warning. - Replace 4 argument fletcher_4_native() with 3 argument version, this change was made in OpenZFS 4185 which has not been ported. - Part of the sections for 'zfs receive' and 'zfs send' was rewritten and reordered to approximate upstream. - Fix mktree xattr creation, 'user.' prefix required. - Minor fixes to newly enabled test cases - Long holds for volumes allowed during receive for minor registration.	2016-06-28 13:47:02 -07:00
Ned Bass	50c957f702	Implement large_dnode pool feature Justification ------------- This feature adds support for variable length dnodes. Our motivation is to eliminate the overhead associated with using spill blocks. Spill blocks are used to store system attribute data (i.e. file metadata) that does not fit in the dnode's bonus buffer. By allowing a larger bonus buffer area the use of a spill block can be avoided. Spill blocks potentially incur an additional read I/O for every dnode in a dnode block. As a worst case example, reading 32 dnodes from a 16k dnode block and all of the spill blocks could issue 33 separate reads. Now suppose those dnodes have size 1024 and therefore don't need spill blocks. Then the worst case number of blocks read is reduced to from 33 to two--one per dnode block. In practice spill blocks may tend to be co-located on disk with the dnode blocks so the reduction in I/O would not be this drastic. In a badly fragmented pool, however, the improvement could be significant. ZFS-on-Linux systems that make heavy use of extended attributes would benefit from this feature. In particular, ZFS-on-Linux supports the xattr=sa dataset property which allows file extended attribute data to be stored in the dnode bonus buffer as an alternative to the traditional directory-based format. Workloads such as SELinux and the Lustre distributed filesystem often store enough xattr data to force spill bocks when xattr=sa is in effect. Large dnodes may therefore provide a performance benefit to such systems. Other use cases that may benefit from this feature include files with large ACLs and symbolic links with long target names. Furthermore, this feature may be desirable on other platforms in case future applications or features are developed that could make use of a larger bonus buffer area. Implementation -------------- The size of a dnode may be a multiple of 512 bytes up to the size of a dnode block (currently 16384 bytes). A dn_extra_slots field was added to the current on-disk dnode_phys_t structure to describe the size of the physical dnode on disk. The 8 bits for this field were taken from the zero filled dn_pad2 field. The field represents how many "extra" dnode_phys_t slots a dnode consumes in its dnode block. This convention results in a value of 0 for 512 byte dnodes which preserves on-disk format compatibility with older software. Similarly, the in-memory dnode_t structure has a new dn_num_slots field to represent the total number of dnode_phys_t slots consumed on disk. Thus dn->dn_num_slots is 1 greater than the corresponding dnp->dn_extra_slots. This difference in convention was adopted because, unlike on-disk structures, backward compatibility is not a concern for in-memory objects, so we used a more natural way to represent size for a dnode_t. The default size for newly created dnodes is determined by the value of a new "dnodesize" dataset property. By default the property is set to "legacy" which is compatible with older software. Setting the property to "auto" will allow the filesystem to choose the most suitable dnode size. Currently this just sets the default dnode size to 1k, but future code improvements could dynamically choose a size based on observed workload patterns. Dnodes of varying sizes can coexist within the same dataset and even within the same dnode block. For example, to enable automatically-sized dnodes, run # zfs set dnodesize=auto tank/fish The user can also specify literal values for the dnodesize property. These are currently limited to powers of two from 1k to 16k. The power-of-2 limitation is only for simplicity of the user interface. Internally the implementation can handle any multiple of 512 up to 16k, and consumers of the DMU API can specify any legal dnode value. The size of a new dnode is determined at object allocation time and stored as a new field in the znode in-memory structure. New DMU interfaces are added to allow the consumer to specify the dnode size that a newly allocated object should use. Existing interfaces are unchanged to avoid having to update every call site and to preserve compatibility with external consumers such as Lustre. The new interfaces names are given below. The versions of these functions that don't take a dnodesize parameter now just call the _dnsize() versions with a dnodesize of 0, which means use the legacy dnode size. New DMU interfaces: dmu_object_alloc_dnsize() dmu_object_claim_dnsize() dmu_object_reclaim_dnsize() New ZAP interfaces: zap_create_dnsize() zap_create_norm_dnsize() zap_create_flags_dnsize() zap_create_claim_norm_dnsize() zap_create_link_dnsize() The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The spa_maxdnodesize() function should be used to determine the maximum bonus length for a pool. These are a few noteworthy changes to key functions: * The prototype for dnode_hold_impl() now takes a "slots" parameter. When the DNODE_MUST_BE_FREE flag is set, this parameter is used to ensure the hole at the specified object offset is large enough to hold the dnode being created. The slots parameter is also used to ensure a dnode does not span multiple dnode blocks. In both of these cases, if a failure occurs, ENOSPC is returned. Keep in mind, these failure cases are only possible when using DNODE_MUST_BE_FREE. If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0. dnode_hold_impl() will check if the requested dnode is already consumed as an extra dnode slot by an large dnode, in which case it returns ENOENT. * The function dmu_object_alloc() advances to the next dnode block if dnode_hold_impl() returns an error for a requested object. This is because the beginning of the next dnode block is the only location it can safely assume to either be a hole or a valid starting point for a dnode. * dnode_next_offset_level() and other functions that iterate through dnode blocks may no longer use a simple array indexing scheme. These now use the current dnode's dn_num_slots field to advance to the next dnode in the block. This is to ensure we properly skip the current dnode's bonus area and don't interpret it as a valid dnode. zdb --- The zdb command was updated to display a dnode's size under the "dnsize" column when the object is dumped. For ZIL create log records, zdb will now display the slot count for the object. ztest ----- Ztest chooses a random dnodesize for every newly created object. The random distribution is more heavily weighted toward small dnodes to better simulate real-world datasets. Unused bonus buffer space is filled with non-zero values computed from the object number, dataset id, offset, and generation number. This helps ensure that the dnode traversal code properly skips the interior regions of large dnodes, and that these interior regions are not overwritten by data belonging to other dnodes. A new test visits each object in a dataset. It verifies that the actual dnode size matches what was stored in the ztest block tag when it was created. It also verifies that the unused bonus buffer space is filled with the expected data patterns. ZFS Test Suite -------------- Added six new large dnode-specific tests, and integrated the dnodesize property into existing tests for zfs allow and send/recv. Send/Receive ------------ ZFS send streams for datasets containing large dnodes cannot be received on pools that don't support the large_dnode feature. A send stream with large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be unrecognized by an incompatible receiving pool so that the zfs receive will fail gracefully. While not implemented here, it may be possible to generate a backward-compatible send stream from a dataset containing large dnodes. The implementation may be tricky, however, because the send object record for a large dnode would need to be resized to a 512 byte dnode, possibly kicking in a spill block in the process. This means we would need to construct a new SA layout and possibly register it in the SA layout object. The SA layout is normally just sent as an ordinary object record. But if we are constructing new layouts while generating the send stream we'd have to build the SA layout object dynamically and send it at the end of the stream. For sending and receiving between pools that do support large dnodes, the drr_object send record type is extended with a new field to store the dnode slot count. This field was repurposed from unused padding in the structure. ZIL Replay ---------- The dnode slot count is stored in the uppermost 8 bits of the lr_foid field. The bits were unused as the object id is currently capped at 48 bits. Resizing Dnodes --------------- It should be possible to resize a dnode when it is dirtied if the current dnodesize dataset property differs from the dnode's size, but this functionality is not currently implemented. Clearly a dnode can only grow if there are sufficient contiguous unused slots in the dnode block, but it should always be possible to shrink a dnode. Growing dnodes may be useful to reduce fragmentation in a pool with many spill blocks in use. Shrinking dnodes may be useful to allow sending a dataset to a pool that doesn't support the large_dnode feature. Feature Reference Counting -------------------------- The reference count for the large_dnode pool feature tracks the number of datasets that have ever contained a dnode of size larger than 512 bytes. The first time a large dnode is created in a dataset the dataset is converted to an extensible dataset. This is a one-way operation and the only way to decrement the feature count is to destroy the dataset, even if the dataset no longer contains any large dnodes. The complexity of reference counting on a per-dnode basis was too high, so we chose to track it on a per-dataset basis similarly to the large_block feature. Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3542	2016-06-24 13:13:21 -07:00
Gvozden Neskovic	ab9f4b0b82	SIMD implementation of vdev_raidz generate and reconstruct routines This is a new implementation of RAIDZ1/2/3 routines using x86_64 scalar, SSE, and AVX2 instruction sets. Included are 3 parity generation routines (P, PQ, and PQR) and 7 reconstruction routines, for all RAIDZ level. On module load, a quick benchmark of supported routines will select the fastest for each operation and they will be used at runtime. Original implementation is still present and can be selected via module parameter. Patch contains: - specialized gen/rec routines for all RAIDZ levels, - new scalar raidz implementation (unrolled), - two x86_64 SIMD implementations (SSE and AVX2 instructions sets), - fastest routines selected on module load (benchmark). - cmd/raidz_test - verify and benchmark all implementations - added raidz_test to the ZFS Test Suite New zfs module parameters: - zfs_vdev_raidz_impl (str): selects the implementation to use. On module load, the parameter will only accept first 3 options, and the other implementations can be set once module is finished loading. Possible values for this option are: "fastest" - use the fastest math available "original" - use the original raidz code "scalar" - new scalar impl "sse" - new SSE impl if available "avx2" - new AVX2 impl if available See contents of `/sys/module/zfs/parameters/zfs_vdev_raidz_impl` to get the list of supported values. If an implementation is not supported on the system, it will not be shown. Currently selected option is enclosed in `[]`. Signed-off-by: Gvozden Neskovic <neskovic@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4328	2016-06-21 09:27:26 -07:00
Brian Behlendorf	f74b821a66	Add `zfs allow` and `zfs unallow` support ZFS allows for specific permissions to be delegated to normal users with the `zfs allow` and `zfs unallow` commands. In addition, non- privileged users should be able to run all of the following commands: * zpool [list \| iostat \| status \| get] * zfs [list \| get] Historically this functionality was not available on Linux. In order to add it the secpolicy_* functions needed to be implemented and mapped to the equivalent Linux capability. Only then could the permissions on the `/dev/zfs` be relaxed and the internal ZFS permission checks used. Even with this change some limitations remain. Under Linux only the root user is allowed to modify the namespace (unless it's a private namespace). This means the mount, mountpoint, canmount, unmount, and remount delegations cannot be supported with the existing code. It may be possible to add this functionality in the future. This functionality was validated with the cli_user and delegation test cases from the ZFS Test Suite. These tests exhaustively verify each of the supported permissions which can be delegated and ensures only an authorized user can perform it. Two minor bug fixes were required for test-running.py. First, the Timer() object cannot be safely created in a `try:` block when there is an unconditional `finally` block which references it. Second, when running as a normal user also check for scripts using the both the .ksh and .sh suffixes. Finally, existing users who are simulating delegations by setting group permissions on the /dev/zfs device should revert that customization when updating to a version with this change. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #362 Closes #434 Closes #4100 Closes #4394 Closes #4410 Closes #4487	2016-06-07 09:16:52 -07:00
Jinshan Xiong	1eeb4562a7	Implementation of AVX2 optimized Fletcher-4 New functionality: - Preserves existing scalar implementation. - Adds AVX2 optimized Fletcher-4 computation. - Fastest routines selected on module load (benchmark). - Test case for Fletcher-4 added to ztest. New zcommon module parameters: - zfs_fletcher_4_impl (str): selects the implementation to use. "fastest" - use the fastest version available "cycle" - cycle trough all available impl for ztest "scalar" - use the original version "avx2" - new AVX2 implementation if available Performance comparison (Intel i7 CPU, 1MB data buffers): - Scalar: 4216 MB/s - AVX2: 14499 MB/s See contents of `/sys/module/zcommon/parameters/zfs_fletcher_4_impl` to get list of supported values. If an implementation is not supported on the system, it will not be shown. Currently selected option is enclosed in `[]`. Signed-off-by: Jinshan Xiong <jinshan.xiong@intel.com> Signed-off-by: Andreas Dilger <andreas.dilger@intel.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4330	2016-06-02 14:30:51 -07:00
Tony Hutter	26ef0cc7db	OpenZFS 6531 - Provide mechanism to artificially limit disk performance Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> Ported by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/6531 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/97e8130 Porting notes: - Added new IO delay tracepoints, and moved common ZIO tracepoint macros to a new trace_common.h file. - Used zio_delay_taskq() in place of OpenZFS's timeout_generic() function. - Updated zinject man page - Updated zpool_scrub test files	2016-05-26 10:11:51 -07:00
Tony Hutter	7e945072d1	Add request size histograms (-r) to zpool iostat, minor man page fix Add -r option to "zpool iostat" to print request size histograms for the leaf ZIOs. This includes histograms of individual ZIOs ("ind") and aggregate ZIOs ("agg"). These stats can be useful for seeing how well the ZFS IO aggregator is working. $ zpool iostat -r mypool sync_read sync_write async_read async_write scrub req_size ind agg ind agg ind agg ind agg ind agg ---------- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- 512 0 0 0 0 0 0 530 0 0 0 1K 0 0 260 0 0 0 116 246 0 0 2K 0 0 0 0 0 0 0 431 0 0 4K 0 0 0 0 0 0 3 107 0 0 8K 15 0 35 0 0 0 0 6 0 0 16K 0 0 0 0 0 0 0 39 0 0 32K 0 0 0 0 0 0 0 0 0 0 64K 20 0 40 0 0 0 0 0 0 0 128K 0 0 20 0 0 0 0 0 0 0 256K 0 0 0 0 0 0 0 0 0 0 512K 0 0 0 0 0 0 0 0 0 0 1M 0 0 0 0 0 0 0 0 0 0 2M 0 0 0 0 0 0 0 0 0 0 4M 0 0 0 0 0 0 155 19 0 0 8M 0 0 0 0 0 0 0 811 0 0 16M 0 0 0 0 0 0 0 68 0 0 -------------------------------------------------------------------------------- Also rename the stray "-G" in the man page to be "-w" for latency histograms. Signed-off-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Closes #4659	2016-05-25 15:49:35 -07:00
DHE	8342673502	Improve zfs-module-parameters(5) Various rewrites to the descriptions of module parameters. Corrects spelling mistakes, makes descriptions them more user-friendly and describes some ZFS quirks which should be understood before changing parameter values. Signed-off-by: DHE <git@dehacked.net> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4671	2016-05-23 11:08:45 -07:00
Christer Ekholm	3491d6eb06	Consistently use parsable instead of parseable This is a purely cosmetical change, to consistently prefer one of two (both acceptable) choises for the word parsable in documentation and code. I don't really care which to use, but acording to wiktionary https://en.wiktionary.org/wiki/parsable#English parsable is preferred. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4682	2016-05-23 10:20:42 -07:00
Brian Behlendorf	81b4c075ec	Remove additional cruft from manpages These changes should have been part of the original `930b0d4` commit but were overlooked because `193a37c` had not yet been merged when the original change was ported. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #4631	2016-05-17 15:36:46 -07:00
Richard Laager	61a3d06f84	zfs.8: Relative paths must start with ./ Simply containing a slash is not enough, presumably because foo/bar could be either a dataset or a mountpoint. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Laager <rlaager@wiktel.com> Closes #4655	2016-05-16 14:19:57 -07:00
Matthew Ahrens	c5ee751394	Illumos 1644 add ZFS "clones" property Reviewed by: Richard Lowe <richlowe@richlowe.net> Reviewed by: George Wilson <gwilson@zfsmail.com> Approved by: Gordon Ross <gwr@nexenta.com> References: https://www.illumos.org/issues/1644 Ported-by: Richard Laager <rlaager@wiktel.com> Signed-off-by: Richard Laager <rlaager@wiktel.com>	2016-05-16 12:26:32 -07:00
Richard Laager	930b0d4c77	Illumos 1502 Remove conversion cruft from manpages Reviewed by: Alexander Eremin <alexander.eremin@nexenta.com> Reviewed by: Gordon Ross <gordon.w.ross@gmail.com> Reviewed by: Garrett D'Amore <garrett.damore@gmail.com> References: https://www.illumos.org/issues/1502 Ported-by: Richard Laager <rlaager@wiktel.com> Signed-off-by: Richard Laager <rlaager@wiktel.com> Conflicts: man/man8/zpool.8	2016-05-16 12:26:31 -07:00
Ruben Kerkhof	32a6c3d756	zfs.8 & mount.zfs.8: fix a few typos filesytem -> filesystem defntext -> defcontext Signed-off-by: Ruben Kerkhof <ruben@rubenkerkhof.com>	2016-05-16 12:26:31 -07:00
Richard Laager	d919da83fa	zfs.8 & zpool.8: Standardize property value order The default value is now always listed first. Signed-off-by: Richard Laager <rlaager@wiktel.com>	2016-05-16 12:26:31 -07:00
Richard Laager	8fd888baa7	zfs.8 & zpool.8: Various documentation edits Signed-off-by: Richard Laager <rlaager@wiktel.com>	2016-05-16 12:26:31 -07:00
Richard Laager	8e07f9a916	zfs.8: Improve zfs upgrade documentation Signed-off-by: Richard Laager <rlaager@wiktel.com>	2016-05-16 12:26:31 -07:00
Richard Laager	9ef21991f3	zfs.8: Cleanup stray code Bad copy-and-paste? Signed-off-by: Richard Laager <rlaager@wiktel.com>	2016-05-16 12:26:31 -07:00
Richard Laager	8c5edae993	zfs.8 & zpool.8: Drop legal/illegal There's a convention in documentation that these words not be used to mean "invalid". Signed-off-by: Richard Laager <rlaager@wiktel.com>	2016-05-16 12:26:31 -07:00
Richard Laager	9bb3e153c4	zfs.8: Fix minor typos and the like This commit only contains the most trivial of changes. Signed-off-by: Richard Laager <rlaager@wiktel.com>	2016-05-16 12:26:30 -07:00
Richard Laager	879dbef094	zfs.8: Rework native vs user properties Signed-off-by: Richard Laager <rlaager@wiktel.com>	2016-05-16 12:26:30 -07:00
Richard Laager	6a107f4199	zfs.8 & zpool.8: Linux/Solaris differences Signed-off-by: Richard Laager <rlaager@wiktel.com>	2016-05-16 12:26:30 -07:00
Richard Laager	a5eb2d8746	zfs.8: Improve mount option documentation This change is primarily about adding inline references in the properties section to the traditional mount option names. There are some other editorial changes too. Signed-off-by: Richard Laager <rlaager@wiktel.com>	2016-05-16 12:26:30 -07:00
Richard Laager	7e0754c675	zfs.8: Improve consistency in size documentation Signed-off-by: Richard Laager <rlaager@wiktel.com>	2016-05-16 12:26:30 -07:00
Richard Laager	cab1aa295e	zfs.8: Drop references to Oracle documentation Signed-off-by: Richard Laager <rlaager@wiktel.com>	2016-05-16 12:26:30 -07:00
Richard Laager	76281da4eb	zfs.8: zfs get and zfs list accept mountpoints Signed-off-by: Richard Laager <rlaager@wiktel.com>	2016-05-16 12:26:29 -07:00
Tony Hutter	193a37cb24	Add -lhHpw options to "zpool iostat" for avg latency, histograms, & queues Update the zfs module to collect statistics on average latencies, queue sizes, and keep an internal histogram of all IO latencies. Along with this, update "zpool iostat" with some new options to print out the stats: -l: Include average IO latencies stats: total_wait disk_wait syncq_wait asyncq_wait scrub read write read write read write read write wait ----- ----- ----- ----- ----- ----- ----- ----- ----- - 41ms - 2ms - 46ms - 4ms - - 5ms - 1ms - 1us - 4ms - - 5ms - 1ms - 1us - 4ms - - - - - - - - - - - 49ms - 2ms - 47ms - - - - - - - - - - - - - 2ms - 1ms - - - 1ms - ----- ----- ----- ----- ----- ----- ----- ----- ----- 1ms 1ms 1ms 413us 16us 25us - 5ms - 1ms 1ms 1ms 413us 16us 25us - 5ms - 2ms 1ms 2ms 412us 26us 25us - 5ms - - 1ms - 413us - 25us - 5ms - - 1ms - 460us - 29us - 5ms - 196us 1ms 196us 370us 7us 23us - 5ms - ----- ----- ----- ----- ----- ----- ----- ----- ----- -w: Print out latency histograms: sdb total disk sync_queue async_queue latency read write read write read write read write scrub ------- ------ ------ ------ ------ ------ ------ ------ ------ ------ 1ns 0 0 0 0 0 0 0 0 0 ... 33us 0 0 0 0 0 0 0 0 0 66us 0 0 107 2486 2 788 12 12 0 131us 2 797 359 4499 10 558 184 184 6 262us 22 801 264 1563 10 286 287 287 24 524us 87 575 71 52086 15 1063 136 136 92 1ms 152 1190 5 41292 4 1693 252 252 141 2ms 245 2018 0 50007 0 2322 371 371 220 4ms 189 7455 22 162957 0 3912 6726 6726 199 8ms 108 9461 0 102320 0 5775 2526 2526 86 17ms 23 11287 0 37142 0 8043 1813 1813 19 34ms 0 14725 0 24015 0 11732 3071 3071 0 67ms 0 23597 0 7914 0 18113 5025 5025 0 134ms 0 33798 0 254 0 25755 7326 7326 0 268ms 0 51780 0 12 0 41593 10002 10002 0 537ms 0 77808 0 0 0 64255 13120 13120 0 1s 0 105281 0 0 0 83805 20841 20841 0 2s 0 88248 0 0 0 73772 14006 14006 0 4s 0 47266 0 0 0 29783 17176 17176 0 9s 0 10460 0 0 0 4130 6295 6295 0 17s 0 0 0 0 0 0 0 0 0 34s 0 0 0 0 0 0 0 0 0 69s 0 0 0 0 0 0 0 0 0 137s 0 0 0 0 0 0 0 0 0 ------------------------------------------------------------------------------- -h: Help -H: Scripted mode. Do not display headers, and separate fields by a single tab instead of arbitrary space. -q: Include current number of entries in sync & async read/write queues, and scrub queue: syncq_read syncq_write asyncq_read asyncq_write scrubq_read pend activ pend activ pend activ pend activ pend activ ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- 0 0 0 0 78 29 0 0 0 0 0 0 0 0 78 29 0 0 0 0 0 0 0 0 0 0 0 0 0 0 - - - - - - - - - - 0 0 0 0 0 0 0 0 0 0 - - - - - - - - - - 0 0 0 0 0 0 0 0 0 0 ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- 0 0 227 394 0 19 0 0 0 0 0 0 227 394 0 19 0 0 0 0 0 0 108 98 0 19 0 0 0 0 0 0 19 98 0 0 0 0 0 0 0 0 78 98 0 0 0 0 0 0 0 0 19 88 0 0 0 0 0 0 ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -p: Display numbers in parseable (exact) values. Also, update iostat syntax to allow the user to specify specific vdevs to show statistics for. The three options for choosing pools/vdevs are: Display a list of pools: zpool iostat ... [pool ...] Display a list of vdevs from a specific pool: zpool iostat ... [pool vdev ...] Display a list of vdevs from any pools: zpool iostat ... [vdev ...] Lastly, allow zpool command "interval" value to be floating point: zpool iostat -v 0.5 Signed-off-by: Tony Hutter <hutter2@llnl.gov Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4433	2016-05-12 12:36:32 -07:00
Adam Stevko	2a8b84b747	OpenZFS 3993, 4700 3993 zpool(1M) and zfs(1M) should support -p for "list" and "get" 4700 "zpool get" doesn't support -H or -o options Reviewed by: Dan McDonald <danmcd@omniti.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Ported by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/3993 OpenZFS-issue: https://www.illumos.org/issues/4700 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c58b352 Porting notes: I removed ZoL's zpool_get_prop_literal() in favor of zpool_get_prop(..., boolean_t literal) since that's what OpenZFS uses. The functionality is the same.	2016-05-11 11:49:37 -07:00
Don Brady	39fc0cb557	Add support for devid and phys_path keys in vdev disk labels This is foundational work for ZED. Updates a leaf vdev's persistent device strings on Linux platform * only applies for a dedicated leaf vdev (aka whole disk) * updated during pool create\|add\|attach\|import * used for matching device matching during auto-{online,expand,replace} * stored in a leaf disk config label (i.e. alongside 'path' NVP) * can opt-out using env var ZFS_VDEV_DEVID_OPT_OUT=YES Some examples: path: '/dev/sdb1' devid: 'scsi-350000394a8ca4fbc-part1' phys_path: 'pci-0000:04:00.0-sas-0x50000394a8ca4fbf-lun-0' path: '/dev/mapper/mpatha' devid: 'dm-uuid-mpath-35000c5006304de3f' Signed-off-by: Don Brady <don.brady@intel.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2856 Closes #3978 Closes #4416	2016-03-31 13:45:53 -07:00
Richard Laager	1c0120832c	Correct typo in spa_load_verify_metadata docs Signed-off-by: Richard Laager <rlaager@wiktel.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4471	2016-03-29 18:33:16 -07:00
Brian Behlendorf	7d11e37e55	Require libblkid Historically libblkid support was detected as part of configure and optionally enabled. This was done because at the time support for detecting ZFS pool vdevs had just be added to libblkid and those updated packages were not yet part of many distributions. This is no longer the case and any reasonably current distribution will ship a version of libblkid which can detect ZFS pool vdevs. This patch makes libblkid mandatory at build time and libblkid the preferred method of scanning for ZFS pools. For distributions which include a modern version of libblkid there is no change in behavior. Explicitly scanning the default search paths is still supported and can be enabled with the '-s' command line option. Additionally making libblkid mandatory means that the 'zpool create' command can reliably detect if a specified device has an existing non-ZFS filesystem (ext4, xfs) and print a warning. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2448	2016-03-09 10:39:22 -08:00
smh	9f500936c8	FreeBSD r256956: Improve ZFS N-way mirror read performance by using load and locality information. The existing algorithm selects a preferred leaf vdev based on offset of the zio request modulo the number of members in the mirror. It assumes the devices are of equal performance and that spreading the requests randomly over both drives will be sufficient to saturate them. In practice this results in the leaf vdevs being under utilized. The new algorithm takes into the following additional factors: * Load of the vdevs (number outstanding I/O requests) * The locality of last queued I/O vs the new I/O request. Within the locality calculation additional knowledge about the underlying vdev is considered such as; is the device backing the vdev a rotating media device. This results in performance increases across the board as well as significant increases for predominantly streaming loads and for configurations which don't have evenly performing devices. The following are results from a setup with 3 Way Mirror with 2 x HD's and 1 x SSD from a basic test running multiple parrallel dd's. With pre-fetch disabled (vfs.zfs.prefetch_disable=1): == Stripe Balanced (default) == Read 15360MB using bs: 1048576, readers: 3, took 161 seconds @ 95 MB/s == Load Balanced (zfslinux) == Read 15360MB using bs: 1048576, readers: 3, took 297 seconds @ 51 MB/s == Load Balanced (locality freebsd) == Read 15360MB using bs: 1048576, readers: 3, took 54 seconds @ 284 MB/s With pre-fetch enabled (vfs.zfs.prefetch_disable=0): == Stripe Balanced (default) == Read 15360MB using bs: 1048576, readers: 3, took 91 seconds @ 168 MB/s == Load Balanced (zfslinux) == Read 15360MB using bs: 1048576, readers: 3, took 108 seconds @ 142 MB/s == Load Balanced (locality freebsd) == Read 15360MB using bs: 1048576, readers: 3, took 48 seconds @ 320 MB/s In addition to the performance changes the code was also restructured, with the help of Justin Gibbs, to provide a more logical flow which also ensures vdevs loads are only calculated from the set of valid candidates. The following additional sysctls where added to allow the administrator to tune the behaviour of the load algorithm: * vfs.zfs.vdev.mirror.rotating_inc * vfs.zfs.vdev.mirror.rotating_seek_inc * vfs.zfs.vdev.mirror.rotating_seek_offset * vfs.zfs.vdev.mirror.non_rotating_inc * vfs.zfs.vdev.mirror.non_rotating_seek_inc These changes where based on work started by the zfsonlinux developers: https://github.com/zfsonlinux/zfs/pull/1487 Reviewed by: gibbs, mav, will MFC after: 2 weeks Sponsored by: Multiplay References: https://github.com/freebsd/freebsd@5c7a6f5d https://github.com/freebsd/freebsd@31b7f68d https://github.com/freebsd/freebsd@e186f564 Performance Testing: https://github.com/zfsonlinux/zfs/pull/4334#issuecomment-189057141 Porting notes: - The tunables were adjusted to have ZoL-style names. - The code was modified to use ZoL's vd_nonrot. - Fixes were done to make cstyle.pl happy - Merge conflicts were handled manually - freebsd/freebsd@e186f564bc by my collegue Andriy Gapon has been included. It applied perfectly, but added a cstyle regression. - This replaces `556011dbec` entirely. - A typo "IO'a" has been corrected to say "IO's" - Descriptions of new tunables were added to man/man5/zfs-module-parameters.5. Ported-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4334	2016-02-26 11:24:35 -08:00
Brian Behlendorf	a77f29f93c	Change full path subcommand flag from -p to -P Commit `d2f3e29` introduced the -p option which outputs full paths for vdevs to multiple zpool subcommands. When this was merged there was no conflict for this flag letter. However it's certain there will be a conflict with the -p (parsable) flag used by other subcommands. Therefore, -p is being changed to -P to avoid this. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4368	2016-02-26 09:06:26 -08:00
Richard Yao	d2f3e292dc	Add -gLp to zpool subcommands for alt vdev names The following options have been added to the zpool add, iostat, list, status, and split subcommands. The default behavior was not modified, from zfs(8). -g Display vdev GUIDs instead of the normal short device names. These GUIDs can be used in-place of device names for the zpool detach/off‐ line/remove/replace commands. -L Display real paths for vdevs resolving all symbolic links. This can be used to lookup the current block device name regardless of the /dev/disk/ path used to open it. -p Display full paths for vdevs instead of only the last component of the path. This can be used in conjunction with the -L flag. This behavior may also be enabled using the following environment variables. ZPOOL_VDEV_NAME_GUID ZPOOL_VDEV_NAME_FOLLOW_LINKS ZPOOL_VDEV_NAME_PATH This change is based on worked originally started by Richard Yao to add a -g option. Then extended by @ilovezfs to add a -L option for openzfsonosx. Those changes have been merged, re-factored, a -p option added and extended to all relevant zpool subcommands. Original-patch-by: Richard Yao <ryao@gentoo.org> Extended-by: ilovezfs <ilovezfs@icloud.com> Extended-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: ilovezfs <ilovezfs@icloud.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2011 Closes #4341	2016-02-25 11:58:39 -08:00
Brian Behlendorf	8a09d5fd46	Add l2arc_max_block_size tunable Set a limit for the largest compressed block which can be written to an L2ARC device. By default this limit is set to 16M so there is no change in behavior. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Elling <Richard.Elling@RichardElling.com> Signed-off-by: Tim Chase <tim@chase2k.com> Closes #4323	2016-02-25 09:44:00 -08:00
Chunwei Chen	8f3b403a73	Allow kicking a taskq to spawn more threads This patch add a module parameter spl_taskq_kick. When writing non-zero value to it, it will scan all the taskq, if a taskq contains a task pending for more than 5 seconds, it will be forced to spawn a new thread. This is use as an emergency recovery from deadlock, not a general solution. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #529	2016-02-05 14:08:31 -08:00
kernelOfTruth	a966c5640e	Reintroduce zfs_remove() synchronous deletes Reintroduce a slightly adapted version of the Illumos logic for synchronous unlinks. The basic idea here is that only files smaller than zfs_delete_blocks (20480) blocks should be deleted synchronously. Unlinking larger files should be handled asynchronously to minimize impact to the caller. To accomplish this iput() which is responsible for calling zfs_znode_delete() on Linux is only called in the delete_now path. Otherwise zfs_async_iput() is used which allows the last reference to be dropped by a taskq thread effectively making the removal asynchronous. Porting notes: - Add zfs_delete_blocks module option for performance analysis. The default value is DMU_MAX_DELETEBLKCNT which is the same as upstream. Reducing this value means that smaller files will be unlinked asynchronously like large files. - All occurrences of zfsvfs changes to zsb. Ported-by: KernelOfTruth kerneloftruth@gmail.com Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2016-01-26 15:26:02 -08:00
George Wilson	ba5ad9a48d	Illumos 6251 - add tunable to disable free_bpobj processing 6251 - add tunable to disable free_bpobj processing Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Simon Klinkert <simon.klinkert@gmail.com> Reviewed by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed by: Albert Lee <trisk@omniti.com> Reviewed by: Xin Li <delphij@freebsd.org> Approved by: Garrett D'Amore <garrett@damore.org> References: https://www.illumos.org/issues/6251 https://github.com/illumos/illumos-gate/commit/139510f Porting notes: - Added as module option declaration. - Added to zfs-module-parameters.5 man page. Ported-by: Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2016-01-25 13:15:17 -08:00
Brian Behlendorf	89666a8e1c	Increase default user space stack size Under RHEL6/CentOS6 the default stack size must be increased to 32K to prevent overflowing the stack when running ztest. This isn't an issue for other distributions due to either the version of pthreads or perhaps the compiler. Doubling the stack size resolves the issue safely for all distribution and leaves us some headroom. $ sudo -E ztest -V -T 300 -f /var/tmp/ 5 vdevs, 7 datasets, 23 threads, 300 seconds... loading space map for vdev 0 of 1, metaslab 0 of 30 ... ... loading space map for vdev 0 of 1, metaslab 14 of 30 ... child died with signal 11 Exited ztest with error 3 Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4215	2016-01-13 13:55:12 -08:00
Brian Behlendorf	d297a5a3a1	Add spl_kmem_cache_kmem_threads man page entry The spl_kmem_cache_kmem_threads module option was accidentally omitted from the documentation. Add it. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #512	2016-01-12 15:04:37 -08:00
Matthew Ahrens	7f60329a26	Illumos 5987 - zfs prefetch code needs work 5987 zfs prefetch code needs work Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Approved by: Gordon Ross <gordon.ross@nexenta.com> References: https://www.illumos.org/issues/5987 zfs prefetch code needs work illumos/illumos-gate@cf6106c 5987 zfs prefetch code needs work Porting notes: - [module/zfs/dbuf.c] - `5f6d0b6` Handle block pointers with a corrupt logical size - [module/zfs/dmu_zfetch.c] - `c65aa5b` Fix gcc missing parenthesis warnings - `428870f` Update core ZFS code from build 121 to build 141. - `79c76d5` Change KM_PUSHPAGE -> KM_SLEEP - `b8d06fc` Switch KM_SLEEP to KM_PUSHPAGE - Account for ISO C90 - mixed declarations and code - warnings - Module parameters (new/changed): - Replaced zfetch_block_cap with zfetch_max_distance (Max bytes to prefetch per stream (default 8MB; 8 * 1024 * 1024)) - Preserved zfs_prefetch_disable as 'int' for consistency with existing Linux module options. - [include/sys/trace_arc.h] - Added new tracepoints - DEFINE_ARC_BUF_HDR_EVENT(zfs_arc__sync__wait__for__async); - DEFINE_ARC_BUF_HDR_EVENT(zfs_arc__demand__hit__predictive__prefetch); - [man/man5/zfs-module-parameters.5] - Updated man page Ported-by: kernelOfTruth kerneloftruth@gmail.com Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2016-01-12 09:02:33 -08:00
Matthew Ahrens	9867e8be2a	Illumos 4891 - want zdb option to dump all metadata 4891 want zdb option to dump all metadata Reviewed by: Sonu Pillai <sonu.pillai@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Reviewed by: Richard Lowe <richlowe@richlowe.net> Approved by: Garrett D'Amore <garrett@damore.org> We'd like a way for zdb to dump metadata in a machine-readable format, so that we can bring that back from a customer site for in-house diagnosis. Think of it as a crash dump for zpools, which can be used for post-mortem analysis of a malfunctioning pool References: https://www.illumos.org/issues/4891 https://github.com/illumos/illumos-gate/commit/df15e41 Porting notes: - [cmd/zdb/zdb.c] - `a5778ea` zdb: Introduce -V for verbatim import - In main() getopt 'opt' variable removed and the code was brought back in line with illumos. - [lib/libzpool/kernel.c] - `1e33ac1` Fix Solaris thread dependency by using pthreads - `f0e324f` Update utsname support - `4d58b69` Fix vn_open/vn_rdwr error handling - In vn_open() allocate 'dumppath' on heap instead of stack - Properly handle 'dump_fd == -1' error path - Free 'realpath' after added vn_dumpdir_code block Ported-by: kernelOfTruth kerneloftruth@gmail.com Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2016-01-11 11:36:54 -08:00
nathancheek	53e0313506	Man page whitespace Signed-off-by: nathancheek <myself@nathancheek.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4184	2016-01-11 09:55:50 -08:00
Paul Dagnelie	fcff0f35bd	Illumos 5960, 5925 5960 zfs recv should prefetch indirect blocks 5925 zfs receive -o origin= Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> References: https://www.illumos.org/issues/5960 https://www.illumos.org/issues/5925 https://github.com/illumos/illumos-gate/commit/a2cdcdd Porting notes: - [lib/libzfs/libzfs_sendrecv.c] - `b8864a2` Fix gcc cast warnings - `325f023` Add linux kernel device support - `5c3f61e` Increase Linux pipe buffer size on 'zfs receive' - [module/zfs/zfs_vnops.c] - `3558fd7` Prototype/structure update for Linux - `c12e3a5` Restructure zfs_readdir() to fix regressions - [module/zfs/zvol.c] - Function @zvol_map_block() isn't needed in ZoL - `9965059` Prefetch start and end of volumes - [module/zfs/dmu.c] - Fixed ISO C90 - mixed declarations and code - Function dmu_prefetch() 'int i' is initialized before the following code block (c90 vs. c99) - [module/zfs/dbuf.c] - `fc5bb51` Fix stack dbuf_hold_impl() - `9b67f60` Illumos 4757, 4913 - `34229a2` Reduce stack usage for recursive traverse_visitbp() - [module/zfs/dmu_send.c] - Fixed ISO C90 - mixed declarations and code - `b58986e` Use large stacks when available - `241b541` Illumos 5959 - clean up per-dataset feature count code - `77aef6f` Use vmem_alloc() for nvlists - `00b4602` Add linux kernel memory support Ported-by: kernelOfTruth kerneloftruth@gmail.com Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2016-01-08 15:08:19 -08:00
Chris Williamson	23de906c72	Illumos 5745 - zfs set allows only one dataset property to be set at a time 5745 zfs set allows only one dataset property to be set at a time Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: George Wilson <george@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com> Reviewed by: Richard PALO <richard@NetBSD.org> Reviewed by: Steven Hartland <killing@multiplay.co.uk> Approved by: Rich Lowe <richlowe@richlowe.net> References: https://www.illumos.org/issues/5745 https://github.com/illumos/illumos-gate/commit/3092556 Porting notes: - Fix the missing braces around initializer, zfs_cmd_t zc = {"\0"}; - Remove extra format argument in zfs_do_set() - Declare at the top: - zfs_prop_t prop; - nvpair_t elem; - nvpair_t next; - int i; - Additionally initialize: - int added_resv = 0; - zfs_prop_t prop = 0; - Assign 0 install of NULL for uint64_t types. - zc->zc_nvlist_conf = '\0'; - zc->zc_nvlist_src = '\0'; - zc->zc_nvlist_dst = '\0'; Ported-by: kernelOfTruth kerneloftruth@gmail.com Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3574	2015-12-29 16:59:26 -08:00
DHE	dcb6bed1df	Make zio_taskq_batch_pct user configurable Adds zio_taskq_batch_pct as an exported module parameter, allowing users to modify it at module load time. Signed-off-by: DHE <git@dehacked.net> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4110	2015-12-18 13:46:23 -08:00
Ned Bass	6b4e21c60e	Man page white space and spelling corrections Correct some misspelled words and grammatical errors, and remove trailing white space in the man pages. Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4115	2015-12-18 13:33:37 -08:00
Turbo Fredriksson	b23d54305a	Remove shareiscsi description and example from zfs(8). Signed-off-by: Turbo Fredriksson <turbo@bayour.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-10-13 16:52:29 -07:00
Turbo Fredriksson	291b06c37e	Unmount is part of the shutdown process, not the boot process. Signed-off-by: Turbo Fredriksson <turbo@bayour.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes: #3762	2015-10-13 16:52:02 -07:00
Don Brady	56b3986316	Add large block support to zpios(1) benchmark As part of the large block support effort, it makes sense to add support for large blocks to zpios(1). The specifying of a zfs block size for zpios is optional and will default to 128K if the block size is not specified. `zpios ... -S size \| --blocksize size ...` This will use size ZFS blocks for each test, specified as a comma delimited list with an optional unit suffix. The supported range is powers of two from 128K through 16M. A range of block sizes can be tested as follows: `-S 128K,256K,512K,1M` Example run below (non realistic results from a VM and output abbreviated for space) ``` --regioncount=750 --regionsize=8M --chunksize=1M --offset=4K --threaddelay=0 --cleanup --human-readable --verbose --cleanup --blocksize=128K,256K,512K,1M th-cnt rg-cnt rg-sz ch-sz blksz wr-data wr-bw rd-data rd-bw --------------------------------------------------------------------- 4 750 8m 1m 128k 5g 90.06m 5g 93.37m 4 750 8m 1m 256k 5g 79.71m 5g 99.81m 4 750 8m 1m 512k 5g 42.20m 5g 93.14m 4 750 8m 1m 1m 5g 35.51m 5g 89.36m 8 750 8m 1m 128k 5g 85.49m 5g 90.81m 8 750 8m 1m 256k 5g 61.42m 5g 99.24m 8 750 8m 1m 512k 5g 49.09m 5g 108.78m 16 750 8m 1m 128k 5g 86.28m 5g 88.73m 16 750 8m 1m 256k 5g 64.34m 5g 93.47m 16 750 8m 1m 512k 5g 68.84m 5g 124.47m 16 750 8m 1m 1m 5g 53.97m 5g 97.20m --------------------------------------------------------------------- ``` Signed-off-by: Don Brady <don.brady@intel.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3795 Closes #2071	2015-09-22 09:13:20 -07:00
Ned Bass	3af56fd95f	Honor xattr=sa dataset property ZFS incorrectly uses directory-based extended attributes even when xattr=sa is specified as a dataset property or mount option. Support to honor temporary mount options including "xattr" was added in commit `0282c4137e`. There are two issues with the mount option handling: * Libzfs has historically included "xattr" in its list of default mount options. This overrides the dataset property, so the dataset is always configured to use directory-based xattrs even when the xattr dataset property is set to off or sa. Address this by removing "xattr" from the set of default mount options in libzfs. * There was no way to enable system attribute-based extended attributes using temporary mount options. Add the mount options "saxattr" and "dirxattr" which enable the xattr behavior their names suggest. This approach has the advantages of mirroring the valid xattr dataset property values and following existing conventions for mount option names. Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3787	2015-09-19 14:04:14 -07:00
Brian Behlendorf	9965059ab9	Prefetch start and end of volumes When adding a zvol to the system prefetch zvol_prefetch_bytes from the start and end of the volume. Prefetching these regions of the volume is desirable because they are likely to be accessed immediately by blkid(8), the kernel scanning for a partition table, or another task which probes the devices. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #3659	2015-09-09 14:38:29 -07:00
Brian Behlendorf	3b36f8319d	Add dbgmsg kstat Internally ZFS keeps a small log to facilitate debugging. By default the log is disabled, to enable it set zfs_dbgmsg_enable=1. The contents of the log can be accessed by reading the /proc/spl/kstat/zfs/dbgmsg file. Writing 0 to this proc file clears the log. $ echo 1 >/sys/module/zfs/parameters/zfs_dbgmsg_enable $ echo 0 >/proc/spl/kstat/zfs/dbgmsg $ zpool import tank $ cat /proc/spl/kstat/zfs/dbgmsg 1 0 0x01 -1 0 2492357525542 2525836565501 timestamp message 1441141408 spa=tank async request task=1 1441141408 txg 70 open pool version 5000; software version 5000/5; ... 1441141409 spa=tank async request task=32 1441141409 txg 72 import pool version 5000; software version 5000/5; ... 1441141414 command: lt-zpool import tank Note the zfs_dbgmsg() and dprintf() functions are both now mapped to the same log. As mentioned above the kernel debug log can be accessed though the /proc/spl/kstat/zfs/dbgmsg kstat. For user space consumers log messages are immediately written to stdout after applying the ZFS_DEBUG environment variable. $ ZFS_DEBUG=on ./cmd/ztest/ztest -V Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov> Closes #3728	2015-09-04 16:08:14 -07:00
Brian Behlendorf	0500e835af	Support accessing .zfs/snapshot via NFS This patch is based on the previous work done by @andrey-ve and @yshui. It triggers the automount by using kern_path() to traverse to the known snapshout mount point. Once the snapshot is mounted NFS can access the contents of the snapshot. Allowing NFS clients to access to the .zfs/snapshot directory would normally mean that a root user on a client mounting an export with 'no_root_squash' would be able to use mkdir/rmdir/mv to manipulate snapshots on the server. To prevent configuration mistakes a zfs_admin_snapshot module option was added which disables the mkdir/rmdir/mv functionally. System administators desiring this functionally must explicitly enable it. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2797 Closes #1655 Closes #616	2015-09-04 13:23:53 -07:00
Brian Behlendorf	e20cd6f7a8	Merge branch 'zvol' Performance improvements for zvols. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3720	2015-09-04 13:14:21 -07:00
Richard Yao	37f9dac592	zvol processing should use struct bio Internally, zvols are files exposed through the block device API. This is intended to reduce overhead when things require block devices. However, the ZoL zvol code emulates a traditional block device in that it has a top half and a bottom half. This is an unnecessary source of overhead that does not exist on any other OpenZFS platform does this. This patch removes it. Early users of this patch reported double digit performance gains in IOPS on zvols in the range of 50% to 80%. Comments in the code suggest that the current implementation was done to obtain IO merging from Linux's IO elevator. However, the DMU already does write merging while arc_read() should implicitly merge read IOs because only 1 thread is permitted to fetch the buffer into ARC. In addition, commercial ZFSOnLinux distributions report that regular files are more performant than zvols under the current implementation, and the main consumers of zvols are VMs and iSCSI targets, which have their own elevators to merge IOs. Some minor refactoring allows us to register zfs_request() as our ->make_request() handler in place of the generic_make_request() function. This eliminates the layer of code that broke IO requests on zvols into a top half and a bottom half. This has several benefits: 1. No per zvol spinlocks. 2. No redundant IO elevator processing. 3. Interrupts are disabled only when actually necessary. 4. No redispatching of IOs when all taskq threads are busy. 5. Linux's page out routines will properly block. 6. Many autotools checks become obsolete. An unfortunate consequence of eliminating the layer that generic_make_request() is that we no longer calls the instrumentation hooks for block IO accounting. Those hooks are GPL-exported, so we cannot call them ourselves and consequently, we lose the ability to do IO monitoring via iostat. Since zvols are internally files mapped as block devices, this should be okay. Anyone who is willing to accept the performance penalty for the block IO layer's accounting could use the loop device in between the zvol and its consumer. Alternatively, perf and ftrace likely could be used. Also, tools like latencytop will still work. Tools such as latencytop sometimes provide a better view of performance bottlenecks than the traditional block IO accounting tools do. Lastly, if direct reclaim occurs during spacemap loading and swap is on a zvol, this code will deadlock. That deadlock could already occur with sync=always on zvols. Given that swap on zvols is not yet production ready, this is not a blocker. Signed-off-by: Richard Yao <ryao@gentoo.org>	2015-09-04 15:30:24 -04:00
Brian Behlendorf	0282c4137e	Add temporary mount options Add the required kernel side infrastructure to parse arbitrary mount options. This enables us to support temporary mount options in largely the same way it is handled on other platforms. See the 'Temporary Mount Point Properties' section of zfs(8) for complete details. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #985 Closes #3351	2015-09-03 14:14:55 -07:00
Brian Behlendorf	6cde64351e	Add spa_slop_shift module option Allow for easy turning of a pools reserved free space. Previous versions of ZFS (v0.6.4 and earlier) held 1/64 of the pools capacity in reserve. Commits `3d45fdd` and `0c60cc3` increased this to 1/32. Setting spa_slop_shift=6 will restore the previous default setting. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3724	2015-09-02 09:30:18 -07:00
Andreas Buschmann	bba365cfc8	Add extra keyword 'slot' to vdev_id.conf Add new keyword 'slot' to vdev_id.conf This selects from where to get the slot number for a SAS/SATA disk Needed to enable access to the physical position of a disk in a Supermicro 2027R-AR24NV . Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov> Closes #3693	2015-08-30 10:03:56 -07:00
Brian Behlendorf	7e8bddd019	Update arc_memory_throttle() to check pageout This brings the behavior of arc_memory_throttle() back in sync with illumos. The updated memory throttling policy roughly goes like this: * Never throttle if more than 10% of memory is free. This threshold is configurable with the zfs_arc_lotsfree_percent module option. * Minimize any throttling of kswapd even when free memory is below the set threshold. Allow it to write out pages as quickly as possible to help alleviate the memory pressure. * Delay all other threads when free memory is below the set threshold in order to avoid compounding the memory pressure. Buffers will be evicted from the ARC to reduce the issue. The Linux specific zfs_arc_memory_throttle_disable module option has been removed in favor of the existing zfs_arc_lotsfree_percent tuning. Setting zfs_arc_lotsfree_percent=0 will have the same effect as zfs_arc_memory_throttle_disable and it was therefore redundant. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3637	2015-07-30 11:52:12 -07:00
Brian Behlendorf	11f552fa90	Update arc_available_memory() to check freemem While Linux doesn't provide detailed information about the state of the VM it does provide us total free pages. This information should be incorporated in to the arc_available_memory() calculation rather than solely relying on a signal from direct reclaim. Conceptually this brings arc_available_memory() back in sync with illumos. It is also desirable that the target amount of free memory be tunable on a system. While the default values are expected to work well for most workloads there may be cases where custom values are needed. The zfs_arc_sys_free module option was added for this purpose. zfs_arc_sys_free - The target number of bytes the ARC should leave as free memory on the system. This value can checked in /proc/spl/kstat/zfs/arcstats and setting this module option will override the default value. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3637	2015-07-30 11:50:22 -07:00
Brian Behlendorf	6339c1b9dc	Bound zvol_threads module option The zvol_threads module option should be bounded to a reasonable range. The taskq must have at least 1 thread and shouldn't have more than 1,024 at most. The default value of 32 is a reasonable default. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3614	2015-07-29 07:42:11 -07:00
Brian Behlendorf	62aa81a577	Add defclsyspri macro Add a new defclsyspri macro which can be used to request the default Linux scheduler priority. Neither the minclsyspri or maxclsyspri map to the default Linux kernel thread priority. This makes it awkward to create taskqs which run with the same priority as the rest of the kernel threads on the system which can lead to performance issues. All SPL callers which previously used minclsyspri or maxclsyspri have been changed to use defclsyspri. The vast majority of callers were part of the test suite which won't have an external impact. The few places where it could impact performance the change was from maxclsyspri to defclsyspri. This makes it more likely the process will be scheduled which may help performance. To facilitate further performance analysis the spl_taskq_thread_priority module option has been added. When disabled (0) all newly created kernel threads will use the default kernel thread priority. When enabled (1) the specified taskq priority will be used. By default this value is enabled (1). Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-07-23 13:25:49 -07:00
Brian Behlendorf	728d6ae91e	Reinstate zfs_arc_p_min_shift Commit `f521ce1` removed the minimum value for "arc_p" allowing it to drop to zero or grow to "arc_c". This was done to improve specific workload which constantly dirties new "metadata" but also frequently touches a "small" amount of mfu data (e.g. mkdir's). This change may still be desirable but it needs to be re-investigated. in the context of the recent ARC changes from upstream. Therefore this code is being restored to facilitate benchmarking. By setting "zfs_arc_p_min_shift=64" we easily compare the performance. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #3533	2015-07-23 09:42:32 -07:00
Manoj Joseph	93f6d7e2e5	Illumos 5764 - "zfs send -nv" directs output to stderr 5764 "zfs send -nv" directs output to stderr Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com> Reviewed by: Basil Crow <basil.crow@delphix.com> Reviewed by: Steven Hartland <killing@multiplay.co.uk> Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://github.com/illumos/illumos-gate/commit/dc5f28a https://www.illumos.org/issues/5764 Ported-by: kernelOfTruth kerneloftruth@gmail.com Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3585	2015-07-14 10:28:32 -07:00
Justin T. Gibbs	99197f034e	Illumos 5661 - ZFS: "compression = on" should use lz4 if feature is enabled 5661 ZFS: "compression = on" should use lz4 if feature is enabled Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net> Reviewed by: Xin LI <delphij@freebsd.org> Approved by: Robert Mustacchi <rm@joyent.com> References: https://github.com/illumos/illumos-gate/commit/db1741f https://www.illumos.org/issues/5661 Ported-by: kernelOfTruth kerneloftruth@gmail.com Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3571	2015-07-10 12:11:45 -07:00
Colin Ian King	8f3439733f	man: fix spelling mistakes in manual A few minor mistakes than should be fixed: zpool: compatability -> compatibility zfs: accessable -> accessible availible -> available zfs-events: availible -> available zfs-module-parameters: proceding -> proceeding Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3544	2015-07-01 10:58:31 -07:00
Brian Behlendorf	f7a973d99b	Add TASKQ_DYNAMIC feature Setting the TASKQ_DYNAMIC flag will create a taskq with dynamic semantics. Initially only a single worker thread will be created to service tasks dispatched to the queue. As additional threads are needed they will be dynamically spawned up to the max number specified by 'nthreads'. When the threads are no longer needed, because the taskq is empty, they will automatically terminate. Due to the low cost of creating and destroying threads under Linux by default new threads and spawned and terminated aggressively. There are two modules options which can be tuned to adjust this behavior if needed. * spl_taskq_thread_sequential - The number of sequential tasks, without interruption, which needed to be handled by a worker thread before a new worker thread is spawned. Default 4. * spl_taskq_thread_dynamic - Provides the ability to completely disable the use of dynamic taskqs on the system. This is provided for the purposes of debugging and troubleshooting. Default 1 (enabled). This behavior is fundamentally consistent with the dynamic taskq implementation found in both illumos and FreeBSD. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Closes #458	2015-06-24 15:14:18 -07:00
Etienne Dechamps	99b14de421	Make metaslab_aliquot a module parameter. This seems generally useful. metaslab_aliquot is the ZFS allocation granularity, which is roughly equivalent to what is called the stripe size in traditional RAID arrays. It seems relevant to performance tuning. Signed-off-by: Etienne Dechamps <etienne@edechamps.fr> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-06-22 14:19:38 -07:00
Hajo Möller	410921241d	Add -y option to `zpool iostat` sysstat's iostat omits the first report when the -y option is used. This patch adds that functionality and omits the first report with statistics since system boot. Signed-off-by: Hajo Möller <dasjoe@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3439	2015-06-17 10:39:20 -07:00
Prakash Surya	ca0bf58d65	Illumos 5497 - lock contention on arcs_mtx Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Richard Elling <richard.elling@richardelling.com> Approved by: Dan McDonald <danmcd@omniti.com> Porting notes and other significant code changes: The illumos 5368 patch (ARC should cache more metadata), which was never picked up by ZoL, is mostly reverted by this patch. Since ZoL relies on the kernel asynchronously calling the shrinker to actually reap memory, the shrinker wakes up arc_reclaim_waiters_cv every time it runs. The arc_adapt_thread() function no longer calls arc_do_user_evicts() since the newly-added arc_user_evicts_thread() calls it periodically. Notable conflicting ZoL commits which conflicted with this patch or whose effects are either duplicated or un-done by this patch: `302f753` - Integrate ARC more tightly with Linux `39e055c` - Adjust arc_p based on "bytes" in arc_shrink `f521ce1` - Allow "arc_p" to drop to zero or grow to "arc_c" `77765b5` - Remove "arc_meta_used" from arc_adjust calculation `94520ca` - Prune metadata from ghost lists in arc_adjust_meta Trace support for multilist_insert() and multilist_remove() has been added and produces the following output: fio-12498 [077] .... 112936.448324: zfs_multilist__insert: ml { offset 240 numsublists 80 sublistidx 63 } fio-12498 [077] .... 112936.448347: zfs_multilist__remove: ml { offset 240 numsublists 80 sublistidx 29 } The following arcstats have been removed: recycle_miss - Used by arcstat.py and arc_summary.py, both of which have been updated appropriately. l2_writes_hdr_miss The following arcstats have been added: evict_not_enough - Number of times arc_evict_state() was unable to evict enough buffers to reach its target amount. evict_l2_skip - Number of times arc_evict_hdr() skipped eviction because it was being written to the l2arc. l2_writes_lock_retry - Replaces l2_writes_hdr_miss. Number of times l2arc_write_done() failed to acquire hash_lock (and re-tries). arc_meta_min - Shows the value of the zfs_arc_meta_min module parameter (see below). The "index" column of the "dbuf" kstat has been removed since it doesn't have a direct analog in the new multilist scheme. Additional multilist- related stats could be added in the future but would likely require extensions to the mulilist API. The following module parameters have been added: zfs_arc_evict_batch_limit - Number of ARC headers to free per sub-list before moving on to the next sub-list. zfs_arc_meta_min - Enforce a floor on the amount of metadata in the ARC. zfs_arc_num_sublists_per_state - Number of multilist sub-lists per ARC state. zfs_arc_overflow_shift - Controls amount by which the ARC must exceed the target size to be considered "overflowing". Ported-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov	2015-06-11 10:27:25 -07:00
Turbo Fredriksson	d050c627b5	Improve on the ZFS events documentation * Add information about the 'zpool events' command in zpool(8). * More events and payloads defined in zfs-events(5). * I/O Stages and I/O Flags sections added. * Remove unused legacy "zio_deadline" payload define. Signed-off-by: Turbo Fredriksson <turbo@bayour.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3467	2015-06-09 11:19:19 -07:00
Tim Chase	b1b85c8772	Zdb should be able to open the root dataset If the pool/dataset command-line argument is specified with a trailing slash, for example, "tank/", it is interpreted as the root dataset. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3415	2015-05-15 10:52:46 -07:00
Chris Dunlap	492b1d2ef0	Update ZED copyright boilerplate This commit updates the copyright boilerplate within the ZED subtree. The instructions for appending a contributor copyright line have been removed. Manually maintaining copyright notices in this manner is error-prone, imprecise at a file-scope granularity, and oftentimes inaccurate. These lines can become a pernicious source of merge conflicts. A commit log is better suited to maintaining this information. Consequently, a line has been added to the boilerplate to refer to the git commit log for authoritative copyright attribution. To account for the scenario where a file may become separated from the codebase and commit history (i.e., it is copied somewhere else), a line has been added to identify the file's origin. http://softwarefreedom.org/resources/2012/ManagingCopyrightInformation.html Signed-off-by: Chris Dunlap <cdunlap@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3384	2015-05-11 15:07:00 -07:00
Matthew Ahrens	f1512ee61e	Illumos 5027 - zfs large block support 5027 zfs large block support Reviewed by: Alek Pinchuk <pinchuk.alek@gmail.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com> Reviewed by: Richard Elling <richard.elling@richardelling.com> Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/5027 https://github.com/illumos/illumos-gate/commit/b515258 Porting Notes: * Included in this patch is a tiny ISP2() cleanup in zio_init() from Illumos 5255. * Unlike the upstream Illumos commit this patch does not impose an arbitrary 128K block size limit on volumes. Volumes, like filesystems, are limited by the zfs_max_recordsize=1M module option. * By default the maximum record size is limited to 1M by the module option zfs_max_recordsize. This value may be safely increased up to 16M which is the largest block size supported by the on-disk format. At the moment, 1M blocks clearly offer a significant performance improvement but the benefits of going beyond this for the majority of workloads are less clear. * The illumos version of this patch increased DMU_MAX_ACCESS to 32M. This was determined not to be large enough when using 16M blocks because the zfs_make_xattrdir() function will fail (EFBIG) when assigning a TX. This was immediately observed under Linux because all newly created files must have a security xattr created and that was failing. Therefore, we've set DMU_MAX_ACCESS to 64M. * On 32-bit platforms a hard limit of 1M is set for blocks due to the limited virtual address space. We should be able to relax this one the ABD patches are merged. Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #354	2015-05-11 12:23:16 -07:00
Turbo Fredriksson	859735c095	Add the '-a' option to 'zpool export' Support exporting all imported pools in one go, using 'zpool export -a'. This is accomplished by moving the export parts from zpool_do_export() in to the new function zpool_export_one(). The for_each_pool() function is used to enumerate the list of pools to be exported. Passing an argc of 0 implies the function should be called on all pools. Signed-off-by: Turbo Fredriksson <turbo@bayour.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes: #3203	2015-05-04 10:17:08 -07:00
Jerry Jelinek	788eb90c4c	Illumos 3897 - zfs filesystem and snapshot limits 3897 zfs filesystem and snapshot limits Author: Jerry Jelinek <jerry.jelinek@joyent.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Christopher Siden <christopher.siden@delphix.com> References: https://www.illumos.org/issues/3897 https://github.com/illumos/illumos-gate/commit/a2afb61 Porting Notes: dsl_dataset_snapshot_check(): reduce stack usage using kmem_alloc(). Ported-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-04-28 16:22:51 -07:00
Paul B. Henson	0bf8501ae1	5410 Document -S option to zfs inherit 5410 Document -S option to zfs inherit 5412 Mention -S option when zfs inherit fails on quota Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Richard Lowe <richlowe@richlowe.net> References: https://www.illumos.org/issues/5410 https://github.com/illumos/illumos-gate/commit/5ff8cfa9 Ported-by: DHE <git@dehacked.net> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3279	2015-04-24 15:16:49 -07:00
cburroughs	7008109646	align zfs_autoimport_disable manpage with reality The default was changed in #2820. Signed-off-by: cburroughs <chris.burroughs@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3341	2015-04-24 14:58:38 -07:00
DHE	614e598c88	Fix formatting error in zfs(8) Commit `b1a3e93217` accidentally introduced an intentation error between the 'zfs receive' and 'zfs allow' detailed documentation sections. Signed-off-by: DHE <git@dehacked.net> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3312	2015-04-23 13:59:00 -07:00
Turbo Fredriksson	b467db454e	Document bookmarks a little better in zfs(8) Add a basic summary to zfs(8) describing bookmarks. Signed-off-by: Turbo Fredriksson <turbo@bayour.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3268	2015-04-14 14:07:07 -07:00
Brian Behlendorf	74aa2ba259	Update zfs_pd_bytes_max default in zfs(8) Commit `b738bc5` should have updated the default value of zfs_pd_bytes_max in the zfs(8) man page. The correct default value is 5010241024. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-03-31 11:55:28 -07:00
George Wilson	b738bc5a0f	Illumos 5694 - traverse_prefetcher does not prefetch enough 5694 traverse_prefetcher does not prefetch enough Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Alex Reece <alex@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com> Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com> Approved by: Garrett D'Amore <garrett@damore.org> References: https://www.illumos.org/issues/5694 https://github.com/illumos/illumos-gate/commit/34d7ce05 Ported-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3230	2015-03-27 15:02:50 -07:00
Isaac Huang	e89bd69775	zio_injection_enabled should not be a module option The zio_inject.c keeps zio_injection_enabled as a counter of fault handlers, so it should not be exported to user space as a module option. Several EXPORT_SYMBOLs are moved from zio.c to zio_inject.c, where the symbols are defined. Signed-off-by: Isaac Huang <he.huang@intel.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3199	2015-03-24 13:22:03 -07:00
Brian Behlendorf	bc88866657	Fix arc_adjust_meta() behavior The goal of this function is to evict enough meta data buffers from the ARC in order to enforce the arc_meta_limit. Achieving this is slightly more complicated than it appears because it is common for data buffers to have holds on meta data buffers. In addition, dnode meta data buffers will be held by the dnodes in the block preventing them from being freed. This means we can't simply traverse the ARC and expect to always find enough unheld meta data buffer to release. Therefore, this function has been updated to make alternating passes over the ARC releasing data buffers and then newly unheld meta data buffers. This ensures forward progress is maintained and arc_meta_used will decrease. Normally this is sufficient, but if required the ARC will call the registered prune callbacks causing dentry and inodes to be dropped from the VFS cache. This will make dnode meta data buffers available for reclaim. The number of total restarts in limited by zfs_arc_meta_adjust_restarts to prevent spinning in the rare case where all meta data is pinned. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Pavel Snajdr <snajpa@snajpa.net> Issue #3160	2015-03-20 10:35:20 -07:00
Brian Behlendorf	2cbb06b561	Restructure per-filesystem reclaim Originally when the ARC prune callback was introduced the idea was to register a single callback for the ZPL. The ARC could invoke this call back if it needed the ZPL to drop dentries, inodes, or other cache objects which might be pinning buffers in the ARC. The ZPL would iterate over all ZFS super blocks and perform the reclaim. For the most part this design has worked well but due to limitations in 2.6.35 and earlier kernels there were some problems. This patch is designed to address those issues. 1) iterate_supers_type() is not provided by all kernels which makes it impossible to safely iterate over all zpl_fs_type filesystems in a single callback. The most straight forward and portable way to resolve this is to register a callback per-filesystem during mount. The arc_*_prune_callback() functions have always supported multiple callbacks so this is functionally a very small change. 2) Commit `050d22b` removed the non-portable shrink_dcache_memory() and shrink_icache_memory() functions and didn't replace them with equivalent functionality. This meant that for Linux 3.1 and older kernels the ARC had no mechanism to drop dentries and inodes from the caches if needed. This patch adds that missing functionality by calling shrink_dcache_parent() to release dentries which may be pinning inodes. This will result in all unused cache entries being dropped which is a bit heavy handed but it's the only interface available for old kernels. 3) A zpl_drop_inode() callback is registered for kernels older than 2.6.35 which do not support the .evict_inode callback. This ensures that when the last reference on an inode is dropped it is immediately removed from the cache. If this isn't done than inode can end up on the global unused LRU with no mechanism available to ZFS to drop them. Since the ARC buffers are not dropped the hottest inodes can still be recreated without performing disk IO. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Pavel Snajdr <snajpa@snajpa.net> Issue #3160	2015-03-20 10:35:20 -07:00
Turbo Fredriksson	b1a3e93217	Move duplicate information about the 'zfs send -e' option. The extra one was under the 'zfs receive' command (which isn't relevant). Instead, it should have been further up (still in the 'zfs send' option). Signed-off-by: Turbo Fredriksson <turbo@bayour.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3194	2015-03-19 10:39:43 -07:00
Brian Behlendorf	6442f3cfe3	Retire zio_bulk_flags Long ago the zio_bulk_flags module parameter was introduced to facilitate debugging and profiling the zio_buf_caches. Today this code works well and there's no compelling reason to keep this functionality. In fact it's preferable to revert this so the code is more consistent with other ZFS implementations. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov> Issue #3063	2015-02-10 16:08:49 -08:00
Brian Behlendorf	3018bffa9b	Refine slab cache sizing This change is designed to improve the memory utilization of slabs by more carefully setting their size. The way the code currently works is problematic for slabs which contain large objects (>1MB). This is due to slabs being unconditionally rounded up to a power of two which may result in unused space at the end of the slab. The reason the existing code rounds up every slab is because it assumes it will backed by the buddy allocator. Since the buddy allocator can only performs power of two allocations this is desirable because it avoids wasting any space. However, this logic breaks down if slab is backed by vmalloc() which operates at a page level granularity. In this case, the optimal thing to do is calculate the minimum required slab size given certain constraints (object size, alignment, objects/slab, etc). Therefore, this patch reworks the spl_slab_size() function so that it sizes KMC_KMEM slabs differently than KMC_VMEM slabs. KMC_KMEM slabs are rounded up to the nearest power of two, and KMC_VMEM slabs are allowed to be the minimum required size. This change also reduces the default number of objects per slab. This reduces how much memory a single cache object can pin, which can result in significant memory saving for highly fragmented caches. But depending on the workload it may result in slabs being allocated and freed more frequently. In practice, this has been shown to be a better default for most workloads. Also the maximum slab size has been reduced to 4MB on 32-bit systems. Due to the limited virtual address space it's critical the we be as frugal as possible. A limit of 4M still lets us reasonably comfortably allocate a limited number of 1MB objects. Finally, the kmem:slab_small and kmem:slab_large SPLAT tests were extended to provide better test coverage of various object sizes and alignments. Caches are created with random parameters and their basic functionality is verified by allocating several slabs worth of objects. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-01-16 13:55:09 -08:00
Brian Behlendorf	b1c3ae48a7	Update spl-module-parameters(5) man page The spl-module-parameters(5) was not kept up to date. Refresh the man page so that it lists all the possible module options, describes what the do, and justify why the default values are set they way the are. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-01-16 13:55:09 -08:00
Brian Behlendorf	1a20496834	Make slab reclaim more aggressive Many people have noticed that the kmem cache implementation is slow to release its memory. This patch makes the reclaim behavior more aggressive by immediately freeing a slab once it is empty. Unused objects which are cached in the magazines will still prevent a slab from being freed. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-01-16 13:55:09 -08:00
Brian Behlendorf	c3eabc75b1	Refactor generic memory allocation interfaces This patch achieves the following goals: 1. It replaces the preprocessor kmem flag to gfp flag mapping with proper translation logic. This eliminates the potential for surprises that were previously possible where kmem flags were mapped to gfp flags. 2. It maps vmem_alloc() allocations to kmem_alloc() for allocations sized less than or equal to the newly-added spl_kmem_alloc_max parameter. This ensures that small allocations will not contend on a single global lock, large allocations can still be handled, and potentially limited virtual address space will not be squandered. This behavior is entirely different than under Illumos due to different memory management strategies employed by the respective kernels. However, this functionally provides the semantics required. 3. The --disable-debug-kmem, --enable-debug-kmem (default), and --enable-debug-kmem-tracking allocators have been unified in to a single spl_kmem_alloc_impl() allocation function. This was done to simplify the code and make it more maintainable. 4. Improve portability by exposing an implementation of the memory allocations functions that can be safely used in the same way they are used on Illumos. Specifically, callers may safely use KM_SLEEP in contexts which perform filesystem IO. This allows us to eliminate an entire class of Linux specific changes which were previously required to avoid deadlocking the system. This change will be largely transparent to existing callers but there are a few caveats: 1. Because the headers were refactored and extraneous includes removed callers may find they need to explicitly add additional #includes. In particular, kmem_cache.h must now be explicitly includes to access the SPL's kmem cache implementation. This behavior is different from Illumos but it was done to avoid always masking the Linux slab functions when kmem.h is included. 2. Callers, like Lustre, which made assumptions about the definitions of KM_SLEEP, KM_NOSLEEP, and KM_PUSHPAGE will need to be updated. Other callers such as ZFS which did not will not require changes. 3. KM_PUSHPAGE is no longer overloaded to imply GFP_NOIO. It retains its original meaning of allowing allocations to access reserved memory. KM_PUSHPAGE callers can be converted back to KM_SLEEP. 4. The KM_NODEBUG flags has been retired and the default warning threshold increased to 32k. 5. The kmem_virt() functions has been removed. For callers which need to distinguish between a physical and virtual address use is_vmalloc_addr(). Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-01-16 13:55:09 -08:00
Ned Bass	33b6dbbc51	Document zfs_flags module parameter Add a table describing the debugging flags that can be set in the zfs_flags module parameter. Also change the module_param type to 'uint' so users aren't shown a negative value. The updated man page text is reproduced below for convenience. zfs_flags (int) Set additional debugging flags. The following flags may be bitwise-or'd together. +-------------------------------------------------------+ \|Value Symbolic Name \| \| Description \| +-------------------------------------------------------+ \| 1 ZFS_DEBUG_DPRINTF \| \| Enable dprintf entries in the debug log. \| +-------------------------------------------------------+ \| 2 ZFS_DEBUG_DBUF_VERIFY * \| \| Enable extra dbuf verifications. \| +-------------------------------------------------------+ \| 4 ZFS_DEBUG_DNODE_VERIFY * \| \| Enable extra dnode verifications. \| +-------------------------------------------------------+ \| 8 ZFS_DEBUG_SNAPNAMES \| \| Enable snapshot name verification. \| +-------------------------------------------------------+ \| 16 ZFS_DEBUG_MODIFY \| \| Check for illegally modified ARC buffers. \| +-------------------------------------------------------+ \| 32 ZFS_DEBUG_SPA \| \| Enable spa_dbgmsg entries in the debug log. \| +-------------------------------------------------------+ \| 64 ZFS_DEBUG_ZIO_FREE \| \| Enable verification of block frees. \| +-------------------------------------------------------+ \| 128 ZFS_DEBUG_HISTOGRAM_VERIFY \| \| Enable extra spacemap histogram verifications. \| +-------------------------------------------------------+ * Requires debug build. Default value: 0. Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2988	2015-01-07 15:50:49 -08:00
Randall Mason	33c0819425	Fix small spelling mistake recieve becomes receive Signed-off-by: Randall Mason <ClashTheBunny@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2877	2014-11-14 15:48:51 -08:00
Daniil Lunev	62bdd5eb7a	Illumos 4924 - LZ4 Compression for metadata Reviewed by Matthew Ahrens <mahrens@delphix.com> Reviewed by Saso Kiselkov <skiselkov.ml@gmail.com> Approved by: Christopher Siden <christopher.siden@delphix.com> References: https://github.com/illumos/illumos-gate/commit/b8289d2 https://www.illumos.org/issues/3756 Porting notes: The static function zfs_prop_activate_feature() was removed because this change removes the only caller. The function was not removed from Illumos but instead left as dead code. However, to keep gcc happy it was removed from Linux and may be easily restored if needed. Ported by: DHE <git@dehacked.net> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1540	2014-10-20 16:17:49 -07:00
Brian Behlendorf	a80d69caf0	Remove adaptive mutex implementation Since the Linux 2.6.29 kernel all mutexes have been adaptive mutexs. There is no longer any point in keeping this code so it is being removed to simplify the code. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2014-10-17 15:07:28 -07:00
Turbo Fredriksson	971808ec9f	Add a stern warning about dedup Users intending to use dedup should be clearly advised about its memory requirements and the risks involved. Thanx to Sachiru for comments and suggestions. Signed-off-by: Turbo Fredriksson <turbo@bayour.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2754	2014-10-08 17:07:11 -07:00
Turbo Fredriksson	a215ee16c0	Add an example for 'zfs bookmark' to the Example section. Signed-off-by: Turbo Fredriksson <turbo@bayour.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2762	2014-10-07 11:29:26 -07:00
Richard Yao	83e9986f6e	Implement -t option to zpool create for temporary pool names Creating virtual machines that have their rootfs on ZFS on hosts that have their rootfs on ZFS causes SPA namespace collisions when the standard name rpool is used. The solution is either to give each guest pool a name unique to the host, which is not always desireable, or boot a VM environment containing an ISO image to install it, which is cumbersome. `26b42f3f9d` introduced `zpool import -t ...` to simplify situations where a host must access a guest's pool when there is a SPA namespace conflict. We build upon that to introduce `zpool import -t tname ...`. That allows us to create a pool whose in-core name is tname, but whose on-disk name is the normal name specified. This simplifies the creation of machine images that use a rootfs on ZFS. That benefits not only real world deployments, but also ZFSOnLinux development by decreasing the time needed to perform rootfs on ZFS experiments. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #2417	2014-09-30 10:46:59 -07:00
Richard Yao	00d2a8c92f	zpool import -t should not update cachefile zpool import's -t parameter is intended for use with -R when operating on pools that belong to other systems. Like -R, pools imported in this way should not update the cachefile unless explicitly requested. The initial implementation allowed the cachefile to be updated when -R was not used. This went uncaught during testing because -R had implicitly disabled use of the cachefile. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #2417	2014-09-30 10:46:58 -07:00
Brian Behlendorf	aa0ac7caa4	Make user stack limit configurable To aid in detecting and debugging stack overflow issues make the user space stack limit configurable via a new ZFS_STACK_SIZE environment variable. The value assigned to ZFS_STACK_SIZE will be used as the default stack size in bytes. Because this is mainly useful as a debugging aid in conjunction with ztest the stack limit is disabled by default. See the ztest(1) man page for additional details on using the ZFS_STACK_SIZE environment variable. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov> Closes #2743 Issue #2293	2014-09-30 10:46:55 -07:00
Chris Dunlap	dcca723ace	Refer to ZED's scripts as ZEDLETs The executables invoked by the ZED in response to a given zevent have been generically referred to as "scripts". By convention, these scripts have aimed to be /bin/sh compatible for reasons of portability and comprehensibility. However, the ZED only requires they be executable and (ideally) capable of reading environment variables. As such, these scripts are now referred to as ZEDLETs (ZFS Event Daemon Linkage for Executable Tasks). Signed-off-by: Chris Dunlap <cdunlap@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2735	2014-09-25 13:54:17 -07:00
Max Grossman	36283ca233	Illumos 5138 - add tunable for maximum number of blocks freed in one txg Reviewed by: Adam Leventhal <adam.leventhal@delphix.com> Reviewed by: Mattew Ahrens <mahrens@delphix.com> Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net> Reviewed by: Richard Elling <richard.elling@gmail.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/5138 https://github.com/illumos/illumos-gate/commit/af3465d Porting notes: Because support for exposing a uint64_t parameter wasn't added until v3.17-rc1 the zfs_free_max_blocks variable has been declared as a unsigned long. This is already far larger than required and it allows us to avoid additional autoconf compatibility code. The default value has been set to 100,000 on Linux instead of ULONG_MAX which is used on Illumos. This was done to limit the number of outstanding IOs in the system when snapshots are destroyed. This helps ensure individual TXG sync times are kept reasonable and memory isn't wasted managing a huge backlog of outstanding IOs. Ported by: Turbo Fredriksson <turbo@bayour.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2675 Closes #2581	2014-09-23 14:26:34 -07:00
Matthew Ahrens	b8bcca18f7	Illumos 5161 - add tunable for number of metaslabs per vdev 5161 add tunable for number of metaslabs per vdev Reviewed by: Alex Reece <alex.reece@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com> Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com> Reviewed by: Richard Elling <richard.elling@gmail.com> Approved by: Richard Lowe <richlowe@richlowe.net> References: https://www.illumos.org/issues/5161 https://github.com/illumos/illumos-gate/commit/bf3e216 Ported by: Turbo Fredriksson <turbo@bayour.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2698	2014-09-23 10:00:02 -07:00
Turbo Fredriksson	71bd064555	Document environment variables for zdb, zfs, zinject and zpool. Signed-off-by: Turbo Fredriksson <turbo@bayour.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2691	2014-09-18 15:01:03 -07:00
Tim Chase	52dd454d05	Document the "readonly" pool property This documentation is based FreeBSD's zpool(8) man page. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2682	2014-09-09 11:35:46 -07:00
Alexey Smirnoff	0dfc732416	Change the default 'zfs_dedup_prefetch' value to '0' This gives a huge performance improvement in operations with deduped datasets especially when the bottleneck is the amount of ram available for zfs. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2639	2014-09-04 09:50:45 -07:00
Matthew Ahrens	dea377c0d9	Illumos 4970-4974 - extreme rewind enhancements 4970 need controls on i/o issued by zpool import -XF 4971 zpool import -T should accept hex values 4972 zpool import -T implies extreme rewind, and thus a scrub 4973 spa_load_retry retries the same txg 4974 spa_load_verify() reads all data twice Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> References: https://www.illumos.org/issues/4970 https://www.illumos.org/issues/4971 https://www.illumos.org/issues/4972 https://www.illumos.org/issues/4973 https://www.illumos.org/issues/4974 https://github.com/illumos/illumos-gate/commit/e42d205 Notes: This set of patches adds a set of tunable parameters for the "extreme rewind" mode of pool import which allows control over the traversal performed during such an import. Ported by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2598	2014-08-26 16:29:57 -07:00
Matthew Ahrens	49ddb31506	Illumos 5034 - ARC's buf_hash_table is too small 5034 ARC's buf_hash_table is too small Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com> Reviewed by: Richard Elling <richard.elling@gmail.com> Approved by: Gordon Ross <gwr@nexenta.com> References: https://www.illumos.org/issues/5034 https://github.com/illumos/illumos-gate/commit/63e911b Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2615	2014-08-26 16:14:49 -07:00
George Wilson	f3a7f6610f	Illumos 4976-4984 - metaslab improvements 4976 zfs should only avoid writing to a failing non-redundant top-level vdev 4978 ztest fails in get_metaslab_refcount() 4979 extend free space histogram to device and pool 4980 metaslabs should have a fragmentation metric 4981 remove fragmented ops vector from block allocator 4982 space_map object should proactively upgrade when feature is enabled 4983 need to collect metaslab information via mdb 4984 device selection should use fragmentation metric Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Adam Leventhal <adam.leventhal@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Approved by: Garrett D'Amore <garrett@damore.org> References: https://www.illumos.org/issues/4976 https://www.illumos.org/issues/4978 https://www.illumos.org/issues/4979 https://www.illumos.org/issues/4980 https://www.illumos.org/issues/4981 https://www.illumos.org/issues/4982 https://www.illumos.org/issues/4983 https://www.illumos.org/issues/4984 https://github.com/illumos/illumos-gate/commit/2e4c998 Notes: The "zdb -M" option has been re-tasked to display the new metaslab fragmentation metric and the new "zdb -I" option is used to control the maximum number of in-flight I/Os. The new fragmentation metric is derived from the space map histogram which has been rolled up to the vdev and pool level and is presented to the user via "zpool list". Add a number of module parameters related to the new metaslab weighting logic. Ported by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2595	2014-08-18 08:40:49 -07:00
Turbo Fredriksson	f67d709080	Create an 'overlay' property Add a new 'overlay' property (default 'off') that controls whether the filesystem should be mounted even if the mountpoint is busy or if it should fail with a 'mountpoint not empty'. Doing overlay mounts is the default mount behavior on Linux, but not in ZFS. It have been decided that following the ZFS behavior should be the default, but this overlay allows for site administrator to override this decision on a per-dataset basis. Signed-off-by: Turbo Fredriksson <turbo@bayour.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes: #2503	2014-08-15 13:39:19 -07:00
Matthew Ahrens	fbeddd60b7	Illumos 4390 - I/O errors can corrupt space map when deleting fs/vol 4390 i/o errors when deleting filesystem/zvol can lead to space map corruption Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/4390 https://github.com/illumos/illumos-gate/commit/7fd05ac Porting notes: Previous stack-reduction efforts in traverse_visitb() caused a fair number of un-mergable pieces of code. This patch should reduce its stack footprint a bit more. The new local bptree_entry_phys_t in bptree_add() is dynamically-allocated using kmem_zalloc() for the purpose of stack reduction. The new global zfs_free_leak_on_eio has been defined as an integer rather than a boolean_t as was the case with the related zfs_recover global. Also, zfs_free_leak_on_eio's definition has been inserted into zfs_debug.c for consistency with the existing definition of zfs_recover. Illumos placed it in spa_misc.c. Ported by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2545	2014-08-04 11:50:52 -07:00
Matthew Ahrens	9b67f60560	Illumos 4757, 4913 4757 ZFS embedded-data block pointers ("zero block compression") 4913 zfs release should not be subject to space checks Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Max Grossman <max.grossman@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/4757 https://www.illumos.org/issues/4913 https://github.com/illumos/illumos-gate/commit/5d7b4d4 Porting notes: For compatibility with the fastpath code the zio_done() function needed to be updated. Because embedded-data block pointers do not require DVAs to be allocated the associated vdevs will not be marked and therefore should not be unmarked. Ported by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2544	2014-08-01 14:28:05 -07:00
Matthew Ahrens	faf0f58c69	Illumos 3835 zfs need not store 2 copies of all metadata Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Approved by: Richard Lowe <richlowe@richlowe.net> Description from Matt Ahrens's bug report at Delphix: Add a new zfs property, "redundant_metadata" which can have values "all" or "most". The default will be "all", which is the current behavior. Setting to "most" will cause us to only store 1 copy of level-1 indirect blocks of user data files. Additional notes: The new man page section for this property states "The exact behavior of which metadata blocks are stored redundantly may change in future releases." and: "When set to most, ZFS stores an extra copy of most types of metadata. This can improve performance of random writes, because less metadata must be written." The current implementation is as described above in Matt's blog. It is controlled by a new global integer "zfs_redundant_metadata_most_ditto_level", currently initialized to 2. When "redundant_metadata" is set to "most", only indirect blocks of the specified level and higher will have additional ditto blocks created. Ported by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2542	2014-07-31 09:49:34 -07:00
Matthew Ahrens	da536844d5	Illumos 4368, 4369. 4369 implement zfs bookmarks 4368 zfs send filesystems from readonly pools Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Garrett D'Amore <garrett@damore.org> References: https://www.illumos.org/issues/4369 https://www.illumos.org/issues/4368 https://github.com/illumos/illumos-gate/commit/78f1710 Ported by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2530	2014-07-29 10:55:29 -07:00
Max Grossman	b0bc7a84d9	Illumos 4370, 4371 4370 avoid transmitting holes during zfs send 4371 DMU code clean up Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net> Approved by: Garrett D'Amore <garrett@damore.org>a References: https://www.illumos.org/issues/4370 https://www.illumos.org/issues/4371 https://github.com/illumos/illumos-gate/commit/43466aa Ported by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2529	2014-07-28 14:29:58 -07:00
Matthew Ahrens	fa86b5dbb6	Illumos 4171, 4172 4171 clean up spa_feature_*() interfaces 4172 implement extensible_dataset feature for use by other zpool features Reviewed by: Max Grossman <max.grossman@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com> Approved by: Garrett D'Amore <garrett@damore.org>a References: https://www.illumos.org/issues/4171 https://www.illumos.org/issues/4172 https://github.com/illumos/illumos-gate/commit/2acef22 Ported-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2528	2014-07-25 16:40:07 -07:00
Turbo Fredriksson	79eb71dc6c	Support '-H' (scripted mode) to 'zpool get' This functionality is already available in 'zfs get'. Providing it for 'zpool get' is useful and good for consistency. Signed-off-by: Turbo Fredriksson <turbo@bayour.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes: #2522	2014-07-25 11:58:36 -07:00
Turbo Fredriksson	a60e668bd2	Initial attempt to document events and payloads. In no way complete - most have been trial and error and some deducing what they could mean. It needs more information from someone that knows the code better. But this is a start and it lays the basic structure for adding this additional detail. Signed-off-by: Turbo Fredriksson <turbo@bayour.com> Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2357	2014-07-25 11:58:36 -07:00
George Wilson	93cf20764a	Illumos #4101 , #4102 , #4103 , #4105 , #4106 4101 metaslab_debug should allow for fine-grained control 4102 space_maps should store more information about themselves 4103 space map object blocksize should be increased 4105 removing a mirrored log device results in a leaked object 4106 asynchronously load metaslab Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Sebastien Roy <seb@delphix.com> Approved by: Garrett D'Amore <garrett@damore.org> Prior to this patch, space_maps were preferred solely based on the amount of free space left in each. Unfortunately, this heuristic didn't contain any information about the make-up of that free space, which meant we could keep preferring and loading a highly fragmented space map that wouldn't actually have enough contiguous space to satisfy the allocation; then unloading that space_map and repeating the process. This change modifies the space_map's to store additional information about the contiguous space in the space_map, so that we can use this information to make a better decision about which space_map to load. This requires reallocating all space_map objects to increase their bonus buffer size sizes enough to fit the new metadata. The above feature can be enabled via a new feature flag introduced by this change: com.delphix:spacemap_histogram In addition to the above, this patch allows the space_map block size to be increase. Currently the block size is set to be 4K in size, which has certain implications including the following: * 4K sector devices will not see any compression benefit * large space_maps require more metadata on-disk * large space_maps require more time to load (typically random reads) Now the space_map block size can adjust as needed up to the maximum size set via the space_map_max_blksz variable. A bug was fixed which resulted in potentially leaking an object when removing a mirrored log device. The previous logic for vdev_remove() did not deal with removing top-level vdevs that are interior vdevs (i.e. mirror) correctly. The problem would occur when removing a mirrored log device, and result in the DTL space map object being leaked; because top-level vdevs don't have DTL space map objects associated with them. References: https://www.illumos.org/issues/4101 https://www.illumos.org/issues/4102 https://www.illumos.org/issues/4103 https://www.illumos.org/issues/4105 https://www.illumos.org/issues/4106 https://github.com/illumos/illumos-gate/commit/0713e23 Porting notes: A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also, the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary. Ported-by: Tim Chase <tim@chase2k.com> Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2488	2014-07-22 09:39:16 -07:00
Richard Yao	a5778ea242	zdb: Introduce -V for verbatim import When given a pool name via -e, zdb would attempt an import. If it failed, then it would attempt a verbatim import. This behavior is not always desirable so a -V switch is added to zdb to control the behavior. When specified, a verbatim import is done. Otherwise, the behavior is as it was previously, except no verbatim import is done on failure. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2372	2014-07-17 11:40:32 -07:00
Tim Chase	f4a4046bd6	Convert zfs_mg_noalloc_threshold to a module parameter and document The parameter was added as illumos issue 4081 which was committed to zfsonlinux in `ac72fac3ea`. This patch documents the parameter and allows for it to be set as a module parameter. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2483	2014-07-16 16:49:25 -07:00
Tim Chase	52e68edc2d	Document the optional "device" argument for "zpool split" Most ZFS implementations seemed to have missed this bit of documentation. The additional text is based on FreeBSD's man page. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2416	2014-07-01 14:16:43 -07:00

... 2 3 4 5 6 ...

493 Commits