Archive-Team/zfs - zfs - Gitea: Git with a cup of tea

Commit Graph

Author	SHA1	Message	Date
Max Grossman	5dc8b7365f	Illumos 5765 - add support for estimating send stream size with lzc_send_space when source is a bookmark 5765 add support for estimating send stream size with lzc_send_space when source is a bookmark Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: Steven Hartland <killing@multiplay.co.uk> Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com> Approved by: Albert Lee <trisk@nexenta.com> References: https://www.illumos.org/issues/5765 https://github.com/illumos/illumos-gate/commit/643da460 Porting notes: * Unused variable 'recordsize' in dmu_send_estimate() dropped Ported-by: DHE <git@dehacked.net> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3397	2015-05-13 09:03:59 -07:00
Justin T. Gibbs	19b3b1d2a2	Illumos 5393 - spurious failures from dsl_dataset_hold_obj() 5393 spurious failures from dsl_dataset_hold_obj() Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Will Andrews <willa@spectralogic.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Steven Hartland <killing@multiplay.co.uk> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/5393 https://github.com/illumos/illumos-gate/commit/e1f3c20 Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3403	2015-05-13 08:56:04 -07:00
Justin T. Gibbs	63b33e878a	Illumos 5562 - ZFS sa_handle's violate kmem invariants, debug kernels panic on boot 5562 ZFS sa_handle's violate kmem invariants, debug kernels panic on boot Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Robert Mustacchi <rm@fingolfin.org> Reviewed by: George Wilson <george@delphix.com> Reviewed by: Rich Lowe <richlowe@richlowe.net> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/5562 https://github.com/illumos/illumos-gate/commit/0fda3cc5 Ported-by: DHE <git@dehacked.net> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3388	2015-05-11 15:10:57 -07:00
Matthew Ahrens	252e1a54ab	Illumos 5810 - zdb should print details of bpobj 5810 zdb should print details of bpobj Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Alex Reece <alex@delphix.com> Reviewed by: George Wilson <george@delphix.com> Reviewed by: Will Andrews <will@freebsd.org> Reviewed by: Simon Klinkert <simon.klinkert@gmail.com> Approved by: Gordon Ross <gwr@nexenta.com> References: https://www.illumos.org/issues/5810 https://github.com/illumos/illumos-gate/commit/732885fc Ported-by: DHE <git@dehacked.net> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3387	2015-05-11 15:10:24 -07:00
Matthew Ahrens	10400bfeac	Illumos 5351, 5352 - scrub pauses 5351 scrub goes for an extra second each txg 5352 scrub should pause when there is some dirty data Author: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Alex Reece <alex.reece@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Richard Elling <richard.elling@richardelling.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/5351 https://github.com/illumos/illumos-gate/commit/6f6a76a Ported-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3383	2015-05-11 15:09:56 -07:00
Matthew Ahrens	08dc1b2ddd	Illumos 5350 - clean up code in dnode_sync() Author: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Alex Reece <alex.reece@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Richard Elling <richard.elling@richardelling.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/5350 https://github.com/illumos/illumos-gate/commit/e651831 Ported-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3382	2015-05-11 15:09:51 -07:00
Alex Reece	7224c67fea	Illumos 5422 - preserve AVL invariants in dn_dbufs Author: Alex Reece <alex@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com> Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com> Reviewed by: Albert Lee <trisk@nexenta.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/5422 https://github.com/illumos/illumos-gate/commit/a846f19 Ported-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3381	2015-05-11 15:09:29 -07:00
David Lamparter	214806c7e9	Safely handle security / ACL failures The security and ACL operations should all be performed atomically. To accomplish this there would need to significant invasive changes made to the common code base. For the moment it's desirable for compatibility reasons to avoid this. Therefore the code has been updated to attempt to unwind the operation in case of failure rather than panic. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2445	2015-05-11 14:35:14 -07:00
Matthew Ahrens	f1512ee61e	Illumos 5027 - zfs large block support 5027 zfs large block support Reviewed by: Alek Pinchuk <pinchuk.alek@gmail.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com> Reviewed by: Richard Elling <richard.elling@richardelling.com> Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/5027 https://github.com/illumos/illumos-gate/commit/b515258 Porting Notes: * Included in this patch is a tiny ISP2() cleanup in zio_init() from Illumos 5255. * Unlike the upstream Illumos commit this patch does not impose an arbitrary 128K block size limit on volumes. Volumes, like filesystems, are limited by the zfs_max_recordsize=1M module option. * By default the maximum record size is limited to 1M by the module option zfs_max_recordsize. This value may be safely increased up to 16M which is the largest block size supported by the on-disk format. At the moment, 1M blocks clearly offer a significant performance improvement but the benefits of going beyond this for the majority of workloads are less clear. * The illumos version of this patch increased DMU_MAX_ACCESS to 32M. This was determined not to be large enough when using 16M blocks because the zfs_make_xattrdir() function will fail (EFBIG) when assigning a TX. This was immediately observed under Linux because all newly created files must have a security xattr created and that was failing. Therefore, we've set DMU_MAX_ACCESS to 64M. * On 32-bit platforms a hard limit of 1M is set for blocks due to the limited virtual address space. We should be able to relax this one the ABD patches are merged. Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #354	2015-05-11 12:23:16 -07:00
Brian Behlendorf	f9cab37291	Remove metaslab_min_alloc_size module option The metaslab_min_alloc_size option is no longer used in the code. This functionality was removed by commit `f3a7f66` and the module options should have been dropped at that time. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-05-11 09:51:57 -07:00
Chris Dunlop	16fcdea363	arc_evict, arc_evict_ghost: reduce stack usage using kmem_zalloc With debugging enabled and depending on your kernel config, the size of arc_buf_hdr_t can blow out the stack of arc_evict() and arc_evict_ghost() to greater than 1024 bytes. Let's avoid this. Signed-off-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3377	2015-05-08 14:14:35 -07:00
Matthew Ahrens	63e3a8616b	Illumos 5349 - verify that block pointer is plausible before reading 5349 verify that block pointer is plausible before reading Reviewed by: Alex Reece <alex.reece@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Richard Lowe <richlowe@richlowe.net> Reviewed by: Xin Li <delphij@FreeBSD.org> Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com> Approved by: Gordon Ross <gwr@nexenta.com> References: https://www.illumos.org/issues/5349 https://github.com/illumos/illumos-gate/commit/f63ab3d5 Porting notes: * Several variable declarations were moved due to C style needs Ported-by: DHE <git@dehacked.net> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3373	2015-05-08 14:09:15 -07:00
Chris Dunlop	f0da4d1508	Wait for all znodes to be released before tearing down the superblock By the time we're tearing down our superblock the VFS has started releasing all our inodes/znodes. Some of this work may have been handed off to our iput taskq so we need to wait for that work to complete. However the iput from the taskq can itself result in additional work being added to the taskq: dsl_pool_iput_taskq iput iput_final evict destroy_inode zpl_inode_destroy zfs_inode_destroy zfs_iput_async(ZTOI(zp->z_xattr_parent)) taskq_dispatch(dsl_pool_iput_taskq..., iput, ...) Let's wait until all our znodes have been released. Signed-off-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3281	2015-05-06 14:13:19 -07:00
Matthew Ahrens	7a3066ffdd	Illumos 5348 - zio_checksum_error() only fills in info if ECKSUM 5348 zio_checksum_error() only fills in info if ECKSUM Reviewed by: Alex Reece <alex.reece@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Steven Hartland <killing@multiplay.co.uk> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/5348 https://github.com/illumos/illumos-gate/commit/373dc1cf Ported-by: DHE <git@dehacked.net> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3372	2015-05-05 11:05:37 -07:00
Matthew Ahrens	f3c517d814	Illumos 5820 - verify failed in zio_done(): BP_EQUAL(bp, io_bp_orig) 5820 verify failed in zio_done(): BP_EQUAL(bp, io_bp_orig) Reviewed by: Alex Reece <alex@delphix.com> Reviewed by: George Wilson <george@delphix.com> Reviewed by: Steven Hartland <killing@multiplay.co.uk> Approved by: Garrett D'Amore <garrett@damore.org> References: https://www.illumos.org/issues/5820 https://github.com/illumos/illumos-gate/commit/34e8acef00 Ported-by: DHE <git@dehacked.net> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3364	2015-05-04 10:49:49 -07:00
Matthew Ahrens	36c6ffb6b6	Illumos 5808 - spa_check_logs is not necessary on readonly pools 5808 spa_check_logs is not necessary on readonly pools Reviewed by: George Wilson <george@delphix.com> Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com> Reviewed by: Simon Klinkert <simon.klinkert@gmail.com> Reviewed by: Will Andrews <will@freebsd.org> Approved by: Gordon Ross <gwr@nexenta.com> References: https://www.illumos.org/issues/5808 https://github.com/illumos/illumos-gate/commit/23367a2f Ported-by: DHE <git@dehacked.net> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3369	2015-05-04 10:45:42 -07:00
Will Andrews	50f9ea0149	Illumos 5814 - bpobj_iterate_impl(): Close a refcount leak iterating on a sublist. 5814 bpobj_iterate_impl(): Close a refcount leak iterating on a sublist. Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com> Reviewed by: Simon Klinkert <simon.klinkert@gmail.com> Approved by: Gordon Ross <gwr@nexenta.com> References: https://www.illumos.org/issues/5814 https://github.com/illumos/illumos-gate/commit/b67dde11 Ported-by: DHE <git@dehacked.net> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3368	2015-05-04 10:28:18 -07:00
Christopher Siden	0c60cc326b	Illumos 4951 - ZFS administrative commands (fix) 4951 ZFS administrative commands should use reserved space, not fail with ENOSPC Approved by: Christopher Siden <christopher.siden@delphix.com> References: https://www.illumos.org/issues/4951 https://github.com/illumos/illumos-gate/commit/c39f2c8 Ported by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-05-04 09:41:10 -07:00
Matthew Ahrens	3d45fdd6c0	Illumos 4951 - ZFS administrative commands should use reserved space 4951 ZFS administrative commands should use reserved space, not with ENOSPC Reviewed by: John Kennedy <john.kennedy@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Approved by: Garrett D'Amore <garrett@damore.org> References: https://www.illumos.org/issues/4373 https://github.com/illumos/illumos-gate/commit/7d46dc6 Ported by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-05-04 09:41:10 -07:00
Jerry Jelinek	a0c9a17aef	Illumos 4901 - zfs filesystem/snapshot limit leaks 4901 zfs filesystem/snapshot limit leaks Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/4901 https://github.com/illumos/illumos-gate/commit/adf3407 Ported by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-05-04 09:41:09 -07:00
Matthew Ahrens	83017311e4	Illumos 3654,3656 3654 zdb should print number of ganged blocks 3656 remove unused function zap_cursor_move_to_key() Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: Dan McDonald <danmcd@nexenta.com> Approved by: Garrett D'Amore <garrett@damore.org> References: https://www.illumos.org/issues/3654 https://www.illumos.org/issues/3656 https://github.com/illumos/illumos-gate/commit/d5ee8a1 Porting Notes: 3655 and 3657 were part of this commit but those hunks were dropped since they apply to mdb. Ported by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-05-04 09:41:09 -07:00
tuxoko	6102d0376e	Add cond_resched to zfs_zget to prevent infinite loop It's been reported that threads would loop infinitely inside zfs_zget. The speculated cause for this is that if an inode is marked for evict, zfs_zget would see that and loop. However, if the looping thread doesn't yield, the inode may not have a chance to finish evict, thus causing a infinite loop. This patch solve this issue by add cond_resched to zfs_zget, making the looping thread to yield when needed. Tested-by: jlavoy <jalavoy@gmail.com> Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3349	2015-05-04 09:12:41 -07:00
Jason Zaman	c9520ecc0f	dmu: fix integer overflows The params to the functions are uint64_t, but the offsets to memcpy / bcopy are calculated using 32bit ints. This patch changes them to also be uint64_t so there isnt an overflow. PaX's Size Overflow caught this when formatting a zvol. Gentoo bug: #546490 PAX: offset: 1ffffb000 db->db_offset: 1ffffa000 db->db_size: 2000 size: 5000 PAX: size overflow detected in function dmu_read /var/tmp/portage/sys-fs/zfs-kmod-0.6.3-r1/work/zfs-zfs-0.6.3/module/zfs/../../module/zfs/dmu.c:781 cicus.366_146 max, count: 15 CPU: 1 PID: 2236 Comm: zvol/10 Tainted: P O 3.17.7-hardened-r1 #1 Call Trace: [<ffffffffa0382ee8>] ? dsl_dataset_get_holds+0x9d58/0x343ce [zfs] [<ffffffff81a59c88>] dump_stack+0x4e/0x7a [<ffffffffa0393c2a>] ? dsl_dataset_get_holds+0x1aa9a/0x343ce [zfs] [<ffffffff81206696>] report_size_overflow+0x36/0x40 [<ffffffffa02dba2b>] dmu_read+0x52b/0x920 [zfs] [<ffffffffa0373ad1>] zrl_is_locked+0x7d1/0x1ce0 [zfs] [<ffffffffa0364cd2>] zil_clean+0x9d2/0xc00 [zfs] [<ffffffffa0364f21>] zil_commit+0x21/0x30 [zfs] [<ffffffffa0373fe1>] zrl_is_locked+0xce1/0x1ce0 [zfs] [<ffffffff81a5e2c7>] ? __schedule+0x547/0xbc0 [<ffffffffa01582e6>] taskq_cancel_id+0x2a6/0x5b0 [spl] [<ffffffff81103eb0>] ? wake_up_state+0x20/0x20 [<ffffffffa0158150>] ? taskq_cancel_id+0x110/0x5b0 [spl] [<ffffffff810f7ff4>] kthread+0xc4/0xe0 [<ffffffff810f7f30>] ? kthread_create_on_node+0x170/0x170 [<ffffffff81a62fa4>] ret_from_fork+0x74/0xa0 [<ffffffff810f7f30>] ? kthread_create_on_node+0x170/0x170 Signed-off-by: Jason Zaman <jason@perfinion.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3333	2015-05-04 09:12:00 -07:00
George Wilson	98b254188a	Illumos #5244 - zio pipeline callers should explicitly invoke next stage 5244 zio pipeline callers should explicitly invoke next stage Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Alex Reece <alex.reece@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Richard Elling <richard.elling@gmail.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Reviewed by: Steven Hartland <killing@multiplay.co.uk> Approved by: Gordon Ross <gwr@nexenta.com> References: https://www.illumos.org/issues/5244 https://github.com/illumos/illumos-gate/commit/738f37b Porting Notes: 1. The unported "2932 support crash dumps to raidz, etc. pools" caused a merge conflict due to a copyright difference in module/zfs/vdev_raidz.c. 2. The unported "4128 disks in zpools never go away when pulled" and additional Linux-specific changes caused merge conflicts in module/zfs/vdev_disk.c. Ported-by: Richard Yao <richard.yao@clusterhq.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2828	2015-04-30 15:07:47 -07:00
Matthew Ahrens	8dd86a10cf	Illumos 5812 - assertion failed in zrl_tryenter(): zr_owner==NULL 5812 assertion failed in zrl_tryenter(): zr_owner==NULL Reviewed by: George Wilson <george@delphix.com> Reviewed by: Alex Reece <alex@delphix.com> Reviewed by: Will Andrews <will@freebsd.org> Approved by: Gordon Ross <gwr@nexenta.com> References: https://www.illumos.org/issues/5812 https://github.com/illumos/illumos-gate/commit/8df1730 Ported-by: DHE <git@dehacked.net> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3357	2015-04-30 14:43:40 -07:00
Justin T. Gibbs	6186e29753	Illumos 5592 - NULL pointer dereference in dsl_prop_notify_all_cb() 5592 NULL pointer dereference in dsl_prop_notify_all_cb() Author: Justin T. Gibbs <justing@spectralogic.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george@delphix.com> Reviewed by: Will Andrews <will@freebsd.org> Approved by: Robert Mustacchi <rm@joyent.com> References: https://www.illumos.org/issues/5592 https://github.com/illumos/illumos-gate/commit/9d47dec Ported-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-04-28 16:25:58 -07:00
Justin T. Gibbs	6ebebaceb1	Illumos 5531 - NULL pointer dereference in dsl_prop_get_ds() 5531 NULL pointer dereference in dsl_prop_get_ds() Author: Justin T. Gibbs <justing@spectralogic.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Reviewed by: George Wilson <george@delphix.com> Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com> Approved by: Robert Mustacchi <rm@joyent.com> References: https://www.illumos.org/issues/5531 https://github.com/illumos/illumos-gate/commit/e57a022 Ported-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-04-28 16:25:44 -07:00
Justin T. Gibbs	0c66c32d1d	Illumos 5056 - ZFS deadlock on db_mtx and dn_holds 5056 ZFS deadlock on db_mtx and dn_holds Author: Justin Gibbs <justing@spectralogic.com> Reviewed by: Will Andrews <willa@spectralogic.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/5056 https://github.com/illumos/illumos-gate/commit/bc9014e Porting Notes: sa_handle_get_from_db(): - the original patch includes an otherwise unmentioned fix for a possible usage of an uninitialised variable dmu_objset_open_impl(): - Under Illumos list_link_init() is the same as filling a list_node_t with NULLs, so they don't notice if they miss doing list_link_init() on a zero'd containing structure (e.g. allocated with kmem_zalloc as here). Under Linux, not so much: an uninitialised list_node_t goes "Boom!" some time later when it's used or destroyed. dmu_objset_evict_dbufs(): - reduce stack usage using kmem_alloc() Ported-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-04-28 16:25:34 -07:00
Justin T. Gibbs	d683ddbb72	Illumos 5314 - Remove "dbuf phys" db->db_data pointer aliases in ZFS 5314 Remove "dbuf phys" db->db_data pointer aliases in ZFS Author: Justin T. Gibbs <justing@spectralogic.com> Reviewed by: Andriy Gapon <avg@freebsd.org> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Will Andrews <willa@spectralogic.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/5314 https://github.com/illumos/illumos-gate/commit/c137962 Ported-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-04-28 16:25:20 -07:00
Justin T. Gibbs	945dd93525	Illumos 5310 - Remove always true tests for non-NULL ds->ds_phys 5310 Remove always true tests for non-NULL ds->ds_phys Author: Justin T. Gibbs <justing@spectralogic.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Will Andrews <willa@spectralogic.com> Reviewed by: Andriy Gapon <avg@FreeBSD.org> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/5310 https://github.com/illumos/illumos-gate/commit/d808a4f Ported-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-04-28 16:25:08 -07:00
Alex Reece	9925c28cde	Illumos 5095 - panic when adding a duplicate dbuf to dn_dbufs 5095 panic when adding a duplicate dbuf to dn_dbufs Author: Alex Reece <alex@delphix.com> Reviewed by: Adam Leventhal <adam.leventhal@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Mattew Ahrens <mahrens@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Reviewed by: Josef Sipek <jeffpc@josefsipek.net> Approved by: Robert Mustacchi <rm@joyent.com> References: https://www.illumos.org/issues/5095 https://github.com/illumos/illumos-gate/commit/86bb58a Ported-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-04-28 16:24:49 -07:00
Justin T. Gibbs	5aea3644d6	Illumos 5038 - Remove "old-style" flexible array usage in ZFS. 5038 Remove "old-style" flexible array usage in ZFS. Author: Justin T. Gibbs <justing@spectralogic.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net> Approved by: Richard Lowe <richlowe@richlowe.net> References: https://www.illumos.org/issues/5038 https://github.com/illumos/illumos-gate/commit/7f18da4 Ported-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-04-28 16:24:24 -07:00
Alex Reece	8951cb8dfb	Illumos 4873 - zvol unmap calls can take a very long time for larger datasets 4873 zvol unmap calls can take a very long time for larger datasets Author: Alex Reece <alex@delphix.com> Reviewed by: George Wilson <george@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com> Reviewed by: Basil Crow <basil.crow@delphix.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Approved by: Robert Mustacchi <rm@joyent.com> References: https://www.illumos.org/issues/4873 https://github.com/illumos/illumos-gate/commit/0f6d88a Porting Notes: dbuf_free_range(): - reduce stack usage using kmem_alloc() - the sorted AVL tree will handle the spill block case correctly without all the special handling in the for() loop Ported-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-04-28 16:24:03 -07:00
Jorgen Lundman	58c4aa00c6	Illumos 4975 - missing mutex_destroy() calls in zfs 4975 missing mutex_destroy() calls in zfs Author: Jorgen Lundman <lundman@lundman.net> Reviewed by: Matthew Ahrens <matthew.ahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Rich Lowe <richlowe@richlowe.net> Reviewed by: Seth Nimbosa <darth.Serious@gmail.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Reviewed by: Don Brady <dev.fs.zfs@gmail.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/4975 https://github.com/illumos/illumos-gate/commit/d2b3cbb Ported-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-04-28 16:23:38 -07:00
Alex Reece	ca227e54a8	Illumos 3897 - zfs filesystem and snapshot limits (fix leak) 3897 zfs filesystem and snapshot limits (fix leak) Author: Alex Reece <alex.reece@delphix.com> Approved by: Christopher Siden <christopher.siden@delphix.com> References: https://www.illumos.org/issues/3897 https://github.com/illumos/illumos-gate/commit/fb7001f Ported-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-04-28 16:23:14 -07:00
Jerry Jelinek	788eb90c4c	Illumos 3897 - zfs filesystem and snapshot limits 3897 zfs filesystem and snapshot limits Author: Jerry Jelinek <jerry.jelinek@joyent.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Christopher Siden <christopher.siden@delphix.com> References: https://www.illumos.org/issues/3897 https://github.com/illumos/illumos-gate/commit/a2afb61 Porting Notes: dsl_dataset_snapshot_check(): reduce stack usage using kmem_alloc(). Ported-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-04-28 16:22:51 -07:00
tuxoko	ecfb0b5f42	Fix misuse of input argument in traverse_visitbp In traverse_visitbp(), the input argument dnp is modified in the middle to point to a temporary buffer. Originally this doesn't matter, because no user of TRAVERSE_POST dereferences it. However, in `fbeddd6` a piece of code is added dereferencing dnp after the modification, creating a possible bug. We fix this by creating a new local variable cdnp for the DMU_OT_DNODE case, so we don't modify the input argument. Also we introduce different local variables in the DMU_OT_OBJSET case to prevent confusion between the input argument. Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2060	2015-04-28 09:43:50 -07:00
Isaac Huang	0336f3d001	Remove useless variable spa_active_count This isn't required for the Linux port because the kernel tracks if a module is busy. The prototype for spa_busy() is also removed since its definition was already removed. Signed-off-by: Isaac Huang <he.huang@intel.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3262	2015-04-27 09:22:05 -07:00
Justin T. Gibbs	ec8501ee12	5313 Allow I/Os to be aggregated across ZIO priority classes Reviewed by: Andriy Gapon <avg@FreeBSD.org> Reviewed by: Will Andrews <willa@SpectraLogic.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> References: https://www.illumos.org/issues/5313 https://github.com/illumos/illumos-gate/commit/fe319232 Ported-by: DHE <git@dehacked.net> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3280	2015-04-24 15:16:56 -07:00
Ned Bass	4eb30c6864	Serialize access to spa->spa_feat_stats nvlist The function spa_add_feature_stats() manipulates the shared nvlist spa->spa_feat_stats in an unsafe concurrent manner. Add a mutex to protect the list. Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3335	2015-04-24 15:04:43 -07:00
Chunwei Chen	07012da668	Fix kernel panic due to tsd_exit in ZFS_EXIT(zsb) The following panic would occur under certain heavy load: [ 4692.202686] Kernel panic - not syncing: thread ffff8800c4f5dd60 terminating with rrw lock ffff8800da1b9c40 held [ 4692.228053] CPU: 1 PID: 6250 Comm: mmap_deadlock Tainted: P OE 3.18.10 #7 The culprit is that ZFS_EXIT(zsb) would call tsd_exit() every time, which would purge all tsd data for the thread. However, ZFS_ENTER is designed to be reentrant, so we cannot allow ZFS_EXIT to blindly purge tsd data. Instead, we rely on the new behavior of tsd_set. When NULL is passed as the new value to tsd_set, it will automatically remove the tsd entry specified the the key for the current thread. rrw_tsd_key and zfs_allow_log_key already calls tsd_set(key, NULL) when they're done. The zfs_fsyncer_key relied on ZFS_EXIT(zsb) to call tsd_exit() to do clean up. Now we explicitly call tsd_set(key, NULL) on them. Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3247	2015-04-24 14:57:54 -07:00
Brian Behlendorf	a438ff0e85	Extend PF_FSTRANS critical regions Additional testing has shown that the region covered by PF_FSTRANS needs to be extended to cover the zpl_xattr_security_init() and init_acl() functions. The zpl_mark_dirty() function can also recurse and therefore must always be protected. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Richard Yao <ryao@gentoo.org> Closes #3331	2015-04-24 09:54:22 -07:00
Brian Behlendorf	7fad6290eb	Mark additional functions as PF_FSTRANS Prevent deadlocks by disabling direct reclaim during all NFS, xattr, ctldir, and super function calls. This is related to `40d06e3`. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Issue #3225	2015-04-17 09:35:24 -07:00
Tim Chase	5074bfe8ad	Allocate zfs_znode_cache on the Linux slab The Linux slab, in general, performs better than the SPl slab in cases where a lot of objects are allocated and fragmentation is likely present. This patch fixes pathologically bad behavior in cases where the ARC is filled with mostly metadata and a user program needs to allocate and dirty enough memory which would require an insignificant amount of the ARC to be reclaimed. If zfs_znode_cache is on the SPL slab, the system may spin for a very long time trying to reclaim sufficient memory. If it is on the Linux slab, the behavior has been observed to be much more predictible; the memory is reclaimed more efficiently. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #3283	2015-04-14 12:19:22 -07:00
Brian Behlendorf	f42d7f4111	Use vmem_alloc() in spa_config_write() The packed nvlist allocated in spa_config_write() may exceed the warning threshold for large configurations. Use the vmem interfaces for this short lived allocation. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3251	2015-04-07 15:10:19 -07:00
Tim Chase	40d06e3c78	Mark all ZPL and ioctl functions as PF_FSTRANS Prevent deadlocks by disabling direct reclaim during all ZPL and ioctl calls as well as the l2arc and adapt ARC threads. This obviates the need for MUTEX_FSTRANS so its previous uses and definition have been eliminated. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3225	2015-04-03 11:38:59 -07:00
Matthew Ahrens	0f7d2a4b3d	Illumus 5693 - ztest fails in dbuf_verify: buf[i] == 0, due to dedup and bp_override 5693 ztest fails in dbuf_verify: buf[i] == 0, due to dedup and bp_override Reviewed by: George Wilson <george@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/5693 https://github.com/illumos/illumos-gate/commit/7f7ace3 Ported-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3231	2015-03-27 15:02:56 -07:00
George Wilson	b738bc5a0f	Illumos 5694 - traverse_prefetcher does not prefetch enough 5694 traverse_prefetcher does not prefetch enough Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Alex Reece <alex@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com> Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com> Approved by: Garrett D'Amore <garrett@damore.org> References: https://www.illumos.org/issues/5694 https://github.com/illumos/illumos-gate/commit/34d7ce05 Ported-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3230	2015-03-27 15:02:50 -07:00
Chris Dunlop	ee2f17aa2a	Align code with Illumos Align code in traverse_visitbp() with that in Illumos in preparation for applying Illumos-5694. No functional change: use a temporary variable pd to replace multiple occurrences of td->td_pfd. This increases our stack use slightly more then normal because the function is called recursively. Signed-off-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #3230	2015-03-27 14:52:45 -07:00
Prakash Surya	a4069eef2e	Illumos 5695 - dmu_sync'ed holes do not retain birth time 5695 dmu_sync'ed holes do not retain birth time Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/5695 https://github.com/illumos/illumos-gate/commit/70163ac Ported-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3229	2015-03-27 14:51:34 -07:00
Ned Bass	58806b4cdc	dbuf_free_range() overzealously frees dbufs When called to free a spill block from a dnode, dbuf_free_range() has a bug that results in all dbufs for the dnode getting freed. A variety of problems may result from this bug, but a common one was a zap lookup tripping an ASSERT because the zap buffers had been zeroed out. This could happen on a dataset with xattr=sa set when extended attributes are written and removed on a directory concurrently with I/O to files in that directory. Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Fixes #3195 Fixes #3204 Fixes #3222	2015-03-25 14:48:22 -07:00
Tim Chase	ded576e28f	Set the maximum ZVOL transfer size correctly ZoL had been setting max_sectors to UINT_MAX, but until Linux 3.19, it the kernel artifically capped it at 1024 (BLK_DEF_MAX_SECTORS). This cap was removed in torvalds/linux@34b48db. This patch changes it to DMU_MAX_ACCESS (in sectors) and also changes the ASSERT in dmu_tx_hold_write() to allow the maximum transfer size. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3212	2015-03-25 11:58:42 -07:00
Isaac Huang	e89bd69775	zio_injection_enabled should not be a module option The zio_inject.c keeps zio_injection_enabled as a counter of fault handlers, so it should not be exported to user space as a module option. Several EXPORT_SYMBOLs are moved from zio.c to zio_inject.c, where the symbols are defined. Signed-off-by: Isaac Huang <he.huang@intel.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3199	2015-03-24 13:22:03 -07:00
Chris Dunlop	d07b7c7f21	Reduce size of zfs_sb_t: allocate z_hold_mtx separately zfs_sb_t has grown to the point where using kmem_zalloc() for allocations is triggering the 32k warning threshold. We can't safely convert this entire allocation to use vmem_alloc() instead of kmem_alloc() because the backing_dev_info structure is embedded here. It depends on the bit_waitqueue() function which won't behave properly when given a virtual address. Instead, use vmem_alloc() to allocate the z_hold_mtx array separately. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chris Dunlop <chris@onthe.net.au> Closes #3178	2015-03-24 13:17:44 -07:00
Brian Behlendorf	bc88866657	Fix arc_adjust_meta() behavior The goal of this function is to evict enough meta data buffers from the ARC in order to enforce the arc_meta_limit. Achieving this is slightly more complicated than it appears because it is common for data buffers to have holds on meta data buffers. In addition, dnode meta data buffers will be held by the dnodes in the block preventing them from being freed. This means we can't simply traverse the ARC and expect to always find enough unheld meta data buffer to release. Therefore, this function has been updated to make alternating passes over the ARC releasing data buffers and then newly unheld meta data buffers. This ensures forward progress is maintained and arc_meta_used will decrease. Normally this is sufficient, but if required the ARC will call the registered prune callbacks causing dentry and inodes to be dropped from the VFS cache. This will make dnode meta data buffers available for reclaim. The number of total restarts in limited by zfs_arc_meta_adjust_restarts to prevent spinning in the rare case where all meta data is pinned. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Pavel Snajdr <snajpa@snajpa.net> Issue #3160	2015-03-20 10:35:20 -07:00
Brian Behlendorf	2cbb06b561	Restructure per-filesystem reclaim Originally when the ARC prune callback was introduced the idea was to register a single callback for the ZPL. The ARC could invoke this call back if it needed the ZPL to drop dentries, inodes, or other cache objects which might be pinning buffers in the ARC. The ZPL would iterate over all ZFS super blocks and perform the reclaim. For the most part this design has worked well but due to limitations in 2.6.35 and earlier kernels there were some problems. This patch is designed to address those issues. 1) iterate_supers_type() is not provided by all kernels which makes it impossible to safely iterate over all zpl_fs_type filesystems in a single callback. The most straight forward and portable way to resolve this is to register a callback per-filesystem during mount. The arc_*_prune_callback() functions have always supported multiple callbacks so this is functionally a very small change. 2) Commit `050d22b` removed the non-portable shrink_dcache_memory() and shrink_icache_memory() functions and didn't replace them with equivalent functionality. This meant that for Linux 3.1 and older kernels the ARC had no mechanism to drop dentries and inodes from the caches if needed. This patch adds that missing functionality by calling shrink_dcache_parent() to release dentries which may be pinning inodes. This will result in all unused cache entries being dropped which is a bit heavy handed but it's the only interface available for old kernels. 3) A zpl_drop_inode() callback is registered for kernels older than 2.6.35 which do not support the .evict_inode callback. This ensures that when the last reference on an inode is dropped it is immediately removed from the cache. If this isn't done than inode can end up on the global unused LRU with no mechanism available to ZFS to drop them. Since the ARC buffers are not dropped the hottest inodes can still be recreated without performing disk IO. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Pavel Snajdr <snajpa@snajpa.net> Issue #3160	2015-03-20 10:35:20 -07:00
Brian Behlendorf	596a8935a1	Fix arc_meta_max accounting The arc_meta_max value should be increased when space it consumed not when it is returned. This ensure's that arc_meta_max is always up to date. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Pavel Snajdr <snajpa@snajpa.net> Issue #3160	2015-03-20 10:35:20 -07:00
Chunwei Chen	40749aa7a6	Use MUTEX_FSTRANS on l2arc_buflist_mtx Use MUTEX_FSTRANS on l2arc_buflist_mtx to prevent the following deadlock scenario: 1. arc_release() -> hash_lock -> l2arc_buflist_mtx 2. l2arc_write_buffers() -> l2arc_buflist_mtx -> (direct reclaim) -> arc_buf_remove_ref() -> hash_lock Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Signed-off-by: Tim Chase <tim@chase2k.com> Issue #3160	2015-03-18 09:29:38 -07:00
Justin T. Gibbs	4c7b7eedcd	Illumos 5630 - stale bonus buffer in recycled dnode_t leads to data corruption 5630 stale bonus buffer in recycled dnode_t leads to data corruption Author: Justin T. Gibbs <justing@spectralogic.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george@delphix.com> Reviewed by: Will Andrews <will@freebsd.org> Approved by: Robert Mustacchi <rm@joyent.com> References: https://www.illumos.org/issues/5630 https://github.com/illumos/illumos-gate/commit/cd485b4 Ported-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <ryao@gentoo.org> Issue #3172	2015-03-12 15:40:39 -07:00
Josef 'Jeff' Sipek	73ad4a9f3c	Illumos 5047 - don't use atomic__nv if you discard the return value 5047 don't use atomic__nv if you discard the return value Author: Josef 'Jeff' Sipek <josef.sipek@nexenta.com> Reviewed by: Garrett D'Amore <garrett@damore.org> Reviewed by: Jason King <jason.brian.king@gmail.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> References: https://www.illumos.org/issues/5047 https://github.com/illumos/illumos-gate/commit/640c167 Porting Notes: Several hunks from the original patch where not specific to ZFS and thus were dropped. Ported-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <ryao@gentoo.org> Issue #3172	2015-03-12 15:40:33 -07:00
Brian Behlendorf	7f3e466283	Mark zfs_inactive() with PF_FSTRANS Allowing direct reclaim to re-enter the VFS in the zfs_inactive() call path has historically been problematic for ZoL. Therefore, in order to avoid an entire class of current and future issues caused by this PF_FSTRANS is set for all zfs_inactive() callers. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3163	2015-03-10 09:21:48 -07:00
Ned Bass	417104bdd3	Use cached feature info in spa_add_feature_stats() Avoid issuing I/O to the pool when retrieving feature flags information. Trying to read the ZAPs from disk means that zpool clear would hang if the pool is suspended and recovery would require a reboot. To keep the feature stats resident in memory, we hang a cached nvlist off of the spa. It is built up from disk the first time spa_add_feature_stats() is called, and refreshed thereafter using the cached feature reference counts. spa_add_feature_stats() gets called at pool import time so we can be sure the cached nvlist will be available if the pool is later suspended. Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3082	2015-03-05 14:11:10 -08:00
Brian Behlendorf	989fd514b1	Change ASSERT(!"...") to cmn_err(CE_PANIC, ...) There are a handful of ASSERT(!"...")'s throughout the code base for cases which should be impossible. This patch converts them to use cmn_err(CE_PANIC, ...) to ensure they are always enabled and so that additional debugging is logged if they were to occur. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #1445	2015-03-03 13:22:21 -08:00
Brian Behlendorf	8c45def24a	Linux 4.0 compat: bdi_setup_and_register() The 'capabilities' argument which was passed to bdi_setup_and_register() has been removed. File systems should no longer pass BDI_CAP_MAP_COPY. For our purposes this means there are now three different interfaces which must be handled. A zpl_bdi_setup_and_register() wrapper function has been introduced to provide a single interface to the ZPL code. * 2.6.32 - 2.6.33, bdi_setup_and_register() is not exported. * 2.6.34 - 3.19, bdi_setup_and_register() takes 3 arguments. * 4.0 - x.y, bdi_setup_and_register() takes 2 arguments. I've also taken this opportunity to remove HAVE_BDI because kernels older then 2.6.32 are no longer supported. All kernels newer than this will have one of the above interfaces. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Closes #3128	2015-03-03 10:49:45 -08:00
Brian Behlendorf	4ec15b8dcf	Use MUTEX_FSTRANS mutex type There are regions in the ZFS code where it is desirable to be able to be set PF_FSTRANS while a specific mutex is held. The ZFS code could be updated to set/clear this flag in all the correct places, but this is undesirable for a few reasons. 1) It would require changes to a significant amount of the ZFS code. This would complicate applying patches from upstream. 2) It would be easy to accidentally miss a critical region in the initial patch or to have an future change introduce a new one. Both of these concerns can be addressed by using a new mutex type which is responsible for managing PF_FSTRANS, support for which was added to the SPL in commit zfsonlinux/spl@9099312 - Merge branch 'kmem-rework'. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Closes #3050 Closes #3055 Closes #3062 Closes #3132 Closes #3142 Closes #2983	2015-03-03 10:46:40 -08:00
Isaac Huang	d14cfd83da	Fix deadlock between zpool export and zfs list Pool reference count is NOT checked in spa_export_common() if the pool has been imported readonly=on, i.e. spa->spa_sync_on is FALSE. Then zpool export and zfs list may deadlock: 1. Pool A is imported readonly. 2. zpool export A and zfs list are run concurrently. 3. zfs command gets reference on the spa, which holds a dbuf on on the MOS meta dnode. 4. zpool command grabs spa_namespace_lock, and tries to evict dbufs of the MOS meta dnode. The dbuf held by zfs command can't be evicted as its reference count is not 0. 5. zpool command blocks in dnode_special_close() waiting for the MOS meta dnode reference count to drop to 0, with spa_namespace_lock held. 6. zfs command tries to get the spa_namespace_lock with a reference on the spa held, which holds a dbuf on the MOS meta dnode. 7. Now zpool command and zfs command deadlock each other. Also any further zfs/zpool command will block on spa_namespace_lock forever. The fix is to always check pool reference count in spa_export_common(), no matter whether the pool was imported readonly or not. Signed-off-by: Isaac Huang <he.huang@intel.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2034	2015-03-02 11:50:06 -08:00
Brian Behlendorf	87a63dd702	Prevent "zpool destroy\|export" when suspended Cleanly destroying or exporting a pool requires that the pool not be suspended. Therefore, set the POOL_CHECK_SUSPENDED flag for these ioctls so the utilities will output a descriptive error message rather than block. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #2878	2015-03-02 11:50:06 -08:00
Brian Behlendorf	b4f3666a16	Retire spl_module_init()/spl_module_fini() In the original implementation of the SPL wrappers were provided for module initialization and cleanup. This was done to abstract away any compatibility code which might be needed for the SPL. As it turned out the only significant compatibility issue was that the default pwd during module load differed under Illumos and Linux. Since this is such as minor thing and the wrappers complicate the code they are being retired. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2985	2015-02-24 11:37:44 -08:00
Brian Behlendorf	1efdc45ea8	Fix O_APPEND open(2) flag As described in flags section of open(2): O_APPEND: The file is opened in append mode. Before each write(2), the file offset is positioned at the end of the file, as if with lseek(2). O_APPEND may lead to corrupted files on NFS filesys- tems if more than one process appends data to a file at once. This is because NFS does not support appending to a file, so the client kernel has to simulate it, which can't be done without a race condition. This issue was originally overlooked because normally the generic VFS code handles this for a filesystem. However, because ZFS explictly registers a zpl_write() function it's responsible for the seek. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3124	2015-02-24 11:21:54 -08:00
Dan Swartzendruber	1611bb7b4f	Set zfs_autoimport_disable default value to 1 When loading the ZFS kernel modules they should not populate the spa namespace using the cache file. This behavior isn't consistent with other Linux kernel modules and we need to move away from it. Removing this makes the whole startup process predictable with four basic steps which are driven by the init system. 1) modprobe 2) zpool import 3) zfs mount 4) zfs share This change also helps lay the groundwork for eventually removing the kobj_* compatibility code on the kernel side. It may need to be preserved in userspace because libzfs_init() depends on it. This is why the conditional must be wrapped with an #ifdef _KERNEL. Signed-off-by: Dan Swartzendruber <dswartz@druber.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2820	2015-02-17 16:09:41 -08:00
Brian Behlendorf	7d2868d5fc	Skip bad DVAs during free by setting zfs_recover=1 When a bad DVA is encountered in metaslab_free_dva() the system should treat it as fatal. This indicates that somehow a damaged DVA was written to disk and that should be impossible. However, we have seen a handful of reports over the years of pools somehow being damaged in this way. Since this damage can render otherwise intact pools unimportable, and the consequence of skipping the bad DVA is only leaked free space, it makes sense to provide a mechanism to ignore the bad DVA. Setting the zfs_recover=1 module option will cause the DVA to be ignored which may allow the pool to be imported. Since zfs_recover=0 by default any pool attempting to free a bad DVA will treat it as a fatal error preserving the current behavior. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3099 Issue #3090 Issue #2720	2015-02-13 16:02:04 -08:00
Andrey Vesnovaty	5f15fa2216	Fix readdir for .zfs/snapshot directory dmu_snapshot_list_next stores the index of the next snapshot entry to the offp argument, which zpl_snapdir_iterate then uses for the dir_emit. This result in an off-by-one error. Therefore a temporary variable should be used. This was a regression introduced in commit zfsonlinux/zfs@0f37d0c. Signed-off-by: Andrey Vesnovaty <andrey.vesnovaty@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2930	2015-02-10 16:34:30 -08:00
Brian Behlendorf	3941503c0a	Retire zio_cons()/zio_dest() The zio_cons() constructor and zio_dest() destructor don't exist in the upstream Illumos code. They were introduced as a workaround to avoid issue #2523. Since this issue has now been resolved this code is being reverted to bring ZoL back in sync with Illumos. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov> Issue #3063	2015-02-10 16:09:49 -08:00
Brian Behlendorf	6442f3cfe3	Retire zio_bulk_flags Long ago the zio_bulk_flags module parameter was introduced to facilitate debugging and profiling the zio_buf_caches. Today this code works well and there's no compelling reason to keep this functionality. In fact it's preferable to revert this so the code is more consistent with other ZFS implementations. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov> Issue #3063	2015-02-10 16:08:49 -08:00
Jörg Thalheim	534759fad3	Linux 3.19 compat: file_inode was added struct access f->f_dentry->d_inode was replaced by accessor function file_inode(f) Signed-off-by: Joerg Thalheim <joerg@higgsboson.tk> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3084	2015-02-10 11:24:51 -08:00
Brian Behlendorf	77aef6f60e	Use vmem_alloc() for nvlists Several of the nvlist functions may perform allocations larger than the 32k warning threshold. Convert them to use vmem_alloc() so the best allocator is used. Commit `efcd79a` retired KM_NODEBUG which was used to suppress large allocation warnings. Concurrently the large allocation warning threshold was increased from 8k to 32k. The goal was to identify the remaining locations, such as this one, where the allocation can be larger than 32k. This patch is expected fine tuning resulting for the kmem-rework changes, see commit `6e9710f`. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3057 Closes #3079 Closes #3081	2015-02-10 11:00:08 -08:00
Brian Behlendorf	afe373260e	Revert "Don't read space maps during import for readonly pools" This reverts commit `7fc8c33ede` which accidentally introduced a ztest failure. ztest: '/usr/sbin/zdb -bcc -d -U /var/tmp/zpool.cache ztest' exit code 2 child exited with code 3 Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-02-09 16:56:59 -08:00
Brian Behlendorf	7fc8c33ede	Don't read space maps during import for readonly pools Normally when importing a pool the space maps for all top level vdevs are read from disk. The space maps will be required latter when an allocation is performed and free blocks need to be located. However, if the pool is imported readonly then we are guaranteed that no allocations can occur. In this case the space maps need not be loaded.. A similar argument can be made for the DTLs (dirty time logs). Because a pool import will fail if the space maps cannot be read. The ability to safely ignore them makes it more likely that a damaged pool can be imported readonly to recover its contents. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #2831	2015-02-09 16:43:03 -08:00
Justin T. Gibbs	33b4de513e	Illumos 5311 - traverse_dnode may report success when it should not 5311 traverse_dnode may report success when it should not Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Andriy Gapon <avg@FreeBSD.org> Reviewed by: Will Andrews <willa@spectralogic.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://github.com/illumos/illumos-gate/commit/2a89c2c https://www.illumos.org/issues/5311 Ported by: DHE <git@dehacked.net> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2970	2015-02-06 12:07:15 -08:00
Ned Bass	a62d1b02e3	Fix SA header size accounting The functions sa_find_sizes() and sa_build_layout() fail to account for the additional 2 bytes of SA header space when calculating whether a variable size attribute might spill over. They may consequently determine that an attribute will fit in the bonus buffer along with a spill block pointer, when in reality the attribute would be partially overwritten by the spill block pointer if spill over occurs. This also causes an inconsistency between the SA header size and the number of variable size attributes in the layout, tripping an assertion when debugging is on. The following reproducer demonstrates the problem. ln -s $(perl -e 'print "z" x 20') file setfattr -h -n trusted.foo -v $(perl -e 'print "z" x 200') file Even though sa_find_sizes() computes the index of the attribute where spill-over will occur, sa_build_layouts() discards the result and recomputes it itself. As it turns out, both functions get it wrong. Since this computation is awkward and, as history has shown, easy to screw up, let's just do it in one place. This patch fixes the bug in sa_find_sizes() and updates sa_build_layout() to use the result computed there. Also improve the comments in sa_find_sizes(). Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Closes #3070	2015-02-06 09:26:46 -08:00
Brian Behlendorf	e2c4acde55	Skip evicting dbufs when walking the dbuf hash When a dbuf is in the DB_EVICTING state it may no longer be on the dn_dbufs list. In which case it's unsafe to call DB_DNODE_ENTER. Therefore, any dbuf which is found in this safe must be skipped. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2553 Closes #2495	2015-02-06 09:24:28 -08:00
Tim Chase	aa2ef419e4	Spurious ENOMEM returns when reading dbufs kstat Commit `7b2d78a046` fixed some improper uses of snprintf(), however, in __dbuf_stats_hash_table_data() the return value of snprintf is propagated to the caller. This caused spurious ENOMEM errors when reading the dbufs kstat. This commit causes the actual number of characters written to be returned. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3072	2015-02-04 16:35:26 -08:00
avg	037763e44e	fix l2arc compression buffers leak Commit log from FreeBSD: We have observed that arc_release() can be called concurrently with a l2arc in-flight write. Also, we have observed that arc_hdr_destroy() can be called from arc_write_done() for a zio with ZIO_FLAG_IO_REWRITE flag in similar circumstances. Previously the l2arc headers would be freed while leaking their associated compression buffers. Now the buffers are placed on l2arc_free_on_write list for delayed freeing. This is similar to what was already done to arc buffers that were supposed to be freed concurrently with in-flight writes of those buffers. In addition to fixing the discovered leaks this change also adds some protective code to assert that a compression buffer associated with a l2arc header is never leaked. A new kstat l2_cdata_free_on_write is added. It keeps a count of delayed compression buffer frees which previously would have been leaks. Tested by: Vitalij Satanivskij <satan@ukr.net> et al Requested by: many MFC after: 2 weeks Sponsored by: HybridCluster / ClusterHQ References: https://illumos.org/issues/5222 https://github.com/freebsd/freebsd/commit/b98f85d http://thread.gmane.org/gmane.os.freebsd.current/155757/focus=155781 http://lists.open-zfs.org/pipermail/developer/2014-January/000455.html http://lists.open-zfs.org/pipermail/developer/2014-February/000523.html Ported-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3029	2015-02-03 16:54:16 -08:00
Brian Behlendorf	19ea3d25df	Use zio buffers in zil_itx_create() The zil_itx_create() function uses the vmem_alloc() allocator for its buffers because when logging a write that buffer may be as large as 64K. This is non-optimal because we may need to allocate many of of these buffers and this interface has the potential to be slow. Instead, use zio_data_buf_alloc() which is specifically designed to be able to efficiently allocate a wide range of buffer sizes. In addition, do some cleanup and use the zil_itx_destroy() function to always free an itx structure. This way we're always sure the right allocation functions are used. Notice that in the current code kmem_free() and vmem_free() were both used. This happened to work because these wrappers map to the same internal SPL function. This was identified as a potential problem when a low-end memory constrained system began logging the following warnings. There was no deadlock here just repeated allocation failures resulting in increased latency. Possible memory allocation deadlock: size=65792 lflags=0x42d0 Pid: 20118, comm: kvm Tainted: P O 3.2.0-0.bpo.4-amd64 Call Trace: [<ffffffffa040b834>] ? spl_kmem_alloc_impl+0x115/0x127 [spl] [<ffffffffa040b84f>] ? spl_kmem_alloc_debug+0x9/0x36 [spl] [<ffffffffa05d8a0b>] ? zil_itx_create+0x2d/0x59 [zfs] [<ffffffffa05c71e6>] ? zfs_log_write+0x13a/0x2f0 [zfs] [<ffffffffa05d41bc>] ? zfs_write+0x85b/0x9bb [zfs] [<ffffffffa05e37ec>] ? zpl_aio_write+0xca/0x110 [zfs] [<ffffffff811088e5>] ? do_sync_readv_writev+0xa3/0xde [<ffffffff81108f41>] ? do_readv_writev+0xaf/0x125 [<ffffffff81109055>] ? sys_pwritev+0x55/0x9a [<ffffffff813721d2>] ? system_call_fastpath+0x16/0x1b Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <ryao@gentoo.org> Closes #3059	2015-02-02 11:20:41 -08:00
Brian Behlendorf	0365064a97	Handle closing an unopened ZVOL Thank to commit `a4430fce69` we're now correctly returning EROFS when opening a zvol on a read-only pool. Unfortunately, it looks like this causes us to trigger some unexpected behavior by __blkdev_get(). In the failure case it's possible __blkdev_get() will call __blkdev_put() for a bdev which was never successfully opened. This results in us trying to close the device again and hitting the NULL dereference. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1343	2015-01-30 14:44:14 -08:00
Brian Behlendorf	a127e841de	Add zvol_open() error handling for readonly property Rather than ASSERT when for some reason the readonly property of a zvol can't be read cleanly handle the failure. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1343	2015-01-30 14:44:06 -08:00
Tim Chase	b0cf0676c0	Fix removal of SA in sa_modify_attrs() The sa_modify_attrs() function can add, remove or replace an SA. The main loop in the function uses the index "i" to iterate over the existing SAs and uses the index "j" for writing them into a new buffer via SA_ADD_BULK_ATTR(). The write index, "j" is incremented on remove (SA_REMOVE) operations which leads to a corruption in the new SA buffer. This patch remove the increment for SA_REMOVE operations. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Ned Bass <bass6@llnl.gov> Closes #3028	2015-01-21 16:35:14 -08:00
Richard Yao	841c9d43c7	Use kmem_vasprintf() in log_internal() An attempt to debug zfsonlinux/zfs#2781 revealed that this code could be simplified by using kmem_asprintf(). It is not clear that switching to kmem_asprintf() addresses zfsonlinux/zfs#2781. However, switching to kmem_asprintf() is cleanup that simplifies debugging such that it would become clear that this is a bug in glibc should the issue persist. It also brings this function almost back in sync with Illumos. This was possible due to the recently reworked kmem code which allows us to use KM_SLEEP in the same fashion as Illumos. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2791 Issue #2781	2015-01-21 15:30:24 -08:00
Tim Chase	3c832b8cc1	Linux 3.12 compat: split shrinker has s_shrink The split count/scan shrinker callbacks introduced in 3.12 broke the test for HAVE_SHRINK, effectively disabling the per-superblock shrinkers. This patch re-enables the per-superblock shrinkers when the split shrinker callbacks have been detected. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2975	2015-01-20 14:07:59 -08:00
Brian Behlendorf	81971b137a	Revert "SA spill block cache" The SA spill_cache was originally introduced to avoid the need to perform large kmem or vmem allocations. Instead a small dedicated cache of preallocated SA buffers was kept. This solution was viable while the maximum block size was limited to 128K. But with the planned increase of the maximum block size to 16M callers need to migrate to the zio_buf_alloc(). However, they should be aware this interface is expected to change again once the zio buffers are fully backed by scatter-gather lists. Alternately, if the callers know these buffers will never be large or be infrequently accessed they may kmem_alloc() or vmem_alloc() the needed temporary space. This change has the additional benegit of bringing the code back inline with the upstream Illumos source. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-01-16 14:41:28 -08:00
Brian Behlendorf	285b29d959	Revert "Pre-allocate vdev I/O buffers" Commit `86dd0fd` added preallocated I/O buffers. This is no longer required after the recent kmem changes designed to make our memory allocation interfaces behave more like those found on Illumos. A deadlock in this situation is no longer possible. However, these allocations still have the potential to be expensive. So a potential future optimization might be to perform then KM_NOSLEEP so that they either succeed of fail quicky. Either case is acceptable here because we can safely abort the aggregation. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-01-16 14:41:28 -08:00
Brian Behlendorf	79c76d5b65	Change KM_PUSHPAGE -> KM_SLEEP By marking DMU transaction processing contexts with PF_FSTRANS we can revert the KM_PUSHPAGE -> KM_SLEEP changes. This brings us back in line with upstream. In some cases this means simply swapping the flags back. For others fnvlist_alloc() was replaced by nvlist_alloc(..., KM_PUSHPAGE) and must be reverted back to fnvlist_alloc() which assumes KM_SLEEP. The one place KM_PUSHPAGE is kept is when allocating ARC buffers which allows us to dip in to reserved memory. This is again the same as upstream. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-01-16 14:41:26 -08:00
Brian Behlendorf	efcd79a883	Retire KM_NODEBUG Callers of kmem_alloc() which passed the KM_NODEBUG flag to suppress the large allocation warning have been replaced by vmem_alloc() as appropriate. The updated vmem_alloc() call will not print a warning regardless of the size of the allocation. A careful reader will notice that not all callers have been changed to vmem_alloc(). Some have only had the KM_NODEBUG flag removed. This was possible because the default warning threshold has been increased to 32k. This is desirable because it minimizes the need for Linux specific code changes. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-01-16 14:40:32 -08:00
Richard Yao	71f8548ea4	Use is_vmalloc_addr() in vdev_disk.c The initial port of ZFS to Linux required a way to identify virtual memory to make IO to virtual memory backed slabs work, so kmem_virt() was created. Linux 2.6.25 introduced is_vmalloc_addr(), which is logically equivalent to kmem_virt(). Support for kernels before 2.6.26 was later dropped and more recently, support for kernels before Linux 2.6.32 has been dropped. We retire kmem_virt() in favor of is_vmalloc_addr() to cleanup the code. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-01-16 14:28:05 -08:00
Brian Behlendorf	92119cc259	Mark IO pipeline with PF_FSTRANS In order to avoid deadlocking in the IO pipeline it is critical that pageout be avoided during direct memory reclaim. This ensures that the pipeline threads can always make forward progress and never end up blocking on a DMU transaction. For this very reason Linux now provides the PF_FSTRANS flag which may be set in the process context. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-01-16 14:28:05 -08:00
Brian Behlendorf	d958324f97	Fix zfs_putpage() lock inversion (again) This is a follow up commit to `74328ee` which correctly resolved a lock inversion between zfs_putpage() and zfs_free_range(). Unfortunately, in the process it accidentally introduced another inversion between zfs_putpage() and zfs_read(). The page must be unlocked before taking the range lock. This patch corrects that issue. In addition, because the locking rules here are subtle a block comment has been added clearly explaining why the ordering here is critical. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov> Issue #2976	2015-01-08 16:09:41 -08:00
Ned Bass	33b6dbbc51	Document zfs_flags module parameter Add a table describing the debugging flags that can be set in the zfs_flags module parameter. Also change the module_param type to 'uint' so users aren't shown a negative value. The updated man page text is reproduced below for convenience. zfs_flags (int) Set additional debugging flags. The following flags may be bitwise-or'd together. +-------------------------------------------------------+ \|Value Symbolic Name \| \| Description \| +-------------------------------------------------------+ \| 1 ZFS_DEBUG_DPRINTF \| \| Enable dprintf entries in the debug log. \| +-------------------------------------------------------+ \| 2 ZFS_DEBUG_DBUF_VERIFY * \| \| Enable extra dbuf verifications. \| +-------------------------------------------------------+ \| 4 ZFS_DEBUG_DNODE_VERIFY * \| \| Enable extra dnode verifications. \| +-------------------------------------------------------+ \| 8 ZFS_DEBUG_SNAPNAMES \| \| Enable snapshot name verification. \| +-------------------------------------------------------+ \| 16 ZFS_DEBUG_MODIFY \| \| Check for illegally modified ARC buffers. \| +-------------------------------------------------------+ \| 32 ZFS_DEBUG_SPA \| \| Enable spa_dbgmsg entries in the debug log. \| +-------------------------------------------------------+ \| 64 ZFS_DEBUG_ZIO_FREE \| \| Enable verification of block frees. \| +-------------------------------------------------------+ \| 128 ZFS_DEBUG_HISTOGRAM_VERIFY \| \| Enable extra spacemap histogram verifications. \| +-------------------------------------------------------+ * Requires debug build. Default value: 0. Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2988	2015-01-07 15:50:49 -08:00
Ned Bass	49ee64e5e6	Remove duplicate typedefs from trace.h Older versions of GCC (e.g. GCC 4.4.7 on RHEL6) do not allow duplicate typedef declarations with the same type. The trace.h header contains some typedefs to avoid 'unknown type' errors for C files that haven't declared the type in question. But this causes build failures for C files that have already declared the type. Newer versions of GCC (e.g. v4.6) allow duplicate typedefs with the same type unless pedantic error checking is in force. To support the older versions we need to remove the duplicate typedefs. Removal of the typedefs means we can't built tracepoints code using those types unless the required headers have been included. To facilitate this, all tracepoint event declarations have been moved out of trace.h into separate headers. Each new header is explicitly included from the C file that uses the events defined therein. The trace.h header is still indirectly included form zfs_context.h and provides the implementation of the dprintf(), dbgmsg(), and SET_ERROR() interfaces. This makes those interfaces readily available throughout the code base. The macros that redefine DTRACE_PROBE* to use Linux tracepoints are also still provided by trace.h, so it is a prerequisite for the other trace_*.h headers. These new Linux implementation-specific headers do introduce a small divergence from upstream ZFS in several core C files, but this should not present a significant maintenance burden. Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #2953	2015-01-06 16:53:24 -08:00
Brian Behlendorf	74328ee18f	Fix zfs_putpage() lock inversion There exists a lock inversions involving the zfs range lock and the individual page writeback bits which can result in a deadlock. To prevent this we must always manipulate the writeback bit while holding the range lock. The exact deadlock is as follows: ------ Process A ------ ------ Process B ------ zpl_writepages zpl_fallocate write_cache_pages zpl_fallocate_common zpl_putpage zfs_space zfs_putpage (set bit) zfs_freesp zfs_range_lock (wait on lock) zfs_free_range (take lock) [has not yet initiated I/O, truncate_inode_pages_range the bit will not be cleared] wait_on_page_writeback (wait on bit) Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Richard Yao <richard.yao@clusterhq.com> Issue #2976	2014-12-22 09:31:56 -08:00
Boris Protopopov	9063f65476	Correct error returns to unify cross-pool operation error handling Signed-off-by: Boris Protopopov <boris.protopopov@actifio.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2911	2014-12-19 10:51:24 -08:00
Brian Behlendorf	c944be5d7e	Fix snapshots with dirty inodes Filesystems which are mounted read-only or are immutable because they are snapshots must not be allowed to dirty and inode. This will result in a write which will correctly cause a kernel panic because these filesystem are (and must be) immutable. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2812	2014-11-20 10:38:16 -08:00
Isaac Huang	29b763cd2c	bio_alloc() with __GFP_WAIT never returns NULL Mark the error handling branch as unlikely() because the current kernel interface can never return NULL. However, we want to keep the error handling in case this behavior changes in the futre. Plus fix a small style issue. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Isaac Huang <he.huang@intel.com> Closes #2703	2014-11-19 12:50:49 -05:00
Ned Bass	aaed7c408c	Explicitly include SPL compat headers Inclusion of SPL compatibility headers was moved out of the public header sys/types.h to avoid conflicts with external packages. Include a few compatiblity headers explicitly to cope with that change. Also, sort some linux-specific inclusions alphabetically. Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2898	2014-11-19 12:30:39 -05:00
Ned Bass	7b2d78a046	Fix improper null-byte termination handling Fix a few cases where null-byte termination of strings was done unnecessarily or incorrectly. - The snprintf() function always produces a null-byte terminated string for non-negative return values, so it is not necessary to write out a null-byte as a separate step. - Also, it is unsafe to use the return value of snprintf() as an offset for placing a null-byte, because if the output was truncated the return value is the number of bytes that _would_ have been written had enough space been available. Therefore the return value may index beyond the array boundaries. - Finally, snprintf() accounts for the null-byte when limiting its output size, so there is no need to pass it a size parameter that is one less than the buffer size. Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2875	2014-11-17 15:28:59 -08:00
smh	89b1cd6581	Prevent ZFS leaking pool free space When processing async destroys ZFS would leak space every txg timeout (5 seconds by default), if no writes occurred, until the pool is totally full. At this point it would be unfixable without a pool recreation. In addition if the machine was rebooted with the pool in this situation would fail to import on boot, hanging indefinitely, as the import process requires the ability to write data to the pool. Any attempts to query the pool status during the hung import would not return as the import holds the pool lock. The only way to import such a pool would be to specify -o readonly=on to the zpool import. zdb -bb <pool> can be used to check for "deferred free" size which is where this lost space will be counted. References: https://github.com/freebsd/freebsd/commit/48431b7 http://svnweb.freebsd.org/base?view=revision&revision=273158 https://reviews.csiden.org/r/132/ Porting notes: This issue was filed as illumos 5347 and a more comprehensive fix is under review. Once that change is finalized it will be integrated, in the meanwhile the FreeBSD fix has been merged to prevent the issue. Ported by: Tim Chase <tim@chase2k.com> Signed-off-by: Matthew Ahrens mahrens@delphix.com Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2896	2014-11-17 11:35:38 -08:00
Tim Chase	4254acb057	Undirty freed spill blocks. If a spill block's dbuf hasn't yet been written when a spill block is freed, the unwritten version will still be written. This patch handles the case in which a spill block's dbuf is freed and undirties it to prevent it from being written. The most common case in which this could happen is when xattr=sa is being used and a long xattr is immediately replaced by a short xattr as in: setfattr -n user.test -v very_very_very..._long_value <file> setfattr -n user.test -v short_value <file> The first value must be sufficiently long that a spill block is generated and the second value must be short enough to not require a spill block. In practice, this would typically happen due to internal xattr operations as a result of setting acltype=posixacl. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2663 Closes #2700 Closes #2701 Closes #2717 Closes #2863 Closes #2884	2014-11-17 11:25:48 -08:00
Prakash Surya	0b39b9f96f	Swap DTRACE_PROBE* with Linux tracepoints This patch leverages Linux tracepoints from within the ZFS on Linux code base. It also refactors the debug code to bring it back in sync with Illumos. The information exported via tracepoints can be used for a variety of reasons (e.g. debugging, tuning, general exploration/understanding, etc). It is advantageous to use Linux tracepoints as the mechanism to export this kind of information (as opposed to something else) for a number of reasons: * A number of external tools can make use of our tracepoints "automatically" (e.g. perf, systemtap) * Tracepoints are designed to be extremely cheap when disabled * It's one of the "accepted" ways to export this kind of information; many other kernel subsystems use tracepoints too. Unfortunately, though, there are a few caveats as well: * Linux tracepoints appear to only be available to GPL licensed modules due to the way certain kernel functions are exported. Thus, to actually make use of the tracepoints introduced by this patch, one might have to patch and re-compile the kernel; exporting the necessary functions to non-GPL modules. * Prior to upstream kernel version v3.14-rc6-30-g66cc69e, Linux tracepoints are not available for unsigned kernel modules (tracepoints will get disabled due to the module's 'F' taint). Thus, one either has to sign the zfs kernel module prior to loading it, or use a kernel versioned v3.14-rc6-30-g66cc69e or newer. Assuming the above two requirements are satisfied, lets look at an example of how this patch can be used and what information it exposes (all commands run as 'root'): # list all zfs tracepoints available $ ls /sys/kernel/debug/tracing/events/zfs enable filter zfs_arc__delete zfs_arc__evict zfs_arc__hit zfs_arc__miss zfs_l2arc__evict zfs_l2arc__hit zfs_l2arc__iodone zfs_l2arc__miss zfs_l2arc__read zfs_l2arc__write zfs_new_state__mfu zfs_new_state__mru # enable all zfs tracepoints, clear the tracepoint ring buffer $ echo 1 > /sys/kernel/debug/tracing/events/zfs/enable $ echo 0 > /sys/kernel/debug/tracing/trace # import zpool called 'tank', inspect tracepoint data (each line was # truncated, they're too long for a commit message otherwise) $ zpool import tank $ cat /sys/kernel/debug/tracing/trace \| head -n35 # tracer: nop # # entries-in-buffer/entries-written: 1219/1219 #P:8 # # _-----=> irqs-off # / _----=> need-resched # \| / _---=> hardirq/softirq # \|\| / _--=> preempt-depth # \|\|\| / delay # TASK-PID CPU# \|\|\|\| TIMESTAMP FUNCTION # \| \| \| \|\|\|\| \| \| lt-zpool-30132 [003] .... 91344.200050: zfs_arc__miss: hdr... z_rd_int/0-30156 [003] .... 91344.200611: zfs_new_state__mru... lt-zpool-30132 [003] .... 91344.201173: zfs_arc__miss: hdr... z_rd_int/1-30157 [003] .... 91344.201756: zfs_new_state__mru... lt-zpool-30132 [003] .... 91344.201795: zfs_arc__miss: hdr... z_rd_int/2-30158 [003] .... 91344.202099: zfs_new_state__mru... lt-zpool-30132 [003] .... 91344.202126: zfs_arc__hit: hdr ... lt-zpool-30132 [003] .... 91344.202130: zfs_arc__hit: hdr ... lt-zpool-30132 [003] .... 91344.202134: zfs_arc__hit: hdr ... lt-zpool-30132 [003] .... 91344.202146: zfs_arc__miss: hdr... z_rd_int/3-30159 [003] .... 91344.202457: zfs_new_state__mru... lt-zpool-30132 [003] .... 91344.202484: zfs_arc__miss: hdr... z_rd_int/4-30160 [003] .... 91344.202866: zfs_new_state__mru... lt-zpool-30132 [003] .... 91344.202891: zfs_arc__hit: hdr ... lt-zpool-30132 [001] .... 91344.203034: zfs_arc__miss: hdr... z_rd_iss/1-30149 [001] .... 91344.203749: zfs_new_state__mru... lt-zpool-30132 [001] .... 91344.203789: zfs_arc__hit: hdr ... lt-zpool-30132 [001] .... 91344.203878: zfs_arc__miss: hdr... z_rd_iss/3-30151 [001] .... 91344.204315: zfs_new_state__mru... lt-zpool-30132 [001] .... 91344.204332: zfs_arc__hit: hdr ... lt-zpool-30132 [001] .... 91344.204337: zfs_arc__hit: hdr ... lt-zpool-30132 [001] .... 91344.204352: zfs_arc__hit: hdr ... lt-zpool-30132 [001] .... 91344.204356: zfs_arc__hit: hdr ... lt-zpool-30132 [001] .... 91344.204360: zfs_arc__hit: hdr ... To highlight the kind of detailed information that is being exported using this infrastructure, I've taken the first tracepoint line from the output above and reformatted it such that it fits in 80 columns: lt-zpool-30132 [003] .... 91344.200050: zfs_arc__miss: hdr { dva 0x1:0x40082 birth 15491 cksum0 0x163edbff3a flags 0x640 datacnt 1 type 1 size 2048 spa 3133524293419867460 state_type 0 access 0 mru_hits 0 mru_ghost_hits 0 mfu_hits 0 mfu_ghost_hits 0 l2_hits 0 refcount 1 } bp { dva0 0x1:0x40082 dva1 0x1:0x3000e5 dva2 0x1:0x5a006e cksum 0x163edbff3a:0x75af30b3dd6:0x1499263ff5f2b:0x288bd118815e00 lsize 2048 } zb { objset 0 object 0 level -1 blkid 0 } For the specific tracepoint shown here, 'zfs_arc__miss', data is exported detailing the arc_buf_hdr_t (hdr), blkptr_t (bp), and zbookmark_t (zb) that caused the ARC miss (down to the exact DVA!). This kind of precise and detailed information can be extremely valuable when trying to answer certain kinds of questions. For anybody unfamiliar but looking to build on this, I found the XFS source code along with the following three web links to be extremely helpful: * http://lwn.net/Articles/379903/ * http://lwn.net/Articles/381064/ * http://lwn.net/Articles/383362/ I should also node the more "boring" aspects of this patch: * The ZFS_LINUX_COMPILE_IFELSE autoconf macro was modified to support a sixth paramter. This parameter is used to populate the contents of the new conftest.h file. If no sixth parameter is provided, conftest.h will be empty. * The ZFS_LINUX_TRY_COMPILE_HEADER autoconf macro was introduced. This macro is nearly identical to the ZFS_LINUX_TRY_COMPILE macro, except it has support for a fifth option that is then passed as the sixth parameter to ZFS_LINUX_COMPILE_IFELSE. These autoconf changes were needed to test the availability of the Linux tracepoint macros. Due to the odd nature of the Linux tracepoint macro API, a separate ".h" must be created (the path and filename is used internally by the kernel's define_trace.h file). * The HAVE_DECLARE_EVENT_CLASS autoconf macro was introduced. This is to determine if we can safely enable the Linux tracepoint functionality. We need to selectively disable the tracepoint code due to the kernel exporting certain functions as GPL only. Without this check, the build process will fail at link time. In addition, the SET_ERROR macro was modified into a tracepoint as well. To do this, the 'sdt.h' file was moved into the 'include/sys' directory and now contains a userspace portion and a kernel space portion. The dprintf and zfs_dbgmsg* interfaces are now implemented as tracepoint as well. Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2014-11-17 11:13:55 -08:00
Ned Bass	29e57d15c8	Fix dprintf format specifiers Fix a few dprintf format specifiers that disagreed with their argument types. These came to light as compiler errors when converting dprintf to use the Linux trace buffer. Previously this wasn't a problem, presumably because the SPL debug logging uses vsnprintf which must perform automatic type conversion. Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2014-11-17 11:13:45 -08:00
Ned Bass	59ec819a0c	Move a few internal ARC strucutres to arc_impl.h Add a new file named arc_impl.h and move a few internal ARC structure definitions into this file. This is needed in order to allow the Linux tracepoint functions to grub around in the internals of these structures. Signed-off-by: Prakash Surya <prakash.surya@delphix.com> Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2014-11-17 11:13:27 -08:00
Prakash Surya	fb42a49328	Illumos 5213 - panic in metaslab_init due to space_map_open returning ENXIO 5213 panic in metaslab_init due to space_map_open returning ENXIO Reviewed by: Matthew Ahrens mahrens@delphix.com Reviewed by: George Wilson george.wilson@delphix.com References: https://www.illumos.org/issues/5213 https://reviews.csiden.org/r/110 Porting notes: For the Linux port, KM_SLEEP was replaced with KM_PUSHPAGE. Ported by: Turbo Fredriksson <turbo@bayour.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2745	2014-11-14 15:37:45 -08:00
Chris Wedgwood	b31d8ea77c	Reduce buf/dbuf mutex contention Due to evidence of contention both the buf_hash_table and the dbuf_hash_table sizes have been increased from 256 to 8192. This increase in hash table size adds approximating 0.5M to our fixed memory footprint. This relatively small increase is not expected to cause problems even on low memory machines. This footprint will also become dynamic when the persistent L2ARC support is finalized. In the meanwhile, this small change significantly reduces contention for certain workloads. Signed-off-by: Chris Wedgwood <cw@f00f.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Pavel Snajdr <snajpa@snajpa.net> Closes #1291	2014-11-14 14:59:21 -08:00
Alex Zhuravlev	0f69910833	Export symbols for ZIL interface These symbols are needed by consumers (i.e. Lustre) who wish to integrate with the ZIL. In addition the zil_rollback_destroy() prototype was removed because the implementation of this function was removed long ago. Signed-off-by: Alex Zhuravlev <alexey.zhuravlev@intel.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2892	2014-11-14 14:39:43 -08:00
Alexander Pyhalov	bb9d808c5a	Fix modules installation directory When building zfs modules with kernel, compiled from deb.src, the packaging process ends up installing the modules in the wrong place. Signed-off-by: Alexander Pyhalov <apyhalov@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2822	2014-10-28 09:46:14 -07:00
Tim Chase	ed6e9cc235	Linux 3.12 compat: shrinker semantics The new shrinker API as of Linux 3.12 modifies "struct shrinker" by replacing the @shrink callback with the pair of @count_objects and @scan_objects. It also requires the return value of @count_objects to return the number of objects actually freed whereas the previous @shrink callback returned the number of remaining freeable objects. This patch adds support for the new @scan_objects return value semantics. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Closes #2837	2014-10-28 09:34:51 -07:00
Matthew Ahrens	9635861742	Illumos 5164-5165 - space map fixes 5164 space_map_max_blksz causes panic, does not work 5165 zdb fails assertion when run on pool with recently-enabled space map_histogram feature Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/5164 https://www.illumos.org/issues/5165 https://github.com/illumos/illumos-gate/commit/b1be289 Porting Notes: The metaslab_fragmentation() hunk was dropped from this patch because it was already resolved by commit `8b0a084`. The comment modified in metaslab.c was updated to use the correct variable name, space_map_blksz. The upstream commit incorrectly used space_map_blksize. Ported by: Turbo Fredriksson <turbo@bayour.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2697	2014-10-23 15:30:32 -07:00
Alex Reece	b02fe35d37	Illumos 4958 zdb trips assert on pools with ashift >= 0xe 4958 zdb trips assert on pools with ashift >= 0xe Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Max Grossman <max.grossman@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Approved by: Garrett D'Amore <garrett@damore.org> References: https://www.illumos.org/issues/4958 https://github.com/illumos/illumos-gate/commit/2a104a5 Porting notes: Keep the ZIO_FLAG_FASTWRITE define. This is for a feature present in Linux but not yet in *BSD. Ported by: Turbo Fredriksson <turbo@bayour.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2697	2014-10-23 15:30:32 -07:00
Brian Behlendorf	5f6d0b6f5a	Handle block pointers with a corrupt logical size The general strategy used by ZFS to verify that blocks are valid is to checksum everything. This has the advantage of being extremely robust and generically applicable regardless of the contents of the block. If a blocks checksum is valid then its contents are trusted by the higher layers. This system works exceptionally well as long as bad data is never written with a valid checksum. If this does somehow occur due to a software bug or a memory bit-flip on a non-ECC system it may result in kernel panic. One such place where this could occur is if somehow the logical size stored in a block pointer exceeds the maximum block size. This will result in an attempt to allocate a buffer greater than the maximum block size causing a system panic. To prevent this from happening the arc_read() function has been updated to detect this specific case. If a block pointer with an invalid logical size is passed it will treat the block as if it contained a checksum error. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2678	2014-10-23 09:20:52 -07:00
Ned Bass	bc151f7b31	Remove checks for mandatory locks The Linux VFS handles mandatory locks generically so we shouldn't need to check for conflicting locks in zfs_read(), zfs_write(), or zfs_freesp(). Linux 3.18 removed the lock_may_read() and lock_may_write() interfaces which we were relying on for this purpose. Rather than emulating those interfaces we remove the redundant checks. Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2804	2014-10-22 11:06:53 -07:00
Matthew Ahrens	88904bb3e3	Illumos 5162 - zfs recv should use loaned arc buffer to avoid copy 5162 zfs recv should use loaned arc buffer to avoid copy Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Bayard Bell <Bayard.Bell@nexenta.com> Reviewed by: Richard Elling <richard.elling@gmail.com> Approved by: Garrett D'Amore <garrett@damore.org> References: https://www.illumos.org/issues/5162 https://github.com/illumos/illumos-gate/commit/8a90470 Porting notes: Fix spelling error 's/arena/area/' in dmu.c. In restore_write() declare bonus and abuf at the top of the function. Ported by: Turbo Fredriksson <turbo@bayour.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2696	2014-10-21 16:32:11 -07:00
Matthew Ahrens	4b20a6f509	Illumos 5150 - zfs clone of a defer_destroy snapshot causes strangeness When a clone is created of a snapshot that has been marked for deferred destroy (with "zfs destroy -d"), the clone "inherits" the defer_destroy flag from the origin, and any snapshots of the clone "inherit" the defer_destroy flag from the clone. This causes a strange situation where the clone's snapshots are marked for defer_destroy but they have no holds or clones. If the clone's snapshot gets a hold or clone, which is then deleted, we will honor the incorrectly-set defer_destroy flag and delete the snapshot! Steps to reproduce: * zpool create test c1t1d0 * zfs create test/fs * zfs snapshot test/fs@a * zfs clone test/fs@a test/clone * zfs destroy -d test/fs@a * zfs clone test/fs@a test/clone2 * zfs snapshot test/clone2@a * zfs hold hld test/clone2@a * zfs release hld test/clone2@a * zfs list -r -t all test <test/clone2@a has been destroyed> We noticed that this causes dcenter to get very confused, because it treats snapshots that are marked defer_destroy as not existing. So it won't see any snapshots of the clone that's marked defer_destroy. 5150 - zfs clone of a defer_destroy snapshot causes strangeness Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Max Grossman <max.grossman@delphix.com> Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com> Reviewed by: Richard Elling <richard.elling@gmail.com> Approved by: Robert Mustacchi <rm@joyent.com> References: https://www.illumos.org/projects/illumos-gate//issues/5150 https://github.com/illumos/illumos-gate/commit/42fcb65 Ported by: Turbo Fredriksson <turbo@bayour.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2690	2014-10-21 15:26:58 -07:00
Matthew Ahrens	6c59307a3c	Illumos 3693 - restore_object uses at least two transactions to restore an object Restore_object should not use two transactions to restore an object: * one transaction is used for dmu_object_claim * another transaction is used to set compression, checksum and most importantly bonus data * furthermore dmu_object_reclaim internally uses multiple transactions * dmu_free_long_range frees chunks in separate transactions * dnode_reallocate is executed in a distinct transaction The fact the dnode_allocate/dnode_reallocate are executed in one transaction and bonus (re-)population is executed in a different transaction may lead to violation of ZFS consistency assertions if the transactions are assigned to different transaction groups. Also, if the first transaction group is successfully written to a permanent storage, but the second transaction is lost, then an invalid dnode may be created on the stable storage. 3693 restore_object uses at least two transactions to restore an object Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Andriy Gapon <andriy.gapon@hybridcluster.com> Approved by: Robert Mustacchi <rm@joyent.com> Original authors: Matthew Ahrens and Andriy Gapon References: https://www.illumos.org/issues/3693 https://github.com/illumos/illumos-gate/commit/e77d42e Ported by: Turbo Fredriksson <turbo@bayour.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2689	2014-10-21 15:26:50 -07:00
Tim Chase	356d9ed4c8	Don't perform ACL-to-mode translation on empty ACL In zfs_acl_chown_setattr(), the zfs_mode_comput() function is used to create a traditional mode value based on an ACL. If no ACL exists, this processing shouldn't be done. Problems caused by this were most evident on version 4 filesystems which not only don't have system attributes, and also frequently have empty ACLs. On such filesystems, performing a chown() operation could have the effect of dirtying the mode bits in memory but not on the file system as follows: # create a file with typical mode of 664 echo test > test chown anyuser test ls -l test and the mode will show up as all zeroes. Unmounting/mounting and/or exporting/importing the filesystem will reveal the proper mode again. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1264	2014-10-21 09:23:27 -07:00
Daniil Lunev	62bdd5eb7a	Illumos 4924 - LZ4 Compression for metadata Reviewed by Matthew Ahrens <mahrens@delphix.com> Reviewed by Saso Kiselkov <skiselkov.ml@gmail.com> Approved by: Christopher Siden <christopher.siden@delphix.com> References: https://github.com/illumos/illumos-gate/commit/b8289d2 https://www.illumos.org/issues/3756 Porting notes: The static function zfs_prop_activate_feature() was removed because this change removes the only caller. The function was not removed from Illumos but instead left as dead code. However, to keep gcc happy it was removed from Linux and may be easily restored if needed. Ported by: DHE <git@dehacked.net> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1540	2014-10-20 16:17:49 -07:00
Brian Behlendorf	ba232d8aea	Suppress AIO kmem warnings The new zpl_aio_write() and zpl_aio_read() functions use kmem_alloc() to allocate enough memory to hold the vectorized IO. While this allocation will be small it's been observed in practice to sometimes slightly exceed the 8K warning threshold by a few kilobytes. Therefore, the KM_NODEBUG flag has been added to suppress warning. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <ryao@gentoo.org> Closes #2774	2014-10-20 16:10:25 -07:00
Brian Behlendorf	33074f2254	Handle NULL mirror child vdev When selecting a mirror child it's possible that map allocated by vdev_mirror_map_allc() contains a NULL for the child vdev. In this case the child should be skipped and the read issues to another member of the mirror. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov> Closes #1744	2014-10-17 14:59:05 -07:00
Brian Behlendorf	f0e324f25d	Update utsname support Modify the code to use the utsname() kernel function rather than a global variable. This results is cleaner more portable code because utsname() is already provided by the kernel and can be easily emulated in user space via uname(2). This means that it will behave consistently in both contexts. This is also has the benefit that it allows the removal of a few _KERNEL pre-processor conditions. And it also is a pre-requisite for a proper FUSE port because we need to provide a valid utsname. Finally, it allows us to remove this functionality from the SPL and all the related compatibility code. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #2757	2014-10-17 14:58:57 -07:00
Brian Behlendorf	050d22b068	Remove shrink_dcache_memory() and shrink_icache_memory() This functionality is optional and until Linux 3.0, which provided per-filesystem shinkers, they was never a reasonable interface. Therefore, this functionality is being dropped for earlier kernels. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #2757	2014-10-17 14:58:50 -07:00
Brian Behlendorf	60bba62814	Update code to use misc_register()/misc_deregister() When ZPIOS was originally written it was designed to use the device_create() and device_destroy() functions. Unfortunately, these functions changed considerably over the years making them difficult to rely on. As it turns out a better choice would have been to use the misc_register()/misc_deregister() functions. This interface for registering character devices has remained stable, is simple, and provides everything we need. Therefore the code has been reworked to use this interface. The higher level ZFS code has always depended on these same interfaces so this is also as a step towards minimizing our kernel dependencies. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #2757	2014-10-17 14:58:44 -07:00
Brian Behlendorf	e6659763c6	Improve VERIFY() error in dmu_write() This is a debug patch designed to ensure an error code is logged to the console when this VERIFY() is hit. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov> Issue #1440	2014-10-08 09:18:14 -07:00
Brian Behlendorf	8878261fc9	Fix CPU_SEQID use in preemptible context Commit `e022864` introduced a regression for kernels which are built with CONFIG_DEBUG_PREEMPT. The use of CPU_SEQID in a preemptible context causes zio_nowait() to trigger the BUG. Since CPU_SEQID is simply being used as a random index the usage here is safe. To resolve the issue preempt is disable while calling CPU_SEQID. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov> Closes #2769	2014-10-07 16:40:29 -07:00
Matthew Ahrens	e022864d19	Illumos 5176 - lock contention on godfather zio 5176 lock contention on godfather zio Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Alex Reece <alex.reece@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Richard Elling <richard.elling@gmail.com> Reviewed by: Bayard Bell <Bayard.Bell@nexenta.com> Approved by: Garrett D'Amore <garrett@damore.org> References: https://www.illumos.org/issues/5176 https://github.com/illumos/illumos-gate/commit/6f834bc Porting notes: Under Linux max_ncpus is defined as num_possible_cpus(). This is largest number of cpu ids which might be available during the life time of the system boot. This value can be larger than the number of present cpus if CONFIG_HOTPLUG_CPU is defined. Ported by: Turbo Fredriksson <turbo@bayour.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2711	2014-10-07 11:24:24 -07:00
Richard Yao	83e9986f6e	Implement -t option to zpool create for temporary pool names Creating virtual machines that have their rootfs on ZFS on hosts that have their rootfs on ZFS causes SPA namespace collisions when the standard name rpool is used. The solution is either to give each guest pool a name unique to the host, which is not always desireable, or boot a VM environment containing an ISO image to install it, which is cumbersome. `26b42f3f9d` introduced `zpool import -t ...` to simplify situations where a host must access a guest's pool when there is a SPA namespace conflict. We build upon that to introduce `zpool import -t tname ...`. That allows us to create a pool whose in-core name is tname, but whose on-disk name is the normal name specified. This simplifies the creation of machine images that use a rootfs on ZFS. That benefits not only real world deployments, but also ZFSOnLinux development by decreasing the time needed to perform rootfs on ZFS experiments. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #2417	2014-09-30 10:46:59 -07:00
Tim Chase	cb08f06307	Perform whole-page page truncation for hole-punching under a range lock As an attempt to perform the page truncation more optimally, the hole-punching support added in `223df0161f` truncated performed the operation in two steps: first, sub-page "stubs" were zeroed under the range lock in zfs_free_range() using the new zfs_zero_partial_page() function and then the whole pages were truncated within zfs_freesp(). This left a window of opportunity during which the full pages could be touched. This patch closes the window by moving the whole-page truncation into zfs_free_range() under the range lock. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2733	2014-09-29 09:22:03 -07:00
Max Grossman	36283ca233	Illumos 5138 - add tunable for maximum number of blocks freed in one txg Reviewed by: Adam Leventhal <adam.leventhal@delphix.com> Reviewed by: Mattew Ahrens <mahrens@delphix.com> Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net> Reviewed by: Richard Elling <richard.elling@gmail.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/5138 https://github.com/illumos/illumos-gate/commit/af3465d Porting notes: Because support for exposing a uint64_t parameter wasn't added until v3.17-rc1 the zfs_free_max_blocks variable has been declared as a unsigned long. This is already far larger than required and it allows us to avoid additional autoconf compatibility code. The default value has been set to 100,000 on Linux instead of ULONG_MAX which is used on Illumos. This was done to limit the number of outstanding IOs in the system when snapshots are destroyed. This helps ensure individual TXG sync times are kept reasonable and memory isn't wasted managing a huge backlog of outstanding IOs. Ported by: Turbo Fredriksson <turbo@bayour.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2675 Closes #2581	2014-09-23 14:26:34 -07:00
Alex Reece	acbad6ff67	Illumos 4753 - increase number of outstanding async writes when sync task is waiting Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Approved by: Garrett D'Amore <garrett@damore.org> References: https://www.illumos.org/issues/4753 https://github.com/illumos/illumos-gate/commit/73527f4 Comments by Matt Ahrens from the issue tracker: When a sync task is waiting for a txg to complete, we should hurry it along by increasing the number of outstanding async writes (i.e. make vdev_queue_max_async_writes() return a larger number). Initially we might just have a tunable for "minimum async writes while a synctask is waiting" and set it to 3. Ported-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2716	2014-09-23 13:50:55 -07:00
Matthew Ahrens	d97aa48f7c	Illumos 5139 - SEEK_HOLE failed to report a hole at end of file 5139 SEEK_HOLE failed to report a hole at end of file Reviewed by: Adam Leventhal <adam.leventhal@delphix.com> Reviewed by: Alex Reece <alex.reece@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Max Grossman <max.grossman@delphix.com> Reviewed by: Peng Dai <peng.dai@delphix.com> Reviewed by: Richard Elling <richard.elling@gmail.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/5139 https://github.com/illumos/illumos-gate/commit/0fbc0cd Ported by: Turbo Fredriksson <turbo@bayour.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2714	2014-09-23 10:38:45 -07:00
Richard Yao	485c581c41	Fix function call with uninitialized value in vdev_inuse LLVM's static analyzer reported that we could pass an uninitialized pool_guid to spa_by_guid() in vdev_inuse(). Upon review, it is correct. An attempt to repurpose a spare or L2ARC drive from an exported pool will cause the pool_guid passed to spa_by_guid() to be unintialized information from the stack. This will cause non-deterministic behavior. Since there is no reason why we cannot repurpose such disks, we modify vdev_inuse() to avoid calling spa_by_guid() when they are detected. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #2330	2014-09-23 10:32:45 -07:00
Matthew Ahrens	b8bcca18f7	Illumos 5161 - add tunable for number of metaslabs per vdev 5161 add tunable for number of metaslabs per vdev Reviewed by: Alex Reece <alex.reece@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com> Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com> Reviewed by: Richard Elling <richard.elling@gmail.com> Approved by: Richard Lowe <richlowe@richlowe.net> References: https://www.illumos.org/issues/5161 https://github.com/illumos/illumos-gate/commit/bf3e216 Ported by: Turbo Fredriksson <turbo@bayour.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2698	2014-09-23 10:00:02 -07:00
Matthew Ahrens	ebcf49365a	Illumos 5177 - remove dead code from dsl_scan.c 5177 remove dead code from dsl_scan.c Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Richard Elling <richard.elling@gmail.com> Reviewed by: Richard Lowe <richlowe@richlowe.net> Approved by: Robert Mustacchi <rm@joyent.com> References: https://www.illumos.org/issues/5177 https://github.com/illumos/illumos-gate/commit/5f37736 Porting notes: The local variable 'buf' was removed from dsl_scan_visitbp(). This wasn't part of the original patch but it should have been. Ported by: Turbo Fredriksson <turbo@bayour.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2712	2014-09-22 15:52:58 -07:00
Adam Leventhal	64dbba3679	Illumos 5174 - add sdt probe for blocked read in dbuf_read() 5174 add sdt probe for blocked read in dbuf_read() Reviewed by: Basil Crow <basil.crow@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Steven Hartland <killing@multiplay.co.uk> Reviewed by: Richard Elling <richard.elling@gmail.com> Reviewed by: Boris Protopopov <bprotopopov@hotmail.com> Reviewed by: Steven Hartland <killing@multiplay.co.uk> Reviewed by: Garrett D'Amore <garrett@damore.org> Approved by: Robert Mustacchi <rm@joyent.com> References: https://www.illumos.org/issues/5174 https://github.com/illumos/illumos-gate/commit/f6164ad Ported by: Turbo Fredriksson <turbo@bayour.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2710	2014-09-22 14:20:25 -07:00
Matthew Ahrens	6d9036f350	Illumos 5140 - message about "%recv could not be opened" is printed when booting after crash Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Max Grossman <max.grossman@delphix.com> Reviewed by: Richard Elling <richard.elling@gmail.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/projects/illumos-gate//issues/5140 https://github.com/illumos/illumos-gate/commit/2243853 Ported by: Turbo Fredriksson <turbo@bayour.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2676	2014-09-18 15:04:59 -07:00
Brian Behlendorf	2d50158343	Fix z_teardown_inactive_lock deadlock When rolling back a mounted filesystem zfs_suspend() is called which acquires the z_teardown_inactive_lock. This lock can not be dropped until the filesystem has been rolled back and resumed in zfs_resume_fs(). Therefore, we must not call iput() under this lock because it may result in the inode->evict() handler being called which also takes this lock. Instead use zfs_iput_async() to ensure dropping the last reference is deferred and runs in a safe context. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2670	2014-09-11 09:11:45 -07:00
Tim Chase	223df0161f	Implement fallocate FALLOC_FL_PUNCH_HOLE Add support for the FALLOC_FL_PUNCH_HOLE \| FALLOC_FL_KEEP_SIZE mode of fallocate(2). Mimic the behavior of other native file systems such as ext4 in cases where the file might be extended. If the offset is beyond the end of the file, return success without changing the file. If the extent of the punched hole would extend the file, only the existing tail of the file is punched. Add the zfs_zero_partial_page() function, modeled after update_page(), to handle zeroing partial pages in a hole-punching operation. It must be used under a range lock for the requested region in order that the ARC and page cache stay in sync. Move the existing page cache truncation via truncate_setsize() into zfs_freesp() for better source structure compatibility with upstream code. Add page cache truncation to zfs_freesp() and zfs_free_range() to handle hole punching. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Closes #2619	2014-09-08 13:52:25 -07:00
George Wilson	4f68d7878f	Illumos 5117 - spacemap reallocation can cause corruption 5117 space map reallocation can cause corruption Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Sebastien Roy <sebastien.roy@delphix.com> Reviewed by: Richard Elling <richard.elling@gmail.com> Approved by: Richard Lowe <richlowe@richlowe.net> References: https://www.illumos.org/projects/illumos-gate/issues/5117 https://github.com/illumos/illumos-gate/commit/e503a68 Ported by: Turbo Fredriksson <turbo@bayour.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2662	2014-09-08 09:42:39 -07:00
Brian Behlendorf	ceb49b0acd	Add object type checking to zap_lockdir() If a non-ZAP object is passed to zap_lockdir() it will be treated as a valid ZAP object. This can result in zap_lockdir() attempting to read what it believes are leaf blocks from invalid disk locations. The SCSI layer will eventually generate errors for these bogus IOs but the caller will hang in zap_get_leaf_byblk(). The good news is that is a situation which can not occur unless the pool has been damaged. The bad news is that there are reports from both FreeBSD and Solaris of damaged pools. Specifically, there are normal files in the filesystem which reference another normal file as their parent. Since pools like this are known to exist the zap_lockdir() function has been updated to verify the type of the object. If a non-ZAP object has been passed it EINVAL will be returned immediately. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #2597 Issue #2602	2014-09-08 09:15:38 -07:00
Richard Yao	cd3939c5f0	Linux AIO Support nfsd uses do_readv_writev() to implement fops->read and fops->write. do_readv_writev() will attempt to read/write using fops->aio_read and fops->aio_write, but it will fallback to fops->read and fops->write when AIO is not available. However, the fallback will perform a call for each individual data page. Since our default recordsize is 128KB, sequential operations on NFS will generate 32 DMU transactions where only 1 transaction was needed. That was unnecessary overhead and we implement fops->aio_read and fops->aio_write to eliminate it. ZFS originated in OpenSolaris, where the AIO API is entirely implemented in userland's libc by intelligently mapping them to VOP_WRITE, VOP_READ and VOP_FSYNC. Linux implements AIO inside the kernel itself. Linux filesystems therefore must implement their own AIO logic and nearly all of them implement fops->aio_write synchronously. Consequently, they do not implement aio_fsync(). However, since the ZPL works by mapping Linux's VFS calls to the functions implementing Illumos' VFS operations, we instead implement AIO in the kernel by mapping the operations to the VOP_READ, VOP_WRITE and VOP_FSYNC equivalents. We therefore implement fops->aio_fsync. One might be inclined to make our fops->aio_write implementation synchronous to make software that expects this behavior safe. However, there are several reasons not to do this: 1. Other platforms do not implement aio_write() synchronously and since the majority of userland software using AIO should be cross platform, expectations of synchronous behavior should not be a problem. 2. We would hurt the performance of programs that use POSIX interfaces properly while simultaneously encouraging the creation of more non-compliant software. 3. The broader community concluded that userland software should be patched to properly use POSIX interfaces instead of implementing hacks in filesystems to cater to broken software. This concept is best described as the O_PONIES debate. 4. Making an asynchronous write synchronous is non sequitur. Any software dependent on synchronous aio_write behavior will suffer data loss on ZFSOnLinux in a kernel panic / system failure of at most zfs_txg_timeout seconds, which by default is 5 seconds. This seems like a reasonable consequence of using non-compliant software. It should be noted that this is also a problem in the kernel itself where nfsd does not pass O_SYNC on files opened with it and instead relies on a open()/write()/close() to enforce synchronous behavior when the flush is only guarenteed on last close. Exporting any filesystem that does not implement AIO via NFS risks data loss in the event of a kernel panic / system failure when something else is also accessing the file. Exporting any file system that implements AIO the way this patch does bears similar risk. However, it seems reasonable to forgo crippling our AIO implementation in favor of developing patches to fix this problem in Linux's nfsd for the reasons stated earlier. In the interim, the risk will remain. Failing to implement AIO will not change the problem that nfsd created, so there is no reason for nfsd's mistake to block our implementation of AIO. It also should be noted that `aio_cancel()` will always return `AIO_NOTCANCELED` under this implementation. It is possible to implement aio_cancel by deferring work to taskqs and use `kiocb_set_cancel_fn()` to set a callback function for cancelling work sent to taskqs, but the simpler approach is allowed by the specification: ``` Which operations are cancelable is implementation-defined. ``` http://pubs.opengroup.org/onlinepubs/009695399/functions/aio_cancel.html The only programs on my system that are capable of using `aio_cancel()` are QEMU, beecrypt and fio use it according to a recursive grep of my system's `/usr/src/debug`. That suggests that `aio_cancel()` users are rare. Implementing aio_cancel() is left to a future date when it is clear that there are consumers that benefit from its implementation to justify the work. Lastly, it is important to know that handling of the iovec updates differs between Illumos and Linux in the implementation of read/write. On Linux, it is the VFS' responsibility whle on Illumos, it is the filesystem's responsibility. We take the intermediate solution of copying the iovec so that the ZFS code can update it like on Solaris while leaving the originals alone. This imposes some overhead. We could always revisit this should profiling show that the allocations are a problem. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #223 Closes #2373	2014-09-05 15:11:43 -07:00
Alex Reece	f38dfec3fd	Illumos 5049 - panic when removing log device Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Mattew Ahrens <mahrens@delphix.com> Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Saso Kiselkov <skiselkov@gmail.com> Approved by: Rich Lowe <richlowe@richlowe.net> References: https://www.illumos.org/issues/5049 https://github.com/illumos/illumos-gate/commit/2986efa Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2636	2014-09-05 09:06:27 -07:00
Stanislav Seletskiy	2078f21015	Fix invalid locking order in rename operation This commit should prevent a deadlock on dp_config_rwlock when running `zfs rename` by ensuring zvol_rename_minors() is not called under this lock. Signed-off-by: Stanislav Seletskiy <s.seletskiy@gmail.com> Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2652. Closes #2525.	2014-09-04 09:50:46 -07:00
Alexey Smirnoff	0dfc732416	Change the default 'zfs_dedup_prefetch' value to '0' This gives a huge performance improvement in operations with deduped datasets especially when the bottleneck is the amount of ram available for zfs. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2639	2014-09-04 09:50:45 -07:00
Dan Swartzendruber	287be44f53	Improve handling of filesystem versions Change mount code to diagnose filesystem versions that are not supported by the current implementation. Change upgrade code to do likewise and refuse to upgrade a pool if any filesystems on it are a version which is not supported by the current implementation. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Dan Swartzendruber <dswartz@druber.com> Closes: #2616	2014-09-03 09:17:14 -07:00
Matthew Ahrens	dea377c0d9	Illumos 4970-4974 - extreme rewind enhancements 4970 need controls on i/o issued by zpool import -XF 4971 zpool import -T should accept hex values 4972 zpool import -T implies extreme rewind, and thus a scrub 4973 spa_load_retry retries the same txg 4974 spa_load_verify() reads all data twice Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> References: https://www.illumos.org/issues/4970 https://www.illumos.org/issues/4971 https://www.illumos.org/issues/4972 https://www.illumos.org/issues/4973 https://www.illumos.org/issues/4974 https://github.com/illumos/illumos-gate/commit/e42d205 Notes: This set of patches adds a set of tunable parameters for the "extreme rewind" mode of pool import which allows control over the traversal performed during such an import. Ported by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2598	2014-08-26 16:29:57 -07:00
Matthew Ahrens	49ddb31506	Illumos 5034 - ARC's buf_hash_table is too small 5034 ARC's buf_hash_table is too small Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com> Reviewed by: Richard Elling <richard.elling@gmail.com> Approved by: Gordon Ross <gwr@nexenta.com> References: https://www.illumos.org/issues/5034 https://github.com/illumos/illumos-gate/commit/63e911b Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2615	2014-08-26 16:14:49 -07:00
Isaac Huang	0426c16804	Fixed memory leaks in zevent handling Some nvlist_t could be leaked in error handling paths. Also make sure cb argument to zfs_zevent_post() cannnot be NULL. Signed-off-by: Isaac Huang <he.huang@intel.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2158	2014-08-20 10:45:16 -07:00
Matthew Ahrens	bd089c5477	Illumos 4631 - zvol_get_stats triggering too many reads 4631 zvol_get_stats triggering too many reads Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Sebastien Roy <sebastien.roy@delphix.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/4631 https://github.com/illumos/illumos-gate/commit/bbfa8ea Ported-by: Boris Protopopov <bprotopopov@hotmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2612 Closes #2480	2014-08-20 09:17:00 -07:00
Tim Chase	8b0a0840b4	Don't upgrade a metaslab when the pool is not writable Illumos 4982 added code to metaslab_fragmentation() to proactively update space maps when the spacemap_histogram feature is enabled. This should only happen when the pool is writeable. References: https://www.illumos.org/issues/4982 https://github.com/illumos/illumos-gate/commit/2e4c998 Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2595	2014-08-18 08:47:19 -07:00
George Wilson	f3a7f6610f	Illumos 4976-4984 - metaslab improvements 4976 zfs should only avoid writing to a failing non-redundant top-level vdev 4978 ztest fails in get_metaslab_refcount() 4979 extend free space histogram to device and pool 4980 metaslabs should have a fragmentation metric 4981 remove fragmented ops vector from block allocator 4982 space_map object should proactively upgrade when feature is enabled 4983 need to collect metaslab information via mdb 4984 device selection should use fragmentation metric Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Adam Leventhal <adam.leventhal@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Approved by: Garrett D'Amore <garrett@damore.org> References: https://www.illumos.org/issues/4976 https://www.illumos.org/issues/4978 https://www.illumos.org/issues/4979 https://www.illumos.org/issues/4980 https://www.illumos.org/issues/4981 https://www.illumos.org/issues/4982 https://www.illumos.org/issues/4983 https://www.illumos.org/issues/4984 https://github.com/illumos/illumos-gate/commit/2e4c998 Notes: The "zdb -M" option has been re-tasked to display the new metaslab fragmentation metric and the new "zdb -I" option is used to control the maximum number of in-flight I/Os. The new fragmentation metric is derived from the space map histogram which has been rolled up to the vdev and pool level and is presented to the user via "zpool list". Add a number of module parameters related to the new metaslab weighting logic. Ported by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2595	2014-08-18 08:40:49 -07:00
Turbo Fredriksson	f67d709080	Create an 'overlay' property Add a new 'overlay' property (default 'off') that controls whether the filesystem should be mounted even if the mountpoint is busy or if it should fail with a 'mountpoint not empty'. Doing overlay mounts is the default mount behavior on Linux, but not in ZFS. It have been decided that following the ZFS behavior should be the default, but this overlay allows for site administrator to override this decision on a per-dataset basis. Signed-off-by: Turbo Fredriksson <turbo@bayour.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes: #2503	2014-08-15 13:39:19 -07:00
Brian Behlendorf	0d5c500d6c	Revert "Revert "Revert "Fix unlink/xattr deadlock""" This reverts commit `7973e46` which brings the basic flow of the code back in line with the other ZFS implementations. This was possible due to the following related changes. `e89260a` Directory xattr znodes hold a reference on their parent `6f9548c` Fix deadlock in zfs_zget() `0a50679` Add zfs_iput_async() interface `4dd1893` Avoid 128K kmem allocations in mzap_upgrade() Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <ryao@gentoo.org> Closes #457 Closes #2058 Closes #2128 Closes #2240	2014-08-11 16:12:36 -07:00
Brian Behlendorf	0a50679ce9	Add zfs_iput_async() interface Handle all iputs in zfs_purgedir() and zfs_inode_destroy() asynchronously to prevent deadlocks. When the iputs are allowed to run synchronously in the destroy call path deadlocks between xattr directory inodes and their parent file inodes are possible. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <ryao@gentoo.org> Closes #457	2014-08-11 16:11:43 -07:00
Brian Behlendorf	4dd18932ba	Avoid 128K kmem allocations in mzap_upgrade() As originally implemented the mzap_upgrade() function will perform up to SPA_MAXBLOCKSIZE allocations using kmem_alloc(). These large allocations can potentially block indefinitely if contiguous memory is not available. Since this allocation is done under the zap->zap_rwlock it can appear as if there is a deadlock in zap_lockdir(). This is shown below. The optimal fix for this would be to rework mzap_upgrade() such that no large allocations are required. This could be done but it would result in us diverging further from the other implementations. Therefore I've opted against doing this unless it becomes absolutely necessary. Instead mzap_upgrade() has been updated to use zio_buf_alloc() which can reliably provide buffers of up to SPA_MAXBLOCKSIZE. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <ryao@gentoo.org> Close #2580	2014-08-11 16:10:32 -07:00
Brian Behlendorf	50b25b2187	Avoid dynamic allocation of 'search zio' As part of commit `e8b96c6` the search zio used by the vdev_queue_io_to_issue() function was moved to the heap to minimize stack usage. Functionally this is fine, but to maximize performance it's best to minimize the number of dynamic allocations. To avoid this allocation temporary space for the search zio has been reserved in the vdev_queue structure. All access must be serialized through the vq_lock. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov> Closes #2572	2014-08-11 08:44:54 -07:00
Brian Behlendorf	ab6f407faa	Use KM_PUSHPAGE in dsl_dataset_rollback_check() The dsl_dataset_rollback_check() function is executed in the txg_sync context. To prevent a potential deadlock due to direct memory reclaim it must use KM_PUSHPAGE. This was introduced by the recent 'zfs bookmark' features, commit `da53684`. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Eric Dillmann <eric@jave.fr> Closes #2569	2014-08-06 16:09:28 -07:00
Matthew Ahrens	5dbd68a352	Illumos 4914 - zfs on-disk bookmark structure should be named _phys_t 4914 zfs on-disk bookmark structure should be named _phys_t Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: Richard Lowe <richlowe@richlowe.net> Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com> Approved by: Robert Mustacchi <rm@joyent.com> References: https://www.illumos.org/issues/4914 https://github.com/illumos/illumos-gate/commit/7802d7b Porting notes: There were a number of zfsonlinux-specific uses of zbookmark_t which needed to be updated. This should reduce the likelihood of further problems like issue #2094 from occurring. Ported by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2558	2014-08-06 14:48:41 -07:00
Matthew Ahrens	1fa8f795d5	Illumos 4881 - zfs send performance regression with embedded data 4881 zfs send performance degradation when embedded block pointers are encountered Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/4881 https://github.com/illumos/illumos-gate/commit/06315b7 Ported by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2547	2014-08-06 14:44:10 -07:00
Saso Kiselkov	3bec585e6c	Illumos 4897 - Space accounting mismatch in L2ARC/zpool 4897 Space accounting mismatch in L2ARC/zpool Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Boris Protopopov <bprotopopov@hotmail.com> Approved by: Dan McDonald <danmcd@omniti.com> From the illumos issue tracker: L2ARC vdev space usage statistics are calculated as the delta between the maximum and minimum vdev offset ever written to by the L2ARC fill thread, but do not inform the user of how much space in between these two offsets is actually taken up by cached buffers. This fix changes that so that vdev space usage stats on L2ARC devices accurately track the volume of buffers stored on them, allowing users to see the exact L2ARC usage in "zpool iostat -v". References: https://www.illumos.org/issues/4897 https://github.com/illumos/illumos-gate/commit/3038a2b Ported by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2555	2014-08-06 13:44:10 -07:00
Matthew Ahrens	fbeddd60b7	Illumos 4390 - I/O errors can corrupt space map when deleting fs/vol 4390 i/o errors when deleting filesystem/zvol can lead to space map corruption Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/4390 https://github.com/illumos/illumos-gate/commit/7fd05ac Porting notes: Previous stack-reduction efforts in traverse_visitb() caused a fair number of un-mergable pieces of code. This patch should reduce its stack footprint a bit more. The new local bptree_entry_phys_t in bptree_add() is dynamically-allocated using kmem_zalloc() for the purpose of stack reduction. The new global zfs_free_leak_on_eio has been defined as an integer rather than a boolean_t as was the case with the related zfs_recover global. Also, zfs_free_leak_on_eio's definition has been inserted into zfs_debug.c for consistency with the existing definition of zfs_recover. Illumos placed it in spa_misc.c. Ported by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2545	2014-08-04 11:50:52 -07:00
Matthew Ahrens	9b67f60560	Illumos 4757, 4913 4757 ZFS embedded-data block pointers ("zero block compression") 4913 zfs release should not be subject to space checks Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Max Grossman <max.grossman@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/4757 https://www.illumos.org/issues/4913 https://github.com/illumos/illumos-gate/commit/5d7b4d4 Porting notes: For compatibility with the fastpath code the zio_done() function needed to be updated. Because embedded-data block pointers do not require DVAs to be allocated the associated vdevs will not be marked and therefore should not be unmarked. Ported by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2544	2014-08-01 14:28:05 -07:00
Matthew Ahrens	faf0f58c69	Illumos 3835 zfs need not store 2 copies of all metadata Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Approved by: Richard Lowe <richlowe@richlowe.net> Description from Matt Ahrens's bug report at Delphix: Add a new zfs property, "redundant_metadata" which can have values "all" or "most". The default will be "all", which is the current behavior. Setting to "most" will cause us to only store 1 copy of level-1 indirect blocks of user data files. Additional notes: The new man page section for this property states "The exact behavior of which metadata blocks are stored redundantly may change in future releases." and: "When set to most, ZFS stores an extra copy of most types of metadata. This can improve performance of random writes, because less metadata must be written." The current implementation is as described above in Matt's blog. It is controlled by a new global integer "zfs_redundant_metadata_most_ditto_level", currently initialized to 2. When "redundant_metadata" is set to "most", only indirect blocks of the specified level and higher will have additional ditto blocks created. Ported by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2542	2014-07-31 09:49:34 -07:00
George Wilson	672692c7b7	Illumos 4754, 4755 4754 io issued to near-full luns even after setting noalloc threshold 4755 mg_alloc_failures is no longer needed Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/4754 https://www.illumos.org/issues/4755 https://github.com/illumos/illumos-gate/commit/b6240e8 Ported by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2533	2014-07-30 10:30:05 -07:00
Matthew Ahrens	9bd274ddd8	Illumos #4374 4374 dn_free_ranges should use range_tree_t Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Max Grossman <max.grossman@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com Reviewed by: Garrett D'Amore <garrett@damore.org> Reviewed by: Dan McDonald <danmcd@omniti.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/4374 https://github.com/illumos/illumos-gate/commit/bf16b11 Ported by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2531	2014-07-30 09:20:35 -07:00
Matthew Ahrens	da536844d5	Illumos 4368, 4369. 4369 implement zfs bookmarks 4368 zfs send filesystems from readonly pools Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Garrett D'Amore <garrett@damore.org> References: https://www.illumos.org/issues/4369 https://www.illumos.org/issues/4368 https://github.com/illumos/illumos-gate/commit/78f1710 Ported by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2530	2014-07-29 10:55:29 -07:00
Max Grossman	b0bc7a84d9	Illumos 4370, 4371 4370 avoid transmitting holes during zfs send 4371 DMU code clean up Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net> Approved by: Garrett D'Amore <garrett@damore.org>a References: https://www.illumos.org/issues/4370 https://www.illumos.org/issues/4371 https://github.com/illumos/illumos-gate/commit/43466aa Ported by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2529	2014-07-28 14:29:58 -07:00
Matthew Ahrens	fa86b5dbb6	Illumos 4171, 4172 4171 clean up spa_feature_*() interfaces 4172 implement extensible_dataset feature for use by other zpool features Reviewed by: Max Grossman <max.grossman@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com> Approved by: Garrett D'Amore <garrett@damore.org>a References: https://www.illumos.org/issues/4171 https://www.illumos.org/issues/4172 https://github.com/illumos/illumos-gate/commit/2acef22 Ported-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2528	2014-07-25 16:40:07 -07:00
Brian Behlendorf	7a8f0e80ea	zfs_trunc() should use dmu_tx_assign(tx, TXG_WAIT) As part of the write throttle & i/o schedule performance work the zfs_trunc() function should have been updated to use TXG_WAIT. Using TXG_WAIT ensures that the tx will be part of the next txg. If TXG_NOWAIT is used and retried for ERESTART errors then the tx can suffer from starvation. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov> Closes #2488	2014-07-22 09:41:38 -07:00
George Wilson	080b310015	Illumos #4756 Fix metaslab_group_preload deadlock 4756 metaslab_group_preload() could deadlock Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com> Approved by: Garrett D'Amore <garrett@damore.org> The metaslab_group_preload() function grabs the mg_lock and then later tries to grab the metaslab lock. This lock ordering may lead to a deadlock since other consumers of the mg_lock will grab the metaslab lock first. References: https://www.illumos.org/issues/4756 https://github.com/illumos/illumos-gate/commit/30beaff Ported-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2488	2014-07-22 09:41:32 -07:00
George Wilson	3c51c5cb1f	Illumos #4730 destroy metaslab group taskq 4730 metaslab group taskq should be destroyed in metaslab_group_destroy() Reviewed by: Alex Reece <alex.reece@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Sebastien Roy <sebastien.roy@delphix.com> Reviewed by: Rich Lowe <richlowe@richlowe.net> Reviewed by: Dan McDonald <danmcd@omniti.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/4730 https://github.com/illumos/illumos-gate/commit/be08211 Porting notes: Under ZFSonlinux, one of the effects of not destroying the taskq is that zdb would never exit (due to the SPL taskq implementation). Ported-by: Tim Chase <tim@chase2k.com> Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2488	2014-07-22 09:41:06 -07:00
George Wilson	93cf20764a	Illumos #4101 , #4102 , #4103 , #4105 , #4106 4101 metaslab_debug should allow for fine-grained control 4102 space_maps should store more information about themselves 4103 space map object blocksize should be increased 4105 removing a mirrored log device results in a leaked object 4106 asynchronously load metaslab Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Sebastien Roy <seb@delphix.com> Approved by: Garrett D'Amore <garrett@damore.org> Prior to this patch, space_maps were preferred solely based on the amount of free space left in each. Unfortunately, this heuristic didn't contain any information about the make-up of that free space, which meant we could keep preferring and loading a highly fragmented space map that wouldn't actually have enough contiguous space to satisfy the allocation; then unloading that space_map and repeating the process. This change modifies the space_map's to store additional information about the contiguous space in the space_map, so that we can use this information to make a better decision about which space_map to load. This requires reallocating all space_map objects to increase their bonus buffer size sizes enough to fit the new metadata. The above feature can be enabled via a new feature flag introduced by this change: com.delphix:spacemap_histogram In addition to the above, this patch allows the space_map block size to be increase. Currently the block size is set to be 4K in size, which has certain implications including the following: * 4K sector devices will not see any compression benefit * large space_maps require more metadata on-disk * large space_maps require more time to load (typically random reads) Now the space_map block size can adjust as needed up to the maximum size set via the space_map_max_blksz variable. A bug was fixed which resulted in potentially leaking an object when removing a mirrored log device. The previous logic for vdev_remove() did not deal with removing top-level vdevs that are interior vdevs (i.e. mirror) correctly. The problem would occur when removing a mirrored log device, and result in the DTL space map object being leaked; because top-level vdevs don't have DTL space map objects associated with them. References: https://www.illumos.org/issues/4101 https://www.illumos.org/issues/4102 https://www.illumos.org/issues/4103 https://www.illumos.org/issues/4105 https://www.illumos.org/issues/4106 https://github.com/illumos/illumos-gate/commit/0713e23 Porting notes: A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also, the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary. Ported-by: Tim Chase <tim@chase2k.com> Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2488	2014-07-22 09:39:16 -07:00
Prakash Surya	1be627f5c2	Move metaslab_group_alloc_update() call This changes moves the called to metaslab_group_alloc_update() to the metaslab_sync_reassess() function. The original placement of the call within metaslab_sync_done() appears to have been a simple mistake, introduced by `ac72fac3ea`. This aligns us more closely to the upstream illumos code base. Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2014-07-22 09:38:16 -07:00
Brian Behlendorf	1e8db77102	Fix zil_commit() NULL dereference Update the current code to ensure inodes are never dirtied if they are part of a read-only file system or snapshot. If they do somehow get dirtied an attempt will make made to write them to disk. In the case of snapshots, which don't have a ZIL, this will result in a NULL dereference in zil_commit(). Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2405	2014-07-17 15:15:07 -07:00
George Wilson	2fbc542ebd	Illumos 4168, 4169, 4170: ztest, zdb and zhack fixes 4168 ztest assertion failure in dbuf_undirty 4169 verbatim import causes zdb to segfault 4170 zhack leaves pool in ACTIVE state Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Eric Schrock <eric.schrock@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@nexenta.com> References: https://www.illumos.org/issues/4168 https://www.illumos.org/issues/4169 https://www.illumos.org/issues/4170 https://github.com/illumos/illumos-gate/commit/7fdd916 Porting notes: Of particular interest when troubleshooting corrupted pools, the commonly-used "zdb -e" operation may perform verbatim imports and furthermore, it will soon have direct support for verbatim imports via a new "-V" option. The 4169 fix eliminates a common segfault case in which spa_history_log_version() tries to access an un-opened dsl_pool_t. Ported by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2451 Closes #2283 Closes #2467	2014-07-17 11:37:57 -07:00
Tim Chase	f4a4046bd6	Convert zfs_mg_noalloc_threshold to a module parameter and document The parameter was added as illumos issue 4081 which was committed to zfsonlinux in `ac72fac3ea`. This patch documents the parameter and allows for it to be set as a module parameter. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2483	2014-07-16 16:49:25 -07:00
Andrew Barnes	61e99a73bc	Preserve asize when last mirror child promoted to top-level vdev If the smaller of 2 different sized child vdev's of a mirrored vdev is detached, and the pool has the autoexpand property set to off, as the remaining larger vdev is promoted to a top level vdev it fails to retain the asize of the original top level mirror vdev and therefore partially autoexpands. This partially autoexpanded state leaves the new vdev too large to re-mirror by adding the smaller vdev back in, and the pool fails to utilize the space until next imported. If the autoexpand property is set to on, the child vdev grows in size after it has been promoted to a top level vdev as expected. This commit causes the remaining child mirror to retain the asize of its old parent mirror vdev if the autoexpand property is set to off, this allows the smaller vdev to be re-added if required the vdev can then be told to expand if required by the usual using zpool online -e. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Andrew Barnes <barnes333@gmail.com> Signed-off-by: George Wilson <george.wilson@delphix.com> Closes #1208	2014-07-02 14:04:29 -07:00
Dan McDonald	ee4712284c	Illumos #4936 fix potential overflow in lz4 4936 lz4 could theoretically overflow a pointer with a certain input Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com> Reviewed by: Keith Wesolowski <keith.wesolowski@joyent.com> Approved by: Gordon Ross <gordon.ross@nexenta.com> Ported by: Tim Chase <tim@chase2k.com> References: https://illumos.org/issues/4936 https://github.com/illumos/illumos-gate/commit/58d0718 Porting notes: This fixes the widely-reported "20-year-old vulnerability" in LZO/LZ4 implementations which inherited said bug from the reference implementation. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2429	2014-07-01 14:10:47 -07:00
Tim Chase	4240dc332d	Comment the lack of real_LZ4_uncompress() Added several comments regarding the removal of real_LZ4_uncompress() which exists in the reference implementation but has been removed here since it's not used. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2014-07-01 14:09:56 -07:00
Brian Behlendorf	7f6884f419	Revert "Fix __zio_execute() asynchronous dispatch" This reverts commit `91579709fc` which limited the asynchronous dispatch to kernel space. We want to do this for two reasons: 1) While we have slightly more headroom in user space excessively deep stacks have been observed while running ztest, see #2293. 2) Removing this conditional makes the pipeline behave consistently regardless of if it's executing in kernel space or user space. This way we're more likely to uncover subtle issues with ztest. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2384	2014-06-11 16:32:57 -07:00
Brian Behlendorf	6795a698f4	Use default slab types We should not override the default memory type of the kmem cache. This was done previously to force certain objects which were slightly over object size limit cut off in to KMC_KMEM caches for better performance. The zfsonlinux/spl#356 patch slightly increases the default cut off from 511 bytes 1024 bytes for x86_64. This means there is long longer a need to override the default for the caches. And since the default values are now being used the new spl_kmem_cache_slab_limit and spl_kmem_cache_kmem_limit tunables will apply to all kmem caches. The following is a list of caches that will be impacted: \| object size \| forced type \| default type ----------------- \| ------------- \| ------------- \| -------------- dnode_t \| 936 bytes \| KMC_KMEM \| KMC_KMEM zio_cache \| 1104 bytes \| KMC_KMEM \| KMC_VMEM zio_link_cache \| 48 bytes \| KMC_KMEM \| KMC_KMEM zio_vdev_cache \| 131088 bytes \| KMC_VMEM \| KMC_VMEM zio_buf_512 \| 512 bytes \| KMC_KMEM \| KMC_KMEM zio_data_buf_512 \| 512 bytes \| KMC_KMEM \| KMC_KMEM zio_buf_1024 \| 1024 bytes \| KMC_KMEM \| KMC_KMEM zio_data_buf_1024 \| 1024 bytes \| +KMC_VMEM \| +KMC_KMEM * Cache memory type will change from KMC_KMEM to KMC_VMEM. + Cache memory type will change from KMC_VMEM to KMC_KMEM. This patch removes another slight point of divergence between ZoL and Illumos. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Prakash Surya <surya1@llnl.gov> Closes #2337	2014-05-22 10:39:52 -07:00
HC	f9a1ac4d59	Honor zfs_nocacheflush for file vdevs For consistency with disk vdevs honor the zfs_nocacheflush tunable. This setting is available primarily for debugging and performance analysis. Signed-off-by: HC <mmttdebbcc@yahoo.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2336	2014-05-19 13:30:48 -07:00
Tim Chase	83021b47c2	Calculate header size correctly in sa_find_sizes() In the case where a variable-sized SA overlaps the spill block pointer and a new variable-sized SA is being added, the header size was improperly calculated to include the to-be-moved SA. This problem could be reproduced when xattr=sa enabled as follows: ln -s $(perl -e 'print "x" x 120') blah setfattr -n security.selinux -v blahblah -h blah The symlink is large enough to interfere with the spill block pointer and has a typical SA registration as follows (shown in modified "zdb -dddd" <SA attr layout obj> format): [ ... ZPL_DACL_COUNT ZPL_DACL_ACES ZPL_SYMLINK ] Adding the SA xattr will attempt to extend the registration to: [ ... ZPL_DACL_COUNT ZPL_DACL_ACES ZPL_SYMLINK ZPL_DXATTR ] but since the ZPL_SYMLINK SA interferes with the spill block pointer, it must also be moved to the spill block which will have a registration of: [ ZPL_SYMLINK ZPL_DXATTR ] This commit updates extra_hdrsize when this condition occurs, allowing hdrsize to be subsequently decreased appropriately. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov> Issue #2214 Issue #2228 Issue #2316 Issue #2343	2014-05-19 11:55:50 -07:00
Tim Chase	3937ab20f3	Allow for lock-free reading zfsdev_state_list. Restructure the zfsdev_state_list to allow for lock-free reading by converting to a simple singly-linked list from which items are never deleted and over which only forward iterations are performed. It depends on, among other things, the atomicity of accessing the zs_minor integer and zs_next pointer. This fixes a lock inversion in which the zfsdev_state_lock is used by both the sync task (txg_sync) and indirectly by any user program which uses /dev/zfs; the zfsdev_release method uses the same lock and then blocks on the sync task. The most typical failure scenerio occurs when the sync task is cleaning up a user hold while various concurrent "zfs" commands are in progress. Neither Illumos nor Solaris are affected by this issue because they use DDI interface which provides lock-free reading of device state via the ddi_get_soft_state() function. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2301	2014-05-19 11:45:11 -07:00
Chunwei Chen	bc25c9325b	Use a dedicated taskq for vdev_file Originally, vdev_file used system_taskq. This would cause a deadlock, especially on system with few CPUs. The reason is that the prefetcher threads, which are on system_taskq, will sometimes be blocked waiting for I/O to finish. If the prefetcher threads consume all the tasks in system_taskq, the I/O cannot be served and thus results in a deadlock. We fix this by creating a dedicated vdev_file_taskq for vdev_file I/O. Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2270	2014-05-14 16:20:21 -07:00
Brian Behlendorf	2c33b91275	Handle vdev_lookup_top() failure in dva_get_dsize_sync() The dva_get_dsize_sync() function incorrectly assumes that the call to vdev_lookup_top() cannot fail. However, the NULL dereference at clearly shows that under certain circumstances it is possible. Note that offset 0x570 (1376) maps as expected to vd->vdev_deflate_ratio. BUG: unable to handle kernel NULL pointer dereference at 00000570 crash> struct -o vdev struct vdev { [0] uint64_t vdev_id; ... ... [1376] uint64_t vdev_deflate_ratio; Given that this can happen this patch add the required error handling. In the case where vdev_lookup_top() fails assume that no deflation will occur for the DVA and use the asize. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Alexey Zhuravlev <alexey.zhuravlev@intel.com> Closes #1707 Closes #1987 Closes #1891 Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2014-05-06 10:41:48 -07:00
Tim Chase	962d524212	Check the dataset type more rigorously when fetching properties. When fetching property values of snapshots, a check against the head dataset type must be performed. Previously, this additional check was performed only when fetching "version", "normalize", "utf8only" or "case". This caused the ZPL properties "acltype", "exec", "devices", "nbmand", "setuid" and "xattr" to be erroneously displayed with meaningless values for snapshots of volumes. It also did not allow for the display of "volsize" of a snapshot of a volume. This patch adds the headcheck flag paramater to zfs_prop_valid_for_type() and zprop_valid_for_type() to indicate the check is being done against a head dataset's type in order that properties valid only for snapshots are handled correctly. This allows the the head check in get_numeric_property() to be performed when fetching a property for a snapshot. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2265	2014-05-06 10:41:46 -07:00
Brian Behlendorf	1ce0457348	Fix style A minor style issue was accidentally introduced by `aa7d06a`. This change resolves that style problem. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2014-05-06 10:41:17 -07:00
George Wilson	aa7d06a98a	Illumos #4101 finer-grained control of metaslab_debug Today the metaslab_debug logic performs two tasks: - load all metaslabs on import/open - don't unload metaslabs at the end of spa_sync This change provides knobs for each of these independently. References: https://illumos.org/issues/4101 https://github.com/illumos/illumos-gate/commit/0713e23 Notes: 1) This is a small piece of the metaslab improvement patch from Illumos. It was worth bringing over before the rest, since it's low risk and it can be useful on fragmented pools (e.g. Lustre MDTs). metaslab_debug_unload would give the performance benefit of the old metaslab_debug option without causing unwanted delay during pool import. Ported-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2227	2014-05-06 09:46:04 -07:00
Brian Behlendorf	cc79a5c263	Treat spill block dbufs as meta data When the system attributes (SAs) for an object exceed what can can be stored in the bonus area of a dnode a spill block is allocated. These spill blocks are currently considered data blocks. However, they should be accounted for as meta data because they are effectively an extension of the dnode. While this may seem like a minor accounting issue it has broader implications. The key thing to be aware of is that each spill block will hold a reference on its parent dnode. The dnode in turn holds a reference on its dbuf in the dnode object. This means that a single 512 byte data buffer for a spill block can pin over 16k of meta data. This is analogous to the small file situation described in `2b13331` where a relatively small number of data buffer can cause the ARC to exceed the meta limit. However, unlike the small file case a spill block can legitimately be considered meta data. By changing the spill block to meta data they will now be dropped from the cache when the meta limit is reached. This then allows the dnodes and dbufs which the spill block was pinning to be released. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Prakash Surya <surya1@llnl.gov> Closes #2294	2014-05-05 13:56:59 -07:00
Brian Behlendorf	12f9a6a3f9	dmu_tx_assign() should not return ENOMEM As described in the comment above dmu_tx_assign() this function must only fail if the pool is out of space. If for some other reason the TX cannot be assigned (such as memory pressure) ERESTART must be returned. Alternately, EAGAIN could be returned to inject a delay but that isn't required because the caller will block on the condition variable waiting for the next TXG. /* * Assign tx to a transaction group. txg_how can be one of: * * (1) TXG_WAIT. If the current open txg is full, waits until there's * a new one. This should be used when you're not holding locks. * It will only fail if we're truly out of space (or over quota). * ... */ Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov> Closes #2287	2014-05-01 12:08:53 -07:00
Richard Yao	9d317793aa	Implement File Attribute Support We add support for lsattr and chattr to resolve a regression caused by `88c283952f` that broke Python's xattr.list(). That changet broke Gentoo Portage's FEATURES=xattr, which depended on Python's xattr.list(). Only attributes common to both Solaris and Linux are supported. These are 'a', 'd' and 'i' in Linux's lsattr and chattr commands. File attributes exclusive to Solaris are present in the ZFS code, but cannot be accessed or modified through this method. That was the case prior to this patch. The resolution of issue zfsonlinux/zfs#229 should implement some method to permit access and modification of Solaris-specific attributes. References: https://bugs.gentoo.org/show_bug.cgi?id=483516 Original-patch-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1691	2014-05-01 10:11:18 -07:00
Chunwei Chen	17584980b9	Add assertion to catch 0-count page Some network related block device uses tcp_sendpage, which doesn't behave well when using 0-count page. Add assertion to catch them. This has a runtime dependency on: zfsonlinux/spl@ae16ed9 Fix crash when using ZFS on Ceph rbd Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2277	2014-04-25 15:41:19 -07:00
Ned Bass	de39ec11b8	Fix LZ4 endianness autodetection Endianness detection in LZ4 is broken in user-space builds. This bug corrupts compressed data and manifests itself in several ztest failures. When LZ4 was originally ported to Illumos ZFS, the proper checks for Linux were stripped out. The Linux port then inherited the remaining detection code that works on Illumos but not on Linux. The current LZ4 endianness check misuses the condition defined(__BIG_ENDIAN) to indicate a big-endian system. On Linux __BIG_ENDIAN is defined uncondtionally in the user-space header /usr/include/endian.h, regardless of the endianness of the system. The kernel does not use this header, so only user-space builds are affected. While we could fix this by restoring the upstream LZ4 endianness detection code, reliable checks already exist in libspl/include/sys/isa_defs.h. This change uses the libspl results to replace the word-size and endianness checks in LZ4, simplifying the code and reducing duplication. Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: DHE <git@dehacked.net> Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Fixes #1963 Fixes #1964 Fixes #1965	2014-04-20 16:55:42 -07:00
Brian Behlendorf	4fd762f8ad	Fix zfsdev_ioctl() kmem leak warning Due to an asymmetry in the kmem accounting a memory leak was being reported when it was only an accounting issue. All memory allocated with kmem_alloc() must be released with kmem_free() or it will not be properly accounted for. In this case the code used strfree() to release the memory allocated by kmem_alloc(). Presumably this was done because the size of the memory region wasn't available when the memory needed to be freed. To resolve this issue the code has been updated to use strdup() instead of kmem_alloc() to allocate the memory. Like strfree(), strdup() is not integrated with the memory accounting. This means we can use strfree() to release it like Illumos. SPL: kmem leaked 10/4368729 bytes address size data func:line ffff880067e9aa40 10 ZZZZZZZZZZ zfsdev_ioctl:5655 Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Closes #2262	2014-04-18 13:30:15 -07:00
DHE	2dbedf5484	Uninitialized variable spa_autoreplace used Caught by ztest and valgrind. Signed-off-by: DHE <git@dehacked.net> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2259	2014-04-16 10:59:24 -07:00
Chunwei Chen	0b75bdb369	Use ddi_time_after and friends to compare time Also, make sure we use clock_t for ddi_get_lbolt to prevent type conversion from screwing things. Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2142	2014-04-14 13:27:56 -07:00
Chunwei Chen	b761912b34	Linux 3.14 compat: rq_for_each_segment in dmu_req_copy rq_for_each_segment changed from taking bio_vec * to taking bio_vec. We provide rq_for_each_segment4 which takes both. Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #2124	2014-04-10 14:28:51 -07:00
Chunwei Chen	22760eebef	Revert "Fix zvol+btrfs hang" After the dmu_req_copy change, bi_io_vecs are not touched, so this is no longer needed. This reverts commit `e26ade5101`. Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #2124	2014-04-10 14:28:47 -07:00
Chunwei Chen	215b4634c7	Refactor dmu_req_copy for immutable biovec changes Originally, dmu_req_copy modifies bv_len and bv_offset in bio_vec so that it can continue in subsequent passes. However, after the immutable biovec changes in Linux 3.14, this is not allowed. So instead, we just tell dmu_req_copy how many bytes are already copied and it will skip to the right spot accordingly. Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #2124	2014-04-10 14:28:43 -07:00
Chunwei Chen	d4541210f3	Linux 3.14 compat: Immutable biovec changes in vdev_disk.c bi_sector, bi_size and bi_idx are moved from bio to bio->bi_iter. This patch creates BIO_BI_*(bio) macros to hide the differences. Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #2124	2014-04-10 14:28:38 -07:00
Chunwei Chen	408ec0d2e1	Linux 3.14 compat: posix_acl_{create,chmod} posix_acl_{create,chmod} is changed to __posix_acl_{create_chmod} Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #2124	2014-04-10 14:27:03 -07:00
Richard Yao	f3ad9cd67a	Fix locking order in zfs_zget() Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2014-04-04 09:12:47 -07:00
Richard Yao	6f9548c487	Fix deadlock in zfs_zget() zfsonlinux/zfs#180 occurred because of a race between inode eviction and zfs_zget(). zfsonlinux/zfs@36df284 tried to address it by making a call to the VFS to learn whether an inode is being evicted. If it was being evicted the operation was retried after dropping and reacquiring the relevant resources. Unfortunately, this introduced another deadlock. INFO: task kworker/u24:6:891 blocked for more than 120 seconds. Tainted: P O 3.13.6 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. kworker/u24:6 D ffff88107fcd2e80 0 891 2 0x00000000 Workqueue: writeback bdi_writeback_workfn (flush-zfs-5) ffff8810370ff950 0000000000000002 ffff88103853d940 0000000000012e80 ffff8810370fffd8 0000000000012e80 ffff88103853d940 ffff880f5c8be098 ffff88107ffb6950 ffff8810370ff980 ffff88103a9a5b78 0000000000000000 Call Trace: [<ffffffff813dd1d4>] schedule+0x24/0x70 [<ffffffff8115fc09>] __wait_on_freeing_inode+0x99/0xc0 [<ffffffff8115fdd8>] find_inode_fast+0x78/0xb0 [<ffffffff811608c5>] ilookup+0x65/0xd0 [<ffffffffa035c5ab>] zfs_zget+0xdb/0x260 [zfs] [<ffffffffa03589d6>] zfs_get_data+0x46/0x340 [zfs] [<ffffffffa035fee1>] zil_add_block+0xa31/0xc00 [zfs] [<ffffffffa0360642>] zil_commit+0x12/0x20 [zfs] [<ffffffffa036a6e4>] zpl_putpage+0x174/0x840 [zfs] [<ffffffff811071ec>] do_writepages+0x1c/0x40 [<ffffffff8116df2b>] __writeback_single_inode+0x3b/0x2b0 [<ffffffff8116ecf7>] writeback_sb_inodes+0x247/0x420 [<ffffffff8116f5f3>] wb_writeback+0xe3/0x320 [<ffffffff81170b8e>] bdi_writeback_workfn+0xfe/0x490 [<ffffffff8106072c>] process_one_work+0x16c/0x490 [<ffffffff810613f3>] worker_thread+0x113/0x390 [<ffffffff81066edf>] kthread+0xdf/0x100 This patch implements the original fix in a slightly different manner in order to avoid both deadlocks. Instead of relying on a call to ilookup() which can block in __wait_on_freeing_inode() the return value from igrab() is used. This gives us the information that ilookup() provided without the risk of a deadlock. Alternately, this race could be closed by registering an sops->drop_inode() callback. The callback would need to detect the active SA hold thereby informing the VFS that this inode should not be evicted. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #180	2014-04-04 09:11:54 -07:00
Brian Behlendorf	8ac67298b1	Revert "Fixed a use-after-free bug in zfs_zget()." This reverts commit `36df284366`. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2014-04-03 16:23:28 -07:00
Brian Behlendorf	904ea2763e	Add automatic hot spare functionality When a vdev starts getting I/O or checksum errors it is now possible to automatically rebuild to a hot spare device. To cleanly support this functionality in a shell script some additional information was added to all zevent ereports which include a vdev. This covers both io and checksum zevents but may be used but other scripts. In the Illumos FMA solution the same information is required but it is retrieved through the libzfs library interface. Specifically the following members were added: vdev_spare_paths - List of vdev paths for all hot spares. vdev_spare_guids - List of vdev guids for all hot spares. vdev_read_errors - Read errors for the problematic vdev vdev_write_errors - Write errors for the problematic vdev vdev_cksum_errors - Checksum errors for the problematic vdev. By default the required hot spare scripts are installed but this functionality is disabled. To enable hot sparing uncomment the ZED_SPARE_ON_IO_ERRORS and ZED_SPARE_ON_CHECKSUM_ERRORS in the /etc/zfs/zed.d/zed.rc configuration file. These scripts do no add support for the autoexpand property. At a minimum this requires adding a new udev rule to detect when a new device is added to the system. It also requires that the autoexpand policy be ported from Illumos, see: https://github.com/illumos/illumos-gate/blob/master/usr/src/cmd/syseventd/modules/zfs_mod/zfs_mod.c Support for detecting the correct name of a vdev when it's not a whole disk was added by Turbo Fredriksson. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chris Dunlap <cdunlap@llnl.gov> Signed-off-by: Turbo Fredriksson <turbo@bayour.com> Issue #2	2014-04-02 13:10:08 -07:00
Brian Behlendorf	9b101a7320	Clarify zpool_events_next() comment Due to the very poorly chosen argument name 'cleanup_fd' it was completely unclear that this file descriptor is used to track the current cursor location. When the file descriptor is created by opening ZFS_DEV a private cursor is created in the kernel for the returned file descriptor. Subsequent calls to zpool_events_next() and zpool_events_seek() then require the file descriptor as an argument to reposition the cursor. When the file descriptor is closed the kernel state tracking the cursor is destroyed. This patch contains no functional change, it just changes a few variable names and clarifies the documentation. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chris Dunlap <cdunlap@llnl.gov> Issue #2	2014-03-31 16:11:08 -07:00
Brian Behlendorf	75e3ff58fe	Add zpool_events_seek() functionality The ZFS_IOC_EVENTS_SEEK ioctl was added to allow user space callers to seek around the zevent file descriptor by EID. When a specific EID is passed and it exists the cursor will be positioned there. If the EID is no longer cached by the kernel ENOENT is returned. The caller may also pass ZEVENT_SEEK_START or ZEVENT_SEEK_END to seek to those respective locations. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chris Dunlap <cdunlap@llnl.gov> Issue #2	2014-03-31 16:10:57 -07:00
Brian Behlendorf	a2f1945ee3	Add a unique "eid" value to all zevents Tagging each zevent with a unique monotonically increasing EID (Event IDentifier) provides the required infrastructure for a user space daemon to reliably process zevents. By writing the EID to persistent storage the daemon can safely resume where it left off in the event stream when it's restarted. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chris Dunlap <cdunlap@llnl.gov> Issue #2	2014-03-31 16:10:41 -07:00
Boris Protopopov	0ed212dc0e	Illumos #4089 NULL pointer dereference in arc_read() 4089 NULL pointer dereference in arc_read() Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com> Reviewed by: Garrett D'Amore <garrett@damore.org> Approved by: Dan McDonald <danmcd@nexenta.com> References: https://www.illumos.org/issues/4089 illumos/illumos-gate@57815f6b95 Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #2171 Issue #2165 Closes #2198	2014-03-24 11:06:57 -07:00
Richard Yao	26b42f3f9d	Implement -t option to zpool import for temporary pool names Originally, users had to handle spa namespace collisions by either exporting the already imported pool or by specifying a new name for the pool with a conflicting name. In the case of root pools from virtual guests, neither approach to collision resolution is reasonable. This is addressed by extending the new name syntax with a -t option to specify that the new name is temporary. When specified, this sets an internal flag that is passed into the kernel to tell it that all label updates should refer to the name used in the original label. Consequently, the original pool name will be retained on export. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2189	2014-03-20 12:05:30 -07:00
Andrey Vesnovaty	00fcdee1f8	Fix regression introduced in port of Illumos #3744 Remove the redundant call to zfs_unmount_snap() which was being done after char array was freed, This fixes an upstream regression that was introduced in commit zfsonlinux/zfs@d09f25dc66, which ported the Illumos 3744 changes. Signed-off-by: Andrey Vesnovaty <andrey.vesnovaty@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <ryao@gentoo.org> Closes #2156	2014-03-20 11:00:48 -07:00
Boris Protopopov	47fe91b54c	Illumos #4088 use after free in arc_release() 4088 use after free in arc_release() Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Garrett D'Amore <garrett@damore.org> Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com> Approved by: Dan McDonald <danmcd@nexenta.com> References: https://www.illumos.org/issues/4088 illumos/illumos-gate@ccc22e1304 From the illumos issue: A race-induced use after free occurs in arc_release() where the ARC header is used outside the critical section protected by the hash_lock. Ported by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <ryao@gentoo.org> Closes #2162	2014-03-10 09:11:15 -07:00
Tim Chase	a45fc6a677	Use KM_PUSHPAGE in spa_add() for spa_label_features. The spa_label_features nvlist is used in the sync context during pool version upgrade. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2168	2014-03-10 09:09:30 -07:00
Brian Behlendorf	e74400155f	Export symbols dsl_sync_task{_nowait} These are needed by consumers (i.e. Lustre) who wish to perform a callback in the syncing context. Both a blocking and non-blocking version are available to callers. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2014-03-07 10:01:36 -08:00
Ned Bass	a77c4c8332	Improve reporting of tx assignment wait times Some callers of dmu_tx_assign() use the TXG_NOWAIT flag and call dmu_tx_wait() themselves before retrying if the assignment fails. The wait times for such callers are not accounted for in the dmu_tx_assign kstat histogram, because the histogram only records time spent in dmu_tx_assign(). This change moves the histogram update to dmu_tx_wait() to properly account for all time spent there. One downside of this approach is that it is possible to call dmu_tx_wait() multiple times before successfully assigning a transaction, in which case the cumulative wait time would not be recorded. However, this case should not often arise in practice, because most callers currently use one of these forms: dmu_tx_assign(tx, TXG_WAIT); dmu_tx_assign(tx, waited ? TXG_WAITED : TXG_NOWAIT); The first form should make just one call to dmu_tx_delay() inside of dmu_tx_assign(). The second form retries with TXG_WAITED if the first assignment fails and incurs a delay, in which case no further waiting is performed. Therefore transaction delays normally occur in one call to dmu_tx_wait() so the histogram should be fairly accurate. Another possible downside of this approach is that the histogram will no longer record overhead outside of dmu_tx_wait() such as in dmu_tx_try_assign(). While I'm not aware of any reason for concern on this point, it is conceivable that lock contention, long list traversal, etc. could cause assignment delays that would not be reflected in the histogram. Therefore the histogram should strictly be used for visibility in to the normal delay mechanisms and not as a profiling tool for code performance. Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1915	2014-03-04 12:22:24 -08:00
Ned Bass	3ccab25205	replace nreserved with ndirty in txgs kstat The nreserved column in the txgs kstat file always contains 0 following the write throttle restructuring of commit `e8b96c6007`. Prior to that commit, the nreserved column showed the number of bytes temporarily reserved in the pool by a transaction group at sync time. The new write throttle did away with temporary reservations and uses the amount of dirty data instead. To approximate the old output of the txgs kstat, the number of dirty bytes per-txg was passed in as the nreserved value to spa_txg_history_set_io(). This approach did not work as intended, because the per-txg dirty value is decremented as data is written out to disk, so it is zero by the time we call spa_txg_history_set_io(). To fix this, save the number of dirty bytes before calling spa_sync(), and pass this value in to spa_txg_history_set_io(). Also, since the name "nreserved" is now a misnomer, the column heading is now labeled "ndirty". Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #1696	2014-03-04 12:22:24 -08:00
Ned Bass	3d920a1567	dmu_tx kstat cleanup A few counters in the dmu_tx kstats are obsolete or no longer bumped properly. - The sync task restructuring commit `13fe019870` removed the code that bumpted dmu_tx_quota. The counter is now bumped in two cases, instead of just the one case as before (after the result of dsl_dataset_check_quota call). The second case is where we check the requested reservation against the actual pool size, as this is an implicit quota of sorts. - The write throttle restructuring commit `e8b96c6007` makes dmu_tx_how and dmu_tx_inflight obsolete, so they are removed. Signed-off-by: Kohsuke Kawaguchi <kk@kohsuke.org> Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1914	2014-03-04 12:22:24 -08:00
Richard Yao	cecb7487fc	Invalidate Linux buffer cache on vdevs upon each flush Userland tools such as blkid, grub2-probe and zdb will go through the buffer cache. However, ZFS uses on submit_bio() to bypass the buffer cache when performing IO operations on vdevs for efficiency purposes. This permits the on-disk state and buffer cache to fall out of synchronization. That causes seemingly random failures when tools reading stale metadata from the buffer cache try to access references to data that is no longer there. A particularly bad failure this causes involves grub2-probe, which is used by grub2-mkconfig. Ordinarily, a rootfs might be called rpool/ROOT/gentoo. However, when a failure occurs in grub2-probe, grub2-mkconfig will generate a configuration file containing /ROOT/gentoo, which omits the pool name and causes a boot failure. This is avoidable by calling invalidate_bdev() on each flush, which is a simple way to ensure that all non-dirty pages are wiped. Since userland tools rarely access vdevs directly, this should be a fancy noop >99.999% of the time and have little impact on IO. We could have tried a finer grained approach for the rare instances in which the vdevs are accessed frequently by userland. However, that would require consideration of corner cases and it is not worth the effort. Memory-wise, it would have been better to use a Linux kernel API hook to disable the buffer cache on such devices, but it provides us no way of doing that, so we opt for this approach instead. We should revisit that idea in the future when higher priority issues have been tackled. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2150	2014-03-04 12:22:03 -08:00
Alexander Stetsenko	36f92e93e5	Illumos #4574 get_clones_stat does not call zap_count in non-debug kernel 4574 get_clones_stat does not call zap_count in non-debug kernel Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Marcel Telka <marcel@telka.sk> Approved by: Gordon Ross <gwr@nexenta.com> References: https://www.illumos.org/issues/4574 illumos/illumos-gate@03d1795fa6 Ported-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2147	2014-03-04 11:50:13 -08:00
Tim Chase	13a7ba1c2c	Fix zap_lookup() in feature_is_supported(). The length (number of integers) argument passed to zap_lookup was wrong; likely as a result of performing stack-reduction on the function. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2141	2014-03-04 11:44:44 -08:00
Andrew Barnes	1ba1615925	Remove recursion from dsl_dir_willuse_space() Remove recursion from dsl_dir_willuse_space() to reduce stack usage. Issues with stack overflow were observed in zfs recv of zvols, likelihood of an overflow is proportional to the depth of the dataset as dsl_dir_willuse_space() recurses to parent datasets. Signed-off-by: Andrew Barnes <barnes333@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2069	2014-03-04 11:22:27 -08:00
Prakash Surya	2b13331d62	Set "arc_meta_limit" to 3/4 arc_c_max by default Unfortunately, this change is an cheap attempt to work around a pathological workload for the ARC. A "real" solution still needs to be fleshed out, so this patch is intended to alleviate the situation in the meantime. Let me try and describe the problem.. Data buffers residing in the dbuf hash table (dbuf cache) will keep a hold on their respective dnode, this dnode will in turn keep a hold on its backing dbuf (the physical block of the dnode object backing it). Since the dnode has a hold on its backing dbuf, the arc buffer for this dbuf is unevictable. What this essentially boils down to, "data" buffers have the potential to pin "metadata" in the arc (as a result of these dnode object buffers being unevictable). This scenario becomes a real problem when the workload consists of many small files (e.g. creating millions of 4K files). With this workload, the arc's "arc_meta_used" space get filled up with buffers for any resident directories as well as buffers for the objset's dnode object. Once the "arc_meta_limit" is reached, the directory buffers will be evicted and only the unevictable dnode object buffers will reside. If the workload is simply creating new small files, these dnode object buffers will never even be needed again, whereas the directory buffers will be used constantly until the creates move to a new directory. If "arc_c" and "arc_meta_limit" are sized appropriately, this situation wont occur. This is because as the data buffers accumulate, "arc_size" will eventually approach "arc_c" (before "arc_meta_used" reaches "arc_meta_limit"); at that point the data buffers will be evicted, which releases the hold on the dnode, which releases the hold on the dnode object's dbuf, which allows that buffer to be evicted from the arc in preference to more "useful" metadata. So, to side step the issue, we simply need to ensure "arc_size" reaches "arc_c" before "arc_meta_used" reaches "arc_meta_limit". In order to pick a proper limit, we have to do some math. To make things a little easier to follow, it is assumed that there will only be a single data buffer per file (which is probably always the case for "small" files anyways). Based on the current internals of the arc, if N files residing in the dbuf cache all pin a single dnode buffer (i.e. their dnodes all share the same physical dnode object block), then the following amount of "arc_meta_used" space will be consumed: - 16K for the dnode object's block - [ 16384 bytes] - N * sizeof(dnode_t) -------------- [ N * 928 bytes] - (N + 1) * sizeof(arc_buf_t) ------ [(N + 1) * 72 bytes] - (N + 1) * sizeof(arc_buf_hdr_t) -- [(N + 1) * 264 bytes] - (N + 1) * sizeof(dmu_buf_impl_t) - [(N + 1) * 280 bytes] To simplify, these N files will pin the following amount of "arc_meta_used" space as unevictable: Pinned "arc_meta_used" bytes = 16384 + N * 928 + (N + 1) * (72 + 264 + 280) Pinned "arc_meta_used" bytes = 17000 + N * 1544 This pinned space is regardless of the size of the files, and is only dependent on the number of pinned dnodes sharing a physical block (i.e. N). For example, 32 512b files sharing a single dnode object block would consume the same "arc_meta_used" space as 32 4K files sharing a single dnode object block. Now, given a files size of S, we can determine the total amount of space that will be consumed in the arc: Total = 17000 + N * 1544 + S * N ^^^^^^^^^^^^^^^^ ^^^^^ metadata data So, given these formulas, we can generate a table which states the ratio of pinned metadata to total arc (meta + data) using different values of N (number of pinned dnodes per pinned physical dnode block) and S (size of the file). File Sizes (S) \| 512 \| 1024 \| 2048 \| 4096 \| 8192 \| 16384 \| ---+----------+----------+----------+----------+----------+----------+ 1 \| 0.973132 \| 0.947670 \| 0.900544 \| 0.819081 \| 0.693597 \| 0.530921 \| 2 \| 0.951497 \| 0.907481 \| 0.830632 \| 0.710325 \| 0.550779 \| 0.380051 \| N 4 \| 0.918807 \| 0.849809 \| 0.738842 \| 0.585844 \| 0.414271 \| 0.261250 \| 8 \| 0.877541 \| 0.781803 \| 0.641770 \| 0.472505 \| 0.309333 \| 0.182965 \| 16 \| 0.835819 \| 0.717945 \| 0.559996 \| 0.388885 \| 0.241376 \| 0.137253 \| 32 \| 0.802106 \| 0.669597 \| 0.503304 \| 0.336277 \| 0.202123 \| 0.112423 \| As you can see, if we wanted to support the absolute worst case of 1 dnode per physical dnode block and 512b files, we would have to set the "arc_meta_limit" to something greater than 97.3132% of "arc_c_max". At that point, it essentially defeats the purpose of having an "arc_meta_limit" at all. This patch changes the default value of "arc_meta_limit" to be 75% of "arc_c_max", which should be good enough for "most" workloads (I think). Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #2110	2014-02-21 16:10:49 -08:00
Prakash Surya	cc7f677c16	Split "data_size" into "meta" and "data" Previously, the "data_size" field in the arcstats kstat contained the amount of cached "metadata" and "data" in the ARC. The problem is this then made it difficult to extract out just the "metadata" size, or just the "data" size. To make it easier to distinguish the two values, "data_size" has been modified to count only buffers of type ARC_BUFC_DATA, and "meta_size" was added to count only buffers of type ARC_BUFC_METADATA. If one wants the old "data_size" value, simply sum the new "data_size" and "meta_size" values. Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #2110	2014-02-21 16:10:49 -08:00
Prakash Surya	da8ccd0ee0	Prioritize "metadata" in arc_get_data_buf When the arc is at it's size limit and a new buffer is added, data will be evicted (or recycled) from the arc to make room for this new buffer. As far as I can tell, this is to try and keep the arc from over stepping it's bounds (i.e. keep it below the size limitation placed on it). This makes sense conceptually, but there appears to be a subtle flaw in its current implementation, resulting in metadata buffers being throttled. When it evicts from the arc's lists, it also passes in a "type" so as to remove a buffer of the same type that it is adding. The problem with this is that once the size limit is hit, the ratio of "metadata" to "data" contained in the arc essentially becomes fixed. For example, consider the following scenario: * the size of the arc is capped at 10G * the meta_limit is capped at 4G * 9G of the arc contains "data" * 1G of the arc contains "metadata" Now, every time a new "metadata" buffer is created and added to the arc, an older "metadata" buffer(s) will be removed from the arc; preserving the 9G "data" to 1G "metadata" ratio that was in-place when the size limit was reached. This occurs even though the amount of "metadata" is far below the "metadata" limit. This can result in the arc behaving pathologically for certain workloads. To fix this, the arc_get_data_buf function was modified to evict "data" from the arc even when adding a "metadata" buffer; unless it's at the "metadata" limit. In addition, arc_evict now more closely resembles arc_evict_ghost; such that when evicting "data" from the arc, it may make a second pass over the arc lists and evict "metadata" if it cannot meet the eviction size the first time around. Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #2110	2014-02-21 16:10:49 -08:00
Prakash Surya	77765b540b	Remove "arc_meta_used" from arc_adjust calculation Using "arc_meta_used" to determine if the arc's mru list is over it's target value of "arc_p" doesn't seem correct. The size of the mru list and the value of "arc_meta_used", although related, are completely independent. Buffers contained in "arc_meta_used" may not even be contained in the arc's mru list. As such, this patch removes "arc_meta_used" from the calculation in arc_adjust. Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #2110	2014-02-21 16:10:49 -08:00
Prakash Surya	94520ca462	Prune metadata from ghost lists in arc_adjust_meta To maintain a strict limit on the metadata contained in the arc, while preventing the arc buffer headers from completely consuming the "arc_meta_used" space, we need to evict metadata buffers from the arc's ghost lists along with the regular lists. This change modifies arc_adjust_meta such that it more closely models the adjustments made in arc_adjust. "arc_meta_used" is used similarly to "arc_size", and "arc_meta_limit" is used similarly to "arc_c". Testing metadata intensive workloads (e.g. creating, copying, and removing millions of small files and/or directories) has shown this change to make a dramatic improvement to the hit rate maintained in the arc. While I think there is still room for improvement, this is a big step in the right direction. In addition, zpl_free_cached_objects was made into a no-op as I'm not yet sure how to properly implement that function. Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #2110	2014-02-21 16:10:49 -08:00
Prakash Surya	1e3cb67b53	Revert "Return -1 from arc_shrinker_func()" This reverts commit `c11a12bc3b`. Out of memory events were fixed by reverting this patch. Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #2110	2014-02-21 16:10:49 -08:00
Prakash Surya	624227854e	Disable arc_p adapt dampener by default It's unclear why adjustments to arc_p need to be dampened as they are in arc_adjust. With that said, it's removal significantly improves the arc's ability to "warm up" to a given workload. Thus, I'm disabling by default until its usefulness is better understood. Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #2110	2014-02-21 16:10:49 -08:00
Prakash Surya	f521ce1b9c	Allow "arc_p" to drop to zero or grow to "arc_c" Setting a limit on the minimum value of "arc_p" has been shown to have detrimental effects on the arc hit rate for certain "metadata" intensive workloads. Specifically, this has been exhibited with a workload that constantly dirties new "metadata" but also frequently touches a "small" amount of mfu data (e.g. mkdir's). What is seen is that the new anon data throttles the mfu list to a negligible size (because arc_p > anon + mru in arc_get_data_buf), even though the mfu ghost list receives a constant stream of hits. To remedy this, arc_p is now allowed to drop to zero if the algorithm deems it necessary. Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #2110	2014-02-21 16:10:27 -08:00
Prakash Surya	89c8cac493	Disable aggressive arc_p growth by default For specific workloads consisting mainly of mfu data and new anon data buffers, the aggressive growth of arc_p found in the arc_get_data_buf() function can have detrimental effects on the mfu list size and ghost list hit rate. Running a workload consisting of two processes: * Process 1 is creating many small files * Process 2 is tar'ing a directory consisting of many small files I've seen arc_p and the mru grow to their maximum size, while the mru ghost list receives 100K times fewer hits than the mfu ghost list. Ideally, as the mfu ghost list receives hits, arc_p should be driven down and the size of the mfu should increase. Given the specific workload I was testing with, the mfu list size should grow to a point where almost no mfu ghost list hits would occur. Unfortunately, this does not happen because the newly dirtied anon buffers constancy drive arc_p to its maximum value and keep it there (effectively prioritizing the mru list and starving the mfu list down to a negligible size). The logic to increment arc_p from within the arc_get_data_buf() function was introduced many years ago in this upstream commit: commit 641fbdae3a027d12b3c3dcd18927ccafae6d58bc Author: maybee <none@none> Date: Wed Dec 20 15:46:12 2006 -0800 6505658 target MRU size (arc.p) needs to be adjusted more aggressively and since I don't fully understand the motivation for the change, I am reluctant to completely remove it. As a way to test out how it's removal might affect performance, I've disabled that code by default, but left it tunable via a module option. Thus, if its removal is found to be grossly detrimental for certain workloads, it can be re-enabled on the fly, without a code change. Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #2110	2014-02-21 14:53:28 -08:00
Prakash Surya	39e055c44b	Adjust arc_p based on "bytes" in arc_shrink In an attempt to prevent arc_c from collapsing "too fast", the arc_shrink() function was updated to take a "bytes" parameter by this change: commit `302f753f16` Author: Brian Behlendorf <behlendorf1@llnl.gov> Date: Tue Mar 13 14:29:16 2012 -0700 Integrate ARC more tightly with Linux Unfortunately, that change failed to make a similar change to the way that arc_p was updated. So, there still exists the possibility for arc_p to collapse to near 0 when the kernel start calling the arc's shrinkers. This change attempts to fix this, by decrementing arc_p by the "bytes" parameter in the same way that arc_c is updated. In addition, a minimum value of arc_p is attempted to be maintained, similar to the way a minimum arc_p value is maintained in arc_adapt(). Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #2110	2014-02-21 14:53:08 -08:00
Brian Behlendorf	9141582592	Set zfs_arc_min to 4MB Decrease the mimimum ARC size from 1/32 of total system memory (or 64MB) to a much smaller 4MB. 1) Large systems with over a 1TB of memory are being deployed and reserving 1/32 of this memory (32GB) as the mimimum requirement is overkill. 2) Tiny systems like the raspberry pi may only have 256MB of memory in which case 64MB is far too large. The ARC should be reclaimable if the VFS determines it needs the memory for some other purpose. If you want to ensure the ARC is never completely reclaimed due to memory pressure you may still set a larger value with zfs_arc_min. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Prakash Surya <surya1@llnl.gov> Issue #2110	2014-02-21 14:52:02 -08:00
Richard Yao	4f2dcb3eee	Add erratum for issue #2094 ZoL commit `1421c89` unintentionally changed the disk format in a forward- compatible, but not backward compatible way. This was accomplished by adding an entry to zbookmark_t, which is included in a couple of on-disk structures. That lead to the creation of pools with incorrect dsl_scan_phys_t objects that could only be imported by versions of ZoL containing that commit. Such pools cannot be imported by other versions of ZFS or past versions of ZoL. The additional field has been removed by the previous commit. However, affected pools must be imported and scrubbed using a version of ZoL with this commit applied. This will return the pools to a state in which they may be imported by other implementations. The 'zpool import' or 'zpool status' command can be used to determine if a pool is impacted. A message similar to one of the following means your pool must be scrubbed to restore compatibility. $ zpool import pool: zol-0.6.2-173 id: 1165955789558693437 state: ONLINE status: Errata #1 detected. action: The pool can be imported using its name or numeric identifier, however there is a compatibility issue which should be corrected by running 'zpool scrub' see: http://zfsonlinux.org/msg/ZFS-8000-ER config: ... $ zpool status pool: zol-0.6.2-173 state: ONLINE scan: pool compatibility issue detected. see: https://github.com/zfsonlinux/zfs/issues/2094 action: To correct the issue run 'zpool scrub'. config: ... If there was an async destroy in progress 'zpool import' will prevent the pool from being imported. Further advice on how to proceed will be provided by the error message as follows. $ zpool import pool: zol-0.6.2-173 id: 1165955789558693437 state: ONLINE status: Errata #2 detected. action: The pool can not be imported with this version of ZFS due to an active asynchronous destroy. Revert to an earlier version and allow the destroy to complete before updating. see: http://zfsonlinux.org/msg/ZFS-8000-ER config: ... Pools affected by the damaged dsl_scan_phys_t can be detected prior to an upgrade by running the following command as root: zdb -dddd poolname 1 \| grep -P '^\t\tscan = ' \| sed -e 's;scan = ;;' \| wc -w Note that `poolname` must be replaced with the name of the pool you wish to check. A value of 25 indicates the dsl_scan_phys_t has been damaged. A value of 24 indicates that the dsl_scan_phys_t is normal. A value of 0 indicates that there has never been a scrub run on the pool. The regression caused by the change to zbookmark_t never made it into a tagged release, Gentoo backports, Ubuntu, Debian, Fedora, or EPEL stable respositorys. Only those using the HEAD version directly from Github after the 0.6.2 but before the 0.6.3 tag are affected. This patch does have one limitation that should be mentioned. It will not detect errata #2 on a pool unless errata #1 is also present. It expected this will not be a significant problem because pools impacted by errata #2 have a high probably of being impacted by errata #1. End users can ensure they do no hit this unlikely case by waiting for all asynchronous destroy operations to complete before updating ZoL. The presence of any background destroys on any imported pools can be checked by running `zpool get freeing` as root. This will display a non-zero value for any pool with an active asynchronous destroy. Lastly, it is expected that no user data has been lost as a result of this erratum. Original-patch-by: Tim Chase <tim@chase2k.com> Reworked-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #2094	2014-02-21 12:10:40 -08:00
Brian Behlendorf	ffe9d38275	Add generic errata infrastructure From time to time it may be necessary to inform the pool administrator about an errata which impacts their pool. These errata will by shown to the administrator through the 'zpool status' and 'zpool import' output as appropriate. The errata must clearly describe the issue detected, how the pool is impacted, and what action should be taken to resolve the situation. Additional information for each errata will be provided at http://zfsonlinux.org/msg/ZFS-8000-ER. To accomplish the above this patch adds the required infrastructure to allow the kernel modules to notify the utilities that an errata has been detected. This is done through the ZPOOL_CONFIG_ERRATA uint64_t which has been added to the pool configuration nvlist. To add a new errata the following changes must be made: * A new errata identifier must be assigned by adding a new enum value to the zpool_errata_t type. New enums must be added to the end to preserve the existing ordering. * Code must be added to detect the issue. This does not strictly need to be done at pool import time but doing so will make the errata visible in 'zpool import' as well as 'zpool status'. Once detected the spa->spa_errata member should be set to the new enum. * If possible code should be added to clear the spa->spa_errata member once the errata has been resolved. * The show_import() and status_callback() functions must be updated to include an informational message describing the errata. This should include an action message describing what an administrator should do to address the errata. * The documentation at http://zfsonlinux.org/msg/ZFS-8000-ER must be updated to describe the errata. This space can be used to provide as much additional information as needed to fully describe the errata. A link to this documentation will be automatically generated in the output of 'zpool import' and 'zpool status'. Original-idea-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Richard Yao <ryao@gentoo.or Issue #2094	2014-02-21 12:10:40 -08:00
Richard Yao	ed9e8368d3	Revert changes to zbookmark_t Commit `1421c89142` added a field to zbookmark_t that unintentinoally caused a disk format change. This negatively affected backward compatibility and platform portability. Therefore, this field is being removed. The function that field permitted is left unimplemented until a later patch that will reimplement the field in a way that does not affect the disk format. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #2094	2014-02-21 12:10:39 -08:00
Tim Chase	98fad86293	Propagate errors when registering "relatime" property callback. Various errors can occur when registering property callbacks. As the author's comments indicate, the code is very paranoid about preserving the first-seen error when registering callbacks. This patch causes an error seen while registering the "relatime" callback to not clobber a previously-seen error. Reported-by: Jorgen Lundman <lundman@lundman.net> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2117	2014-02-12 09:38:28 -08:00
Brian Behlendorf	c5cb66addc	Fix corrupted l2_asize in arcstats Commit `e0b0ca9` accidentally corrupted the l2_asize displayed in arcstats. This was caused by changing the l2arc_buf_hdr.b_asize member from an int to uint32_t type. There are places in the code where this field is cast to a uint64_t resulting in the b_hits member being treated as part of b_asize. To resolve the issue the type has been changed to a uint64_t, and the b_hits member is placed after the enum to prevent the size of the structure from increasing. This is a good example of exactly why it's a bad idea to use ambiguous types (int) in these structures. Signed-off-by: DHE <git@dehacked.net> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1990	2014-02-05 12:24:53 -08:00
Matthew Ahrens	2e7b7657cd	4188 assertion failed in dmu_tx_hold_free(): dn_datablkshift != 0 Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Approved by: Garrett D'Amore <garrett@damore.org> Refences: https://www.illumos.org/issues/4188 illumos/illumos-gate@bb411a08b0 Ported-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2091	2014-01-31 10:49:34 -08:00
Matthew Ahrens	8b4646494c	Illumos 4504 traverse_visitbp: visit group before user 4504 traverse_visitbp: visit DMU_GROUPUSED_OBJECT before DMU_USERUSED_OBJECT Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> References: https://illumos.org/issues/4504 http://code.delphix.com/illumos-4504 http://svnweb.freebsd.org/base?view=revision&revision=260812 Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Closes #2079	2014-01-29 15:50:49 -08:00
Tim Chase	6d111134c0	Implement relatime. Add the "relatime" property. When set to "on", a file's atime will only be updated if the existing atime at least a day old or if the existing ctime or mtime has been updated since the last access. This behavior is compatible with the Linux "relatime" mount option. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2064 Closes #1917	2014-01-29 15:50:44 -08:00
Cyril Plisko	01b738f457	Call gethrtime() only once per new txg creation When transitioning current open TXG into QUIESCE state and opening a new one txg_quiesce() calls gethrtime(): - to mark the birth time of the new TXG - to record the SPA txg history kstat - implicitely inside spa_txg_history_add() These timestamps are practically the same, so that the first one can be used instead of the other two. The only visible difference is that inside spa_txg_history_add() the time spent in kmem_zalloc() will be counted towards the opened TXG. Since at this point the new TXG already exists (tx->tx_open_txg has been already incremented) it is actually a correct accounting. In any case this extra work is only happening when spa_txg_history kstat is activated (i.e. zfs_txg_history > 0) and doesn't affect the normal processing in any way. Signed-off-by: Cyril Plisko <cyril.plisko@mountall.com> Issue #2075	2014-01-23 13:31:51 -08:00
Igor Lvovsky	478d64fdae	Add additional state TXG_STATE_WAIT_FOR_SYNC for txg. In several cases when digging into kstats we can found two txgs in SYNC state, e.g. txg birth state nreserved nread nwritten ... 985452 258127184872561 C 0 373948416 2376272384 ... 985453 258129016180616 C 0 378173440 28793344 ... 985454 258129016271523 S 0 0 0 ... 985455 258130864245986 S 0 0 0 ... 985456 258130867458851 O 0 0 0 ... However only first txg (985454) is really syncing at this moment. The other one (985455) marked as SYNCED is actually in a post-QUIESCED state and waiting to start sync. So, the new TXG_STATE_WAIT_FOR_SYNC state between TXG_STATE_QUIESCED and TXG_STATE_SYNCED was added to reveal this situation. txg birth state nreserved nread nwritten ... 1086896 235261068743969 C 0 163577856 8437248 ... 1086897 235262870830801 C 0 280625152 822594048 ... 1086898 235264172219064 S 0 0 0 ... 1086899 235264936134407 W 0 0 0 ... 1086900 235264936296156 O 0 0 0 ... Signed-off-by: Igor Lvovsky <ilvovsky@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #2075	2014-01-23 13:31:51 -08:00
Shen Yan	93292b3081	Use enum type(zfetch_dirn_t) instead Fix code with zfetch_dirn_t, which is more readable and clear. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2068	2014-01-23 12:56:33 -08:00
Tim Chase	4461aa6118	Allow chown/chgrp when no ACL SAs exist. From the comment in the commit: Some ZFS implementations (ZEVO) create neither a ZNODE_ACL nor a DACL_ACES SA in which case ENOENT is returned from zfs_acl_node_read() when the SA can't be located. Allow chown/chgrp to succeed in these cases rather than returning an error that makes no sense in the context of the caller. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue zfs-osx/zfs#86 Closes #1911 Closes #2029	2014-01-23 11:07:29 -08:00
Ned Bass	04aa2de8f7	vdev_file_io_start() to use taskq_dispatch(TQ_PUSHPAGE) The vdev_file_io_start() function may be processing a zio that the txg_sync thread is waiting on. In this case it is not safe to perform memory allocations that may generate new I/O since this could cause a deadlock. To avoid this, call taskq_dispatch() with TQ_PUSHPAGE instead of TQ_SLEEP. Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #1928	2014-01-23 09:58:07 -08:00
Brian Behlendorf	35d3e32274	Use long holds in zvol_set_volsize() Under Linux the zvol_set_volsize() function was originally written to use dmu_objset_hold()/dmu_objset_rele(). Subsequently, the dmu_objset_own()/dmu_objset_disown() interfaces were added but the ZVOL code wasn't updated to take advantage of them. This was never an issue but after the dsl_pool_config changes the code now takes the config lock twice. The cleanest solution is to shift to using dmu_objset_own() which takes a long hold on the dataset and does not hold the dsl pool lock. This patch also slightly restructures the existing code such that it more closely resembles the upstream Illumos code. Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2039	2014-01-14 14:46:12 -08:00
Brian Behlendorf	fd23720ae1	Drain iput taskq outside z_teardown_lock It's unsafe to drain the iput taskq while holding the z_teardown_lock as a writer. This is because when the last reference on an inode is dropped it may still have pages which need to be written to disk. This will be done through zpl_writepages which will acquire the z_teardown_lock as a reader in ZFS_ENTER. Therefore, if we're holding the lock as a writer in zfs_sb_teardown the unmount will deadlock. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chris Dunlop <chris@onthe.net.au> Closes #1988	2014-01-09 15:54:08 -08:00
Brian Behlendorf	4fcc43790c	Force LZ4_FORCE_SW_BITCOUNT for Sparc This change was proposed for Sparc but it's not clear to me why it's required. Proper support exists in the lz4 code to detect the endianness and the required builtins are available for gcc. Still I'm including the patch because it will only impact Sparc and it may resolve a case which hasn't occured to me. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: marku89 <mar42@kola.li> Issue #1700	2014-01-09 15:54:03 -08:00
Brian Behlendorf	b585bc4afa	Fix zfs_getattr_fast types On Sparc sp->blksize will be a 64-bit value which is then cast incorrectly to a 32-bit value. For big endian systems this results in an incorrect value for sp->blksize. To resolve the problem local variables of the correct size are used and then assigned to sp->blksize. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: marku89 <mar42@kola.li> Issue #1700	2014-01-09 15:50:23 -08:00
Brian Behlendorf	aa0218d6a1	Fix nvlist 'Bus Error' for Sparc The mis-aligned memory accesses in nvpair_native_embedded() and nvpair_native_embedded_array() will cause a 'Bus Error' for architectures such as Sparc which not fully byte addressible. To avoid this issue care is taken to avoid dereferencing the potentially mis-aligned packed nvlist_t. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: marku89 <mar42@kola.li> Issue #1700	2014-01-09 15:50:15 -08:00
Brian Behlendorf	7f89ae6ba0	Use local variable to read zp->z_mode When accessing the zp->z_mode through the SA bulk interface we expect that 64-bits are available to hold the result. However, on 32-bit platforms mode_t will only be 32-bits so we cannot pass it to SA_ADD_BULK_ATTR(). Instead a local uint64_t variable must be used and the result assigned to zp->z_mode. This went unnoticed on 32-bit little endian platforms because the bytes happen to end up in the correct 32-bits. But on big endian platforms like Sparc the zp->z_mode will always end up set to zero. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: marku89 <mar42@kola.li> Issue #1700	2014-01-09 15:50:11 -08:00
John Layman	ecf3d9b8e6	Add ddt, ddt_entry, and l2arc_hdr caches Back the allocations for ddt tables+entries and l2arc headers with kmem caches. This will reduce the cost of allocating these commonly used structures and allow for greater visibility of them through the /proc/spl/kmem/slab interface. Signed-off-by: John Layman <jlayman@sagecloud.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1893	2014-01-07 10:33:11 -08:00
Tim Chase	fb8e608d9d	Fix the creation of ZPOOL_HIST_CMD pool history entries. Move the libzfs_fini() after the zpool_log_history() call so the ZPOOL_HIST_CMD entry can get written. Fix the handling of saved_poolname in zfsdev_ioctl() which was broken as part of the stack-reduction work in `a168788053`. Since ZoL destroys the TSD data in which the previously successful ioctl()'s pool name is stored following every vop, the ZFS_IOC_LOG_HISTORY ioctl has a very important restriction: it can only successfully write a long entry following a successful ioctl() if no intervening vops have been performed. Some of zfs subcommands do perform intervening vops and to do the logging themselves. At the moment, the "create" and "clone" subcommands have been modified appropriately. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1998	2014-01-07 09:00:26 -08:00
Tim Chase	5d862cb0d9	Properly handle updates of variably-sized SA entries. During the update process in sa_modify_attrs(), the sizes of existing variably-sized SA entries are obtained from sa_lengths[]. The case where a variably-sized SA was being replaced neglected to increment the index into sa_lengths[], so subsequent variable-length SAs would be rewritten with the wrong length. This patch adds the missing increment operation so all variably-sized SA entries are stored with their correct lengths. Previously, a size-changing update of a variably-sized SA that occurred when there were other variably-sized SAs in the bonus buffer would cause the subsequent SAs to be corrupted. The most common case in which this would occur is when a mode change caused the ZPL_DACL_ACES entry to change size when a ZPL_DXATTR (SA xattr) entry already existed. The following sequence would have caused a failure when xattr=sa was in force and would corrupt the bonus buffer: open(filename, O_WRONLY \| O_CREAT, 0600); ... lsetxattr(filename, ...); /* create xattr SA / chmod(filename, 0650); / enlarges the ACL */ Signed-off-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1978	2013-12-20 13:52:33 -08:00
Brian Behlendorf	ac0340970c	Register correct handlers for nvlist_{dup,pack,unpack} This change is related to commit `81eaf15` which ensured the correct allocation handlers were installed for nvlist_alloc(). The nvlist functions nvlist_dup(), nvlist_pack(), and nvlist_unpack() suffer from the same issue and have been updated accordingly. Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #1937	2013-12-20 13:52:28 -08:00
Matthew Thode	11b9ec23b9	Add full SELinux support Four new dataset properties have been added to support SELinux. They are 'context', 'fscontext', 'defcontext' and 'rootcontext' which map directly to the context options described in mount(8). When one of these properties is set to something other than 'none'. That string will be passed verbatim as a mount option for the given context when the filesystem is mounted. For example, if you wanted the rootcontext for a filesystem to be set to 'system_u:object_r:fs_t' you would set the property as follows: $ zfs set rootcontext="system_u:object_r:fs_t" storage-pool/media This will ensure the filesystem is automatically mounted with that rootcontext. It is equivalent to manually specifying the rootcontext with the -o option like this: $ zfs mount -o rootcontext=system_u:object_r:fs_t storage-pool/media By default all four contexts are set to 'none'. Further information on SELinux contexts is detailed in mount(8) and selinux(8) man pages. Signed-off-by: Matthew Thode <prometheanfire@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <ryao@gentoo.org> Closes #1504	2013-12-19 10:37:31 -08:00
Michael Kjorling	d1d7e2689d	cstyle: Resolve C style issues The vast majority of these changes are in Linux specific code. They are the result of not having an automated style checker to validate the code when it was originally written. Others were caused when the common code was slightly adjusted for Linux. This patch contains no functional changes. It only refreshes the code to conform to style guide. Everyone submitting patches for inclusion upstream should now run 'make checkstyle' and resolve any warning prior to opening a pull request. The automated builders have been updated to fail a build if when 'make checkstyle' detects an issue. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1821	2013-12-18 16:46:35 -08:00
Turbo Fredriksson	fd8febbd1e	Add zfs_send_corrupt_data module option Tuning setting to ignore read/checksum errors when sending data. Signed-off-by: Turbo Fredriksson <turbo@bayour.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1982 Issue #1897	2013-12-18 16:46:35 -08:00
Chunwei Chen	7dc71949f2	Fix z_sync_cnt decrement in zfs_close The comment in zfs_close states that "Under Linux the zfs_close() hook is not symmetric with zfs_open()". This is not true. zfs_open/zfs_close is associated with every successful struct file creation/deletion, which should always be balanced. Here is an example of what's wrong: Process A B open(O_SYNC) z_sync_cnt = 1 open(O_SYNC) z_sync_cnt = 2 close() z_sync_cnt = 0 So z_sync_cnt is 0 even if B still has the file with O_SYNC. Also moves the generic_file_open call before zfs_open to ensure that in the case generic_file_open fails z_sync_cnt is not incremented. This is safe because generic_file_open has no side effects. Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #1962	2013-12-17 10:28:27 -08:00
Brian Behlendorf	ce37ebd2eb	cstyle: zvol.c Update zvol.c to conform to the style guidelines, verified by running cstyle.pl on the source file. This patch contains no functional changes. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Issue #1821	2013-12-16 09:41:45 -08:00
Brian Behlendorf	2e0358cbca	Sync /dev/zfs ioctl ordering In order to minimize any future disruption caused by the addition and removal /dev/zfs ioctls this patch makes the following changes. 1) Sync ZoL's ioctl ordering such that it matches Illumos. For historic reasons the ZFS_IOC_DESTROY_SNAPS and ZFS_IOC_POOL_REGUID ioctls were out of order. 2) Move Linux and FreeBSD specific ioctls in to their own reserved ranges. This allows us to preserve the existing ordering when new ioctls are added by either Illumos or FreeBSD. When an ioctl is no longer needed it should be retired in place. This change alters the ZFS user/kernel ABI so make sure you rebuild both your user and kernel modules. However, it should allow for a much stabler interface going forward. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov> Closes #1973	2013-12-16 09:41:39 -08:00
Brian Behlendorf	ba6a24026c	Remove ZFC_IOC__MINOR ioctl()s Early versions of ZFS coordinated the creation and destruction of device minors from userspace. This was inherently racy and in late 2009 these ioctl()s were removed leaving everything up to the kernel. This significantly simplified the code. However, we never picked up these changes in ZoL since we'd already significantly adjusted this code for Linux. This patch aims to rectify that by finally removing ZFC_IOC__MINOR ioctl()s and moving all the functionality down in to the kernel. Since this cleanup will change the kernel/user ABI it's being done in the same tag as the previous libzfs_core ABI changes. This will minimize, but not eliminate, the disruption to end users. Once merged ZoL, Illumos, and FreeBSD will basically be back in sync in regards to handling ZVOLs in the common code. While each platform must have its own custom zvol.c implemenation the interfaces provided are consistent. NOTES: 1) This patch introduces one subtle change in behavior which could not be easily avoided. Prior to this change callers of 'zfs create -V ...' were guaranteed that upon exit the /dev/zvol/ block device link would be created or an error returned. That's no longer the case. The utilities will no longer block waiting for the symlink to be created. Callers are now responsible for blocking, this is why a 'udev_wait' call was added to the 'label' function in scripts/common.sh. 2) The read-only behavior of a ZVOL now solely depends on if the ZVOL_RDONLY bit is set in zv->zv_flags. The redundant policy setting in the gendisk structure was removed. This both simplifies the code and allows us to safely leverage set_disk_ro() to issue a KOBJ_CHANGE uevent. See the comment in the code for futher details on this. 3) Because __zvol_create_minor() and zvol_alloc() may now be called in a sync task they must use KM_PUSHPAGE. References: illumos/illumos-gate@681d9761e8 Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Closes #1969	2013-12-16 09:15:57 -08:00
George Wilson	dda12da9f1	Illumos #4121 vdev_label_init read only 4121 vdev_label_init should treat request as succeeded when pool is read only Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com> Approved by: Richard Lowe <richlowe@richlowe.net> References: https://www.illumos.org/issues/4121 illumos/illumos-gate@973c78e94b Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1863	2013-12-12 10:24:01 -08:00
Tim Chase	84b0aac5fd	Fix atime handling. Previously, the atime-modifying vnops called ZFS_ACCESSTIME_STAMP() followed by zfs_inode_update() to update the atime. However, since atimes are cached in the znode for delayed writing, the zfs_inode_update() function would effectively ignore the cached atime by reading it from the SA. This commit moves the updating of the atime in the inode into zfs_tstamp_update_setup() which is called by the ZFS_ACCESSTIME_STAMP() macro and eliminates the call to zfs_inode_update() in the atime-modifying vnops. It's possible the same thing could have been done directly in zfs_inode_update() but I wasn't sure that it was safe in all cases where it is called. The effect is that atime handling is as if "strictatime" were selected; even if the filesystem is mounted with "relatime". Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #1949	2013-12-12 10:23:58 -08:00
david.chen	be5db977ea	Remove MAX when initializing arc_c_max The MAX when initializing arc_c_max doesn't make any sense because it hasn't been set anywhere before. Though, arc_c_max should be implicitly set to zero when initializing arc_stats, so the MAX doesn't make any difference. The MAX was mistakenly left if place when the Illumos default values were changed for Linux. Signed-off-by: david.chen <tuxoko@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1941	2013-12-10 10:05:40 -08:00
Ned Bass	b6e335bfc4	Revert "Use directory xattrs for symlinks" This reverts commit `6a7c0ccca4`. A proper fix for Issue #1648 was landed under Issue #1890, so this is no longer needed. Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1648	2013-12-10 09:48:30 -08:00
James Pan	472e7c6085	sa_find_sizes() may compute wrong SA header size Under the right conditions sa_find_sizes() will compute an incorrect size of the system attribute (SA) header. This causes a failed assertion when the SA_HDR_SIZE_MATCH_LAYOUT() test returns false, and may lead to corruption of SA data. The bug presents itself when there are more than two variable-length SAs of just the right size to fit in the bonus buffer of a dnode. The existing logic fails to account for the SA header space needed to store the sizes of all the variable-length SAs. A reproducer was possible on Linux by setting the xattr=sa dataset property and storing xattrs on symbolic links (Issue #1648). Note the corrupt link target name: $ zfs set xattr=sa tank/fish $ cd /tank/fish $ ln -fs 12345678901234567 link $ setfattr -n trusted.0000000000000000000 -v 0x000000000000000000000000 -h link $ setfattr -n trusted.1111111111111111111 -v 0x000000000000000000000000 -h link $ ls -l link lrwxrwxrwx 1 root root 17 Dec 6 15:40 link -> 90123456701234567 Commit `6a7c0ccca4` worked around this bug by forcing xattr's on symlinks to be stored in directory format. This change implements a proper fix, so the workaround can now be reverted. The reference link below contains a reproducer for FreeBSD. References: http://lists.open-zfs.org/pipermail/developer/2013-November/000306.html Ported-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1890	2013-12-10 09:48:15 -08:00
Brian Behlendorf	90ee9ed32f	Fix 'zfs diff' shares error When creating a dataset with ZoL a zsb->z_shares_dir ZAP object will not be created because shares are unimplemented. Instead ZoL just sets zsb->z_shares_dir to zero to indicate there are no shares. However, if you import a pool which was created with a different ZFS implementation then the shares ZAP object may exist. Code was added to handle this case but it clearly wasn't sufficiently tested with other ZFS pools. There was a bug in the zpl_shares_getattr() function which passed the wrong inode to zfs_getattr_fast() for the case where are shares ZAP object does exist. This causes an EIO to be returned to stat64() which in turn causes 'zfs diff' to fail. This fix is the pass the correct inode after a sucessful zfs_zget(). Additionally, only put away the references if we were able to get one. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Graham Booker <https://github.com/gbooker> Signed-off-by: timemaster67 <https://github.com/timemaster67> Closes #1426 Closes #481	2013-12-06 09:42:39 -08:00
Brian Behlendorf	99e349db92	Add module versioning Use the standard Linux MODULE_VERSION macro to expose the installed zavl, znvpair, zunicode, zcommon, zfs, and zpios module versions. This will also automatically add a checksum of the .c files and headers in "srcversion". See: /sys/module/zavl/version /sys/module/zavl/srcversion /sys/module/znvpair/version /sys/module/znvpair/srcversion /sys/module/zunicode/version /sys/module/zunicode/srcversion /sys/module/zcommon/version /sys/module/zcommon/srcversion /sys/module/zfs/version /sys/module/zfs/srcversion /sys/module/zpios/version /sys/module/zpios/srcversion Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1923	2013-12-06 09:34:41 -08:00
Matthew Ahrens	e8b96c6007	Illumos #4045 write throttle & i/o scheduler performance work 4045 zfs write throttle & i/o scheduler performance work 1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync read, sync write, async read, async write, and scrub/resilver. The scheduler issues a number of concurrent i/os from each class to the device. Once a class has been selected, an i/o is selected from this class using either an elevator algorithem (async, scrub classes) or FIFO (sync classes). The number of concurrent async write i/os is tuned dynamically based on i/o load, to achieve good sync i/o latency when there is not a high load of writes, and good write throughput when there is. See the block comment in vdev_queue.c (reproduced below) for more details. 2. The write throttle (dsl_pool_tempreserve_space() and txg_constrain_throughput()) is rewritten to produce much more consistent delays when under constant load. The new write throttle is based on the amount of dirty data, rather than guesses about future performance of the system. When there is a lot of dirty data, each transaction (e.g. write() syscall) will be delayed by the same small amount. This eliminates the "brick wall of wait" that the old write throttle could hit, causing all transactions to wait several seconds until the next txg opens. One of the keys to the new write throttle is decrementing the amount of dirty data as i/o completes, rather than at the end of spa_sync(). Note that the write throttle is only applied once the i/o scheduler is issuing the maximum number of outstanding async writes. See the block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for more details. This diff has several other effects, including: * the commonly-tuned global variable zfs_vdev_max_pending has been removed; use per-class zfs_vdev__max_active values or zfs_vdev_max_active instead. the size of each txg (meaning the amount of dirty data written, and thus the time it takes to write out) is now controlled differently. There is no longer an explicit time goal; the primary determinant is amount of dirty data. Systems that are under light or medium load will now often see that a txg is always syncing, but the impact to performance (e.g. read latency) is minimal. Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this. * zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression, checksum, etc. This improves latency by not allowing these CPU-intensive tasks to consume all CPU (on machines with at least 4 CPU's; the percentage is rounded up). --matt APPENDIX: problems with the current i/o scheduler The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem with this is that if there are always i/os pending, then certain classes of i/os can see very long delays. For example, if there are always synchronous reads outstanding, then no async writes will be serviced until they become "past due". One symptom of this situation is that each pass of the txg sync takes at least several seconds (typically 3 seconds). If many i/os become "past due" (their deadline is in the past), then we must service all of these overdue i/os before any new i/os. This happens when we enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in the future. If we can't complete all the i/os in 2.5 seconds (e.g. because there were always reads pending), then these i/os will become past due. Now we must service all the "async" writes (which could be hundreds of megabytes) before we service any reads, introducing considerable latency to synchronous i/os (reads or ZIL writes). Notes on porting to ZFS on Linux: - zio_t gained new members io_physdone and io_phys_children. Because object caches in the Linux port call the constructor only once at allocation time, objects may contain residual data when retrieved from the cache. Therefore zio_create() was updated to zero out the two new fields. - vdev_mirror_pending() relied on the depth of the per-vdev pending queue (vq->vq_pending_tree) to select the least-busy leaf vdev to read from. This tree has been replaced by vq->vq_active_tree which is now used for the same purpose. - vdev_queue_init() used the value of zfs_vdev_max_pending to determine the number of vdev I/O buffers to pre-allocate. That global no longer exists, so we instead use the sum of the *_max_active values for each of the five I/O classes described above. - The Illumos implementation of dmu_tx_delay() delays a transaction by sleeping in condition variable embedded in the thread (curthread->t_delay_cv). We do not have an equivalent CV to use in Linux, so this change replaced the delay logic with a wrapper called zfs_sleep_until(). This wrapper could be adopted upstream and in other downstream ports to abstract away operating system-specific delay logic. - These tunables are added as module parameters, and descriptions added to the zfs-module-parameters.5 man page. spa_asize_inflation zfs_deadman_synctime_ms zfs_vdev_max_active zfs_vdev_async_write_active_min_dirty_percent zfs_vdev_async_write_active_max_dirty_percent zfs_vdev_async_read_max_active zfs_vdev_async_read_min_active zfs_vdev_async_write_max_active zfs_vdev_async_write_min_active zfs_vdev_scrub_max_active zfs_vdev_scrub_min_active zfs_vdev_sync_read_max_active zfs_vdev_sync_read_min_active zfs_vdev_sync_write_max_active zfs_vdev_sync_write_min_active zfs_dirty_data_max_percent zfs_delay_min_dirty_percent zfs_dirty_data_max_max_percent zfs_dirty_data_max zfs_dirty_data_max_max zfs_dirty_data_sync zfs_delay_scale The latter four have type unsigned long, whereas they are uint64_t in Illumos. This accommodates Linux's module_param() supported types, but means they may overflow on 32-bit architectures. The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most likely to overflow on 32-bit systems, since they express physical RAM sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to 2^32 which does overflow. To resolve that, this port instead initializes it in arc_init() to 25% of physical RAM, and adds the tunable zfs_dirty_data_max_max_percent to override that percentage. While this solution doesn't completely avoid the overflow issue, it should be a reasonable default for most systems, and the minority of affected systems can work around the issue by overriding the defaults. - Fixed reversed logic in comment above zfs_delay_scale declaration. - Clarified comments in vdev_queue.c regarding when per-queue minimums take effect. - Replaced dmu_tx_write_limit in the dmu_tx kstat file with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts how many times a transaction has been delayed because the pool dirty data has exceeded zfs_delay_min_dirty_percent. The latter counts how many times the pool dirty data has exceeded zfs_dirty_data_max (which we expect to never happen). - The original patch would have regressed the bug fixed in zfsonlinux/zfs@c418410, which prevented users from setting the zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE. A similar fix is added to vdev_queue_aggregate(). - In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the heap instead of the stack. In Linux we can't afford such large structures on the stack. Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: Ned Bass <bass6@llnl.gov> Reviewed by: Brendan Gregg <brendan.gregg@joyent.com> Approved by: Robert Mustacchi <rm@joyent.com> References: http://www.illumos.org/issues/4045 illumos/illumos-gate@69962b5647 Ported-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1913	2013-12-06 09:32:43 -08:00
Matthew Ahrens	384f8a09f8	Illumos #4347 ZPL can use dmu_tx_assign(TXG_WAIT) Fix a lock contention issue by allowing threads not holding ZPL locks to block when waiting to assign a transaction. Porting Notes: zfs_putpage() still uses TXG_NOWAIT, unlike the upstream version. This case may be a contention point just like zfs_write(), however it is not safe to block here since it may be called during memory reclaim. Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Dan McDonald <danmcd@nexenta.com> Reviewed by: Boris Protopopov <boris.protopopov@nexenta.com> Approved by: Dan McDonald <danmcd@nexenta.com> References: https://www.illumos.org/issues/4347 illumos/illumos-gate@e722410c49 Ported-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2013-12-06 09:30:51 -08:00
Brian Behlendorf	2e40f09410	Remove incorrect ASSERT in zfs_sb_teardown() As part of zfs_sb_teardown() there is an assertion that all inodes which are part of the zsb->z_all_znodes list have at least one reference on them. This is always true for the standard unmount case but there are two other cases where it is not strictly true. * zfs_ioc_rollback() - This is the most common case and it results from the fact that we aren't unmounting the filesystem. During a normal unmount the MS_ACTIVE flag will be cleared on the super block causing iput_final() to evict the inode when its reference count drops to zero. However, during a rollback MS_ACTIVE remains set since we're rolling back a live filesystem and need to preserve the existing super block. This allows inodes with a zero reference count to stay in the cache thereby violating the assertion. * destroy_inode() / zfs_sb_teardown() - There exists a small race between dropping the last reference on an inode and removing it from the zsb->z_all_znodes list. This is unlikely to occur but could also trigger the assertion which is incorrect. The inode may safely have a zero reference count in this case. Since allowing a zero reference count on the inode is expected and safe for both of these cases the simplest thing to do is remove the ASSERT. This code is only enabled for default builds so removing this entirely is a very safe change. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Tim Chase <tim@chase2k.com> Closes #1417 Closes #1536	2013-12-02 15:58:58 -08:00
Tim Chase	f707635fa5	Some nvlist allocations in hold processing need to use KM_PUSHPAGE. This should hopefully catch the rest of the allocations in the user hold/release processing that were missed by commit `65c67ea86e`. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1852 Closes #1855	2013-12-02 14:02:46 -08:00
Etienne Dechamps	119a394ab0	Only commit the ZIL once in zpl_writepages() (msync() case). Currently, using msync() results in the following code path: sys_msync -> zpl_fsync -> filemap_write_and_wait_range -> zpl_writepages -> write_cache_pages -> zpl_putpage In such a code path, zil_commit() is called as part of zpl_putpage(). This means that for each page, the write is handed to the DMU, the ZIL is committed, and only then do we move on to the next page. As one might imagine, this results in atrocious performance where there is a large number of pages to write: instead of committing a batch of N writes, we do N commits containing one page each. In some extreme cases this can result in msync() being ~700 times slower than it should be, as well as very inefficient use of ZIL resources. This patch fixes this issue by making sure that the requested writes are batched and then committed only once. Unfortunately, the implementation is somewhat non-trivial because there is no way to run write_cache_pages in SYNC mode (so that we get all pages) without making it wait on the writeback tag for each page. The solution implemented here is composed of two parts: - I added a new callback system to the ZIL, which allows the caller to be notified when its ITX gets written to stable storage. One nice thing is that the callback is called not only in zil_commit() but in zil_sync() as well, which means that the caller doesn't have to care whether the write ended up in the ZIL or the DMU: it will get notified as soon as it's safe, period. This is an improvement over dmu_tx_callback_register() that was used previously, which only supports DMU writes. The rationale for this change is to allow zpl_putpage() to be notified when a ZIL commit is completed without having to block on zil_commit() itself. - zpl_writepages() now calls write_cache_pages in non-SYNC mode, which will prevent (1) write_cache_pages from blocking, and (2) zpl_putpage from issuing ZIL commits. zpl_writepages() will issue the commit itself instead of relying on zpl_putpage() to do it, thus nicely batching the writes. Note, however, that we still have to call write_cache_pages() again in SYNC mode because there is an edge case documented in the implementation of write_cache_pages() whereas it will not give us all dirty pages when running in non-SYNC mode. Thus we need to run it at least once in SYNC mode to make sure we honor persistency guarantees. This only happens when the pages are modified at the same time msync() is running, which should be rare. In most cases there won't be any additional pages and this second call will do nothing. Note that this change also fixes a bug related to #907 whereas calling msync() on pages that were already handed over to the DMU in a previous writepages() call would make msync() block until the next TXG sync instead of returning as soon as the ZIL commit is complete. The new callback system fixes that problem. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1849 Closes #907	2013-11-23 15:08:29 -08:00
Brian Behlendorf	e3dc14b861	Add I/O Read/Write Accounting Because ZFS bypasses the page cache we don't inherit per-task I/O accounting for free. However, the Linux kernel does provide helper functions allow us to perform our own accounting. These are most commonly used for direct IO which also bypasses the page cache, but they can be used for the common read/write call paths as well. Signed-off-by: Pavel Snajdr <snajpa@snajpa.net> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #313 Closes #1275	2013-11-21 08:56:24 -08:00
Steven Hartland	e5bacf2109	Illumos #4322 4322 ZFS deadlock on dp_config_rwlock Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Ilya Usvyatsky <ilya.usvyatsky@nexenta.com> Approved by: Dan McDonald <danmcd@nexenta.com> References: https://www.illumos.org/issues/4322 illumos/illumos-gate@c50d56f667 Ported by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1886	2013-11-20 15:27:32 -08:00
Brian Behlendorf	64ad2b26e2	Remove the slog restriction on bootfs pools Under Linux this restriction does not apply because we have access to all the required devices. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1631	2013-11-14 14:28:35 -08:00
Matthew Thode	227bc96951	Fixes (extends) support for selinux xattrs to more inode types Properly initialize SELinux xattrs for all inode types. The initial implementation accidentally only did this for files. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1832	2013-11-14 14:28:35 -08:00
Brian Behlendorf	a168788053	Reduce stack for traverse_visitbp() recursion During pool import stack overflows may still occur due to the potentially deep recursion of traverse_visitbp(). This is most likely to occur when additional layers are added to the block device stack such as DM multipath. To minimize the stack usage for this call path the following changes were made: 1) Added the keywork 'noinline' to the vdev_*_map_alloc() functions to prevent them from being inlined by gcc. This reduced the stack usage of vdev_raidz_io_start() from 208 to 128 bytes, and vdev_mirror_io_start() from 144 to 128 bytes. 2) The 'saved_poolname' charater array in zfsdev_ioctl() was moved from the stack to the heap. This reduced the stack usage of zfsdev_ioctl() from 368 to 112 bytes. 3) The major saving came from slimming down traverse_visitbp() from from 224 to 144 bytes. Since this function is called recursively the 80 bytes saved per invokation adds up. The following changes were made: a) The 'hard' local variable was replaced by a TD_HARD() macro. b) The 'pd' local variable was replaced by 'td->td_pfd' references. c) The zbookmark_t was moved to the heap. This does cost us an additional memory allocation per recursion by that cost should still be minimal. The cost could be further reduced by adding a dedicated zbookmark_t slab cache. d) The variable declarations in 'if (BP_GET_LEVEL()) { }' were restructured to use the minimum amount of stack. This includes removing the 'cbp' local variable. Overall for the offending use case roughly 1584 of total stack space has been saved. This is enough to avoid overflowing the stack on stock kernels with 8k stacks. See #1778 for additional details. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov> Closes #1778	2013-11-14 14:28:12 -08:00
Tim Chase	65c67ea86e	Some nvlist allocations in hold processing need to use KM_PUSHPAGE. Commit `95fd54a1c5` restructured the hold/release processing and moved some of the work into the sync task. A number of nvlist allocations now need to use KM_PUSHPAGE. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1852 Closes #1855	2013-11-14 11:11:37 -08:00
Tim Chase	2008e9209f	Fix rollback of mounted filesystem regression The Illumos #3875 patch reverted a part of ZoL's `7b3e34b` which added special-case error handling for zfs_rezget(). The error handling dealt with the case in which an all-ones object number ended up being passed to dnode_hold() and causing an EINVAL to be returned from zfs_rezget(). Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1859 Closes #1861	2013-11-14 10:44:03 -08:00
Tim Chase	fd4f76160c	Handle concurrent snapshot automounts failing due to EBUSY. In the current snapshot automount implementation, it is possible for multiple mounts to attempted concurrently. Only one of the mounts will succeed and the other will fail. The failed mounts will cause an EREMOTE to be propagated back to the application. This commit works around the problem by adding a new exit status, MOUNT_BUSY to the mount.zfs program which is used when the underlying mount(2) call returns EBUSY. The zfs code detects this condition and treats it as if the mount had succeeded. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1819	2013-11-08 10:45:14 -08:00
Massimo Maggi	b695c34ea4	Honor CONFIG_FS_POSIX_ACL kernel option The required Posix ACL interfaces are only available for kernels with CONFIG_FS_POSIX_ACL defined. Therefore, only enable Posix ACL support for these kernels. All major distribution kernels enable CONFIG_FS_POSIX_ACL by default. If your kernel does not support Posix ACLs the following warning will be printed at ZFS module load time. "ZFS: Posix ACLs disabled by kernel" Signed-off-by: Massimo Maggi <me@massimo-maggi.eu> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1825	2013-11-05 16:22:05 -08:00
Matthew Ahrens	78e2739d3c	26126 panic system rather than corrupting pool if we hit bug 26100 References: delphix/delphix-os@931c8aaab7 Ported-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1650	2013-11-05 13:18:26 -08:00
Brian Behlendorf	2517c8ee08	Switch allocations from KM_SLEEP to KM_PUSHPAGE A couple of kmem_alloc() allocations were using KM_SLEEP in the sync thread context. These were accidentally introduced by the recent set of Illumos patches. The solution is to switch to KM_PUSHPAGE. dsl_dataset_promote_sync() -> promote_hold() -> snaplist_make() -> kmem_alloc(sizeof (snap), KM_SLEEP); dsl_dataset_user_hold_sync() -> dsl_onexit_hold_cleanup() -> kmem_alloc(sizeof (ca), KM_SLEEP) Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #1775	2013-11-05 12:26:14 -08:00
Saso Kiselkov	1ca546b338	Illumos #3995 3995 Memory leak of compressed buffers in l2arc_write_done References: https://illumos.org/issues/3995 Ported-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1688 Issue #1775	2013-11-05 12:26:00 -08:00
George Wilson	43a696ed38	Illumos #4168 , #4169 , #4170 4168 ztest assertion failure in dbuf_undirty 4169 verbatim import causes zdb to segfault 4170 zhack leaves pool in ACTIVE state Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Eric Schrock <eric.schrock@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@nexenta.com> References: https://www.illumos.org/issues/4168 https://www.illumos.org/issues/4169 https://www.illumos.org/issues/4170 illumos/illumos-gate@7fdd916c47 Ported-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #1775	2013-11-05 12:25:44 -08:00
Matthew Ahrens	92bc214c2e	Illumos #4082 4082 zfs receive gets EFBIG from dmu_tx_hold_free() Reviewed by: Eric Schrock <eric.schrock@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Richard Lowe <richlowe@richlowe.net> References: https://www.illumos.org/issues/4082 illumos/illumos-gate@5253393b09 Ported-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #1775	2013-11-05 12:25:26 -08:00
George Wilson	ac72fac3ea	Illumos #3954 , #4080 , #4081 3954 metaslabs continue to load even after hitting zfs_mg_alloc_failure limit 4080 zpool clear fails to clear pool 4081 need zfs_mg_noalloc_threshold Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Richard Lowe <richlowe@richlowe.net> References: https://www.illumos.org/issues/3954 https://www.illumos.org/issues/4080 https://www.illumos.org/issues/4081 illumos/illumos-gate@22e30981d8 Ported-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #1775	2013-11-05 12:25:01 -08:00
Matthew Ahrens	a169a625a6	Illumos #4046 4046 dsl_dataset_t ds_dir->dd_lock is highly contended Reviewed by: Eric Schrock <eric.schrock@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Garrett D'Amore <garrett@damore.org> References: https://www.illumos.org/issues/4046 illumos/illumos-gate@b62969f868 Ported-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #1775 Porting notes: 1. This commit removed dsl_dataset_namelen in Illumos, but that appears to have been removed from ZFSOnLinux in an earlier commit.	2013-11-05 12:24:24 -08:00
Matthew Ahrens	b663a23d36	Illumos #4047 4047 panic from dbuf_free_range() from dmu_free_object() while doing zfs receive Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Dan McDonald <danmcd@nexenta.com> References: https://www.illumos.org/issues/4047 illumos/illumos-gate@713d6c2088 Ported-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #1775 Porting notes: 1. The exported symbol dmu_free_object() was renamed to dmu_free_long_object() in Illumos.	2013-11-05 12:23:35 -08:00
Matthew Ahrens	46ba1e59d3	Illumos #3996 3996 want a libzfs_core API to rollback to latest snapshot Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Andy Stormont <andyjstormont@gmail.com> Approved by: Richard Lowe <richlowe@richlowe.net> References: https://www.illumos.org/issues/3996 illumos/illumos-gate@a7027df17f Ported-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #1775	2013-11-05 12:23:11 -08:00
George Wilson	5d1f7fb647	Illumos #3956 , #3957 , #3958 , #3959 , #3960 , #3961 , #3962 3956 ::vdev -r should work with pipelines 3957 ztest should update the cachefile before killing itself 3958 multiple scans can lead to partial resilvering 3959 ddt entries are not always resilvered 3960 dsl_scan can skip over dedup-ed blocks if physical birth != logical birth 3961 freed gang blocks are not resilvered and can cause pool to suspend 3962 ztest should print out zfs debug buffer before exiting Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Adam Leventhal <ahl@delphix.com> Approved by: Richard Lowe <richlowe@richlowe.net> References: https://www.illumos.org/issues/3956 https://www.illumos.org/issues/3957 https://www.illumos.org/issues/3958 https://www.illumos.org/issues/3959 https://www.illumos.org/issues/3960 https://www.illumos.org/issues/3961 https://www.illumos.org/issues/3962 illumos/illumos-gate@b4952e17e8 Ported-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Porting notes: 1. zfs_dbgmsg_print() is only used in userland. Since we do not have mdb on Linux, it does not make sense to make it available in the kernel. This means that a build failure will occur if any future kernel patch depends on it. However, that is unlikely given that this functionality was added to support zdb. 2. zfs_dbgmsg_print() is only invoked for -VVV or greater log levels. This preserves the existing behavior of minimal noise when running with -V, and -VV. 3. In vdev_config_generate() the call to nvlist_alloc() was not changed to fnvlist_alloc() because we must pass KM_PUSHPAGE in the txg_sync context.	2013-11-05 12:23:05 -08:00
George Wilson	621dd7bb2c	Illumos #3949 , #3950 , #3952 , #3953 3949 ztest fault injection should avoid resilvering devices 3950 ztest: deadman fires when we're doing a scan 3951 ztest hang when running dedup test 3952 ztest: ztest_reguid test and ztest_fault_inject don't place nice together Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Adam Leventhal <ahl@delphix.com> Approved by: Richard Lowe <richlowe@richlowe.net> References: https://www.illumos.org/issues/3949 https://www.illumos.org/issues/3950 https://www.illumos.org/issues/3951 https://www.illumos.org/issues/3952 illumos/illumos-gate@2c1e2b4414 Ported-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #1775 Porting notes: 1. The deadman thread was removed from ztest during the original port because it depended on Solaris thr_create() interface. This functionality should be reintroduced using the more portable pthreads.	2013-11-05 12:17:07 -08:00
Matthew Ahrens	383fc4a997	Illumos #3955 3955 ztest failure: assertion refcount_count(&tx->tx_space_written) + delta <= tx->tx_space_towrite Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Richard Lowe <richlowe@richlowe.net> References: https://www.illumos.org/issues/3955 illumos/illumos-gate@be9000cc67 Ported-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #1775	2013-11-05 12:16:14 -08:00
Steven Hartland	9554185d90	Illumos #3973 3973 zfs_ioc_rename alters passed in zc->zc_name Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Christopher Siden <christopher.siden@delphix.com> References: https://www.illumos.org/issues/3973 illumos/illumos-gate@a0c1127b14 Ported-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #1775	2013-11-05 12:15:50 -08:00
Matthew Ahrens	ea97f8ce35	Illumos #3834 3834 incremental replication of 'holey' file systems is slow Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Richard Lowe <richlowe@richlowe.net> References: https://www.illumos.org/issues/3834 illumos/illumos-gate@ca48f36f20 Ported-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #1775	2013-11-05 12:15:00 -08:00
Matthew Ahrens	2883cad5b7	Illumos #3836 3836 zio_free() can be processed immediately in the common case Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Adam Leventhal <ahl@delphix.com> Approved by: Dan McDonald <danmcd@nexenta.com> References: https://www.illumos.org/issues/3836 illumos/illumos-gate@9cb154a3c9 Ported-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #1775	2013-11-05 12:14:56 -08:00
Matthew Ahrens	498877baf5	Illumos #3112 , #3113 , #3114 3112 ztest does not honor ZFS_DEBUG 3113 ztest should use watchpoints to protect frozen arc bufs 3114 some leaked nvlists in zfsdev_ioctl Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Matt Amdur <Matt.Amdur@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Christopher Siden <chris.siden@delphix.com> Approved by: Eric Schrock <eric.schrock@delphix.com> References: https://www.illumos.org/issues/3112 https://www.illumos.org/issues/3113 https://www.illumos.org/issues/3114 illumos/illumos-gate@cd1c8b85eb The /proc/self/cmd watchpoint interface is specific to Solaris. Therefore, the #3113 implementation was reworked to use the more portable mprotect(2) system call. When the pages are watched they are marked read-only for protection. Any write to the protected address range immediately trigger a SIGSEGV. The pages are marked writable again when they are unwatched. Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #1489	2013-11-05 12:14:48 -08:00
George Wilson	03c6040bee	Illumos #3236 3236 zio nop-write Reviewed by: Matt Ahrens <matthew.ahrens@delphix.com> Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Christopher Siden <chris.siden@delphix.com> Approved by: Garrett D'Amore <garrett@damore.org> References: illumos/illumos-gate@80901aea8e https://www.illumos.org/issues/3236 Porting Notes 1. This patch is being merged dispite an increased instance of https://www.illumos.org/issues/3113 being triggered by ztest. Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #1489	2013-11-05 12:14:21 -08:00
Keith M Wesolowski	831baf06ef	Illumos #3875 3875 panic in zfs_root() after failed rollback Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Gordon Ross <gwr@nexenta.com> References: https://www.illumos.org/issues/3875 illumos/illumos-gate@91948b51b8 Ported-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #1775	2013-11-04 11:27:41 -08:00
Matthew Ahrens	1958067629	Illumos #3888 3888 zfs recv -F should destroy any snapshots created since the incremental source Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Peng Dai <peng.dai@delphix.com> Approved by: Richard Lowe <richlowe@richlowe.net> References: https://www.illumos.org/issues/3888 illumos/illumos-gate@34f2f8cf94 Porting notes: 1. Commit `1fde1e3720` wrapped a declaration in dsl_dataset_modified_since_lastsnap in ASSERTV(). The ASSERTV() and local variable have been removed to avoid an unused variable warning. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: Richard Yao <ryao@gentoo.org> Issue #1775	2013-11-04 11:18:14 -08:00
Keith M Wesolowski	96c2e96193	Illumos #3894 3894 zfs should not allow snapshot of inconsistent dataset Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Gordon Ross <gwr@nexenta.com> References: https://www.illumos.org/issues/3894 illumos/illumos-gate@ca48f36f20 Ported-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #1775	2013-11-04 11:18:14 -08:00
Matthew Ahrens	1a077756e8	Illumos #3829 3829 fix for 3740 changed behavior of zfs destroy/hold/release ioctl Reviewed by: Matt Amdur <matt.amdur@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Approved by: Richard Lowe <richlowe@richlowe.net> References: https://www.illumos.org/issues/3829 illumos/illumos-gate@bb6e70758d Ported-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #1775	2013-11-04 11:18:14 -08:00
Steven Hartland	95fd54a1c5	Illumos #3740 3740 Poor ZFS send / receive performance due to snapshot hold / release processing Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Christopher Siden <christopher.siden@delphix.com> References: https://www.illumos.org/issues/3740 illumos/illumos-gate@a7a845e4bf Ported-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #1775 Porting notes: 1. `13fe019870` introduced a merge conflict in dsl_dataset_user_release_tmp where some variables were moved outside of the preprocessor directive. 2. dea9dfefdd747534b3846845629d2200f0616dad made the previous merge conflict worse by switching KM_SLEEP to KM_PUSHPAGE. This is notable because this commit refactors the code, adding a new KM_SLEEP allocation. It is not clear to me whether this should be converted to KM_PUSHPAGE. 3. We had a merge conflict in libzfs_sendrecv.c because of copyright notices. 4. Several small C99 compatibility fixed were made.	2013-11-04 11:17:48 -08:00
Will Andrews	d09f25dc66	Illumos #3744 3744 zfs shouldn't ignore errors unmounting snapshots Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Christopher Siden <christopher.siden@delphix.com> References: https://www.illumos.org/issues/3744 illumos/illumos-gate@fc7a6e3fef Ported-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #1775 Porting notes: 1. There is no clear way to distinguish between a failure when we tried to unmount the snapdir of a zvol (which does not exist) and the failure when we try to unmount a snapdir of a dataset, so the changes to zfs_unmount_snap() were dropped in favor of an altered Linux function that unconditionally returns 0.	2013-11-04 10:55:25 -08:00
Will Andrews	3a84951d7d	Illumos #3743 3743 zfs needs a refcount audit Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Eric Schrock <eric.schrock@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Christopher Siden <christopher.siden@delphix.com> References: https://www.illumos.org/issues/3743 illumos/illumos-gate@b287be1ba8 Ported-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #1775	2013-11-04 10:55:25 -08:00
Will Andrews	d3cc8b152e	Illumos #3742 3742 zfs comments need cleaner, more consistent style Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Eric Schrock <eric.schrock@delphix.com> Approved by: Christopher Siden <christopher.siden@delphix.com> References: https://www.illumos.org/issues/3742 illumos/illumos-gate@f717074149 Ported-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #1775 Porting notes: 1. The change to zfs_vfsops.c was dropped because it involves zfs_mount_label_policy, which does not exist in the Linux port.	2013-11-04 10:55:25 -08:00
Will Andrews	e49f1e20a0	Illumos #3741 3741 zfs needs better comments Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Eric Schrock <eric.schrock@delphix.com> Approved by: Christopher Siden <christopher.siden@delphix.com> References: https://www.illumos.org/issues/3741 illumos/illumos-gate@3e30c24aee Ported-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #1775	2013-11-04 10:55:25 -08:00
Martin Matuska	b1118acbb1	Illumos #3699 , #3739 3699 zfs hold or release of a non-existent snapshot does not output error 3739 cannot set zfs quota or reservation on pool version < 22 Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Eric Shrock <eric.schrock@delphix.com> Approved by: Dan McDonald <danmcd@nexenta.com> References: https://www.illumos.org/issues/3699 https://www.illumos.org/issues/3739 illumos/illumos-gate@013023d4ed Ported-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #1775	2013-11-04 10:55:25 -08:00
Adam Leventhal	63fd3c6cfd	Illumos #3582 , #3584 3582 zfs_delay() should support a variable resolution 3584 DTrace sdt probes for ZFS txg states Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: Dan McDonald <danmcd@nexenta.com> Reviewed by: Richard Elling <richard.elling@dey-sys.com> Approved by: Garrett D'Amore <garrett@damore.org> References: https://www.illumos.org/issues/3582 illumos/illumos-gate@0689f76 Ported by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #1775	2013-11-04 10:55:25 -08:00
Mark Shellenbaum	c1fabe7961	6977619 NULL pointer deference in sa_handle_get_from_db() References: illumos/illumos-gate@44bffe012c Ported-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #1775	2013-11-04 10:54:48 -08:00
Mark Shellenbaum	c0ebc844c7	6939941 problem with moving files in zfs References: illumos/illumos-gate@d39ee142a9 Ported-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #1775 Porting notes: 1. This commit was so old that only two lines applied to the modern code base.	2013-11-04 10:53:18 -08:00
George Wilson	2696dfafd9	Illumos #3642 , #3643 3642 dsl_scan_active() should not issue I/O to determine if async destroying is active 3643 txg_delay should not hold the tc_lock Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Adam Leventhal <ahl@delphix.com> Approved by: Gordon Ross <gwr@nexenta.com> References: https://www.illumos.org/issues/3642 https://www.illumos.org/issues/3643 illumos/illumos-gate@4a92375985 Ported-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #1775 Porting Notes: 1. The alignment assumptions for the tx_cpu structure assume that a kmutex_t is 8 bytes. This isn't true under Linux but tc_pad[] was adjusted anyway for consistency since this structure was never carefully aligned in ZoL. If careful alignment does impact performance significantly this should be reworked to be portable.	2013-11-01 08:55:12 -07:00
Matthew Ahrens	7ec09286b7	Illumos #3645 , #3692 3645 dmu_send_impl: possibilty of pool hold leak 3692 Panic on zfs receive of a recursive deduplicated stream Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: Dan McDonald <danmcd@nexenta.com> Approved by: Richard Lowe <richlowe@richlowe.net> References: https://www.illumos.org/issues/3645 https://www.illumos.org/issues/3692 illumos/illumos-gate@de8d9cff56 Ported-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #1792 Issue #1775	2013-10-31 14:58:09 -07:00
Matthew Ahrens	2e528b49f8	Illumos #3598 3598 want to dtrace when errors are generated in zfs Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Approved by: Garrett D'Amore <garrett@damore.org> References: https://www.illumos.org/issues/3598 illumos/illumos-gate@be6fd75a69 Ported-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #1775 Porting notes: 1. include/sys/zfs_context.h has been modified to render some new macros inert until dtrace is available on Linux. 2. Linux-specific changes have been adapted to use SET_ERROR(). 3. I'm NOT happy about this change. It does nothing but ugly up the code under Linux. Unfortunately we need to take it to avoid more merge conflicts in the future. -Brian	2013-10-31 14:58:04 -07:00
Yuri Pankov	7011fb6004	Illumos #3517 3517 importing pool with autoreplace=on and "hole" vdevs crashes syseventd Reviewed by: Albert Lee <trisk@nexenta.com> Reviewed by: Jeffry Molanus <jeffry.molanus@nexenta.com> Reviewed by: George Wilson <gwilson@zfsmail.com> Approved by: Christopher Siden <christopher.siden@delphix.com> References: https://www.illumos.org/issues/3517 illumos/illumos-gate@efb4a871d8 Ported-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #1775	2013-10-31 14:57:59 -07:00
Matthew Ahrens	d1fada1e6d	Illumos #3603 , #3604 : bobj improvements 3603 panic from bpobj_enqueue_subobj() 3604 zdb should print bpobjs more verbosely 3871 GCC 4.5.3 does not like issue 3604 patch Reviewed by: Henrik Mattson <henrik.mattson@delphix.com> Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Garrett D'Amore <garrett@damore.org> Approved by: Dan McDonald <danmcd@nexenta.com> References: https://www.illumos.org/issues/3603 https://www.illumos.org/issues/3604 https://www.illumos.org/issues/3871 illumos/illumos-gate@d04756377d Ported-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #1775 Note that the patch from Illumos issue 3871 is not accepted into Illumos at the time of this writing. It is something that I wrote when porting this. Documentation is in the Illumos issue.	2013-10-31 14:57:51 -07:00
Matthew Ahrens	24a64651b4	Illumos #3588 3588 provide zfs properties for logical (uncompressed) space used and referenced Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Dan McDonald <danmcd@nexenta.com> Reviewed by: Richard Elling <richard.elling@dey-sys.com> Approved by: Richard Lowe <richlowe@richlowe.net> References: https://www.illumos.org/issues/3588 illumos/illumos-gate@77372cb0f3 Ported-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2013-10-31 10:16:11 -07:00
George Wilson	c2e42f9d53	Illumos #3578 , #3579 3578 transferring the freed map to the defer map should be constant time 3579 ztest trips assertion in metaslab_weight() Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: Richard Elling <richard.elling@dey-sys.com> Approved by: Dan McDonald <danmcd@nexenta.com> References: https://www.illumos.org/issues/3578 https://www.illumos.org/issues/3579 illumos/illumos-gate@9eb57f7f3f Ported-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2013-10-31 09:23:40 -07:00
George Wilson	23c0a1333c	Illumos #3561 , #3116 3561 arc_meta_limit should be exposed via kstats 3116 zpool reguid may log negative guids to internal SPA history Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: Gordon Ross <gordon.ross@nexenta.com> Approved by: Garrett D'Amore <garrett@damore.org> References: https://www.illumos.org/issues/3561 https://www.illumos.org/issues/3116 illumos/illumos-gate@20128a0826 Ported-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Porting Notes: 1. The spa change was accidentally included in the libzfs_core merge. 2. "Add missing arcstats" (`1834f2d8b7`) already implemented these kstats a few years ago.	2013-10-31 09:23:40 -07:00
Matthew Ahrens	330847ff36	Illumos #3537 3537 want pool io kstats Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Eric Schrock <eric.schrock@delphix.com> Reviewed by: Sa?o Kiselkov <skiselkov.ml@gmail.com> Reviewed by: Garrett D'Amore <garrett@damore.org> Reviewed by: Brendan Gregg <brendan.gregg@joyent.com> Approved by: Gordon Ross <gwr@nexenta.com> References: http://www.illumos.org/issues/3537 illumos/illumos-gate@c3a6601 Ported by: Cyril Plisko <cyril.plisko@mountall.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Porting Notes: 1. The patch was restructured to take advantage of the existing spa statistics infrastructure. To accomplish this the kstat was moved in to spa->io_stats and the init/destroy code moved to spa_stats.c. 2. The I/O kstat was simply named <pool> which conflicted with the pool directory we had already created. Therefore it was renamed to <pool>/io 3. An update handler was added to allow the kstat to be zeroed.	2013-10-31 09:16:03 -07:00
George Wilson	a117a6d66e	Illumos #3522 3522 zfs module should not allow uninitialized variables Reviewed by: Sebastien Roy <seb@delphix.com> Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Garrett D'Amore <garrett@damore.org> References: https://www.illumos.org/issues/3522 illumos/illumos-gate@d5285cae91 Ported-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Porting notes: 1. ZFSOnLinux had already addressed many of these issues because of its use of -Wall. However, the manner in which they were addressed differed. The illumos fixes replace the ones previously made in ZFSOnLinux to reduce code differences. 2. Part of the upstream patch made a small change to arc.c that might address zfsonlinux/zfs#1334. 3. The initialization of aclsize in zfs_log_create() differs because vsecp is a NULL pointer on ZFSOnLinux. 4. The changes to zfs_register_callbacks() were dropped because it has diverged and needs to be resynced.	2013-10-30 14:51:27 -07:00
Richard Yao	495b25a91a	Add missing code to zfs_debug.{c,h} This is required to make Illumos 3962 merge. Signed-off-by: Richard Yao <ryao@gentoo.org>	2013-10-29 15:06:18 -07:00
Richard Yao	632a242e83	Add missing copyright notices from Illumos This resolves merge conflicts when merging Illumos #3588 and Illumos #4047. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #1775	2013-10-29 15:06:18 -07:00
Richard Yao	20f04f08aa	Fix incorrect usage of strdup() in zfs_unmount_snap() Modifying the length of a string returned by strdup() is incorrect because strfree() is allowed to use strlen() to determine which slab cache was used to do the allocation. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #1775	2013-10-29 15:06:18 -07:00
Richard Yao	8c8417933f	Fix order of function calls in zio_free_sync() The resolution of a merge conflict when merging Illumos #3464 caused us to invert the order couple of function calls in zio_free_sync() versus what they are in Illumos. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #1775	2013-10-29 15:06:18 -07:00
Richard Yao	9cac042cfe	Reintroduce uio_prefaultpages() This was accidentally removed by overzealous commenting. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #1775	2013-10-29 15:06:18 -07:00
Massimo Maggi	023699cd62	Posix ACL Support This change adds support for Posix ACLs by storing them as an xattr which is common practice for many Linux file systems. Since the Posix ACL is stored as an xattr it will not overwrite any existing ZFS/NFSv4 ACLs which may have been set. The Posix ACL will also be non-functional on other platforms although it may be visible as an xattr if that platform understands SA based xattrs. By default Posix ACLs are disabled but they may be enabled with the new 'aclmode=noacl\|posixacl' property. Set the property to 'posixacl' to enable them. If ZFS/NFSv4 ACL support is ever added an appropriate acltype will be added. This change passes the POSIX Test Suite cleanly with the exception of xacl/00.t test 45 which is incorrect for Linux (Ext4 fails too). http://www.tuxera.com/community/posix-test-suite/ Signed-off-by: Massimo Maggi <me@massimo-maggi.eu> Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #170	2013-10-29 14:54:26 -07:00
Brian Behlendorf	fc9e0530c9	Prevent xattr remove from creating xattr directory Attempting to remove an xattr from a file which does not contain any directory based xattrs would result in the xattr directory being created. This behavior is non-optimal because it results in write operations to the pool in addition to the expected error being returned. To prevent this the CREATE_XATTR_DIR flag is only passed in zpl_xattr_set_dir() when setting a non-NULL xattr value. In addition, zpl_xattr_set() is updated similarly such that it will return immediately if passed an xattr name which doesn't exist and a NULL value. Signed-off-by: Massimo Maggi <me@massimo-maggi.eu> Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #170	2013-10-29 13:23:53 -07:00
Richard Yao	c12e3a594a	Restructure zfs_readdir() to fix regressions This does the following: 1. It creates a uint8_t type value, which is initialized to DT_DIR on dot directories and ZFS_DIRENT_TYPE(zap.za_first_integer) otherwise. This resolves a regression where we return unintialized values as the directory entry type on dot directories. This was accidentally introduced by commit `8170d28126`. 2. It restructures zfs_readdir() code to use `uint64_t offset` like Illumos instead of `loff_t *pos`. This resolves a regression where negative ZAP cursors were treated as if they were dot directories. 3. It restructures the function to more closely match the structure of zfs_readdir() on Illumos and removes the unused variable outcount, which was only used on Illumos. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1750	2013-10-29 09:51:59 -07:00
Brian Behlendorf	e0b0ca983d	Add visibility in to cached dbufs Currently there is no mechanism to inspect which dbufs are being cached by the system. There are some coarse counters in arcstats by they only give a rough idea of what's being cached. This patch aims to improve the current situation by adding a new dbufs kstat. When read this new kstat will walk all cached dbufs linked in to the dbuf_hash. For each dbuf it will dump detailed information about the buffer. It will also dump additional information about the referenced arc buffer and its related dnode. This provides a more complete view in to exactly what is being cached. With this generic infrastructure in place utilities can be written to post-process the data to understand exactly how the caching is working. For example, the data could be processed to show a list of all cached dnodes and how much space they're consuming. Or a similar list could be generated based on dnode type. Many other ways to interpret the data exist based on what kinds of questions you're trying to answer. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Prakash Surya <surya1@llnl.gov>	2013-10-25 13:59:40 -07:00
Brian Behlendorf	2d37239a28	Add visibility in to dmu_tx_assign times This change adds a new kstat to gain some visibility into the amount of time spent in each call to dmu_tx_assign. A histogram is exported via the new dmu_tx_assign file. The information contained in this histogram is the frequency dmu_tx_assign took to complete given an interval range. Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2013-10-25 13:57:25 -07:00
Brian Behlendorf	0b1401ee91	Add visibility in to txg sync behavior This change is an attempt to add visibility in to how txgs are being formed on a system, in real time. To do this, a list was added to the in memory SPA data structure for a pool, with each element on the list corresponding to txg. These entries are then exported through the kstat interface, which can then be interpreted in userspace. For each txg, the following information is exported: * Unique txg number (uint64_t) * The time the txd was born (hrtime_t) (not wall clock time; relative to the other entries on the list) * The current txg state ((O)pen/(Q)uiescing/(S)yncing/(C)ommitted) * The number of reserved bytes for the txg (uint64_t) * The number of bytes read during the txg (uint64_t) * The number of bytes written during the txg (uint64_t) * The number of read operations during the txg (uint64_t) * The number of write operations during the txg (uint64_t) * The time the txg was closed (hrtime_t) * The time the txg was quiesced (hrtime_t) * The time the txg was synced (hrtime_t) Note that while the raw kstat now stores relative hrtimes for the open, quiesce, and sync times. Those relative times are used to calculate how long each state took and these deltas and printed by output handlers. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2013-10-25 13:57:25 -07:00
Prakash Surya	1421c89142	Add visibility in to arc_read This change is an attempt to add visibility into the arc_read calls occurring on a system, in real time. To do this, a list was added to the in memory SPA data structure for a pool, with each element on the list corresponding to a call to arc_read. These entries are then exported through the kstat interface, which can then be interpreted in userspace. For each arc_read call, the following information is exported: * A unique identifier (uint64_t) * The time the entry was added to the list (hrtime_t) (not wall clock time; relative to the other entries on the list) * The objset ID (uint64_t) * The object number (uint64_t) * The indirection level (uint64_t) * The block ID (uint64_t) * The name of the function originating the arc_read call (char[24]) * The arc_flags from the arc_read call (uint32_t) * The PID of the reading thread (pid_t) * The command or name of thread originating read (char[16]) From this exported information one can see, in real time, exactly what is being read, what function is generating the read, and whether or not the read was found to be already cached. There is still some work to be done, but this should serve as a good starting point. Specifically, dbuf_read's are not accounted for in the currently exported information. Thus, a follow up patch should probably be added to export these calls that never call into arc_read (they only hit the dbuf hash table). In addition, it might be nice to create a utility similar to "arcstat.py" to digest the exported information and display it in a more readable format. Or perhaps, log the information and allow for it to be "replayed" at a later time. Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2013-10-25 13:57:25 -07:00
Brian Behlendorf	76463d4026	Revert "Add txgs-<pool> kstat file" This reverts commit `e95853a331`.	2013-10-25 13:57:25 -07:00
Brian Behlendorf	98ab38d109	Revert "Add new kstat for monitoring time in dmu_tx_assign" This reverts commit `92334b14ec`. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2013-10-25 13:57:25 -07:00
Richard Yao	b3c49d3df8	Linux 3.11 compat: Rename LZ4 symbols Linus Torvalds merged LZ4 into Linux 3.11. This causes a conflict whenever CONFIG_LZ4_DECOMPRESS=y or CONFIG_LZ4_COMPRESS=y are set in the kernel's .config. We rename the symbols to avoid the conflict. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1789	2013-10-22 10:12:39 -07:00
Tim Chase	fbcb768c8f	Add missing dsl pool configuration lock The semantics introduced by the restructured sync task of illumos 3464 require this lock when calling dmu_snapshot_list_next(). The pool is locked/unlocked for each iteration to reduce the chance of long-running locks. This was accidentally missed when doing the original port because ZoL's control directory code is Linux-specific and is in a different file than in illumos. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1785	2013-10-22 08:31:20 -07:00
George Wilson	7a61440761	Illumos #3552 3552 condensing one space map burns 3 seconds of CPU in spa_sync() thread (fix race condition) References: https://www.illumos.org/issues/3552 illumos/illumos-gate@03f8c36688 Ported-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Porting notes: This fixes an upstream regression that was introduced in commit zfsonlinux/zfs@e51be06697, which ported the Illumos 3552 changes. This fix was added to upstream rather quickly, but at the time of the port, no one spotted it and the race was rare enough that it passed our regression tests. I discovered this when comparing our metaslab.c to the illumos metaslab.c. Without this change it is possible for metaslab_group_alloc() to consume a large amount of cpu time. Since this occurs under a mutex in a rcu critical section the kernel will log this to the console as a self-detected cpu stall as follows: INFO: rcu_sched self-detected stall on CPU { 0} (t=60000 jiffies g=11431890 c=11431889 q=18271) Closes #1687 Closes #1720 Closes #1731 Closes #1747	2013-10-18 14:34:01 -07:00
Ned Bass	40a806df25	Export symbols dsl_pool_config_{enter,exit} These are needed by consumers (i.e. Lustre) who wish to use the dsl_prop_register() interface to register callbacks when pool properties of interest change. This interface requires that the DSL pool configuration lock is held when called. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1762	2013-10-10 16:56:51 -07:00
Brian Behlendorf	222b948059	Fix memory leak false positive in log_internal() When building the spl with --enable-debug-kmem-tracking a memory leak is detected in log_internal(). This happens to be a false positive because the memory was freed using strfree() instead of kmem_free(). All kmem_alloc()'s must be released with kmem_free() to ensure correct accounting. SPL: kmem leaked 135/5641311 bytes address size data func:line ffff8800cba7cd80 135 ZZZZZZZZZZZZZZZZ log_internal:456 Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2013-10-09 09:16:36 -07:00
Brian Behlendorf	36342b13d9	Export addition dsl_prop_* symbols The recent sync task restructuring in `13fe019` introduced several new symbols which should be exported for use by consumers such as Lustre. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2013-09-25 15:44:22 -07:00
Tim Chase	8769db3966	Allocate the ioctl "output" nvlist with KM_PUSHPAGE. Some ZFS errors such as certain snapshot failures can occur in the sync task context. Because they may require additional memory allocations, the initial nvlist must be allocated with KM_PUSHPAGE. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #1746 Issue #1737	2013-09-25 15:44:22 -07:00
Tim Chase	c5322236ec	Fix several new KM_SLEEP warnings A handful of allocations now occur in the sync path and need to use KM_PUSHPAGE. These were introduced by commit `13fe019`. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #1746 Issue #1737	2013-09-25 15:44:22 -07:00
Brian Behlendorf	cbfa294de4	Fix spa_deadman() TQ_SLEEP warning The spa_deadman() and spa_sync() functions can both be run in the spa_sync context and therefore should use TQ_PUSHPAGE instead of TQ_SLEEP. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1734 Closes #1749	2013-09-25 15:38:44 -07:00
GregorKopka	f9f3f1ef98	Removing unneeded mutex for reading vq_pending_tree size Locking mutex &vq->vq_lock in vdev_mirror_pending is unneeded: * no data is modified * only vq_pending_tree is read * in case garbage is returned (eg. vq_pending_tree being updated while the read is made) the worst case would be that a single read could be queued on a mirror side which more busy than thought The benefit of this change is streamlining of the code path since it is taken for every mirror member on every read. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1739	2013-09-25 15:29:45 -07:00
Kohsuke Kawaguchi	77831e1738	Reduce the stack usage of dsl_dataset_remove_clones_key dataset_remove_clones_key does recursion, so if the recursion goes deep it can overrun the linux kernel stack size of 8KB. I have seen this happen in the actual deployment, and subsequently confirmed it by running a test workload on a custom-built kernel that uses 32KB stack. See the following stack trace as an example of the case where it would have run over the 8KB stack kernel: Depth Size Location (42 entries) ----- ---- -------- 0) 11192 72 __kmalloc+0x2e/0x240 1) 11120 144 kmem_alloc_debug+0x20e/0x500 2) 10976 72 dbuf_hold_impl+0x4a/0xa0 3) 10904 120 dbuf_prefetch+0xd3/0x280 4) 10784 80 dmu_zfetch_dofetch.isra.5+0x10f/0x180 5) 10704 240 dmu_zfetch+0x5f7/0x10e0 6) 10464 168 dbuf_read+0x71e/0x8f0 7) 10296 104 dnode_hold_impl+0x1ee/0x620 8) 10192 16 dnode_hold+0x19/0x20 9) 10176 88 dmu_buf_hold+0x42/0x1b0 10) 10088 144 zap_lockdir+0x48/0x730 11) 9944 128 zap_cursor_retrieve+0x1c4/0x2f0 12) 9816 392 dsl_dataset_remove_clones_key.isra.14+0xab/0x190 13) 9424 392 dsl_dataset_remove_clones_key.isra.14+0x10c/0x190 14) 9032 392 dsl_dataset_remove_clones_key.isra.14+0x10c/0x190 15) 8640 392 dsl_dataset_remove_clones_key.isra.14+0x10c/0x190 16) 8248 392 dsl_dataset_remove_clones_key.isra.14+0x10c/0x190 17) 7856 392 dsl_dataset_remove_clones_key.isra.14+0x10c/0x190 18) 7464 392 dsl_dataset_remove_clones_key.isra.14+0x10c/0x190 19) 7072 392 dsl_dataset_remove_clones_key.isra.14+0x10c/0x190 20) 6680 392 dsl_dataset_remove_clones_key.isra.14+0x10c/0x190 21) 6288 392 dsl_dataset_remove_clones_key.isra.14+0x10c/0x190 22) 5896 392 dsl_dataset_remove_clones_key.isra.14+0x10c/0x190 23) 5504 392 dsl_dataset_remove_clones_key.isra.14+0x10c/0x190 24) 5112 392 dsl_dataset_remove_clones_key.isra.14+0x10c/0x190 25) 4720 392 dsl_dataset_remove_clones_key.isra.14+0x10c/0x190 26) 4328 392 dsl_dataset_remove_clones_key.isra.14+0x10c/0x190 27) 3936 392 dsl_dataset_remove_clones_key.isra.14+0x10c/0x190 28) 3544 392 dsl_dataset_remove_clones_key.isra.14+0x10c/0x190 29) 3152 392 dsl_dataset_remove_clones_key.isra.14+0x10c/0x190 30) 2760 392 dsl_dataset_remove_clones_key.isra.14+0x10c/0x190 31) 2368 392 dsl_dataset_remove_clones_key.isra.14+0x10c/0x190 32) 1976 392 dsl_dataset_remove_clones_key.isra.14+0x10c/0x190 33) 1584 392 dsl_dataset_remove_clones_key.isra.14+0x10c/0x190 34) 1192 232 dsl_dataset_destroy_sync+0x311/0xf60 35) 960 72 dsl_sync_task_group_sync+0x12f/0x230 36) 888 168 dsl_pool_sync+0x48b/0x5c0 37) 720 184 spa_sync+0x417/0xb00 38) 536 184 txg_sync_thread+0x325/0x5b0 39) 352 48 thread_generic_wrapper+0x7a/0x90 40) 304 128 kthread+0xc0/0xd0 41) 176 176 ret_from_fork+0x7c/0xb0 This change reduces the stack usage in dsl_dataset_remove_clones_key by allocating structures in heap, not in stack. This is not a fundamental fix, as one can create an arbitrary large data set that runs over any fixed size stack, but this will make the problem far less likely. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Kohsuke Kawaguchi <kk@kohsuke.org> Closes #1726	2013-09-25 15:18:32 -07:00
Brian Behlendorf	34d5a5fd03	Fix zpl_mknod() return values The zpl_mknod() function was incorrectly negating its return value. This doesn't cause any problems in the success case, but it does prevent us from returning the correct error code for a failure. The implementation of this function is now consistent with all the other zpl_* functions. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1717	2013-09-13 13:31:24 -07:00
Brian Behlendorf	17897ce2c8	Fix uninitialized variables When compiling on an ARM device using gcc 4.7.3 several variables in the zfs_obj_to_path_impl() function were flagged as uninitialized. To resolve the warnings explicitly initialize them to zero. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1716	2013-09-13 13:31:24 -07:00
Tim Chase	4cf652e5d4	Fix dmu_objset_find_dp() KM_SLEEP warning After the restructuring in `13fe019` The 'zfs rename' command will result in a KM_SLEEP being called in the sync context. This may deadlock due to reclaim so it was changed to KM_PUSHPAGE. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1711	2013-09-11 11:49:32 -07:00
Matthew Ahrens	13fe019870	Illumos #3464 3464 zfs synctask code needs restructuring Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Approved by: Garrett D'Amore <garrett@damore.org> References: https://www.illumos.org/issues/3464 illumos/illumos-gate@3b2aab1880 Ported-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1495	2013-09-04 16:01:24 -07:00
Matthew Ahrens	6f1ffb0665	Illumos #2882 , #2883 , #2900 2882 implement libzfs_core 2883 changing "canmount" property to "on" should not always remount dataset 2900 "zfs snapshot" should be able to create multiple, arbitrary snapshots at once Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Chris Siden <christopher.siden@delphix.com> Reviewed by: Garrett D'Amore <garrett@damore.org> Reviewed by: Bill Pijewski <wdp@joyent.com> Reviewed by: Dan Kruchinin <dan.kruchinin@gmail.com> Approved by: Eric Schrock <Eric.Schrock@delphix.com> References: https://www.illumos.org/issues/2882 https://www.illumos.org/issues/2883 https://www.illumos.org/issues/2900 illumos/illumos-gate@4445fffbbb Ported-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1293 Porting notes: WARNING: This patch changes the user/kernel ABI. That means that the zfs/zpool utilities built from master are NOT compatible with the 0.6.2 kernel modules. Ensure you load the matching kernel modules from master after updating the utilities. Otherwise the zfs/zpool commands will be unable to interact with your pool and you will see errors similar to the following: $ zpool list failed to read pool configuration: bad address no pools available $ zfs list no datasets available Add zvol minor device creation to the new zfs_snapshot_nvl function. Remove the logging of the "release" operation in dsl_dataset_user_release_sync(). The logging caused a null dereference because ds->ds_dir is zeroed in dsl_dataset_destroy_sync() and the logging functions try to get the ds name via the dsl_dataset_name() function. I've got no idea why this particular code would have worked in Illumos. This code has subsequently been completely reworked in Illumos commit 3b2aab1 (3464 zfs synctask code needs restructuring). Squash some "may be used uninitialized" warning/erorrs. Fix some printf format warnings for %lld and %llu. Apply a few spa_writeable() changes that were made to Illumos in illumos/illumos-gate.git@cd1c8b8 as part of the 3112, 3113, 3114 and 3115 fixes. Add a missing call to fnvlist_free(nvl) in log_internal() that was added in Illumos to fix issue 3085 but couldn't be ported to ZoL at the time (zfsonlinux/zfs@9e11c73) because it depended on future work.	2013-09-04 15:49:00 -07:00
Brian Behlendorf	6a7c0ccca4	Use directory xattrs for symlinks There is currently a subtle bug in the SA implementation which can crop up which prevents us from safely using multiple variable length SAs in one object. Fortunately, the only existing use case for this are symlinks with SA based xattrs. Therefore, until the root cause in the SA code can be identified and fixed we prevent adding SA xattrs to symlinks. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #1468	2013-08-22 13:30:44 -07:00
Brian Behlendorf	c273d60d80	Revert "Evict meta data from ghost lists + l2arc headers" This reverts commit `fadd0c4da1` which introduced a regression in honoring the meta limit. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Close #1660	2013-08-22 12:15:37 -07:00
Richard Yao	0f37d0c8be	Linux 3.11 compat: fops->iterate() Commit torvalds/linux@2233f31aad replaced ->readdir() with ->iterate() in struct file_operations. All filesystems must now use the new ->iterate method. To handle this the code was reworked to use the new ->iterate interface. Care was taken to keep the majority of changes confined to the ZPL layer which is already Linux specific. However, minor changes were required to the common zfs_readdir() function. Compatibility with older kernels was accomplished by adding versions of the trivial dir_emit* helper functions. Also the various _readdir() functions were reworked in to wrappers which create a dir_context structure to pass to the new _iterate() functions. Unfortunately, the new dir_emit* functions prevent us from passing a private pointer to the filldir function. The xattr directory code leveraged this ability through zfs_readdir() to generate the list of xattr names. Since we can no longer use zfs_readdir() a simplified zpl_xattr_readdir() function was added to perform the same task. Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1653 Issue #1591	2013-08-15 16:19:07 -07:00
Brian Behlendorf	34e143323e	Fix z_wr_iss_h zio_execute() import hang Because we need to be more frugal about our stack usage under Linux. The __zio_execute() function was modified to re-dispatch zios to a ZIO_TASKQ_ISSUE thread when we're in a context which is known to be stack heavy. Those two contexts are the sync thread and what ever thread is performing spa initialization. Unfortunately, this change introduced an unlikely bug which can result in a zio being re-dispatched indefinitely and never being executed. If during spa initialization we handle a zio with ZIO_PRIORITY_NOW it will be moved to the high priority queue. When __zio_execute() is called again for the zio it will mis- interpret the context and re-dispatch it again. The system will get stuck spinning re-dispatching the zio and making no forward progress. To fix this rare issue __zio_execute() has been updated not to re-dispatch zios on either the ZIO_TASKQ_ISSUE or ZIO_TASKQ_ISSUE_HIGH task queues. In practice this issue was rarely reported and can usually be fixed by rebooting the system and importing the pool again. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1455	2013-08-15 15:20:36 -07:00
Matthew Ahrens	cb682a173a	Illumos #3618 ::zio dcmd does not show timestamp data 3618 ::zio dcmd does not show timestamp data Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: George Wilson <gwilson@zfsmail.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: Garrett D'Amore <garrett@damore.org> Approved by: Dan McDonald <danmcd@nexenta.com> References: http://www.illumos.org/issues/3618 illumos/illumos-gate@c55e05cb35 Notes on porting to ZFS on Linux: The original changeset mostly deals with mdb ::zio dcmd. However, in order to provide the requested functionality it modifies vdev and zio structures to keep the timing data in nanoseconds instead of ticks. It is these changes that are ported over in the commit in hand. One visible change of this commit is that the default value of 'zfs_vdev_time_shift' tunable is changed: zfs_vdev_time_shift = 6 to zfs_vdev_time_shift = 29 The original value of 6 was inherited from OpenSolaris and was subotimal - since it shifted the raw tick value - it didn't compensate for different tick frequencies on Linux and OpenSolaris. The former has HZ=1000, while the latter HZ=100. (Which itself led to other interesting performance anomalies under non-trivial load. The deadline scheduler delays the IO according to its priority - the lower priority the further the deadline is set. The delay is measured in units of "shifted ticks". Since the HZ value was 10 times higher, the delay units were 10 times shorter. Thus really low priority IO like resilver (delay is 10 units) and scrub (delay is 20 units) were scheduled much sooner than intended. The overall effect is that resilver and scrub IO consumed more bandwidth at the expense of the other IO.) Now that the bookkeeping is done is nanoseconds the shift behaves correctly for any tick frequency (HZ). Ported-by: Cyril Plisko <cyril.plisko@mountall.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1643	2013-08-12 16:46:50 -07:00
Richard Yao	570d6edf1d	Linux 3.8 compat: Support CONFIG_UIDGID_STRICT_TYPE_CHECKS When CONFIG_UIDGID_STRICT_TYPE_CHECKS is enabled uid_t/git_t are replaced by kuid_t/kgid_t, which are structures instead of integral types. This causes any code that uses an integral type to fail to build. The User Namespace functionality introduced in Linux 3.8 requires CONFIG_UIDGID_STRICT_TYPE_CHECKS, so we could not build against any kernel that supported it. We resolve this by converting between the new kuid_t/kgid_t structures and the original uid_t/gid_t types. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1589	2013-08-09 15:31:52 -07:00
Brian Behlendorf	fadd0c4da1	Evict meta data from ghost lists + l2arc headers When the meta limit is exceeded the ARC evicts some meta data buffers from the mfu+mru lists. Unfortunately, for meta data heavy workloads it's possible for these buffers to accumulate on the ghost lists if arc_c doesn't exceed arc_size. To handle this case arc_adjust_meta() has been entended to explicitly evict meta data buffers from the ghost lists in proportion to what was evicted from the mfu+mru lists. If this is insufficient we request that the VFS release some inodes and dentries. This will result in the release of some dnodes which are counted as 'other' metadata. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2013-08-09 10:06:12 -07:00
Brian Behlendorf	68121a03da	Allow arc_evict_ghost() to only evict meta data The default behavior of arc_evict_ghost() is to start by evicting data buffers. Then only if the requested number of bytes to evict cannot be satisfied by data buffers move on to meta data buffers. This is ideal for honoring arc_c since it's preferable to keep the meta data cached. However, if we're trying to free memory from the arc to honor the meta limit it's a problem because we will need to discard all the data to get to the meta data. To avoid this issue the arc_evict_ghost() is now passed a fourth argumented describing which buffer type to start with. The arc_evict() function already behaves exactly like this for a same reason so this is consistent with the existing code. All existing callers have been updated to pass ARC_BUFC_DATA so this patch introduces no functional change. New callers may pass ARC_BUFC_METADATA to skip immediately to evicting meta data leaving the normal data untouched. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2013-08-09 10:06:08 -07:00
Saso Kiselkov	3a17a7a99a	Illumos #3137 L2ARC compression 3137 L2ARC compression Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@nexenta.com> References: illumos/illumos-gate@aad02571bc https://www.illumos.org/issues/3137 http://wiki.illumos.org/display/illumos/L2ARC+Compression Notes for Linux port: A l2arc_nocompress module option was added to prevent the compression of l2arc buffers regardless of how a dataset's compression property is set. This allows the legacy behavior to be preserved. Ported by: James H <james@kagisoft.co.uk> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1379	2013-08-08 13:27:21 -07:00
Richard Yao	c11a12bc3b	Return -1 from arc_shrinker_func() This is analogous to SPL commit zfsonlinux/spl@b9b3715. While we don't have clear evidence of systems getting caught here indefinately like in the SPL this ensures that it will never happen. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1579	2013-08-08 09:20:56 -07:00
Richard Yao	8170d28126	Return correct type and offset from zfs_readdir zfs_readdir() is used by getdents(), which provides a list of all files in directory, their types and an offset that be used by llseek() to seek to the next directory entry. On Solaris, the first two directory entries "." and ".." respectively have offsets 1 and 2 on ZFS while the other files have rather large numbers. Currently, ZFSOnLinux is giving "." offset 0 and all other entries large numbers. The first entry's next entry offset points to itself, which causes software that uses llseek() in conjunction with getdents() for filesystem navigation to enter an infinite loop. The offsets used for each directory entry are filesystem specific on all platforms, so we can fix this by adopting the Solaris behavior. Also, we currently report each directory entry as having type 0 (???). This is not wrong, but we can do better. getdents() on Solaris does not appear to provide this information, but it does on Linux and Mac OS X do. ZFS provides easy access to type information in zfs_readdir(), so this patch provides this as well. Reported-by: Andrey <andrey@kudinov.su> Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1624	2013-08-07 16:16:43 -07:00
George Wilson	c61f97f426	Illumos #3639 zpool.cache should skip over readonly pools 3639 zpool.cache should skip over readonly pools Reviewed by: Eric Schrock <eric.schrock@delphix.com> Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Basil Crow <basil.crow@delphix.com> Approved by: Gordon Ross <gwr@nexenta.com> References: illumos/illumos-gate@fb02ae0252 https://www.illumos.org/issues/3639 Normally we don't list pools that are imported read-only in the cache file, however you can accidentally get one into the cache file by importing and exporting a read-write pool while a read-only pool is imported: $ zpool import -o readonly test1 $ zpool import test2 $ zpool export test2 $ zdb -C This is a problem because if the machine reboots we import all pools in the cache file as read-write. Ported-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2013-08-07 16:13:56 -07:00
Brian Behlendorf	78d7a5d780	Write dirty inodes on close When the property atime=on is set operations which only access and inode do cause an atime update. However, it turns out that dirty inodes with updated atimes are only written to disk when the inodes get evicted from the cache. Somewhat surprisingly the source suggests that this isn't a ZoL specific issue. This behavior may in part explain why zfs's reclaim logic has been observed to be slow. When reclaiming inodes its likely that they have a dirty atime which will force a write to disk. Obviously we don't want to force a write to disk for every atime update, these needs to be batched. The right way to do this is to fully implement the .dirty_inode and .write_inode callbacks. However, to do that right requires proper unification of some fields in the znode/inode. Then we could just mark the inode dirty and leave it to the VFS to call .write_inode periodically. Until that work gets done we have to settle for some middle ground. The simplest and safest thing we can do for now is to write the dirty inode on last close. This should prevent the majority of inodes in the cache from having dirty atimes and not drastically increase the number of writes. Some rudimentally testing to show how long it takes to drop 500,000 inodes from the cache shows promising results. This is as expected because we're no longer do lots of IO as part of the eviction, it was done earlier during the close. w/out patch: ~30s to drop 500,000 inodes with drop_caches. with patch: ~3s to drop 500,000 inodes with drop_caches. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2013-08-07 16:11:19 -07:00
Brian Behlendorf	57b650b86f	Export additional dmu symbols The dmu_prefetch, dmu_free_long_range, dmu_free_object, dmu_prealloc, dmu_write_policy, and dmu_sync symbols have been exported so they may be used by other modules. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2013-08-01 09:48:07 -07:00
Nathaniel Clark	7d63721118	dmu_tx: Fix possible NULL pointer dereference dmu_tx_hold_object_impl can return NULL on error. Check for this condition prior to dereferencing pointer. This can only occur if the passed object was invalid or unallocated. Signed-off-by: Nathaniel Clark <Nathaniel.Clark@misrule.us> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1610	2013-08-01 09:48:07 -07:00
Richard Yao	cb543e6b5e	Remove b_thawed from arc_buf_hdr_t The code involving b_thawed appears to be dead, so lets discard it. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #1614	2013-08-01 09:48:07 -07:00
Richard Yao	3f4058cd15	Remove arc_data_buf_alloc()/arc_data_buf_free() These functions are used in neither Illumos nor ZFSOnLinux. They appear to have been replaced by arc_buf_alloc()/arc_buf_free(), so lets remove them. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #1614	2013-08-01 09:48:07 -07:00
Richard Yao	4edbd2f79a	Remove zio_alloc_arena We declare zio_alloc_arena using extern, but it does not appear to exist anywhere in the code. This permits undefined behavior, so lets remove it. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #1614	2013-08-01 09:48:06 -07:00
Brian Behlendorf	bce45ec9fb	Make arc+l2arc module options writable The l2arc module options can be made safely writable. This allows the options to be changed without unloading/loading the modules. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2013-07-30 15:40:20 -07:00
Brian Behlendorf	c93504f03a	Change l2arc_norw default to zero These days modern SSDs can efficiently service concurrent reads and writes. When this flag was added that wasn't really the case for a variety of SSD controllers. But now we can set the default value to take advantage of this parallelism and only disable this as needed for specific troublesome hardware. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2013-07-29 22:05:32 -07:00
Ying Zhu	6e1d7276c9	Fix inaccurate arcstat_l2_hdr_size calculations Based on the comments in arc.c we know that buffers can exist both in arc and l2arc, under this circumstance both arc_buf_hdr_t and l2arc_buf_hdr_t will be allocated. However the current logic only cares for memory that l2arc_buf_hdr takes up when the buffer's state transfers from or to arc_l2c_only. This will cause obvious deviations for illumos's zfs version since the sizeof(l2arc_buf_hdr) is larger than ZOL's. We can implement the calcuation in the following simple way: 1. When allocate a l2arc_buf_hdr_t we add its memory consumption instantly and subtract it when we free or evict the l2arc buf. 2. According to l2arc_hdr_stat_add and l2arc_hdr_stat_remove, if the buffer only stays in l2arc we should also add the memory its arc_buf_hdr_t consumes, so we only need to add HDR_SIZE to arcstat_l2_hdr_size since we already concerned with L2HDR_SIZE in step 1 and the same for transfering arc bufs from l2arc only state. The testbox has 2 4-core Intel Xeon CPUs(2.13GHz), with 16GB memory and tests were set upped in the following way: 1. Fdisked a SATA disk into two partitions, one partition for zpool storage and the other one was used as the cache device. 2. Generated some files occupying 14GB altogether in the zpool prepared in step 1 using iozone. 3. Read them all using md5sum and watched the l2arc related statistics in /proc/spl/kstat/zfs/arcstats. After the reading ended the l2_hdr_size and l2_size were shown like this: l2_size 4 4403780608 l2_hdr_size 4 0 which was weird. 4. After applying this patch and reran step 1-3, the results were as following: l2_size 4 4306443264 l2_hdr_size 4 535600 these numbers made sense, on 64-bit systems the sizeof(l2arc_buf_hdr_t) is 16 bytes. Assue all blocks cached by l2arc are 128KB, so 535600/161281024=4387635200, since not all blocks are equal-sized, the theoretical result will be a little bigger, as we can see. Since I'm familiar with systemtap instrumentation tool I used it to examine what had happened. The script looked like this: probe module("zfs").function("arc_chage_state") { if ($new_state == $arc_l2_only) printf("change arc buf to arc_l2_only\n") } It will print out some information each time we call funciton arc_chage_state if the argument new_state is arc_l2_only. I gathered the trace logs and found that none of the arc bufs ran into arc state arc_l2_only when the tests was running, this was the reason why l2_hdr_size in step 3 was 0. The arc bufs fell into arc_l2_only when the pool or the filesystem was offlined. Signed-off-by: Ying Zhu <casualfisher@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2013-07-29 22:05:26 -07:00
Brian Behlendorf	dba1d70566	Fix arc_adapt() spinning in iterate_supers_type() The iterate_supers_type() function which was introduced in the 3.0 kernel was supposed to provide a safe way to call an arbitrary function on all super blocks of a specific type. Unfortunately, because a list_head was used a bug was introduced which made it possible for iterate_supers_type() to get stuck spinning on a super block which was just deactivated. This can occur because when the list head is removed from the fs_supers list it is reinitialized to point to itself. If the iterate_supers_type() function happened to be processing the removed list_head it will get stuck spinning on that list_head. The bug was fixed in the 3.3 kernel by converting the list_head to an hlist_node. However, to resolve the issue for existing 3.0 - 3.2 kernels we detect when a list_head is used. Then to prevent the spinning from occurring the .next pointer is set to the fs_supers list_head which ensures the iterate_supers_type() function will always terminate. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1045 Closes #861 Closes #790	2013-07-17 09:28:06 -07:00
Brian Behlendorf	c9ada6d5a0	Fix read-only pool hang on unmount During mount a filesystem dataset would have the MS_RDONLY bit incorrectly cleared even if the entire pool was read-only. There is existing to code to handle this case but it was being run before the property callbacks were registered. To resolve the issue we move this read-only code after the callback registration. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1338	2013-07-17 09:22:23 -07:00
Brian Behlendorf	76351672c2	Fix zfsctl_expire_snapshot() deadlock It is possible for an automounted snapshot which is expiring to deadlock with a manual unmount of the snapshot. This can occur because taskq_cancel_id() will block if the task is currently executing until it completes. But it will never complete because zfsctl_unmount_snapshot() is holding the zsb->z_ctldir_lock which zfsctl_expire_snapshot() must acquire. ---------------------- z_unmount/0:2153 --------------------- mutex_lock <blocking on zsb->z_ctldir_lock> zfsctl_unmount_snapshot zfsctl_expire_snapshot taskq_thread ------------------------- zfs:10690 ------------------------- taskq_wait_id <waiting for z_unmount to exit> taskq_cancel_id __zfsctl_unmount_snapshot zfsctl_unmount_snapshot <takes zsb->z_ctldir_lock> zfs_unmount_snap zfs_ioc_destroy_snaps_nvl zfsdev_ioctl do_vfs_ioctl We resolve the deadlock by dropping the zsb->z_ctldir_lock before calling __zfsctl_unmount_snapshot(). The lock is only there to prevent concurrent modification to the zsb->z_ctldir_snaps AVL tree. Moreover, we're careful to remove the zfs_snapentry_t from the AVL tree before dropping the lock which ensures no other tasks can find it. On failure it's added back to the tree. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chris Dunlap <cdunlap@llnl.gov> Closes #1527	2013-07-12 10:06:53 -07:00
Brian Behlendorf	556011dbec	Improve N-way mirror performance The read bandwidth of an N-way mirror can by increased by 50%, and the IOPs by 10%, by more carefully selecting the preferred leaf vdev. The existing algorthm selects a perferred leaf vdev based on offset of the zio request modulo the number of members in the mirror. It assumes the drives are of equal performance and that spreading the requests randomly over both drives will be sufficient to saturate them. In practice this results in the leaf vdevs being under utilized. Utilization can be improved by preferentially selecting the leaf vdev with the least pending IO. This prevents leaf vdevs from being starved and compensates for performance differences between disks in the mirror. Faster vdevs will be sent more work and the mirror performance will not be limitted by the slowest drive. In the common case where all the pending queues are full and there is no single least busy leaf vdev a batching stratagy is employed. Of the N least busy vdevs one is selected with equal probability to be the preferred vdev for T microseconds. Compared to randomly selecting a vdev to break the tie batching the requests greatly improves the odds of merging the requests in the Linux elevator. The testing results show a significant performance improvement for all four workloads tested. The workloads were generated using the fio benchmark and are as follows. 1) 1MB sequential reads from 16 threads to 16 files (MB/s). 2) 4KB sequential reads from 16 threads to 16 files (MB/s). 3) 1MB random reads from 16 threads to 16 files (IOP/s). 4) 4KB random reads from 16 threads to 16 files (IOP/s). \| Pristine \| With 1461 \| \| Sequential Random \| Sequential Random \| \| 1MB 4KB 1MB 4KB \| 1MB 4KB 1MB 4KB \| \| MB/s MB/s IO/s IO/s \| MB/s MB/s IO/s IO/s \| ---------------+-----------------------+------------------------+ 2 Striped \| 226 243 11 304 \| 222 255 11 299 \| 2 2-Way Mirror \| 302 324 16 534 \| 433 448 23 571 \| 2 3-Way Mirror \| 429 458 24 714 \| 648 648 41 808 \| 2 4-Way Mirror \| 562 601 36 849 \| 816 828 82 926 \| Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1461	2013-07-11 13:53:50 -07:00
Prakash Surya	92334b14ec	Add new kstat for monitoring time in dmu_tx_assign This change adds a new kstat to gain some visibility into the amount of time spent in each call to dmu_tx_assign. A histogram is exported via a new dmu_tx_assign_histogram-$POOLNAME file. The information contained in this histogram is the frequency dmu_tx_assign took to complete given an interval range. For example, given the below histogram file: $ cat /proc/spl/kstat/zfs/dmu_tx_assign_histogram-tank 12 1 0x01 32 1536 19792068076691 20516481514522 name type data 1 us 4 859 2 us 4 252 4 us 4 171 8 us 4 2 16 us 4 0 32 us 4 2 64 us 4 0 128 us 4 0 256 us 4 0 512 us 4 0 1024 us 4 0 2048 us 4 0 4096 us 4 0 8192 us 4 0 16384 us 4 0 32768 us 4 1 65536 us 4 1 131072 us 4 1 262144 us 4 4 524288 us 4 0 1048576 us 4 0 2097152 us 4 0 4194304 us 4 0 8388608 us 4 0 16777216 us 4 0 33554432 us 4 0 67108864 us 4 0 134217728 us 4 0 268435456 us 4 0 536870912 us 4 0 1073741824 us 4 0 2147483648 us 4 0 one can see most calls to dmu_tx_assign completed in 32us or less, but a few outliers did not. Specifically, 4 of the calls took between 262144us and 131072us. This information is difficult, if not impossible, to gather without this change. Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1584	2013-07-11 13:53:44 -07:00
Brian Behlendorf	bf89c19914	Log pool suspension warnings to the console In the event that a pool gets suspended log this information to the console. This is critical information and we want to make sure it gets logged. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1555	2013-07-10 15:15:52 -07:00
Brian Behlendorf	abc41ac7c7	Use GFP_NOIO in vdev_disk_io_flush() To avoid a potential deadlock when using a zvol as a swap device prevent vdev_disk_io_flush() from performing IO during the bio_alloc(). Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1508	2013-07-10 14:12:21 -07:00
Ying Zhu	b4f7f10527	Improve code in arc_buf_remove_ref When we remove references of arc bufs in the arc_anon state we needn't take its header's hash_lock, so postpone it to where we really need it to avoid unnecessary invocations of function buf_hash. Signed-off-by: Ying Zhu <casualfisher@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1557	2013-07-09 11:53:28 -07:00
Shen Yan	8e07b99b2f	Update zio.c The cv_wait_io is used to account io time instead of cv_wait. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1566	2013-07-09 10:41:46 -07:00
Brian Behlendorf	31455ab130	Add zfs_autoimport_disable tunable There are times when it is desirable for zfs to not automatically populate the spa namespace at module load time using the pools in the /etc/zfs/zpool.cache file. The zfs_autoimport_disable module option has been added to control this behavior. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #330	2013-07-09 10:11:19 -07:00
Chris Dunlop	a1d9543a39	3.10 API change: block_device_operations->release() returns void Linux kernel commit torvalds/linux@db2a144 changed the return type of block_device_operations->release() to void. Detect the expected prototype and defined our callout accordingly. Signed-off-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1494	2013-07-08 15:41:57 -07:00
Brian Behlendorf	91604b298c	Open pools asynchronously after module load One of the side effects of calling zvol_create_minors() in zvol_init() is that all pools listed in the cache file will be opened. Depending on the state and contents of your pool this operation can take a considerable length of time. Doing this at load time is undesirable because the kernel is holding a global module lock. This prevents other modules from loading and can serialize an otherwise parallel boot process. Doing this after module inititialization also reduces the chances of accidentally introducing a race during module init. To ensure that /dev/zvol/<pool>/<dataset> devices are still automatically created after the module load completes a udev rules has been added. When udev notices that the /dev/zfs device has been create the 'zpool list' command will be run. This then will cause all the pools listed in the zpool.cache file to be opened. Because this process in now driven asynchronously by udev there is the risk of problems in downstream distributions. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #756 Issue #1020 Issue #1234	2013-07-03 09:24:38 -07:00
Richard Yao	2a3871d4bc	Cleanup zvol initialization code The following error will occur on some (possibly all) kernels because blk_init_queue() will try to take the spinlock before we initialize it. BUG: spinlock bad magic on CPU#0, zpool/4054 lock: 0xffff88021a73de60, .magic: 00000000, .owner: <none>/-1, .owner_cpu: 0 Pid: 4054, comm: zpool Not tainted 3.9.3 #11 Call Trace: [<ffffffff81478ef8>] spin_dump+0x8c/0x91 [<ffffffff81478f1e>] spin_bug+0x21/0x26 [<ffffffff812da097>] do_raw_spin_lock+0x127/0x130 [<ffffffff8147d851>] _raw_spin_lock_irq+0x21/0x30 [<ffffffff812c2c1e>] cfq_init_queue+0x1fe/0x350 [<ffffffff812aacb8>] elevator_init+0x78/0x140 [<ffffffff812b2677>] blk_init_allocated_queue+0x87/0xb0 [<ffffffff812b26d5>] blk_init_queue_node+0x35/0x70 [<ffffffff812b271e>] blk_init_queue+0xe/0x10 [<ffffffff8125211b>] __zvol_create_minor+0x24b/0x620 [<ffffffff81253264>] zvol_create_minors_cb+0x24/0x30 [<ffffffff811bd9ca>] dmu_objset_find_spa+0xea/0x510 [<ffffffff811bda71>] dmu_objset_find_spa+0x191/0x510 [<ffffffff81253ea2>] zvol_create_minors+0x92/0x180 [<ffffffff811f8d80>] spa_open_common+0x250/0x380 [<ffffffff811f8ece>] spa_open+0xe/0x10 [<ffffffff8122817e>] pool_status_check.part.22+0x1e/0x80 [<ffffffff81228a55>] zfsdev_ioctl+0x155/0x190 [<ffffffff8116a695>] do_vfs_ioctl+0x325/0x5a0 [<ffffffff8116a950>] sys_ioctl+0x40/0x80 [<ffffffff814812c9>] ? do_page_fault+0x9/0x10 [<ffffffff81483929>] system_call_fastpath+0x16/0x1b zd0: unknown partition table We fix this by calling spin_lock_init before blk_init_queue. The manner in which zvol_init() initializes structures is suspectible to a race between initialization and a probe on a zvol. We reorganize zvol_init() to prevent that. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2013-07-03 09:23:35 -07:00
Pawel Jakub Dawidek	526af78550	Call zvol_create_minors() in spa_open_common() when initializing pool There is an extremely odd bug that causes zvols to fail to appear on some systems, but not others. Recently, I was able to consistently reproduce this issue over a period of 1 month. The issue disappeared after I applied this change from FreeBSD. This is from FreeBSD's pool version 28 import, which occurred in revision 219089. Ported-by: Richard Yao <ryao@cs.stonybrook.edu> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #441 Issue #599	2013-07-03 09:22:44 -07:00
George Wilson	294f68063b	Illumos #3498 panic in arc_read() 3498 panic in arc_read(): !refcount_is_zero(&pbuf->b_hdr->b_refcnt) Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Richard Lowe <richlowe@richlowe.net> References: illumos/illumos-gate@1b912ec710 https://www.illumos.org/issues/3498 Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1249	2013-07-02 13:34:31 -07:00
Matthew Ahrens	96b89346c0	Illumos #3122 zfs destroy filesystem should prefetch blocks 3122 zfs destroy filesystem should prefetch blocks Reviewed by: Christopher Siden <chris.siden@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Adam Leventhal <ahl@delphix.com> Approved by: Garrett D'Amore <garrett@damore.org> References: illumos/illumos-gate@b4709335aa https://www.illumos.org/issues/3122 Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1565	2013-07-02 13:34:02 -07:00
Cyril Plisko	29dee3ee9a	Add zfs_sync_pass_* tunable parameters Commit `55d85d5a8c` (backport of the upstream changes) replaced three hardcoded constants: #define SYNC_PASS_DEFERRED_FREE 2 /* defer frees after this pass / #define SYNC_PASS_DONT_COMPRESS 4 / don't compress after this pass / #define SYNC_PASS_REWRITE 1 / rewrite new bps after this pass / with a tunable parameters: int zfs_sync_pass_deferred_free = 2; / defer frees starting in this pass / int zfs_sync_pass_dont_compress = 5; / don't compress starting in this pass / int zfs_sync_pass_rewrite = 2; / rewrite new bps starting in this pass */ This commit makes these tunables available as module parameters in Linux. They should only be used for performance analysis because changing them can result in subtle and pathological performance problems. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1562	2013-07-02 09:34:18 -07:00
Li Dongyang	802e7b5feb	Add SEEK_DATA/SEEK_HOLE to lseek()/llseek() The approach taken was the rework zfs_holey() as little as possible and then just wrap the code as needed to ensure correct locking and error handling. Tested with xfstests 285 and 286. All tests pass except for 7-9 of 285 which try to reserve blocks first via fallocate(2) and fail because fallocate(2) is not yet supported. Note that the filp->f_lock spinlock did not exist prior to Linux 2.6.30, but we avoid the need for autotools check by virtue of the fact that SEEK_DATA/SEEK_HOLE support was not added until Linux 3.1. An autoconf check was added for lseek_execute() which is currently a private function but the expectation is that it will be exported perhaps as early as Linux 3.11. Reviewed-by: Richard Laager <rlaager@wiktel.com> Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1384	2013-07-02 09:24:43 -07:00
Matthew Ahrens	cf91b2b6b2	Readd zfs_holey() from OpenSolaris This patch restores the zfs_holey() function from OpenSolaris. This was removed by commit `3558fd7` because it wasn't clear we had a use for it in ZoL. However, this functionality is a prerequisite for adding SEEK_DATA/SEEK_HOLE support to the ZPL. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <ryao@gentoo.org> Issue #1384	2013-07-02 09:24:18 -07:00
shenyan1	0a6bef26ec	kmem_zalloc(..., KM_SLEEP) will never fail By definitition these allocations will never fail. For consistency with the rest of the code remove this dead error handling code. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1558	2013-07-01 14:51:48 -07:00
Tim Chase	ab68b6e5db	Fix zfs_sb_teardown/zfs_resume_fs NULL dereference Fix a pair of conditions in which a concurrent umount can cause NULL pointer dereferences: * zfs_sb_teardown - prevent a NULL dereference by not calling dmu_objset_pool with a null z_os. * zfs_resume_fs - don't try to unmount with a null z_os. This change makes the ZoL code more consistent with both Illumos and FreeBSD. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1543	2013-07-01 14:51:45 -07:00
Ying Zhu	c12936b141	Fix module probe failure on 32-bit systems Previous commit `7ef5e54e2e` caused module probe failure on 32-bit systems, dmesg showed Unknown symbol __moddi3 This was caused by the modulo operation 'gethrtime() % tqs->stqs_count' in the committed code. Instead of implementing __moddi3 for all 32-bit systems, Behlendorf advised we can just cast the return value of gethrtime() into a uint64_t, since gethrtime does not return negative value on all circumstances we need not care about the potential overflow. Signed-off-by: Ying Zhu <casualfisher@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1551	2013-06-27 10:01:25 -07:00
Brian Behlendorf	88c283952f	Return -EOPNOTSUPP for ZFS_IOC_{GET\|SET}FLAGS Until these hooks are fully implemented return the expected -EOPNOTSUPP error to indicate they are not functional. This allows test suites such as xfstests to cleanly skip testing this functionality until it's implemented. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #229	2013-06-26 15:20:13 -07:00
Brian Behlendorf	81eaf15107	Register correct handlers in nvlist_alloc() The non-blocking allocation handlers in nvlist_alloc() would be mistakenly assigned if any flags other than KM_SLEEP were passed. This meant that nvlists allocated with KM_PUSHPUSH or other KM_* debug flags were effectively always using atomic allocations. While these failures were unlikely it could lead to assertions because KM_PUSHPAGE allocations in particular are guaranteed to succeed or block. They must never fail. Since the existing API does not allow us to pass allocation flags to the private allocators the cleanest thing to do is to add a KM_PUSHPAGE allocator. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes zfsonlinux/spl#249	2013-06-20 09:58:15 -07:00
Matthew Ahrens	df4474f92d	Illumos #3805 arc shouldn't cache freed blocks 3805 arc shouldn't cache freed blocks Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: Richard Elling <richard.elling@dey-sys.com> Reviewed by: Will Andrews <will@firepipe.net> Approved by: Dan McDonald <danmcd@nexenta.com> References: illumos/illumos-gate@6e6d5868f5 https://www.illumos.org/issues/3805 ZFS should proactively evict freed blocks from the cache. On dcenter, we saw that we were caching ~256GB of metadata, while the pool only had <4GB of metadata on disk. We were wasting about half the system's RAM (252GB) on blocks that have been freed. Even though these freed blocks will never be used again, and thus will eventually be evicted, this causes us to use memory inefficiently for 2 reasons: 1. A block that is freed has no chance of being accessed again, but will be kept in memory preferentially to a block that was accessed before it (and is thus older) but has not been freed and thus has at least some chance of being accessed again. 2. We partition the ARC into several buckets: user data that has been accessed only once (MRU) metadata that has been accessed only once (MRU) user data that has been accessed more than once (MFU) metadata that has been accessed more than once (MFU) The user data vs metadata split is somewhat arbitrary, and the primary control on how much memory is used to cache data vs metadata is to simply try to keep the proportion the same as it has been in the past (each bucket "evicts against" itself). The secondary control is to evict data before evicting metadata. Because of this bucketing, we may end up with one bucket mostly containing freed blocks that are very old, while another bucket has more recently accessed, still-allocated blocks. Data in the useful bucket (with still-allocated blocks) may be evicted in preference to data in the useless bucket (with old, freed blocks). On dcenter, we saw that the MFU metadata bucket was 230MB, while the MFU data bucket was 27GB and the MRU metadata bucket was 256GB. However, the vast majority of data in the MRU metadata bucket (256GB) was freed blocks, and thus useless. Meanwhile, the MFU metadata bucket (230MB) was constantly evicting useful blocks that will be soon needed. The problem of cache segmentation is a larger problem that needs more investigation. However, if we stop caching freed blocks, it should reduce the impact of this more fundamental issue. Ported-by: Richard Yao <ryao@cs.stonybrook.edu> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1503	2013-06-20 09:55:52 -07:00
George Wilson	e51be06697	Illumos #3552 , #3564 3552 condensing one space map burns 3 seconds of CPU in spa_sync() thread 3564 spa_sync() spends 5-10% of its time in metaslab_sync() (when not condensing) Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Richard Lowe <richlowe@richlowe.net> References: illumos/illumos-gate@16a4a80742 https://www.illumos.org/issues/3552 https://www.illumos.org/issues/3564 Ported-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1513	2013-06-19 16:22:39 -07:00
Madhav Suresh	c99c90015e	Illumos #3006 3006 VERIFY[S,U,P] and ASSERT[S,U,P] frequently check if first argument is zero Reviewed by Matt Ahrens <matthew.ahrens@delphix.com> Reviewed by George Wilson <george.wilson@delphix.com> Approved by Eric Schrock <eric.schrock@delphix.com> References: illumos/illumos-gate@fb09f5aad4 https://illumos.org/issues/3006 Requires: zfsonlinux/spl@1c6d149feb Ported-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1509	2013-06-19 15:14:10 -07:00
Brian Behlendorf	0377189b88	Only check directory xattr on ENOENT When SA xattrs are enabled only fallback to checking the directory xattrs when the name is not found as a SA xattr. Otherwise, the SA error which should be returned to the caller is overwritten by the directory xattr errors. Positive return values indicating success will also be immediately returned. In the case of #1437 the ERANGE error was being correctly returned by zpl_xattr_get_sa() only to be overridden with ENOENT which was returned by the subsequent unnessisary call to zpl_xattr_get_dir(). Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1437	2013-05-10 12:24:56 -07:00
Cyril Plisko	4f34b3bdf4	zfs_scrub_limit tunable is not used anywhere As a part of scrub/resilver tuning zfs_scrub_limit fell out of use, but the definition of the variable remained in place. Moreover various guides still (misleadingly) mention it as a way to influence resilver/scrub behavior. This commit removes its finally. Signed-off-by: Cyril Plisko <cyril.plisko@mountall.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1444	2013-05-06 14:14:06 -07:00
Ying Zhu	ee664d4631	Fix incorrect assertions in ddt_phys_decref and ddt_sync_entry The assertions in ddt_phys_decref and ddt_sync_entry cast ddp->ddp_refcnt from uint64_t to int64_t, with a reference count bigger than 2^63, e.g. the reference count of zero blocks commonly available in spare files, we may mistakenly hit these assertations, so drop the type conversions here. Signed-off-by: Ying Zhu <casualfisher@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1436	2013-05-06 14:10:55 -07:00
Brian Behlendorf	044baf009a	Use taskq for dump_bytes() The vn_rdwr() function performs I/O by calling the vfs_write() or vfs_read() functions. These functions reside just below the system call layer and the expectation is they have almost the entire 8k of stack space to work with. In fact, certain layered configurations such as ext+lvm+md+multipath require the majority of this stack to avoid stack overflows. To avoid this posibility the vn_rdwr() call in dump_bytes() has been moved to the ZIO_TYPE_FREE, taskq. This ensures that all I/O will be performed with the majority of the stack space available. This ends up being very similiar to as if the I/O were issued via sys_write() or sys_read(). Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1399 Closes #1423	2013-05-06 14:05:42 -07:00
Adam Leventhal	7ef5e54e2e	Illumos #3581 spa_zio_taskq[ZIO_TYPE_FREE][ZIO_TASKQ_ISSUE]->tq_lock contention 3581 spa_zio_taskq[ZIO_TYPE_FREE][ZIO_TASKQ_ISSUE]->tq_lock is piping hot Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: Gordon Ross <gordon.ross@nexenta.com> Approved by: Richard Lowe <richlowe@richlowe.net> References: illumos/illumos-gate@ec94d32 https://illumos.org/issues/3581 Notes for Linux port: Earlier commit `08d08eb` reduced contention on this taskq lock by simply reducing the number of z_fr_iss threads from 100 to one-per-CPU. We also optimized the taskq implementation in zfsonlinux/spl@3c6ed54. These changes significantly improved unlink performance to acceptable levels. This patch further reduces time spent spinning on this lock by randomly dispatching the work items over multiple independent task queues. The Illumos ZFS developers stated that this lock contention only arose after "3329 spa_sync() spends 10-20% of its time in spa_free_sync_cb()" was landed. It's not clear if 3329 affects the Linux port or not. I didn't see spa_free_sync_cb() show up in oprofile sessions while unlinking large files, but I may just not have used the right test case. I tested unlinking a 1 TB of data with and without the patch and didn't observe a meaningful difference in elapsed time. However, oprofile showed that the percent time spent in taskq_thread() was reduced from about 16% to about 5%. Aside from a possible slight performance benefit this may be worth landing if only for the sake of maintaining consistency with upstream. Ported-by: Ned Bass <bass6@llnl.gov> Closes #1327	2013-05-06 14:05:37 -07:00
George Wilson	55d85d5a8c	Illumos #3329 , #3330 , #3331 , #3335 3329 spa_sync() spends 10-20% of its time in spa_free_sync_cb() 3330 space_seg_t should have its own kmem_cache 3331 deferred frees should happen after sync_pass 1 3335 make SYNC_PASS_* constants tunable Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Matt Ahrens <matthew.ahrens@delphix.com> Reviewed by: Christopher Siden <chris.siden@delphix.com> Reviewed by: Eric Schrock <eric.schrock@delphix.com> Reviewed by: Richard Lowe <richlowe@richlowe.net> Reviewed by: Dan McDonald <danmcd@nexenta.com> Approved by: Eric Schrock <eric.schrock@delphix.com> References: illumos/illumos-gate@01f55e48fb https://www.illumos.org/issues/3329 https://www.illumos.org/issues/3330 https://www.illumos.org/issues/3331 https://www.illumos.org/issues/3335 Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>	2013-05-06 12:39:34 -07:00
George Wilson	5853fe790d	Illumos #3306 , #3321 3306 zdb should be able to issue reads in parallel 3321 'zpool reopen' command should be documented in the man page and help Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Matt Ahrens <matthew.ahrens@delphix.com> Reviewed by: Christopher Siden <chris.siden@delphix.com> Approved by: Garrett D'Amore <garrett@damore.org> References: illumos/illumos-gate@31d7e8fa33 https://www.illumos.org/issues/3306 https://www.illumos.org/issues/3321 The vdev_file.c implementation in this patch diverges significantly from the upstream version. For consistenty with the vdev_disk.c code the upstream version leverages the Illumos bio interfaces. This makes sense for Illumos but not for ZoL for two reasons. 1) The vdev_disk.c code in ZoL has been rewritten to use the Linux block device interfaces which differ significantly from those in Illumos. Therefore, updating the vdev_file.c to use the Illumos interfaces doesn't get you consistency with vdev_disk.c. 2) Using the upstream patch as is would requiring implementing compatibility code for those Solaris block device interfaces in user and kernel space. That additional complexity could lead to confusion and doesn't buy us anything. For these reasons I've opted to simply move the existing vn_rdwr() as is in to the taskq function. This has the advantage of being low risk and easy to understand. Moving the vn_rdwr() function in to its own taskq thread also neatly avoids the possibility of a stack overflow. Finally, because of the additional work which is being handled by the free taskq the number of threads has been increased. The thread count under Illumos defaults to 100 but was decreased to 2 in commit 08d08e due to contention. We increase it to 8 until the contention can be address by porting Illumos #3581. Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1354	2013-05-03 16:53:52 -07:00
George.Wilson	cc92e9d0c3	3246 ZFS I/O deadman thread Reviewed by: Matt Ahrens <matthew.ahrens@delphix.com> Reviewed by: Eric Schrock <eric.schrock@delphix.com> Reviewed by: Christopher Siden <chris.siden@delphix.com> Approved by: Garrett D'Amore <garrett@damore.org> NOTES: This patch has been reworked from the original in the following ways to accomidate Linux ZFS implementation ) Usage of the cyclic interface was replaced by the delayed taskq interface. This avoids the need to implement new compatibility code and allows us to rely on the existing taskq implementation. ) An extern for zfs_txg_synctime_ms was added to sys/dsl_pool.h because declaring externs in source files as was done in the original patch is just plain wrong. ) Instead of panicing the system when the deadman triggers a zevent describing the blocked vdev and the first pending I/O is posted. If the panic behavior is desired Linux provides other generic methods to panic the system when threads are observed to hang. ) For reference, to delay zios by 30 seconds for testing you can use zinject as follows: 'zinject -d <vdev> -D30 <pool>' References: illumos/illumos-gate@283b84606b https://www.illumos.org/issues/3246 Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1396	2013-05-01 17:05:52 -07:00
Brian Behlendorf	57f5a2008e	Fix txg_quiesce thread deadlock A deadlock was accidentally introduced by commit `e95853a` which can occur when the system is under memory pressure. What happens is that while the txg_quiesce thread is holding the tx->tx_cpu locks it enters memory reclaim. In the context of this memory reclaim it then issues synchronous I/O to a ZVOL swap device. Because the txg_quiesce thread is holding the tx->tx_cpu locks a new txg cannot be opened to handle the I/O. Deadlock. The fix is straight forward. Move the memory allocation outside the critical region where the tx->tx_cpu locks are held. And for good measure change the offending allocation to KM_PUSHPAGE to ensure it never attempts to issue I/O during reclaim. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #1274	2013-04-26 14:42:36 -07:00
Brian Behlendorf	f706421173	Correctly return ERANGE in getxattr(2) According to the getxattr(2) man page the ERANGE errno should be returned when the size of the value buffer is to small to hold the result. Prior to this patch the implementation would just truncate the value to size bytes. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1408	2013-04-24 12:35:04 -07:00
Chris Dunlop	254255f735	Trivial spelling fix Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1411	2013-04-19 15:43:16 -07:00
Caleb James DeLisle	8f1e11b610	Remove .readdir from zpl_file_operations table The zpl_readdir() function shouldn't be registered as part of the zpl_file_operations table, it must only be part of the zpl_dir_file_operations table. By removing this callback the VFS will now correctly return ENOTDIR when calling getdents() on a file. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1404	2013-04-19 15:36:47 -07:00
Martin Matuska	b28e57cb82	Allow setting a lower ashift with -o ashift Previous patches have allowed you to set an increased ashift to avoid doing 512b IO with 4k sector devices. However, it was not possible to set the ashift lower than the reported physical sector size even when a smaller logical size was supported. In practice, there are several cases where settong a lower ashift is useful: * Most modern drives now correctly report their physical sector size as 4k. This causes zfs to correctly default to using a 4k sector size (ashift=12). However, for some usage models this new default ashift value causes an unacceptable increase in space usage. Filesystems with many small files may see the total available space reduced to 30-40% which is unacceptable. * When replacing a drive in an existing pool which was created with ashift=9 a modern 4k sector drive cannot be used. The 'zpool replace' command will issue an error that the new drive has an 'incompatible sector alignment'. However, by allowing the ashift to be manual specified as smaller, non-optimal, value the device may still be safely used. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1381 Closes #1328 Issue #967 Issue #548	2013-04-12 10:50:46 -07:00
George Wilson	295304bed6	Illumos #3422 , #3425 3422 zpool create/syseventd race yield non-importable pool 3425 first write to a new zvol can fail with EFBIG Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Garrett D'Amore <garrett@damore.org> References: illumos/illumos-gate@bda8819455 https://www.illumos.org/issues/3422 https://www.illumos.org/issues/3425 Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1390	2013-04-12 09:01:36 -07:00
Jan Engelhardt	ea0fcfc875	gitignore: anchor entries at their respective directory .ko is specific to module, .m4 to config, etc. Signed-off-by: Jan Engelhardt <jengelh@inai.de> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2013-04-02 10:50:17 -07:00
Jan Engelhardt	4e95cc99b0	build: resolve orthographic and other grammatical errors Signed-off-by: Jan Engelhardt <jengelh@inai.de> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2013-04-02 10:44:52 -07:00
Brian Behlendorf	5dc6af0eec	Add zio_ddt_free()+ddt_phys_decref() error handling The assumption in zio_ddt_free() is that ddt_phys_select() must always find a match. However, if that fails due to a damaged DDT or some other reason the code will NULL dereference in ddt_phys_decref(). While this should never happen it has been observed on various platforms. The result is that unless your willing to patch the ZFS code the pool is inaccessible. Therefore, we're choosing to more gracefully handle this case rather than leave it fatal. http://mail.opensolaris.org/pipermail/zfs-discuss/2012-February/050972.html Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1308	2013-03-19 13:01:01 -07:00
Brian Behlendorf	30b92c1de6	Add metaslab_debug option Enabling metaslab debugging will prevent space maps from being automatically unloaded. This can significantly increase the memory footprint but being able to dynamically control this is helpful for debugging and certain performance testing. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2013-03-18 16:47:43 -07:00
Richard Yao	1c24b699b0	Linux 3.9 compat: Undefine GCC_VERSION The mainline kernel started defining GCC_VERSION with commit torvalds/linux@3f3f8d2f48. Unfortunately, LZ4 also defines this macro, but the two defintions are incompatible. We undefine GCC_VERSION in lz4.c to handle this. Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1339	2013-03-06 15:48:48 -08:00
Ned Bass	92db59ca3b	Refresh links to web site A few files still refer to @behlendorf's private fork on github. Use the primary web site URL instead. Two typos are also corrected. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2013-03-06 15:46:41 -08:00
Brian Behlendorf	d09f98a9a6	Add KMODDIR to install target Provide a mechanism to control the directory name the modules are installed in. The kernel privdes INSTALL_MOD_DIR for this but it was hardcoded to be 'addon/zfs'. Add a KMODDIR variable which can be passed to 'make install' to override the default directory name. While we're here change the default from 'addon/zfs' to 'extra' which is the kernel.org default. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2013-03-06 15:46:40 -08:00
Eric Dillmann	0b4d1b5853	Add snapdev=[hidden\|visible] dataset property The new snapdev dataset property may be set to control the visibility of zvol snapshot devices. By default this value is set to 'hidden' which will prevent zvol snapshots from appearing under /dev/zvol/ and /dev/<dataset>/. When set to 'visible' all zvol snapshots for the dataset will be visible. This functionality was largely added because when automatic snapshoting is enabled large numbers of read-only zvol snapshots will be created. When creating these devices the kernel will attempt to read their partition tables, and blkid will attempt to identify any filesystems on those partitions. This leads to a variety of issues: 1) The zvol partition tables will be read in the context of the `modprobe zfs` for automatically imported pools. This is undesirable and should be done asynchronously, but for now reducing the number of visible devices helps. 2) Udev expects to be able to complete its work for a new block devices fairly quickly. When many zvol devices are added at the same time this is no longer be true. It can lead to udev timeouts and missing /dev/zvol links. 3) Simply having lots of devices in /dev/ can be aukward from a management standpoint. Hidding the devices your unlikely to ever use helps with this. Any snapshot device which is needed can be made visible by changing the snapdev property. NOTE: This patch changes the default behavior for zvols which was effectively 'snapdev=visible'. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1235 Closes #945 Issue #956 Issue #756	2013-03-05 12:37:54 -08:00
George Wilson	a4430fce69	Merge zvol.c changes from PSARC 2010/306 Read-only ZFS pools The changes to zvol.c were never merged from the last onnv_147 bulk update. This was because zvol.c was largely rewritten for Linux making it fairly easy to miss these sorts of changes. This causes a regression when importing a zpool with zvols read-only. This does not impact pool which only contain filesystem datasets. References: illumos/illumos-gate@f9af39b Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1332 Closes #1333	2013-03-04 09:56:13 -08:00
Richard Yao	b01615d5ac	Constify structures containing function pointers The PaX team modified the kernel's modpost to report writeable function pointers as section mismatches because they are potential exploit targets. We could ignore the warnings, but their presence can obscure actual issues. Proper const correctness can also catch programming mistakes. Building the kernel modules against a PaX/GrSecurity patched Linux 3.4.2 kernel reports 133 section mismatches prior to this patch. This patch eliminates 130 of them. The quantity of writeable function pointers eliminated by constifying each structure is as follows: vdev_opts_t 52 zil_replay_func_t 24 zio_compress_info_t 24 zio_checksum_info_t 9 space_map_ops_t 7 arc_byteswap_func_t 5 The remaining 3 writeable function pointers cannot be addressed by this patch. 2 of them are in zpl_fs_type. The kernel's sget function requires that this be non-const. The final writeable function pointer is created by SPL_SHRINKER_DECLARE. The kernel's set_shrinker() and remove_shrinker() functions also require that this be non-const. Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1300	2013-03-04 08:49:32 -08:00
Brian Behlendorf	8128bd89fb	Fix hot spares The issue with hot spares in ZoL is because it opens all leaf vdevs exclusively (O_EXCL). On Linux, exclusive opens cause subsequent exclusive opens to fail with EBUSY. This could be resolved by not opening any of the devices exclusively, which is what Illumos does, but the additional protection offered by exclusive opens is desirable. It cleanly prevents you from accidentally adding an in-use non-ZFS device to your pool. To fix this we very slightly relaxed the usage of O_EXCL in the following ways. 1) Functions which open the device but only read had the O_EXCL flag removed and were updated to use O_RDONLY. 2) A common holder was added to the vdev disk code. This allow the ZFS code to internally open the device multiple times but non-ZFS callers may not. 3) An exception was added to make_disks() for hot spare when creating partition tables. For hot spare devices which are already opened exclusively we skip creating the partition table because this must already have been done when the disk was originally added as a hot spare. Additional minor changes include fixing check_in_use() to use a partition instead of a slice suffix. And is_spare() was moved above make_disks() to avoid adding a forward reference. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #250	2013-03-01 13:31:02 -08:00
Brian Behlendorf	bd99a7584a	Remove wholedisk check from vdev_disk_open() As described by the comment and enforced the by assertion the v->vdev_wholedisk will never be -1. The wholedisk handling is performed by the user space utilities. To prevent confusion this dead code is being removed. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2013-02-28 12:02:59 -08:00
Brian Behlendorf	0d8103d956	Leaf vdevs should not be reopened When vdev_disk.c was implemented for Linux we failed to handle the reopen case. According to the vdev_reopen() comment leaf vdevs should not be closed or opened when v->vdev_reopening is set. Under Linux we would always close and open the device. This issue was only noticed when a 'zpool scrub' command was run while the leaf vdev device names in /dev/disk/by-vdev were missing. The scrub command calls vdev_reopen() which caused the vdevs to be closed but they couldn't be reopened due to the missing links. The result was that all the vdevs were marked unavailable and the pool was halted due to failmode=wait. This patch adds the missing functionality in a similiar fashion to to the Illumos code. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2013-02-28 12:02:59 -08:00
Etienne Dechamps	d9b0ebbe82	Remove the bio_empty_barrier() check. To determine whether the kernel is capable of handling empty barrier BIOs, we check for the presence of the bio_empty_barrier() macro, which was introduced in 2.6.24. If this macro is defined, then we can flush disk vdevs; if it isn't, then flushing is disabled. Unfortunately, the bio_empty_barrier() macro was removed in 2.6.37, even though the kernel is still capable of handling empty barrier BIOs. As a result, flushing is effectively disabled on kernels >= 2.6.37, meaning that starting from this kernel version, zfs doesn't use barriers to guarantee on-disk data consistency. This is quite bad and can lead to potential data corruption on power failures. This patch fixes the issue by removing the configure check for bio_empty_barrier(), as we don't support kernels <= 2.6.24 anymore. Thanks to Richard Kojedzinszky for catching this nasty bug. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1318	2013-02-24 10:22:34 -08:00
Brian Behlendorf	546c978bbd	Enable zfs_arc_memory_throttle_disable by default The zfs_arc_memory_throttle_disable module option was introduced by commit `0c5493d470` to resolve a memory miscalculation which could result in the txg_sync thread spinning. When this was first introduced the default behavior was left unchanged until enough real world usage confirmed there were no unexpected issues. We've now reached that point. Linux's direct reclaim is working as expected so we're enabling this behavior by default. This helps pave the way to retire the spl_kmem_availrmem() functionality in the SPL layer. This was the only caller. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #938	2013-02-21 13:38:24 -08:00
Richard Yao	8dca0a9a38	Make spa.c assertions catch unsupported pre-feature flag pool versions A couple of assertions in spa.c were designed to prevent the use of invalid pool versions. They were written under the assumption that all valid pools are less than SPA_VERSION. Since feature flags jumped from 28 to 5000, any numbers in the range 28 to 5000 non-inclusive will fail to trigger them. We switch to the new SPA_VERSION_IS_SUPPORTED macro to correct this. Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1282	2013-02-12 10:27:44 -08:00
Brian Behlendorf	9878a89d7a	Add explicit MAXNAMELEN check It turns out that the Linux VFS doesn't strictly handle all cases where a component path name exceeds MAXNAMELEN. It does however appear to correctly handle MAXPATHLEN for us. The right way to handle this appears to be to add an explicit check to the zpl_lookup() function. Several in-tree filesystems handle this case the same way. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1279	2013-02-12 10:27:39 -08:00
Ned Bass	ed2e157605	Switch KM_SLEEP to KM_PUSHPAGE Two more locations where KM_SLEEP was used in a call which must use KM_PUSHPAGE were found while using the zpool upgrade command. See commit `b8d06fc` for additional details. Also make a small correction to the comment block above dsl_dir_open_spa(). Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1268	2013-02-06 11:19:58 -08:00
Brian Behlendorf	dd26aa535b	Cast 'zfs bad bloc' to ULL for x86 Explicitly case this value to an unsigned long long for 32-bit systems to inform the compiler that a long type should not be used. Otherwise we get the following compiler error: dmu_send.c:376: error: integer constant is too large for ‘long’ type Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2013-02-04 16:39:08 -08:00
Brian Behlendorf	0c5493d470	Add zfs_arc_memory_throttle_disable module option The way in which virtual box ab(uses) memory can throw off the free memory calculation in arc_memory_throttle(). The result is the txg_sync thread will effectively spin waiting for memory to be released even though there's lots of memory on the system. To handle this case I'm adding a zfs_arc_memory_throttle_disable module option largely for virtual box users. Setting this option disables free memory checks which allows the txg_sync thread to make progress. By default this option is disabled to preserve the current behavior. However, because Linux supports direct memory reclaim it's doubtful throttling due to perceived memory pressure is ever a good idea. We should enable this option by default once we've done enough real world testing to convince ourselve there aren't any unexpected side effects. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #938	2013-02-01 11:17:14 -08:00
Brian Behlendorf	1f7c30df8f	Add zfs_disable_dup_eviction module option Commit `1eb5bfa` introduced a new zfs_disable_dup_eviction tunable. It should have been made available as a module option in the original patch but was overlooked. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2013-02-01 09:57:57 -08:00
Ned Bass	36f86f73f6	Fix mismatch between SA header size and layout When a system attribute layout is created an inconsistency may occur between the system attribute header (sa_hdr_phys_t) size and the variable-sized attribute count stored in the layout. The inconsistency results in the following failed assertion when SA_HDR_SIZE_MATCH_LAYOUT returns false: SPLError: 11315:0:(sa.c:1541:sa_find_idx_tab()) ASSERTION((IS_SA_BONUSTYPE(bonustype) && SA_HDR_SIZE_MATCH_LAYOUT(hdr, tb)) \|\| !IS_SA_BONUSTYPE(bonustype) \|\| (IS_SA_BONUSTYPE(bonustype) && hdr->sa_layout_info == 0)) failed The bug originates in this snippet from sa_find_sizes(). if (is_var_sz && var_size > 1) { if (P2ROUNDUP(hdrsize + sizeof (uint16_t), *total < full_space) { hdrsize += sizeof (uint16_t); This assumes that the current variable-sized attribute will be stored in the current buffer and accounts for the space needed to store its size in the sa_hdr_phys_t. However if the next attribute spills over we need to store a blkptr_t at the end of the bonus buffer to point to the spill block. If the current attribute is in the way of the blkptr_t then it too will be relocated into the spill block. But since we've already accounted for it in the header size we get the inconsistency described above. To avoid this, record the index of the last variable-sized attribute that prompted a hdrsize increase, and reverse the increase if we later determine that that attribute will be relocated to the spill block. Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1250	2013-01-31 10:31:19 -08:00
Ned Bass	67629d0f08	Fix rounding discrepancy in sa_find_sizes() A rounding discrepancy exists between how sa_build_layouts() and sa_find_sizes() calculate when the spill block needs to be kicked in. This results in a narrow size range where sa_build_layouts() believes there must be a spill block allocated but due to the discrepancy there isn't. A panic then occurs when the hdl->sa_spill NULL pointer is dereferenced. The following reproducer for this bug was isolated: truncate -s 128m /tmp/tank zpool create tank /tmp/tank zfs create -o xattr=sa tank/fish ln -s `perl -e 'print "z" x 41'` /tank/fish/z setfattr -hn trusted.foo -v`perl -e 'print "z"x45'` /tank/fish/z This test results in roughly the following system attribute (SA) layout: 176 bytes - "standard" SA's 41 bytes - name of symbolic link target 100 bytes - XDR encoded nvlist for xattr --- 317 bytes - total Because 317 is less than DN_MAX_BONUSLEN (320), sa_find_sizes() decides no spill block is needed. But sa_build_layouts() rounds 41 up to 48 when computing the space requirements so it tries to switch to the spill block. Note that we were only able to reproduce this bug using a combination of symbolic links and the Linux-specific xattr=sa dataset property. So while this issue is not technically Linux-specific, it may be difficult or impossible to hit the narrow size range needed to reproduce it on other platforms. To fix the discrepancy, round the running total in sa_find_sizes() up to an 8-byte boundary before accounting for each SA, since this is how they will be stored in the bonus and (possibly) spill buffers. To make the intent of the code more clear, explicitly assert key assumptions about expected alignment of data and whether spill-over will occur. Signed-off-by: Matthew Ahrens <mahrens@delphix.com Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1240	2013-01-31 10:31:13 -08:00
Adam H. Leventhal	89103a2643	Illumos #3447 improve the comment in txg.c 3447 improve the comment in txg.c Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Richard Lowe <richlowe@richlowe.net> Reviewed by: Garrett D'Amore <garrett@damore.org> Reviewed by: Richard Elling <richard.elling@dey-sys.com> Approved by: Dan McDonald <danmcd@nexenta.com> References: illumos/illumos-gate@adbbcfface https://www.illumos.org/issues/3447 Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>	2013-01-30 08:55:20 -08:00
Eric Dillmann	9759c60f1a	Illumos #3035 LZ4 compression support in ZFS and GRUB 3035 LZ4 compression support in ZFS and GRUB Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Christopher Siden <csiden@delphix.com> References: illumos/illumos-gate@a6f561b4ae https://www.illumos.org/issues/3035 http://wiki.illumos.org/display/illumos/LZ4+Compression+In+ZFS This patch has been slightly modified from the upstream Illumos version to be compatible with Linux. Due to the very limited stack space in the kernel a lz4 workspace kmem cache is used. Since we are using gcc we are also able to take advantage of the gcc optimized __builtin_ctz functions. Support for GRUB has been dropped from this patch. That code is available but those changes will need to made to the upstream GRUB package. Lastly, several hunks of dead code were dropped for clarity. They include the functions real_LZ4_uncompress(), LZ4_compressBound() and the Visual Studio specific hunks wrapped in _MSC_VER. Ported-by: Eric Dillmann <eric@jave.fr> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1217	2013-01-29 09:28:20 -08:00
Chris Wedgwood	ddc07fa57a	Avoid gcc -Werror=maybe-uninitialized warnings Explicitly set acl details to zero to silence gcc (zfs_acl_node_read can't be sure zfs_acl_znode_info will set acl_count and aclsize). Normally suppressing these warnings by setting this to zero at declaration time is a bad idea but in this instance it's hard to avoid and should be fairly safe. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1244	2013-01-28 09:10:29 -08:00
Brian Behlendorf	6772fb679a	Use dsl_dataset_snap_lookup() Retire the dmu_snapshot_id() function which was introduced in the initial .zfs control directory implementation. There is already an existing dsl_dataset_snap_lookup() which does exactly what we need, and the dmu_snapshot_id() function as implemented is racy. https://github.com/zfsonlinux/zfs/issues/1215#issuecomment-12579879 Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1238	2013-01-25 15:07:40 -08:00
Brian Behlendorf	bf01b5e616	Add d_clear_d_op() compatibility Added d_clear_d_op() helper function which clears some flags and the registered dentry->d_op table. This is required because d_set_d_op() issues a warning when the dentry operations table is already set. For the .zfs control directory to work properly we must be able to override the default operations table and register custom .d_automount and .d_revalidate callbacks. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov> Closes #1230	2013-01-23 16:33:29 -08:00
Ned Bass	1305d33a4b	fzap_cursor_move_to_key() should drop l_rwlock Callers of zap_deref_leaf() must be careful to drop leaf->l_rwlock since that function returns with the lock held on success. All other callers drop the lock correctly but it seems fzap_cursor_move_to_key() does not. This may block writers or cause VERIFY failures when the lock is freed. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1215 Closes zfsonlinux/spl#143 Closes zfsonlinux/spl#97	2013-01-23 16:31:16 -08:00
Brian Behlendorf	09a661e960	Fix zpl_revalidate() NULL deref In zpl_revalidate() it's possible for the nameidata to be NULL for kernels which still accept the parameter. In particular, lookup_one_len() calls d_revalidate() with a NULL nameidata. Resolve the issue by checking for a NULL nameidata in which case just set the flags to 0. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1226	2013-01-22 09:38:17 -08:00
Brian Behlendorf	ee93035378	Use sb->s_d_op default dentry operations As of Linux 2.6.37 the right way to register custom dentry operations is to use the super block's ->s_d_op field. For older kernels they should be registered as part of the lookup operation. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1223	2013-01-18 15:04:23 -08:00
Massimo Maggi	babf3f9b6d	Fix zpool on zvol deadlock Commit `65d56083b4` fixes the lock inversion between spa_namespace_lock and bdev->bd_mutex but only for the first user of spa_namespace_lock: dmu_objset_own(). Later spa_namespace_lock gets acquired by dsl_prop_get_integer() though dsl_prop_get()->dsl_dataset_hold()->dsl_dir_open_spa()-> spa_open()->spa_open_common() without this "protection". By moving the mutex release after this second use, even this acquisition of the lock is "protected" by the ERESTARTSYS trick. Signed-off-by: Massimo Maggi <me@massimo-maggi.eu> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1220	2013-01-18 09:44:55 -08:00
Brian Behlendorf	7973e464de	Revert "Revert "Fix unlink/xattr deadlock"" This reverts commit `53c7411919` effectively reinstating the asynchronous xattr cleanup code. These Linux changes were reverted because after testing and careful contemplation I was convinced that due to the 89260a1c8851ce05ea04b23606ba438b271d890 commit they were no longer required. Unfortunately, the deadlock described in #1176 was a case which wasn't considered. At mount zfs_unlinked_drain() can occur which will unlink a list of znodes in effectively a random order which isn't safe. The only reason it was safe to originally revert this change was the we could guarantee that the VFS would always prune the xattr leaves before the parents. Therefore, until we can cleanly resolve this deadlock for all cases we need to keep this change in spite of the xattr unlink performance penalty associated with it. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1176 Issue #457	2013-01-17 11:24:20 -08:00
Brian Behlendorf	7b3e34ba5a	Fix 'zfs rollback' on mounted file systems Rolling back a mounted filesystem with open file handles and cached dentries+inodes never worked properly in ZoL. The major issue was that Linux provides no easy mechanism for modules to invalidate the inode cache for a file system. Because of this it was possible that an inode from the previous filesystem would not get properly dropped from the cache during rolling back. Then a new inode with the same inode number would be create and collide with the existing cached inode. Ideally this would trigger an VERIFY() but in practice the error wasn't handled and it would just NULL reference. Luckily, this issue can be resolved by sprucing up the existing Solaris zfs_rezget() functionality for the Linux VFS. The way it works now is that when a file system is rolled back all the cached inodes will be traversed and refetched from disk. If a version of the cached inode exists on disk the in-core copy will be updated accordingly. If there is no match for that object on disk it will be unhashed from the inode cache and marked as stale. This will effectively make the inode unfindable for lookups allowing the inode number to be immediately recycled. The inode will then only be accessible from the cached dentries. Subsequent dentry lookups which reference a stale inode will result in the dentry being invalidated. Once invalidated the dentry will drop its reference on the inode allowing it to be safely pruned from the cache. Special care is taken for negative dentries since they do not reference any inode. These dentires will be invalidate based on when they were added to the dentry cache. Entries added before the last rollback will be invalidate to prevent them from masking real files in the dataset. Two nice side effects of this fix are: * Removes the dependency on spl_invalidate_inodes(), it can now be safely removed from the SPL when we choose to do so. * zfs_znode_alloc() no longer requires a dentry to be passed. This effectively reverts this portition of the code to its upstream counterpart. The dentry is not instantiated more correctly in the Linux ZPL layer. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov> Closes #795	2013-01-17 09:51:20 -08:00
Ned Bass	f1a05fa114	Fix false ENOENT on snapshot control dentries Lookups in the snapshot control directory for an existing snapshot fail with ENOENT if an earlier lookup failed before the snapshot was created. This is because the earlier lookup causes a negative dentry to be cached which is never invalidated. The bug can be reproduced as follows (the second ls should succeed): $ ls /tank/.zfs/snapshot/s ls: cannot access /tank/.zfs/snapshot/s: No such file or directory $ zfs snap tank@s $ ls /tank/.zfs/snapshot/s ls: cannot access /tank/.zfs/snapshot/s: No such file or directory To remedy this, always invalidate cached dentries in the snapshot control directory. Since these entries never exist on disk there is no significant performance penalty for the extra lookups. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1192	2013-01-16 16:28:54 -08:00
Ned Bass	94a9bb4709	Fix quoting error in unmount command A misplaced single quote caused the umount command to fail with a syntax error when unmounting snapshots under the .zfs/snapshot control directory. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1210	2013-01-16 15:30:47 -08:00
Christopher Siden	b077fd4c4e	Illumos #3189 kernel panic in test hotspare_onoffline_004_neg 3189 kernel panic in ZFS test suite during hotspare_onoffline_004_neg Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Arne Jansen <sensille@gmx.net> Approved by: Dan McDonald <danmcd@nexenta.com> References: illumos/illumos-gate@8f0b538d1d changeset: 13818:e9ad0a945d45 https://www.illumos.org/issues/3189 Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>	2013-01-14 10:34:53 -08:00
Arne Jansen	ff80d9b142	Illumos #1862 incremental zfs receive fails for sparse file > 8PB 1862 incremental zfs receive fails for sparse file > 8PB Reviewed by: Matt Ahrens <matthew.ahrens@delphix.com> Reviewed by: Simon Klinkert <klinkert@webgods.de> Approved by: Eric Schrock <eric.schrock@delphix.com> References: illumos/illumos-gate@31495a1e56 illumos changeset: 13789:f0c17d471b7a https://www.illumos.org/issues/1862 Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>	2013-01-14 10:34:41 -08:00
Matthew Ahrens	a94addd974	Illumos #3208 cross-endian incorrect user/group accounting 3208 moving zpool cross-endian results in incorrect user/group accounting Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Christopher Siden <chris.siden@delphix.com> Approved by: Richard Lowe <richlowe@richlowe.net> References: illumos/illumos-gate@e828a46d29 illumos changeset: 13835:eea81edc4f14 https://www.illumos.org/issues/3208 Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #627 Closes #1136	2013-01-14 09:32:22 -08:00
Bart Coddens	5c83989071	Illumos #2618 arc.c mistypes in the comments 2618 arc.c mistypes in the comments Reviewed by: Jason King <jason.brian.king@gmail.com> Reviewed by: Josef Sipek <jeffpc@josefsipek.net> Approved by: Richard Lowe <richlowe@richlowe.net> References: illumos/illumos-gate@fc98fea58e illumos changeset: 13721:5b51a16a186f https://www.illumos.org/issues/2618 Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>	2013-01-11 09:16:59 -08:00
Ned Bass	761394b3af	call_usermodehelper() should wait for process As of Linux 3.4 the UMH_WAIT_* constants were renumbered. In particular, the meaning of "1" changed from UMH_WAIT_PROC (wait for process to complete), to UMH_WAIT_EXEC (wait for the exec, but not the process). A number of call sites used the number 1 instead of the constant name, so the behavior was not as expected on kernels with this change. One visible consequence of this change was that processes accessing automounted snapshots received an ELOOP error because they failed to wait for zfs.mount to complete. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #816	2013-01-09 16:54:52 -08:00
Brian Behlendorf	1c50c992ba	Revert "Avoid ELOOP on auto-mounted snapshots" This reverts commit `7afcf5b1da` which accidentally introduced a regression with the .zfs snapshot directory. While the updated code still does correctly mount the requested snapshot. It updates the vfsmount such that it references the original dataset vfsmount. The result is that the snapshot itself isn't visible. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #816	2013-01-09 11:24:47 -08:00
Brian Behlendorf	4cec9b2dc7	Only reduce __zio_execute() stack usage in kernel space Related to `91579709fc` we need to be very careful about not overrunning the stack in kernel space. However, in user space we're already allowing slightly larger stacks so this stack usage optimization is not required there. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2013-01-09 10:34:35 -08:00
George Wilson	1eb5bfa3dc	Illumos #3145 , #3212 3145 single-copy arc 3212 ztest: race condition between vdev_online() and spa_vdev_remove() Reviewed by: Matt Ahrens <matthew.ahrens@delphix.com> Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Eric Schrock <eric.schrock@delphix.com> Reviewed by: Justin T. Gibbs <gibbs@scsiguy.com> Approved by: Eric Schrock <eric.schrock@delphix.com> References: illumos-gate/commit/9253d63df408bb48584e0b1abfcc24ef2472382e illumos changeset: 13840:97fd5cdf328a https://www.illumos.org/issues/3145 https://www.illumos.org/issues/3212 Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #989 Closes #1137	2013-01-08 10:35:44 -08:00
Matthew Ahrens	753c38392d	Illumos #3104 : eliminate empty bpobjs 3104 eliminate empty bpobjs Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Christopher Siden <chris.siden@delphix.com> Reviewed by: Garrett D'Amore <garrett@damore.org> Approved by: Eric Schrock <eric.schrock@delphix.com> References: illumos/illumos-gate@f174573681 illumos changeset: 13782:8f78aae28a63 https://www.illumos.org/issues/3104 Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>	2013-01-08 10:35:43 -08:00
Brian Behlendorf	91579709fc	Fix __zio_execute() asynchronous dispatch To save valuable stack all zio's were made asynchronous when in the tgx_sync_thread context or during pool initialization. See commit `2fac4c2` for the original patch and motivation. Unfortuantely, the changes to dsl_pool_sync_context() made by the feature flags broke this logic causing in __zio_execute() to dispatch itself infinitely when called during pool initialization. This commit refines the existing logic to specificly target only the two cases we care about. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2013-01-08 10:35:43 -08:00
George Wilson	ea0b2538cd	Illumos #3349 : zpool upgrade -V bumps the on disk version number 3349 zpool upgrade -V bumps the on disk version number, but leaves the in core version Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Christopher Siden <chris.siden@delphix.com> Reviewed by: Matt Ahrens <matthew.ahrens@delphix.com> Reviewed by: Richard Lowe <richlowe@richlowe.net> Approved by: Dan McDonald <danmcd@nexenta.com> References: illumos/illumos-gate@25345e4666 https://www.illumos.org/issues/3349 Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>	2013-01-08 10:35:43 -08:00
Matthew Ahrens	29809a6cba	Illumos #3086 : unnecessarily setting DS_FLAG_INCONSISTENT on async 3086 unnecessarily setting DS_FLAG_INCONSISTENT on async destroyed datasets Reviewed by: Christopher Siden <chris.siden@delphix.com> Approved by: Eric Schrock <Eric.Schrock@delphix.com> References: illumos/illumos-gate@ce636f8b38 illumos changeset: 13776:cd512c80fd75 https://www.illumos.org/issues/3086 Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>	2013-01-08 10:35:43 -08:00
Christopher Siden	b9b24bb4ca	Illumos #2762 : zpool command should have better support for feature flags 2762 zpool command should have better support for feature flags Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Eric Schrock <Eric.Schrock@delphix.com> References: illumos/illumos-gate@57221772c3 https://www.illumos.org/issues/2762 Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>	2013-01-08 10:35:43 -08:00
George Wilson	3bc7e0fb0f	Illumos #3090 and #3102 3090 vdev_reopen() during reguid causes vdev to be treated as corrupt 3102 vdev_uberblock_load() and vdev_validate() may read the wrong label Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Christopher Siden <chris.siden@delphix.com> Reviewed by: Garrett D'Amore <garrett@damore.org> Approved by: Eric Schrock <Eric.Schrock@delphix.com> References: illumos/illumos-gate@dfbb943217 illumos changeset: 13777:b1e53580146d https://www.illumos.org/issues/3090 https://www.illumos.org/issues/3102 Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #939	2013-01-08 10:35:42 -08:00
Christopher Siden	9ae529ec5d	Illumos #2619 and #2747 2619 asynchronous destruction of ZFS file systems 2747 SPA versioning with zfs feature flags Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <gwilson@delphix.com> Reviewed by: Richard Lowe <richlowe@richlowe.net> Reviewed by: Dan Kruchinin <dan.kruchinin@gmail.com> Approved by: Eric Schrock <Eric.Schrock@delphix.com> References: illumos/illumos-gate@53089ab7c8 illumos/illumos-gate@ad135b5d64 illumos changeset: 13700:2889e2596bd6 https://www.illumos.org/issues/2619 https://www.illumos.org/issues/2747 NOTE: The grub specific changes were not ported. This change must be made to the Linux grub packages. Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>	2013-01-08 10:35:35 -08:00
Ned Bass	37f000c5aa	Fix gcc array subscript above bounds warning In a debug build, certain GCC versions flag an array bounds warning in the below code from dnode_sync.c } else { int i; ASSERT(dn->dn_next_nblkptr[txgoff] < dnp->dn_nblkptr); /* the blkptrs we are losing better be unallocated */ for (i = dn->dn_next_nblkptr[txgoff]; i < dnp->dn_nblkptr; i++) ASSERT(BP_IS_HOLE(&dnp->dn_blkptr[i])); This usage is in fact safe, since the ASSERT ensures the index does not exceed to maximum possible number of block pointers. However gcc can't determine that the assignment 'i = dn->dn_next_nblkptr[txgoff];' falls within the array bounds so it issues a warning. To avoid this, initialize i to zero to make gcc happy but skip the elements before dn->dn_next_nblkptr[txgoff] in the loop body. Since a dnode contains at most 3 block pointers this overhead should be negligible. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #950	2013-01-07 11:21:52 -08:00
Matt Johnston	72938d6905	Use cv_wait_io() which will will account for iowait Update zio_wait() to use cv_wait_io() to ensure the iowait time is properly accounted for. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2013-01-07 10:52:52 -08:00
Matt Johnston	72f53c5694	Revert part of "Log I/Os longer than zio_delay_max (30s default)" This reverts commit `9dcb971983` which was originally introduced to debug occasional slow I/Os. These I/Os would complete eventually but were observed to take several 100 seconds. The root cause of this issue was the CFQ scheduler which can, under certain conditions, excessively delay an I/O from being issued to the device. This issue was mitigated somewhat by commit `84daaddedb` which ensures the I/O elevator gets changed even for DM style devices. This change isn't in any way harmful but it does conflict with a required change to properly account from I/O wait time. Because Linux does not export the io_schedule_timeout() function we must instead rely on io_schedule() via cv_wait_io(). The additional debugging information which was added to the delay event has been intentionally left in place. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2013-01-07 10:51:04 -08:00
Brian Behlendorf	65d56083b4	Fix zpool on zvol lock inversion deadlock In all but one case the spa_namespace_lock is taken before the bdev->bd_mutex lock. But Linux __blkdev_get() function calls fops->open() with the bdev->bd_mutex lock held and we must somehow still safely acquire the spa_namespace_lock. To avoid a potential lock inversion deadlock we preemptively try to take the spa_namespace_lock(). Normally it will not be contended and this is safe because spa_open_common() handles the case where the caller already holds the spa_namespace_lock. When it is contended we risk a lock inversion if we were to block waiting for the lock. Luckily, the __blkdev_get() function allows us to return -ERESTARTSYS which will result in bdev->bd_mutex being dropped, reacquired, and fops->open() being called again. This process can be repeated safely until both locks are acquired. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Jorgen Lundman <lundman@lundman.net> Closes #612	2012-12-20 09:57:39 -08:00
Brian Behlendorf	d5446cfc52	Revert "Remove TSD zfs_fsyncer_key" This reverts commit `31f2b5abdf` back to the original code until the fsync(2) performance regression can be addressed. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-12-20 09:56:28 -08:00
Brian Behlendorf	31f2b5abdf	Remove TSD zfs_fsyncer_key It's my understanding that the zfs_fsyncer_key TSD was added as a performance omtimization to reduce contention on the zl_lock from zil_commit(). This issue manifested itself as very long (100+ms) fsync() system call times for fsync() heavy workloads. However, under Linux I'm not seeing the same contention that was originally described. Therefore, I'm removing this code in order to ween ourselves off any dependence on TSD. If the original performance issue reappears on Linux we can revisit fixing it without resorting to TSD. This just leaves one small ZFS TSD consumer. If it can be cleanly removed from the code we'll be able to shed the SPL TSD implementation entirely. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes zfsonlinux/spl#174	2012-12-19 09:08:01 -08:00
Prakash Surya	84daaddedb	Set elevator for DM devices despite vdev_wholedisk The current state of udev and devicer-mapper devices makes it difficult to construct a mapping of DM partitions and their underlying DM device. For example, with a /dev directory with the following contents: $ ls -d /dev/dm-* /dev/dm-0 /dev/dm-1 /dev/dm-2 /dev/dm-3 it is not immediately apparent if these are completely separate devices, or partitions and real devices intermixed. In contrast, SCSI devices would appear as so: $ ls -d /dev/sd* /dev/sda /dev/sda1 /dev/sdb /dev/sdb1 Here, one can immediately determine that there are two devices (sda and sdb), each containing a single partition. The lack of a predictable and consistent mapping from DM devices to DM device partitions makes it difficult for user space to process these devices the same way it does SCSI devices. As a result, the ZFS utilities do not partition DM devices, and instead set the "vdev_wholedisk" label to 0 and treat them as partitions. This has the side effect that, even if ZFS has sole ownership of the device, the IO scheduler will not be modified because it is treated as a partition. This change adds an exception for DM devices in vdev_elevator_switch, allowing the elevator to be modified even though the "vdev_wholedisk" property is not set. Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1149	2012-12-18 15:12:40 -08:00
Jorgen Lundman	6c2856726f	Fix using zvol as slog device During the original ZoL port the vdev_uses_zvols() function was disabled until it could be properly implemented. This prevented a zpool from use a zvol for its slog device. This patch implements that missing functionality by adding a zvol_is_zvol() function to zvol.c. Given the full path to a device it will lookup the device and verify its major number against the registered zvol major number for the system. If they match we know the device is a zvol. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1131	2012-12-18 11:02:28 -08:00
Brian Behlendorf	8780c53961	Update SAs when an inode is dirtied Revert the portion of commit `d3aa3ea` which always resulted in the SAs being update when an mmap()'ed file was closed. That change accidentally resulted in unexpected ctime updates which upset tools like git. That was always a horrible hack and I'm happy it will never make it in to a tagged release. The right fix is something I initially resisted doing because I was worried about the additional overhead. However, in hindsight the overhead isn't as bad as I feared. This patch implemented the sops->dirty_inode() callback which is unsurprisingly called when an inode is dirtied. We leverage this callback to keep the znode SAs strictly in sync with the inode. However, for now we're going to go slowly to avoid introducing any new unexpected issues by only updating the atime, mtime, and ctime. This will cover the callpath of most concern to us. ->filemap_page_mkwrite->file_update_time->update_time-> mark_inode_dirty_sync->__mark_inode_dirty->dirty_inode Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #764 Closes #1140	2012-12-14 12:18:54 -08:00
Ned Bass	7afcf5b1da	Avoid ELOOP on auto-mounted snapshots Ensure that the path member pointers are associated with the newly-mounted snapshot when zpl_snapdir_automount() returns. Otherwise the follow_automount() function may be called repeatedly, leading to an incorrect ELOOP error return. This problem was observed as a 'Too many levels of symbolic links' error from user-space commands accessing an unmounted snapshot in the .zfs/snapshot directory. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #816	2012-12-13 08:57:11 -08:00
Brian Behlendorf	2ae1031962	Linux 3.7 compat, schedule_delayed_work() Linux kernel commit d8e794d accidentally broke the delayed work APIs for non-GPL callers. While the APIs to schedule a delayed work item are still available to all callers, it is no longer possible to initialize the delayed work item. I'm cautiously optimistic we could get the delayed_work_timer_fn exported for all callers in the upstream kernel. But frankly the compatibility code to use this kernel interface has always been problematic. Therefore, this patch abandons direct use the of the Linux kernel interface in favor of the new delayed taskq interface. It provides roughly the same functionality as delayed work queues but it's a stable interface under our control. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1053	2012-12-12 10:47:05 -08:00
Richard Yao	e4d89e9cfc	Switch KM_SLEEP to KM_PUSHPAGE When writes to zvols invoke ZIL, zfs_range_new_proxy() is called, which allocates memory using KM_SLEEP, triggering a warning. Switch to KM_PUSHPAGE to silence that warning. See commit `b8d06fca08` for additional details. Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1138	2012-12-10 09:44:45 -08:00
Brian Behlendorf	53c7411919	Revert "Fix unlink/xattr deadlock" This reverts commit `b00131d43c` which is no longer needed due to `e89260a1c8`. This change forces all xattr znodes to hold a reference on their parent which ensures prune_icache() will never attempt to evict both the parent and child concurrently. This effectively prevents the deadlock condition from ever occuring. Therefore we can safely revert back to the upstream synchronous cleanup code. This is nice because it keeps our code base closer to upstream and resolves the performance issues introduced by the original deadlock fix. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #457	2012-12-05 13:41:30 -08:00
Brian Behlendorf	d3aa3ea96e	Preserve inode mtime/ctime in .writepage() When updating a file via mmap()'ed I/O preserve the mtime/ctime which were updated when the page was made writable by the generic callback filemap_page_mkwrite(). But more importantly than preserving the exact time add the missing call to sa_bulk_update(). This ensures that the znode modifications are written to disk as part of the transaction. Without this the inode may mistaken rollback to the previous on-disk znode state. Additionally, for mmap()'ed znodes explicitly set the atime, mtime, and ctime on close using the up to date values in the inode. This is critical because writepage() may occur after close and on close we need to ensure the values are correct. Original-patch-by: Richard Yao <ryao@cs.stonybrook.edu> Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #764	2012-12-05 13:00:25 -08:00
Brian Behlendorf	e89260a1c8	Directory xattr znodes hold a reference on their parent Unlike normal file or directory znodes, an xattr znode is guaranteed to only have a single parent. Therefore, we can take a refernce on that parent if it is provided at create time and cache it. Additionally, we take care to cache it on any subsequent zfs_zaccess() where the parent is provided as an optimization. This allows us to avoid needing to do a zfs_zget() when setting up the SELinux security xattr in the create path. This is critical because a hash lookup on the directory will deadlock since it is locked. The zpl_xattr_security_init() call has also been moved up to the zpl layer to ensure TXs to create the required xattrs are performed after the create TX. Otherwise we run the risk of deadlocking on the open create TX. Ideally the security xattr should be fully constructed before the new inode is unlocked. However, doing so would require far more extensive changes to ZFS. This change may also have the benefitial side effect of ensuring xattr directory znodes are evicted from the cache before normal file or directory znodes due to the extra reference. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #671	2012-12-03 12:10:46 -08:00
Brian Behlendorf	c3275b56a1	Add load_nvlist() error handling Add the missing error handling to load_nvlist(). There's no good reason this needs to be fatal. All callers of load_nvlist() do correctly handle an error condition and it is preferable that an error be returned. This will allow 'zpool import -FX' to safely attempt to rollback through previous txgs looking for a good one. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1120	2012-11-30 13:48:17 -08:00
Brian Behlendorf	004324ecc6	Disable page allocation warnings for super block Due to the slightly increased size of the ZFS super block caused by `30315d2` there are now allocation warnings. The allocation size is still small (just over 8k) and super blocks are rarely allocated so we suppress the warning. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #1101	2012-11-30 11:04:44 -08:00
Brian Behlendorf	f74a147c02	Fix NULL deref when zvol_alloc() fails If zvol_alloc() fails zv will be set to NULL and dereferenced in out_dmu_objset_disown. To avoid this entirely the zv->objset line is moved up in to the success block. Original-patch-by: Jorgen Lundman <lundman@lundman.net> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1109	2012-11-27 14:10:31 -08:00
George Wilson	32a9872bba	Illumos #2671 : zpool import should not fail if vdev ashift has increased Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Eric Schrock <eric.schrock@delphix.com> Reviewed by: Richard Elling <richard.elling@richardelling.com> Reviewed by: Gordon Ross <gwr@nexenta.com> Reviewed by: Garrett D'Amore <garrett@damore.org> Approved by: Richard Lowe <richlowe@richlowe.net> Refererces to Illumos issue: https://www.illumos.org/issues/2671 This patch has been slightly modified from the upstream Illumos version. In the upstream implementation a warning message is logged to the console. To prevent pointless console noise this notification is now posted as a "ereport.fs.zfs.vdev.bad_ashift" event. The event indicates a non-optimial (but entirely safe) ashift value was used to create the pool. Depending on your workload this may impact pool performance. Unfortunately, the only way to correct the issue is to recreate the pool with a new ashift. NOTE: The unrelated fix to the comment in zpool_main.c appears in the upstream commit and was preserved for consistnecy. Ported-by: Cyril Plisko <cyril.plisko@mountall.com> Reworked-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #955	2012-11-15 11:05:59 -08:00
Brian Behlendorf	4c837f0d93	Fix "allocating allocated segment" panic Gunnar Beutner did all the hard work on this one by correctly identifying that this issue is a race between dmu_sync() and dbuf_dirty(). Now in all cases the caller is responsible for preventing this race by making sure the zfs_range_lock() is held when dirtying a buffer which may be referenced in a log record. The mmap case which relies on zfs_putpage() was not taking the range lock. This code was accidentally dropped when the function was rewritten for the Linux VFS. This patch adds the required range locking to zfs_putpage(). It also adds the missing ZFS_ENTER()/ZFS_EXIT() macros which aren't strictly required due to the VFS holding a reference. However, this makes the code more consistent with the upsteam code and there's no harm in being extra careful here. Original-patch-by: Gunnar Beutner <gunnar@beutner.name> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #541	2012-11-09 19:01:09 -08:00
Brian Behlendorf	e26ade5101	Fix zvol+btrfs hang When using a zvol to back a btrfs filesystem the btrfs mount would hang. This was due to the bio completion callback used in btrfs assuming that lower level drivers would never modify the bio->bi_io_vecs after they were submitted via bio_submit(). If they are modified btrfs will miscalculate which pages need to be unlocked resulting in a hang. It's worth mentioning that other file systems such as ext[234] and xfs work fine because they do not make the same assumption in the bio completion callback. The most straight forward way to fix the issue is to present the semantics expected by btrfs. This is done by cloning the bios attached to each request and then using the clones bvecs to perform the required accounting. The clones are freed after each read/write and the original unmodified bios are linked back in to the request. Signed-off-by: Chris Wedgwood <cw@f00f.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #469	2012-11-09 12:24:51 -08:00
Brian Behlendorf	9dcb971983	Log I/Os longer than zio_delay_max (30s default) There have been reports of ZFS deadlocking due to what appears to be a lost IO. This patch addes some debugging to determine the exact state of the IO which neither 1) completed, 2) failed, or 3) timed out after zio_delay_max (30) seconds. This information will be logged using the ZFS FMA infrastructure as a 'delay' event and posted to the internal zevent log. By default the last 64 events will be kept in the log but the limit is configurable via the zfs_zevent_len_max module option. To dump the contents of the log use the 'zpool events -v' command and look for the resource.fs.zfs.delay event. It will include various information about the pool, vdev, and zio which may shed some light on the issue. In the context of this change the 120 second kernel blocked thread watchdog has been disabled for synchronous IOs. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #930	2012-11-02 15:45:59 -07:00
Brian Behlendorf	e95853a331	Add txgs-<pool> kstat file Create a kstat file which contains useful statistics about the last N txgs processed. This can be helpful when analyzing pool performance. The new KSTAT_TYPE_TXG type was added for this purpose and it tracks the following statistics per-txg. txg - Unique txg number state - State (O)pen/(Q)uiescing/(S)yncing/(C)ommitted birth; - Creation time nread - Bytes read nwritten; - Bytes written reads - IOPs read writes - IOPs write open_time; - Length in nanoseconds the txg was open quiesce_time - Length in nanoseconds the txg was quiescing sync_time; - Length in nanoseconds the txg was syncing Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-11-02 15:45:56 -07:00
Brian Behlendorf	e8fd45a0f9	Add ddt_object_count() error handling The interface for the ddt_zap_count() function assumes it can never fail. However, internally ddt_zap_count() is implemented with zap_count() which can potentially fail. Now because there was no way to return the error to the caller a VERIFY was used to ensure this case never happens. Unfortunately, it has been observed that pools can be damaged in such a way that zap_count() fails. The result is that the pool can not be imported without hitting the VERIFY and crashing the system. This patch reworks ddt_object_count() so the error can be safely caught and returned to the caller. This allows a pool which has be damaged in this way to be safely rewound for import. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #910	2012-10-29 08:57:45 -07:00
Brian Behlendorf	178e73b376	Revert "Don't ashift-align vdev read requests." This reverts commit `a5c20e2a0a` which accidentally introduced a regression for real 4k sector devices. See issue #1065 for details. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #1065	2012-10-24 15:25:33 -07:00
Brian Behlendorf	f21e5c6a17	Remove 'Resized bio's/dio' warning The following warning was originally added to provide visibility in to how often a dio gets heavily fragmented in to over 16 bios. This can happen due to constraints imposed by the block device and may have a negitive impact on performance but is otherwise harmless. To prevent needless confusion and worry the message has been removed. kernel: WARNING: Resized bio's/dio to 32 Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-10-22 10:17:10 -07:00
Brian Behlendorf	c7dfc08629	Quote snapshot and mountpoint for .zfs automount When automounting a snapshot in the .zfs/snapshot directory make sure to quote both the dataset name and the mount point. This ensures that if either component contains spaces, which are allowed, they get handled correctly. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1027	2012-10-17 13:26:18 -07:00
Etienne Dechamps	5d7a86d114	Use the slog even with logbias=throughput. In the current code, logbias=throughput implies the following: 1) All synchronous writes are logged in indirect mode. 2) The slog is not used. (1) makes sense because it avoids writing the data twice, which is obviously a good thing when the user wants maximum pool throughput. (2), however, is a surprising decision. Considering all writes are indirect, the log record doesn't contain the actual data, only pointers to DMU blocks. As a result, log records written in logbias=throughput mode are quite small, and as such, it doesn't make any sense to write them to the main pool since slogs are usually optimized for small synchronous writes. In fact, the current behavior is actually harmful for performance, because log blocks and data blocks from dmu_sync() seldom have the same allocation size and as a result are usually allocated from different metaslabs. This means that if a spindle has to write both log blocks and DMU blocks (which is likely to happen under heavy load), it will have to seek between the two. Allocating the log blocks from the slog pool instead of the main pool avoids these unnecessary seeks. This commit makes ZFS use the slog on datasets with logbias=throughput. Real-life performance testing shows a 50% synchronous write performance increase with some large commit sizes, and no negative effect in other cases. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #1013	2012-10-17 08:56:46 -07:00
Etienne Dechamps	920dd524fb	Add FASTWRITE algorithm for synchronous writes. Currently, ZIL blocks are spread over vdevs using hint block pointers managed by the ZIL commit code and passed to metaslab_alloc(). Spreading log blocks accross vdevs is important for performance: indeed, using mutliple disks in parallel decreases the ZIL commit latency, which is the main performance metric for synchronous writes. However, the current implementation suffers from the following issues: 1) It would be best if the ZIL module was not aware of such low-level details. They should be handled by the ZIO and metaslab modules; 2) Because the hint block pointer is managed per log, simultaneous commits from multiple logs might use the same vdevs at the same time, which is inefficient; 3) Because dmu_write() does not honor the block pointer hint, indirect writes are not spread. The naive solution of rotating the metaslab rotor each time a block is allocated for the ZIL or dmu_sync() doesn't work in practice because the first ZIL block to be written is actually allocated during the previous commit. Consequently, when metaslab_alloc() decides the vdev for this block, it will do so while a bunch of other allocations are happening at the same time (from dmu_sync() and other ZILs). This means the vdev for this block is chosen more or less at random. When the next commit happens, there is a high chance (especially when the number of blocks per commit is slightly less than the number of the disks) that one disk will have to write two blocks (with a potential seek) while other disks are sitting idle, which defeats spreading and increases the commit latency. This commit introduces a new concept in the metaslab allocator: fastwrites. Basically, each top-level vdev maintains a counter indicating the number of synchronous writes (from dmu_sync() and the ZIL) which have been allocated but not yet completed. When the metaslab is called with the FASTWRITE flag, it will choose the vdev with the least amount of pending synchronous writes. If there are multiple vdevs with the same value, the first matching vdev (starting from the rotor) is used. Once metaslab_alloc() has decided which vdev the block is allocated to, it updates the fastwrite counter for this vdev. The rationale goes like this: when an allocation is done with FASTWRITE, it "reserves" the vdev until the data is written. Until then, all future allocations will naturally avoid this vdev, even after a full rotation of the rotor. As a result, pending synchronous writes at a given point in time will be nicely spread over all vdevs. This contrasts with the previous algorithm, which is based on the implicit assumption that blocks are written instantaneously after they're allocated. metaslab_fastwrite_mark() and metaslab_fastwrite_unmark() are used to manually increase or decrease fastwrite counters, respectively. They should be used with caution, as there is no per-BP tracking of fastwrite information, so leaks and "double-unmarks" are possible. There is, however, an assert in the vdev teardown code which will fire if the fastwrite counters are not zero when the pool is exported or the vdev removed. Note that as stated above, marking is also done implictly by metaslab_alloc(). ZIO also got a new FASTWRITE flag; when it is used, ZIO will pass it to the metaslab when allocating (assuming ZIO does the allocation, which is only true in the case of dmu_sync). This flag will also trigger an unmark when zio_done() fires. A side-effect of the new algorithm is that when a ZIL stops being used, its last block can stay in the pending state (allocated but not yet written) for a long time, polluting the fastwrite counters. To avoid that, I've implemented a somewhat crude but working solution which unmarks these pending blocks in zil_sync(), thus guaranteeing that linguering fastwrites will get pruned at each sync event. The best performance improvements are observed with pools using a large number of top-level vdevs and heavy synchronous write workflows (especially indirect writes and concurrent writes from multiple ZILs). Real-life testing shows a 200% to 300% performance increase with indirect writes and various commit sizes. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #1013	2012-10-17 08:56:41 -07:00
Brian Behlendorf	a298dbde92	Condition variable usage, zp->r_{rd,wr}_cv The following incorrect usage of cv_broadcast() was caught by code inspection. The cv_broadcast() function must be called under the associated mutex to preventing racing with cv_wait(). Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-10-15 16:02:03 -07:00
Brian Behlendorf	8c0712fd88	Condition variable usage, zilog->zl_cv_batch The following incorrect usage of cv_signal and cv_broadcast() was caught by code inspection. The cv_signal and cv_broadcast() functions must be called under the associated mutex to preventing racing with cv_wait(). Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-10-15 16:01:58 -07:00
Brian Behlendorf	99db9bfde7	Condition variable usage, zevent_cv The following incorrect usage of cv_broadcast() was caught by code inspection. The cv_broadcast() function must be called under the associated mutex to preventing racing with cv_wait(). Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-10-15 16:01:54 -07:00
Massimo Maggi	6f53a6a229	Switch KM_SLEEP to KM_PUSHPAGE In this particular instance the allocation occurred in the context of sys_msync()->...->zpl_putpage() where we must be careful not to initiate additional I/O. Signed-off-by: Massimo Maggi <massimo@mmmm.it> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1038	2012-10-15 09:32:38 -07:00
Brian Behlendorf	c418410393	Limit zfs_vdev_aggregation_limit to SPA_MAXBLOCKSIZE Prevent users from setting the zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #520	2012-10-15 09:28:43 -07:00
Yuxuan Shui	45ca2d91cb	Return positive error number in zfsctl_shares_lookup. Otherwise it will cause zpl_shares_lookup() to return a invalid pointer when an error occurs. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com> Closes #626 #885 #947 #977	2012-10-15 09:11:56 -07:00
Yuxuan Shui	558ef6d080	Linux 3.6 compat, iops->create() As of Linux commit ebfc3b49a7ac25920cb5be5445f602e51d2ea559 the struct nameidata is no longer passed to iops->create. Instead only the result of (inamedata->flags & LOOKUP_EXCL) is passed. ZFS like almost all Linux fileystems never made use of this so only the prototype needs to be wrapped for compatibility. Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #873	2012-10-14 14:42:25 -07:00
Yuxuan Shui	8f195a908f	Linux 3.6 compat, iops->lookup() As of Linux commit 00cd8dd3bf95f2cc8435b4cac01d9995635c6d0b the struct nameidata is no longer passed to iops->lookup. Instead only the inamedata->flags are passed. ZFS like almost all Linux fileystems never made use of this so only the prototype needs to be wrapped for compatibility. Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #873	2012-10-14 13:06:54 -07:00
Yuxuan Shui	3c20361075	Linux 3.6 compat, sget() As of Linux commit 9249e17fe094d853d1ef7475dd559a2cc7e23d42 the mount flags are now passed to sget() so they can be used when initializing a new superblock. ZFS never uses sget() in this fashion so we can simply pass a zero and add a zpl_sget() compatibility wrapper. Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #873	2012-10-14 13:06:48 -07:00
Yuxuan Shui	af26c4d4ab	Linux 3.6 compat, sops->write_super() removed The .write_super callback was removed the the super_operations structure by Linux commit f0cd2dbb6cf387c11f87265462e370bb5469299e. All file systems are now expected to self manage writing any dirty state assoicated with their super block. ZFS never made use of this callback so it can simply be removed from the super_operations structure. Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #873	2012-10-14 11:33:56 -07:00
Etienne Dechamps	a5c20e2a0a	Don't ashift-align vdev read requests. Currently, the size of read and write requests on vdevs is aligned according to the vdev's ashift, allocating a new ZIO buffer and padding if need be. This makes sense for write requests to prevent read/modify/write if the write happens to be smaller than the device's internal block size. For reads however, the rationale is less clear. It seems that the original code aligns reads because, on Solaris, device drivers will outright refuse unaligned requests. We don't have that issue on Linux. Indeed, Linux block devices are able to accept requests of any size, and take care of alignment issues themselves. As a result, there's no point in enforcing alignment for read requests on Linux. This is a nice optimization opportunity for two reasons: - We remove a memory allocation in a heavily-used code path; - The request gets aligned in the lowest layer possible, which shrinks the path that the additional, useless padding data has to travel. For example, when using 4k-sector drives that lie about their sector size, using 512b read requests instead of 4k means that there will be less data traveling down the ATA/SCSI interface, even though the drive actually reads 4k from the platter. The only exception is raidz, because raidz needs to read the whole allocated block for parity. This patch removes alignment enforcement for read requests, except on raidz. Note that we also remove an assertion that checks that we're aligning a top-level vdev I/O, because that's not the case anymore for repair writes that results from failed reads. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1022	2012-10-12 12:01:56 -07:00
Richard Yao	b68503fb30	Remove vmem_size() consumers There are currently three vmem_size() consumers all of which are part of the ARC implemention. However, since the expected behavior of the Linux and Solaris virtual memory subsystems are so different the behavior in each of these instances needs to be reevaluated. * arc_evict_needed() - This is actually dead code. Arena support was never added to the SPL and zio_arena is always NULL. This support isn't needed so we simply remove this dead code. * arc_memory_throttle() - On Solaris where virtual memory constitutes almost all of the address space we can reasonably expect there to be a fairly large amount free. However, on Linux by default we only have about 100MB total and that's heavily used by the ARC. So the expectation on Linux is that this will usually be a small value. Therefore we remove the vmem_size() check for i386 systems because the expectation is that it will be less than the zfs_write_limit_max. * arc_init() - Here vmem_size() is used to initially size the ARC. Since the ARC is currently backed by the virtual address space it makes sense to use this as a limit on the ARC for 32-bit systems. This code can be removed when the ARC is backed by the page cache. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #831	2012-10-12 10:03:03 -07:00
Brian Behlendorf	87d98efe9e	Fix zfs_txg_timeout module parameter Allow the zfs_txg_timeout variable to be dynamically tuned at run time. By pulling it down out of the variable declaration it will be evaluted each time through the loop. The zfs_txg_timeout variable is now declared extern in a the common sys/txg.h header rather than locally in dsl_scan.c. This prevents potential type mismatches if the global variable needs to be used elsewhere. Move the module_param() code in to the same source file where zfs_txg_timeout is declared. This is the most logical location. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-10-11 15:07:09 -07:00
Richard Yao	7df05a4266	Fix zfs_write_limit_max integer size mismatch on 32-bit systems Commit `c409e4647f` introduced a number of module parameters. This required several types to be changed to accomidate the required module parameters Linux macros. Unfortunately, arc.c contained its own extern definition of the zfs_write_limit_max variable and its type was not updated to be consistent with its dsl_pool.c counterpart. If the variable had been properly marked extern in a common header, then gcc would have generated a warning and this would not have slipped through. The result of this was that the ARC unconditionally expected zfs_write_limit_max to be 64-bit. Unfortunately, the largest size integer module parameter that Linux supports is unsigned long, which varies in size depending on the host system's native word size. The effect was that on 32-bit systems, ARC incorrectly performed 64-bit operations on a 32-bit value by reading the neighboring 32 bits as the upper 32 bits of the 64-bit value. We correct that by changing the extern declaration to use the unsigned long type and move these extern definitions in to the common arc.h header. This should make ARC correctly treat zfs_write_limit_max as a 32-bit value on 32-bit systems. Reported-by: Jorgen Lundman <lundman@lundman.net> Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #749	2012-10-11 11:09:25 -07:00
Cyril Plisko	15fd274973	Make zfs_immediate_write_sz a module paramater zfs_immediate_write_sz variable is a tunable, but lacks proper module_param() instrumentation. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1032	2012-10-11 11:09:21 -07:00
Cyril Plisko	5b7e5b5ab9	txg is spelled as tgx in places Term 'transaction group' is commonly abbreviated as txg in ZFS sources. There are some places (Linux specific MODULE_PARAM_DESC() macros) where it is incorrectly spelled as 'tgx'. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1030	2012-10-11 09:19:08 -07:00
Massimo Maggi	beb999445a	Switch KM_SLEEP to KM_PUSHPAGE Prevent snapshot_check to initiate I/O during memory allocation. Signed-off-by: Massimo Maggi <massimo@mmmm.it> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1023	2012-10-08 10:19:05 -07:00
Brian Behlendorf	7bd04f2d7d	Set default zvol elevator to noop It doesn't make sense for a zvol to use the default system I/O scheduler because it is a virtual device. Therefore, we change the default scheduler to 'noop' for zvols provided that the elevator_change() function is available. This interface has been available since Linux 2.6.36 and appears in the RHEL 6.x kernels. We deliberately do not implement the method for older kernels because it was racy and could result in system crashes. It's better to simply manually tune the scheduler for these kernels. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1017	2012-10-05 12:39:59 -07:00
Etienne Dechamps	089fa91bc5	Align DISCARD requests on zvols. Currently, when processing DISCARD requests, zvol_discard() calls dmu_free_long_range() with the precise offset and size of the request. Unfortunately, this is not optimal for requests that are not aligned to the zvol block boundaries. Indeed, in the case of an unaligned range, dnode_free_range() will zero out the unaligned parts. Not only is this useless since we are not freeing any space by doing so, it is also slow because it translates to a read-modify-write operation. This patch fixes the issue by rounding up the discard start offset to the next volume block boundary, and rounding down the discard end offset to the previous volume block boundary. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1010	2012-10-04 16:01:44 -07:00
Chris Dunlop	d75d6f294e	Switch KM_SLEEP to KM_PUSHPAGE This warning indicates the incorrect use of KM_SLEEP in a call path which must use KM_PUSHPAGE to avoid deadlocking in direct reclaim. See commit `b8d06fc` for additional details. SPL: Fixing allocation for task txg_sync (6093) which used GFP flags 0x297bda7c with PF_NOFS set Signed-off-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1002	2012-10-04 10:44:09 -07:00
Matthew Ahrens	04434775b7	Illumos #3100 : zvol rename fails with EBUSY when dirty. illumos/illumos-gate@2e2c135528 Illumos changeset: 13780:6da32a929222 3100 zvol rename fails with EBUSY when dirty Reviewed by: Christopher Siden <chris.siden@delphix.com> Reviewed by: Adam H. Leventhal <ahl@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Garrett D'Amore <garrett@damore.org> Approved by: Eric Schrock <eric.schrock@delphix.com> Ported-by: Etienne Dechamps <etienne.dechamps@ovh.net> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #995	2012-10-03 13:59:02 -07:00
George Wilson	65947351e7	Illumos #3129 , #3130 illumos/illumos-gate@d6afdce20f Illumos changeset: 13794:7c5e0e746b2c 3129 'zpool reopen' restarts resilvers 3130 ztest failure: Assertion failed: 0 == dmu_objset_destroy(name, B_FALSE) (0x0 == 0x10) Reviewed by: Eric Schrock <eric.schrock@delphix.com> Reviewed by: Matt Ahrens <matthew.ahrens@delphix.com> Reviewed by: Christopher Siden <chris.siden@delphix.com> Reviewed by: Adam Leventhal <ahl@delphix.com> Approved by: Dan McDonald <danmcd@nexenta.com> References: https://www.illumos.org/issues/3129 https://www.illumos.org/issues/3130 Ported by: Etienne Dechamps <etienne.dechamps@ovh.net> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #994	2012-10-03 13:59:02 -07:00
Brian Behlendorf	6d1d976b2c	Modify vdev_elevator_switch() to use elevator_change() As of Linux 2.6.36 an elevator_change() interface was added. This commit updates vdev_elevator_switch() to use this interface when available, otherwise it falls back to the usermodehelper method. Original-patch-by: foobarz <sysop@xeon.(none)> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #906	2012-10-03 13:31:44 -07:00
Cyril Plisko	393b44c711	Implement .commit_metadata hook for NFS export In order to implement synchronous NFS metadata semantics ZFS needs to provide the .commit_metadata hook. All it takes there is to make sure changes are committed to ZIL. Fortunately zfs_fsync() does just that, so simply calling it from zpl_commit_metadata() does the trick. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #969	2012-10-03 10:49:45 -07:00
Chris Wedgwood	23a61ccc1b	zvol_probe should return NULL when the device isn't found. Previously we returned ERR_PTR(-ENOENT) which the rest of the kernel doesn't expect and as such we can oops. Signed-off-by: Chris Wedgwood <cw@f00f.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #949 Closes #931 Closes #789 Closes #743 Closes #730	2012-10-03 10:39:12 -07:00
Bill Pijewski	37abac6d55	Illumos #2703 : add mechanism to report ZFS send progress Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed by: Robert Mustacchi <rm@joyent.com> Reviewed by: Richard Lowe <richlowe@richlowe.net> Approved by: Eric Schrock <Eric.Schrock@delphix.com> References: https://www.illumos.org/issues/2703 Ported by: Martin Matuska <martin@matuska.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-09-19 13:39:06 -07:00
Chris Siden	1bd201e70d	Illumos #1948 : zpool list should show more detailed pool info Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed by: Eric Schrock <eric.schrock@delphix.com> Reviewed by: Richard Lowe <richlowe@richlowe.net> Reviewed by: Albert Lee <trisk@nexenta.com> Reviewed by: Dan McDonald <danmcd@nexenta.com> Reviewed by: Garrett D'Amore <garrett@damore.org> Approved by: Eric Schrock <eric.schrock@delphix.com> References: https://www.illumos.org/issues/1948 Ported by: Martin Matuska <martin@matuska.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #685	2012-09-19 13:39:05 -07:00
Brian Behlendorf	95fd8c9a7f	Switch KM_SLEEP to KM_PUSHPAGE This warning indicates the incorrect use of KM_SLEEP in a call path which must use KM_PUSHPAGE to avoid deadlocking in direct reclaim. See commit `b8d06fca08` for additional details. SPL: Fixing allocation for task txg_sync (6093) which used GFP flags 0x297bda7c with PF_NOFS set Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #973	2012-09-19 11:52:36 -07:00
Brian Behlendorf	ba367276d8	Switch KM_SLEEP to KM_PUSHPAGE This warning indicates the incorrect use of KM_SLEEP in a call path which must use KM_PUSHPAGE to avoid deadlocking in direct reclaim. See commit `b8d06fca08` for additional details. SPL: Fixing allocation for task txg_sync (6093) which used GFP flags 0x297bda7c with PF_NOFS set Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #917	2012-09-17 11:22:23 -07:00
Cyril Plisko	49d39798f2	ZFS replay transaction error 5 When zfs_replay_write() replays TX_WRITE records from ZIL it calls zpl_write_common() to perform the actual write. zpl_write_common() returns the number of bytes written (similar to write() system call) or an (negative) error. However, the code expects the positive return value to be a residual counter. Thus when zpl_write_common() successfully completes it is mistakenly considered to be a partial write and the error code delivered further. At this point the ZIL processing is aborted with famous "ZFS replay transaction error 5" error message given to the message buffer. The fix is to compare the zpl_write_commmon() return value with the buffer size and flag error only when they disagree. Signed-off-by: Cyril Plisko <cyril.plisko@mountall.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #933	2012-09-17 11:06:58 -07:00
Brian Behlendorf	8312c6df55	Clear PG_writeback for sync I/O error case Commit `2b2861362f` accidentally introduced this issue by only conditionally registering the commit callback in the async case. The error handing code for the dmu_tx_assign() failure case relied on there always being a registered commit callback to clear the PG_writeback bit. Since that is no longer strictly true for the synchronous case we must explicitly invoke the callback. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #961	2012-09-14 15:53:47 -07:00
Brian Behlendorf	5915791096	Move iput() after zfs_inode_update() When replaying an unlink/remove operation via zfs_rmdir() the object being removed will be instantiated by a call to zfs_dirent_lock(). This means that there is a single reference protecting the object. Right before the call to zfs_inode_update() this reference is dropped which may cause the object to be destroyed. This will result in a NULL dereference as shown by the stack trace is issue #782. This likely isn't an issue during normal operation because there is always an additional reference held on the object by the VFS. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #782	2012-09-12 14:22:52 -07:00
Brian Behlendorf	4ca9a43644	Remove zvol device node The 'zfs destroy' changes in `330d06f` disrupted how zvol devices get removed on ZoL. However, it basically boils down to the fact that we are no longer reliably calling zvol_remove_minor() via zfs_ioc_destroy_snaps(). Therefore we add the missing call and handle things similarly to the existing zfs_unmount_snap() case. Ideally we would check if this is of type DMU_OST_ZFS or DMU_OST_ZVOL and just do the right thing as in zfs_ioc_destroy(). However, it looks like it would be fairly expensive to get the type, and it's harmless to simply attempt the umount and minor removal. This is also an issue in the latest FreeBSD and Illumos code. It was being tracked under the following issue, and we may want to refresh our code when they settle on what they want to do about it upstream. https://www.illumos.org/issues/3170 Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #903	2012-09-10 10:25:08 -07:00
Cyril Plisko	04f9432d3b	Make ZFS filesystem id persistent across different machines Use ZFS dataset fsid guid as a unique file system id, similar to what is done on Illumos/OpenSolaris. Signed-off-by: Cyril Plisko <cyril.plisko@mountall.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #888	2012-09-06 12:47:11 -07:00
Brian Behlendorf	ebcfc8a534	Disable page allocation warnings for ARC buffers Buffers for the ARC are normally backed by the SPL virtual slab. However, if memory is low, AND no slab objects are available, AND a new slab cannot be quickly constructed a new emergency object will be directly allocated. These objects can be as large as order 5 on a system with 4k pages. And because they are allocated with KM_PUSHPAGE, to avoid a potential deadlock, they are not allowed to initiate I/O to satisfy the allocation. This can result in the occasional allocation failure. However, since these allocations are allowed to block and perform operations such as memory compaction they will eventually succeed. Since this is not unexpected (just unlikely) behavior this patch disables the warning for the allocation failure. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #465	2012-09-06 11:53:08 -07:00
Brian Behlendorf	cafa9709f3	Switch KM_SLEEP to KM_PUSHPAGE This warning indicates the incorrect use of KM_SLEEP in a call path which must use KM_PUSHPAGE to avoid deadlocking in direct reclaim. See commit `b8d06fca08` for additional details. SPL: Fixing allocation for task txg_sync (6093) which used GFP flags 0x297bda7c with PF_NOFS set Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #917	2012-09-05 08:44:58 -07:00
Brian Behlendorf	0ef0ff546e	Switch KM_SLEEP to KM_PUSHPAGE This warning indicates the incorrect use of KM_SLEEP in a call path which must use KM_PUSHPAGE to avoid deadlocking in direct reclaim. See commit `b8d06fca08` for additional details. SPL: Fixing allocation for task txg_sync (6093) which used GFP flags 0x297bda7c with PF_NOFS set Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #917	2012-09-04 16:00:06 -07:00
Brian Behlendorf	594b4dd82a	Switch KM_SLEEP to KM_PUSHPAGE This warning indicates the incorrect use of KM_SLEEP in a call path which must use KM_PUSHPAGE to avoid deadlocking in direct reclaim. See commit `b8d06fca08` for additional details. SPL: Fixing allocation for task txg_sync (6093) which used GFP flags 0x297bda7c with PF_NOFS set Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #917	2012-09-04 08:41:12 -07:00
Chris Dunlop	20a083cbe2	Switch KM_SLEEP to KM_PUSHPAGE This warning indicates the incorrect use of KM_SLEEP in a call path which must use KM_PUSHPAGE to avoid deadlocking in direct reclaim. See commit `b8d06fca08` for additional details. SPL: Fixing allocation for task txg_sync (6093) which used GFP flags 0x297bda7c with PF_NOFS set Signed-off-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #917	2012-09-02 10:15:49 -07:00
Brian Behlendorf	b404a3f07f	Switch KM_SLEEP to KM_PUSHPAGE This warning indicates the incorrect use of KM_SLEEP in a call path which must use KM_PUSHPAGE to avoid deadlocking in direct reclaim. See commit `b8d06fca08` for additional details. SPL: Fixing allocation for task txg_sync (6093) which used GFP flags 0x297bda7c with PF_NOFS set Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #917	2012-08-31 17:39:29 -07:00
Brian Behlendorf	2b2861362f	Clear PG_writeback after zil_commit() for sync I/O When writing via ->writepage() the writeback bit was always cleared as part of the txg commit callback. However, when the I/O is also being written synchronsously to the zil we can immediately clear this bit. There is no need to wait for the subsequent TXG sync since the data is already safe on stable storage. This has been observed to reduce the msync(2) delay from up to 5 seconds down 10s of miliseconds. One workload which is expected to benefit from this are the intermittent samba hands described in issue #700. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #700 Closes #907	2012-08-30 20:16:28 -07:00
Richard Yao	b8d06fca08	Switch KM_SLEEP to KM_PUSHPAGE Differences between how paging is done on Solaris and Linux can cause deadlocks if KM_SLEEP is used in any the following contexts. * The txg_sync thread * The zvol write/discard threads * The zpl_putpage() VFS callback This is because KM_SLEEP will allow for direct reclaim which may result in the VM calling back in to the filesystem or block layer to write out pages. If a lock is held over this operation the potential exists to deadlock the system. To ensure forward progress all memory allocations in these contexts must us KM_PUSHPAGE which disables performing any I/O to accomplish the memory allocation. Previously, this behavior was acheived by setting PF_MEMALLOC on the thread. However, that resulted in unexpected side effects such as the exhaustion of pages in ZONE_DMA. This approach touchs more of the zfs code, but it is more consistent with the right way to handle these cases under Linux. This is patch lays the ground work for being able to safely revert the following commits which used PF_MEMALLOC: `21ade34` Disable direct reclaim for z_wr_* threads `cfc9a5c` Fix zpl_writepage() deadlock `eec8164` Fix ASSERTION(!dsl_pool_sync_context(tx->tx_pool)) Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #726	2012-08-27 12:01:37 -07:00
Brian Behlendorf	991fc1d7ae	mzap_upgrade() must use kmem_alloc() These allocations in mzap_update() used to be kmem_alloc() but were changed to vmem_alloc() due to the size of the allocation. However, since it turns out this function may be called in the context of the txg_sync thread they must be changed back to use a kmem_alloc() to ensure the KM_PUSHPAGE flag is honored. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-08-27 12:01:37 -07:00
Brian Behlendorf	8630650a8d	Annotate KM_PUSHPAGE call paths with PF_NOFS The txg_sync(), zfs_putpage(), zvol_write(), and zvol_discard() call paths must only use KM_PUSHPAGE to avoid potential deadlocks during direct reclaim. This patch annotates these call paths so any accidental use of KM_SLEEP will be quickly detected. In the interest of stability if debugging is disabled the offending allocation will have its GFP flags automatically corrected. When debugging is enabled any misuse will be treated as a fatal error. This patch is entirely for debugging. We should be careful to NOT become dependant on it fixing up the incorrect allocations. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-08-27 12:01:37 -07:00
Brian Behlendorf	86dd0fd922	Pre-allocate vdev I/O buffers The vdev queue layer may require a small number of buffers when attempting to create aggregate I/O requests. Rather than attempting to allocate them from the global zio buffers, which is slow under memory pressure, it makes sense to pre-allocate them because... 1) These buffers are short lived. They are only required for the life of a single I/O at which point they can be used by the next I/O. 2) The maximum number of concurrent buffers needed by a vdev is small. It's roughly limited by the zfs_vdev_max_pending tunable which defaults to 10. By keeping a small list of these buffer per-vdev we can ensure one is always available when we need it. This significantly reduces contention on the vq->vq_lock, because we no longer need to perform a slow allocation under this lock. This is particularly important when memory is already low on the system. It would probably be wise to extend the use of these buffers beyond aggregate I/O and in to the raidz implementation. The inability to quickly allocate buffer for the parity stripes could result in similiar problems. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-08-27 12:01:37 -07:00
Richard Yao	44f21da41c	Revert Disable direct reclaim for z_wr_* threads This commit used PF_MEMALLOC to prevent a memory reclaim deadlock. However, commit `49be0ccf1f` eliminated the invocation of __cv_init(), which was the cause of the deadlock. PF_MEMALLOC has the side effect of permitting pages from ZONE_DMA to be allocated. The use of PF_MEMALLOC was found to cause stability problems when doing swap on zvols. Since this technique is known to cause problems and no longer fixes anything, we revert it. Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #726	2012-08-27 12:01:37 -07:00
Richard Yao	62c4165a1b	Revert Fix zpl_writepage() deadlock The commit, `cfc9a5c88f`, to fix deadlocks in zpl_writepage() relied on PF_MEMALLOC. That had the effect of disabling the direct reclaim path on all allocations originating from calls to this function, but it failed to address the actual cause of those deadlocks. This led to the same deadlocks being observed with swap on zvols, but not with swap on the loop device, which exercises this code. The use of PF_MEMALLOC also had the side effect of permitting allocations to be made from ZONE_DMA in instances that did not require it. This contributes to the possibility of panics caused by depletion of pages from ZONE_DMA. As such, we revert this patch in favor of a proper fix for both issues. Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #726	2012-08-27 12:01:37 -07:00
Richard Yao	b876dac776	Revert Fix ASSERTION(!dsl_pool_sync_context(tx->tx_pool)) Commit `eec8164771` worked around an issue involving direct reclaim through the use of PF_MEMALLOC. Since we are reworking thing to use KM_PUSHPAGE so that swap works, we revert this patch in favor of the use of KM_PUSHPAGE in the affected areas. Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #726	2012-08-27 12:01:37 -07:00
Brian Behlendorf	cd38ac58a3	rmdir(2) should return ENOTEMPTY Under Solaris the behavior for rmdir(2) is to return EEXIST when a directory still contains entries. However, on Linux ENOTEMPTY is the expected return value with EEXIST being technically allowed. According to rmdir(2): ENOTEMPTY pathname contains entries other than . and .. ; or, pathname has .. as its final component. POSIX.1-2001 also allows EEXIST for this condition. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #895	2012-08-26 13:55:45 -07:00
Christopher Siden	9e11c7eee2	Illumos #3085 : zfs diff panics, then panics in a loop on booting Reviewed by: Matt Ahrens <matthew.ahrens@delphix.com> Approved by: Richard Lowe <richlowe@richlowe.net> References: https://www.illumos.org/issues/3085 Ported by: Martin Matuska <martin@matuska.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-08-25 12:32:25 -07:00
Simon Klinkert	c578f007ff	Illumos #2901 : zfs receive fails for exabyte sparse files Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@nexenta.com> References: https://www.illumos.org/issues/2901 Ported by: Martin Matuska <martin@matuska.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-08-25 12:28:29 -07:00
Javen Wu	a47587389e	Drop spill buffer reference When calling sa_update() and friends it is possible that a spill buffer will be needed to accomidate the update. When this happens a hold is taken on the new dbuf and that hold must be released before calling dmu_tx_commit(). Failing to release the hold will cause a copy of the dbuf to be made in dbuf_sync_leaf(). This is done to ensure further updates to the dbuf never sneak in to the syncing txg. This could be left to the sa_update() caller. But then the caller would need to be aware of this internal SA implementation detail. It is therefore preferable to handle this all internally in the SA implementation. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #503 Closes #513	2012-08-25 09:26:10 -07:00
Brian Behlendorf	f828e63a0d	Revert "Use SA_HDL_PRIVATE for SA xattrs" This reverts commit `ec2626ad3f` which caused consistency problems between the shared and private handles. Reverting this change should resolve issues #709 and #727. It will also reintroduce an arc_anon memory leak which is addressed by the next commit. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #709 Closes #727	2012-08-25 09:25:56 -07:00
Prakash Surya	15a9e03368	Wrap smp_processor_id in kpreempt_[dis\|en]able After surveying the code, the few places where smp_processor_id is used were deemed to be safe to use with a preempt enabled kernel. As such, no core logic had to be changed. These smp_processor_id call sites are simply are wrapped in kpreempt_disable and kpreempt_enabled to prevent the Linux kernel from emitting scary warnings. Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu> Issue #83	2012-08-24 13:19:06 -07:00
Brian Behlendorf	4047414a6a	Export dmu_buf_rele() symbol While I'd like to remove the various pragmas in module/zfs/dbuf.c. There are consumers such as Lustre which still depend on dmu_buf_* versions of the symbols. Until all consumers can be converted to use only the dbuf_* names leave this symbol exported. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-08-14 08:38:19 -07:00
Brian Behlendorf	bafc4e9e2a	Suppress 'zfs_sb_create' memory warning When mutex debugging is enabled in your kernel the increased size of the mutex structures can push the zfs_sb_t type beyond the 8k warning threshold. This isn't harmful so we suppress the warning for this case. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #628	2012-08-10 16:43:32 -07:00
Brian Behlendorf	8f576c2321	Export dbuf_* symbols Export these symbols so they may be used by other ZFS consumers besides the ZPL. Remove three stale prototype definites from dbuf.h. The actual implementations of these functions were removed/renamed long ago. It would be good in the long term to remove the existing pragmas we inherited from Solaris and simply use the dbuf_* names. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-08-10 16:45:13 -07:00
Dan McDonald	d96eb2b153	Illumos #1693 : persistent 'comment' field for a zpool Reviewed by: George Wilson <gwilson@zfsmail.com> Reviewed by: Eric Schrock <eric.schrock@delphix.com> Approved by: Richard Lowe <richlowe@richlowe.net> References: https://www.illumos.org/issues/1693 Ported by: Martin Matuska <martin@matuska.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #678	2012-08-08 11:49:37 -07:00
Etienne Dechamps	ee5fd0bb80	Set zvol discard_granularity to the volblocksize. Currently, zvols have a discard granularity set to 0, which suggests to the upper layer that discard requests of arbirarily small size and alignment can be made efficiently. In practice however, ZFS does not handle unaligned discard requests efficiently: indeed, it is unable to free a part of a block. It will write zeros to the specified range instead, which is both useless and inefficient (see dnode_free_range). With this patch, zvol block devices expose volblocksize as their discard granularity, so the upper layer is aware that it's not supposed to send discard requests smaller than volblocksize. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #862	2012-08-07 14:55:31 -07:00
Etienne Dechamps	7c0e570888	Limit the number of blocks to discard at once. The number of blocks that can be discarded in one BLKDISCARD ioctl on a zvol is currently unlimited. Some applications, such as mkfs, discard the whole volume at once and they use the maximum possible discard size to do that. As a result, several gigabytes discard requests are not uncommon. Unfortunately, if a large amount of data is allocated in the zvol, ZFS can be quite slow to process discard requests. This is especially true if the volblocksize is low (e.g. the 8K default). As a result, very large discard requests can take a very long time (seconds to minutes under heavy load) to complete. This can cause a number of problems, most notably if the zvol is accessed remotely (e.g. via iSCSI), in which case the client has a high probability of timing out on the request. This patch solves the issue by adding a new tunable module parameter: zvol_max_discard_blocks. This indicates the maximum possible range, in zvol blocks, of one discard operation. It is set by default to 16384 blocks, which appears to be a good tradeoff. Using the default volblocksize of 8K this is equivalent to 128 MB. When using the maximum volblocksize of 128K this is equivalent to 2 GB. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #858	2012-07-31 09:46:09 -07:00
Matthew Ahrens	330d06f90d	Illumos #1644 , #1645 , #1646 , #1647 , #1708 1644 add ZFS "clones" property 1645 add ZFS "written" and "written@..." properties 1646 "zfs send" should estimate size of stream 1647 "zfs destroy" should determine space reclaimed by destroying multiple snapshots 1708 adjust size of zpool history data References: https://www.illumos.org/issues/1644 https://www.illumos.org/issues/1645 https://www.illumos.org/issues/1646 https://www.illumos.org/issues/1647 https://www.illumos.org/issues/1708 This commit modifies the user to kernel space ioctl ABI. Extra care should be taken when updating to ensure both the kernel modules and utilities are updated. This change has reordered all of the new ioctl()s to the end of the list. This should help minimize this issue in the future. Reviewed by: Richard Lowe <richlowe@richlowe.net> Reviewed by: George Wilson <gwilson@zfsmail.com> Reviewed by: Albert Lee <trisk@opensolaris.org> Approved by: Garrett D'Amore <garret@nexenta.com> Ported by: Martin Matuska <martin@matuska.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #826 Closes #664	2012-07-31 09:25:30 -07:00
Etienne Dechamps	2ee4a18b2a	Add script for builtin module building. This commit introduces a "copy-builtin" script designed to prepare a kernel source tree for building ZFS as a builtin module. The script makes a full copy of all needed files, thus making the kernel source tree fully independent of the zfs source package. To achieve that, some compilation flags (-include, -I) have been moved to module/Makefile. This Makefile is only used when compiling external modules; when compiling builtin modules, a Kbuild file generated by the configure-builtin script is used instead. This makes sure Makefiles inside the kernel source tree does not contain references to the zfs source package. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #851	2012-07-26 13:45:09 -07:00
Richard Yao	739a1a82e0	Linux 3.5 compat, end_writeback() changed to clear_inode() The end_writeback() function was changed by moving the call to inode_sync_wait() earlier in to evict(). This effecitvely changes the ordering of the sync but it does not impact the details of the zfs implementation. However, as part of this change end_writeback() was renamed to clear_inode() to reflect the new semantics. This change does impact us and clear_inode() now maps to end_writeback() for kernels prior to 3.5. Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #784	2012-07-23 12:29:36 -07:00
Richard Yao	ea1fdf46e2	Linux 3.5 compat, iops->truncate_range() removed The vmtruncate_range() support has been removed from the kernel in favor of using the fallocate method in the file_operations table. Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #784	2012-07-23 12:29:32 -07:00
Richard Yao	756c3e5a9c	Linux 3.5 compat, eops->encode_fh() takes inodes The export_operations member ->encode_fh() has been updated to take both the child and parent inodes. This interface used to take the child dentry and a bool describing if the parent is needed. NOTE: While updating this code I noticed that we do not currently cleanly handle the case where we're passed a connectable parent. This code should be audited to make sure we're doing the right thing. Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #784	2012-07-23 12:29:23 -07:00
Brian Behlendorf	fc173c8589	Disable .zfs directory on 32-bit systems The .zfs control directory implementation currently relies on the fact that there is a direct 1:1 mapping from an object id to its inode number. This works well as long as the system uses a 64-bit value to store the inode number. Unfortunately, the Linux kernel defines the inode number as an 'unsigned long' type. This means that for 32-bit systems will only have 32-bit inode numbers but we still have 64-bit object ids. This problem is particularly acute for the .zfs directories which leverage those upper 32-bits. This is done to avoid conflicting with object ids which are allocated monotonically starting from 0. This is likely to also be a problem for datasets on 32-bit systems with more than ~2 billion files. The right long term fix must remove the simple 1:1 mapping. Until that's done the only safe thing to do is to disable the .zfs directory on 32-bit systems. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-07-20 12:20:57 -07:00
Brian Behlendorf	2a4a9dc2f0	Add ddt_object_load() error handling Add the missing error handling to ddt_object_load(). There's no good reason this needs to be fatal. It is preferable that an error be returned. This will allow 'zpool import -FX' to safely attempt to rollback through previous txgs looking for a good one. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-07-20 10:36:21 -07:00
Brian Behlendorf	10be533e33	Add 'inline' keyword The '__attribute__((always_inline))' does not strictly imply 'inline'. Newer versions of gcc detect this misuse and issue the following warning. Including the missing 'inline' resolves the build warning. ./module/zfs/dsl_scan.c:758:1:error: always_inline function might not be inlinable [-Werror=attributes] Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-07-19 13:41:00 -07:00
Richard Yao	0a6b03d3b8	Fix build failures on PaX/GRSecurity patched kernels Gentoo Hardened kernels include the PaX/GRSecurity patches. They use a dialect of C that relies on a GCC plugin. In particular, struct file_operations has been marked do_const in the PaX/GRSecurity dialect, which causes GCC to consider all instances of it as const. This caused failures in the autotools checks and the ZFS source code. To address this, we modify the autotools checks to take into account differences between the PaX C dialect and the regular C dialect. We also modify struct zfs_acl's z_ops member to be a pointer to a function pointer table. Lastly, we modify zpl_put_link() to address a PaX change to the function prototype of nd_get_link(). This avoids compiler errors in the PaX/GRSecurity dialect. Note that the change in zpl_put_link() causes a warning that becomes a build failure when debugging is enabled. Fixing that warning requires ryao/spl@5ca50ef459. Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #484	2012-07-17 09:22:43 -07:00
Etienne Dechamps	b5a28807cd	Move partition scanning from userspace to module. Currently, zpool online -e (dynamic vdev expansion) doesn't work on whole disks because we're invoking ioctl(BLKRRPART) from userspace while ZFS still has a partition open on the disk, which results in EBUSY. This patch moves the BLKRRPART invocation from the zpool utility to the module. Specifically, this is done just before opening the device in vdev_disk_open() which is called inside vdev_reopen(). This requires jumping through some hoops to get to the disk device from the partition device, and to make sure we can still open the partition after the BLKRRPART call. Note that this new code path is triggered on dynamic vdev expansion only; other actions, like creating a new pool, are unchanged and still call BLKRRPART from userspace. This change also depends on API changes which are available in 2.6.37 and latter kernels. The build system has been updated to detect this, but there is no compatibility mode for older kernels. This means that online expansion will NOT be available in older kernels. However, it will still be possible to expand the vdev offline. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #808	2012-07-17 09:17:31 -07:00
George Wilson	c7f2d69de3	Illumos #1949 , #1953 1949 crash during reguid causes stale config 1953 allow and unallow missing from zpool history since removal of pyzfs Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed by: Eric Schrock <eric.schrock@delphix.com> Reviewed by: Bill Pijewski <wdp@joyent.com> Reviewed by: Richard Lowe <richlowe@richlowe.net> Reviewed by: Garrett D'Amore <garrett.damore@gmail.com> Reviewed by: Dan McDonald <danmcd@nexenta.com> Reviewed by: Steve Gonczi <gonczi@comcast.net> Approved by: Eric Schrock <eric.schrock@delphix.com> References: https://www.illumos.org/issues/1949 https://www.illumos.org/issues/1953 Ported by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #665	2012-07-11 13:33:31 -07:00
Garrett D'Amore	3541dc6d02	Illumos #1748 : desire support for reguid in zfs Reviewed by: George Wilson <gwilson@zfsmail.com> Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com> Reviewed by: Alexander Eremin <alexander.eremin@nexenta.com> Reviewed by: Alexander Stetsenko <ams@nexenta.com> Approved by: Richard Lowe <richlowe@richlowe.net> References: https://www.illumos.org/issues/1748 This commit modifies the user to kernel space ioctl ABI. Extra care should be taken when updating to ensure both the kernel modules and utilities are updated. If only the user space component is updated both the 'zpool events' command and the 'zpool reguid' command will not work until the kernel modules are updated. Ported by: Martin Matuska <martin@matuska.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #665	2012-07-11 13:08:56 -07:00
Brian Behlendorf	42d3b990cf	Update incorrect ddt_zap_lookup() assertion When the ddt_zap_lookup() function was updated to dynamically allocate memory for the cbuf variable, to save stack space, the 'csize <= sizeof (cbuf)' assertion was not updated. The result of this was that the size of the pointer was being used in the comparison rather than the buffer size. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Prakash Surya <surya1@llnl.gov>	2012-07-03 15:14:34 -07:00
Etienne Dechamps	b6ad9671ac	Add ZIL statistics. The performance of the ZIL is usually the main bottleneck when dealing with synchronous, write-heavy workloads (e.g. databases). Understanding the behavior of the ZIL is required to diagnose performance issues for these workloads, and to tune ZIL parameters (like zil_slog_limit) accordingly. This commit adds a new kstat page dedicated to the ZIL with some counters which, hopefully, scheds some light into what the ZIL is doing, and how it is doing it. Currently, these statistics are available in /proc/spl/kstat/zfs/zil. A description of the fields can be found in zil.h. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #786	2012-06-29 09:56:51 -07:00
Pawel Jakub Dawidek	0cee24064a	Speed up 'zfs list -t snapshot -o name -s name' FreeBSD #xxx: Dramatically optimize listing snapshots when user requests only snapshot names and wants to sort them by name, ie. when executes: # zfs list -t snapshot -o name -s name Because only name is needed we don't have to read all snapshot properties. Below you can find how long does it take to list 34509 snapshots from a single disk pool before and after this change with cold and warm cache: before: # time zfs list -t snapshot -o name -s name > /dev/null cold cache: 525s warm cache: 218s after: # time zfs list -t snapshot -o name -s name > /dev/null cold cache: 1.7s warm cache: 1.1s NOTE: This patch only appears in FreeBSD. If/when Illumos picks up the change we may want to drop this patch and adopt their version. However, for now this addresses a real issue. Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #450	2012-06-14 09:49:04 -07:00
Darik Horn	74497b7ab6	Add zvol_inhibit_dev module option. ZoL can create more zvols at runtime than can be configured during system start, which hangs the init stack at reboot. When a slow system has more than a few hundred zvols, udev will fork bomb during system start and spend too much time in device detection routines, so upstart kills it. The zfs_inhibit_dev option allows an affected system to be rescued by skipping /dev/zd* creation and thereby avoiding the udev overload. All zvols are made inaccessible if this option is set, but the `zfs destroy` and `zfs send` commands still work, and ZFS filesystems can be mounted. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-06-13 17:05:16 -07:00
Etienne Dechamps	ee191e802c	Make zil_slog_limit a tunable module parameter. zil_slog_limit specifies the maximum commit size to be written to the separate log device. Larger commits bypass the separate log device and go directly to the data devices. The optimal value for zil_slog_limit directly depends on the latency and throughput characteristics of both the separate log device and the data disks. Small synchronous writes are faster on low-latency separate log devices (e.g. SSDs) whereas large synchronous writes are faster on high-latency data disks (e.g. spindles) because of higher throughput, especially with a large array. The point is, the line between "small" and "large" synchronous writes in this scenario is heavily dependent on the hardware used. That's why it should be made configurable. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #783	2012-06-12 08:45:53 -07:00
Richard Yao	6a0936babc	Linux 3.4 compat, d_make_root() replaces d_alloc_root() torvalds/linux@adc0e91ab1 introduced introduced d_make_root() as a replacement for d_alloc_root(). Further commits appear to have removed d_alloc_root() from the Linux source tree. This causes the following failure: error: implicit declaration of function 'd_alloc_root' [-Werror=implicit-function-declaration] To correct this we update the code to use the current d_make_root() interface for readability. Then we introduce an autotools check to determine if d_make_root() is available. If it isn't then we define some compatibility logic which used the older d_alloc_root() interface. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #776	2012-06-11 10:04:49 -07:00
Etienne Dechamps	ab85f8455b	Honor logbias when writing to ZVOLs. The logbias option is not taken into account when writing to ZVOLs. We fix that by using the same logic as in the zfs filesystem write code (see zfs_log.c). Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #774	2012-06-11 09:43:48 -07:00
Brian Behlendorf	710114089f	Revert "Disable direct reclaim on zvols" This reverts commit `ce90208cf9`. This change was observed to cause problems when using a zvol to back a VM under 2.6.32.59 kernels. This issue was filed as #710. Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #342 Issue #710	2012-04-30 14:26:49 -07:00
Brian Behlendorf	b39d3b9f7b	Linux 3.3 compat, iops->create()/mkdir()/mknod() The mode argument of iops->create()/mkdir()/mknod() was changed from an 'int' to a 'umode_t'. To prevent a compiler warning an autoconf check was added to detect the API change and then correctly set a zpl_umode_t typedef. There is no functional change. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #701	2012-04-30 12:52:38 -07:00
Richard Yao	ce90208cf9	Disable direct reclaim on zvols Previously, it was possible for the direct reclaim path to be invoked when a write to a zvol was made. When a zvol is used as a swap device, this often causes swap requests to depend on additional swap requests, which deadlocks. We address this by disabling the direct reclaim path on zvols. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #342	2012-04-30 11:25:36 -07:00
Richard Yao	518b487602	Update ARC memory limits to account for SLUB internal fragmentation `23bdb07d4e` updated the ARC memory limits to be 1/2 of memory or all but 4GB. Unfortunately, these values assume zero internal fragmentation in the SLUB allocator, when in reality, the internal fragmentation could be as high as 50%, effectively doubling memory usage. This poses clear safety issues, because it permits the size of ARC to exceed system memory. This patch changes this so that the default value of arc_c_max is always 1/2 of system memory. This effectively limits the ARC to the memory that the system has physically installed. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #660	2012-04-30 10:04:34 -07:00
Brian Behlendorf	302f753f16	Integrate ARC more tightly with Linux Under Solaris the ARC was designed to stay one step ahead of the VM subsystem. It would attempt to recognize low memory situtions before they occured and evict data from the cache. It would also make assessments about if there was enough free memory to perform a specific operation. This was all possible because Solaris exposes a fairly decent view of the memory state of the system to other kernel threads. Linux on the other hand does not make this information easily available. To avoid extensive modifications to the ARC the SPL attempts to provide these same interfaces. While this works it is not ideal and problems can arise when the ARC and Linux have different ideas about when your out of memory. This has manifested itself in the past as a spinning arc_reclaim_thread. This patch abandons the emulated Solaris interfaces in favor of the prefered Linux interface. That means moving the bulk of the memory reclaim logic out of the arc_reclaim_thread and in to the evict driven shrinker callback. The Linux VM will call this function when it needs memory. The ARC is then responsible for attempting to free the requested amount of memory if possible. Several interfaces have been modified to accomidate this approach, however the basic user space implementation remains the same. The following changes almost exclusively just apply to the kernel implementation. * Removed the hdr_recl() reclaim callback which is redundant with the broader arc_shrinker_func(). * Reduced arc_grow_retry to 5 seconds from 60. This is now used internally in the ARC with arc_no_grow to indicate that direct reclaim was recently performed. This typically indicates a rapid change in memory demands which the kswapd threads were unable to keep ahead of. As long as direct reclaim is happening once every 5 seconds arc growth will be paused to avoid further contributing to the existing memory pressure. The more common indirect reclaim paths will not set arc_no_grow. * arc_shrink() has been extended to take the number of bytes by which arc_c should be reduced. This allows for a more granual reduction of the arc target. Since the kernel provides a reclaim value to the arc_shrinker_func() this value is used instead of 1<<arc_shrink_shift. * arc_reclaim_needed() has been removed. It was used to determine if the system was under memory pressure and relied extensively on Solaris specific VM interfaces. In most case the new code just checks arc_no_grow which indicates that within the last arc_grow_retry seconds direct memory reclaim occurred. * arc_memory_throttle() has been updated to always include the amount of evictable memory (arc and page cache) in its free space calculations. This space is largely available in most call paths due to direct memory reclaim. * The Solaris pageout code was also removed to avoid confusion. It has always been disabled due to proc_pageout being defined as NULL in the Linux port. Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-04-30 10:03:05 -07:00
Brian Behlendorf	afec56b43f	Add zfs_mdcomp_disable module option Expose the zfs_mdcomp_disable variable as a module option. This can be used to disable compression of zfs meta data which is enabled by default. This shouldn't need to be tuned but for most workloads, however there may be very specific instances where it makes sense to trade disk capacity for extra cpu cycles. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-04-27 16:28:02 -07:00
Brian Behlendorf	ebf8e3a237	Illumos #1909 : disk sync write perf regression when slog is used post oi_148 Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed by: Eric Schrock <eric.schrock@delphix.com> Reviewed by: Robert Mustacchi <rm@joyent.com> Reviewed by: Bill Pijewski <wdp@joyent.com> Reviewed by: Richard Elling <richard.elling@richardelling.com> Reviewed by: Steve Gonczi <gonczi@comcast.net> Reviewed by: Garrett D'Amore <garrett.damore@gmail.com> Reviewed by: Dan McDonald <danmcd@nexenta.com> Reviewed by: Albert Lee <trisk@nexenta.com> Approved by: Eric Schrock <eric.schrock@delphix.com> Refererces to Illumos issue: https://www.illumos.org/issues/1909 Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #680	2012-04-19 16:26:29 -07:00
Prakash Surya	409dc1a570	Use KM_PUSHPAGE in l2arc_write_buffers There is potential for deadlock in the l2arc_feed thread if KM_PUSHPAGE is not used for the allocations made in l2arc_write_buffers. Specifically, if KM_PUSHPAGE is not used for these allocations, it is possible for reclaim to be triggered which can cause the l2arc_feed thread to deadlock itself on the ARC_mru mutex. An example of this is demonstrated in the following backtrace of the l2arc_feed thread: crash> bt 4123 PID: 4123 TASK: ffff88062f8c1500 CPU: 6 COMMAND: "l2arc_feed" 0 [ffff88062511d610] schedule at ffffffff814eeee0 1 [ffff88062511d6d8] __mutex_lock_slowpath at ffffffff814f057e 2 [ffff88062511d748] mutex_lock at ffffffff814f041b 3 [ffff88062511d768] arc_evict at ffffffffa05130ca [zfs] 4 [ffff88062511d858] arc_adjust at ffffffffa05139a9 [zfs] 5 [ffff88062511d878] arc_shrink at ffffffffa0513a95 [zfs] 6 [ffff88062511d898] arc_kmem_reap_now at ffffffffa0513be8 [zfs] 7 [ffff88062511d8c8] arc_shrinker_func at ffffffffa0513ccc [zfs] 8 [ffff88062511d8f8] shrink_slab at ffffffff8112a17a 9 [ffff88062511d958] do_try_to_free_pages at ffffffff8112bfdf 10 [ffff88062511d9e8] try_to_free_pages at ffffffff8112c3ed 11 [ffff88062511da98] __alloc_pages_nodemask at ffffffff8112431d 12 [ffff88062511dbb8] kmem_getpages at ffffffff8115e632 13 [ffff88062511dbe8] fallback_alloc at ffffffff8115f24a 14 [ffff88062511dc68] ____cache_alloc_node at ffffffff8115efc9 15 [ffff88062511dcc8] __kmalloc at ffffffff8115fbf9 16 [ffff88062511dd18] kmem_alloc_debug at ffffffffa047b8cb [spl] 17 [ffff88062511dda8] l2arc_feed_thread at ffffffffa0511e71 [zfs] 18 [ffff88062511dea8] thread_generic_wrapper at ffffffffa047d1a1 [spl] 19 [ffff88062511dee8] kthread at ffffffff81090a86 20 [ffff88062511df48] kernel_thread at ffffffff8100c14a Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-04-17 11:56:21 -07:00
Martin Matuska	7d5cd71da6	Illumos #1346 : zfs incremental receive may leave behind temporary clones 1356 zfs dataset prefetch code not working Reviewed by: Matthew Ahrens <matt@delphix.com> Reviewed by: Dan McDonald <danmcd@nexenta.com> Approved by: Gordon Ross <gwr@nexenta.com> References to Illumos issue: https://www.illumos.org/issues/1346 https://www.illumos.org/issues/1356 Ported-by: Richard Yao <ryao@cs.stonybrook.edu> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #647	2012-04-11 12:02:27 -07:00
Albert Lee	22cd4a4653	Illumos #1475 : zfs spill block hold can access invalid spill blkptr Reviewed by: Dan McDonald <danmcd@nexenta.com> Reviewed by: Gordon Ross <gwr@nexenta.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <gwilson@zfsmail.com> Approved by: Garrett D'Amore <garrett@nexenta.com> References to Illumos issue: https://www.illumos.org/issues/1475 Ported-by: Richard Yao <ryao@cs.stonybrook.edu> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #648	2012-04-11 11:46:30 -07:00
George Wilson	5ffb9d1d05	Illumos #1951 : leaking a vdev when removing an l2cache device 1952 memory leak when adding a file-based l2arc device 1954 leak in ZFS from metaslab_group_create and zfs_ereport_checksum Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed by: Eric Schrock <eric.schrock@delphix.com> Reviewed by: Bill Pijewski <wdp@joyent.com> Reviewed by: Dan McDonald <danmcd@nexenta.com> Approved by: Eric Schrock <eric.schrock@delphix.com> References to Illumos issues: https://www.illumos.org/issues/1951 https://www.illumos.org/issues/1952 https://www.illumos.org/issues/1954 Ported-by: Richard Yao <ryao@cs.stonybrook.edu> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #650	2012-04-11 11:32:06 -07:00
Martin Matuska	b129c6590e	OS-926: zfs panic in zfs_fill_zplprops_impl() This change appears to be exclusive to SmartOS. It is not present in illumos-gate but it just adds some needed error handling. This is clearly preferable to simply ASSERTING which is what would occur prior to the patch. Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com> Reviewed by: Matt Ahrens <matt@delphix.com> Ported-by: Richard Yao <ryao@cs.stonybrook.edu> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #652	2012-04-11 11:29:19 -07:00
Andriy Gapon	3adfc400f5	Illumos #1680 : zfs vdev_file_io_start: validate vdev before using vdev_tsd vdev_tsd can be NULL for certain vdev states. At least in userland testing with ztest. References to Illumos issue: https://www.illumos.org/issues/1680 Ported-by: Richard Yao <ryao@cs.stonybrook.edu> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #655	2012-04-11 11:23:18 -07:00
Brian Behlendorf	f0fd83be65	Export additional dsl symbols Principly these symbols were exported to get access to the dsl_prop_register/dsl_prop_unregister functions. They allow us to cleanly register a callback which is called when a dataset property is modified. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-04-11 09:26:55 -07:00
Gunnar Beutner	1f0d8a566f	Fixed a NULL pointer dereference bug in zfs_preumount When zpl_fill_super -> zfs_domount fails (e.g. because the dataset was destroyed before it could be successfully mounted) the subsequent call to zpl_kill_sb -> zfs_preumount would derefence a NULL pointer. This bug can be reproduced using this shell script: #!/bin/sh ( while true; do zfs create -o mountpoint=legacz tank/bar zfs destroy tank/bar done ) & ( while true; do mount -t zfs tank/bar /mnt umount /mnt done ) & Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #639	2012-04-05 11:29:42 -07:00
Brian Behlendorf	fc41c6402b	Properly expose the mfu ghost list kstats Due to a typo the mru ghost lists stats were accidentally being exposed as the mfu ghost list stats. This was harmless but confusing since memory usage could be over reported. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-03-27 15:08:22 -07:00
Brian Behlendorf	1c5de20ae2	Add --enable-debug-dmu-tx configure option Allow rigorous (and expensive) tx validation to be enabled/disabled indepentantly from the standard zfs debugging. When enabled these checks ensure that all txs are constructed properly and that a dbuf is never dirtied without taking the correct tx hold. This checking is particularly helpful when adding new dmu consumers like Lustre. However, for established consumers such as the zpl with no known outstanding tx construction problems this is just overhead. --enable-debug-dmu-tx - Enable/disable validation of each tx as --disable-debug-dmu-tx it is constructed. By default validation is disabled due to performance concerns. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-03-23 12:25:17 -07:00
Brian Behlendorf	99ea23c583	Enhance a dmu_tx_dirty_buf() assertion The following assertion is good to validate the correctness of new DMU consumers, but it doesn't quite provide enough information. Slightly rework the assertion so that when it is hit the actual offending values will be included in the output. SPLError: 4787:0:(dmu_tx.c:828:dmu_tx_dirty_buf()) ASSERTION(dn == NULL \|\| dn->dn_assigned_txg == tx->tx_txg) failed Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-03-23 12:24:05 -07:00
Brian Behlendorf	4b5d425f14	Add ZFS_META_RELEASE to module load/unload messages Include the ZFS_META_RELEASE in the module load/unload messages to more clearly indidcate exactly what version of ZFS has been loaded. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-03-23 12:14:35 -07:00
Brian Behlendorf	9ed86e7cc7	Account for .zfs ctldir inodes Because the .zfs ctldir inodes are not backed by physical storage they use a different create path which was not properly accounting for them as used. This could result in ->nr_cached_objects() returning 0 and cause a divide by zero error in prune_super(). In my option there's a kernel bug here too which allows this to happen. They should either be checking for 0 or adding +1 like they correctly do earlier in the function. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #617	2012-03-22 15:43:55 -07:00
Brian Behlendorf	ebe7e575ea	Add .zfs control directory Add support for the .zfs control directory. This was accomplished by leveraging as much of the existing ZFS infrastructure as posible and updating it for Linux as required. The bulk of the core functionality is now all there with the following limitations. ) The .zfs/snapshot directory automount support requires a 2.6.37 or newer kernel. The exception is RHEL6.2 which has backported the d_automount patches. ) Creating/destroying/renaming snapshots with mkdir/rmdir/mv in the .zfs/snapshot directory works as expected. However, this functionality is only available to root until zfs delegations are finished. * mkdir - create a snapshot * rmdir - destroy a snapshot * mv - rename a snapshot The following issues are known defeciences, but we expect them to be addressed by future commits. ) Add automount support for kernels older the 2.6.37. This should be possible using follow_link() which is what Linux did before. ) Accessing the .zfs/snapshot directory via NFS is not yet possible. The majority of the ground work for this is complete. However, finishing this work will require resolving some lingering integration issues with the Linux NFS kernel server. *) The .zfs/shares directory exists but no futher smb functionality has yet been implemented. Contributions-by: Rohan Puri <rohan.puri15@gmail.com> Contributiobs-by: Andrew Barnes <barnes333@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #173	2012-03-22 13:03:47 -07:00
Brian Behlendorf	49be0ccf1f	Add zio constructor/destructor Add a standard zio constructor and destructor. Normally, this is done to reduce to cost of allocating a new structure by reducing expensive operations such as memory allocations. However, in this case none of the operations moved out of zio_create() were really very expensive. This change was principly made as a debug patch (and workaround) for a zio_destroy() race. The is good evidence that zio_create() is reinitializing a mutex which is really still in use by another thread. This would completely explain the observed symptoms in the issue report. This patch doesn't fix the root cause of the race, but it should make it less likely by only initializing the mutex once in the constructor. Also, this particular flaw might have gone unnoticed in other zfs implementations due to the specific implementation details of Linux ticket spinlocks. Once the real root cause is determined and resolved this change can be safely reverted. Until then this should help workaround the issue. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #496	2012-03-21 14:51:44 -07:00
Brian Behlendorf	c8df41538d	Revert "Add zio constructor/destructor" This patch was slightly flawed and allowed for zio->io_logical to potentially not be reinitialized for a new zio. This could lead to assertion failures in specific cases when debugging is enabled (--enable-debug) and I/O errors are encountered. It may also have caused problems when issues logical I/Os. Since we want to make sure this workaround can be easily removed in the future (when we have the real fix). I'm reverting this change and applying a new version of the patch which includes the zio->io_logical fix. This reverts commit `2c6d0b1e07`. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #602 Issue #604	2012-03-21 14:51:01 -07:00
Brian Behlendorf	77a405ae52	Add missing NULL in zpl_xattr_handlers The xattr_resolve_name() helper function expects the registered list of xattr handlers to be NULL terminated. This NULL was accidentally missing which could result in a NULL dereference. Interestingly this issue only manifested itself on certain 32-bit systems. Presumably on 64-bit kernels we just always happen to get lucky and the memory following the structure is zeroed. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #594	2012-03-15 15:18:29 -07:00
Brian Behlendorf	0ece356db5	Add sa_spill_rele() interface Add a SA interface which allows us to release the spill block from a SA handle without destroying the handle. This is useful because we can then ensure that a copy of the dirty spill block is not made at sync time due to the extra hold. Susequent calls to sa_update() or sa_lookup() with transparently refetch the spill block dbuf from the ARC hash. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-03-07 16:28:00 -08:00
Brian Behlendorf	2c6d0b1e07	Add zio constructor/destructor Add a standard zio constructor and destructor. Normally, this is done to reduce to cost of allocating a new structure by reducing expensive operations such as memory allocations. However, in this case none of the operations moved out of zio_create() were really very expensive. This change was principly made as a debug patch (and workaround) for a zio_destroy() race. The is good evidence that zio_create() is reinitializing a mutex which is really still in use by another thread. This would completely explain the observed symptoms in the issue report. This patch doesn't fix the root cause of the race, but it should make it less likely by only initializing the mutex once in the constructor. Also, this particular flaw might have gone unnoticed in other zfs implementations due to the specific implementation details of Linux ticket spinlocks. Once the real root cause is determined and resolved this change can be safely reverted. Until then this should help workaround the issue. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #496	2012-03-07 16:06:23 -08:00
Brian Behlendorf	ec2626ad3f	Use SA_HDL_PRIVATE for SA xattrs A private SA handle must be used to ensure we can drop the dbuf hold on the spill block prior to calling dmu_tx_commit(). If we call dmu_tx_commit() before sa_handle_destroy(), then our hold will trigger a copy of the dbuf to be made. This is done to prevent data from leaking in to the syncing txg. As a result the original dirty spill block will remain cached. Additionally, relying on the shared zp->z_sa_hdl is unsafe in the xattr context because the znode may be asynchronously dropped from the cache. It's far safer and simpler just to use a private handle for xattrs. Plus any additional overhead is offset by the avoidance of the previously mentioned memory copy. These forever dirty buffers can be noticed in the arcstats under the anon_size. On a quiescent system the value should be zero. Without this fix and a SA xattr write workload you will see anon_size increase. Eventually, if enough dirty data builds up your system it will appear to hang. This occurs because the dmu won't allow new txs to be assigned until that dirty data is flushed, and it won't be because it's not part of an assigned tx. As an aside, I typically see anon_size lurk around 16k so I think there is another place in the code which needs a similar fix. However, this value doesn't grow over time so it isn't critical. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #503 Issue #513	2012-03-02 13:20:48 -08:00
Brian Behlendorf	570827e129	Add 'dmu_tx' kstats entry Keep counters for the various reasons that a thread may end up in txg_wait_open() waiting on a new txg. This can be useful when attempting to determine why a particular workload is under performing. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-02-27 08:59:10 -08:00
Brian Behlendorf	13be560d89	Add arc_state_t stats to arcstats To ensure the arc is behaving properly we need greater visibility in to exactly how it's managing the systems memory. This patch takes one step in that direction be adding the current arc_state_t for the anon, mru, mru_ghost, mfu, and mfs_ghost lists. The l2 arc_state_t is already well represented in the arcstats. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-02-27 08:58:59 -08:00
Alex Zhuravlev	a473d90cee	Export symbols for zero-copy Export additional symbols to make use of the DMU's zero-copy API. This allows external modules to move data in to and out of the ARC without incurring the cost of a memory copy. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-02-17 12:43:02 -08:00
Richard Yao	b41c9906dc	Support ashift=13 for 8KB SSD block sizes New SSDs are now available which use an internal 8k block size. To make sure ZFS can get the maximum performance out of these devices we're increasing the maximum ashift to 13 (8KB). This value is still small enough that we can fit 16 uberblocks in the vdev ring label. However, I don't want to increase this any futher or it will limit the ability the safely roll back a pool to recover it. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #565	2012-02-13 12:25:27 -08:00
Brian Behlendorf	b10c77f70a	Export symbols for zero-copy Exported the required symbols to make use of the DMU's zero-copy API. This allows external modules to move data in to and out of the ARC without incurring the cost of a memory copy. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-02-10 11:56:55 -08:00
Brian Behlendorf	a31acb462d	Use spl_debug_* helpers When configuring the spl debug log support use the provided wrapper functions. This ensures that if --disable-debug-log was used when buiding the spl the functions will have no effect. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-02-09 16:37:48 -08:00
Etienne Dechamps	30930fba21	Add support for DISCARD to ZVOLs. DISCARD (REQ_DISCARD, BLKDISCARD) is useful for thin provisioning. It allows ZVOL clients to discard (unmap, trim) block ranges from a ZVOL, thus optimizing disk space usage by allowing a ZVOL to shrink instead of just grow. We can't use zfs_space() or zfs_freesp() here, since these functions only work on regular files, not volumes. Fortunately we can use the low-level function dmu_free_long_range() which does exactly what we want. Currently the discard operation is not added to the log. That's not a big deal since losing discard requests cannot result in data corruption. It would however result in disk space usage higher than it should be. Thus adding log support to zvol_discard() is probably a good idea for a future improvement. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-02-09 16:19:38 -08:00
Etienne Dechamps	cb2d19010d	Support the fallocate() file operation. Currently only the (FALLOC_FL_PUNCH_HOLE) flag combination is supported, since it's the only one that matches the behavior of zfs_space(). This makes it pretty much useless in its current form, but it's a start. To support other flag combinations we would need to modify zfs_space() to make it more flexible, or emulate the desired functionality in zpl_fallocate(). Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #334	2012-02-09 16:19:32 -08:00
Etienne Dechamps	aec69371a6	Check permissions in zfs_space(). This isn't done on Solaris because on this OS zfs_space() can only be called with an opened file handle. Since the addition of zpl_truncate_range() this isn't the case anymore, so we need to enforce access rights. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #334	2012-02-09 15:20:37 -08:00
Etienne Dechamps	5cb63a57f8	Implement the truncate_range() inode operation. This operation allows "hole punching" in ZFS files. On Solaris this is done via the vop_space() system call, which maps to the zfs_space() function. So we just need to write zpl_truncate_range() as a wrapper around zfs_space(). Note that this only works for regular files, not ZVOLs. This is currently an insecure implementation without permission checking, although this isn't that big of a deal since truncate_range() isn't even callable from userspace. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #334	2012-02-09 15:20:32 -08:00
Etienne Dechamps	dde9380a1b	Use 32 as the default number of zvol threads. Currently, the `zvol_threads` variable, which controls the number of worker threads which process items from the ZVOL queues, is set to the number of available CPUs. This choice seems to be based on the assumption that ZVOL threads are CPU-bound. This is not necessarily true, especially for synchronous writes. Consider the situation described in the comments for `zil_commit()`, which is called inside `zvol_write()` for synchronous writes: > itxs are committed in batches. In a heavily stressed zil there will be a > commit writer thread who is writing out a bunch of itxs to the log for a > set of committing threads (cthreads) in the same batch as the writer. > Those cthreads are all waiting on the same cv for that batch. > > There will also be a different and growing batch of threads that are > waiting to commit (qthreads). When the committing batch completes a > transition occurs such that the cthreads exit and the qthreads become > cthreads. One of the new cthreads becomes he writer thread for the batch. > Any new threads arriving become new qthreads. We can easily deduce that, in the case of ZVOLs, there can be a maximum of `zvol_threads` cthreads and qthreads. The default value for `zvol_threads` is typically between 1 and 8, which is way too low in this case. This means there will be a lot of small commits to the ZIL, which is very inefficient compared to a few big commits, especially since we have to wait for the data to be on stable storage. Increasing the number of threads will increase the amount of data waiting to be commited and thus the size of the individual commits. On my system, in the context of VM disk image storage (lots of small synchronous writes), increasing `zvol_threads` from 8 to 32 results in a 50% increase in sequential synchronous write performance. We should choose a more sensible default for `zvol_threads`. Unfortunately the optimal value is difficult to determine automatically, since it depends on the synchronous write latency of the underlying storage devices. In any case, a hardcoded value of 32 would probably be better than the current situation. Having a lot of ZVOL threads doesn't seem to have any real downside anyway. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Fixes #392	2012-02-08 13:58:10 -08:00
Etienne Dechamps	34037afe24	Improve ZVOL queue behavior. The Linux block device queue subsystem exposes a number of configurable settings described in Linux block/blk-settings.c. The defaults for these settings are tuned for hard drives, and are not optimized for ZVOLs. Proper configuration of these options would allow upper layers (I/O scheduler) to take better decisions about write merging and ordering. Detailed rationale: - max_hw_sectors is set to unlimited (UINT_MAX). zvol_write() is able to handle writes of any size, so there's no reason to impose a limit. Let the upper layer decide. - max_segments and max_segment_size are set to unlimited. zvol_write() will copy the requests' contents into a dbuf anyway, so the number and size of the segments are irrelevant. Let the upper layer decide. - physical_block_size and io_opt are set to the ZVOL's block size. This has the potential to somewhat alleviate issue #361 for ZVOLs, by warning the upper layers that writes smaller than the volume's block size will be slow. - The NONROT flag is set to indicate this isn't a rotational device. Although the backing zpool might be composed of rotational devices, the resulting ZVOL often doesn't exhibit the same behavior due to the COW mechanisms used by ZFS. Setting this flag will prevent upper layers from making useless decisions (such as reordering writes) based on incorrect assumptions about the behavior of the ZVOL. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-02-07 16:23:06 -08:00
Etienne Dechamps	b18019d2d8	Fix synchronicity for ZVOLs. zvol_write() assumes that the write request must be written to stable storage if rq_is_sync() is true. Unfortunately, this assumption is incorrect. Indeed, "sync" does not mean what we think it means in the context of the Linux block layer. This is well explained in linux/fs.h: WRITE: A normal async write. Device will be plugged. WRITE_SYNC: Synchronous write. Identical to WRITE, but passes down the hint that someone will be waiting on this IO shortly. WRITE_FLUSH: Like WRITE_SYNC but with preceding cache flush. WRITE_FUA: Like WRITE_SYNC but data is guaranteed to be on non-volatile media on completion. In other words, SYNC does not mean that the write must be on stable storage on completion. It just means that someone is waiting on us to complete the write request. Thus triggering a ZIL commit for each SYNC write request on a ZVOL is unnecessary and harmful for performance. To make matters worse, ZVOL users have no way to express that they actually want data to be written to stable storage, which means the ZIL is broken for ZVOLs. The request for stable storage is expressed by the FUA flag, so we must commit the ZIL after the write if the FUA flag is set. In addition, we must commit the ZIL before the write if the FLUSH flag is set. Also, we must inform the block layer that we actually support FLUSH and FUA. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-02-07 16:23:06 -08:00
Etienne Dechamps	56c34bac44	Support "sync=always" for ZVOLs. Currently the "sync=always" property works for regular ZFS datasets, but not for ZVOLs. This patch remedies that. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Fixes #374.	2012-02-07 16:23:06 -08:00
Brian Behlendorf	47621f3d76	Linux 3.3 compat, sops->show_options() The second argument of sops->show_options() was changed from a 'struct vfsmount ' to a 'struct dentry '. Add an autoconf check to detect the API change and then conditionally define the expected interface. In either case we are only interested in the zfs_sb_t. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #549	2012-02-03 10:02:01 -08:00
Brian Behlendorf	d7e398ce1a	Cleanup ZFS debug infrastructure Historically the internal zfs debug infrastructure has been scattered throughout the code. Since we expect to start making more use of this code this patch performs some cleanup. * Consolidate the zfs debug infrastructure in the zfs_debug.[ch] files. This includes moving the zfs_flags and zfs_recover variables, plus moving the zfs_panic_recover() function. * Remove the existing unused functionality in zfs_debug.c and replace it with code which correctly utilized the spl logging infrastructure. * Remove the __dprintf() function from zfs_ioctl.c. This is dead code, the dprintf() functionality in the kernel relies on the spl log support. * Remove dprintf() from hdr_recl(). This wasn't particularly useful and was missing the required format specifier anyway. * Subsequent patches should unify the dprintf() and zfs_dbgmsg() functions. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-02-02 11:24:30 -08:00
Brian Behlendorf	0c5dde492f	Allow multiple values per directory entry When using zfs to back a Lustre filesystem it's advantageous to to store a fid with the object id in the directory zap. The only technical impediment to doing this is that the zpl code expects a single value in the zap per directory entry. This change relaxes that requirement such that multiple entries are allowed provided the first one is the object id. The zpl code will just ignore additional entries. This allows the ZoL count to mount datasets which are being used as Lustre server backends. Once the upstream feature flags support is merged in this change should be updated to a read-only feature. Until this occurs other zfs implementations will not be able to read the zfs filesystems created by Lustre. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-02-02 11:22:08 -08:00
Brian Behlendorf	e29be02e46	Export symbol zfs_attr_table Export the zfs_attr_table symbol so it may be used by non-zpl consumers which are still interested in writing a zpl compatible dataset (e.g. Lustre). Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-01-27 09:23:36 -08:00
Richard Laager	57a4eddc4d	Allow setting bootfs on any pool The vdev_is_bootable() restrictions are no longer necessary with recent GRUB2 code. FreeBSD has implemented the same change, except that I moved the Solaris comment to be inside the #ifdef __sun__ block. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #317	2012-01-17 13:49:07 -08:00
Ned Bass	08d08ebba2	Reduce number of zio free threads As described in Issue #458 and #258, unlinking large amounts of data can cause the threads in the zio free wait queue to start spinning. Reducing the number of z_fr_iss threads from a fixed value of 100 to 1 per cpu signficantly reduces contention on the taskq spinlock and improves throughput. Instrumenting the taskq code showed that __taskq_dispatch() can spend a long time holding tq->tq_lock if there are a large number of threads in the queue. It turns out the time spent in wake_up() scales linearly with the number of threads in the queue. When a large number of short work items are dispatched, as seems to be the case with unlink, the worker threads drain the queue faster than the dispatcher can fill it. They then all pile into the work wait queue to wait for new work items. So if 100 threads are in the queue, wake_up() takes about 100 times as long, and the woken threads have to spin until the dispatcher releases the lock. Reducing the number of threads helps with the symptoms, but doesn't get to the root of the problem. It would seem that wake_up() shouldn't scale linearly in time with queue depth, particularly if we are only trying to wake up one thread. In that vein, I tried making all of the waiting processes exclusive to prevent the scheduler from iterating over the entire list, but I still saw the linear time scaling. So further investigation is needed, but in the meantime reducing the thread count is an easy workaround. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #258 Issue #458	2012-01-17 08:54:00 -08:00
Darik Horn	96b91ef0d6	Apply the ZoL coding standard to zpl_xattr.c Make the indenting in the zpl_xattr.c file consistent with the Sun coding standard by removing soft tabs. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-01-12 15:12:03 -08:00
Brian Behlendorf	166dd49de0	Linux 3.2 compat, security_inode_init_security() The security_inode_init_security() API has been changed to include a filesystem specific callback to write security extended attributes. This was done to support the initialization of multiple LSM xattrs and the EVM xattr. This change updates the code to use the new API when it's available. Otherwise it falls back to the previous implementation. In addition, the ZFS_AC_KERNEL_6ARGS_SECURITY_INODE_INIT_SECURITY autoconf test has been made more rigerous by passing the expected types. This is done to ensure we always properly the detect the correct form for the security_inode_init_security() API. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #516	2012-01-12 15:06:39 -08:00
Brian Behlendorf	ab26409db7	Linux 3.1 compat, super_block->s_shrink The Linux 3.1 kernel has introduced the concept of per-filesystem shrinkers which are directly assoicated with a super block. Prior to this change there was one shared global shrinker. The zfs code relied on being able to call the global shrinker when the arc_meta_limit was exceeded. This would cause the VFS to drop references on a fraction of the dentries in the dcache. The ARC could then safely reclaim the memory used by these entries and honor the arc_meta_limit. Unfortunately, when per-filesystem shrinkers were added the old interfaces were made unavailable. This change adds support to use the new per-filesystem shrinker interface so we can continue to honor the arc_meta_limit. The major benefit of the new interface is that we can now target only the zfs filesystem for dentry and inode pruning. Thus we can minimize any impact on the caching of other filesystems. In the context of making this change several other important issues related to managing the ARC were addressed, they include: * The dnlc_reduce_cache() function which was called by the ARC to drop dentries for the Posix layer was replaced with a generic zfs_prune_t callback. The ZPL layer now registers a callback to drop these dentries removing a layering violation which dates back to the Solaris code. This callback can also be used by other ARC consumers such as Lustre. arc_add_prune_callback() arc_remove_prune_callback() * The arc_reduce_dnlc_percent module option has been changed to arc_meta_prune for clarity. The dnlc functions are specific to Solaris's VFS and have already been largely eliminated already. The replacement tunable now represents the number of bytes the prune callback will request when invoked. * Less aggressively invoke the prune callback. We used to call this whenever we exceeded the arc_meta_limit however that's not strictly correct since it results in over zeleous reclaim of dentries and inodes. It is now only called once the arc_meta_limit is exceeded and every effort has been made to evict other data from the ARC cache. * More promptly manage exceeding the arc_meta_limit. When reading meta data in to the cache if a buffer was unable to be recycled notify the arc_reclaim thread to invoke the required prune. * Added arcstat_prune kstat which is incremented when the ARC is forced to request that a consumer prune its cache. Remember this will only occur when the ARC has no other choice. If it can evict buffers safely without invoking the prune callback it will. * This change is also expected to resolve the unexpect collapses of the ARC cache. This would occur because when exceeded just the arc_meta_limit reclaim presure would be excerted on the arc_c value via arc_shrink(). This effectively shrunk the entire cache when really we just needed to reclaim meta data. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #466 Closes #292	2012-01-11 11:46:02 -08:00
Darik Horn	28eb9213d8	Linux 3.2 compat: set_nlink() Directly changing inode->i_nlink is deprecated in Linux 3.2 by commit SHA: bfe8684869601dacfcb2cd69ef8cfd9045f62170 Use the new set_nlink() kernel function instead. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes: #462	2011-12-16 20:02:52 -08:00
Garrett D'Amore	a38718a63d	Illumos #734 : Use taskq_dispatch_ent() interface It has been observed that some of the hottest locks are those of the zio taskqs. Contention on these locks can limit the rate at which zios are dispatched which limits performance. This upstream change from Illumos uses new interface to the taskqs which allow them to utilize a prealloc'ed taskq_ent_t. This removes the need to perform an allocation at dispatch time while holding the contended lock. This has the effect of improving system performance. Reviewed by: Albert Lee <trisk@nexenta.com> Reviewed by: Richard Lowe <richlowe@richlowe.net> Reviewed by: Alexey Zaytsev <alexey.zaytsev@nexenta.com> Reviewed by: Jason Brian King <jason.brian.king@gmail.com> Reviewed by: George Wilson <gwilson@zfsmail.com> Reviewed by: Adam Leventhal <ahl@delphix.com> Approved by: Gordon Ross <gwr@nexenta.com> References to Illumos issue: https://www.illumos.org/issues/734 Ported-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #482	2011-12-14 09:19:30 -08:00
Brian Behlendorf	30a9524e45	Set zvol_major/zvol_threads permissions The zvol_major and zvol_threads module options were being created with 0 permission bits. This prevented them from being listed in the /sys/module/zfs/parameters/ directory, although they were visible in `modinfo zfs`. This patch fixes the issue by updating the permission bits to 0444. For the moment these options must be read-only because they are used during module initialization. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #392	2011-12-07 09:27:50 -08:00
Brian Behlendorf	23bdb07d4e	Update default ARC memory limits In the upstream OpenSolaris ZFS code the maximum ARC usage is limited to 3/4 of memory or all but 1GB, whichever is larger. Because of how Linux's VM subsystem is organized these defaults have proven to be too large which can lead to stability issues. To avoid making everyone manually tune the ARC the defaults are being changed to 1/2 of memory or all but 4GB. The rational for this is as follows: * Desktop Systems (less than 8GB of memory) Limiting the ARC to 1/2 of memory is desirable for desktop systems which have highly dynamic memory requirements. For example, launching your web browser can suddenly result in a demand for several gigabytes of memory. This memory must be reclaimed from the ARC cache which can take some time. The user will experience this reclaim time as a sluggish system with poor interactive performance. Thus in this case it is preferable to leave the memory as free and available for immediate use. * Server Systems (more than 8GB of memory) Using all but 4GB of memory for the ARC is preferable for server systems. These systems often run with minimal user interaction and have long running daemons with relatively stable memory demands. These systems will benefit most by having as much data cached in memory as possible. These values should work well for most configurations. However, if you have a desktop system with more than 8GB of memory you may wish to further restrict the ARC. This can still be accomplished by setting the 'zfs_arc_max' module option. Additionally, keep in mind these aren't currently hard limits. The ARC is based on a slab implementation which can suffer from memory fragmentation. Because this fragmentation is not visible from the ARC it may believe it is within the specified limits while actually consuming slightly more memory. How much more memory get's consumed will be determined by how badly fragmented the slabs are. In the long term this can be mitigated by slab defragmentation code which was OpenSolaris solution. Or preferably, using the page cache to back the ARC under Linux would be even better. See issue #75 for the benefits of more tightly integrating with the page cache. This change also fixes a issue where the default ARC max was being set incorrectly for machines with less than 2GB of memory. The constant in the arc_c_max comparison must be explicitly cast to a uint64_t type to prevent overflow and the wrong conditional branch being taken. This failure was typically observed in VMs which are commonly created with less than 2GB of memory. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #75	2011-12-05 12:02:12 -08:00
Brian Behlendorf	f31b3ebe6e	Allow xattrs on symlinks The Solaris version of ZFS does not allow xattrs to be set on symlinks due to the way they implemented the attropen() system call. Linux however implements xattrs through the lgetxattr() and lsetxattr() system calls which do not have this limitation. The only reason this hasn't always worked under ZFS on Linux is that the xattr handlers were not registered for symlink type inodes. This was done simply to be consistent with the Solaris behavior. Upon futher reflection I believe this should be allowed under Linux. The only ill effect would be that the xattrs on symlinks will not be visible when the pool is imported on a Solaris system. This also has the benefit that it allows for SELinux style security xattr labeling which expects to be able to set xattrs on all inode types. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #272	2011-11-29 10:24:24 -08:00
Brian Behlendorf	82a37189aa	Implement SA based xattrs The current ZFS implementation stores xattrs on disk using a hidden directory. In this directory a file name represents the xattr name and the file contexts are the xattr binary data. This approach is very flexible and allows for arbitrarily large xattrs. However, it also suffers from a significant performance penalty. Accessing a single xattr can requires up to three disk seeks. 1) Lookup the dnode object. 2) Lookup the dnodes's xattr directory object. 3) Lookup the xattr object in the directory. To avoid this performance penalty Linux filesystems such as ext3 and xfs try to store the xattr as part of the inode on disk. When the xattr is to large to store in the inode then a single external block is allocated for them. In practice most xattrs are small and this approach works well. The addition of System Attributes (SA) to zfs provides us a clean way to make this optimization. When the dataset property 'xattr=sa' is set then xattrs will be preferentially stored as System Attributes. This allows tiny xattrs (~100 bytes) to be stored with the dnode and up to 64k of xattrs to be stored in the spill block. If additional xattr space is required, which is unlikely under Linux, they will be stored using the traditional directory approach. This optimization results in roughly a 3x performance improvement when accessing xattrs which brings zfs roughly to parity with ext4 and xfs (see table below). When multiple xattrs are stored per-file the performance improvements are even greater because all of the xattrs stored in the spill block will be cached. However, by default SA based xattrs are disabled in the Linux port to maximize compatibility with other implementations. If you do enable SA based xattrs then they will not be visible on platforms which do not support this feature. ---------------------------------------------------------------------- Time in seconds to get/set one xattr of N bytes on 100,000 files ------+--------------------------------+------------------------------ \| setxattr \| getxattr bytes \| ext4 xfs zfs-dir zfs-sa \| ext4 xfs zfs-dir zfs-sa ------+--------------------------------+------------------------------ 1 \| 2.33 31.88 21.50 4.57 \| 2.35 2.64 6.29 2.43 32 \| 2.79 30.68 21.98 4.60 \| 2.44 2.59 6.78 2.48 256 \| 3.25 31.99 21.36 5.92 \| 2.32 2.71 6.22 3.14 1024 \| 3.30 32.61 22.83 8.45 \| 2.40 2.79 6.24 3.27 4096 \| 3.57 317.46 22.52 10.73 \| 2.78 28.62 6.90 3.94 16384 \| n/a 2342.39 34.30 19.20 \| n/a 45.44 145.90 7.55 65536 \| n/a 2941.39 128.15 131.32* \| n/a 141.92 256.85 262.12* Legend: * ext4 - Stock RHEL6.1 ext4 mounted with '-o user_xattr'. * xfs - Stock RHEL6.1 xfs mounted with default options. * zfs-dir - Directory based xattrs only. * zfs-sa - Prefer SAs but spill in to directories as needed, a trailing * indicates overflow in to directories occured. NOTE: Ext4 supports 4096 bytes of xattr name/value pairs per file. NOTE: XFS and ZFS have no limit on xattr name/value pairs per file. NOTE: Linux limits individual name/value pairs to 65536 bytes. NOTE: All setattr/getattr's were done after dropping the cache. NOTE: All tests were run against a single hard drive. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #443	2011-11-28 15:45:51 -08:00
Brian Behlendorf	ca5fd24984	Limit maximum ashift value to 12 While we initially allowed you to set your ashift as large as 17 (SPA_MAXBLOCKSIZE) that is actually unsafe. What wasn't considered at the time is that each uberblock written to the vdev label ring buffer will be of this size. Now the buffer is statically sized to 128k and we need to be able to fit several uberblocks in it. With a large ashift that becomes a problem. Therefore I'm reducing the maximum configurable ashift value to 12. This is large enough for the 4k sector drives and small enough that we can still keep the most recent 32 uberblock in the vdev label ring buffer. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #425	2011-11-11 14:50:48 -08:00
Brian Behlendorf	adcd70bd1a	Linux 3.1 compat, fops->fsync() The Linux 3.1 kernel updated the fops->fsync() callback yet again. They now pass the requested range and delegate the responsibility for calling filemap_write_and_wait_range() to the callback. In addition imutex is no longer held by the caller and the callback is responsible for taking the lock if required. This commit updates the code to provide a zpl_fsync() function for the updated API. Implementations for the previous two APIs are also maintained for compatibility. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #445	2011-11-10 10:03:08 -08:00
Brian Behlendorf	5547c2f1bf	Simplify BDI integration Update the code to use the bdi_setup_and_register() helper to simplify the bdi integration code. The updated code now just registers the bdi during mount and destroys it during unmount. The only complication is that for 2.6.32 - 2.6.33 kernels the helper wasn't available so in these cases the zfs code must provide it. Luckily the bdi_setup_and_register() function is trivial. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #367	2011-11-08 10:19:03 -08:00
Brian Behlendorf	591fb62f19	Disown dataset in zfs_sb_create() Fix an unlikely failure cause in zfs_sb_create() which could leave the dataset owned on error and thus unavailable until after a reboot. Disown the dataset if SA are expected but are in fact missing. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2011-11-08 10:18:40 -08:00
Brian Behlendorf	ae6ba3dbe6	Improve meta data performance Profiling the system during meta data intensive workloads such as creating/removing millions of files, revealed that the system was cpu bound. A large fraction of that cpu time was being spent waiting on the virtual address space spin lock. It turns out this was caused by certain heavily used kmem_caches being backed by virtual memory. By default a kmem_cache will dynamically determine the type of memory used based on the object size. For large objects virtual memory is usually preferable and for small object physical memory is a better choice. See the spl_slab_alloc() function for a longer discussion on this. However, there is a certain amount of gray area when defining a 'large' object. For the following caches it turns out they were just over the line: * dnode_cache * zio_cache * zio_link_cache * zio_buf_512_cache * zfs_data_buf_512_cache Now because we know there will be a lot of churn in these caches, and because we know the slabs will still be reasonably sized. We can safely request with the KMC_KMEM flag that the caches be backed with physical memory addresses. This entirely avoids the need to serialize on the virtual address space lock. As a bonus this also reduces our vmalloc usage which will be good for 32-bit kernels which have a very small virtual address space. It will also probably be good for interactive performance since unrelated processes could also block of this same global lock. Finally, we may see less cpu time being burned in the arc_reclaim and txg_sync_threads. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #258	2011-11-03 10:19:21 -07:00
Brian Behlendorf	6a95d0b74c	Fix NULL deref in balance_pgdat() Be careful not to unconditionally clear the PF_MEMALLOC bit in the task structure. It may have already been set when entering zpl_putpage() in which case it must remain set on exit. In particular the kswapd thread will have PF_MEMALLOC set in order to prevent it from entering direct reclaim. By clearing it we allow the following NULL deref to potentially occur. BUG: unable to handle kernel NULL pointer dereference at (null) IP: [<ffffffff8109c7ab>] balance_pgdat+0x25b/0x4ff Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #287	2011-11-03 10:15:39 -07:00
Gunnar Beutner	a7b125e9a5	Fix a race condition in zfs_getattr_fast() zfs_getattr_fast() was missing a lock on the ZFS superblock which could result in zfs_znode_dmu_fini() clearing the zp->z_sa_hdl member while zfs_getattr_fast() was accessing the znode. The result of this would usually be a panic. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Fixes #431	2011-11-03 10:13:09 -07:00
Xin Li	c475167627	Illumos #1661 : Fix flaw in sa_find_sizes() calculation When calculating space needed for SA_BONUS buffers, hdrsize is always rounded up to next 8-aligned boundary. However, in two places the round up was done against sum of 'total' plus hdrsize. On the other hand, hdrsize increments by 4 each time, which means in certain conditions, we would end up returning with will_spill == 0 and (total + hdrsize) larger than full_space, leading to a failed assertion because it's invalid for dmu_set_bonus. Reviewed by: Matthew Ahrens <matt@delphix.com> Reviewed by: Dan McDonald <danmcd@nexenta.com> Approved by: Gordon Ross <gwr@nexenta.com> References to Illumos issue: https://www.illumos.org/issues/1661 Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #426	2011-10-24 09:57:52 -07:00
Darik Horn	3cee2262a6	Change sun.com URLs to zfsonlinux.org ZFS contains error messages that point to the defunct www.sun.com domain, which is currently offline. Change these error messages to use the zfsonlinux.org mirror instead. This commit depends on: zfsonlinux/zfsonlinux.github.com@8e10ead3dc Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2011-10-24 09:52:21 -07:00
Brian Behlendorf	6f2255ba8a	Set mtime on symbolic links Register the setattr/getattr callbacks for symlinks. Without these the generic inode_setattr() and generic_fillattr() functions will be used. In the setattr case this will only result in the inode being updated in memory, the dirty_inode callback would also normally run but none is registered for zfs. The straight forward fix is to set the setattr/getattr callbacks for symlinks so they are handled just like files and directories. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #412	2011-10-18 15:49:31 -07:00
Alexander Stetsenko	8d35c1499d	Illumos #755 : dmu_recv_stream builds incomplete guid_to_ds_map An incomplete guid_to_ds_map would cause restore_write_byref() to fail while receiving a de-duplicated backup stream. Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Garrett D`Amore <garrett@nexenta.com> Reviewed by: Gordon Ross <gwr@nexenta.com> Approved by: Gordon Ross <gwr@nexenta.com> References to Illumos issue and patch: - https://www.illumos.org/issues/755 - https://github.com/illumos/illumos-gate/commit/ec5cf9d53a Signed-off-by: Gunnar Beutner <gunnar@beutner.name> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #372	2011-10-18 11:18:14 -07:00
Brian Behlendorf	86f35f34f4	Export symbols for the VFS API Export all symbols already marked extern in the zfs_vfsops.h header. Several non-static symbols have also been added to the header and exportewd. This allows external modules to more easily create and manipulate properly created ZFS filesystem type datasets. Rename zfsvfs_teardown() to zfs_sb_teardown and export it. This is done simply for consistency with the rest of the code base. All other zfsvfs_* functions have already been renamed. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2011-10-11 10:25:59 -07:00
Brian Behlendorf	e45aa45298	Export symbols for the full SA API Export all the symbols for the system attribute (SA) API. This allows external module to cleanly manipulate the SAs associated with a dnode. Documention for the SA API can be found in the module/zfs/sa.c source. This change also removes the zfs_sa_uprade_pre, and zfs_sa_uprade_post prototypes. The functions themselves were dropped some time ago. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2011-10-05 15:59:56 -07:00
Andreas Dilger	baab063016	zpl: Fix "df -i" to have better free inodes value Due to the confusion in Linux statfs between f_frsize and f_bsize the blocks counts were changed to be in units of z_max_blksize instead of SPA_MINBLOCKSIZE as it is on other platforms. However, the free files calculation in zfs_statvfs() is limited by the free blocks count, since each dnode consumes one block/sector. This provided a reasonable estimate of free inodes, but on Linux this meant that the free inodes count was underestimated by a large amount, since 256 512-byte dnodes can fit into a 128kB block, and more if the max blocksize is increased to 1MB or larger. Also, the use of SPA_MINBLOCKSIZE is semantically incorrect since DNODE_SIZE may change to a value other than SPA_MINBLOCKSIZE and may even change per dataset, and devices with large sectors setting ashift will also use a larger blocksize. Correct the f_ffree calculation to use (availbytes >> DNODE_SHIFT) to more accurately compute the maximum number of dnodes that can be created. Signed-off-by: Andreas Dilger <adilger@whamcloud.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #413 Closes #400	2011-09-28 11:27:10 -07:00
Brian Behlendorf	dee28b0700	Export symbols for the full ZAP API Export all the symbols for the ZAP API. This allows external modules to cleanly interface with ZAP type objects. Previously only a subset of the functionality was exposed. Documention for the ZAP API can be found in the sys/zap.h header. This change also removes a duplicate zap_increment_int() prototype. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2011-09-27 16:12:36 -07:00
Brian Behlendorf	fa6e5ced2f	Suppress kmem_alloc() warning in zfs_prop_set_special() Suppress the warning for this large kmem_alloc() because it is not that far over the warning threshhold (8k) and it is short lived. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2011-09-15 20:26:51 -07:00
Brian Behlendorf	2708f716c0	Fix usage of zsb after free Caught by code inspection, the variable zsb was referenced after being freed. Move the kmem_free() to the end of the function. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2011-09-09 10:29:48 -07:00
Brian Behlendorf	95d9fd028b	Fix incompatible pointer type warning This warning was accidentally introduced by commit `f3ab88d646` which updated the .readpages() implementation. The fix is to simply cast the helper function to the appropriate type when passed. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2011-08-19 15:16:30 -07:00
Brian Behlendorf	f3ab88d646	Correctly lock pages for .readpages() Unlike the .readpage() callback which is passed a single locked page to be populated. The .readpages() callback is passed a list of unlocked pages which are all marked for read-ahead (PG_readahead set). It is the responsibly of .readpages() to ensure to pages are properly locked before being populated. Prior to this change the requested read-ahead pages would be updated outside of the page lock which is unsafe. The unlocked pages would then be unlocked again which is harmless but should have been immediately detected as bug. Unfortunately, newer kernels failed detect this issue because the check is done with a VM_BUG_ON which is disabled by default. Luckily, the old Debian Lenny 2.6.26 kernel caught this because it simply uses a BUG_ON. The straight forward fix for this is to update the .readpages() callback to use the read_cache_pages() helper function. The helper function will ensure that each page in the list is properly locked before it is passed to the .readpage() callback. In addition resolving the bug, this results in a nice simplification of the existing code. The downside to this change is that instead of passing one large read request to the dmu multiple smaller ones are submitted. All of these requests however are marked for readahead so the lower layers should issue a large I/O regardless. Thus most of the request should hit the ARC cache. Futher optimization of this code can be done in the future is a perform analysis determines it to be worthwhile. But for the moment, it is preferable that code be correct and understandable. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #355	2011-08-08 13:24:52 -07:00
Brian Behlendorf	76659dc110	Add backing_device_info per-filesystem For a long time now the kernel has been moving away from using the pdflush daemon to write 'old' dirty pages to disk. The primary reason for this is because the pdflush daemon is single threaded and can be a limiting factor for performance. Since pdflush sequentially walks the dirty inode list for each super block any delay in processing can slow down dirty page writeback for all filesystems. The replacement for pdflush is called bdi (backing device info). The bdi system involves creating a per-filesystem control structure each with its own private sets of queues to manage writeback. The advantage is greater parallelism which improves performance and prevents a single filesystem from slowing writeback to the others. For a long time both systems co-existed in the kernel so it wasn't strictly required to implement the bdi scheme. However, as of Linux 2.6.36 kernels the pdflush functionality has been retired. Since ZFS already bypasses the page cache for most I/O this is only an issue for mmap(2) writes which must go through the page cache. Even then adding this missing support for newer kernels was overlooked because there are other mechanisms which can trigger writeback. However, there is one critical case where not implementing the bdi functionality can cause problems. If an application handles a page fault it can enter the balance_dirty_pages() callpath. This will result in the application hanging until the number of dirty pages in the system drops below the dirty ratio. Without a registered backing_device_info for the filesystem the dirty pages will not get written out. Thus the application will hang. As mentioned above this was less of an issue with older kernels because pdflush would eventually write out the dirty pages. This change adds a backing_device_info structure to the zfs_sb_t which is already allocated per-super block. It is then registered when the filesystem mounted and unregistered on unmount. It will not be registered for mounted snapshots which are read-only. This change will result in flush-<pool> thread being dynamically created and destroyed per-mounted filesystem for writeback. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #174	2011-08-04 13:37:38 -07:00
Brian Behlendorf	3c0e5c0f45	Cleanup mmap(2) writes While the existing implementation of .writepage()/zpl_putpage() was functional it was not entirely correct. In particular, it would move dirty pages in to a clean state simply after copying them in to the ARC cache. This would result in the pages being lost if the system were to crash enough though the Linux VFS believed them to be safe on stable storage. Since at the moment virtually all I/O, except mmap(2), bypasses the page cache this isn't as bad as it sounds. However, as hopefully start using the page cache more getting this right becomes more important so it's good to improve this now. This patch takes a big step in that direction by updating the code to correctly move dirty pages through a writeback phase before they are marked clean. When a dirty page is copied in to the ARC it will now be set in writeback and a completion callback is registered with the transaction. The page will stay in writeback until the dmu runs the completion callback indicating the page is on stable storage. At this point the page can be safely marked clean. This process is normally entirely asynchronous and will be repeated for every dirty page. This may initially sound inefficient but most of these pages will end up in a few txgs. That means when they are eventually written to disk they should be nicely batched. However, there is room for improvement. It may still be desirable to batch up the pages in to larger writes for the dmu. This would reduce the number of callbacks and small 4k buffer required by the ARC. Finally, if the caller requires that the I/O be done synchronously by setting WB_SYNC_ALL or if ZFS_SYNC_ALWAYS is set. Then the I/O will trigger a zil_commit() to flush the data to stable storage. At which point the registered callbacks will be run leaving the date safe of disk and marked clean before returning from .writepage. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2011-08-02 10:34:55 -07:00
Martin Matuska	cddafdcbc5	Illumos #1313 : Integer overflow in txg_delay() The function txg_delay() is used to delay txg (transaction group) threads in ZFS. The timeout value for this function is calculated using: int timeout = ddi_get_lbolt() + ticks; Later, the actual wait is performed: while (ddi_get_lbolt() < timeout && tx->tx_syncing_txg < txg-1 && !txg_stalled(dp)) (void) cv_timedwait(&tx->tx_quiesce_more_cv, &tx->tx_sync_lock, timeout - ddi_get_lbolt()); The ddi_get_lbolt() function returns current uptime in clock ticks and is typed as clock_t. The clock_t type on 64-bit architectures is int64_t. The "timeout" variable will overflow depending on the tick frequency (e.g. for 1000 it will overflow in 28.855 days). This will make the expression "ddi_get_lbolt() < timeout" always false - txg threads will not be delayed anymore at all. This leads to a slowdown in ZFS writes. The attached patch initializes timeout as clock_t to match the return value of ddi_get_lbolt(). Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #352	2011-08-01 12:09:43 -07:00
Alexander Stetsenko	0b7936d5c2	Illumos #278 : get rid zfs of python and pyzfs dependencies Remove all python and pyzfs dependencies for consistency and to ensure full functionality even in a mimimalist environment. Reviewed by: gordon.w.ross@gmail.com Reviewed by: trisk@opensolaris.org Reviewed by: alexander.r.eremin@gmail.com Reviewed by: jerry.jelinek@joyent.com Approved by: garrett@nexenta.com References to Illumos issue and patch: - https://www.illumos.org/issues/278 - https://github.com/illumos/illumos-gate/commit/1af68beac3 Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #340 Issue #160 Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2011-08-01 12:09:36 -07:00
Martin Matuska	ca5252204a	Illumos #1043 : Recursive zfs snapshot destroy fails Prior to revision 11314 if a user was recursively destroying snapshots of a dataset the target dataset was not required to exist. The zfs_secpolicy_destroy_snaps() function introduced the security check on the target dataset, so since then if the target dataset does not exist, the recursive destroy is not performed. Before 11314, only a delete permission check on the snapshot's master dataset was performed. Steps to reproduce: zfs create pool/a zfs snapshot pool/a@s1 zfs destroy -r pool@s1 Therefore I suggest to fallback to the old security check, if the target snapshot does not exist and continue with the destroy. References to Illumos issue and patch: - https://www.illumos.org/issues/1043 - https://www.illumos.org/attachments/217/recursive_dataset_destroy.patch Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #340	2011-08-01 12:09:11 -07:00
Eric Schrock	3e31d2b080	Illumos #883 : ZIL reuse during remount corruption Moving the zil_free() cleanup to zil_close() prevents this problem from occurring in the first place. There is a very good description of the issue and fix in Illumus #883. Reviewed by: Matt Ahrens <Matt.Ahrens@delphix.com> Reviewed by: Adam Leventhal <Adam.Leventhal@delphix.com> Reviewed by: Albert Lee <trisk@nexenta.com> Reviewed by: Gordon Ross <gwr@nexenta.com> Reviewed by: Garrett D'Amore <garrett@nexenta.com> Reivewed by: Dan McDonald <danmcd@nexenta.com> Approved by: Gordon Ross <gwr@nexenta.com> References to Illumos issue and patch: - https://www.illumos.org/issues/883 - https://github.com/illumos/illumos-gate/commit/c9ba2a43cb Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #340	2011-08-01 12:09:11 -07:00
Matt Ahrens	f5fc4acaa7	Illumos #1092 : zfs refratio property Add a "REFRATIO" property, which is the compression ratio based on data referenced. For snapshots, this is the same as COMPRESSRATIO, but for filesystems/volumes, the COMPRESSRATIO is based on the data "USED" (ie, includes blocks in children, but not blocks shared with the origin). This is needed to figure out how much space a filesystem would use if it were not compressed (ignoring snapshots). Reviewed by: George Wilson <George.Wilson@delphix.com> Reviewed by: Adam Leventhal <Adam.Leventhal@delphix.com> Reviewed by: Dan McDonald <danmcd@nexenta.com> Reviewed by: Richard Elling <richard.elling@richardelling.com> Reviewed by: Mark Musante <Mark.Musante@oracle.com> Reviewed by: Garrett D'Amore <garrett@nexenta.com> Approved by: Garrett D'Amore <garrett@nexenta.com> References to Illumos issue and patch: - https://www.illumos.org/issues/1092 - https://github.com/illumos/illumos-gate/commit/187d6ac08a Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #340	2011-08-01 12:09:11 -07:00
George Wilson	6d974228ef	Illumos #1051 : zfs should handle imbalanced luns Today zfs tries to allocate blocks evenly across all devices. This means when devices are imbalanced zfs will use lots of CPU searching for space on devices which tend to be pretty full. It should instead fail quickly on the full LUNs and move onto devices which have more availability. Reviewed by: Eric Schrock <Eric.Schrock@delphix.com> Reviewed by: Matt Ahrens <Matt.Ahrens@delphix.com> Reviewed by: Adam Leventhal <Adam.Leventhal@delphix.com> Reviewed by: Albert Lee <trisk@nexenta.com> Reviewed by: Gordon Ross <gwr@nexenta.com> Approved by: Garrett D'Amore <garrett@nexenta.com> References to Illumos issue and patch: - https://www.illumos.org/issues/510 - https://github.com/illumos/illumos-gate/commit/5ead3ed965 Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #340	2011-08-01 12:09:11 -07:00
Garrett D'Amore	2cc6c8db12	Illumos #175 : zfs vdev cache consumes excessive memory Note that with the current ZFS code, it turns out that the vdev cache is not helpful, and in some cases actually harmful. It is better if we disable this. Once some time has passed, we should actually remove this to simplify the code. For now we just disable it by setting the zfs_vdev_cache_size to zero. Note that Solaris 11 has made these same changes. References to Illumos issue and patch: - https://www.illumos.org/issues/175 - https://github.com/illumos/illumos-gate/commit/b68a40a845 Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Eric Schrock <eric.schrock@delphix.com> Approved by: Richard Lowe <richlowe@richlowe.net> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #340	2011-08-01 12:09:11 -07:00
Gordon Ross	ef3c1dea70	Illumos #764 : panic in zfs:dbuf_sync_list Hypothesis about what's going on here. At some time in the past, something, i.e. dnode_reallocate() calls one of: dbuf_rm_spill(dn, tx); These will do: dbuf_rm_spill(dnode_t dn, dmu_tx_t tx) dbuf_free_range(dn, DMU_SPILL_BLKID, DMU_SPILL_BLKID, tx) dbuf_undirty(db, tx) Currently dbuf_undirty can leave a spill block in dn_dirty_records[], (it having been put there previously by dbuf_dirty) and free it. Sometime later, dbuf_sync_list trips over this reference to free'd (and typically reused) memory. Also, dbuf_undirty can call dnode_clear_range with a bogus block ID. It needs to test for DMU_SPILL_BLKID, similar to how dnode_clear_range is called in dbuf_dirty(). References to Illumos issue and patch: - https://www.illumos.org/issues/764 - https://github.com/illumos/illumos-gate/commit/3f2366c2bb Reviewed by: George Wilson <gwilson@zfsmail.com> Reviewed by: Mark.Maybe@oracle.com Reviewed by: Albert Lee <trisk@nexenta.com Approved by: Garrett D'Amore <garrett@nexenta.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #340	2011-08-01 12:09:11 -07:00
Tim Haley	7b8518cb8d	Illumos #xxx: zdb -vvv broken after zfs diff integration References to Illumos issue and patch: - https://github.com/illumos/illumos-gate/commit/163eb7ff Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #340	2011-08-01 12:09:02 -07:00
Brian Behlendorf	beb9826902	Fix txg_sync_thread deadlock Update two kmem_alloc()'s in dbuf_dirty() to use KM_PUSHPAGE. Because these functions are called from txg_sync_thread we must ensure they don't reenter the zfs filesystem code via the .writepage callback. This would result in a deadlock. This deadlock is rare and has only been observed once under an abusive mmap() write workload. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2011-07-22 15:24:57 -07:00
Brian Behlendorf	22872ff5da	Use zfs_mknode() to create dataset root Long, long, long ago when the effort to port ZFS was begun the zfs_create_fs() function was heavily modified to remove all of its VFS dependencies. This allowed Lustre to use the dataset without us having to spend the time porting all the required VFS code. Fast-forward several years and we now have all the VFS code in place but are still relying on the modified zfs_create_fs(). This isn't required anymore and we can now use zfs_mknode() to create the root znode for the filesystem. This commit reverts the contents of zfs_create_fs() to largely match the upstream OpenSolaris code. There have been minor modifications to accomidate the Linux VFS but that is all. This code fixes issue #116 by bootstraping enough of the VFS data structures so we can rely on zfs_mknode() to create the root directory. This ensures it is created properly with support for system attributes. Previously it wasn't which is why it behaved differently that all other directories when modified. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #116	2011-07-20 19:52:26 -07:00
Brian Behlendorf	9fd91daeef	Honor setgit bit on directories Newly created files were always being created with the fsuid/fsgid in the current users credentials. This is correct except in the case when the parent directory sets the 'setgit' bit. In this case according to posix the newly created file/directory should inherit the gid of the parent directory. Additionally, in the case of a subdirectory it should also inherit the 'setgit' bit. Finally, this commit performs a little cleanup of the vattr_t initialization by moving it to a common helper function. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #262	2011-07-20 14:07:13 -07:00
Brian Behlendorf	fe0ed8f910	Fix 'make install' overly broad 'rm' When running 'make install' without DESTDIR set the module install rules would mistakenly destroy the 'modules.*' files for ALL of your installed kernels. This could lead to a non-functional system for the alternate kernels because 'depmod -a' will only be run for the kernel which was compiled against. This issue would not impact anyone using the 'make <deb\|rpm\|pkg>' build targets to build and install packages. The fix for this issue is to only remove extraneous build products when DESTDIR is set. This almost exclusively indicates we are building packages and installed the build products in to a temporary staging location. Additionally, limit the removal the unneeded build products to the target kernel version. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #328	2011-07-20 09:38:51 -07:00
Brian Behlendorf	cfc9a5c88f	Fix zpl_writepage() deadlock Disable the normal reclaim path for zpl_putpage(). This ensures that all memory allocations under this call path will never enter direct reclaim. If this were to happen the VM might try to write out additional pages by calling zpl_putpage() again resulting in a deadlock. This sitution is typically handled in Linux by marking each offending allocation GFP_NOFS. However, since much of the code used is common it makes more sense to use PF_MEMALLOC to flag the entire call tree. Alternately, the code could be updated to pass the needed allocation flags but that's a more invasive change. The following example of the above described deadlock was triggered by test 074 in the xfstest suite. Call Trace: [<ffffffff814dcdb2>] down_write+0x32/0x40 [<ffffffffa05af6e4>] dnode_new_blkid+0x94/0x2d0 [zfs] [<ffffffffa0597d66>] dbuf_dirty+0x556/0x750 [zfs] [<ffffffffa05987d1>] dmu_buf_will_dirty+0x81/0xd0 [zfs] [<ffffffffa059ee70>] dmu_write+0x90/0x170 [zfs] [<ffffffffa0611afe>] zfs_putpage+0x2ce/0x360 [zfs] [<ffffffffa062875e>] zpl_putpage+0x1e/0x60 [zfs] [<ffffffffa06287b2>] zpl_writepage+0x12/0x20 [zfs] [<ffffffff8115f907>] writeout+0xa7/0xd0 [<ffffffff8115fa6b>] move_to_new_page+0x13b/0x170 [<ffffffff8115fed4>] migrate_pages+0x434/0x4c0 [<ffffffff811559ab>] compact_zone+0x4fb/0x780 [<ffffffff81155ed1>] compact_zone_order+0xa1/0xe0 [<ffffffff8115602c>] try_to_compact_pages+0x11c/0x190 [<ffffffff811200bb>] __alloc_pages_nodemask+0x5eb/0x8b0 [<ffffffff8115464a>] alloc_pages_current+0xaa/0x110 [<ffffffff8111e36e>] __get_free_pages+0xe/0x50 [<ffffffffa03f0e2f>] kv_alloc+0x3f/0xb0 [spl] [<ffffffffa03f11d9>] spl_kmem_cache_alloc+0x339/0x660 [spl] [<ffffffffa05950b3>] dbuf_create+0x43/0x370 [zfs] [<ffffffffa0596fb1>] __dbuf_hold_impl+0x241/0x480 [zfs] [<ffffffffa0597276>] dbuf_hold_impl+0x86/0xc0 [zfs] [<ffffffffa05977ff>] dbuf_hold_level+0x1f/0x30 [zfs] [<ffffffffa05a9dde>] dmu_tx_check_ioerr+0x4e/0x110 [zfs] [<ffffffffa05aa1f9>] dmu_tx_count_write+0x359/0x6f0 [zfs] [<ffffffffa05aa5df>] dmu_tx_hold_write+0x4f/0x70 [zfs] [<ffffffffa0611a6d>] zfs_putpage+0x23d/0x360 [zfs] [<ffffffffa062875e>] zpl_putpage+0x1e/0x60 [zfs] [<ffffffff811221f9>] write_cache_pages+0x1c9/0x4a0 [<ffffffffa0628738>] zpl_writepages+0x18/0x20 [zfs] [<ffffffff81122521>] do_writepages+0x21/0x40 [<ffffffff8119bbbd>] writeback_single_inode+0xdd/0x2c0 [<ffffffff8119bfbe>] writeback_sb_inodes+0xce/0x180 [<ffffffff8119c11b>] writeback_inodes_wb+0xab/0x1b0 [<ffffffff8119c4bb>] wb_writeback+0x29b/0x3f0 [<ffffffff8119c6cb>] wb_do_writeback+0xbb/0x240 [<ffffffff811308ea>] bdi_forker_task+0x6a/0x310 [<ffffffff8108ddf6>] kthread+0x96/0xa0 Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #327	2011-07-19 16:17:01 -07:00
Brian Behlendorf	abd39a8289	Fix zio_execute() deadlock To avoid deadlocking the system it is crucial that all memory allocations performed in the zio_execute() call path are marked KM_PUSHPAGE (GFP_NOFS). This ensures that while a z_wr_iss thread is processing the syncing transaction group it does not re-enter the filesystem code and deadlock on itself. Call Trace: [<ffffffffa02580e8>] cv_wait_common+0x78/0xe0 [spl] [<ffffffffa0347bab>] txg_wait_open+0x7b/0xa0 [zfs] [<ffffffffa030e73d>] dmu_tx_wait+0xed/0xf0 [zfs] [<ffffffffa0376a49>] zfs_putpage+0x219/0x360 [zfs] [<ffffffffa038d75e>] zpl_putpage+0x1e/0x60 [zfs] [<ffffffffa038d7b2>] zpl_writepage+0x12/0x20 [zfs] [<ffffffff8115f907>] writeout+0xa7/0xd0 [<ffffffff8115fa6b>] move_to_new_page+0x13b/0x170 [<ffffffff8115fed4>] migrate_pages+0x434/0x4c0 [<ffffffff811559ab>] compact_zone+0x4fb/0x780 [<ffffffff81155ed1>] compact_zone_order+0xa1/0xe0 [<ffffffff8115602c>] try_to_compact_pages+0x11c/0x190 [<ffffffff811200bb>] __alloc_pages_nodemask+0x5eb/0x8b0 [<ffffffff81159932>] kmem_getpages+0x62/0x170 [<ffffffff8115a54a>] fallback_alloc+0x1ba/0x270 [<ffffffff8115a2c9>] ____cache_alloc_node+0x99/0x160 [<ffffffff8115b059>] __kmalloc+0x189/0x220 [<ffffffffa02539fb>] kmem_alloc_debug+0xeb/0x130 [spl] [<ffffffffa031454a>] dnode_hold_impl+0x46a/0x550 [zfs] [<ffffffffa0314649>] dnode_hold+0x19/0x20 [zfs] [<ffffffffa03042e3>] dmu_read+0x33/0x180 [zfs] [<ffffffffa034729d>] space_map_load+0xfd/0x320 [zfs] [<ffffffffa03300bc>] metaslab_activate+0x10c/0x170 [zfs] [<ffffffffa0330ad9>] metaslab_alloc+0x469/0x800 [zfs] [<ffffffffa038963c>] zio_dva_allocate+0x6c/0x2f0 [zfs] [<ffffffffa038a249>] zio_execute+0x99/0xf0 [zfs] [<ffffffffa0254b1c>] taskq_thread+0x1cc/0x330 [spl] [<ffffffff8108ddf6>] kthread+0x96/0xa0 Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #291	2011-07-19 11:55:42 -07:00
Brian Behlendorf	a140dc5469	Fix mmap(2)/write(2)/read(2) deadlock When modifing overlapping regions of a file using mmap(2) and write(2)/read(2) it is possible to deadlock due to a lock inversion. The zfs_write() and zfs_read() hooks first take the zfs range lock and then lock the individual pages. Conversely, when using mmap'ed I/O the zpl_writepage() hook is called with the individual page locks already taken and then zfs_putpage() takes the zfs range lock. The most straight forward fix is to simply not take the zfs range lock in the mmap(2) case. The individual pages will still be locked thus serializing access. Updating the same region of a file with write(2) and mmap(2) has always been a dodgy thing to do. This change at a minimum ensures we don't deadlock and is consistent with the existing Linux semantics enforced by the VFS. This isn't an issue under Solaris because the only range locking performed will be with the zfs range locks. It's up to each filesystem to perform its own file locking. Under Linux the VFS provides many of these services. It may be possible/desirable at a latter date to entirely dump the existing zfs range locking and rely on the Linux VFS page locks. However, for now its safest to perform both layers of locking until zfs is more tightly integrated with the page cache. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #302	2011-07-19 11:55:42 -07:00
Brian Behlendorf	61f218b090	Fix send/recv 'dataset is busy' errors This commit fixes a regression which was accidentally introduced by the Linux 2.6.39 compatibility chanages. As part of these changes instead of holding an active reference on the namepsace (which is no longer posible) a reference is taken on the super block. This reference ensures the super block remains valid while it is in use. To handle the unlikely race condition of the filesystem being unmounted concurrently with the start of a 'zfs send/recv' the code was updated to only take the super block reference when there was an existing reference. This indicates that the filesystem is active and in use. Unfortunately, in the 'zfs recv' case this is not the case. The newly created dataset will not have a super block without an active reference which results in the 'dataset is busy' error. The most straight forward fix for this is to simply update the code to always take the reference even when it's zero. This may expose us to very very unlikely concurrent umount/send/recv case but the consequences of that are minor. Closes #319	2011-07-15 16:37:19 -07:00
Brian Behlendorf	057e8eee35	Improve fstat(2) performance There is at most a factor of 3x performance improvement to be had by using the Linux generic_fillattr() helper. However, to use it safely we need to ensure the values in a cached inode are kept rigerously up to date. Unfortunately, this isn't the case for the blksize, blocks, and atime fields. At the moment the authoritative values are still stored in the znode. This patch introduces an optimized zfs_getattr_fast() call. The idea is to use the up to date values from the inode and the blksize, block, and atime fields from the znode. At some latter date we should be able to strictly use the inode values and further improve performance. The remaining overhead in the zfs_getattr_fast() call can be attributed to having to take the znode mutex. This overhead is unavoidable until the inode is kept strictly up to date. The the careful reader will notice the we do not use the customary ZFS_ENTER()/ZFS_EXIT() macros. These macro's are designed to ensure the filesystem is not torn down in the middle of an operation. However, in this case the VFS is holding a reference on the active inode so we know this is impossible. =================== Performance Tests ======================== This test calls the fstat(2) system call 10,000,000 times on an open file description in a tight loop. The test results show the zfs stat(2) performance is now only 22% slower than ext4. This is a 2.5x improvement and there is a clear long term plan to get to parity with ext4. filesystem \| test-1 test-2 test-3 \| average \| times-ext4 --------------+-------------------------+---------+----------- ext4 \| 7.785s 7.899s 7.284s \| 7.656s \| 1.000x zfs-0.6.0-rc4 \| 24.052s 22.531s 23.857s \| 23.480s \| 3.066x zfs-faststat \| 9.224s 9.398s 9.485s \| 9.369s \| 1.223x The second test is to run 'du' of a copy of the /usr tree which contains 110514 files. The test is run multiple times both using both a cold cache (/proc/sys/vm/drop_caches) and a hot cache. As expected this change signigicantly improved the zfs hot cache performance and doesn't quite bring zfs to parity with ext4. A little surprisingly the zfs cold cache performance is better than ext4. This can probably be attributed to the zfs allocation policy of co-locating all the meta data on disk which minimizes seek times. By default the ext4 allocator will spread the data over the entire disk only co-locating each directory. filesystem \| cold \| hot --------------+---------+-------- ext4 \| 13.318s \| 1.040s zfs-0.6.0-rc4 \| 4.982s \| 1.762s zfs-faststat \| 4.933s \| 1.345s	2011-07-11 09:11:22 -07:00
Brian Behlendorf	abd8610cd5	Add L2ARC tunables The performance of the L2ARC can be tweaked by a number of tunables, which may be necessary for different workloads: l2arc_write_max max write bytes per interval l2arc_write_boost extra write bytes during device warmup l2arc_noprefetch skip caching prefetched buffers l2arc_headroom number of max device writes to precache l2arc_feed_secs seconds between L2ARC writing l2arc_feed_min_ms min feed interval in milliseconds l2arc_feed_again turbo L2ARC warmup l2arc_norw no reads during writes Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #316	2011-07-08 12:44:11 -07:00
Gunnar Beutner	3c9609b322	Renamed HAVE_SHARE ifdefs to HAVE_SMB_SHARE. The remaining code that is guarded by HAVE_SHARE ifdefs is related to the .zfs/shares functionality which is currently not available on Linux. On Solaris the .zfs/shares directory can be used to set permissions for SMB shares. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2011-07-06 09:20:28 -07:00
Gunnar Beutner	46e18b3f0f	Implemented sharing datasets via NFS using libshare. The sharenfs and sharesmb properties depend on the libshare library to export datasets via NFS and SMB. This commit implements the base libshare functionality as well as support for managing NFS shares. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2011-07-06 09:20:28 -07:00
Brian Behlendorf	285226eff3	Always allow non-user xattrs Under Linux you may only disable USER xattrs. The SECURITY, SYSTEM, and TRUSTED xattr namespaces must always be available if xattrs are supported by the filesystem. The enforcement of USER xattrs is performed in the zpl_xattr_user_* handlers. Under Solaris there is only a single xattr namespace which is managed globally.	2011-07-01 13:39:48 -07:00
Rohan Puri	a89c3e0bd5	Support mandatory locks (nbmand) The Linux kernel already has support for mandatory locking. This change just replaces the Solaris mandatory locking calls with the Linux equivilants. In fact, it looks like this code could be removed entirely because this checking is already done generically in the Linux VFS. However, for now we'll leave it in place even if it is redundant just in case we missed something. The original patch to update the code to support mandatory locking was done by Rohan Puri. This patch is an updated version which is compatible with the previous mount option handling changes. Original-Patch-by: Rohan Puri <rohan.puri15@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #222 Closes #253	2011-07-01 13:39:40 -07:00
Brian Behlendorf	2cf7f52bc4	Linux compat 2.6.39: mount_nodev() The .get_sb callback has been replaced by a .mount callback in the file_system_type structure. When using the new interface the caller must now use the mount_nodev() helper. Unfortunately, the new interface no longer passes the vfsmount down to the zfs layers. This poses a problem for the existing implementation because we currently save this pointer in the super block for latter use. It provides our only entry point in to the namespace layer for manipulating certain mount options. This needed to be done originally to allow commands like 'zfs set atime=off tank' to work properly. It also allowed me to keep more of the original Solaris code unmodified. Under Solaris there is a 1-to-1 mapping between a mount point and a file system so this is a fairly natural thing to do. However, under Linux they many be multiple entries in the namespace which reference the same filesystem. Thus keeping a back reference from the filesystem to the namespace is complicated. Rather than introduce some ugly hack to get the vfsmount and continue as before. I'm leveraging this API change to update the ZFS code to do things in a more natural way for Linux. This has the upside that is resolves the compatibility issue for the long term and fixes several other minor bugs which have been reported. This commit updates the code to remove this vfsmount back reference entirely. All modifications to filesystem mount options are now passed in to the kernel via a '-o remount'. This is the expected Linux mechanism and allows the namespace to properly handle any options which apply to it before passing them on to the file system itself. Aside from fixing the compatibility issue, removing the vfsmount has had the benefit of simplifying the code. This change which fairly involved has turned out nicely. Closes #246 Closes #217 Closes #187 Closes #248 Closes #231	2011-07-01 13:36:39 -07:00
Brian Behlendorf	5c03efc379	Linux compat 2.6.39: security_inode_init_security() The security_inode_init_security() function now takes an additional qstr argument which must be passed in from the dentry if available. Passing a NULL is safe when no qstr is available the relevant security checks will just be skipped. Closes #246 Closes #217 Closes #187	2011-07-01 12:40:08 -07:00
Brian Behlendorf	e2e7aa2df8	Add ZFS specific mmap() checks Under Linux the VFS handles virtually all of the mmap() access checks. Filesystem specific checks are left to be handled in the .mmap() hook and normally there arn't any. However, ZFS provides a few attributes which can influence the mmap behavior and should be honored. Note, currently the code to modify these attributes has not been implemented under Linux. * ZFS_IMMUTABLE \| ZFS_READONLY \| ZFS_APPENDONLY: when any of these attributes are set a file may not be mmaped with write access. * ZFS_AV_QUARANTINED: when set a file file may not be mmaped with read or exec access. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2011-07-01 12:23:46 -07:00
Brian Behlendorf	f0b2486034	Remove unused MMAP functions The following functions were required for the OpenSolaris mmap implementation. Because the Linux VFS does most the most heavy lifting for us they are not required and are being removed to keep the code clean and easy to understand. * zfs_null_putapage() * zfs_frlock() * zfs_no_putpage() Signed-off-by: Brian Behlendorf <behlendorf@llnl.gov>	2011-07-01 12:22:57 -07:00
Prasad Joshi	dde471ef5a	MMAP Optimization Enable zfs_getpage, zfs_fillpage, zfs_putpage, zfs_putapage functions. The functions have been modified to make them Linux friendly. ZFS uses these functions to read/write the mmapped pages. Using them from readpage/writepage results in clear code. The patch also adds readpages and writepages interface functions to read/write list of pages in one function call. The code change handles the first mmap optimization mentioned on https://github.com/behlendorf/zfs/issues/225 Signed-off-by: Prasad Joshi <pjoshi@stec-inc.com> Signed-off-by: Brian Behlendorf <behlendorf@llnl.gov> Issue #255	2011-07-01 12:22:52 -07:00
Prasad Joshi	218b8eafbd	Use truncate_setsize in zfs_setattr According to Linux kernel commit 2c27c65e, using truncate_setsize in setattr simplifies the code. Therefore, the patch replaces the call to vmtruncate() with truncate_setsize(). zfs_setattr uses zfs_freesp to free the disk space belonging to the file. As truncate_setsize may release the page cache and flushing the dirty data to disk, it must be called before the zfs_freesp. Suggested-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Prasad Joshi <pjoshi@stec-inc.com> Closes #255	2011-06-27 09:59:52 -07:00
Prasad Joshi	b312979252	Tear down and flush the mmap region The inode eviction should unmap the pages associated with the inode. These pages should also be flushed to disk to avoid the data loss. Therefore, use truncate_setsize() in evict_inode() to release the pagecache. The API truncate_setsize() was added in 2.6.35 kernel. To ensure compatibility with the old kernel, the patch defines its own truncate_setsize function. Signed-off-by: Prasad Joshi <pjoshi@stec-inc.com> Closes #255	2011-06-27 09:59:19 -07:00
Brian Behlendorf	7e7baecaa3	Linux 3.0 compat, shrinker compatibility To accomindate the updated Linux 3.0 shrinker API the spl shrinker compatibility code was updated. Unfortunately, this couldn't be done cleanly without slightly adjusting the comapt API. See spl commit `a55bcaad18`. This commit updates the ZFS code to use the slightly modified API. You must use the latest SPL if your building ZFS.	2011-06-21 14:36:39 -07:00
Gunnar Beutner	b00131d43c	Fix unlink/xattr deadlock The problem here is that prune_icache() tries to evict/delete both the xattr directory inode as well as at least one xattr inode contained in that directory. Here's what happens: 1. File is created. 2. xattr is created for that file (behind the scenes a xattr directory and a file in that xattr directory are created) 3. File is deleted. 4. Both the xattr directory inode and at least one xattr inode from that directory are evicted by prune_icache(); prune_icache() acquires a lock on both inodes before it calls ->evict() on the inodes When the xattr directory inode is evicted zfs_zinactive attempts to delete the xattr files contained in that directory. While enumerating these files zfs_zget() is called to obtain a reference to the xattr file znode - which tries to lock the xattr inode. However that very same xattr inode was already locked by prune_icache() further up the call stack, thus leading to a deadlock. This can be reliably reproduced like this: $ touch test $ attr -s a -V b test $ rm test $ echo 3 > /proc/sys/vm/drop_caches This patch fixes the deadlock by moving the zfs_purgedir() call to zfs_unlinked_drain(). Instead zfs_rmnode() now checks whether the xattr dir is empty and leaves the xattr dir in the unlinked set if it finds any xattrs. To ensure zfs_unlinked_drain() never accesses a stale super block zfsvfs_teardown() has been update to block until the iput taskq has been drained. This avoids a potential race where a file with an xattr directory is removed and the file system is immediately unmounted. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #266	2011-06-20 13:47:03 -07:00
Gunnar Beutner	6f0cf71e0d	Removed erroneous zfs_inode_destroy() calls from zfs_rmnode(). iput_final() already calls zpl_inode_destroy() -> zfs_inode_destroy() for us after zfs_zinactive(), thus making sure that the inode is properly cleaned up. The zfs_inode_destroy() calls in zfs_rmnode() would lead to a double-free. Fixes #282	2011-06-20 10:30:17 -07:00
Christian Kohlschütter	df30f56639	Add "ashift" property to zpool create Some disks with internal sectors larger than 512 bytes (e.g., 4k) can suffer from bad write performance when ashift is not configured correctly. This is caused by the disk not reporting its actual sector size, but a sector size of 512 bytes. The drive may behave this way for compatibility reasons. For example, the WDC WD20EARS disks are known to exhibit this behavior. When creating a zpool, ZFS takes that wrong sector size and sets the "ashift" property accordingly (to 9: 1<<9=512), whereas it should be set to 12 for 4k sectors (1<<12=4096). This patch allows an adminstrator to manual specify the known correct ashift size at 'zpool create' time. This can significantly improve performance in certain cases. However, it will have an impact on your total pool capacity. See the updated ashift property description in the zpool.8 man page for additional details. Valid values for the ashift property range from 9 to 17 (512B-128KB). Additionally, you may set the ashift to 0 if you wish to auto-detect the sector size based on what the disk reports, this is the default behavior. The most common ashift values are 9 and 12. Example: zpool create -o ashift=12 tank raidz2 sda sdb sdc sdd Closes #280 Original-patch-by: Richard Laager <rlaager@wiktel.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2011-06-17 16:35:49 -07:00
Brian Behlendorf	96801d2906	Linux 2.6.37 compat, WRITE_FLUSH_FUA The WRITE_FLUSH, WRITE_FUA, and WRITE_FLUSH_FUA flags have been introduced as a replacement for WRITE_BARRIER. This was done to allow richer semantics to be expressed to the block layer. It is the block layers responsibility to choose the correct way to implement these semantics. This change simply updates the bio's to use the new kernel API which should be absolutely safe. However, since ZFS depends entirely on this working as designed for correctness we do want to be careful. Closes #281	2011-06-17 14:37:26 -07:00
Brian Behlendorf	e95b3bdcbb	Fix stack ddt_class_contains() Stack usage for ddt_class_contains() reduced from 524 bytes to 68 bytes. This large stack allocation significantly contributed to the likelyhood of a stack overflow when scrubbing/resilvering dedup pools.	2011-05-31 12:17:27 -07:00
Brian Behlendorf	5b8c7bbcea	Fix stack ddt_zap_lookup() Stack usage for ddt_zap_lookup() reduced from 368 bytes to 120 bytes. This large stack allocation significantly contributed to the likelyhood of a stack overflow when scrubbing/resilvering dedup pools.	2011-05-31 12:17:27 -07:00
Brian Behlendorf	c7f8f831a4	Revert "Fix stack traverse_visitbp()" This abomination is no longer required because the zio's issued during this recursive call path will now be handled asynchronously by the taskq thread pool. This reverts commit `6656bf5621`.	2011-05-31 12:17:27 -07:00
Brian Behlendorf	2fac4c2a74	Make tgx_sync_thread zio's async The majority of the recursive operations performed by the dsl are done either in the context of the tgx_sync_thread or during pool import. It is these recursive operations which contribute greatly to the stack depth. When this recursion is coupled with a synchronous I/O in the same context overflow becomes possible. Previously to handle this case I have focused on keeping the individual stack frames as light as possible. This is a good idea as long as it can be done in a way which doesn't overly complicate the code. However, there is a better solution. If we treat all zio's issued by the tgx_sync_thread as async then we can use the tgx_sync_thread stack for the recursive parts, and the zio_* threads for the I/O parts. This effectively doubles our available stack space with the only drawback being a small delay to schedule the I/O. However, in practice the scheduling time is so much smaller than the actual I/O time this isn't an issue. Another benefit of making the zio async is that the zio pipeline is now parallel. That should mean for CPU intensive pipelines such as compression or dedup performance may be improved. With this change in place the worst case stack usage observed so far is 6902 bytes. This is still higher than I'd like but significantly improved. Additional changes to specific functions should improve this further. This change allows us to revent commit `6656bf5` which did some horrible things to the recursive traverse_visitbp() callpath in the name of saving stack.	2011-05-31 12:17:27 -07:00
Brian Behlendorf	f74fae8b30	Fix 4K sector support Yesterday I ran across a 3TB drive which exposed 4K sectors to Linux. While I thought I had gotten this support correct it turns out there were 2 subtle bugs which prevented it from working. sudo ./cmd/zpool/zpool create -f large-sector /dev/sda cannot create 'large-sector': one or more devices is currently unavailable 1) The first issue was that it was possible that bdev_capacity() would return the number of 512 byte sectors rather than the number of 4096 sectors. Internally, certain Linux functions only operate with 512 byte sectors so you need to be careful. To avoid any confusion in the future I've updated bdev_capacity() to simply return the device (or partition) capacity in bytes. The higher levels of ZFS want the value in bytes anyway so this is cleaner. 2) When creating a bio the ->bi_sector count must always be expressed in 512 byte sectors. The existing code would scale the byte offset by the logical sector size. Until now this was always 512 so it never caused problems. Trying a 4K sector drive clearly exposed the issue. The problem has been fixed by hard coding the 512 byte sector which is exactly what the bio code does internally. With these changes I'm now able to create ZFS pools using 4K sector drives. No issues were observed during fairly extensive testing. This is also a low risk change if your using 512b sectors devices because none of the logic changes. Closes #256	2011-05-27 11:38:53 -07:00
Brian Behlendorf	2b8cad6159	Use vmem_alloc() for zfs_ioc_userspace_many() The default buffer size when requesting multiple quota entries is 100 times the zfs_useracct_t size. In practice this works out to exactly 27200 bytes. Since this will be a short lived buffer in a non-performance critical path it is preferable to vmem_alloc() the needed memory.	2011-05-20 14:23:18 -07:00
Brian Behlendorf	f01b360e67	Pass caller's credential in zfsdev_ioctl() Initially when zfsdev_ioctl() was ported to Linux we didn't have any credential support implemented. So at the time we simply passed NULL which wasn't much of a problem since most of the secpolicy code was disabled. However, one exception is quota handling which does require the credential. Now that proper credentials are supported we can safely start passing the callers credential. This is also an initial step towards fully implemented the zfs secpolicy.	2011-05-20 10:12:25 -07:00
Brian Behlendorf	3fd70ee6b0	Fix 'negative objects to delete' warning Normally when the arc_shrinker_func() function is called the return value should be: >=0 - To indicate the number of freeable objects in the cache, or -1 - To indicate this cache should be skipped However, when the shrinker callback is called with 'nr_to_scan' equal to zero. The caller simply wants the number of freeable objects in the cache and we must never return -1. This patch reorders the first two conditionals in arc_shrinker_func() to ensure this behavior. This patch also now explictly casts arc_size and arc_c_min to signed int64_t types so MAX(x, 0) works as expected. As unsigned types we would never see an negative value which defeated the purpose of the MAX() lower bound and broke the shrinker logic. Finally, when nr_to_scan is non-zero we explictly prevent all reclaim below arc_c_min. This is done to prevent the Linux page cache from completely crowding out the ARC. This limit is tunable and some experimentation is likely going to be required to set it exactly right. For now we're sticking with the OpenSolaris defaults. Closes #218 Closes #243	2011-05-18 10:29:22 -07:00
Brian Behlendorf	e814770f2e	Update synchronous open zfs_close() comment The comment in zfs_close() pertaining to decrementing the synchronous open count needs to be updated for Linux. The code was already updated to be correct, but the comment was missed and is now misleading. Under Linux the zfs_close() hook is only called once when the final reference is dropped. This differs from Solaris where zfs_close() is called for each close. Closes #237	2011-05-13 08:20:06 -07:00
Brian Behlendorf	c91d229809	Merge pull request #235 from nedbass/rdev Don't store rdev in SA for FIFOs and sockets	2011-05-09 16:41:28 -07:00
Ned A. Bass	aa6d8c1086	Don't store rdev in SA for FIFOs and sockets Update the handling of named pipes and sockets to be consistent with other platforms with regard to the rdev attribute. While all ZFS ipmlementations store the rdev for device files in a system attribute (SA), this is not the case for FIFOs and sockets. Indeed, Linux always passes rdev=0 to mknod() for FIFOs and sockets, so the value is not needed. Add an ASSERT that rdev==0 for FIFOs and sockets to detect if the expected behavior ever changes. Closes #216	2011-05-09 13:35:07 -07:00
Brian Behlendorf	21ade34764	Disable direct reclaim for z_wr_* threads The direct reclaim path in the z_wr_* threads must be disabled to ensure forward progress is always maintained for txg processing. This ensures that a txg will never get stuck waiting on itself because it entered the following memory reclaim callpath. ->prune_icache()->dispose_list()->zpl_clear_inode()->zfs_inactive() ->dmu_tx_assign()->dmu_tx_wait()->tgx_wait_open() It would be preferable to target this exact code path but the kernel offers no way to do this without custom patches. To avoid this we are forced to disable all reclaim for these threads. It should not be necessary to do this for other other z_* threads because they will not hold a txg open. Closes #232	2011-05-06 15:26:26 -07:00
Brian Behlendorf	3117dd0b90	Handle NULL in nfsd .fsync() hook How nfsd handles .fsync() has been changed a couple of times in the recent kernels. But basically there are three cases we need to consider. Linux 2.6.12 - 2.6.33 * The .fsync() hook takes 3 arguments * The nfsd will call .fsync() with a NULL file struct pointer. Linux 2.6.34 * The .fsync() hook takes 3 arguments * The nfsd no longer calls .fsync() but instead used sync_inode() Linux 2.6.35 - 2.6.x * The .fsync() hook takes 2 arguments * The nfsd no longer calls .fsync() but instead used sync_inode() For once it looks like we've gotten lucky. The first two cases can actually be collased in to one if we stop using the file struct pointer entirely. Since the dentry is still passed in both cases this is possible. The last case can then be safely handled by unconditionally using the dentry in the file struct pointer now that we know the nfsd caller has been removed. Closes #230	2011-05-06 12:33:45 -07:00
Brian Behlendorf	34b84cb831	Use vmem_alloc() for zfs_ioc_pool_get_history() The default buffer size when requesting history is 128k. This is far to large for a kmem_alloc() so instead use the slower vmem_alloc(). This path has no performance concerns and the buffer is immediately free'd after its contents are copied to the user space buffer.	2011-05-06 09:59:52 -07:00
Brian Behlendorf	c409e4647f	Add missing ZFS tunables This commit adds module options for all existing zfs tunables. Ideally the average user should never need to modify any of these values. However, in practice sometimes you do need to tweak these values for one reason or another. In those cases it's nice not to have to resort to rebuilding from source. All tunables are visable to modinfo and the list is as follows: $ modinfo module/zfs/zfs.ko filename: module/zfs/zfs.ko license: CDDL author: Sun Microsystems/Oracle, Lawrence Livermore National Laboratory description: ZFS srcversion: 8EAB1D71DACE05B5AA61567 depends: spl,znvpair,zcommon,zunicode,zavl vermagic: 2.6.32-131.0.5.el6.x86_64 SMP mod_unload modversions parm: zvol_major:Major number for zvol device (uint) parm: zvol_threads:Number of threads for zvol device (uint) parm: zio_injection_enabled:Enable fault injection (int) parm: zio_bulk_flags:Additional flags to pass to bulk buffers (int) parm: zio_delay_max:Max zio millisec delay before posting event (int) parm: zio_requeue_io_start_cut_in_line:Prioritize requeued I/O (bool) parm: zil_replay_disable:Disable intent logging replay (int) parm: zfs_nocacheflush:Disable cache flushes (bool) parm: zfs_read_chunk_size:Bytes to read per chunk (long) parm: zfs_vdev_max_pending:Max pending per-vdev I/Os (int) parm: zfs_vdev_min_pending:Min pending per-vdev I/Os (int) parm: zfs_vdev_aggregation_limit:Max vdev I/O aggregation size (int) parm: zfs_vdev_time_shift:Deadline time shift for vdev I/O (int) parm: zfs_vdev_ramp_rate:Exponential I/O issue ramp-up rate (int) parm: zfs_vdev_read_gap_limit:Aggregate read I/O over gap (int) parm: zfs_vdev_write_gap_limit:Aggregate write I/O over gap (int) parm: zfs_vdev_scheduler:I/O scheduler (charp) parm: zfs_vdev_cache_max:Inflate reads small than max (int) parm: zfs_vdev_cache_size:Total size of the per-disk cache (int) parm: zfs_vdev_cache_bshift:Shift size to inflate reads too (int) parm: zfs_scrub_limit:Max scrub/resilver I/O per leaf vdev (int) parm: zfs_recover:Set to attempt to recover from fatal errors (int) parm: spa_config_path:SPA config file (/etc/zfs/zpool.cache) (charp) parm: zfs_zevent_len_max:Max event queue length (int) parm: zfs_zevent_cols:Max event column width (int) parm: zfs_zevent_console:Log events to the console (int) parm: zfs_top_maxinflight:Max I/Os per top-level (int) parm: zfs_resilver_delay:Number of ticks to delay resilver (int) parm: zfs_scrub_delay:Number of ticks to delay scrub (int) parm: zfs_scan_idle:Idle window in clock ticks (int) parm: zfs_scan_min_time_ms:Min millisecs to scrub per txg (int) parm: zfs_free_min_time_ms:Min millisecs to free per txg (int) parm: zfs_resilver_min_time_ms:Min millisecs to resilver per txg (int) parm: zfs_no_scrub_io:Set to disable scrub I/O (bool) parm: zfs_no_scrub_prefetch:Set to disable scrub prefetching (bool) parm: zfs_txg_timeout:Max seconds worth of delta per txg (int) parm: zfs_no_write_throttle:Disable write throttling (int) parm: zfs_write_limit_shift:log2(fraction of memory) per txg (int) parm: zfs_txg_synctime_ms:Target milliseconds between tgx sync (int) parm: zfs_write_limit_min:Min tgx write limit (ulong) parm: zfs_write_limit_max:Max tgx write limit (ulong) parm: zfs_write_limit_inflated:Inflated tgx write limit (ulong) parm: zfs_write_limit_override:Override tgx write limit (ulong) parm: zfs_prefetch_disable:Disable all ZFS prefetching (int) parm: zfetch_max_streams:Max number of streams per zfetch (uint) parm: zfetch_min_sec_reap:Min time before stream reclaim (uint) parm: zfetch_block_cap:Max number of blocks to fetch at a time (uint) parm: zfetch_array_rd_sz:Number of bytes in a array_read (ulong) parm: zfs_pd_blks_max:Max number of blocks to prefetch (int) parm: zfs_dedup_prefetch:Enable prefetching dedup-ed blks (int) parm: zfs_arc_min:Min arc size (ulong) parm: zfs_arc_max:Max arc size (ulong) parm: zfs_arc_meta_limit:Meta limit for arc size (ulong) parm: zfs_arc_reduce_dnlc_percent:Meta reclaim percentage (int) parm: zfs_arc_grow_retry:Seconds before growing arc size (int) parm: zfs_arc_shrink_shift:log2(fraction of arc to reclaim) (int) parm: zfs_arc_p_min_shift:arc_c shift to calc min/max arc_p (int)	2011-05-04 10:02:37 -07:00
Brian Behlendorf	5f35b19007	Fully update inode when created When a new znode/inode pair is created both the znode and the inode should be immediately updated to the correct values. This was done for the znode and for most of the values in the inode, but not all of them. This normally wasn't a problem because most subsequent operations would cause the inode to be immediately updated. This change ensures the inode is now fully updated before it is inserted in to the inode hash. Closes #116 Closes #146 Closes #164	2011-05-02 14:04:19 -07:00
Brian Behlendorf	df554c148e	Fix 'zfs set volsize=N pool/dataset' This change fixes a kernel panic which would occur when resizing a dataset which was not open. The objset_t stored in the zvol_state_t will be set to NULL when the block device is closed. To avoid this issue we pass the correct objset_t as the third arg. The code has also been updated to correctly notify the kernel when the block device capacity changes. For 2.6.28 and newer kernels the capacity change will be immediately detected. For earlier kernels the capacity change will be detected when the device is next opened. This is a known limitation of older kernels. Online ext3 resize test case passes on 2.6.28+ kernels: $ dd if=/dev/zero of=/tmp/zvol bs=1M count=1 seek=1023 $ zpool create tank /tmp/zvol $ zfs create -V 500M tank/zd0 $ mkfs.ext3 /dev/zd0 $ mkdir /mnt/zd0 $ mount /dev/zd0 /mnt/zd0 $ df -h /mnt/zd0 $ zfs set volsize=800M tank/zd0 $ resize2fs /dev/zd0 $ df -h /mnt/zd0 Original-patch-by: Fajar A. Nugraha <github@fajar.net> Closes #68 Closes #84	2011-05-02 08:54:40 -07:00
Gunnar Beutner	055656d4f4	Implemented NFS export_operations. Implemented the required NFS operations for exporting ZFS datasets using the in-kernel NFS daemon.	2011-04-29 12:36:13 -07:00
Brian Behlendorf	5476e6952c	Suppress 'vdev_metaslab_init' memory warning The vdev_metaslab_init() function has been observed to allocate larger than 8k chunks. However, they are not much larger than 8k and it does this infrequently so it is allowed and the warning is supressed.	2011-04-27 09:35:18 -07:00
Brian Behlendorf	40a39e1103	Conserve stack in dsl_scan_visit() The dsl_scan_visit() function is a little heavy weight taking 464 bytes on the stack. This can be easily reduced for little cost by moving zap_cursor_t and zap_attribute_t off the stack and on to the heap. After this change dsl_scan_visit() has been reduced in size by 320 bytes. This change was made to reduce stack usage in the dsl_scan_sync() callpath which is recursive and has been observed to overflow the stack. Issue #174	2011-04-26 15:48:00 -07:00
Brian Behlendorf	b81c4ac9af	Conserve stack in dsl_scan_visitbp() This function is called recursively so everything possible must be done to limit its stack consumption. The dprintf_bp() debugging function adds 30 bytes of local variables to the function we cannot afford. By commenting out this debugging we save 30 bytes per recursion and depths of 13 are not uncommon. This yeilds a total stack saving of 390 bytes on our 8k stack. Issue #174	2011-04-26 15:48:00 -07:00
Brian Behlendorf	7a060636b0	Conserve stack in dsl_scan_visitbp() The recursive call chain dsl_scan_visitbp() -> dsl_scan_recurse() -> dsl_scan_visitdnode() -> dsl_scan_visitbp has been observed to consume considerable stack resulting in a stack overflow (>8k). The cleanest way I see to fix this with minimal impact to the existing flow of code, and with the fewest performance concerns, is to always inline dsl_scan_recurse() and dsl_scan_visitdnode(). While this will increase the function size of dsl_scan_visitbp(), by 4660 bytes, it also reduces the stack requirements by removing the function call overhead. Issue #174	2011-04-26 13:37:35 -07:00
Brian Behlendorf	701b1f8168	Fix zvol deadlock It's possible for a zvol_write thread to enter direct memory reclaim while holding open a transaction group. This results in the system attempting to write out data to the disk to free memory. Unfortunately, this can't succeed because the the thread doing reclaim is holding open the txg which must be closed to be synced to disk. To prevent this the offending allocation is marked KM_PUSHPAGE which will prevent it from attempting writeback. Closes #191	2011-04-26 12:56:35 -07:00
Brian Behlendorf	e2448b0e62	Fix spurious -EFAULT when setting I/O scheduler Occasionally we would see an -EFAULT returned when setting the I/O scheduler on a vdev. This was caused an improperly formatted user mode helper command. This commit restructures the command to something simpler, allocates space for it dynamically to save stack, and removes the retry logic which is no longer needed. Closes #169	2011-04-22 14:55:35 -07:00
Brian Behlendorf	6a8f9b6bf0	Enforce ARC meta-data limits This change ensures the ARC meta-data limits are enforced. Without this enforcement meta-data can grow to consume all of the ARC cache pushing out data and hurting performance. The cache is aggressively reclaimed but this is a soft and not a hard limit. The cache may exceed the set limit briefly before being brought under control. By default 25% of the ARC capacity can be used for meta-data. This limit can be tuned by setting the 'zfs_arc_meta_limit' module option. Once this limit is exceeded meta-data reclaim will occur in 3 percent chunks, or may be tuned using 'arc_reduce_dnlc_percent'. Closes #193	2011-04-21 13:49:31 -07:00
Gunnar Beutner	36df284366	Fixed a use-after-free bug in zfs_zget(). Fixed a bug where zfs_zget could access a stale znode pointer when the inode had already been removed from the inode cache via iput -> iput_final -> ... -> zfs_zinactive but the corresponding SA handle was still alive. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #180	2011-04-21 13:48:01 -07:00
Brian Behlendorf	d247f2a3cc	Suppress 'zfs receive' memory warning As part of zfs_ioc_recv() a zfs_cmd_t is allocated in the kernel which is 17808 bytes in size. This sort of thing in general should be avoided. However, since this should be an infrequent event for now we allow it and simply suppress the warning with the KM_NODEBUG flag. This can be revisited latter if/when it becomes an issue. Closes #178	2011-04-20 10:22:31 -07:00
Gunnar Beutner	bec30953cd	Truncate the xattr znode when updating existing attributes. If the attribute's new value was shorter than the old one the old code would leave parts of the old value in the xattr znode. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #203	2011-04-19 14:14:40 -07:00
Gunnar Beutner	274b7e79f3	Added missing initialization for va.va_dentry in zfs_get_xattrdir. Without this we may mistakenly believe we have a dentry and try to d_instantiate() it. This will result in the following BUG. It's important to note that while the xattr directory has an inode assoicated with it we never create a dentry for it. kernel BUG at fs/dcache.c:1418! Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #202	2011-04-19 13:57:41 -07:00
Brian Behlendorf	0fe3d820f5	Fix gcc compiler warning, dsl_pool_create() When compiling ZFS in user space gcc-4.6.0 correctly identifies the variable 'os' as being set but never used. This generates a warning and a build failure when using --enable-debug. However, the code is correct we only want to use 'os' for the kernel space builds. To suppress the warning the call was wrapped with a VERIFY() which has the nice side effect of ensuring the 'os' actually never is NULL. This was observed under Fedora 15. module/zfs/dsl_pool.c: In function ‘dsl_pool_create’: module/zfs/dsl_pool.c:229:12: error: variable ‘os’ set but not used [-Werror=unused-but-set-variable]	2011-04-19 09:04:51 -07:00
Brian Behlendorf	e30c0ada6d	Linux 2.6.39 compat, invalidate_inodes() Update code to use the spl_invalidate_inodes() wrapper. This hides some of the complexity of determining if invalidate_inodes() was exported, and if so what is its prototype. The second argument of spl_invalidate_inodes() determined the behavior of how dirty inodes are handled. By passing a zero we are indicated that we want those inodes to be treated as busy and skipped.	2011-04-19 08:57:23 -07:00
Brian Behlendorf	0d3ac5e735	Linux 2.6.29 compat, credentials The .sync_fs fix as applied did not use the updated SPL credential API. This broke builds on Debian Lenny, this change applies the needed fix to use the portable API. The original credential changes are part of commit `81e97e2187`.	2011-04-07 14:27:09 -07:00
Brian Behlendorf	eec8164771	Fix ASSERTION(!dsl_pool_sync_context(tx->tx_pool)) Disable the normal reclaim path for the txg_sync thread. This ensures the thread will never enter dmu_tx_assign() which can otherwise occur due to direct reclaim. If this is allowed to happen the system can deadlock. Direct reclaim call path: ->shrink_icache_memory->prune_icache->dispose_list-> clear_inode->zpl_clear_inode->zfs_inactive->dmu_tx_assign	2011-04-07 09:52:16 -07:00
Brian Behlendorf	7cb67b45f3	Add direct+indirect ARC reclaim Under OpenSolaris all memory reclaim is done asyncronously. Under Linux memory reclaim is done asynchronously _and_ synchronously. When a process allocates memory with GFP_KERNEL it explicitly allows the kernel to do reclaim on its behalf to satify the allocation. If that GFP_KERNEL allocation fails the kernel may take more drastic measures to reclaim the memory such as killing user space processes. This was observed to happen with ZFS because the ARC could consume a large fraction of the system memory but no synchronous reclaim could be performed on it. The result was GFP_KERNEL allocations could fail resulting in OOM events, and only moments latter the arc_reclaim thread would free unused memory from the ARC. This change leaves the arc_thread in place to manage the fundamental ARC behavior. But it adds a synchronous (direct) reclaim path for the ARC which can be called when memory is badly needed. It also adds an asynchronous (indirect) reclaim path which is called much more frequently to prune the ARC slab caches.	2011-04-07 09:52:10 -07:00
Brian Behlendorf	1834f2d8b7	Add missing arcstats The following useful values were missing the arcstats. This change adds them in to provide greater visibility in to the arcs behavior. arc_no_grow 4 0 arc_tempreserve 4 0 arc_loaned_bytes 4 0 arc_meta_used 4 624774592 arc_meta_limit 4 400785408 arc_meta_max 4 625594176	2011-04-07 09:52:05 -07:00
Brian Behlendorf	c85b224faf	Call d_instantiate before unlocking inode Under Linux a dentry referencing an inode must be instantiated before the inode is unlocked. To accomplish this without overly modifing the core ZFS code the dentry it passed via the vattr_t. There are cases such as replay when a dentry is not available. In which case it is obviously not initialized at inode creation time, if a dentry is needed it will be spliced as when required via d_lookup().	2011-04-07 09:51:57 -07:00
Brian Behlendorf	d433c20651	Fix `make distclean` for `./configure --with-config=user Making distclean in module make[1]: Entering directory `/zfs/module' make -C SUBDIRS=`pwd` clean make: Entering an unknown directory make: *** SUBDIRS=/zfs/module: No such file or directory. Stop. When using --with-config=user the 'distclean' target would fail because it assumes the kernel configuration infrastrure is set up. This is not the case, nor does it need to be, because the '--with-config=user' option will prune the entire ./module subtree from SUBDIRS. This prevents most build rules from operating in the ./module directory. However, the 'dist*' rules will still traverse this directory because it is listed in DIST_SUBDIRS. This is correct because we need to ensure the dist rules package the directory contents regardless of the configuration for the 'dist' rule. The correct way to handle this is to only invoke the kernel build system as part of the 'clean' rule when CONFIG_KERNEL_TRUE is set. Initial fix provided by Darik Horn <dajhorn@vanadac.com>. This commit is a slightly refined form of the original.	2011-04-05 13:33:28 -07:00
Brian Behlendorf	bfd214af01	Fix inflated load average Kernel threads which sleep uninterruptibly on Linux are marked in the (D) state. These threads are usually in the process of performing IO and are thus counted against the load average. The txg_quiesce and txg_sync threads were always sleeping uninterruptibly and thus inflating the load average. This change makes them sleep interruptibly. Some care is required however because these threads may now be woken early by signals. In this case the callers are all careful to check that the required conditions are met after waking up. If we're woken early due to a signal they will simply go back to sleep. In this case these changes are safe. Closes #175	2011-03-31 17:07:12 -07:00
Brian Behlendorf	7a1cdc0775	Linux 2.6.29 compat, .freeze_fs/.unfreeze_fs The .freeze_fs/.unfreeze_fs hooks were not added until Linux 2.6.29 Since these hooks are currently unused they are being removed to allow support of older kernels.	2011-03-22 12:17:24 -07:00
Brian Behlendorf	81e97e2187	Linux 2.6.29 compat, credentials As of Linux 2.6.29 a clean credential API was added to the Linux kernel. Previously the credential was embedded in the task_struct. Because the SPL already has considerable support for handling this API change the ZPL code has been updated to use the Solaris credential API.	2011-03-22 12:15:54 -07:00
Brian Behlendorf	d6bd8eaae4	Fix evict() deadlock Now that KM_SLEEP is not defined as GFP_NOFS there is the possibility of synchronous reclaim deadlocks. These deadlocks never existed in the original OpenSolaris code because all memory reclaim on Solaris is done asyncronously. Linux does both synchronous (direct) and asynchronous (indirect) reclaim. This commit addresses a deadlock caused by inode eviction. A KM_SLEEP allocation may trigger direct memory reclaim and shrink the inode cache. This can occur while a mutex in the array of ZFS_OBJ_HOLD mutexes is held. Through the ->shrink_icache_memory()->evict()->zfs_inactive()-> zfs_zinactive() call path the same mutex may be reacquired resulting in a deadlock. To avoid this deadlock the process must not reacquire the mutex when it is already holding it. This is a reasonable fix for now but longer term the ZFS_OBJ_HOLD mutex locking should be reevaluated. This infrastructure already prevents us from ever using the Linux lock dependency analysis tools, and it may limit scalability.	2011-03-22 12:14:55 -07:00
Brian Behlendorf	691f6ac4c2	Use KM_PUSHPAGE instead of KM_SLEEP It used to be the case that all KM_SLEEP allocations were GFS_NOFS. Unfortunately this often resulted in the kernel being unable to reclaim the ARC, inode, and dentry caches in a timely manor. The fix was to make KM_SLEEP a GFP_KERNEL allocation in the SPL. However, this increases the posibility of deadlocking the system on a zfs write thread. If a zfs write thread attempts to perform an allocation it may trigger synchronous reclaim. This reclaim may attempt to flush dirty data/inode to disk to free memory. Unforunately, this write cannot finish because the write thread which would handle it is holding the previous transaction open. Deadlock. To avoid this all allocations in the zfs write thread path must use KM_PUSHPAGE which prohibits synchronous reclaim for that thread. In this way forward progress in ensured. The risk with this change is I missed updating an allocation for the write threads leaving an increased posibility of deadlock. If any deadlocks remain they will be unlikely but we'll have to make sure they all get fixed.	2011-03-22 12:14:55 -07:00
Brian Behlendorf	0de19dad9c	Register .remount_fs handler Register the missing .remount_fs handler. This handler isn't strictly required because the VFS does a pretty good job updating most of the MS_* flags. However, there's no harm in using the hook to call the registered zpl callback for various MS_* flags. Additionaly, this allows us to lay the ground work for more complicated argument parsing in the future.	2011-03-15 13:33:29 -07:00
Brian Behlendorf	03f9ba9d99	Register .sync_fs handler Register the missing .sync_fs handler. This is a noop in most cases because the usual requirement is that sync just be initiated. As part of the DMU's normal transaction processing txgs will be frequently synced. However, when the 'wait' flag is set the requirement is that .sync_fs must not return until the data is safe on disk. With the addition of the .sync_fs handler this is now properly implemented.	2011-03-15 13:33:29 -07:00
Brian Behlendorf	04516a45b2	Don't set I/O Scheduler for Partitions ZFS should only change the i/o scheduler for a disk when it has ownership of the whole disk. This is basically the same logic as adjusting the write cache behavior on a disk. This change updates the vdev disk code to skip partitions when setting the i/o scheduler. Closes #152	2011-03-10 13:34:17 -08:00
Brian Behlendorf	adf2e8778e	Fix O_APPEND Corruption Due to an uninitialized variable files opened with O_APPEND may overwrite the start of the file rather than append to it. This was introduced accidentally when I removed the Solaris vnodes. The zfs_range_lock_writer() function used to key off zf->z_vnode to determine if a znode_t was for a zvol of zpl object. With the removal of vnodes this was replaced by the flag zp->z_is_zvol. This flag was used to control the append behavior for range locks. Unfortunately, this value was never properly initialized after the vnode removal. However, because most of memory is usually zeros it happened to be set correctly most of the time making the bug appear racy. Properly initializing zp->z_is_zvol to zero completely resolves the problem with O_APPEND. Closes #126	2011-03-09 13:31:00 -08:00
Brian Behlendorf	17c37660a1	Conserve stack in zfs_setattr() Move 'bulk' and 'xattr_bulk' from the stack to the heap to minimize stack space usage. These two arrays consumed 448 bytes on the stack and have been replaced by two 8 byte points for a total stack space saving of 432 bytes. The zfs_setattr() path had been previously observed to overrun the stack in certain circumstances.	2011-03-09 13:30:03 -08:00
Brian Behlendorf	450dc149bd	Range lock performance improvements The original range lock implementation had to be modified by commit `8926ab7` because it was unsafe on Linux. In particular, calling cv_destroy() immediately after cv_broadcast() is dangerous because the waiters may still be asleep. Thus the following cv_destroy() will free memory which may still be in use. This was fixed by updating cv_destroy() to block on waiters but this in turn introduced a deadlock. The deadlock was resolved with the use of a taskq to move the offending free outside the range lock. This worked well but using the taskq for the free resulted in a serious performace hit. This is somewhat ironic because at the time I felt using the taskq might improve things by making the free asynchronous. This patch refines the original fix and moves the free from the taskq to a private free list. Then items which must be free'd are simply inserted in to the list. When the range lock is dropped it's safe to free the items. The list is walked and all rl_t entries are freed. This change improves small cached read performance by 26x. This was expected because for small reads the number of locking calls goes up significantly. More surprisingly this change significantly improves large cache read performance. This probably attributable to better cpu/memory locality. Very likely the same processor which allocated the memory is now freeing it. bs ext3 zfs zfs+fix faster ---------------------------------------------- 512 435 3 79 26x 1k 820 7 160 22x 2k 1536 14 305 21x 4k 2764 28 572 20x 8k 3788 50 1024 20x 16k 4300 86 1843 21x 32k 4505 138 2560 18x 64k 5324 252 3891 15x 128k 5427 276 4710 17x 256k 5427 413 5017 12x 512k 5427 497 5324 10x 1m 5427 521 5632 10x Closes #142	2011-03-08 12:44:06 -08:00
Brian Behlendorf	126400a1ca	Add zfs_open()/zfs_close() In the original implementation the zfs_open()/zfs_close() hooks were dropped for simplicity. This was functional but not 100% correct with the expected ZFS sematics. Updating and re-adding the zfs_open()/zfs_close() hooks resolves the following issues. 1) The ZFS_APPENDONLY file attribute is once again honored. While there are still no Linux tools to set/clear these attributes once there are it should behave correctly. 2) Minimal virus scan file attribute hooks were added. Once again this support in disabled but the infrastructure is back in place. 3) Most importantly correctly handle assigning files which were opened syncronously to the intent log. Without this change O_SYNC modifications could be lost during a system crash even though they were marked synchronous.	2011-03-08 11:04:51 -08:00
Brian Behlendorf	53cf50e081	Set stat->st_dev and statfs->f_fsid Filesystems like ZFS must use what the kernel calls an anonymous super block. Basically, this is just a filesystem which is not backed by a single block device. Normally this block device's dev_t is stored in the super block. For anonymous super blocks a unique reserved dev_t is assigned as part of get_sb(). This sb->s_dev must then be set in the returned stat structures as stat->st_dev. This allows userspace utilities to easily detect the boundries of a specific filesystem. Tools such as 'du' depend on this for proper accounting. Additionally, under OpenSolaris the statfs->f_fsid is set to the device id. To preserve consistency with OpenSolaris we also set the fsid to the device id. Other Linux filesystem (ext) set the fsid to a unique value determined by the filesystems uuid. This value is unique but maintains no relationship to the device id. This may be desirable when exporting NFS filesystem because it minimizes to chance of a client observing the same fsid from two different servers. Closes #140	2011-03-07 16:06:22 -08:00
Brian Behlendorf	6742abf9ec	Use Linux ATTR_ versions The AT_ versions of these macros are used on Solaris and while they map to their Linux equivilants the code has been updated to use the ATTR_ versions.	2011-03-03 11:29:15 -08:00
Brian Behlendorf	f4ea75d492	Conserve stack in zfs_setattr() Move 'tmpxvattr' from the stack to the heap to minimize stack space usage. This is enough to get us below the 1024 byte stack frame warning. That however is still a large stack frame and it should be further reduced by moving the 'bulk' and 'xattr_bulk' sa_bulk_attr_t variables to the heap in a future patch.	2011-03-02 14:18:58 -08:00
Brian Behlendorf	5484965ab6	Drop HAVE_XVATTR macros When I began work on the Posix layer it immediately became clear to me that to integrate cleanly with the Linux VFS certain Solaris specific things would have to go. One of these things was to elimate as many Solaris specific types from the ZPL layer as possible. They would be replaced with their Linux equivalents. This would not only be good for performance, but for the general readability and health of the code. The Solaris and Linux VFS are different beasts and should be treated as such. Most of the code remains common for constructing transactions and such, but there are subtle and important differenced which need to be repsected. This policy went quite for for certain types such as the vnode_t, and it initially seemed to be working out well for the vattr_t. There was a relatively small amount of related xvattr_t code I was forced to comment out with HAVE_XVATTR. But it didn't look that hard to come back soon and replace it all with a native Linux type. However, after going doing this path with xvattr some distance it clear that this code was woven in the ZPL more deeply than I thought. In particular its hooks went very deep in to the ZPL replay code and replacing it would not be as easy as I originally thought. Rather than continue persuing replacing and removing this code I've taken a step back and reevaluted things. This commit reverts many of my previous commits which removed xvattr related code. It restores much of the code to its original upstream state and now relies on improved xvattr_t support in the zfs package itself. The result of this is that much of the code which I had commented out, which accidentally broke things like replay, is now back in place and working. However, there may be a small performance impact for getattr/setattr operations because they now require a translation from native Linux to Solaris types. For now that's a price I'm willing to pay. Once everything is completely functional we can revisting the issue of removing the vattr_t/xvattr_t types. Closes #111	2011-03-02 11:44:34 -08:00
Brian Behlendorf	9623f736d9	Remove caller_context_t Remove the remaining callers of caller_context_t. This type has been removed because it is not needed for the Linux port.	2011-03-02 11:35:35 -08:00
Darik Horn	a23cc0a443	Add the zpool and filesystem versions Print the supported zpool and filesystem versions at module load time. This change removes an ambiguity and adds information that system administrators care about. The phrase "ZFS pool version %s" is the same as zpool upgrade -v so that the operator is familiar with the message. ZFS: Loaded module v0.6.0, ZFS pool version 28, ZFS filesystem version 5 ZFS: Unloaded module v0.6.0 Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2011-02-28 09:46:23 -08:00
Brian Behlendorf	fdcd952b4d	Fix set block scheduler warnings There were two cases when attempting to set the vdev block device scheduler which would causes console warnings. The first case was when the vdev used a loop, ram, dm, or other such device which doesn't support a configurable scheduler. In these cases attempting to set a scheduler is pointless and can be safely skipped. The secord case is slightly more troubling. We were seeing transient cases where setting the elevator would return -EFAULT. On retry everything is fine so there appears to be a small window where this is possible. To handle that case we silently retry up to three times before reporting the warning. In all of the above cases the warning is harmless and at worse you may see slightly different performance characteristics from one or more of your vdevs.	2011-02-25 11:37:11 -08:00
Fajar A. Nugraha	4c0d8e50b9	Use udev to create /dev/zvol/[dataset_name] links This commit allows zvols with names longer than 32 characters, which fixes issue on https://github.com/behlendorf/zfs/issues/#issue/102. Changes include: - use /dev/zd* device names for zvol, where * is the device minor (include/sys/fs/zfs.h, module/zfs/zvol.c). - add BLKZNAME ioctl to get dataset name from userland (include/sys/fs/zfs.h, module/zfs/zvol.c, cmd/zvol_id). - add udev rule to create /dev/zvol/[dataset_name] and the legacy /dev/[dataset_name] symlink. For partitions on zvol, it will create /dev/zvol/[dataset_name]-part* (etc/udev/rules.d/60-zvol.rules, cmd/zvol_id). Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2011-02-25 09:43:19 -08:00
Brian Behlendorf	dc1d7665c5	Remove rdev packing Remove custom code to pack/unpack dev_t's. Under Linux all dev_t's are an unsigned 32-bit value even on 64-bit platforms. The lower 20 bits are used for the minor number and the upper 12 for the major number. This means if your importing a pool from Solaris you may get strange major/minor numbers. But it doesn't really matter because even if we add compatibility code to translate the encoded Solaris major/minor they won't do you any good under Linux. You will still need to recreate the dev_t with a major/minor which maps to reserved major numbers used under Linux. Dropping this code also resolves 32-bit builds by removing the offending 32-bit compatibility code.	2011-02-23 15:13:03 -08:00

... 13 14 15 16 17 ...

1600 Commits