Archive-Team/zfs - zfs - Gitea: Git with a cup of tea

Commit Graph

Author	SHA1	Message	Date
Tim Chase	ad4af89561	Partially revert "Add ddt, ddt_entry, and l2arc_hdr caches" This reverts only the l2arc_hdr part of commit `ecf3d9b8e6` in preparation for the illumos 5497 "lock contention on arcs_mtx" patch which does the same thing but uses the newer two-level ARC structure following the Illumos 5408 "managing ZFS cache devices requires lots of RAM" patch. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-06-11 10:27:25 -07:00
Tim Chase	97639d0a52	Revert "Allow arc_evict_ghost() to only evict meta data" Illumos 5497 "lock contention on arcs_mtx" reworks eviction and obviates the need for this. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-06-11 10:27:25 -07:00
Tim Chase	f6b3b1f5d6	Revert "fix l2arc compression buffers leak" This reverts commit `037763e44e` in preparation for the illumos 5497 "lock contention on arcs_mtx" patch which includes a fix for this very problem. ZoL had picked up a subset of the illumos 5497 patch to deal with the l2arc compression buffer leak. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-06-11 10:27:24 -07:00
Tim Chase	7807028ccd	Revert "arc_evict, arc_evict_ghost: reduce stack usage using kmem_zalloc" This reverts commit `16fcdea363` in preparation for the illumos 5497 "lock contention on arcs_mtx" patch which eliminates "marker" within the ARC code. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-06-11 10:27:24 -07:00
Brian Behlendorf	44de2f02d6	Remove unused variable in vdev_add_child() Commit `c3520e7` restructured vdev_add_child() in such a way that the spa variable was unused during non-debug builds. This is consistent with the upstream illumos code but because ZoL, unlike illumos, is built with all compiler warnings enabled this causes a legitimate warning. Revert this hunk of the patch to keep the build clean. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #3432	2015-06-11 10:22:38 -07:00
Matthew Ahrens	c3520e7f1f	Illumos 5818 - zfs {ref}compressratio is incorrect with 4k sector size 5818 zfs {ref}compressratio is incorrect with 4k sector size Reviewed by: Alex Reece <alex@delphix.com> Reviewed by: George Wilson <george@delphix.com> Reviewed by: Richard Elling <richard.elling@richardelling.com> Reviewed by: Steven Hartland <killing@multiplay.co.uk> Approved by: Albert Lee <trisk@omniti.com> References: https://www.illumos.org/issues/5818 https://github.com/illumos/illumos-gate/commit/81cd5c5 Ported-by: Don Brady <don.brady@intel.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3432	2015-06-10 16:24:01 -07:00
Arne Jansen	9c43027b3f	Illumos 5269 - zpool import slow 5269 zpool import slow Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george@delphix.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/5269 https://github.com/illumos/illumos-gate/commit/12380e1e Ported-by: DHE <git@dehacked.net> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3396	2015-06-09 13:48:02 -07:00
Ned Bass	5f8e1e8505	dmu_objset_userquota_get_ids uses dn_bonus unsafely The function dmu_objset_userquota_get_ids() checks and uses dn->dn_bonus outside of dn_struct_rwlock. If the dnode is being freed then the bonus dbuf may be in the process of getting evicted. In this case there is a race that may cause dmu_objset_userquota_get_ids() to access the dbuf after it has been destroyed. To prevent this, ensure that when we are using the bonus dbuf we are either holding a reference on it or have taken dn_struct_rwlock. Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3443	2015-06-05 12:41:17 -07:00
Ned Bass	d617648c7f	dbuf_try_add_ref minor bug fixes - Don't check db->bb_blkid, but use the blkid argument instead. Checking db->db_blkid may be unsafe since we doesn't yet have a hold on the dbuf so its validity is unknown. - Call mutex_exit() on found_db, not db, since it's not certain that they point to the same dbuf, and the mutex was taken on found_db. Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #3443	2015-06-05 12:40:38 -07:00
Matthew Ahrens	e5fd1dd682	Illumos 5243 - zdb -b could be much faster 5243 zdb -b could be much faster Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Richard Elling <richard.elling@gmail.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/5243 https://github.com/illumos/illumos-gate/commit/f7950bf Ported-by: Don Brady <don.brady@intel.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3414	2015-05-15 11:14:54 -07:00
Jan Sanislo	79065ed5a4	Return -ESTALE to force lookup for missing NFS file handles There seems to be a annoying problem using NFSv4 to access ZFS file systems under certain circumstances. It's easily reproduced: nfs_client1: mount server:/export /mnt nfs_client1: cd /mnt nfs_client1: echo foo >junk nfs_client1: cat junk foo Now on a different NFSv4 client: nfs_client2: mount server:/export /mnt nfs_client2: cd /mnt nfs_client2: vi junk # Make some changes to /mnt/junk and save # This change the inode associated with /mnt/junk Now back to the original client: nfs_client1: cat junk cat: junk: No such file or directory Admittedly NFSv4 is not advertised as a cluster file system that maintains a completely coherent view of data across multiple nodes. But it does have some mechanisms built in that try to deal with situations like the above. Namely, it employs specialized file handle lookup routines that return ESTALE when a file handle contains a non-existant inode value. The ESTALE return triggers a return full file path lookup from the client to determine if the file has actually gone away or if the cached file handle is no longer valid. ZFS behavior can be brought into line with other file systems (e.g., ext4) by applying the following patch: Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3224	2015-05-14 11:16:52 -07:00
Antonio Russo	7290cd3c4e	Relax restriction on zfs_ioc_next_obj() iteration Per the documentation for dnode_next_offset in dnode.c, the "txg" parameter specifies a lower bound on which transaction the dnode can be found in. We are interested in all dnodes that are removed between the first and last transaction in the snapshot. It doesn't need to be created in that snapshot to correspond to a removed file. In fact, the behavior of zfs diff in the test case exactly matches this: the transaction that created the data that was deleted in snapshot "2" was produced before, in snapshot "1", definitely predating the first transaction in snapshot "2". Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <Tim Chase <tim@onlight.com> Closes #2081	2015-05-14 11:16:08 -07:00
Brian Behlendorf	fd0fd6467b	Remove unused 'dsl_pool_t dp' variable When ASSERTs are compiled out by using the --disable-debug configure option. Then the local variable 'dsl_pool_t dp' will be unused and generate a compiler warning. Since this variable is only used once in the ASSERT replace it with 'ds->ds_dir->dd_pool'. This has the additional advantage of potentially saving a few bytes on the stack depending on how gcc decides to compile the function. This issue was not noticed immediately because the automated builders use --enable-debug to make the testing as rigorous as possible. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chris Dunlap <cdunlap@llnl.gov> Closes #3410	2015-05-14 11:09:47 -07:00
Max Grossman	5dc8b7365f	Illumos 5765 - add support for estimating send stream size with lzc_send_space when source is a bookmark 5765 add support for estimating send stream size with lzc_send_space when source is a bookmark Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: Steven Hartland <killing@multiplay.co.uk> Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com> Approved by: Albert Lee <trisk@nexenta.com> References: https://www.illumos.org/issues/5765 https://github.com/illumos/illumos-gate/commit/643da460 Porting notes: * Unused variable 'recordsize' in dmu_send_estimate() dropped Ported-by: DHE <git@dehacked.net> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3397	2015-05-13 09:03:59 -07:00
Justin T. Gibbs	19b3b1d2a2	Illumos 5393 - spurious failures from dsl_dataset_hold_obj() 5393 spurious failures from dsl_dataset_hold_obj() Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Will Andrews <willa@spectralogic.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Steven Hartland <killing@multiplay.co.uk> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/5393 https://github.com/illumos/illumos-gate/commit/e1f3c20 Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3403	2015-05-13 08:56:04 -07:00
Justin T. Gibbs	63b33e878a	Illumos 5562 - ZFS sa_handle's violate kmem invariants, debug kernels panic on boot 5562 ZFS sa_handle's violate kmem invariants, debug kernels panic on boot Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Robert Mustacchi <rm@fingolfin.org> Reviewed by: George Wilson <george@delphix.com> Reviewed by: Rich Lowe <richlowe@richlowe.net> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/5562 https://github.com/illumos/illumos-gate/commit/0fda3cc5 Ported-by: DHE <git@dehacked.net> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3388	2015-05-11 15:10:57 -07:00
Matthew Ahrens	252e1a54ab	Illumos 5810 - zdb should print details of bpobj 5810 zdb should print details of bpobj Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Alex Reece <alex@delphix.com> Reviewed by: George Wilson <george@delphix.com> Reviewed by: Will Andrews <will@freebsd.org> Reviewed by: Simon Klinkert <simon.klinkert@gmail.com> Approved by: Gordon Ross <gwr@nexenta.com> References: https://www.illumos.org/issues/5810 https://github.com/illumos/illumos-gate/commit/732885fc Ported-by: DHE <git@dehacked.net> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3387	2015-05-11 15:10:24 -07:00
Matthew Ahrens	10400bfeac	Illumos 5351, 5352 - scrub pauses 5351 scrub goes for an extra second each txg 5352 scrub should pause when there is some dirty data Author: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Alex Reece <alex.reece@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Richard Elling <richard.elling@richardelling.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/5351 https://github.com/illumos/illumos-gate/commit/6f6a76a Ported-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3383	2015-05-11 15:09:56 -07:00
Matthew Ahrens	08dc1b2ddd	Illumos 5350 - clean up code in dnode_sync() Author: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Alex Reece <alex.reece@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Richard Elling <richard.elling@richardelling.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/5350 https://github.com/illumos/illumos-gate/commit/e651831 Ported-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3382	2015-05-11 15:09:51 -07:00
Alex Reece	7224c67fea	Illumos 5422 - preserve AVL invariants in dn_dbufs Author: Alex Reece <alex@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com> Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com> Reviewed by: Albert Lee <trisk@nexenta.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/5422 https://github.com/illumos/illumos-gate/commit/a846f19 Ported-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3381	2015-05-11 15:09:29 -07:00
David Lamparter	214806c7e9	Safely handle security / ACL failures The security and ACL operations should all be performed atomically. To accomplish this there would need to significant invasive changes made to the common code base. For the moment it's desirable for compatibility reasons to avoid this. Therefore the code has been updated to attempt to unwind the operation in case of failure rather than panic. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2445	2015-05-11 14:35:14 -07:00
Matthew Ahrens	f1512ee61e	Illumos 5027 - zfs large block support 5027 zfs large block support Reviewed by: Alek Pinchuk <pinchuk.alek@gmail.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com> Reviewed by: Richard Elling <richard.elling@richardelling.com> Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/5027 https://github.com/illumos/illumos-gate/commit/b515258 Porting Notes: * Included in this patch is a tiny ISP2() cleanup in zio_init() from Illumos 5255. * Unlike the upstream Illumos commit this patch does not impose an arbitrary 128K block size limit on volumes. Volumes, like filesystems, are limited by the zfs_max_recordsize=1M module option. * By default the maximum record size is limited to 1M by the module option zfs_max_recordsize. This value may be safely increased up to 16M which is the largest block size supported by the on-disk format. At the moment, 1M blocks clearly offer a significant performance improvement but the benefits of going beyond this for the majority of workloads are less clear. * The illumos version of this patch increased DMU_MAX_ACCESS to 32M. This was determined not to be large enough when using 16M blocks because the zfs_make_xattrdir() function will fail (EFBIG) when assigning a TX. This was immediately observed under Linux because all newly created files must have a security xattr created and that was failing. Therefore, we've set DMU_MAX_ACCESS to 64M. * On 32-bit platforms a hard limit of 1M is set for blocks due to the limited virtual address space. We should be able to relax this one the ABD patches are merged. Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #354	2015-05-11 12:23:16 -07:00
Brian Behlendorf	f9cab37291	Remove metaslab_min_alloc_size module option The metaslab_min_alloc_size option is no longer used in the code. This functionality was removed by commit `f3a7f66` and the module options should have been dropped at that time. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-05-11 09:51:57 -07:00
Chris Dunlop	16fcdea363	arc_evict, arc_evict_ghost: reduce stack usage using kmem_zalloc With debugging enabled and depending on your kernel config, the size of arc_buf_hdr_t can blow out the stack of arc_evict() and arc_evict_ghost() to greater than 1024 bytes. Let's avoid this. Signed-off-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3377	2015-05-08 14:14:35 -07:00
Matthew Ahrens	63e3a8616b	Illumos 5349 - verify that block pointer is plausible before reading 5349 verify that block pointer is plausible before reading Reviewed by: Alex Reece <alex.reece@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Richard Lowe <richlowe@richlowe.net> Reviewed by: Xin Li <delphij@FreeBSD.org> Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com> Approved by: Gordon Ross <gwr@nexenta.com> References: https://www.illumos.org/issues/5349 https://github.com/illumos/illumos-gate/commit/f63ab3d5 Porting notes: * Several variable declarations were moved due to C style needs Ported-by: DHE <git@dehacked.net> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3373	2015-05-08 14:09:15 -07:00
Chris Dunlop	f0da4d1508	Wait for all znodes to be released before tearing down the superblock By the time we're tearing down our superblock the VFS has started releasing all our inodes/znodes. Some of this work may have been handed off to our iput taskq so we need to wait for that work to complete. However the iput from the taskq can itself result in additional work being added to the taskq: dsl_pool_iput_taskq iput iput_final evict destroy_inode zpl_inode_destroy zfs_inode_destroy zfs_iput_async(ZTOI(zp->z_xattr_parent)) taskq_dispatch(dsl_pool_iput_taskq..., iput, ...) Let's wait until all our znodes have been released. Signed-off-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3281	2015-05-06 14:13:19 -07:00
Matthew Ahrens	7a3066ffdd	Illumos 5348 - zio_checksum_error() only fills in info if ECKSUM 5348 zio_checksum_error() only fills in info if ECKSUM Reviewed by: Alex Reece <alex.reece@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Steven Hartland <killing@multiplay.co.uk> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/5348 https://github.com/illumos/illumos-gate/commit/373dc1cf Ported-by: DHE <git@dehacked.net> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3372	2015-05-05 11:05:37 -07:00
Matthew Ahrens	f3c517d814	Illumos 5820 - verify failed in zio_done(): BP_EQUAL(bp, io_bp_orig) 5820 verify failed in zio_done(): BP_EQUAL(bp, io_bp_orig) Reviewed by: Alex Reece <alex@delphix.com> Reviewed by: George Wilson <george@delphix.com> Reviewed by: Steven Hartland <killing@multiplay.co.uk> Approved by: Garrett D'Amore <garrett@damore.org> References: https://www.illumos.org/issues/5820 https://github.com/illumos/illumos-gate/commit/34e8acef00 Ported-by: DHE <git@dehacked.net> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3364	2015-05-04 10:49:49 -07:00
Matthew Ahrens	36c6ffb6b6	Illumos 5808 - spa_check_logs is not necessary on readonly pools 5808 spa_check_logs is not necessary on readonly pools Reviewed by: George Wilson <george@delphix.com> Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com> Reviewed by: Simon Klinkert <simon.klinkert@gmail.com> Reviewed by: Will Andrews <will@freebsd.org> Approved by: Gordon Ross <gwr@nexenta.com> References: https://www.illumos.org/issues/5808 https://github.com/illumos/illumos-gate/commit/23367a2f Ported-by: DHE <git@dehacked.net> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3369	2015-05-04 10:45:42 -07:00
Will Andrews	50f9ea0149	Illumos 5814 - bpobj_iterate_impl(): Close a refcount leak iterating on a sublist. 5814 bpobj_iterate_impl(): Close a refcount leak iterating on a sublist. Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com> Reviewed by: Simon Klinkert <simon.klinkert@gmail.com> Approved by: Gordon Ross <gwr@nexenta.com> References: https://www.illumos.org/issues/5814 https://github.com/illumos/illumos-gate/commit/b67dde11 Ported-by: DHE <git@dehacked.net> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3368	2015-05-04 10:28:18 -07:00
Christopher Siden	0c60cc326b	Illumos 4951 - ZFS administrative commands (fix) 4951 ZFS administrative commands should use reserved space, not fail with ENOSPC Approved by: Christopher Siden <christopher.siden@delphix.com> References: https://www.illumos.org/issues/4951 https://github.com/illumos/illumos-gate/commit/c39f2c8 Ported by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-05-04 09:41:10 -07:00
Matthew Ahrens	3d45fdd6c0	Illumos 4951 - ZFS administrative commands should use reserved space 4951 ZFS administrative commands should use reserved space, not with ENOSPC Reviewed by: John Kennedy <john.kennedy@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Approved by: Garrett D'Amore <garrett@damore.org> References: https://www.illumos.org/issues/4373 https://github.com/illumos/illumos-gate/commit/7d46dc6 Ported by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-05-04 09:41:10 -07:00
Jerry Jelinek	a0c9a17aef	Illumos 4901 - zfs filesystem/snapshot limit leaks 4901 zfs filesystem/snapshot limit leaks Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/4901 https://github.com/illumos/illumos-gate/commit/adf3407 Ported by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-05-04 09:41:09 -07:00
Matthew Ahrens	83017311e4	Illumos 3654,3656 3654 zdb should print number of ganged blocks 3656 remove unused function zap_cursor_move_to_key() Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: Dan McDonald <danmcd@nexenta.com> Approved by: Garrett D'Amore <garrett@damore.org> References: https://www.illumos.org/issues/3654 https://www.illumos.org/issues/3656 https://github.com/illumos/illumos-gate/commit/d5ee8a1 Porting Notes: 3655 and 3657 were part of this commit but those hunks were dropped since they apply to mdb. Ported by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-05-04 09:41:09 -07:00
tuxoko	6102d0376e	Add cond_resched to zfs_zget to prevent infinite loop It's been reported that threads would loop infinitely inside zfs_zget. The speculated cause for this is that if an inode is marked for evict, zfs_zget would see that and loop. However, if the looping thread doesn't yield, the inode may not have a chance to finish evict, thus causing a infinite loop. This patch solve this issue by add cond_resched to zfs_zget, making the looping thread to yield when needed. Tested-by: jlavoy <jalavoy@gmail.com> Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3349	2015-05-04 09:12:41 -07:00
Jason Zaman	c9520ecc0f	dmu: fix integer overflows The params to the functions are uint64_t, but the offsets to memcpy / bcopy are calculated using 32bit ints. This patch changes them to also be uint64_t so there isnt an overflow. PaX's Size Overflow caught this when formatting a zvol. Gentoo bug: #546490 PAX: offset: 1ffffb000 db->db_offset: 1ffffa000 db->db_size: 2000 size: 5000 PAX: size overflow detected in function dmu_read /var/tmp/portage/sys-fs/zfs-kmod-0.6.3-r1/work/zfs-zfs-0.6.3/module/zfs/../../module/zfs/dmu.c:781 cicus.366_146 max, count: 15 CPU: 1 PID: 2236 Comm: zvol/10 Tainted: P O 3.17.7-hardened-r1 #1 Call Trace: [<ffffffffa0382ee8>] ? dsl_dataset_get_holds+0x9d58/0x343ce [zfs] [<ffffffff81a59c88>] dump_stack+0x4e/0x7a [<ffffffffa0393c2a>] ? dsl_dataset_get_holds+0x1aa9a/0x343ce [zfs] [<ffffffff81206696>] report_size_overflow+0x36/0x40 [<ffffffffa02dba2b>] dmu_read+0x52b/0x920 [zfs] [<ffffffffa0373ad1>] zrl_is_locked+0x7d1/0x1ce0 [zfs] [<ffffffffa0364cd2>] zil_clean+0x9d2/0xc00 [zfs] [<ffffffffa0364f21>] zil_commit+0x21/0x30 [zfs] [<ffffffffa0373fe1>] zrl_is_locked+0xce1/0x1ce0 [zfs] [<ffffffff81a5e2c7>] ? __schedule+0x547/0xbc0 [<ffffffffa01582e6>] taskq_cancel_id+0x2a6/0x5b0 [spl] [<ffffffff81103eb0>] ? wake_up_state+0x20/0x20 [<ffffffffa0158150>] ? taskq_cancel_id+0x110/0x5b0 [spl] [<ffffffff810f7ff4>] kthread+0xc4/0xe0 [<ffffffff810f7f30>] ? kthread_create_on_node+0x170/0x170 [<ffffffff81a62fa4>] ret_from_fork+0x74/0xa0 [<ffffffff810f7f30>] ? kthread_create_on_node+0x170/0x170 Signed-off-by: Jason Zaman <jason@perfinion.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3333	2015-05-04 09:12:00 -07:00
George Wilson	98b254188a	Illumos #5244 - zio pipeline callers should explicitly invoke next stage 5244 zio pipeline callers should explicitly invoke next stage Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Alex Reece <alex.reece@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Richard Elling <richard.elling@gmail.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Reviewed by: Steven Hartland <killing@multiplay.co.uk> Approved by: Gordon Ross <gwr@nexenta.com> References: https://www.illumos.org/issues/5244 https://github.com/illumos/illumos-gate/commit/738f37b Porting Notes: 1. The unported "2932 support crash dumps to raidz, etc. pools" caused a merge conflict due to a copyright difference in module/zfs/vdev_raidz.c. 2. The unported "4128 disks in zpools never go away when pulled" and additional Linux-specific changes caused merge conflicts in module/zfs/vdev_disk.c. Ported-by: Richard Yao <richard.yao@clusterhq.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2828	2015-04-30 15:07:47 -07:00
Matthew Ahrens	8dd86a10cf	Illumos 5812 - assertion failed in zrl_tryenter(): zr_owner==NULL 5812 assertion failed in zrl_tryenter(): zr_owner==NULL Reviewed by: George Wilson <george@delphix.com> Reviewed by: Alex Reece <alex@delphix.com> Reviewed by: Will Andrews <will@freebsd.org> Approved by: Gordon Ross <gwr@nexenta.com> References: https://www.illumos.org/issues/5812 https://github.com/illumos/illumos-gate/commit/8df1730 Ported-by: DHE <git@dehacked.net> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3357	2015-04-30 14:43:40 -07:00
Justin T. Gibbs	6186e29753	Illumos 5592 - NULL pointer dereference in dsl_prop_notify_all_cb() 5592 NULL pointer dereference in dsl_prop_notify_all_cb() Author: Justin T. Gibbs <justing@spectralogic.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george@delphix.com> Reviewed by: Will Andrews <will@freebsd.org> Approved by: Robert Mustacchi <rm@joyent.com> References: https://www.illumos.org/issues/5592 https://github.com/illumos/illumos-gate/commit/9d47dec Ported-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-04-28 16:25:58 -07:00
Justin T. Gibbs	6ebebaceb1	Illumos 5531 - NULL pointer dereference in dsl_prop_get_ds() 5531 NULL pointer dereference in dsl_prop_get_ds() Author: Justin T. Gibbs <justing@spectralogic.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Reviewed by: George Wilson <george@delphix.com> Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com> Approved by: Robert Mustacchi <rm@joyent.com> References: https://www.illumos.org/issues/5531 https://github.com/illumos/illumos-gate/commit/e57a022 Ported-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-04-28 16:25:44 -07:00
Justin T. Gibbs	0c66c32d1d	Illumos 5056 - ZFS deadlock on db_mtx and dn_holds 5056 ZFS deadlock on db_mtx and dn_holds Author: Justin Gibbs <justing@spectralogic.com> Reviewed by: Will Andrews <willa@spectralogic.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/5056 https://github.com/illumos/illumos-gate/commit/bc9014e Porting Notes: sa_handle_get_from_db(): - the original patch includes an otherwise unmentioned fix for a possible usage of an uninitialised variable dmu_objset_open_impl(): - Under Illumos list_link_init() is the same as filling a list_node_t with NULLs, so they don't notice if they miss doing list_link_init() on a zero'd containing structure (e.g. allocated with kmem_zalloc as here). Under Linux, not so much: an uninitialised list_node_t goes "Boom!" some time later when it's used or destroyed. dmu_objset_evict_dbufs(): - reduce stack usage using kmem_alloc() Ported-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-04-28 16:25:34 -07:00
Justin T. Gibbs	d683ddbb72	Illumos 5314 - Remove "dbuf phys" db->db_data pointer aliases in ZFS 5314 Remove "dbuf phys" db->db_data pointer aliases in ZFS Author: Justin T. Gibbs <justing@spectralogic.com> Reviewed by: Andriy Gapon <avg@freebsd.org> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Will Andrews <willa@spectralogic.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/5314 https://github.com/illumos/illumos-gate/commit/c137962 Ported-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-04-28 16:25:20 -07:00
Justin T. Gibbs	945dd93525	Illumos 5310 - Remove always true tests for non-NULL ds->ds_phys 5310 Remove always true tests for non-NULL ds->ds_phys Author: Justin T. Gibbs <justing@spectralogic.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Will Andrews <willa@spectralogic.com> Reviewed by: Andriy Gapon <avg@FreeBSD.org> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/5310 https://github.com/illumos/illumos-gate/commit/d808a4f Ported-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-04-28 16:25:08 -07:00
Alex Reece	9925c28cde	Illumos 5095 - panic when adding a duplicate dbuf to dn_dbufs 5095 panic when adding a duplicate dbuf to dn_dbufs Author: Alex Reece <alex@delphix.com> Reviewed by: Adam Leventhal <adam.leventhal@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Mattew Ahrens <mahrens@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Reviewed by: Josef Sipek <jeffpc@josefsipek.net> Approved by: Robert Mustacchi <rm@joyent.com> References: https://www.illumos.org/issues/5095 https://github.com/illumos/illumos-gate/commit/86bb58a Ported-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-04-28 16:24:49 -07:00
Justin T. Gibbs	5aea3644d6	Illumos 5038 - Remove "old-style" flexible array usage in ZFS. 5038 Remove "old-style" flexible array usage in ZFS. Author: Justin T. Gibbs <justing@spectralogic.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net> Approved by: Richard Lowe <richlowe@richlowe.net> References: https://www.illumos.org/issues/5038 https://github.com/illumos/illumos-gate/commit/7f18da4 Ported-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-04-28 16:24:24 -07:00
Alex Reece	8951cb8dfb	Illumos 4873 - zvol unmap calls can take a very long time for larger datasets 4873 zvol unmap calls can take a very long time for larger datasets Author: Alex Reece <alex@delphix.com> Reviewed by: George Wilson <george@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com> Reviewed by: Basil Crow <basil.crow@delphix.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Approved by: Robert Mustacchi <rm@joyent.com> References: https://www.illumos.org/issues/4873 https://github.com/illumos/illumos-gate/commit/0f6d88a Porting Notes: dbuf_free_range(): - reduce stack usage using kmem_alloc() - the sorted AVL tree will handle the spill block case correctly without all the special handling in the for() loop Ported-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-04-28 16:24:03 -07:00
Jorgen Lundman	58c4aa00c6	Illumos 4975 - missing mutex_destroy() calls in zfs 4975 missing mutex_destroy() calls in zfs Author: Jorgen Lundman <lundman@lundman.net> Reviewed by: Matthew Ahrens <matthew.ahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Rich Lowe <richlowe@richlowe.net> Reviewed by: Seth Nimbosa <darth.Serious@gmail.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Reviewed by: Don Brady <dev.fs.zfs@gmail.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/4975 https://github.com/illumos/illumos-gate/commit/d2b3cbb Ported-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-04-28 16:23:38 -07:00
Alex Reece	ca227e54a8	Illumos 3897 - zfs filesystem and snapshot limits (fix leak) 3897 zfs filesystem and snapshot limits (fix leak) Author: Alex Reece <alex.reece@delphix.com> Approved by: Christopher Siden <christopher.siden@delphix.com> References: https://www.illumos.org/issues/3897 https://github.com/illumos/illumos-gate/commit/fb7001f Ported-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-04-28 16:23:14 -07:00
Jerry Jelinek	788eb90c4c	Illumos 3897 - zfs filesystem and snapshot limits 3897 zfs filesystem and snapshot limits Author: Jerry Jelinek <jerry.jelinek@joyent.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Christopher Siden <christopher.siden@delphix.com> References: https://www.illumos.org/issues/3897 https://github.com/illumos/illumos-gate/commit/a2afb61 Porting Notes: dsl_dataset_snapshot_check(): reduce stack usage using kmem_alloc(). Ported-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-04-28 16:22:51 -07:00
tuxoko	ecfb0b5f42	Fix misuse of input argument in traverse_visitbp In traverse_visitbp(), the input argument dnp is modified in the middle to point to a temporary buffer. Originally this doesn't matter, because no user of TRAVERSE_POST dereferences it. However, in `fbeddd6` a piece of code is added dereferencing dnp after the modification, creating a possible bug. We fix this by creating a new local variable cdnp for the DMU_OT_DNODE case, so we don't modify the input argument. Also we introduce different local variables in the DMU_OT_OBJSET case to prevent confusion between the input argument. Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2060	2015-04-28 09:43:50 -07:00
Isaac Huang	0336f3d001	Remove useless variable spa_active_count This isn't required for the Linux port because the kernel tracks if a module is busy. The prototype for spa_busy() is also removed since its definition was already removed. Signed-off-by: Isaac Huang <he.huang@intel.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3262	2015-04-27 09:22:05 -07:00
Justin T. Gibbs	ec8501ee12	5313 Allow I/Os to be aggregated across ZIO priority classes Reviewed by: Andriy Gapon <avg@FreeBSD.org> Reviewed by: Will Andrews <willa@SpectraLogic.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> References: https://www.illumos.org/issues/5313 https://github.com/illumos/illumos-gate/commit/fe319232 Ported-by: DHE <git@dehacked.net> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3280	2015-04-24 15:16:56 -07:00
Ned Bass	4eb30c6864	Serialize access to spa->spa_feat_stats nvlist The function spa_add_feature_stats() manipulates the shared nvlist spa->spa_feat_stats in an unsafe concurrent manner. Add a mutex to protect the list. Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3335	2015-04-24 15:04:43 -07:00
Chunwei Chen	07012da668	Fix kernel panic due to tsd_exit in ZFS_EXIT(zsb) The following panic would occur under certain heavy load: [ 4692.202686] Kernel panic - not syncing: thread ffff8800c4f5dd60 terminating with rrw lock ffff8800da1b9c40 held [ 4692.228053] CPU: 1 PID: 6250 Comm: mmap_deadlock Tainted: P OE 3.18.10 #7 The culprit is that ZFS_EXIT(zsb) would call tsd_exit() every time, which would purge all tsd data for the thread. However, ZFS_ENTER is designed to be reentrant, so we cannot allow ZFS_EXIT to blindly purge tsd data. Instead, we rely on the new behavior of tsd_set. When NULL is passed as the new value to tsd_set, it will automatically remove the tsd entry specified the the key for the current thread. rrw_tsd_key and zfs_allow_log_key already calls tsd_set(key, NULL) when they're done. The zfs_fsyncer_key relied on ZFS_EXIT(zsb) to call tsd_exit() to do clean up. Now we explicitly call tsd_set(key, NULL) on them. Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3247	2015-04-24 14:57:54 -07:00
Brian Behlendorf	a438ff0e85	Extend PF_FSTRANS critical regions Additional testing has shown that the region covered by PF_FSTRANS needs to be extended to cover the zpl_xattr_security_init() and init_acl() functions. The zpl_mark_dirty() function can also recurse and therefore must always be protected. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Richard Yao <ryao@gentoo.org> Closes #3331	2015-04-24 09:54:22 -07:00
Brian Behlendorf	7fad6290eb	Mark additional functions as PF_FSTRANS Prevent deadlocks by disabling direct reclaim during all NFS, xattr, ctldir, and super function calls. This is related to `40d06e3`. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Issue #3225	2015-04-17 09:35:24 -07:00
Tim Chase	5074bfe8ad	Allocate zfs_znode_cache on the Linux slab The Linux slab, in general, performs better than the SPl slab in cases where a lot of objects are allocated and fragmentation is likely present. This patch fixes pathologically bad behavior in cases where the ARC is filled with mostly metadata and a user program needs to allocate and dirty enough memory which would require an insignificant amount of the ARC to be reclaimed. If zfs_znode_cache is on the SPL slab, the system may spin for a very long time trying to reclaim sufficient memory. If it is on the Linux slab, the behavior has been observed to be much more predictible; the memory is reclaimed more efficiently. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #3283	2015-04-14 12:19:22 -07:00
Brian Behlendorf	f42d7f4111	Use vmem_alloc() in spa_config_write() The packed nvlist allocated in spa_config_write() may exceed the warning threshold for large configurations. Use the vmem interfaces for this short lived allocation. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3251	2015-04-07 15:10:19 -07:00
Tim Chase	40d06e3c78	Mark all ZPL and ioctl functions as PF_FSTRANS Prevent deadlocks by disabling direct reclaim during all ZPL and ioctl calls as well as the l2arc and adapt ARC threads. This obviates the need for MUTEX_FSTRANS so its previous uses and definition have been eliminated. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3225	2015-04-03 11:38:59 -07:00
Matthew Ahrens	0f7d2a4b3d	Illumus 5693 - ztest fails in dbuf_verify: buf[i] == 0, due to dedup and bp_override 5693 ztest fails in dbuf_verify: buf[i] == 0, due to dedup and bp_override Reviewed by: George Wilson <george@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/5693 https://github.com/illumos/illumos-gate/commit/7f7ace3 Ported-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3231	2015-03-27 15:02:56 -07:00
George Wilson	b738bc5a0f	Illumos 5694 - traverse_prefetcher does not prefetch enough 5694 traverse_prefetcher does not prefetch enough Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Alex Reece <alex@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com> Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com> Approved by: Garrett D'Amore <garrett@damore.org> References: https://www.illumos.org/issues/5694 https://github.com/illumos/illumos-gate/commit/34d7ce05 Ported-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3230	2015-03-27 15:02:50 -07:00
Chris Dunlop	ee2f17aa2a	Align code with Illumos Align code in traverse_visitbp() with that in Illumos in preparation for applying Illumos-5694. No functional change: use a temporary variable pd to replace multiple occurrences of td->td_pfd. This increases our stack use slightly more then normal because the function is called recursively. Signed-off-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #3230	2015-03-27 14:52:45 -07:00
Prakash Surya	a4069eef2e	Illumos 5695 - dmu_sync'ed holes do not retain birth time 5695 dmu_sync'ed holes do not retain birth time Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/5695 https://github.com/illumos/illumos-gate/commit/70163ac Ported-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3229	2015-03-27 14:51:34 -07:00
Ned Bass	58806b4cdc	dbuf_free_range() overzealously frees dbufs When called to free a spill block from a dnode, dbuf_free_range() has a bug that results in all dbufs for the dnode getting freed. A variety of problems may result from this bug, but a common one was a zap lookup tripping an ASSERT because the zap buffers had been zeroed out. This could happen on a dataset with xattr=sa set when extended attributes are written and removed on a directory concurrently with I/O to files in that directory. Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Fixes #3195 Fixes #3204 Fixes #3222	2015-03-25 14:48:22 -07:00
Tim Chase	ded576e28f	Set the maximum ZVOL transfer size correctly ZoL had been setting max_sectors to UINT_MAX, but until Linux 3.19, it the kernel artifically capped it at 1024 (BLK_DEF_MAX_SECTORS). This cap was removed in torvalds/linux@34b48db. This patch changes it to DMU_MAX_ACCESS (in sectors) and also changes the ASSERT in dmu_tx_hold_write() to allow the maximum transfer size. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3212	2015-03-25 11:58:42 -07:00
Isaac Huang	e89bd69775	zio_injection_enabled should not be a module option The zio_inject.c keeps zio_injection_enabled as a counter of fault handlers, so it should not be exported to user space as a module option. Several EXPORT_SYMBOLs are moved from zio.c to zio_inject.c, where the symbols are defined. Signed-off-by: Isaac Huang <he.huang@intel.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3199	2015-03-24 13:22:03 -07:00
Chris Dunlop	d07b7c7f21	Reduce size of zfs_sb_t: allocate z_hold_mtx separately zfs_sb_t has grown to the point where using kmem_zalloc() for allocations is triggering the 32k warning threshold. We can't safely convert this entire allocation to use vmem_alloc() instead of kmem_alloc() because the backing_dev_info structure is embedded here. It depends on the bit_waitqueue() function which won't behave properly when given a virtual address. Instead, use vmem_alloc() to allocate the z_hold_mtx array separately. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chris Dunlop <chris@onthe.net.au> Closes #3178	2015-03-24 13:17:44 -07:00
Brian Behlendorf	bc88866657	Fix arc_adjust_meta() behavior The goal of this function is to evict enough meta data buffers from the ARC in order to enforce the arc_meta_limit. Achieving this is slightly more complicated than it appears because it is common for data buffers to have holds on meta data buffers. In addition, dnode meta data buffers will be held by the dnodes in the block preventing them from being freed. This means we can't simply traverse the ARC and expect to always find enough unheld meta data buffer to release. Therefore, this function has been updated to make alternating passes over the ARC releasing data buffers and then newly unheld meta data buffers. This ensures forward progress is maintained and arc_meta_used will decrease. Normally this is sufficient, but if required the ARC will call the registered prune callbacks causing dentry and inodes to be dropped from the VFS cache. This will make dnode meta data buffers available for reclaim. The number of total restarts in limited by zfs_arc_meta_adjust_restarts to prevent spinning in the rare case where all meta data is pinned. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Pavel Snajdr <snajpa@snajpa.net> Issue #3160	2015-03-20 10:35:20 -07:00
Brian Behlendorf	2cbb06b561	Restructure per-filesystem reclaim Originally when the ARC prune callback was introduced the idea was to register a single callback for the ZPL. The ARC could invoke this call back if it needed the ZPL to drop dentries, inodes, or other cache objects which might be pinning buffers in the ARC. The ZPL would iterate over all ZFS super blocks and perform the reclaim. For the most part this design has worked well but due to limitations in 2.6.35 and earlier kernels there were some problems. This patch is designed to address those issues. 1) iterate_supers_type() is not provided by all kernels which makes it impossible to safely iterate over all zpl_fs_type filesystems in a single callback. The most straight forward and portable way to resolve this is to register a callback per-filesystem during mount. The arc_*_prune_callback() functions have always supported multiple callbacks so this is functionally a very small change. 2) Commit `050d22b` removed the non-portable shrink_dcache_memory() and shrink_icache_memory() functions and didn't replace them with equivalent functionality. This meant that for Linux 3.1 and older kernels the ARC had no mechanism to drop dentries and inodes from the caches if needed. This patch adds that missing functionality by calling shrink_dcache_parent() to release dentries which may be pinning inodes. This will result in all unused cache entries being dropped which is a bit heavy handed but it's the only interface available for old kernels. 3) A zpl_drop_inode() callback is registered for kernels older than 2.6.35 which do not support the .evict_inode callback. This ensures that when the last reference on an inode is dropped it is immediately removed from the cache. If this isn't done than inode can end up on the global unused LRU with no mechanism available to ZFS to drop them. Since the ARC buffers are not dropped the hottest inodes can still be recreated without performing disk IO. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Pavel Snajdr <snajpa@snajpa.net> Issue #3160	2015-03-20 10:35:20 -07:00
Brian Behlendorf	596a8935a1	Fix arc_meta_max accounting The arc_meta_max value should be increased when space it consumed not when it is returned. This ensure's that arc_meta_max is always up to date. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Pavel Snajdr <snajpa@snajpa.net> Issue #3160	2015-03-20 10:35:20 -07:00
Chunwei Chen	40749aa7a6	Use MUTEX_FSTRANS on l2arc_buflist_mtx Use MUTEX_FSTRANS on l2arc_buflist_mtx to prevent the following deadlock scenario: 1. arc_release() -> hash_lock -> l2arc_buflist_mtx 2. l2arc_write_buffers() -> l2arc_buflist_mtx -> (direct reclaim) -> arc_buf_remove_ref() -> hash_lock Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Signed-off-by: Tim Chase <tim@chase2k.com> Issue #3160	2015-03-18 09:29:38 -07:00
Justin T. Gibbs	4c7b7eedcd	Illumos 5630 - stale bonus buffer in recycled dnode_t leads to data corruption 5630 stale bonus buffer in recycled dnode_t leads to data corruption Author: Justin T. Gibbs <justing@spectralogic.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george@delphix.com> Reviewed by: Will Andrews <will@freebsd.org> Approved by: Robert Mustacchi <rm@joyent.com> References: https://www.illumos.org/issues/5630 https://github.com/illumos/illumos-gate/commit/cd485b4 Ported-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <ryao@gentoo.org> Issue #3172	2015-03-12 15:40:39 -07:00
Josef 'Jeff' Sipek	73ad4a9f3c	Illumos 5047 - don't use atomic__nv if you discard the return value 5047 don't use atomic__nv if you discard the return value Author: Josef 'Jeff' Sipek <josef.sipek@nexenta.com> Reviewed by: Garrett D'Amore <garrett@damore.org> Reviewed by: Jason King <jason.brian.king@gmail.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> References: https://www.illumos.org/issues/5047 https://github.com/illumos/illumos-gate/commit/640c167 Porting Notes: Several hunks from the original patch where not specific to ZFS and thus were dropped. Ported-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <ryao@gentoo.org> Issue #3172	2015-03-12 15:40:33 -07:00
Brian Behlendorf	7f3e466283	Mark zfs_inactive() with PF_FSTRANS Allowing direct reclaim to re-enter the VFS in the zfs_inactive() call path has historically been problematic for ZoL. Therefore, in order to avoid an entire class of current and future issues caused by this PF_FSTRANS is set for all zfs_inactive() callers. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3163	2015-03-10 09:21:48 -07:00
Ned Bass	417104bdd3	Use cached feature info in spa_add_feature_stats() Avoid issuing I/O to the pool when retrieving feature flags information. Trying to read the ZAPs from disk means that zpool clear would hang if the pool is suspended and recovery would require a reboot. To keep the feature stats resident in memory, we hang a cached nvlist off of the spa. It is built up from disk the first time spa_add_feature_stats() is called, and refreshed thereafter using the cached feature reference counts. spa_add_feature_stats() gets called at pool import time so we can be sure the cached nvlist will be available if the pool is later suspended. Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3082	2015-03-05 14:11:10 -08:00
Brian Behlendorf	989fd514b1	Change ASSERT(!"...") to cmn_err(CE_PANIC, ...) There are a handful of ASSERT(!"...")'s throughout the code base for cases which should be impossible. This patch converts them to use cmn_err(CE_PANIC, ...) to ensure they are always enabled and so that additional debugging is logged if they were to occur. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #1445	2015-03-03 13:22:21 -08:00
Brian Behlendorf	8c45def24a	Linux 4.0 compat: bdi_setup_and_register() The 'capabilities' argument which was passed to bdi_setup_and_register() has been removed. File systems should no longer pass BDI_CAP_MAP_COPY. For our purposes this means there are now three different interfaces which must be handled. A zpl_bdi_setup_and_register() wrapper function has been introduced to provide a single interface to the ZPL code. * 2.6.32 - 2.6.33, bdi_setup_and_register() is not exported. * 2.6.34 - 3.19, bdi_setup_and_register() takes 3 arguments. * 4.0 - x.y, bdi_setup_and_register() takes 2 arguments. I've also taken this opportunity to remove HAVE_BDI because kernels older then 2.6.32 are no longer supported. All kernels newer than this will have one of the above interfaces. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Closes #3128	2015-03-03 10:49:45 -08:00
Brian Behlendorf	4ec15b8dcf	Use MUTEX_FSTRANS mutex type There are regions in the ZFS code where it is desirable to be able to be set PF_FSTRANS while a specific mutex is held. The ZFS code could be updated to set/clear this flag in all the correct places, but this is undesirable for a few reasons. 1) It would require changes to a significant amount of the ZFS code. This would complicate applying patches from upstream. 2) It would be easy to accidentally miss a critical region in the initial patch or to have an future change introduce a new one. Both of these concerns can be addressed by using a new mutex type which is responsible for managing PF_FSTRANS, support for which was added to the SPL in commit zfsonlinux/spl@9099312 - Merge branch 'kmem-rework'. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Closes #3050 Closes #3055 Closes #3062 Closes #3132 Closes #3142 Closes #2983	2015-03-03 10:46:40 -08:00
Isaac Huang	d14cfd83da	Fix deadlock between zpool export and zfs list Pool reference count is NOT checked in spa_export_common() if the pool has been imported readonly=on, i.e. spa->spa_sync_on is FALSE. Then zpool export and zfs list may deadlock: 1. Pool A is imported readonly. 2. zpool export A and zfs list are run concurrently. 3. zfs command gets reference on the spa, which holds a dbuf on on the MOS meta dnode. 4. zpool command grabs spa_namespace_lock, and tries to evict dbufs of the MOS meta dnode. The dbuf held by zfs command can't be evicted as its reference count is not 0. 5. zpool command blocks in dnode_special_close() waiting for the MOS meta dnode reference count to drop to 0, with spa_namespace_lock held. 6. zfs command tries to get the spa_namespace_lock with a reference on the spa held, which holds a dbuf on the MOS meta dnode. 7. Now zpool command and zfs command deadlock each other. Also any further zfs/zpool command will block on spa_namespace_lock forever. The fix is to always check pool reference count in spa_export_common(), no matter whether the pool was imported readonly or not. Signed-off-by: Isaac Huang <he.huang@intel.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2034	2015-03-02 11:50:06 -08:00
Brian Behlendorf	87a63dd702	Prevent "zpool destroy\|export" when suspended Cleanly destroying or exporting a pool requires that the pool not be suspended. Therefore, set the POOL_CHECK_SUSPENDED flag for these ioctls so the utilities will output a descriptive error message rather than block. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #2878	2015-03-02 11:50:06 -08:00
Brian Behlendorf	b4f3666a16	Retire spl_module_init()/spl_module_fini() In the original implementation of the SPL wrappers were provided for module initialization and cleanup. This was done to abstract away any compatibility code which might be needed for the SPL. As it turned out the only significant compatibility issue was that the default pwd during module load differed under Illumos and Linux. Since this is such as minor thing and the wrappers complicate the code they are being retired. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2985	2015-02-24 11:37:44 -08:00
Brian Behlendorf	1efdc45ea8	Fix O_APPEND open(2) flag As described in flags section of open(2): O_APPEND: The file is opened in append mode. Before each write(2), the file offset is positioned at the end of the file, as if with lseek(2). O_APPEND may lead to corrupted files on NFS filesys- tems if more than one process appends data to a file at once. This is because NFS does not support appending to a file, so the client kernel has to simulate it, which can't be done without a race condition. This issue was originally overlooked because normally the generic VFS code handles this for a filesystem. However, because ZFS explictly registers a zpl_write() function it's responsible for the seek. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3124	2015-02-24 11:21:54 -08:00
Dan Swartzendruber	1611bb7b4f	Set zfs_autoimport_disable default value to 1 When loading the ZFS kernel modules they should not populate the spa namespace using the cache file. This behavior isn't consistent with other Linux kernel modules and we need to move away from it. Removing this makes the whole startup process predictable with four basic steps which are driven by the init system. 1) modprobe 2) zpool import 3) zfs mount 4) zfs share This change also helps lay the groundwork for eventually removing the kobj_* compatibility code on the kernel side. It may need to be preserved in userspace because libzfs_init() depends on it. This is why the conditional must be wrapped with an #ifdef _KERNEL. Signed-off-by: Dan Swartzendruber <dswartz@druber.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2820	2015-02-17 16:09:41 -08:00
Brian Behlendorf	7d2868d5fc	Skip bad DVAs during free by setting zfs_recover=1 When a bad DVA is encountered in metaslab_free_dva() the system should treat it as fatal. This indicates that somehow a damaged DVA was written to disk and that should be impossible. However, we have seen a handful of reports over the years of pools somehow being damaged in this way. Since this damage can render otherwise intact pools unimportable, and the consequence of skipping the bad DVA is only leaked free space, it makes sense to provide a mechanism to ignore the bad DVA. Setting the zfs_recover=1 module option will cause the DVA to be ignored which may allow the pool to be imported. Since zfs_recover=0 by default any pool attempting to free a bad DVA will treat it as a fatal error preserving the current behavior. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3099 Issue #3090 Issue #2720	2015-02-13 16:02:04 -08:00
Andrey Vesnovaty	5f15fa2216	Fix readdir for .zfs/snapshot directory dmu_snapshot_list_next stores the index of the next snapshot entry to the offp argument, which zpl_snapdir_iterate then uses for the dir_emit. This result in an off-by-one error. Therefore a temporary variable should be used. This was a regression introduced in commit zfsonlinux/zfs@0f37d0c. Signed-off-by: Andrey Vesnovaty <andrey.vesnovaty@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2930	2015-02-10 16:34:30 -08:00
Brian Behlendorf	3941503c0a	Retire zio_cons()/zio_dest() The zio_cons() constructor and zio_dest() destructor don't exist in the upstream Illumos code. They were introduced as a workaround to avoid issue #2523. Since this issue has now been resolved this code is being reverted to bring ZoL back in sync with Illumos. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov> Issue #3063	2015-02-10 16:09:49 -08:00
Brian Behlendorf	6442f3cfe3	Retire zio_bulk_flags Long ago the zio_bulk_flags module parameter was introduced to facilitate debugging and profiling the zio_buf_caches. Today this code works well and there's no compelling reason to keep this functionality. In fact it's preferable to revert this so the code is more consistent with other ZFS implementations. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov> Issue #3063	2015-02-10 16:08:49 -08:00
Jörg Thalheim	534759fad3	Linux 3.19 compat: file_inode was added struct access f->f_dentry->d_inode was replaced by accessor function file_inode(f) Signed-off-by: Joerg Thalheim <joerg@higgsboson.tk> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3084	2015-02-10 11:24:51 -08:00
Brian Behlendorf	77aef6f60e	Use vmem_alloc() for nvlists Several of the nvlist functions may perform allocations larger than the 32k warning threshold. Convert them to use vmem_alloc() so the best allocator is used. Commit `efcd79a` retired KM_NODEBUG which was used to suppress large allocation warnings. Concurrently the large allocation warning threshold was increased from 8k to 32k. The goal was to identify the remaining locations, such as this one, where the allocation can be larger than 32k. This patch is expected fine tuning resulting for the kmem-rework changes, see commit `6e9710f`. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3057 Closes #3079 Closes #3081	2015-02-10 11:00:08 -08:00
Brian Behlendorf	afe373260e	Revert "Don't read space maps during import for readonly pools" This reverts commit `7fc8c33ede` which accidentally introduced a ztest failure. ztest: '/usr/sbin/zdb -bcc -d -U /var/tmp/zpool.cache ztest' exit code 2 child exited with code 3 Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-02-09 16:56:59 -08:00
Brian Behlendorf	7fc8c33ede	Don't read space maps during import for readonly pools Normally when importing a pool the space maps for all top level vdevs are read from disk. The space maps will be required latter when an allocation is performed and free blocks need to be located. However, if the pool is imported readonly then we are guaranteed that no allocations can occur. In this case the space maps need not be loaded.. A similar argument can be made for the DTLs (dirty time logs). Because a pool import will fail if the space maps cannot be read. The ability to safely ignore them makes it more likely that a damaged pool can be imported readonly to recover its contents. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #2831	2015-02-09 16:43:03 -08:00
Justin T. Gibbs	33b4de513e	Illumos 5311 - traverse_dnode may report success when it should not 5311 traverse_dnode may report success when it should not Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Andriy Gapon <avg@FreeBSD.org> Reviewed by: Will Andrews <willa@spectralogic.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://github.com/illumos/illumos-gate/commit/2a89c2c https://www.illumos.org/issues/5311 Ported by: DHE <git@dehacked.net> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2970	2015-02-06 12:07:15 -08:00
Ned Bass	a62d1b02e3	Fix SA header size accounting The functions sa_find_sizes() and sa_build_layout() fail to account for the additional 2 bytes of SA header space when calculating whether a variable size attribute might spill over. They may consequently determine that an attribute will fit in the bonus buffer along with a spill block pointer, when in reality the attribute would be partially overwritten by the spill block pointer if spill over occurs. This also causes an inconsistency between the SA header size and the number of variable size attributes in the layout, tripping an assertion when debugging is on. The following reproducer demonstrates the problem. ln -s $(perl -e 'print "z" x 20') file setfattr -h -n trusted.foo -v $(perl -e 'print "z" x 200') file Even though sa_find_sizes() computes the index of the attribute where spill-over will occur, sa_build_layouts() discards the result and recomputes it itself. As it turns out, both functions get it wrong. Since this computation is awkward and, as history has shown, easy to screw up, let's just do it in one place. This patch fixes the bug in sa_find_sizes() and updates sa_build_layout() to use the result computed there. Also improve the comments in sa_find_sizes(). Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Closes #3070	2015-02-06 09:26:46 -08:00
Brian Behlendorf	e2c4acde55	Skip evicting dbufs when walking the dbuf hash When a dbuf is in the DB_EVICTING state it may no longer be on the dn_dbufs list. In which case it's unsafe to call DB_DNODE_ENTER. Therefore, any dbuf which is found in this safe must be skipped. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2553 Closes #2495	2015-02-06 09:24:28 -08:00
Tim Chase	aa2ef419e4	Spurious ENOMEM returns when reading dbufs kstat Commit `7b2d78a046` fixed some improper uses of snprintf(), however, in __dbuf_stats_hash_table_data() the return value of snprintf is propagated to the caller. This caused spurious ENOMEM errors when reading the dbufs kstat. This commit causes the actual number of characters written to be returned. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3072	2015-02-04 16:35:26 -08:00
avg	037763e44e	fix l2arc compression buffers leak Commit log from FreeBSD: We have observed that arc_release() can be called concurrently with a l2arc in-flight write. Also, we have observed that arc_hdr_destroy() can be called from arc_write_done() for a zio with ZIO_FLAG_IO_REWRITE flag in similar circumstances. Previously the l2arc headers would be freed while leaking their associated compression buffers. Now the buffers are placed on l2arc_free_on_write list for delayed freeing. This is similar to what was already done to arc buffers that were supposed to be freed concurrently with in-flight writes of those buffers. In addition to fixing the discovered leaks this change also adds some protective code to assert that a compression buffer associated with a l2arc header is never leaked. A new kstat l2_cdata_free_on_write is added. It keeps a count of delayed compression buffer frees which previously would have been leaks. Tested by: Vitalij Satanivskij <satan@ukr.net> et al Requested by: many MFC after: 2 weeks Sponsored by: HybridCluster / ClusterHQ References: https://illumos.org/issues/5222 https://github.com/freebsd/freebsd/commit/b98f85d http://thread.gmane.org/gmane.os.freebsd.current/155757/focus=155781 http://lists.open-zfs.org/pipermail/developer/2014-January/000455.html http://lists.open-zfs.org/pipermail/developer/2014-February/000523.html Ported-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3029	2015-02-03 16:54:16 -08:00
Brian Behlendorf	19ea3d25df	Use zio buffers in zil_itx_create() The zil_itx_create() function uses the vmem_alloc() allocator for its buffers because when logging a write that buffer may be as large as 64K. This is non-optimal because we may need to allocate many of of these buffers and this interface has the potential to be slow. Instead, use zio_data_buf_alloc() which is specifically designed to be able to efficiently allocate a wide range of buffer sizes. In addition, do some cleanup and use the zil_itx_destroy() function to always free an itx structure. This way we're always sure the right allocation functions are used. Notice that in the current code kmem_free() and vmem_free() were both used. This happened to work because these wrappers map to the same internal SPL function. This was identified as a potential problem when a low-end memory constrained system began logging the following warnings. There was no deadlock here just repeated allocation failures resulting in increased latency. Possible memory allocation deadlock: size=65792 lflags=0x42d0 Pid: 20118, comm: kvm Tainted: P O 3.2.0-0.bpo.4-amd64 Call Trace: [<ffffffffa040b834>] ? spl_kmem_alloc_impl+0x115/0x127 [spl] [<ffffffffa040b84f>] ? spl_kmem_alloc_debug+0x9/0x36 [spl] [<ffffffffa05d8a0b>] ? zil_itx_create+0x2d/0x59 [zfs] [<ffffffffa05c71e6>] ? zfs_log_write+0x13a/0x2f0 [zfs] [<ffffffffa05d41bc>] ? zfs_write+0x85b/0x9bb [zfs] [<ffffffffa05e37ec>] ? zpl_aio_write+0xca/0x110 [zfs] [<ffffffff811088e5>] ? do_sync_readv_writev+0xa3/0xde [<ffffffff81108f41>] ? do_readv_writev+0xaf/0x125 [<ffffffff81109055>] ? sys_pwritev+0x55/0x9a [<ffffffff813721d2>] ? system_call_fastpath+0x16/0x1b Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <ryao@gentoo.org> Closes #3059	2015-02-02 11:20:41 -08:00
Brian Behlendorf	0365064a97	Handle closing an unopened ZVOL Thank to commit `a4430fce69` we're now correctly returning EROFS when opening a zvol on a read-only pool. Unfortunately, it looks like this causes us to trigger some unexpected behavior by __blkdev_get(). In the failure case it's possible __blkdev_get() will call __blkdev_put() for a bdev which was never successfully opened. This results in us trying to close the device again and hitting the NULL dereference. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1343	2015-01-30 14:44:14 -08:00
Brian Behlendorf	a127e841de	Add zvol_open() error handling for readonly property Rather than ASSERT when for some reason the readonly property of a zvol can't be read cleanly handle the failure. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1343	2015-01-30 14:44:06 -08:00
Tim Chase	b0cf0676c0	Fix removal of SA in sa_modify_attrs() The sa_modify_attrs() function can add, remove or replace an SA. The main loop in the function uses the index "i" to iterate over the existing SAs and uses the index "j" for writing them into a new buffer via SA_ADD_BULK_ATTR(). The write index, "j" is incremented on remove (SA_REMOVE) operations which leads to a corruption in the new SA buffer. This patch remove the increment for SA_REMOVE operations. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Ned Bass <bass6@llnl.gov> Closes #3028	2015-01-21 16:35:14 -08:00
Richard Yao	841c9d43c7	Use kmem_vasprintf() in log_internal() An attempt to debug zfsonlinux/zfs#2781 revealed that this code could be simplified by using kmem_asprintf(). It is not clear that switching to kmem_asprintf() addresses zfsonlinux/zfs#2781. However, switching to kmem_asprintf() is cleanup that simplifies debugging such that it would become clear that this is a bug in glibc should the issue persist. It also brings this function almost back in sync with Illumos. This was possible due to the recently reworked kmem code which allows us to use KM_SLEEP in the same fashion as Illumos. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2791 Issue #2781	2015-01-21 15:30:24 -08:00
Tim Chase	3c832b8cc1	Linux 3.12 compat: split shrinker has s_shrink The split count/scan shrinker callbacks introduced in 3.12 broke the test for HAVE_SHRINK, effectively disabling the per-superblock shrinkers. This patch re-enables the per-superblock shrinkers when the split shrinker callbacks have been detected. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2975	2015-01-20 14:07:59 -08:00
Brian Behlendorf	81971b137a	Revert "SA spill block cache" The SA spill_cache was originally introduced to avoid the need to perform large kmem or vmem allocations. Instead a small dedicated cache of preallocated SA buffers was kept. This solution was viable while the maximum block size was limited to 128K. But with the planned increase of the maximum block size to 16M callers need to migrate to the zio_buf_alloc(). However, they should be aware this interface is expected to change again once the zio buffers are fully backed by scatter-gather lists. Alternately, if the callers know these buffers will never be large or be infrequently accessed they may kmem_alloc() or vmem_alloc() the needed temporary space. This change has the additional benegit of bringing the code back inline with the upstream Illumos source. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-01-16 14:41:28 -08:00
Brian Behlendorf	285b29d959	Revert "Pre-allocate vdev I/O buffers" Commit `86dd0fd` added preallocated I/O buffers. This is no longer required after the recent kmem changes designed to make our memory allocation interfaces behave more like those found on Illumos. A deadlock in this situation is no longer possible. However, these allocations still have the potential to be expensive. So a potential future optimization might be to perform then KM_NOSLEEP so that they either succeed of fail quicky. Either case is acceptable here because we can safely abort the aggregation. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-01-16 14:41:28 -08:00
Brian Behlendorf	79c76d5b65	Change KM_PUSHPAGE -> KM_SLEEP By marking DMU transaction processing contexts with PF_FSTRANS we can revert the KM_PUSHPAGE -> KM_SLEEP changes. This brings us back in line with upstream. In some cases this means simply swapping the flags back. For others fnvlist_alloc() was replaced by nvlist_alloc(..., KM_PUSHPAGE) and must be reverted back to fnvlist_alloc() which assumes KM_SLEEP. The one place KM_PUSHPAGE is kept is when allocating ARC buffers which allows us to dip in to reserved memory. This is again the same as upstream. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-01-16 14:41:26 -08:00
Brian Behlendorf	efcd79a883	Retire KM_NODEBUG Callers of kmem_alloc() which passed the KM_NODEBUG flag to suppress the large allocation warning have been replaced by vmem_alloc() as appropriate. The updated vmem_alloc() call will not print a warning regardless of the size of the allocation. A careful reader will notice that not all callers have been changed to vmem_alloc(). Some have only had the KM_NODEBUG flag removed. This was possible because the default warning threshold has been increased to 32k. This is desirable because it minimizes the need for Linux specific code changes. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-01-16 14:40:32 -08:00
Richard Yao	71f8548ea4	Use is_vmalloc_addr() in vdev_disk.c The initial port of ZFS to Linux required a way to identify virtual memory to make IO to virtual memory backed slabs work, so kmem_virt() was created. Linux 2.6.25 introduced is_vmalloc_addr(), which is logically equivalent to kmem_virt(). Support for kernels before 2.6.26 was later dropped and more recently, support for kernels before Linux 2.6.32 has been dropped. We retire kmem_virt() in favor of is_vmalloc_addr() to cleanup the code. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-01-16 14:28:05 -08:00
Brian Behlendorf	92119cc259	Mark IO pipeline with PF_FSTRANS In order to avoid deadlocking in the IO pipeline it is critical that pageout be avoided during direct memory reclaim. This ensures that the pipeline threads can always make forward progress and never end up blocking on a DMU transaction. For this very reason Linux now provides the PF_FSTRANS flag which may be set in the process context. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-01-16 14:28:05 -08:00
Brian Behlendorf	d958324f97	Fix zfs_putpage() lock inversion (again) This is a follow up commit to `74328ee` which correctly resolved a lock inversion between zfs_putpage() and zfs_free_range(). Unfortunately, in the process it accidentally introduced another inversion between zfs_putpage() and zfs_read(). The page must be unlocked before taking the range lock. This patch corrects that issue. In addition, because the locking rules here are subtle a block comment has been added clearly explaining why the ordering here is critical. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov> Issue #2976	2015-01-08 16:09:41 -08:00
Ned Bass	33b6dbbc51	Document zfs_flags module parameter Add a table describing the debugging flags that can be set in the zfs_flags module parameter. Also change the module_param type to 'uint' so users aren't shown a negative value. The updated man page text is reproduced below for convenience. zfs_flags (int) Set additional debugging flags. The following flags may be bitwise-or'd together. +-------------------------------------------------------+ \|Value Symbolic Name \| \| Description \| +-------------------------------------------------------+ \| 1 ZFS_DEBUG_DPRINTF \| \| Enable dprintf entries in the debug log. \| +-------------------------------------------------------+ \| 2 ZFS_DEBUG_DBUF_VERIFY * \| \| Enable extra dbuf verifications. \| +-------------------------------------------------------+ \| 4 ZFS_DEBUG_DNODE_VERIFY * \| \| Enable extra dnode verifications. \| +-------------------------------------------------------+ \| 8 ZFS_DEBUG_SNAPNAMES \| \| Enable snapshot name verification. \| +-------------------------------------------------------+ \| 16 ZFS_DEBUG_MODIFY \| \| Check for illegally modified ARC buffers. \| +-------------------------------------------------------+ \| 32 ZFS_DEBUG_SPA \| \| Enable spa_dbgmsg entries in the debug log. \| +-------------------------------------------------------+ \| 64 ZFS_DEBUG_ZIO_FREE \| \| Enable verification of block frees. \| +-------------------------------------------------------+ \| 128 ZFS_DEBUG_HISTOGRAM_VERIFY \| \| Enable extra spacemap histogram verifications. \| +-------------------------------------------------------+ * Requires debug build. Default value: 0. Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2988	2015-01-07 15:50:49 -08:00
Ned Bass	49ee64e5e6	Remove duplicate typedefs from trace.h Older versions of GCC (e.g. GCC 4.4.7 on RHEL6) do not allow duplicate typedef declarations with the same type. The trace.h header contains some typedefs to avoid 'unknown type' errors for C files that haven't declared the type in question. But this causes build failures for C files that have already declared the type. Newer versions of GCC (e.g. v4.6) allow duplicate typedefs with the same type unless pedantic error checking is in force. To support the older versions we need to remove the duplicate typedefs. Removal of the typedefs means we can't built tracepoints code using those types unless the required headers have been included. To facilitate this, all tracepoint event declarations have been moved out of trace.h into separate headers. Each new header is explicitly included from the C file that uses the events defined therein. The trace.h header is still indirectly included form zfs_context.h and provides the implementation of the dprintf(), dbgmsg(), and SET_ERROR() interfaces. This makes those interfaces readily available throughout the code base. The macros that redefine DTRACE_PROBE* to use Linux tracepoints are also still provided by trace.h, so it is a prerequisite for the other trace_*.h headers. These new Linux implementation-specific headers do introduce a small divergence from upstream ZFS in several core C files, but this should not present a significant maintenance burden. Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #2953	2015-01-06 16:53:24 -08:00
Brian Behlendorf	74328ee18f	Fix zfs_putpage() lock inversion There exists a lock inversions involving the zfs range lock and the individual page writeback bits which can result in a deadlock. To prevent this we must always manipulate the writeback bit while holding the range lock. The exact deadlock is as follows: ------ Process A ------ ------ Process B ------ zpl_writepages zpl_fallocate write_cache_pages zpl_fallocate_common zpl_putpage zfs_space zfs_putpage (set bit) zfs_freesp zfs_range_lock (wait on lock) zfs_free_range (take lock) [has not yet initiated I/O, truncate_inode_pages_range the bit will not be cleared] wait_on_page_writeback (wait on bit) Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Richard Yao <richard.yao@clusterhq.com> Issue #2976	2014-12-22 09:31:56 -08:00
Boris Protopopov	9063f65476	Correct error returns to unify cross-pool operation error handling Signed-off-by: Boris Protopopov <boris.protopopov@actifio.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2911	2014-12-19 10:51:24 -08:00
Brian Behlendorf	c944be5d7e	Fix snapshots with dirty inodes Filesystems which are mounted read-only or are immutable because they are snapshots must not be allowed to dirty and inode. This will result in a write which will correctly cause a kernel panic because these filesystem are (and must be) immutable. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2812	2014-11-20 10:38:16 -08:00
Isaac Huang	29b763cd2c	bio_alloc() with __GFP_WAIT never returns NULL Mark the error handling branch as unlikely() because the current kernel interface can never return NULL. However, we want to keep the error handling in case this behavior changes in the futre. Plus fix a small style issue. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Isaac Huang <he.huang@intel.com> Closes #2703	2014-11-19 12:50:49 -05:00
Ned Bass	aaed7c408c	Explicitly include SPL compat headers Inclusion of SPL compatibility headers was moved out of the public header sys/types.h to avoid conflicts with external packages. Include a few compatiblity headers explicitly to cope with that change. Also, sort some linux-specific inclusions alphabetically. Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2898	2014-11-19 12:30:39 -05:00
Ned Bass	7b2d78a046	Fix improper null-byte termination handling Fix a few cases where null-byte termination of strings was done unnecessarily or incorrectly. - The snprintf() function always produces a null-byte terminated string for non-negative return values, so it is not necessary to write out a null-byte as a separate step. - Also, it is unsafe to use the return value of snprintf() as an offset for placing a null-byte, because if the output was truncated the return value is the number of bytes that _would_ have been written had enough space been available. Therefore the return value may index beyond the array boundaries. - Finally, snprintf() accounts for the null-byte when limiting its output size, so there is no need to pass it a size parameter that is one less than the buffer size. Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2875	2014-11-17 15:28:59 -08:00
smh	89b1cd6581	Prevent ZFS leaking pool free space When processing async destroys ZFS would leak space every txg timeout (5 seconds by default), if no writes occurred, until the pool is totally full. At this point it would be unfixable without a pool recreation. In addition if the machine was rebooted with the pool in this situation would fail to import on boot, hanging indefinitely, as the import process requires the ability to write data to the pool. Any attempts to query the pool status during the hung import would not return as the import holds the pool lock. The only way to import such a pool would be to specify -o readonly=on to the zpool import. zdb -bb <pool> can be used to check for "deferred free" size which is where this lost space will be counted. References: https://github.com/freebsd/freebsd/commit/48431b7 http://svnweb.freebsd.org/base?view=revision&revision=273158 https://reviews.csiden.org/r/132/ Porting notes: This issue was filed as illumos 5347 and a more comprehensive fix is under review. Once that change is finalized it will be integrated, in the meanwhile the FreeBSD fix has been merged to prevent the issue. Ported by: Tim Chase <tim@chase2k.com> Signed-off-by: Matthew Ahrens mahrens@delphix.com Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2896	2014-11-17 11:35:38 -08:00
Tim Chase	4254acb057	Undirty freed spill blocks. If a spill block's dbuf hasn't yet been written when a spill block is freed, the unwritten version will still be written. This patch handles the case in which a spill block's dbuf is freed and undirties it to prevent it from being written. The most common case in which this could happen is when xattr=sa is being used and a long xattr is immediately replaced by a short xattr as in: setfattr -n user.test -v very_very_very..._long_value <file> setfattr -n user.test -v short_value <file> The first value must be sufficiently long that a spill block is generated and the second value must be short enough to not require a spill block. In practice, this would typically happen due to internal xattr operations as a result of setting acltype=posixacl. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2663 Closes #2700 Closes #2701 Closes #2717 Closes #2863 Closes #2884	2014-11-17 11:25:48 -08:00
Prakash Surya	0b39b9f96f	Swap DTRACE_PROBE* with Linux tracepoints This patch leverages Linux tracepoints from within the ZFS on Linux code base. It also refactors the debug code to bring it back in sync with Illumos. The information exported via tracepoints can be used for a variety of reasons (e.g. debugging, tuning, general exploration/understanding, etc). It is advantageous to use Linux tracepoints as the mechanism to export this kind of information (as opposed to something else) for a number of reasons: * A number of external tools can make use of our tracepoints "automatically" (e.g. perf, systemtap) * Tracepoints are designed to be extremely cheap when disabled * It's one of the "accepted" ways to export this kind of information; many other kernel subsystems use tracepoints too. Unfortunately, though, there are a few caveats as well: * Linux tracepoints appear to only be available to GPL licensed modules due to the way certain kernel functions are exported. Thus, to actually make use of the tracepoints introduced by this patch, one might have to patch and re-compile the kernel; exporting the necessary functions to non-GPL modules. * Prior to upstream kernel version v3.14-rc6-30-g66cc69e, Linux tracepoints are not available for unsigned kernel modules (tracepoints will get disabled due to the module's 'F' taint). Thus, one either has to sign the zfs kernel module prior to loading it, or use a kernel versioned v3.14-rc6-30-g66cc69e or newer. Assuming the above two requirements are satisfied, lets look at an example of how this patch can be used and what information it exposes (all commands run as 'root'): # list all zfs tracepoints available $ ls /sys/kernel/debug/tracing/events/zfs enable filter zfs_arc__delete zfs_arc__evict zfs_arc__hit zfs_arc__miss zfs_l2arc__evict zfs_l2arc__hit zfs_l2arc__iodone zfs_l2arc__miss zfs_l2arc__read zfs_l2arc__write zfs_new_state__mfu zfs_new_state__mru # enable all zfs tracepoints, clear the tracepoint ring buffer $ echo 1 > /sys/kernel/debug/tracing/events/zfs/enable $ echo 0 > /sys/kernel/debug/tracing/trace # import zpool called 'tank', inspect tracepoint data (each line was # truncated, they're too long for a commit message otherwise) $ zpool import tank $ cat /sys/kernel/debug/tracing/trace \| head -n35 # tracer: nop # # entries-in-buffer/entries-written: 1219/1219 #P:8 # # _-----=> irqs-off # / _----=> need-resched # \| / _---=> hardirq/softirq # \|\| / _--=> preempt-depth # \|\|\| / delay # TASK-PID CPU# \|\|\|\| TIMESTAMP FUNCTION # \| \| \| \|\|\|\| \| \| lt-zpool-30132 [003] .... 91344.200050: zfs_arc__miss: hdr... z_rd_int/0-30156 [003] .... 91344.200611: zfs_new_state__mru... lt-zpool-30132 [003] .... 91344.201173: zfs_arc__miss: hdr... z_rd_int/1-30157 [003] .... 91344.201756: zfs_new_state__mru... lt-zpool-30132 [003] .... 91344.201795: zfs_arc__miss: hdr... z_rd_int/2-30158 [003] .... 91344.202099: zfs_new_state__mru... lt-zpool-30132 [003] .... 91344.202126: zfs_arc__hit: hdr ... lt-zpool-30132 [003] .... 91344.202130: zfs_arc__hit: hdr ... lt-zpool-30132 [003] .... 91344.202134: zfs_arc__hit: hdr ... lt-zpool-30132 [003] .... 91344.202146: zfs_arc__miss: hdr... z_rd_int/3-30159 [003] .... 91344.202457: zfs_new_state__mru... lt-zpool-30132 [003] .... 91344.202484: zfs_arc__miss: hdr... z_rd_int/4-30160 [003] .... 91344.202866: zfs_new_state__mru... lt-zpool-30132 [003] .... 91344.202891: zfs_arc__hit: hdr ... lt-zpool-30132 [001] .... 91344.203034: zfs_arc__miss: hdr... z_rd_iss/1-30149 [001] .... 91344.203749: zfs_new_state__mru... lt-zpool-30132 [001] .... 91344.203789: zfs_arc__hit: hdr ... lt-zpool-30132 [001] .... 91344.203878: zfs_arc__miss: hdr... z_rd_iss/3-30151 [001] .... 91344.204315: zfs_new_state__mru... lt-zpool-30132 [001] .... 91344.204332: zfs_arc__hit: hdr ... lt-zpool-30132 [001] .... 91344.204337: zfs_arc__hit: hdr ... lt-zpool-30132 [001] .... 91344.204352: zfs_arc__hit: hdr ... lt-zpool-30132 [001] .... 91344.204356: zfs_arc__hit: hdr ... lt-zpool-30132 [001] .... 91344.204360: zfs_arc__hit: hdr ... To highlight the kind of detailed information that is being exported using this infrastructure, I've taken the first tracepoint line from the output above and reformatted it such that it fits in 80 columns: lt-zpool-30132 [003] .... 91344.200050: zfs_arc__miss: hdr { dva 0x1:0x40082 birth 15491 cksum0 0x163edbff3a flags 0x640 datacnt 1 type 1 size 2048 spa 3133524293419867460 state_type 0 access 0 mru_hits 0 mru_ghost_hits 0 mfu_hits 0 mfu_ghost_hits 0 l2_hits 0 refcount 1 } bp { dva0 0x1:0x40082 dva1 0x1:0x3000e5 dva2 0x1:0x5a006e cksum 0x163edbff3a:0x75af30b3dd6:0x1499263ff5f2b:0x288bd118815e00 lsize 2048 } zb { objset 0 object 0 level -1 blkid 0 } For the specific tracepoint shown here, 'zfs_arc__miss', data is exported detailing the arc_buf_hdr_t (hdr), blkptr_t (bp), and zbookmark_t (zb) that caused the ARC miss (down to the exact DVA!). This kind of precise and detailed information can be extremely valuable when trying to answer certain kinds of questions. For anybody unfamiliar but looking to build on this, I found the XFS source code along with the following three web links to be extremely helpful: * http://lwn.net/Articles/379903/ * http://lwn.net/Articles/381064/ * http://lwn.net/Articles/383362/ I should also node the more "boring" aspects of this patch: * The ZFS_LINUX_COMPILE_IFELSE autoconf macro was modified to support a sixth paramter. This parameter is used to populate the contents of the new conftest.h file. If no sixth parameter is provided, conftest.h will be empty. * The ZFS_LINUX_TRY_COMPILE_HEADER autoconf macro was introduced. This macro is nearly identical to the ZFS_LINUX_TRY_COMPILE macro, except it has support for a fifth option that is then passed as the sixth parameter to ZFS_LINUX_COMPILE_IFELSE. These autoconf changes were needed to test the availability of the Linux tracepoint macros. Due to the odd nature of the Linux tracepoint macro API, a separate ".h" must be created (the path and filename is used internally by the kernel's define_trace.h file). * The HAVE_DECLARE_EVENT_CLASS autoconf macro was introduced. This is to determine if we can safely enable the Linux tracepoint functionality. We need to selectively disable the tracepoint code due to the kernel exporting certain functions as GPL only. Without this check, the build process will fail at link time. In addition, the SET_ERROR macro was modified into a tracepoint as well. To do this, the 'sdt.h' file was moved into the 'include/sys' directory and now contains a userspace portion and a kernel space portion. The dprintf and zfs_dbgmsg* interfaces are now implemented as tracepoint as well. Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2014-11-17 11:13:55 -08:00
Ned Bass	29e57d15c8	Fix dprintf format specifiers Fix a few dprintf format specifiers that disagreed with their argument types. These came to light as compiler errors when converting dprintf to use the Linux trace buffer. Previously this wasn't a problem, presumably because the SPL debug logging uses vsnprintf which must perform automatic type conversion. Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2014-11-17 11:13:45 -08:00
Ned Bass	59ec819a0c	Move a few internal ARC strucutres to arc_impl.h Add a new file named arc_impl.h and move a few internal ARC structure definitions into this file. This is needed in order to allow the Linux tracepoint functions to grub around in the internals of these structures. Signed-off-by: Prakash Surya <prakash.surya@delphix.com> Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2014-11-17 11:13:27 -08:00
Prakash Surya	fb42a49328	Illumos 5213 - panic in metaslab_init due to space_map_open returning ENXIO 5213 panic in metaslab_init due to space_map_open returning ENXIO Reviewed by: Matthew Ahrens mahrens@delphix.com Reviewed by: George Wilson george.wilson@delphix.com References: https://www.illumos.org/issues/5213 https://reviews.csiden.org/r/110 Porting notes: For the Linux port, KM_SLEEP was replaced with KM_PUSHPAGE. Ported by: Turbo Fredriksson <turbo@bayour.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2745	2014-11-14 15:37:45 -08:00
Chris Wedgwood	b31d8ea77c	Reduce buf/dbuf mutex contention Due to evidence of contention both the buf_hash_table and the dbuf_hash_table sizes have been increased from 256 to 8192. This increase in hash table size adds approximating 0.5M to our fixed memory footprint. This relatively small increase is not expected to cause problems even on low memory machines. This footprint will also become dynamic when the persistent L2ARC support is finalized. In the meanwhile, this small change significantly reduces contention for certain workloads. Signed-off-by: Chris Wedgwood <cw@f00f.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Pavel Snajdr <snajpa@snajpa.net> Closes #1291	2014-11-14 14:59:21 -08:00
Alex Zhuravlev	0f69910833	Export symbols for ZIL interface These symbols are needed by consumers (i.e. Lustre) who wish to integrate with the ZIL. In addition the zil_rollback_destroy() prototype was removed because the implementation of this function was removed long ago. Signed-off-by: Alex Zhuravlev <alexey.zhuravlev@intel.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2892	2014-11-14 14:39:43 -08:00
Alexander Pyhalov	bb9d808c5a	Fix modules installation directory When building zfs modules with kernel, compiled from deb.src, the packaging process ends up installing the modules in the wrong place. Signed-off-by: Alexander Pyhalov <apyhalov@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2822	2014-10-28 09:46:14 -07:00
Tim Chase	ed6e9cc235	Linux 3.12 compat: shrinker semantics The new shrinker API as of Linux 3.12 modifies "struct shrinker" by replacing the @shrink callback with the pair of @count_objects and @scan_objects. It also requires the return value of @count_objects to return the number of objects actually freed whereas the previous @shrink callback returned the number of remaining freeable objects. This patch adds support for the new @scan_objects return value semantics. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Closes #2837	2014-10-28 09:34:51 -07:00
Matthew Ahrens	9635861742	Illumos 5164-5165 - space map fixes 5164 space_map_max_blksz causes panic, does not work 5165 zdb fails assertion when run on pool with recently-enabled space map_histogram feature Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/5164 https://www.illumos.org/issues/5165 https://github.com/illumos/illumos-gate/commit/b1be289 Porting Notes: The metaslab_fragmentation() hunk was dropped from this patch because it was already resolved by commit `8b0a084`. The comment modified in metaslab.c was updated to use the correct variable name, space_map_blksz. The upstream commit incorrectly used space_map_blksize. Ported by: Turbo Fredriksson <turbo@bayour.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2697	2014-10-23 15:30:32 -07:00
Alex Reece	b02fe35d37	Illumos 4958 zdb trips assert on pools with ashift >= 0xe 4958 zdb trips assert on pools with ashift >= 0xe Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Max Grossman <max.grossman@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Approved by: Garrett D'Amore <garrett@damore.org> References: https://www.illumos.org/issues/4958 https://github.com/illumos/illumos-gate/commit/2a104a5 Porting notes: Keep the ZIO_FLAG_FASTWRITE define. This is for a feature present in Linux but not yet in *BSD. Ported by: Turbo Fredriksson <turbo@bayour.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2697	2014-10-23 15:30:32 -07:00
Brian Behlendorf	5f6d0b6f5a	Handle block pointers with a corrupt logical size The general strategy used by ZFS to verify that blocks are valid is to checksum everything. This has the advantage of being extremely robust and generically applicable regardless of the contents of the block. If a blocks checksum is valid then its contents are trusted by the higher layers. This system works exceptionally well as long as bad data is never written with a valid checksum. If this does somehow occur due to a software bug or a memory bit-flip on a non-ECC system it may result in kernel panic. One such place where this could occur is if somehow the logical size stored in a block pointer exceeds the maximum block size. This will result in an attempt to allocate a buffer greater than the maximum block size causing a system panic. To prevent this from happening the arc_read() function has been updated to detect this specific case. If a block pointer with an invalid logical size is passed it will treat the block as if it contained a checksum error. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2678	2014-10-23 09:20:52 -07:00
Ned Bass	bc151f7b31	Remove checks for mandatory locks The Linux VFS handles mandatory locks generically so we shouldn't need to check for conflicting locks in zfs_read(), zfs_write(), or zfs_freesp(). Linux 3.18 removed the lock_may_read() and lock_may_write() interfaces which we were relying on for this purpose. Rather than emulating those interfaces we remove the redundant checks. Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2804	2014-10-22 11:06:53 -07:00
Matthew Ahrens	88904bb3e3	Illumos 5162 - zfs recv should use loaned arc buffer to avoid copy 5162 zfs recv should use loaned arc buffer to avoid copy Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Bayard Bell <Bayard.Bell@nexenta.com> Reviewed by: Richard Elling <richard.elling@gmail.com> Approved by: Garrett D'Amore <garrett@damore.org> References: https://www.illumos.org/issues/5162 https://github.com/illumos/illumos-gate/commit/8a90470 Porting notes: Fix spelling error 's/arena/area/' in dmu.c. In restore_write() declare bonus and abuf at the top of the function. Ported by: Turbo Fredriksson <turbo@bayour.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2696	2014-10-21 16:32:11 -07:00
Matthew Ahrens	4b20a6f509	Illumos 5150 - zfs clone of a defer_destroy snapshot causes strangeness When a clone is created of a snapshot that has been marked for deferred destroy (with "zfs destroy -d"), the clone "inherits" the defer_destroy flag from the origin, and any snapshots of the clone "inherit" the defer_destroy flag from the clone. This causes a strange situation where the clone's snapshots are marked for defer_destroy but they have no holds or clones. If the clone's snapshot gets a hold or clone, which is then deleted, we will honor the incorrectly-set defer_destroy flag and delete the snapshot! Steps to reproduce: * zpool create test c1t1d0 * zfs create test/fs * zfs snapshot test/fs@a * zfs clone test/fs@a test/clone * zfs destroy -d test/fs@a * zfs clone test/fs@a test/clone2 * zfs snapshot test/clone2@a * zfs hold hld test/clone2@a * zfs release hld test/clone2@a * zfs list -r -t all test <test/clone2@a has been destroyed> We noticed that this causes dcenter to get very confused, because it treats snapshots that are marked defer_destroy as not existing. So it won't see any snapshots of the clone that's marked defer_destroy. 5150 - zfs clone of a defer_destroy snapshot causes strangeness Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Max Grossman <max.grossman@delphix.com> Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com> Reviewed by: Richard Elling <richard.elling@gmail.com> Approved by: Robert Mustacchi <rm@joyent.com> References: https://www.illumos.org/projects/illumos-gate//issues/5150 https://github.com/illumos/illumos-gate/commit/42fcb65 Ported by: Turbo Fredriksson <turbo@bayour.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2690	2014-10-21 15:26:58 -07:00
Matthew Ahrens	6c59307a3c	Illumos 3693 - restore_object uses at least two transactions to restore an object Restore_object should not use two transactions to restore an object: * one transaction is used for dmu_object_claim * another transaction is used to set compression, checksum and most importantly bonus data * furthermore dmu_object_reclaim internally uses multiple transactions * dmu_free_long_range frees chunks in separate transactions * dnode_reallocate is executed in a distinct transaction The fact the dnode_allocate/dnode_reallocate are executed in one transaction and bonus (re-)population is executed in a different transaction may lead to violation of ZFS consistency assertions if the transactions are assigned to different transaction groups. Also, if the first transaction group is successfully written to a permanent storage, but the second transaction is lost, then an invalid dnode may be created on the stable storage. 3693 restore_object uses at least two transactions to restore an object Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Andriy Gapon <andriy.gapon@hybridcluster.com> Approved by: Robert Mustacchi <rm@joyent.com> Original authors: Matthew Ahrens and Andriy Gapon References: https://www.illumos.org/issues/3693 https://github.com/illumos/illumos-gate/commit/e77d42e Ported by: Turbo Fredriksson <turbo@bayour.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2689	2014-10-21 15:26:50 -07:00
Tim Chase	356d9ed4c8	Don't perform ACL-to-mode translation on empty ACL In zfs_acl_chown_setattr(), the zfs_mode_comput() function is used to create a traditional mode value based on an ACL. If no ACL exists, this processing shouldn't be done. Problems caused by this were most evident on version 4 filesystems which not only don't have system attributes, and also frequently have empty ACLs. On such filesystems, performing a chown() operation could have the effect of dirtying the mode bits in memory but not on the file system as follows: # create a file with typical mode of 664 echo test > test chown anyuser test ls -l test and the mode will show up as all zeroes. Unmounting/mounting and/or exporting/importing the filesystem will reveal the proper mode again. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1264	2014-10-21 09:23:27 -07:00
Daniil Lunev	62bdd5eb7a	Illumos 4924 - LZ4 Compression for metadata Reviewed by Matthew Ahrens <mahrens@delphix.com> Reviewed by Saso Kiselkov <skiselkov.ml@gmail.com> Approved by: Christopher Siden <christopher.siden@delphix.com> References: https://github.com/illumos/illumos-gate/commit/b8289d2 https://www.illumos.org/issues/3756 Porting notes: The static function zfs_prop_activate_feature() was removed because this change removes the only caller. The function was not removed from Illumos but instead left as dead code. However, to keep gcc happy it was removed from Linux and may be easily restored if needed. Ported by: DHE <git@dehacked.net> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1540	2014-10-20 16:17:49 -07:00
Brian Behlendorf	ba232d8aea	Suppress AIO kmem warnings The new zpl_aio_write() and zpl_aio_read() functions use kmem_alloc() to allocate enough memory to hold the vectorized IO. While this allocation will be small it's been observed in practice to sometimes slightly exceed the 8K warning threshold by a few kilobytes. Therefore, the KM_NODEBUG flag has been added to suppress warning. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <ryao@gentoo.org> Closes #2774	2014-10-20 16:10:25 -07:00
Brian Behlendorf	33074f2254	Handle NULL mirror child vdev When selecting a mirror child it's possible that map allocated by vdev_mirror_map_allc() contains a NULL for the child vdev. In this case the child should be skipped and the read issues to another member of the mirror. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov> Closes #1744	2014-10-17 14:59:05 -07:00
Brian Behlendorf	f0e324f25d	Update utsname support Modify the code to use the utsname() kernel function rather than a global variable. This results is cleaner more portable code because utsname() is already provided by the kernel and can be easily emulated in user space via uname(2). This means that it will behave consistently in both contexts. This is also has the benefit that it allows the removal of a few _KERNEL pre-processor conditions. And it also is a pre-requisite for a proper FUSE port because we need to provide a valid utsname. Finally, it allows us to remove this functionality from the SPL and all the related compatibility code. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #2757	2014-10-17 14:58:57 -07:00
Brian Behlendorf	050d22b068	Remove shrink_dcache_memory() and shrink_icache_memory() This functionality is optional and until Linux 3.0, which provided per-filesystem shinkers, they was never a reasonable interface. Therefore, this functionality is being dropped for earlier kernels. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #2757	2014-10-17 14:58:50 -07:00
Brian Behlendorf	60bba62814	Update code to use misc_register()/misc_deregister() When ZPIOS was originally written it was designed to use the device_create() and device_destroy() functions. Unfortunately, these functions changed considerably over the years making them difficult to rely on. As it turns out a better choice would have been to use the misc_register()/misc_deregister() functions. This interface for registering character devices has remained stable, is simple, and provides everything we need. Therefore the code has been reworked to use this interface. The higher level ZFS code has always depended on these same interfaces so this is also as a step towards minimizing our kernel dependencies. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #2757	2014-10-17 14:58:44 -07:00
Brian Behlendorf	e6659763c6	Improve VERIFY() error in dmu_write() This is a debug patch designed to ensure an error code is logged to the console when this VERIFY() is hit. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov> Issue #1440	2014-10-08 09:18:14 -07:00
Brian Behlendorf	8878261fc9	Fix CPU_SEQID use in preemptible context Commit `e022864` introduced a regression for kernels which are built with CONFIG_DEBUG_PREEMPT. The use of CPU_SEQID in a preemptible context causes zio_nowait() to trigger the BUG. Since CPU_SEQID is simply being used as a random index the usage here is safe. To resolve the issue preempt is disable while calling CPU_SEQID. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov> Closes #2769	2014-10-07 16:40:29 -07:00
Matthew Ahrens	e022864d19	Illumos 5176 - lock contention on godfather zio 5176 lock contention on godfather zio Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Alex Reece <alex.reece@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Richard Elling <richard.elling@gmail.com> Reviewed by: Bayard Bell <Bayard.Bell@nexenta.com> Approved by: Garrett D'Amore <garrett@damore.org> References: https://www.illumos.org/issues/5176 https://github.com/illumos/illumos-gate/commit/6f834bc Porting notes: Under Linux max_ncpus is defined as num_possible_cpus(). This is largest number of cpu ids which might be available during the life time of the system boot. This value can be larger than the number of present cpus if CONFIG_HOTPLUG_CPU is defined. Ported by: Turbo Fredriksson <turbo@bayour.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2711	2014-10-07 11:24:24 -07:00
Richard Yao	83e9986f6e	Implement -t option to zpool create for temporary pool names Creating virtual machines that have their rootfs on ZFS on hosts that have their rootfs on ZFS causes SPA namespace collisions when the standard name rpool is used. The solution is either to give each guest pool a name unique to the host, which is not always desireable, or boot a VM environment containing an ISO image to install it, which is cumbersome. `26b42f3f9d` introduced `zpool import -t ...` to simplify situations where a host must access a guest's pool when there is a SPA namespace conflict. We build upon that to introduce `zpool import -t tname ...`. That allows us to create a pool whose in-core name is tname, but whose on-disk name is the normal name specified. This simplifies the creation of machine images that use a rootfs on ZFS. That benefits not only real world deployments, but also ZFSOnLinux development by decreasing the time needed to perform rootfs on ZFS experiments. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #2417	2014-09-30 10:46:59 -07:00
Tim Chase	cb08f06307	Perform whole-page page truncation for hole-punching under a range lock As an attempt to perform the page truncation more optimally, the hole-punching support added in `223df0161f` truncated performed the operation in two steps: first, sub-page "stubs" were zeroed under the range lock in zfs_free_range() using the new zfs_zero_partial_page() function and then the whole pages were truncated within zfs_freesp(). This left a window of opportunity during which the full pages could be touched. This patch closes the window by moving the whole-page truncation into zfs_free_range() under the range lock. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2733	2014-09-29 09:22:03 -07:00
Max Grossman	36283ca233	Illumos 5138 - add tunable for maximum number of blocks freed in one txg Reviewed by: Adam Leventhal <adam.leventhal@delphix.com> Reviewed by: Mattew Ahrens <mahrens@delphix.com> Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net> Reviewed by: Richard Elling <richard.elling@gmail.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/5138 https://github.com/illumos/illumos-gate/commit/af3465d Porting notes: Because support for exposing a uint64_t parameter wasn't added until v3.17-rc1 the zfs_free_max_blocks variable has been declared as a unsigned long. This is already far larger than required and it allows us to avoid additional autoconf compatibility code. The default value has been set to 100,000 on Linux instead of ULONG_MAX which is used on Illumos. This was done to limit the number of outstanding IOs in the system when snapshots are destroyed. This helps ensure individual TXG sync times are kept reasonable and memory isn't wasted managing a huge backlog of outstanding IOs. Ported by: Turbo Fredriksson <turbo@bayour.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2675 Closes #2581	2014-09-23 14:26:34 -07:00
Alex Reece	acbad6ff67	Illumos 4753 - increase number of outstanding async writes when sync task is waiting Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Approved by: Garrett D'Amore <garrett@damore.org> References: https://www.illumos.org/issues/4753 https://github.com/illumos/illumos-gate/commit/73527f4 Comments by Matt Ahrens from the issue tracker: When a sync task is waiting for a txg to complete, we should hurry it along by increasing the number of outstanding async writes (i.e. make vdev_queue_max_async_writes() return a larger number). Initially we might just have a tunable for "minimum async writes while a synctask is waiting" and set it to 3. Ported-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2716	2014-09-23 13:50:55 -07:00
Matthew Ahrens	d97aa48f7c	Illumos 5139 - SEEK_HOLE failed to report a hole at end of file 5139 SEEK_HOLE failed to report a hole at end of file Reviewed by: Adam Leventhal <adam.leventhal@delphix.com> Reviewed by: Alex Reece <alex.reece@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Max Grossman <max.grossman@delphix.com> Reviewed by: Peng Dai <peng.dai@delphix.com> Reviewed by: Richard Elling <richard.elling@gmail.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/5139 https://github.com/illumos/illumos-gate/commit/0fbc0cd Ported by: Turbo Fredriksson <turbo@bayour.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2714	2014-09-23 10:38:45 -07:00
Richard Yao	485c581c41	Fix function call with uninitialized value in vdev_inuse LLVM's static analyzer reported that we could pass an uninitialized pool_guid to spa_by_guid() in vdev_inuse(). Upon review, it is correct. An attempt to repurpose a spare or L2ARC drive from an exported pool will cause the pool_guid passed to spa_by_guid() to be unintialized information from the stack. This will cause non-deterministic behavior. Since there is no reason why we cannot repurpose such disks, we modify vdev_inuse() to avoid calling spa_by_guid() when they are detected. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #2330	2014-09-23 10:32:45 -07:00

1 2 3 4 5 ...

1013 Commits