Archive-Team/zfs - zfs - Gitea: Git with a cup of tea

Commit Graph

Author	SHA1	Message	Date
Brian Behlendorf	8fe1fb14cb	Handle block pointers with a corrupt logical size Commit `5f6d0b6` was originally added to gracefully handle block pointers with a damaged logical size. However, it incorrectly assumed that all passed arc_done_func_t could handle a NULL arc_buf_t. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4069 Closes #4080	2016-09-09 13:21:10 -07:00
Brian Behlendorf	bf8b4a9fd5	Linux 4.8 compat: Fix removal of bio->bi_rw member All users of bio->bi_rw have been replaced with compatibility wrappers. This allows the kernel specific logic to be abstracted away, and for each of the supported cases to be documented with the wrapper. The updated interfaces are as follows: * void blk_queue_set_write_cache(struct request_queue , bool, bool) boolean_t bio_is_flush(struct bio ) boolean_t bio_is_fua(struct bio ) boolean_t bio_is_discard(struct bio ) boolean_t bio_is_secure_erase(struct bio ) VDEV_WRITE_FLUSH_FUA Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Closes #4951	2016-09-09 13:21:10 -07:00
Brian Behlendorf	39a78fe9d4	Linux 4.8 compat: posix_acl_valid() The posix_acl_valid() function has been updated to require a user namespace. Filesystem callers should normally provide the user_ns from the super block associcated with the ACL; the zpl_posix_acl_valid() wrapper has been added for this purpose. See https://github.com/torvalds/linux/commit/0d4d717f for complete details. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Closes #4922	2016-09-09 13:21:10 -07:00
Chunwei Chen	6ae0dbdc8a	Linux 4.8 compat: REQ_OP and bio_set_op_attrs() New REQ_OP_* definitions have been introduced to separate the WRITE, READ, and DISCARD operations from the flags. This included changing the encoding of bi_rw. It places REQ_OP_* in high order bits and other stuff in low order bits. This encoding is done through the new helper function bio_set_op_attrs. For complete details refer to: https://github.com/torvalds/linux/commit/f215082 https://github.com/torvalds/linux/commit/4e1b2d5 Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4892 Closes #4899	2016-09-09 13:21:10 -07:00
Brian Behlendorf	68b8d22c6e	Linux 4.8 compat: submit_bio() The rw argument has been removed from submit_bio/submit_bio_wait. Callers are now expected to set bio->bi_rw instead of passing it in. See https://github.com/torvalds/linux/commit/4e49ea4a for complete details. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #4892 Issue #4899	2016-09-09 13:21:09 -07:00
smh	3d824a8878	FreeBSD rS271776 - Persist vdev_resilver_txg changes Persist vdev_resilver_txg changes to avoid panic caused by validation vs a vdev_resilver_txg value from a previous resilver. Authored-by: smh <smh@FreeBSD.org> Ported-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/5154 FreeBSD-issue: https://reviews.freebsd.org/rS271776 FreeBSD-commit: https://github.com/freebsd/freebsd/commit/c3c60bf Closes #4790	2016-09-09 13:21:09 -07:00
GeLiXin	c23686524f	Fix incorrect pool state after import Import a raidz pool which has a vdev with a bad label, zpool status shows the right state of the dev, but the wrong state of the pool. The pool state should be DEGRADED, not ONLINE. We examine the label in vdev_validate while in spa_load_impl, the bad label can be detected but doesn't propagate its state to the parent. There are other chances to propagate state in the following vdev_load if we failed to load DTL, but our pool is raidz1 which can tolerate a faulted disk. So we lost the last chance to correct the pool state. Propagate the leaf vdev's state to parent if its label was corrupted, as is done elsewhere in vdev_validate. Signed-off-by: GeLiXin <ge.lixin@zte.com.cn> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Don Brady <don.brady@intel.com> Closes #4948	2016-09-09 13:21:09 -07:00
GeLiXin	74acdfc682	Fix self-healing IO prior to dsl_pool_init() completion Async writes triggered by a self-healing IO may be issued before the pool finishes the process of initialization. This results in a NULL dereference of `spa->spa_dsl_pool` in vdev_queue_max_async_writes(). George Wilson recommended addressing this issue by initializing the passed `dsl_pool_t **` prior to dmu_objset_open_impl(). Since the caller is passing the `spa->spa_dsl_pool` this has the effect of ensuring it's initialized. However, since this depends on the caller knowing they must pass the `spa->spa_dsl_pool` an additional NULL check was added to vdev_queue_max_async_writes(). This guards against any future restructuring of the code which might result in dsl_pool_init() being called differently. Signed-off-by: GeLiXin <47034221@qq.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4652	2016-09-09 13:21:09 -07:00
Paul Dagnelie	d9e1eec9a2	OpenZFS 6876 - Stack corruption after importing a pool with a too-long name Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> Calling dsl_dataset_name on a dataset with a 256 byte buffer is asking for trouble. We should check every dataset on import, using a 1024 byte buffer and checking each time to see if the dataset's new name is longer than 256 bytes. OpenZFS-issue: https://www.illumos.org/issues/6876 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/ca8674e	2016-09-09 13:21:09 -07:00
Matthew Ahrens	1421562a0d	OpenZFS 7263 - deeply nested nvlist can overflow stack nvlist_pack() and nvlist_unpack are implemented recursively, which can cause the stack to overflow with a deeply nested nvlist; i.e. an nvlist which contains an nvlist, which contains an nvlist, which... Unprivileged users can pass an nvlist to the kernel via certain ioctls on /dev/zfs, which the kernel will unpack without additional permission checking or validation. Therefore, an unprivileged user can cause the kernel's stack to overflow and panic. Ideally, these functions would be implemented non-recursively. As a quick fix, this patch limits the depth of the recursion and returns an error when attempting to pack and unpack a deeply-nested nvlist. Signed-off-by: Adam Leventhal <ahl@delphix.com> Signed-off-by: George Wilson <george.wilson@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: Prakash Surya <prakash.surya@delphix.com> OpenZFS-issue: https://www.illumos.org/issues/7263 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/0511d6d -	2016-09-09 13:21:09 -07:00
Chunwei Chen	58000c3ec7	Fix dbuf_stats_hash_table_data race Dropping DBUF_HASH_MUTEX when walking the hash list is unsafe. The dbuf can be freed at any time. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4846	2016-09-09 13:21:09 -07:00
Tim Chase	e871059bc4	Prevent null dereferences when accessing dbuf kstat In arc_buf_info(), the arc_buf_t may have no header. If not, don't try to fetch the arc buffer stats and instead just zero them. The null dereferences were observed while accessing the dbuf kstat with awk on a system in which millions of small files were being created in order to overflow the system's metadata limit. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Closes #4837	2016-09-09 13:21:09 -07:00
Chunwei Chen	91f81c42f0	fh_to_dentry should return ESTALE when generation mismatch When generation mismatch, it usually means the file pointed by the file handle was deleted. We should return ESTALE to indicate this. We return ENOENT in zfs_vget since zpl_fh_to_dentry will convert it to ESTALE. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #4828	2016-09-09 13:21:09 -07:00
Chunwei Chen	2ab9247411	Don't allow accessing XATTR via export handle Allow accessing XATTR through export handle is a very bad idea. It would allow user to write whatever they want in fields where they otherwise could not. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #4828	2016-09-09 13:21:09 -07:00
Chunwei Chen	af4e50750b	Fix out-of-bound access in zfs_fillpage The original code will do an out-of-bound access on pl[] during last iteration. ================================================================== BUG: KASAN: stack-out-of-bounds in zfs_getpage+0x14c/0x2d0 [zfs] Read of size 8 by task tmpfile/7850 page:ffffea00017c6dc0 count:0 mapcount:0 mapping: (null) index:0x0 flags: 0xffff8000000000() page dumped because: kasan: bad access detected CPU: 3 PID: 7850 Comm: tmpfile Tainted: G OE 4.6.0+ #3 ffff88005f1b7678 0000000006dbe035 ffff88005f1b7508 ffffffff81635618 ffff88005f1b7678 ffff88005f1b75a0 ffff88005f1b7590 ffffffff81313ee8 ffffea0001ae8dd0 ffff88005f1b7670 0000000000000246 0000000041b58ab3 Call Trace: [<ffffffff81635618>] dump_stack+0x63/0x8b [<ffffffff81313ee8>] kasan_report_error+0x528/0x560 [<ffffffff81278f20>] ? filemap_map_pages+0x5f0/0x5f0 [<ffffffff813144b8>] kasan_report+0x58/0x60 [<ffffffffc12250dc>] ? zfs_getpage+0x14c/0x2d0 [zfs] [<ffffffff81312e4e>] __asan_load8+0x5e/0x70 [<ffffffffc12250dc>] zfs_getpage+0x14c/0x2d0 [zfs] [<ffffffffc1252131>] zpl_readpage+0xd1/0x180 [zfs] [<ffffffff81353c3a>] SyS_execve+0x3a/0x50 [<ffffffff810058ef>] do_syscall_64+0xef/0x180 [<ffffffff81d0ee25>] entry_SYSCALL64_slow_path+0x25/0x25 Memory state around the buggy address: ffff88005f1b7500: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ffff88005f1b7580: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >ffff88005f1b7600: 00 00 00 00 00 00 00 00 00 00 f1 f1 f1 f1 00 f4 ^ ffff88005f1b7680: f4 f4 f3 f3 f3 f3 00 00 00 00 00 00 00 00 00 00 ffff88005f1b7700: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ================================================================== Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4705 Issue #4708	2016-09-09 13:21:09 -07:00
Chunwei Chen	3602878ff7	Fix memleak in zpl_parse_options strsep() will advance tmp_mntopts, and will change it to NULL on last iteration. This will cause strfree(tmp_mntopts) to not free anything. unreferenced object 0xffff8800883976c0 (size 64): comm "mount.zfs", pid 3361, jiffies 4294931877 (age 1482.408s) hex dump (first 32 bytes): 72 77 00 73 74 72 69 63 74 61 74 69 6d 65 00 7a rw.strictatime.z 66 73 75 74 69 6c 00 6d 6e 74 70 6f 69 6e 74 3d fsutil.mntpoint= backtrace: [<ffffffff81810c4e>] kmemleak_alloc+0x4e/0xb0 [<ffffffff811f9cac>] __kmalloc+0x16c/0x250 [<ffffffffc065ce9b>] strdup+0x3b/0x60 [spl] [<ffffffffc080fad6>] zpl_parse_options+0x56/0x300 [zfs] [<ffffffffc080fe46>] zpl_mount+0x36/0x80 [zfs] [<ffffffff81222dc8>] mount_fs+0x38/0x160 [<ffffffff81240097>] vfs_kern_mount+0x67/0x110 [<ffffffff812428e0>] do_mount+0x250/0xe20 [<ffffffff812437d5>] SyS_mount+0x95/0xe0 [<ffffffff8181aff6>] entry_SYSCALL_64_fastpath+0x1e/0xa8 [<ffffffffffffffff>] 0xffffffffffffffff Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4706 Issue #4708	2016-09-09 13:21:09 -07:00
Chunwei Chen	9f5f758d77	Fix arc_prune_task use-after-free arc_prune_task uses a refcount to protect arc_prune_t, but it doesn't prevent the underlying zsb from disappearing if there's a concurrent umount. We fix this by force the caller of arc_remove_prune_callback to wait for arc_prune_taskq to finish. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4687 Closes #4690	2016-09-09 13:21:09 -07:00
Chunwei Chen	d5b0e7fcf1	Fix get_zfs_sb race with concurrent umount Certain ioctl operations will call get_zfs_sb, which will holds an active count on sb without checking whether it's active or not. This will result in use-after-free. We fix this by using atomic_inc_not_zero to make sure we got an active sb. P1 P2 --- --- deactivate_locked_super(): s_active = 0 zfs_sb_hold() ->get_zfs_sb(): s_active = 1 ->zpl_kill_sb() -->zpl_put_super() --->zfs_umount() ---->zfs_sb_free(zsb) zfs_sb_rele(zsb) Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2016-09-09 13:21:09 -07:00
Chunwei Chen	ec9b8fae06	Kill zp->z_xattr_parent to prevent pinning zp->z_xattr_parent will pin the parent. This will cause huge issue when unlink a file with xattr. Because the unlinked file is pinned, it will never get purged immediately. And because of that, the xattr stuff will never be marked as unlinked. So the whole unlinked stuff will stay there until shrink cache or umount. This change partially reverts `e89260a`. This is safe because only the zp->z_xattr_parent optimization is removed, zpl_xattr_security_init() is still called from the zpl outside the inode lock. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chris Dunlop <chris@onthe.net.au> Issue #4359 Issue #3508 Issue #4413 Issue #4827	2016-09-09 13:21:09 -07:00
Chunwei Chen	f7923f4ada	xattr dir doesn't get purged during iput We need to set inode->i_nlink to zero so iput will purge it. Without this, it will get purged during shrink cache or umount, which would likely result in deadlock due to zfs_zget waiting forever on its children which are in the dispose_list of the same thread. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chris Dunlop <chris@onthe.net.au> Issue #4359 Issue #3508 Issue #4413 Issue #4827	2016-09-09 13:21:09 -07:00
Rich Ercolani	3a8e13688b	Add tunable to ignore hole_birth (enabled by default) Adds a module option which disables the hole_birth optimization which has been responsible for several recent bugs, including issue #4050. Original-patch: https://gist.github.com/pcd1193182/2c0cd47211f3aee623958b4698836c48 Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4833	2016-09-09 13:20:54 -07:00
Peng	4f96e68fad	Fix PANIC: metaslab_free_dva(): bad DVA X:Y:Z The following scenario can result in garbage in the dn_spill field. The db->db_blkptr must be set to NULL when DNODE_FLAG_SPILL_BLKPTR is clear to ensure the dn_spill field is cleared. Current txg = A. * A new spill buffer is created. Its dbuf is initialized with db_blkptr = NULL and it's dirtied. Current txg = B. * The spill buffer is modified. It's marked as dirty in this txg. * Additional changes make the spill buffer unnecessary because the xattr fits into the bonus buffer, so it's removed. The dbuf is undirtied in this txg, but it's still referenced and cannot be destroyed. Current txg = C. * Starts syncing of txg A * dbuf_sync_leaf() is called for the spill buffer. Since db_blkptr is NULL, dbuf_check_blkptr() is called. * The dbuf starts being written and it reaches the ready state (not done yet). * A new change makes the spill buffer necessary again. sa_build_layouts() ends up calling dbuf_find() to locate the dbuf. It finds the old dbuf because it has not been destroyed yet (it will be destroyed when the previous write is done and there are no more references). The old dbuf has db_blkptr != NULL. * txg A write is complete and the dbuf released. However it's still referenced, so it's not destroyed. Current txg = D. * Starts syncing of txg B * dbuf_sync_leaf() is called for the bonus buffer. Its contents are directly copied into the dnode, overwriting the blkptr area because, in txg B, the bonus buffer was big enough to hold the entire xattr. * At this point, the db_blkptr of the spill buffer used in txg C gets corrupted. Signed-off-by: Peng <peng.hse@xtaotech.com> Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3937	2016-09-05 16:07:09 -07:00
Chunwei Chen	a77cea5f0f	Fix Large kmem_alloc in vdev_metaslab_init This allocation can go way over 1MB, so we should use vmem_alloc instead of kmem_alloc. Large kmem_alloc(1430784, 0x1000), please file an issue... Call Trace: [<ffffffffa0324aff>] ? spl_kmem_zalloc+0xef/0x160 [spl] [<ffffffffa17d0c8d>] ? vdev_metaslab_init+0x9d/0x1f0 [zfs] [<ffffffffa17d46d0>] ? vdev_load+0xc0/0xd0 [zfs] [<ffffffffa17d4643>] ? vdev_load+0x33/0xd0 [zfs] [<ffffffffa17c0004>] ? spa_load+0xfc4/0x1b60 [zfs] [<ffffffffa17c1838>] ? spa_tryimport+0x98/0x430 [zfs] [<ffffffffa17f28b1>] ? zfs_ioc_pool_tryimport+0x41/0x80 [zfs] [<ffffffffa17f5669>] ? zfsdev_ioctl+0x4a9/0x4e0 [zfs] [<ffffffff811bacdf>] ? do_vfs_ioctl+0x2cf/0x4b0 [<ffffffff811baf41>] ? SyS_ioctl+0x81/0xa0 Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4752	2016-09-05 16:07:09 -07:00
Tim Chase	db3f5edcf1	Linux 4.6 compat: Fall back to d_prune_aliases() if necessary As of 4.6, the icache and dcache LRUs are memcg aware insofar as the kernel's per-superblock shrinker is concerned. The effect is that dcache or icache entries added by a task in a non-root memcg won't be scanned by the shrinker in the context of the root (or NULL) memcg. This defeats the attempts by zfs_sb_prune() to unpin buffers and can allow metadata to grow uncontrollably. This patch reverts to the d_prune_aliaes() method in case the kernel's per-superblock shrinker is not able to free anything. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Closes: #4726	2016-09-05 16:07:09 -07:00
Chunwei Chen	f3f0c589c3	Skip ctldir znode in zfs_rezget to fix snapdir issues Skip ctldir in zfs_rezget, otherwise they will always get invalidated. This will cause funny behaviour for the mounted snapdirs. Especially for Linux >= 3.18, d_invalidate will detach the mountpoint and prevent anyone automount it again as long as someone is still using the detached mount. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4514 Closes #4661 Closes #4672	2016-09-05 16:07:08 -07:00
Chunwei Chen	26e2bfa770	Linux 4.7 compat: fix zpl_get_acl returns invalid acl pointer Starting from Linux 4.7, get_acl will set acl cache pointer to temporary sentinel value before calling i_op->get_acl. Therefore we can't compare against ACL_NOT_CACHED and return. Since from Linux 3.14, get_acl already check the cache for us, so we disable this in zpl_get_acl. Linux 4.7 also does set_cached_acl for us so we disable it in zpl_get_acl. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4944 Closes #4946	2016-09-05 16:07:08 -07:00
Brian Behlendorf	97a1bbd4ea	Retire HAVE_CURRENT_UMASK and HAVE_POSIX_ACL_CACHING Remove ZFS_AC_KERNEL_CURRENT_UMASK and ZFS_AC_KERNEL_POSIX_ACL_CACHING configure checks, all supported kernel provide this functionality. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Closes #4922	2016-09-05 16:07:08 -07:00
Chunwei Chen	7043281906	Linux 4.7 compat: use iterate_shared for concurrent readdir Register iterate_shared if it exists so the kernel will used shared lock and allowing concurrent readdir. Also, use shared lock when doing llseek with SEEK_DATA or SEEK_HOLE to allow concurrent seeking. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4664 Closes #4665	2016-09-05 16:07:08 -07:00
Chunwei Chen	1aff4bb235	Linux 4.7 compat: replace blk_queue_flush with blk_queue_write_cache Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #4665	2016-09-05 16:07:08 -07:00
Chunwei Chen	703c9f5893	Remove dummy znode from zvol_state struct zvol_state contains a dummy znode, which is around 1KB on x64, only for zfs_range_lock. But in reality, other than z_range_lock and z_range_avl, zfs_range_lock only need znode on regular file, which means we add 1KB on a structure and gain nothing. In this patch, we remove the dummy znode for zvol_state. In order to do that, we also need to refactor zfs_range_lock a bit. We move z_range_lock and z_range_avl pair out of znode_t to form zfs_rlock_t. This new struct replaces znode_t as the main handle inside the range lock functions. We also add pointers to z_size, z_blksz, and z_max_blksz so range lock code doesn't depend on znode_t. This allows non-ZPL consumers like Lustre to use the range locks with their equivalent znode_t structure. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Boris Protopopov <boris.protopopov@actifio.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4510	2016-09-05 16:07:08 -07:00
Brian Behlendorf	efde19487c	Fix ztest truncated cache file Commit `efc412b` updated spa_config_write() for Linux 4.2 kernels to truncate and overwrite rather than rename the cache file. This is the correct fix but it should have only been applied for the kernel build. In user space rename(2) is needed because ztest depends on the cache file. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4129	2016-09-05 16:07:08 -07:00
AndCycle	347cdb6e61	Obey arc_meta_limit default size when changing arc_max When decreasing the maximum ARC size preserve the 3/4 default ratio for the arc_meta_limit. Otherwise, the arc_meta_limit may be set the same as arc_max. Signed-off-by: AndCycle <andcycle@andcycle.idv.tw> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4001	2016-09-05 16:07:08 -07:00
Tim Chase	52475b507a	Enable PF_FSTRANS for ioctl secpolicy callbacks (#4571 ) At the very least, the zfs_secpolicy_write_perms ioctl security policy callback, which calls dsl_dataset_hold(), can require freeing memory and, therefore, re-enter ZFS. This patch enables PF_FSTRANS for all of the security policy callbacks similarly to the manner in which it's enabled for the actual ioctl callback. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4554	2016-05-06 18:22:34 -04:00
Brian Behlendorf	2cb77346cb	Use udev for partition detection When ZFS partitions a block device it must wait for udev to create both a device node and all the device symlinks. This process takes a variable length of time and depends on factors such how many links must be created, the complexity of the rules, etc. Complicating the situation further it is not uncommon for udev to create and then remove a link multiple times while processing the udev rules. Given the above, the existing scheme of waiting for an expected partition to appear by name isn't 100% reliable. At this point udev may still remove and recreate think link resulting in the kernel modules being unable to open the device. In order to address this the zpool_label_disk_wait() function has been updated to use libudev. Until the registered system device acknowledges that it in fully initialized the function will wait. Once fully initialized all device links are checked and allowed to settle for 50ms. This makes it far more likely that all the device nodes will exist when the kernel modules need to open them. For systems without libudev an alternate zpool_label_disk_wait() was updated to include a settle time. In addition, the kernel modules were updated to include retry logic for this ENOENT case. Due to the improved checks in the utilities it is unlikely this logic will be invoked. However, if the rare event it is needed it will prevent a failure. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Richard Laager <rlaager@wiktel.com> Closes #4523 Closes #3708 Closes #4077 Closes #4144 Closes #4214 Closes #4517	2016-05-06 18:22:34 -04:00
Chunwei Chen	21ea9460fa	Remove wrong ASSERT in annotate_ecksum When using large blocks like 1M, there will be more than UINT16_MAX qwords in one block, so this ASSERT would go off. Also, it is possible for the histogram to overflow. We cap them to UINT16_MAX to prevent this. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4257	2016-05-01 18:29:51 -04:00
Colin Ian King	354424de5a	Add support 32 bit FS_IOC32_{GET\|SET}FLAGS compat ioctls We need 32 bit userspace FS_IOC32_GETFLAGS and FS_IOC32_SETFLAGS compat ioctls for systems such as powerpc64. We use the normal compat ioctl idiom as used by a variety of file systems to provide this support. Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4477	2016-05-01 18:26:09 -04:00
Brian Behlendorf	d746e2ea0e	Linux 4.6 compat: PAGE_CACHE_SIZE removal As described in torvalds/linux@4a2d057e the macros PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were originally introduced to make it possible to add bigger chunks to the page cache. This never panned out and it has therefore been removed from the kernel. ZFS has been updated to use the PAGE_{SIZE,SHIFT,MASK,ALIGN} macros and calls to page_cache_release() have been replaced with put_page(). There was no need to introduce a configure check for this because these interfaces have existed for a very long time. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Closes #4489	2016-05-01 18:24:30 -04:00
Colin Ian King	60a4ea3f94	Fix inverted logic on none elevator comparison Commit `d1d7e2689d` ("cstyle: Resolve C style issues") inverted the logic on the none elevator comparison. Fix this and make it cstyle warning clean. Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4507	2016-05-01 18:23:29 -04:00
Ned Bass	6400ae85ee	Fix ZPL miswrite of default POSIX ACL Commit `4967a3e` introduced a typo that caused the ZPL to store the intended default ACL as an access ACL. Due to caching this problem may not become visible until the filesystem is remounted or the inode is evicted from the cache. Fix the typo. Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Closes #4520	2016-05-01 18:19:54 -04:00
Chunwei Chen	31dbe4b404	Linux 4.5 compat: Use xattr_handler->name for acl Linux 4.5 added member "name" to xattr_handler. xattr_handler which matches to whole name rather than prefix should use "name" instead of "prefix". Otherwise, kernel will return with EINVAL when it tries to resolve handlers. Also, we remove the strcmp checks when xattr_handler has name, because xattr_resolve_name will do the check for us. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4549 Closes #4537	2016-05-01 18:17:57 -04:00
Boris Protopopov	d0337e80ca	Fix lock order inversion with zvol_open() zfsonlinux issue #3681 - lock order inversion between zvol_open() and dsl_pool_sync()...zvol_rename_minors() Remove trylock of spa_namespace_lock as it is no longer needed when zvol minor operations are performed in a separate context with no prior locking state; the spa_namespace_lock is no longer held when bdev->bd_mutex or zfs_state_lock might be taken in the code paths originating from the zvol minor operation callbacks. Signed-off-by: Boris Protopopov <boris.protopopov@actifio.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3681	2016-03-22 18:08:04 -07:00
Boris Protopopov	9d02f9557f	Add support for asynchronous zvol minor operations zfsonlinux issue #2217 - zvol minor operations: check snapdev property before traversing snapshots of a dataset zfsonlinux issue #3681 - lock order inversion between zvol_open() and dsl_pool_sync()...zvol_rename_minors() Create a per-pool zvol taskq for asynchronous zvol tasks. There are a few key design decisions to be aware of. * Each taskq must be single threaded to ensure tasks are always processed in the order in which they were dispatched. * There is a taskq per-pool in order to keep the pools independent. This way if one pool is suspended it will not impact another. * The preferred location to dispatch a zvol minor task is a sync task. In this context there is easy access to the spa_t and minimal error handling is required because the sync task must succeed. Support for asynchronous zvol minor operations address issue #3681. Signed-off-by: Boris Protopopov <boris.protopopov@actifio.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2217 Closes #3678 Closes #3681	2016-03-22 18:08:04 -07:00
Boris Protopopov	682be6e0c9	Make zvol minor functionality more robust Close the race window in zvol_open() to prevent removal of zvol_state in the 'first open' code path. Move the call to check_disk_change() under zvol_state_lock to make sure the zvol_media_changed() and zvol_revalidate_disk() called by check_disk_change() are invoked with positive zv_open_count. Skip opened zvols when removing minors and set private_data to NULL for zvols that are not in use whose minors are being removed, to indicate to zvol_open() that the state is gone. Skip opened zvols when renaming minors to avoid modifying zv_name that might be in use, e.g. in zvol_ioctl(). Drop zvol_state_lock before calling add_disk() when creating minors to avoid deadlocks with zvol_open(). Wrap dmu_objset_find() with spl_fstran_mark()/unmark(). Signed-off-by: Boris Protopopov <boris.protopopov@actifio.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <ryao@gentoo.org> Closes #4344	2016-03-22 17:54:07 -07:00
Richard Sharpe	82ff881071	Handling negative dentries in a CI file system. For a Case Insensitive file system we must avoid creating negative entries in the dentry cache. We must also pass the FIGNORECASE into zfs_lookup so that special files are handled correctly. We must also prevent negative dentries from being created when files are unlinked. Tested by running fsstress from LTP (10 loops, 10 processes, 10,000 ops.) Also tested with printks (now removed) to ensure that lookups come to zpl_lookup when negative should not exist. Tests: 1. ls Some-file.txt; touch some-file.txt; ls Some-file.txt and ensure no errors. 2. touch Some-file.txt; rm some-file.txt; ls Some-file.txt and ensure that the last ls shows log messages showing the lookup went all the way to zpl_lookup. Thanks to tuxoko for helping me get this correct. Signed-off-by: Richard Sharpe <realrichardsharpe@gmail.com> Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4243	2016-03-22 17:54:07 -07:00
Richard Sharpe	a99c845fdc	Fix casesensitivity=insensitive deadlock When casesensitivity=insensitive is set for the file system, we can deadlock in a rename if the user uses different case for each path. For example rename("A/some-file.txt", "a/some-file.txt"). The simple test for this is: 1. mkdir some-dir in a ZFS file system 2. touch some-dir/some-file.txt 3. mv Some-dir/some-file.txt some-dir/some-other-file.txt This last request deadlocks trying to relock the i_mutex on the inode for the parent directory. The solution is to use d_add_ci in zpl_lookup if we are on a file system that has the casesensitivity=insensitive attribute set. This patch checks if we are working on a case insensitive file system and if so, allocates storage for the case insensitive name and passes it to zfs_lookup and then calls d_add_ci instead of d_splice_alias. The performance impact seems to be minimal even though we have introduced a kmalloc and kfree in the lookup path. The problem was found when running Microsoft's FSCT against Samba on top of ZFS On Linux. Signed-off-by: Richard Sharpe <realrichardsharpe@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4136	2016-03-22 17:54:07 -07:00
Paul Dagnelie	63ce7b6fcf	Illumos 6370 - ZFS send fails to transmit some holes 6370 ZFS send fails to transmit some holes Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Chris Williamson <chris.williamson@delphix.com> Reviewed by: Stefan Ring <stefanrin@gmail.com> Reviewed by: Steven Burgess <sburgess@datto.com> Reviewed by: Arne Jansen <sensille@gmx.net> Approved by: Robert Mustacchi <rm@joyent.com> References: https://www.illumos.org/issues/6370 https://github.com/illumos/illumos-gate/commit/286ef71 In certain circumstances, "zfs send -i" (incremental send) can produce a stream which will result in incorrect sparse file contents on the target. The problem manifests as regions of the received file that should be sparse (and read a zero-filled) actually contain data from a file that was deleted (and which happened to share this file's object ID). Note: this can happen only with filesystems (not zvols, because they do not free (and thus can not reuse) object IDs). Note: This can happen only if, since the incremental source (FromSnap), a file was deleted and then another file was created, and the new file is sparse (i.e. has areas that were never written to and should be implicitly zero-filled). We suspect that this was introduced by 4370 (applies only if hole_birth feature is enabled), and made worse by 5243 (applies if hole_birth feature is disabled, and we never send any holes). The bug is caused by the hole birth feature. When an object is deleted and replaced, all the holes in the object have birth time zero. However, zfs send cannot tell that the holes are new since the file was replaced, so it doesn't send them in an incremental. As a result, you can end up with invalid data when you receive incremental send streams. As a short-term fix, we can always send holes with birth time 0 (unless it's a zvol or a dataset where we can guarantee that no objects have been reused). Ported-by: Steven Burgess <sburgess@datto.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4369 Closes #4050	2016-03-15 10:01:48 -07:00
Brian Behlendorf	9842008fc0	Linux 4.5 compat: xattr list handler The registered xattr .list handler was simplified in the 4.5 kernel to only perform a permission check. Given a dentry for the file it must return a boolean indicating if the name is visible. This differs slightly from the previous APIs which also required the function to copy the name in to the provided list and return its size. That is now all the responsibility of the caller. This should be straight forward change to make to ZoL since we've always required the caller to make the copy. However, this was slightly complicated by the need to support 3 older APIs. Yes, between 2.6.32 and 4.5 there are 4 versions of this interface! Therefore, while the functional change in this patch is small it includes significant cleanup to make the code understandable and maintainable. These changes include: - Improved configure checks for .list, .get, and .set interfaces. - Interfaces checked from newest to oldest. - Strict checking for each possible known interface. - Configure fails when no known interface is available. - HAVE__XATTR_LIST renamed HAVE_XATTR_LIST_ for consistency with similar iops and fops configure checks. - POSIX_ACL_XATTR_{DEFAULT\|ACCESS} were removed forcing callers to move to their replacements, XATTR_NAME_POSIX_ACL_{DEFAULT\|ACCESS}. Compatibility wrapper were added for old kernels. - ZPL_XATTR_LIST_WRAPPER added which behaves the same as the existing ZPL_XATTR_{GET\|SET} WRAPPERs. Only the inode is guaranteed to be a valid pointer, passing NULL for the 'list' and 'name' variables is allowed and must be checked for. All .list functions were updated to use the wrapper to aid readability. - zpl_xattr_filldir() updated to use the .list function for its permission check which is consistent with the updated Linux 4.5 interface. If a .list function is registered it should return 0 to indicate a name should be skipped, if there is no registered function the name will be added. - Additional documentation from xattr(7) describing the correct behavior for each namespace was added before the relevant handlers. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Issue #4228	2016-01-29 09:52:13 -08:00
Brian Behlendorf	b3c9e2caf5	Linux 4.5 compat: get_link() / put_link() The follow_link() interface was retired in favor of get_link(). In the process of phasing in get_link() the Linux kernel went through two different versions. The first of which depended on put_link() and the final version on a delayed done function. - Improved configure checks for .follow_link, .get_link, .put_link. - Interfaces checked from newest to oldest. - Strict checking for each possible known interface. - Configure fails when no known interface is available. - Both versions .get_link are detected and supported as well two previous versions of .follow_link. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Issue #4228	2016-01-29 09:52:13 -08:00
Tim Chase	6f7acfc9c9	Prevent arc_c collapse Adjusting arc_c directly is racy because it can happen in the context of multiple threads. It should always be >= 2 * maxblocksize. Set it to a known valid value rather than adjusting it directly. In addition refactor arc_shrink() to a simpler structure, protect against underflow in the calculation of the new arc_c value. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Reverts: `935434ef` Closes: #3904 Closes: #4161	2016-01-29 09:52:12 -08:00
Chunwei Chen	a08add3067	Prevent duplicated xattr between SA and dir When replacing an xattr would cause overflowing in SA, we would fallback to xattr dir. However, current implementation don't clear the one in SA, so we would end up with duplicated SA. For example, running the following script on an xattr=sa filesystem would cause duplicated "user.1". -- dup_xattr.sh begin -- randbase64() { dd if=/dev/urandom bs=1 count=$1 2>/dev/null \| openssl enc -a -A } file=$1 touch $file setfattr -h -n user.1 -v `randbase64 5000` $file setfattr -h -n user.2 -v `randbase64 20000` $file setfattr -h -n user.3 -v `randbase64 20000` $file setfattr -h -n user.1 -v `randbase64 20000` $file getfattr -m. -d $file -- dup_xattr.sh end -- Also, when a filesystem is switch from xattr=sa to xattr=on, it will never modify those in SA. This would cause strange behavior like, you cannot delete an xattr, or setxattr would cause duplicate and the result would not match when you getxattr. For example, the following shell sequence. -- shell begin -- $ sudo zfs set xattr=sa pp/fs0 $ touch zzz $ setfattr -n user.test -v asdf zzz $ sudo zfs set xattr=on pp/fs0 $ setfattr -x user.test zzz setfattr: zzz: No such attribute $ getfattr -d zzz user.test="asdf" $ setfattr -n user.test -v zxcv zzz $ getfattr -d zzz user.test="asdf" user.test="asdf" -- shell end -- We fix this behavior, by first finding where the xattr resides before setxattr. Then, after we successfully updated the xattr in one location, we will clear the other location. Note that, because update and clear are not in single tx, we could still end up with duplicated xattr. But by doing setxattr again, it can be fixed. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Closes #3472 Closes #4153	2016-01-29 09:52:12 -08:00

1 2 3 4 5 ...

1084 Commits