Archive-Team/zfs - zfs - Gitea: Git with a cup of tea

Commit Graph

Author	SHA1	Message	Date
Brian Behlendorf	f95e647891	Speed up zvol import and export speed Speed up import and export speed by: * Add system delay taskq * Parallel prefetch zvol dnodes during zvol_create_minors * Parallel zvol_free during zvol_remove_minors * Reduce list linear search using ida and hash Reviewed-by: Boris Protopopov <boris.protopopov@actifio.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Closes #5433	2016-12-08 14:05:02 -07:00
Chunwei Chen	f200b83673	Add system_delay_taskq for long delay Add a dedicated system_delay_taskq for long delay like spa_deadman and zpl_posix_acl_free. This will allow us to use system_taskq in the manner of dispatch multiple tasks and call taskq_wait_outstanding. Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Closes #588	2016-12-08 14:00:20 -07:00
Brian Behlendorf	27f2b90d3e	Revert "Disable zio_dva_throttle_enabled by default" Enable zio_dva_throttle_enabled=1 by default. Subsequent testing has been unable to reproduce the suspected regression. Tested-by: kernelOfTruth kerneloftruth@gmail.com Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Signed-off-by: Brian Behlendorf behlendorf1@llnl.gov Reverts #5335 Closes #5289 Closes #5457	2016-12-08 13:57:42 -07:00
Gvozden Neskovic	e8a2014436	Cache ddt_get_dedup_dspace() value if there was no ddt changes Save and reuse ddt dspace calculation when there have been no ddt changes. This avoids unnecessary traversal of 168KiB of ddt histograms. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Gvozden Neskovic <neskovic@gmail.com> Closes #5425	2016-12-02 16:59:35 -07:00
Brian Behlendorf	baf67d15a5	Refactor txg history kstat It was observed that even when the txg history is disabled by setting `zfs_txg_history=0` the txg_sync thread still fetches the vdev stats unnecessarily. This patch refactors the code such that vdev_get_stats() is no longer called when `zfs_txg_history=0`. And it further reduces the differences between upstream and the ZoL txg_sync_thread() function. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #5412	2016-12-02 16:57:49 -07:00
Chunwei Chen	899662e344	zvol_remove_minors do parallel zvol_free On some kernel version, blk_cleanup_queue and put_disk will wait for more then 10ms. So a pool with a lot of zvols will easily wait for more then 1 min if we do zvol_free sequentially. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Requires-spl: refs/pull/588/head	2016-12-02 11:13:35 -08:00
Chunwei Chen	7ac557cef1	zpool_create_minors parallel prefetch Do parallel prefetch all zvol dnodes before actually creating each individual. This will greatly reduce the import time when having a lot of zvols and disk is slow. Signed-off-by: Chunwei Chen <david.chen@osnexus.com>	2016-12-02 11:13:35 -08:00
Brian Behlendorf	b0319c1faa	OpenZFS 7143 - dbuf_read() creates unnecessary zio_root() for bonus buf dbuf_read() creates a zio_root() to track and wait for all the zio's that may happen as part of this call. However, if the blkptr_t for this buffer is NULL or a hole, we will not create any more zio's, so this zio_root() is unnecessary. This is always the case when calling dbuf_read() on a bonus buffer, because it has no blkptr (it's part of the containing dnode). For workloads that read a lot of bonus buffers (e.g. file creation and removal), creating and destroying these unnecessary zio's can decrease performance by around 3%. The fix is to only create/destroy the zio_root() in dbuf_read() if the blkptr is not NULL and not a hole. Changes sponsored by Intel Corp. Authored by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Alex Zhuravlev <alexey.zhuravlev@intel.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs/openzfs#137 Closes #4803 Closes #5382	2016-12-01 16:50:11 -07:00
luozhengzheng	ba712624d6	Fix incorrect operator in abd_alloc_sametype() This should be & and not \| so is_metadata is set correctly. Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: luozhengzheng <luo.zhengzheng@zte.com.cn> Closes #5438	2016-12-01 16:45:16 -07:00
cao	e2c7d3785a	Remove unused sa_update_from_cb() It looks like this was functionality which was added in the original SA implementation and then never needed. It can be safely removed now and easily added back if we find a use for it. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: cao.xuewen <cao.xuewen@zte.com.cn> Closes #5440	2016-12-01 16:39:06 -07:00
cao	6a8fd57fa7	Compile zio.h and zio_impl.h mutual include zio.h includes zio_impl.h but zio_impl.h also includes zio.h, so the header files to contain each other. Get rid of the zio_impl.h include in zio.h and update zio_inject.c to include zio.h instead of zio_impl.h. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: cao.xuewen <cao.xuewen@zte.com.cn> Closes #5439	2016-12-01 16:36:25 -07:00
Chunwei Chen	d45e010dce	zvol: reduce linear list search Use kernel ida to generate minor number, and use hash table to find zvol with name. Signed-off-by: Chunwei Chen <david.chen@osnexus.com>	2016-12-01 14:53:11 -08:00
Chunwei Chen	57ddcda164	Use system_delay_taskq for long delay tasks Use it for spa_deadman, zpl_posix_acl_free, snapentry_expire. This free system_taskq from the above long delay tasks, and allow us to do taskq_wait_outstanding on system_taskq without being blocked forever, making system_taskq more generic and useful. Signed-off-by: Chunwei Chen <david.chen@osnexus.com>	2016-12-01 14:52:48 -08:00
Chunwei Chen	493492559e	Limit number of tasks shown in taskq proc To prevent holding tq_lock for too long. Before zfsonlinux/zfs@8e71ab9, hogging delay tasks and cat /proc/spl/taskq would easily cause a lockup. While that bug has been fixed. It's probably still a good idea to do this just in case task lists grow too large. Reviewed-by: Tim Chase <tim@chase2k.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Closes #586	2016-12-01 11:06:27 -07:00
Brian Behlendorf	a3fd9d9e15	Convert zio_buf_alloc() consumers In multiple cases zio_buf_alloc() was used instead of kmem_alloc() or vmem_alloc(). This was often done because the allocations could be large and it was easy to use zfs_buf_alloc() for them. But this isn't ideal for allocations which are small or short lived. In these cases it is better to use kmem_alloc() or vmem_alloc(). If possible we want to avoid the case where we have slabs allocated for kmem caches which are rarely used. Note for small allocations vmem_alloc() will be internally converted to kmem_alloc(). Therefore as long as large allocations are infrequent and short lived the penalty for using vmem_alloc() is small. Reviewed-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #5409	2016-11-30 16:18:20 -07:00
Chunwei Chen	9829574834	ABD optimized page allocation code * Convert ABD to use the Linux Kernel scatterlist implementation instead of the hand rolled one from illumos. * Scatter ABDs are preferentially populated with higher order compound pages from a single zone. Allocation size is progressively decreased until it can be satisfied without performing reclaim or compaction. * An alternate page allocator is provided for kernels older than 3.6 and for CONFIG_HIGHMEM systems. This allocator is designed as a fallback for maximum compatibility. * Extended abdstats to provide visibility in the the allocator. * Add cached value for PAGESIZE in userspace. Contributions-by: Chunwei Chen <david.chen@osnexus.com> Gvozden Neskovic <neskovic@gmail.com> Jinshan Xiong <jinshan.xiong@intel.com> Isaac Huang <he.huang@intel.com> David Quigley <david.quigley@intel.com> Brian Behlendorf <behlendorf1@llnl.gov>	2016-11-29 14:34:33 -08:00
Chunwei Chen	4f60152910	ABD kmap to kmap_atomic Convert usage of kmap to kmap_atomic while correctly saving off irq state.	2016-11-29 14:34:33 -08:00
Romain Dolbeau	88cc2352ea	ABD raidz NEON support Port NEON implementation of RAID-Z functions to ABD. Signed-off-by: Roomain Dolbeau <romain.dolbeau@atos.net>	2016-11-29 14:34:33 -08:00
Gvozden Neskovic	65d71d4212	ABD raidz avx512f support Implement shift based multiplication for 512f. Higher IPC over lookup based methods yields up to 40% better performance on the current hardware. Results on Xeon Phi(TM) CPU 7210: implementation gen_p gen_pq gen_pqr rec_p rec_q rec_r rec_pq rec_pr rec_qr rec_pqr original 142232671 24411492 12948205 283053705 22348167 4215911 9171609 2265548 2378370 1648495 scalar 295711162 49851491 33253815 293198109 88179448 61866752 27941684 25764416 17384442 12138153 sse2 410055998 199642658 117973654 406240463 152688682 121092250 84968180 79291076 47473657 20779719 ssse3 411641595 199669571 117937647 406211024 137638508 117050346 81263322 76120405 46281559 32696722 avx2 616485806 311515332 188595628 605455115 260602390 230554476 148198817 138800254 92273356 62937819 avx512f 832191523 408509425 253599522 810094481 404325734 317590971 218235687 197204920 133101937 94001219 fastest avx512f avx512f avx512f avx512f avx512f avx512f avx512f avx512f avx512f avx512f Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>	2016-11-29 14:34:33 -08:00
Gvozden Neskovic	cbf484f8ad	ABD Vectorized raidz Enable vectorized raidz code on ABD buffers. The avx512f, avx512bw, neon and aarch64_neonx2 are disabled in this commit. With the exception of avx512bw these implementations are updated for ABD in the subsequent commits. Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>	2016-11-29 14:34:33 -08:00
Gvozden Neskovic	a206522c4f	ABD changes for vectorized RAIDZ * userspace: aligned buffers. Minimum of 32B alignment is needed for AVX2. Kernel buffers are aligned 512B or more. * add abd_get_offset_size() interface * abd_iter_map(): fix calculation of iter_mapsize * add abd_raidz_gen_iterate() and abd_raidz_rec_iterate() Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>	2016-11-29 14:34:33 -08:00
Isaac Huang	b0be93e81a	ABD page support to vdev_disk.c Signed-off-by: Isaac Huang <he.huang@intel.com>	2016-11-29 14:34:32 -08:00
David Quigley	a6255b7fce	DLPX-44812 integrate EP-220 large memory scalability	2016-11-29 14:34:27 -08:00
DeHackEd	7ca25051b6	Kernel 4.9 compat: file_operations->aio_fsync removal Linux kernel commit 723c038475b78 removed this field. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: DHE <git@dehacked.net> Closes #5393	2016-11-15 09:20:46 -08:00
luozhengzheng	32dec7bd1a	Fix coverity defects: CID 147503 CID 147503: Dereference after null check (FORWARD_NULL) Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: luozhengzheng <luo.zhengzheng@zte.com.cn> Closes #5326	2016-11-10 08:50:32 -08:00
cao	3bfd95d589	Fix coverity defects: CID 147540, 147542 CID 147540: unsigned_compare - Cast nsec to a int32_t to properly detect the expected overflow. CID 147542: unsigned_compare - intval can never be less than ZIO_FAILURE_MODE_WAIT which is defined to be zero. Remove this useless check. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: cao.xuewen <cao.xuewen@zte.com.cn> Closes #5379	2016-11-09 17:35:26 -08:00
jxiong	126ae9f4e9	Export symbol dmu_objset_userobjspace_upgradable It's used by Lustre to determine if the objset can be upgraded. The inline version doesn't work because dmu_objset_is_snapshot() is not exported. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Jinshan Xiong <jinshan.xiong@intel.com> Closes #5385	2016-11-09 13:51:12 -08:00
tuxoko	0420c126ce	Linux 3.14 compat: assign inode->set_acl Linux 3.14 introduces inode->set_acl(). Normally, acl modification will come from setxattr, which will handle by the acl xattr_handler, and we already handles that well. However, nfsd will directly calls inode->set_acl or return error if it doesn't exists. Reviewed-by: Tim Chase <tim@chase2k.com> Reviewed-by: Massimo Maggi <me@massimo-maggi.eu> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Closes #5371 Closes #5375	2016-11-09 10:37:17 -08:00
cao	a36cc8d242	Fix coverity defects: CID 147626, 147628 CID 147626: Type:Dereference before null check CID 147628: Type:Dereference before null check Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov Reviewed-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: cao.xuewen <cao.xuewen@zte.com.cn> Closes #5304	2016-11-08 14:28:17 -08:00
Don Brady	976246fadd	Add illumos FMD ZFS logic to ZED -- phase 2 The phase 2 work primarily entails the Diagnosis Engine and the Retire Agent modules. It also includes infrastructure to support a crude FMD environment to host these modules. The Diagnosis Engine consumes I/O and checksum ereports and feeds them into a SERD engine which will generate a corres- ponding fault diagnosis when the SERD engine fires. All the diagnosis state data is collected into cases, one case per vdev being tracked. The Retire Agent responds to diagnosed faults by isolating the faulty VDEV. It will notify the ZFS kernel module of the new VDEV state (degraded or faulted). This agent is also responsible for managing hot spares across pools. When it encounters a device fault or a device removal it replaces the device with an appropriate spare if available. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Don Brady <don.brady@intel.com> Closes #5343	2016-11-07 15:01:38 -08:00
cao	f4bae2ed63	Fix coverity defects: CID 147575, 147577, 147578, 147579 CID 147575, Type:Unintentional integer overflow CID 147577, Type:Unintentional integer overflow CID 147578, Type:Unintentional integer overflow CID 147579, Type:Unintentional integer overflow Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: cao.xuewen <cao.xuewen@zte.com.cn> Closes #5365	2016-11-07 14:54:32 -08:00
Chunwei Chen	8e71ab99dc	Batch free zpl_posix_acl_release Currently every calls to zpl_posix_acl_release will schedule a delayed task, and each delayed task will add a timer. This used to be fine except for possibly bad performance impact. However, in Linux 4.8, a new timer wheel implementation[1] is introduced. In this new implementation, the larger the delay, the less accuracy the timer is. So when we have a flood of timer from zpl_posix_acl_release, they will expire at the same time. Couple with the fact that task_expire will do linear search with lock held. This causes an extreme amount of contention inside interrupt and would actually lockup the system. We fix this by doing batch free to prevent a flood of delayed task. Every call to zpl_posix_acl_release will put the posix_acl to be freed on a lockless list. Every batch window, 1 sec, the zpl_posix_acl_free will fire up and free every posix_acl that passed the grace period on the list. This way, we only have one delayed task every second. [1] https://lwn.net/Articles/646950/ Signed-off-by: Chunwei Chen <david.chen@osnexus.com>	2016-11-07 11:04:44 -08:00
Brian Behlendorf	34328f3cf8	Allow 16M zio buffers in user space Only restrict the maximum zio alloc size to 32-bit kernel space. The same virtual address space limitations don't apply to user space. This resolves a memory allocation failure in raidz_test where it expects to be able to exercises all valid zio sizes. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2016-11-07 10:26:17 -08:00
Romain Dolbeau	7f3194932d	Add superscalar fletcher4 This is the Fletcher4 algorithm implemented in pure C, but using multiple counters using algorithms identical to those used for SSE/NEON and AVX2. This allows for faster execution on core with strong superscalar capabilities but weak SIMD capabilities. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Romain Dolbeau <romain.dolbeau@atos.net> Closes #5317	2016-11-04 10:53:03 -07:00
Chunwei Chen	ace1eae84c	Add support for O_TMPFILE Linux 3.11 add O_TMPFILE to open(2), which allow creating an unlinked file on supported filesystem. It's basically doing open(2) and unlink(2) atomically. The filesystem support is added through i_op->tmpfile. We basically copy the create operation except we get rid of the link and name related stuff and add the new node to unlinked set. We also add support for linkat(2) to link tmpfile. However, since all previous file operation will skip ZIL, we force a txg_wait_synced to make sure we are sync safe. Signed-off-by: Chunwei Chen <david.chen@osnexus.com>	2016-11-04 10:46:40 -07:00
Chunwei Chen	987014903f	Fix unlinked file cannot do xattr operations Currently, doing things like fsetxattr(2) on an unlinked file will result in ENODATA. There's two places that cause this: zfs_dirent_lock and zfs_zget. The fix in zfs_dirent_lock is pretty straightforward. In zfs_zget though, we need it to not return error when the zp is unlinked. This is a pretty big change in behavior, but skimming through all the callers, I don't think this change would cause any problem. Also there's nothing preventing z_unlinked from being set after the z_lock mutex is dropped before but before zfs_zget returns anyway. The rest of the stuff is to make sure we don't log xattr stuff when owner is unlinked. Signed-off-by: Chunwei Chen <david.chen@osnexus.com>	2016-11-04 10:46:40 -07:00
Romain Dolbeau	7f547f85fe	Add parity generation/rebuild using AVX-512 for x86-64 avx512f should work on all AVX512 hardware, since it only uses Foundation instructions. avx512bw should be faster on hardware supporting the AVW512BW extension. We can use full-width pshufb (instead of relying on the 256 bits AVX2 pshufb). As a side-effect, the code is also unrolled more. Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Gvozden Neskovic <neskovic@gmail.com> Reviewed-by: Jinshan Xiong <jinshan.xiong@intel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Romain Dolbeau <romain.github@dolbeau.name> Closes #5219	2016-11-02 12:40:23 -07:00
BearBabyLiu	6d4210052b	Fix dsl_prop_get_all_dsl() memory leak On error dsl_prop_get_all_ds() does not free the nvlist it allocates. This behavior may have been intentional when originally written but is atypical and often confusing. Since no callers rely on this behavior the function has been updated to always free the nvlist on error. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: BearBabyLiu <liu.huang@zte.com.cn> Closes #5320	2016-11-02 12:34:10 -07:00
Brian Behlendorf	9edb36954a	Use vmem_size() for 32-bit systems On 32-bit Linux systems use vmem_size() to correctly size the ARC and better determine when IO should be throttle due to low memory. On 64-bit systems this change has no effect since the virtual address space available far exceeds the physical memory available. Reviewed-by: Tom Caputi <tcaputi@datto.com> Reviewed-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #5347	2016-11-02 12:14:45 -07:00
Brian Behlendorf	82ec9d41d8	Fix 32-bit maximum volume size A limit of 1TB exists for zvols on 32-bit systems. Update the code to correctly reflect this limitation in a similar manor as the OpenZFS implementation. Reviewed-by: Tom Caputi <tcaputi@datto.com> Reviewed-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #5347	2016-11-02 12:14:45 -07:00
Brian Behlendorf	4990e576c6	Enable .zfs/snapshot for 32-bit systems Originally the .zfs/snapshot directory was disabled for 32-bit systems because 64-bit inode numbers were not supported. This is no longer the case and this functionality can be enabled by default. Reviewed-by: Tom Caputi <tcaputi@datto.com> Reviewed-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #5347 Closes #2002	2016-11-02 12:14:45 -07:00
Brian Behlendorf	48d3eb40c7	Add TASKQID_INVALID Add the TASKQID_INVALID macros and update callers to use the macro instead of testing against 0. There is no functional change even though the functions in zfs_ctldir.c incorrectly used -1 instead of 0. Reviewed-by: Tom Caputi <tcaputi@datto.com> Reviewed-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #5347	2016-11-02 12:14:45 -07:00
Ubuntu	cbba714667	Add TASKQID_INVALID and TASKQID_INITIAL macros Add the TASKQID_INVALID and TASKQID_INITIAL macros and update the taskq implementation and test cases to use them. This is solely for the purposes of readability and introduces no functional change. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2016-11-02 10:34:19 -07:00
Ubuntu	1b457bcbe5	Fix vmem_size() Add a minimal implementation of vmem_size() which accounts for the virtual memory usage of the SPL's kmem cache. This functionality is only useful on 32-bit systems with a small virtual address space. The following assumptions are made: 1) The major SPL consumer of virtual memory is the kmem cache. 2) Memory allocated with vmem_alloc() is short lived and can be ignored. 3) Allow a 4MB floor as a generous pad given normal consumption. 4) The spl_kmem_cache_sem only contends with cache create/destroy. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2016-11-02 10:34:19 -07:00
cao	51c9163f98	Fix sa_legacy_attr_count to use ARRAY_SIZE Replace magic value 16 with ARRAY_SIZE() to correctly handle when the sa_legacy_attrs array size changes. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: cao.xuewen <cao.xuewen@zte.com.cn> Closes #5354	2016-11-02 10:26:12 -07:00
cao	981b21260e	Fix coverity defects: CID 147553 CID 147553: Type:Dereference null return value Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: cao.xuewen <cao.xuewen@zte.com.cn> Closes #5305	2016-11-01 10:20:24 -07:00
cao	b182ac00aa	Fix coverity defects: CID 152975 CID 152975: Type:Dereference null return value Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: cao.xuewen <cao.xuewen@zte.com.cn> Closes #5322	2016-10-31 16:23:56 -07:00
GeLiXin	4aafab91c5	Fix coverity defects: CID 147509 CID 147509: Explicit null dereferenced - l2arc_sublist_lock is fragile as relied on caller too much. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: GeLiXin <ge.lixin@zte.com.cn> Closes #5319	2016-10-31 16:04:01 -07:00
Hajo Möller	e02aaf17f1	Fix lookup_bdev() on Ubuntu Ubuntu added support for checking inode permissions to lookup_bdev() in kernel commit 193fb6a2c94fab8eb8ce70a5da4d21c7d4023bee (merged in 4.4.0-6.21). Upstream bug: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1636517 This patch adds a test for Ubuntu's variant of lookup_bdev() to configure and calls the function in the correct way. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Hajo Möller <dasjoe@gmail.com> Closes #5336	2016-10-26 10:30:43 -07:00
Brian Behlendorf	76a87a902e	Disable zio_dva_throttle_enabled by default Until it can be determined definitively that a performance regression wasn't introduced accidentally by `3dfb57a` this functionality is being disabled by default. It can be re- enabled by setting zio_dva_throttle_enabled=1. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #5335 Issue #5289	2016-10-26 09:13:43 -07:00
Tony Hutter	6568379eea	Fix statechange-led.sh & unnecessary libdevmapper warning - Fix autoreplace behaviour on statechange-led.sh script. ZED sends the following events on an auto-replace: 1. statechange: Disk goes UNAVAIL->ONLINE 2. statechange: Disk goes ONLINE->UNAVAIL 3. vdev_attach: Disk goes ONLINE Events 1-2 happen when ZED first attempts to do an auto-online. When that fails, ZED then tries an auto-replace, generating the vdev_attach event in #3. In the previous code, statechange-led was only looking at the UNAVAIL->ONLINE transition to turn off the LED. It ignored the #2 ONLINE->UNAVAIL transition, assuming it was just the "old" VDEV going offline. This is problematic, as a drive can go from ONLINE->UNAVAIL when it's malfunctioning, and we don't want to ignore that. This new patch correctly turns on the fault LED every time a drive becomes UNAVAIL. It also monitors vdev_attach events to trigger turning off the LED when an auto-replaced disk comes online. - Remove unnecessary libdevmapper warning with --with-config=kernel This fixes an unnecessary libdevmapper warning when building --with-config=kernel. Kernel code does not use libdevmapper, so the warning is not needed. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #2375 Closes #5312 Closes #5331	2016-10-25 11:05:30 -07:00
Jason Zaman	402c7c27b0	icp: mark asm files with noexec stack Similar to commit `a3600a106`. Asm files need an explicit note that they do not require an executable stack. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Jason Zaman <jason@perfinion.com> Closes #5332	2016-10-25 10:44:09 -07:00
tuxoko	9fa4db44b7	Fix cred leak in zpl_fallocate_common This is caught by kmemleak when running compress_004_pos Reviewed-by: Tim Chase <tim@chase2k.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Closes #5244 Closes #5330	2016-10-24 16:41:56 -07:00
Brian Behlendorf	13d9a004fe	Fix taskq creation failure in vdev_open_children() When creating and destroying pools in tight loop it's possible to exhaust the number of allowed threads on a system. This results in taskq_create() failling and a NULL dereference. Resolve the issue by falling back to opening the vdevs all synchronously. Reviewed-by: Denys Rtveliashvili <denys@rtveliashvili.name> Reviewed-by: Håkan Johansson <f96hajo@chalmers.se> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes zfsonlinux/spl#521 Closes #4637	2016-10-24 13:28:58 -07:00
Tony Hutter	1bbd877049	Turn on/off enclosure slot fault LED even when disk isn't present Previously when a drive faulted, the statechange-led.sh script would lookup the drive's LED sysfs entry in /sys/block/sd/device/enclosure_device, and turn it on. During testing we noticed that if you pulled out a drive, or if the drive was so badly broken that it no longer appeared to Linux, that the /sys/block/sd path would be removed, and the script could not lookup the LED entry. To fix this, this patch looks up the disks's more persistent "/sys/class/enclosure/X:X:X:X/Slot N" LED sysfs path at pool import. It then passes that path to the statechange-led script to use, rather than having the script look it up on the fly. This allows the script to turn on/off the slot LEDs even when the drive is missing. Closes #5309 Closes #2375	2016-10-24 10:45:59 -07:00
Romain Dolbeau	24cdeaf12e	Fletcher4 algorithm implemented in pure NEON for Aarch64 / ARMv8 64 bits This is not useful on micro-architecture with a weak NEON implementation (only 64 bits); the native version is slower & the byteswap barely faster than scalar. On A53 or A57, it's a small improvement on scalar but OK for byteswap. Results from an A53 system: 0 0 0x01 -1 0 1499068294333000 1499101101878000 implementation native byteswap scalar 1008227510 755880264 aarch64_neon 1198098720 1044818671 fastest aarch64_neon aarch64_neon Results from a A57 system: 0 0 0x01 -1 0 4407214734807033 4407233933777404 implementation native byteswap scalar 2302071241 1124873346 aarch64_neon 2542214946 2245570352 fastest aarch64_neon aarch64_neon Reviewed-by: Gvozden Neskovic <neskovic@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Romain Dolbeau <romain.dolbeau@atos.net> Closes #5248	2016-10-21 10:55:49 -07:00
Brian Behlendorf	e4ffa98dca	Fix userquota_compare() function The AVL tree compare function requires that either -1, 0, or 1 be returned. However the strcmp() function only guarantees that a negative, zero, or positive value is returned. Therefore, the return value of strcmp() needs to be sanitized with AVL_ISIGN. This was initially overlooked because the x86_64 implementation of strcmp() happens to only returns the allowed values. This was observed on an aarch64 platform which behaves correctly but differently as described above. Reviewed-by: Jinshan Xiong <jinshan.xiong@intel.com> Reviewed-by: Richard Laager <rlaager@wiktel.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #5311 Closes #5313	2016-10-21 08:23:27 -07:00
luozhengzheng	9523b15ac1	Fix coverity defects: CID 153459 CID 153459: Null pointer dereferences (FORWARD_NULL) Accidentally introduced by #5159. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: luozhengzheng <luo.zhengzheng@zte.com.cn> Closes #5310	2016-10-20 11:54:02 -07:00
cao	9d01680430	Fix coverity defects: CID 147551, 147552 CID 147551: Type:dereference null return value CID 147552: Type:dereference null return value Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: cao.xuewen <cao.xuewen@zte.com.cn> Closes #5279	2016-10-20 11:49:50 -07:00
cao	5a6765cf8c	Fix coverity defects: CID 147472 CID 147472: Type: 'Constant' variable guards dead code Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: cao.xuewen <cao.xuewen@zte.com.cn> Closes #5288	2016-10-20 11:24:01 -07:00
luozhengzheng	1f72394443	Fix coverity defects: CID 150919, 150923 CID 150919: Buffer not null terminated (BUFFER_SIZE_WARNING) CID 150923: Buffer not null terminated (BUFFER_SIZE_WARNING) Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: luozhengzheng <luo.zhengzheng@zte.com.cn> Closes #5298	2016-10-20 11:09:39 -07:00
Brian Behlendorf	3b0ba3ba99	Linux 4.9 compat: inode_change_ok() renamed setattr_prepare() In torvalds/linux@31051c8 the inode_change_ok() function was renamed setattr_prepare() and updated to take a dentry ratheri than an inode. Update the code to call the setattr_prepare() and add a wrapper function which call inode_change_ok() for older kernels. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Requires-spl: refs/pull/581/head	2016-10-20 09:39:09 -07:00
Chunwei Chen	0fedeedd30	Linux 4.9 compat: remove iops->{set,get,remove}xattr In Linux 4.9, torvalds/linux@fd50eca, iops->{set,get,remove}xattr and generic_{set,get,remove}xattr are removed. xattr operations will directly go through sb->s_xattr. Signed-off-by: Chunwei Chen <david.chen@osnexus.com>	2016-10-20 09:39:09 -07:00
Chunwei Chen	b8d9e26440	Linux 4.9 compat: iops->rename() wants flags In Linux 4.9, torvalds/linux@2773bf0, iops->rename() and iops->rename2() are merged together into iops->rename(), it now wants flags. Signed-off-by: Chunwei Chen <david.chen@osnexus.com>	2016-10-20 09:39:09 -07:00
Chunwei Chen	8ba3f2bf6a	Remove dir inode operations from zpl_inode_operations These operations are dir specific, there's no point putting them in zpl_inode_operations which is for regular files. Signed-off-by: Chunwei Chen <david.chen@osnexus.com>	2016-10-20 09:39:09 -07:00
Chunwei Chen	ae7eda1dde	Linux 4.9 compat: group_info changes In Linux 4.9, torvalds/linux@81243ea, group_info changed from 2d array via ->blocks to 1d array via ->gid. We change the spl cred functions accordingly. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Closes #581	2016-10-20 09:33:28 -07:00
Chunwei Chen	87063d7dc3	Fix splat-cred.c cred usage No need to crhold current_cred(), fix possible leak in splat_cred_test2 Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Closes #556	2016-10-20 09:33:22 -07:00
Chunwei Chen	9ba3c01923	Fix crgetgroups out-of-bound and misc cred fix init_groups has 0 nblocks, therefore calling the current crgetgroups with init_groups would result in out-of-bound access. We fix this by returning NULL when nblocks is 0. Cap crgetngroups to NGROUPS_PER_BLOCK, since crgetgroups will only return blocks[0]. Also, remove all get_group_info. The cred already holds reference on the group_info, and cred is not mutable. So there's no reason to hold extra reference, if we hold cred. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Closes #556	2016-10-20 09:33:01 -07:00
Tony Hutter	6078881aa1	Multipath autoreplace, control enclosure LEDs, event rate limiting 1. Enable multipath autoreplace support for FMA. This extends FMA autoreplace to work with multipath disks. This requires libdevmapper to be installed at build time. 2. Turn on/off fault LEDs when VDEVs become degraded/faulted/online Set ZED_USE_ENCLOSURE_LEDS=1 in zed.rc to have ZED turn on/off the enclosure LED for a drive when a drive becomes FAULTED/DEGRADED. Your enclosure must be supported by the Linux SES driver for this to work. The enclosure LED scripts work for multipath devices as well. The scripts will clear the LED when the fault is cleared. 3. Rate limit ZIO delay and checksum events so as not to flood ZED ZIO delay and checksum events are rate limited to 5/sec in the zfs module. Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed by: Don Brady <don.brady@intel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #2449 Closes #3017 Closes #5159	2016-10-19 12:55:59 -07:00
luozhengzheng	7c502b0b1d	Fix coverity defects: CID 150926 CID 150926: Unchecked return value (CHECKED_RETURN) - This case cannot occur given the existing taskq implementation and flags passed to task_dispatch(). Reviewed-by: Chunwei Chen <david.chen@osnexus.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: luozhengzheng <luo.zhengzheng@zte.com.cn> Closes #5272	2016-10-18 11:32:59 -07:00
Brian Behlendorf	6d00b5e136	Fix unused variable Accidentally introduced by `3dfb57a`, when building with debugging disabled several variables are unused. Resolve this by wrapping them in ASSERTV to remove them for non-debug builds. Reviewed by: Don Brady <don.brady@intel.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #5284	2016-10-18 10:44:44 -07:00
cao	1b81ab46d0	Fix coverity defects: CID 49339, 153393 CID 49339: Type:Buffer not null terminated CID 153393: Type:Buffer not null terminated Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: <cao.xuewen cao.xuewen@zte.com.cn> Closes #5296	2016-10-18 10:31:57 -07:00
luozhengzheng	b60eac3d1a	Fix coverity defects: CID 150924 CID 150924: Unchecked return value (CHECKED_RETURN) - On taskq_dispatch failure the reference must be dropped and this entry can be safely skipped. This case should be impossible in the existing implementation but should be handled regardless. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: luozhengzheng <luo.zhengzheng@zte.com.cn> Closes #5278	2016-10-17 12:03:52 -07:00
cao	b6ca6193f7	Fix coverity defects: CID 147488, 147490 CID 147488, Type:explicit null dereferenced CID 147490, Type:dereference null return value Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: cao.xuewen <cao.xuewen@zte.com.cn> Closes #5237	2016-10-14 11:00:47 -07:00
Don Brady	3dfb57a35e	OpenZFS 7090 - zfs should throttle allocations OpenZFS 7090 - zfs should throttle allocations Authored by: George Wilson <george.wilson@delphix.com> Reviewed by: Alex Reece <alex@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Sebastien Roy <sebastien.roy@delphix.com> Approved by: Matthew Ahrens <mahrens@delphix.com> Ported-by: Don Brady <don.brady@intel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> When write I/Os are issued, they are issued in block order but the ZIO pipeline will drive them asynchronously through the allocation stage which can result in blocks being allocated out-of-order. It would be nice to preserve as much of the logical order as possible. In addition, the allocations are equally scattered across all top-level VDEVs but not all top-level VDEVs are created equally. The pipeline should be able to detect devices that are more capable of handling allocations and should allocate more blocks to those devices. This allows for dynamic allocation distribution when devices are imbalanced as fuller devices will tend to be slower than empty devices. The change includes a new pool-wide allocation queue which would throttle and order allocations in the ZIO pipeline. The queue would be ordered by issued time and offset and would provide an initial amount of allocation of work to each top-level vdev. The allocation logic utilizes a reservation system to reserve allocations that will be performed by the allocator. Once an allocation is successfully completed it's scheduled on a given top-level vdev. Each top-level vdev maintains a maximum number of allocations that it can handle (mg_alloc_queue_depth). The pool-wide reserved allocations (top-levels * mg_alloc_queue_depth) are distributed across the top-level vdevs metaslab groups and round robin across all eligible metaslab groups to distribute the work. As top-levels complete their work, they receive additional work from the pool-wide allocation queue until the allocation queue is emptied. OpenZFS-issue: https://www.illumos.org/issues/7090 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/4756c3d7 Closes #5258 Porting Notes: - Maintained minimal stack in zio_done - Preserve linux-specific io sizes in zio_write_compress - Added module params and documentation - Updated to use optimize AVL cmp macros	2016-10-13 17:59:18 -07:00
cao	3f93077b02	Fix coverity defects: CID 150943, 150938 CID:150943, Type:Unintentional integer overflow CID:150938, Type:Explicit null dereferenced Reviewed-by: Tom Caputi <tcaputi@datto.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: cao.xuewen <cao.xuewen@zte.com.cn> Closes #5255	2016-10-13 14:30:50 -07:00
luozhengzheng	05852b3467	Fix coverity defects: CID 147571, 147574 CID 147571: Unintentional integer overflow (OVERFLOW_BEFORE_WIDEN) CID 147574: Unintentional integer overflow (OVERFLOW_BEFORE_WIDEN) Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: luozhengzheng <luo.zhengzheng@zte.com.cn> Closes #5268	2016-10-13 14:25:05 -07:00
luozhengzheng	1f51b525ff	Fix coverity defects: CID 153394 coverity scan CID 153394, Type:String overflow Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: luozhengzheng <luo.zhengzheng@zte.com.cn> Closes #5263	2016-10-12 13:24:03 -07:00
Tom Caputi	ef78750d98	Fix ICP memleak introduced in #4760 The ICP requires destructors to for each crypto module that is added. These do not necessarily exist in Illumos because they assume that these modules can never be unloaded from the kernel. Some of this cleanup code was missed when #4760 was merged, resulting in leaks. This patch simply fixes that. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Issue #4760 Closes #5265	2016-10-12 12:52:30 -07:00
Brian Behlendorf	1697d2dcf1	Fix zfsctl_snapshot_{,un}mount() issues Fix use after free in zfsctl_snapshot_unmount(). Use /usr/bin/env instead of /bin/sh to fix a shell code injection flaw and allow use with grsecurity. Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov Reviewed-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Stian Ellingsen <stian@plaimi.net> Closes #5250 Closes #4377	2016-10-11 09:56:28 -07:00
Tim Chase	d33931a83a	Write issue taskq shouldn't be dynamic This is as much an upstream compatibility as it's a bit of a performance gain. The illumos taskq implemention doesn't allow a TASKQ_THREADS_CPU_PCT type to be dynamic and in fact enforces as much with an ASSERT. As to performance, if this taskq is dynamic, it can cause excessive contention on tq_lock as the threads are created and destroyed because it can see bursts of many thousands of tasks in a short time, particularly in heavy high-concurrency zvol write workloads. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Closes #5236	2016-10-10 15:19:14 -07:00
Tom Caputi	57f16600b9	Porting over some ICP code that was missed in #4760 When #4760 was merged tests were added to ensure that the new checksums were working properly. However, some of the functionality for sha2 functions were not ported over, resulting in some Coverity defects and code that would be unstable when needed in the future. This patch simply ports over the missing code and fixes the defects in the process. Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Issue #4760 Closes #5251	2016-10-10 11:34:57 -07:00
Brian Behlendorf	7515f8f63d	Fix file permissions The following new test cases need to have execute permissions set: userquota/groupspace_003_pos.ksh userquota/userquota_013_pos.ksh userquota/userspace_003_pos.ksh upgrade/upgrade_userobj_001_pos.ksh upgrade/setup.ksh upgrade/cleanup.ksh The following source files accidentally were marked executable: lib/libzpool/kernel.c lib/libshare/nfs.c lib/libzfs/libzfs_dataset.c lib/libzfs/libzfs_util.c tests/zfs-tests/cmd/rm_lnkcnt_zero_file/rm_lnkcnt_zero_file.c tests/zfs-tests/cmd/dir_rd_update/dir_rd_update.c cmd/zed/zed_exec.c module/icp/core/kcf_sched.c module/zfs/dsl_pool.c module/zfs/arc.c module/nvpair/nvpair.c man/man5/zfs-module-parameters.5 Reviewed-by: GeLiXin <ge.lixin@zte.com.cn> Reviewed-by: Andreas Dilger <andreas.dilger@intel.com> Reviewed-by: Jinshan Xiong <jinshan.xiong@intel.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #5241	2016-10-08 14:57:56 -07:00
Stian Ellingsen	5dc1ff29ec	Use env, not sh in zfsctl_snapshot_{,un}mount() Call mount and umount via /usr/bin/env instead of /bin/sh in zfsctl_snapshot_mount() and zfsctl_snapshot_unmount(). This change fixes a shell code injection flaw. The call to /bin/sh passed the mountpoint unescaped, only surrounded by single quotes. A mountpoint containing one or more single quotes would cause the command to fail or potentially execute arbitrary shell code. This change also provides compatibility with grsecurity patches. Grsecurity only allows call_usermodehelper() to use helper binaries in certain paths. /usr/bin/* is allowed, /bin/* is not.	2016-10-08 17:43:29 +02:00
Stian Ellingsen	00b65db711	Fix use after free in zfsctl_snapshot_unmount()	2016-10-08 17:42:52 +02:00
Brian Behlendorf	690fe6479e	Rename hole_birth tunable to match OpenZFS OpenZFS decided that ignore_hole_birth was too imprecise and incorrect a name (and went with send_holes_without_birth_time). Rename it in ZoL too, while keeping the name "ignore_hole_birth" pointing to the same variable for existing consumers. Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes #5239	2016-10-07 21:02:24 -07:00
tuxoko	0d26756665	Fix out-of-bound in per_cpu in spl_random_init When iterating per_cpu values, we need to use for_each_possible_cpu. While NR_CPUS indicates the number of CPU supported by the kernel, it might not initialize all of them if the kernel decides it's not possible to use them. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Closes #578	2016-10-07 20:59:46 -07:00
Håkan Johansson	4770aa0643	Fix vdev_open_child() race on updating vdev_parent->vdev_nonrot Updating vd->vdev_parent->vdev_nonrot in vdev_open_child() is a race when vdev_open_child is called for many children from a task queue. vdev_open_child() is only called by vdev_open_children(), let the latter update the parent vdev_nonrot member. The update was already there, so done twice previously. Thus using the same logic at the end in vdev_open_children() to update vdev_nonrot, either we are vdev_uses_zvols() or not. Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Haakan T Johansson <f96hajo@chalmers.se> Closes #5162	2016-10-07 13:25:35 -07:00
cao	ccc92611b1	Fix coverity defects: CID 147565-147567 coverity scan CID:147567, Type:dereference null return value coverity scan CID:147566, Type:dereference null return value coverity scan CID:147565, Type:dereference null return value Reviewed by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: cao.xuewen <cao.xuewen@zte.com.cn> Closes #5166	2016-10-07 13:19:43 -07:00
Brian Behlendorf	482cd9ee69	Fletcher4: Incremental updates and ctx calculation Fixes ABI issues with fletcher4 code, adds support for incremental updates, and adds ztest method for testing. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Gvozden Neskovic <neskovic@gmail.com> Closes #5164	2016-10-07 12:44:12 -07:00
Jinshan Xiong	9b7a83cbb6	OpenZFS 6988 spa_sync() spends half its time in dmu_objset_do_userquota_updates Using a benchmark which creates 2 million files in one TXG, I observe that the thread running spa_sync() is on CPU almost the entire time we are syncing, and therefore can be a performance bottleneck. About 50% of the time in spa_sync() is in dmu_objset_do_userquota_updates(). The problem is that dmu_objset_do_userquota_updates() calls zap_increment_int(DMU_USERUSED_OBJECT) once for every file that was modified (or created). In this benchmark, all the files are owned by the same user/group, so all 2 million calls to zap_increment_int() are modifying the same entry in the zap. The same issue exists for the DMU_GROUPUSED_OBJECT. We should keep an in-memory map from user to space delta while we are syncing, and when we finish, iterate over the in-memory map and modify the ZAP once per entry. This reduces the number of calls to zap_increment_int() from "number of objects modified" to "number of owners/groups of modified files". This reduced the time spent in spa_sync() in the file create benchmark by ~33%, from 11 seconds to 7 seconds. Upstream bugs: DLPX-44799 Ported by: Ned Bass <bass6@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/6988 ZFSonLinux-issue: https://github.com/zfsonlinux/zfs/issues/4642 OpenZFS-commit: unmerged Porting notes: - Added curly braces around declaration of userquota_cache_t cache to quiet compiler warning; - Handled the userobj accounting the same way it proposed in this path. Signed-off-by: Jinshan Xiong <jinshan.xiong@intel.com>	2016-10-07 09:45:13 -07:00
Jinshan Xiong	1de321e626	Add support for user/group dnode accounting & quota This patch tracks dnode usage for each user/group in the DMU_USER/GROUPUSED_OBJECT ZAPs. ZAP entries dedicated to dnode accounting have the key prefixed with "obj-" followed by the UID/GID in string format (as done for the block accounting). A new SPA feature has been added for dnode accounting as well as a new ZPL version. The SPA feature must be enabled in the pool before upgrading the zfs filesystem. During the zfs version upgrade, a "quotacheck" will be executed by marking all dnode as dirty. ZoL-bug-id: https://github.com/zfsonlinux/zfs/issues/3500 Signed-off-by: Jinshan Xiong <jinshan.xiong@intel.com> Signed-off-by: Johann Lombardi <johann.lombardi@intel.com>	2016-10-07 09:45:13 -07:00
lorddoskias	64c688d716	Refactor updating of immutable/appendonly flags Move the synchronization of inode/znode i_flgas/pflags into the respective internal zfs function. This is mostly mechanical work and shouldn't introduce any functional changes. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com> Issue #227 Closes #5223	2016-10-05 14:47:29 -07:00
Gvozden Neskovic	5bf703b8f3	Fletcher4: save/reload implementation context Init, compute, and fini methods are changed to work on internal context object. This is necessary because ABI does not guarantee that SIMD registers will be preserved on function calls. This is technically the case in Linux kernel in between `kfpu_begin()/kfpu_end()`, but it breaks user-space tests and some kernels that don't require disabling preemption for using SIMD (osx). Use scalar compute methods in-place for small buffers, and when the buffer size does not meet SIMD size alignment. Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>	2016-10-05 16:41:46 +02:00
Gvozden Neskovic	37f520db2d	Fletcher4: Incremental using SIMD Combine incrementally computed fletcher4 checksums. Checksums are combined a posteriori, allowing for parallel computation on chunks to be implemented if required. The algorithm is general, and does not add changes in each SIMD implementation. New test in ztest verifies incremental fletcher computations. Checksum combining matrix for two buffers `a` and `b`, where `Ca` and `Cb` are respective fletcher4 checksums, `Cab` is combined checksum, `s` is size of buffer `b` (divided by sizeof(uint32_t)) is: Cab[A] = Cb[A] + Ca[A] Cab[B] = Cb[B] + Ca[B] + s * Ca[A] Cab[C] = Cb[C] + Ca[C] + s * Ca[B] + s(s+1)/2 * Ca[A] Cab[D] = Cb[D] + Ca[D] + s * Ca[C] + s(s+1)/2 * Ca[B] + s(s+1)(s+2)/6 * Ca[A] NOTE: this calculation overflows for larger buffers. Thus, internally, the calculation is performed on 8MiB chunks. Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>	2016-10-05 16:41:46 +02:00
luozhengzheng	e2c292bbfc	Fix coverity defects: CID 150953, 147603, 147610 coverity scan CID:150953,type: uninitialized scalar variable coverity scan CID:147603,type: Resource leak coverity scan CID:147610,type: Resource leak Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: luozhengzheng <luo.zhengzheng@zte.com.cn> Closes #5209	2016-10-04 18:15:57 -07:00
Brian Behlendorf	341dfdb3fd	Fix p0 initializer Due to changes in the task_struct the following warning is occurs when initializing the global p0. Since this structure only exists for it's address to be taken initialize it in a manor which isn't sensitive to internal changes to the structure. module/spl/spl-generic.c:58:1: error: missing braces around initializer [-Werror=missing-braces] Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #576	2016-10-04 17:26:36 -07:00
ilovezfs	125a406e24	OpenZFS 6585 - sha512, skein, and edonr have an unenforced dependency on extensible dataset Authored by: ilovezfs <ilovezfs@icloud.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Richard Laager <rlaager@wiktel.com> Approved by: Robert Mustacchi <rm@joyent.com> Ported by: Tony Hutter <hutter2@llnl.gov> In any pool without the extensible dataset feature flag already enabled, creating a dataset with dedup set to use one of the new checksums would result in the following panic as soon as any data was added: panic[cpu0]/thread=ffffff0006761c40: feature_get_refcount(spa, feature, &refcount) != 48 (0x30 != 0x30), file: ../../common/fs/zfs/zfeature.c line 390 Inpsection showed that feature->fi_feature was 7, which is the value of SPA_FEATURE_EXTENSIBLE_DATASET in the spa_feature enum. This commit adds extensible dataset as a dependency for the sha512, edonr, and skein feature flags, which prevents the panic. OpenZFS-issue: https://www.illumos.org/issues/6585 OpenZFS-commit: `892586e8a1` Porting Notes: This code was originally from Illumos, but I actually ported it from: openzfsonosx/zfs@b62a652	2016-10-03 14:51:21 -07:00
ilovezfs	4a2e9a17d5	OpenZFS 6541 - Pool feature-flag check defeated if "verify" is included in the dedup property value Authored by: ilovezfs <ilovezfs@icloud.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Richard Laager <rlaager@wiktel.com> Approved by: Robert Mustacchi <rm@joyent.com> Ported-by: Tony Hutter <hutter2@llnl.gov> zio_checksum_to_feature() expects a zio_checksum enum not a raw property intval, so the new checksums weren't being detected when the ZIO_CHECKSUM_VERIFY flag got in the way. Given a pool without feature@sha512, zfs create -o dedup=sha512 naughty/fivetwelve_noverify_ds would fail as expected since the raw intval would indeed be equal to SPA_FEATURE_SHA512. However, zfs create -o dedup=sha512,verify naughty/fivetwelve_verify_ds would incorrectly succeed because ZIO_CHECKSUM_VERIFY would be in the way, the raw intval would not be a member of the enum, and zio_checksum_to_feature() would return SPA_FEATURE_NONE, with the result that spa_feature_is_enabled() would never be called. This was first detected with edonr, since in that case verify is required. This commit clears the ZIO_CHECKSUM_VERIFY flag before calling zio_checksum_to_feature() using the ZIO_CHECKSUM_MASK and verifies in zio_checksum_to_feature() that ZIO_CHECKSUM_MASK has been applied by the caller to attempt to prevent the same bug from occurring again in the future. OpenZFS-issue: https://www.illumos.org/issues/6541 OpenZFS-commit: `971640e6aa` Porting notes: This code was originally from Illumos, but I actually ported it from: openzfsonosx/zfs@bef06e1	2016-10-03 14:51:21 -07:00
Tony Hutter	3c67d83a8a	OpenZFS 4185 - add new cryptographic checksums to ZFS: SHA-512, Skein, Edon-R Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com> Reviewed by: Richard Lowe <richlowe@richlowe.net> Approved by: Garrett D'Amore <garrett@damore.org> Ported by: Tony Hutter <hutter2@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/4185 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/45818ee Porting Notes: This code is ported on top of the Illumos Crypto Framework code: `b5e030c8db` The list of porting changes includes: - Copied module/icp/include/sha2/sha2.h directly from illumos - Removed from module/icp/algs/sha2/sha2.c: #pragma inline(SHA256Init, SHA384Init, SHA512Init) - Added 'ctx' to lib/libzfs/libzfs_sendrecv.c:zio_checksum_SHA256() since it now takes in an extra parameter. - Added CTASSERT() to assert.h from for module/zfs/edonr_zfs.c - Added skein & edonr to libicp/Makefile.am - Added sha512.S. It was generated from sha512-x86_64.pl in Illumos. - Updated ztest.c with new fletcher_4_() args; used NULL for new CTX argument. - In icp/algs/edonr/edonr_byteorder.h, Removed the #if defined(__linux) section to not #include the non-existant endian.h. - In skein_test.c, renane NULL to 0 in "no test vector" array entries to get around a compiler warning. - Fixup test files: - Rename <sys/varargs.h> -> <varargs.h>, <strings.h> -> <string.h>, - Remove <note.h> and define NOTE() as NOP. - Define u_longlong_t - Rename "#!/usr/bin/ksh" -> "#!/bin/ksh -p" - Rename NULL to 0 in "no test vector" array entries to get around a compiler warning. - Remove "for isa in $($ISAINFO); do" stuff - Add/update Makefiles - Add some userspace headers like stdio.h/stdlib.h in places of sys/types.h. - EXPORT_SYMBOL _Init/_Update/_Final... routines in ICP modules. - Update scripts/zfs2zol-patch.sed - include <sys/sha2.h> in sha2_impl.h - Add sha2.h to include/sys/Makefile.am - Add skein and edonr dirs to icp Makefile - Add new checksums to zpool_get.cfg - Move checksum switch block from zfs_secpolicy_setprop() to zfs_check_settable() - Fix -Wuninitialized error in edonr_byteorder.h on PPC - Fix stack frame size errors on ARM32 - Don't unroll loops in Skein on 32-bit to save stack space - Add memory barriers in sha2.c on 32-bit to save stack space - Add filetest_001_pos.ksh checksum sanity test - Add option to write psudorandom data in file_write utility	2016-10-03 14:51:15 -07:00
Romain Dolbeau	62a65a654e	Add parity generation/rebuild using 128-bits NEON for Aarch64 This re-use the framework established for SSE2, SSSE3 and AVX2. However, GCC is using FP registers on Aarch64, so unlike SSE/AVX2 we can't rely on the registers being left alone between ASM statements. So instead, the NEON code uses C variables and GCC extended ASM syntax. Note that since the kernel explicitly disable vector registers, they have to be locally re-enabled explicitly. As we use the variable's number to define the symbolic name, and GCC won't allow duplicate symbolic names, numbers have to be unique. Even when the code is not going to be used (e.g. the case for 4 registers when using the macro with only 2). Only the actually used variables should be declared, otherwise the build will fails in debug mode. This requires the replacement of the XOR(X,X) syntax by a new ZERO(X) macro, which does the same thing but without repeating the argument. And perhaps someday there will be a machine where there is a more efficient way to zero a register than XOR with itself. This affects scalar, SSE2, SSSE3 and AVX2 as they need the new macro. It's possible to write faster implementations (different scheduling, different unrolling, interleaving NEON and scalar, ...) for various cores, but this one has the advantage of fitting in the current state of the code, and thus is likely easier to review/check/merge. The only difference between aarch64-neon and aarch64-neonx2 is that aarch64-neonx2 unroll some functions some more. Reviewed-by: Gvozden Neskovic <neskovic@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Romain Dolbeau <romain.dolbeau@atos.net> Closes #4801	2016-10-03 09:44:00 -07:00
luozhengzheng	aecdc70604	Fix coverity defects: CID 147448, 147449, 147450, 147453, 147454 coverity scan CID:147448,type: unchecked return value coverity scan CID:147449,type: unchecked return value coverity scan CID:147450,type: unchecked return value coverity scan CID:147453,type: unchecked return value coverity scan CID:147454,type: unchecked return value Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: luozhengzheng <luo.zhengzheng@zte.com.cn> Closes #5206	2016-10-02 11:24:54 -07:00
Brian Behlendorf	6c2a66bfa8	Fix aarch64 type warning Explicitly cast type in splat-rwlock.c test case to silence the following warning. warning: format ‘%ld’ expects argument of type ‘long int’, but argument N has type ‘int’ Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #574	2016-10-01 18:33:01 -07:00
candychencan	0ca5261be4	Fix NULL deref in kcf_remove_mech_provider In the default case the function must return to avoid dereferencing 'prov_mech' which will be NULL. Reviewed-by: Tom Caputi <tcaputi@datto.com> Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: candychencan <chen.can2@zte.com.cn> Closes #5134	2016-09-30 16:04:43 -07:00
cao	0a8f18f932	Fix coverity defects: CID 147563, 147560 coverity scan CID:147563, Type:dereference null return value coverity scan CID:147560, Type:dereference null return value Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: cao.xuewen <cao.xuewen@zte.com.cn> Closes #5168	2016-09-30 15:56:17 -07:00
GeLiXin	470f12d631	Fix coverity defects: CID 147531 147532 147533 147535 coverity scan CID:147531,type: Argument cannot be negative - may copy data with negative size coverity scan CID:147532,type: resource leaks - may close a fd which is negative coverity scan CID:147533,type: resource leaks - may call pwrite64 with a negative size coverity scan CID:147535,type: resource leaks - may call fdopen with a negative fd Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: GeLiXin <ge.lixin@zte.com.cn> Closes #5176	2016-09-30 15:47:57 -07:00
Brian Behlendorf	2db28197fe	Fix cppcheck warning in buf_init() Cppcheck 1.63 erroneously complains about an uninitialized value in buf_init(). Newer versions of cppcheck (1.72) handle this correctly but we'll initialize the value anyway to silence the warning. Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #5203	2016-09-30 15:04:21 -07:00
Gvozden Neskovic	6ca636a152	Avoid undefined shift overflow in fzap_cursor_retrieve() Avoid calculating (1<<64) if lh_prefix_len == 0. Semantics of the method remain the same. Assert (lh_prefix_len > 0) in zap_expand_leaf() to detect possibly the same problem. Issue #4883 Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>	2016-09-29 15:55:41 -07:00
Gvozden Neskovic	4ca9c1de12	Explicit integer promotion for bit shift operations Explicitly promote variables to correct type. Undefined behavior is reported because length of int is not well defined by C standard. Issue #4883 Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>	2016-09-29 15:55:41 -07:00
Gvozden Neskovic	031d7c2fe6	fix: Shift exponent too large Undefined operation is reported by running ztest (or zloop) compiled with GCC UndefinedBehaviorSanitizer. Error only happens on top level of dnode indirection with large enough offset values. Logically, left shift operation would work, but bit shift semantics in C, and limitation of uint64_t, do not produce desired result. Issue #5059, #4883 Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>	2016-09-29 15:55:41 -07:00
Isaac Huang	e8ac4557af	Explicit block device plugging when submitting multiple BIOs Without plugging, the default 'noop' scheduler will not merge the BIOs which are part of a large ZIO. Reviewed-by: Andreas Dilger <andreas.dilger@intel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Isaac Huang <he.huang@intel.com> Closes #5181	2016-09-29 13:13:31 -07:00
cao	c9d61adbf8	Fix coverity defects: 147658, 147652, 147651 coverity scan CID:147658, Type:copy into fixed size buffer. coverity scan CID:147652, Type:copy into fixed size buffer. coverity scan CID:147651, Type:copy into fixed size buffer. Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: cao.xuewen <cao.xuewen@zte.com.cn> Closes #5160	2016-09-29 12:06:14 -07:00
lorddoskias	12fa7f3436	Refactor inode->i_mode management Refactor the code in such a way so that inode->i_mode is being set at the same time zp->z_mode is being changed. This has the effect of keeping both in sync without relying on zfs_inode_update. Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com> Closes #5158	2016-09-27 14:08:52 -07:00
cao	680eada9b0	Fix coverity defects: CID 147650, 147649, 147647, 147646 coverity scan CID:147650, Type:copy into fixed size buffer. coverity scan CID:147649, Type:copy into fixed size buffer. coverity scan CID:147647, Type:copy into fixed size buffer. coverity scan CID:147646, Type:copy into fixed size buffer. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: cao.xuewen <cao.xuewen@zte.com.cn> Closes #5161	2016-09-25 15:08:28 -07:00
Brian Behlendorf	7571033285	Fix multilist_create() memory leak In arc_state_fini() the `arc_l2c_only->arcs_list[*]` multilists must be destroyed. This accidentally regressed in `d3c2ae1c`. Reviewed by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #5151 Closes #5152	2016-09-23 10:55:10 -07:00
tuxoko	d5b897a6a1	Linux 4.7 compat: Fix deadlock during lookup on case-insensitive We must not use d_add_ci if the dentry already has the real name. Otherwise, d_add_ci()->d_alloc_parallel() will find itself on the lookup hash and wait on itself causing deadlock. Tested-by: satmandu Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Closes #5124 Closes #5141 Closes #5147 Closes #5148	2016-09-22 19:09:16 -07:00
kernelOfTruth aka. kOT, Gentoo user	51907a31bc	OpenZFS 7230 - add assertions to dmu_send_impl() to verify that stream includes BEGIN and END records Authored by: Matt Krantz <matt.krantz@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Approved by: Robert Mustacchi <rm@joyent.com> Ported-by: kernelOfTruth <kerneloftruth@gmail.com> OpenZFS-issue: https://www.illumos.org/issues/7230 OpenZFS-commit: https://github.com/illumos/illumos-gate/commit/12b90ee2 Closes #5112	2016-09-22 16:01:19 -07:00
luozhengzheng	160987b576	Fix coverity defects coverity scan CID:147633,type: sizeof not portable coverity scan CID:147637,type: sizeof not portable coverity scan CID:147638,type: sizeof not portable coverity scan CID:147640,type: sizeof not portable In these particular cases sizeof (XX *) happens to be equal to sizeof (X ), but this is not a portable assumption. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: luozhengzheng <luo.zhengzheng@zte.com.cn> Closes #5144	2016-09-21 18:09:00 -07:00
Isaac Huang	da8d57488b	Reduce noise in tracing logs dbuf_read_impl() returns (SET_ERROR(err)) when err can be 0, which adds lots of noise in tracing logs. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Isaac Huang <he.huang@intel.com> Closes #4430 Closes #5146	2016-09-21 13:37:20 -07:00
BearBabyLiu	609603a5d3	Fix coverity defects coverity scan CID:147504 Type: Explicit null dereferenced Reason: passing null pointer dl to zfs_dirent_unlock Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: BearBabyLiu <liu.huang@zte.com.cn> Closes #5131	2016-09-20 19:09:22 -07:00
Tim Chase	25e2ab16be	Fix arc_adjust_meta_balanced() The type of "adjustmnt" was erroneously changed to unsigned when the compressed ARC code was ported in `d3c2ae1c08`. As a result of it being unsigned, the balanced metadata eviction logic would evict all of the non-metadata. Reviewed-by: Chris Severance <github.severach@spamgourmet.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: David Quigley <david.quigley@intel.com> Signed-off-by: Tim Chase <tim@onlight.com> Closes #5128 Closes #5129	2016-09-19 09:28:35 -07:00
luozhengzheng	30f3f2e13c	Fix Coverity defects CID 147659, 150952 and 147645 Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: luozhengzheng <luo.zhengzheng@zte.com.cn> Closes #5103	2016-09-17 15:08:54 -07:00
Brian Behlendorf	cb81c0c588	Increase spl_kmem_alloc_warn limit In order to support ABD with large blocks the spl_kmem_alloc_warn limit needs to be increased to 64K. A 16M block requires that pointers be stored for 4096 4K-pages on an x86_64 system. Each of these pointers is 8 bytes requiring an allocation of 8*4096=32,768 bytes. The addition of a small header to this structure pushes the allocation over the default 32K warning threshold. In addition, fix a small bug where MAX was used instead of MIN when setting the default. This ensures a reasonable limit is still set on systems with page sizes larger then 4K. Reviewed-by: David Quigley <david.quigley@intel.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #571	2016-09-16 17:10:36 -07:00
Brian Behlendorf	9ea9e0b9a1	Enable ignore_hole_birth module option by default Enable ignore_hole_birth by default until all known hole birth bugs have been resolved and relevant test cases added. Reviewed-by: Boris Protopopov <boris.protopopov@actifio.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #4809 Closes #5099	2016-09-16 14:05:30 -07:00
Nikolay Borisov	87f9371aef	Simplify time handling logic in zfs_settattr Simplify time handling in zfs_setattr by mimicking the logic in setattr_copy from the linux kernel. In order to achieve this in the case when ZFS' log is being replayed it is necessary to unconditionally set the ctime in zfs_replay_setattr. Also use the timespec_trunc function when assigning values to the generic inode struct. This is currently a noop since zfs sets s_time_gran to 1, however in the future rules about precision might change. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com> Closes #4916	2016-09-13 12:00:18 -07:00
Nikolay Borisov	9f5f0019ab	Refactor generic inode time updating ZFS doesn't provide a custom update_time method meaning it delegates this job to the generic VFS layer. The only time when it needs to set the various *time values is when the inode is being marshalled to/from the disk. Do this by moving the relevant code from zfs_inode_update_impl to zfs_node_alloc and zfs_rezget. As a result from this change it is no longer necessary to have multiple versions of the zfs_inode_update function - so just nuke them and leave only one. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com> Issue #227 Closes #4916	2016-09-13 11:57:37 -07:00
Dan Kimmel	524b4217b8	DLPX-44733 combine arc_buf_alloc_impl() with arc_buf_clone() Authored by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Tom Caputi <tcaputi@datto.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Ported by: David Quigley <david.quigley@intel.com> Issue #5078	2016-09-13 09:59:13 -07:00
Tom Caputi	c17bcf83da	Enable raw writes to perform dedup with verification Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: David Quigley <david.quigley@intel.com> Signed-off-by: Tom Caputi <tcaputi@datto.com> Issue #5078	2016-09-13 09:59:04 -07:00
Dan Kimmel	2aa34383b9	DLPX-40252 integrate EP-476 compressed zfs send/receive Authored by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Tom Caputi <tcaputi@datto.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Ported by: David Quigley <david.quigley@intel.com> Issue #5078	2016-09-13 09:58:58 -07:00
George Wilson	d3c2ae1c08	OpenZFS 6950 - ARC should cache compressed data Authored by: George Wilson <george.wilson@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Tom Caputi <tcaputi@datto.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Ported by: David Quigley <david.quigley@intel.com> This review covers the reading and writing of compressed arc headers, sharing data between the arc_hdr_t and the arc_buf_t, and the implementation of a new dbuf cache to keep frequently access data uncompressed. I've added a new member to l1 arc hdr called b_pdata. The b_pdata always hangs off the arc_buf_hdr_t (if an L1 hdr is in use) and points to the physical block for that DVA. The physical block may or may not be compressed. If compressed arc is enabled and the block on-disk is compressed, then the b_pdata will match the block on-disk and remain compressed in memory. If the block on disk is not compressed, then neither will the b_pdata. Lastly, if compressed arc is disabled, then b_pdata will always be an uncompressed version of the on-disk block. Typically the arc will cache only the arc_buf_hdr_t and will aggressively evict any arc_buf_t's that are no longer referenced. This means that the arc will primarily have compressed blocks as the arc_buf_t's are considered overhead and are always uncompressed. When a consumer reads a block we first look to see if the arc_buf_hdr_t is cached. If the hdr is cached then we allocate a new arc_buf_t and decompress the b_pdata contents into the arc_buf_t's b_data. If the hdr already has a arc_buf_t, then we will allocate an additional arc_buf_t and bcopy the uncompressed contents from the first arc_buf_t to the new one. Writing to the compressed arc requires that we first discard the b_pdata since the physical block is about to be rewritten. The new data contents will be passed in via an arc_buf_t (uncompressed) and during the I/O pipeline stages we will copy the physical block contents to a newly allocated b_pdata. When an l2arc is inuse it will also take advantage of the b_pdata. Now the l2arc will always write the contents of b_pdata to the l2arc. This means that when compressed arc is enabled that the l2arc blocks are identical to those stored in the main data pool. This provides a significant advantage since we can leverage the bp's checksum when reading from the l2arc to determine if the contents are valid. If the compressed arc is disabled, then we must first transform the read block to look like the physical block in the main data pool before comparing the checksum and determining it's valid. OpenZFS-issue: https://www.illumos.org/issues/6950 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/7fc10f0 Issue #5078	2016-09-13 09:58:33 -07:00
Tim Chase	43924bfeaa	Remove redundant assignments to arc_c Several assignments to arc_c had no effect because it is ultimately initialized to arc_c_max. This aligns ZoL better with the upstream code which removed these assignments some time ago. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@onlight.com> Closes #5081	2016-09-12 12:40:30 -07:00
Nikolay Borisov	67d6082494	Refactor spa_load_l2cache to make build happy In case sav->sav_config was NULL the body of the function would skip the iteration of the l2 cache devices and will just cleanup the old devices. However, this wasn't very obvious since the null check was performed after the loop body and after the old devices were cleaned. Refactor the code so that it's now obvious when the iteration of the l2cache devices is skipped. This fixes the following cppcheck warning: [module/zfs/spa.c:1552]: (error) Possible null pointer dereference: newvdevs Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com> Closes #5087	2016-09-12 12:40:03 -07:00
Tim Chase	20aa7a4e31	Free property names with spa_strfree() rather than strfree() Since they're allocated with spa_strdup(), they should be freed with spa_strfree() so the proper length buffer is freed. Reviewed-by: Richard Yao <ryao@gentoo.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Closes #5082 Closes #5086	2016-09-12 09:45:26 -07:00
Don Brady	d02ca37979	Bring over illumos ZFS FMA logic -- phase 1 This first phase brings over the ZFS SLM module, zfs_mod.c, to handle auto operations in response to disk events. Disk event monitoring is provided from libudev and generates the expected payload schema for zfs_mod. This work leverages the recently added devid and phys_path strings in the vdev label. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Don Brady <don.brady@intel.com> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #4673	2016-09-01 11:39:45 -07:00
luozhengzheng	0b284702b7	Delete unreferenced function zfs_ereport_send_interim_checksum Signed-off-by: luozhengzheng <luo.zhengzheng@zte.com.cn> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #5055	2016-09-01 11:39:45 -07:00
luozhengzheng	ca8587a517	kmem_zalloc with KM_SLEEP will never return NULL These allocations can never fail. Leaving the error handling code here gives the impression they can so it has been removed. Signed-off-by: luozhengzheng <luo.zhengzheng@zte.com.cn> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #5048	2016-09-01 11:39:45 -07:00
Gvozden Neskovic	ee36c709c3	Performance optimization of AVL tree comparator functions perf: 2.75x faster ddt_entry_compare() First 256bits of ddt_key_t is a block checksum, which are expected to be close to random data. Hence, on average, comparison only needs to look at first few bytes of the keys. To reduce number of conditional jump instructions, the result is computed as: sign(memcmp(k1, k2)). Sign of an integer 'a' can be obtained as: `(0 < a) - (a < 0)` := {-1, 0, 1} , which is computed efficiently. Synthetic performance evaluation of original and new algorithm over 1G random keys on 2.6GHz Intel(R) Xeon(R) CPU E5-2660 v3: old 6.85789 s new 2.49089 s perf: 2.8x faster vdev_queue_offset_compare() and vdev_queue_timestamp_compare() Compute the result directly instead of using conditionals perf: zfs_range_compare() Speedup between 1.1x - 2.5x, depending on compiler version and optimization level. perf: spa_error_entry_compare() `bcmp()` is not suitable for comparator use. Use `memcmp()` instead. perf: 2.8x faster metaslab_compare() and metaslab_rangesize_compare() perf: 2.8x faster zil_bp_compare() perf: 2.8x faster mze_compare() perf: faster dbuf_compare() perf: faster compares in spa_misc perf: 2.8x faster layout_hash_compare() perf: 2.8x faster space_reftree_compare() perf: libzfs: faster avl tree comparators perf: guid_compare() perf: dsl_deadlist_compare() perf: perm_set_compare() perf: 2x faster range_tree_seg_compare() perf: faster unique_compare() perf: faster vdev_cache _compare() perf: faster vdev_uberblock_compare() perf: faster fuid _compare() perf: faster zfs_znode_hold_compare() Signed-off-by: Gvozden Neskovic <neskovic@gmail.com> Signed-off-by: Richard Elling <richard.elling@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #5033	2016-08-31 14:35:34 -07:00
Hajo Möller	82ab6848cc	Fix "zpool get guid,freeing,leaked" source `zpool get guid,freeing,leaked` shows SOURCE as `default`, it should be `-` as those props are not editable. Changed code to not overwrite `src` for `ZPOOL_PROP_VERSION`, so it stays `ZPROP_SRC_NONE`. Make src const to avoid future mistakes Signed-off-by: Hajo Möller <dasjoe@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4170	2016-08-30 15:57:15 -07:00
cao	8f50bafb04	Delete unused zfsctl_snapdir_inactive declaration zfsctl_snapdir_inactive is defined in zfs-0.6.3. In zfs-0.6.5.7 this is declaration remains even though the implementation was removed in commit `278bee93`. Removed fastreboot_disable_highpil which is also unused. Signed-off-by: caoxuewen cao.xuewen@zte.com.cn Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #5042	2016-08-30 14:33:40 -07:00
Simon Klinkert	db707ad094	OpenZFS 6940 - Cannot unlink directories when over quota From user perspective, I would expect that ZFS is always able to remove files and directories even when the quota is exceeded. Authored by: Simon Klinkert <simon.klinkert@gmail.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Ported-by: kernelOfTruth kerneloftruth@gmail.com Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/6940 OpenZFS-issue: https://www.illumos.org/issues/6334 OpenZFS-commit: https://github.com/illumos/illumos-gate/commit/9918916 Closes #5044	2016-08-30 14:33:04 -07:00
Alexander Motin	755065f3dc	OpenZFS 6322 - ZFS indirect block predictive prefetch For quite some time I was thinking about possibility to prefetch ZFS indirection tables while doing sequential reads or writes. Recent changes in predictive prefetcher made that much easier to do. My tests on zvol with 16KB block size on 5x striped and 2x mirrored pool of 10 disks show almost double throughput on sequential read, and almost tripple on sequential rewrite. While for read alike effect can be received from increasing maximal prefetch distance (though at higher memory cost), for rewrite there is no other solution so far. Authored by: Alexander Motin <mav@freebsd.org> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Ported-by: kernelOfTruth kerneloftruth@gmail.com Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/6322 OpenZFS-commit: https://github.com/illumos/illumos-gate/commit/cb92f413 Closes #5040 Porting notes: - Change from upstream in module/zfs/dbuf.c in 'int dbuf_read' due to commit `5f6d0b6` 'Handle block pointers with a corrupt logical size' - Difference from upstream in module/zfs/dmu_zfetch.c, uint32_t zfetch_max_idistance -> unsigned int zfetch_max_idistance - Variables have been initialized at the beginning of the function (void dmu_zfetch) to resemble the order of occurrence and account for C99, C11 mode errors.	2016-08-30 14:26:55 -07:00
Matthew Ahrens	98ace739bd	OpenZFS 7086 - ztest attempts dva_get_dsize_sync on an embedded blockpointer In dbuf_dirty(), we need to grab the dn_struct_rwlock before looking at the db_blkptr, to prevent it from being changed by syncing context. Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/7086 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/98fa317 Closes #5039	2016-08-30 14:25:50 -07:00
GeLiXin	c40db193a5	Fix: Build warnings with different gcc optimization levels in debug mode This fix resolves warnings reported during compiling with different gcc optimization levels in debug mode, Test tools: gcc version 4.4.7 20120313 (Red Hat 4.4.7-16) (GCC) Linux version: 2.6.32-573.18.1.el6.x86_64, Red Hat Enterprise Linux Server release 6.1 (Santiago) List of warnings: CFLAGS=-O1 ./configure --enable-debug ;make ../../module/icp/core/kcf_sched.c: In function ‘kcf_aop_done’: ../../module/icp/core/kcf_sched.c:499: error: ‘fg’ may be used uninitialized in this function ../../module/icp/core/kcf_sched.c:499: note: ‘fg’ was declared here CFLAGS=-Os ./configure --enable-debug ; make libzfs_dataset.c: In function ‘zfs_prop_set_list’: libzfs_dataset.c:1575: error: ‘nvl_len’ may be used uninitialized in this function Signed-off-by: GeLiXin <ge.lixin@zte.com.cn> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #5022	2016-08-29 12:46:18 -07:00
GeLiXin	9907cc1cc8	Add zfs_arc_meta_limit_percent tunable ARC will evict meta buffers that exceed the arc_meta_limit. Before a further investigating on whether we should take special protection on meta buffers, this tunable make arc_meta_limit adjustable for different workloads. People can set zfs_arc_meta_limit_percent to any value while insmod zfs.ko, so some range check is added to guarantee a suitable arc_meta_limit. Suggested by Tim Chase, zfs_arc_dnode_limit is changed to a percent-style tunable as well. Signed-off-by: GeLiXin <ge.lixin@zte.com.cn> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4957	2016-08-23 13:03:01 -07:00
Tim Chase	3e635ac15c	Prevent reclaim in send_traverse_thread() As is the case with traverse_prefetch_thread(), the deep stacks caused by traversal require disabling reclaim in the send traverse thread. Also, do the same for receive_writer_thread() in which similar problems have been observed. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4912 Closes #4998	2016-08-22 16:12:05 -07:00
Gvozden Neskovic	9cc1844a1d	Linux compat: Grsecurity kernel API Change: Module parameter set/get methods take const parameter in Grsecurity kernel v4.7.1 Signed-off-by: Gvozden Neskovic <neskovic@gmail.com> Signed-off-by: Jason Zaman <jason@perfinion.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4997 Closes #5001	2016-08-22 10:05:45 -07:00
Matthew Ahrens	2bce8049c3	OpenZFS 7004 - dmu_tx_hold_zap() does dnode_hold() 7x on same object Using a benchmark which has 32 threads creating 2 million files in the same directory, on a machine with 16 CPU cores, I observed poor performance. I noticed that dmu_tx_hold_zap() was using about 30% of all CPU, and doing dnode_hold() 7 times on the same object (the ZAP object that is being held). dmu_tx_hold_zap() keeps a hold on the dnode_t the entire time it is running, in dmu_tx_hold_t:txh_dnode, so it would be nice to use the dnode_t that we already have in hand, rather than repeatedly calling dnode_hold(). To do this, we need to pass the dnode_t down through all the intermediate calls that dmu_tx_hold_zap() makes, making these routines take the dnode_t* rather than an objset_t* and a uint64_t object number. In particular, the following routines will need to have analogous *_by_dnode() variants created: dmu_buf_hold_noread() dmu_buf_hold() zap_lookup() zap_lookup_norm() zap_count_write() zap_lockdir() zap_count_write() This can improve performance on the benchmark described above by 100%, from 30,000 file creations per second to 60,000. (This improvement is on top of that provided by working around the object allocation issue. Peak performance of ~90,000 creations per second was observed with 8 CPUs; adding CPUs past that decreased performance due to lock contention.) The CPU used by dmu_tx_hold_zap() was reduced by 88%, from 340 CPU-seconds to 40 CPU-seconds. Sponsored by: Intel Corp. Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/7004 OpenZFS-commit: https://github.com/openzfs/openzfs/pull/109 Closes #4641 Closes #4972	2016-08-19 12:48:03 -07:00
Matthew Ahrens	8bea981504	OpenZFS 7003 - zap_lockdir() should tag hold zap_lockdir() / zap_unlockdir() should take a "void *tag" argument which tags the hold on the zap. This will help diagnose programming errors which misuse the hold on the ZAP. Sponsored by: Intel Corp. Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Pavel Zakharov <pavel.zakha@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/7003 OpenZFS-commit: https://github.com/openzfs/openzfs/pull/108 Closes #4972	2016-08-19 12:35:23 -07:00
heary-cao	ee6370a7a4	Fix spa config generate memory leak in spa_load_best function When spa retry load succeeds and spa recovery is requested it may leak in spa_load_best function. Always free the generated config when it is not assigned to the spa. Signed-off-by: cao.xuewen <cao.xuewen@zte.com.cn> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4940	2016-08-19 11:17:12 -07:00
GeLiXin	aeb9baa618	Fix: handle NULL case in spl_kmem_free_track() When DEBUG_KMEM_TRACKING is enabled in SPL, we keep tracking all the buffers alloced by kmem_alloc() and kmem_zalloc(). If a NULL pointer which indicates no track info in SPL is passed to spl_kmem_free_track, we just ignore it. Signed-off-by: GeLiXin <ge.lixin@zte.com.cn> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue zfsonlinux/zfs#4967 Closes #567	2016-08-19 09:14:24 -07:00
Paul Dagnelie	32d41fb73a	OpenZFS 7176 - Yet another hole birth issue This is another bug in the long line of hole-birth related issues. In this particular case, it was discovered that a previous hole-birth fix (illumos bug 6513, commit `bc77ba73`) did not cover as many cases as we thought it did. While the issue worked in the case of hole-punching (writing zeroes to a large part of a file), it did not deal with truncation, and then writing beyond the new end of the file. The problem is that dbuf_findbp will return ENOENT if the block it's trying to find is beyond the end of the file. If that happens, we assume there is no birth time, and so we lose that information when we write out new blkptrs. We should teach dbuf_findbp to look for things that are beyond the current end, but not beyond the absolute end of the file. Authored by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Matthew Ahrens mahrens@delphix.com Reviewed by: George Wilson george.wilson@delphix.com Ported-by: kernelOfTruth <kerneloftruth@gmail.com> Signed-off-by: Boris Protopopov <boris.protopopov@actifio.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/7176 OpenZFS-commit: https://github.com/openzfs/openzfs/pull/173/commits/8b9f3ad Upstream-bugs: DLPX-46009 Porting notes: - Fix ISO C90 mixed declaration error in dbuf.c ( int nlevels, epbs; ) ; keep previous position of the initialization	2016-08-18 09:26:44 -07:00
Matthew Ahrens	d9eea113f8	It is not necessary to zero struct dbuf_hold_impl_data Under a workload which makes heavy use of `dbuf_hold()`, I noticed that a considerable amount of time was spent in `dbuf_hold_impl()`, due to its call to `kmem_zalloc(sizeof (struct dbuf_hold_impl_data) * DBUF_HOLD_IMPL_MAX_DEPTH)`, which is around 2KiB. This structure is used as a stack, to limit the size of the C stack as dbuf_hold() calls itself recursively. We make a recursive call to hold the parent's dbuf when the requested dbuf is not found. The vast majority of the time, the parent or grandparent indirect dbuf is cached, so the number of recursive calls is very low. However, we initialize this entire array for every call to dbuf_hold(). To improve performance, this commit changes `dbuf_hold()` to use `kmem_alloc()` instead of `kmem_zalloc()`. __dbuf_hold_impl_init is changed to initialize all members of the struct before they are used. I observed ~5% performance improvement on a workload which creates many files. Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4974	2016-08-16 15:27:17 -07:00
Gvozden Neskovic	fc897b24b2	Rework of fletcher_4 module - Benchmark memory block is increased to 128kiB to reflect real block sizes more accurately. Measurements include all three stages needed for checksum generation, i.e. `init()/compute()/fini()`. The inner loop is repeated multiple times to offset overhead of time function. - Fastest implementation selects native and byteswap methods independently in benchmark. To support this new function pointers `init_byteswap()/fini_byteswap()` are introduced. - Implementation mutex lock is replaced by atomic variable. - To save time, benchmark is not executed in userspace. Instead, highest supported implementation is used for fastest. Default userspace selector is still 'cycle'. - `fletcher_4_native/byteswap()` methods use incremental methods to finish calculation if data size is not multiple of vector stride (currently 64B). - Added `fletcher_4_native_varsize()` special purpose method for use when buffer size is not known in advance. The method does not enforce 4B alignment on buffer size, and will ignore last (size % 4) bytes of the data buffer. - Benchmark `kstat` is changed to match the one of vdev_raidz. It now shows throughput for all supported implementations (in B/s), native and byteswap, as well as the code [fastest] is running. Example of `fletcher_4_bench` running on `Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz`: implementation native byteswap scalar 4768120823 3426105750 sse2 7947841777 4318964249 ssse3 7951922722 6112191941 avx2 13269714358 11043200912 fastest avx2 avx2 Example of `fletcher_4_bench` running on `Intel(R) Xeon Phi(TM) CPU 7210 @ 1.30GHz`: implementation native byteswap scalar 1291115967 1031555336 sse2 2539571138 1280970926 ssse3 2537778746 1080016762 avx2 4950749767 1078493449 avx512f 9581379998 4010029046 fastest avx512f avx512f Signed-off-by: Gvozden Neskovic <neskovic@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4952	2016-08-16 14:11:55 -07:00
Gvozden Neskovic	70b258fc96	Fletcher4 implementation using avx512f instruction set Algorithm runs 8 parallel sums, consuming 8x uint32_t elements per loop iteration. Size alignment of main fletcher4 methods is adjusted accordingly. New implementation is called 'avx512f'. Note: byteswap method can be implemented more efficiently when avx512bw hardware becomes available. Currently, it is ~ 2x slower than native method. Table shows result of full (native) fletcher4 calculation for different buffer size: fletcher4 4KB 16KB 64KB 128KB 256KB 1MB 16MB -------------------------------------------------------------------- [scalar] 1213 1228 1231 1231 1225 1200 1160 [sse2] 2374 2442 2459 2456 2462 2250 2220 [avx2] 4288 4753 4871 4893 4900 4050 3882 [avx512f] 5975 8445 9196 9221 9262 6307 5620 Signed-off-by: Gvozden Neskovic <neskovic@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #4952	2016-08-16 14:11:14 -07:00
Rich Ercolani	6d836e6f8b	Add tunable to ignore hole_birth Adds a module option which disables the hole_birth optimization which has been responsible for several recent bugs, including issue #4050. Original-patch: https://gist.github.com/pcd1193182/2c0cd47211f3aee623958b4698836c48 Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4833	2016-08-15 09:52:56 -07:00
GeLiXin	e35c5a8265	Fix incorrect pool state after import Import a raidz pool which has a vdev with a bad label, zpool status shows the right state of the dev, but the wrong state of the pool. The pool state should be DEGRADED, not ONLINE. We examine the label in vdev_validate while in spa_load_impl, the bad label can be detected but doesn't propagate its state to the parent. There are other chances to propagate state in the following vdev_load if we failed to load DTL, but our pool is raidz1 which can tolerate a faulted disk. So we lost the last chance to correct the pool state. Propagate the leaf vdev's state to parent if its label was corrupted, as is done elsewhere in vdev_validate. Signed-off-by: GeLiXin <ge.lixin@zte.com.cn> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Don Brady <don.brady@intel.com> Closes #4948	2016-08-12 13:46:51 -07:00
Hans Rosenfeld	fb390aafc8	OpenZFS 5997 - FRU field not set during pool creation and never updated Authored by: Hans Rosenfeld <hans.rosenfeld@nexenta.com> Reviewed by: Dan Fields <dan.fields@nexenta.com> Reviewed by: Josef Sipek <josef.sipek@nexenta.com> Reviewed by: Richard Elling <richard.elling@gmail.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Signed-off-by: Don Brady <don.brady@intel.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/5997 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/1437283 Porting Notes: In addition to the OpenZFS changes this patch realigns the events with those found in OpenZFS. Events which would be logged as sysevents on illumos have been been mapped to the 'sysevent' class for Linux. In addition, several subclass names have been changed to match what is used in OpenZFS. In all cases this means a '.' was changed to an '_' in the subclass. The scripts provided by ZoL have been updated, however users which provide scripts for any of the following events will need to rename them based on the new subclass names. ereport.fs.zfs.config.sync sysevent.fs.zfs.config_sync ereport.fs.zfs.zpool.destroy sysevent.fs.zfs.pool_destroy ereport.fs.zfs.zpool.reguid sysevent.fs.zfs.pool_reguid ereport.fs.zfs.vdev.remove sysevent.fs.zfs.vdev_remove ereport.fs.zfs.vdev.clear sysevent.fs.zfs.vdev_clear ereport.fs.zfs.vdev.check sysevent.fs.zfs.vdev_check ereport.fs.zfs.vdev.spare sysevent.fs.zfs.vdev_spare ereport.fs.zfs.vdev.autoexpand sysevent.fs.zfs.vdev_autoexpand ereport.fs.zfs.resilver.start sysevent.fs.zfs.resilver_start ereport.fs.zfs.resilver.finish sysevent.fs.zfs.resilver_finish ereport.fs.zfs.scrub.start sysevent.fs.zfs.scrub_start ereport.fs.zfs.scrub.finish sysevent.fs.zfs.scrub_finish ereport.fs.zfs.bootfs.vdev.attach sysevent.fs.zfs.bootfs_vdev_attach	2016-08-12 13:06:48 -07:00
Jason Zaman	a3600a106d	icp: mark asm files with noexec stack If there is no explicit note in the .S files, the obj file will mark it as requiring an executable stack. This is unneeded and causes issues on hardened systems. More info: https://wiki.gentoo.org/wiki/Hardened/GNU_stack_quickstart Signed-off-by: Jason Zaman <jason@perfinion.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4947 Closes #4962	2016-08-12 09:51:26 -07:00
Jason Zaman	a9947ce771	icp: add no_const for PaX Compat The constify plugin will automatically constify a class of types that contain only function pointers. The icp structs fail to build if this is enabled with the following error. The no_const attribute makes the plugin skip those structs. module/icp/spi/kcf_spi.c: In function ‘copy_ops_vector_v1’: module/icp/spi/kcf_spi.c:61:16: error: assignment of read-only location ‘dst_ops->cou.cou_v1.co_control_ops’ ((dst)->ops) = *((src)->ops); ^ module/icp/spi/kcf_spi.c:74:2: note: in expansion of macro ‘KCF_SPI_COPY_OPS’ KCF_SPI_COPY_OPS(src_ops, dst_ops, co_control_ops); ^ Signed-off-by: Jason Zaman <jason@perfinion.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4947 Closes #4962	2016-08-12 09:51:19 -07:00
Matthew Ahrens	169ab07cc8	OpenZFS 7263 - deeply nested nvlist can overflow stack nvlist_pack() and nvlist_unpack are implemented recursively, which can cause the stack to overflow with a deeply nested nvlist; i.e. an nvlist which contains an nvlist, which contains an nvlist, which... Unprivileged users can pass an nvlist to the kernel via certain ioctls on /dev/zfs, which the kernel will unpack without additional permission checking or validation. Therefore, an unprivileged user can cause the kernel's stack to overflow and panic. Ideally, these functions would be implemented non-recursively. As a quick fix, this patch limits the depth of the recursion and returns an error when attempting to pack and unpack a deeply-nested nvlist. Signed-off-by: Adam Leventhal <ahl@delphix.com> Signed-off-by: George Wilson <george.wilson@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: Prakash Surya <prakash.surya@delphix.com> OpenZFS-issue: https://www.illumos.org/issues/7263 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/0511d6d -	2016-08-11 15:58:03 -07:00
Chen Haiquan	d9c97ec08b	Use file_dentry and file_inode wrappers Fix bugs due to kernel change in torvalds/linux@4bacc9c923 ("overlayfs: Make f_path always point to the overlay and f_inode to the underlay"). This problem crashes system when use zfs as a layer of overlayfs. Signed-off-by: Chen Haiquan <oc@yunify.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4914 Closes #4935	2016-08-11 12:06:37 -07:00
GeLiXin	d5884c3453	Fix indefinite article The indefinite article before nvlist should be "an", not "a". We have 27 "an nvlist" and 7 "a nvlist" in our comment, they should stay the same as we are such a strict filesystem. Signed-off-by: GeLiXin <ge.lixin@zte.com.cn> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4941	2016-08-11 11:23:49 -07:00
Brian Behlendorf	e5fe9ddeec	Remove custom root pool import code Non-Linux OpenZFS implementations require additional support to be used a root pool. This code should simply be removed to avoid confusion and improve readability. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Closes #4951	2016-08-11 11:19:34 -07:00
Brian Behlendorf	cf41432c70	Linux 4.8 compat: Fix removal of bio->bi_rw member All users of bio->bi_rw have been replaced with compatibility wrappers. This allows the kernel specific logic to be abstracted away, and for each of the supported cases to be documented with the wrapper. The updated interfaces are as follows: * void blk_queue_set_write_cache(struct request_queue , bool, bool) boolean_t bio_is_flush(struct bio ) boolean_t bio_is_fua(struct bio ) boolean_t bio_is_discard(struct bio ) boolean_t bio_is_secure_erase(struct bio ) VDEV_WRITE_FLUSH_FUA Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Closes #4951	2016-08-11 11:19:34 -07:00
Gvozden Neskovic	689f093ebc	Build user-space with different gcc optimization levels This fix resolves warnings reported during compiling of user-space libraries with different gcc optimization levels. Tested with gcc versions: 4.9.2 (Debian), and 6.1.1 (Fedora). The patch enables use of following opt levels: O0, O1, O2, O3, Og, Os, Ofast. List of warnings: [GCC 4.9.2 -Os] libzfs_sendrecv.c:3726:26: error: 'clp' may be used uninitialized in this function [-Werror=maybe-uninitialized] [GCC 4.9.2 -Og] fs_fletcher.c:323:26: error: 'idx' may be used uninitialized in this function [-Werror=maybe-uninitialized] dsl_dataset.c:1290:12: error: 'atp' may be used uninitialized in this function [-Werror=maybe-uninitialized] [GCC 4.9.2 -Ofast] u8_textprep.c:1310:9: error: 'tc[3ul]' may be used uninitialized in this function [-Werror=maybe-uninitialized] u8_textprep.c:177:23: error: 'u8t[0ul]' may be used uninitialized in this function [-Werror=maybe-uninitialized] dsl_dataset.c:2089:37: error: ‘hds’ may be used uninitialized in this function [-Werror=maybe-uninitialized] dsl_dataset.c:3216:2: error: ‘ds’ may be used uninitialized in this function [-Werror=maybe-uninitialized] dsl_dataset.c:1591:2: error: ‘ds’ may be used uninitialized in this function [-Werror=maybe-uninitialized] dsl_dataset.c:3341:2: error: ‘ds’ may be used uninitialized in this function [-Werror=maybe-uninitialized] vdev_raidz.c:1153:8: error: 'dcount[2]' may be used uninitialized in this function [-Werror=maybe-uninitialized] vdev_raidz.c:1167:17: error: 'dst[2]' may be used uninitialized in this function [-Werror=maybe-uninitialized] kernel.c:1005:2: error: ‘resid’ may be used uninitialized in this function [-Werror=maybe-uninitialized] libzfs_dataset.c:2826:8: error: ‘val’ may be used uninitialized in this function [-Werror=maybe-uninitialized] libzfs_dataset.c:3056:35: error: ‘val’ may be used uninitialized in this function [-Werror=maybe-uninitialized] libzfs_dataset.c:1584:13: error: ‘val’ may be used uninitialized in this function [-Werror=maybe-uninitialized] libzfs_dataset.c:3056:35: error: ‘val’ may be used uninitialized in this function [-Werror=maybe-uninitialized] libzfs_dataset.c:1792:66: error: ‘val’ may be used uninitialized in this function [-Werror=maybe-uninitialized] libzfs_dataset.c:3986:35: error: ‘val’ may be used uninitialized in this function [-Werror=maybe-uninitialized] [GCC 6.1.1] Resolved in PR #4907 Signed-off-by: Gvozden Neskovic <neskovic@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4937	2016-08-09 14:40:35 -07:00
Chunwei Chen	afb6c031e8	Linux 4.7 compat: fix zpl_get_acl returns invalid acl pointer Starting from Linux 4.7, get_acl will set acl cache pointer to temporary sentinel value before calling i_op->get_acl. Therefore we can't compare against ACL_NOT_CACHED and return. Since from Linux 3.14, get_acl already check the cache for us, so we disable this in zpl_get_acl. Linux 4.7 also does set_cached_acl for us so we disable it in zpl_get_acl. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4944 Closes #4946	2016-08-09 10:03:04 -07:00
Brian Behlendorf	4b908d3220	Linux 4.8 compat: posix_acl_valid() The posix_acl_valid() function has been updated to require a user namespace. Filesystem callers should normally provide the user_ns from the super block associcated with the ACL; the zpl_posix_acl_valid() wrapper has been added for this purpose. See https://github.com/torvalds/linux/commit/0d4d717f for complete details. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Closes #4922	2016-08-08 11:46:40 -07:00
Brian Behlendorf	e85a6396b0	Retire HAVE_CURRENT_UMASK and HAVE_POSIX_ACL_CACHING Remove ZFS_AC_KERNEL_CURRENT_UMASK and ZFS_AC_KERNEL_POSIX_ACL_CACHING configure checks, all supported kernel provide this functionality. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Closes #4922	2016-08-08 11:46:32 -07:00
Nikolay Borisov	64aefee1b8	Fix interaction between userns uid/gid and SA * When the uid/gid change is handled in zfs_setattr we want to actually adjust the user passed uid to a KUID and write that to disk. * In trace points use the i_uid member without doing translation, since it has already been performed. * Use kuid in zfs_aclset_common Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4928	2016-08-08 10:47:43 -07:00
Gaurav Kumar	cf2731e65b	arc_meta_limit should be updated when arc_max is changed. When arc_max is increased, arc_meta_limit will not be updated to 3/4 of the new arc_c_max value. This was done originally to preserve any existing maximum value. This turned out to be counter intuitive to users and this fix changes that behavior. If zfs_arc_meta_limit is non-default, it will be picked up later in the ARC tuning function. Signed-off-by: Gaurav Kumar <gaurav.kumar@nutanix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4893	2016-08-02 13:43:36 -07:00
Brian Behlendorf	efe7978d89	Fix gcc self-comparison warning As of gcc 6.1.1 20160621 (Red Hat 6.1.1-3) a self-comparison is detected by gcc in metaslab_alloc(). Resolve the warning by passing a physical size of 0 to BP_SET_BIRTH() as it done by other callers. module/zfs/metaslab.c: In function ‘metaslab_alloc’: module/zfs/metaslab.c:2575:184: error: self-comparison always evaluates to true [-Werror=tautological-compare] Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Gvozden Neskovic <neskovic@gmail.com> Issue #4907	2016-08-02 13:14:18 -07:00
Tony Hutter	4eb0db42d3	Fix possible VDEV stats array overflow Fix a possible VDEV statistics array overflow when ZIOs with ZIO_PRIORITY_NOW complete. Signed-off-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #4883 Closes #4917	2016-08-02 08:45:24 -07:00
Nikolay Borisov	ba2fe6affb	Move assignment of i_blkbits field Currently i_blkbits is always set to SPA_MINBLOCKSHIFT every time zfs_inode_update_impl is called. Since this value never changes move its assignment to at inode creation time. Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4906	2016-07-29 15:34:12 -07:00
Nikolay Borisov	e334e828a6	Unify license of icp module with the rest of zfs The newly added icp module uses a hardcoded value of CDDL for the license, however in local development one might want to change that to something else in order to facilitate compiling against lock debugging enabled kernel. All modules of the zfs use the ZFS_META_LICNSE string which is replaced with the value held in the META file. One can modify the value in the META file once and then rerun the configure to have all modules' licenses changed. Change the icp module license string to be ZFS_META_LICENSE so that it falls under the same paradigm. Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4905	2016-07-29 15:34:12 -07:00
heary-cao	9f3d1407dc	Fix zfs_allow_log_destroy() NULL dereference In zfs_ioc_log_history() function the tsd_set() function is called with NULL which causes the zfs_allow_log_destroy() to be run. In this case the passed value will be NULL. This is normally entirely safe because strfree() maps directly to kfree() which may be passed a NULL. However, since alternate implementations of strfree() may not handle this gracefully add a check for NULL. Observed under an embedded Linux 2.6.32.41 kernel running the automated testing while running the ZFS Test Suite. Signed-off-by: caoxuewen <cao.xuewen@zte.com.cn> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4872	2016-07-29 15:34:12 -07:00
Chunwei Chen	3b86aeb295	Linux 4.8 compat: REQ_OP and bio_set_op_attrs() New REQ_OP_* definitions have been introduced to separate the WRITE, READ, and DISCARD operations from the flags. This included changing the encoding of bi_rw. It places REQ_OP_* in high order bits and other stuff in low order bits. This encoding is done through the new helper function bio_set_op_attrs. For complete details refer to: https://github.com/torvalds/linux/commit/f215082 https://github.com/torvalds/linux/commit/4e1b2d5 Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4892 Closes #4899	2016-07-29 14:48:19 -07:00
Brian Behlendorf	bbb1b6cea7	Linux 4.8 compat: submit_bio() The rw argument has been removed from submit_bio/submit_bio_wait. Callers are now expected to set bio->bi_rw instead of passing it in. See https://github.com/torvalds/linux/commit/4e49ea4a for complete details. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #4892 Issue #4899	2016-07-29 14:48:00 -07:00
Brian Behlendorf	b7c7008ba2	Linux 4.8 compat: rw_semaphore atomic_long_t count For non-rwsem-spinlocks the "count" member was changed from a "long" to "atomic_long_t" type. A configure check has been added to detect this change along with new versions of the _rwsem_tryupgrade() function and RWSEM_COUNT() macro. See https://github.com/torvalds/linux/commit/8ee62b18 for complete details. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #563	2016-07-29 14:17:53 -07:00
Richard Yao	f26b4b3c8a	txg visibility code should not execute under tc_open_lock The memory allocation and locking in `spa_txg_history_*()` can potentially block txg_hold_open for arbitrarily long periods of time. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #4333	2016-07-27 14:11:13 -07:00
Brian Behlendorf	fcf64f45d9	Fix zdb crash with 4K-only devices Here's the problem - on 4K native devices in userland on Linux using O_DIRECT, buffers must be 4K aligned or I/O will fail with EINVAL, causing zdb (and others) to coredump. Since userland probably doesn't need optimized buffer caches, we just force 4K alignment on everything. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Gvozden Neskovic <neskovic@gmail.com> Closes #4479	2016-07-27 13:38:46 -07:00
Colin Ian King	bf18fd89f9	void integer overflow on computation of refquota_slack DMU_MAX_ACCESS should be cast to a uint64_t otherwise the multiplication of DMU_MAX_ACCESS with spa_asize_inflation will be 32 bit and may lead to an overflow. Currently DMU_MAX_ACCESS is 64 * 1024 * 1024, so spa_asize_inflation being 64 or more will lead to an overflow. Found by static analysis with CoverityScan 0.8.5 CID 150942 (#1 of 1): Unintentional integer overflow (OVERFLOW_BEFORE_WIDEN) overflow_before_widen: Potentially overflowing expression 67108864 * spa_asize_inflation with type int (32 bits, signed) is evaluated using 32-bit arithmetic, and then used in a context that expects an expression of type uint64_t (64 bits, unsigned). Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4889	2016-07-27 13:38:46 -07:00
Gvozden Neskovic	a64f903b06	Fixes for issues found with cppcheck tool The patch fixes small number of errors/false positives reported by `cppcheck`, static analysis tool for C/C++. cppcheck 1.72 $ cppcheck . --force --quiet [cmd/zfs/zfs_main.c:4444]: (error) Possible null pointer dereference: who_perm [cmd/zfs/zfs_main.c:4445]: (error) Possible null pointer dereference: who_perm [cmd/zfs/zfs_main.c:4446]: (error) Possible null pointer dereference: who_perm [cmd/zpool/zpool_iter.c:317]: (error) Uninitialized variable: nvroot [cmd/zpool/zpool_vdev.c:1526]: (error) Memory leak: child [lib/libefi/rdwr_efi.c:1118]: (error) Memory leak: efi_label [lib/libuutil/uu_misc.c:207]: (error) va_list 'args' was opened but not closed by va_end(). [lib/libzfs/libzfs_import.c:1554]: (error) Dangerous usage of 'diskname' (strncpy doesn't always null-terminate it). [lib/libzfs/libzfs_sendrecv.c:3279]: (error) Dereferencing 'cp' after it is deallocated / released [tests/zfs-tests/cmd/file_write/file_write.c:154]: (error) Possible null pointer dereference: operation [tests/zfs-tests/cmd/randfree_file/randfree_file.c:90]: (error) Memory leak: buf [cmd/zinject/zinject.c:1068]: (error) Uninitialized variable: dataset [module/icp/io/sha2_mod.c:698]: (error) Uninitialized variable: blocks_per_int64 Signed-off-by: Gvozden Neskovic <neskovic@gmail.com> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1392	2016-07-27 13:31:22 -07:00
Tim Chase	25458cbef9	Limit the amount of dnode metadata in the ARC Metadata-intensive workloads can cause the ARC to become permanently filled with dnode_t objects as they're pinned by the VFS layer. Subsequent data-intensive workloads may only benefit from about 25% of the potential ARC (arc_c_max - arc_meta_limit). In order to help track metadata usage more precisely, the other_size metadata arcstat has replaced with dbuf_size, dnode_size and bonus_size. The new zfs_arc_dnode_limit tunable, which defaults to 10% of zfs_arc_meta_limit, defines the minimum number of bytes which is desirable to be consumed by dnodes. Attempts to evict non-metadata will trigger async prune tasks if the space used by dnodes exceeds this limit. The new zfs_arc_dnode_reduce_percent tunable specifies the amount by which the excess dnode space is attempted to be pruned as a percentage of the amount by which zfs_arc_dnode_limit is being exceeded. By default, it tries to unpin 10% of the dnodes. The problem of dnode metadata pinning was observed with the following testing procedure (in this example, zfs_arc_max is set to 4GiB): - Create a large number of small files until arc_meta_used exceeds arc_meta_limit (3GiB with default tuning) and arc_prune starts increasing. - Create a 3GiB file with dd. Observe arc_mata_used. It will still be around 3GiB. - Repeatedly read the 3GiB file and observe arc_meta_limit as before. It will continue to stay around 3GiB. With this modification, space for the 3GiB file is gradually made available as subsequent demands on the ARC are made. The previous behavior can be restored by setting zfs_arc_dnode_limit to the same value as the zfs_arc_meta_limit. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #4345 Issue #4512 Issue #4773 Closes #4858	2016-07-25 15:26:38 -07:00
Tim Chase	e6603b7c1f	Fix sync behavior for disk vdevs Prior to `b39c22b`, which was first generally available in the 0.6.5 release as `b39c22b`, ZoL never actually submitted synchronous read or write requests to the Linux block layer. This means the vdev_disk_dio_is_sync() function had always returned false and, therefore, the completion in dio_request_t.dr_comp was never actually used. In `b39c22b`, synchronous ZIO operations were translated to synchronous BIO requests in vdev_disk_io_start(). The follow-on commits `5592404` and `aa159af` fixed several problems introduced by `b39c22b`. In particular, `5592404` introduced the new flag parameter "wait" to __vdev_disk_physio() but under ZoL, since vdev_disk_physio() is never actually used, the wait flag was always zero so the new code had no effect other than to cause a bug in the use of the dio_request_t.dr_comp which was fixed by `aa159af`. The original rationale for introducing synchronous operations in `b39c22b` was to hurry certains requests through the BIO layer which would have otherwise been subject to its unplug timer which would increase the latency. This behavior of the unplug timer, however, went away during the transition of the plug/unplug system between kernels 2.6.32 and 2.6.39. To handle the unplug timer behavior on 2.6.32-2.6.35 kernels the BIO_RW_UNPLUG flag is used as a hint to suppress the plugging behavior. For kernels 2.6.36-2.6.38, the REQ_UNPLUG macro will be available and ise used for the same purpose. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4858	2016-07-25 14:24:47 -07:00
Brian Behlendorf	273ff9b5cc	Fix uninitialized variable in avl_add() Silence the following warning when compiling with gcc 5.4.0. Specifically gcc (Ubuntu 5.4.0-6ubuntu1~16.04.1) 5.4.0 20160609. module/avl/avl.c: In function ‘avl_add’: module/avl/avl.c:647:2: warning: ‘where’ may be used uninitialized in this function [-Wmaybe-uninitialized] avl_insert(tree, new_node, where); Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2016-07-25 14:21:34 -07:00
Nikolay Borisov	2c6abf15ff	Remove znode's z_uid/z_gid member Remove duplicate z_uid/z_gid member which are also held in the generic vfs inode struct. This is done by first removing the members from struct znode and then using the KUID_TO_SUID/KGID_TO_SGID macros to access the respective member from struct inode. In cases where the uid/gids are being marshalled from/to disk, use the newly introduced zfs_(uid\|gid)_(read\|write) functions to properly save the uids rather than the internal kernel representation. Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #4685 Issue #227	2016-07-25 13:21:49 -07:00
Tom Caputi	77943bc1dc	Fix for metaslab_fastwrite_unmark() assert failure Currently there is an issue where metaslab_fastwrite_unmark() unmarks fastwrites on vdev_t's that have never had fastwrites marked on them. The 'fastwrite mark' is essentially a count of outstanding bytes that will be written to a vdev and is used in syncing context. The problem stems from the fact that the vdev_pending_fastwrite field is not being transferred over when replacing a top-level vdev. As a result, the metaslab is marked for fastwrite on the old vdev and unmarked on the new one, which brings the fastwrite count below zero. This fix simply assigns vdev_pending_fastwrite from the old vdev to the new one so this count is not lost. Signed-off-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4267	2016-07-25 13:21:43 -07:00
Tom Caputi	f4bc1bbe11	Fix for compilation error when using the kernel's CONFIG_LOCKDEP Signed-off-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #4329	2016-07-21 18:49:50 -07:00
Tom Caputi	0b04990a5d	Illumos Crypto Port module added to enable native encryption in zfs A port of the Illumos Crypto Framework to a Linux kernel module (found in module/icp). This is needed to do the actual encryption work. We cannot use the Linux kernel's built in crypto api because it is only exported to GPL-licensed modules. Having the ICP also means the crypto code can run on any of the other kernels under OpenZFS. I ended up porting over most of the internals of the framework, which means that porting over other API calls (if we need them) should be fairly easy. Specifically, I have ported over the API functions related to encryption, digests, macs, and crypto templates. The ICP is able to use assembly-accelerated encryption on amd64 machines and AES-NI instructions on Intel chips that support it. There are place-holder directories for similar assembly optimizations for other architectures (although they have not been written). Signed-off-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #4329	2016-07-20 10:43:30 -07:00
Chunwei Chen	be88e733a6	Fix NULL pointer in zfs_preumount from `1d9b3bd` When zfs_domount fails zsb will be freed, and its caller mount_nodev/get_sb_nodev will do deactivate_locked_super and calls into zfs_preumount. In order to make sure we don't touch any nonexistent stuff, we must make sure s_fs_info is NULL in the fail path so zfs_preumount can easily check that. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4867 Issue #4854	2016-07-20 09:05:43 -07:00
Gvozden Neskovic	26a08b5ca9	RAIDZ parity kstat rework Print table with speed of methods for each implementation. Last line describes contents of [fastest] selection. Signed-off-by: Gvozden Neskovic <neskovic@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4860	2016-07-19 16:43:07 -07:00
Gvozden Neskovic	c9187d867f	Fixes and enhancements of SIMD raidz parity - Implementation lock replaced with atomic variable - Trailing whitespace is removed from user specified parameter, to enhance experience when using commands that add newline, e.g. `echo` - raidz_test: remove dependency on `getrusage()` and RUSAGE_THREAD, Issue #4813 - silence `cppcheck` in vdev_raidz, partial solution of Issue #1392 - Minor fixes and cleanups - Enable use of original parity methods in [fastest] configuration. New opaque original ops structure, representing native methods, is added to supported raidz methods. Original parity methods are executed if selected implementation has NULL fn pointer. Signed-off-by: Gvozden Neskovic <neskovic@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #4813 Issue #1392	2016-07-19 16:43:07 -07:00
Chunwei Chen	1d9b3bd8fb	Wait iput_async before evict_inodes to prevent race Wait for iput_async before entering evict_inodes in generic_shutdown_super. The reason we must finish before evict_inodes is when lazytime is on, or when zfs_purgedir calls zfs_zget, iput would bump i_count from 0 to 1. This would race with the i_count check in evict_inodes. This means it could destroy the inode while we are still using it. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4854	2016-07-19 09:23:58 -07:00
Tyler J. Stachecki	3d11ecbddd	Prevent segfaults in SSE optimized Fletcher-4 In some cases, the compiler was not respecting the GNU aligned attribute for stack variables in `35a76a0`. This was resulting in a segfault on CentOS 6.7 hosts using gcc 4.4.7-17. This issue was fixed in gcc 4.6. To prevent this from occurring, use unaligned loads and stores for all stack and global memory references in the SSE optimized Fletcher-4 code. Disable zimport testing against master where this flaw exists: TEST_ZIMPORT_VERSIONS="installed" Signed-off-by: Tyler J. Stachecki <stachecki.tyler@gmail.com> Signed-off-by: Gvozden Neskovic <neskovic@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4862	2016-07-19 09:03:44 -07:00
Roman Strashkin	1b87e0f532	Fix filesystem destroy with receive_resume_token It is possible that the given DS may have hidden child (%recv) datasets - "leftovers" resulting from the previously interrupted 'zfs receieve'. Try to remove the hidden child (%recv) and after that try to remove the target dataset. If the hidden child (%recv) does not exist the original error (EEXIST) will be returned. Signed-off-by: Roman Strashkin <roman.strashkin@nexenta.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4818	2016-07-15 15:34:46 -07:00
Tyler J. Stachecki	35a76a0366	Implementation of SSE optimized Fletcher-4 Builds off of `1eeb4562` (Implementation of AVX2 optimized Fletcher-4) This commit adds another implementation of the Fletcher-4 algorithm. It is automatically selected at module load if it benchmarks higher than all other available implementations. The module benchmark was also amended to analyze the performance of the byteswap-ed version of Fletcher-4, as well as the non-byteswaped version. The average performance of the two is used to select the the fastest implementation available on the host system. Adds a pair of fields to an existing zcommon module parameter: - zfs_fletcher_4_impl (str) "sse2" - new SSE2 implementation if available "ssse3" - new SSSE3 implementation if available Signed-off-by: Tyler J. Stachecki <stachecki.tyler@gmail.com> Signed-off-by: Gvozden Neskovic <neskovic@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4789	2016-07-15 10:42:35 -07:00
Chris Dunlop	dfbc86309f	Use native inode->i_nlink instead of znode->z_links A mostly mechanical change, taking into account i_nlink is 32 bits vs ZFS's 64 bit on-disk link count. We revert "xattr dir doesn't get purged during iput" (`ddae16a`) as this is a more Linux-integrated fix for the same issue. In addition, setting the initial link count on a new node has been changed from setting one less than required in zfs_mknode() then incrementing to the correct count in zfs_link_create() (which was somewhat bizarre in the first place), to setting the correct count in zfs_mknode() and not incrementing it in zfs_link_create(). This both means we no longer set the link count in sa_bulk_update() twice (once for the initial incorrect count then again for the correct count), as well as adhering to the Linux requirement of not incrementing a zero link count without I_LINKABLE (see linux commit f4e0c30c). Signed-off-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Closes #4838 Issue #227	2016-07-14 16:25:34 -07:00
Chunwei Chen	02de3e3c5d	Fix dbuf_stats_hash_table_data race Dropping DBUF_HASH_MUTEX when walking the hash list is unsafe. The dbuf can be freed at any time. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4846	2016-07-14 16:25:04 -07:00
Tim Chase	8887c7d778	Prevent null dereferences when accessing dbuf kstat In arc_buf_info(), the arc_buf_t may have no header. If not, don't try to fetch the arc buffer stats and instead just zero them. The null dereferences were observed while accessing the dbuf kstat with awk on a system in which millions of small files were being created in order to overflow the system's metadata limit. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Closes #4837	2016-07-14 16:25:03 -07:00
Gvozden Neskovic	ae25d22235	Add RAID-Z routines for SSE2 instruction set, in x86_64 mode. The patch covers low-end and older x86 CPUs. Parity generation is equivalent to SSSE3 implementation, but reconstruction is somewhat slower. Previous 'sse' implementation is renamed to 'ssse3' to indicate highest instruction set used. Benchmark results: scalar_rec_p 4 720476442 scalar_rec_q 4 187462804 scalar_rec_r 4 138996096 scalar_rec_pq 4 140834951 scalar_rec_pr 4 129332035 scalar_rec_qr 4 81619194 scalar_rec_pqr 4 53376668 sse2_rec_p 4 2427757064 sse2_rec_q 4 747120861 sse2_rec_r 4 499871637 sse2_rec_pq 4 522403710 sse2_rec_pr 4 464632780 sse2_rec_qr 4 319124434 sse2_rec_pqr 4 205794190 ssse3_rec_p 4 2519939444 ssse3_rec_q 4 1003019289 ssse3_rec_r 4 616428767 ssse3_rec_pq 4 706326396 ssse3_rec_pr 4 570493618 ssse3_rec_qr 4 400185250 ssse3_rec_pqr 4 377541245 original_rec_p 4 691658568 original_rec_q 4 195510948 original_rec_r 4 26075538 original_rec_pq 4 103087368 original_rec_pr 4 15767058 original_rec_qr 4 15513175 original_rec_pqr 4 10746357 Signed-off-by: Gvozden Neskovic <neskovic@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4783	2016-07-13 10:24:55 -07:00
Gvozden Neskovic	1bf3bf0e29	Fix handling of errors nvlist in zfs_ioc_recv_new() zfs_ioc_recv_impl() is changed to always allocate the 'errors' nvlist, its callers are responsible for freeing it. Signed-off-by: Gvozden Neskovic <neskovic@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4829	2016-07-13 10:20:08 -07:00
Peng	81edd3e834	Fix PANIC: metaslab_free_dva(): bad DVA X:Y:Z The following scenario can result in garbage in the dn_spill field. The db->db_blkptr must be set to NULL when DNODE_FLAG_SPILL_BLKPTR is clear to ensure the dn_spill field is cleared. Current txg = A. * A new spill buffer is created. Its dbuf is initialized with db_blkptr = NULL and it's dirtied. Current txg = B. * The spill buffer is modified. It's marked as dirty in this txg. * Additional changes make the spill buffer unnecessary because the xattr fits into the bonus buffer, so it's removed. The dbuf is undirtied in this txg, but it's still referenced and cannot be destroyed. Current txg = C. * Starts syncing of txg A * dbuf_sync_leaf() is called for the spill buffer. Since db_blkptr is NULL, dbuf_check_blkptr() is called. * The dbuf starts being written and it reaches the ready state (not done yet). * A new change makes the spill buffer necessary again. sa_build_layouts() ends up calling dbuf_find() to locate the dbuf. It finds the old dbuf because it has not been destroyed yet (it will be destroyed when the previous write is done and there are no more references). The old dbuf has db_blkptr != NULL. * txg A write is complete and the dbuf released. However it's still referenced, so it's not destroyed. Current txg = D. * Starts syncing of txg B * dbuf_sync_leaf() is called for the bonus buffer. Its contents are directly copied into the dnode, overwriting the blkptr area because, in txg B, the bonus buffer was big enough to hold the entire xattr. * At this point, the db_blkptr of the spill buffer used in txg C gets corrupted. Signed-off-by: Peng <peng.hse@xtaotech.com> Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3937	2016-07-12 16:47:44 -07:00
Chunwei Chen	31b6111fd9	Kill zp->z_xattr_parent to prevent pinning zp->z_xattr_parent will pin the parent. This will cause huge issue when unlink a file with xattr. Because the unlinked file is pinned, it will never get purged immediately. And because of that, the xattr stuff will never be marked as unlinked. So the whole unlinked stuff will stay there until shrink cache or umount. This change partially reverts `e89260a`. This is safe because only the zp->z_xattr_parent optimization is removed, zpl_xattr_security_init() is still called from the zpl outside the inode lock. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chris Dunlop <chris@onthe.net.au> Issue #4359 Issue #3508 Issue #4413 Issue #4827	2016-07-12 14:18:10 -07:00
Chunwei Chen	ddae16a9cf	xattr dir doesn't get purged during iput We need to set inode->i_nlink to zero so iput will purge it. Without this, it will get purged during shrink cache or umount, which would likely result in deadlock due to zfs_zget waiting forever on its children which are in the dispose_list of the same thread. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chris Dunlop <chris@onthe.net.au> Issue #4359 Issue #3508 Issue #4413 Issue #4827	2016-07-12 14:04:30 -07:00
Chunwei Chen	6c2530647c	fh_to_dentry should return ESTALE when generation mismatch When generation mismatch, it usually means the file pointed by the file handle was deleted. We should return ESTALE to indicate this. We return ENOENT in zfs_vget since zpl_fh_to_dentry will convert it to ESTALE. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #4828	2016-07-12 13:34:15 -07:00
Chunwei Chen	bffb68a2b8	Fix Large kmem_alloc in vdev_metaslab_init This allocation can go way over 1MB, so we should use vmem_alloc instead of kmem_alloc. Large kmem_alloc(1430784, 0x1000), please file an issue... Call Trace: [<ffffffffa0324aff>] ? spl_kmem_zalloc+0xef/0x160 [spl] [<ffffffffa17d0c8d>] ? vdev_metaslab_init+0x9d/0x1f0 [zfs] [<ffffffffa17d46d0>] ? vdev_load+0xc0/0xd0 [zfs] [<ffffffffa17d4643>] ? vdev_load+0x33/0xd0 [zfs] [<ffffffffa17c0004>] ? spa_load+0xfc4/0x1b60 [zfs] [<ffffffffa17c1838>] ? spa_tryimport+0x98/0x430 [zfs] [<ffffffffa17f28b1>] ? zfs_ioc_pool_tryimport+0x41/0x80 [zfs] [<ffffffffa17f5669>] ? zfsdev_ioctl+0x4a9/0x4e0 [zfs] [<ffffffff811bacdf>] ? do_vfs_ioctl+0x2cf/0x4b0 [<ffffffff811baf41>] ? SyS_ioctl+0x81/0xa0 Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4752	2016-07-12 13:34:15 -07:00
Chunwei Chen	7938c2aca7	Don't allow accessing XATTR via export handle Allow accessing XATTR through export handle is a very bad idea. It would allow user to write whatever they want in fields where they otherwise could not. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #4828	2016-07-12 13:34:14 -07:00
Chunwei Chen	061460dfe2	Fix get_zfs_sb race with concurrent umount Certain ioctl operations will call get_zfs_sb, which will holds an active count on sb without checking whether it's active or not. This will result in use-after-free. We fix this by using atomic_inc_not_zero to make sure we got an active sb. P1 P2 --- --- deactivate_locked_super(): s_active = 0 zfs_sb_hold() ->get_zfs_sb(): s_active = 1 ->zpl_kill_sb() -->zpl_put_super() --->zfs_umount() ---->zfs_sb_free(zsb) zfs_sb_rele(zsb) Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2016-07-12 13:34:14 -07:00
Gvozden Neskovic	590c9a0994	Allow building with `CFLAGS="-O0"` If compiled with -O0, gcc doesn't do any stack frame coalescing and -Wframe-larger-than=1024 is triggered in debug mode. Starting with gcc 4.8, new opt level -Og is introduced for debugging, which does not trigger this warning. Fix bench zio size, using SPA_OLD_MAXBLOCKSHIFT Signed-off-by: Gvozden Neskovic <neskovic@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4799	2016-07-11 16:53:02 -07:00
Brian Behlendorf	0dab2e84fc	Vectorized fletcher_4 must be 128-bit aligned The fletcher_4_native() and fletcher_4_byteswap() functions may only safely use the vectorized implementations when the buffer is 128-bit aligned. This is because both the AVX2 and SSE implementations process four 32-bit words per iterations. Fallback to the scalar implementation which only processes a single 32-bit word for unaligned buffers. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Gvozden Neskovic <neskovic@gmail.com> Issue #4330	2016-06-29 11:22:22 -07:00
Paul Dagnelie	d1d19c7854	OpenZFS 6876 - Stack corruption after importing a pool with a too-long name Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> Calling dsl_dataset_name on a dataset with a 256 byte buffer is asking for trouble. We should check every dataset on import, using a 1024 byte buffer and checking each time to see if the dataset's new name is longer than 256 bytes. OpenZFS-issue: https://www.illumos.org/issues/6876 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/ca8674e	2016-06-28 13:47:04 -07:00
Igor Kozhukhov	eca7b76001	OpenZFS 6314 - buffer overflow in dsl_dataset_name Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com> Approved by: Dan McDonald <danmcd@omniti.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/6314 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/d6160ee	2016-06-28 13:47:03 -07:00
Brian Behlendorf	43e52eddb1	Implement zfs_ioc_recv_new() for OpenZFS 2605 Adds ZFS_IOC_RECV_NEW for resumable streams and preserves the legacy ZFS_IOC_RECV user/kernel interface. The new interface supports all stream options but is currently only used for resumable streams. This way updated user space utilities will interoperate with older kernel modules. ZFS_IOC_RECV_NEW is modeled after the existing ZFS_IOC_SEND_NEW handler. Non-Linux OpenZFS platforms have opted to change the legacy interface in an incompatible fashion instead of adding a new ioctl. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2016-06-28 13:47:03 -07:00
Dan McDonald	8c62a0d0f3	OpenZFS 6562 - Refquota on receive doesn't account for overage Authored by: Dan McDonald <danmcd@omniti.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com> Reviewed by: Toomas Soome <tsoome@me.com> Approved by: Gordon Ross <gwr@nexenta.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/6562 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/5f7a8e6	2016-06-28 13:47:03 -07:00
Dan McDonald	671c93546c	OpenZFS 4986 - receiving replication stream fails if any snapshot exceeds refquota Authored by: Dan McDonald <danmcd@omniti.com> Reviewed by: John Kennedy <john.kennedy@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Gordon Ross <gordon.ross@nexenta.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/4986 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/5878fad	2016-06-28 13:47:03 -07:00
Eli Rosenthal	f8866f8ae3	OpenZFS 6738 - zfs send stream padding needs documentation Authored by: Eli Rosenthal <eli.rosenthal@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Approved by: Robert Mustacchi <rm@joyent.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/6738 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c20404ff	2016-06-28 13:47:03 -07:00
Andrew Stormont	b607405fea	OpenZFS 6536 - zfs send: want a way to disable setting of DRR_FLAG_FREERECORDS Authored by: Andrew Stormont <astormont@racktopsystems.com> Reviewed by: Anil Vijarnia <avijarnia@racktopsystems.com> Reviewed by: Kim Shrier <kshrier@racktopsystems.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/6536 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/880094b	2016-06-28 13:47:03 -07:00
Paul Dagnelie	e6d3a843d6	OpenZFS 6393 - zfs receive a full send as a clone Authored by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Richard Elling <Richard.Elling@RichardElling.com> Approved by: Dan McDonald <danmcd@omniti.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/6394 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/68ecb2e	2016-06-28 13:47:03 -07:00
Matthew Ahrens	47dfff3b86	OpenZFS 2605, 6980, 6902 2605 want to resume interrupted zfs send Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed by: Xin Li <delphij@freebsd.org> Reviewed by: Arne Jansen <sensille@gmx.net> Approved by: Dan McDonald <danmcd@omniti.com> Ported-by: kernelOfTruth <kerneloftruth@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/2605 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/9c3fd12 6980 6902 causes zfs send to break due to 32-bit/64-bit struct mismatch Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Ported by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/6980 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/ea4a67f Porting notes: - All rsend and snapshop tests enabled and updated for Linux. - Fix misuse of input argument in traverse_visitbp(). - Fix ISO C90 warnings and errors. - Fix gcc 'missing braces around initializer' in 'struct send_thread_arg to_arg =' warning. - Replace 4 argument fletcher_4_native() with 3 argument version, this change was made in OpenZFS 4185 which has not been ported. - Part of the sections for 'zfs receive' and 'zfs send' was rewritten and reordered to approximate upstream. - Fix mktree xattr creation, 'user.' prefix required. - Minor fixes to newly enabled test cases - Long holds for volumes allowed during receive for minor registration.	2016-06-28 13:47:02 -07:00
Ned Bass	50c957f702	Implement large_dnode pool feature Justification ------------- This feature adds support for variable length dnodes. Our motivation is to eliminate the overhead associated with using spill blocks. Spill blocks are used to store system attribute data (i.e. file metadata) that does not fit in the dnode's bonus buffer. By allowing a larger bonus buffer area the use of a spill block can be avoided. Spill blocks potentially incur an additional read I/O for every dnode in a dnode block. As a worst case example, reading 32 dnodes from a 16k dnode block and all of the spill blocks could issue 33 separate reads. Now suppose those dnodes have size 1024 and therefore don't need spill blocks. Then the worst case number of blocks read is reduced to from 33 to two--one per dnode block. In practice spill blocks may tend to be co-located on disk with the dnode blocks so the reduction in I/O would not be this drastic. In a badly fragmented pool, however, the improvement could be significant. ZFS-on-Linux systems that make heavy use of extended attributes would benefit from this feature. In particular, ZFS-on-Linux supports the xattr=sa dataset property which allows file extended attribute data to be stored in the dnode bonus buffer as an alternative to the traditional directory-based format. Workloads such as SELinux and the Lustre distributed filesystem often store enough xattr data to force spill bocks when xattr=sa is in effect. Large dnodes may therefore provide a performance benefit to such systems. Other use cases that may benefit from this feature include files with large ACLs and symbolic links with long target names. Furthermore, this feature may be desirable on other platforms in case future applications or features are developed that could make use of a larger bonus buffer area. Implementation -------------- The size of a dnode may be a multiple of 512 bytes up to the size of a dnode block (currently 16384 bytes). A dn_extra_slots field was added to the current on-disk dnode_phys_t structure to describe the size of the physical dnode on disk. The 8 bits for this field were taken from the zero filled dn_pad2 field. The field represents how many "extra" dnode_phys_t slots a dnode consumes in its dnode block. This convention results in a value of 0 for 512 byte dnodes which preserves on-disk format compatibility with older software. Similarly, the in-memory dnode_t structure has a new dn_num_slots field to represent the total number of dnode_phys_t slots consumed on disk. Thus dn->dn_num_slots is 1 greater than the corresponding dnp->dn_extra_slots. This difference in convention was adopted because, unlike on-disk structures, backward compatibility is not a concern for in-memory objects, so we used a more natural way to represent size for a dnode_t. The default size for newly created dnodes is determined by the value of a new "dnodesize" dataset property. By default the property is set to "legacy" which is compatible with older software. Setting the property to "auto" will allow the filesystem to choose the most suitable dnode size. Currently this just sets the default dnode size to 1k, but future code improvements could dynamically choose a size based on observed workload patterns. Dnodes of varying sizes can coexist within the same dataset and even within the same dnode block. For example, to enable automatically-sized dnodes, run # zfs set dnodesize=auto tank/fish The user can also specify literal values for the dnodesize property. These are currently limited to powers of two from 1k to 16k. The power-of-2 limitation is only for simplicity of the user interface. Internally the implementation can handle any multiple of 512 up to 16k, and consumers of the DMU API can specify any legal dnode value. The size of a new dnode is determined at object allocation time and stored as a new field in the znode in-memory structure. New DMU interfaces are added to allow the consumer to specify the dnode size that a newly allocated object should use. Existing interfaces are unchanged to avoid having to update every call site and to preserve compatibility with external consumers such as Lustre. The new interfaces names are given below. The versions of these functions that don't take a dnodesize parameter now just call the _dnsize() versions with a dnodesize of 0, which means use the legacy dnode size. New DMU interfaces: dmu_object_alloc_dnsize() dmu_object_claim_dnsize() dmu_object_reclaim_dnsize() New ZAP interfaces: zap_create_dnsize() zap_create_norm_dnsize() zap_create_flags_dnsize() zap_create_claim_norm_dnsize() zap_create_link_dnsize() The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The spa_maxdnodesize() function should be used to determine the maximum bonus length for a pool. These are a few noteworthy changes to key functions: * The prototype for dnode_hold_impl() now takes a "slots" parameter. When the DNODE_MUST_BE_FREE flag is set, this parameter is used to ensure the hole at the specified object offset is large enough to hold the dnode being created. The slots parameter is also used to ensure a dnode does not span multiple dnode blocks. In both of these cases, if a failure occurs, ENOSPC is returned. Keep in mind, these failure cases are only possible when using DNODE_MUST_BE_FREE. If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0. dnode_hold_impl() will check if the requested dnode is already consumed as an extra dnode slot by an large dnode, in which case it returns ENOENT. * The function dmu_object_alloc() advances to the next dnode block if dnode_hold_impl() returns an error for a requested object. This is because the beginning of the next dnode block is the only location it can safely assume to either be a hole or a valid starting point for a dnode. * dnode_next_offset_level() and other functions that iterate through dnode blocks may no longer use a simple array indexing scheme. These now use the current dnode's dn_num_slots field to advance to the next dnode in the block. This is to ensure we properly skip the current dnode's bonus area and don't interpret it as a valid dnode. zdb --- The zdb command was updated to display a dnode's size under the "dnsize" column when the object is dumped. For ZIL create log records, zdb will now display the slot count for the object. ztest ----- Ztest chooses a random dnodesize for every newly created object. The random distribution is more heavily weighted toward small dnodes to better simulate real-world datasets. Unused bonus buffer space is filled with non-zero values computed from the object number, dataset id, offset, and generation number. This helps ensure that the dnode traversal code properly skips the interior regions of large dnodes, and that these interior regions are not overwritten by data belonging to other dnodes. A new test visits each object in a dataset. It verifies that the actual dnode size matches what was stored in the ztest block tag when it was created. It also verifies that the unused bonus buffer space is filled with the expected data patterns. ZFS Test Suite -------------- Added six new large dnode-specific tests, and integrated the dnodesize property into existing tests for zfs allow and send/recv. Send/Receive ------------ ZFS send streams for datasets containing large dnodes cannot be received on pools that don't support the large_dnode feature. A send stream with large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be unrecognized by an incompatible receiving pool so that the zfs receive will fail gracefully. While not implemented here, it may be possible to generate a backward-compatible send stream from a dataset containing large dnodes. The implementation may be tricky, however, because the send object record for a large dnode would need to be resized to a 512 byte dnode, possibly kicking in a spill block in the process. This means we would need to construct a new SA layout and possibly register it in the SA layout object. The SA layout is normally just sent as an ordinary object record. But if we are constructing new layouts while generating the send stream we'd have to build the SA layout object dynamically and send it at the end of the stream. For sending and receiving between pools that do support large dnodes, the drr_object send record type is extended with a new field to store the dnode slot count. This field was repurposed from unused padding in the structure. ZIL Replay ---------- The dnode slot count is stored in the uppermost 8 bits of the lr_foid field. The bits were unused as the object id is currently capped at 48 bits. Resizing Dnodes --------------- It should be possible to resize a dnode when it is dirtied if the current dnodesize dataset property differs from the dnode's size, but this functionality is not currently implemented. Clearly a dnode can only grow if there are sufficient contiguous unused slots in the dnode block, but it should always be possible to shrink a dnode. Growing dnodes may be useful to reduce fragmentation in a pool with many spill blocks in use. Shrinking dnodes may be useful to allow sending a dataset to a pool that doesn't support the large_dnode feature. Feature Reference Counting -------------------------- The reference count for the large_dnode pool feature tracks the number of datasets that have ever contained a dnode of size larger than 512 bytes. The first time a large dnode is created in a dataset the dataset is converted to an extensible dataset. This is a one-way operation and the only way to decrement the feature count is to destroy the dataset, even if the dataset no longer contains any large dnodes. The complexity of reference counting on a per-dnode basis was too high, so we chose to track it on a per-dataset basis similarly to the large_block feature. Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3542	2016-06-24 13:13:21 -07:00
Ned Bass	68cbd56e18	Backfill metadnode more intelligently Only attempt to backfill lower metadnode object numbers if at least 4096 objects have been freed since the last rescan, and at most once per transaction group. This avoids a pathology in dmu_object_alloc() that caused O(N^2) behavior for create-heavy workloads and substantially improves object creation rates. As summarized by @mahrens in #4636: "Normally, the object allocator simply checks to see if the next object is available. The slow calls happened when dmu_object_alloc() checks to see if it can backfill lower object numbers. This happens every time we move on to a new L1 indirect block (i.e. every 32 * 128 = 4096 objects). When re-checking lower object numbers, we use the on-disk fill count (blkptr_t:blk_fill) to quickly skip over indirect blocks that don’t have enough free dnodes (defined as an L2 with at least 393,216 of 524,288 dnodes free). Therefore, we may find that a block of dnodes has a low (or zero) fill count, and yet we can’t allocate any of its dnodes, because they've been allocated in memory but not yet written to disk. In this case we have to hold each of the dnodes and then notice that it has been allocated in memory. The end result is that allocating N objects in the same TXG can require CPU usage proportional to N^2." Add a tunable dmu_rescan_dnode_threshold to define the number of objects that must be freed before a rescan is performed. Don't bother to export this as a module option because testing doesn't show a compelling reason to change it. The vast majority of the performance gain comes from limit the rescan to at most once per TXG. Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2016-06-24 13:13:12 -07:00
smh	d14fa5dba1	FreeBSD rS271776 - Persist vdev_resilver_txg changes Persist vdev_resilver_txg changes to avoid panic caused by validation vs a vdev_resilver_txg value from a previous resilver. Authored-by: smh <smh@FreeBSD.org> Ported-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/5154 FreeBSD-issue: https://reviews.freebsd.org/rS271776 FreeBSD-commit: https://github.com/freebsd/freebsd/commit/c3c60bf Closes #4790	2016-06-24 13:02:42 -07:00
Nav Ravindranath	784d15c14c	OpenZFS 6878 - Add scrub completion info to "zpool history" Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> Authored by: Nav Ravindranath <nav@delphix.com> Ported-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/6878 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/1825bc5 Closes #4787	2016-06-24 13:01:31 -07:00
Paul Dagnelie	bc77ba73fe	OpenZFS 6513 - partially filled holes lose birth time Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Boris Protopopov <bprotopopov@hotmail.com> Approved by: Richard Lowe <richlowe@richlowe.net>a Ported by: Boris Protopopov <bprotopopov@actifio.com> Signed-off-by: Boris Protopopov <bprotopopov@actifio.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/6513 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/8df0bcf0 If a ZFS object contains a hole at level one, and then a data block is created at level 0 underneath that l1 block, l0 holes will be created. However, these l0 holes do not have the birth time property set; as a result, incremental sends will not send those holes. Fix is to modify the dbuf_read code to fill in birth time data.	2016-06-21 10:55:13 -07:00
Chunwei Chen	100a91aa3e	Fix NFS credential The commit `f74b821` caused a regression where creating file through NFS will always create a file owned by root. This is because the patch enables the KSID code in zfs_acl_ids_create, which it would use euid and egid of the current process. However, on Linux, we should use fsuid and fsgid for file operations, which is the original behaviour. So we revert this part of code. The patch also enables secpolicy_vnode_*, since they are also used in file operations, we change them to use fsuid and fsgid. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4772 Closes #4758	2016-06-21 09:58:37 -07:00
Gvozden Neskovic	ab9f4b0b82	SIMD implementation of vdev_raidz generate and reconstruct routines This is a new implementation of RAIDZ1/2/3 routines using x86_64 scalar, SSE, and AVX2 instruction sets. Included are 3 parity generation routines (P, PQ, and PQR) and 7 reconstruction routines, for all RAIDZ level. On module load, a quick benchmark of supported routines will select the fastest for each operation and they will be used at runtime. Original implementation is still present and can be selected via module parameter. Patch contains: - specialized gen/rec routines for all RAIDZ levels, - new scalar raidz implementation (unrolled), - two x86_64 SIMD implementations (SSE and AVX2 instructions sets), - fastest routines selected on module load (benchmark). - cmd/raidz_test - verify and benchmark all implementations - added raidz_test to the ZFS Test Suite New zfs module parameters: - zfs_vdev_raidz_impl (str): selects the implementation to use. On module load, the parameter will only accept first 3 options, and the other implementations can be set once module is finished loading. Possible values for this option are: "fastest" - use the fastest math available "original" - use the original raidz code "scalar" - new scalar impl "sse" - new SSE impl if available "avx2" - new AVX2 impl if available See contents of `/sys/module/zfs/parameters/zfs_vdev_raidz_impl` to get the list of supported values. If an implementation is not supported on the system, it will not be shown. Currently selected option is enclosed in `[]`. Signed-off-by: Gvozden Neskovic <neskovic@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4328	2016-06-21 09:27:26 -07:00
Tim Chase	09fb30e5e9	Linux 4.6 compat: Fall back to d_prune_aliases() if necessary As of 4.6, the icache and dcache LRUs are memcg aware insofar as the kernel's per-superblock shrinker is concerned. The effect is that dcache or icache entries added by a task in a non-root memcg won't be scanned by the shrinker in the context of the root (or NULL) memcg. This defeats the attempts by zfs_sb_prune() to unpin buffers and can allow metadata to grow uncontrollably. This patch reverts to the d_prune_aliaes() method in case the kernel's per-superblock shrinker is not able to free anything. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Closes: #4726	2016-06-17 13:33:49 -07:00
Brian Behlendorf	f74b821a66	Add `zfs allow` and `zfs unallow` support ZFS allows for specific permissions to be delegated to normal users with the `zfs allow` and `zfs unallow` commands. In addition, non- privileged users should be able to run all of the following commands: * zpool [list \| iostat \| status \| get] * zfs [list \| get] Historically this functionality was not available on Linux. In order to add it the secpolicy_* functions needed to be implemented and mapped to the equivalent Linux capability. Only then could the permissions on the `/dev/zfs` be relaxed and the internal ZFS permission checks used. Even with this change some limitations remain. Under Linux only the root user is allowed to modify the namespace (unless it's a private namespace). This means the mount, mountpoint, canmount, unmount, and remount delegations cannot be supported with the existing code. It may be possible to add this functionality in the future. This functionality was validated with the cli_user and delegation test cases from the ZFS Test Suite. These tests exhaustively verify each of the supported permissions which can be delegated and ensures only an authorized user can perform it. Two minor bug fixes were required for test-running.py. First, the Timer() object cannot be safely created in a `try:` block when there is an unconditional `finally` block which references it. Second, when running as a normal user also check for scripts using the both the .ksh and .sh suffixes. Finally, existing users who are simulating delegations by setting group permissions on the /dev/zfs device should revert that customization when updating to a version with this change. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #362 Closes #434 Closes #4100 Closes #4394 Closes #4410 Closes #4487	2016-06-07 09:16:52 -07:00
Jinshan Xiong	1eeb4562a7	Implementation of AVX2 optimized Fletcher-4 New functionality: - Preserves existing scalar implementation. - Adds AVX2 optimized Fletcher-4 computation. - Fastest routines selected on module load (benchmark). - Test case for Fletcher-4 added to ztest. New zcommon module parameters: - zfs_fletcher_4_impl (str): selects the implementation to use. "fastest" - use the fastest version available "cycle" - cycle trough all available impl for ztest "scalar" - use the original version "avx2" - new AVX2 implementation if available Performance comparison (Intel i7 CPU, 1MB data buffers): - Scalar: 4216 MB/s - AVX2: 14499 MB/s See contents of `/sys/module/zcommon/parameters/zfs_fletcher_4_impl` to get list of supported values. If an implementation is not supported on the system, it will not be shown. Currently selected option is enclosed in `[]`. Signed-off-by: Jinshan Xiong <jinshan.xiong@intel.com> Signed-off-by: Andreas Dilger <andreas.dilger@intel.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4330	2016-06-02 14:30:51 -07:00
Jinshan Xiong	16fc1ec3ba	Improve spl slab cache alloc The policy is to try to allocate with KM_NOSLEEP, which will lead to memory allocation with GFP_ATOMIC, and if it fails, it will launch an taskq to expand slab space. This way it should be able to get better NUMA memory locality and reduce the overhead of context switch. Signed-off-by: Jinshan Xiong <jinshan.xiong@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #551	2016-06-01 10:26:42 -07:00
Chunwei Chen	6a79672530	Fix memleak in vdev_config_generate_stats fnvlist_add_nvlist will copy the contents of nvx, so we need to free it here. unreferenced object 0xffff8800a6934e80 (size 64): comm "zpool", pid 3398, jiffies 4295007406 (age 214.180s) hex dump (first 32 bytes): 60 06 c2 73 00 88 ff ff 00 7c 8c 73 00 88 ff ff `..s.....\|.s.... 00 00 00 00 00 00 00 00 40 b0 70 c0 ff ff ff ff ........@.p..... backtrace: [<ffffffff81810c4e>] kmemleak_alloc+0x4e/0xb0 [<ffffffff811fac7d>] __kmalloc_node+0x17d/0x310 [<ffffffffc065528c>] spl_kmem_alloc_impl+0xac/0x180 [spl] [<ffffffffc0657379>] spl_vmem_alloc+0x19/0x20 [spl] [<ffffffffc07056cf>] nv_alloc_sleep_spl+0x1f/0x30 [znvpair] [<ffffffffc07006b7>] nvlist_xalloc.part.13+0x27/0xc0 [znvpair] [<ffffffffc07007ad>] nvlist_alloc+0x3d/0x40 [znvpair] [<ffffffffc0703abc>] fnvlist_alloc+0x2c/0x80 [znvpair] [<ffffffffc07b1783>] vdev_config_generate_stats+0x83/0x370 [zfs] [<ffffffffc07b1f53>] vdev_config_generate+0x4e3/0x650 [zfs] [<ffffffffc07996db>] spa_config_generate+0x20b/0x4b0 [zfs] [<ffffffffc0794f64>] spa_tryimport+0xc4/0x430 [zfs] [<ffffffffc07d11d8>] zfs_ioc_pool_tryimport+0x68/0x110 [zfs] [<ffffffffc07d4fc6>] zfsdev_ioctl+0x646/0x7a0 [zfs] [<ffffffff81232e31>] do_vfs_ioctl+0xa1/0x5b0 [<ffffffff812333b9>] SyS_ioctl+0x79/0x90 Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4707 Issue #4708	2016-05-31 16:05:21 -07:00
Chunwei Chen	06ee0031a6	Fix memleak in zpl_parse_options strsep() will advance tmp_mntopts, and will change it to NULL on last iteration. This will cause strfree(tmp_mntopts) to not free anything. unreferenced object 0xffff8800883976c0 (size 64): comm "mount.zfs", pid 3361, jiffies 4294931877 (age 1482.408s) hex dump (first 32 bytes): 72 77 00 73 74 72 69 63 74 61 74 69 6d 65 00 7a rw.strictatime.z 66 73 75 74 69 6c 00 6d 6e 74 70 6f 69 6e 74 3d fsutil.mntpoint= backtrace: [<ffffffff81810c4e>] kmemleak_alloc+0x4e/0xb0 [<ffffffff811f9cac>] __kmalloc+0x16c/0x250 [<ffffffffc065ce9b>] strdup+0x3b/0x60 [spl] [<ffffffffc080fad6>] zpl_parse_options+0x56/0x300 [zfs] [<ffffffffc080fe46>] zpl_mount+0x36/0x80 [zfs] [<ffffffff81222dc8>] mount_fs+0x38/0x160 [<ffffffff81240097>] vfs_kern_mount+0x67/0x110 [<ffffffff812428e0>] do_mount+0x250/0xe20 [<ffffffff812437d5>] SyS_mount+0x95/0xe0 [<ffffffff8181aff6>] entry_SYSCALL_64_fastpath+0x1e/0xa8 [<ffffffffffffffff>] 0xffffffffffffffff Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4706 Issue #4708	2016-05-31 16:04:26 -07:00
Chunwei Chen	540c392793	Fix out-of-bound access in zfs_fillpage The original code will do an out-of-bound access on pl[] during last iteration. ================================================================== BUG: KASAN: stack-out-of-bounds in zfs_getpage+0x14c/0x2d0 [zfs] Read of size 8 by task tmpfile/7850 page:ffffea00017c6dc0 count:0 mapcount:0 mapping: (null) index:0x0 flags: 0xffff8000000000() page dumped because: kasan: bad access detected CPU: 3 PID: 7850 Comm: tmpfile Tainted: G OE 4.6.0+ #3 ffff88005f1b7678 0000000006dbe035 ffff88005f1b7508 ffffffff81635618 ffff88005f1b7678 ffff88005f1b75a0 ffff88005f1b7590 ffffffff81313ee8 ffffea0001ae8dd0 ffff88005f1b7670 0000000000000246 0000000041b58ab3 Call Trace: [<ffffffff81635618>] dump_stack+0x63/0x8b [<ffffffff81313ee8>] kasan_report_error+0x528/0x560 [<ffffffff81278f20>] ? filemap_map_pages+0x5f0/0x5f0 [<ffffffff813144b8>] kasan_report+0x58/0x60 [<ffffffffc12250dc>] ? zfs_getpage+0x14c/0x2d0 [zfs] [<ffffffff81312e4e>] __asan_load8+0x5e/0x70 [<ffffffffc12250dc>] zfs_getpage+0x14c/0x2d0 [zfs] [<ffffffffc1252131>] zpl_readpage+0xd1/0x180 [zfs] [<ffffffff81353c3a>] SyS_execve+0x3a/0x50 [<ffffffff810058ef>] do_syscall_64+0xef/0x180 [<ffffffff81d0ee25>] entry_SYSCALL64_slow_path+0x25/0x25 Memory state around the buggy address: ffff88005f1b7500: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ffff88005f1b7580: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >ffff88005f1b7600: 00 00 00 00 00 00 00 00 00 00 f1 f1 f1 f1 00 f4 ^ ffff88005f1b7680: f4 f4 f3 f3 f3 f3 00 00 00 00 00 00 00 00 00 00 ffff88005f1b7700: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ================================================================== Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4705 Issue #4708	2016-05-31 16:01:27 -07:00
Chunwei Chen	ea5f1a200b	Fix use-after-free in splat_taskq_test7 This splat_vprint is using tq_arg->name after tq_arg is freed. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #557	2016-05-31 11:58:42 -07:00
Chunwei Chen	f58040c0fc	Implement a proper rw_tryupgrade Current rw_tryupgrade does rw_exit and then rw_tryenter(RW_RWITER), and then does rw_enter(RW_READER) if it fails. This violate the assumption that rw_tryupgrade should be atomic and could cause extra contention or even lock inversion. This patch we implement a proper rw_tryupgrade. For rwsem-spinlock, we take the spinlock to check rwsem->count and rwsem->wait_list. For normal rwsem, we use cmpxchg on rwsem->count to change the value from single reader to single writer. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Closes zfsonlinux/zfs#4692 Closes #554	2016-05-31 11:44:15 -07:00
GeLiXin	b7faa7aabd	Fix self-healing IO prior to dsl_pool_init() completion Async writes triggered by a self-healing IO may be issued before the pool finishes the process of initialization. This results in a NULL dereference of `spa->spa_dsl_pool` in vdev_queue_max_async_writes(). George Wilson recommended addressing this issue by initializing the passed `dsl_pool_t **` prior to dmu_objset_open_impl(). Since the caller is passing the `spa->spa_dsl_pool` this has the effect of ensuring it's initialized. However, since this depends on the caller knowing they must pass the `spa->spa_dsl_pool` an additional NULL check was added to vdev_queue_max_async_writes(). This guards against any future restructuring of the code which might result in dsl_pool_init() being called differently. Signed-off-by: GeLiXin <47034221@qq.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4652	2016-05-27 14:11:25 -07:00
Tony Hutter	26ef0cc7db	OpenZFS 6531 - Provide mechanism to artificially limit disk performance Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> Ported by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/6531 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/97e8130 Porting notes: - Added new IO delay tracepoints, and moved common ZIO tracepoint macros to a new trace_common.h file. - Used zio_delay_taskq() in place of OpenZFS's timeout_generic() function. - Updated zinject man page - Updated zpool_scrub test files	2016-05-26 10:11:51 -07:00
Tony Hutter	7e945072d1	Add request size histograms (-r) to zpool iostat, minor man page fix Add -r option to "zpool iostat" to print request size histograms for the leaf ZIOs. This includes histograms of individual ZIOs ("ind") and aggregate ZIOs ("agg"). These stats can be useful for seeing how well the ZFS IO aggregator is working. $ zpool iostat -r mypool sync_read sync_write async_read async_write scrub req_size ind agg ind agg ind agg ind agg ind agg ---------- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- 512 0 0 0 0 0 0 530 0 0 0 1K 0 0 260 0 0 0 116 246 0 0 2K 0 0 0 0 0 0 0 431 0 0 4K 0 0 0 0 0 0 3 107 0 0 8K 15 0 35 0 0 0 0 6 0 0 16K 0 0 0 0 0 0 0 39 0 0 32K 0 0 0 0 0 0 0 0 0 0 64K 20 0 40 0 0 0 0 0 0 0 128K 0 0 20 0 0 0 0 0 0 0 256K 0 0 0 0 0 0 0 0 0 0 512K 0 0 0 0 0 0 0 0 0 0 1M 0 0 0 0 0 0 0 0 0 0 2M 0 0 0 0 0 0 0 0 0 0 4M 0 0 0 0 0 0 155 19 0 0 8M 0 0 0 0 0 0 0 811 0 0 16M 0 0 0 0 0 0 0 68 0 0 -------------------------------------------------------------------------------- Also rename the stray "-G" in the man page to be "-w" for latency histograms. Signed-off-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Closes #4659	2016-05-25 15:49:35 -07:00
Chunwei Chen	4442f60d8e	Fix arc_prune_task use-after-free arc_prune_task uses a refcount to protect arc_prune_t, but it doesn't prevent the underlying zsb from disappearing if there's a concurrent umount. We fix this by force the caller of arc_remove_prune_callback to wait for arc_prune_taskq to finish. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4687 Closes #4690	2016-05-25 14:11:53 -07:00
Chunwei Chen	b3a22a0a00	Fix taskq_wait_outstanding re-evaluate tq_next_id wait_event is a macro, so the current implementation will cause re- evaluation of tq_next_id every time it wakes up. This would cause taskq_wait_outstanding(tq, 0) to be equivalent to taskq_wait(tq) Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Issue #553	2016-05-24 13:02:10 -07:00
Chunwei Chen	5ce028b0d4	Fix race between taskq_destroy and dynamic spawning thread While taskq_destroy would wait for dynamic_taskq to finish its tasks, but it does not implies the thread being spawned is up and running. This will cause taskq to be freed before the thread can exit. We fix this by using tq_nspawn to indicate how many threads are being spawned before they are inserted to the thread list. And have taskq_destroy to wait for it to drop to zero. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Issue #553 Closes #550	2016-05-24 13:00:17 -07:00
Chunwei Chen	872e0cc9c7	Restore CALLOUT_FLAG_ABSOLUTE in cv_timedwait_hires In `39cd90e`, I mistakenly disabled the ability of using absolute expire time in cv_timedwait_hires. I don't quite sure why I did that, so let's restore it. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Issue #553	2016-05-24 12:58:49 -07:00
Chunwei Chen	cbecb4fb22	Skip ctldir znode in zfs_rezget to fix snapdir issues Skip ctldir in zfs_rezget, otherwise they will always get invalidated. This will cause funny behaviour for the mounted snapdirs. Especially for Linux >= 3.18, d_invalidate will detach the mountpoint and prevent anyone automount it again as long as someone is still using the detached mount. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4514 Closes #4661 Closes #4672	2016-05-23 11:06:56 -07:00
Chunwei Chen	9baaa7deae	Linux 4.7 compat: use iterate_shared for concurrent readdir Register iterate_shared if it exists so the kernel will used shared lock and allowing concurrent readdir. Also, use shared lock when doing llseek with SEEK_DATA or SEEK_HOLE to allow concurrent seeking. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4664 Closes #4665	2016-05-20 11:09:16 -07:00
Chunwei Chen	68e8f59afb	Linux 4.7 compat: replace blk_queue_flush with blk_queue_write_cache Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #4665	2016-05-20 11:08:55 -07:00
Chunwei Chen	fdbc1ba99d	Linux 4.7 compat: inode_lock() and friends Linux 4.7 changes i_mutex to i_rwsem, and we should used inode_lock and inode_lock_shared to do exclusive and shared lock respectively. We use spl_inode_lock{,_shared}() to hide the difference. Note that on older kernel you'll always take an exclusive lock. We also add all other inode_lock friends. And nested users now should explicitly call spl_inode_lock_nested with correct subclass. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue zfsonlinux/zfs#4665 Closes #549	2016-05-20 11:00:14 -07:00
Nikolay Borisov	278f223668	Kill znode->z_gen field This field is a duplicate of the inode->i_generation, so just kill it. Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com> Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4538 Closes #4654	2016-05-19 13:06:14 -07:00
Brian Behlendorf	ada8258141	Revert "zhack: Add 'feature disable' command" This reverts commit `8302528617` and `ebecfcd699` which broke the build. While these patches do apply cleanly and passed previous test runs they need to be updated to account for the changes made in commit `241b541574`. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #3878	2016-05-17 11:52:07 -07:00
Marcel Huber	2587cd8f93	Fixes subtle bug in zio_handle_io_delay() Fixed bug introduced in commit #c35b1882. Hinted by gcc: zio_inject.c: In function ‘zio_handle_io_delay’: zio_inject.c:382:3: warning: this ‘if’ clause does not guard... [-Wmisleading-indentation] if (handler->zi_record.zi_freq != 0 && ^~ zio_inject.c:384:4: note: ...this statement, but the latter is misleadingly indented as if it is guarded by the ‘if’ continue; ^~~~~~~~ Signed-off-by: Marcel Huber <marcelhuberfoo@gmail.com> Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4632	2016-05-17 11:03:11 -07:00
Brian Behlendorf	8302528617	zhack: Add 'feature disable' command Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #3878	2016-05-17 11:00:21 -07:00
Chunwei Chen	d88895a069	Remove dummy znode from zvol_state struct zvol_state contains a dummy znode, which is around 1KB on x64, only for zfs_range_lock. But in reality, other than z_range_lock and z_range_avl, zfs_range_lock only need znode on regular file, which means we add 1KB on a structure and gain nothing. In this patch, we remove the dummy znode for zvol_state. In order to do that, we also need to refactor zfs_range_lock a bit. We move z_range_lock and z_range_avl pair out of znode_t to form zfs_rlock_t. This new struct replaces znode_t as the main handle inside the range lock functions. We also add pointers to z_size, z_blksz, and z_max_blksz so range lock code doesn't depend on znode_t. This allows non-ZPL consumers like Lustre to use the range locks with their equivalent znode_t structure. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Boris Protopopov <boris.protopopov@actifio.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4510	2016-05-17 10:29:02 -07:00
Chunwei Chen	a9bb2b6827	Use cv_timedwait_sig_hires in arc_reclaim_thread The was originally using interruptible cv_timedwait_sig, but was changed to uninterruptible cv_timedwait_hires in `ae6d0c6`. Use _sig_hires instead to allow interruptible sleep. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4633 Closes #4634	2016-05-12 14:56:47 -07:00
Chunwei Chen	39cd90ef08	Add cv_timedwait_sig_hires to allow interruptible sleep Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #548	2016-05-12 14:54:15 -07:00
Dan McDonald	8adb798aa5	OpenZFS 6093 - zfsctl_shares_lookup 6093 zfsctl_shares_lookup should only VN_RELE() on zfs_zget() success Reviewed by: Gordon Ross <gwr@nexenta.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/6093 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/0f92170 Closes #4630 This function was always implemented slightly differently under Linux and therefore never suffered from this issue. The patch has been updated and applied as cleanup in order to minimize differences with the upstream OpenZFS code.	2016-05-12 13:39:06 -07:00
Brian Behlendorf	c15706490e	Revert "Kill znode->z_gen field" This reverts commit `4cd77889b6`. The i_generation field in the inode is 32-bit and the SA code expects 64-bit fixed values. Revert this optimization for now until this is cleanly addressed. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #4538	2016-05-12 13:36:22 -07:00
Tony Hutter	193a37cb24	Add -lhHpw options to "zpool iostat" for avg latency, histograms, & queues Update the zfs module to collect statistics on average latencies, queue sizes, and keep an internal histogram of all IO latencies. Along with this, update "zpool iostat" with some new options to print out the stats: -l: Include average IO latencies stats: total_wait disk_wait syncq_wait asyncq_wait scrub read write read write read write read write wait ----- ----- ----- ----- ----- ----- ----- ----- ----- - 41ms - 2ms - 46ms - 4ms - - 5ms - 1ms - 1us - 4ms - - 5ms - 1ms - 1us - 4ms - - - - - - - - - - - 49ms - 2ms - 47ms - - - - - - - - - - - - - 2ms - 1ms - - - 1ms - ----- ----- ----- ----- ----- ----- ----- ----- ----- 1ms 1ms 1ms 413us 16us 25us - 5ms - 1ms 1ms 1ms 413us 16us 25us - 5ms - 2ms 1ms 2ms 412us 26us 25us - 5ms - - 1ms - 413us - 25us - 5ms - - 1ms - 460us - 29us - 5ms - 196us 1ms 196us 370us 7us 23us - 5ms - ----- ----- ----- ----- ----- ----- ----- ----- ----- -w: Print out latency histograms: sdb total disk sync_queue async_queue latency read write read write read write read write scrub ------- ------ ------ ------ ------ ------ ------ ------ ------ ------ 1ns 0 0 0 0 0 0 0 0 0 ... 33us 0 0 0 0 0 0 0 0 0 66us 0 0 107 2486 2 788 12 12 0 131us 2 797 359 4499 10 558 184 184 6 262us 22 801 264 1563 10 286 287 287 24 524us 87 575 71 52086 15 1063 136 136 92 1ms 152 1190 5 41292 4 1693 252 252 141 2ms 245 2018 0 50007 0 2322 371 371 220 4ms 189 7455 22 162957 0 3912 6726 6726 199 8ms 108 9461 0 102320 0 5775 2526 2526 86 17ms 23 11287 0 37142 0 8043 1813 1813 19 34ms 0 14725 0 24015 0 11732 3071 3071 0 67ms 0 23597 0 7914 0 18113 5025 5025 0 134ms 0 33798 0 254 0 25755 7326 7326 0 268ms 0 51780 0 12 0 41593 10002 10002 0 537ms 0 77808 0 0 0 64255 13120 13120 0 1s 0 105281 0 0 0 83805 20841 20841 0 2s 0 88248 0 0 0 73772 14006 14006 0 4s 0 47266 0 0 0 29783 17176 17176 0 9s 0 10460 0 0 0 4130 6295 6295 0 17s 0 0 0 0 0 0 0 0 0 34s 0 0 0 0 0 0 0 0 0 69s 0 0 0 0 0 0 0 0 0 137s 0 0 0 0 0 0 0 0 0 ------------------------------------------------------------------------------- -h: Help -H: Scripted mode. Do not display headers, and separate fields by a single tab instead of arbitrary space. -q: Include current number of entries in sync & async read/write queues, and scrub queue: syncq_read syncq_write asyncq_read asyncq_write scrubq_read pend activ pend activ pend activ pend activ pend activ ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- 0 0 0 0 78 29 0 0 0 0 0 0 0 0 78 29 0 0 0 0 0 0 0 0 0 0 0 0 0 0 - - - - - - - - - - 0 0 0 0 0 0 0 0 0 0 - - - - - - - - - - 0 0 0 0 0 0 0 0 0 0 ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- 0 0 227 394 0 19 0 0 0 0 0 0 227 394 0 19 0 0 0 0 0 0 108 98 0 19 0 0 0 0 0 0 19 98 0 0 0 0 0 0 0 0 78 98 0 0 0 0 0 0 0 0 19 88 0 0 0 0 0 0 ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -p: Display numbers in parseable (exact) values. Also, update iostat syntax to allow the user to specify specific vdevs to show statistics for. The three options for choosing pools/vdevs are: Display a list of pools: zpool iostat ... [pool ...] Display a list of vdevs from a specific pool: zpool iostat ... [pool vdev ...] Display a list of vdevs from any pools: zpool iostat ... [vdev ...] Lastly, allow zpool command "interval" value to be floating point: zpool iostat -v 0.5 Signed-off-by: Tony Hutter <hutter2@llnl.gov Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4433	2016-05-12 12:36:32 -07:00
Nikolay Borisov	04bc461062	Reduce stack usage of dmu_recv_stream function The receive_writer_arg and receive_arg structures become large when ZFS is compiled with debugging enabled. This results in gcc throwing an error about excessive stack usage: module/zfs/dmu_send.c: In function ‘dmu_recv_stream’: module/zfs/dmu_send.c:2502:1: error: the frame size of 1256 bytes is larger than 1024 bytes [-Werror=frame-larger-than=] Fix this by allocating those functions on the heap, rather than on the stack. With patch: dmu_send.c:2350:1:dmu_recv_stream 240 static Without patch: dmu_send.c:2350:1:dmu_recv_stream 1336 static Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4620	2016-05-11 12:14:24 -07:00
Chunwei Chen	32c8c946ea	OpenZFS 6842 - Fix empty xattr dir causing lockup Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Dan McDonald <danmcd@omniti.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Ported-by: Denys Rtveliashvili <denys@rtveliashvili.name> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> An initial version of this patch was applied in commit `29572cc` and subsequently refined upstream. Since the implementations do not conflict with each other both are left applied for now. OpenZFS-issue: https://www.illumos.org/issues/6842 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/02525cd Closes #4615	2016-05-10 10:38:21 -07:00
Brian Behlendorf	33cf67cd9a	Wrap vdev_count_verify_zaps() with ZFS_DEBUG Commit `e0ab3ab` introduced two blocks of code which are only needed when debugging is enabled. These blocks should be wrapped with ZFS_DEBUG for clarity and to prevent unused variable warnings in a production build. Signed-off-by: Don Brady <don.brady@intel.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4515	2016-05-06 18:14:03 -07:00
David Quigley	ae6d0c601e	OpenZFS 6672 - arc_reclaim_thread() should use gethrtime() 6672 arc_reclaim_thread() should use gethrtime() instead of ddi_get_lbolt() Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net> Reviewed by: Robert Mustacchi <rm@joyent.com> Approved by: Dan McDonald <danmcd@omniti.com> Ported-by: David Quigley <dpquigl@davequigley.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/6672 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/571be5c Closes #4600	2016-05-06 09:35:52 -07:00
Brian Behlendorf	4b2a3e0c9d	OpenZFS 6286 - ZFS internal error when set large block on bootfs 6286 ZFS internal error when set large block on bootfs Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Andriy Gapon <avg@FreeBSD.org> Approved by: Robert Mustacchi <rm@joyent.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/6286 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/6de9bb5 Closes #4585	2016-05-05 16:19:12 -07:00
Joe Stein	e0ab3ab553	OpenZFS 6736 - ZFS per-vdev ZAPs 6736 ZFS per-vdev ZAPs Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: John Kennedy <john.kennedy@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Don Brady <don.brady@intel.com> Reviewed by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/6736 https://github.com/openzfs/openzfs/commit/215198a Ported-by: Don Brady <don.brady@intel.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4515	2016-05-02 14:27:45 -07:00
Nikolay Borisov	4cd77889b6	Kill znode->z_gen field This field is a duplicate of the inode->i_generation, so just kill it Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4538	2016-05-02 11:22:31 -07:00
Tim Chase	ddab862d4c	Enable PF_FSTRANS for ioctl secpolicy callbacks (#4571 ) At the very least, the zfs_secpolicy_write_perms ioctl security policy callback, which calls dsl_dataset_hold(), can require freeing memory and, therefore, re-enter ZFS. This patch enables PF_FSTRANS for all of the security policy callbacks similarly to the manner in which it's enabled for the actual ioctl callback. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4554	2016-05-02 10:00:50 -07:00
Vitaut Bajaryn	ef1c27117b	module/.gitignore: Add *.dwo (#4580 ) These files get generated when CONFIG_DEBUG_INFO_DWARF4 is enabled in Linux .config. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4580	2016-05-02 09:07:04 -07:00
Alex Reece	463a8cfe2b	Illumos 6844 - dnode_next_offset can detect fictional holes 6844 dnode_next_offset can detect fictional holes Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> dnode_next_offset is used in a variety of places to iterate over the holes or allocated blocks in a dnode. It operates under the premise that it can iterate over the blockpointers of a dnode in open context while holding only the dn_struct_rwlock as reader. Unfortunately, this premise does not hold. When we create the zio for a dbuf, we pass in the actual block pointer in the indirect block above that dbuf. When we later zero the bp in zio_write_compress, we are directly modifying the bp. The state of the bp is now inconsistent from the perspective of dnode_next_offset: the bp will appear to be a hole until zio_dva_allocate finally finishes filling it in. In the meantime, dnode_next_offset can detect a hole in the dnode when none exists. I was able to experimentally demonstrate this behavior with the following setup: 1. Create a file with 1 million dbufs. 2. Create a thread that randomly dirties L2 blocks by writing to the first L0 block under them. 3. Observe dnode_next_offset, waiting for it to skip over a hole in the middle of a file. 4. Do dnode_next_offset in a loop until we skip over such a non-existent hole. The fix is to ensure that it is valid to iterate over the indirect blocks in a dnode while holding the dn_struct_rwlock by passing the zio a copy of the BP and updating the actual BP in dbuf_write_ready while holding the lock. References: https://www.illumos.org/issues/6844 https://github.com/openzfs/openzfs/pull/82 DLPX-35372 Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4548	2016-04-27 16:24:15 -07:00
Josef 'Jeff' Sipek	8a5fc74880	Illumos 6659 - nvlist_free(NULL) is a no-op 6659 nvlist_free(NULL) is a no-op Reviewed by: Toomas Soome <tsoome@me.com> Reviewed by: Marcel Telka <marcel@telka.sk> Approved by: Robert Mustacchi <rm@joyent.com> References: https://www.illumos.org/issues/6659 https://github.com/illumos/illumos-gate/commit/aab83bb Ported-by: David Quigley <dpquigl@davequigley.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4566	2016-04-27 15:58:23 -07:00
Tim Chase	ea2633ad26	Clear PF_FSTRANS over spl_filp_fallocate() The problem described in `2a5d574` also applies to XFS's file or inode fallocate method. Both paths may trigger writeback and expose this issue, see the full stack below. When layered on XFS a warning will be emitted under CentOS7 when entering either the file or inode fallocate method with PF_FSTRANS already set. To avoid triggering this error PF_FSTRANS is cleared and then reset in vn_space(). WARNING: at fs/xfs/xfs_aops.c:982 xfs_vm_writepage+0x58b/0x5d0 Call Trace: [<ffffffff810a1ed5>] warn_slowpath_common+0x95/0xe0 [<ffffffff810a1f3a>] warn_slowpath_null+0x1a/0x20 [<ffffffffa0231fdb>] xfs_vm_writepage+0x58b/0x5d0 [xfs] [<ffffffff81173ed7>] __writepage+0x17/0x40 [<ffffffff81176f81>] write_cache_pages+0x251/0x530 [<ffffffff811772b1>] generic_writepages+0x51/0x80 [<ffffffffa0230cb0>] xfs_vm_writepages+0x60/0x80 [xfs] [<ffffffff81177300>] do_writepages+0x20/0x30 [<ffffffff8116a5f5>] __filemap_fdatawrite_range+0xb5/0x100 [<ffffffff8116a6cb>] filemap_write_and_wait_range+0x8b/0xd0 [<ffffffffa0235bb4>] xfs_free_file_space+0xf4/0x520 [xfs] [<ffffffffa023cbce>] xfs_file_fallocate+0x19e/0x2c0 [xfs] [<ffffffffa036c6fc>] vn_space+0x3c/0x40 [spl] [<ffffffffa0434817>] vdev_file_io_start+0x207/0x260 [zfs] [<ffffffffa047170d>] zio_vdev_io_start+0xad/0x2d0 [zfs] [<ffffffffa0474942>] zio_execute+0x82/0xe0 [zfs] [<ffffffffa036ba7d>] taskq_thread+0x28d/0x5a0 [spl] [<ffffffff810c1777>] kthread+0xd7/0xf0 [<ffffffff8167de2f>] ret_from_fork+0x3f/0x70 Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Nikolay Borisov <kernel@kyup.com> Closes zfsonlinux/zfs#4529	2016-04-26 11:22:43 -07:00
Brian Behlendorf	2d82ea8b11	Use udev for partition detection When ZFS partitions a block device it must wait for udev to create both a device node and all the device symlinks. This process takes a variable length of time and depends on factors such how many links must be created, the complexity of the rules, etc. Complicating the situation further it is not uncommon for udev to create and then remove a link multiple times while processing the udev rules. Given the above, the existing scheme of waiting for an expected partition to appear by name isn't 100% reliable. At this point udev may still remove and recreate think link resulting in the kernel modules being unable to open the device. In order to address this the zpool_label_disk_wait() function has been updated to use libudev. Until the registered system device acknowledges that it in fully initialized the function will wait. Once fully initialized all device links are checked and allowed to settle for 50ms. This makes it far more likely that all the device nodes will exist when the kernel modules need to open them. For systems without libudev an alternate zpool_label_disk_wait() was updated to include a settle time. In addition, the kernel modules were updated to include retry logic for this ENOENT case. Due to the improved checks in the utilities it is unlikely this logic will be invoked. However, if the rare event it is needed it will prevent a failure. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Richard Laager <rlaager@wiktel.com> Closes #4523 Closes #3708 Closes #4077 Closes #4144 Closes #4214 Closes #4517	2016-04-25 11:13:20 -07:00
Chunwei Chen	232604b58e	Linux 4.5 compat: Use xattr_handler->name for acl Linux 4.5 added member "name" to xattr_handler. xattr_handler which matches to whole name rather than prefix should use "name" instead of "prefix". Otherwise, kernel will return with EINVAL when it tries to resolve handlers. Also, we remove the strcmp checks when xattr_handler has name, because xattr_resolve_name will do the check for us. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4549 Closes #4537	2016-04-25 08:42:08 -07:00
Brian Behlendorf	da5e151f20	Add pn_alloc()/pn_free() functions In order to remove the HAVE_PN_UTILS wrappers the pn_alloc() and pn_free() functions must be implemented. The existing illumos implementation were used for this purpose. The `flags` argument which was used in places wrapped by the HAVE_PN_UTILS condition has beed added back to zfs_remove() and zfs_link() functions. This removes a small point of divergence between the ZoL code and upstream. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4522	2016-04-21 09:49:25 -07:00
Ned Bass	98f03691a4	Fix ZPL miswrite of default POSIX ACL Commit `4967a3e` introduced a typo that caused the ZPL to store the intended default ACL as an access ACL. Due to caching this problem may not become visible until the filesystem is remounted or the inode is evicted from the cache. Fix the typo and add a regression test. Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Closes #4520	2016-04-18 11:26:55 -07:00
Colin Ian King	4903926f89	Fix inverted logic on none elevator comparison Commit `d1d7e2689d` ("cstyle: Resolve C style issues") inverted the logic on the none elevator comparison. Fix this and make it cstyle warning clean. Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4507	2016-04-15 12:18:08 -07:00
Chunwei Chen	704cd0758a	Enable lazytime semantic for atime Linux 4.0 introduces lazytime. The idea is that when we update the atime, we delay writing it to disk for as long as it is reasonably possible. When lazytime is enabled, dirty_inode will be called with only I_DIRTY_TIME flag whenever i_atime is updated. So under such condition, we will set z_atime_dirty. We will only write it to disk if file is closed, inode is evicted or setattr is called. Ideally, we should also write it whenever SA is going to be updated, but it is left for future improvement. There's one thing that we should take care of now that we allow i_atime to be dirty. In original implementation, whenever SA is modified, zfs_inode_update will be called to overwrite every thing in inode. This will cause dirty i_atime to be discarded. We fix this by don't overwrite i_atime in zfs_inode_update. We only overwrite i_atime when allocating new inode or doing zfs_rezget with zfs_inode_update_new. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #4482	2016-04-05 18:55:51 -07:00
Chunwei Chen	0df9673f01	Fix atime handling and relatime The problem for atime: We have 3 places for atime: inode->i_atime, znode->z_atime and SA. And its handling is a mess. A huge part of mess regarding atime comes from zfs_tstamp_update_setup, zfs_inode_update, and zfs_getattr, which behave inconsistently with those three values. zfs_tstamp_update_setup clears z_atime_dirty unconditionally as long as you don't pass ATTR_ATIME. Which means every write(2) operation which only updates ctime and mtime will cause atime changes to not be written to disk. Also zfs_inode_update from write(2) will replace inode->i_atime with what's inside SA(stale). But doesn't touch z_atime. So after read(2) and write(2). You'll have i_atime(stale), z_atime(new), SA(stale) and z_atime_dirty=0. Now, if you do stat(2), zfs_getattr will actually replace i_atime with what's inside, z_atime. So you will have now you'll have i_atime(new), z_atime(new), SA(stale) and z_atime_dirty=0. These will all gone after umount. And you'll leave with a stale atime. The problem for relatime: We do have a relatime config inside ZFS dataset, but how it should interact with the mount flag MS_RELATIME is not well defined. It seems it wanted relatime mount option to override the dataset config by showing it as temporary in `zfs get`. But at the same time, `zfs set relatime=on\|off` would also seems to want to override the mount option. Not to mention that MS_RELATIME flag is actually never passed into ZFS, so it never really worked. How Linux handles atime: The Linux kernel actually handles atime completely in VFS, except for writing it to disk. So if we remove the atime handling in ZFS, things would just work, no matter it's strictatime, relatime, noatime, or even O_NOATIME. And whenever VFS updates the i_atime, it will notify the underlying filesystem via sb->dirty_inode(). And also there's one thing to note about atime flags like MS_RELATIME and other flags like MS_NODEV, etc. They are mount point flags rather than filesystem(sb) flags. Since native linux filesystem can be mounted at multiple places at the same time, they can all have different atime settings. So these flags are never passed down to filesystem drivers. What this patch tries to do: We remove znode->z_atime, since we won't gain anything from it. We remove most of the atime handling and leave it to VFS. The only thing we do with atime is to write it when dirty_inode() or setattr() is called. We also add file_accessed() in zpl_read() since it's not provided in vfs_read(). After this patch, only the MS_RELATIME flag will have effect. The setting in dataset won't do anything. We will make zfstuil to mount ZFS with MS_RELATIME set according to the setting in dataset in future patch. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #4482	2016-04-05 18:54:55 -07:00
Brian Behlendorf	8b1899d3fb	Linux 4.6 compat: PAGE_CACHE_SIZE removal As described in torvalds/linux@4a2d057e the macros PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were originally introduced to make it possible to add bigger chunks to the page cache. This never panned out and it has therefore been removed from the kernel. ZFS has been updated to use the PAGE_{SIZE,SHIFT,MASK,ALIGN} macros and calls to page_cache_release() have been replaced with put_page(). There was no need to introduce a configure check for this because these interfaces have existed for a very long time. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Closes #4489	2016-04-05 17:26:56 -07:00
Colin Ian King	f7b939bdbc	Add support 32 bit FS_IOC32_{GET\|SET}FLAGS compat ioctls We need 32 bit userspace FS_IOC32_GETFLAGS and FS_IOC32_SETFLAGS compat ioctls for systems such as powerpc64. We use the normal compat ioctl idiom as used by a variety of file systems to provide this support. Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4477	2016-03-31 17:56:12 -07:00
DHE	e3e7cf6026	gcc build error: -Wbool-compare in metaslab.c When debugging is enabled on a very recent version of gcc (tested with 5.3.0), DVA_SET_GANG(dva, !!(flags)) fails because an assertion causes a comparison between what is technically a boolean and an integer. Signed-off-by: DHE <git@dehacked.net> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4465	2016-03-30 09:36:51 -07:00
Brian Behlendorf	c35b188246	Fix zpool_scrub_* test cases The zpool_scrub_002, zpool_scrub_003, zpool_scrub_004 test cases fail reliably when running against small pools or fast storage. This occurs because the scrub/resilver operation completes before subsequent commands can be run. A one second delay has been added to 10% of zio's in order to ensure the scrub/resilver operation will run for at least several seconds. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4450	2016-03-30 09:30:34 -07:00
Tim Chase	7bb5d92de8	Allow spawning a new thread for TQ_NOQUEUE dispatch with dynamic taskq When a TQ_NOQUEUE dispatch is done on a dynamic taskq, allow another thread to be spawned. This will cause TQ_NOQUEUE to behave similarly as it does with non-dynamic taskqs. Add support for TQ_NOQUEUE to taskq_dispatch_ent(). Signed-off-by: Tim Chase <tim@onlight.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #530	2016-03-17 09:52:35 -07:00
Alex Wilson	887d1e60ef	Illumos 6681 - zfs list burning lots of time in dodefault() via dsl_prop_* 6681 zfs list burning lots of time in dodefault() via dsl_prop_* Reviewed by: Patrick Mooney <patrick.mooney@joyent.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Approved by: Matthew Ahrens <mahrens@delphix.com> References: https://www.illumos.org/issues/6681 https://github.com/illumos/illumos-gate/commit/d09e447 Ported-by: kernelOfTruth kerneloftruth@gmail.com Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4406	2016-03-15 18:46:44 -07:00
Paul Dagnelie	c352ec27d5	Illumos 6370 - ZFS send fails to transmit some holes 6370 ZFS send fails to transmit some holes Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Chris Williamson <chris.williamson@delphix.com> Reviewed by: Stefan Ring <stefanrin@gmail.com> Reviewed by: Steven Burgess <sburgess@datto.com> Reviewed by: Arne Jansen <sensille@gmx.net> Approved by: Robert Mustacchi <rm@joyent.com> References: https://www.illumos.org/issues/6370 https://github.com/illumos/illumos-gate/commit/286ef71 In certain circumstances, "zfs send -i" (incremental send) can produce a stream which will result in incorrect sparse file contents on the target. The problem manifests as regions of the received file that should be sparse (and read a zero-filled) actually contain data from a file that was deleted (and which happened to share this file's object ID). Note: this can happen only with filesystems (not zvols, because they do not free (and thus can not reuse) object IDs). Note: This can happen only if, since the incremental source (FromSnap), a file was deleted and then another file was created, and the new file is sparse (i.e. has areas that were never written to and should be implicitly zero-filled). We suspect that this was introduced by 4370 (applies only if hole_birth feature is enabled), and made worse by 5243 (applies if hole_birth feature is disabled, and we never send any holes). The bug is caused by the hole birth feature. When an object is deleted and replaced, all the holes in the object have birth time zero. However, zfs send cannot tell that the holes are new since the file was replaced, so it doesn't send them in an incremental. As a result, you can end up with invalid data when you receive incremental send streams. As a short-term fix, we can always send holes with birth time 0 (unless it's a zvol or a dataset where we can guarantee that no objects have been reused). Ported-by: Steven Burgess <sburgess@datto.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4369 Closes #4050	2016-03-10 14:25:22 -08:00
Brian Behlendorf	a6ae97caed	Add rw_tryupgrade() This implementation of rw_tryupgrade() behaves slightly differently from its counterparts on other platforms. It drops the RW_READER lock and then acquires the RW_WRITER lock leaving a small window where no lock is held. On other platforms the lock is never released during the upgrade process. This is necessary under Linux because the kernel does not provide an upgrade function. There are currently no callers in the ZFS code where this change in behavior is a problem. In fact, in most cases the code is already written such that if the upgrade fails the RW_READER lock is dropped and the caller blocks waiting to acquire the lock as RW_WRITER. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Matthew Thode <prometheanfire@gentoo.org> Closes zfsonlinux/zfs#4388 Closes #534	2016-03-10 13:05:25 -08:00
Boris Protopopov	1ee159f423	Fix lock order inversion with zvol_open() zfsonlinux issue #3681 - lock order inversion between zvol_open() and dsl_pool_sync()...zvol_rename_minors() Remove trylock of spa_namespace_lock as it is no longer needed when zvol minor operations are performed in a separate context with no prior locking state; the spa_namespace_lock is no longer held when bdev->bd_mutex or zfs_state_lock might be taken in the code paths originating from the zvol minor operation callbacks. Signed-off-by: Boris Protopopov <boris.protopopov@actifio.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3681	2016-03-10 09:53:36 -08:00
Boris Protopopov	a0bd735adb	Add support for asynchronous zvol minor operations zfsonlinux issue #2217 - zvol minor operations: check snapdev property before traversing snapshots of a dataset zfsonlinux issue #3681 - lock order inversion between zvol_open() and dsl_pool_sync()...zvol_rename_minors() Create a per-pool zvol taskq for asynchronous zvol tasks. There are a few key design decisions to be aware of. * Each taskq must be single threaded to ensure tasks are always processed in the order in which they were dispatched. * There is a taskq per-pool in order to keep the pools independent. This way if one pool is suspended it will not impact another. * The preferred location to dispatch a zvol minor task is a sync task. In this context there is easy access to the spa_t and minimal error handling is required because the sync task must succeed. Support for asynchronous zvol minor operations address issue #3681. Signed-off-by: Boris Protopopov <boris.protopopov@actifio.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2217 Closes #3678 Closes #3681	2016-03-10 09:49:22 -08:00
Tim Chase	f764edf016	Change KM_SLEEP to TQ_SLEEP in spa_deadman() Since they both evaluate to zero, this is a semi-cosmetic change but the latter is the proper value to use as an argument to taskq_dispatch_delay(). Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4393	2016-03-09 10:41:31 -08:00
ab-oe	513168abd2	Make zvol update volsize operation synchronous. There is a race condition when new transaction group is added to dp->dp_dirty_datasets list by the zap_update in the zvol_update_volsize. Meanwhile, before these dirty data are synchronized, the receive process can cause that dmu_recv_end_sync is executed. Then finally dirty data are going to be synchronized but the synchronization ends with the NULL pointer dereference error. Signed-off-by: ab-oe <arkadiusz.bubala@open-e.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4116	2016-02-29 09:07:27 -08:00
smh	9f500936c8	FreeBSD r256956: Improve ZFS N-way mirror read performance by using load and locality information. The existing algorithm selects a preferred leaf vdev based on offset of the zio request modulo the number of members in the mirror. It assumes the devices are of equal performance and that spreading the requests randomly over both drives will be sufficient to saturate them. In practice this results in the leaf vdevs being under utilized. The new algorithm takes into the following additional factors: * Load of the vdevs (number outstanding I/O requests) * The locality of last queued I/O vs the new I/O request. Within the locality calculation additional knowledge about the underlying vdev is considered such as; is the device backing the vdev a rotating media device. This results in performance increases across the board as well as significant increases for predominantly streaming loads and for configurations which don't have evenly performing devices. The following are results from a setup with 3 Way Mirror with 2 x HD's and 1 x SSD from a basic test running multiple parrallel dd's. With pre-fetch disabled (vfs.zfs.prefetch_disable=1): == Stripe Balanced (default) == Read 15360MB using bs: 1048576, readers: 3, took 161 seconds @ 95 MB/s == Load Balanced (zfslinux) == Read 15360MB using bs: 1048576, readers: 3, took 297 seconds @ 51 MB/s == Load Balanced (locality freebsd) == Read 15360MB using bs: 1048576, readers: 3, took 54 seconds @ 284 MB/s With pre-fetch enabled (vfs.zfs.prefetch_disable=0): == Stripe Balanced (default) == Read 15360MB using bs: 1048576, readers: 3, took 91 seconds @ 168 MB/s == Load Balanced (zfslinux) == Read 15360MB using bs: 1048576, readers: 3, took 108 seconds @ 142 MB/s == Load Balanced (locality freebsd) == Read 15360MB using bs: 1048576, readers: 3, took 48 seconds @ 320 MB/s In addition to the performance changes the code was also restructured, with the help of Justin Gibbs, to provide a more logical flow which also ensures vdevs loads are only calculated from the set of valid candidates. The following additional sysctls where added to allow the administrator to tune the behaviour of the load algorithm: * vfs.zfs.vdev.mirror.rotating_inc * vfs.zfs.vdev.mirror.rotating_seek_inc * vfs.zfs.vdev.mirror.rotating_seek_offset * vfs.zfs.vdev.mirror.non_rotating_inc * vfs.zfs.vdev.mirror.non_rotating_seek_inc These changes where based on work started by the zfsonlinux developers: https://github.com/zfsonlinux/zfs/pull/1487 Reviewed by: gibbs, mav, will MFC after: 2 weeks Sponsored by: Multiplay References: https://github.com/freebsd/freebsd@5c7a6f5d https://github.com/freebsd/freebsd@31b7f68d https://github.com/freebsd/freebsd@e186f564 Performance Testing: https://github.com/zfsonlinux/zfs/pull/4334#issuecomment-189057141 Porting notes: - The tunables were adjusted to have ZoL-style names. - The code was modified to use ZoL's vd_nonrot. - Fixes were done to make cstyle.pl happy - Merge conflicts were handled manually - freebsd/freebsd@e186f564bc by my collegue Andriy Gapon has been included. It applied perfectly, but added a cstyle regression. - This replaces `556011dbec` entirely. - A typo "IO'a" has been corrected to say "IO's" - Descriptions of new tunables were added to man/man5/zfs-module-parameters.5. Ported-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4334	2016-02-26 11:24:35 -08:00
Brian Behlendorf	8a09d5fd46	Add l2arc_max_block_size tunable Set a limit for the largest compressed block which can be written to an L2ARC device. By default this limit is set to 16M so there is no change in behavior. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Elling <Richard.Elling@RichardElling.com> Signed-off-by: Tim Chase <tim@chase2k.com> Closes #4323	2016-02-25 09:44:00 -08:00
Boris Protopopov	5428dc51fb	Make zvol minor functionality more robust Close the race window in zvol_open() to prevent removal of zvol_state in the 'first open' code path. Move the call to check_disk_change() under zvol_state_lock to make sure the zvol_media_changed() and zvol_revalidate_disk() called by check_disk_change() are invoked with positive zv_open_count. Skip opened zvols when removing minors and set private_data to NULL for zvols that are not in use whose minors are being removed, to indicate to zvol_open() that the state is gone. Skip opened zvols when renaming minors to avoid modifying zv_name that might be in use, e.g. in zvol_ioctl(). Drop zvol_state_lock before calling add_disk() when creating minors to avoid deadlocks with zvol_open(). Wrap dmu_objset_find() with spl_fstran_mark()/unmark(). Signed-off-by: Boris Protopopov <boris.protopopov@actifio.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <ryao@gentoo.org> Closes #4344	2016-02-24 11:54:24 -08:00
Richard Yao	19a47cb1c2	Call dmu_read_uio_dbuf() in zvol_read() The difference between `dmu_read_uio()` and `dmu_read_uio_dbuf()` is that the former takes a hold while the latter uses an existing hold. `zfs_read()` in the ZPL will use `dmu_read_uio_dbuf()` while our analogous `zvol_write()` will use `dmu_write_uio_dbuf()`, but for no apparent reason, we inherited a `zvol_read()` function from OpenSolaris that does `dmu_read_uio()`. illumos-gate also still uses `dmu_read_uio()` to this day. Lets switch to `dmu_read_uio_dbuf()`, which is more performant. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Closes #4316	2016-02-18 10:24:33 -08:00
Richard Yao	a765a34a31	Clean up zvol request processing to pass uio and fix porting regressions In illumos-gate, `zvol_read` and `zvol_write` are both passed uio_t rather than bio_t. Since we are translating from bio to uio for both, we might as well unify the logic and have code more similar to its illumos counterpart. At the same time, we can fix some regressions that occurred versus the original code from illumos-gate. We refactor zvol_write to take uio and also correct the following problems: 1. We did `dnode_hold()` on each IO when we already had a hold. 2. We would attempt to send writes that exceeded `DMU_MAX_ACCESS` to the DMU. 3. We could call `zil_commit()` twice. In this case, this is because Linux uses the `->write` function to send flushes and can aggregate the flush with a write. If a synchronous write occurred with the flush, we effectively flushed twice when there is no need to do that. zvol_read also suffers from the first two problems. Other platforms suffer from the first, so we leave that for a second patch so that there is a discrete patch for them to cherry-pick. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Closes #4316	2016-02-18 10:23:30 -08:00
Chunwei Chen	093911f194	Remove wrong ASSERT in annotate_ecksum When using large blocks like 1M, there will be more than UINT16_MAX qwords in one block, so this ASSERT would go off. Also, it is possible for the histogram to overflow. We cap them to UINT16_MAX to prevent this. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4257	2016-02-17 10:43:02 -08:00
Richard Yao	0b43696e66	random_get_pseudo_bytes() need not provide cryptographic strength entropy Perf profiling of dd on a zvol revealed that my system spent 3.16% of its time in random_get_pseudo_bytes(). No SPL consumers need cryptographic strength entropy, so we can reduce our overhead by changing the implementation to utilize a fast PRNG. The Linux kernel did not export a suitable PRNG function until it exported get_random_int() in Linux 3.10. While we could implement an autotools check so that we use it when it is available or even try to access the symbol on older kernels where it is not exported using the fact that it is exported on newer ones as justification, we can instead implement our own pseudo-random data generator. For this purpose, I have written one based on a 128-bit pseudo-random number generator proposed in a paper by Sebastiano Vigna that itself was based on work by the late George Marsaglia. http://vigna.di.unimi.it/ftp/papers/xorshiftplus.pdf Profiling the same benchmark with an earlier variant of this patch that used a slightly different generator (roughly same number of instructions) by the same author showed that time spent in random_get_pseudo_bytes() dropped to 0.06%. That is a factor of 50 improvement. This particular generator algorithm is also well known to be fast: http://xorshift.di.unimi.it/#speed The benchmark numbers there state that it runs at 1.12ns/64-bits or 7.14 GBps of throughput on an Intel Core i7-4770 in what is presumably a single-threaded context. Using it in `random_get_pseudo_bytes()` in the manner I have will probably not reach that level of performance, but it should be fairly high and many times higher than the Linux `get_random_bytes()` function that we use now, which runs at 16.3 MB/s on my Intel Xeon E3-1276v3 processor when measured by using dd on /dev/urandom. Also, putting this generator's seed into per-CPU variables allows us to eliminate overhead from both spin locks and CPU memory barriers, which is NUMA friendly. We could have alternatively modified consumers to use something like `gethrtime() % 3` as suggested by both Matthew Ahrens and Tim Chase, but that has a few potential problems that this approach avoids: 1. Switching to `gethrtime() % 3` in hot code paths today requires diverging from illumos-gate and does nothing about potential future patches from illumos-gate that call our slow `random_get_pseudo_bytes()` in different hot code paths. Reimplementing `random_get_pseudo_bytes()` with a per-CPU PRNG avoids both of those things entirely, which means less work for us in the future. 2. Looking at the code that implements `gethrtime()`, I think it is unlikely to be faster than this per-CPU PRNG implementation of `random_get_pseudo_bytes()`. It would be best to go with something fast now so that there is no point in revisiting this from a performance perspective. 3. `gethrtime() % 3` can vary in behavior from system to system based on kernel version, architecture and clock source. In comparison, this per-CPU PRNG is about ~40 lines of code in `random_get_pseudo_bytes()` that should behave consistently across all systems regardless of kernel version, system architecture or machine clock source. It is unlikely that we would ever need to revisit this per-CPU PRNG while the same cannot be said for `gethrtime() % 3`. 4. `gethrtime()` uses CPU memory barriers and maybe atomic instructions depending on the clock source, so replacing `random_get_pseudo_bytes()` with `gethrtime()` in hot code paths could still require a future person working on NUMA scalability to reimplement it anyway while this per-CPU PRNG would not by virtue of using neither CPU memory barriers nor atomic instructions. Note that I did not check various clock sources for the presence of atomic instructions. There is simply too much code to read and given the drawbacks versus this per-cpu PRNG, there is no point in being certain. 5. I have heard of instances where poor quality pseudo-random numbers caused problems for HPC code in ways that took more than a year to identify and were remedied by switching to a higher quality source of pseudo-random numbers. While filesystems are different than HPC code, I do not think it is impossible for us to have instances where poor quality pseudo-random numbers can cause problems. Opting for a well studied PRNG algorithm that passes tests for statistical randomness over changing callers to use `gethrtime() % 3` bypasses the need to think about both whether poor quality pseudo-random numbers can cause problems and the statistical quality of numbers from `gethrtime() % 3`. 6. `gethrtime()` calls `getrawmonotonic()`, which uses seqlocks. This is probably not a huge issue, but anyone using kgdb would never be able to step through a seqlock critical section, which is not a problem either now or with the per-CPU PRNG: https://en.wikipedia.org/wiki/Seqlock The only downside that I can see is that this code's memory requirement is O(N) where N is NR_CPUS, versus the current code and `gethrtime() % 3`, which are O(1), but that should not be a problem. The seeds will use 64KB of memory at the high end (i.e `NR_CPU == 4096`) and 16 bytes of memory at the low end (i.e. `NR_CPU == 1`). In either case, we should only use a few hundred bytes of code for text, especially since `spl_rand_jump()` should be inlined into `spl_random_init()`, which should be removed during early boot as part of "Freeing unused kernel memory". In either case, the memory requirements are minuscule. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Closes #372	2016-02-17 09:49:09 -08:00
Paul Dagnelie	6b42ea8590	Illumos 5809 - Blowaway full receive in v1 pool causes kernel panic 5809 Blowaway full receive in v1 pool causes kernel panic Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Alex Reece <alex@delphix.com> Reviewed by: Will Andrews <will@freebsd.org> Approved by: Gordon Ross <gwr@nexenta.com> References: https://www.illumos.org/issues/5809 https://github.com/illumos/illumos-gate/commit/f40b29c Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>	2016-02-08 09:37:55 -08:00
Chunwei Chen	8f3b403a73	Allow kicking a taskq to spawn more threads This patch add a module parameter spl_taskq_kick. When writing non-zero value to it, it will scan all the taskq, if a taskq contains a task pending for more than 5 seconds, it will be forced to spawn a new thread. This is use as an emergency recovery from deadlock, not a general solution. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #529	2016-02-05 14:08:31 -08:00
Gary Mills	7c9abfa7ab	Illumos 6537 - Panic on zpool scrub with DEBUG kernel 6537 Panic on zpool scrub with DEBUG kernel Reviewed by: Steve Gonczi <gonczi@comcast.net> Reviewed by: Dan McDonald <danmcd@omniti.com> Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Matthew Ahrens <mahrens@delphix.com> References: https://www.illumos.org/issues/6537 https://github.com/illumos/illumos-gate/commit/8c04a1f Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>	2016-02-05 11:29:32 -08:00
Dan McDonald	a4d179efa9	Illumos 6096 - ZFS_SMB_ACL_RENAME needs to cleanup better 6096 ZFS_SMB_ACL_RENAME needs to cleanup better Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Gordon Ross <gordon.w.ross@gmail.com> Reviewed by: George Wilson <gwilson@zfsmail.com> Approved by: Robert Mustacchi <rm@joyent.com> References: https://www.illumos.org/issues/6096 https://github.com/illumos/illumos-gate/commit/8f5190a5 Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>	2016-02-05 11:28:53 -08:00
Matthew Ahrens	b77222c873	Illumos 6450 - scrub/resilver unnecessarily traverses snapshots 6450 scrub/resilver unnecessarily traverses snapshots created after the scrub started Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Richard Elling <Richard.Elling@RichardElling.com> Approved by: Richard Lowe <richlowe@richlowe.net> References: https://www.illumos.org/issues/6450 https://github.com/illumos/illumos-gate/commit/38d6103 Ported-by: kernelOfTruth kerneloftruth@gmail.com Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2016-02-02 14:00:24 -08:00
Richard Sharpe	9d36cdb650	Handling negative dentries in a CI file system. For a Case Insensitive file system we must avoid creating negative entries in the dentry cache. We must also pass the FIGNORECASE into zfs_lookup so that special files are handled correctly. We must also prevent negative dentries from being created when files are unlinked. Tested by running fsstress from LTP (10 loops, 10 processes, 10,000 ops.) Also tested with printks (now removed) to ensure that lookups come to zpl_lookup when negative should not exist. Tests: 1. ls Some-file.txt; touch some-file.txt; ls Some-file.txt and ensure no errors. 2. touch Some-file.txt; rm some-file.txt; ls Some-file.txt and ensure that the last ls shows log messages showing the lookup went all the way to zpl_lookup. Thanks to tuxoko for helping me get this correct. Signed-off-by: Richard Sharpe <realrichardsharpe@gmail.com> Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4243	2016-02-02 13:59:00 -08:00
Jorgen Lundman	4b9ed698b4	Illumos 6527 - Possible access beyond end of string in zpool comment 6527 Possible access beyond end of string in zpool comment Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Approved by: Gordon Ross <gwr@nexenta.com> References: https://www.illumos.org/issues/6527 https://github.com/illumos/illumos-gate/commit/2bd7a8d Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <tuxoko@gmail.com>	2016-01-28 12:46:04 -05:00
Brian Behlendorf	e56766360b	Illumos 6495 - Fix mutex leak in dmu_objset_find_dp 6495 Fix mutex leak in dmu_objset_find_dp Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Albert Lee <trisk@omniti.com> References: https://www.illumos.org/issues/6495 https://github.com/illumos/illumos-gate/commit/2bad225 Ported-by: Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <tuxoko@gmail.com>	2016-01-28 12:45:39 -05:00
Brian Behlendorf	b6fcb792ca	Illumos 6414 - vdev_config_sync could be simpler 6414 vdev_config_sync could be simpler Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> References: https://www.illumos.org/issues/6414 https://github.com/illumos/illumos-gate/commit/eb5bb58 Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <tuxoko@gmail.com>	2016-01-28 12:44:39 -05:00
Simon Klinkert	1a04bab348	llumos 6334 - Cannot unlink files when over quota 6334 Cannot unlink files when over quota Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Toomas Soome <tsoome@me.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/6334 https://github.com/illumos/illumos-gate/commit/6575bca Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>	2016-01-26 15:27:08 -08:00
kernelOfTruth	a966c5640e	Reintroduce zfs_remove() synchronous deletes Reintroduce a slightly adapted version of the Illumos logic for synchronous unlinks. The basic idea here is that only files smaller than zfs_delete_blocks (20480) blocks should be deleted synchronously. Unlinking larger files should be handled asynchronously to minimize impact to the caller. To accomplish this iput() which is responsible for calling zfs_znode_delete() on Linux is only called in the delete_now path. Otherwise zfs_async_iput() is used which allows the last reference to be dropped by a taskq thread effectively making the removal asynchronous. Porting notes: - Add zfs_delete_blocks module option for performance analysis. The default value is DMU_MAX_DELETEBLKCNT which is the same as upstream. Reducing this value means that smaller files will be unlinked asynchronously like large files. - All occurrences of zfsvfs changes to zsb. Ported-by: KernelOfTruth kerneloftruth@gmail.com Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2016-01-26 15:26:02 -08:00
Dan McDonald	460a021391	Log zvol truncate/discard operations As the comments in zvol_discard() suggested, the discard operation could be logged to the zil. This is a port of the relevant code from Nexenta as it was added in "701 UNMAP support for COMSTAR" and has been attributed to the author of that commit. References: https://github.com/Nexenta/illumos-nexenta/commit/b77b923 https://github.com/zfsonlinux/zfs/blob/089fa91b/module/zfs/zvol.c#L637 Ported-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2016-01-26 14:16:03 -08:00
George Wilson	ba5ad9a48d	Illumos 6251 - add tunable to disable free_bpobj processing 6251 - add tunable to disable free_bpobj processing Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Simon Klinkert <simon.klinkert@gmail.com> Reviewed by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed by: Albert Lee <trisk@omniti.com> Reviewed by: Xin Li <delphij@freebsd.org> Approved by: Garrett D'Amore <garrett@damore.org> References: https://www.illumos.org/issues/6251 https://github.com/illumos/illumos-gate/commit/139510f Porting notes: - Added as module option declaration. - Added to zfs-module-parameters.5 man page. Ported-by: Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2016-01-25 13:15:17 -08:00
Tim Chase	0a1f8cd999	Set arc_c_min properly in userland builds Since it's set to arc_c_max / 2, it must be set after arc_c_max is set. Also added protection against it falling below 2 * maxblocksize in userland builds. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4268	2016-01-25 10:25:10 -08:00
Tim Chase	1b8951b319	Prevent arc_c collapse Adjusting arc_c directly is racy because it can happen in the context of multiple threads. It should always be >= 2 * maxblocksize. Set it to a known valid value rather than adjusting it directly. In addition refactor arc_shrink() to a simpler structure, protect against underflow in the calculation of the new arc_c value. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Reverts: `935434ef` Closes: #3904 Closes: #4161	2016-01-25 10:25:04 -08:00
Brian Behlendorf	6b38e7510f	Remove RLIM64_INFINITY assert in vn_rdwr() Previous commit `be29e6a` updated kobj_read_file() so it no longer unconditionally passes RLIM64_INFINITY. The vn_rdwr() function needs to be updated accordingly. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #513	2016-01-23 11:16:23 -08:00
Richard Yao	be29e6a6e6	kobj_read_file: Return -1 on vn_rdwr() error I noticed that the SPL implementation of kobj_read_file is not correct after comparing it with the userland implementation of kobj_read_file() in zfsonlinux/zfs#4104. Note that we no longer pass RLIM64_INFINITY with this, but our vn_rdwr implementation did not support it anyway, so there is no difference. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #513	2016-01-23 10:10:44 -08:00
Matthew Ahrens	19d55079ae	Illumos 4950 - files sometimes can't be removed from a full filesystem 4950 files sometimes can't be removed from a full filesystem Reviewed by: Adam Leventhal <adam.leventhal@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Sebastien Roy <sebastien.roy@delphix.com> Reviewed by: Boris Protopopov <bprotopopov@hotmail.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/4950 https://github.com/illumos/illumos-gate/commit/4bb7380 Porting notes: - ZoL currently does not log discards to zvols, so the portion of this patch that modifies the discard logging to mark it as freeing space has been discarded. 2. may_delete_now had been removed from zfs_remove() in ZoL. It has been reintroduced. 3. We do not try to emulate vnodes, so the following lines are not valid on Linux: mutex_enter(&vp->v_lock); may_delete_now = vp->v_count == 1 && !vn_has_cached_data(vp); mutex_exit(&vp->v_lock); This has been replaced with: mutex_enter(&zp->z_lock); may_delete_now = atomic_read(&ip->i_count) == 1 && !(zp->z_is_mapped); mutex_exit(&zp->z_lock); Ported-by: Richard Yao <richard.yao@clusterhq.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2016-01-21 16:59:30 -08:00
Brian Behlendorf	37c56346cc	Close possible zfs_znode_held() race Check if the lock is held while holding the z_hold_locks() lock. This prevents a possible use-after-free bug for callers which are not holding the lock. There currently are no such callers so this can't cause a problem today but it has been fixed regardless. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Closes #4244 Issue #4124	2016-01-20 13:36:15 -08:00
Chunwei Chen	16522ac290	Use tsd to store tq for taskq_member To prevent taskq_member holding tq_lock and doing linear search, thus causing contention. We store the taskq pointer to which the thread belongs in tsd. This way taskq_member will not need to touch tq_lock, and tsd has per slot spinlock. So the contention should be reduced greatly. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #500 Closes #504 Closes #505	2016-01-20 13:07:45 -08:00
Brian Behlendorf	4967a3eb9d	Linux 4.5 compat: xattr list handler The registered xattr .list handler was simplified in the 4.5 kernel to only perform a permission check. Given a dentry for the file it must return a boolean indicating if the name is visible. This differs slightly from the previous APIs which also required the function to copy the name in to the provided list and return its size. That is now all the responsibility of the caller. This should be straight forward change to make to ZoL since we've always required the caller to make the copy. However, this was slightly complicated by the need to support 3 older APIs. Yes, between 2.6.32 and 4.5 there are 4 versions of this interface! Therefore, while the functional change in this patch is small it includes significant cleanup to make the code understandable and maintainable. These changes include: - Improved configure checks for .list, .get, and .set interfaces. - Interfaces checked from newest to oldest. - Strict checking for each possible known interface. - Configure fails when no known interface is available. - HAVE__XATTR_LIST renamed HAVE_XATTR_LIST_ for consistency with similar iops and fops configure checks. - POSIX_ACL_XATTR_{DEFAULT\|ACCESS} were removed forcing callers to move to their replacements, XATTR_NAME_POSIX_ACL_{DEFAULT\|ACCESS}. Compatibility wrapper were added for old kernels. - ZPL_XATTR_LIST_WRAPPER added which behaves the same as the existing ZPL_XATTR_{GET\|SET} WRAPPERs. Only the inode is guaranteed to be a valid pointer, passing NULL for the 'list' and 'name' variables is allowed and must be checked for. All .list functions were updated to use the wrapper to aid readability. - zpl_xattr_filldir() updated to use the .list function for its permission check which is consistent with the updated Linux 4.5 interface. If a .list function is registered it should return 0 to indicate a name should be skipped, if there is no registered function the name will be added. - Additional documentation from xattr(7) describing the correct behavior for each namespace was added before the relevant handlers. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Issue #4228	2016-01-20 11:36:56 -08:00
Brian Behlendorf	beeed4596b	Linux 4.5 compat: get_link() / put_link() The follow_link() interface was retired in favor of get_link(). In the process of phasing in get_link() the Linux kernel went through two different versions. The first of which depended on put_link() and the final version on a delayed done function. - Improved configure checks for .follow_link, .get_link, .put_link. - Interfaces checked from newest to oldest. - Strict checking for each possible known interface. - Configure fails when no known interface is available. - Both versions .get_link are detected and supported as well two previous versions of .follow_link. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Issue #4228	2016-01-20 11:36:00 -08:00
Josef 'Jeff' Sipek	bc89ac8479	Illumos 5045 - use atomic_{inc,dec}_* instead of atomic_add_* 5045 use atomic_{inc,dec}_* instead of atomic_add_* Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Garrett D'Amore <garrett@damore.org> Approved by: Robert Mustacchi <rm@joyent.com> References: https://www.illumos.org/issues/5045 https://github.com/illumos/illumos-gate/commit/1a5e258 Porting notes: - All changes to non-ZFS files dropped. - Changes to zfs_vfsops.c dropped because they were Illumos specific. Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4220	2016-01-15 15:38:36 -08:00
Marcel Telka	812e91a7e3	Illumos 4039 - zfs_rename()/zfs_link() needs stronger test for XDEV 4039 zfs_rename()/zfs_link() needs stronger test for XDEV Reviewed by: Gordon Ross <gordon.ross@nexenta.com> Reviewed by: Kevin Crowe <kevin.crowe@nexenta.com> Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@nexenta.com> References: https://www.illumos.org/issues/4039 https://github.com/illumos/illumos-gate/commit/18e6497 Porting notes: - This check was updated in Linux in a similar fashion early on in the port. Therefore, this patch just reorders the function and updates the comment so it flows the same way as the upstream code. Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4218	2016-01-15 15:38:35 -08:00
George Wilson	59d4c71cca	Illumos 3557, 3558, 3559, 3560 3557 dumpvp_size is not updated correctly when a dump zvol's size is changed 3558 setting the volsize on a dump device does not return back ENOSPC 3559 setting a volsize larger than the space available sometimes succeeds 3560 dumpadm should be able to remove a dump device Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Approved by: Albert Lee <trisk@nexenta.com> References: https://www.illumos.org/issues/3559 https://github.com/illumos/illumos-gate/commit/c61ea56 Porting notes: - Internal zvol.c changes not applied due to implementation differences. The external interface and behavior was already consistent with the latest upstream code. - Retired 2.6.28 HAVE_CHECK_DISK_SIZE_CHANGE configure check. All supported kernels (2.6.32 and newer) provide this interface. Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4217	2016-01-15 15:38:35 -08:00
Chunwei Chen	21f604d460	Prevent duplicated xattr between SA and dir When replacing an xattr would cause overflowing in SA, we would fallback to xattr dir. However, current implementation don't clear the one in SA, so we would end up with duplicated SA. For example, running the following script on an xattr=sa filesystem would cause duplicated "user.1". -- dup_xattr.sh begin -- randbase64() { dd if=/dev/urandom bs=1 count=$1 2>/dev/null \| openssl enc -a -A } file=$1 touch $file setfattr -h -n user.1 -v `randbase64 5000` $file setfattr -h -n user.2 -v `randbase64 20000` $file setfattr -h -n user.3 -v `randbase64 20000` $file setfattr -h -n user.1 -v `randbase64 20000` $file getfattr -m. -d $file -- dup_xattr.sh end -- Also, when a filesystem is switch from xattr=sa to xattr=on, it will never modify those in SA. This would cause strange behavior like, you cannot delete an xattr, or setxattr would cause duplicate and the result would not match when you getxattr. For example, the following shell sequence. -- shell begin -- $ sudo zfs set xattr=sa pp/fs0 $ touch zzz $ setfattr -n user.test -v asdf zzz $ sudo zfs set xattr=on pp/fs0 $ setfattr -x user.test zzz setfattr: zzz: No such attribute $ getfattr -d zzz user.test="asdf" $ setfattr -n user.test -v zxcv zzz $ getfattr -d zzz user.test="asdf" user.test="asdf" -- shell end -- We fix this behavior, by first finding where the xattr resides before setxattr. Then, after we successfully updated the xattr in one location, we will clear the other location. Note that, because update and clear are not in single tx, we could still end up with duplicated xattr. But by doing setxattr again, it can be fixed. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Closes #3472 Closes #4153	2016-01-15 15:38:35 -08:00
Richard Yao	b10695c8f1	Remove fastwrite mutex The fast write mutex is intended to protect accounting, but it is redundant because all accounting is performed through atomic operations. It also serializes all metaslab IO behind a mutex, which introduces a theoretical scaling regression that the Illumos developers did not like when we showed this to them. Removing it makes the selection of the metaslab_group lock free as it is on Illumos. The selection is not quite the same without the lock because the loop races with IO completions, but any imbalances caused by this are likely to be corrected by subsequent metaslab group selections. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3643	2016-01-15 15:38:35 -08:00
Brian Behlendorf	c96c36fa22	Fix zsb->z_hold_mtx deadlock The zfs_znode_hold_enter() / zfs_znode_hold_exit() functions are used to serialize access to a znode and its SA buffer while the object is being created or destroyed. This kind of locking would normally reside in the znode itself but in this case that's impossible because the znode and SA buffer may not yet exist. Therefore the locking is handled externally with an array of mutexs and AVLs trees which contain per-object locks. In zfs_znode_hold_enter() a per-object lock is created as needed, inserted in to the correct AVL tree and finally the per-object lock is held. In zfs_znode_hold_exit() the process is reversed. The per-object lock is released, removed from the AVL tree and destroyed if there are no waiters. This scheme has two important properties: 1) No memory allocations are performed while holding one of the z_hold_locks. This ensures evict(), which can be called from direct memory reclaim, will never block waiting on a z_hold_locks which just happens to have hashed to the same index. 2) All locks used to serialize access to an object are per-object and never shared. This minimizes lock contention without creating a large number of dedicated locks. On the downside it does require znode_lock_t structures to be frequently allocated and freed. However, because these are backed by a kmem cache and very short lived this cost is minimal. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #4106	2016-01-15 15:33:45 -08:00
Brian Behlendorf	0720116d4d	Add zfs_object_mutex_size module option Add a zfs_object_mutex_size module option to facilitate resizing the the per-dataset znode mutex array. Increasing this value may help make the deadlock described in #4106 less common, but this is not a proper fix. This patch is primarily to aid debugging and analysis. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Issue #4106	2016-01-15 15:33:44 -08:00
Brian Behlendorf	89666a8e1c	Increase default user space stack size Under RHEL6/CentOS6 the default stack size must be increased to 32K to prevent overflowing the stack when running ztest. This isn't an issue for other distributions due to either the version of pthreads or perhaps the compiler. Doubling the stack size resolves the issue safely for all distribution and leaves us some headroom. $ sudo -E ztest -V -T 300 -f /var/tmp/ 5 vdevs, 7 datasets, 23 threads, 300 seconds... loading space map for vdev 0 of 1, metaslab 0 of 30 ... ... loading space map for vdev 0 of 1, metaslab 14 of 30 ... child died with signal 11 Exited ztest with error 3 Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4215	2016-01-13 13:55:12 -08:00
Chunwei Chen	e843553d03	Don't hold mutex until release cv in cv_wait If a thread is holding mutex when doing cv_destroy, it might end up waiting a thread in cv_wait. The waiter would wake up trying to aquire the same mutex and cause deadlock. We solve this by move the mutex_enter to the bottom of cv_wait, so that the waiter will release the cv first, allowing cv_destroy to succeed and have a chance to free the mutex. This would create race condition on the cv_mutex. We use xchg to set and check it to ensure we won't be harmed by the race. This would result in the cv_mutex debugging becomes best-effort. Also, the change reveals a race, which was unlikely before, where we call mutex_destroy while test threads are still holding the mutex. We use kthread_stop to make sure the threads are exit before mutex_destroy. Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Issue zfsonlinux/zfs#4166 Issue zfsonlinux/zfs#4106	2016-01-12 15:18:44 -08:00
Will Andrews	e6cfd633be	Illumos 3749 - zfs event processing should work on R/O root filesystems 3749 zfs event processing should work on R/O root filesystems Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Eric Schrock <eric.schrock@delphix.com> Approved by: Christopher Siden <christopher.siden@delphix.com> References: https://www.illumos.org/issues/3749 https://github.com/illumos/illumos-gate/commit/3cb69f7 Porting notes: - [include/sys/spa_impl.h] - `ffe9d38` Add generic errata infrastructure - `1421c89` Add visibility in to arc_read - [include/sys/fm/fs/zfs.h] - `2668527` Add linux events - `6283f55` Support custom build directories and move includes - [module/zfs/spa_config.c] - Updated spa_config_sync() to match illumos with the exception of a Linux specific block. Ported-by: kernelOfTruth kerneloftruth@gmail.com Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2016-01-12 14:42:32 -08:00
Justin Gibbs	ee3a23b84e	Illumos 5438 - zfs_blkptr_verify should continue after zfs_panic_recover 5438 zfs_blkptr_verify should continue after zfs_panic_recover Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george@delphix.com> Reviewed by: Xin LI <delphij@freebsd.org> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/5438 https://github.com/illumos/illumos-gate/commit/5897eb4 Ported-by: kernelOfTruth kerneloftruth@gmail.com Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2016-01-12 13:54:05 -08:00
Josef 'Jeff' Sipek	fc581e0507	Illumos 5515 - dataset user hold doesn't reject empty tags 5515 dataset user hold doesn't reject empty tags Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com> Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com> Approved by: Matthew Ahrens <mahrens@delphix.com> References: https://www.illumos.org/issues/5515 https://github.com/illumos/illumos-gate/commit/752fd8d Ported-by: kernelOfTruth kerneloftruth@gmail.com Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2016-01-12 13:52:26 -08:00
George Wilson	a6fb32b85a	Illumos 6281 - prefetching should apply to 1MB reads 6281 prefetching should apply to 1MB reads Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Alexander Motin <mav@freebsd.org> Reviewed by: Dan McDonald <danmcd@omniti.com> Reviewed by: Justin Gibbs <gibbs@scsiguy.com> Reviewed by: Xin Li <delphij@freebsd.org> Approved by: Gordon Ross <gordon.ross@nexenta.com> References: https://www.illumos.org/issues/6281 https://github.com/illumos/illumos-gate/commit/6328027 Ported-by: kernelOfTruth kerneloftruth@gmail.com Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2016-01-12 13:51:27 -08:00
Saso Kiselkov	adfe9d932b	Illumos 6367 - spa_config_tryenter incorrectly handles the multiple-lock case 6367 spa_config_tryenter incorrectly handles the multiple-lock case Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com> Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com> Reviewed by: Prashanth Sreenivasa <prashksp@gmail.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Reviewed by: Steven Hartland <steven.hartland@multiplay.co.uk> Approved by: Matthew Ahrens <mahrens@delphix.com> References: https://www.illumos.org/issues/6367 https://github.com/illumos/illumos-gate/commit/e495b6e Ported-by: kernelOfTruth kerneloftruth@gmail.com Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2016-01-12 11:05:28 -08:00
Joe Stein	5f3d9c69d1	Illumos 6295 - metaslab_condense's dbgmsg should include vdev id 6295 metaslab_condense's dbgmsg should include vdev id Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Andriy Gapon <avg@freebsd.org> Reviewed by: Xin Li <delphij@freebsd.org> Reviewed by: Justin Gibbs <gibbs@scsiguy.com> Approved by: Richard Lowe <richlowe@richlowe.net> References: https://www.illumos.org/issues/6295 https://github.com/illumos/illumos-gate/commit/daec38e Ported-by: kernelOfTruth kerneloftruth@gmail.com Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2016-01-12 11:02:07 -08:00
Justin T. Gibbs	0eb21616fa	Illumos 6171 - dsl_prop_unregister() slows down dataset eviction. 6171 dsl_prop_unregister() slows down dataset eviction. Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/6171 https://github.com/illumos/illumos-gate/commit/03bad06 Porting notes: - Conflicts - `3558fd7` Prototype/structure update for Linux - `2cf7f52` Linux compat 2.6.39: mount_nodev() - `13fe019` Illumos #3464 - `241b541` Illumos 5959 - clean up per-dataset feature count code - dsl_prop_unregister() preserved until out of tree consumers like Lustre can transition to dsl_prop_unregister_all(). - Fixing 'space or tab at end of line' in include/sys/dsl_dataset.h Ported-by: kernelOfTruth kerneloftruth@gmail.com Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2016-01-12 10:53:12 -08:00
Matthew Ahrens	5a28a9737a	Illumos 6288 - dmu_buf_will_dirty could be faster 6288 dmu_buf_will_dirty could be faster Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Justin Gibbs <gibbs@scsiguy.com> Reviewed by: Richard Elling <Richard.Elling@RichardElling.com> Approved by: Robert Mustacchi <rm@joyent.com> References: https://www.illumos.org/issues/6288 https://github.com/illumos/illumos-gate/commit/0f2e7d0 Porting notes: - [module/zfs/dbuf.c] - Fix 'warning: ISO C90 forbids mixed declarations and code' by moving 'dbuf_dirty_record_t *dr' to start of code block. Ported-by: kernelOfTruth kerneloftruth@gmail.com Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2016-01-12 09:13:52 -08:00
George Wilson	2e8efe1bef	Illumos 6292 - exporting a pool while an async destroy 6292 exporting a pool while an async destroy is running can leave entries in the deferred tree Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Andriy Gapon <avg@FreeBSD.org> Reviewed by: Fabian Keil <fk@fabiankeil.de> Approved by: Gordon Ross <gordon.ross@nexenta.com> References: https://www.illumos.org/issues/6292 https://github.com/illumos/illumos-gate/commit/a443cc8 Ported-by: kernelOfTruth kerneloftruth@gmail.com Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2016-01-12 09:10:52 -08:00
Matthew Ahrens	5511754b4f	Illumos 6319 - assertion failed in zio_ddt_write: bp->blk_birth == txg 6319 assertion failed in zio_ddt_write: bp->blk_birth == txg Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/6319 https://github.com/illumos/illumos-gate/commit/b39b744 Porting notes: - Re-enabled ztest for CentOS test slaves. Ported-by: kernelOfTruth kerneloftruth@gmail.com Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3449	2016-01-12 09:10:52 -08:00
Matthew Ahrens	7f60329a26	Illumos 5987 - zfs prefetch code needs work 5987 zfs prefetch code needs work Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Approved by: Gordon Ross <gordon.ross@nexenta.com> References: https://www.illumos.org/issues/5987 zfs prefetch code needs work illumos/illumos-gate@cf6106c 5987 zfs prefetch code needs work Porting notes: - [module/zfs/dbuf.c] - `5f6d0b6` Handle block pointers with a corrupt logical size - [module/zfs/dmu_zfetch.c] - `c65aa5b` Fix gcc missing parenthesis warnings - `428870f` Update core ZFS code from build 121 to build 141. - `79c76d5` Change KM_PUSHPAGE -> KM_SLEEP - `b8d06fc` Switch KM_SLEEP to KM_PUSHPAGE - Account for ISO C90 - mixed declarations and code - warnings - Module parameters (new/changed): - Replaced zfetch_block_cap with zfetch_max_distance (Max bytes to prefetch per stream (default 8MB; 8 * 1024 * 1024)) - Preserved zfs_prefetch_disable as 'int' for consistency with existing Linux module options. - [include/sys/trace_arc.h] - Added new tracepoints - DEFINE_ARC_BUF_HDR_EVENT(zfs_arc__sync__wait__for__async); - DEFINE_ARC_BUF_HDR_EVENT(zfs_arc__demand__hit__predictive__prefetch); - [man/man5/zfs-module-parameters.5] - Updated man page Ported-by: kernelOfTruth kerneloftruth@gmail.com Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2016-01-12 09:02:33 -08:00
Brian Behlendorf	ab5cbbd107	Illumos 6293 - ztest failure: error == 28 (0xc == 0x1c) in ztest_tx_assign() 6293 ztest failure: error == 28 (0xc == 0x1c) in ztest_tx_assign() Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Richard Elling <Richard.Elling@RichardElling.com> Approved by: Richard Lowe <richlowe@richlowe.net> References: https://www.illumos.org/issues/6293 https://github.com/illumos/illumos-gate/commit/8fe00bf Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>	2016-01-11 14:10:31 -08:00
Brian Behlendorf	b870c7e5f4	Revert "Illumos 3749 - zfs event processing should work on R/O root filesystems" This reverts commit `b47637ecdc` which introduced a regression in ztest. $ ./cmd/ztest/ztest -V 5 vdevs, 7 datasets, 23 threads, 300 seconds... * Error in `/rpool/home/behlendo/src/git/zfs/cmd/ztest/.libs/lt-ztest': double free or corruption (fasttop): 0x0000000000d339f0 *	2016-01-11 14:10:30 -08:00
Brian Behlendorf	b858767a31	Fix 'prevsnap property' build failure Fix build failure accidentally introduced by `1715493`. This only results in a failure when debugging is disabled. dsl_dataset.c: In function 'dsl_dataset_stats': dsl_dataset.c:1698:45: error: 'dp' undeclared (first use in this function) Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2016-01-11 13:35:40 -08:00
Matthew Ahrens	1715493f38	Illumos 4929 - want prevsnap property 4929 want prevsnap property Reviewed by: Adam Leventhal <adam.leventhal@delphix.com> Reviewed by: Matt Amdur <matt.amdur@delphix.com> Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com> Reviewed by: Boris Protopopov <bprotopopov@hotmail.com> Reviewed by: Richard Lowe <richlowe@richlowe.net> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/4929 https://github.com/illumos/illumos-gate/commit/b461c74 Porting notes: - [include/sys/fs/zfs.h] - f67d70 Create an 'overlay' property - 11b9ec Add full SELinux support - [fs/zfs/dsl_dataset.c] - This increases the stack size of dsl_dataset_stats() but nothing has been changed until this is shown to be an issue. Ported-by: kernelOfTruth kerneloftruth@gmail.com Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2016-01-11 11:58:26 -08:00
Marcel Telka	f3c9dca093	Illumos 4638 - Panic in ZFS via rfs3_setattr()/rfs3_write(): dirtying snapshot! 4638 Panic in ZFS via rfs3_setattr()/rfs3_write(): dirtying snapshot! Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com> Reviewed by: Ilya Usvyatsky <ilya.usvyatsky@nexenta.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Garrett D'Amore <garrett@damore.org> Approved by: Garrett D'Amore <garrett@damore.org> References: https://www.illumos.org/issues/4638 https://github.com/illumos/illumos-gate/commit/2144b12 Porting notes: - [module/zfs/zfs_vnops.c] - `3558fd7` Prototype/structure update for Linux - `2cf7f52` Linux compat 2.6.39: mount_nodev() - Use zfs_is_readonly() wrapper - Remove first line of comment which doesn't apply Ported-by: kernelOfTruth kerneloftruth@gmail.com Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2016-01-11 10:29:48 -08:00
Will Andrews	b47637ecdc	Illumos 3749 - zfs event processing should work on R/O root filesystems 3749 zfs event processing should work on R/O root filesystems Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Eric Schrock <eric.schrock@delphix.com> Approved by: Christopher Siden <christopher.siden@delphix.com> References: https://www.illumos.org/issues/3749 https://github.com/illumos/illumos-gate/commit/3cb69f7 Porting notes: - [include/sys/spa_impl.h] - `ffe9d38` Add generic errata infrastructure - `1421c89` Add visibility in to arc_read - [include/sys/fm/fs/zfs.h] - `2668527` Add linux events - `6283f55` Support custom build directories and move includes - [module/zfs/spa_config.c] - Updated spa_config_sync() to match illumos with the exception of a Linux specific block. Ported-by: kernelOfTruth kerneloftruth@gmail.com Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2016-01-11 09:23:37 -08:00
Brian Behlendorf	e9e3d31d2c	Allow 16M send/recv blocks Fix an off by one error introduced by `fcff0f3` which triggers an assertion when 16M blocks are used with send/recv. This fix was intentionally not folder in to the Illumos commit so it can be easily cherry-picked by upstream. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2016-01-08 20:23:23 -05:00
Paul Dagnelie	fcff0f35bd	Illumos 5960, 5925 5960 zfs recv should prefetch indirect blocks 5925 zfs receive -o origin= Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> References: https://www.illumos.org/issues/5960 https://www.illumos.org/issues/5925 https://github.com/illumos/illumos-gate/commit/a2cdcdd Porting notes: - [lib/libzfs/libzfs_sendrecv.c] - `b8864a2` Fix gcc cast warnings - `325f023` Add linux kernel device support - `5c3f61e` Increase Linux pipe buffer size on 'zfs receive' - [module/zfs/zfs_vnops.c] - `3558fd7` Prototype/structure update for Linux - `c12e3a5` Restructure zfs_readdir() to fix regressions - [module/zfs/zvol.c] - Function @zvol_map_block() isn't needed in ZoL - `9965059` Prefetch start and end of volumes - [module/zfs/dmu.c] - Fixed ISO C90 - mixed declarations and code - Function dmu_prefetch() 'int i' is initialized before the following code block (c90 vs. c99) - [module/zfs/dbuf.c] - `fc5bb51` Fix stack dbuf_hold_impl() - `9b67f60` Illumos 4757, 4913 - `34229a2` Reduce stack usage for recursive traverse_visitbp() - [module/zfs/dmu_send.c] - Fixed ISO C90 - mixed declarations and code - `b58986e` Use large stacks when available - `241b541` Illumos 5959 - clean up per-dataset feature count code - `77aef6f` Use vmem_alloc() for nvlists - `00b4602` Add linux kernel memory support Ported-by: kernelOfTruth kerneloftruth@gmail.com Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2016-01-08 15:08:19 -08:00
Richard Sharpe	c5d0287011	Fix casesensitivity=insensitive deadlock When casesensitivity=insensitive is set for the file system, we can deadlock in a rename if the user uses different case for each path. For example rename("A/some-file.txt", "a/some-file.txt"). The simple test for this is: 1. mkdir some-dir in a ZFS file system 2. touch some-dir/some-file.txt 3. mv Some-dir/some-file.txt some-dir/some-other-file.txt This last request deadlocks trying to relock the i_mutex on the inode for the parent directory. The solution is to use d_add_ci in zpl_lookup if we are on a file system that has the casesensitivity=insensitive attribute set. This patch checks if we are working on a case insensitive file system and if so, allocates storage for the case insensitive name and passes it to zfs_lookup and then calls d_add_ci instead of d_splice_alias. The performance impact seems to be minimal even though we have introduced a kmalloc and kfree in the lookup path. The problem was found when running Microsoft's FSCT against Samba on top of ZFS On Linux. Signed-off-by: Richard Sharpe <realrichardsharpe@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4136	2016-01-08 11:05:07 -08:00
Jeremy Jones	b23ad7f350	Illumos 3139 - zdb dies when it tries to determine path of unlinked file 3139 zdb dies when it tries to determine path of unlinked file Reviewed by: Matt Ahrens <matthew.ahrens@delphix.com> Reviewed by: Christopher Siden <chris.siden@delphix.com> Reviewed by: Eric Schrock <eric.schrock@delphix.com> Approved by: Dan McDonald <danmcd@nexenta.com> References: https://github.com/illumos/illumos-gate/commit/1ce39b5 https://www.illumos.org/issues/3139 Ported-by: kernelOfTruth kerneloftruth@gmail.com Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2016-01-05 11:25:41 -08:00
Matthew Ahrens	37f8a8835a	Illumos 5746 - more checksumming in zfs send 5746 more checksumming in zfs send Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com> Approved by: Albert Lee <trisk@omniti.com> References: https://www.illumos.org/issues/5746 https://github.com/illumos/illumos-gate/commit/98110f0 https://github.com/zfsonlinux/zfs/issues/905 Porting notes: - Minor conflicts due to: - https://github.com/zfsonlinux/zfs/commit/2024041 - https://github.com/zfsonlinux/zfs/commit/044baf0 - https://github.com/zfsonlinux/zfs/commit/88904bb - Fix ISO C90 warnings (-Werror=declaration-after-statement) - arc_buf_t abuf; - dmu_buf_t bonus; - zio_cksum_t cksum_orig; - zio_cksum_t *cksump; - Fix format '%llx' format specifier warning - Align message in zstreamdump safe_malloc() with upstream Ported-by: kernelOfTruth kerneloftruth@gmail.com Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3611	2015-12-30 14:24:14 -08:00
Ned Bass	43b4935e53	Prevent SA length overflow The function sa_update() accepts a 32-bit length parameter and assigns it to a 16-bit field in sa_bulk_attr_t, potentially truncating the passed-in value. This could lead to corrupt system attribute (SA) records getting written to the pool. Add a VERIFY to sa_update() to detect cases where overflow would occur. The SA length is limited to 16-bit values by the on-disk format defined by sa_hdr_phys_t. The function zfs_sa_set_xattr() is vulnerable to this bug if the unpacked nvlist of xattrs is less than 64k in size but the packed size is greater than 64k. Fix this by appropriately checking the size of the packed nvlist before calling sa_update(). Add error handling to zpl_xattr_set_sa() to keep the cached list of SA-based xattrs consistent with the data on disk. Lastly, zfs_sa_set_xattr() calls dmu_tx_abort() on an assigned transaction if sa_update() returns an error, but the DMU only allows unassigned transactions to be aborted. Wrap the sa_update() call in a VERIFY0, remove the transaction abort, and call dmu_tx_commit() unconditionally. This is consistent practice with other callers of sa_update(). Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <ryao@gentoo.org> Closes #4150	2015-12-30 13:20:12 -08:00
Chunwei Chen	f5f087eb88	Make xattr dir truncate and remove in one tx We need truncate and remove be in the same tx when doing zfs_rmnode on xattr dir. Otherwise, if we truncate and crash, we'll end up with inconsistent zap object on the delete queue. We do this by skipping dmu_free_long_range and let zfs_znode_delete to do the work. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #4114 Issue #4052 Issue #4006 Issue #3018 Issue #2861	2015-12-28 09:48:26 -08:00
Chunwei Chen	29572ccdef	Fix empty xattr dir causing lockup During zfs_rmnode on a xattr dir, if the system crash just after dmu_free_long_range, we would get empty xattr dir in delete queue. This would cause blkid=0 be passed into zap_get_leaf_byblk when doing zfs_purgedir during mount, and would try to do rw_enter on a wrong structure and cause system lockup. We fix this by returning ENOENT when blkid is zero in zap_get_leaf_byblk. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4114 Closes #4052 Closes #4006 Closes #3018 Closes #2861	2015-12-28 09:41:30 -08:00
Brian Behlendorf	2ebc7b72b3	Fix z_xattr_lock/z_teardown_lock inversion There exists a lock inversion between the z_xattr_lock and the z_teardown_lock. Resolve this by taking the z_teardown_lock in all registered xattr callbacks prior to taking the z_xattr_lock. This ensures the locks are always taken is the same order thus preventing a deadlock. Note the z_teardown_lock is taken again in zfs_lookup() and this is safe because the z_teardown lock is a re-entrant read reader/writer lock. * process-1 zpl_xattr_get -> Takes zp->z_xattr_lock __zpl_xattr_get zfs_lookup -> Takes zsb->z_teardown_lock in ZFS_ENTER macro * process-2 zfs_ioc_recv -> Takes zsb->z_teardown_lock in zfs_suspend_fs() zfs_resume_fs zfs_rezget -> Takes zp->z_xattr_lock Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Closes #3943 Closes #3969 Closes #4121	2015-12-22 16:59:20 -08:00
Brian Behlendorf	228b461b56	Revert "Fix z_xattr_lock/z_teardown_lock lock inversion" This reverts commit `6b32ef572f`.	2015-12-22 16:58:43 -08:00
Brian Behlendorf	151f84e2c3	Fix ztest truncated cache file Commit `efc412b` updated spa_config_write() for Linux 4.2 kernels to truncate and overwrite rather than rename the cache file. This is the correct fix but it should have only been applied for the kernel build. In user space rename(2) is needed because ztest depends on the cache file. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4129	2015-12-22 10:40:40 -08:00
Brian Behlendorf	2a552736b7	Fix do_div() types in condvar:timeout The do_div() macro expects unsigned types and this is detected in powerpc implementation of do_div(). Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #516	2015-12-22 10:32:35 -08:00
Olaf Faaland	448d7aaabc	Identify locks flagged by lockdep When running a kernel with CONFIG_LOCKDEP=y, lockdep reports possible recursive locking in some cases and possible circular locking dependency in others, within the SPL and ZFS modules. This patch uses a mutex type defined in SPL, MUTEX_NOLOCKDEP, to mark such mutexes when they are initialized. This mutex type causes attempts to take or release those locks to be wrapped in lockdep_off() and lockdep_on() calls to silence the dependency checker and allow the use of lock_stats to examine contention. For RW locks, it uses an analogous lock type, RW_NOLOCKDEP. The goal is that these locks are ultimately changed back to type MUTEX_DEFAULT or RW_DEFAULT, after the locks are annotated to reflect their relationship (e.g. z_name_lock below) or any real problem with the lock dependencies are fixed. Some of the affected locks are: tc_open_lock: ============= This is an array of locks, all with same name, which txg_quiesce must take all of in order to move txg to next state. All default to the same lockdep class, and so to lockdep appears recursive. zp->z_name_lock: ================ In zfs_rmdir, dzp = znode for the directory (input to zfs_dirent_lock) zp = znode for the entry being removed (output of zfs_dirent_lock) zfs_rmdir()->zfs_dirent_lock() takes z_name_lock in dzp zfs_rmdir() takes z_name_lock in zp Since both dzp and zp are type znode_t, the locks have the same default class, and lockdep considers it a possible recursive lock attempt. l->l_rwlock: ============ zap_expand_leaf() sometimes creates two new zap leaf structures, via these call paths: zap_deref_leaf()->zap_get_leaf_byblk()->zap_leaf_open() zap_expand_leaf()->zap_create_leaf()->zap_expand_leaf()->zap_create_leaf() Because both zap_leaf_open() and zap_create_leaf() initialize l->l_rwlock in their (separate) leaf structures, the lockdep class is the same, and the linux kernel believes these might both be the same lock, and emits a possible recursive lock warning. Signed-off-by: Olaf Faaland <faaland1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3895	2015-12-22 10:21:33 -08:00
DHE	dcb6bed1df	Make zio_taskq_batch_pct user configurable Adds zio_taskq_batch_pct as an exported module parameter, allowing users to modify it at module load time. Signed-off-by: DHE <git@dehacked.net> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4110	2015-12-18 13:46:23 -08:00
Brian Behlendorf	a58df6f536	Fix zfs_vdev_aggregation_limit bounds checking Update the bounds checking for zfs_vdev_aggregation_limit so that it has a floor of zero and a maximum value of the supported block size for the pool. Additionally add an early return when zfs_vdev_aggregation_limit equals zero to disable aggregation. For very fast solid state or memory devices it may be more expensive to perform the aggregation than to issue the IO immediately. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-12-18 13:32:06 -08:00
Brian Behlendorf	6fe53787f3	Fix vdev_queue_aggregate() deadlock This deadlock may manifest itself in slightly different ways but at the core it is caused by a memory allocation blocking on file- system reclaim in the zio pipeline. This is normally impossible because zio_execute() disables filesystem reclaim by setting PF_FSTRANS on the thread. However, kmem cache allocations may still indirectly block on file system reclaim while holding the critical vq->vq_lock as shown below. To resolve this issue zio_buf_alloc_flags() is introduced which allocation flags to be passed. This can then be used in vdev_queue_aggregate() with KM_NOSLEEP when allocating the aggregate IO buffer. Since aggregating the IO is purely a performance optimization we want this to either succeed or fail quickly. Trying too hard to allocate this memory under the vq->vq_lock can negatively impact performance and result in this deadlock. * z_wr_iss zio_vdev_io_start vdev_queue_io -> Takes vq->vq_lock vdev_queue_io_to_issue vdev_queue_aggregate zio_buf_alloc -> Waiting on spl_kmem_cache process * z_wr_int zio_vdev_io_done vdev_queue_io_done mutex_lock -> Waiting on vq->vq_lock held by z_wr_iss * txg_sync spa_sync dsl_pool_sync zio_wait -> Waiting on zio being handled by z_wr_int * spl_kmem_cache spl_cache_grow_work kv_alloc spl_vmalloc ... evict zpl_evict_inode zfs_inactive dmu_tx_wait txg_wait_open -> Waiting on txg_sync Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Tim Chase <tim@chase2k.com> Closes #3808 Closes #3867	2015-12-18 13:27:12 -08:00
Chunwei Chen	b4ad50ac5f	Use spl_fstrans_mark instead of memalloc_noio_save For earlier versions of the kernel with memalloc_noio_save, it only turns off __GFP_IO but leaves __GFP_FS untouched during direct reclaim. This would cause threads to direct reclaim into ZFS and cause deadlock. Instead, we should stick to using spl_fstrans_mark. Since we would explicitly turn off both __GFP_IO and __GFP_FS before allocation, it will work on every version of the kernel. This impacts kernel versions 3.9-3.17, see upstream kernel commit torvalds/linux@934f307 for reference. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Closes #515 Issue zfsonlinux/zfs#4111	2015-12-18 13:24:52 -08:00
Brian Behlendorf	a8ad3bf02c	Fix z_xattr_lock/z_teardown_lock lock inversion There exists a lock inversion between the z_xattr_lock and the z_teardown_lock. Detect this case and return EBUSY so zfs_resume_fs() will mark the inode stale and it can be safely revalidated on next access. * process-1 zpl_xattr_get -> Takes zp->z_xattr_lock __zpl_xattr_get zfs_lookup -> Takes zsb->z_teardown_lock in ZFS_ENTER macro * process-2 zfs_ioc_recv -> Takes zsb->z_teardown_lock in zfs_suspend_fs() zfs_resume_fs zfs_rezget -> Takes zp->z_xattr_lock Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Closes #3969	2015-12-18 13:17:44 -08:00
Tim Chase	200366f23f	Provide kstat for taskqs This patch provides 2 new kstats to display task queues: /proc/spl/taskqs-all - Display all task queues /proc/spl/taskqs - Display only "active" task queues A task queue is considered to be "active" if it currently has active (running) threads or if any of its pending, priority, delay or waitq lists are not empty. If the task queue has running threads, displays each thread function's address (symbolically, if possibly) and its argument. If the task queue has a non-empty list of pending, priority or delayed task queue entries (taskq_ent_t), displays each entry's thread function address and arguemnt. If the task queue has any waiters, displays each waiting task's pid. Note: This patch also updates some comments in taskq.h which referred to "taskq_t" when they should have referred to "taskq_ent_t". Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #491	2015-12-16 09:35:22 -08:00
Chunwei Chen	2727b9d3b6	Use uio for zvol_{read,write} Since uio now supports bvec, we can convert bio into uio and reuse dmu_{read,write}_uio. This way, we can remove some duplicate code. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4078	2015-12-15 16:21:43 -08:00
Chunwei Chen	502923bb44	Fix uio_prefaultpages for 0 length iovec Userspace can freely pass in whatever iovec it feels like, and it's perfectly legal to pass an iovec which contains a zero length segment. In the current implementation, uio_prefaultpages would touch an out of bound byte in the "last byte" logic. While this probably wouldn't cause any critical error, we would like uio_prefaultpages to be able to continue gracefully. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4078	2015-12-15 16:19:55 -08:00
Brian Behlendorf	eba9e745dc	Handle damaged blk_birth in dsl_deadlist_insert() If a bit were cleared in `bp->blk_birth` such that the txg birth was now lower than any other txg_birth in the deadlist, then there will be no entry before this in the tree. This should be impossible but regardless error handling code has been added for this case. By default this is left as a fatal case and the blk_birth is logged. However, setting `zfs_recover=1` will cause the bp to be placed at the start of the deadlist even though it contains an invalid blk_birth. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Closes #4086 Closes #4089	2015-12-15 16:12:31 -08:00
Brian Behlendorf	1cdb86cba2	Handle block pointers with a corrupt logical size Commit `5f6d0b6` was originally added to gracefully handle block pointers with a damaged logical size. However, it incorrectly assumed that all passed arc_done_func_t could handle a NULL arc_buf_t. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4069 Closes #4080	2015-12-15 16:11:44 -08:00
Brian Behlendorf	245b7ab3d1	Hold the zfs_snapentry_t before dispatch While exceptionally unlikely to cause a problem the zfs_snapentry_t hold should be taken before the dispatch to prevent any possibility of the task being processed before the hold. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@osnexus.com>	2015-12-14 12:06:31 -08:00
Chunwei Chen	1997660170	Fix snapshot automount race cause EREMOTE When a concorrent mount finishes just before calling to zfsctl_snapshot_ismounted, if we return EISDIR, the VFS will return with EREMOTE. We should instead just return 0, so VFS may retry and would likely notice the dentry is alreadly mounted. This will be inline with when usermode helper return EBUSY. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-12-14 12:06:31 -08:00
Brian Behlendorf	5ed27c572c	Change zfs_snapshot_lock from mutex to rw lock By changing the zfs_snapshot_lock from a mutex to a rw lock the zfsctl_lookup_objset() function can be allowed to run concurrently. This should reduce the latency of fh_to_dentry lookups in ZFS snapshots which are being accessed over NFS. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@osnexus.com>	2015-12-14 12:06:31 -08:00
Brian Behlendorf	f22f900f15	Fix zfsctl_lookup_objset() deadlock The zfsctl_snapshot_unmount_delay() function must not be called from zfsctl_lookup_objset() while it is currently holding the zfs_snapshot_lock. This will result in a deadlock. It is safe to call zfsctl_snapshot_unmount_delay_impl() directly because the function already has a reference on the zfs_snapentry_t. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Closes #3997	2015-12-14 12:05:52 -08:00
Brian Behlendorf	5e94284fe5	Set 'zfs_expire_snapshot=0' to disable auto-unmount There are cases where it's desirable that auto-mounted snapshots not expire after a fixed duration. They should be unmounted only when the filesystem they are a snapshot of is unmounted. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@osnexus.com>	2015-12-14 11:02:32 -08:00
Brian Behlendorf	2c4332cf79	Fix cstyle issues in spl-taskq.c and taskq.h This patch only addresses the issues identified by the style checker. It contains no functional changes. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-12-11 16:20:22 -08:00
Chunwei Chen	066b89e685	Don't use tq->tq_lock_flags The flags argument in spin_lock_irqsave is modified out side of spin_lock context. We cannot use a shared variable like tq->tq_lock_flags for them. This patch removes it and uses local variable for the flags. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #506	2015-12-11 16:20:03 -08:00
Olaf Faaland	326172d854	Subclass tq_lock to eliminate a lockdep warning When taskq_dispatch() calls taskq_thread_spawn() to create a new thread for a taskq, linux lockdep warns of possible recursive locking. This is a false positive. One such call chain is as follows, when a taskq needs more threads: taskq_dispatch->taskq_thread_spawn->taskq_dispatch The initial taskq_dispatch() holds tq_lock on the taskq that needed more worker threads. The later call into taskq_dispatch() takes dynamic_taskq->tq_lock. Without subclassing, lockdep believes these could potentially be the same lock and complains. A similar case occurs when taskq_dispatch() then calls task_alloc(). This patch uses spin_lock_irqsave_nested() when taking tq_lock, with one of two new lock subclasses: subclass taskq TQ_LOCK_DYNAMIC dynamic_taskq TQ_LOCK_GENERAL any other Signed-off-by: Olaf Faaland <faaland1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #480	2015-12-11 16:19:56 -08:00
Brian Behlendorf	c5a8b1e163	Revert "Make taskq_member() use ->journal_info" This reverts commit `a430c11f0b`. Using journal_info like this can cause a BUG at kernel fs/jbd2/transaction.c:425! Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #500	2015-12-08 17:12:36 -08:00
Chunwei Chen	24ef51f660	Use spa as key besides objsetid for snapentry objsetid is not unique across pool, so using it solely as key would cause panic when automounting two snapshot on different pools with the same objsetid. We fix this by adding spa pointer as additional key. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <ryao@gentoo.org> Issue #3948 Issue #3786 Issue #3887	2015-12-08 16:38:56 -08:00
Richard Yao	a430c11f0b	Make taskq_member() use ->journal_info The ->journal_info pointer in the task_struct is reserved for use by filesystems and because the kernel can have multiple file systems on the same stack due to direct reclaim, each filesystem that touches ->journal_info in a callback function will save the value at the start of its frame and restore it at the end of its frame. This allows us to safely use ->journal_info to store a pointer to the taskq's struct in taskq threads so that ZFS code paths can detect the presence of a taskq. This could break if the ZFS code were to use taskq_member from the context of direct reclaim. However, there are no such uses of it in that manner, so this is safe. This eliminates an O(N) list traversal under a spinlock with an O(1) unlocked pointer comparison. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: tuxoko <tuxoko@gmail.com> Signed-off-by: Tim Chase <tim@chase2k.com> Closes #500	2015-12-08 13:24:47 -08:00
Brian Behlendorf	b58986eebf	Use large stacks when available While stack size will vary by architecture it has historically defaulted to 8K on x86_64 systems. However, as of Linux 3.15 the default thread stack size was increased to 16K. These kernels are now the default in most non- enterprise distributions which means we no longer need to assume 8K stacks. This patch takes advantage of that fact by appropriately reverting stack conservation changes which were made to ensure stability. Changes which may have had a negative impact on performance for certain workloads. This also has the side effect of bringing the code slightly more in line with upstream. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <ryao@gentoo.org> Closes #4059	2015-12-07 12:20:43 -08:00
Matthew Ahrens	241b541574	Illumos 5959 - clean up per-dataset feature count code 5959 clean up per-dataset feature count code Reviewed by: Toomas Soome <tsoome@me.com> Reviewed by: George Wilson <george@delphix.com> Reviewed by: Alex Reece <alex@delphix.com> Approved by: Richard Lowe <richlowe@richlowe.net> References: https://www.illumos.org/issues/5959 https://github.com/illumos/illumos-gate/commit/ca0cc39 Porting notes: illumos code doesn't check for feature_get_refcount() returning ENOTSUP (which means feature is disabled) in zdb. zfsonlinux added a check in https://github.com/zfsonlinux/zfs/commit/784652c due to #3468. The check was reintroduced here. Ported-by: Witaut Bajaryn <vitaut.bayaryn@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3965	2015-12-04 14:20:20 -08:00
Brian Behlendorf	072484504f	Add zap_prefetch() interface Provide a generic interface to prefetch ZAP entries by name. This functionality is being added for external consumers such as Lustre. It is based of the existing zap_prefetch_uint64() version which is used by the deduplication code. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <ryao@gentoo.org> Closes #4061	2015-12-04 09:39:20 -08:00
Richard Yao	1683e75edc	Fix race between getf() and areleasef() If a vnode is released asynchronously through areleasef(), it is possible for the user process to reuse the file descriptor before areleasef is called. When this happens, getf() will return a stale reference, any operations in the kernel on that file descriptor will fail (as it is closed) and the operations meant for that fd will never occur from userspace's perspective. We correct this by detecting this condition in getf(), doing a putf on the old file handle, updating the file descriptor and proceeding as if everything was fine. When the areleasef() is done, it will harmlessly decrement the reference counter on the Illumos file handle. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #492	2015-12-03 15:44:47 -08:00
tuxoko	b0fe1adeb1	Prevent rm modules.* when make install This was originally in `fe0ed8f910`, but somehow was changed and not working anymore. And it will cause the following error: modprobe: ERROR: ../libkmod/libkmod.c:506 lookup_builtin_file() could not open builtin file '/lib/modules/4.2.0-18-generic/modules.builtin.bin' Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4027	2015-12-02 14:39:12 -08:00
tuxoko	d28c5c4f04	Prevent rm modules.* when make install This was originally in `e80cd06b8e`, but somehow was changed and not working anymore. And it will cause the following error: modprobe: ERROR: ../libkmod/libkmod.c:506 lookup_builtin_file() could not open builtin file '/lib/modules/4.2.0-18-generic/modules.builtin.bin' Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #501	2015-12-02 14:38:20 -08:00
Dimitri John Ledkov	9f456111ab	spl-kmem-cache: include linux/prefetch.h for prefetchw() This is needed for architectures that do not have a builtin prefetchw() Signed-off-by: Dimitri John Ledkov <xnox@ubuntu.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #502	2015-12-02 12:45:06 -08:00
Chunwei Chen	61d482f7cd	Linux 4.4 compat: xattr operations takes xattr_handler The xattr_hander->{list,get,set} were changed to take a xattr_handler, and handler_flags argument was removed and should be accessed by handler->flags. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #4021	2015-12-01 16:48:25 -08:00
Chunwei Chen	1a09371678	Linux 4.4 compat: make_request_fn returns blk_qc_t As part of block polling support in Linux 4.4, make_request_fn should return a cookie value of type blk_qc_t. For now, we make zvol_request always return BLK_QC_T_NONE until we assess whether and how we want to support block polling. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #4021	2015-12-01 16:48:08 -08:00
tuxoko	43518d92fd	Fix zfs_dirty_data_max overflow on 32-bit On 32 bit, the calculation of zfs_dirty_data_max from phymem will overflow, causing it to be smaller than zfs_dirty_data_sync, and will cause txg being delayed while no one write to disk. The end result is horrendous write speed. On 4G ram 32-bit VM, before this patch, simple dd results in ~7MB/s. Now it can reach speed on par with 64-bit VM. Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3973	2015-11-19 16:02:47 -08:00
tuxoko	d0c614ecf9	Fix null pointer in arc_kmem_reap_now on 32-bit On 32 bit system, zio_buf_cache is limit to 1M. Larger than that is all NULL. So we need to avoid reaping them. Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #3973	2015-11-19 16:01:47 -08:00
Chunwei Chen	d287880afd	Fix snapshot automount behavior when concurrent or fail When concurrent threads accessing the snapdir, one will succeed the user helper mount while others will get EBUSY. However, the original code treats those EBUSY threads as success and goes on to do zfsctl_snapshot_add, which causes repeated avl_add and thus panic. Also, if the snapshot is already mounted somewhere else, a thread accessing the snapdir will also get EBUSY from user helper mount. And it will cause strange things as doing follow_down_one will fail and then follow_up will jump up to the mountpoint of the filesystem and confuse the hell out of VFS. The patch fix both behavior by returning 0 immediately for the EBUSY threads. Note, this will have a side effect for the second case where the VFS will retry several times before returning ELOOP. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4018	2015-11-19 15:36:59 -08:00
Brian Behlendorf	3d8d245fb3	Follow 0/-E convention for module load errors Because errors during module load are so rare it went unnoticed that it was possible that a positive errno was returned. This would result in the module being loaded, nothing being initialized, and a system panic shortly thereafter. This is what was causing the hard failures in the automated testing. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-11-16 16:10:06 -08:00
Brian Behlendorf	e7b75d9b46	Limit maximum object size in kmem tests Limit the maximum object size to 1/128 of total system memory for the kmem cache tests. Large values can result in out of memory errors for systems with less the 512M of memory. Additionally, use the known number of objects per-slab for calculating the number of objects to use for a test. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-11-16 15:02:24 -08:00
AndCycle	256fa983f4	Obey arc_meta_limit default size when changing arc_max When decreasing the maximum ARC size preserve the 3/4 default ratio for the arc_meta_limit. Otherwise, the arc_meta_limit may be set the same as arc_max. Signed-off-by: AndCycle <andcycle@andcycle.idv.tw> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4001	2015-11-13 15:45:22 -08:00
loli10K	31f24932a4	Remove superfluous `newline` character Remove superfluous `newline` character from spl_kmem_cache_magazine_size module parameter description. Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #499	2015-11-13 15:27:45 -08:00
tuxoko	f5f2b87df0	Fix taskq dynamic spawning Currently taskq_dispatch() will spawn new task with a condition that the caller is also a member of the taskq. However, under this condition, it will still cause deadlock where a task on tq1 is waiting another thread, who is trying to dispatch a task on tq1. So this patch removes the check. For example when you do: zfs send pp/fs0@001 \| zfs recv pp/fs0_copy This will easily deadlock before this patch. Also, move the seq_task check from taskq_thread_spawn() to taskq_thread() because it's not used by the caller from taskq_dispatch(). Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #496	2015-11-13 15:02:55 -08:00
Chunwei Chen	3e7e6f34d0	Don't call kmem_cache_shrink from shrinker Linux slab will automatically free empty slab when number of partial slab is over min_partial, so we don't need to explicitly shrink it. In fact, calling kmem_cache_shrink from shrinker will cause heavy contention on kmem_cache_node->list_lock, to the point that it might cause __slab_free to livelock (see zfsonlinux/zfs#3936) Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes zfsonlinux/zfs#3936 Closes #487	2015-11-11 13:48:31 -08:00
Chunwei Chen	07d63f0cb9	Fix fail path in zfs_znode_alloc When sa_bulk_lookup() fails, unlock_new_inode() will spit out a WARNING. It will also recursive deadlock on ZFS_OBJ_HOLD_ENTER in zfs_zinactive(). Since we never call insert_inode_locked in fail path, I_NEW is never set, the inode is never hashed. So unlock_new_inode() can be safely remove it. We set z_sa_hdl to NULL in fail path so that iput path will stop at zfs_inactive() without entering zfs_zinactive(). This way we can avoid the deadlock and prevent double sa_handle_destroy(). Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #3899	2015-10-13 15:57:17 -07:00
Chunwei Chen	aa159afb56	Fix use-after-free in vdev_disk_physio_completion Currently, vdev_disk_physio_completion will try to wake up an waiter without first checking the existence. This creates a race window in which complete is called after dr is freed. We add dr_wait in dio_request to indicate the existence of waiter. Also, remove dr_rw since no one is using it, and reorder dr_ref to make the struct more compact in 64bit. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #3917 Issue #3880	2015-10-13 15:25:33 -07:00
Justin T. Gibbs	bc4501f75a	Illumos 6267 - dn_bonus evicted too early 6267 dn_bonus evicted too early Reviewed by: Richard Yao <ryao@gentoo.org> Reviewed by: Xin LI <delphij@freebsd.org> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Richard Lowe <richlowe@richlowe.net> References: https://www.illumos.org/issues/6267 https://github.com/illumos/illumos-gate/commit/d205810 Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: Ned Bass bass6@llnl.gov Issue #3865 Issue #3443	2015-10-13 14:12:02 -07:00
Brian Behlendorf	9b13f65d28	Fix CPU hotplug Allocate a kmem cache magazine for every possible CPU which might be added to the system. This ensures that when one of these CPUs is enabled it can be safely used immediately. For many systems the number of online CPUs is identical to the number of present CPUs so this does imply an increased memory footprint. In fact, dynamically allocating the array of magazine pointers instead of using the worst case NR_CPUS can end up decreasing our memory footprint. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov> Closes #482	2015-10-13 09:50:40 -07:00
Brian Behlendorf	935434ef01	Fix 'arc_c < arc_c_min' panic Strictly enforce keeping 'arc_c >= arc_c_min'. The ASSERTs are left in place to catch this in a debug build but logic has been added to gracefully handle in a production build. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #3904	2015-10-13 09:23:35 -07:00
Richard Yao	919efe93cb	zfs_inode_update should not call dmu_object_size_from_db under spinlock We should never block when holding a spin lock, but zfs_inode_update can block in the critical section of a spin lock in zfs_inode_update: zfs_inode_update -> dmu_object_size_from_db -> zrl_add -> mutex_enter Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #3858	2015-09-30 10:47:40 -07:00
Richard Yao	bc8ffb2d08	Remove obsolete zv_lock All users of zv_lock were removed by `37f9dac`, but we forgot to remove it. Lets remove it as clean up. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #3858	2015-09-30 10:43:19 -07:00
Chunwei Chen	45838e3a41	Fix uioskip crash when skip to end When doing uioskip to skip an iovec to the very end, the current loop condition will falsely check pass the end of iovec. We fix this checking uio_iovcnt first. Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3806 Closes #3850	2015-09-29 10:06:58 -07:00
Brian Behlendorf	2ebe396046	Fix PAX Patch/Grsec SLAB_USERCOPY panic Support grsecurity/PaX kernel configurations where CONFIG_PAX_USERCOPY_SLABS are enabled. When this kernel option is enabled slabs which are used to copy between user and kernel space must be created with SLAB_USERCOPY. Stock Linux kernels do not have a SLAB_USERCOPY definition so this causes no change in behavior for non-PAX-enabled kernels. Verified-by: Wuffleton <null@wuffleton.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #2977 Issue #3796	2015-09-28 09:18:29 -07:00
Richard Yao	b815ec32b3	Userspace can pass zero length segments via writev/readv Userspace can trigger an assertion by passing a zero-length segment when assertions are enabled: [27961.614792] VERIFY3(skip < iov->iov_len) failed (0 < 0) [27961.614795] PANIC at zfs_uio.c:187:uio_prefaultpages() [27961.614805] Call Trace: [27961.614811] dump_stack+0x45/0x57 [27961.614830] spl_dumpstack+0x44/0x50 [spl] [27961.614834] spl_panic+0xbb/0x100 [spl] [27961.614908] uio_prefaultpages+0x134/0x140 [zcommon] [27961.614930] zfs_write+0x1fd/0xe80 [zfs] [27961.615014] zpl_write_common_iovec+0x7f/0x110 [zfs] [27961.615035] zpl_iter_write+0xa0/0xd0 [zfs] [27961.615037] do_iter_readv_writev+0x59/0x80 [27961.615063] do_readv_writev+0x11b/0x260 [27961.615098] vfs_writev+0x39/0x50 [27961.615100] SyS_writev+0x4a/0xe0 [27961.615103] system_call_fastpath+0x16/0x6e The solution is to delete the assertion. This could potentially occur in uiomove as well, which contains analogous assertions that appear similarly unnecessary, so we remove those as well. Reported-by: Jonathan Vasquez <jvasquez1011@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <ryao@gentoo.org> Issue #3792	2015-09-25 12:51:16 -07:00
Brian Behlendorf	a3000f9358	Revert "dmu_objset_userquota_get_ids uses dn_bonus unsafely" This reverts commit `5f8e1e8505`. It was determined that this patch introduced the quota regression described in #3789. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #3443 Issue #3789	2015-09-25 12:50:24 -07:00
Brian Behlendorf	5592404784	Fix synchronous behavior in __vdev_disk_physio() Commit `b39c22b` set the READ_SYNC and WRITE_SYNC flags for a bio based on the ZIO_PRIORITY_* flag passed in. This had the unnoticed side-effect of making the vdev_disk_io_start() synchronous for certain I/Os. This in turn resulted in vdev_disk_io_start() being able to re-dispatch zio's which would result in a RCU stalls when a disk was removed from the system. Additionally, this could negatively impact performance and explains the performance regressions reported in both #3829 and #3780. This patch resolves the issue by making the blocking behavior dependent on a 'wait' flag being passed rather than overloading the passed bio flags. Finally, the WRITE_SYNC and READ_SYNC behavior is restricted to non-rotational devices where there is no benefit to queuing to aggregate the I/O. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #3652 Issue #3780 Issue #3785 Issue #3817 Issue #3821 Issue #3829 Issue #3832 Issue #3870	2015-09-25 12:47:31 -07:00
Brian Behlendorf	ef5b2e1048	Avoid blocking in arc_reclaim_thread() As described in the comment above arc_reclaim_thread() it's critical that the reclaim thread be careful about blocking. Just like it must never wait on a hash lock, it must never wait on a task which can in turn wait on the CV in arc_get_data_buf(). This will deadlock, see issue #3822 for full backtraces showing the problem. To resolve this issue arc_kmem_reap_now() has been updated to use the asynchronous arc prune function. This means that arc_prune_async() may now be called while there are still outstanding arc_prune_tasks. However, this isn't a problem because arc_prune_async() already keeps a reference count preventing multiple outstanding tasks per registered consumer. Functionally, this behavior is the same as the counterpart illumos function dnlc_reduce_cache(). Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Issue #3808 Issue #3834 Issue #3822	2015-09-25 12:45:47 -07:00
Brian Behlendorf	04870568e6	Disable zpl_nr_cached_objects() callback The zpl_nr_cached_objects() function has been disabled because in the current code it doesn't provide any critical functionality and it may result in a deadlock under certain circumstances. However, because we expect to need these hooks in the future this code has not been entirely removed. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #3719	2015-09-25 12:45:42 -07:00
Brian Behlendorf	d4787d55ad	Allow NFS activity to defer snapshot unmounts Accessing a snapshot via NFS should cause an auto-unmount of that snapshot to be deferred until such as time as the snapshot is idle. This is analogous to the zpl_revalidate logic employed by locally mounted snapshots. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #3794	2015-09-25 12:45:38 -07:00
Lukas Wunner	784a7fe5d9	Linux 4.3 compat: bio_end_io_t / BIO_UPTODATE Commit torvalds/linux@4246a0b63b ("block: add a bi_error field to struct bio") dropped the error argument from bio_endio in favor of newly introduced bio->bi_error. This also replaces bio->bi_flags value BIO_UPTODATE. bio_endio was a 3 argument function until Linux 2.6.24, which made it a 2 argument function, and now the prototype has changed yet again to a 1 argument function. Support for pre 2.6.24 kernels was already dropped with `37f9dac592` ("zvol processing should use struct bio") which assumed the 2 argument version in zvol_request(). Remaining code to support the 3 argument version is hereby removed. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Lukas Wunner <lukas@wunner.de> Issue #3799	2015-09-25 12:44:54 -07:00
Don Brady	56b3986316	Add large block support to zpios(1) benchmark As part of the large block support effort, it makes sense to add support for large blocks to zpios(1). The specifying of a zfs block size for zpios is optional and will default to 128K if the block size is not specified. `zpios ... -S size \| --blocksize size ...` This will use size ZFS blocks for each test, specified as a comma delimited list with an optional unit suffix. The supported range is powers of two from 128K through 16M. A range of block sizes can be tested as follows: `-S 128K,256K,512K,1M` Example run below (non realistic results from a VM and output abbreviated for space) ``` --regioncount=750 --regionsize=8M --chunksize=1M --offset=4K --threaddelay=0 --cleanup --human-readable --verbose --cleanup --blocksize=128K,256K,512K,1M th-cnt rg-cnt rg-sz ch-sz blksz wr-data wr-bw rd-data rd-bw --------------------------------------------------------------------- 4 750 8m 1m 128k 5g 90.06m 5g 93.37m 4 750 8m 1m 256k 5g 79.71m 5g 99.81m 4 750 8m 1m 512k 5g 42.20m 5g 93.14m 4 750 8m 1m 1m 5g 35.51m 5g 89.36m 8 750 8m 1m 128k 5g 85.49m 5g 90.81m 8 750 8m 1m 256k 5g 61.42m 5g 99.24m 8 750 8m 1m 512k 5g 49.09m 5g 108.78m 16 750 8m 1m 128k 5g 86.28m 5g 88.73m 16 750 8m 1m 256k 5g 64.34m 5g 93.47m 16 750 8m 1m 512k 5g 68.84m 5g 124.47m 16 750 8m 1m 1m 5g 53.97m 5g 97.20m --------------------------------------------------------------------- ``` Signed-off-by: Don Brady <don.brady@intel.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3795 Closes #2071	2015-09-22 09:13:20 -07:00
Ned Bass	3af56fd95f	Honor xattr=sa dataset property ZFS incorrectly uses directory-based extended attributes even when xattr=sa is specified as a dataset property or mount option. Support to honor temporary mount options including "xattr" was added in commit `0282c4137e`. There are two issues with the mount option handling: * Libzfs has historically included "xattr" in its list of default mount options. This overrides the dataset property, so the dataset is always configured to use directory-based xattrs even when the xattr dataset property is set to off or sa. Address this by removing "xattr" from the set of default mount options in libzfs. * There was no way to enable system attribute-based extended attributes using temporary mount options. Add the mount options "saxattr" and "dirxattr" which enable the xattr behavior their names suggest. This approach has the advantages of mirroring the valid xattr dataset property values and following existing conventions for mount option names. Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3787	2015-09-19 14:04:14 -07:00
Brian Behlendorf	66aad10ce8	Fix NULL as mount(2) syscall data parameter Passing NULL for the mount data should not result in EINVAL. It should be treated as if an empty string were passed and succeed. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov> Closes #3771	2015-09-19 14:03:01 -07:00
Richard Yao	f52ebcb3eb	Discard on zvols should not exceed the length of a block `37f9dac592` replaced the end-start calculation with a cached value, but neglected to update it on discard operations. This can cause us to discard data not requested, causing data loss on zvols. Reported-by: Richard Connon <richard.connon@zynstra.com> Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3798	2015-09-19 14:00:14 -07:00
Arne Jansen	4e0f33ffe0	Illumos 6214 - zpools going south 6214 zpools going south Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com> References: https://www.illumos.org/issues/6214 http://cr.illumos.org/~webrev/sensille/6214_zpools_going_south/ Porting Notes: Reintroduce b_compress to the l2arc_buf_hdr_t. In commit `b9541d6` the compression flags were moved to the generic b_flags in the arc_buf_hdr_t. This is a problem because l2arc_compress_buf() may manipulate the compression flags and this can only be done safely under the hash lock which is not held. See Illumos 6214 for a detailed analysis of the race. HDR_GET_COMPRESS() macro was removed from arc_buf_info(). Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3757	2015-09-11 11:14:38 -07:00
Brian Behlendorf	9965059ab9	Prefetch start and end of volumes When adding a zvol to the system prefetch zvol_prefetch_bytes from the start and end of the volume. Prefetching these regions of the volume is desirable because they are likely to be accessed immediately by blkid(8), the kernel scanning for a partition table, or another task which probes the devices. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #3659	2015-09-09 14:38:29 -07:00
Richard Yao	d4bf6d8429	Disable direct reclaim in taskq worker threads on Linux 3.9+ Illumos does not have direct reclaim and code run inside taskq worker threads is not designed to deal with it. Allowing direct reclaim inside a worker thread can therefore deadlock. We set PF_MEMALLOC_NOIO through memalloc_noio_save() to indicate to the kernel's reclaim code that we are inside a context where memory allocations cannot be allowed to block on filesystem activity. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue zfsonlinux/zfs#1274 Issue zfsonlinux/zfs#2390 Closes #474	2015-09-09 14:30:47 -07:00
Richard Yao	8198d18ca7	Reintroduce IO accounting on zvols on Linux 3.19+ zfsonlinux/zfs@e20cd6f7a8 caused us to lose IO accounting on zvols. When I originally wrote that last year, the symbols we needed to maintain IO accounting were GPL exported, but torvalds/linux@394ffa503b provided suitable symbols for restoring this functionality 4 months later. We can call them to restore the IO accounting on Linux 3.19 and later as well as any older kernels where that patch is backported. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3741	2015-09-09 09:29:24 -07:00
Brian Behlendorf	3b36f8319d	Add dbgmsg kstat Internally ZFS keeps a small log to facilitate debugging. By default the log is disabled, to enable it set zfs_dbgmsg_enable=1. The contents of the log can be accessed by reading the /proc/spl/kstat/zfs/dbgmsg file. Writing 0 to this proc file clears the log. $ echo 1 >/sys/module/zfs/parameters/zfs_dbgmsg_enable $ echo 0 >/proc/spl/kstat/zfs/dbgmsg $ zpool import tank $ cat /proc/spl/kstat/zfs/dbgmsg 1 0 0x01 -1 0 2492357525542 2525836565501 timestamp message 1441141408 spa=tank async request task=1 1441141408 txg 70 open pool version 5000; software version 5000/5; ... 1441141409 spa=tank async request task=32 1441141409 txg 72 import pool version 5000; software version 5000/5; ... 1441141414 command: lt-zpool import tank Note the zfs_dbgmsg() and dprintf() functions are both now mapped to the same log. As mentioned above the kernel debug log can be accessed though the /proc/spl/kstat/zfs/dbgmsg kstat. For user space consumers log messages are immediately written to stdout after applying the ZFS_DEBUG environment variable. $ ZFS_DEBUG=on ./cmd/ztest/ztest -V Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov> Closes #3728	2015-09-04 16:08:14 -07:00
Brian Behlendorf	0500e835af	Support accessing .zfs/snapshot via NFS This patch is based on the previous work done by @andrey-ve and @yshui. It triggers the automount by using kern_path() to traverse to the known snapshout mount point. Once the snapshot is mounted NFS can access the contents of the snapshot. Allowing NFS clients to access to the .zfs/snapshot directory would normally mean that a root user on a client mounting an export with 'no_root_squash' would be able to use mkdir/rmdir/mv to manipulate snapshots on the server. To prevent configuration mistakes a zfs_admin_snapshot module option was added which disables the mkdir/rmdir/mv functionally. System administators desiring this functionally must explicitly enable it. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2797 Closes #1655 Closes #616	2015-09-04 13:23:53 -07:00
Andrey Vesnovaty	aa9b27080b	Fix invalid fileid for snapshot root dentry Prevents NFS client from detection of different fileids of snapshot root dentry before & after snapshot mount. Signed-off-by: Andrey Vesnovaty <andrey.vesnovaty@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-09-04 13:23:06 -07:00
Brian Behlendorf	e20cd6f7a8	Merge branch 'zvol' Performance improvements for zvols. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3720	2015-09-04 13:14:21 -07:00
Richard Yao	fa56567630	Support secure discard on zvols Linux 2.6.36 introduced REQ_SECURE to indicate when discards must be processed, such that we cannot do optimizations like block alignment. Consequently, the discard semantics prior to 2.6.36 require us to always process unaligned discards. Previously, we would do this optimization regardless. This patch changes things to correctly restrict this optimization to situations where REQ_SECURE exists, but is not included in the flags. Signed-off-by: Richard Yao <ryao@gentoo.org>	2015-09-04 15:37:24 -04:00
Richard Yao	37f9dac592	zvol processing should use struct bio Internally, zvols are files exposed through the block device API. This is intended to reduce overhead when things require block devices. However, the ZoL zvol code emulates a traditional block device in that it has a top half and a bottom half. This is an unnecessary source of overhead that does not exist on any other OpenZFS platform does this. This patch removes it. Early users of this patch reported double digit performance gains in IOPS on zvols in the range of 50% to 80%. Comments in the code suggest that the current implementation was done to obtain IO merging from Linux's IO elevator. However, the DMU already does write merging while arc_read() should implicitly merge read IOs because only 1 thread is permitted to fetch the buffer into ARC. In addition, commercial ZFSOnLinux distributions report that regular files are more performant than zvols under the current implementation, and the main consumers of zvols are VMs and iSCSI targets, which have their own elevators to merge IOs. Some minor refactoring allows us to register zfs_request() as our ->make_request() handler in place of the generic_make_request() function. This eliminates the layer of code that broke IO requests on zvols into a top half and a bottom half. This has several benefits: 1. No per zvol spinlocks. 2. No redundant IO elevator processing. 3. Interrupts are disabled only when actually necessary. 4. No redispatching of IOs when all taskq threads are busy. 5. Linux's page out routines will properly block. 6. Many autotools checks become obsolete. An unfortunate consequence of eliminating the layer that generic_make_request() is that we no longer calls the instrumentation hooks for block IO accounting. Those hooks are GPL-exported, so we cannot call them ourselves and consequently, we lose the ability to do IO monitoring via iostat. Since zvols are internally files mapped as block devices, this should be okay. Anyone who is willing to accept the performance penalty for the block IO layer's accounting could use the loop device in between the zvol and its consumer. Alternatively, perf and ftrace likely could be used. Also, tools like latencytop will still work. Tools such as latencytop sometimes provide a better view of performance bottlenecks than the traditional block IO accounting tools do. Lastly, if direct reclaim occurs during spacemap loading and swap is on a zvol, this code will deadlock. That deadlock could already occur with sync=always on zvols. Given that swap on zvols is not yet production ready, this is not a blocker. Signed-off-by: Richard Yao <ryao@gentoo.org>	2015-09-04 15:30:24 -04:00
Tim Chase	dca8c34da4	Prevent reclaim in the traverse prefetch thread Reclaim in the traverse prefetch thread, which is run on the system taskq, can overrun the stack. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Closes #3733	2015-09-04 08:43:28 -07:00
Brian Behlendorf	0282c4137e	Add temporary mount options Add the required kernel side infrastructure to parse arbitrary mount options. This enables us to support temporary mount options in largely the same way it is handled on other platforms. See the 'Temporary Mount Point Properties' section of zfs(8) for complete details. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #985 Closes #3351	2015-09-03 14:14:55 -07:00
Tim Chase	69de34219a	Dbuf hash table should be sized as is the arc hash table Commit `49ddb31506` added the zfs_arc_average_blocksize parameter to allow control over the size of the arc hash table. The dbuf hash table's size should be determined similarly. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3721	2015-09-02 09:33:02 -07:00
Brian Behlendorf	6cde64351e	Add spa_slop_shift module option Allow for easy turning of a pools reserved free space. Previous versions of ZFS (v0.6.4 and earlier) held 1/64 of the pools capacity in reserve. Commits `3d45fdd` and `0c60cc3` increased this to 1/32. Setting spa_slop_shift=6 will restore the previous default setting. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3724	2015-09-02 09:30:18 -07:00
Richard Yao	fb40095f5f	Disable LBA weighting on files and SSDs The LBA weighting makes sense on rotational media where the outer tracks have twice the bandwidth of the inner tracks. However, it is detrimental on nonrotational media such as solid state disks, where the only effect is to ensure that metaslabs enter the best-fit allocation behavior sooner, which is detrimental to performance. It also makes no sense on files where the underlying filesystem can arrange things however it wants. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3712	2015-09-01 15:22:07 -07:00
tuxoko	cafbd2aca3	Check for RW_WRITE_HELD in zfs_inactive Before read locking z_teardown_inactive_lock, we need to check if we have already had write lock on it. Otherwise, we would deadlock on ourself when doing rollback: zfs_ioc_rollback ->zfs_suspend_fs (z_teardown_inactive_lock, RW_WRITER) ->zfs_resume_fs->zfs_rezget->zfs_iput_async->iput-> ... ->zfs_inactive (z_teardown_inactive_lock, RW_READER) Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2869	2015-09-01 10:17:57 -07:00
Brian Behlendorf	324dcd3733	Linux 4.2 compat: misc_deregister() The misc_deregister() function was changed to a void return type. Rather than add compatibility code to detect this change simply ignore the return code on all kernels. It was only used to log an informational error message of no real value. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-09-01 09:33:18 -07:00
Brian Behlendorf	4fa4cab972	Linux 4.2 compat: misc_deregister() The misc_deregister() function was changed to a void return type. Rather than add compatibility code to detect this change simply ignore the return code on all kernels. It was only used to log an informational error message of no real value. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-09-01 09:20:45 -07:00
Tim Chase	a64e55752f	Create a new thread during recursive taskq dispatch if necessary When dynamic taskq is enabled and all threads for a taskq are occupied, a recursive dispatch can cause a deadlock if calling thread depends on the recursively-dispatched thread for its return condition. This patch attempts to create a new thread for recursive dispatch when none are available. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #472	2015-09-01 08:46:41 -07:00
Brian Behlendorf	801b56090b	Revert "Create a new thread during recursive taskq dispatch if necessary" This reverts commit `076821e` due to a locking issue uncovered in subsequent testing. An ASSERT is hit due to tq->tq_nspawn being updated outside the lock. The patch will need to be reworked. VERIFY3(0 == tq->tq_nspawn) failed (0 == -1) Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #472	2015-08-31 17:03:01 -07:00
Tim Chase	076821eaff	Create a new thread during recursive taskq dispatch if necessary When dynamic taskq is enabled and all threads for a taskq are occupied, a recursive dispatch can cause a deadlock if calling thread depends on the recursively-dispatched thread for its return condition. This patch attempts to create a new thread for recursive dispatch when none are available. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #472	2015-08-31 15:52:06 -07:00
Brian Behlendorf	278bee9319	Linux 3.18 compat: Snapshot auto-mounting Re-factor the .zfs/snapshot auto-mouting code to take in to account changes made to the upstream kernels. And to lay the groundwork for enabling access to .zfs snapshots via NFS clients. This patch makes the following core improvements. * All actively auto-mounted snapshots are now tracked in two global trees which are indexed by snapshot name and objset id respectively. This allows for fast lookups of any auto-mounted snapshot regardless without needing access to the parent dataset. * Snapshot entries are added to the tree in zfsctl_snapshot_mount(). However, they are now removed from the tree in the context of the unmount process. This eliminates the need complicated error logic in zfsctl_snapshot_unmount() to handle unmount failures. * References are now taken on the snapshot entries in the tree to ensure they always remain valid while a task is outstanding. * The MNT_SHRINKABLE flag is set on the snapshot vfsmount_t right after the auto-mount succeeds. This allows to kernel to unmount idle auto-mounted snapshots if needed removing the need for the zfsctl_unmount_snapshots() function. * Snapshots in active use will not be automatically unmounted. As long as at least one dentry is revalidated every zfs_expire_snapshot/2 seconds the auto-unmount expiration timer will be extended. * Commit torvalds/linux@bafc9b7 caused snapshots auto-mounted by ZFS to be immediately unmounted when the dentry was revalidated. This was a consequence of ZFS invaliding all snapdir dentries to ensure that negative dentries didn't mask new snapshots. This patch modifies the behavior such that only negative dentries are invalidated. This solves the issue and may result in a performance improvement. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3589 Closes #3344 Closes #3295 Closes #3257 Closes #3243 Closes #3030 Closes #2841	2015-08-31 13:54:39 -07:00
Andrey Vesnovaty	b23975cbe0	zfsctl: No need to sync ctldir inodes There's no metadata to write to disk for ctldir inodes. So we check if a inode belongs to the ctldir in zpl_commit_metadata, and returns immediately if it is. Signed-off-by: Andrey Vesnovaty <andrey.vesnovaty@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #2797	2015-08-31 13:54:39 -07:00
Richard Yao	c6a3a222d3	Clear QUEUE_FLAG_ADD_RANDOM on zvols zvols should not be an entropy source for the kernel. Disable it to be consistent with the upstream kernel. torvalds/linux@b277da0a8a Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3713	2015-08-30 10:11:57 -07:00
loli10K	3757bff3b1	Fix small typo Add a missing space to the zfs_vdev_sync_write_min_active module parameter description. Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3714	2015-08-30 10:10:16 -07:00
Tim Chase	36b454ab4c	Initialize the taskq entry embedded within struct vdev As part of the stack reduction effort in `50b25b2187`, a zio_t containing a taskq_ent was added to struct vdev_queue which itself is part of struct vdev. The taskq entry should be initialized as is currently done in zio_create() for newly-created bare zio_t object. The rationale is the same as is described in `f467b05a26`. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3709	2015-08-30 10:04:56 -07:00
Tim Chase	d439f63ff5	Allow recovery from corrupted snapshot maps If the ZAP object containing a snapshot map is corrupted due to an unrecoverable checksum error or otherwise, dsl_dataset_name() will normally panic the system due to its VERIFY. This patch attempts to allow a recovery avenue from such situations by manufacturing a descriptive snapshot name and then ignoring the error. Scrubbing a pool with this type of corruption will then show the affected object in the error list rather than panicking. The recovery code is only enabled when the zfs_recover module parameter is set. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3705	2015-08-28 11:56:32 -07:00
Brian Behlendorf	4cb7b9c5d4	Check large block feature flag on volumes Since ZoL allows large blocks to be used by volumes, unlike upstream illumos, the feature flag must be checked prior to volume creation. This is critical because unlike filesystems, volumes will create a object which uses large blocks as part of the create. Therefore, it cannot be safely checked in zfs_check_settable() after the dataset can been created. In addition this patch updates the relevant error messages to use zfs_nicenum() to print the maximum blocksize. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3591	2015-08-28 09:25:03 -07:00
Brian Behlendorf	c495fe2c1c	Limit max_hw_sectors_kb to 16M When support for large blocks was added DMU_MAX_ACCESS was increased to allow for blocks of up to 16M to fit in a transaction handle. This had the side effect of increasing the max_hw_sectors_kb for volumes, which are scaled off DMU_MAX_ACCESS, to 64M from 10M. This is an issue for volumes which by default use an 8K block size because it results in dmu_buf_hold_array_by_dnode() allocating a 64K array for the dbufs. The solution is to restore the maximum size to ~10M. This patch specifically changes it to 16M which is close enough. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3684	2015-08-28 09:16:59 -07:00
Chunwei Chen	5475aada94	Linux 4.1 compat: loop device on ZFS Starting from Linux 4.1 allows iov_iter with bio_vec to be passed into iter_read/iter_write. Notably, the loop device will pass bio_vec to backend filesystem. However, current ZFS code assumes iovec without any check, so it will always crash when using loop device. With the restructured uio_t, we can safely pass bio_vec in uio_t with UIO_BVEC set. The uio* functions are modified to handle bio_vec case separately. The const uio_iov causes some warning in xuio related stuff, so explicit convert them to non const. Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3511 Closes #3640	2015-08-24 10:17:06 -07:00
Brian Behlendorf	efc412b645	Linux 4.2 compat: vfs_rename() The spa_config_write() function relies on the classic method of making sure updates to the /etc/zfs/zpool.cache file are atomic. It writes out a temporary version of the file and then uses vn_rename() to switch it in to place. This way there can never exist a partial version of the file, it's all or nothing. Conceptually this is a good strategy and it makes good sense for platforms where it's easy to do a rename within the kernel. Unfortunately, Linux is not one of those platforms. Even doing basic I/O to a file system from within the kernel is strongly discouraged. In order to support this at all the vn_rename() implementation ends up being complex and fragile. So fragile that recent Linux 4.2 changes have broken it. While it is possible to update vn_rename() to work with the latest kernels a better long term strategy is to stop using vn_rename() entirely. Then all this complex, fragile code can be removed. Achieving this is straight forward because config_write() is the only consumer of vn_rename(). This patch reworks spa_config_write() to update the cache file in place. The file will be truncated, written out, and then synced to disk. If an error is encountered the file will be unlinked leaving the system in a consistent state. This does expose a tiny tiny tiny window where a system could crash at exactly the wrong moment could leave a partially written cache file. However, this is highly unlikely because the cache file is 1) infrequently updated, 2) only a few kilobytes in size, and 3) written with a single vn_rdwr() call. If this were to somehow happen it poses no risk to pool. Simply removing the cache file will allow the pool to be imported cleanly. Going forward this will be even less of an issue as we intend to disable the use of a cache file by default. Bottom line not using vn_rename() allows us to make ZoL more robust against upstream kernel changes. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3653	2015-08-19 16:04:33 -07:00
Brian Behlendorf	ebc2c9374b	Linux 4.2 compat: vfs_rename() Attempting to perform a vfs_rename() on Linux 4.2 and newer kernels results in an EACCES error. Rather than attempting to add and maintain more ugly compatibility code it's best to just retire this interface. As a first step the SPLAT test is disabled for Linux 4.2 and newer kernels. vn_rename: Failed vn_rename /tmp/vn.tmp.1 -> /tmp/vn.tmp.2 (13) Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue zfsonlinux/zfs#3653	2015-08-19 16:03:29 -07:00
Brian Behlendorf	ff9b1d0725	Handle zap_lookup() failure in ddt_object_load() Failing to lookup a name in the spa_ddt_stat_object should not result in a panic in ddt_object_load(). The error can be safely returned to the caller for handling resulting in a useful user error message. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3370	2015-08-19 14:32:50 -07:00
tuxoko	6d79eabf9f	Add parenthesis to the ternary operator Without the parenthesis, this particular ASSERT will evaluate to "(RW_READER == (!zap->zap_ismicro && fatreader)) ? RW_READER : lti" Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3685	2015-08-19 11:28:41 -07:00
Brian Behlendorf	7e8bddd019	Update arc_memory_throttle() to check pageout This brings the behavior of arc_memory_throttle() back in sync with illumos. The updated memory throttling policy roughly goes like this: * Never throttle if more than 10% of memory is free. This threshold is configurable with the zfs_arc_lotsfree_percent module option. * Minimize any throttling of kswapd even when free memory is below the set threshold. Allow it to write out pages as quickly as possible to help alleviate the memory pressure. * Delay all other threads when free memory is below the set threshold in order to avoid compounding the memory pressure. Buffers will be evicted from the ARC to reduce the issue. The Linux specific zfs_arc_memory_throttle_disable module option has been removed in favor of the existing zfs_arc_lotsfree_percent tuning. Setting zfs_arc_lotsfree_percent=0 will have the same effect as zfs_arc_memory_throttle_disable and it was therefore redundant. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3637	2015-07-30 11:52:12 -07:00
Brian Behlendorf	11f552fa90	Update arc_available_memory() to check freemem While Linux doesn't provide detailed information about the state of the VM it does provide us total free pages. This information should be incorporated in to the arc_available_memory() calculation rather than solely relying on a signal from direct reclaim. Conceptually this brings arc_available_memory() back in sync with illumos. It is also desirable that the target amount of free memory be tunable on a system. While the default values are expected to work well for most workloads there may be cases where custom values are needed. The zfs_arc_sys_free module option was added for this purpose. zfs_arc_sys_free - The target number of bytes the ARC should leave as free memory on the system. This value can checked in /proc/spl/kstat/zfs/arcstats and setting this module option will override the default value. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3637	2015-07-30 11:50:22 -07:00
Brian Behlendorf	6339c1b9dc	Bound zvol_threads module option The zvol_threads module option should be bounded to a reasonable range. The taskq must have at least 1 thread and shouldn't have more than 1,024 at most. The default value of 32 is a reasonable default. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3614	2015-07-29 07:42:11 -07:00
Chunwei Chen	21a96fb635	Fix "BUG: Bad page state" caused by writeback flag Commit d958324 fixed the deadlock between page lock and range lock by unlocking the page lock before acquiring the range lock. However, this created a new issue #3075. The problem is that if we can't set the write back bit before releasing the page lock. Then other processes will be unaware that the page is under active write back. They may therefore truncate the page, invalidate the page, or not honor the sync semantics. To workaround this problem we re-dirty the page before dropping the page lock. While this doesn't prevent the page from being truncated it does ensure it won't be invalidated. Then the range lock and the page lock are reacquired in the correct deadlock-free order. Once both locks are safely held the page state can be rechecked. If all is well and the page is in the expect state the dirty bit can be removed, the write back bit set, and the page removed from the skip count. If not the page will be handled as appropriate. Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3075	2015-07-29 07:38:15 -07:00
Brian Behlendorf	9dc5ffbec8	Invert minclsyspri and maxclsyspri On Linux the meaning of a processes priority is inverted with respect to illumos. High values on Linux indicate a _low_ priority while high value on illumos indicate a _high_ priority. In order to preserve the logical meaning of the minclsyspri and maxclsyspri macros when they are used by the illumos wrapper functions their values have been inverted. This way when changes are merged from upstream illumos we won't need to remember to invert the macro. It could also lead to confusion. Note this change also reverts some of the priorities changes in prior commit `62aa81a`. The rational is as follows: spl_kmem_cache - High priority may result in blocked memory allocs spl_system_taskq - May perform I/O for file backed VDEVs spl_dynamic_taskq - New taskq threads should be spawned promptly Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov> Issue zfsonlinux/zfs#3607	2015-07-28 13:59:03 -07:00
Brian Behlendorf	1229323d5f	Align thread priority with Linux defaults Under Linux filesystem threads responsible for handling I/O are normally created with the maximum priority. Non-I/O filesystem processes run with the default priority. ZFS should adopt the same priority scheme under Linux to maintain good performance and so that it will complete fairly when other Linux filesystems are active. The priorities have been updated to the following: $ ps -eLo rtprio,cls,pid,pri,nice,cmd \| egrep 'z_\|spl_\|zvol\|arc\|dbu\|meta' - TS 10743 19 -20 [spl_kmem_cache] - TS 10744 19 -20 [spl_system_task] - TS 10745 19 -20 [spl_dynamic_tas] - TS 10764 19 0 [dbu_evict] - TS 10765 19 0 [arc_prune] - TS 10766 19 0 [arc_reclaim] - TS 10767 19 0 [arc_user_evicts] - TS 10768 19 0 [l2arc_feed] - TS 10769 39 0 [z_unmount] - TS 10770 39 -20 [zvol] - TS 11011 39 -20 [z_null_iss] - TS 11012 39 -20 [z_null_int] - TS 11013 39 -20 [z_rd_iss] - TS 11014 39 -20 [z_rd_int_0] - TS 11022 38 -19 [z_wr_iss] - TS 11023 39 -20 [z_wr_iss_h] - TS 11024 39 -20 [z_wr_int_0] - TS 11032 39 -20 [z_wr_int_h] - TS 11033 39 -20 [z_fr_iss_0] - TS 11041 39 -20 [z_fr_int] - TS 11042 39 -20 [z_cl_iss] - TS 11043 39 -20 [z_cl_int] - TS 11044 39 -20 [z_ioctl_iss] - TS 11045 39 -20 [z_ioctl_int] - TS 11046 39 -20 [metaslab_group_] - TS 11050 19 0 [z_iput] - TS 11121 38 -19 [z_wr_iss] Note that under Linux the meaning of a processes priority is inverted with respect to illumos. High values on Linux indicate a _low_ priority while high value on illumos indicate a _high_ priority. In order to preserve the logical meaning of the minclsyspri and maxclsyspri macros when they are used by the illumos wrapper functions their values have been inverted. This way when changes are merged from upstream illumos we won't need to remember to invert the macro. It could also lead to confusion. This patch depends on https://github.com/zfsonlinux/spl/pull/466. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov> Closes #3607	2015-07-28 13:36:47 -07:00
Brian Behlendorf	c97d30691c	Check for NULL in dmu_free_long_range_impl() A NULL should never be passed as the dnode_t pointer to the function dmu_free_long_range_impl(). Regardless, because we have a reported occurrence of this let's add some error handling to catch this. Better to report a reasonable error to caller than panic the system. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #3445	2015-07-28 13:30:53 -07:00
Brian Behlendorf	4699d76d19	Remove skc_ref from alloc/free paths As described in spl_kmem_cache_destroy() the ->skc_ref count was added to address the case of a cache reap or grow racing with a destroy. They are not strictly needed in the alloc/free paths because consumers of the cache are responsible for not using it while it's being destroyed. Removing this code is desirable because there is some evidence that contention on this atomic negative impacts performance on large-scale NUMA systems. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Issue #463	2015-07-24 11:11:45 -07:00
Brian Behlendorf	62aa81a577	Add defclsyspri macro Add a new defclsyspri macro which can be used to request the default Linux scheduler priority. Neither the minclsyspri or maxclsyspri map to the default Linux kernel thread priority. This makes it awkward to create taskqs which run with the same priority as the rest of the kernel threads on the system which can lead to performance issues. All SPL callers which previously used minclsyspri or maxclsyspri have been changed to use defclsyspri. The vast majority of callers were part of the test suite which won't have an external impact. The few places where it could impact performance the change was from maxclsyspri to defclsyspri. This makes it more likely the process will be scheduled which may help performance. To facilitate further performance analysis the spl_taskq_thread_priority module option has been added. When disabled (0) all newly created kernel threads will use the default kernel thread priority. When enabled (1) the specified taskq priority will be used. By default this value is enabled (1). Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-07-23 13:25:49 -07:00
Brian Behlendorf	96c080cb9c	Minor style cleanup Address minor differences in style between upstream and ZoL. This patch contains no functional differences and is solely designed to minimize the delta from upstream. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #3533	2015-07-23 09:42:54 -07:00
Brian Behlendorf	3056818343	Remove double counting HDR_L2ONLY_SIZE Commit `d962d5d` didn't quite properly resolve the HDR_L2ONLY_SIZE accounting. Accounting is now performed only in the constructor and destructor which is a nice simplification. It should have been removed the from create and destroy functions. This brings up back in sync with upstream. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #3533	2015-07-23 09:42:44 -07:00
Brian Behlendorf	8c8af9d807	Add hdr_recl() reclaim callback Originally removed because it wasn't required under Linux. However, there may still be some utility in signaling the arc reclaim thread under Linux via reclaim. This should already have happened by other means but it's not harmless and reduces another point of divergence with upstream. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #3533	2015-07-23 09:42:40 -07:00
Brian Behlendorf	728d6ae91e	Reinstate zfs_arc_p_min_shift Commit `f521ce1` removed the minimum value for "arc_p" allowing it to drop to zero or grow to "arc_c". This was done to improve specific workload which constantly dirties new "metadata" but also frequently touches a "small" amount of mfu data (e.g. mkdir's). This change may still be desirable but it needs to be re-investigated. in the context of the recent ARC changes from upstream. Therefore this code is being restored to facilitate benchmarking. By setting "zfs_arc_p_min_shift=64" we easily compare the performance. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #3533	2015-07-23 09:42:32 -07:00
Prakash Surya	36da08ef9b	Illumos 5817 - change type of arcs_size from uint64_t to refcount_t 5817 change type of arcs_size from uint64_t to refcount_t Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com> Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Alex Reece <alex@delphix.com> Reviewed by: Richard Elling <richard.elling@richardelling.com> Approved by: Garrett D'Amore <garrett@damore.org> References: https://www.illumos.org/issues/5817 https://github.com/illumos/illumos-gate/commit/2fd872a Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #3533	2015-07-23 09:42:28 -07:00
Prakash Surya	500445c046	Illumos 5445 - Add more visibility via arcstats 5445 Add more visibility via arcstats; specifically arc_state_t stats and differentiate between "data" and "metadata" Reviewed by: Basil Crow <basil.crow@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Bayard Bell <bayard.bell@nexenta.com> Approved by: Robert Mustacchi <rm@joyent.com> References: https://www.illumos.org/issues/5445 https://github.com/illumos/illumos-gate/commit/4076b1b Porting Notes: This patch is an improved version of `cc7f677` which was previously merged in ZoL. This patch incorporates the additional improvements which were made upstream. Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #3533	2015-07-23 09:42:06 -07:00
Matthew Ahrens	ca67b33aba	Illumos 5376 - arc_kmem_reap_now() should not result in clearing arc_no_grow 5376 arc_kmem_reap_now() should not result in clearing arc_no_grow Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Steven Hartland <killing@multiplay.co.uk> Reviewed by: Richard Elling <richard.elling@richardelling.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/5376 https://github.com/illumos/illumos-gate/commit/2ec99e3 Porting Notes: The good news is that many of the recent changes made upstream to the ARC tackled issues previously observed by ZoL with similar solutions. The bad news is those solution weren't identical to the ones we applied. This patch is designed to split the difference and apply as much of the upstream work as possible. * The arc_available_memory() function was removed previous in ZoL but due to the upstream changes it makes sense to add it back. This function has been customized for Linux so that it can be used to determine a low memory. This provides the same basic functionality as the illumos version allowing us to minimize changes through the rest of the code base. The exact mechanism used to detect a low memory state remains unchanged so this change isn't a significant as it might first appear. * This patch includes the long standing fix for arc_shrink() which was originally proposed in #2167. Since there were related changes to this function it made sense to include that work. * The arc_init() function has been re-factored. As before it sets sane default values for the ARC but then calls arc_tuning_update() to apply user specific tuning made via module options. The arc_tuning_update() function is then called periodically by the arc_reclaim_thread() to apply changes to the tunings made during normal operation. Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3616 Closes #2167	2015-07-23 09:41:28 -07:00
Brian Behlendorf	9eb361aaa5	Default to --disable-debug-kmem The default kmem debugging (--enable-debug-kmem) can severely impact performance on large-scale NUMA systems due to the atomic operations used in the memory accounting. A 32-thread fio test running on a 40-core 80-thread system and performing 100% cached reads with kmem debugging is: Enabled: READ: io=177071MB, aggrb=2951.2MB/s, minb=2951.2MB/s, maxb=2951.2MB/s, Disabled: READ: io=271454MB, aggrb=4524.4MB/s, minb=4524.4MB/s, maxb=4524.4MB/s, Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Issues #463	2015-07-21 11:47:10 -07:00
Brian Behlendorf	53b1d9794e	Add logic to try and recover an inode with an invalid mode When an inode is detected with invalid mode bits the safe thing to do is panic the system. This indicates a problem with the contents of a dnode and it should never be possible. This is the default behavior. Unfortunately, due to flaws in the system attribute (SA) implementation (on all platforms) it was possible that ZFS could create a damaged dnode. This was a rare issue which only impacted dnodes which used a spill block. Normally only symlinks and files with ACLs would require a spill block. However, if the dataset had the xattr=sa property set and extended attributes were used this problem could occur. As of the 0.6.4 tag the root cause of this issue has been fixed. For pools which are exhibiting this damage the 'zfs_recover=1' module option may be set. This will cause ZFS to interpret the dnode with invalid mode bits as a normal file. This may allow the files to be accessed for recovery purposes. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3548	2015-07-17 15:33:35 -07:00
Turbo Fredriksson	47a4a6fd5f	Support parallel build trees (VPATH builds) Build products from an out of tree build should be written relative to the build directory. Sources should be referred to by their locations in the source directory. This is accomplished by adding the 'src' and 'obj' variables for the module Makefile.am, using relative paths to reference source files, and by setting VPATH when source files are not co-located with the Makefile. This enables the following: $ mkdir build $ cd build $ ../configure \ --with-spl=$HOME/src/git/spl/ \ --with-spl-obj=$HOME/src/git/spl/build $ make -s This change also has the advantage of resolving the following warning which is generated by modern versions of automake. Makefile.am:00: warning: source file 'xxx' is in a subdirectory, Makefile.am:00: but option 'subdir-objects' is disabled Signed-off-by: Turbo Fredriksson <turbo@bayour.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1082	2015-07-17 13:42:51 -07:00
Turbo Fredriksson	37d7cd94f3	Support parallel build trees (VPATH builds) Build products from an out of tree build should be written relative to the build directory. Sources should be referred to by their locations in the source directory. This is accomplished by adding the 'src' and 'obj' variables for the module Makefile.am, using relative paths to reference source files, and by setting VPATH when source files are not co-located with the Makefile. This enables the following: $ mkdir build $ cd build $ ../configure $ make -s This change also has the advantage of resolving the following warning which is generated by modern versions of automake. Makefile.am:00: warning: source file 'xxx' is in a subdirectory, Makefile.am:00: but option 'subdir-objects' is disabled Signed-off-by: Turbo Fredriksson <turbo@bayour.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue zfsonlinux/zfs#1082	2015-07-17 12:53:11 -07:00
Brian Behlendorf	2a53e2dacc	Update inode under range lock After a successful write the inode must be updated under the range lock. If it is updated after dropping the lock there exists a race where the znode and inode wile disagree about the file size. This could result in narrow window of time where read(2) is able to access data beyond what fstat(2) reports as the file size. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov> Closes #3601	2015-07-17 09:18:22 -07:00
Brian Behlendorf	bd29109f1a	Linux 4.2 compat: follow_link() / put_link() As of Linux 4.2 the kernel has completely retired the nameidata structure. One of the few remaining consumers of this interface were the follow_link() and put_link() callbacks. This patch adds the required checks to configure to detect the interface change and updates the functions accordingly. Migrating to the simple_follow_link() interface was considered but was decided against ironically due to the increased complexity. It also should be noted that the kernel follow_link() and put_link() interfaces changes several times after 4.1 and but before 4.2. This means there is a narrow range of kernel commits which never appear in an official tag of the Linux kernel which ZoL will not build. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <ryao@gentoo.org> Issue #3596	2015-07-17 09:18:16 -07:00
Brian Behlendorf	7eb333fbdd	Linux 4.2 compat: remove bio->bi_cnt access Linux 4.2 commit torvalds/linux@dac5621 renamed bio->bi_cnt to bio->__bi_cnt. Because this value is only used once in a block of debug code it simplest just to remove the PANIC. To my knowledge this debugging has never been hit or proved useful so this is no great loss. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <ryao@gentoo.org> Closes #3596	2015-07-17 09:16:08 -07:00
Matthew Ahrens	905edb405d	Illumos 5347 - idle pool may run itself out of space 5347 idle pool may run itself out of space Reviewed by: Alex Reece <alex.reece@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Steven Hartland <killing@multiplay.co.uk> Reviewed by: Richard Elling <richard.elling@richardelling.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://github.com/illumos/illumos-gate/commit/231aab8 https://github.com/illumos/illumos-gate/commit/4a92375 3642 https://www.illumos.org/issues/5347 https://github.com/zfsonlinux/zfs/commit/89b1cd6 (partial commit & fix) https://github.com/zfsonlinux/zfs/commit/fbeddd6 Illumos 4390 https://github.com/zfsonlinux/zfs/commit/2696dfa Illumos 3642, 3643 Porting notes: This is completing the partial fix from FreeBSD Ported-by: kernelOfTruth kerneloftruth@gmail.com Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3586	2015-07-14 10:35:21 -07:00
Alexander Eremin	1cddb8c9ff	Illumos 5610 - zfs clone from different source and target pools produces coredump 5610 zfs clone from different source and target pools produces coredump Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://github.com/illumos/illumos-gate/commit/03b1c29 https://www.illumos.org/issues/5610 https://www.illumos.org/issues/5824 https://github.com/zfsonlinux/zfs/issues/2911 https://github.com/zfsonlinux/zfs/commit/9063f65 Ported-by: kernelOfTruth kerneloftruth@gmail.com Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3584	2015-07-14 10:27:46 -07:00
Richard Yao	0de7c552b6	Failure of userland copy should return EFAULT Many key internal functions pass system return codes that are safe to return to userland. In the case of ddi_copyin(9F), an error passes -1 and the documentation states very clearly that drivers should pass EFAULT to userland when this happens. http://illumos.org/man/9F/ddi_copyin This does not happen in the ZFS source code. I believe it should be changed to pass EFAULT. I caught this when writing man pages for the libzfs_core API. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3575	2015-07-14 10:20:35 -07:00
Boris Protopopov	b39c22b73c	Translate sync zio to sync bio Translate zio requests with ZIO_PRIORITY_SYNC_READ and ZIO_PRIORITY_SYNC_WRITE into synchronous bio requests by setting READ_SYNC and WRITE_SYNC flags. Specifically, WRITE_SYNC flag turns out to have a pronounced effect when writing to an SSD-based SLOG. When WRITE_SYNC is not set (WRITE is set instead), the block trace for a SLOG device looks as follows: ... 130,96 0 3 0.008968390 0 C W 830464 + 136 [0] 130,96 0 4 0.011999161 0 C W 830720 + 136 [0] 130,96 0 5 0.023955549 0 C W 831744 + 136 [0] 130,96 0 6 0.024337663 19775 A W 832000 + 136 <- (130,97) 829952 130,96 0 7 0.024338823 19775 Q W 832000 + 136 [z_wr_iss/6] 130,96 0 8 0.024340523 19775 G W 832000 + 136 [z_wr_iss/6] 130,96 0 9 0.024343187 19775 P N [z_wr_iss/6] 130,96 0 10 0.024344120 19775 I W 832000 + 136 [z_wr_iss/6] 130,96 0 11 0.026784405 0 UT N [swapper] 1 130,96 0 12 0.026805339 202 U N [kblockd/0] 1 130,96 0 13 0.026807199 202 D W 832000 + 136 [kblockd/0] 130,96 0 14 0.026966948 0 C W 832000 + 136 [0] 130,96 3 1 0.000449358 19788 A W 829952 + 136 <- (130,97) 827904 130,96 3 2 0.000450951 19788 Q W 829952 + 136 [z_wr_iss/19] 130,96 3 3 0.000453212 19788 G W 829952 + 136 [z_wr_iss/19] 130,96 3 4 0.000455956 19788 P N [z_wr_iss/19] 130,96 3 5 0.000457076 19788 I W 829952 + 136 [z_wr_iss/19] 130,96 3 6 0.002786349 0 UT N [swapper] 1 ... Here the 130,197 is the partition created on the log device when adding it to the pool, whereas the base device is 130,96. As one can see, the writes to the SLOG are not marked synchronous (the S is missing next to W), and the queue unplugs occur based on the timer (UT event) resulting in slightly over 2 msec latency of writes. This results in a sub-par performance of single stream synchronous writes (limited by latency of the SLOG). When the WRITE_SYNC is set, a similar trace looks as follows: ... 130,96 4 1 0.000000000 70714 A WS 4280576 + 136 <- (130,97) 4278528 130,96 4 2 0.000000832 70714 Q WS 4280576 + 136 [(null)] 130,96 4 3 0.000002109 70714 G WS 4280576 + 136 [(null)] 130,96 4 4 0.000003394 70714 P N [(null)] 130,96 4 5 0.000003846 70714 I WS 4280576 + 136 [(null)] 130,96 4 6 0.000004854 70714 D WS 4280576 + 136 [(null)] 130,96 5 1 0.000354487 70713 A WS 4280832 + 136 <- (130,97) 4278784 130,96 5 2 0.000355072 70713 Q WS 4280832 + 136 [(null)] 130,96 5 3 0.000356383 70713 G WS 4280832 + 136 [(null)] 130,96 5 4 0.000357635 70713 P N [(null)] 130,96 5 5 0.000358088 70713 I WS 4280832 + 136 [(null)] 130,96 5 6 0.000359191 70713 D WS 4280832 + 136 [(null)] 130,96 0 76 0.000159539 0 C WS 4280576 + 136 [0] 130,96 16 85 0.000742108 70718 A WS 4281088 + 136 <- (130,97) 4279040 130,96 16 86 0.000743197 70718 Q WS 4281088 + 136 [z_wr_iss/15] 130,96 16 87 0.000744450 70718 G WS 4281088 + 136 [z_wr_iss/15] 130,96 16 88 0.000745817 70718 P N [z_wr_iss/15] 130,96 16 89 0.000746705 70718 I WS 4281088 + 136 [z_wr_iss/15] 130,96 16 90 0.000747848 70718 D WS 4281088 + 136 [z_wr_iss/15] 130,96 0 77 0.000604063 0 C WS 4280832 + 136 [0] 130,96 0 78 0.000899858 0 C WS 4281088 + 136 [0] As one can see, all the writes are synchronous (WS), and I/O completions (e.g. from issue I to completion C) take 160-250 usec, or about 10x faster. Since WRITE_SYNC or READ_SYNC flags are among several factors that are considered when processing bio requests, it seems prudent to mark all the zio requests of synchronous priority with the READ/WRITE_SYNC flags to make them eligible for consideration as such by the Linux block I/O layer. Signed-off-by: Boris Protopopov <boris.protopopov@actifio.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3529	2015-07-13 14:28:50 -07:00
Brian Behlendorf	2b7b78fa5d	Fix switch-bool warning As of gcc version 5.1.1 a new warning has been added to detect the use of a boolean in a switch statement (-Wswitch-bool). Resolve the warning by explicitly casting the value to an integer type. zfs-0.6.4/module/zfs/zvol.c: In function 'zvol_request': error: switch condition has boolean value [-Werror=switch-bool] Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-07-13 13:03:01 -07:00
Justin T. Gibbs	99197f034e	Illumos 5661 - ZFS: "compression = on" should use lz4 if feature is enabled 5661 ZFS: "compression = on" should use lz4 if feature is enabled Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net> Reviewed by: Xin LI <delphij@freebsd.org> Approved by: Robert Mustacchi <rm@joyent.com> References: https://github.com/illumos/illumos-gate/commit/db1741f https://www.illumos.org/issues/5661 Ported-by: kernelOfTruth kerneloftruth@gmail.com Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3571	2015-07-10 12:11:45 -07:00
Josef 'Jeff' Sipek	411bf201f5	Illumos 4745 - fix AVL code misspellings 4745 fix AVL code misspellings Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Richard Lowe <richlowe@richlowe.net> Approved by: Robert Mustacchi <rm@joyent.com> References: https://github.com/illumos/illumos-gate/commit/6907ca4 https://www.illumos.org/issues/4745 Ported-by: kernelOfTruth kerneloftruth@gmail.com Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3565	2015-07-10 11:58:37 -07:00
Tim Chase	1cd777340b	Prevent reclaim in metaslab preload threads Reclaim during metaslab preloading can cause deadlocks involving znode z_lock and ARC buffer header ht_lock. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3532.	2015-07-06 09:36:13 -07:00
Alexander Motin	e16b3fcc61	Illumos 5008 - lock contention (rrw_exit) while running a read only load 5008 lock contention (rrw_exit) while running a read only load Reviewed by: Matthew Ahrens <matthew.ahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Alex Reece <alex.reece@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: Richard Yao <ryao@gentoo.org> Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com> Approved by: Garrett D'Amore <garrett@damore.org> Porting notes: This patch ported perfectly cleanly to ZoL. During testing 100% cached small-block reads, extreme contention was noticed on rrl->rr_lock from rrw_exit() due to the frequent entering and leaving ZPL. Illumos picked up this patch from FreeBSD and it also helps under Linux. On a 1-minute 4K cached read test with 10 fio processes pinned to a single socket on a 4-socket (10 thread per socket) NUMA system, contentions on rrl->rr_lock were reduced from 508799 to 43085. Ported-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3555	2015-07-06 09:34:13 -07:00
Matthew Ahrens	4bda3bd0e7	Illumos 5911 - ZFS "hangs" while deleting file 5911 ZFS "hangs" while deleting file Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com> Reviewed by: Alek Pinchuk <alek@nexenta.com> Reviewed by: Simon Klinkert <simon.klinkert@gmail.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Approved by: Richard Lowe <richlowe@richlowe.net> References: https://www.illumos.org/issues/5911 https://github.com/illumos/illumos-gate/commit/46e1baa Porting notes: Resolved ISO C90 forbids mixed declarations and code wanting in the dnode_free_range() function. Ported-by: kernelOfTruth kerneloftruth@gmail.com Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3554	2015-07-06 09:31:42 -07:00
Arne Jansen	5e8cd5d17f	Illumos 5981 - Deadlock in dmu_objset_find_dp 5981 Deadlock in dmu_objset_find_dp Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Approved by: Robert Mustacchi <rm@joyent.com> References: https://www.illumos.org/issues/5981 https://github.com/illumos/illumos-gate/commit/1d3f896 Ported-by: kernelOfTruth kerneloftruth@gmail.com Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3553	2015-07-06 09:31:35 -07:00
Andriy Gapon	71e2fe41be	Illumos 5946, 5945 5946 zfs_ioc_space_snaps must check that firstsnap and lastsnap refer to snapshots 5945 zfs_ioc_send_space must ensure that fromsnap refers to a snapshot Reviewed by: Steven Hartland <killing@multiplay.co.uk> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Gordon Ross <gordon.ross@nexenta.com> References: https://www.illumos.org/issues/5946 https://www.illumos.org/issues/5945 https://github.com/illumos/illumos-gate/commit/24218be Ported-by: Andriy Gapon <avg@FreeBSD.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3552	2015-07-06 09:31:30 -07:00
Andriy Gapon	b6640117f0	Illumos 5870 - dmu_recv_end_check() leaks origin_head hold if error happens in drc_force branch 5870 dmu_recv_end_check() leaks origin_head hold if error happens in drc_force branch Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Andrew Stormont <andyjstormont@gmail.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/5870 https://github.com/illumos/illumos-gate/commit/beddaa9 Ported-by: Andriy Gapon <avg@FreeBSD.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3551	2015-07-06 09:22:18 -07:00
Andriy Gapon	fec417097b	Illumos 5909 - ensure that shared snap names don't become too long after promotion 5909 ensure that shared snap names don't become too long after promotion Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/5909 https://github.com/illumos/illumos-gate/commit/cb5842f Ported-by: Andriy Gapon <avg@FreeBSD.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3550	2015-07-06 09:21:30 -07:00
Andriy Gapon	cf50a2b08f	Illumos 5912 - full stream can not be force-received into a dataset if it has a snapshot 5912 full stream can not be force-received into a dataset if it has a snapshot Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/5912 https://github.com/illumos/illumos-gate/commit/5bae108 Ported-by: Andriy Gapon <avg@FreeBSD.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3549	2015-07-06 09:20:18 -07:00
Alek Pinchuk	a7b10a9319	Illumos 6033 - arc_adjust() should search MFU lists 6033 arc_adjust() should search MFU lists for oldest buffer when adjusting MFU size Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com> Reviewed by: Xin Li <delphij@delphij.net> Reviewed by: Prakash Surya <me@prakashsurya.com> Approved by: Matthew Ahrens <mahrens@delphix.com> References: https://www.illumos.org/issues/6033 https://github.com/illumos/illumos-gate/commit/31c46cf Ported-by: kernelOfTruth kerneloftruth@gmail.com Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3545	2015-07-01 11:09:15 -07:00
Matthew Ahrens	804e050457	Illumos 5175 - implement dmu_read_uio_dbuf() to improve cached read performance 5175 implement dmu_read_uio_dbuf() to improve cached read performance Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Alex Reece <alex.reece@delphix.com> Reviewed by: George Wilson <george@delphix.com> Reviewed by: Richard Elling <richard.elling@gmail.com> Approved by: Robert Mustacchi <rm@joyent.com> References: https://www.illumos.org/issues/5175 https://github.com/illumos/illumos-gate/commit/f8554bb Porting notes: This patch doesn't include the changes for the COMSTAR (Common Multiprotocol SCSI Target) - since it's not available for ZoL. http://thegreyblog.blogspot.co.at/2010/02/setting-up-solaris-comstar-and.html Ported by: kernelOfTruth <kerneloftruth@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3392	2015-06-29 14:33:23 -07:00
Matthew Ahrens	c52fca13a0	Illumos 5368 - ARC should cache more metadata 5368 ARC should cache more metadata Reviewed by: Alex Reece <alex.reece@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Richard Elling <richard.elling@richardelling.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/5368 https://github.com/illumos/illumos-gate/commit/3a5286a Porting Notes: The vast majority of this patch was already merged in the context of the `06358ea` changes. This is just a small hunk which was missed. Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-06-25 08:58:17 -07:00
George Wilson	669dedb33f	Illumos 5163 - arc should reap range_seg_cache 5163 arc should reap range_seg_cache Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Richard Elling <richard.elling@gmail.com> Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/5163 https://github.com/illumos/illumos-gate/commit/83803b5 Porting Notes: Added umem_cache_reap_now() wrapped to suppress unused variable warning for user space build in arc_kmem_reap_now(). Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-06-25 08:58:16 -07:00
Brian Behlendorf	aa9af22cdf	Update all default taskq settings Over the years the default values for the taskqs used on Linux have differed slightly from illumos. In the vast majority of cases this was done to avoid creating an obnoxious number of idle threads which would pollute the process listing. With the addition of support for dynamic taskqs all multi-threaded queues should be created as dynamic taskqs. This allows us to get the best of both worlds. * The illumos default values for the I/O pipeline can be restored. These values are known to work well for most workloads. The only exception is the zio write interrupt taskq which is changed to ZTI_P(12, 8). At least under Linux more threads has been shown to improve performance, see commit `7e55f4e`. * Reduces the number of idle threads on the system when it's not under heavy load. The maximum number of threads will only be created when they are required. * Remove the vdev_file_taskq and rely on the system_taskq instead which is now dynamic and may have up to 64-threads. Again this brings us back inline with upstream. * Tasks dispatched with taskq_dispatch_ent() are allowed to use dynamic taskqs. The Linux taskq implementation supports this. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Closes #3507	2015-06-25 08:58:16 -07:00
Andriy Gapon	ef56b0780c	Account for ashift when gathering buffers to be written to l2arc device If we don't account for that, then we might end up overwriting disk area of buffers that have not been evicted yet, because l2arc_evict operates in terms of disk addresses. The discrepancy between the write size calculation and the actual increment to l2ad_hand was introduced in commit `3a17a7a9`. The change that introduced l2ad_hand alignment was almost correct as the write size was accumulated as a sum of rounded buffer sizes. See commit illumos/illumos-gate@e14bb32. Also, we now consistently use asize / a_sz for the allocated size and psize / p_sz for the physical size. The latter accounts for a possible size reduction because of the compression, whereas the former accounts for a possible subsequent size expansion because of the alignment requirements. The code still assumes that either underlying storage subsystems or hardware is able to do read-modify-write when an L2ARC buffer size is not a multiple of a disk's block size. This is true for 4KB sector disks that provide 512B sector emulation, but may not be true in general. In other words, we currently do not have any code to make sure that an L2ARC buffer, whether compressed or not, which is used for physical I/O has a suitable size. Note that currently the cache device utilization is calculated based on the physical size, not the allocated size. The same applies to l2_asize kstat. That is wrong, but this commit does not fix that. The accounting problem was introduced partially in commit `3a17a7a9` and partially in 3038a2b (accounting became consistent but in favour of the wrong size). Porting Notes: Reworked to be C90 compatible and the 'write_psize' variable was removed because it is now unused. References: https://reviews.csiden.org/r/229/ https://reviews.freebsd.org/D2764 Ported-by: kernelOfTruth <kerneloftruth@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3400 Closes #3433 Closes #3451	2015-06-25 08:57:16 -07:00
Prakash Surya	d962d5dad9	Illumos 5701 - zpool list reports incorrect "alloc" value for cache devices 5701 zpool list reports incorrect "alloc" value for cache devices Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george@delphix.com> Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/5701 https://github.com/illumos/illumos-gate/commit/a52fc31 Porting Notes: arc_space_return(HDR_L2ONLY_SIZE, ARC_SPACE_L2HDRS); correctly placed at arc_hdr_l2hdr_destroy(arc_buf_hdr_t *hdr). Ported by: kernelOfTruth kerneloftruth@gmail.com Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-06-25 08:51:44 -07:00
Brian Behlendorf	3c82160ff2	Set TASKQ_DYNAMIC for kmem and system taskqs Add the TASKQ_DYNAMIC flag to the kmem_cache and system taskqs to reduce the number of idle threads on the system. Additional threads will be created on demand up to the previous maximum thread counts. This should have minimal, if any, impact on performance. This makes the system taskq consistent with illumos which is always created as a dynamic taskq with up to 64 threads. The task limits for the kmem_cache have been increased to avoid any unnessisary throttling and to keep a larger reserve of task_t structures on the free list. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Closes #458	2015-06-24 15:14:25 -07:00
Brian Behlendorf	f7a973d99b	Add TASKQ_DYNAMIC feature Setting the TASKQ_DYNAMIC flag will create a taskq with dynamic semantics. Initially only a single worker thread will be created to service tasks dispatched to the queue. As additional threads are needed they will be dynamically spawned up to the max number specified by 'nthreads'. When the threads are no longer needed, because the taskq is empty, they will automatically terminate. Due to the low cost of creating and destroying threads under Linux by default new threads and spawned and terminated aggressively. There are two modules options which can be tuned to adjust this behavior if needed. * spl_taskq_thread_sequential - The number of sequential tasks, without interruption, which needed to be handled by a worker thread before a new worker thread is spawned. Default 4. * spl_taskq_thread_dynamic - Provides the ability to completely disable the use of dynamic taskqs on the system. This is provided for the purposes of debugging and troubleshooting. Default 1 (enabled). This behavior is fundamentally consistent with the dynamic taskq implementation found in both illumos and FreeBSD. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Closes #458	2015-06-24 15:14:18 -07:00
Richard Yao	72540ea314	zfsdev_getminor() should check for invalid file handles Unit testing at ClusterHQ found that passing an invalid file handle to zfs_ioc_hold results in a NULL pointer dereference on a system without assertions: IP: [<ffffffffa0218aa0>] zfsdev_getminor+0x10/0x20 [zfs] Call Trace: [<ffffffffa021b4b0>] zfs_onexit_fd_hold+0x20/0x40 [zfs] [<ffffffffa0214043>] zfs_ioc_hold+0x93/0xd0 [zfs] [<ffffffffa0215890>] zfsdev_ioctl+0x200/0x500 [zfs] An assertion would have caught this had they been enabled, but this is something that the kernel module should handle without failing. We resolve this by searching the linked list to ensure that the file handle's private_data points to a valid zfsdev_state_t. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Andriy Gapon <avg@FreeBSD.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3506	2015-06-22 17:02:13 -07:00
Etienne Dechamps	99b14de421	Make metaslab_aliquot a module parameter. This seems generally useful. metaslab_aliquot is the ZFS allocation granularity, which is roughly equivalent to what is called the stripe size in traditional RAID arrays. It seems relevant to performance tuning. Signed-off-by: Etienne Dechamps <etienne@edechamps.fr> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-06-22 14:19:38 -07:00
Etienne Dechamps	e8fe6684a5	Document metaslab_aliquot. Signed-off-by: Etienne Dechamps <etienne@edechamps.fr> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-06-22 14:19:31 -07:00
Etienne Dechamps	bb3250d07e	Allocate disk space fairly in the presence of vdevs of unequal size. The metaslab allocator device selection algorithm contains a bias mechanism whose goal is to achieve roughly equal disk space usage across all top-level vdevs. It seems that the initial rationale for this code was to allow newly added (empty) vdevs to "come up to speed" faster in an attempt to make the pool quickly converge to a steady state where all vdevs are equally utilized. While the code seems to work reasonably well for this use case, there is another scenario in which this algorithm fails miserably: the case where top-level vdevs don't have the same sizes (capacities). ZFS allows this, and it is a good feature to have, so that users who simply want to build a pool with the disks they happen to have lying around can do so even if the disks have heteregenous sizes. Here's a script that simulates a pool with two vdevs, with one 4X larger than the other: dd if=/dev/zero of=/tmp/d1 bs=1 count=1 seek=134217728 dd if=/dev/zero of=/tmp/d2 bs=1 count=1 seek=536870912 zpool create testspace /tmp/d1 /tmp/d2 dd if=/dev/zero of=/testspace/foobar bs=1M count=256 zpool iostat -v testspace Before this commit, the script would output the following: capacity pool alloc free ---------- ----- ----- testspace 252M 375M /tmp/d1 104M 18.5M /tmp/d2 148M 356M ---------- ----- ----- This demonstrates that the current code handles this situation very poorly: d1 shows 85% usage despite the pool itself being only 40% full. d1 is quite saturated at this point, and is slowing down the entire pool due to saturation, fragmentation and the like. In contrast, here's the result with the code in this commit: capacity pool alloc free ---------- ----- ----- testspace 252M 375M /tmp/d1 56.7M 66.3M /tmp/d2 195M 309M ---------- ----- ------ This looks much better. d1 is 46% used, which is close to the overall pool utilization (40%). The code still doesn't result in perfectly balanced allocation, probably because of the way mg_bias is applied which does not guarantee perfect accuracy, but this is still much better than before. Signed-off-by: Etienne Dechamps <etienne@edechamps.fr> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3389	2015-06-22 14:18:29 -07:00
Brian Behlendorf	218b4e0a76	Add zfs_sb_prune_aliases() function For kernels which do not implement a per-suberblock shrinker, those older than Linux 3.1, the shrink_dcache_parent() function was used to attempt to reclaim dentries. This was found not be entirely reliable and could lead to performance issues on older kernels running meta-data heavy workloads. To address this issue a zfs_sb_prune_aliases() function has been added to implement this functionality. It relies on traversing the list of znodes for a filesystem and adding them to a private list with a reference held. The private list can then be safely walked outside the z_znodes_lock to prune dentires and drop the last reference so the inode can be freed. This provides the same synchronous behavior as the per-filesystem shrinker and has the advantage of depending on only long standing interfaces. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Closes #3501	2015-06-22 10:22:49 -07:00
Brian Behlendorf	4c6a700910	Increase the number of iput taskq threads The number of threads in the iput taskq has been increased to speed up the number of iputs which can be handled. This has been observed to improve the meta data reclaim regardless of zfs_sb_prune() implementation in use. The taskq has also been renamed z_iput to for consistency with the rest of the I/O pipeline taskqs which are all named z_*. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com>	2015-06-22 10:22:10 -07:00
Matus Kral	57ae840077	Linux 4.1 compat: use read_iter() / write_iter() Linux 3.15 commit torvalds/linux@293bc98 introduced two new methods. The ->read_iter() and ->write_iter() methods were designed to replace the ->aio_read() and ->aio_write() interfaces. Both interfaces were preserved for several kernel releases in order to migrate all existing consumers to the new interfaces. But as of Linux 4.1 the legacy interface has been retired and the ZFS code must be updated to use the new interfaces. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3352	2015-06-18 12:06:59 -07:00
Tim Chase	90947b2357	3.12 compat, NUMA-aware per-superblock shrinker Kernels >= 3.12 have a NUMA-aware superblock shrinker which is used in ZoL by zfs_sb_prune(). This patch calls the shrinker for each on-line NUMA node in order that memory be freed for each one. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3495	2015-06-17 10:43:13 -07:00
Brian Behlendorf	8e70975f90	Wait interruptibly in prefetch thread The Linux kernel watchdog will automatically dump a backtrace for any process while sleeps for over 120s in an uninterruptible state. The solution is for the prefetch thread to sleep in an interruptible state. The way the existing code was written this is safe because when woken it will always reevaluate its conditional. As a general rule it is preferable to sleep in an interruptible when possible. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3450 Closes #3402	2015-06-16 16:18:11 -07:00
Brian Behlendorf	b64ccd6c52	Rename cv_wait_interruptible() to cv_wait_sig() This is the counterpart to zfsonlinux/spl@2345368 which replaces the cv_wait_interruptible() function with cv_wait_sig(). There is no functional change to patch merely brings the function names in to sync to maximize portability. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #3450 Issue #3402	2015-06-11 10:50:47 -07:00
Tim Chase	121b3cae74	Increase arc_c_min to allow safe operation of arc_adapt() ZoL had lowered the minimum ARC size to 4MiB to better accommodate tiny systems such as the raspberry pi, however, as of addition of large block support, the arc_adapt() function depends on arc_c being >= 32MiB (2 * SPA_MAXBLOCKSIZE). This patch raises the minimum ARC size to 32MiB and adds a VERIFY test to arc_adapt() for future-proofing. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-06-11 10:27:25 -07:00
Brian Behlendorf	f604673836	Make arc_prune() asynchronous As described in the comment above arc_adapt_thread() it is critical that the arc_adapt_thread() function never sleep while holding a hash lock. This behavior was possible in the Linux implementation because the arc_prune() logic was implemented to be synchronous. Under illumos the analogous dnlc_reduce_cache() function is asynchronous. To address this the arc_do_user_prune() function is has been reworked in to two new functions as follows: * arc_prune_async() is an asynchronous implementation which dispatches the prune callback to be run by the system taskq. This makes it suitable to use in the context of the arc_adapt_thread(). * arc_prune() is a synchronous implementation which depends on the arc_prune_async() implementation but blocks until the outstanding callbacks complete. This is used in arc_kmem_reap_now() where it is safe, and expected, that memory will be freed. This patch additionally adds the zfs_arc_meta_strategy module option while allows the meta reclaim strategy to be configured. It defaults to a balanced strategy which has been proved to work well under Linux but the illumos meta-only strategy can be enabled. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-06-11 10:27:25 -07:00
Brian Behlendorf	c5528b9ba6	Use taskq_wait_outstanding() function Replace taskq_wait() with taskq_wait_oustanding(). This way callers will only block until previously submitted tasks have been completed. This was the previous behavior of task_wait() prior to the introduction of taskq_wait_outstanding() so this isn't really a functionalty change for these callers. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-06-11 10:27:25 -07:00
Prakash Surya	ca0bf58d65	Illumos 5497 - lock contention on arcs_mtx Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Richard Elling <richard.elling@richardelling.com> Approved by: Dan McDonald <danmcd@omniti.com> Porting notes and other significant code changes: The illumos 5368 patch (ARC should cache more metadata), which was never picked up by ZoL, is mostly reverted by this patch. Since ZoL relies on the kernel asynchronously calling the shrinker to actually reap memory, the shrinker wakes up arc_reclaim_waiters_cv every time it runs. The arc_adapt_thread() function no longer calls arc_do_user_evicts() since the newly-added arc_user_evicts_thread() calls it periodically. Notable conflicting ZoL commits which conflicted with this patch or whose effects are either duplicated or un-done by this patch: `302f753` - Integrate ARC more tightly with Linux `39e055c` - Adjust arc_p based on "bytes" in arc_shrink `f521ce1` - Allow "arc_p" to drop to zero or grow to "arc_c" `77765b5` - Remove "arc_meta_used" from arc_adjust calculation `94520ca` - Prune metadata from ghost lists in arc_adjust_meta Trace support for multilist_insert() and multilist_remove() has been added and produces the following output: fio-12498 [077] .... 112936.448324: zfs_multilist__insert: ml { offset 240 numsublists 80 sublistidx 63 } fio-12498 [077] .... 112936.448347: zfs_multilist__remove: ml { offset 240 numsublists 80 sublistidx 29 } The following arcstats have been removed: recycle_miss - Used by arcstat.py and arc_summary.py, both of which have been updated appropriately. l2_writes_hdr_miss The following arcstats have been added: evict_not_enough - Number of times arc_evict_state() was unable to evict enough buffers to reach its target amount. evict_l2_skip - Number of times arc_evict_hdr() skipped eviction because it was being written to the l2arc. l2_writes_lock_retry - Replaces l2_writes_hdr_miss. Number of times l2arc_write_done() failed to acquire hash_lock (and re-tries). arc_meta_min - Shows the value of the zfs_arc_meta_min module parameter (see below). The "index" column of the "dbuf" kstat has been removed since it doesn't have a direct analog in the new multilist scheme. Additional multilist- related stats could be added in the future but would likely require extensions to the mulilist API. The following module parameters have been added: zfs_arc_evict_batch_limit - Number of ARC headers to free per sub-list before moving on to the next sub-list. zfs_arc_meta_min - Enforce a floor on the amount of metadata in the ARC. zfs_arc_num_sublists_per_state - Number of multilist sub-lists per ARC state. zfs_arc_overflow_shift - Controls amount by which the ARC must exceed the target size to be considered "overflowing". Ported-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov	2015-06-11 10:27:25 -07:00
Chris Williamson	b9541d6b7d	Illumos 5408 - managing ZFS cache devices requires lots of RAM 5408 managing ZFS cache devices requires lots of RAM Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Don Brady <dev.fs.zfs@gmail.com> Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com> Approved by: Garrett D'Amore <garrett@damore.org> Porting notes: Due to the restructuring of the ARC-related structures, this patch conflicts with at least the following existing ZoL commits: `6e1d7276c9` Fix inaccurate arcstat_l2_hdr_size calculations The ARC_SPACE_HDRS constant no longer exists and has been somewhat equivalently replaced by HDR_L2ONLY_SIZE. `e0b0ca983d` Add visibility in to cached dbufs The new layering of l{1,2}arc_buf_hdr_t within the arc_buf_hdr struct requires additional structure member names to be used when referencing the inner items. Also, the presence of L1 or L2 inner member is indicated by flags using the new HDR_HAS_L{1,2}HDR macros. Ported by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-06-11 10:27:25 -07:00
George Wilson	2a4324141f	Illumos 5369 - arc flags should be an enum 5369 arc flags should be an enum 5370 consistent arc_buf_hdr_t naming scheme Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Alex Reece <alex.reece@delphix.com> Reviewed by: Sebastien Roy <sebastien.roy@delphix.com> Reviewed by: Richard Elling <richard.elling@richardelling.com> Approved by: Richard Lowe <richlowe@richlowe.net> Porting notes: ZoL has moved some ARC definitions into arc_impl.h. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported by: Tim Chase <tim@chase2k.com>	2015-06-11 10:27:25 -07:00
Tim Chase	ad4af89561	Partially revert "Add ddt, ddt_entry, and l2arc_hdr caches" This reverts only the l2arc_hdr part of commit `ecf3d9b8e6` in preparation for the illumos 5497 "lock contention on arcs_mtx" patch which does the same thing but uses the newer two-level ARC structure following the Illumos 5408 "managing ZFS cache devices requires lots of RAM" patch. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-06-11 10:27:25 -07:00
Tim Chase	97639d0a52	Revert "Allow arc_evict_ghost() to only evict meta data" Illumos 5497 "lock contention on arcs_mtx" reworks eviction and obviates the need for this. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-06-11 10:27:25 -07:00
Tim Chase	f6b3b1f5d6	Revert "fix l2arc compression buffers leak" This reverts commit `037763e44e` in preparation for the illumos 5497 "lock contention on arcs_mtx" patch which includes a fix for this very problem. ZoL had picked up a subset of the illumos 5497 patch to deal with the l2arc compression buffer leak. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-06-11 10:27:24 -07:00
Tim Chase	7807028ccd	Revert "arc_evict, arc_evict_ghost: reduce stack usage using kmem_zalloc" This reverts commit `16fcdea363` in preparation for the illumos 5497 "lock contention on arcs_mtx" patch which eliminates "marker" within the ARC code. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-06-11 10:27:24 -07:00
Brian Behlendorf	44de2f02d6	Remove unused variable in vdev_add_child() Commit `c3520e7` restructured vdev_add_child() in such a way that the spa variable was unused during non-debug builds. This is consistent with the upstream illumos code but because ZoL, unlike illumos, is built with all compiler warnings enabled this causes a legitimate warning. Revert this hunk of the patch to keep the build clean. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #3432	2015-06-11 10:22:38 -07:00
Brian Behlendorf	2345368646	Rename cv_wait_interruptible() to cv_wait_sig() Commit `f752b46e` added the cv_wait_interruptible() function to allow condition variables to be woken by signals. This function and its timed wait counterpart should have been named cv_wait_sig() to match the illumos interface which provides the same functionality. This patch renames the symbol but leaves a #define compatibility wrapper in place until the ZFS code can be moved to the correct name. This patch also makes a small number of cosmetic changes to make the condvar source and header cstyle clean. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #456	2015-06-10 16:36:12 -07:00
Matthew Ahrens	c3520e7f1f	Illumos 5818 - zfs {ref}compressratio is incorrect with 4k sector size 5818 zfs {ref}compressratio is incorrect with 4k sector size Reviewed by: Alex Reece <alex@delphix.com> Reviewed by: George Wilson <george@delphix.com> Reviewed by: Richard Elling <richard.elling@richardelling.com> Reviewed by: Steven Hartland <killing@multiplay.co.uk> Approved by: Albert Lee <trisk@omniti.com> References: https://www.illumos.org/issues/5818 https://github.com/illumos/illumos-gate/commit/81cd5c5 Ported-by: Don Brady <don.brady@intel.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3432	2015-06-10 16:24:01 -07:00
Arne Jansen	9c43027b3f	Illumos 5269 - zpool import slow 5269 zpool import slow Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george@delphix.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/5269 https://github.com/illumos/illumos-gate/commit/12380e1e Ported-by: DHE <git@dehacked.net> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3396	2015-06-09 13:48:02 -07:00
Chris Dunlop	a876b0305e	Make taskq_wait() block until the queue is empty Under Illumos taskq_wait() returns when there are no more tasks in the queue. This behavior differs from ZoL and FreeBSD where taskq_wait() returns when all the tasks in the queue at the beginning of the taskq_wait() call are complete. New tasks added whilst taskq_wait() is running will be ignored. This difference in semantics makes it possible that new subtle issues could be introduced when porting changes from Illumos. To avoid that possibility the taskq_wait() function is being updated such that it blocks until the queue in empty. The previous behavior remains available through the taskq_wait_outstanding() interface. Note that this function was previously called taskq_wait_all() but has been renamed to avoid confusion. Signed-off-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #455	2015-06-09 12:20:12 -07:00
Ned Bass	5f8e1e8505	dmu_objset_userquota_get_ids uses dn_bonus unsafely The function dmu_objset_userquota_get_ids() checks and uses dn->dn_bonus outside of dn_struct_rwlock. If the dnode is being freed then the bonus dbuf may be in the process of getting evicted. In this case there is a race that may cause dmu_objset_userquota_get_ids() to access the dbuf after it has been destroyed. To prevent this, ensure that when we are using the bonus dbuf we are either holding a reference on it or have taken dn_struct_rwlock. Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3443	2015-06-05 12:41:17 -07:00
Ned Bass	d617648c7f	dbuf_try_add_ref minor bug fixes - Don't check db->bb_blkid, but use the blkid argument instead. Checking db->db_blkid may be unsafe since we doesn't yet have a hold on the dbuf so its validity is unknown. - Call mutex_exit() on found_db, not db, since it's not certain that they point to the same dbuf, and the mutex was taken on found_db. Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #3443	2015-06-05 12:40:38 -07:00
Matthew Ahrens	e5fd1dd682	Illumos 5243 - zdb -b could be much faster 5243 zdb -b could be much faster Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Richard Elling <richard.elling@gmail.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/5243 https://github.com/illumos/illumos-gate/commit/f7950bf Ported-by: Don Brady <don.brady@intel.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3414	2015-05-15 11:14:54 -07:00
Jan Sanislo	79065ed5a4	Return -ESTALE to force lookup for missing NFS file handles There seems to be a annoying problem using NFSv4 to access ZFS file systems under certain circumstances. It's easily reproduced: nfs_client1: mount server:/export /mnt nfs_client1: cd /mnt nfs_client1: echo foo >junk nfs_client1: cat junk foo Now on a different NFSv4 client: nfs_client2: mount server:/export /mnt nfs_client2: cd /mnt nfs_client2: vi junk # Make some changes to /mnt/junk and save # This change the inode associated with /mnt/junk Now back to the original client: nfs_client1: cat junk cat: junk: No such file or directory Admittedly NFSv4 is not advertised as a cluster file system that maintains a completely coherent view of data across multiple nodes. But it does have some mechanisms built in that try to deal with situations like the above. Namely, it employs specialized file handle lookup routines that return ESTALE when a file handle contains a non-existant inode value. The ESTALE return triggers a return full file path lookup from the client to determine if the file has actually gone away or if the cached file handle is no longer valid. ZFS behavior can be brought into line with other file systems (e.g., ext4) by applying the following patch: Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3224	2015-05-14 11:16:52 -07:00
Antonio Russo	7290cd3c4e	Relax restriction on zfs_ioc_next_obj() iteration Per the documentation for dnode_next_offset in dnode.c, the "txg" parameter specifies a lower bound on which transaction the dnode can be found in. We are interested in all dnodes that are removed between the first and last transaction in the snapshot. It doesn't need to be created in that snapshot to correspond to a removed file. In fact, the behavior of zfs diff in the test case exactly matches this: the transaction that created the data that was deleted in snapshot "2" was produced before, in snapshot "1", definitely predating the first transaction in snapshot "2". Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <Tim Chase <tim@onlight.com> Closes #2081	2015-05-14 11:16:08 -07:00
Brian Behlendorf	fd0fd6467b	Remove unused 'dsl_pool_t dp' variable When ASSERTs are compiled out by using the --disable-debug configure option. Then the local variable 'dsl_pool_t dp' will be unused and generate a compiler warning. Since this variable is only used once in the ASSERT replace it with 'ds->ds_dir->dd_pool'. This has the additional advantage of potentially saving a few bytes on the stack depending on how gcc decides to compile the function. This issue was not noticed immediately because the automated builders use --enable-debug to make the testing as rigorous as possible. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chris Dunlap <cdunlap@llnl.gov> Closes #3410	2015-05-14 11:09:47 -07:00
Max Grossman	5dc8b7365f	Illumos 5765 - add support for estimating send stream size with lzc_send_space when source is a bookmark 5765 add support for estimating send stream size with lzc_send_space when source is a bookmark Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: Steven Hartland <killing@multiplay.co.uk> Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com> Approved by: Albert Lee <trisk@nexenta.com> References: https://www.illumos.org/issues/5765 https://github.com/illumos/illumos-gate/commit/643da460 Porting notes: * Unused variable 'recordsize' in dmu_send_estimate() dropped Ported-by: DHE <git@dehacked.net> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3397	2015-05-13 09:03:59 -07:00
Justin T. Gibbs	19b3b1d2a2	Illumos 5393 - spurious failures from dsl_dataset_hold_obj() 5393 spurious failures from dsl_dataset_hold_obj() Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Will Andrews <willa@spectralogic.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Steven Hartland <killing@multiplay.co.uk> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/5393 https://github.com/illumos/illumos-gate/commit/e1f3c20 Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3403	2015-05-13 08:56:04 -07:00
Justin T. Gibbs	63b33e878a	Illumos 5562 - ZFS sa_handle's violate kmem invariants, debug kernels panic on boot 5562 ZFS sa_handle's violate kmem invariants, debug kernels panic on boot Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Robert Mustacchi <rm@fingolfin.org> Reviewed by: George Wilson <george@delphix.com> Reviewed by: Rich Lowe <richlowe@richlowe.net> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/5562 https://github.com/illumos/illumos-gate/commit/0fda3cc5 Ported-by: DHE <git@dehacked.net> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3388	2015-05-11 15:10:57 -07:00
Matthew Ahrens	252e1a54ab	Illumos 5810 - zdb should print details of bpobj 5810 zdb should print details of bpobj Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Alex Reece <alex@delphix.com> Reviewed by: George Wilson <george@delphix.com> Reviewed by: Will Andrews <will@freebsd.org> Reviewed by: Simon Klinkert <simon.klinkert@gmail.com> Approved by: Gordon Ross <gwr@nexenta.com> References: https://www.illumos.org/issues/5810 https://github.com/illumos/illumos-gate/commit/732885fc Ported-by: DHE <git@dehacked.net> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3387	2015-05-11 15:10:24 -07:00
Matthew Ahrens	10400bfeac	Illumos 5351, 5352 - scrub pauses 5351 scrub goes for an extra second each txg 5352 scrub should pause when there is some dirty data Author: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Alex Reece <alex.reece@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Richard Elling <richard.elling@richardelling.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/5351 https://github.com/illumos/illumos-gate/commit/6f6a76a Ported-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3383	2015-05-11 15:09:56 -07:00
Matthew Ahrens	08dc1b2ddd	Illumos 5350 - clean up code in dnode_sync() Author: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Alex Reece <alex.reece@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Richard Elling <richard.elling@richardelling.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/5350 https://github.com/illumos/illumos-gate/commit/e651831 Ported-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3382	2015-05-11 15:09:51 -07:00
Alex Reece	7224c67fea	Illumos 5422 - preserve AVL invariants in dn_dbufs Author: Alex Reece <alex@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com> Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com> Reviewed by: Albert Lee <trisk@nexenta.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/5422 https://github.com/illumos/illumos-gate/commit/a846f19 Ported-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3381	2015-05-11 15:09:29 -07:00
David Lamparter	214806c7e9	Safely handle security / ACL failures The security and ACL operations should all be performed atomically. To accomplish this there would need to significant invasive changes made to the common code base. For the moment it's desirable for compatibility reasons to avoid this. Therefore the code has been updated to attempt to unwind the operation in case of failure rather than panic. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2445	2015-05-11 14:35:14 -07:00
Matthew Ahrens	f1512ee61e	Illumos 5027 - zfs large block support 5027 zfs large block support Reviewed by: Alek Pinchuk <pinchuk.alek@gmail.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com> Reviewed by: Richard Elling <richard.elling@richardelling.com> Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/5027 https://github.com/illumos/illumos-gate/commit/b515258 Porting Notes: * Included in this patch is a tiny ISP2() cleanup in zio_init() from Illumos 5255. * Unlike the upstream Illumos commit this patch does not impose an arbitrary 128K block size limit on volumes. Volumes, like filesystems, are limited by the zfs_max_recordsize=1M module option. * By default the maximum record size is limited to 1M by the module option zfs_max_recordsize. This value may be safely increased up to 16M which is the largest block size supported by the on-disk format. At the moment, 1M blocks clearly offer a significant performance improvement but the benefits of going beyond this for the majority of workloads are less clear. * The illumos version of this patch increased DMU_MAX_ACCESS to 32M. This was determined not to be large enough when using 16M blocks because the zfs_make_xattrdir() function will fail (EFBIG) when assigning a TX. This was immediately observed under Linux because all newly created files must have a security xattr created and that was failing. Therefore, we've set DMU_MAX_ACCESS to 64M. * On 32-bit platforms a hard limit of 1M is set for blocks due to the limited virtual address space. We should be able to relax this one the ABD patches are merged. Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #354	2015-05-11 12:23:16 -07:00
Brian Behlendorf	f9cab37291	Remove metaslab_min_alloc_size module option The metaslab_min_alloc_size option is no longer used in the code. This functionality was removed by commit `f3a7f66` and the module options should have been dropped at that time. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-05-11 09:51:57 -07:00
Chris Dunlop	16fcdea363	arc_evict, arc_evict_ghost: reduce stack usage using kmem_zalloc With debugging enabled and depending on your kernel config, the size of arc_buf_hdr_t can blow out the stack of arc_evict() and arc_evict_ghost() to greater than 1024 bytes. Let's avoid this. Signed-off-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3377	2015-05-08 14:14:35 -07:00
Matthew Ahrens	63e3a8616b	Illumos 5349 - verify that block pointer is plausible before reading 5349 verify that block pointer is plausible before reading Reviewed by: Alex Reece <alex.reece@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Richard Lowe <richlowe@richlowe.net> Reviewed by: Xin Li <delphij@FreeBSD.org> Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com> Approved by: Gordon Ross <gwr@nexenta.com> References: https://www.illumos.org/issues/5349 https://github.com/illumos/illumos-gate/commit/f63ab3d5 Porting notes: * Several variable declarations were moved due to C style needs Ported-by: DHE <git@dehacked.net> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3373	2015-05-08 14:09:15 -07:00
Chris Dunlop	f0da4d1508	Wait for all znodes to be released before tearing down the superblock By the time we're tearing down our superblock the VFS has started releasing all our inodes/znodes. Some of this work may have been handed off to our iput taskq so we need to wait for that work to complete. However the iput from the taskq can itself result in additional work being added to the taskq: dsl_pool_iput_taskq iput iput_final evict destroy_inode zpl_inode_destroy zfs_inode_destroy zfs_iput_async(ZTOI(zp->z_xattr_parent)) taskq_dispatch(dsl_pool_iput_taskq..., iput, ...) Let's wait until all our znodes have been released. Signed-off-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3281	2015-05-06 14:13:19 -07:00
Matthew Ahrens	7a3066ffdd	Illumos 5348 - zio_checksum_error() only fills in info if ECKSUM 5348 zio_checksum_error() only fills in info if ECKSUM Reviewed by: Alex Reece <alex.reece@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Steven Hartland <killing@multiplay.co.uk> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/5348 https://github.com/illumos/illumos-gate/commit/373dc1cf Ported-by: DHE <git@dehacked.net> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3372	2015-05-05 11:05:37 -07:00
Matthew Ahrens	f3c517d814	Illumos 5820 - verify failed in zio_done(): BP_EQUAL(bp, io_bp_orig) 5820 verify failed in zio_done(): BP_EQUAL(bp, io_bp_orig) Reviewed by: Alex Reece <alex@delphix.com> Reviewed by: George Wilson <george@delphix.com> Reviewed by: Steven Hartland <killing@multiplay.co.uk> Approved by: Garrett D'Amore <garrett@damore.org> References: https://www.illumos.org/issues/5820 https://github.com/illumos/illumos-gate/commit/34e8acef00 Ported-by: DHE <git@dehacked.net> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3364	2015-05-04 10:49:49 -07:00
Matthew Ahrens	36c6ffb6b6	Illumos 5808 - spa_check_logs is not necessary on readonly pools 5808 spa_check_logs is not necessary on readonly pools Reviewed by: George Wilson <george@delphix.com> Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com> Reviewed by: Simon Klinkert <simon.klinkert@gmail.com> Reviewed by: Will Andrews <will@freebsd.org> Approved by: Gordon Ross <gwr@nexenta.com> References: https://www.illumos.org/issues/5808 https://github.com/illumos/illumos-gate/commit/23367a2f Ported-by: DHE <git@dehacked.net> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3369	2015-05-04 10:45:42 -07:00
Will Andrews	50f9ea0149	Illumos 5814 - bpobj_iterate_impl(): Close a refcount leak iterating on a sublist. 5814 bpobj_iterate_impl(): Close a refcount leak iterating on a sublist. Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com> Reviewed by: Simon Klinkert <simon.klinkert@gmail.com> Approved by: Gordon Ross <gwr@nexenta.com> References: https://www.illumos.org/issues/5814 https://github.com/illumos/illumos-gate/commit/b67dde11 Ported-by: DHE <git@dehacked.net> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3368	2015-05-04 10:28:18 -07:00
Christopher Siden	0c60cc326b	Illumos 4951 - ZFS administrative commands (fix) 4951 ZFS administrative commands should use reserved space, not fail with ENOSPC Approved by: Christopher Siden <christopher.siden@delphix.com> References: https://www.illumos.org/issues/4951 https://github.com/illumos/illumos-gate/commit/c39f2c8 Ported by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-05-04 09:41:10 -07:00
Matthew Ahrens	3d45fdd6c0	Illumos 4951 - ZFS administrative commands should use reserved space 4951 ZFS administrative commands should use reserved space, not with ENOSPC Reviewed by: John Kennedy <john.kennedy@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Approved by: Garrett D'Amore <garrett@damore.org> References: https://www.illumos.org/issues/4373 https://github.com/illumos/illumos-gate/commit/7d46dc6 Ported by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-05-04 09:41:10 -07:00
Jerry Jelinek	a0c9a17aef	Illumos 4901 - zfs filesystem/snapshot limit leaks 4901 zfs filesystem/snapshot limit leaks Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/4901 https://github.com/illumos/illumos-gate/commit/adf3407 Ported by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-05-04 09:41:09 -07:00
Matthew Ahrens	83017311e4	Illumos 3654,3656 3654 zdb should print number of ganged blocks 3656 remove unused function zap_cursor_move_to_key() Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: Dan McDonald <danmcd@nexenta.com> Approved by: Garrett D'Amore <garrett@damore.org> References: https://www.illumos.org/issues/3654 https://www.illumos.org/issues/3656 https://github.com/illumos/illumos-gate/commit/d5ee8a1 Porting Notes: 3655 and 3657 were part of this commit but those hunks were dropped since they apply to mdb. Ported by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-05-04 09:41:09 -07:00
tuxoko	6102d0376e	Add cond_resched to zfs_zget to prevent infinite loop It's been reported that threads would loop infinitely inside zfs_zget. The speculated cause for this is that if an inode is marked for evict, zfs_zget would see that and loop. However, if the looping thread doesn't yield, the inode may not have a chance to finish evict, thus causing a infinite loop. This patch solve this issue by add cond_resched to zfs_zget, making the looping thread to yield when needed. Tested-by: jlavoy <jalavoy@gmail.com> Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3349	2015-05-04 09:12:41 -07:00
Jason Zaman	c9520ecc0f	dmu: fix integer overflows The params to the functions are uint64_t, but the offsets to memcpy / bcopy are calculated using 32bit ints. This patch changes them to also be uint64_t so there isnt an overflow. PaX's Size Overflow caught this when formatting a zvol. Gentoo bug: #546490 PAX: offset: 1ffffb000 db->db_offset: 1ffffa000 db->db_size: 2000 size: 5000 PAX: size overflow detected in function dmu_read /var/tmp/portage/sys-fs/zfs-kmod-0.6.3-r1/work/zfs-zfs-0.6.3/module/zfs/../../module/zfs/dmu.c:781 cicus.366_146 max, count: 15 CPU: 1 PID: 2236 Comm: zvol/10 Tainted: P O 3.17.7-hardened-r1 #1 Call Trace: [<ffffffffa0382ee8>] ? dsl_dataset_get_holds+0x9d58/0x343ce [zfs] [<ffffffff81a59c88>] dump_stack+0x4e/0x7a [<ffffffffa0393c2a>] ? dsl_dataset_get_holds+0x1aa9a/0x343ce [zfs] [<ffffffff81206696>] report_size_overflow+0x36/0x40 [<ffffffffa02dba2b>] dmu_read+0x52b/0x920 [zfs] [<ffffffffa0373ad1>] zrl_is_locked+0x7d1/0x1ce0 [zfs] [<ffffffffa0364cd2>] zil_clean+0x9d2/0xc00 [zfs] [<ffffffffa0364f21>] zil_commit+0x21/0x30 [zfs] [<ffffffffa0373fe1>] zrl_is_locked+0xce1/0x1ce0 [zfs] [<ffffffff81a5e2c7>] ? __schedule+0x547/0xbc0 [<ffffffffa01582e6>] taskq_cancel_id+0x2a6/0x5b0 [spl] [<ffffffff81103eb0>] ? wake_up_state+0x20/0x20 [<ffffffffa0158150>] ? taskq_cancel_id+0x110/0x5b0 [spl] [<ffffffff810f7ff4>] kthread+0xc4/0xe0 [<ffffffff810f7f30>] ? kthread_create_on_node+0x170/0x170 [<ffffffff81a62fa4>] ret_from_fork+0x74/0xa0 [<ffffffff810f7f30>] ? kthread_create_on_node+0x170/0x170 Signed-off-by: Jason Zaman <jason@perfinion.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3333	2015-05-04 09:12:00 -07:00
George Wilson	98b254188a	Illumos #5244 - zio pipeline callers should explicitly invoke next stage 5244 zio pipeline callers should explicitly invoke next stage Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Alex Reece <alex.reece@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Richard Elling <richard.elling@gmail.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Reviewed by: Steven Hartland <killing@multiplay.co.uk> Approved by: Gordon Ross <gwr@nexenta.com> References: https://www.illumos.org/issues/5244 https://github.com/illumos/illumos-gate/commit/738f37b Porting Notes: 1. The unported "2932 support crash dumps to raidz, etc. pools" caused a merge conflict due to a copyright difference in module/zfs/vdev_raidz.c. 2. The unported "4128 disks in zpools never go away when pulled" and additional Linux-specific changes caused merge conflicts in module/zfs/vdev_disk.c. Ported-by: Richard Yao <richard.yao@clusterhq.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2828	2015-04-30 15:07:47 -07:00
Matthew Ahrens	8dd86a10cf	Illumos 5812 - assertion failed in zrl_tryenter(): zr_owner==NULL 5812 assertion failed in zrl_tryenter(): zr_owner==NULL Reviewed by: George Wilson <george@delphix.com> Reviewed by: Alex Reece <alex@delphix.com> Reviewed by: Will Andrews <will@freebsd.org> Approved by: Gordon Ross <gwr@nexenta.com> References: https://www.illumos.org/issues/5812 https://github.com/illumos/illumos-gate/commit/8df1730 Ported-by: DHE <git@dehacked.net> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3357	2015-04-30 14:43:40 -07:00
Justin T. Gibbs	6186e29753	Illumos 5592 - NULL pointer dereference in dsl_prop_notify_all_cb() 5592 NULL pointer dereference in dsl_prop_notify_all_cb() Author: Justin T. Gibbs <justing@spectralogic.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george@delphix.com> Reviewed by: Will Andrews <will@freebsd.org> Approved by: Robert Mustacchi <rm@joyent.com> References: https://www.illumos.org/issues/5592 https://github.com/illumos/illumos-gate/commit/9d47dec Ported-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-04-28 16:25:58 -07:00
Justin T. Gibbs	6ebebaceb1	Illumos 5531 - NULL pointer dereference in dsl_prop_get_ds() 5531 NULL pointer dereference in dsl_prop_get_ds() Author: Justin T. Gibbs <justing@spectralogic.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Reviewed by: George Wilson <george@delphix.com> Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com> Approved by: Robert Mustacchi <rm@joyent.com> References: https://www.illumos.org/issues/5531 https://github.com/illumos/illumos-gate/commit/e57a022 Ported-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-04-28 16:25:44 -07:00
Justin T. Gibbs	0c66c32d1d	Illumos 5056 - ZFS deadlock on db_mtx and dn_holds 5056 ZFS deadlock on db_mtx and dn_holds Author: Justin Gibbs <justing@spectralogic.com> Reviewed by: Will Andrews <willa@spectralogic.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/5056 https://github.com/illumos/illumos-gate/commit/bc9014e Porting Notes: sa_handle_get_from_db(): - the original patch includes an otherwise unmentioned fix for a possible usage of an uninitialised variable dmu_objset_open_impl(): - Under Illumos list_link_init() is the same as filling a list_node_t with NULLs, so they don't notice if they miss doing list_link_init() on a zero'd containing structure (e.g. allocated with kmem_zalloc as here). Under Linux, not so much: an uninitialised list_node_t goes "Boom!" some time later when it's used or destroyed. dmu_objset_evict_dbufs(): - reduce stack usage using kmem_alloc() Ported-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-04-28 16:25:34 -07:00
Justin T. Gibbs	d683ddbb72	Illumos 5314 - Remove "dbuf phys" db->db_data pointer aliases in ZFS 5314 Remove "dbuf phys" db->db_data pointer aliases in ZFS Author: Justin T. Gibbs <justing@spectralogic.com> Reviewed by: Andriy Gapon <avg@freebsd.org> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Will Andrews <willa@spectralogic.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/5314 https://github.com/illumos/illumos-gate/commit/c137962 Ported-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-04-28 16:25:20 -07:00
Justin T. Gibbs	945dd93525	Illumos 5310 - Remove always true tests for non-NULL ds->ds_phys 5310 Remove always true tests for non-NULL ds->ds_phys Author: Justin T. Gibbs <justing@spectralogic.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Will Andrews <willa@spectralogic.com> Reviewed by: Andriy Gapon <avg@FreeBSD.org> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/5310 https://github.com/illumos/illumos-gate/commit/d808a4f Ported-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-04-28 16:25:08 -07:00
Alex Reece	9925c28cde	Illumos 5095 - panic when adding a duplicate dbuf to dn_dbufs 5095 panic when adding a duplicate dbuf to dn_dbufs Author: Alex Reece <alex@delphix.com> Reviewed by: Adam Leventhal <adam.leventhal@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Mattew Ahrens <mahrens@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Reviewed by: Josef Sipek <jeffpc@josefsipek.net> Approved by: Robert Mustacchi <rm@joyent.com> References: https://www.illumos.org/issues/5095 https://github.com/illumos/illumos-gate/commit/86bb58a Ported-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-04-28 16:24:49 -07:00
Justin T. Gibbs	5aea3644d6	Illumos 5038 - Remove "old-style" flexible array usage in ZFS. 5038 Remove "old-style" flexible array usage in ZFS. Author: Justin T. Gibbs <justing@spectralogic.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net> Approved by: Richard Lowe <richlowe@richlowe.net> References: https://www.illumos.org/issues/5038 https://github.com/illumos/illumos-gate/commit/7f18da4 Ported-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-04-28 16:24:24 -07:00
Alex Reece	8951cb8dfb	Illumos 4873 - zvol unmap calls can take a very long time for larger datasets 4873 zvol unmap calls can take a very long time for larger datasets Author: Alex Reece <alex@delphix.com> Reviewed by: George Wilson <george@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com> Reviewed by: Basil Crow <basil.crow@delphix.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Approved by: Robert Mustacchi <rm@joyent.com> References: https://www.illumos.org/issues/4873 https://github.com/illumos/illumos-gate/commit/0f6d88a Porting Notes: dbuf_free_range(): - reduce stack usage using kmem_alloc() - the sorted AVL tree will handle the spill block case correctly without all the special handling in the for() loop Ported-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-04-28 16:24:03 -07:00
Jorgen Lundman	58c4aa00c6	Illumos 4975 - missing mutex_destroy() calls in zfs 4975 missing mutex_destroy() calls in zfs Author: Jorgen Lundman <lundman@lundman.net> Reviewed by: Matthew Ahrens <matthew.ahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Rich Lowe <richlowe@richlowe.net> Reviewed by: Seth Nimbosa <darth.Serious@gmail.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Reviewed by: Don Brady <dev.fs.zfs@gmail.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/4975 https://github.com/illumos/illumos-gate/commit/d2b3cbb Ported-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-04-28 16:23:38 -07:00
Alex Reece	ca227e54a8	Illumos 3897 - zfs filesystem and snapshot limits (fix leak) 3897 zfs filesystem and snapshot limits (fix leak) Author: Alex Reece <alex.reece@delphix.com> Approved by: Christopher Siden <christopher.siden@delphix.com> References: https://www.illumos.org/issues/3897 https://github.com/illumos/illumos-gate/commit/fb7001f Ported-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-04-28 16:23:14 -07:00
Jerry Jelinek	788eb90c4c	Illumos 3897 - zfs filesystem and snapshot limits 3897 zfs filesystem and snapshot limits Author: Jerry Jelinek <jerry.jelinek@joyent.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Christopher Siden <christopher.siden@delphix.com> References: https://www.illumos.org/issues/3897 https://github.com/illumos/illumos-gate/commit/a2afb61 Porting Notes: dsl_dataset_snapshot_check(): reduce stack usage using kmem_alloc(). Ported-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-04-28 16:22:51 -07:00
tuxoko	ecfb0b5f42	Fix misuse of input argument in traverse_visitbp In traverse_visitbp(), the input argument dnp is modified in the middle to point to a temporary buffer. Originally this doesn't matter, because no user of TRAVERSE_POST dereferences it. However, in `fbeddd6` a piece of code is added dereferencing dnp after the modification, creating a possible bug. We fix this by creating a new local variable cdnp for the DMU_OT_DNODE case, so we don't modify the input argument. Also we introduce different local variables in the DMU_OT_OBJSET case to prevent confusion between the input argument. Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2060	2015-04-28 09:43:50 -07:00
Isaac Huang	0336f3d001	Remove useless variable spa_active_count This isn't required for the Linux port because the kernel tracks if a module is busy. The prototype for spa_busy() is also removed since its definition was already removed. Signed-off-by: Isaac Huang <he.huang@intel.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3262	2015-04-27 09:22:05 -07:00
Justin T. Gibbs	ec8501ee12	5313 Allow I/Os to be aggregated across ZIO priority classes Reviewed by: Andriy Gapon <avg@FreeBSD.org> Reviewed by: Will Andrews <willa@SpectraLogic.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> References: https://www.illumos.org/issues/5313 https://github.com/illumos/illumos-gate/commit/fe319232 Ported-by: DHE <git@dehacked.net> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3280	2015-04-24 15:16:56 -07:00
Ned Bass	4eb30c6864	Serialize access to spa->spa_feat_stats nvlist The function spa_add_feature_stats() manipulates the shared nvlist spa->spa_feat_stats in an unsafe concurrent manner. Add a mutex to protect the list. Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3335	2015-04-24 15:04:43 -07:00
Chunwei Chen	07012da668	Fix kernel panic due to tsd_exit in ZFS_EXIT(zsb) The following panic would occur under certain heavy load: [ 4692.202686] Kernel panic - not syncing: thread ffff8800c4f5dd60 terminating with rrw lock ffff8800da1b9c40 held [ 4692.228053] CPU: 1 PID: 6250 Comm: mmap_deadlock Tainted: P OE 3.18.10 #7 The culprit is that ZFS_EXIT(zsb) would call tsd_exit() every time, which would purge all tsd data for the thread. However, ZFS_ENTER is designed to be reentrant, so we cannot allow ZFS_EXIT to blindly purge tsd data. Instead, we rely on the new behavior of tsd_set. When NULL is passed as the new value to tsd_set, it will automatically remove the tsd entry specified the the key for the current thread. rrw_tsd_key and zfs_allow_log_key already calls tsd_set(key, NULL) when they're done. The zfs_fsyncer_key relied on ZFS_EXIT(zsb) to call tsd_exit() to do clean up. Now we explicitly call tsd_set(key, NULL) on them. Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3247	2015-04-24 14:57:54 -07:00
Brian Behlendorf	62e2eb2329	Fix cstyle issues in spl-tsd.c This patch only addresses the issues identified by the style checker in spl-tsd.c. It contains no functional changes. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-04-24 14:23:07 -07:00
Chunwei Chen	3d39d0afab	Make tsd_set(key, NULL) remove the tsd entry for current thread To prevent leaking tsd entries, we make tsd_set(key, NULL) remove the tsd entry for the current thread. This is alright since tsd_get() returns NULL when the entry doesn't exist. Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #443	2015-04-24 14:15:22 -07:00
Richard Yao	d3c677bcd3	Implement areleasef() Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #449	2015-04-24 13:02:37 -07:00
Richard Yao	313b1ea622	vn_getf/vn_releasef should not accept negative file descriptors C type coercion rules require that negative numbers be converted into positive numbers via wraparound such that a negative -1 becomes a positive 1. This causes vn_getf to return a file handle when it should return NULL whenever a positive file descriptor existed with the same value. We should check for a negative file descriptor and return NULL instead. This was caught by ClusterHQ's unit testing. Reference: http://stackoverflow.com/questions/50605/signed-to-unsigned-conversion-in-c-is-it-always-safe Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Andriy Gapon <avg@FreeBSD.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #450	2015-04-24 13:02:00 -07:00
Brian Behlendorf	a438ff0e85	Extend PF_FSTRANS critical regions Additional testing has shown that the region covered by PF_FSTRANS needs to be extended to cover the zpl_xattr_security_init() and init_acl() functions. The zpl_mark_dirty() function can also recurse and therefore must always be protected. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Richard Yao <ryao@gentoo.org> Closes #3331	2015-04-24 09:54:22 -07:00
Brian Behlendorf	7fad6290eb	Mark additional functions as PF_FSTRANS Prevent deadlocks by disabling direct reclaim during all NFS, xattr, ctldir, and super function calls. This is related to `40d06e3`. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Issue #3225	2015-04-17 09:35:24 -07:00
Tim Chase	5074bfe8ad	Allocate zfs_znode_cache on the Linux slab The Linux slab, in general, performs better than the SPl slab in cases where a lot of objects are allocated and fragmentation is likely present. This patch fixes pathologically bad behavior in cases where the ARC is filled with mostly metadata and a user program needs to allocate and dirty enough memory which would require an insignificant amount of the ARC to be reclaimed. If zfs_znode_cache is on the SPL slab, the system may spin for a very long time trying to reclaim sufficient memory. If it is on the Linux slab, the behavior has been observed to be much more predictible; the memory is reclaimed more efficiently. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #3283	2015-04-14 12:19:22 -07:00
Brian Behlendorf	f42d7f4111	Use vmem_alloc() in spa_config_write() The packed nvlist allocated in spa_config_write() may exceed the warning threshold for large configurations. Use the vmem interfaces for this short lived allocation. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3251	2015-04-07 15:10:19 -07:00
Brian Behlendorf	2a5d574eca	Clear PF_FSTRANS over vfs_sync() When layered on XFS the following warning will be emitted under CentOS7 when entering vfs_fsync() with PF_FSTRANS already set. This is not an issue for other stock Linux file systems and the warning was removed for newer kernels. However, to avoid triggering this error PF_FSTRANS is cleared and then reset in vn_fsync(). WARNING: at fs/xfs/xfs_aops.c:968 xfs_vm_writepage+0x5ab/0x5c0 Call Trace: [<ffffffff8105dee1>] warn_slowpath_common+0x61/0x80 [<ffffffffa01706fb>] xfs_vm_writepage+0x5ab/0x5c0 [xfs] [<ffffffff8114b833>] __writepage+0x13/0x50 [<ffffffff8114c341>] write_cache_pages+0x251/0x4d0 [<ffffffff8114c60d>] generic_writepages+0x4d/0x80 [<ffffffffa016fc93>] xfs_vm_writepages+0x43/0x50 [xfs] [<ffffffff8114d68e>] do_writepages+0x1e/0x40 [<ffffffff81142bd5>] __filemap_fdatawrite_range+0x65/0x80 [<ffffffff81142cea>] filemap_write_and_wait_range+0x2a/0x70 [<ffffffffa017a5b6>] xfs_file_fsync+0x66/0x1f0 [xfs] [<ffffffff811df54b>] vfs_fsync+0x2b/0x40 [<ffffffffa03a88bd>] vn_fsync+0x2d/0x90 [spl] [<ffffffffa0520c33>] spa_config_sync+0x503/0x680 [zfs] [<ffffffffa0520ee4>] spa_config_update+0x134/0x170 [zfs] [<ffffffffa0520eba>] spa_config_update+0x10a/0x170 [zfs] [<ffffffffa051c54f>] spa_import+0x5bf/0x7b0 [zfs] [<ffffffffa055c754>] zfs_ioc_pool_import+0x104/0x150 [zfs] [<ffffffffa056294f>] zfsdev_ioctl+0x4cf/0x5c0 [zfs] [<ffffffffa0562480>] ? pool_status_check+0xf0/0xf0 [zfs] [<ffffffff811c2c85>] do_vfs_ioctl+0x2e5/0x4c0 [<ffffffff811c2f01>] SyS_ioctl+0xa1/0xc0 [<ffffffff815f3219>] system_call_fastpath+0x16/0x1b Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-04-07 15:03:47 -07:00
Tim Chase	40d06e3c78	Mark all ZPL and ioctl functions as PF_FSTRANS Prevent deadlocks by disabling direct reclaim during all ZPL and ioctl calls as well as the l2arc and adapt ARC threads. This obviates the need for MUTEX_FSTRANS so its previous uses and definition have been eliminated. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3225	2015-04-03 11:38:59 -07:00
Tim Chase	ae26dd0039	Don't allow shrinking a PF_FSTRANS context Avoid deadlocks when entering the shrinker from a PF_FSTRANS context. This patch also reverts commit `d0d5dd7` which added MUTEX_FSTRANS. Its use has been deprecated within ZFS as it was an ineffective mechanism to eliminate deadlocks. Among other things, it introduced the need for strict ordering of mutex locking and unlocking in order that the PF_FSTRANS flag wouldn't set incorrectly. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #446	2015-04-03 11:32:31 -07:00
Matthew Ahrens	0f7d2a4b3d	Illumus 5693 - ztest fails in dbuf_verify: buf[i] == 0, due to dedup and bp_override 5693 ztest fails in dbuf_verify: buf[i] == 0, due to dedup and bp_override Reviewed by: George Wilson <george@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/5693 https://github.com/illumos/illumos-gate/commit/7f7ace3 Ported-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3231	2015-03-27 15:02:56 -07:00
George Wilson	b738bc5a0f	Illumos 5694 - traverse_prefetcher does not prefetch enough 5694 traverse_prefetcher does not prefetch enough Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Alex Reece <alex@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com> Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com> Approved by: Garrett D'Amore <garrett@damore.org> References: https://www.illumos.org/issues/5694 https://github.com/illumos/illumos-gate/commit/34d7ce05 Ported-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3230	2015-03-27 15:02:50 -07:00
Chris Dunlop	ee2f17aa2a	Align code with Illumos Align code in traverse_visitbp() with that in Illumos in preparation for applying Illumos-5694. No functional change: use a temporary variable pd to replace multiple occurrences of td->td_pfd. This increases our stack use slightly more then normal because the function is called recursively. Signed-off-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #3230	2015-03-27 14:52:45 -07:00
Prakash Surya	a4069eef2e	Illumos 5695 - dmu_sync'ed holes do not retain birth time 5695 dmu_sync'ed holes do not retain birth time Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/5695 https://github.com/illumos/illumos-gate/commit/70163ac Ported-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3229	2015-03-27 14:51:34 -07:00
Ned Bass	58806b4cdc	dbuf_free_range() overzealously frees dbufs When called to free a spill block from a dnode, dbuf_free_range() has a bug that results in all dbufs for the dnode getting freed. A variety of problems may result from this bug, but a common one was a zap lookup tripping an ASSERT because the zap buffers had been zeroed out. This could happen on a dataset with xattr=sa set when extended attributes are written and removed on a directory concurrently with I/O to files in that directory. Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Fixes #3195 Fixes #3204 Fixes #3222	2015-03-25 14:48:22 -07:00
Tim Chase	ded576e28f	Set the maximum ZVOL transfer size correctly ZoL had been setting max_sectors to UINT_MAX, but until Linux 3.19, it the kernel artifically capped it at 1024 (BLK_DEF_MAX_SECTORS). This cap was removed in torvalds/linux@34b48db. This patch changes it to DMU_MAX_ACCESS (in sectors) and also changes the ASSERT in dmu_tx_hold_write() to allow the maximum transfer size. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3212	2015-03-25 11:58:42 -07:00
Isaac Huang	e89bd69775	zio_injection_enabled should not be a module option The zio_inject.c keeps zio_injection_enabled as a counter of fault handlers, so it should not be exported to user space as a module option. Several EXPORT_SYMBOLs are moved from zio.c to zio_inject.c, where the symbols are defined. Signed-off-by: Isaac Huang <he.huang@intel.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3199	2015-03-24 13:22:03 -07:00
Chris Dunlop	d07b7c7f21	Reduce size of zfs_sb_t: allocate z_hold_mtx separately zfs_sb_t has grown to the point where using kmem_zalloc() for allocations is triggering the 32k warning threshold. We can't safely convert this entire allocation to use vmem_alloc() instead of kmem_alloc() because the backing_dev_info structure is embedded here. It depends on the bit_waitqueue() function which won't behave properly when given a virtual address. Instead, use vmem_alloc() to allocate the z_hold_mtx array separately. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chris Dunlop <chris@onthe.net.au> Closes #3178	2015-03-24 13:17:44 -07:00
Brian Behlendorf	bc88866657	Fix arc_adjust_meta() behavior The goal of this function is to evict enough meta data buffers from the ARC in order to enforce the arc_meta_limit. Achieving this is slightly more complicated than it appears because it is common for data buffers to have holds on meta data buffers. In addition, dnode meta data buffers will be held by the dnodes in the block preventing them from being freed. This means we can't simply traverse the ARC and expect to always find enough unheld meta data buffer to release. Therefore, this function has been updated to make alternating passes over the ARC releasing data buffers and then newly unheld meta data buffers. This ensures forward progress is maintained and arc_meta_used will decrease. Normally this is sufficient, but if required the ARC will call the registered prune callbacks causing dentry and inodes to be dropped from the VFS cache. This will make dnode meta data buffers available for reclaim. The number of total restarts in limited by zfs_arc_meta_adjust_restarts to prevent spinning in the rare case where all meta data is pinned. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Pavel Snajdr <snajpa@snajpa.net> Issue #3160	2015-03-20 10:35:20 -07:00
Brian Behlendorf	2cbb06b561	Restructure per-filesystem reclaim Originally when the ARC prune callback was introduced the idea was to register a single callback for the ZPL. The ARC could invoke this call back if it needed the ZPL to drop dentries, inodes, or other cache objects which might be pinning buffers in the ARC. The ZPL would iterate over all ZFS super blocks and perform the reclaim. For the most part this design has worked well but due to limitations in 2.6.35 and earlier kernels there were some problems. This patch is designed to address those issues. 1) iterate_supers_type() is not provided by all kernels which makes it impossible to safely iterate over all zpl_fs_type filesystems in a single callback. The most straight forward and portable way to resolve this is to register a callback per-filesystem during mount. The arc_*_prune_callback() functions have always supported multiple callbacks so this is functionally a very small change. 2) Commit `050d22b` removed the non-portable shrink_dcache_memory() and shrink_icache_memory() functions and didn't replace them with equivalent functionality. This meant that for Linux 3.1 and older kernels the ARC had no mechanism to drop dentries and inodes from the caches if needed. This patch adds that missing functionality by calling shrink_dcache_parent() to release dentries which may be pinning inodes. This will result in all unused cache entries being dropped which is a bit heavy handed but it's the only interface available for old kernels. 3) A zpl_drop_inode() callback is registered for kernels older than 2.6.35 which do not support the .evict_inode callback. This ensures that when the last reference on an inode is dropped it is immediately removed from the cache. If this isn't done than inode can end up on the global unused LRU with no mechanism available to ZFS to drop them. Since the ARC buffers are not dropped the hottest inodes can still be recreated without performing disk IO. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Pavel Snajdr <snajpa@snajpa.net> Issue #3160	2015-03-20 10:35:20 -07:00
Brian Behlendorf	596a8935a1	Fix arc_meta_max accounting The arc_meta_max value should be increased when space it consumed not when it is returned. This ensure's that arc_meta_max is always up to date. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Pavel Snajdr <snajpa@snajpa.net> Issue #3160	2015-03-20 10:35:20 -07:00
Chunwei Chen	40749aa7a6	Use MUTEX_FSTRANS on l2arc_buflist_mtx Use MUTEX_FSTRANS on l2arc_buflist_mtx to prevent the following deadlock scenario: 1. arc_release() -> hash_lock -> l2arc_buflist_mtx 2. l2arc_write_buffers() -> l2arc_buflist_mtx -> (direct reclaim) -> arc_buf_remove_ref() -> hash_lock Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Signed-off-by: Tim Chase <tim@chase2k.com> Issue #3160	2015-03-18 09:29:38 -07:00
Justin T. Gibbs	4c7b7eedcd	Illumos 5630 - stale bonus buffer in recycled dnode_t leads to data corruption 5630 stale bonus buffer in recycled dnode_t leads to data corruption Author: Justin T. Gibbs <justing@spectralogic.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george@delphix.com> Reviewed by: Will Andrews <will@freebsd.org> Approved by: Robert Mustacchi <rm@joyent.com> References: https://www.illumos.org/issues/5630 https://github.com/illumos/illumos-gate/commit/cd485b4 Ported-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <ryao@gentoo.org> Issue #3172	2015-03-12 15:40:39 -07:00
Josef 'Jeff' Sipek	73ad4a9f3c	Illumos 5047 - don't use atomic__nv if you discard the return value 5047 don't use atomic__nv if you discard the return value Author: Josef 'Jeff' Sipek <josef.sipek@nexenta.com> Reviewed by: Garrett D'Amore <garrett@damore.org> Reviewed by: Jason King <jason.brian.king@gmail.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> References: https://www.illumos.org/issues/5047 https://github.com/illumos/illumos-gate/commit/640c167 Porting Notes: Several hunks from the original patch where not specific to ZFS and thus were dropped. Ported-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <ryao@gentoo.org> Issue #3172	2015-03-12 15:40:33 -07:00
Brian Behlendorf	7f3e466283	Mark zfs_inactive() with PF_FSTRANS Allowing direct reclaim to re-enter the VFS in the zfs_inactive() call path has historically been problematic for ZoL. Therefore, in order to avoid an entire class of current and future issues caused by this PF_FSTRANS is set for all zfs_inactive() callers. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3163	2015-03-10 09:21:48 -07:00
Ned Bass	417104bdd3	Use cached feature info in spa_add_feature_stats() Avoid issuing I/O to the pool when retrieving feature flags information. Trying to read the ZAPs from disk means that zpool clear would hang if the pool is suspended and recovery would require a reboot. To keep the feature stats resident in memory, we hang a cached nvlist off of the spa. It is built up from disk the first time spa_add_feature_stats() is called, and refreshed thereafter using the cached feature reference counts. spa_add_feature_stats() gets called at pool import time so we can be sure the cached nvlist will be available if the pool is later suspended. Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3082	2015-03-05 14:11:10 -08:00
Brian Behlendorf	989fd514b1	Change ASSERT(!"...") to cmn_err(CE_PANIC, ...) There are a handful of ASSERT(!"...")'s throughout the code base for cases which should be impossible. This patch converts them to use cmn_err(CE_PANIC, ...) to ensure they are always enabled and so that additional debugging is logged if they were to occur. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #1445	2015-03-03 13:22:21 -08:00
Brian Behlendorf	8c45def24a	Linux 4.0 compat: bdi_setup_and_register() The 'capabilities' argument which was passed to bdi_setup_and_register() has been removed. File systems should no longer pass BDI_CAP_MAP_COPY. For our purposes this means there are now three different interfaces which must be handled. A zpl_bdi_setup_and_register() wrapper function has been introduced to provide a single interface to the ZPL code. * 2.6.32 - 2.6.33, bdi_setup_and_register() is not exported. * 2.6.34 - 3.19, bdi_setup_and_register() takes 3 arguments. * 4.0 - x.y, bdi_setup_and_register() takes 2 arguments. I've also taken this opportunity to remove HAVE_BDI because kernels older then 2.6.32 are no longer supported. All kernels newer than this will have one of the above interfaces. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Closes #3128	2015-03-03 10:49:45 -08:00
Brian Behlendorf	4ec15b8dcf	Use MUTEX_FSTRANS mutex type There are regions in the ZFS code where it is desirable to be able to be set PF_FSTRANS while a specific mutex is held. The ZFS code could be updated to set/clear this flag in all the correct places, but this is undesirable for a few reasons. 1) It would require changes to a significant amount of the ZFS code. This would complicate applying patches from upstream. 2) It would be easy to accidentally miss a critical region in the initial patch or to have an future change introduce a new one. Both of these concerns can be addressed by using a new mutex type which is responsible for managing PF_FSTRANS, support for which was added to the SPL in commit zfsonlinux/spl@9099312 - Merge branch 'kmem-rework'. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Closes #3050 Closes #3055 Closes #3062 Closes #3132 Closes #3142 Closes #2983	2015-03-03 10:46:40 -08:00
Brian Behlendorf	6ab08667a4	Reduce splat_taskq_test2_impl() stack frame size Slightly increasing the size of a kmutex_t has caused us to exceed the stack frame warning size in splat_taskq_test2_impl(). To address this the tq_args have been moved to the heap. cc1: warnings being treated as errors spl-0.6.3/module/splat/splat-taskq.c:358: error: the frame size of 1040 bytes is larger than 1024 bytes Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Issue #435	2015-03-03 10:18:31 -08:00
Brian Behlendorf	d0d5dd7144	Add MUTEX_FSTRANS mutex type There are regions in the ZFS code where it is desirable to be able to be set PF_FSTRANS while a specific mutex is held. The ZFS code could be updated to set/clear this flag in all the correct places, but this is undesirable for a few reasons. 1) It would require changes to a significant amount of the ZFS code. This would complicate applying patches from upstream. 2) It would be easy to accidentally miss a critical region in the initial patch or to have an future change introduce a new one. Both of these concerns can be addressed by adding a new mutex type which is responsible for managing PF_FSTRANS, support for which was added to the SPL in commit `9099312` - Merge branch 'kmem-rework'. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Issue #435	2015-03-03 10:18:24 -08:00
Isaac Huang	d14cfd83da	Fix deadlock between zpool export and zfs list Pool reference count is NOT checked in spa_export_common() if the pool has been imported readonly=on, i.e. spa->spa_sync_on is FALSE. Then zpool export and zfs list may deadlock: 1. Pool A is imported readonly. 2. zpool export A and zfs list are run concurrently. 3. zfs command gets reference on the spa, which holds a dbuf on on the MOS meta dnode. 4. zpool command grabs spa_namespace_lock, and tries to evict dbufs of the MOS meta dnode. The dbuf held by zfs command can't be evicted as its reference count is not 0. 5. zpool command blocks in dnode_special_close() waiting for the MOS meta dnode reference count to drop to 0, with spa_namespace_lock held. 6. zfs command tries to get the spa_namespace_lock with a reference on the spa held, which holds a dbuf on the MOS meta dnode. 7. Now zpool command and zfs command deadlock each other. Also any further zfs/zpool command will block on spa_namespace_lock forever. The fix is to always check pool reference count in spa_export_common(), no matter whether the pool was imported readonly or not. Signed-off-by: Isaac Huang <he.huang@intel.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2034	2015-03-02 11:50:06 -08:00
Brian Behlendorf	87a63dd702	Prevent "zpool destroy\|export" when suspended Cleanly destroying or exporting a pool requires that the pool not be suspended. Therefore, set the POOL_CHECK_SUSPENDED flag for these ioctls so the utilities will output a descriptive error message rather than block. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #2878	2015-03-02 11:50:06 -08:00
Brian Behlendorf	c1bc8e610b	Retire spl_module_init()/spl_module_fini() In the original implementation of the SPL wrappers were provided for module initialization and cleanup. This was done to abstract away any compatibility code which might be needed for the SPL. As it turned out the only significant compatibility issue was that the default pwd during module load differed under Illumos and Linux. Since this is such as minor thing and the wrappers complicate the code they are being retired. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue zfsonlinux/zfs#2985	2015-02-27 13:43:39 -08:00
Brian Behlendorf	b4f3666a16	Retire spl_module_init()/spl_module_fini() In the original implementation of the SPL wrappers were provided for module initialization and cleanup. This was done to abstract away any compatibility code which might be needed for the SPL. As it turned out the only significant compatibility issue was that the default pwd during module load differed under Illumos and Linux. Since this is such as minor thing and the wrappers complicate the code they are being retired. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2985	2015-02-24 11:37:44 -08:00
Brian Behlendorf	1efdc45ea8	Fix O_APPEND open(2) flag As described in flags section of open(2): O_APPEND: The file is opened in append mode. Before each write(2), the file offset is positioned at the end of the file, as if with lseek(2). O_APPEND may lead to corrupted files on NFS filesys- tems if more than one process appends data to a file at once. This is because NFS does not support appending to a file, so the client kernel has to simulate it, which can't be done without a race condition. This issue was originally overlooked because normally the generic VFS code handles this for a filesystem. However, because ZFS explictly registers a zpl_write() function it's responsible for the seek. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3124	2015-02-24 11:21:54 -08:00
Dan Swartzendruber	1611bb7b4f	Set zfs_autoimport_disable default value to 1 When loading the ZFS kernel modules they should not populate the spa namespace using the cache file. This behavior isn't consistent with other Linux kernel modules and we need to move away from it. Removing this makes the whole startup process predictable with four basic steps which are driven by the init system. 1) modprobe 2) zpool import 3) zfs mount 4) zfs share This change also helps lay the groundwork for eventually removing the kobj_* compatibility code on the kernel side. It may need to be preserved in userspace because libzfs_init() depends on it. This is why the conditional must be wrapped with an #ifdef _KERNEL. Signed-off-by: Dan Swartzendruber <dswartz@druber.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2820	2015-02-17 16:09:41 -08:00
Brian Behlendorf	7d2868d5fc	Skip bad DVAs during free by setting zfs_recover=1 When a bad DVA is encountered in metaslab_free_dva() the system should treat it as fatal. This indicates that somehow a damaged DVA was written to disk and that should be impossible. However, we have seen a handful of reports over the years of pools somehow being damaged in this way. Since this damage can render otherwise intact pools unimportable, and the consequence of skipping the bad DVA is only leaked free space, it makes sense to provide a mechanism to ignore the bad DVA. Setting the zfs_recover=1 module option will cause the DVA to be ignored which may allow the pool to be imported. Since zfs_recover=0 by default any pool attempting to free a bad DVA will treat it as a fatal error preserving the current behavior. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3099 Issue #3090 Issue #2720	2015-02-13 16:02:04 -08:00
Andrey Vesnovaty	5f15fa2216	Fix readdir for .zfs/snapshot directory dmu_snapshot_list_next stores the index of the next snapshot entry to the offp argument, which zpl_snapdir_iterate then uses for the dir_emit. This result in an off-by-one error. Therefore a temporary variable should be used. This was a regression introduced in commit zfsonlinux/zfs@0f37d0c. Signed-off-by: Andrey Vesnovaty <andrey.vesnovaty@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2930	2015-02-10 16:34:30 -08:00
Brian Behlendorf	3941503c0a	Retire zio_cons()/zio_dest() The zio_cons() constructor and zio_dest() destructor don't exist in the upstream Illumos code. They were introduced as a workaround to avoid issue #2523. Since this issue has now been resolved this code is being reverted to bring ZoL back in sync with Illumos. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov> Issue #3063	2015-02-10 16:09:49 -08:00
Brian Behlendorf	6442f3cfe3	Retire zio_bulk_flags Long ago the zio_bulk_flags module parameter was introduced to facilitate debugging and profiling the zio_buf_caches. Today this code works well and there's no compelling reason to keep this functionality. In fact it's preferable to revert this so the code is more consistent with other ZFS implementations. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov> Issue #3063	2015-02-10 16:08:49 -08:00
Jörg Thalheim	534759fad3	Linux 3.19 compat: file_inode was added struct access f->f_dentry->d_inode was replaced by accessor function file_inode(f) Signed-off-by: Joerg Thalheim <joerg@higgsboson.tk> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3084	2015-02-10 11:24:51 -08:00
Brian Behlendorf	77aef6f60e	Use vmem_alloc() for nvlists Several of the nvlist functions may perform allocations larger than the 32k warning threshold. Convert them to use vmem_alloc() so the best allocator is used. Commit `efcd79a` retired KM_NODEBUG which was used to suppress large allocation warnings. Concurrently the large allocation warning threshold was increased from 8k to 32k. The goal was to identify the remaining locations, such as this one, where the allocation can be larger than 32k. This patch is expected fine tuning resulting for the kmem-rework changes, see commit `6e9710f`. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3057 Closes #3079 Closes #3081	2015-02-10 11:00:08 -08:00
Brian Behlendorf	afe373260e	Revert "Don't read space maps during import for readonly pools" This reverts commit `7fc8c33ede` which accidentally introduced a ztest failure. ztest: '/usr/sbin/zdb -bcc -d -U /var/tmp/zpool.cache ztest' exit code 2 child exited with code 3 Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-02-09 16:56:59 -08:00
Brian Behlendorf	7fc8c33ede	Don't read space maps during import for readonly pools Normally when importing a pool the space maps for all top level vdevs are read from disk. The space maps will be required latter when an allocation is performed and free blocks need to be located. However, if the pool is imported readonly then we are guaranteed that no allocations can occur. In this case the space maps need not be loaded.. A similar argument can be made for the DTLs (dirty time logs). Because a pool import will fail if the space maps cannot be read. The ability to safely ignore them makes it more likely that a damaged pool can be imported readonly to recover its contents. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #2831	2015-02-09 16:43:03 -08:00
Justin T. Gibbs	33b4de513e	Illumos 5311 - traverse_dnode may report success when it should not 5311 traverse_dnode may report success when it should not Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Andriy Gapon <avg@FreeBSD.org> Reviewed by: Will Andrews <willa@spectralogic.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://github.com/illumos/illumos-gate/commit/2a89c2c https://www.illumos.org/issues/5311 Ported by: DHE <git@dehacked.net> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2970	2015-02-06 12:07:15 -08:00
Ned Bass	a62d1b02e3	Fix SA header size accounting The functions sa_find_sizes() and sa_build_layout() fail to account for the additional 2 bytes of SA header space when calculating whether a variable size attribute might spill over. They may consequently determine that an attribute will fit in the bonus buffer along with a spill block pointer, when in reality the attribute would be partially overwritten by the spill block pointer if spill over occurs. This also causes an inconsistency between the SA header size and the number of variable size attributes in the layout, tripping an assertion when debugging is on. The following reproducer demonstrates the problem. ln -s $(perl -e 'print "z" x 20') file setfattr -h -n trusted.foo -v $(perl -e 'print "z" x 200') file Even though sa_find_sizes() computes the index of the attribute where spill-over will occur, sa_build_layouts() discards the result and recomputes it itself. As it turns out, both functions get it wrong. Since this computation is awkward and, as history has shown, easy to screw up, let's just do it in one place. This patch fixes the bug in sa_find_sizes() and updates sa_build_layout() to use the result computed there. Also improve the comments in sa_find_sizes(). Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Closes #3070	2015-02-06 09:26:46 -08:00
Brian Behlendorf	e2c4acde55	Skip evicting dbufs when walking the dbuf hash When a dbuf is in the DB_EVICTING state it may no longer be on the dn_dbufs list. In which case it's unsafe to call DB_DNODE_ENTER. Therefore, any dbuf which is found in this safe must be skipped. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2553 Closes #2495	2015-02-06 09:24:28 -08:00
Chunwei Chen	086476f920	Fix spl_hostid module parameter Currently, spl_hostid module parameter doesn't do anything, because it will always be overwritten when calling into hostid_read(). Instead, we should only call into hostid_read() when spl_hostid is not zero, just as the comment describes. Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #427	2015-02-04 16:42:25 -08:00
Tim Chase	aa2ef419e4	Spurious ENOMEM returns when reading dbufs kstat Commit `7b2d78a046` fixed some improper uses of snprintf(), however, in __dbuf_stats_hash_table_data() the return value of snprintf is propagated to the caller. This caused spurious ENOMEM errors when reading the dbufs kstat. This commit causes the actual number of characters written to be returned. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3072	2015-02-04 16:35:26 -08:00
avg	037763e44e	fix l2arc compression buffers leak Commit log from FreeBSD: We have observed that arc_release() can be called concurrently with a l2arc in-flight write. Also, we have observed that arc_hdr_destroy() can be called from arc_write_done() for a zio with ZIO_FLAG_IO_REWRITE flag in similar circumstances. Previously the l2arc headers would be freed while leaking their associated compression buffers. Now the buffers are placed on l2arc_free_on_write list for delayed freeing. This is similar to what was already done to arc buffers that were supposed to be freed concurrently with in-flight writes of those buffers. In addition to fixing the discovered leaks this change also adds some protective code to assert that a compression buffer associated with a l2arc header is never leaked. A new kstat l2_cdata_free_on_write is added. It keeps a count of delayed compression buffer frees which previously would have been leaks. Tested by: Vitalij Satanivskij <satan@ukr.net> et al Requested by: many MFC after: 2 weeks Sponsored by: HybridCluster / ClusterHQ References: https://illumos.org/issues/5222 https://github.com/freebsd/freebsd/commit/b98f85d http://thread.gmane.org/gmane.os.freebsd.current/155757/focus=155781 http://lists.open-zfs.org/pipermail/developer/2014-January/000455.html http://lists.open-zfs.org/pipermail/developer/2014-February/000523.html Ported-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3029	2015-02-03 16:54:16 -08:00
Brian Behlendorf	19ea3d25df	Use zio buffers in zil_itx_create() The zil_itx_create() function uses the vmem_alloc() allocator for its buffers because when logging a write that buffer may be as large as 64K. This is non-optimal because we may need to allocate many of of these buffers and this interface has the potential to be slow. Instead, use zio_data_buf_alloc() which is specifically designed to be able to efficiently allocate a wide range of buffer sizes. In addition, do some cleanup and use the zil_itx_destroy() function to always free an itx structure. This way we're always sure the right allocation functions are used. Notice that in the current code kmem_free() and vmem_free() were both used. This happened to work because these wrappers map to the same internal SPL function. This was identified as a potential problem when a low-end memory constrained system began logging the following warnings. There was no deadlock here just repeated allocation failures resulting in increased latency. Possible memory allocation deadlock: size=65792 lflags=0x42d0 Pid: 20118, comm: kvm Tainted: P O 3.2.0-0.bpo.4-amd64 Call Trace: [<ffffffffa040b834>] ? spl_kmem_alloc_impl+0x115/0x127 [spl] [<ffffffffa040b84f>] ? spl_kmem_alloc_debug+0x9/0x36 [spl] [<ffffffffa05d8a0b>] ? zil_itx_create+0x2d/0x59 [zfs] [<ffffffffa05c71e6>] ? zfs_log_write+0x13a/0x2f0 [zfs] [<ffffffffa05d41bc>] ? zfs_write+0x85b/0x9bb [zfs] [<ffffffffa05e37ec>] ? zpl_aio_write+0xca/0x110 [zfs] [<ffffffff811088e5>] ? do_sync_readv_writev+0xa3/0xde [<ffffffff81108f41>] ? do_readv_writev+0xaf/0x125 [<ffffffff81109055>] ? sys_pwritev+0x55/0x9a [<ffffffff813721d2>] ? system_call_fastpath+0x16/0x1b Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <ryao@gentoo.org> Closes #3059	2015-02-02 11:20:41 -08:00
Brian Behlendorf	c7db36a3c4	Optimize vmem_alloc() retry path For performance reasons the reworked kmem code maps vmem_alloc() to kmalloc_node() for allocations less than spa_kmem_alloc_max. This allows for more concurrency in the system and less contention of the virtual address space. Generally, this is a good thing. However, in the case when the kmalloc_node() fails it makes little sense to retry it using kmalloc_node() again. It will likely fail in exactly the same way. A smarter strategy is to abandon this optimization and retry using spl_vmalloc() which is very likely to succeed. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov> Closes #428	2015-02-02 10:57:56 -08:00
Brian Behlendorf	0365064a97	Handle closing an unopened ZVOL Thank to commit `a4430fce69` we're now correctly returning EROFS when opening a zvol on a read-only pool. Unfortunately, it looks like this causes us to trigger some unexpected behavior by __blkdev_get(). In the failure case it's possible __blkdev_get() will call __blkdev_put() for a bdev which was never successfully opened. This results in us trying to close the device again and hitting the NULL dereference. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1343	2015-01-30 14:44:14 -08:00
Brian Behlendorf	a127e841de	Add zvol_open() error handling for readonly property Rather than ASSERT when for some reason the readonly property of a zvol can't be read cleanly handle the failure. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1343	2015-01-30 14:44:06 -08:00
Tim Chase	b0cf0676c0	Fix removal of SA in sa_modify_attrs() The sa_modify_attrs() function can add, remove or replace an SA. The main loop in the function uses the index "i" to iterate over the existing SAs and uses the index "j" for writing them into a new buffer via SA_ADD_BULK_ATTR(). The write index, "j" is incremented on remove (SA_REMOVE) operations which leads to a corruption in the new SA buffer. This patch remove the increment for SA_REMOVE operations. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Ned Bass <bass6@llnl.gov> Closes #3028	2015-01-21 16:35:14 -08:00
Richard Yao	841c9d43c7	Use kmem_vasprintf() in log_internal() An attempt to debug zfsonlinux/zfs#2781 revealed that this code could be simplified by using kmem_asprintf(). It is not clear that switching to kmem_asprintf() addresses zfsonlinux/zfs#2781. However, switching to kmem_asprintf() is cleanup that simplifies debugging such that it would become clear that this is a bug in glibc should the issue persist. It also brings this function almost back in sync with Illumos. This was possible due to the recently reworked kmem code which allows us to use KM_SLEEP in the same fashion as Illumos. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2791 Issue #2781	2015-01-21 15:30:24 -08:00
Brian Behlendorf	54cccfc2e3	Fix GFP_KERNEL allocations flags The kmem_vasprintf(), kmem_vsprintf(), kobj_open_file(), and vn_openat() functions should all use the kmem_flags_convert() function to generate the GFP_* flags. This ensures that they can be safely called in any context and the correct flags will be used. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #426	2015-01-21 15:25:19 -08:00
Tim Chase	3c832b8cc1	Linux 3.12 compat: split shrinker has s_shrink The split count/scan shrinker callbacks introduced in 3.12 broke the test for HAVE_SHRINK, effectively disabling the per-superblock shrinkers. This patch re-enables the per-superblock shrinkers when the split shrinker callbacks have been detected. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2975	2015-01-20 14:07:59 -08:00
Brian Behlendorf	81971b137a	Revert "SA spill block cache" The SA spill_cache was originally introduced to avoid the need to perform large kmem or vmem allocations. Instead a small dedicated cache of preallocated SA buffers was kept. This solution was viable while the maximum block size was limited to 128K. But with the planned increase of the maximum block size to 16M callers need to migrate to the zio_buf_alloc(). However, they should be aware this interface is expected to change again once the zio buffers are fully backed by scatter-gather lists. Alternately, if the callers know these buffers will never be large or be infrequently accessed they may kmem_alloc() or vmem_alloc() the needed temporary space. This change has the additional benegit of bringing the code back inline with the upstream Illumos source. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-01-16 14:41:28 -08:00
Brian Behlendorf	285b29d959	Revert "Pre-allocate vdev I/O buffers" Commit `86dd0fd` added preallocated I/O buffers. This is no longer required after the recent kmem changes designed to make our memory allocation interfaces behave more like those found on Illumos. A deadlock in this situation is no longer possible. However, these allocations still have the potential to be expensive. So a potential future optimization might be to perform then KM_NOSLEEP so that they either succeed of fail quicky. Either case is acceptable here because we can safely abort the aggregation. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-01-16 14:41:28 -08:00
Brian Behlendorf	79c76d5b65	Change KM_PUSHPAGE -> KM_SLEEP By marking DMU transaction processing contexts with PF_FSTRANS we can revert the KM_PUSHPAGE -> KM_SLEEP changes. This brings us back in line with upstream. In some cases this means simply swapping the flags back. For others fnvlist_alloc() was replaced by nvlist_alloc(..., KM_PUSHPAGE) and must be reverted back to fnvlist_alloc() which assumes KM_SLEEP. The one place KM_PUSHPAGE is kept is when allocating ARC buffers which allows us to dip in to reserved memory. This is again the same as upstream. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-01-16 14:41:26 -08:00
Brian Behlendorf	efcd79a883	Retire KM_NODEBUG Callers of kmem_alloc() which passed the KM_NODEBUG flag to suppress the large allocation warning have been replaced by vmem_alloc() as appropriate. The updated vmem_alloc() call will not print a warning regardless of the size of the allocation. A careful reader will notice that not all callers have been changed to vmem_alloc(). Some have only had the KM_NODEBUG flag removed. This was possible because the default warning threshold has been increased to 32k. This is desirable because it minimizes the need for Linux specific code changes. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-01-16 14:40:32 -08:00
Richard Yao	71f8548ea4	Use is_vmalloc_addr() in vdev_disk.c The initial port of ZFS to Linux required a way to identify virtual memory to make IO to virtual memory backed slabs work, so kmem_virt() was created. Linux 2.6.25 introduced is_vmalloc_addr(), which is logically equivalent to kmem_virt(). Support for kernels before 2.6.26 was later dropped and more recently, support for kernels before Linux 2.6.32 has been dropped. We retire kmem_virt() in favor of is_vmalloc_addr() to cleanup the code. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-01-16 14:28:05 -08:00
Brian Behlendorf	92119cc259	Mark IO pipeline with PF_FSTRANS In order to avoid deadlocking in the IO pipeline it is critical that pageout be avoided during direct memory reclaim. This ensures that the pipeline threads can always make forward progress and never end up blocking on a DMU transaction. For this very reason Linux now provides the PF_FSTRANS flag which may be set in the process context. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-01-16 14:28:05 -08:00
Brian Behlendorf	ee33517452	Use __get_free_pages() for emergency objects The __get_free_pages() function must be used in place of kmalloc() to ensure the __GFP_COMP is strictly honored. This is due to kmalloc() being layered on the generic Linux slab caches. It wasn't until recently that all caches were created using __GFP_COMP. This means that it is possible for a kmalloc() which passed the __GFP_COMP flag to be returned a non-compound allocation. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-01-16 13:58:11 -08:00
Brian Behlendorf	436ad60faa	Fix kmem cache deadlock logic The kmem cache implementation always adds new slabs by dispatching a task to the spl_kmem_cache taskq to perform the allocation. This is done because large slabs must be allocated using vmalloc(). It is possible these allocations will block on IO because the GFP_NOIO flag is not honored. This can result in a deadlock. Therefore, a deadlock detection strategy was implemented to deal with this case. When it is determined, by timeout, that the spl_kmem_cache thread has deadlocked attempting to add a new slab. Then all callers attempting to allocate from the cache fall back to using kmalloc() which does honor all passed flags. This logic was correct but an optimization in the code allowed for a deadlock. Because only slabs backed by vmalloc() can deadlock in the way described above. An optimization was made to only invoke this deadlock detection code for vmalloc() backed caches. This had the advantage of making it easy to distinguish these objects when they were freed. But this isn't strictly safe. If all the spl_kmem_cache threads end up deadlocked than we can't grow any of the other caches either. This can once again result in a deadlock if memory needs to be allocated from one of these other caches to ensure forward progress. The fix here is to remove the optimization which limits this fall back allocation stratagy to vmalloc() backed caches. Doing this means we may need to take the cache lock in spl_kmem_cache_free() call path. But this small cost can be mitigated by ignoring objects with virtual addresses. For good measure the default number of spl_kmem_cache threads has been increased from 1 to 4, and made tunable. This alone wouldn't resolve the original issue since it's still possible for all the threads to be deadlocked. However, it does help responsiveness by ensuring that a single deadlocked spl_kmem_cache thread doesn't block allocations from other caches until the timeout is reached. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-01-16 13:55:09 -08:00
Brian Behlendorf	3018bffa9b	Refine slab cache sizing This change is designed to improve the memory utilization of slabs by more carefully setting their size. The way the code currently works is problematic for slabs which contain large objects (>1MB). This is due to slabs being unconditionally rounded up to a power of two which may result in unused space at the end of the slab. The reason the existing code rounds up every slab is because it assumes it will backed by the buddy allocator. Since the buddy allocator can only performs power of two allocations this is desirable because it avoids wasting any space. However, this logic breaks down if slab is backed by vmalloc() which operates at a page level granularity. In this case, the optimal thing to do is calculate the minimum required slab size given certain constraints (object size, alignment, objects/slab, etc). Therefore, this patch reworks the spl_slab_size() function so that it sizes KMC_KMEM slabs differently than KMC_VMEM slabs. KMC_KMEM slabs are rounded up to the nearest power of two, and KMC_VMEM slabs are allowed to be the minimum required size. This change also reduces the default number of objects per slab. This reduces how much memory a single cache object can pin, which can result in significant memory saving for highly fragmented caches. But depending on the workload it may result in slabs being allocated and freed more frequently. In practice, this has been shown to be a better default for most workloads. Also the maximum slab size has been reduced to 4MB on 32-bit systems. Due to the limited virtual address space it's critical the we be as frugal as possible. A limit of 4M still lets us reasonably comfortably allocate a limited number of 1MB objects. Finally, the kmem:slab_small and kmem:slab_large SPLAT tests were extended to provide better test coverage of various object sizes and alignments. Caches are created with random parameters and their basic functionality is verified by allocating several slabs worth of objects. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-01-16 13:55:09 -08:00
Brian Behlendorf	e50e6cc958	Reduce kmem cache deadlock threshold Reduce the threshold for detecting a kmem cache deadlock by 10x from HZ to HZ/10. The reduced value is still several orders of magnitude large enough to avoid being triggered incorrectly. By reducing it we allow the system to resolve the issue more quickly. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-01-16 13:55:09 -08:00
Brian Behlendorf	1a20496834	Make slab reclaim more aggressive Many people have noticed that the kmem cache implementation is slow to release its memory. This patch makes the reclaim behavior more aggressive by immediately freeing a slab once it is empty. Unused objects which are cached in the magazines will still prevent a slab from being freed. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-01-16 13:55:09 -08:00
Richard Yao	a988a35a93	Enforce architecture-specific barriers around clear_bit() The comment above the Linux 3.16 kernel's clear_bit() states: /** * clear_bit - Clears a bit in memory * @nr: Bit to clear * @addr: Address to start counting from * * clear_bit() is atomic and may not be reordered. However, it does * not contain a memory barrier, so if it is used for locking purposes, * you should call smp_mb__before_atomic() and/or smp_mb__after_atomic() * in order to ensure changes are visible on other processors. */ This comment does not make sense in the context of x86 because x86 maps the operations to barrier(), which is a compiler barrier. However, it does make sense to me when I consider architectures that reorder around atomic instructions. In such situations, a processor is allowed to execute the wake_up_bit() before clear_bit() and we have a race. There are a few architectures that suffer from this issue. In such situations, the other processor would wake-up, see the bit is still taken and go to sleep, while the one responsible for waking it up will assume that it did its job and continue. This patch implements a wrapper that maps smp_mb__{before,after}_atomic() to smp_mb__{before,after}_clear_bit() on older kernels and changes our code to leverage it in a manner consistent with the mainline kernel. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-01-16 13:55:09 -08:00
Richard Yao	c2fa09454e	Add hooks for disabling direct reclaim The port of XFS to Linux introduced a thread-specific PF_FSTRANS bit that is used to mark contexts which are processing transactions. When set, allocations in this context can dip into kernel memory reserves to avoid deadlocks during writeback. Linux 3.9 provided the additional PF_MEMALLOC_NOIO for disabling __GFP_IO in page allocations, which XFS began using in 3.15. This patch implements hooks for marking transactions via PF_FSTRANS. When an allocation is performed in the context of PF_FSTRANS, any KM_SLEEP allocation is transparently converted to a GFP_NOIO allocation. Additionally, when using a Linux 3.9 or newer kernel, it will set PF_MEMALLOC_NOIO to prevent direct reclaim from entering pageout() on on any KM_PUSHPAGE or KM_NOSLEEP allocation. This effectively allows the spl_vmalloc() helper function to be used safely in a thread which is responsible for IO. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-01-16 13:55:09 -08:00
Brian Behlendorf	c3eabc75b1	Refactor generic memory allocation interfaces This patch achieves the following goals: 1. It replaces the preprocessor kmem flag to gfp flag mapping with proper translation logic. This eliminates the potential for surprises that were previously possible where kmem flags were mapped to gfp flags. 2. It maps vmem_alloc() allocations to kmem_alloc() for allocations sized less than or equal to the newly-added spl_kmem_alloc_max parameter. This ensures that small allocations will not contend on a single global lock, large allocations can still be handled, and potentially limited virtual address space will not be squandered. This behavior is entirely different than under Illumos due to different memory management strategies employed by the respective kernels. However, this functionally provides the semantics required. 3. The --disable-debug-kmem, --enable-debug-kmem (default), and --enable-debug-kmem-tracking allocators have been unified in to a single spl_kmem_alloc_impl() allocation function. This was done to simplify the code and make it more maintainable. 4. Improve portability by exposing an implementation of the memory allocations functions that can be safely used in the same way they are used on Illumos. Specifically, callers may safely use KM_SLEEP in contexts which perform filesystem IO. This allows us to eliminate an entire class of Linux specific changes which were previously required to avoid deadlocking the system. This change will be largely transparent to existing callers but there are a few caveats: 1. Because the headers were refactored and extraneous includes removed callers may find they need to explicitly add additional #includes. In particular, kmem_cache.h must now be explicitly includes to access the SPL's kmem cache implementation. This behavior is different from Illumos but it was done to avoid always masking the Linux slab functions when kmem.h is included. 2. Callers, like Lustre, which made assumptions about the definitions of KM_SLEEP, KM_NOSLEEP, and KM_PUSHPAGE will need to be updated. Other callers such as ZFS which did not will not require changes. 3. KM_PUSHPAGE is no longer overloaded to imply GFP_NOIO. It retains its original meaning of allowing allocations to access reserved memory. KM_PUSHPAGE callers can be converted back to KM_SLEEP. 4. The KM_NODEBUG flags has been retired and the default warning threshold increased to 32k. 5. The kmem_virt() functions has been removed. For callers which need to distinguish between a physical and virtual address use is_vmalloc_addr(). Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-01-16 13:55:09 -08:00
Brian Behlendorf	b34b95635a	Fix kmem cstyle issues Address all cstyle issues in the kmem, vmem, and kmem_cache source and headers. This will done to make it easier to review subsequent changes which will rework the kmem/vmem implementation. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-01-16 13:55:09 -08:00
Brian Behlendorf	e5b9b344c7	Refactor existing code This change introduces no functional changes to the memory management interfaces. It only restructures the existing codes by separating the kmem, vmem, and kmem cache implementations in the separate source and header files. Splitting this functionality in to separate files required the addition of spl_vmem_{init,fini}() and spl_kmem_cache_{initi,fini}() functions. Additionally, several minor changes to the #include's were required to accommodate the removal of extraneous header from kmem.h. But again, while large this patch introduces no functional changes. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-01-16 13:55:08 -08:00
Brian Behlendorf	d958324f97	Fix zfs_putpage() lock inversion (again) This is a follow up commit to `74328ee` which correctly resolved a lock inversion between zfs_putpage() and zfs_free_range(). Unfortunately, in the process it accidentally introduced another inversion between zfs_putpage() and zfs_read(). The page must be unlocked before taking the range lock. This patch corrects that issue. In addition, because the locking rules here are subtle a block comment has been added clearly explaining why the ordering here is critical. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov> Issue #2976	2015-01-08 16:09:41 -08:00
Ned Bass	33b6dbbc51	Document zfs_flags module parameter Add a table describing the debugging flags that can be set in the zfs_flags module parameter. Also change the module_param type to 'uint' so users aren't shown a negative value. The updated man page text is reproduced below for convenience. zfs_flags (int) Set additional debugging flags. The following flags may be bitwise-or'd together. +-------------------------------------------------------+ \|Value Symbolic Name \| \| Description \| +-------------------------------------------------------+ \| 1 ZFS_DEBUG_DPRINTF \| \| Enable dprintf entries in the debug log. \| +-------------------------------------------------------+ \| 2 ZFS_DEBUG_DBUF_VERIFY * \| \| Enable extra dbuf verifications. \| +-------------------------------------------------------+ \| 4 ZFS_DEBUG_DNODE_VERIFY * \| \| Enable extra dnode verifications. \| +-------------------------------------------------------+ \| 8 ZFS_DEBUG_SNAPNAMES \| \| Enable snapshot name verification. \| +-------------------------------------------------------+ \| 16 ZFS_DEBUG_MODIFY \| \| Check for illegally modified ARC buffers. \| +-------------------------------------------------------+ \| 32 ZFS_DEBUG_SPA \| \| Enable spa_dbgmsg entries in the debug log. \| +-------------------------------------------------------+ \| 64 ZFS_DEBUG_ZIO_FREE \| \| Enable verification of block frees. \| +-------------------------------------------------------+ \| 128 ZFS_DEBUG_HISTOGRAM_VERIFY \| \| Enable extra spacemap histogram verifications. \| +-------------------------------------------------------+ * Requires debug build. Default value: 0. Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2988	2015-01-07 15:50:49 -08:00
Brian Behlendorf	03a783534a	Fix debug object on stack warning When running the SPLAT tests on a kernel with CONFIG_DEBUG_OBJECTS=y enabled the following warning is generated. ODEBUG: object is on stack, but not annotated WARNING: at lib/debugobjects.c:300 __debug_object_init+0x221/0x480() This is caused by the test cases placing a debug object on the stack rather than the heap. This isn't harmful since they are small objects but to make CONFIG_DEBUG_OBJECTS=y happy the objects have been relocated to the heap. This impacted taskq tests 1, 3, and 7. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #424	2015-01-07 13:52:20 -08:00
Ned Bass	49ee64e5e6	Remove duplicate typedefs from trace.h Older versions of GCC (e.g. GCC 4.4.7 on RHEL6) do not allow duplicate typedef declarations with the same type. The trace.h header contains some typedefs to avoid 'unknown type' errors for C files that haven't declared the type in question. But this causes build failures for C files that have already declared the type. Newer versions of GCC (e.g. v4.6) allow duplicate typedefs with the same type unless pedantic error checking is in force. To support the older versions we need to remove the duplicate typedefs. Removal of the typedefs means we can't built tracepoints code using those types unless the required headers have been included. To facilitate this, all tracepoint event declarations have been moved out of trace.h into separate headers. Each new header is explicitly included from the C file that uses the events defined therein. The trace.h header is still indirectly included form zfs_context.h and provides the implementation of the dprintf(), dbgmsg(), and SET_ERROR() interfaces. This makes those interfaces readily available throughout the code base. The macros that redefine DTRACE_PROBE* to use Linux tracepoints are also still provided by trace.h, so it is a prerequisite for the other trace_*.h headers. These new Linux implementation-specific headers do introduce a small divergence from upstream ZFS in several core C files, but this should not present a significant maintenance burden. Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #2953	2015-01-06 16:53:24 -08:00
Brian Behlendorf	74328ee18f	Fix zfs_putpage() lock inversion There exists a lock inversions involving the zfs range lock and the individual page writeback bits which can result in a deadlock. To prevent this we must always manipulate the writeback bit while holding the range lock. The exact deadlock is as follows: ------ Process A ------ ------ Process B ------ zpl_writepages zpl_fallocate write_cache_pages zpl_fallocate_common zpl_putpage zfs_space zfs_putpage (set bit) zfs_freesp zfs_range_lock (wait on lock) zfs_free_range (take lock) [has not yet initiated I/O, truncate_inode_pages_range the bit will not be cleared] wait_on_page_writeback (wait on bit) Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Richard Yao <richard.yao@clusterhq.com> Issue #2976	2014-12-22 09:31:56 -08:00
Boris Protopopov	9063f65476	Correct error returns to unify cross-pool operation error handling Signed-off-by: Boris Protopopov <boris.protopopov@actifio.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2911	2014-12-19 10:51:24 -08:00
Brian Behlendorf	c944be5d7e	Fix snapshots with dirty inodes Filesystems which are mounted read-only or are immutable because they are snapshots must not be allowed to dirty and inode. This will result in a write which will correctly cause a kernel panic because these filesystem are (and must be) immutable. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2812	2014-11-20 10:38:16 -08:00
Ned Bass	52479ecf58	Remove compat includes from sys/types.h Don't include the compatibility code in linux/_compat.h in the public header sys/types.h. This causes problems when an external code base includes the ZFS headers and has its own conflicting compatibility code. Lustre, in particular, defined SHRINK_STOP for compatibility with pre-3.12 kernels in a way that conflicted with the SPL's definition. Because Lustre ZFS OSD includes ZFS headers it fails to build due to a '"SHRINK_STOP" redefined' compiler warning. To avoid such conflicts only include the compat headers from .c files or private headers. Also, for consistency, include sys/.h before linux/*.h then sort by header name. Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #411	2014-11-19 10:35:12 -08:00
Brian Behlendorf	8d9a23e82c	Retire legacy debugging infrastructure When the SPL was originally written Linux tracepoints were still in their infancy. Therefore, an entire debugging subsystem was added to facilite tracing which served us well for many years. Now that Linux tracepoints have matured they provide all the functionality of the previous tracing subsystem. Rather than maintain parallel functionality it makes sense to fully adopt tracepoints. Therefore, this patch retires the legacy debugging infrastructure. See zfsonlinux/zfs@bc9f413 for the tracepoint changes. Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #408	2014-11-19 10:35:07 -08:00
Isaac Huang	29b763cd2c	bio_alloc() with __GFP_WAIT never returns NULL Mark the error handling branch as unlikely() because the current kernel interface can never return NULL. However, we want to keep the error handling in case this behavior changes in the futre. Plus fix a small style issue. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Isaac Huang <he.huang@intel.com> Closes #2703	2014-11-19 12:50:49 -05:00
Ned Bass	aaed7c408c	Explicitly include SPL compat headers Inclusion of SPL compatibility headers was moved out of the public header sys/types.h to avoid conflicts with external packages. Include a few compatiblity headers explicitly to cope with that change. Also, sort some linux-specific inclusions alphabetically. Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2898	2014-11-19 12:30:39 -05:00
Ned Bass	7b2d78a046	Fix improper null-byte termination handling Fix a few cases where null-byte termination of strings was done unnecessarily or incorrectly. - The snprintf() function always produces a null-byte terminated string for non-negative return values, so it is not necessary to write out a null-byte as a separate step. - Also, it is unsafe to use the return value of snprintf() as an offset for placing a null-byte, because if the output was truncated the return value is the number of bytes that _would_ have been written had enough space been available. Therefore the return value may index beyond the array boundaries. - Finally, snprintf() accounts for the null-byte when limiting its output size, so there is no need to pass it a size parameter that is one less than the buffer size. Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2875	2014-11-17 15:28:59 -08:00
smh	89b1cd6581	Prevent ZFS leaking pool free space When processing async destroys ZFS would leak space every txg timeout (5 seconds by default), if no writes occurred, until the pool is totally full. At this point it would be unfixable without a pool recreation. In addition if the machine was rebooted with the pool in this situation would fail to import on boot, hanging indefinitely, as the import process requires the ability to write data to the pool. Any attempts to query the pool status during the hung import would not return as the import holds the pool lock. The only way to import such a pool would be to specify -o readonly=on to the zpool import. zdb -bb <pool> can be used to check for "deferred free" size which is where this lost space will be counted. References: https://github.com/freebsd/freebsd/commit/48431b7 http://svnweb.freebsd.org/base?view=revision&revision=273158 https://reviews.csiden.org/r/132/ Porting notes: This issue was filed as illumos 5347 and a more comprehensive fix is under review. Once that change is finalized it will be integrated, in the meanwhile the FreeBSD fix has been merged to prevent the issue. Ported by: Tim Chase <tim@chase2k.com> Signed-off-by: Matthew Ahrens mahrens@delphix.com Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2896	2014-11-17 11:35:38 -08:00
Tim Chase	4254acb057	Undirty freed spill blocks. If a spill block's dbuf hasn't yet been written when a spill block is freed, the unwritten version will still be written. This patch handles the case in which a spill block's dbuf is freed and undirties it to prevent it from being written. The most common case in which this could happen is when xattr=sa is being used and a long xattr is immediately replaced by a short xattr as in: setfattr -n user.test -v very_very_very..._long_value <file> setfattr -n user.test -v short_value <file> The first value must be sufficiently long that a spill block is generated and the second value must be short enough to not require a spill block. In practice, this would typically happen due to internal xattr operations as a result of setting acltype=posixacl. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2663 Closes #2700 Closes #2701 Closes #2717 Closes #2863 Closes #2884	2014-11-17 11:25:48 -08:00
Prakash Surya	0b39b9f96f	Swap DTRACE_PROBE* with Linux tracepoints This patch leverages Linux tracepoints from within the ZFS on Linux code base. It also refactors the debug code to bring it back in sync with Illumos. The information exported via tracepoints can be used for a variety of reasons (e.g. debugging, tuning, general exploration/understanding, etc). It is advantageous to use Linux tracepoints as the mechanism to export this kind of information (as opposed to something else) for a number of reasons: * A number of external tools can make use of our tracepoints "automatically" (e.g. perf, systemtap) * Tracepoints are designed to be extremely cheap when disabled * It's one of the "accepted" ways to export this kind of information; many other kernel subsystems use tracepoints too. Unfortunately, though, there are a few caveats as well: * Linux tracepoints appear to only be available to GPL licensed modules due to the way certain kernel functions are exported. Thus, to actually make use of the tracepoints introduced by this patch, one might have to patch and re-compile the kernel; exporting the necessary functions to non-GPL modules. * Prior to upstream kernel version v3.14-rc6-30-g66cc69e, Linux tracepoints are not available for unsigned kernel modules (tracepoints will get disabled due to the module's 'F' taint). Thus, one either has to sign the zfs kernel module prior to loading it, or use a kernel versioned v3.14-rc6-30-g66cc69e or newer. Assuming the above two requirements are satisfied, lets look at an example of how this patch can be used and what information it exposes (all commands run as 'root'): # list all zfs tracepoints available $ ls /sys/kernel/debug/tracing/events/zfs enable filter zfs_arc__delete zfs_arc__evict zfs_arc__hit zfs_arc__miss zfs_l2arc__evict zfs_l2arc__hit zfs_l2arc__iodone zfs_l2arc__miss zfs_l2arc__read zfs_l2arc__write zfs_new_state__mfu zfs_new_state__mru # enable all zfs tracepoints, clear the tracepoint ring buffer $ echo 1 > /sys/kernel/debug/tracing/events/zfs/enable $ echo 0 > /sys/kernel/debug/tracing/trace # import zpool called 'tank', inspect tracepoint data (each line was # truncated, they're too long for a commit message otherwise) $ zpool import tank $ cat /sys/kernel/debug/tracing/trace \| head -n35 # tracer: nop # # entries-in-buffer/entries-written: 1219/1219 #P:8 # # _-----=> irqs-off # / _----=> need-resched # \| / _---=> hardirq/softirq # \|\| / _--=> preempt-depth # \|\|\| / delay # TASK-PID CPU# \|\|\|\| TIMESTAMP FUNCTION # \| \| \| \|\|\|\| \| \| lt-zpool-30132 [003] .... 91344.200050: zfs_arc__miss: hdr... z_rd_int/0-30156 [003] .... 91344.200611: zfs_new_state__mru... lt-zpool-30132 [003] .... 91344.201173: zfs_arc__miss: hdr... z_rd_int/1-30157 [003] .... 91344.201756: zfs_new_state__mru... lt-zpool-30132 [003] .... 91344.201795: zfs_arc__miss: hdr... z_rd_int/2-30158 [003] .... 91344.202099: zfs_new_state__mru... lt-zpool-30132 [003] .... 91344.202126: zfs_arc__hit: hdr ... lt-zpool-30132 [003] .... 91344.202130: zfs_arc__hit: hdr ... lt-zpool-30132 [003] .... 91344.202134: zfs_arc__hit: hdr ... lt-zpool-30132 [003] .... 91344.202146: zfs_arc__miss: hdr... z_rd_int/3-30159 [003] .... 91344.202457: zfs_new_state__mru... lt-zpool-30132 [003] .... 91344.202484: zfs_arc__miss: hdr... z_rd_int/4-30160 [003] .... 91344.202866: zfs_new_state__mru... lt-zpool-30132 [003] .... 91344.202891: zfs_arc__hit: hdr ... lt-zpool-30132 [001] .... 91344.203034: zfs_arc__miss: hdr... z_rd_iss/1-30149 [001] .... 91344.203749: zfs_new_state__mru... lt-zpool-30132 [001] .... 91344.203789: zfs_arc__hit: hdr ... lt-zpool-30132 [001] .... 91344.203878: zfs_arc__miss: hdr... z_rd_iss/3-30151 [001] .... 91344.204315: zfs_new_state__mru... lt-zpool-30132 [001] .... 91344.204332: zfs_arc__hit: hdr ... lt-zpool-30132 [001] .... 91344.204337: zfs_arc__hit: hdr ... lt-zpool-30132 [001] .... 91344.204352: zfs_arc__hit: hdr ... lt-zpool-30132 [001] .... 91344.204356: zfs_arc__hit: hdr ... lt-zpool-30132 [001] .... 91344.204360: zfs_arc__hit: hdr ... To highlight the kind of detailed information that is being exported using this infrastructure, I've taken the first tracepoint line from the output above and reformatted it such that it fits in 80 columns: lt-zpool-30132 [003] .... 91344.200050: zfs_arc__miss: hdr { dva 0x1:0x40082 birth 15491 cksum0 0x163edbff3a flags 0x640 datacnt 1 type 1 size 2048 spa 3133524293419867460 state_type 0 access 0 mru_hits 0 mru_ghost_hits 0 mfu_hits 0 mfu_ghost_hits 0 l2_hits 0 refcount 1 } bp { dva0 0x1:0x40082 dva1 0x1:0x3000e5 dva2 0x1:0x5a006e cksum 0x163edbff3a:0x75af30b3dd6:0x1499263ff5f2b:0x288bd118815e00 lsize 2048 } zb { objset 0 object 0 level -1 blkid 0 } For the specific tracepoint shown here, 'zfs_arc__miss', data is exported detailing the arc_buf_hdr_t (hdr), blkptr_t (bp), and zbookmark_t (zb) that caused the ARC miss (down to the exact DVA!). This kind of precise and detailed information can be extremely valuable when trying to answer certain kinds of questions. For anybody unfamiliar but looking to build on this, I found the XFS source code along with the following three web links to be extremely helpful: * http://lwn.net/Articles/379903/ * http://lwn.net/Articles/381064/ * http://lwn.net/Articles/383362/ I should also node the more "boring" aspects of this patch: * The ZFS_LINUX_COMPILE_IFELSE autoconf macro was modified to support a sixth paramter. This parameter is used to populate the contents of the new conftest.h file. If no sixth parameter is provided, conftest.h will be empty. * The ZFS_LINUX_TRY_COMPILE_HEADER autoconf macro was introduced. This macro is nearly identical to the ZFS_LINUX_TRY_COMPILE macro, except it has support for a fifth option that is then passed as the sixth parameter to ZFS_LINUX_COMPILE_IFELSE. These autoconf changes were needed to test the availability of the Linux tracepoint macros. Due to the odd nature of the Linux tracepoint macro API, a separate ".h" must be created (the path and filename is used internally by the kernel's define_trace.h file). * The HAVE_DECLARE_EVENT_CLASS autoconf macro was introduced. This is to determine if we can safely enable the Linux tracepoint functionality. We need to selectively disable the tracepoint code due to the kernel exporting certain functions as GPL only. Without this check, the build process will fail at link time. In addition, the SET_ERROR macro was modified into a tracepoint as well. To do this, the 'sdt.h' file was moved into the 'include/sys' directory and now contains a userspace portion and a kernel space portion. The dprintf and zfs_dbgmsg* interfaces are now implemented as tracepoint as well. Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2014-11-17 11:13:55 -08:00
Ned Bass	29e57d15c8	Fix dprintf format specifiers Fix a few dprintf format specifiers that disagreed with their argument types. These came to light as compiler errors when converting dprintf to use the Linux trace buffer. Previously this wasn't a problem, presumably because the SPL debug logging uses vsnprintf which must perform automatic type conversion. Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2014-11-17 11:13:45 -08:00
Ned Bass	59ec819a0c	Move a few internal ARC strucutres to arc_impl.h Add a new file named arc_impl.h and move a few internal ARC structure definitions into this file. This is needed in order to allow the Linux tracepoint functions to grub around in the internals of these structures. Signed-off-by: Prakash Surya <prakash.surya@delphix.com> Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2014-11-17 11:13:27 -08:00
Prakash Surya	fb42a49328	Illumos 5213 - panic in metaslab_init due to space_map_open returning ENXIO 5213 panic in metaslab_init due to space_map_open returning ENXIO Reviewed by: Matthew Ahrens mahrens@delphix.com Reviewed by: George Wilson george.wilson@delphix.com References: https://www.illumos.org/issues/5213 https://reviews.csiden.org/r/110 Porting notes: For the Linux port, KM_SLEEP was replaced with KM_PUSHPAGE. Ported by: Turbo Fredriksson <turbo@bayour.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2745	2014-11-14 15:37:45 -08:00
Chris Wedgwood	b31d8ea77c	Reduce buf/dbuf mutex contention Due to evidence of contention both the buf_hash_table and the dbuf_hash_table sizes have been increased from 256 to 8192. This increase in hash table size adds approximating 0.5M to our fixed memory footprint. This relatively small increase is not expected to cause problems even on low memory machines. This footprint will also become dynamic when the persistent L2ARC support is finalized. In the meanwhile, this small change significantly reduces contention for certain workloads. Signed-off-by: Chris Wedgwood <cw@f00f.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Pavel Snajdr <snajpa@snajpa.net> Closes #1291	2014-11-14 14:59:21 -08:00
Alex Zhuravlev	0f69910833	Export symbols for ZIL interface These symbols are needed by consumers (i.e. Lustre) who wish to integrate with the ZIL. In addition the zil_rollback_destroy() prototype was removed because the implementation of this function was removed long ago. Signed-off-by: Alex Zhuravlev <alexey.zhuravlev@intel.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2892	2014-11-14 14:39:43 -08:00
Brian Behlendorf	917fef2732	Lower minimum objects/slab threshold As long as we can fit a minimum of one object/slab there's no reason to prevent the creation of the cache. This effectively pushes the maximum object size up to 32MB. The splat cache tests were extended accordingly to verify this functionality. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2014-11-05 10:08:21 -08:00
Alexander Pyhalov	3f4a13c497	Fix modules installation directory When building zfs modules with kernel, compiled from deb.src, the packaging process ends up installing the modules in the wrong place. Signed-off-by: Alexander Pyhalov <apyhalov@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes zfsonlinux/zfs#2822	2014-10-28 09:49:24 -07:00
Alexander Pyhalov	bb9d808c5a	Fix modules installation directory When building zfs modules with kernel, compiled from deb.src, the packaging process ends up installing the modules in the wrong place. Signed-off-by: Alexander Pyhalov <apyhalov@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2822	2014-10-28 09:46:14 -07:00
Tim Chase	ed6e9cc235	Linux 3.12 compat: shrinker semantics The new shrinker API as of Linux 3.12 modifies "struct shrinker" by replacing the @shrink callback with the pair of @count_objects and @scan_objects. It also requires the return value of @count_objects to return the number of objects actually freed whereas the previous @shrink callback returned the number of remaining freeable objects. This patch adds support for the new @scan_objects return value semantics. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Closes #2837	2014-10-28 09:34:51 -07:00
Richard Yao	ad9863e80b	kmem_cache: Call constructor/destructor on each alloc/free This has a few benefits. First, it fixes a regression that "Rework generic memory allocation interfaces" appears to have triggered in splat's slab_reap and slab_age tests. Second, it makes porting code from Illumos to ZFSOnLinux easier. Third, it has the side effect of making reclaim from slab caches that specify reclaim functions an order of magnitude faster. The splat slab_reap test usually took 30 to 40 seconds. With this change, it takes 3 to 4. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #369	2014-10-28 09:21:08 -07:00
Tim Chase	802a4a2ad5	Linux 3.12 compat: shrinker semantics The new shrinker API as of Linux 3.12 modifies "struct shrinker" by replacing the @shrink callback with the pair of @count_objects and @scan_objects. It also requires the return value of @count_objects to return the number of objects actually freed whereas the previous @shrink callback returned the number of remaining freeable objects. This patch adds support for the new @scan_objects return value semantics and updates the splat shrinker test case appropriately. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Closes #403	2014-10-28 09:20:13 -07:00
Matthew Ahrens	9635861742	Illumos 5164-5165 - space map fixes 5164 space_map_max_blksz causes panic, does not work 5165 zdb fails assertion when run on pool with recently-enabled space map_histogram feature Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/5164 https://www.illumos.org/issues/5165 https://github.com/illumos/illumos-gate/commit/b1be289 Porting Notes: The metaslab_fragmentation() hunk was dropped from this patch because it was already resolved by commit `8b0a084`. The comment modified in metaslab.c was updated to use the correct variable name, space_map_blksz. The upstream commit incorrectly used space_map_blksize. Ported by: Turbo Fredriksson <turbo@bayour.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2697	2014-10-23 15:30:32 -07:00
Alex Reece	b02fe35d37	Illumos 4958 zdb trips assert on pools with ashift >= 0xe 4958 zdb trips assert on pools with ashift >= 0xe Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Max Grossman <max.grossman@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Approved by: Garrett D'Amore <garrett@damore.org> References: https://www.illumos.org/issues/4958 https://github.com/illumos/illumos-gate/commit/2a104a5 Porting notes: Keep the ZIO_FLAG_FASTWRITE define. This is for a feature present in Linux but not yet in *BSD. Ported by: Turbo Fredriksson <turbo@bayour.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2697	2014-10-23 15:30:32 -07:00
Brian Behlendorf	5f6d0b6f5a	Handle block pointers with a corrupt logical size The general strategy used by ZFS to verify that blocks are valid is to checksum everything. This has the advantage of being extremely robust and generically applicable regardless of the contents of the block. If a blocks checksum is valid then its contents are trusted by the higher layers. This system works exceptionally well as long as bad data is never written with a valid checksum. If this does somehow occur due to a software bug or a memory bit-flip on a non-ECC system it may result in kernel panic. One such place where this could occur is if somehow the logical size stored in a block pointer exceeds the maximum block size. This will result in an attempt to allocate a buffer greater than the maximum block size causing a system panic. To prevent this from happening the arc_read() function has been updated to detect this specific case. If a block pointer with an invalid logical size is passed it will treat the block as if it contained a checksum error. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2678	2014-10-23 09:20:52 -07:00
Ned Bass	bc151f7b31	Remove checks for mandatory locks The Linux VFS handles mandatory locks generically so we shouldn't need to check for conflicting locks in zfs_read(), zfs_write(), or zfs_freesp(). Linux 3.18 removed the lock_may_read() and lock_may_write() interfaces which we were relying on for this purpose. Rather than emulating those interfaces we remove the redundant checks. Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2804	2014-10-22 11:06:53 -07:00
Matthew Ahrens	88904bb3e3	Illumos 5162 - zfs recv should use loaned arc buffer to avoid copy 5162 zfs recv should use loaned arc buffer to avoid copy Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Bayard Bell <Bayard.Bell@nexenta.com> Reviewed by: Richard Elling <richard.elling@gmail.com> Approved by: Garrett D'Amore <garrett@damore.org> References: https://www.illumos.org/issues/5162 https://github.com/illumos/illumos-gate/commit/8a90470 Porting notes: Fix spelling error 's/arena/area/' in dmu.c. In restore_write() declare bonus and abuf at the top of the function. Ported by: Turbo Fredriksson <turbo@bayour.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2696	2014-10-21 16:32:11 -07:00
Matthew Ahrens	4b20a6f509	Illumos 5150 - zfs clone of a defer_destroy snapshot causes strangeness When a clone is created of a snapshot that has been marked for deferred destroy (with "zfs destroy -d"), the clone "inherits" the defer_destroy flag from the origin, and any snapshots of the clone "inherit" the defer_destroy flag from the clone. This causes a strange situation where the clone's snapshots are marked for defer_destroy but they have no holds or clones. If the clone's snapshot gets a hold or clone, which is then deleted, we will honor the incorrectly-set defer_destroy flag and delete the snapshot! Steps to reproduce: * zpool create test c1t1d0 * zfs create test/fs * zfs snapshot test/fs@a * zfs clone test/fs@a test/clone * zfs destroy -d test/fs@a * zfs clone test/fs@a test/clone2 * zfs snapshot test/clone2@a * zfs hold hld test/clone2@a * zfs release hld test/clone2@a * zfs list -r -t all test <test/clone2@a has been destroyed> We noticed that this causes dcenter to get very confused, because it treats snapshots that are marked defer_destroy as not existing. So it won't see any snapshots of the clone that's marked defer_destroy. 5150 - zfs clone of a defer_destroy snapshot causes strangeness Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Max Grossman <max.grossman@delphix.com> Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com> Reviewed by: Richard Elling <richard.elling@gmail.com> Approved by: Robert Mustacchi <rm@joyent.com> References: https://www.illumos.org/projects/illumos-gate//issues/5150 https://github.com/illumos/illumos-gate/commit/42fcb65 Ported by: Turbo Fredriksson <turbo@bayour.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2690	2014-10-21 15:26:58 -07:00
Matthew Ahrens	6c59307a3c	Illumos 3693 - restore_object uses at least two transactions to restore an object Restore_object should not use two transactions to restore an object: * one transaction is used for dmu_object_claim * another transaction is used to set compression, checksum and most importantly bonus data * furthermore dmu_object_reclaim internally uses multiple transactions * dmu_free_long_range frees chunks in separate transactions * dnode_reallocate is executed in a distinct transaction The fact the dnode_allocate/dnode_reallocate are executed in one transaction and bonus (re-)population is executed in a different transaction may lead to violation of ZFS consistency assertions if the transactions are assigned to different transaction groups. Also, if the first transaction group is successfully written to a permanent storage, but the second transaction is lost, then an invalid dnode may be created on the stable storage. 3693 restore_object uses at least two transactions to restore an object Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Andriy Gapon <andriy.gapon@hybridcluster.com> Approved by: Robert Mustacchi <rm@joyent.com> Original authors: Matthew Ahrens and Andriy Gapon References: https://www.illumos.org/issues/3693 https://github.com/illumos/illumos-gate/commit/e77d42e Ported by: Turbo Fredriksson <turbo@bayour.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2689	2014-10-21 15:26:50 -07:00
Tim Chase	356d9ed4c8	Don't perform ACL-to-mode translation on empty ACL In zfs_acl_chown_setattr(), the zfs_mode_comput() function is used to create a traditional mode value based on an ACL. If no ACL exists, this processing shouldn't be done. Problems caused by this were most evident on version 4 filesystems which not only don't have system attributes, and also frequently have empty ACLs. On such filesystems, performing a chown() operation could have the effect of dirtying the mode bits in memory but not on the file system as follows: # create a file with typical mode of 664 echo test > test chown anyuser test ls -l test and the mode will show up as all zeroes. Unmounting/mounting and/or exporting/importing the filesystem will reveal the proper mode again. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1264	2014-10-21 09:23:27 -07:00
Daniil Lunev	62bdd5eb7a	Illumos 4924 - LZ4 Compression for metadata Reviewed by Matthew Ahrens <mahrens@delphix.com> Reviewed by Saso Kiselkov <skiselkov.ml@gmail.com> Approved by: Christopher Siden <christopher.siden@delphix.com> References: https://github.com/illumos/illumos-gate/commit/b8289d2 https://www.illumos.org/issues/3756 Porting notes: The static function zfs_prop_activate_feature() was removed because this change removes the only caller. The function was not removed from Illumos but instead left as dead code. However, to keep gcc happy it was removed from Linux and may be easily restored if needed. Ported by: DHE <git@dehacked.net> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1540	2014-10-20 16:17:49 -07:00
Brian Behlendorf	ba232d8aea	Suppress AIO kmem warnings The new zpl_aio_write() and zpl_aio_read() functions use kmem_alloc() to allocate enough memory to hold the vectorized IO. While this allocation will be small it's been observed in practice to sometimes slightly exceed the 8K warning threshold by a few kilobytes. Therefore, the KM_NODEBUG flag has been added to suppress warning. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <ryao@gentoo.org> Closes #2774	2014-10-20 16:10:25 -07:00
Brian Behlendorf	599662c538	Remove kern_path() wrapper The kern_path() function has been available since Linux 2.6.28. There is no longer a need to maintain this compatibility code. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2014-10-17 15:11:52 -07:00
Brian Behlendorf	3d5392cefa	Remove kvasprintf() wrapper The kvasprintf() function has been available since Linux 2.6.22. There is no longer a need to maintain this compatibility code. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2014-10-17 15:11:52 -07:00
Brian Behlendorf	0fac9c9e6d	Remove proc_handler() wrapper As of Linux 2.6.32 the proc handlers where updated to expect only five arguments. Therefore there is no longer a need to maintain this compatibility code and this infrastructure can be simplified. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2014-10-17 15:11:52 -07:00
Brian Behlendorf	68a829b29d	Remove credential configure checks. The groups_search() function was never exported by a mainline kernel therefore we drop this compatibility code and always provide our own implementation. Additionally, the cred_t structure has been available since 2.6.29 so there is no longer a need to maintain compatibility code. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2014-10-17 15:11:51 -07:00
Brian Behlendorf	137af025f6	Remove set_fs_pwd() configure check This function has never been exported by any mainline and was only briefly available under RHEL5. Therefore this check is being removed and the code update to always use the wrapper function. The next step will be to eliminate all this code. If ZFS were updated not to assume that it's pwd was / there would be no need for this. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2014-10-17 15:11:51 -07:00
Brian Behlendorf	3c49a16989	Remove user_path_dir() wrapper The user_path_dir() function has been available since Linux 2.6.27. There is no longer a need to maintain this compatibility code. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2014-10-17 15:11:51 -07:00
Brian Behlendorf	44778f4110	Remove kallsyms_lookup_name() wrapper After the removable of get_vmalloc_info(), the unused global memory variables, and the optional dcache/icache shrinkers there is no longer a need for the kallsyms compatibility code. This allows us to eliminate another brittle area of the code by removing the kernel upcall this functionality depended on for older kernels. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2014-10-17 15:11:51 -07:00
Brian Behlendorf	89a461e70c	Remove shrink_{i,d}node_cache() wrappers This is optional functionality which may or may not be useful to ZFS when using older kernels. It is never a hard requirement. Therefore this functionality is being removed from the SPL and a simpler slimmed down version will be added to ZFS. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2014-10-17 15:11:51 -07:00
Brian Behlendorf	8bbbe46f86	Remove global memory variables Platforms such as Illumos and FreeBSD have historically provided global variables which summerize the memory state of a system. Linux on the otherhand doesn't expose any of this information to kernel modules and uses entirely different mechanisms for memory management. In order to simplify the original ZFS port to Linux these global variables were emulated by the SPL for the benefit of ZFS. As ZoL has matured over the years it has moved steadily away from these interfaces and now no longer depends on them at all. Therefore, this patch completely removes the global variables availrmem, minfree, desfree, lotsfree, needfree, swapfs_minfree, and swapfs_reserve. This greatly simplifies the memory management code and eliminates a common area of confusion. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2014-10-17 15:11:51 -07:00
Brian Behlendorf	e1310afae3	Remove get_vmalloc_info() wrapper The get_vmalloc_info() function was used to back the vmem_size() function. This was always problematic and resulted in brittle code because the kernel never provided a clean interface for modules. However, it turns out that the only caller of this function in ZFS uses it to determine the total virtual address space size. This can be determined easily without get_vmalloc_info() so vmem_size() has been updated to take this approach which allows us to shed the get_vmalloc_info() dependency. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2014-10-17 15:11:51 -07:00
Brian Behlendorf	50e41ab1e1	Remove on_each_cpu() wrapper The on_each_cpu() function has been available since Linux 2.6.27. There is no longer a need to maintain this compatibility code. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2014-10-17 15:11:51 -07:00
Brian Behlendorf	2bc5666f53	Remove i_mutex() configure check The inode structure has used i_mutex as its internal locking primitive since 2.6.16. The compatibility code to check for the previous semaphore primitive has been removed. However, the wrapper function itself is being kept because it's entirely possible this primitive will change again to allow finer grained locking. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2014-10-17 15:11:51 -07:00
Brian Behlendorf	82f2f1a3af	Simplify the time compatibility wrappers Many of the time functions had grown overly complex in order to handle kernel compatibility issues. However, as of Linux 2.6.26 all the required functionality is available. This allows us to retire numerous configure checks and greatly simplify the time compatibility wrappers. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2014-10-17 15:11:50 -07:00
Brian Behlendorf	87f8055a91	Map highbit64() to fls64() The fls64() function has been available since Linux 2.6.16 and it should be used to implemented highbit64(). This allows us to provide an optimized implementation and simplify the code. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2014-10-17 15:11:50 -07:00
Brian Behlendorf	9c91800d19	Remove CTL_UNNUMBERED sysctl interface Support for the CTL_UNNUMBERED sysctl interface was removed in Linux 2.6.19. There is no longer any reason to maintain this compatibility code. There also issue any reason to keep around the CTL_NAME macro and helpers so they have been retired. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2014-10-17 15:11:50 -07:00
Brian Behlendorf	b38bf6a4e3	Remove register_sysctl() compatibility code The register_sysctl() interface has been stable since Linux 2.6.21. There is no longer a need to maintain compatibility code. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2014-10-17 15:11:50 -07:00
Brian Behlendorf	bb4dee3df2	Remove utsname() wrapper There is no longer a need to wrap this because utsname() is provided by the kernel and can be called directly. This will require a small change in the ZFS code because utsname is expected to be a global structure and not a function. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2014-10-17 15:11:41 -07:00
Brian Behlendorf	aa363c5c05	Remove sysctl_vfs_cache_pressure assumption The generic SPL cache shrinkers make the assumption that the caches only contain VFS cache data and therefore should be scaled based on vfs_cache_pressure. This is not strictly true and it should not be assumed. Removing this tuning should not have any impact on the stock behavior because vfs_cache_pressure=100 by default. This means that no scaling will take place. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2014-10-17 15:07:28 -07:00
Brian Behlendorf	a80d69caf0	Remove adaptive mutex implementation Since the Linux 2.6.29 kernel all mutexes have been adaptive mutexs. There is no longer any point in keeping this code so it is being removed to simplify the code. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2014-10-17 15:07:28 -07:00
Brian Behlendorf	3a92530563	Update code to use misc_register()/misc_deregister() When the SPL was originally written it was designed to use the device_create() and device_destroy() functions. Unfortunately, these functions changed considerably over the years making them difficult to rely on. As it turns out a better choice would have been to use the misc_register()/misc_deregister() functions. This interface for registering character devices has remained stable, is simple, and provides everything we need. Therefore the code has been reworked to use this interface. The higher level ZFS code has always depended on these same interfaces so this is also as a step towards minimizing our kernel dependencies. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2014-10-17 15:07:28 -07:00
Brian Behlendorf	0cb3dafccd	Update SPLAT to use kmutex_t for portability For consistency throughout the code update the SPLAT infrastructure to use the wrapped mutex interfaces. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2014-10-17 15:07:28 -07:00

... 12 13 14 15 16 ...

2457 Commits