Archive-Team/zfs - zfs - Gitea: Git with a cup of tea

Commit Graph

Author	SHA1	Message	Date
George Wilson	07ce5d7390	OpenZFS 8857 - zio_remove_child() panic due to already destroyed parent zio PROBLEM ======= It's possible for a parent zio to complete even though it has children which have not completed. This can result in the following panic: > $C ffffff01809128c0 vpanic() ffffff01809128e0 mutex_panic+0x58(fffffffffb94c904, ffffff597dde7f80) ffffff0180912950 mutex_vector_enter+0x347(ffffff597dde7f80) ffffff01809129b0 zio_remove_child+0x50(ffffff597dde7c58, ffffff32bd901ac0, ffffff3373370908) ffffff0180912a40 zio_done+0x390(ffffff32bd901ac0) ffffff0180912a70 zio_execute+0x78(ffffff32bd901ac0) ffffff0180912b30 taskq_thread+0x2d0(ffffff33bae44140) ffffff0180912b40 thread_start+8() > ::status debugging crash dump vmcore.2 (64-bit) from batfs0390 operating system: 5.11 joyent_20170911T171900Z (i86pc) image uuid: (not set) panic message: mutex_enter: bad mutex, lp=ffffff597dde7f80 owner=ffffff3c59b39480 thread=ffffff0180912c40 dump content: kernel pages only The problem is that dbuf_prefetch along with l2arc can create a zio tree which confuses the parent zio and allows it to complete with while children still exist. Here's the scenario: zio tree: pio \|--- lio The parent zio, pio, has entered the zio_done stage and begins to check its children to see there are still some that have not completed. In zio_done(), the children are checked in the following order: zio_wait_for_children(zio, ZIO_CHILD_VDEV, ZIO_WAIT_DONE) zio_wait_for_children(zio, ZIO_CHILD_GANG, ZIO_WAIT_DONE) zio_wait_for_children(zio, ZIO_CHILD_DDT, ZIO_WAIT_DONE) zio_wait_for_children(zio, ZIO_CHILD_LOGICAL, ZIO_WAIT_DONE) If pio, finds any child which has not completed then it stops executing and goes to sleep. Each call to zio_wait_for_children() will grab the io_lock while checking the particular child. In this scenario, the pio has completed the first call to zio_wait_for_children() to check for any ZIO_CHILD_VDEV children. Since the only zio in the zio tree right now is the logical zio, lio, then it completes that call and prepares to check the next child type. In the meantime, the lio completes and in its callback creates a child vdev zio, cio. The zio tree looks like this: zio tree: pio \|--- lio \|--- cio The lio then grabs the parent's io_lock and removes itself. zio tree: pio \|--- cio The pio continues to run but has already completed its check for ZIO_CHILD_VDEV and will erroneously complete. When the child zio, cio, completes it will panic the system trying to reference the parent zio which has been destroyed. SOLUTION ======== The fix is to rework the zio_wait_for_children() logic to accept a bitfield for all the children types that it's interested in checking. The io_lock will is held the entire time we check all the children types. Since the function now accepts a bitfield, a simple ZIO_CHILD_BIT() macro is provided to allow for the conversion between a ZIO_CHILD type and the bitfield used by the zio_wiat_for_children logic. Authored by: George Wilson <george.wilson@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Andriy Gapon <avg@FreeBSD.org> Reviewed by: Youzhong Yang <youzhong@gmail.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Approved by: Dan McDonald <danmcd@omniti.com> Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/8857 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/862ff6d99c Issue #5918 Closes #7168	2018-03-14 16:10:37 -07:00
sanjeevbagewadi	d85011ed69	mmp should use a fixed tag for spa_config locks mmp_write_uberblock() and mmp_write_done() should the same tag for spa_config_locks. Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Sanjeev Bagewadi <sanjeev.bagewadi@gmail.com> Closes #6530 Closes #7155	2018-03-14 16:10:37 -07:00
sanjeevbagewadi	b3da003ebf	Handle zap_add() failures in mixed case mode With "casesensitivity=mixed", zap_add() could fail when the number of files/directories with the same name (varying in case) exceed the capacity of the leaf node of a Fatzap. This results in a ASSERT() failure as zfs_link_create() does not expect zap_add() to fail. The fix is to handle these failures and rollback the transactions. Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Chunwei Chen <david.chen@nutanix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Sanjeev Bagewadi <sanjeev.bagewadi@gmail.com> Closes #7011 Closes #7054	2018-03-14 16:10:37 -07:00
Chunwei Chen	5e566c5772	Fix zle_decompress out of bound access Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Signed-off-by: Chunwei Chen <david.chen@nutanix.com> Closes #7099	2018-03-14 16:10:36 -07:00
Chunwei Chen	23227313a2	Fix zdb -c traverse stop on damaged objset root If a corruption happens to be on a root block of an objset, zdb -c will not correctly report the error, and it will not traverse the datasets that come after. This is because traverse_visitbp, which does the callback and reset error for TRAVERSE_HARD, is skipped when traversing zil is failed in traverse_impl. Here's example of what 'zdb -eLcc' command looks like on a pool with damaged objset root: == before patch: Traversing all blocks to verify checksums ... Error counts: errno count block traversal size 379392 != alloc 33987072 (unreachable 33607680) bp count: 172 ganged count: 0 bp logical: 1678336 avg: 9757 bp physical: 130560 avg: 759 compression: 12.85 bp allocated: 379392 avg: 2205 compression: 4.42 bp deduped: 0 ref>1: 0 deduplication: 1.00 SPA allocated: 33987072 used: 0.80% additional, non-pointer bps of type 0: 71 Dittoed blocks on same vdev: 101 == after patch: Traversing all blocks to verify checksums ... zdb_blkptr_cb: Got error 52 reading <54, 0, -1, 0> -- skipping Error counts: errno count 52 1 block traversal size 33963520 != alloc 33987072 (unreachable 23552) bp count: 447 ganged count: 0 bp logical: 36093440 avg: 80745 bp physical: 33699840 avg: 75391 compression: 1.07 bp allocated: 33963520 avg: 75981 compression: 1.06 bp deduped: 0 ref>1: 0 deduplication: 1.00 SPA allocated: 33987072 used: 0.80% additional, non-pointer bps of type 0: 76 Dittoed blocks on same vdev: 115 == Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Signed-off-by: Chunwei Chen <david.chen@nutanix.com> Closes #7099	2018-03-14 16:10:36 -07:00
Brian Behlendorf	310e63dfd1	Linux 4.16 compat: inode_set_iversion() A new interface was added to manipulate the version field of an inode. Add a inode_set_iversion() wrapper for older kernels and use the new interface when available. The i_version field was dropped from the trace point due to the switch to an atomic64_t i_version type. Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Chunwei Chen <david.chen@nutanix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #7148	2018-03-14 16:10:36 -07:00
WHR	a196b3bc3d	OpenZFS 8966 - Source file zfs_acl.c, function zfs_aclset_common contains a use after end of the lifetime of a local variable Authored by: WHR <msl0000023508@gmail.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Andriy Gapon <avg@FreeBSD.org> Reviewed by: George Melikov <mail@gmelikov.ru> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Approved by: Richard Lowe <richlowe@richlowe.net> Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/8966 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c95549fcdc Closes #7141	2018-03-14 16:10:36 -07:00
Richard Elling	a58e1284d8	Remove deprecated zfs_arc_p_aggressive_disable zfs_arc_p_aggressive_disable is no more. This PR removes docs and module parameters for zfs_arc_p_aggressive_disable. Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Richard Elling <Richard.Elling@RichardElling.com> Closes #7135	2018-03-14 16:10:36 -07:00
wli5	5f38142e7b	Bug fix in qat_compress.c for vmalloc addr check Remove the unused vmalloc address check, and function mem_to_page will handle the non-vmalloc address when map it to a physical address. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Weigang Li <weigang.li@intel.com> Closes #7125	2018-03-14 16:10:36 -07:00
John L. Hammond	ecc972c7f0	Emit an error message before MMP suspends pool In mmp_thread(), emit an MMP specific error message before calling zio_suspend() so that the administrator will understand why the pool is being suspended. Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: John L. Hammond <john.hammond@intel.com> Closes #7048	2018-03-14 16:10:36 -07:00
Brian Behlendorf	5d62588032	Remove vn_rename and vn_remove dependency The only place vn_rename and vn_remove are used is when writing out an updated pool configuration file. By truncating the file instead of renaming and removing it we can avoid having to implement these interfaces entirely. Functionally an empty cache file is treated the same as a missing cache file. This is particularly advantageous because the Linux kernel has never provided a way to reliably implement vn_rename and vn_remove. The cachefile_004_pos.ksh test case was updated to understand that an empty cache file is the same as a missing one. The zfs-import-* systemd service files were not updated to use ConditionFileNotEmpty in place of ConditionPathExists. This means that after exporting all pools and rebooting new pools will not the scanned for on the next boot. This small change should not impact normal usage since pools are not exported as part of a normal shutdown. Documentation was updated accordingly. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Arkadiusz Bubała <arkadiusz.bubala@open-e.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes zfsonlinux/spl#648 Closes #6753	2018-03-14 16:10:36 -07:00
Alexander Motin	701ebd014a	OpenZFS 8835 - Speculative prefetch in ZFS not working for misaligned reads In case of misaligned I/O sequential requests are not detected as such due to overlaps in logical block sequence: dmu_zfetch(fffff80198dd0ae0, 27347, 9, 1) dmu_zfetch(fffff80198dd0ae0, 27355, 9, 1) dmu_zfetch(fffff80198dd0ae0, 27363, 9, 1) dmu_zfetch(fffff80198dd0ae0, 27371, 9, 1) dmu_zfetch(fffff80198dd0ae0, 27379, 9, 1) dmu_zfetch(fffff80198dd0ae0, 27387, 9, 1) This patch makes single block overlap to be counted as a stream hit, improving performance up to several times. Authored by: Alexander Motin <mav@FreeBSD.org> Approved by: Gordon Ross <gwr@nexenta.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Allan Jude <allanjude@freebsd.org> Reviewed by: Gvozden Neskovic <neskovic@gmail.com> Reviewed by: George Melikov <mail@gmelikov.ru> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/8835 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/aab6dd482a Closes #7062	2018-01-30 10:27:31 -06:00
Alex Zhuravlev	129e3e8dc3	Use zap_count instead of cached z_size for unlink As a performance optimization Lustre does not strictly update the SA_ZPL_SIZE when adding/removing from non-directory entries. This results in entries which cannot be removed through the ZPL layer even though the ZAP is empty and safe to remove. Resolve this issue by checking the zap_count() directly instead on relying on the cached SA_ZPL_SIZE. Micro-benchmarks show no significant performance impact due to the additional overhead of using zap_count(). Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Signed-off-by: Alex Zhuravlev <alexey.zhuravlev@intel.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #7019	2018-01-30 10:27:31 -06:00
Brian Behlendorf	9c1a8eaa51	Fix ARC hit rate When the compressed ARC feature was added in commit `d3c2ae1` the method of reference counting in the ARC was modified. As part of this accounting change the arc_buf_add_ref() function was removed entirely. This would have be fine but the arc_buf_add_ref() function served a second undocumented purpose of updating the ARC access information when taking a hold on a dbuf. Without this logic in place a cached dbuf would not migrate its associated arc_buf_hdr_t to the MFU list. This would negatively impact the ARC hit rate, particularly on systems with a small ARC. This change reinstates the missing call to arc_access() from dbuf_hold() by implementing a new arc_buf_access() function. Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Tim Chase <tim@chase2k.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #6171 Closes #6852 Closes #6989	2018-01-30 10:27:31 -06:00
lidongyang	8d82a19def	Call commit callbacks from the tail of the list Our zfs backed Lustre MDT had soft lockups while under heavy metadata workloads while handling transaction callbacks from osd_zfs. The problem is zfs is not taking advantage of the fast path in Lustre's trans callback handling, where Lustre will skip the calls to ptlrpc_commit_replies() when it already saw a higher transaction number. This patch corrects this, it also has a positive impact on metadata performance on Lustre with osd_zfs, plus some cleanup in the headers. A similar issue for ext4/ldiskfs is described on: https://jira.hpdd.intel.com/browse/LU-6527 Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Li Dongyang <dongyang.li@anu.edu.au> Closes #6986	2018-01-30 10:27:31 -06:00
Brian Behlendorf	aebc5df418	Update for cppcheck v1.80 Resolve new warnings and errors from cppcheck v1.80. * [lib/libshare/libshare.c:543]: (warning) Possible null pointer dereference: protocol * [lib/libzfs/libzfs_dataset.c:2323]: (warning) Possible null pointer dereference: srctype * [lib/libzfs/libzfs_import.c:318]: (error) Uninitialized variable: link * [module/zfs/abd.c:353]: (error) Uninitialized variable: sg * [module/zfs/abd.c:353]: (error) Uninitialized variable: i * [module/zfs/abd.c:385]: (error) Uninitialized variable: sg * [module/zfs/abd.c:385]: (error) Uninitialized variable: i * [module/zfs/abd.c:553]: (error) Uninitialized variable: i * [module/zfs/abd.c:553]: (error) Uninitialized variable: sg * [module/zfs/abd.c:763]: (error) Uninitialized variable: i * [module/zfs/abd.c:763]: (error) Uninitialized variable: sg * [module/zfs/abd.c:305]: (error) Uninitialized variable: tmp_page * [module/zfs/zpl_xattr.c:342]: (warning) Possible null pointer dereference: value * [module/zfs/zvol.c:208]: (error) Uninitialized variable: p Convert the following suppression to inline. * [module/zfs/zfs_vnops.c:840]: (error) Possible null pointer dereference: aiov Exclude HAVE_UIO_ZEROCOPY and HAVE_DNLC from analysis since these macro's will never be defined until this functionality is implemented. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #6879	2018-01-30 10:27:31 -06:00
Gvozden Neskovic	a94447ddf3	dmu_objset: release bonus buffer in failure path Reported by kmemleak during testing of a new patch: ``` unreferenced object 0xffff9f1c12e38800 (size 1024): comm "z_upgrade", pid 17842, jiffies 4296870904 (age 8746.268s) backtrace: kmemleak_alloc+0x7a/0x100 __kmalloc_node+0x26c/0x510 range_tree_create+0x39/0xa0 [zfs] dmu_zfetch_init+0x73/0xe0 [zfs] dnode_create+0x12c/0x3b0 [zfs] dnode_hold_impl+0x1096/0x1130 [zfs] dnode_hold+0x23/0x30 [zfs] dmu_bonus_hold_impl+0x6b/0x370 [zfs] dmu_bonus_hold+0x1e/0x30 [zfs] dmu_objset_space_upgrade+0x114/0x310 [zfs] dmu_objset_userobjspace_upgrade_cb+0xd8/0x150 [zfs] dmu_objset_upgrade_task_cb+0x136/0x1e0 [zfs] kthread+0x119/0x150 ``` Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Signed-off-by: Gvozden Neskovic <neskovic@gmail.com> Closes #6575	2018-01-30 10:27:30 -06:00
Chunwei Chen	7192ec7942	Fix zfs_ioc_pool_sync should not use fnvlist Use fnvlist on user input would allow user to easily panic zfs. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Alek Pinchuk <apinchuk@datto.com> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Closes #6529	2018-01-30 10:27:30 -06:00
Gvozden Neskovic	06acbbc429	vdev_mirror: load balancing fixes vdev_queue: - Track the last position of each vdev, including the io size, in order to detect linear access of the following zio. - Remove duplicate `vq_lastoffset` vdev_mirror: - Correctly calculate the zio offset (signedness issue) - Deprecate `vdev_queue_register_lastoffset()` - Add `VDEV_LABEL_START_SIZE` to zio offset of leaf vdevs Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Gvozden Neskovic <neskovic@gmail.com> Closes #6461	2018-01-30 10:27:30 -06:00
Brian Behlendorf	504bfc8b49	Fix multihost stale cache file import When the multihost property is enabled it should be impossible to import an active pool even using the force (-f) option. This patch prevents a forced import from succeeding when importing with a stale cache file. The root cause of the problem is that the kernel modules trusted the hostid provided in configuration. This is always correct when the configuration is generated by scanning for the pool. However, when using an existing cache file the hostid could be stale which would result in the activity check being skipped. Resolve the issue by always using the hostid read from the label configuration where the best uberblock was found. Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #6933 Closes #6971	2017-12-18 10:31:01 -08:00
Tony Hutter	36e0ddb744	Revert "Long hold the dataset during upgrade" This reverts commit `a5c8119eba`. The commit (which was modified to remove encryption) was hitting ASSERT(dsl_pool_config_held(dmu_objset_pool(os))) in dmu_objset_upgrade() during automated testing. Signed-off-by: Tony Hutter <hutter2@llnl.gov>	2017-12-06 13:25:40 -06:00
Brian Behlendorf	1030f807ba	Fix NFS sticky bit permission denied error When zfs_sticky_remove_access() was originally adapted for Linux a typo was made which altered the intended behavior. As described in the block comment, the intended behavior is that permission should be granted when the entry is a regular file and you have write access. That is, S_ISREG should have been used instead of S_ISDIR. Restricting permission to regular files made good sense for older systems where setting the bit on executable files would instruct the system to save the program's text segment on the swap device. On modern systems this behavior has been replaced by the sticky bit acting as a restricted deletion flag and the plain file restriction has been relaxed. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #6889 Closes #6910	2017-12-04 17:22:47 -08:00
Brian Behlendorf	4a98780933	Preserve itx alloc size for zio_data_buf_free() Using zio_data_buf_alloc() to allocate the itx's may be unsafe because the itx->itx_lr.lrc_reclen field is not constant from allocation to free. Using a different itx->itx_lr.lrc_reclen size in zio_data_buf_free() can result in the allocation being returned to the wrong kmem cache. This issue can be avoided entirely by storing the allocation size in itx->itx_size and using that for zio_data_buf_free(). Reviewed by: Prakash Surya <prakash.surya@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #6912	2017-12-04 17:21:39 -08:00
LOLi	6db8f1a0d1	Fix 'zfs get {user\|group}objused@' functionality Fix a regression accidentally introduced in `1b81ab4` that prevents 'zfs get {user\|group}objused@' from correctly reporting the requested value. Update "userspace_003_pos.ksh" and "groupspace_003_pos.ksh" to verify this functionality. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #6908	2017-12-04 17:21:39 -08:00
Mark Wright	e06711412b	Linux 4.14 compat: CONFIG_GCC_PLUGIN_RANDSTRUCT Fix build errors with gcc 7.2.0 on Gentoo with kernel 4.14 built with CONFIG_GCC_PLUGIN_RANDSTRUCT=y such as: module/nvpair/nvpair.c:2810:2:error: positional initialization of field in ?struct? declared with 'designated_init' attribute [-Werror=designated-init] nvs_native_nvlist, ^~~~~~~~~~~~~~~~~ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Mark Wright <gienah@gentoo.org> Closes #5390 Closes #6903	2017-12-04 17:21:39 -08:00
Brian Behlendorf	954516cec1	Emit history events for 'zpool create' History commands and events were being suppressed for the 'zpool create' command since the history object did not yet exist. Create the object earlier so this history doesn't get lost. Split the pool_destroy event in to pool_destroy and pool_export so they may be distinguished. Updated events_001_pos and events_002_pos test cases. They now check for the expected history events and were reworked to be more reliable. Reviewed-by: Nathaniel Clark <nathaniel.l.clark@intel.com> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #6712 Closes #6486 Conflicts: tests/zfs-tests/tests/functional/events/events_002_pos.ksh	2017-12-04 17:21:03 -08:00
Brian Behlendorf	841cb5ee2a	Fix dirty check in dmu_offset_next() The correct way to determine if a dnode is dirty is to check if any of the dn->dn_dirty_link's are active. Relying solely on the dn->dn_dirtyctx can result in the dnode being mistakenly reported as clean. Reviewed-by: Chunwei Chen <tuxoko@gmail.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3125 Closes #6867	2017-11-21 13:11:29 -06:00
LOLi	fedc1d96a8	Fix truncate(2) mtime and ctime handling On Linux, ftruncate(2) always changes the file timestamps, even if the file size is not changed. However, in case of a successfull truncate(2), the timestamps are updated only if the file size changes. This translates to the VFS calling the ZFS Posix Layer "setattr" function (zpl_setattr) with ATTR_MTIME and ATTR_CTIME unconditionally set on the iattr mask only when doing a ftruncate(2), while the truncate(2) is left to the filesystem implementation to be dealt with. This behaviour is consistent with POSIX:2004/SUSv3 specifications where there's no explicit requirement for file size changes to update the timestamps only for ftruncate(2): http://pubs.opengroup.org/onlinepubs/009695399/functions/truncate.html http://pubs.opengroup.org/onlinepubs/009695399/functions/ftruncate.html This has been later updated in POSIX:2008/SUSv4 where, for both truncate(2)/ftruncate(2), there's no mention of this size change requirement: http://austingroupbugs.net/view.php?id=489 http://pubs.opengroup.org/onlinepubs/9699919799/functions/truncate.html http://pubs.opengroup.org/onlinepubs/9699919799/functions/ftruncate.html Unfortunately the Linux VFS is still calling into the ZPL without ATTR_MTIME/ATTR_CTIME set in the truncate(2) case: we fix this by explicitly updating the timestamps when detecting the ATTR_SIZE bit, which is always set in do_truncate(), on the iattr mask. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #6811 Closes #6819	2017-11-21 13:11:29 -06:00
benrubson	59511072b4	OpenZFS 7531 - Assign correct flags to prefetched buffers Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Authored by: abraunegg <alex.braunegg@gmail.com> Approved by: Dan McDonald <danmcd@joyent.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/7531 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/468008cb	2017-11-21 13:11:29 -06:00
Arkadiusz Bubała	a5c8119eba	Long hold the dataset during upgrade If the receive or rollback is performed while filesystem is upgrading the objset may be evicted in `dsl_dataset_clone_swap_sync_impl`. This will lead to NULL pointer dereference when upgrade tries to access evicted objset. This commit adds long hold of dataset during whole upgrade process. The receive and rollback will return an EBUSY error until the upgrade is not finished. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Arkadiusz Bubała <arkadiusz.bubala@open-e.com> Closes #5295 Closes #6837	2017-11-21 13:03:21 -06:00
Tim Chase	d7881a6dca	Handle compressed buffers in __dbuf_hold_impl() In __dbuf_hold_impl(), if a buffer is currently syncing and is still referenced from db_data, a copy is made in case it is dirtied again in the txg. Previously, the buffer for the copy was simply allocated with arc_alloc_buf() which doesn't handle compressed or encrypted buffers (which are a special case of a compressed buffer). The result was typically an invalid memory access because the newly-allocated buffer was of the uncompressed size. This commit fixes the problem by handling the 2 compressed cases, encrypted and unencrypted, respectively, with arc_alloc_raw_buf() and arc_alloc_compressed_buf(). Although using the proper allocation functions fixes the invalid memory access by allocating a buffer of the compressed size, another unrelated issue made it impossible to properly detect compressed buffers in the first place. The header's compression flag was set to ZIO_COMPRESS_OFF in arc_write() when it was possible that an attached buffer was actually compressed. This commit adds logic to only set ZIO_COMPRESS_OFF in the non-ZIO_RAW case which wil handle both cases of compressed buffers (encrypted or unencrypted). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Closes #5742 Closes #6797	2017-11-21 13:01:30 -06:00
wli5	9add19b37d	Bug fix in qat_compress.c when compressed size is < 4KB When the 128KB block is compressed to less than 4KB, the pointer to the Footer is not in the end of the compressed buffer, that's because the Header offset was added twice for this case. So there is a gap between the Footer and the compressed buffer. 1. Always compute the Footer pointer address from the start of the last page. 2. Remove the un-used workaroud code which has been verified fixed with the latest driver and this fix. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Weigang Li <weigang.li@intel.com> Closes #6827	2017-11-20 16:48:26 -06:00
wli5	318fdeb51f	Support integration with new QAT products Support integration with new QAT products: Intel(R) C62x Chipset, or Atom(R) C3000 Processor Product Family SoC: 1. Detect new file name in auto-conf. 2. Change MAX_INSTANCES to 48. 3. Change "num_inst" to U16 to clean a build warning. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Weigang Li <weigang.li@intel.com> Closes #6767	2017-11-20 16:19:23 -06:00
Olaf Faaland	d3d20bf442	Reimplement vdev_random_leaf and rename it Rename it as mmp_random_leaf() since it is defined in mmp.c. The earlier implementation could end up spinning forever if a pool had a vdev marked writeable, none of whose children were writeable. It also did not guarantee that if a writeable leaf vdev existed, it would be found. Reimplement to recursively walk the device tree to select the leaf. It searches the entire tree, so that a return value of (NULL) indicates there were no usable leaves in the pool; all were either not writeable or had pending mmp writes. It still chooses the starting child randomly at each level of the tree, so if the pool's devices are healthy, the mmp writes go to random leaves with an even distribution. This was verified by testing using zfs_multihost_history enabled. Reviewed by: Thomas Caputi <tcaputi@datto.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Signed-off-by: Olaf Faaland <faaland1@llnl.gov> Closes #6631 Closes #6665	2017-11-20 16:19:23 -06:00
Brian Behlendorf	e0eaaf8144	Fixes for SPARC support The current code base almost compiles on SPARC, but a few fixes are required for the code to compile (and work efficiently). Code in this PR comes from OpenZFS project which was initially dropped when porting the crypto framework. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Pengcheng Xu <i@jsteward.moe> Closes #6733 Closes #6738 Closes #6750	2017-10-16 10:57:55 -07:00
Fabian Grünbichler	8d688ce66a	Skip FREEOBJECTS for objects which can't exist When sending an incremental stream based on a snapshot, the receiving side must have the same base snapshot. Thus we do not need to send FREEOBJECTS records for any objects past the maximum one which exists locally. This allows us to send incremental streams (again) to older ZFS implementations (e.g. ZoL < 0.7) which actually try to free all objects in a FREEOBJECTS record, instead of bailing out early. Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com> Closes #5699 Closes #6507 Closes #6616	2017-10-16 10:57:55 -07:00
Fabian Grünbichler	b544fe4123	Free objects when receiving full stream as clone All objects after the last written or freed object are not supposed to exist after receiving the stream. Free them accordingly, as if a freeobjects record for them had been included in the stream. Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com> Closes #5699 Closes #6507 Closes #6616	2017-10-16 10:57:55 -07:00
Brian Behlendorf	91b2f6ab1c	Fix ARC behavior on 32-bit systems With the addition of the ABD changes consumption of the virtual address space has been greatly reduced. This exposed an issue on CONFIG_HIGHMEM systems where free memory was being calculated incorrectly. Functionally this didn't cause any major problems prior to ABD because a lack of available virtual address space was used as an indicator of low memory. This patch makes the following changes to address the issue and in the process realigns the code further with OpenZFS. There are no substantive changes in behavior for 64-bit systems. * Added CONFIG_HIGHMEM case to the arc_all_memory() and arc_free_memory() functions to only consider low memory pages on CONFIG_HIGHMEM systems. * The arc_free_memory() function was updated to return bytes instead of pages to be consistent with the other helper functions. In user space we make up some reasonable values since currently only testing is performed in this context. * Adds three new values to the arcstats kstat to provide visibility in to the ARC's assessment of the memory situation: memory_all_bytes, memory_free_bytes, and memory_available_bytes. * Added kmem_reap() call to arc_available_memory() for 32-bit builds to realign code with OpenZFS. * Reduced size of test file in /async_destroy_001_pos.ksh to speed up test case. Multiple txgs are still required. * Move vdevs used by zpool_clear_001_pos and zpool_upgrade_002_pos to TEST_BASE_DIR location to speed up test cases. Reviewed-by: David Quigley <david.quigley@intel.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #5352 Closes #6734	2017-10-16 10:57:55 -07:00
Tobin Harding	80cc2f6111	Remove unnecessary equality check Currently `if` statement includes an assignment (from a function return value) and a equality check. The parenthesis are in the incorrect place, currently the code clobbers the function return value because of this. We can fix this by simplifying the `if` statement. `if (foo != 0)` can be more succinctly expressed as `if (foo)` Remove the equality check, add parenthesis to correct the statement. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Tobin C. Harding <me@tobin.cc> Closes #6685 Close #6719	2017-10-16 10:57:55 -07:00
Isaac Huang	b97948276d	Use linear abd in vdev_copy_uberblocks() The vdev_copy_uberblocks() function should use abd_alloc_linear() to allocate ub_abd, because abd_to_buf(ub_abd)) is used later. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Isaac Huang <he.huang@intel.com> Closes #6718 Closes #6713	2017-10-16 10:57:55 -07:00
Ned Bass	4cfc086e4d	receive_freeobjects() skips freeing some objects When receiving a FREEOBJECTS record, receive_freeobjects() incorrectly skips a freed object in some cases. Specifically, this happens when the first object in the range to be freed doesn't exist, but the second object does. This leaves an object allocated on disk on the receiving side which is unallocated on the sending side, which may cause receiving subsequent incremental streams to fail. The bug was caused by an incorrect increment of the object index variable when current object being freed doesn't exist. The increment is incorrect because incrementing the object index is handled by a call to dmu_object_next() in the increment portion of the for loop statement. Add test case that exposes this bug. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov> Closes #6694 Closes #6695	2017-10-16 10:57:55 -07:00
chrisrd	25d232f407	Scale the dbuf cache with arc_c Commit `d3c2ae1` introduced a dbuf cache with a default size of the minimum of 100M or 1/32 maximum ARC size. (These figures may be adjusted using dbuf_cache_max_bytes and dbuf_cache_max_shift.) The dbuf cache is counted as metadata for the purposes of ARC size calculations. On a 1GB box the ARC maximum size defaults to c_max 493M which gives a dbuf cache default minimum size of 15.4M, and the ARC metadata defaults to minimum 16M. I.e. the dbuf cache is an significant proportion of the minimum metadata size. With other overheads involved this actually means the ARC metadata doesn't get down to the minimum. This patch dynamically scales the dbuf cache to the target ARC size instead of statically scaling it to the maximum ARC size. (The scale is still set by dbuf_cache_max_shift and the maximum size is still fixed by dbuf_cache_max_bytes.) Using the target ARC size rather than the current ARC size is done to help the ARC reach the target rather than simply focusing on the current size. Reviewed-by: Chunwei Chen <tuxoko@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Chris Dunlop <chris@onthe.net.au> Issue #6506 Closes #6561	2017-10-16 10:57:54 -07:00
Giuseppe Di Natale	bef6a8bc3a	Correct cppcheck errors (#6662 ) ZFS buildbot STYLE builder was moved to Ubuntu 17.04 which has a newer version of cppcheck. Handle the new cppcheck errors. uu_* functions removed in this commit were unused and effectively dead code. They are now retired. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov> Closes #6653	2017-09-20 12:59:21 -07:00
Brian Behlendorf	266b181e75	Increase default arc_c_min Increase the default arc_c_min value to which whichever is larger, either 32M or 1/32 of total system memory. This is advantageous for systems with more than 1G of memory where performance issues may occur when the ARC is allowed to collapse below a minimum size. At the same time we want to use the bare minimum value which is still functional so the filesystem can be used in very low memory environments. Reviewed-by: Tim Chase <tim@chase2k.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #6659	2017-09-20 10:25:54 -07:00
Brian Behlendorf	c474f5e9a7	Export symbol dmu_tx_mark_netfree() This symbol is needed by Lustre for the same reason it was needed by the ZPL. It should have been exported when the original patch was merged. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Alex Zhuravlev <bzzz@whamcloud.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #6660	2017-09-20 10:25:54 -07:00
Brian Behlendorf	661907e6bc	Linux 4.14 compat: IO acct, global_page_state, etc (#6655 ) generic_start_io_acct/generic_end_io_acct in the master branch of the linux kernel requires that the request_queue be provided. Move the logic from freemem in the spl to arc_free_memory in arc.c. Do this so we can take advantage of global_page_state interface checks in zfs. Upstream kernel replaced struct block_device with struct gendisk in struct bio. Determine if the function bio_set_dev exists during configure and have zfs use that if it exists. bio_set_dev https://github.com/torvalds/linux/commit/74d4699 global_node_page_state https://github.com/torvalds/linux/commit/75ef718 io acct https://github.com/torvalds/linux/commit/d62e26b Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov> Closes #6635 Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2017-09-19 14:24:34 -07:00
Gaurav Kumar	d3e7d981d4	Modifying XATTRs doesnt change the ctime Changing any metadata, should modify the ctime. Reviewed-by: Chunwei Chen <tuxoko@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: gaurkuma <gauravk.18@gmail.com> Closes #3644 Closes #6586	2017-09-13 16:05:18 -07:00
Brian Behlendorf	a2a0440918	Fix volume WR_INDIRECT log replay (#6620 ) The portion of the zvol_replay_write() handler responsible for replaying indirect log records for some reason never existed. As a result indirect log records were not being correctly replayed. This went largely unnoticed since the majority of zvol log records were of the type WR_COPIED or WR_NEED_COPY prior to OpenZFS 7578. This patch updates zvol_replay_write() to correctly handle these log records and adds a new test case which verifies volume replay to prevent any regression. The existing test case which verified replay on filesystem was renamed slog_replay_fs.ksh for clarity. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #6603	2017-09-13 16:04:16 -07:00
Giuseppe Di Natale	45d1abc74d	Improved dnode allocation and dmu_hold_impl() (#6611 ) Refactor dmu_object_alloc_dnsize() and dnode_hold_impl() to simplify the code, fix errors introduced by commit `dbeb879` (PR #6117) interacting badly with large dnodes, and improve performance. * When allocating a new dnode in dmu_object_alloc_dnsize(), update the percpu object ID for the core's metadnode chunk immediately. This eliminates most lock contention when taking the hold and creating the dnode. * Correct detection of the chunk boundary to work properly with large dnodes. * Separate the dmu_hold_impl() code for the FREE case from the code for the ALLOCATED case to make it easier to read. * Fully populate the dnode handle array immediately after reading a block of the metadnode from disk. Subsequently the dnode handle array provides enough information to determine which dnode slots are in use and which are free. * Add several kstats to allow the behavior of the code to be examined. * Verify dnode packing in large_dnode_008_pos.ksh. Since the test is purely creates, it should leave very few holes in the metadnode. * Add test large_dnode_009_pos.ksh, which performs concurrent creates and deletes, to complement existing test which does only creates. With the above fixes, there is very little contention in a test of about 200,000 racing dnode allocations produced by tests 'large_dnode_008_pos' and 'large_dnode_009_pos'. name type data dnode_hold_dbuf_hold 4 0 dnode_hold_dbuf_read 4 0 dnode_hold_alloc_hits 4 3804690 dnode_hold_alloc_misses 4 216 dnode_hold_alloc_interior 4 3 dnode_hold_alloc_lock_retry 4 0 dnode_hold_alloc_lock_misses 4 0 dnode_hold_alloc_type_none 4 0 dnode_hold_free_hits 4 203105 dnode_hold_free_misses 4 4 dnode_hold_free_lock_misses 4 0 dnode_hold_free_lock_retry 4 0 dnode_hold_free_overflow 4 0 dnode_hold_free_refcount 4 57 dnode_hold_free_txg 4 0 dnode_allocate 4 203154 dnode_reallocate 4 0 dnode_buf_evict 4 23918 dnode_alloc_next_chunk 4 4887 dnode_alloc_race 4 0 dnode_alloc_next_block 4 18 The performance is slightly improved for concurrent creates with 16+ threads, and unchanged for low thread counts. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Olaf Faaland <faaland1@llnl.gov>	2017-09-13 15:46:15 -07:00
dbavatar	89950722c6	Linux 4.8+ compatibility fix for vm stats vm_node_stat must be used instead of vm_zone_stat. Unfortunately the old code still compiles potentially leading to silent failure of arc_evictable_memory() AKAMAI: CR 3816601: Regression in zfs dropcache test Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com> Closes #6528	2017-09-13 14:21:59 -07:00

1 2 3 4 5 ...

1637 Commits